当前位置:网站首页>定向爬取淘宝商品名称和价格(嵩天老师)
定向爬取淘宝商品名称和价格(嵩天老师)
2022-07-22 10:15:00 【塞班呢】
嵩天老师的代码不能爬取现在的淘宝,那是因为现在淘宝的反扒技术升级了
解决方法:我们要将headers中的cookie替换成淘宝的(每个人的cookie值是不同的)
具体方法参考:通过requests库re库进行淘宝商品爬虫爬取(对中国大学mooc嵩天老师爬虫进行修改)_Omann的博客-CSDN博客
# -*- coding: utf-8 -*-
"""
Created on Mon Oct 4 00:06:08 2021
@author: saiban
"""
#嵩天老帅的代码不能爬取现在的淘宝,现在的淘宝反扒技术升级,
#我们需要把headers内容中的referer和cookies替换成淘宝的
import requests
import re
def getHtmlText(url):#获取页面
try:
header = {
'authority': 's.taobao.com',
'cache-control': 'max-age=0',
'sec-ch-ua': '"Chromium";v="94", "Google Chrome";v="94", ";Not A Brand";v="99"',
'sec-ch-ua-mobile': '?0',
'sec-ch-ua-platform': '"Windows"',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.71 Safari/537.36',
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'sec-fetch-site': 'same-origin',
'sec-fetch-mode': 'navigate',
'sec-fetch-user': '?1',
'sec-fetch-dest': 'document',
'accept-language': 'zh-CN,zh;q=0.9',
'cookie': 'cna=w7y7GYch4kMCAasijBL1tcnw; xlly_s=1; t=d77b430a77fdf76d9c17f2806b57c2ff; hng=CN%7Czh-CN%7CCNY%7C156; thw=cn; _m_h5_tk=5ccf9d80baa24976d1bd97719fb7d377_1633327797887; _m_h5_tk_enc=2d2f907d5f3a86f04dbbf2aa73897ecd; _samesite_flag_=true; cookie2=16bd808d5d4c077a1e5f8e3abc00787e; _tb_token_=e3e5e0fd3333d; sgcookie=E1004cVmZPbUKd%2Bo2Y6ewiI7lmCD2rFerQ9K0Rx3PgQSSoYp%2FWW8LOvfo4oThh7eNLIEFm5uGhkQ9IUsgsWnv4%2BYBvCt6Z2xxPQJp498ChIaTCg%3D; unb=2202345247497; uc3=nk2=F5RBx%2BGr84TAocRa&lg2=URm48syIIVrSKA%3D%3D&vt3=F8dCujaCTG0Yn%2BEGbMY%3D&id2=UUphyItuGYNeDyMxrA%3D%3D; csg=fa72b214; lgc=tb4216148421; cancelledSubSites=empty; cookie17=UUphyItuGYNeDyMxrA%3D%3D; dnk=tb4216148421; skt=a19b5b0102a7b5ee; existShop=MTYzMzM1MjE0MQ%3D%3D; uc4=nk4=0%40FY4KoqHi383HYtpSM0RDmlOwk4iA8Tg%3D&id4=0%40U2grE1hEVww3EVoATgMbl4PMiyTEeIZt; tracknick=tb4216148421; _cc_=W5iHLLyFfA%3D%3D; _l_g_=Ug%3D%3D; sg=17e; _nk_=tb4216148421; cookie1=W8743JHOqTkZp4234GIqb8W2j3pRPRi2%2Ftn1Y16wf2Y%3D; enc=lEu3iTdRRzo2bKJ%2FSRTJ7W3KJmqkZoqJ8qTWcN7Fqxv4oVm4619kntnz84TzJb6SnF8AjFC43wovrgqFDVvISLE2T0wQC8D4h3ZzkSjIpSs%3D; mt=ci=0_1; uc1=existShop=false&pas=0&cookie21=Vq8l%2BKCLjhZM&cookie16=UtASsssmPlP%2Ff1IHDsDaPRu%2BPw%3D%3D&cookie14=Uoe3dP4mSshcOw%3D%3D&cookie15=WqG3DMC9VAQiUQ%3D%3D; JSESSIONID=338CCA02A39CF025F71F1BF78290B3CC; tfstk=c_YCB2ijKJ2I0Vnz8BGaUpf5nnb5aLH1sD6pO3g3IVJs9xRcBsmQutMG732Gz_C1.; l=eBxAgJEmghiX7hpyBO5Cnurza77OFIRbzPVzaNbMiInca6OA9FiEjNCLQg5JWdtjgt5xNFtzh0NBGRE6SuzLRxGjL77kRs5mpI96Re1..; isg=BD09yvnBAAYXt6RpoStY_zgXTJk32nEsV3r_rv-C1hTDNl9owymX_Iuk4Gpws4nk',
}
r=requests.get(url,headers=header,timeout=30)
r.raise_for_status()
r.encoding=r.apparent_encoding
return r.text
except:
print("出错了")
def parsePage(ilt,html):#对每个获得的页面进行解析
try:
plt=re.findall(r'\"view_price\"\:\"[\d\.]*\"', html)
tlt=re.findall(r'\"raw_title\"\:\".*?\"', html)
for i in range(len(plt)):
price=eval(plt[i].split(':')[1])#eval函数可以把最外层的单引号和双引号去掉
title=eval(tlt[i].split(':')[1])
ilt.append([price,title])
except:
print("有出错了")
def printGoodList(ilt):#打印输出信息
tplt='{:4}\t{:8}\t{:16}\t'
print(tplt.format('序号', '价格','商品名称'))
count=0
for g in ilt:
count=count+1
print(tplt.format(count,g[0],g[1]))
def main():
goods='s书包'
depth=2
start_url='https://s.taobao.com/search?q='+goods
infolist=[]
for i in range(depth):
try:
url=start_url+'&pnum='+str(44*i)
html=getHtmlText(url)
parsePage(infolist,html)
except:
continue
printGoodList(infolist)
main()#前面只是定义main()函数,这里是调用main函数,使整个程序运行
这里对Convert curl command syntax to Python requests, Ansible URI, browser fetch, MATLAB, Node.js, R, PHP, Strest, Go, Dart, Java, JSON, Elixir, and Rust code 做一下解释,我们从检查中获得的是curl语法,这个网站可以将 curl 语法转换为 Python、Node.js、PHP、R、Go、Rust、Elixir、Java、MATLAB、Ansible URI、Strest、Dart 和 JSON等格式
塞班学爬虫....学废了
边栏推荐
- jvm-jhat(虚拟机堆转储快照分析工具)
- 文献记录(part109)--Self-Representation Based Unsupervised Exemplar Selection in a Union of Subspaces
- Neo4j - cypher syntax example
- 模块与组件、模块化与组件化的理解分析
- Spark SQL 内置函数和自定义函数UDF
- shell 脚本编写提示
- Install MariaDB 10.5.7 (tar package installation)
- mysql/mariadb怎样生成core文件
- ES6新特性分享(四)
- Sentry nodestore_ View data in node table
猜你喜欢
Mysql之一主多从复制
Spark:图(Graph)
spark Json日志分析
ZABBIX realizes nail monitoring alarm
mariadb审计插件
分享一下Typora工具
It is found that the MariaDB database is 12 hours late, and the xxljob scheduled task scheduling is abnormal
MariaDB audit plug-in
How can ZABBIX customize MySQL monitoring items and trigger alarms
Atr5179 single pole double throw switch chip replaces as179-92lf
随机推荐
深入浅出ES6(四):模板字符串
微服务——Eruka
最佳实践|用腾讯云AI文字识别实现企业资质证书识别
Yii2 composer reports an error to solve the problem of requires bower asset
Setting proxy method in PHP curl request
Oracle 11g 基于CentOS7安装并启动em
As a beginner, I don't want to use eslint
Implementation of bytecode technology in dynamic proxy
zabbix实现钉钉监控告警
将mariadb里的数据导入到columnstore里
关与 @EnableConfigurationProperties 注解
Dp4361 domestic six channel stereo d/a audio converter chip replaces cs4361
Win10图标变白纸了,恢复方法
NewSQL數據庫數據模型設計
HTB- Armageddon
How can ZABBIX customize MySQL monitoring items and trigger alarms
ES6 new features sharing (III)
Polymorphism
NIOFiles工具类
promise