当前位置:网站首页>Learning notes: cat's eye Top100 movie information crawling
Learning notes: cat's eye Top100 movie information crawling
2022-07-20 10:11:00 【p_ hat-trick】
Learning notes : Cat's Eye top100 Movie information crawling
Get web source
Check it casually on the Internet User-agent As a browser proxy
try:
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.162 Safari/537.36'
}
response = requests.get(url, headers=headers)
Regular expression parsing
pattern = re.compile('<dd>.*?board-index.*?>(\d+)</i>.*?data-src="(.*?)".*?name"><a'
+ '.*?>(.*?)</a>.*?star">(.*?)</p>.*?releasetime">(.*?)</p>'
+ '.*?integer">(.*?)</i>.*?fraction">(.*?)</i>.*?</dd>', re.S)
- Parentheses are used in re.findall Get the required information , Pay attention to using non greedy matching .*? Avoid missing information caused by excessive matching .
- yield function : Instead of return, Make the returned data more tidy .
write file
json.dumps Convert dictionary to string .( Add :loads Convert string to dictionary )
Cross page crawling
Observed cat's eye website top Each page of the interface has only 10 A movie message , And turn the page Url The information changes as follows :
http://maoyan.com/board/4?
Page turning ->
http://maoyan.com/board/4?offset=10
in other words ,offset Every addition 10 Control the page to jump back one side , Use this law to crawl :
for i in range(10):
main(offset=i * 10)
print(str(i*10 + 1) + " to " + str(i*10 +10) + " got!")
time.sleep(1) # Avoid access restrictions caused by high access frequency
Although most of them are direct copy Examples in the book , But still full of sense of achievement !
边栏推荐
- 使用google cloud部署基于flask的网站
- 网安学习(十九)HSRP协议
- 乐山师范程序设计大赛2020-F: 我的魔法【模拟】
- 黑马程序员反射入门
- TCP three handshakes and four swings
- 黑马程序员线程安全问题
- Network Security Learning (XIX) HSRP protocol
- 黑马程序员Thread常用方法学习
- FPGA skimming p1:1-out-of-4 multiplexer, asynchronous reset Series T trigger, parity check, shift splicing multiplication
- 2022 -7-18 第八小组 顾宇佳 学习笔记
猜你喜欢
随机推荐
黑马程序员线程通信【了解】
重载和重写的区别
FPGA skimming P3: 4-bit numeric comparator circuit, 4bit carry ahead adder circuit, priority encoder circuit, priority encoder
第二次实验 静态路由的扩展配置
LVM (PV, PE, VG, LV)
Postgresql Tutorial
黑马程序员UDP入门
7-1 懂的都懂
ZTE ZXR10 5250 command hints
Servlet概述
2022-7-11 第八小组 顾宇佳 学习笔记(Js)
统计实验数据命名,数据命名规则like:d8i8.txt
Unreal Engine learning (I)
MongoDB
Leshan normal programming competition 2020-h: least common multiple [find the number of factors]
Network Security Learning (XIX) HSRP protocol
2022-7-12 第八小组 顾宇佳 (Js)
Network Security Learning (21) NAT dynamic routing
Qt字符串操作
Leetcode:2. Add two numbers [large number addition + analog carry]