scrapy从安装到爬取煎蛋网图片
下載地址:https://www.lfd.uci.edu/~gohlke/pythonlibs/
pip install wheel
pip install lxml
pip install pyopenssl
pip install Twisted
pip install pywin32
pip install scrapy
scrapy startproject jandan 創(chuàng)建項(xiàng)目
cd?jandan
cd?jandan
items.py 存放數(shù)據(jù)
pipelines.py 管道文件
?
由于煎蛋網(wǎng)有反爬蟲措施,我們需要做一些處理
settings文件
ROBOTSTXT_OBEY = False #不遵尋reboot協(xié)議 DOWNLOAD_DELAY = 2 #下載延遲時(shí)間 DOWNLOAD_TIMEOUT = 15 #下載超時(shí)時(shí)間 COOKIES_ENABLED = False #禁用cookieDOWNLOADER_MIDDLEWARES = {
#請(qǐng)求頭
'jandan.middlewares.RandomUserAgent': 100,
#代理ip
'jandan.middlewares.RandomProxy': 200,
} #請(qǐng)求列表
USER_AGENTS = [
#遨游
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Maxthon 2.0)",
#火狐
"Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
#谷歌
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11"
]
#代理ip列表
PROXIES = [
{"ip_port":"119.177.90.103:9999","user_passwd":""},
#代理ip無密碼
{"ip_port":"101.132.122.230:3128","user_passwd":""},
#代理ip有密碼
# {"ip_port":"123.139.56.238:9999","user_passwd":"root:admin"}
] #管道文件,取消注釋
ITEM_PIPELINES = {
'jandan.pipelines.JandanPipeline': 300,
} IMAGES_STORE = "images"
?
middlewares文件 import random import base64 from jandan.settings import USER_AGENTS from jandan.settings import PROXIESclass RandomUserAgent(object):def process_request(self,request,spider):useragent = random.choice(USER_AGENTS)request.headers.setdefault("User-Agent",useragent)class RandomProxy(object):def process_request(self,request,spider):proxy = random.choice(PROXIES)if proxy["user_passwd"] is None:request.meta["proxy"] = "http://" + proxy["ip_port"]else:# b64編碼接收字節(jié)對(duì)象,在py3中str是unicode,需要轉(zhuǎn)換,返回是字節(jié)對(duì)象base64_userpasswd = base64.b16encode(proxy["user_passwd"].encode())request.meta["proxy"] = "http://" + proxy["ip_port"]#拼接是字符串,需要轉(zhuǎn)碼request.headers["Proxy-Authorization"] = "Basic " + base64_userpasswd.decode()?
?
items文件
import scrapyclass JandanItem(scrapy.Item):name = scrapy.Field()url = scrapy.Field()?
scrapy genspider ?-t crawl dj jandan.net 創(chuàng)建crawlscrapy類爬蟲
會(huì)自動(dòng)在spiders下創(chuàng)建jandan.py文件,頁面由js編寫,需要BeautifulSoup類定位js元素獲取數(shù)據(jù)
?
pipelines.py
import json import os import requests from scrapy.conf import settingsclass JandanPipeline(object):#保存為json文件# def __init__(self):# self.filename = open("jandan.json","wb")# self.num = 0# # def process_item(self, item, spider):# text = json.dumps(dict(item),ensure_ascii=False) + "\n"# self.filename.write(text.encode("utf-8"))# self.num += 1# return item# # def close_spider(self,spider):# self.filename.close()# print("總共有" + str(self.num) + "個(gè)資源")
#下載到本地def process_item(self, item, spider):if 'url' in item:dir_path = settings["IMAGES_STORE"]if not os.path.exists(dir_path):os.makedirs(dir_path)su = "." + item["url"].split(".")[-1]path = item["name"] + sunew_path = '%s/%s' % (dir_path, path)if not os.path.exists(new_path):with open(new_path, 'wb') as handle:response = requests.get(item["url"], stream=True)for block in response.iter_content(1024):if not block:breakhandle.write(block)return item
?
?
scrapy crawl dj 啟動(dòng)爬蟲
scrapy shell "https://hr.tencent.com/position.php?&start=0" 發(fā)送請(qǐng)求
?
?奉上我的github地址,會(huì)定期更新項(xiàng)目
https://github.com/bjptw/workspace
?
轉(zhuǎn)載于:https://www.cnblogs.com/bjp9528/p/9318013.html
總結(jié)
以上是生活随笔為你收集整理的scrapy从安装到爬取煎蛋网图片的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: org.dom4j.DocumentEx
- 下一篇: 有50千克糖水,按糖与水的比为1:9配成