爬虫学习笔记(十)—— Scrapy框架(五):下载中间件、用户/IP代理池、settings文件
一、下載中間件
-
下載中間件是一個用來hooks進Scrapy的request/response處理過程的框架。
-
它是一個輕量級的底層系統,用來全局修改scrapy的request和response。
-
scrapy框架中的下載中間件,是實現了特殊方法的類。
1.1、內置中間件
-
scrapy系統自帶的中間件被放在DOWNLOADER_MIDDLEWARES_BASE設置中
可以通過命令 scrapy setttings --get=DOWNLOADER_MIDDLEWARES_BASE 查看 -
用戶自定義的中間件需要在DOWNLOADER_MIDDLEWARES中進行設置
該設置是一個dict,鍵是中間件類路徑,期值是中間件的順序,是一個正整數0-1000,越小越靠近引擎。
例: “scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware”: 100,數字越小,優先級越高 即越靠近引擎 -
各中間件含義(通過命令 scrapy setttings --get=DOWNLOADER_MIDDLEWARES_BASE 查看)
1.2、下載中間件API
-
process_request(request,spider) 處理請求,對于通過中間件的每個請求調用此方法
-
process_response(request, response, spider) 處理響應,對于通過中間件的每個響應,調用此方法
-
process_exception(request, exception, spider) 處理請求時發生了異常調用
-
from_crawler(cls,crawler ) 創建爬蟲
注意:要注意每個方法的返回值,返回值內容的不同,決定著請求或響應進一步在哪里執行
class BaiduDownloaderMiddleware(object):# Not all methods need to be defined. If a method is not defined,# scrapy acts as if the downloader middleware does not modify the# passed objects.@classmethoddef from_crawler(cls, crawler):# This method is used by Scrapy to create your spiders.s = cls()crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)return sdef process_request(self, request, spider):# Called for each request that goes through the downloader# middleware.# Must either: 以下必選其一# - return None: continue processing this request #返回None request 被繼續交個下一個中間件處理# - or return a Response object #返回response對象 不會交給下一個precess_request 而是直接交給下載器# - or return a Request object #返回request對象 直接交給引擎處理# - or raise IgnoreRequest: process_exception() methods of #拋出異常 process_exception處理# installed downloader middleware will be calledreturn Nonedef process_response(self, request, response, spider):# Called with the response returned from the downloader.# Must either;# - return a Response object #返回response對象 繼續交給下一中間件處理# - return a Request object #返回request對象 直接交給引擎處理# - or raise IgnoreRequest #拋出異常 process_exception處理return responsedef process_exception(self, request, exception, spider):# Called when a download handler or a process_request()# (from other downloader middleware) raises an exception.#處理異常# Must either:# - return None: continue processing this exception #繼續調用其他中間件的process_exception# - return a Response object: stops process_exception() chain #返回response 停止調用其他中間件的process_exception# - return a Request object: stops process_exception() chain #返回request 直接交給引擎處理passdef spider_opened(self, spider):spider.logger.info('Spider opened: %s' % spider.name)二、自定義中間件
2.1、用戶代理池
2.1.1、settings文件添加列表
#這里僅供參考,具體自己添加 user_agent_list = ["Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 ""(KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1","Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 ""(KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 ""(KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6","Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 ""(KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6","Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 ""(KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1","Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 ""(KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5","Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 ""(KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 ""(KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3","Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 ""(KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3","Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 ""(KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3","Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 ""(KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 ""(KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3","Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 ""(KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 ""(KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3","Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 ""(KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3","Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 ""(KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3","Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 ""(KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24","Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 ""(KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"]2.1.2、middlewares文件設置用戶代理
from .settings import user_agent_list import random class User_AgentDownloaderMiddleware(object): #類名自定義def process_request(self, request, spider):request.headers["User_Agent"]=random.choice(user_agent_list) #隨機return None #被繼續交個下一個中間件處理2.2、IP代理池
2.2.1、settings中ip 代理池
#此處是示例,以上的代理基本無法使用的,同時不建議大家去找免費的代理,會不安全 IPPOOL=[{"ipaddr":"61.129.70.131:8080"},{"ipaddr":"61.152.81.193:9100"},{"ipaddr":"120.204.85.29:3128"},{"ipaddr":"219.228.126.86:8123"},{"ipaddr":"61.152.81.193:9100"},{"ipaddr":"218.82.33.225:53853"},{"ipaddr":"223.167.190.17:42789"} ]2.2.2、middlewares文件設置用戶代理
from .settings import IPPOOL import random class MyproxyDownloaderMiddleware(object): #類名自定義#目的 設置多個代理#通過meta 設置代理def process_request(self, request, spider):proxyip=random.choice(IPPOOL)request.meta["proxy"]="http://"+proxyip["ipaddr"]#http://61.129.70.131:8080return None #被繼續交個下一個中間件處理2.3、小案例:爬取豆瓣電影信息
具體代碼實現見:爬蟲學習筆記(六)——Scrapy框架(一),這里僅添加實現用戶代理和ip代理代碼
middle文件
自己重新構造類:
settings文件
①添加上面的 用戶代理池和IP代理池
②設置下載中間件
三、Scrapy.settings文件
BOT_NAME = 'baidu' #baidu: 項目名字SPIDER_MODULES = ['baidu.spiders'] #爬蟲模塊 NEWSPIDER_MODULE = 'baidu.spiders' #使用genspider 命令創建的爬蟲模塊3.1、基本配置
1. 項目名稱,默認的USER_AGENT由它來構成,也作為日志記錄的日志名#BOT_NAME = 'db250' 2. 爬蟲應用路徑#SPIDER_MODULES = ['db250.spiders']#NEWSPIDER_MODULE = 'db250.spiders' 3. 客戶端User-Agent請求頭#USER_AGENT = 'db250 (+http://www.yourdomain.com)' 4. 是否遵循爬蟲協議 一般我們是不會遵循的#Obey robots.txt rulesROBOTSTXT_OBEY = False 5. 是否支持cookie,cookiejar進行操作cookie,默認開啟#Disable cookies (enabled by default)#COOKIES_ENABLED = False 6. Telnet用于查看當前爬蟲的信息,操作爬蟲等...使用telnet ip port ,然后通過命令操作#TELNETCONSOLE_ENABLED = False#TELNETCONSOLE_HOST = '127.0.0.1'#TELNETCONSOLE_PORT = [6023,] 7. Scrapy發送HTTP請求默認使用的請求頭DEFAULT_REQUEST_HEADERS = {'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8','Accept-Language': 'en',} 8. 請求失敗后(retry)scrapy自帶scrapy.downloadermiddlewares.retry.RetryMiddleware中間件,如果想通過重試次數,可以進行如下操作:參數配置:#RETRY_ENABLED: 是否開啟retry#RETRY_TIMES: 重試次數#RETRY_HTTP_CODECS: 遇到什么http code時需要重試,默認是500,502,503,504,408,其他的,網絡連接超時等問題也會自動retry的3.2、并發與延遲
1. 下載器總共最大處理的并發請求數,默認值16#CONCURRENT_REQUESTS = 32 2. 每個域名能夠被執行的最大并發請求數目,默認值8#CONCURRENT_REQUESTS_PER_DOMAIN = 16 3. 能夠被單個IP處理的并發請求數,默認值0,代表無限制,需要注意兩點#I、如果不為零,那CONCURRENT_REQUESTS_PER_DOMAIN將被忽略,即并發數的限制是按照每個IP來計算,而不是每個域名#II、該設置也影響DOWNLOAD_DELAY,如果該值不為零,那么DOWNLOAD_DELAY下載延遲是限制每個IP而不是每個域 #CONCURRENT_REQUESTS_PER_IP = 16 4. 如果沒有開啟智能限速,這個值就代表一個規定死的值,代表對同一網址延遲請求的秒數#DOWNLOAD_DELAY = 33.3、智能限速/自動節流
1.開啟True,默認False#AUTOTHROTTLE_ENABLED = True 2.起始的延遲#AUTOTHROTTLE_START_DELAY = 5 3.最小延遲#DOWNLOAD_DELAY = 3 4.最大延遲#AUTOTHROTTLE_MAX_DELAY = 10 5.每秒并發請求數的平均值,不能高于 CONCURRENT_REQUESTS_PER_DOMAIN或CONCURRENT_REQUESTS_PER_IP,調高了則吞吐量增大強奸目標站點,調低了則對目標站點更加”禮貌“ 每個特定的時間點,scrapy并發請求的數目都可能高于或低于該值,這是爬蟲視圖達到的建議值而不是硬限制AUTOTHROTTLE_TARGET_CONCURRENCY = 16.0 6.調試#AUTOTHROTTLE_DEBUG = True#CONCURRENT_REQUESTS_PER_DOMAIN = 16#CONCURRENT_REQUESTS_PER_IP = 163.4、爬取深度與爬取方式
1. 爬蟲允許的最大深度,可以通過meta查看當前深度;0表示無深度#DEPTH_LIMIT = 3 2. 爬取時,0表示深度優先Lifo(默認);1表示廣度優先FiFo**后進先出,深度優先**#DEPTH_PRIORITY = 0#SCHEDULER_DISK_QUEUE = 'scrapy.squeue.PickleLifoDiskQueue'#SCHEDULER_MEMORY_QUEUE = 'scrapy.squeue.LifoMemoryQueue'**先進先出,廣度優先**#DEPTH_PRIORITY = 1#SCHEDULER_DISK_QUEUE = 'scrapy.squeue.PickleFifoDiskQueue'#SCHEDULER_MEMORY_QUEUE = 'scrapy.squeue.FifoMemoryQueue' 3. 調度器隊列#SCHEDULER = 'scrapy.core.scheduler.Scheduler'#from scrapy.core.scheduler import Scheduler 4. 訪問URL去重#DUPEFILTER_CLASS = 'step8_king.duplication.RepeatUrl'3.5、緩存
1.是否啟用緩存策略#HTTPCACHE_ENABLED = True 2.緩存策略:所有請求均緩存,下次在請求直接訪問原來的緩存即可#HTTPCACHE_POLICY = "scrapy.extensions.httpcache.DummyPolicy" 3.緩存策略:根據Http響應頭:Cache-Control、Last-Modified 等進行緩存的策略#HTTPCACHE_POLICY = "scrapy.extensions.httpcache.RFC2616Policy" 4.緩存超時時間#HTTPCACHE_EXPIRATION_SECS = 0 5.緩存保存路徑#HTTPCACHE_DIR = 'httpcache' 6.緩存忽略的Http狀態碼#HTTPCACHE_IGNORE_HTTP_CODES = [] 7.緩存存儲的插件#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'詳見官方文檔:
下載中間件:https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
配置文件:https://docs.scrapy.org/en/latest/topics/settings.html
爬蟲中間件:https://docs.scrapy.org/en/latest/topics/spider-middleware.html
總結
以上是生活随笔為你收集整理的爬虫学习笔记(十)—— Scrapy框架(五):下载中间件、用户/IP代理池、settings文件的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 爬虫学习笔记(九)—— Scrapy框架
- 下一篇: 运行pyspider时出现 : Impo