菜鸟爬取中关村手机详情页参数及报价
生活随笔
收集整理的這篇文章主要介紹了
菜鸟爬取中关村手机详情页参数及报价
小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.
菜鳥爬取中關(guān)村手機詳情頁參數(shù)及報價
背景介紹:
- 需求是爬取所有手機詳情頁各個手機型號對應(yīng)的價位區(qū)間及參數(shù);
- 此前沒有爬蟲經(jīng)歷,套用網(wǎng)上教程屢屢報錯,痛定思痛決定自己搜索爬蟲框架,參照官方文檔并整理網(wǎng)頁源碼規(guī)則,制定適合自己的爬取方案;
- 網(wǎng)上的爬蟲框架有scrapy和bs4,個人覺得bs4較易上手,也能滿足本次爬取需求,所以選擇了bs4;
- 感興趣的筒子可以研究下scrapy哦,貌似遞歸爬取很強大的趕腳;
- 廢話不多說,下面開始爬取吧~
方案很簡單,主要分三步:
- 觀察列表頁和詳情頁之間的關(guān)系后發(fā)現(xiàn),列表頁中 /cell_phone/index375437.shtml 的數(shù)字對應(yīng)詳情頁url中的數(shù)字,由此聯(lián)想到,可以把這個數(shù)字摳出來作為商品id,放入詳情頁url中,由此獲取詳情頁鏈接 http://detail.zol.com.cn/cell_phone/index375437.shtml ;
- 結(jié)果數(shù)據(jù):
- 第一步,爬取列表頁所有的商品id(觀察網(wǎng)頁后發(fā)現(xiàn),前104頁包含了所有有效商品id,104頁之后均為空,可以寫for循環(huán)獲取前104頁的所有商品id);
- 第二步,將所有商品id帶入詳情頁url獲取所有詳情頁鏈接 ;
- 第三步,同樣循環(huán)獲取所有詳情頁源代碼,并解析出需要的字段存到csv上。
備注:當(dāng)然啦,中間涉及要一些調(diào)試和解析的過程,是比較費神的,好在都已經(jīng)解決。
這里獲取的詳情字段為16個(標(biāo)題、手機大類名(比如中興)、中文名稱、別名(含英文名稱)、上市時間、屏幕尺寸、商家指導(dǎo)價、價格區(qū)間、運行內(nèi)存、存儲內(nèi)存、內(nèi)核、主屏幕、前置攝像頭、后置攝像頭、電容、電池類型)+1個(詳情頁url,方便對照):
參考網(wǎng)址
下面就奉上爬取代碼啦!為了方便測試,最后還有爬取其中一個網(wǎng)頁的測試代碼哦~
建議對照網(wǎng)頁源碼,更易于理解解析過程。參考網(wǎng)址:
https://www.crummy.com/software/BeautifulSoup/bs3/documentation.zh.html
http://www.jb51.net/article/99453.htm
爬取代碼
#--------------------以此為準(zhǔn):爬取zol手機詳情頁參數(shù),價格等信息 #----------------------------爬取列表頁url并解出商品id#-*- coding: utf-8 -*- import sys reload(sys) sys.setdefaultencoding('utf-8') #a=soup.get_text().encode('utf-8') import requests from bs4 import BeautifulSoup import numpy as np import urllib import urllib2 import re import os from bs4 import BeautifulSoup import pandas as pd os.chdir('/Users/wyy/Downloads/') print(os.getcwd()) if __name__=='__main__':all_pkg=[]for i in range(1,104):url='http://detail.zol.com.cn/cell_phone_index/subcate57_0_list_1_0_9_2_0_'+str(i)+'.html'headers={'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/601.7.7 (KHTML, like Gecko) Version/9.1.2 Safari/601.7.7'}#防止反爬request=urllib2.Request(url=url,headers=headers)response=urllib2.urlopen(request)content= response.read()soup=BeautifulSoup(content,"lxml")lt=soup.findAll(["a","href"])for j in lt:pkg_en=j.get('href')# pkg=pkg_en.split('/')[-1]pair=[pkg_en]all_pkg.append(pair)all_pkg1=pd.DataFrame(all_pkg)#/Users/wyy/Downloads all_pkg1.columns=['url1'] t1=all_pkg1.dropna(axis=0,how='any',thresh=None,subset=None,inplace=False) # axis 指 軸,0是行,1是列, # how 是刪除條件:any 任意一個為na則刪除整行/列,all 整行/列為na才刪除 # inplace 是否在原DataFrame 上進行刪除,false為否 t2=t1.loc[(t1['url1'].str.contains('/cell_phone/index'))]#篩選出有用的url1 t3=t2.drop_duplicates()#有效url去重后有4913行(個商品id) # 怎樣刪除list中空字符? # 最簡單的方法:new_list = [ x for x in li if x != '' ] #t3.to_csv('phone_url.csv',encoding='utf-8')#處理成手機id(phone_url.csv) #去空格 # s = ' rtrt3434' # s.strip() #--------------------------------------------獲取詳情數(shù)據(jù)#將手機id傳入介紹首頁獲取詳情介紹信息#-*- coding: utf-8 -*- def map1(x):t=str(x.url1)s=re.findall(r"x(.+?).shtml",t)#正則return s[0]#取出list中第一個元素 t3['url2']=t3.apply(lambda x:map1(x),1)#1是對行操作,默認對列操作 t3.head() # Out[23]: # Unnamed: 0 url1 url2 # 0 35 /cell_phone/index1164015.shtml 1164015 # 1 36 /cell_phone/index375437.shtml 375437 # 2 37 /cell_phone/index1164296.shtml 1164296 # 3 38 /cell_phone/index1175015.shtml 1175015 # 4 39 /cell_phone/index1158842.shtml 1158842 #t4=pd.read_csv('phone_id.csv') #if __name__=='__main__': pag=[] data = [] for i in t3['url2']:url='http://detail.zol.com.cn/cell_phone/index'+str(i)+'.shtml'headers={'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/601.7.7 (KHTML, like Gecko) Version/9.1.2 Safari/601.7.7'}request=urllib2.Request(url=url,headers=headers)response=urllib2.urlopen(request)content= response.read()soup=[url,BeautifulSoup(content,"lxml")]pag.append(soup)pag1= pd.DataFrame(pag,columns=['url','pag'])#pag1.to_csv('phone_zol_origin.csv',encoding='utf-8') for j in pag:try:Name=[x.split('=')[1].strip() for x in j[1].findAll(text=re.compile("manuName"))[0].split(';') if re.findall('manuName',x)][0][1:-1]#最后一個0表示取出list中的元素,strip()去空格,[1:-1]去掉引號,大類名-中興except:Name = ''try:showdate=[x for x in j[1].find_all("span",{"class":"showdate"})[0]][0]#上市時間except:showdate = ''try:price_range=re.findall(r'\d+',str([x for x in j[1].find_all("span",{"class":"merchant-price-range"})[0]][0]))[-2:]#價格區(qū)間except:price_range = ''try:c_name=re.findall(r"<h1>(.+?)</h1>",unicode([x for x in j[1].find_all("div",{"class":"page-title clearfix"})[0]][1]))[0]#中文全名except:c_name = ''try:other_name=re.findall(r"<h2>(.+?)</h2>",unicode([x for x in j[1].find_all("div",{"class":"page-title clearfix"})[0]][4]))[0]#英文全名except:other_name=''try:title =[x for x in j[1].title][0]#標(biāo)題except:title=''try:try:guide_price= re.findall(r'\d+',str([x for x in j[1].find_all("b",{"class":"price-type price-retain"})[0]][0]))[-1]#廠商指導(dǎo)價except:try:guide_price= re.findall(r'\d+',str([x for x in j[1].find_all("b",{"class":"price-type price-"})[0]][0]))[-1]#廠商指導(dǎo)價except:guide_price=''#other_name=re.findall(r"<h2>(.+?)</h2>",unicode([x for x in j[1].find_all("div",{"class":"page-title clearfix"})[0]][4]))[0]#英文全名#price_range=re.findall(r'\d+',str([x for x in j[1].find_all("span",{"class":"merchant-price-range"})[0]][0]))[-2:]#價格區(qū)間#guide_price= re.findall(r'\d+',str([x for x in j[1].find_all("b",{"class":"price-type price-retain"})[0]][0]))[-1]#廠商指導(dǎo)價ROM =[x for x in j[1].find_all("span",{"class":"price-status"})[0]][0][1:-1]#手機存儲容量screen_size=[x.split('>')[0].strip() for x in j[1].find_all("span",{"class":"param-value low"})[0]][0]#主屏尺寸RAM=[x.split('>')[0].strip() for x in j[1].find_all("span",{"class":"param-value highest"})[0]][0]#運行內(nèi)存core_num=[x.split('>')[0].strip() for x in j[1].find_all("span",{"class":"param-value low"})[1]][0]#核心數(shù)main_screen=[x.split('>')[0].strip() for x in j[1].find_all("span",{"class":"param-value middle"})[0]][0]#主屏分辨率camera_back=[x.split('>')[0].strip() for x in j[1].find_all("span",{"class":"param-value middle"})[1]][0]#后置攝像頭e_capacity=[x.split('>')[0].strip() for x in j[1].find_all("span",{"class":"param-value middle"})[2]][0]#電池容量camera_front=[x.split('>')[0].strip() for x in j[1].find_all("span",{"class":"param-value"})[3]][0]#前置攝像頭battery_type=[x.split('>')[0].strip() for x in j[1].find_all("span",{"class":"param-value"})[5]][0]#電池類型except:#有的網(wǎng)頁手機參數(shù)源碼是下面這種規(guī)則,如果不滿足上面的規(guī)則,就執(zhí)行如下規(guī)則哦~替換一個字段列:battery_type換成了extendtry:guide_price= re.findall(r'\d+',str([x for x in j[1].find_all("b",{"class":"price-type price-retain"})[0]][0]))[-1]#廠商指導(dǎo)價except:try:guide_price= re.findall(r'\d+',str([x for x in j[1].find_all("b",{"class":"price-type price-"})[0]][0]))[-1]#廠商指導(dǎo)價except:guide_price=''try:screen_size=re.findall(r"<em>(.+?)</em>",unicode(j[1].find_all(href=re.compile(j[0]))[1]))[0]#screen_sizeexcept:screen_size=''#<a href="http://detail.zol.com.cn/cell_phone/index1144202.shtml" style="width: 228.15px"><em>6.44英寸</em></a>try:main_screen=re.findall(r"<em>(.+?)</em>", unicode(j[1].find_all(href=re.compile(j[0]))[3]))[0]#main_screenexcept:main_screen=''#<a href="http://detail.zol.com.cn/cell_phone/index1144202.shtml" style="width: 115.1px"><em>342ppi</em></a>try:e_capacity =re.findall(r"<em>(.+?)</em>", unicode(j[1].find_all(href=re.compile(j[0]))[5]))[0]#e_capacityexcept:e_capacity =''#<a href="http://detail.zol.com.cn/cell_phone/index1144202.shtml" style="width: 234.8px"><em>4850mAh</em></a>try:camera_front=re.findall(r"<em>(.+?)</em>", unicode(j[1].find_all(href=re.compile(j[0]))[7]))[0]#camera_frontexcept:camera_front =''#<a href="http://detail.zol.com.cn/cell_phone/index1144202.shtml" style="width: 170px"><em>1600萬</em></a>try:camera_back=re.findall(r"<em>(.+?)</em>", unicode(j[1].find_all(href=re.compile(j[0]))[9]))[0]#camera_backexcept:camera_back =''#<a href="http://detail.zol.com.cn/cell_phone/index1144202.shtml" style="width: 134px"><em>500萬</em></a>try:ROM=re.findall(r"<em>(.+?)</em>", unicode(j[1].find_all(href=re.compile(j[0]))[12]))[0]#ROMexcept:ROM =''#<a href="http://detail.zol.com.cn/cell_phone/index1144202.shtml" style="width: 164.28571428571px"><em>64GB</em></a>try:extend =re.findall(r"<em>(.+?)</em>", unicode(j[1].find_all(href=re.compile(j[0]))[13]))[0]#新字段,內(nèi)存是否可擴展,#extendexcept:extend =''#<a href="http://detail.zol.com.cn/cell_phone/index1144202.shtml" style="width: 100%"><em>可擴展</em></a>try:RAM =re.findall(r"<em>(.+?)</em>", unicode(j[1].find_all(href=re.compile(j[0]))[15]))[0]#RAMexcept:RAM =''list1=[j[0],Name,title,c_name,other_name,showdate,screen_size,guide_price,price_range,ROM,RAM,main_screen,camera_front,camera_back,e_capacity,extend]#,core_num,battery_typedata.append(list1) clean= pd.DataFrame(data,columns=['url','Name','title','c_name','other_name','showdate','screen_size','guide_price','price_range','ROM','RAM','main_screen','camera_front','camera_back','e_capacity','extend'])#,'core_num','battery_type'#clean.to_csv('2017.7.30phone_zol.csv',encoding='gbk')測試代碼
#-*- coding: utf-8 -*- import sys reload(sys) sys.setdefaultencoding('utf-8') #a=soup.get_text().encode('utf-8') import requests from bs4 import BeautifulSoup import numpy as np import urllib import urllib2 import re import os from bs4 import BeautifulSoup import pandas as pd os.chdir('/Users/wyy/Downloads/') print(os.getcwd())#--------------測試(其中一個網(wǎng)頁)url='http://detail.zol.com.cn/cell_phone/index1174169.shtml' headers={'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/601.7.7 (KHTML, like Gecko) Version/9.1.2 Safari/601.7.7'} request=urllib2.Request(url=url,headers=headers) response=urllib2.urlopen(request) content= response.read() soup=BeautifulSoup(content,"lxml")#介紹:title+手機名(大類)+機型+別名+報價+上市時間+爬取鏈接+參數(shù)(主屏+尺寸+攝像頭+內(nèi)存) soup.find_all("span",{"class":"showdate"})#上市時間 #<span class="showdate">\u4e0a\u5e02\u65f6\u95f4\uff1a2017\u5e7406\u670822\u65e5</span> # print u'\u4e0a\u5e02\u65f6\u95f4\uff1a2017\u5e7406\u670822\u65e5' # 上市時間:2017年06月22日 soup.find_all("span",{"class":"merchant-price-range"})#價格區(qū)間 #<span class="merchant-price-range"><a href="/1175/1174169/price.shtml">¥1259<i> 至 </i>1329</a></span> soup.find_all("div",{"class":"page-title clearfix"})#機型中文名+別名 # <div class="page-title clearfix"> # <h1>中興小鮮5(4GB RAM/全網(wǎng)通)</h1><span class="num"><a href="/series/57/22380_1.html">( 系列共2款 )</a></span> <h2>別名:ZTE V0840,中興 V0840</h2><div class="subtitle">2.5D弧面玻璃,后置雙攝,指紋解鎖,23種語言實時翻譯</div> # <!-- 當(dāng)排行是 1-10 的情況,會給a加一個class lt10 --> # </div> soup.find_all("b",{"class":"price-type price-retain"})#廠商指導(dǎo)價 # <b chart-data='[["7.22","7.23","7.24","7.25","7.28"],[1399,1399,1399,1399,1399],1399,1399]' class="price-type price-retain">1399<i class="icon"></i></b> soup.find_all("span",{"class":"price-status"})# ROM # <span class="price-status">[\u5317\u4eac 32GB\u5382\u5546\u6307\u5bfc\u4ef7]</span> # print u'[\u5317\u4eac 32GB\u5382\u5546\u6307\u5bfc\u4ef7]' # [北京 32GB廠商指導(dǎo)價] soup.title#標(biāo)題 #<title>【中興小鮮5 4GB RAM/全網(wǎng)通】報價_參數(shù)_圖片_論壇_ZTE ZTE V0840,中興 V0840中興手機報價-ZOL中關(guān)村在線</title> soup.findAll(text=re.compile("manuName"))#manuName= '中興';manuId= '642'# var pageType = 'Detail';# var subPageType = 'Detail';# var subcateId = '57';# var manuId = '642';# var proId = '1174169';# var seriesId = '22380';# var subcateName = '手機';# var manuName = '中興';# var pv_subcatid = subcateId;# var requestTuanFlag = '0';# var ewImg = 'http://qr.fd.zol-img.com.cn/qrcode/qrcodegen.php?sizeNum=2&logotype=pure&url=http%3A%2F%2Fwap.zol.com.cn%2F1175%2F1174169%2Findex.html%3Ffrom%3Dqrcode&token=66fde63e76';# var tplType = '';# var dataFrom = '0';soup.find_all("span",{"class":"param-value middle"}) #<span>主屏分辨率:</span> #<span>后置攝像頭:</span> #<span>電池容量:</span>soup.find_all("span",{"class":"param-value low"}) #<span>主屏尺寸:</span> #<span>核心數(shù):</span>soup.find_all("span",{"class":"param-value highest"}) #<span>內(nèi)存:</span>soup.find_all("span",{"class":"param-value"}) #<span>主屏尺寸:</span> #<span>主屏分辨率:</span> #<span>后置攝像頭:</span> #<span>前置攝像頭:</span> ### #<span>電池容量:</span> #<span>電池類型:</span> ### #<span>核心數(shù):</span> #<span>內(nèi)存:</span>#停產(chǎn)無參考報價,待上市新機無價格區(qū)間,有時候無別名,英文名稱直接在中文名中;有時候無廠商指導(dǎo)價(有參考報價,如新機,停產(chǎn)機) try:Name=[x.split('=')[1].strip() for x in soup.findAll(text=re.compile("manuName"))[0].split(';') if re.findall('manuName',x)][0][1:-1]#最后一個0表示取出list中的元素,strip()去空格,[1:-1]去掉引號,大類名-中興 except:Name = '' try:showdate=[x for x in soup.find_all("span",{"class":"showdate"})[0]][0]#上市時間 except:showdate = '' try:price_range=re.findall(r'\d+',str([x for x in soup.find_all("span",{"class":"merchant-price-range"})[0]][0]))[-2:]#價格區(qū)間 except:price_range = '' try:c_name=re.findall(r"<h1>(.+?)</h1>",unicode([x for x in soup.find_all("div",{"class":"page-title clearfix"})[0]][1]))[0]#中文全名 except:c_name = '' try:other_name=re.findall(r"<h2>(.+?)</h2>",unicode([x for x in soup.find_all("div",{"class":"page-title clearfix"})[0]][4]))[0]#英文全名 except:other_name='' title =[x for x in soup.title][0]#標(biāo)題 try:try:guide_price= re.findall(r'\d+',str([x for x in soup.find_all("b",{"class":"price-type price-retain"})[0]][0]))[-1]#廠商指導(dǎo)價except:try:guide_price= re.findall(r'\d+',str([x for x in soup.find_all("b",{"class":"price-type price-"})[0]][0]))[-1]#廠商指導(dǎo)價except:guide_price=''other_name=re.findall(r"<h2>(.+?)</h2>",unicode([x for x in soup.find_all("div",{"class":"page-title clearfix"})[0]][4]))[0]#英文全名price_range=re.findall(r'\d+',str([x for x in soup.find_all("span",{"class":"merchant-price-range"})[0]][0]))[-2:]#價格區(qū)間guide_price= re.findall(r'\d+',str([x for x in soup.find_all("b",{"class":"price-type price-retain"})[0]][0]))[-1]#廠商指導(dǎo)價ROM =[x for x in soup.find_all("span",{"class":"price-status"})[0]][0][1:-1]#手機存儲容量screen_size=[x.split('>')[0].strip() for x in soup.find_all("span",{"class":"param-value low"})[0]][0]#主屏尺寸RAM=[x.split('>')[0].strip() for x in soup.find_all("span",{"class":"param-value highest"})[0]][0]#運行內(nèi)存core_num=[x.split('>')[0].strip() for x in soup.find_all("span",{"class":"param-value low"})[1]][0]#核心數(shù)main_screen=[x.split('>')[0].strip() for x in soup.find_all("span",{"class":"param-value middle"})[0]][0]#主屏分辨率camera_back=[x.split('>')[0].strip() for x in soup.find_all("span",{"class":"param-value middle"})[1]][0]#后置攝像頭e_capacity=[x.split('>')[0].strip() for x in soup.find_all("span",{"class":"param-value middle"})[2]][0]#電池容量camera_front=[x.split('>')[0].strip() for x in soup.find_all("span",{"class":"param-value"})[3]][0]#前置攝像頭battery_type=[x.split('>')[0].strip() for x in soup.find_all("span",{"class":"param-value"})[5]][0]#電池類型 except:#有的網(wǎng)頁手機參數(shù)源碼是下面這種規(guī)則,如果不滿足上面的規(guī)則,就執(zhí)行如下規(guī)則哦~替換一個字段列:battery_type換成了extendtry:guide_price= re.findall(r'\d+',str([x for x in soup.find_all("b",{"class":"price-type price-retain"})[0]][0]))[-1]#廠商指導(dǎo)價except:try:guide_price= re.findall(r'\d+',str([x for x in soup.find_all("b",{"class":"price-type price-"})[0]][0]))[-1]#廠商指導(dǎo)價except:guide_price=''try:screen_size=re.findall(r"<em>(.+?)</em>",unicode(soup.find_all(href=re.compile(url))[1]))[0]#screen_sizeexcept:screen_size=''#<a href="http://detail.zol.com.cn/cell_phone/index1144202.shtml" style="width: 228.15px"><em>6.44英寸</em></a>try:main_screen=re.findall(r"<em>(.+?)</em>", unicode(soup.find_all(href=re.compile(url))[3]))[0]#main_screenexcept:main_screen=''#<a href="http://detail.zol.com.cn/cell_phone/index1144202.shtml" style="width: 115.1px"><em>342ppi</em></a>try:e_capacity =re.findall(r"<em>(.+?)</em>", unicode(soup.find_all(href=re.compile(url))[5]))[0]#e_capacityexcept:e_capacity =''#<a href="http://detail.zol.com.cn/cell_phone/index1144202.shtml" style="width: 234.8px"><em>4850mAh</em></a>try:camera_front=re.findall(r"<em>(.+?)</em>", unicode(soup.find_all(href=re.compile(url))[7]))[0]#camera_frontexcept:camera_front =''#<a href="http://detail.zol.com.cn/cell_phone/index1144202.shtml" style="width: 170px"><em>1600萬</em></a>try:camera_back=re.findall(r"<em>(.+?)</em>", unicode(soup.find_all(href=re.compile(url))[9]))[0]#camera_backexcept:camera_back =''#<a href="http://detail.zol.com.cn/cell_phone/index1144202.shtml" style="width: 134px"><em>500萬</em></a>try:ROM=re.findall(r"<em>(.+?)</em>", unicode(soup.find_all(href=re.compile(url))[12]))[0]#ROMexcept:ROM =''#<a href="http://detail.zol.com.cn/cell_phone/index1144202.shtml" style="width: 164.28571428571px"><em>64GB</em></a>try:extend =re.findall(r"<em>(.+?)</em>", unicode(soup.find_all(href=re.compile(url))[13]))[0]#新字段,內(nèi)存是否可擴展,#extendexcept:extend =''#<a href="http://detail.zol.com.cn/cell_phone/index1144202.shtml" style="width: 100%"><em>可擴展</em></a>try:RAM =re.findall(r"<em>(.+?)</em>", unicode(soup.find_all(href=re.compile(url))[15]))[0]#RAMexcept:RAM =''#<a href="http://detail.zol.com.cn/cell_phone/index1144202.shtml" style="width: 133.33333333333px"><em>3GB</em></a> list=[] list1=[url,Name,title,c_name,other_name,showdate,screen_size,guide_price,price_range,ROM,RAM,core_num,main_screen,camera_front,camera_back,e_capacity,battery_type,extend] list.append(list1) clean= pd.DataFrame(list,columns=['url','Name','title','c_name','other_name','showdate','screen_size','guide_price','price_range','ROM','RAM','core_num','main_screen','camera_front','camera_back','e_capacity','battery_type','extend'])#clean.to_csv('2017.7.30phone_zol.csv',encoding='gbk')總結(jié)
以上是生活随笔為你收集整理的菜鸟爬取中关村手机详情页参数及报价的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: HasS Python 温湿度检测系统及
- 下一篇: c语言实现日历表