趁著暑假的空閑,把在上個學期學到的Python數據采集的皮毛用來試試手,寫了一個爬取豆瓣圖書的爬蟲,總結如下: 
 下面是我要做的事: 
 1. 登錄 
 2. 獲取豆瓣圖書分類目錄 
 3. 進入每一個分類里面,爬取第一頁的書的書名,作者,譯者,出版時間等信息,放入MySQL中,然后將封面下載下來。
 
第一步
 
首先,盜亦有道嘛,看看豆瓣網的robots協議:
 
User-agent: *
Disallow: /subject_search
Disallow: /amazon_search
Disallow: /search
Disallow: /group/search
Disallow: /event/search
Disallow: /celebrities/search
Disallow: /location/drama/search
Disallow: /forum/
Disallow: /new_subject
Disallow: /service/iframe
Disallow: /j/
Disallow: /link2/
Disallow: /recommend/
Disallow: /trailer/
Disallow: /doubanapp/card
Sitemap: https://www
.douban.com/sitemap_index
.xml
Sitemap: https://www
.douban.com/sitemap_updated_index
.xml
# Crawl-delay: 5User-agent: Wandoujia Spider
Disallow: / 
再看看我要爬取的網站:
 
https://book.douban.com/tag/?view=type&icn=index-sorttags-allhttps://book.douban.com/tag/?icn=index-navhttps://book.douban.com/tag/[此處是標簽名]https://book.douban.com/subject/[書的編號]/ 
好了,并沒有違反robots協議,安心的寫代碼了。
 
第二步
 
既然寫了,就做得完整一些,現在先登錄一下豆瓣: 
 我在這里采用的是cookies登錄的方式,首先用firefox神奇的插件HttpFox獲得一下正常登錄的headers和cookies、
 
 
def login(url):cookies = {}
with open(
"C:/Users/lenovo/OneDrive/projects/Scraping/doubancookies.txt") 
as file:raw_cookies = file.read();
for line 
in raw_cookies.split(
';'):key,value = line.split(
'=',
1)cookies[key] = valueheaders = {
'User-Agent':
'''Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.78 Safari/537.36'''}s = requests.get(url, cookies=cookies, headers=headers)
return s   
我在這里采用的是將headers復制到程序里,將cookies放入文件中讀取的方式,同時注意要將cookies處理成字典的形式,然后用requests庫的get函數獲得網頁響應。
 
第三步
 
先進入豆瓣讀書的分類目錄 
 https://book.douban.com/tag/?view=type&icn=index-sorttags-all 
 我們把這個網站上的分類鏈接爬取下來:
 
import requests
from bs4 import BeautifulSoup
url = 
"https://book.douban.com/tag/?icn=index-nav"
web = requests.
get(url)                
soup = BeautifulSoup(web.
text,
"lxml")  
tags = soup.select(
"#content > div > div.article > div > div > table > tbody > tr > td > a")
urls = []     
for tag 
in tags:    tag=tag.get_text()    helf=
"https://book.douban.com/tag/"   url=helf+str(tag) urls.append(url)
with open(
"channel.txt",
"w") 
as file:
for link 
in urls:
file.
write(link+
'\n')
 
上面代碼當中用了CSS選擇器,不懂CSS沒關系,將相應的網站頁面用瀏覽器打開,打開開發者工具,在elements界面右鍵要爬取的內容,copy->selector 
 (我用的是chrome瀏覽器,在正常的圖形網頁里右鍵檢查就能直接定位到對應的elements位置),將CSS選擇器復制下來,注意如果出現了:nth-child(*)之類的都要去掉,不然會報錯。 
  
 然后我們得到了鏈接的目錄: 
 
 
第四步
 
下面先找一找爬取的方法 
 根據上面說的CSS選擇器的方法,可以得到書名,作者,譯者,評價人數,評分,還有這本書的封面鏈接和簡介。
 
title = bookSoup.
select('#wrapper > h1 > span')[0].contents[0]
title = deal_title(title)
author = get_author(bookSoup.select("#info > a")[0].contents[0])
translator = bookSoup.select("#info > span > a")[0].contents[0]
person = bookSoup.select("#interest_sectl > div > div.rating_self.clearfix > div > div.rating_sum > span > a > span")[0].contents[0]
scor = bookSoup.select("#interest_sectl > div > div.rating_self.clearfix > strong")[0].contents[0]
coverUrl = bookSoup.select("#mainpic > a > img")[0].attrs['src'];
brief = get_brief(bookSoup.
select('#link-report > div > div > p')) 
有幾點要注意:
 
- 文件名不能含有 :?<>"|\/* 所以用正則表達式處理一下:
def deal_title(raw_title):r = re.compile(
'[/\*?"<>|:]')
return r.sub(
'~',raw_title) 
然后將封面下載下來:
 
path = 
"C:/Users/lenovo/OneDrive/projects/Scraping/covers/"+title+
".png"
urlretrieve(coverUrl,path); 
 
def get_author(raw_author):parts = raw_author.split(
'\n')
return ''.join(map(str.strip,parts)) 
 
def get_brief(line_tags):brief = line_tags[
0].contents
for tag 
in line_tags[
1:]:brief += tag.contentsbrief = 
'\n'.join(brief)
return brief 
而對于出版社,出版時間,ISBN和圖書定價,則可以用下面更簡潔的方法獲得:
 
info = bookSoup.
select(
'#info')
infos = list(info[
0].strings)
publish = infos[infos.
index(
'出版社:') + 
1]
ISBN = infos[infos.
index(
'ISBN:') + 
1]
Ptime = infos[infos.
index(
'出版年:') + 
1]
price = infos[infos.
index(
'定價:') + 
1] 
第五步
 
先創建數據庫和數據表
 
CREATE TABLE `allbooks` (`title` char(255) NOT NULL,`scor` char(255) DEFAULT NULL,`author` char(255) DEFAULT NULL,`price` char(255) DEFAULT NULL,`time` char(255) DEFAULT NULL,`publish` char(255) DEFAULT NULL,`person` char(255) DEFAULT NULL,`yizhe` char(255) DEFAULT NULL,`tag` char(255) DEFAULT NULL,`brief` mediumtext,`ISBN` char(255) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8; 
然后用executemany方法便捷地將數據存入。
 
connection = pymysql.connect(host=
'你的主機',user=
'你的賬號',password=
'你的密碼',charset=
'utf8')
with connection.cursor() 
as cursor:sql = 
"USE DOUBAN_DB;"cursor.execute(sql)sql = 
'''INSERT INTO allbooks (
title, scor, author, price, time, publish, person, yizhe, tag, brief, ISBN)VALUES (%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)'''cursor.executemany(sql, data)connection.commit() 
第六步
 
到此,我們已經找到了全部的方法,就剩寫出完整程序了 
 還要注意的一點就是要設置隨機訪問間隔,以防封IP。 
 代碼如下,也在github更新,歡迎star,我的github鏈接。
 
"""
Created on Sat Aug 12 13:29:17 2017@author: Throne
"""
import requests                       
from bs4 
import BeautifulSoup         
import time          
import pymysql       
import random        
from urllib.request 
import urlretrieve     
import re             connection = pymysql.connect(host=
'localhost',user=
'root',password=
'',charset=
'utf8')
with connection.cursor() 
as cursor:sql = 
"USE DOUBAN_DB;"cursor.execute(sql)
connection.commit()
def deal_title(raw_title):r = re.compile(
'[/\*?"<>|:]')
return r.sub(
'~',raw_title)
def get_brief(line_tags):brief = line_tags[
0].contents
for tag 
in line_tags[
1:]:brief += tag.contentsbrief = 
'\n'.join(brief)
return brief
def get_author(raw_author):parts = raw_author.split(
'\n')
return ''.join(map(str.strip,parts))
def login(url):cookies = {}
with open(
"C:/Users/lenovo/OneDrive/projects/Scraping/doubancookies.txt") 
as file:raw_cookies = file.read();
for line 
in raw_cookies.split(
';'):key,value = line.split(
'=',
1)cookies[key] = valueheaders = {
'User-Agent':
'''Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.78 Safari/537.36'''}s = requests.get(url, cookies=cookies, headers=headers)
return s   
def crawl():channel = []
with open(
'C:/Users/lenovo/OneDrive/projects/Scraping/channel.txt') 
as file:channel = file.readlines()
for url 
in channel:data = [] web_data = login(url.strip())soup = BeautifulSoup(web_data.text.encode(
'utf-8'),
'lxml')tag = url.split(
"?")[
0].split(
"/")[-
1]books = soup.select(
'''#subject_list > ul > li > div.info > h2 > a''')
for book 
in books:bookurl = book.attrs[
'href']book_data = login(bookurl)bookSoup = BeautifulSoup(book_data.text.encode(
'utf-8'),
'lxml')info = bookSoup.select(
'#info')infos = list(info[
0].strings)
try:title = bookSoup.select(
'#wrapper > h1 > span')[
0].contents[
0]title = deal_title(title)publish = infos[infos.index(
'出版社:') + 
1]translator = bookSoup.select(
"#info > span > a")[
0].contents[
0]author = get_author(bookSoup.select(
"#info > a")[
0].contents[
0])ISBN = infos[infos.index(
'ISBN:') + 
1]Ptime = infos[infos.index(
'出版年:') + 
1]price = infos[infos.index(
'定價:') + 
1]person = bookSoup.select(
"#interest_sectl > div > div.rating_self.clearfix > div > div.rating_sum > span > a > span")[
0].contents[
0]scor = bookSoup.select(
"#interest_sectl > div > div.rating_self.clearfix > strong")[
0].contents[
0]coverUrl = bookSoup.select(
"#mainpic > a > img")[
0].attrs[
'src'];brief = get_brief(bookSoup.select(
'#link-report > div > div > p'))
except :
try:title = bookSoup.select(
'#wrapper > h1 > span')[
0].contents[
0]title = deal_title(title)publish = infos[infos.index(
'出版社:') + 
1]translator = 
""author = get_author(bookSoup.select(
"#info > a")[
0].contents[
0])ISBN = infos[infos.index(
'ISBN:') + 
1]Ptime = infos[infos.index(
'出版年:') + 
1]price = infos[infos.index(
'定價:') + 
1]person = bookSoup.select(
"#interest_sectl > div > div.rating_self.clearfix > div > div.rating_sum > span > a > span")[
0].contents[
0]scor = bookSoup.select(
"#interest_sectl > div > div.rating_self.clearfix > strong")[
0].contents[
0]coverUrl = bookSoup.select(
"#mainpic > a > img")[
0].attrs[
'src'];brief = get_brief(bookSoup.select(
'#link-report > div > div > p'))
except:
continuefinally:path = 
"C:/Users/lenovo/OneDrive/projects/Scraping/covers/"+title+
".png"urlretrieve(coverUrl,path);data.append([title,scor,author,price,Ptime,publish,person,translator,tag,brief,ISBN])            
with connection.cursor() 
as cursor:sql = 
'''INSERT INTO allbooks (
title, scor, author, price, time, publish, person, yizhe, tag, brief, ISBN)
VALUES (%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)'''cursor.executemany(sql, data)connection.commit()
del datatime.sleep(random.randint(
0,
9)) start = time.clock()
crawl()
end = time.clock()
with connection.cursor() 
as cursor:print(
"Time Usage:", end -start)count = cursor.execute(
'SELECT * FROM allbooks')print(
"Total of books:", count)
if connection.open:connection.close() 
結果展示
 
 
 
 
文章原創,要轉載請聯系作者
 
 
參考博客: http://www.jianshu.com/p/6c060433facf?appinstall=0
                            總結
                            
                                以上是生活随笔為你收集整理的Python爬虫爬取豆瓣图书的信息和封面,放入MySQL数据库中。的全部內容,希望文章能夠幫你解決所遇到的問題。
                            
                            
                                如果覺得生活随笔網站內容還不錯,歡迎將生活随笔推薦給好友。