當前位置：首頁 > 编程语言 > python >内容正文

python

python爬虫基础(二)～工具包: 下载包requests、urllib和解析包BeautifulSoup(bs4)、lxml.etree.xpath

發布時間：2025/4/5 python 28 豆豆

生活随笔收集整理的這篇文章主要介紹了 python爬虫基础(二)～工具包: 下载包requests、urllib和解析包BeautifulSoup(bs4)、lxml.etree.xpath 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

1. html下載工具包

1.1 urllib工具包

1.1.1 urllib錯誤一

1.2 Requests工具包

1.2.1 requests錯誤一

2. html解析工具包

2.1 BeautifulSoup(bs4)工具包

2.1.1 BeautifulSoup_object.find()抽取標簽方法

2.1.2 BeautifulSoup_object.find_all()抽取標簽方法

2.1.3 BeautifulSoup.select()抽取標簽方法

2.1.4 BeautifulSoup_object獲取標簽文本、屬性值方法

2.1.5 BeautifulSoup_object獲取同級標簽(兄弟節點)方法

2.1.6 BeautifulSoup_object獲取子孫、祖先節點?

2.1.7 BeautifulSoup_object節點的刪除、插入和替換方法

2.1.8 bs4錯誤一

2.2 lxml.etree.HTML工具包

2.2.1 lxml.etree.xpath抽取標簽方法

1. html下載工具包

1.1 urllib工具包

urllib.parse.quote(content) <--因為url只允許一部分ascii字符，其他字符(如漢子)是不符合標準的，此時就要進行編碼。
urllib.request.Request --> urlopen()方法可以實現最基本構造HTTP請求的方法，但如果加入headers等信息，就可以利用Request類來構造請求。

方法：urllib.request.Request(url, data=None, headers={}, origin_req_host=None, unverificable=False, method=None)

headers: 請求頭，字典類型。用來偽裝瀏覽器，默認是User-Agent python-urllib。也可偽裝火狐瀏覽器，

headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36'}

method：'GET', 'POST', 'PUT'

# 訪問、下載html網頁url = 'https://baike.baidu.com/item/' + urllib.parse.quote(content) # 請求地址# 請求頭部，偽造瀏覽器，防止爬蟲被反headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36'}# 利用請求地址和請求頭部構造請求對象req = urllib.request.Request(url=url, headers=headers, method='GET')response = urllib.request.urlopen(req) # 發送請求，獲得響應text = response.read().decode('utf-8') # 讀取響應，獲得文本

模塊urllib和urllib2的功能差不多，簡單來說urllib2是urllib的增強——urllib2更好一些，但是urllib中有urllib2中所沒有的函數。對于簡單的下載， urllib綽綽有余。如果需要實現HTTP身份驗證或Cookie亦或編寫擴展來處理自己的協議，urllib2可能是更好的選擇。在Python2.x中主要為urllib和urllib2，這兩個標準庫是不可相互替代的。但是在Python3.x中將urllib2合并到了urllib，這一點值得注意。
————————————————
版權聲明：本文為CSDN博主「IoneFine」的原創文章，遵循CC 4.0 BY-SA版權協議，轉載請附上原文出處鏈接及本聲明。
原文鏈接：https://blog.csdn.net/jiduochou963/article/details/87564467?

1.1.1 urllib錯誤一

urllib.parse.quote(content) Failed to establish a new connection: [Errno 61] Connection refused')?

原因：服務器沒啟動！手動滑稽。。。。

1.2 Requests工具包

Requests是用python編寫的，基于urllib，采用Apache2 Licensed開源協議的http庫。它比url更方便。它支持python3

1.2.1 requests錯誤一

requests.obj.json()出現錯誤

json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

原因：當我們在爬取一些網頁時，有些網頁的內容是通過Unicode字符編碼進行傳輸的，

解決：

比如爬蟲解碼法：

1 import requests 2 3 reps = requests.get(url=url) 4 reps.content.decode("utf-8") 5 #或者使用這條語句 reps.content.decode("unicode_escape")

2. html解析工具包

2.1 BeautifulSoup(bs4)工具包

中文官方文檔：https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/

BS4，全稱是Beautiful Soup，它提供一些簡單的、python式的函數用來處理導航、搜索、修改分析樹等功能。

它是一個工具箱，通過解析文檔為soup自動將輸入文檔轉換為unicode編碼，輸出文檔轉換為utf-8編碼

Tag對象。html中的一個標簽，用BeautifulSoup就能解析出Tag的具體內容，具體格式為soup.name
BeautifulSoup對象。整個html文本對象，可當作Tag對象
NavigableString對象。
Comment對象。

BeautifulSoup對象聲明方法：字符串、在線網頁、html文件

將bs4.element.Tag轉換成字符串：使用str()進行強制轉換（Python真香！）

將字符串str轉換成bs4.element.Tag，需要以字符串形式用BeautifulSoup重新聲明

# br_soup = BeautifulSoup(str(br), 'lxml')

# print(type(br_soup))

source：https://blog.csdn.net/xtingjie/article/details/73442317

html = '<div>text1</div>' html = urlopen("http://www.pythonscraping.com/pages/page3.html") html = open('c:\\aa.html')#以上三行表示了HTML的三種來源，一是字符串形式，二是在線網頁形式，三是HTML文件形式bsObj = BeautifulSoup(html, 'html.parser') # 'html.parser'是解析器，也可以用'lxml' # BeautifulSoup類似于C++中的構造函數 e = bsObj.find('div') print(e.text)

2.1.1 BeautifulSoup_object.find()抽取標簽方法

find()方法只返回當前標簽下的第一個匹配子標簽，返回一個tag標簽。

2.1.2 BeautifulSoup_object.find_all()抽取標簽方法

find_all()方法返回當前標簽下的所有匹配子標簽的結果，返回一個標簽列表。如，

title = soup.find_all('div', class_='basicInfo_item name')

-->?注意：只有class屬性名要有class_這個下橫線

find_all()方法支持嵌套查詢，不僅bs4對象可以調用，而且tag標簽也可以調用。

for ul in soup.find_all(name = 'ul'):print(ul.find_all(name='li'))

find_all(name, attrs, recursive, text, **kwargs)

name參數。可以查找所有名字為name的標簽，字符串對象會被自動忽略。如：find_all(name='title') 或?find_all('title')?
搜索指定名字的屬性時，可以使用的參數值包括字符串、正則表達式、列表、True。如：find_all(attrs={'id', 'link2'}) 或 find_all(id='link2'); find_all(href=re.compile('elsie')); 組合查找～find_all('div', class_='abcd')。使用多個指定名字的參數可以同時過濾標簽的多個屬性，find_all(href=re.compile('elsie'), id='link1')
attrs參數。定義一個字典參數來搜索包含特殊屬性的tag。如：find_all(attrs={'data-foo': 'value'})
text參數。可以搜索文檔中的字符串內容，接受字符串、正則、列表、True。如：find_all(text='Elsie'); find_all(text=['Tillie', 'Elsie', 'Lacie']); find_all(text=re.compile('link'))
與其他參數混合使用。find_all('a', text='Elsie')

# 讀取響應，獲得文本text = response.read().decode('utf-8')# 解析html網頁soup = BeautifulSoup(text, 'lxml') # 創建soup對象，獲取html源碼intro_tag = soup.find_all('div', class_="lemma-summary") # 獲取百科基本信息列表name_tag = soup.find_all('dt', class_="basicInfo-item name") # 找到所有dt標簽，返回一個標簽列表value_tag = soup.find_all('dd', class_="basicInfo-item value") # 找到所有dd標簽，返回一個標簽列表

2.1.3 BeautifulSoup.select()抽取標簽方法

select()方法返回類型的標簽列表

通過標簽名查找。如：soup.select('title')
通過類名查找(class)。如：soup.select('.sister')
通過id名查找。如：soup.select('#link1')
通過組合查找。組合查找時，標簽名、類名、id名格式不變，只是二者之間用空格分開。如：soup.select('p #link1')
子標簽查找。soup.select('head>title')，注意，子標簽查找不支持屬性篩選或組合查找
屬性查找。查找時還可以加入屬性元素，屬性需要用中括號括起來，注意屬性與標簽屬于同一節點，所以中間不能加空格！！！否則無法匹配到

如：soup.select('a[href='http://example.com.elsie']')，屬性查找也可用于組合查找

2.1.4 BeautifulSoup_object獲取標簽文本、屬性值方法

<a class= "lemma-album layout-right nslog: 10000206" href= "url"> hello, world<img class= "picture" src= "url"></img> </a>

tag.get_text()方法 --> 獲取當前tag中包含的文本內容包含子節點中的文本, "hello, world"。tag.string方法，獲取當前節點中的文本，但如果當前節點包含子節點，.string會引起混亂，返回none。
tag.get('href'), tag.get('class') 或 tag['id'], tag.attrs['id']--> 獲取本標簽的class屬性值，無法獲得子標簽的屬性值。子標簽屬性值方法獲取參考bs4_object.select()

-->BeautifulSoup如何獲取不包含子節點文本的文本？

contents屬性返回當天標簽的直接子節點，返回結果時列表形式，你可以根據索引索取你想要的標簽節點或文本。

# contents返回的結果列表 [潘建偉, '人物履歷']print(i.find(class_='title-text').contents[1])

2.1.5 BeautifulSoup_object獲取同級標簽(兄弟節點)方法

next_sibling和next_siblings，分別獲取當前節點的下一個兄弟節點和后面所有兄弟節點的生成器
find_next_siblings()和find_next_sibling()：前者返回后面所有的兄弟節點，后者返回后面第一個兄弟節點
find_all_next()和find_next()：前者返回節點后所有符合條件的節點，后者返回第一個符合條件的節點。
previous_sibling和previous_siblings，分別獲取當前節點的前一個兄弟節點和前面所有兄弟節點的生成器
find_previous_siblings()和find_previous_sibling()：前者返回前面所有的兄弟節點，后者返回前者第一個兄弟節點
find_all_previous()和find_previous()：前者返回節點前所有符合條件的節點，后者返回第一個符合條件的節點。
BeautifulSoup(sibling_html, 'html.parser') 解析正常，而lxml可能存在解析異常

sibling_soup = BeautifulSoup(sibling_html, 'html.parser') br = sibling_soup.p while br.next_sibling != None:print brbr = br.next_sibling --------------------------------------------------------------- for tag in soup.select('div .col-md-4'):if tag.get_text() == 'Total':result = tag.next_sibling.get_text()

--> 判斷each br?in 返回的兄弟標簽列表是否是標簽，因為有些兄弟節點為空。

for br in i.next_siblings: # 獲取人物履歷標簽后面所有的兄弟標簽# print(br)if type(br) is bs4.element.Tag: # 判斷br是不是一個標簽attrs = ''.join(br.attrs['class'])if attrs == 'para':br_text_list.append(br.get_text())elif attrs == re.compile('para-title level'):breakelse:continue

2.1.6 BeautifulSoup_object獲取子孫、祖先節點?

children屬性，返回直接子節點生成器；descendants屬性，會遞歸查詢所有子節點，得到所有的子孫節點。
parent屬性，獲取某一個元素節點的父節點；parents屬性，獲取所有祖先節點。
find_parents()和find_parent()：前者返回所有祖先節點，后者返回直接父節點。

2.1.7 BeautifulSoup_object節點的刪除、插入和替換方法

參考：beautifulsoup中文官方文檔，https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/#clear

Tag.clear()?方法移除當前tag的內容:

markup = '<a href="http://example.com/">I linked to example.com</a>' soup = BeautifulSoup(markup) tag = soup.atag.clear() tag # <a href="http://example.com/"></a>

PageElement.extract()?方法將當前tag移除文檔樹,并作為方法結果返回:

markup = '<a href="http://example.com/">I linked to example.com</a>' soup = BeautifulSoup(markup) a_tag = soup.ai_tag = soup.i.extract()a_tag # <a href="http://example.com/">I linked to</a>i_tag # example.comprint(i_tag.parent) None

這個方法實際上產生了2個文檔樹: 一個是用來解析原始文檔的?BeautifulSoup?對象,另一個是被移除并且返回的tag.被移除并返回的tag可以繼續調用?extract?方法:

Tag.decompose()?方法將當前節點移除文檔樹并完全銷毀:

markup = '<a href="http://example.com/">I linked to example.com</a>' soup = BeautifulSoup(markup) a_tag = soup.asoup.i.decompose()a_tag # <a href="http://example.com/">I linked to</a>

PageElement.replace_with()?方法移除文檔樹中的某段內容,并用新tag或文本節點替代它:

markup = '<a href="http://example.com/">I linked to example.com</a>' soup = BeautifulSoup(markup) a_tag = soup.anew_tag = soup.new_tag("b") new_tag.string = "example.net" a_tag.i.replace_with(new_tag)a_tag # <a href="http://example.com/">I linked to example.net</a>

replace_with()?方法返回被替代的tag或文本節點,可以用來瀏覽或添加到文檔樹其它地方

2.1.8 bs4錯誤一

bs4.FeatureNotFound: Couldn't find a tree builder with the features you requested: lxml. Do you need to install a parser library?
原因：不可以使用 BeautifulSoup(html,'lxml')，沒有安裝lxml導致bs4不能使用lxml

解決：pip3 install lxml

2.2 lxml.etree.HTML工具包

2.2.1 lxml.etree.xpath抽取標簽方法

/ -->類似于find()，// -->類似于find_all()，后面跟標簽名，[@ ] --> @后面跟屬性名

# class屬性抽取標簽 + 提取標簽href屬性值 link_list = html_et.xpath('//div[@class="main-content"]//a/@href')# class屬性抽取標簽 + 提取標簽文本text text_list = html_et.xpath('//div[@class="lemma-summary"]//text()')# 模糊定位starts-with方法 ele = etree.xpath("//input[starts-with(@class, "tag")]") # 獲得class= tagyou# 模糊定位ends-with方法 ele = etree.xpath("//input[ends-with(@class, "tag")]") # 獲得class= youtag# 模糊定位contains方法 ele = etree.xpath("//input[contains(@class, "tag")]") # 獲得class= youtagyou# 模糊定位-使用任意值來匹配屬性元素 ele = etree.xpath("//input[@*="tag"]")# 使用索引定位元素 ele = etree.xpath("/a/b/input[4]") # 因為索引定位可能出現元素變動，如：input[4], input[3]，所以使用last()最后一個元素索引定位 ele = etree.xpath("/a/b/input[last()]")

使用lxml前注意，先確保html經過了utf-8解碼，即code = html.decode('utf-8', 'ignore')，否則會出現解析出錯的情況

--> html網頁源碼的字符編碼(charset)格式包括：GB2312, GBK, UTF-8, IOS8859-1等。

# 讀取響應，獲得文本text = response.read().decode('utf-8')# 構造 _Element 對象html = etree.HTML(text)# 使用 xpath 匹配數據，得到匹配字符串列表sen_list = html.xpath('//div[contains(@class,"lemma-summary")]//text()')# sen_list = html.xpath('//div[@class="lemma-summary"]//text()')# 過濾數據，去掉空白sen_list_after_filter = [item.strip('\n') for item in sen_list if item != '\n']# 將字符串列表連成字符串并返回return ''.join(sen_list_after_filter)

總結

以上是生活随笔為你收集整理的python爬虫基础(二)～工具包: 下载包requests、urllib和解析包BeautifulSoup(bs4)、lxml.etree.xpath的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： python爬虫实战(一)～爬取百度百科
下一篇： Transformer论文阅读(一):

国产亚洲精品久久久久动-影视先锋中文字幕-av网站在线观看一区-亚洲视频 在线观看-久久亚洲不卡-欧美精品一区在线观看-欧美乱淫视频-欧美熟妇另类久久久久久不卡-粉嫩av一区二区三区四区五区-日韩欧美操

python

python爬虫基础(二)～工具包: 下载包requests、urllib和解析包BeautifulSoup(bs4)、lxml.etree.xpath

1. html下載工具包

1.1 urllib工具包

1.1.1 urllib錯誤一

1.2 Requests工具包

1.2.1 requests錯誤一

2. html解析工具包

2.1 BeautifulSoup(bs4)工具包

2.1.1 BeautifulSoup_object.find()抽取標簽方法

2.1.2 BeautifulSoup_object.find_all()抽取標簽方法

2.1.3 BeautifulSoup.select()抽取標簽方法

2.1.4 BeautifulSoup_object獲取標簽文本、屬性值方法

2.1.5 BeautifulSoup_object獲取同級標簽(兄弟節點)方法

2.1.6 BeautifulSoup_object獲取子孫、祖先節點?

2.1.7 BeautifulSoup_object節點的刪除、插入和替換方法

2.1.8 bs4錯誤一

2.2 lxml.etree.HTML工具包

2.2.1 lxml.etree.xpath抽取標簽方法

總結

国产亚洲精品久久久久动-影视先锋中文字幕-av网站在线观看一区-亚洲视频在线观看-久久亚洲不卡-欧美精品一区在线观看-欧美乱淫视频-欧美熟妇另类久久久久久不卡-粉嫩av一区二区三区四区五区-日韩欧美操