當前位置：首頁 > 编程语言 > python >内容正文

python

[python-thirdLib] Python中第三方的用于解析HTML的库：BeautifulSoup

發布時間：2023/12/9 python 30 豆豆

生活随笔收集整理的這篇文章主要介紹了 [python-thirdLib] Python中第三方的用于解析HTML的库：BeautifulSoup 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

From: http://www.crifan.com/python_third_party_lib_html_parser_beautifulsoup/

背景

在Python去寫爬蟲，網頁解析等過程中，比如：

如何用Python，C#等語言去實現抓取靜態網頁+抓取動態網頁+模擬登陸網站

常常需要涉及到HTML等網頁的解析。

當然，對于簡單的HTML中內容的提取，Python內置的正則表達式Re模塊，就足夠用了，

但是對于復雜的HTML的處理，尤其是一些非法的，有bug的html代碼的處理，那么最好還是用專門的HTML的解析的庫。

Python中的，專門用于HTML解析的庫，比較好用的，就是BeautifulSoup。

BeautifulSoup簡介

Python中，專門用于HTML/XML解析的庫；

特點是：

即使是有bug，有問題的html代碼，也可以解析。

功能很強大；

BeautifulSoup的主頁是：

http://www.crummy.com/software/BeautifulSoup/

BeautifulSoup的版本

BeautifulSoup主要有兩個版本：

BeautifulSoup 3

之前的，比較早的，是3.x的版本。

BeautifulSoup 3的在線文檔

最新的，可用的，在線文檔是：

http://www.crummy.com/software/BeautifulSoup/bs3/documentation.html

中文版的是：

http://www.crummy.com/software/BeautifulSoup/bs3/documentation.zh.html

下載BeautifulSoup 3

http://www.crummy.com/software/BeautifulSoup/bs3/download//3.x/

中可以下載到很多版本，比如我常用的3.0.6的版本：

BeautifulSoup-3.0.6.py

http://www.crummy.com/software/BeautifulSoup/bs3/download//3.x/BeautifulSoup-3.0.6.py

BeautifulSoup 4：縮寫為bs4

最新的v4版本的BeautifulSoup，改名為bs4了。

注意：

使用bs4時，導入BeautifulSoup的寫法是：

from bs4 import BeautifulSoup;

然后就可以像之前3.x中一樣，直接使用BeautifulSoup了。

詳見：

【已解決】Python3中，已經安裝了bs4（Beautifulsoup 4）了，但是卻還是出錯：ImportError: No module named BeautifulSoup

bs4的在線文檔

http://www.crummy.com/software/BeautifulSoup/bs4/doc/

下載bs4

http://www.crummy.com/software/BeautifulSoup/bs4/download/

可以下載到對應的bs4的版本，比如：

此時最新的版本是：

beautifulsoup4-4.1.3.tar.gz

http://www.crummy.com/software/BeautifulSoup/bs4/download/beautifulsoup4-4.1.3.tar.gz

BeautifulSoup的用法

如何安裝BeautifulSoup

3.0.6之前：無需安裝，放到和Python文件同目錄下即可使用

3.0.6之前，都是不需要安裝的，所以使用起來最簡單，直接下載對應的版本，比如：

http://www.crummy.com/software/BeautifulSoup/bs3/download//3.x/BeautifulSoup-3.0.6.py

得到了BeautifulSoup-3.0.6.py，然后改名為：BeautifulSoup.py

然后，放到和你當前的python文件同目錄下，比如我當前python文件是：

D:\tmp\tmp_dev_root\python\beautifulsoup_demo\beautifulsoup_demo.py

那就放到

D:\tmp\tmp_dev_root\python\beautifulsoup_demo\

下面，和beautifulsoup_demo.py同目錄。

3.0.6之后：需要安裝BeautifulSoup后才可使用

關于如何安裝一個Python的第三方模塊，簡單說就是，進入對應目錄，運行：

setup.py install

詳細解釋可參考：

【總結】Python安裝第三方的庫、package的方法

如何使用BeautifulSoup

在你的Python文件，此處為beautifulsoup_demo.py，中直接import即可。

關于示例html代碼，比如使用：

【教程】抓取網并提取網頁中所需要的信息之 Python版

相關參考文檔：

3.x版本的：

find(name, attrs, recursive, text, **kwargs)

使用BeautifulSoup提取html中的某個內容

關于最簡單的，最基本的用法，提取html中的某個內容，具體用法，就死使用對應的find函數。

完整代碼是：

輸出為：

D:\tmp\tmp_dev_root\python\beautifulsoup_demo>beautifulsoup_demo.py type(soup)= <type 'instance'> soup= <html> <body> <div class="icon_col"> <h1 class="h1user">crifan</h1> </div> </body> </html>h1userSoup= <h1 class="h1user">crifan</h1> h1userUnicodeStr= crifan

使用BeautifulSoup修改/改變/替換原先html中的某個內容

如果需要改變原先html中的某個值，可以參考官網解釋：

修改屬性值

后來證實，只能改（Tag的）中的屬性的值，不能改（Tag的）的值本身

完整示例代碼為：

#!/usr/bin/python # -*- coding: utf-8 -*- """ Function: 【教程】Python中第三方的用于解析HTML的庫：BeautifulSouphttp://www.crifan.com/python_third_party_lib_html_parser_beautifulsoupAuthor: Crifan Li Version: 2013-02-01 Contact: admin at crifan dot com """from BeautifulSoup import BeautifulSoup;def beautifulsoupDemo():demoHtml = """ <html> <body> <div class="icon_col"><h1 class="h1user">crifan</h1></div></body> </html> """;soup = BeautifulSoup(demoHtml);print "type(soup)=",type(soup); #type(soup)= <type 'instance'>print "soup=",soup;print '{0:=^80}'.format(" 1. extract content ");# method 1: no designate para name#h1userSoup = soup.find("h1", {"class":"h1user"});# method 2: use para nameh1userSoup = soup.find(name="h1", attrs={"class":"h1user"});# more can found at:#http://www.crummy.com/software/BeautifulSoup/bs3/documentation.zh.html#find%28name,%20attrs,%20recursive,%20text,%20**kwargs%29print "h1userSoup=",h1userSoup; #h1userSoup= <h1 class="h1user">crifan</h1>h1userUnicodeStr = h1userSoup.string;print "h1userUnicodeStr=",h1userUnicodeStr; #h1userUnicodeStr= crifanprint '{0:=^80}'.format(" 2. demo change tag value and property ");print '{0:-^80}'.format(" 2.1 can NOT change tag value ");print "old tag value=",soup.body.div.h1.string; #old tag value= crifanchangedToString = u"CrifanLi";soup.body.div.h1.string = changedToString;print "changed tag value=",soup.body.div.h1.string; #changed tag value= CrifanLiprint "After changed tag value, new h1=",soup.body.div.h1; #After changed tag value, new h1= <h1 class="h1user">crifan</h1>print '{0:-^80}'.format(" 2.2 can change tag property "); soup.body.div.h1['class'] = "newH1User";print "changed tag property value=",soup.body.div.h1; #changed tag property value= <h1 class="newH1User">crifan</h1>if __name__ == "__main__":beautifulsoupDemo();

總結

更多的，用法和使用心得，部分內容，已整理到：

【總結】Python的第三方庫BeautifulSoup的使用心得

【整理】關于Python中的html處理庫函數BeautifulSoup使用注意事項

有空再統一整理到：

BeautifulSoup

總結

以上是生活随笔為你收集整理的[python-thirdLib] Python中第三方的用于解析HTML的库：BeautifulSoup的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。