巨細！小姐姐告訴你關(guān)于 BeautifulSoup 的一切

作者：派森醬 2021-10-05 21:03:54

BeautifulSoup 用 NavigableString 類來包裝 tag 中的字符串，NavigableString 表示可遍歷的字符串。

[[427165]]

詳細了解 BeautifulSoup 爬蟲

前面第一篇文章是關(guān)于 BeautifulSoup 爬蟲的基礎知識詳解第一部分，主要介紹了 BeautifulSoup 爬蟲的安裝過程及簡介，同時又快速學習了利用 BeautifulSoup 技術(shù)定位標簽、獲取標簽內(nèi)容的相關(guān)知識點，今天的文章將深入地介紹 BeautifulSoup 技術(shù)的詳細語法及其相關(guān)用法。

1.BeautifulSoup 對象

BeautifulSoup 將復雜的 HTML 文檔轉(zhuǎn)換成一個樹形結(jié)構(gòu)，每個節(jié)點都是 Python 對象，BeautifulSoup 官方文檔將所有的對象歸納為以下四種：

Tag
NavigableString
BeautifulSoup
Comment

接下來詳細介紹 BeautifulSoup 的四個對象：

Tag

Tag 對象表示 XML 或 HTML 文檔中的標簽，通俗地講就是 HTML 中的一個個標簽，該對象與 HTML 或 XML 原生文檔中的標簽相同。Tag 有很多方法和屬性，BeautifulSoup 中定義為 soup.Tag，其中 Tag 為 HTML 中的標簽，比如 a、title 等，其結(jié)果返回完整的標簽內(nèi)容，包括標簽的屬性和內(nèi)容等。例如以下實例就是 Tag:

<title>BeautifulSoup 技術(shù)詳解</title> 
<p class="title">Hello</p> 
<p class="con">Python 技術(shù)</p>

以上的 HTML 代碼中，title、p 都是標簽，起始標簽和結(jié)束標簽之間加上內(nèi)容就是 Tag。標簽獲取方法代碼如下：

#創(chuàng)建本地文件soup對象 
   soup = BeautifulSoup(open('test.html','rb'), "html.parser") 
   #獲取a標簽 
   a = soup.a  #Tag 
   print('a標簽的內(nèi)容是:', a)

除此之外，Tag 中最重要的屬性是 name 和 attrs 。

name

name 屬性用于獲取文檔樹的標簽名字，如果想獲取 title 標簽的名字，只要使用 soup.title.name 代碼即可，對于內(nèi)部標簽，輸出的值便為標簽本身的名稱。

attrsattrs是屬性(attributes)的英文簡稱，屬性是網(wǎng)頁標簽的重要內(nèi)容。一個標簽(Tag)可能有很多個屬性，例如：

<a href="https://www.baidu.com" class="xiaodu" id="l1">ddd</a>

以上實例存在兩個屬性，一個是class屬性，對應的值為“xiaodu”;一個是id屬性，對應的值為“l1”。Tag屬性操作方法與Python字典相同，獲取p標簽的所有屬性代碼如下，得到一個字典類型的值，它獲取的是第一個段落 p 的屬性及屬性值。

# 獲取屬性 
print(soup.p.attrs) 
 
# 獲取屬性值 
print(soup.a['class']) 
#[u'xiaodu'] 
print(soup.a.get('class')) 
#[u'l1']

BeautifulSoup 每個標簽 tag 可能有很多個屬性，可以通過 “.attrs” 獲取屬性，tag 的屬性可以被修改、刪除或添加。

NavigableString

NavigableString 也叫可遍歷的字符串，字符串常被包含在 tag 內(nèi),BeautifulSoup 用 NavigableString 類來包裝tag中的字符串，

BeautifulSoup 用 NavigableString 類來包裝 tag 中的字符串，NavigableString 表示可遍歷的字符串。一個 NavigableString 字符串與 Python 中的 Unicode 字符串相同，并且支持包含在遍歷文檔樹和搜索文檔樹中的一些特性。下述代碼可查看 NavigableString 的類型。

# coding=utf-8 
from bs4 import BeautifulSoup 
soup = BeautifulSoup(open('test.html','rb'), "html.parser") 
tag = soup.title 
print(type(tag.string))

輸出結(jié)果如下：

<class 'bs4.element.NavigableString'>

BeautifulSoup

BeautifulSoup 對象表示的是一個文檔的全部內(nèi)容，通常情況下把它當作 Tag 對象，該對象支持遍歷文檔樹和搜索文檔樹中描述的大部分的方法，下面代碼是輸出 soup 對象的類型，輸出結(jié)果就是 BeautifulSoup 對象類型。

# coding=utf-8 
from bs4 import BeautifulSoup 
soup = BeautifulSoup(open('test.html','rb'), "html.parser") 
tag = soup.title 
 
print(type(soup))

輸出結(jié)果如下：

<class 'bs4.BeautifulSoup'>

因為 BeautifulSoup 對象并不是真正的 HTML 或 XML 的標簽 tag，所以它沒有 name 和 attribute 屬性。但有時查看它的.name 屬性是很方便的，故 BeautifulSoup 對象包含了一個值為[document]的特殊屬性soup.name。下述代碼即是輸出 BeautifulSoup 對象的 name 屬性，其值為 [document]。

Comment

Comment 對象是一個特殊類型的 NavigableString 對象，它用于處理注釋對象。下面這個示例代碼用于讀取注釋內(nèi)容，代碼如下：

markup = "<b><!-- hello comment code --></b>" 
    soup = BeautifulSoup(markup, "html.parser") 
    comment = soup.b.string 
    print(type(comment)) 
    print(comment) 
     
if __name__ == '__main__': 
    mark()

輸出結(jié)果如下：

<class 'bs4.BeautifulSoup'> 
<class 'bs4.element.Comment'> 
 hello comment code

2.遍歷文檔樹

以上內(nèi)容講解完 4 個對象后，下面的知識講解遍歷文檔樹和搜索文檔樹以及 BeatifulSoup 常用的函數(shù)。在 BeautifulSoup 中，一個標簽(Tag)可能包含多個字符串或其它的標簽，這些稱為這個標簽的子標簽。

咱們繼續(xù)用以下超文本協(xié)議來講解：

<!DOCTYPE html> 
<html lang="en"> 
<head> 
    <title>BeautifulSoup 技術(shù)詳解</title> 
</head> 
<body> 
<p class="title">Hello</p> 
<p class="con">Python 技術(shù)</p> 
 
<a href="https://www.baidu.com" class="xiaodu" id="l1">ddd</a> 
 
</body> 
</html>

子節(jié)點

一個Tag可能包含多個字符串或其它的Tag,這些都是這個Tag的子節(jié)點，Beautiful Soup 提供了許多操作和遍歷子節(jié)點的屬性。

例如獲取標簽子節(jié)點內(nèi)容：

# coding=utf-8 
from bs4 import BeautifulSoup 
soup = BeautifulSoup(open('test.html','rb'), "html.parser") 
tag = soup.title 
 
print(soup.head.contents)

輸出結(jié)果如下：

['\n', <title>BeautifulSoup 技術(shù)詳解</title>, '\n']

注意: Beautiful Soup中字符串節(jié)點不支持這些屬性,因為字符串沒有子節(jié)點。

節(jié)點內(nèi)容

如果標簽只有一個子節(jié)點，需要獲取該子節(jié)點的內(nèi)容，則需要使用 string 屬性，以此輸出節(jié)點的內(nèi)容：

# coding=utf-8 
from bs4 import BeautifulSoup 
soup = BeautifulSoup(open('test.html','rb'), "html.parser") 
tag = soup.title 
 
print(soup.head.string) 
 
print(soup.title.string)

輸出結(jié)果如下：

None 
BeautifulSoup 技術(shù)詳解

父節(jié)點

調(diào)用 parent 屬性定位父節(jié)點，如果需要獲取節(jié)點的標簽名則使用 parent.name。實例如下：

# coding=utf-8 
from bs4 import BeautifulSoup 
soup = BeautifulSoup(open('test.html','rb'), "html.parser") 
tag = soup.title 
 
p = soup.p 
print(p.parent) 
print(p.parent.name) 
 
content = soup.head.title.string 
print(content.parent) 
print(content.parent.name)

輸出結(jié)果如下：

<body> 
<p class="title">Hello</p> 
<p class="con">Python 技術(shù)</p> 
<a class="xiaodu" href="https://www.baidu.com" id="l1">ddd</a> 
</body> 
body 
<title>BeautifulSoup 技術(shù)詳解</title> 
title

兄弟節(jié)點

兄弟節(jié)點是指和本節(jié)點位于同一級的節(jié)點，其中 next_sibling 屬性是獲取該節(jié)點的下一個兄弟節(jié)點，previous_sibling 則與之相反，取該節(jié)點的上一個兄弟節(jié)點，如果節(jié)點不存在，則返回 None。

print(soup.p.next_sibling) 
print(soup.p.prev_sibling)

前后節(jié)點

調(diào)用屬性 next_element 可以獲取下一個節(jié)點，調(diào)用屬性 previous_element 可以獲取上一個節(jié)點，代碼舉例如下：

print(soup.p.next_element) 
print(soup.p.previous_element)

3.搜索文檔樹

BeautifulSoup 定義了很多搜索方法，例如 find() 和 find_all(); 但find_all()是最常用的一種方法，而更多的方法與遍歷文檔樹類似，包括父節(jié)點、子節(jié)點、兄弟節(jié)點等，使用find_all()方法的代碼如下：

# coding=utf-8 
from bs4 import BeautifulSoup 
soup = BeautifulSoup(open('test.html','rb'), "html.parser") 
tag = soup.title 
 
urls = soup.find_all('p') 
for u in urls: 
    print(u)

輸出結(jié)果如下：

<p class="title">Hello</p> 
<p class="con">Python 技術(shù)</p>

使用 find_all() 可以查找到想要查找的文檔內(nèi)容。

總結(jié)

至此，阿醬理解范圍內(nèi)的 BeautifulSoup 基礎知識及用法基本上已經(jīng)概述完畢，有差池的地方希望大家海涵，我們一起努力前行。

參考

BeautifulSoup 官網(wǎng)https://blog.csdn.net/Eastmount

責任編輯：武曉燕來源： Python技術(shù)

BeautifulSoup 爬蟲

自拍偷在线精品自拍偷,亚洲欧美中文日韩v在线观看不卡

巨細！小姐姐告訴你關(guān)于 BeautifulSoup 的一切

[[427165]]

詳細了解 BeautifulSoup 爬蟲

1.BeautifulSoup 對象

2.遍歷文檔樹

3.搜索文檔樹

總結(jié)

巨細！小姐姐告訴你關(guān)于 BeautifulSoup 的一切