自拍偷在线精品自拍偷,亚洲欧美中文日韩v在线观看不卡

<style id="ihvw0"><rp id="ihvw0"></rp></style>

51CTO首頁(yè)

AI.x社區(qū)

軟考社區(qū)

免費(fèi)課

企業(yè)培訓(xùn)

鴻蒙開(kāi)發(fā)者社區(qū)

WOT技術(shù)大會(huì)

公眾號(hào)矩陣

移動(dòng)端

視頻課免費(fèi)課排行榜短視頻直播課軟考學(xué)堂

全部課程軟考華為認(rèn)證廠商認(rèn)證 IT技術(shù)PMP項(xiàng)目管理免費(fèi)題庫(kù)

在線學(xué)習(xí)

文章資源問(wèn)答課堂專(zhuān)欄直播

51CTO

鴻蒙開(kāi)發(fā)者社區(qū)

51CTO技術(shù)棧

51CTO官微

51CTO學(xué)堂

51CTO博客

CTO訓(xùn)練營(yíng)

鴻蒙開(kāi)發(fā)者社區(qū)訂閱號(hào)

51CTO軟考

51CTO學(xué)堂APP

51CTO學(xué)堂企業(yè)版APP

鴻蒙開(kāi)發(fā)者社區(qū)視頻號(hào)

51CTO軟考題庫(kù)

賬號(hào)設(shè)置退出

四種Python爬蟲(chóng)常用的定位元素方法對(duì)比，你偏愛(ài)哪一款？

作者：陳熹 2021-02-19 11:04:29

開(kāi)發(fā) 后端

在使用Python本爬蟲(chóng)采集數(shù)據(jù)時(shí)，一個(gè)很重要的操作就是如何從請(qǐng)求到的網(wǎng)頁(yè)中提取數(shù)據(jù)，而正確定位想要的數(shù)據(jù)又是第一步操作。

在使用Python本爬蟲(chóng)采集數(shù)據(jù)時(shí)，一個(gè)很重要的操作就是如何從請(qǐng)求到的網(wǎng)頁(yè)中提取數(shù)據(jù)，而正確定位想要的數(shù)據(jù)又是第一步操作。

本文將對(duì)比幾種 Python 爬蟲(chóng)中比較常用的定位網(wǎng)頁(yè)元素的方式供大家學(xué)習(xí)：

傳統(tǒng) BeautifulSoup 操作
基于 BeautifulSoup 的 CSS 選擇器(與 PyQuery 類(lèi)似)
XPath
正則表達(dá)式

參考網(wǎng)頁(yè)是當(dāng)當(dāng)網(wǎng)圖書(shū)暢銷(xiāo)總榜：

http://bang.dangdang.com/books/bestsellers/01.00.00.00.00.00-24hours-0-0-1-1

我們以獲取第一頁(yè) 20 本書(shū)的書(shū)名為例。先確定網(wǎng)站沒(méi)有設(shè)置反爬措施，是否能直接返回待解析的內(nèi)容：

import requests 
 
url = 'http://bang.dangdang.com/books/bestsellers/01.00.00.00.00.00-24hours-0-0-1-1' 
response = requests.get(url).text 
print(response)

仔細(xì)檢查后發(fā)現(xiàn)需要的數(shù)據(jù)都在返回內(nèi)容中，說(shuō)明不需要特別考慮反爬舉措

審查網(wǎng)頁(yè)元素后可以發(fā)現(xiàn)，書(shū)目信息都包含在 li 中，從屬于 class 為 bang_list clearfix bang_list_mode 的 ul 中

進(jìn)一步審查也可以發(fā)現(xiàn)書(shū)名在的相應(yīng)位置，這是多種解析方法的重要基礎(chǔ)

1. 傳統(tǒng) BeautifulSoup 操作

經(jīng)典的 BeautifulSoup 方法借助 from bs4 import BeautifulSoup，然后通過(guò) soup = BeautifulSoup(html, "lxml") 將文本轉(zhuǎn)換為特定規(guī)范的結(jié)構(gòu)，利用 find 系列方法進(jìn)行解析，代碼如下：

import requests 
from bs4 import BeautifulSoup 
 
url = 'http://bang.dangdang.com/books/bestsellers/01.00.00.00.00.00-24hours-0-0-1-1' 
response = requests.get(url).text 
 
def bs_for_parse(response): 
    soup = BeautifulSoup(response, "lxml") 
    li_list = soup.find('ul', class_='bang_list clearfix bang_list_mode').find_all('li') # 鎖定ul后獲取20個(gè)li 
    for li in li_list: 
        title = li.find('div', class_='name').find('a')['title'] # 逐個(gè)解析獲取書(shū)名 
        print(title) 
 
if __name__ == '__main__': 
    bs_for_parse(response)

成功獲取了 20 個(gè)書(shū)名，有些書(shū)面顯得冗長(zhǎng)可以通過(guò)正則或者其他字符串方法處理，本文不作詳細(xì)介紹

2. 基于 BeautifulSoup 的 CSS 選擇器

這種方法實(shí)際上就是 PyQuery 中 CSS 選擇器在其他模塊的遷移使用，用法是類(lèi)似的。關(guān)于 CSS 選擇器詳細(xì)語(yǔ)法可以參考：http://www.w3school.com.cn/cssref/css_selectors.asp由于是基于 BeautifulSoup 所以導(dǎo)入的模塊以及文本結(jié)構(gòu)轉(zhuǎn)換都是一致的：

import requests 
from bs4 import BeautifulSoup 
 
url = 'http://bang.dangdang.com/books/bestsellers/01.00.00.00.00.00-24hours-0-0-1-1' 
response = requests.get(url).text 
         
def css_for_parse(response): 
    soup = BeautifulSoup(response, "lxml")  
    print(soup) 
 
if __name__ == '__main__': 
    css_for_parse(response)

然后就是通過(guò) soup.select 輔以特定的 CSS 語(yǔ)法獲取特定內(nèi)容，基礎(chǔ)依舊是對(duì)元素的認(rèn)真審查分析：

import requests 
from bs4 import BeautifulSoup 
from lxml import html 
 
url = 'http://bang.dangdang.com/books/bestsellers/01.00.00.00.00.00-24hours-0-0-1-1' 
response = requests.get(url).text 
         
def css_for_parse(response): 
    soup = BeautifulSoup(response, "lxml") 
    li_list = soup.select('ul.bang_list.clearfix.bang_list_mode > li') 
    for li in li_list: 
        title = li.select('div.name > a')[0]['title'] 
        print(title) 
 
if __name__ == '__main__': 
    css_for_parse(response)

3. XPath

XPath 即為 XML 路徑語(yǔ)言，它是一種用來(lái)確定 XML 文檔中某部分位置的計(jì)算機(jī)語(yǔ)言，如果使用 Chrome 瀏覽器建議安裝 XPath Helper 插件，會(huì)大大提高寫(xiě) XPath 的效率。

之前的爬蟲(chóng)文章基本都是基于 XPath，大家相對(duì)比較熟悉因此代碼直接給出：

import requests 
from lxml import html 
 
url = 'http://bang.dangdang.com/books/bestsellers/01.00.00.00.00.00-24hours-0-0-1-1' 
response = requests.get(url).text 
 
def xpath_for_parse(response): 
    selector = html.fromstring(response) 
    books = selector.xpath("//ul[@class='bang_list clearfix bang_list_mode']/li") 
    for book in books: 
        title = book.xpath('div[@class="name"]/a/@title')[0] 
        print(title) 
 
if __name__ == '__main__': 
    xpath_for_parse(response)

4. 正則表達(dá)式如果對(duì) HTML 語(yǔ)言不熟悉，那么之前的幾種解析方法都會(huì)比較吃力。這里也提供一種萬(wàn)能解析大法：正則表達(dá)式，只需要關(guān)注文本本身有什么特殊構(gòu)造文法，即可用特定規(guī)則獲取相應(yīng)內(nèi)容。依賴的模塊是 re

首先重新觀察直接返回的內(nèi)容中，需要的文字前后有什么特殊：

import requests 
import re 
 
url = 'http://bang.dangdang.com/books/bestsellers/01.00.00.00.00.00-24hours-0-0-1-1' 
response = requests.get(url).text 
print(response)

觀察幾個(gè)數(shù)目相信就有答案了：<div class="name"><a href="http://product.dangdang.com/xxxxxxxx.html" target="_blank" title="xxxxxxx">

書(shū)名就藏在上面的字符串中，蘊(yùn)含的網(wǎng)址鏈接中末尾的數(shù)字會(huì)隨著書(shū)名而改變。

分析到這里正則表達(dá)式就可以寫(xiě)出來(lái)了：

import requests 
import re 
 
url = 'http://bang.dangdang.com/books/bestsellers/01.00.00.00.00.00-24hours-0-0-1-1' 
response = requests.get(url).text 
 
def re_for_parse(response): 
    reg = '<div class="name"><a href="http://product.dangdang.com/\d+.html" target="_blank" title="(.*?)">' 
    for title in re.findall(reg, response): 
        print(title) 
 
if __name__ == '__main__': 
    re_for_parse(response)

可以發(fā)現(xiàn)正則寫(xiě)法是最簡(jiǎn)單的，但是需要對(duì)于正則規(guī)則非常熟練。所謂正則大法好!

當(dāng)然，不論哪種方法都有它所適用的場(chǎng)景，在真實(shí)操作中我們也需要在分析網(wǎng)頁(yè)結(jié)構(gòu)來(lái)判斷如何高效的定位元素，最后附上本文介紹的四種方法的完整代碼，大家可以自行操作一下來(lái)加深體會(huì)

import requests 
from bs4 import BeautifulSoup 
from lxml import html 
import re 
 
url = 'http://bang.dangdang.com/books/bestsellers/01.00.00.00.00.00-24hours-0-0-1-1' 
response = requests.get(url).text 
 
def bs_for_parse(response): 
    soup = BeautifulSoup(response, "lxml") 
    li_list = soup.find('ul', class_='bang_list clearfix bang_list_mode').find_all('li') 
    for li in li_list: 
        title = li.find('div', class_='name').find('a')['title'] 
        print(title) 
 
def css_for_parse(response): 
    soup = BeautifulSoup(response, "lxml") 
    li_list = soup.select('ul.bang_list.clearfix.bang_list_mode > li') 
    for li in li_list: 
        title = li.select('div.name > a')[0]['title'] 
        print(title) 
 
def xpath_for_parse(response): 
    selector = html.fromstring(response) 
    books = selector.xpath("//ul[@class='bang_list clearfix bang_list_mode']/li") 
    for book in books: 
        title = book.xpath('div[@class="name"]/a/@title')[0] 
        print(title) 
 
def re_for_parse(response): 
    reg = '<div class="name"><a href="http://product.dangdang.com/\d+.html" target="_blank" title="(.*?)">' 
    for title in re.findall(reg, response): 
        print(title) 
 
if __name__ == '__main__': 
    # bs_for_parse(response) 
    # css_for_parse(response) 
    # xpath_for_parse(response) 
    re_for_parse(response)

責(zé)任編輯：趙寧寧來(lái)源：早起Python

Python 爬蟲(chóng)定位元素

點(diǎn)贊

51CTO技術(shù)棧公眾號(hào)

業(yè)務(wù)
速覽

媒體

51CTO CIOAge HC3i

社區(qū)

51CTO博客鴻蒙開(kāi)發(fā)者社區(qū) AI.x社區(qū)

教育

51CTO學(xué)堂精培企業(yè)培訓(xùn) CTO訓(xùn)練營(yíng)