自拍偷在线精品自拍偷,亚洲欧美中文日韩v在线观看不卡

<cite id="jphtn"><track id="jphtn"></track></cite>

<blockquote id="jphtn"><p id="jphtn"></p></blockquote>

<meter id="jphtn"></meter>

<style id="jphtn"></style>

AI.x社區(qū)

軟考社區(qū)

企業(yè)培訓(xùn)

鴻蒙開發(fā)者社區(qū)

WOT技術(shù)大會

公眾號矩陣

移動端

視頻課免費課排行榜短視頻直播課軟考學(xué)堂

全部課程軟考華為認證廠商認證 IT技術(shù)PMP項目管理免費題庫

在線學(xué)習(xí)

文章資源問答課堂專欄直播

51CTO

鴻蒙開發(fā)者社區(qū)

51CTO技術(shù)棧

51CTO官微

51CTO學(xué)堂

51CTO博客

CTO訓(xùn)練營

鴻蒙開發(fā)者社區(qū)訂閱號

51CTO軟考

51CTO學(xué)堂APP

51CTO學(xué)堂企業(yè)版APP

鴻蒙開發(fā)者社區(qū)視頻號

51CTO軟考題庫

賬號設(shè)置退出

Python Beautiful Soup 刮取簡易指南

作者：Ayush Sharma 2021-12-16 15:09:45

開發(fā) 后端

今天我們將討論如何使用 Beautiful Soup 庫從 HTML 頁面中提取內(nèi)容，之后，我們將使用它將其轉(zhuǎn)換為 Python 列表或字典。

[[440826]]

Python 中的 Beautiful Soup 庫可以很方便的從網(wǎng)頁中提取 HTML 內(nèi)容。

今天我們將討論如何使用 Beautiful Soup 庫從 HTML 頁面中提取內(nèi)容，之后，我們將使用它將其轉(zhuǎn)換為 Python 列表或字典。

什么是 Web 刮取，為什么我需要它？

答案很簡單：并非每個網(wǎng)站都有獲取內(nèi)容的 API。你可能想從你最喜歡的烹飪網(wǎng)站上獲取食譜，或者從旅游博客上獲取照片。如果沒有 API，提取 HTML（或者說刮取scraping 可能是獲取內(nèi)容的唯一方法。我將向你展示如何使用 Python 來獲取。

并非所以網(wǎng)站都喜歡被刮取，有些網(wǎng)站可能會明確禁止。請于網(wǎng)站所有者確認是否同意刮取。

Python 如何刮取網(wǎng)站？

使用 Python 進行刮取，我們將執(zhí)行三個基本步驟：

使用 requests 庫獲取 HTML 內(nèi)容
分析 HTML 結(jié)構(gòu)并識別包含我們需要內(nèi)容的標簽
使用 Beautiful Soup 提取標簽并將數(shù)據(jù)放入 Python 列表中

安裝庫

首先安裝我們需要的庫。requests 庫從網(wǎng)站獲取 HTML 內(nèi)容，Beautiful Soup 解析 HTML 并將其轉(zhuǎn)換為 Python 對象。在 Python3 中安裝它們，運行：

pip3 install requests beautifulsoup4

提取 HTML

在本例中，我將選擇刮取網(wǎng)站的 Techhology 部分。如果你跳轉(zhuǎn)到此頁面，你會看到帶有標題、摘錄和發(fā)布日期的文章列表。我們的目標是創(chuàng)建一個包含這些信息的文章列表。

網(wǎng)站頁面的完整 URL 是：

https://notes.ayushsharma.in/technology

我們可以使用 requests 從這個頁面獲取 HTML 內(nèi)容：

#!/usr/bin/python3
import requests
 
url = 'https://notes.ayushsharma.in/technology'
 
data = requests.get(url)
 
print(data.text)

變量 data 將包含頁面的 HTML 源代碼。

從 HTML 中提取內(nèi)容

為了從 data 中提取數(shù)據(jù)，我們需要確定哪些標簽具有我們需要的內(nèi)容。

如果你瀏覽 HTML，你會發(fā)現(xiàn)靠近頂部的這一段：

<div class="col">
  <a href="/2021/08/using-variables-in-jekyll-to-define-custom-content" class="post-card">
    <div class="card">
      <div class="card-body">
        <h5 class="card-title">Using variables in Jekyll to define custom content</h5>
        <small class="card-text text-muted">I recently discovered that Jekyll's config.yml can be used to define custom
          variables for reusing content. I feel like I've been living under a rock all this time. But to err over and
          over again is human.</small>
      </div>
      <div class="card-footer text-end">
        <small class="text-muted">Aug 2021</small>
      </div>
    </div>
  </a>
</div>

這是每篇文章在整個頁面中重復(fù)的部分。我們可以看到 .card-title 包含文章標題，.card-text 包含摘錄，.card-footer > small 包含發(fā)布日期。

讓我們使用 Beautiful Soup 提取這些內(nèi)容。

#!/usr/bin/python3
import requests
from bs4 import BeautifulSoup
from pprint import pprint
 
url = 'https://notes.ayushsharma.in/technology'
data = requests.get(url)
 
my_data = []
 
html = BeautifulSoup(data.text, 'html.parser')
articles = html.select('a.post-card')
 
for article in articles:
 
    title = article.select('.card-title')[0].get_text()
    excerpt = article.select('.card-text')[0].get_text()
    pub_date = article.select('.card-footer small')[0].get_text()
 
    my_data.append({"title": title, "excerpt": excerpt, "pub_date": pub_date})
 
pprint(my_data)

以上代碼提取文章信息并將它們放入 my_data 變量中。我使用了 pprint 來美化輸出，但你可以在代碼中忽略它。將上面的代碼保存在一個名為 fetch.py 的文件中，然后運行它：

python3 fetch.py

如果一切順利，你應(yīng)該會看到：

[{'excerpt': "I recently discovered that Jekyll's config.yml can be used to"
"define custom variables for reusing content. I feel like I've"
'been living under a rock all this time. But to err over and over'
'again is human.',
'pub_date': 'Aug 2021',
'title': 'Using variables in Jekyll to define custom content'},
{'excerpt': "In this article, I'll highlight some ideas for Jekyll"
'collections, blog category pages, responsive web-design, and'
'netlify.toml to make static website maintenance a breeze.',
'pub_date': 'Jul 2021',
'title': 'The evolution of ayushsharma.in: Jekyll, Bootstrap, Netlify,'
'static websites, and responsive design.'},
{'excerpt': "These are the top 5 lessons I've learned after 5 years of"
'Terraform-ing.',
'pub_date': 'Jul 2021',
'title': '5 key best practices for sane and usable Terraform setups'},
 
... (truncated)

以上是全部內(nèi)容！在這 22 行代碼中，我們用 Python 構(gòu)建了一個網(wǎng)絡(luò)刮取器，你可以在我的示例倉庫中找到源代碼。

總結(jié)

對于 Python 列表中的網(wǎng)站內(nèi)容，我們現(xiàn)在可以用它做一些很酷的事情。我們可以將它作為 JSON 返回給另一個應(yīng)用程序，或者使用自定義樣式將其轉(zhuǎn)換為 HTML。隨意復(fù)制粘貼以上代碼并在你最喜歡的網(wǎng)站上進行試驗。

玩的開心，繼續(xù)編碼吧。

責(zé)任編輯：龐桂玉來源： Linux中國

Python Beautiful Soup 編程語言

51CTO技術(shù)棧公眾號

業(yè)務(wù)
速覽

媒體

51CTO CIOAge HC3i

社區(qū)

51CTO博客鴻蒙開發(fā)者社區(qū) AI.x社區(qū)

教育

51CTO學(xué)堂精培企業(yè)培訓(xùn) CTO訓(xùn)練營

<sub id="t44iz"><rt id="t44iz"></rt></sub>

<menuitem id="t44iz"></menuitem>

<tt id="t44iz"></tt>

<sub id="t44iz"></sub>