自拍偷在线精品自拍偷,亚洲欧美中文日韩v在线观看不卡

<sub id="1p1lm"></sub>

<style id="1p1lm"></style>

AI.x社區(qū)

軟考社區(qū)

企業(yè)培訓

鴻蒙開發(fā)者社區(qū)

WOT技術(shù)大會

公眾號矩陣

移動端

視頻課免費課排行榜短視頻直播課軟考學堂

全部課程軟考華為認證廠商認證 IT技術(shù)PMP項目管理免費題庫

文章資源問答課堂專欄直播

51CTO

鴻蒙開發(fā)者社區(qū)

51CTO技術(shù)棧

51CTO官微

51CTO學堂

51CTO博客

CTO訓練營

鴻蒙開發(fā)者社區(qū)訂閱號

51CTO軟考

51CTO學堂APP

51CTO學堂企業(yè)版APP

鴻蒙開發(fā)者社區(qū)視頻號

51CTO軟考題庫

賬號設(shè)置退出

Python利用Beautifulsoup爬取笑話網(wǎng)站

作者：寶茜滴老公 2017-09-19 15:17:09

開發(fā) 后端

Beautiful Soup是一個可以從HTML或XML文件中提取數(shù)據(jù)的Python庫.它能夠通過你喜歡的轉(zhuǎn)換器實現(xiàn)慣用的文檔導航,查找,修改文檔的方式。Beautiful Soup會幫你節(jié)省數(shù)小時甚至數(shù)天的工作時間。

利用Beautifulsoup爬取知名笑話網(wǎng)站

首先我們來看看需要爬取的網(wǎng)站：http://xiaohua.zol.com.cn/

1.開始前準備

1.1 python3，本篇博客內(nèi)容采用python3來寫，如果電腦上沒有安裝python3請先安裝python3.

1.2 Request庫，urllib的升級版本打包了全部功能并簡化了使用方法。下載方法：

pip install requests

1.3 Beautifulsoup庫，是一個可以從HTML或XML文件中提取數(shù)據(jù)的Python庫.它能夠通過你喜歡的轉(zhuǎn)換器實現(xiàn)慣用的文檔導航，查找,修改文檔的方式.。下載方法：

pip install beautifulsoup4

1.4 LXML，用于輔助Beautifulsoup庫解析網(wǎng)頁。(如果你不用anaconda，你會發(fā)現(xiàn)這個包在Windows下pip安裝報錯)下載方法：

pip install lxml

1.5 pycharm，一款功能強大的pythonIDE工具。下載官方版本后，使用license sever免費使用(同系列產(chǎn)品類似)，具體參照http://www.cnblogs.com/hanggegege/p/6763329.html。

2.爬取過程演示與分析

from bs4 import BeautifulSoup 
 
import os 
 
import requests

導入需要的庫，os庫用來后期儲存爬取內(nèi)容。

隨后我們點開“***笑話”，發(fā)現(xiàn)有“全部笑話”這一欄，能夠讓我們***效率地爬取所有歷史笑話!

我們來通過requests庫來看看這個頁面的源代碼：

from bs4 import BeautifulSoup 
 
import os 
 
import requests 
 
all_url = 'http://xiaohua.zol.com.cn/new/ 
 
headers = {'User-Agent':"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1"} 
 
all_html=requests.get(all_url,headers = headers) 
 
print(all_html.text)

header是請求頭，大部分網(wǎng)站沒有這個請求頭會爬取失敗

部分效果如下：

通過源碼分析發(fā)現(xiàn)我們還是不能通過此網(wǎng)站就直接獲取到所有笑話的信息，因此我們在在這個頁面找一些間接的方法。

點開一個笑話查看全文，我們發(fā)現(xiàn)此時網(wǎng)址變成了http://xiaohua.zol.com.cn/detail58/57681.html，在點開其他的笑話，我們發(fā)現(xiàn)網(wǎng)址部都是形如http://xiaohua.zol.com.cn/detail?/?.html的格式，我們以這個為突破口，去爬取所有的內(nèi)容

我們的目的是找到所有形如http://xiaohua.zol.com.cn/detail?/?.html的網(wǎng)址，再去爬取其內(nèi)容。

我們在“全部笑話”頁面隨便翻到一頁：http://xiaohua.zol.com.cn/new/5.html ，按下F12查看其源代碼，按照其布局發(fā)現(xiàn) ：

每個笑話對應(yīng)其中一個<li>標簽，分析得每個笑話展開全文的網(wǎng)址藏在href當中，我們只需要獲取href就能得到笑話的網(wǎng)址

from bs4 import BeautifulSoup 
import os 
import requests 
all_url = 'http://xiaohua.zol.com.cn/new/  
 
' 
headers = {'User-Agent':"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1"} 
all_html=requests.get(all_url,headers = headers) 
#print(all_html.text) 
soup1 = BeautifulSoup(all_html.text,'lxml') 
list1=soup1.find_all('li',class_ = 'article-summary') 
for i in list1: 
    #print(i) 
    soup2 = BeautifulSoup(i.prettify(),'lxml') 
    list2=soup2.find_all('a',target = '_blank',class_='all-read') 
    for b in list2: 
        href = b['href'] 
        print(href)

我們通過以上代碼，成功獲得***頁所有笑話的網(wǎng)址后綴：

也就是說，我們只需要獲得所有的循環(huán)遍歷所有的頁碼，就能獲得所有的笑話。

上面的代碼優(yōu)化后：

from bs4 import BeautifulSoup 
import os 
import requests 
all_url = 'http://xiaohua.zol.com.cn/new/5.html  
 
' 
def Gethref(url): 
    headers = { 'User-Agent': "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1"} 
    html = requests.get(url,headers = headers) 
    soup_first = BeautifulSoup(html.text,'lxml') 
    list_first = soup_first.find_all('li',class_='article-summary') 
    for i in list_first: 
        soup_second = BeautifulSoup(i.prettify(),'lxml') 
        list_second = soup_second.find_all('a',target = '_blank',class_='all-read') 
        for b in list_second: 
            href = b['href'] 
            print(href) 
Gethref(all_url)

使用如下代碼，獲取完整的笑話地址url

from bs4 import BeautifulSoup 
import os 
import requests 
all_url = 'http://xiaohua.zol.com.cn/new/5.html  
 
' 
def Gethref(url): 
    list_href = [] 
    headers = { 'User-Agent': "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1"} 
    html = requests.get(url,headers = headers) 
    soup_first = BeautifulSoup(html.text,'lxml') 
    list_first = soup_first.find_all('li',class_='article-summary') 
    for i in list_first: 
        soup_second = BeautifulSoup(i.prettify(),'lxml') 
        list_second = soup_second.find_all('a',target = '_blank',class_='all-read') 
        for b in list_second: 
            href = b['href'] 
            list_href.append(href) 
    return list_href 
def GetTrueUrl(liebiao): 
    for i in liebiao: 
        url = 'http://xiaohua.zol.com.cn  
 
'+str(i) 
        print(url) 
GetTrueUrl(Gethref(all_url))

簡單分析笑話頁面html內(nèi)容后，接下來獲取一個頁面全部笑話的內(nèi)容：

from bs4 import BeautifulSoup 
import os 
import requests 
all_url = 'http://xiaohua.zol.com.cn/new/5.html  
 
' 
def Gethref(url): 
    list_href = [] 
    headers = { 'User-Agent': "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1"} 
    html = requests.get(url,headers = headers) 
    soup_first = BeautifulSoup(html.text,'lxml') 
    list_first = soup_first.find_all('li',class_='article-summary') 
    for i in list_first: 
        soup_second = BeautifulSoup(i.prettify(),'lxml') 
        list_second = soup_second.find_all('a',target = '_blank',class_='all-read') 
        for b in list_second: 
            href = b['href'] 
            list_href.append(href) 
    return list_href 
def GetTrueUrl(liebiao): 
    list = [] 
    for i in liebiao: 
        url = 'http://xiaohua.zol.com.cn  
 
'+str(i) 
        list.append(url) 
    return list 
def GetText(url): 
    for i in url: 
        html = requests.get(i) 
        soup = BeautifulSoup(html.text,'lxml') 
        content = soup.find('div',class_='article-text') 
        print(content.text) 
GetText(GetTrueUrl(Gethref(all_url)))

效果圖如下：

現(xiàn)在我們開始存儲笑話內(nèi)容!開始要用到os庫了

使用如下代碼，獲取一頁笑話的所有內(nèi)容!

from bs4 import BeautifulSoup 
import os 
import requests 
all_url = 'http://xiaohua.zol.com.cn/new/5.html  
 
' 
os.mkdir('/home/lei/zol') 
def Gethref(url): 
    list_href = [] 
    headers = { 'User-Agent': "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1"} 
    html = requests.get(url,headers = headers) 
    soup_first = BeautifulSoup(html.text,'lxml') 
    list_first = soup_first.find_all('li',class_='article-summary') 
    for i in list_first: 
        soup_second = BeautifulSoup(i.prettify(),'lxml') 
        list_second = soup_second.find_all('a',target = '_blank',class_='all-read') 
        for b in list_second: 
            href = b['href'] 
            list_href.append(href) 
    return list_href 
def GetTrueUrl(liebiao): 
    list = [] 
    for i in liebiao: 
        url = 'http://xiaohua.zol.com.cn  
 
'+str(i) 
        list.append(url) 
    return list 
def GetText(url): 
    for i in url: 
        html = requests.get(i) 
        soup = BeautifulSoup(html.text,'lxml') 
        content = soup.find('div',class_='article-text') 
        title = soup.find('h1',class_ = 'article-title') 
        SaveText(title.text,content.text) 
def SaveText(TextTitle,text): 
    os.chdir('/home/lei/zol/') 
    f = open(str(TextTitle)+'txt','w') 
    f.write(text) 
    f.close() 
GetText(GetTrueUrl(Gethref(all_url)))

效果圖：

(因為我的系統(tǒng)為linux系統(tǒng)，路徑問題請按照自己電腦自己更改)

我們的目標不是抓取一個頁面的笑話那么簡單，下一步我們要做的是把需要的頁面遍歷一遍!

通過觀察可以得到全部笑話頁面url為http://xiaohua.zol.com.cn/new/+頁碼+html,接下來我們遍歷前100頁的所有笑話，全部下載下來!

接下來我們再次修改代碼：

from bs4 import BeautifulSoup 
import os 
import requests 
num = 1 
url = 'http://xiaohua.zol.com.cn/new/  
 
'+str(num)+'.html' 
os.mkdir('/home/lei/zol') 
def Gethref(url): 
    list_href = [] 
    headers = { 'User-Agent': "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1"} 
    html = requests.get(url,headers = headers) 
    soup_first = BeautifulSoup(html.text,'lxml') 
    list_first = soup_first.find_all('li',class_='article-summary') 
    for i in list_first: 
        soup_second = BeautifulSoup(i.prettify(),'lxml') 
        list_second = soup_second.find_all('a',target = '_blank',class_='all-read') 
        for b in list_second: 
            href = b['href'] 
            list_href.append(href) 
    return list_href 
def GetTrueUrl(liebiao): 
    list = [] 
    for i in liebiao: 
        url = 'http://xiaohua.zol.com.cn  
 
'+str(i) 
        list.append(url) 
    return list 
def GetText(url): 
    for i in url: 
        html = requests.get(i) 
        soup = BeautifulSoup(html.text,'lxml') 
        content = soup.find('div',class_='article-text') 
        title = soup.find('h1',class_ = 'article-title') 
 
        SaveText(title.text,content.text) 
def SaveText(TextTitle,text): 
    os.chdir('/home/lei/zol/') 
    f = open(str(TextTitle)+'txt','w') 
    f.write(text) 
    f.close() 
while num<=100: 
    url = 'http://xiaohua.zol.com.cn/new/  
 
' + str(num) + '.html' 
    GetText(GetTrueUrl(Gethref(url))) 
    num=num+1

大功告成!剩下的等待文件下載完全就行拉!

效果圖：

責任編輯：龐桂玉來源：寶茜滴老公的博客

Python Beautifulsoup 爬取網(wǎng)站

51CTO技術(shù)棧公眾號

業(yè)務(wù)
速覽

媒體

51CTO CIOAge HC3i

社區(qū)

51CTO博客鴻蒙開發(fā)者社區(qū) AI.x社區(qū)

教育

51CTO學堂精培企業(yè)培訓 CTO訓練營

<sub id="gwjrb"></sub>