Python利用Beautifulsoup爬取笑話網(wǎng)站
利用Beautifulsoup爬取知名笑話網(wǎng)站
首先我們來看看需要爬取的網(wǎng)站:http://xiaohua.zol.com.cn/
1.開始前準備
1.1 python3,本篇博客內(nèi)容采用python3來寫,如果電腦上沒有安裝python3請先安裝python3.
1.2 Request庫,urllib的升級版本打包了全部功能并簡化了使用方法。下載方法:
- pip install requests
1.3 Beautifulsoup庫, 是一個可以從HTML或XML文件中提取數(shù)據(jù)的Python庫.它能夠通過你喜歡的轉(zhuǎn)換器實現(xiàn)慣用的文檔導航,查找,修改文檔的方式.。下載方法:
- pip install beautifulsoup4
1.4 LXML,用于輔助Beautifulsoup庫解析網(wǎng)頁。(如果你不用anaconda,你會發(fā)現(xiàn)這個包在Windows下pip安裝報錯)下載方法:
- pip install lxml
1.5 pycharm,一款功能強大的pythonIDE工具。下載官方版本后,使用license sever免費使用(同系列產(chǎn)品類似),具體參照http://www.cnblogs.com/hanggegege/p/6763329.html。
2.爬取過程演示與分析
- from bs4 import BeautifulSoup
- import os
- import requests
導入需要的庫,os庫用來后期儲存爬取內(nèi)容。
隨后我們點開“***笑話”,發(fā)現(xiàn)有“全部笑話”這一欄,能夠讓我們***效率地爬取所有歷史笑話!
我們來通過requests庫來看看這個頁面的源代碼:
- from bs4 import BeautifulSoup
- import os
- import requests
- all_url = 'http://xiaohua.zol.com.cn/new/
- headers = {'User-Agent':"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1"}
- all_html=requests.get(all_url,headers = headers)
- print(all_html.text)
header是請求頭,大部分網(wǎng)站沒有這個請求頭會爬取失敗
部分效果如下:
通過源碼分析發(fā)現(xiàn)我們還是不能通過此網(wǎng)站就直接獲取到所有笑話的信息,因此我們在在這個頁面找一些間接的方法。
點開一個笑話查看全文,我們發(fā)現(xiàn)此時網(wǎng)址變成了http://xiaohua.zol.com.cn/detail58/57681.html,在點開其他的笑話,我們發(fā)現(xiàn)網(wǎng)址部都是形如http://xiaohua.zol.com.cn/detail?/?.html的格式,我們以這個為突破口,去爬取所有的內(nèi)容
我們的目的是找到所有形如http://xiaohua.zol.com.cn/detail?/?.html的網(wǎng)址,再去爬取其內(nèi)容。
我們在“全部笑話”頁面隨便翻到一頁:http://xiaohua.zol.com.cn/new/5.html ,按下F12查看其源代碼,按照其布局發(fā)現(xiàn) :
每個笑話對應(yīng)其中一個<li>標簽,分析得每個笑話展開全文的網(wǎng)址藏在href當中,我們只需要獲取href就能得到笑話的網(wǎng)址
- from bs4 import BeautifulSoup
- import os
- import requests
- all_url = 'http://xiaohua.zol.com.cn/new/
- '
- headers = {'User-Agent':"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1"}
- all_html=requests.get(all_url,headers = headers)
- #print(all_html.text)
- soup1 = BeautifulSoup(all_html.text,'lxml')
- list1=soup1.find_all('li',class_ = 'article-summary')
- for i in list1:
- #print(i)
- soup2 = BeautifulSoup(i.prettify(),'lxml')
- list2=soup2.find_all('a',target = '_blank',class_='all-read')
- for b in list2:
- href = b['href']
- print(href)
我們通過以上代碼,成功獲得***頁所有笑話的網(wǎng)址后綴:
也就是說,我們只需要獲得所有的循環(huán)遍歷所有的頁碼,就能獲得所有的笑話。
上面的代碼優(yōu)化后:
- from bs4 import BeautifulSoup
- import os
- import requests
- all_url = 'http://xiaohua.zol.com.cn/new/5.html
- '
- def Gethref(url):
- headers = { 'User-Agent': "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1"}
- html = requests.get(url,headers = headers)
- soup_first = BeautifulSoup(html.text,'lxml')
- list_first = soup_first.find_all('li',class_='article-summary')
- for i in list_first:
- soup_second = BeautifulSoup(i.prettify(),'lxml')
- list_second = soup_second.find_all('a',target = '_blank',class_='all-read')
- for b in list_second:
- href = b['href']
- print(href)
- Gethref(all_url)
使用如下代碼,獲取完整的笑話地址url
- from bs4 import BeautifulSoup
- import os
- import requests
- all_url = 'http://xiaohua.zol.com.cn/new/5.html
- '
- def Gethref(url):
- list_href = []
- headers = { 'User-Agent': "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1"}
- html = requests.get(url,headers = headers)
- soup_first = BeautifulSoup(html.text,'lxml')
- list_first = soup_first.find_all('li',class_='article-summary')
- for i in list_first:
- soup_second = BeautifulSoup(i.prettify(),'lxml')
- list_second = soup_second.find_all('a',target = '_blank',class_='all-read')
- for b in list_second:
- href = b['href']
- list_href.append(href)
- return list_href
- def GetTrueUrl(liebiao):
- for i in liebiao:
- url = 'http://xiaohua.zol.com.cn
- '+str(i)
- print(url)
- GetTrueUrl(Gethref(all_url))
簡單分析笑話頁面html內(nèi)容后,接下來獲取一個頁面全部笑話的內(nèi)容:
- from bs4 import BeautifulSoup
- import os
- import requests
- all_url = 'http://xiaohua.zol.com.cn/new/5.html
- '
- def Gethref(url):
- list_href = []
- headers = { 'User-Agent': "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1"}
- html = requests.get(url,headers = headers)
- soup_first = BeautifulSoup(html.text,'lxml')
- list_first = soup_first.find_all('li',class_='article-summary')
- for i in list_first:
- soup_second = BeautifulSoup(i.prettify(),'lxml')
- list_second = soup_second.find_all('a',target = '_blank',class_='all-read')
- for b in list_second:
- href = b['href']
- list_href.append(href)
- return list_href
- def GetTrueUrl(liebiao):
- list = []
- for i in liebiao:
- url = 'http://xiaohua.zol.com.cn
- '+str(i)
- list.append(url)
- return list
- def GetText(url):
- for i in url:
- html = requests.get(i)
- soup = BeautifulSoup(html.text,'lxml')
- content = soup.find('div',class_='article-text')
- print(content.text)
- GetText(GetTrueUrl(Gethref(all_url)))
效果圖如下:
現(xiàn)在我們開始存儲笑話內(nèi)容!開始要用到os庫了
使用如下代碼,獲取一頁笑話的所有內(nèi)容!
- from bs4 import BeautifulSoup
- import os
- import requests
- all_url = 'http://xiaohua.zol.com.cn/new/5.html
- '
- os.mkdir('/home/lei/zol')
- def Gethref(url):
- list_href = []
- headers = { 'User-Agent': "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1"}
- html = requests.get(url,headers = headers)
- soup_first = BeautifulSoup(html.text,'lxml')
- list_first = soup_first.find_all('li',class_='article-summary')
- for i in list_first:
- soup_second = BeautifulSoup(i.prettify(),'lxml')
- list_second = soup_second.find_all('a',target = '_blank',class_='all-read')
- for b in list_second:
- href = b['href']
- list_href.append(href)
- return list_href
- def GetTrueUrl(liebiao):
- list = []
- for i in liebiao:
- url = 'http://xiaohua.zol.com.cn
- '+str(i)
- list.append(url)
- return list
- def GetText(url):
- for i in url:
- html = requests.get(i)
- soup = BeautifulSoup(html.text,'lxml')
- content = soup.find('div',class_='article-text')
- title = soup.find('h1',class_ = 'article-title')
- SaveText(title.text,content.text)
- def SaveText(TextTitle,text):
- os.chdir('/home/lei/zol/')
- f = open(str(TextTitle)+'txt','w')
- f.write(text)
- f.close()
- GetText(GetTrueUrl(Gethref(all_url)))
效果圖:
(因為我的系統(tǒng)為linux系統(tǒng),路徑問題請按照自己電腦自己更改)
我們的目標不是抓取一個頁面的笑話那么簡單,下一步我們要做的是把需要的頁面遍歷一遍!
通過觀察可以得到全部笑話頁面url為http://xiaohua.zol.com.cn/new/+頁碼+html,接下來我們遍歷前100頁的所有笑話,全部下載下來!
接下來我們再次修改代碼:
- from bs4 import BeautifulSoup
- import os
- import requests
- num = 1
- url = 'http://xiaohua.zol.com.cn/new/
- '+str(num)+'.html'
- os.mkdir('/home/lei/zol')
- def Gethref(url):
- list_href = []
- headers = { 'User-Agent': "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1"}
- html = requests.get(url,headers = headers)
- soup_first = BeautifulSoup(html.text,'lxml')
- list_first = soup_first.find_all('li',class_='article-summary')
- for i in list_first:
- soup_second = BeautifulSoup(i.prettify(),'lxml')
- list_second = soup_second.find_all('a',target = '_blank',class_='all-read')
- for b in list_second:
- href = b['href']
- list_href.append(href)
- return list_href
- def GetTrueUrl(liebiao):
- list = []
- for i in liebiao:
- url = 'http://xiaohua.zol.com.cn
- '+str(i)
- list.append(url)
- return list
- def GetText(url):
- for i in url:
- html = requests.get(i)
- soup = BeautifulSoup(html.text,'lxml')
- content = soup.find('div',class_='article-text')
- title = soup.find('h1',class_ = 'article-title')
- SaveText(title.text,content.text)
- def SaveText(TextTitle,text):
- os.chdir('/home/lei/zol/')
- f = open(str(TextTitle)+'txt','w')
- f.write(text)
- f.close()
- while num<=100:
- url = 'http://xiaohua.zol.com.cn/new/
- ' + str(num) + '.html'
- GetText(GetTrueUrl(Gethref(url)))
- num=num+1
大功告成!剩下的等待文件下載完全就行拉!
效果圖: