自拍偷在线精品自拍偷,亚洲欧美中文日韩v在线观看不卡

AI.x社區(qū)

軟考社區(qū)

企業(yè)培訓(xùn)

鴻蒙開發(fā)者社區(qū)

WOT技術(shù)大會

公眾號矩陣

移動端

視頻課免費課排行榜短視頻直播課軟考學(xué)堂

全部課程軟考華為認(rèn)證廠商認(rèn)證 IT技術(shù)PMP項目管理免費題庫

在線學(xué)習(xí)

文章資源問答課堂專欄直播

51CTO

鴻蒙開發(fā)者社區(qū)

51CTO技術(shù)棧

51CTO官微

51CTO學(xué)堂

51CTO博客

CTO訓(xùn)練營

鴻蒙開發(fā)者社區(qū)訂閱號

51CTO軟考

51CTO學(xué)堂APP

51CTO學(xué)堂企業(yè)版APP

鴻蒙開發(fā)者社區(qū)視頻號

51CTO軟考題庫

賬號設(shè)置退出

一名合格的數(shù)據(jù)分析師分享Python網(wǎng)絡(luò)爬蟲二三事（Scrapy自動爬蟲）

作者：whenif 2017-02-23 18:41:03

開發(fā) 開發(fā)工具

作為一名合格的數(shù)據(jù)分析師，其完整的技術(shù)知識體系必須貫穿數(shù)據(jù)獲取、數(shù)據(jù)存儲、數(shù)據(jù)提取、數(shù)據(jù)分析、數(shù)據(jù)挖掘、數(shù)據(jù)可視化等各大部分。

接上篇《一名合格的數(shù)據(jù)分析師分享Python網(wǎng)絡(luò)爬蟲二三事(綜合實戰(zhàn)案例)》

五、綜合實戰(zhàn)案例

3. 利用Scrapy框架爬取

（1）了解Scrapy

Scrapy使用了Twisted異步網(wǎng)絡(luò)庫來處理網(wǎng)絡(luò)通訊。整體架構(gòu)大致如下(注：圖片來自互聯(lián)網(wǎng))：

關(guān)于Scrapy的使用方法請參考其官方文檔

（2）Scrapy自動爬蟲

前面的實戰(zhàn)中我們都是通過循環(huán)構(gòu)建URL進行數(shù)據(jù)爬取，其實還有另外一種實現(xiàn)方式，首先設(shè)定初始URL，獲取當(dāng)前URL中的新鏈接，基于這些鏈接繼續(xù)爬取，直到所爬取的頁面不存在新的鏈接為止。

(a)需求

采用自動爬蟲的方式爬取糗事百科文章鏈接與內(nèi)容，并將文章頭部內(nèi)容與鏈接存儲到MySQL數(shù)據(jù)庫中。

(b)分析

A. 怎么提取首頁文章鏈接?

打開首頁后查看源碼，搜索首頁任一篇文章內(nèi)容，可以看到"/article/118123230"鏈接，點擊進去后發(fā)現(xiàn)這就是我們所要的文章內(nèi)容，所以我們在自動爬蟲中需設(shè)置鏈接包含"article"

B. 怎么提取詳情頁文章內(nèi)容與鏈接

內(nèi)容

打開詳情頁后，查看文章內(nèi)容如下：

分析可知利用包含屬性class且其值為content的div標(biāo)簽可***確定文章內(nèi)容，表達式如下：

"//div[@class='content']/text()"

鏈接

打開任一詳情頁，復(fù)制詳情頁鏈接，查看詳情頁源碼，搜索鏈接如下：

采用以下XPath表達式可提取文章鏈接。

["//link[@rel='canonical']/@href"]

（3）項目源碼

A. 創(chuàng)建爬蟲項目

打開CMD，切換到存儲爬蟲項目的目錄下，輸入：

scrapy startproject qsbkauto

B. 項目結(jié)構(gòu)說明

spiders.qsbkspd.py：爬蟲文件
items.py：項目實體，要提取的內(nèi)容的容器，如當(dāng)當(dāng)網(wǎng)商品的標(biāo)題、評論數(shù)等
pipelines.py：項目管道，主要用于數(shù)據(jù)的后續(xù)處理，如將數(shù)據(jù)寫入Excel和db等
settings.py：項目設(shè)置，如默認(rèn)是不開啟pipeline、遵守robots協(xié)議等
scrapy.cfg：項目配置

C. 創(chuàng)建爬蟲

進入創(chuàng)建的爬蟲項目，輸入：

scrapy genspider -t crawl qsbkspd qiushibaie=ke.com（域名）

D. 定義items

import scrapyclass QsbkautoItem(scrapy.Item): 
    # define the fields for your item here like: 
    # name = scrapy.Field() 
    Link = scrapy.Field()     #文章鏈接 
    Connent = scrapy.Field()  #文章內(nèi)容 
    pass

E. 編寫爬蟲

qsbkauto.py

# -*- coding: utf-8 -*-import scrapyfrom scrapy.linkextractors import LinkExtractorfrom scrapy.spiders import CrawlSpider, Rulefrom qsbkauto.items import QsbkautoItemfrom scrapy.http import Requestclass QsbkspdSpider(CrawlSpider): 
  name = 'qsbkspd' 
  allowed_domains = ['qiushibaike.com'] 
  #start_urls = ['http://qiushibaike.com/'] 
  def start_requests(self): 
      i_headers={"User-Agent":"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.22 Safari/537.36 SE 2.X MetaSr 1.0"} 
      yield Request('http://www.qiushibaike.com/',headers=i_headers) 
  rules = ( 
      Rule(LinkExtractor(allow=r'article/'), callback='parse_item', follow=True), 
  ) 
  def parse_item(self, response): 
      #i = {} 
      #i['domain_id'] = response.xpath('//input[@id="sid"]/@value').extract() 
      #i['name'] = response.xpath('//div[@id="name"]').extract() 
      #i['description'] = response.xpath('//div[@id="description"]').extract() 
      i = QsbkautoItem() 
      i["content"]=response.xpath("//div[@class='content']/text()").extract() 
      i["link"]=response.xpath("//link[@rel='canonical']/@href").extract() 
      return i

pipelines.py

import MySQLdbimport timeclass QsbkautoPipeline(object): 
  def exeSQL(self,sql): 
      ''' 
      功能：連接MySQL數(shù)據(jù)庫并執(zhí)行sql語句 
      @sql：定義SQL語句 
      ''' 
      con = MySQLdb.connect( 
          host='localhost',  # port 
          user='root',       # usr_name 
          passwd='xxxx',     # passname 
          db='spdRet',       # db_name 
          charset='utf8', 
          local_infile = 1 
          ) 
      con.query(sql) 
      con.commit() 
      con.close() 
  def process_item(self, item, spider): 
      link_url = item['link'][0] 
      content_header = item['content'][0][0:10] 
      curr_date = time.strftime('%Y-%m-%d',time.localtime(time.time())) 
      content_header = curr_date+'__'+content_header 
      if (len(link_url) and len(content_header)):#判斷是否為空值 
          try: 
              sql="insert into qiushi(content,link) values('"+content_header+"','"+link_url+"')" 
              self.exeSQL(sql) 
          except Exception as er: 
              print("插入錯誤，錯誤如下：") 
              print(er) 
      else: 
          pass 
      return item

setting.py

關(guān)閉ROBOTSTXT_OBEY
設(shè)置USER_AGENT
開啟ITEM_PIPELINES

F. 執(zhí)行爬蟲

scrapy crawl qsbkauto --nolog

G. 結(jié)果

【本文是51CTO專欄機構(gòu)“豈安科技”的原創(chuàng)文章，轉(zhuǎn)載請通過微信公眾號(bigsec)聯(lián)系原作者】

戳這里，看該作者更多好文

責(zé)任編輯：趙寧寧來源： 51CTO專欄

數(shù)據(jù)分析師 Python 網(wǎng)絡(luò)爬蟲

51CTO技術(shù)棧公眾號

業(yè)務(wù)
速覽

媒體

51CTO CIOAge HC3i

社區(qū)

51CTO博客鴻蒙開發(fā)者社區(qū) AI.x社區(qū)

教育

51CTO學(xué)堂精培企業(yè)培訓(xùn) CTO訓(xùn)練營

<legend id="h5xlm"><track id="h5xlm"></track></legend>

<cite id="h5xlm"></cite>