自拍偷在线精品自拍偷,亚洲欧美中文日韩v在线观看不卡

AI.x社區(qū)

軟考社區(qū)

免費(fèi)課

企業(yè)培訓(xùn)

鴻蒙開發(fā)者社區(qū)

WOT技術(shù)大會(huì)

公眾號(hào)矩陣

移動(dòng)端

視頻課免費(fèi)課排行榜短視頻直播課軟考學(xué)堂

全部課程軟考華為認(rèn)證廠商認(rèn)證 IT技術(shù)PMP項(xiàng)目管理免費(fèi)題庫

在線學(xué)習(xí)

文章資源問答課堂專欄直播

51CTO

鴻蒙開發(fā)者社區(qū)

51CTO技術(shù)棧

51CTO官微

51CTO學(xué)堂

51CTO博客

CTO訓(xùn)練營

鴻蒙開發(fā)者社區(qū)訂閱號(hào)

51CTO軟考

51CTO學(xué)堂APP

51CTO學(xué)堂企業(yè)版APP

鴻蒙開發(fā)者社區(qū)視頻號(hào)

51CTO軟考題庫

賬號(hào)設(shè)置退出

基于Python的Scrapy爬蟲入門：代碼詳解

作者：大蟲 2017-11-29 15:21:53

開發(fā) 后端

接下來創(chuàng)建一個(gè)爬蟲項(xiàng)目，以圖蟲網(wǎng) 為例抓取里面的圖片。在頂部菜單“發(fā)現(xiàn)” “標(biāo)簽”里面是對(duì)各種圖片的分類，點(diǎn)擊一個(gè)標(biāo)簽，比如“美女”，網(wǎng)頁的鏈接為：https://tuchong.com/tags/美女/，我們以此作為爬蟲入口，分析一下該頁面。

一、內(nèi)容分析

接下來創(chuàng)建一個(gè)爬蟲項(xiàng)目，以圖蟲網(wǎng) 為例抓取里面的圖片。在頂部菜單“發(fā)現(xiàn)” “標(biāo)簽”里面是對(duì)各種圖片的分類，點(diǎn)擊一個(gè)標(biāo)簽，比如“美女”，網(wǎng)頁的鏈接為：https://tuchong.com/tags/美女/，我們以此作為爬蟲入口，分析一下該頁面：

打開頁面后出現(xiàn)一個(gè)個(gè)的圖集，點(diǎn)擊圖集可全屏瀏覽圖片，向下滾動(dòng)頁面會(huì)出現(xiàn)更多的圖集，沒有頁碼翻頁的設(shè)置。Chrome右鍵“檢查元素”打開開發(fā)者工具，檢查頁面源碼，內(nèi)容部分如下：

<div class="content"> 
 
    <div class="widget-gallery"> 
 
        <ul class="pagelist-wrapper"> 
 
            <li class="gallery-item...

可以判斷每一個(gè)li.gallery-item是一個(gè)圖集的入口，存放在ul.pagelist-wrapper下，div.widget-gallery是一個(gè)容器，如果使用 xpath 選取應(yīng)該是：//div[@class=”widget-gallery”]/ul/li，按照一般頁面的邏輯，在li.gallery-item下面找到對(duì)應(yīng)的鏈接地址，再往下深入一層頁面抓取圖片。

但是如果用類似 Postman 的HTTP調(diào)試工具請(qǐng)求該頁面，得到的內(nèi)容是：

<div class="content"> 
 
    <div class="widget-gallery"></div> 
 
</div>

也就是并沒有實(shí)際的圖集內(nèi)容，因此可以斷定頁面使用了Ajax請(qǐng)求，只有在瀏覽器載入頁面時(shí)才會(huì)請(qǐng)求圖集內(nèi)容并加入div.widget-gallery中，通過開發(fā)者工具查看XHR請(qǐng)求地址為：

https://tuchong.com/rest/tags/美女/posts?page=1&count=20&order=weekly&before_timestamp=

參數(shù)很簡(jiǎn)單，page是頁碼，count是每頁圖集數(shù)量，order是排序，before_timestamp為空，圖蟲因?yàn)槭峭扑蛢?nèi)容式的網(wǎng)站，因此before_timestamp應(yīng)該是一個(gè)時(shí)間值，不同的時(shí)間會(huì)顯示不同的內(nèi)容，這里我們把它丟棄，不考慮時(shí)間直接從***的頁面向前抓取。

請(qǐng)求結(jié)果為JSON格式內(nèi)容，降低了抓取難度，結(jié)果如下：

{ 
 
  "postList": [ 
 
    { 
 
      "post_id": "15624611", 
 
      "type": "multi-photo", 
 
      "url": "https://weishexi.tuchong.com/15624611/", 
 
      "site_id": "443122", 
 
      "author_id": "443122", 
 
      "published_at": "2017-10-28 18:01:03", 
 
      "excerpt": "10月18日", 
 
      "favorites": 4052, 
 
      "comments": 353, 
 
      "rewardable": true, 
 
      "parent_comments": "165", 
 
      "rewards": "2", 
 
      "views": 52709, 
 
      "title": "微風(fēng)不燥  秋意正好", 
 
      "image_count": 15, 
 
      "images": [ 
 
        { 
 
          "img_id": 11585752, 
 
          "user_id": 443122, 
 
          "title": "", 
 
          "excerpt": "", 
 
          "width": 5016, 
 
          "height": 3840 
 
        }, 
 
        { 
 
          "img_id": 11585737, 
 
          "user_id": 443122, 
 
          "title": "", 
 
          "excerpt": "", 
 
          "width": 3840, 
 
          "height": 5760 
 
        }, 
 
        ... 
 
      ], 
 
      "title_image": null, 
 
      "tags": [ 
 
        { 
 
          "tag_id": 131, 
 
          "type": "subject", 
 
          "tag_name": "人像", 
 
          "event_type": "", 
 
          "vote": "" 
 
        }, 
 
        { 
 
          "tag_id": 564, 
 
          "type": "subject", 
 
          "tag_name": "美女", 
 
          "event_type": "", 
 
          "vote": "" 
 
        } 
 
      ], 
 
      "favorite_list_prefix": [], 
 
      "reward_list_prefix": [], 
 
      "comment_list_prefix": [], 
 
      "cover_image_src": "https://photo.tuchong.com/443122/g/11585752.webp", 
 
      "is_favorite": false 
 
    } 
 
  ], 
 
  "siteList": {...}, 
 
  "following": false, 
 
  "coverUrl": "https://photo.tuchong.com/443122/ft640/11585752.webp", 
 
  "tag_name": "美女", 
 
  "tag_id": "564", 
 
  "url": "https://tuchong.com/tags/%E7%BE%8E%E5%A5%B3/", 
 
  "more": true, 
 
  "result": "SUCCESS" 
 
}

根據(jù)屬性名稱很容易知道對(duì)應(yīng)的內(nèi)容含義，這里我們只需關(guān)心 postlist 這個(gè)屬性，它對(duì)應(yīng)的一個(gè)數(shù)組元素便是一個(gè)圖集，圖集元素中有幾項(xiàng)屬性我們需要用到：

url：?jiǎn)蝹€(gè)圖集瀏覽的頁面地址
post_id：圖集編號(hào)，在網(wǎng)站中應(yīng)該是唯一的，可以用來判斷是否已經(jīng)抓取過該內(nèi)容
site_id：作者站點(diǎn)編號(hào) ，構(gòu)建圖片來源鏈接要用到
title：標(biāo)題
excerpt：摘要文字
type：圖集類型，目前發(fā)現(xiàn)兩種，一種multi-photo是純照片，一種text是文字與圖片混合的文章式頁面，兩種內(nèi)容結(jié)構(gòu)不同，需要不同的抓取方式，本例中只抓取純照片類型，text類型直接丟棄
tags：圖集標(biāo)簽，有多個(gè)
image_count：圖片數(shù)量
images：圖片列表，它是一個(gè)對(duì)象數(shù)組，每個(gè)對(duì)象中包含一個(gè)img_id屬性需要用到

根據(jù)圖片瀏覽頁面分析，基本上圖片的地址都是這種格式： https://photo.tuchong.com/{site_id}/f/{img_id}.jpg ，很容易通過上面的信息合成。

二、創(chuàng)建項(xiàng)目

進(jìn)入cmder命令行工具，輸入workon scrapy 進(jìn)入之前建立的虛擬環(huán)境，此時(shí)命令行提示符前會(huì)出現(xiàn)(Scrapy) 標(biāo)識(shí)，標(biāo)識(shí)處于該虛擬環(huán)境中，相關(guān)的路徑都會(huì)添加到PATH環(huán)境變量中便于開發(fā)及使用。
輸入 scrapy startproject tuchong 創(chuàng)建項(xiàng)目 tuchong
進(jìn)入項(xiàng)目主目錄，輸入 scrapy genspider photo tuchong.com 創(chuàng)建一個(gè)爬蟲名稱叫 photo (不能與項(xiàng)目同名)，爬取 tuchong.com 域名（這個(gè)需要修改，此處先輸個(gè)大概地址），的一個(gè)項(xiàng)目?jī)?nèi)可以包含多個(gè)爬蟲

經(jīng)過以上步驟，項(xiàng)目自動(dòng)建立了一些文件及設(shè)置，目錄結(jié)構(gòu)如下：

(PROJECT) 
 
│  scrapy.cfg 
 
│ 
 
└─tuchong 
 
    │  items.py 
 
    │  middlewares.py 
 
    │  pipelines.py 
 
    │  settings.py 
 
    │  __init__.py 
 
    │ 
 
    ├─spiders 
 
    │  │  photo.py 
 
    │  │  __init__.py 
 
    │  │ 
 
    │  └─__pycache__ 
 
    │          __init__.cpython-36.pyc 
 
    │ 
 
    └─__pycache__ 
 
            settings.cpython-36.pyc 
 
            __init__.cpython-36.pyc

scrapy.cfg：基礎(chǔ)設(shè)置
items.py：抓取條目的結(jié)構(gòu)定義
middlewares.py：中間件定義，此例中無需改動(dòng)
pipelines.py：管道定義，用于抓取數(shù)據(jù)后的處理
settings.py：全局設(shè)置
spiders\photo.py：爬蟲主體，定義如何抓取需要的數(shù)據(jù)

三、主要代碼

items.py 中創(chuàng)建一個(gè)TuchongItem類并定義需要的屬性，屬性繼承自 scrapy.Field 值可以是字符、數(shù)字或者列表或字典等等：

import scrapy 
 
class TuchongItem(scrapy.Item): 
 
    post_id = scrapy.Field() 
 
    site_id = scrapy.Field() 
 
    title = scrapy.Field() 
 
    type = scrapy.Field() 
 
    url = scrapy.Field() 
 
    image_count = scrapy.Field() 
 
    images = scrapy.Field() 
 
    tags = scrapy.Field() 
 
    excerpt = scrapy.Field() 
 
    ...

這些屬性的值將在爬蟲主體中賦予。

spiders\photo.py 這個(gè)文件是通過命令 scrapy genspider photo tuchong.com 自動(dòng)創(chuàng)建的，里面的初始內(nèi)容如下：

import scrapy 
 
class PhotoSpider(scrapy.Spider): 
 
    name = 'photo' 
 
    allowed_domains = ['tuchong.com'] 
 
    start_urls = ['http://tuchong.com/'] 
 
    def parse(self, response): 
 
        pass

爬蟲名 name，允許的域名 allowed_domains（如果鏈接不屬于此域名將丟棄，允許多個(gè)），起始地址 start_urls 將從這里定義的地址抓取（允許多個(gè)）

函數(shù) parse 是處理請(qǐng)求內(nèi)容的默認(rèn)回調(diào)函數(shù)，參數(shù) response 為請(qǐng)求內(nèi)容，頁面內(nèi)容文本保存在 response.body 中，我們需要對(duì)默認(rèn)代碼稍加修改，讓其滿足多頁面循環(huán)發(fā)送請(qǐng)求，這需要重載 start_requests 函數(shù)，通過循環(huán)語句構(gòu)建多頁的鏈接請(qǐng)求，修改后代碼如下：

import scrapy, json 
 
from ..items import TuchongItem 
 
class PhotoSpider(scrapy.Spider): 
 
    name = 'photo' 
 
    # allowed_domains = ['tuchong.com'] 
 
    # start_urls = ['http://tuchong.com/'] 
 
 
    def start_requests(self): 
 
        url = 'https://tuchong.com/rest/tags/%s/posts?page=%d&count=20&order=weekly'; 
 
        # 抓取10個(gè)頁面，每頁20個(gè)圖集 
 
        # 指定 parse 作為回調(diào)函數(shù)并返回 Requests 請(qǐng)求對(duì)象 
 
        for page in range(1, 11): 
 
            yield scrapy.Request(url=url % ('美女', page), callback=self.parse) 
 
 
    # 回調(diào)函數(shù)，處理抓取內(nèi)容填充 TuchongItem 屬性 
 
    def parse(self, response): 
 
        body = json.loads(response.body_as_unicode()) 
 
        items = [] 
 
        for post in body['postList']: 
 
            item = TuchongItem() 
 
            item['type'] = post['type'] 
 
            item['post_id'] = post['post_id'] 
 
            item['site_id'] = post['site_id'] 
 
            item['title'] = post['title'] 
 
            item['url'] = post['url'] 
 
            item['excerpt'] = post['excerpt'] 
 
            item['image_count'] = int(post['image_count']) 
 
            item['images'] = {} 
 
            # 將 images 處理成 {img_id: img_url} 對(duì)象數(shù)組 
 
            for img in post.get('images', ''): 
 
                img_id = img['img_id'] 
 
                url = 'https://photo.tuchong.com/%s/f/%s.jpg' % (item['site_id'], img_id) 
 
                item['images'][img_id] = url 
 
            item['tags'] = [] 
 
            # 將 tags 處理成 tag_name 數(shù)組 
 
            for tag in post.get('tags', ''): 
 
                item['tags'].append(tag['tag_name']) 
 
            items.append(item) 
 
        return items

經(jīng)過這些步驟，抓取的數(shù)據(jù)將被保存在 TuchongItem 類中，作為結(jié)構(gòu)化的數(shù)據(jù)便于處理及保存。

前面說過，并不是所有抓取的條目都需要，例如本例中我們只需要 type=”multi_photo 類型的圖集，并且圖片太少的也不需要，這些抓取條目的篩選操作以及如何保存需要在pipelines.py中處理，該文件中默認(rèn)已創(chuàng)建類 TuchongPipeline 并重載了 process_item函數(shù)，通過修改該函數(shù)只返回那些符合條件的 item，代碼如下：

import scrapy, json 
 
from ..items import TuchongItem 
 
class PhotoSpider(scrapy.Spider): 
 
    name = 'photo' 
 
    # allowed_domains = ['tuchong.com'] 
 
    # start_urls = ['http://tuchong.com/'] 
 
 
 
    def start_requests(self): 
 
        url = 'https://tuchong.com/rest/tags/%s/posts?page=%d&count=20&order=weekly'; 
 
        # 抓取10個(gè)頁面，每頁20個(gè)圖集 
 
        # 指定 parse 作為回調(diào)函數(shù)并返回 Requests 請(qǐng)求對(duì)象 
 
        for page in range(1, 11): 
 
            yield scrapy.Request(url=url % ('美女', page), callback=self.parse) 
 
 
 
    # 回調(diào)函數(shù)，處理抓取內(nèi)容填充 TuchongItem 屬性 
 
    def parse(self, response): 
 
        body = json.loads(response.body_as_unicode()) 
 
        items = [] 
 
        for post in body['postList']: 
 
            item = TuchongItem() 
 
            item['type'] = post['type'] 
 
            item['post_id'] = post['post_id'] 
 
            item['site_id'] = post['site_id'] 
 
            item['title'] = post['title'] 
 
            item['url'] = post['url'] 
 
            item['excerpt'] = post['excerpt'] 
 
            item['image_count'] = int(post['image_count']) 
 
            item['images'] = {} 
 
            # 將 images 處理成 {img_id: img_url} 對(duì)象數(shù)組 
 
            for img in post.get('images', ''): 
 
                img_id = img['img_id'] 
 
                url = 'https://photo.tuchong.com/%s/f/%s.jpg' % (item['site_id'], img_id) 
 
                item['images'][img_id] = url 
 
 
 
            item['tags'] = [] 
 
            # 將 tags 處理成 tag_name 數(shù)組 
 
            for tag in post.get('tags', ''): 
 
                item['tags'].append(tag['tag_name']) 
 
            items.append(item) 
 
        return items

當(dāng)然如果不用管道直接在 parse 中處理也是一樣的，只不過這樣結(jié)構(gòu)更清晰一些，而且還有功能更多的FilePipelines和ImagePipelines可供使用，process_item將在每一個(gè)條目抓取后觸發(fā)，同時(shí)還有 open_spider 及 close_spider 函數(shù)可以重載，用于處理爬蟲打開及關(guān)閉時(shí)的動(dòng)作。

注意：管道需要在項(xiàng)目中注冊(cè)才能使用，在 settings.py 中添加：

ITEM_PIPELINES = { 
 
    'tuchong.pipelines.TuchongPipeline': 300, # 管道名稱: 運(yùn)行優(yōu)先級(jí)(數(shù)字小優(yōu)先) 
 
}

另外，大多數(shù)網(wǎng)站都有反爬蟲的 Robots.txt 排除協(xié)議，設(shè)置 ROBOTSTXT_OBEY = True 可以忽略這些協(xié)議，是的，這好像只是個(gè)君子協(xié)定。如果網(wǎng)站設(shè)置了瀏覽器User Agent或者IP地址檢測(cè)來反爬蟲，那就需要更高級(jí)的Scrapy功能，本文不做講解。

四、運(yùn)行

返回 cmder 命令行進(jìn)入項(xiàng)目目錄，輸入命令：

scrapy crawl photo

終端會(huì)輸出所有的爬行結(jié)果及調(diào)試信息，并在***列出爬蟲運(yùn)行的統(tǒng)計(jì)信息，例如：

[scrapy.statscollectors] INFO: Dumping Scrapy stats: 
 
{'downloader/request_bytes': 491, 
 
 'downloader/request_count': 2, 
 
 'downloader/request_method_count/GET': 2, 
 
 'downloader/response_bytes': 10224, 
 
 'downloader/response_count': 2, 
 
 'downloader/response_status_count/200': 2, 
 
 'finish_reason': 'finished', 
 
 'finish_time': datetime.datetime(2017, 11, 27, 7, 20, 24, 414201), 
 
 'item_dropped_count': 5, 
 
 'item_dropped_reasons_count/DropItem': 5, 
 
 'item_scraped_count': 15, 
 
 'log_count/DEBUG': 18, 
 
 'log_count/INFO': 8, 
 
 'log_count/WARNING': 5, 
 
 'response_received_count': 2, 
 
 'scheduler/dequeued': 1, 
 
 'scheduler/dequeued/memory': 1, 
 
 'scheduler/enqueued': 1, 
 
 'scheduler/enqueued/memory': 1, 
 
 'start_time': datetime.datetime(2017, 11, 27, 7, 20, 23, 867300)}

主要關(guān)注ERROR及WARNING兩項(xiàng)，這里的 Warning 其實(shí)是不符合條件而觸發(fā)的 DropItem 異常。

五、保存結(jié)果

大多數(shù)情況下都需要對(duì)抓取的結(jié)果進(jìn)行保存，默認(rèn)情況下 item.py 中定義的屬性可以保存到文件中，只需要命令行加參數(shù) -o {filename} 即可：

scrapy crawl photo -o output.json # 輸出為JSON文件 
 
scrapy crawl photo -o output.csv  # 輸出為CSV文件

注意：輸出至文件中的項(xiàng)目是未經(jīng)過 TuchongPipeline 篩選的項(xiàng)目，只要在 parse 函數(shù)中返回的 Item 都會(huì)輸出，因此也可以在 parse 中過濾只返回需要的項(xiàng)目

如果需要保存至數(shù)據(jù)庫，則需要添加額外代碼處理，比如可以在 pipelines.py 中 process_item 后添加:

... 
 
    def process_item(self, item, spider): 
 
        ... 
 
        else: 
 
            print(item['url']) 
 
            self.myblog.add_post(item) # myblog 是一個(gè)數(shù)據(jù)庫類，用于處理數(shù)據(jù)庫操作 
 
        return item 
 
...

為了在插入數(shù)據(jù)庫操作中排除重復(fù)的內(nèi)容，可以使用 item[‘post_id’] 進(jìn)行判斷，如果存在則跳過。

責(zé)任編輯：龐桂玉來源： 36大數(shù)據(jù)

Python Scrapy 爬蟲

點(diǎn)贊

51CTO技術(shù)棧公眾號(hào)

業(yè)務(wù)
速覽

媒體

51CTO CIOAge HC3i

社區(qū)

51CTO博客鴻蒙開發(fā)者社區(qū) AI.x社區(qū)

教育

51CTO學(xué)堂精培企業(yè)培訓(xùn) CTO訓(xùn)練營

<sub id="hp1qy"></sub>

<cite id="hp1qy"></cite>