自拍偷在线精品自拍偷,亚洲欧美中文日韩v在线观看不卡

<pre id="ltrv5"></pre><u id="ltrv5"><rp id="ltrv5"></rp></u>

<blockquote id="ltrv5"></blockquote>

AI.x社區(qū)

軟考社區(qū)

企業(yè)培訓

鴻蒙開發(fā)者社區(qū)

WOT技術大會

公眾號矩陣

移動端

視頻課免費課排行榜短視頻直播課軟考學堂

全部課程軟考華為認證廠商認證 IT技術 PMP項目管理免費題庫

文章資源問答課堂專欄直播

51CTO

鴻蒙開發(fā)者社區(qū)

51CTO技術棧

51CTO官微

51CTO學堂

51CTO博客

CTO訓練營

鴻蒙開發(fā)者社區(qū)訂閱號

51CTO軟考

51CTO學堂APP

51CTO學堂企業(yè)版APP

鴻蒙開發(fā)者社區(qū)視頻號

51CTO軟考題庫

賬號設置退出

Python爬蟲實戰(zhàn)：采集淘寶商品信息并導入EXCEL表格

作者：青燈教育Python學院 2020-11-06 08:28:44

開發(fā) 后端

本文簡單使用python的requests庫及re正則表達式對淘寶的商品信息(商品名稱，商品價格，生產(chǎn)地區(qū)，以及銷售額)進行了爬取，并最后用xlsxwriter庫將信息放入Excel表格。

文章目錄

前言

一、解析淘寶URL組成
二、查看網(wǎng)頁源碼并用re庫提取信息
1.查看源碼2.re庫提取信息
三：函數(shù)填寫
四：主函數(shù)填寫
五：完整代碼

前言

本文簡單使用python的requests庫及re正則表達式對淘寶的商品信息(商品名稱，商品價格，生產(chǎn)地區(qū)，以及銷售額)進行了爬取，并最后用xlsxwriter庫將信息放入Excel表格。最后的效果圖如下：

提示：以下是本篇文章正文內容

一、解析淘寶URL組成

1.我們的第一個需求就是要輸入商品名字返回對應的信息

所以我們這里隨便選一個商品來觀察它的URL，這里我們選擇的是書包，打開網(wǎng)頁，可知他的URL為：

https://s.taobao.com/search?q=%E4%B9%A6%E5%8C%85&imgfile=&commend=all&ssid=s5-e&search_type=item&sourceId=tb.index&spm=a21bo.2017.201856-taobao-item.1&ie=utf8&initiative_id=tbindexz_20170306

可能單單從這個url里我們看不出什么，但是我們可以從圖中看出一些端倪

我們發(fā)現(xiàn)q后面的參數(shù)就是我們要獲取的物品的名字

2.我們第二個需求就是根據(jù)輸入的數(shù)字來爬取商品的頁碼

所以我們來觀察一下后面幾頁URL的組成

由此我們可以得出分頁的依據(jù)是最后s的值=(44(頁數(shù)-1))

二、查看網(wǎng)頁源碼并用re庫提取信息

1.查看源碼

這里的幾個信息都是我們所需要的

2.re庫提取信息

a = re.findall(r'"raw_title":"(.*?)"', html) 
   b = re.findall(r'"view_price":"(.*?)"', html) 
   c = re.findall(r'"item_loc":"(.*?)"', html) 
   d = re.findall(r'"view_sales":"(.*?)"', html)

三：函數(shù)填寫

這里我寫了三個函數(shù)，第一個函數(shù)來獲取html網(wǎng)頁，代碼如下：

def GetHtml(url): 
    r = requests.get(url,headers =headers) 
    r.raise_for_status() 
    r.encoding = r.apparent_encoding 
    return r

第二個用于獲取網(wǎng)頁的URL代碼如下：

def Geturls(q, x): 
    url = "https://s.taobao.com/search?q=" + q + "&imgfile=&commend=all&ssid=s5-e&search_type=item&sourceId=tb.index&spm" \ 
                                                 "=a21bo.2017.201856-taobao-item.1&ie=utf8&initiative_id=tbindexz_20170306 " 
    urls = [] 
    urls.append(url) 
    if x == 1: 
        return urls 
    for i in range(1, x ): 
        url = "https://s.taobao.com/search?q="+ q + "&commend=all&ssid=s5-e&search_type=item" \ 
              "&sourceId=tb.index&spm=a21bo.2017.201856-taobao-item.1&ie=utf8&initiative_id=tbindexz_20170306" \ 
              "&bcoffset=3&ntoffset=3&p4ppushleft=1%2C48&s=" + str( 
            i * 44) 
        urls.append(url) 
    return urls

第三個用于獲取我們需要的商品信息并寫入Excel表格代碼如下：

def GetxxintoExcel(html): 
    global count#定義一個全局變量count用于后面excel表的填寫 
    a = re.findall(r'"raw_title":"(.*?)"', html)#（.*?）匹配任意字符 
    b = re.findall(r'"view_price":"(.*?)"', html) 
    c = re.findall(r'"item_loc":"(.*?)"', html) 
    d = re.findall(r'"view_sales":"(.*?)"', html) 
    x = [] 
    for i in range(len(a)): 
        try: 
            x.append((a[i],b[i],c[i],d[i]))#把獲取的信息放入新的列表中 
        except IndexError: 
            break 
    i = 0 
    for i in range(len(x)): 
        worksheet.write(count + i + 1, 0, x[i][0])#worksheet.write方法用于寫入數(shù)據(jù),第一個數(shù)字是行位置，第二個數(shù)字是列，第三個是寫入的數(shù)據(jù)信息。 
        worksheet.write(count + i + 1, 1, x[i][1]) 
        worksheet.write(count + i + 1, 2, x[i][2]) 
        worksheet.write(count + i + 1, 3, x[i][3]) 
    count = count +len(x) #下次寫入的行數(shù)是這次的長度+1 
    return print("已完成")

四：主函數(shù)填寫

if __name__ == "__main__": 
    count = 0 
    headers = { 
        "user-agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36" 
        ,"cookie":""#cookie 是每個人獨有的，因為反爬機制的緣故，爬取太快可能到后面要重新刷新一下自己的Cookie。 
                } 
    q = input("輸入貨物") 
    x = int(input("你想爬取幾頁")) 
    urls = Geturls(q,x) 
    workbook = xlsxwriter.Workbook(q+".xlsx") 
    worksheet = workbook.add_worksheet() 
    worksheet.set_column('A:A', 70) 
    worksheet.set_column('B:B', 20) 
    worksheet.set_column('C:C', 20) 
    worksheet.set_column('D:D', 20) 
    worksheet.write('A1', '名稱') 
    worksheet.write('B1', '價格') 
    worksheet.write('C1', '地區(qū)') 
    worksheet.write('D1', '付款人數(shù)') 
    for url in urls: 
        html = GetHtml(url) 
        s = GetxxintoExcel(html.text) 
        time.sleep(5) 
    workbook.close()#在程序結束之前不要打開excel，excel表在當前目錄下

五：完整代碼

import re 
import  requests 
import xlsxwriter 
import  time 
 
def GetxxintoExcel(html): 
    global count 
    a = re.findall(r'"raw_title":"(.*?)"', html) 
    b = re.findall(r'"view_price":"(.*?)"', html) 
    c = re.findall(r'"item_loc":"(.*?)"', html) 
    d = re.findall(r'"view_sales":"(.*?)"', html) 
    x = [] 
    for i in range(len(a)): 
        try: 
            x.append((a[i],b[i],c[i],d[i])) 
        except IndexError: 
            break 
    i = 0 
    for i in range(len(x)): 
        worksheet.write(count + i + 1, 0, x[i][0]) 
        worksheet.write(count + i + 1, 1, x[i][1]) 
        worksheet.write(count + i + 1, 2, x[i][2]) 
        worksheet.write(count + i + 1, 3, x[i][3]) 
    count = count +len(x) 
    return print("已完成") 
 
 
def Geturls(q, x): 
    url = "https://s.taobao.com/search?q=" + q + "&imgfile=&commend=all&ssid=s5-e&search_type=item&sourceId=tb.index&spm" \ 
                                                 "=a21bo.2017.201856-taobao-item.1&ie=utf8&initiative_id=tbindexz_20170306 " 
    urls = [] 
    urls.append(url) 
    if x == 1: 
        return urls 
    for i in range(1, x ): 
        url = "https://s.taobao.com/search?q="+ q + "&commend=all&ssid=s5-e&search_type=item" \ 
              "&sourceId=tb.index&spm=a21bo.2017.201856-taobao-item.1&ie=utf8&initiative_id=tbindexz_20170306" \ 
              "&bcoffset=3&ntoffset=3&p4ppushleft=1%2C48&s=" + str( 
            i * 44) 
        urls.append(url) 
    return urls 
 
 
def GetHtml(url): 
    r = requests.get(url,headers =headers) 
    r.raise_for_status() 
    r.encoding = r.apparent_encoding 
    return r 
 
if __name__ == "__main__": 
    count = 0 
    headers = { 
        "user-agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36" 
        ,"cookie":"" 
                } 
    q = input("輸入貨物") 
    x = int(input("你想爬取幾頁")) 
    urls = Geturls(q,x) 
    workbook = xlsxwriter.Workbook(q+".xlsx") 
    worksheet = workbook.add_worksheet() 
    worksheet.set_column('A:A', 70) 
    worksheet.set_column('B:B', 20) 
    worksheet.set_column('C:C', 20) 
    worksheet.set_column('D:D', 20) 
    worksheet.write('A1', '名稱') 
    worksheet.write('B1', '價格') 
    worksheet.write('C1', '地區(qū)') 
    worksheet.write('D1', '付款人數(shù)') 
    xx = [] 
    for url in urls: 
        html = GetHtml(url) 
        s = GetxxintoExcel(html.text) 
        time.sleep(5) 
    workbook.close()

【編輯推薦】

紅帽開放混合云助力企業(yè)成為數(shù)字原生企業(yè)
分析鴻蒙系統(tǒng)helloworld程序是如何被調用，SYS_RUN做什么事情
5G為何突然間就“不火”了？
新方向、新功能：Python3.9 完整版面世了
請停止在Python中無休止使用列表

責任編輯：姜華來源：今日頭條

51CTO技術棧公眾號

業(yè)務
速覽

媒體

51CTO CIOAge HC3i

社區(qū)

51CTO博客鴻蒙開發(fā)者社區(qū) AI.x社區(qū)

教育

51CTO學堂精培企業(yè)培訓 CTO訓練營

<sub id="9tznn"><p id="9tznn"></p></sub>

<sub id="9tznn"><i id="9tznn"></i></sub>

<sub id="9tznn"></sub>

<legend id="9tznn"><track id="9tznn"></track></legend>