GitHub超級(jí)火!任意爬取,超全開源爬蟲工具箱
最近國(guó)內(nèi)一位開發(fā)者在 GitHub 上開源了個(gè)集眾多數(shù)據(jù)源于一身的爬蟲工具箱——InfoSpider,一不小心就火了?。?!
現(xiàn)在一般網(wǎng)站都有反爬蟲機(jī)制,對(duì)于愛爬蟲的朋友來說,想爬蟲些數(shù)據(jù),做下數(shù)據(jù)分析。是越來越難了。不過最近我們,發(fā)現(xiàn)一個(gè)超寶藏的爬蟲工具箱。
這個(gè)爬蟲工具箱有多火呢?
開源沒幾天就登上GitHub周榜第四,標(biāo)星1.3K,累計(jì)分支 172 個(gè)。同時(shí)作者已經(jīng)開源了所有的項(xiàng)目代碼及使用文檔,并且在B站上還有使用視頻講解。
項(xiàng)目代碼:
https://github.com/kangvcar/InfoSpider
項(xiàng)目使用文檔:
https://infospider.vercel.app
項(xiàng)目視頻演示:
https://www.bilibili.com/video/BV14f4y1R7oF/
在這樣一個(gè)信息爆炸的時(shí)代,每個(gè)人都有很多個(gè)賬號(hào),賬號(hào)一多就會(huì)出現(xiàn)這么一個(gè)情況:個(gè)人數(shù)據(jù)分散在各種各樣的公司之間,就會(huì)形成數(shù)據(jù)孤島,多維數(shù)據(jù)無法融合,這個(gè)項(xiàng)目可以幫你將多維數(shù)據(jù)進(jìn)行融合并對(duì)個(gè)人數(shù)據(jù)進(jìn)行分析,這樣你就可以更直觀、深入了解自己的信息。
InfoSpider 是一個(gè)集眾多數(shù)據(jù)源于一身的爬蟲工具箱,旨在安全快捷的幫助用戶拿回自己的數(shù)據(jù),工具代碼開源,流程透明,并提供數(shù)據(jù)分析功能,基于用戶數(shù)據(jù)生成圖表文件。
目前支持?jǐn)?shù)據(jù)源包括GitHub、QQ郵箱、網(wǎng)易郵箱、阿里郵箱、新浪郵箱、Hotmail郵箱、Outlook郵箱、京東、淘寶、支付寶、中國(guó)移動(dòng)、中國(guó)聯(lián)通、中國(guó)電信、知乎、嗶哩嗶哩、網(wǎng)易云音樂、QQ好友、QQ群、生成朋友圈相冊(cè)、瀏覽器瀏覽歷史、12306、博客園、CSDN博客、開源中國(guó)博客、簡(jiǎn)書。
根據(jù)創(chuàng)建者介紹,InfoSpider 具有以下特性:
- 安全可靠:本項(xiàng)目為開源項(xiàng)目,代碼簡(jiǎn)潔,所有源碼可見,本地運(yùn)行,安全可靠。
- 使用簡(jiǎn)單:提供 GUI 界面,只需點(diǎn)擊所需獲取的數(shù)據(jù)源并根據(jù)提示操作即可。
- 結(jié)構(gòu)清晰:本項(xiàng)目的所有數(shù)據(jù)源相互獨(dú)立,可移植性高,所有爬蟲腳本在項(xiàng)目的 Spiders 文件下。
- 數(shù)據(jù)源豐富:本項(xiàng)目目前支持多達(dá)24+個(gè)數(shù)據(jù)源,持續(xù)更新。
- 數(shù)據(jù)格式統(tǒng)一:爬取的所有數(shù)據(jù)都將存儲(chǔ)為json格式,方便后期數(shù)據(jù)分析。
- 個(gè)人數(shù)據(jù)豐富:本項(xiàng)目將盡可能多地為你爬取個(gè)人數(shù)據(jù),后期數(shù)據(jù)處理可根據(jù)需要?jiǎng)h減。
- 數(shù)據(jù)分析:本項(xiàng)目提供個(gè)人數(shù)據(jù)的可視化分析,目前僅部分支持。
InfoSpider使用起來也非常簡(jiǎn)單,你只需要安裝python3和Chrome瀏覽器,運(yùn)行 python3 main.py,在打開的窗口點(diǎn)擊數(shù)據(jù)源按鈕, 根據(jù)提示選擇數(shù)據(jù)保存路徑,接著輸入賬號(hào)密碼,就會(huì)自動(dòng)爬取數(shù)據(jù),根據(jù)下載的目錄就可以查看爬下來的數(shù)據(jù)。
依賴安裝
- 安裝python3和Chrome瀏覽器
- 安裝與Chrome瀏覽器相同版本的驅(qū)動(dòng)
- 安裝依賴庫(kù) ./install_deps.sh (Windows下只需pip install -r requirements.txt)
工具運(yùn)行
- 進(jìn)入 tools 目錄
- 運(yùn)行 python3 main.py
- 在打開的窗口點(diǎn)擊數(shù)據(jù)源按鈕, 根據(jù)提示選擇數(shù)據(jù)保存路徑
- 彈出的瀏覽器輸入用戶密碼后會(huì)自動(dòng)開始爬取數(shù)據(jù), 爬取完成瀏覽器會(huì)自動(dòng)關(guān)閉
在對(duì)應(yīng)的目錄下可以查看下載下來的數(shù)據(jù)(xxx.json), 數(shù)據(jù)分析圖表(xxx.html)
作者認(rèn)為該項(xiàng)目的最大潛力在于能把多維數(shù)據(jù)進(jìn)行融合并對(duì)個(gè)人數(shù)據(jù)進(jìn)行分析,是個(gè)人數(shù)據(jù)效益最大化。
當(dāng)然如果你想自己去練習(xí)和學(xué)習(xí)爬蟲,作者也開源了所有的爬取代碼,非常適合實(shí)戰(zhàn)。
舉個(gè)例子,比如爬取taobao的:
- import json
- import random
- import time
- import sys
- import os
- import requests
- import numpy as np
- import math
- from lxml import etree
- from pyquery import PyQuery as pq
- from selenium import webdriver
- from selenium.webdriver import ChromeOptions
- from selenium.webdriver.common.by import By
- from selenium.webdriver.support import expected_conditions as EC
- from selenium.webdriver.support.wait import WebDriverWait
- from selenium.webdriver import ChromeOptions, ActionChains
- from tkinter.filedialog import askdirectory
- from tqdm import trange
- def ease_out_quad(x):
- return 1 - (1 - x) * (1 - x)
- def ease_out_quart(x):
- return 1 - pow(1 - x, 4)
- def ease_out_expo(x):
- if x == 1:
- return 1
- else:
- return 1 - pow(2, -10 * x)
- def get_tracks(distance, seconds, ease_func):
- tracks = [0]
- offsets = [0]
- for t in np.arange(0.0, seconds, 0.1):
- ease = globals()[ease_func]
- offset = round(ease(t / seconds) * distance)
- tracks.append(offset - offsets[-1])
- offsets.append(offset)
- return offsets, tracks
- def drag_and_drop(browser, offset=26.5):
- knob = browser.find_element_by_id('nc_1_n1z')
- offsets, tracks = get_tracks(offset, 12, 'ease_out_expo')
- ActionChains(browser).click_and_hold(knob).perform()
- for x in tracks:
- ActionChains(browser).move_by_offset(x, 0).perform()
- ActionChains(browser).pause(0.5).release().perform()
- def gen_session(cookie):
- session = requests.session()
- cookie_dict = {}
- list = cookie.split(';')
- for i in list:
- try:
- cookie_dict[i.split('=')[0]] = i.split('=')[1]
- except IndexError:
- cookie_dict[''] = i
- requests.utils.add_dict_to_cookiejar(session.cookies, cookie_dict)
- return session
- class TaobaoSpider(object):
- def __init__(self, cookies_list):
- self.path = askdirectory(title='選擇信息保存文件夾')
- if str(self.path) == "":
- sys.exit(1)
- self.headers = {
- 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36',
- }
- option = ChromeOptions()
- option.add_experimental_option('excludeSwitches', ['enable-automation'])
- option.add_experimental_option("prefs", {"profile.managed_default_content_settings.images": 2}) # 不加載圖片,加快訪問速度
- option.add_argument('--headless')
- self.driver = webdriver.Chrome(options=option)
- self.driver.get('https://i.taobao.com/my_taobao.htm')
- for i in cookies_list:
- self.driver.add_cookie(cookie_dict=i)
- self.driver.get('https://i.taobao.com/my_taobao.htm')
- self.wait = WebDriverWait(self.driver, 20) # 超時(shí)時(shí)長(zhǎng)為10s
- # 模擬向下滑動(dòng)瀏覽
- def swipe_down(self, second):
- for i in range(int(second / 0.1)):
- # 根據(jù)i的值,模擬上下滑動(dòng)
- if (i % 2 == 0):
- js = "var q=document.documentElement.scrollTop=" + str(300 + 400 * i)
- else:
- js = "var q=document.documentElement.scrollTop=" + str(200 * i)
- self.driver.execute_script(js)
- time.sleep(0.1)
- js = "var q=document.documentElement.scrollTop=100000"
- self.driver.execute_script(js)
- time.sleep(0.1)
- # 爬取淘寶 我已買到的寶貝商品數(shù)據(jù), pn 定義爬取多少頁(yè)數(shù)據(jù)
- def crawl_good_buy_data(self, pn=3):
- # 對(duì)我已買到的寶貝商品數(shù)據(jù)進(jìn)行爬蟲
- self.driver.get("https://buyertrade.taobao.com/trade/itemlist/list_bought_items.htm")
- # 遍歷所有頁(yè)數(shù)
- for page in trange(1, pn):
- data_list = []
- # 等待該頁(yè)面全部已買到的寶貝商品數(shù)據(jù)加載完畢
- good_total = self.wait.until(
- EC.presence_of_element_located((By.CSS_SELECTOR, '#tp-bought-root > div.js-order-container')))
- # 獲取本頁(yè)面源代碼
- html = self.driver.page_source
- # pq模塊解析網(wǎng)頁(yè)源代碼
- doc = pq(html)
- # # 存儲(chǔ)該頁(yè)已經(jīng)買到的寶貝數(shù)據(jù)
- good_items = doc('#tp-bought-root .js-order-container').items()
- # 遍歷該頁(yè)的所有寶貝
- for item in good_items:
- # 商品購(gòu)買時(shí)間、訂單號(hào)
- good_time_and_id = item.find('.bought-wrapper-mod__head-info-cell___29cDO').text().replace('\n', "").replace('\r', "")
- # 商家名稱
- # good_merchant = item.find('.seller-mod__container___1w0Cx').text().replace('\n', "").replace('\r', "")
- good_merchant = item.find('.bought-wrapper-mod__seller-container___3dAK3').text().replace('\n', "").replace('\r', "")
- # 商品名稱
- # good_name = item.find('.sol-mod__no-br___1PwLO').text().replace('\n', "").replace('\r', "")
- good_name = item.find('.sol-mod__no-br___3Ev-2').text().replace('\n', "").replace('\r', "")
- # 商品價(jià)格
- good_price = item.find('.price-mod__price___cYafX').text().replace('\n', "").replace('\r', "")
- # 只列出商品購(gòu)買時(shí)間、訂單號(hào)、商家名稱、商品名稱
- # 其余的請(qǐng)自己實(shí)踐獲取
- data_list.append(good_time_and_id)
- data_list.append(good_merchant)
- data_list.append(good_name)
- data_list.append(good_price)
- #print(good_time_and_id, good_merchant, good_name)
- #file_path = os.path.join(os.path.dirname(__file__) + '/user_orders.json')
- # file_path = "../Spiders/taobao/user_orders.json"
- json_str = json.dumps(data_list)
- with open(self.path + os.sep + 'user_orders.json', 'a') as f:
- f.write(json_str)
- # print('\n\n')
- # 大部分人被檢測(cè)為機(jī)器人就是因?yàn)檫M(jìn)一步模擬人工操作
- # 模擬人工向下瀏覽商品,即進(jìn)行模擬下滑操作,防止被識(shí)別出是機(jī)器人
- # 隨機(jī)滑動(dòng)延時(shí)時(shí)間
- swipe_time = random.randint(1, 3)
- self.swipe_down(swipe_time)
- # 等待下一頁(yè)按鈕 出現(xiàn)
- good_total = self.wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, '.pagination-next')))
- good_total.click()
- time.sleep(2)
- # while 1:
- # time.sleep(0.2)
- # try:
- # good_total = self.driver.find_element_by_xpath('//li[@title="下一頁(yè)"]')
- # break
- # except:
- # continue
- # # 點(diǎn)擊下一頁(yè)按鈕
- # while 1:
- # time.sleep(2)
- # try:
- # good_total.click()
- # break
- # except Exception:
- # pass
- # 收藏寶貝 傳入爬幾頁(yè) 默認(rèn)三頁(yè) https://shoucang.taobao.com/nodejs/item_collect_chunk.htm?ifAllTag=0&tab=0&tagId=&categoryCount=0&type=0&tagName=&categoryName=&needNav=false&startRow=60
- def get_choucang_item(self, page=3):
- url = 'https://shoucang.taobao.com/nodejs/item_collect_chunk.htm?ifAllTag=0&tab=0&tagId=&categoryCount=0&type=0&tagName=&categoryName=&needNav=false&startRow={}'
- pn = 0
- json_list = []
- for i in trange(page):
- self.driver.get(url.format(pn))
- pn += 30
- html_str = self.driver.page_source
- if html_str == '':
- break
- if '登錄' in html_str:
- raise Exception('登錄')
- obj_list = etree.HTML(html_str).xpath('//li')
- for obj in obj_list:
- item = {}
- item['title'] = ''.join([i.strip() for i in obj.xpath('./div[@class="img-item-title"]//text()')])
- item['url'] = ''.join([i.strip() for i in obj.xpath('./div[@class="img-item-title"]/a/@href')])
- item['price'] = ''.join([i.strip() for i in obj.xpath('./div[@class="price-container"]//text()')])
- if item['price'] == '':
- item['price'] = '失效'
- json_list.append(item)
- # file_path = os.path.join(os.path.dirname(__file__) + '/shoucang_item.json')
- json_str = json.dumps(json_list)
- with open(self.path + os.sep + 'shoucang_item.json', 'w') as f:
- f.write(json_str)
- # 瀏覽足跡 傳入爬幾頁(yè) 默認(rèn)三頁(yè) https://shoucang.taobao.com/nodejs/item_collect_chunk.htm?ifAllTag=0&tab=0&tagId=&categoryCount=0&type=0&tagName=&categoryName=&needNav=false&startRow=60
- def get_footmark_item(self, page=3):
- url = 'https://www.taobao.com/markets/footmark/tbfoot'
- self.driver.get(url)
- pn = 0
- item_num = 0
- json_list = []
- for i in trange(page):
- html_str = self.driver.page_source
- obj_list = etree.HTML(html_str).xpath('//div[@class="item-list J_redsList"]/div')[item_num:]
- for obj in obj_list:
- item_num += 1
- item = {}
- item['date'] = ''.join([i.strip() for i in obj.xpath('./@data-date')])
- item['url'] = ''.join([i.strip() for i in obj.xpath('./a/@href')])
- item['name'] = ''.join([i.strip() for i in obj.xpath('.//div[@class="title"]//text()')])
- item['price'] = ''.join([i.strip() for i in obj.xpath('.//div[@class="price-box"]//text()')])
- json_list.append(item)
- self.driver.execute_script('window.scrollTo(0,1000000)')
- # file_path = os.path.join(os.path.dirname(__file__) + '/footmark_item.json')
- json_str = json.dumps(json_list)
- with open(self.path + os.sep + 'footmark_item.json', 'w') as f:
- f.write(json_str)
- # 地址
- def get_addr(self):
- url = 'https://member1.taobao.com/member/fresh/deliver_address.htm'
- self.driver.get(url)
- html_str = self.driver.page_source
- obj_list = etree.HTML(html_str).xpath('//tbody[@class="next-table-body"]/tr')
- data_list = []
- for obj in obj_list:
- item = {}
- item['name'] = obj.xpath('.//td[1]//text()')
- item['area'] = obj.xpath('.//td[2]//text()')
- item['detail_area'] = obj.xpath('.//td[3]//text()')
- item['youbian'] = obj.xpath('.//td[4]//text()')
- item['mobile'] = obj.xpath('.//td[5]//text()')
- data_list.append(item)
- # file_path = os.path.join(os.path.dirname(__file__) + '/addr.json')
- json_str = json.dumps(data_list)
- with open(self.path + os.sep + 'address.json', 'w') as f:
- f.write(json_str)
- if __name__ == '__main__':
- # pass
- cookie_list = json.loads(open('taobao_cookies.json', 'r').read())
- t = TaobaoSpider(cookie_list)
- t.get_orders()
- # t.crawl_good_buy_data()
- # t.get_addr()
- # t.get_choucang_item()
- # t.get_footmark_item()
這么優(yōu)秀的倉(cāng)庫(kù),大家多多給倉(cāng)庫(kù)創(chuàng)建者 star 支持呀!






