自拍偷在线精品自拍偷,亚洲欧美中文日韩v在线观看不卡

51CTO首頁(yè)

AI.x社區(qū)

軟考社區(qū)

免費(fèi)課

企業(yè)培訓(xùn)

鴻蒙開發(fā)者社區(qū)

WOT技術(shù)大會(huì)

公眾號(hào)矩陣

移動(dòng)端

視頻課免費(fèi)課排行榜短視頻直播課軟考學(xué)堂

全部課程軟考華為認(rèn)證廠商認(rèn)證 IT技術(shù)PMP項(xiàng)目管理免費(fèi)題庫(kù)

在線學(xué)習(xí)

文章資源問答課堂專欄直播

51CTO

鴻蒙開發(fā)者社區(qū)

51CTO技術(shù)棧

51CTO官微

51CTO學(xué)堂

51CTO博客

CTO訓(xùn)練營(yíng)

鴻蒙開發(fā)者社區(qū)訂閱號(hào)

51CTO軟考

51CTO學(xué)堂APP

51CTO學(xué)堂企業(yè)版APP

鴻蒙開發(fā)者社區(qū)視頻號(hào)

51CTO軟考題庫(kù)

賬號(hào)設(shè)置退出

GitHub超級(jí)火！任意爬取，超全開源爬蟲工具箱

作者：新智元 2021-02-06 13:10:05

新聞開發(fā)工具

最近國(guó)內(nèi)一位開發(fā)者在 GitHub 上開源了個(gè)集眾多數(shù)據(jù)源于一身的爬蟲工具箱——InfoSpider，一不小心就火了！?。?/div>

最近國(guó)內(nèi)一位開發(fā)者在 GitHub 上開源了個(gè)集眾多數(shù)據(jù)源于一身的爬蟲工具箱——InfoSpider，一不小心就火了?。?！

現(xiàn)在一般網(wǎng)站都有反爬蟲機(jī)制，對(duì)于愛爬蟲的朋友來說，想爬蟲些數(shù)據(jù)，做下數(shù)據(jù)分析。是越來越難了。不過最近我們，發(fā)現(xiàn)一個(gè)超寶藏的爬蟲工具箱。

這個(gè)爬蟲工具箱有多火呢？

開源沒幾天就登上GitHub周榜第四，標(biāo)星1.3K，累計(jì)分支 172 個(gè)。同時(shí)作者已經(jīng)開源了所有的項(xiàng)目代碼及使用文檔，并且在B站上還有使用視頻講解。

項(xiàng)目代碼：

https://github.com/kangvcar/InfoSpider

項(xiàng)目使用文檔：

https://infospider.vercel.app

項(xiàng)目視頻演示：

https://www.bilibili.com/video/BV14f4y1R7oF/

在這樣一個(gè)信息爆炸的時(shí)代，每個(gè)人都有很多個(gè)賬號(hào)，賬號(hào)一多就會(huì)出現(xiàn)這么一個(gè)情況：個(gè)人數(shù)據(jù)分散在各種各樣的公司之間，就會(huì)形成數(shù)據(jù)孤島，多維數(shù)據(jù)無法融合，這個(gè)項(xiàng)目可以幫你將多維數(shù)據(jù)進(jìn)行融合并對(duì)個(gè)人數(shù)據(jù)進(jìn)行分析，這樣你就可以更直觀、深入了解自己的信息。

InfoSpider 是一個(gè)集眾多數(shù)據(jù)源于一身的爬蟲工具箱，旨在安全快捷的幫助用戶拿回自己的數(shù)據(jù)，工具代碼開源，流程透明，并提供數(shù)據(jù)分析功能，基于用戶數(shù)據(jù)生成圖表文件。

目前支持?jǐn)?shù)據(jù)源包括GitHub、QQ郵箱、網(wǎng)易郵箱、阿里郵箱、新浪郵箱、Hotmail郵箱、Outlook郵箱、京東、淘寶、支付寶、中國(guó)移動(dòng)、中國(guó)聯(lián)通、中國(guó)電信、知乎、嗶哩嗶哩、網(wǎng)易云音樂、QQ好友、QQ群、生成朋友圈相冊(cè)、瀏覽器瀏覽歷史、12306、博客園、CSDN博客、開源中國(guó)博客、簡(jiǎn)書。

根據(jù)創(chuàng)建者介紹，InfoSpider 具有以下特性：

安全可靠：本項(xiàng)目為開源項(xiàng)目，代碼簡(jiǎn)潔，所有源碼可見，本地運(yùn)行，安全可靠。
使用簡(jiǎn)單：提供 GUI 界面，只需點(diǎn)擊所需獲取的數(shù)據(jù)源并根據(jù)提示操作即可。
結(jié)構(gòu)清晰：本項(xiàng)目的所有數(shù)據(jù)源相互獨(dú)立，可移植性高，所有爬蟲腳本在項(xiàng)目的 Spiders 文件下。
數(shù)據(jù)源豐富：本項(xiàng)目目前支持多達(dá)24+個(gè)數(shù)據(jù)源，持續(xù)更新。
數(shù)據(jù)格式統(tǒng)一：爬取的所有數(shù)據(jù)都將存儲(chǔ)為json格式，方便后期數(shù)據(jù)分析。
個(gè)人數(shù)據(jù)豐富：本項(xiàng)目將盡可能多地為你爬取個(gè)人數(shù)據(jù)，后期數(shù)據(jù)處理可根據(jù)需要?jiǎng)h減。
數(shù)據(jù)分析：本項(xiàng)目提供個(gè)人數(shù)據(jù)的可視化分析，目前僅部分支持。

InfoSpider使用起來也非常簡(jiǎn)單，你只需要安裝python3和Chrome瀏覽器，運(yùn)行 python3 main.py，在打開的窗口點(diǎn)擊數(shù)據(jù)源按鈕, 根據(jù)提示選擇數(shù)據(jù)保存路徑，接著輸入賬號(hào)密碼，就會(huì)自動(dòng)爬取數(shù)據(jù)，根據(jù)下載的目錄就可以查看爬下來的數(shù)據(jù)。

依賴安裝

安裝python3和Chrome瀏覽器
安裝與Chrome瀏覽器相同版本的驅(qū)動(dòng)
安裝依賴庫(kù) ./install_deps.sh (Windows下只需pip install -r requirements.txt)

工具運(yùn)行

進(jìn)入 tools 目錄
運(yùn)行 python3 main.py
在打開的窗口點(diǎn)擊數(shù)據(jù)源按鈕, 根據(jù)提示選擇數(shù)據(jù)保存路徑
彈出的瀏覽器輸入用戶密碼后會(huì)自動(dòng)開始爬取數(shù)據(jù), 爬取完成瀏覽器會(huì)自動(dòng)關(guān)閉

在對(duì)應(yīng)的目錄下可以查看下載下來的數(shù)據(jù)(xxx.json), 數(shù)據(jù)分析圖表(xxx.html)

作者認(rèn)為該項(xiàng)目的最大潛力在于能把多維數(shù)據(jù)進(jìn)行融合并對(duì)個(gè)人數(shù)據(jù)進(jìn)行分析，是個(gè)人數(shù)據(jù)效益最大化。

當(dāng)然如果你想自己去練習(xí)和學(xué)習(xí)爬蟲，作者也開源了所有的爬取代碼，非常適合實(shí)戰(zhàn)。

舉個(gè)例子，比如爬取taobao的：

import json 
 
import random 
 
import time 
 
import sys 
 
import os 
 
import requests 
 
import numpy as np 
 
import math 
 
from lxml import etree 
 
from pyquery import PyQuery as pq 
 
from selenium import webdriver 
 
from selenium.webdriver import ChromeOptions 
 
from selenium.webdriver.common.by import By 
 
from selenium.webdriver.support import expected_conditions as EC 
 
from selenium.webdriver.support.wait import WebDriverWait 
 
from selenium.webdriver import ChromeOptions, ActionChains 
 
from tkinter.filedialog import askdirectory 
 
from tqdm import trange 
 
 
 
def ease_out_quad(x): 
 
return 1 - (1 - x) * (1 - x) 
 
def ease_out_quart(x): 
 
return 1 - pow(1 - x, 4) 
 
def ease_out_expo(x): 
 
if x == 1: 
 
return 1 
 
else: 
 
return 1 - pow(2, -10 * x) 
 
def get_tracks(distance, seconds, ease_func): 
 
tracks = [0] 
 
offsets = [0] 
 
for t in np.arange(0.0, seconds, 0.1): 
 
ease = globals()[ease_func] 
 
offset = round(ease(t / seconds) * distance) 
 
tracks.append(offset - offsets[-1]) 
 
offsets.append(offset) 
 
return offsets, tracks 
 
def drag_and_drop(browser, offset=26.5): 
 
knob = browser.find_element_by_id('nc_1_n1z') 
 
offsets, tracks = get_tracks(offset, 12, 'ease_out_expo') 
 
ActionChains(browser).click_and_hold(knob).perform() 
 
for x in tracks: 
 
ActionChains(browser).move_by_offset(x, 0).perform() 
 
ActionChains(browser).pause(0.5).release().perform() 
 
def gen_session(cookie): 
 
session = requests.session() 
 
cookie_dict = {} 
 
list = cookie.split(';') 
 
for i in list: 
 
try: 
 
cookie_dict[i.split('=')[0]] = i.split('=')[1] 
 
except IndexError: 
 
cookie_dict[''] = i 
 
requests.utils.add_dict_to_cookiejar(session.cookies, cookie_dict) 
 
return session 
 
class TaobaoSpider(object): 
 
def __init__(self, cookies_list): 
 
self.path = askdirectory(title='選擇信息保存文件夾') 
 
if str(self.path) == "": 
 
sys.exit(1) 
 
self.headers = { 
 
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36', 
 
} 
 
option = ChromeOptions() 
 
option.add_experimental_option('excludeSwitches', ['enable-automation']) 
 
option.add_experimental_option("prefs", {"profile.managed_default_content_settings.images": 2}) # 不加載圖片,加快訪問速度 
 
option.add_argument('--headless') 
 
self.driver = webdriver.Chrome(options=option) 
 
self.driver.get('https://i.taobao.com/my_taobao.htm') 
 
for i in cookies_list: 
 
self.driver.add_cookie(cookie_dict=i) 
 
self.driver.get('https://i.taobao.com/my_taobao.htm') 
 
self.wait = WebDriverWait(self.driver, 20) # 超時(shí)時(shí)長(zhǎng)為10s 
 
# 模擬向下滑動(dòng)瀏覽 
 
def swipe_down(self, second): 
 
for i in range(int(second / 0.1)): 
 
# 根據(jù)i的值，模擬上下滑動(dòng) 
 
if (i % 2 == 0): 
 
js = "var q=document.documentElement.scrollTop=" + str(300 + 400 * i) 
 
else: 
 
js = "var q=document.documentElement.scrollTop=" + str(200 * i) 
 
self.driver.execute_script(js) 
 
time.sleep(0.1) 
 
js = "var q=document.documentElement.scrollTop=100000" 
 
self.driver.execute_script(js) 
 
time.sleep(0.1) 
 
# 爬取淘寶 我已買到的寶貝商品數(shù)據(jù), pn 定義爬取多少頁(yè)數(shù)據(jù) 
 
def crawl_good_buy_data(self, pn=3): 
 
# 對(duì)我已買到的寶貝商品數(shù)據(jù)進(jìn)行爬蟲 
 
self.driver.get("https://buyertrade.taobao.com/trade/itemlist/list_bought_items.htm") 
 
# 遍歷所有頁(yè)數(shù) 
 
 
 
for page in trange(1, pn): 
 
data_list = [] 
 
# 等待該頁(yè)面全部已買到的寶貝商品數(shù)據(jù)加載完畢 
 
good_total = self.wait.until( 
 
EC.presence_of_element_located((By.CSS_SELECTOR, '#tp-bought-root > div.js-order-container'))) 
 
# 獲取本頁(yè)面源代碼 
 
html = self.driver.page_source 
 
# pq模塊解析網(wǎng)頁(yè)源代碼 
 
doc = pq(html) 
 
# # 存儲(chǔ)該頁(yè)已經(jīng)買到的寶貝數(shù)據(jù) 
 
good_items = doc('#tp-bought-root .js-order-container').items() 
 
# 遍歷該頁(yè)的所有寶貝 
 
for item in good_items: 
 
# 商品購(gòu)買時(shí)間、訂單號(hào) 
 
good_time_and_id = item.find('.bought-wrapper-mod__head-info-cell___29cDO').text().replace('\n', "").replace('\r', "") 
 
# 商家名稱 
 
# good_merchant = item.find('.seller-mod__container___1w0Cx').text().replace('\n', "").replace('\r', "") 
 
good_merchant = item.find('.bought-wrapper-mod__seller-container___3dAK3').text().replace('\n', "").replace('\r', "") 
 
# 商品名稱 
 
# good_name = item.find('.sol-mod__no-br___1PwLO').text().replace('\n', "").replace('\r', "") 
 
good_name = item.find('.sol-mod__no-br___3Ev-2').text().replace('\n', "").replace('\r', "") 
 
# 商品價(jià)格 
 
good_price = item.find('.price-mod__price___cYafX').text().replace('\n', "").replace('\r', "") 
 
# 只列出商品購(gòu)買時(shí)間、訂單號(hào)、商家名稱、商品名稱 
 
# 其余的請(qǐng)自己實(shí)踐獲取 
 
data_list.append(good_time_and_id) 
 
data_list.append(good_merchant) 
 
data_list.append(good_name) 
 
data_list.append(good_price) 
 
#print(good_time_and_id, good_merchant, good_name) 
 
#file_path = os.path.join(os.path.dirname(__file__) + '/user_orders.json') 
 
# file_path = "../Spiders/taobao/user_orders.json" 
 
json_str = json.dumps(data_list) 
 
with open(self.path + os.sep + 'user_orders.json', 'a') as f: 
 
f.write(json_str) 
 
# print('\n\n') 
 
# 大部分人被檢測(cè)為機(jī)器人就是因?yàn)檫M(jìn)一步模擬人工操作 
 
# 模擬人工向下瀏覽商品，即進(jìn)行模擬下滑操作，防止被識(shí)別出是機(jī)器人 
 
# 隨機(jī)滑動(dòng)延時(shí)時(shí)間 
 
swipe_time = random.randint(1, 3) 
 
self.swipe_down(swipe_time) 
 
# 等待下一頁(yè)按鈕 出現(xiàn) 
 
good_total = self.wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, '.pagination-next'))) 
 
good_total.click() 
 
time.sleep(2) 
 
# while 1: 
 
# time.sleep(0.2) 
 
# try: 
 
# good_total = self.driver.find_element_by_xpath('//li[@title="下一頁(yè)"]') 
 
# break 
 
# except: 
 
# continue 
 
# # 點(diǎn)擊下一頁(yè)按鈕 
 
# while 1: 
 
# time.sleep(2) 
 
# try: 
 
# good_total.click() 
 
# break 
 
# except Exception: 
 
# pass 
 
# 收藏寶貝 傳入爬幾頁(yè) 默認(rèn)三頁(yè) https://shoucang.taobao.com/nodejs/item_collect_chunk.htm?ifAllTag=0&tab=0&tagId=&categoryCount=0&type=0&tagName=&categoryName=&needNav=false&startRow=60 
 
def get_choucang_item(self, page=3): 
 
url = 'https://shoucang.taobao.com/nodejs/item_collect_chunk.htm?ifAllTag=0&tab=0&tagId=&categoryCount=0&type=0&tagName=&categoryName=&needNav=false&startRow={}' 
 
pn = 0 
 
json_list = [] 
 
for i in trange(page): 
 
self.driver.get(url.format(pn)) 
 
pn += 30 
 
html_str = self.driver.page_source 
 
if html_str == '': 
 
break 
 
if '登錄' in html_str: 
 
raise Exception('登錄') 
 
obj_list = etree.HTML(html_str).xpath('//li') 
 
for obj in obj_list: 
 
item = {} 
 
item['title'] = ''.join([i.strip() for i in obj.xpath('./div[@class="img-item-title"]//text()')]) 
 
item['url'] = ''.join([i.strip() for i in obj.xpath('./div[@class="img-item-title"]/a/@href')]) 
 
item['price'] = ''.join([i.strip() for i in obj.xpath('./div[@class="price-container"]//text()')]) 
 
if item['price'] == '': 
 
item['price'] = '失效' 
 
json_list.append(item) 
 
# file_path = os.path.join(os.path.dirname(__file__) + '/shoucang_item.json') 
 
json_str = json.dumps(json_list) 
 
with open(self.path + os.sep + 'shoucang_item.json', 'w') as f: 
 
f.write(json_str) 
 
# 瀏覽足跡 傳入爬幾頁(yè) 默認(rèn)三頁(yè) https://shoucang.taobao.com/nodejs/item_collect_chunk.htm?ifAllTag=0&tab=0&tagId=&categoryCount=0&type=0&tagName=&categoryName=&needNav=false&startRow=60 
 
def get_footmark_item(self, page=3): 
 
url = 'https://www.taobao.com/markets/footmark/tbfoot' 
 
self.driver.get(url) 
 
pn = 0 
 
item_num = 0 
 
json_list = [] 
 
for i in trange(page): 
 
html_str = self.driver.page_source 
 
obj_list = etree.HTML(html_str).xpath('//div[@class="item-list J_redsList"]/div')[item_num:] 
 
for obj in obj_list: 
 
item_num += 1 
 
item = {} 
 
item['date'] = ''.join([i.strip() for i in obj.xpath('./@data-date')]) 
 
item['url'] = ''.join([i.strip() for i in obj.xpath('./a/@href')]) 
 
item['name'] = ''.join([i.strip() for i in obj.xpath('.//div[@class="title"]//text()')]) 
 
item['price'] = ''.join([i.strip() for i in obj.xpath('.//div[@class="price-box"]//text()')]) 
 
json_list.append(item) 
 
self.driver.execute_script('window.scrollTo(0,1000000)') 
 
# file_path = os.path.join(os.path.dirname(__file__) + '/footmark_item.json') 
 
json_str = json.dumps(json_list) 
 
with open(self.path + os.sep + 'footmark_item.json', 'w') as f: 
 
f.write(json_str) 
 
# 地址 
 
def get_addr(self): 
 
url = 'https://member1.taobao.com/member/fresh/deliver_address.htm' 
 
self.driver.get(url) 
 
html_str = self.driver.page_source 
 
obj_list = etree.HTML(html_str).xpath('//tbody[@class="next-table-body"]/tr') 
 
data_list = [] 
 
for obj in obj_list: 
 
item = {} 
 
item['name'] = obj.xpath('.//td[1]//text()') 
 
item['area'] = obj.xpath('.//td[2]//text()') 
 
item['detail_area'] = obj.xpath('.//td[3]//text()') 
 
item['youbian'] = obj.xpath('.//td[4]//text()') 
 
item['mobile'] = obj.xpath('.//td[5]//text()') 
 
data_list.append(item) 
 
# file_path = os.path.join(os.path.dirname(__file__) + '/addr.json') 
 
json_str = json.dumps(data_list) 
 
with open(self.path + os.sep + 'address.json', 'w') as f: 
 
f.write(json_str) 
 
 
 
if __name__ == '__main__': 
 
# pass 
 
cookie_list = json.loads(open('taobao_cookies.json', 'r').read()) 
 
t = TaobaoSpider(cookie_list) 
 
t.get_orders() 
 
# t.crawl_good_buy_data() 
 
# t.get_addr() 
 
# t.get_choucang_item() 
 
# t.get_footmark_item()

這么優(yōu)秀的倉(cāng)庫(kù)，大家多多給倉(cāng)庫(kù)創(chuàng)建者 star 支持呀！

責(zé)任編輯：張燕妮來源：新智元

工具代碼爬蟲

點(diǎn)贊

51CTO技術(shù)棧公眾號(hào)

業(yè)務(wù)
速覽

媒體

51CTO CIOAge HC3i

社區(qū)

51CTO博客鴻蒙開發(fā)者社區(qū) AI.x社區(qū)

教育

51CTO學(xué)堂精培企業(yè)培訓(xùn) CTO訓(xùn)練營(yíng)