自拍偷在线精品自拍偷,亚洲欧美中文日韩v在线观看不卡

使用NodeJS請求抓取帶有進(jìn)程Cookie認(rèn)證的站點

譯文 精選
開發(fā)
使用NodeJS流抓取數(shù)據(jù)并將其寫入文件

  作者 | Lokesh Joshi

  譯者 | 張哲剛

  審校丨Noe

簡介

  當(dāng)前,NodeJS擁有大量的庫,基本上可以解決所有的常規(guī)需求。網(wǎng)絡(luò)抓取是一項門檻較低的技術(shù),衍生了大量自由職業(yè)者以及開發(fā)團(tuán)隊。自然而然,NodeJS的庫生態(tài)系統(tǒng)幾乎包含了網(wǎng)絡(luò)解析所需的一切。

  本文論述中,首先假定已經(jīng)運行在應(yīng)用程序皆為NodeJS解析而工作的核心設(shè)備之上。此外,我們將研究一個示例,從銷售品牌服裝和飾品的網(wǎng)站https://outlet.scotch-soda.com 站點中的幾百個頁面里收集數(shù)據(jù)。這些代碼示例類似于一些真實的抓取應(yīng)用程序,其中一個就是在Yelp抓取中使用。

  當(dāng)然,由于本文研究所限,示例中刪除了一些生產(chǎn)組件,例如數(shù)據(jù)庫、容器化、代理連接以及進(jìn)程管理工具(例如pm2)。另外,在諸如linting這類顯而易見的事務(wù)上也不會停止。

  但是,我們會保證項目的基本結(jié)構(gòu)完善,將使用最流行的庫(Axios,Cheerio,Lodash),使用Puppeter提取授權(quán)密鑰,使用NodeJS流抓取數(shù)據(jù)并將其寫入文件。

術(shù)語規(guī)定

  本文將使用以下術(shù)語:NodeJS應(yīng)用程序——服務(wù)器應(yīng)用程序;網(wǎng)站 outlet.scotch-soda.com ——Web資源,網(wǎng)站服務(wù)器為Web服務(wù)器。大體來說,首先是在Chrome或Firefox中探究網(wǎng)站網(wǎng)絡(luò)資源及其頁面,然后運行一個服務(wù)器應(yīng)用程序,向Web服務(wù)器發(fā)送HTTP請求,最后收到帶有相應(yīng)數(shù)據(jù)的響應(yīng)。

獲取授權(quán)Cookie

  outlet.scotch-soda.com的內(nèi)容僅對授權(quán)用戶開放。本示例中,授權(quán)將通過由服務(wù)器應(yīng)用程序控制的Chromium瀏覽器實施,cookie也是從其中接收。這些Cookie將包含在每個向web服務(wù)器發(fā)出的HTTP請求上的HTTP標(biāo)頭中,從而允許應(yīng)用程序訪問這些授權(quán)內(nèi)容。當(dāng)抓取具有數(shù)萬乃至數(shù)十萬頁面的大量資源時,接收到的Cookie需要更新一些次數(shù)。

  該應(yīng)用程序?qū)⒕哂幸韵陆Y(jié)構(gòu):

  cookieManager.js:帶有Cookie管理器類的文件,用以負(fù)責(zé)獲取cookie;

  cookie-storage.js: cookie 變量文件;

  index.js:安排Cookie管理器調(diào)用點;

  .env:環(huán)境變量文件。

  /project_root

  |__ /src

  |   |__ /helpers

  |      |__ **cookie-manager.js**

  |      |__ **cookie-storage.js**

  |**__ .env**

  |__ **index.js**

  主目錄和文件結(jié)構(gòu)

  將以下代碼添加到應(yīng)用程序中:

// index.js

// including environment variables in .env
require('dotenv').config();

const cookieManager = require('./src/helpers/cookie-manager');
const { setLocalCookie } = require('./src/helpers/cookie-storage');

// IIFE - application entry point
(async () => {
// CookieManager call point
// login/password values are stored in the .env file
const cookie = await cookieManager.fetchCookie(
process.env.LOGIN,
process.env.PASSWORD,
);

if (cookie) {
// if the cookie was received, assign it as the value of a storage variable
setLocalCookie(cookie);
} else {
console.log('Warning! Could not fetch the Cookie after 3 attempts. Aborting the process...');
// close the application with an error if it is impossible to receive the cookie
process.exit(1);
}
})();

  在cookie-manager.js中:

// cookie-manager.js

// 'user-agents' generates 'User-Agent' values for HTTP headers
// 'puppeteer-extra' - wrapper for 'puppeteer' library
const _ = require('lodash');
const UserAgent = require('user-agents');
const puppeteerXtra = require('puppeteer-extra');
const StealthPlugin = require('puppeteer-extra-plugin-stealth');

// hide from webserver that it is bot
puppeteerXtra.use(StealthPlugin());

class CookieManager {
// this.browser & this.page - Chromium window and page instances
constructor() {
this.browser = null;
this.page = null;
this.cookie = null;
}

// getter
getCookie() {
return this.cookie;
}

// setter
setCookie(cookie) {
this.cookie = cookie;
}

async fetchCookie(username, password) {
// give 3 attempts to authorize and receive cookies
const attemptCount = 3;

try {
// instantiate Chromium window and blank page
this.browser = await puppeteerXtra.launch({
args: ['--window-size=1920,1080'],
headless: process.env.NODE_ENV === 'PROD',
});

// Chromium instantiates blank page and sets 'User-Agent' header
this.page = await this.browser.newPage();
await this.page.setUserAgent((new UserAgent()).toString());

for (let i = 0; i < attemptCount; i += 1) {
// Chromium asks the web server for an authorization page
//and waiting for DOM
await this.page.goto(process.env.LOGIN_PAGE, { waitUntil: ['domcontentloaded'] });

// Chromium waits and presses the country selection confirmation button
// and falling asleep for 1 second: page.waitForTimeout(1000)
await this.page.waitForSelector('#changeRegionAndLanguageBtn', { timeout: 5000 });
await this.page.click('#changeRegionAndLanguageBtn');
await this.page.waitForTimeout(1000);

// Chromium waits for a block to enter a username and password
await this.page.waitForSelector('div.login-box-content', { timeout: 5000 });
await this.page.waitForTimeout(1000);

// Chromium enters username/password and clicks on the 'Log in' button
await this.page.type('input.email-input', username);
await this.page.waitForTimeout(1000);
await this.page.type('input.password-input', password);
await this.page.waitForTimeout(1000);
await this.page.click('button[value="Log in"]');
await this.page.waitForTimeout(3000);

// Chromium waits for target content to load on 'div.main' selector
await this.page.waitForSelector('div.main', { timeout: 5000 });

// get the cookies and glue them into a string of the form <key>=<value> [; <key>=<value>]
this.setCookie(
_.join(
_.map(
await this.page.cookies(),
({ name, value }) => _.join([name, value], '='),
),
'; ',
),
);

// when the cookie has been received, break the loop
if (this.cookie) break;
}

// return cookie to call point (in index.js)
return this.getCookie();
} catch (err) {
throw new Error(err);
} finally {
// close page and browser instances
this.page && await this.page.close();
this.browser && await this.browser.close();
}
}
}

// export singleton
module.exports = new CookieManager();

  某些變量的值是鏈接到.env文件的。

  // .env

  NODE_ENV=DEV

  LOGIN_PAGE=https://outlet.scotch-soda.com/de/en/login

  LOGIN=tyrell.wellick@ecorp.com

  PASSWORD=i*m_on@kde

  例如,配置無頭消息屬性,發(fā)送到方法 puppeteerXtra.launch解析為布爾值,它取決于狀態(tài)可變的process.env.node_env 。在開發(fā)過程中,變量被設(shè)置為DEV,無頭變量被設(shè)置為false,因此Puppeteer能夠明白它此刻應(yīng)該在監(jiān)視器上呈現(xiàn)執(zhí)行Chromium 。

  方法page.cookies返回一個對象數(shù)組,每個對象定義一個 cookie 并包含兩個屬性:名稱和值 。使用一系列 Lodash 函數(shù),循環(huán)提取每個 cookie 的鍵值對,并生成類似于下面的字符串:

文件 cookie-storage.js:

// cookie-storage.js

// cookie storage variable
let localProcessedCookie = null;

// getter
const getLocalCookie = () => localProcessedCookie;

// setter
const setLocalCookie = (cookie) => {
localProcessedCookie = cookie;

// lock the getLocalCookie function;
// its scope with the localProcessedCookie value will be saved
// after the setLocalCookie function completes
return getLocalCookie;
};

module.exports = {
setLocalCookie,
getLocalCookie,
};

  對于閉包的定義,明確的思路是:在該變量作用域內(nèi)的函數(shù)結(jié)束后,保持對某個變量值的訪問。通常,當(dāng)函數(shù)完成執(zhí)行返回操作時,它會離開調(diào)用堆棧,垃圾回收機(jī)制會從作用域內(nèi)刪除內(nèi)存中的所有變量。

  上面的示例中,本地cookie設(shè)置器完成設(shè)置后,應(yīng)該回收的本地已處理cookie變量的值將保留在計算機(jī)的內(nèi)存中。這就意味著只要應(yīng)用程序在運行,它就可以在代碼中的任何地方獲取這個值。

  這樣,當(dāng)調(diào)用setLocalCookie時,將從中返回getLocalCookie函數(shù)。一旦這個LocalCookie函數(shù)作用域面臨回收時,NodeJS能夠看到它具有g(shù)etLocalCookie閉包函數(shù)。此時,垃圾回收機(jī)制將返回的獲取器作用域內(nèi)的所有變量都保留在內(nèi)存中。由于可變的本地處理Cookie在getLocalCookie的作用域內(nèi),因此它將繼續(xù)存在,保持與Cookie的綁定。

URL生成器

  應(yīng)用程序需要一個url的主列表才能開始爬取。在生產(chǎn)過程中,爬取通常從Web資源的主頁開始,經(jīng)過一定數(shù)量次數(shù)的迭代,最終建立一個指向登錄頁面的鏈接集合。通常,一個Web資源有成千上萬個這樣的鏈接。

  在此示例中,爬取程序只會傳輸8個爬取鏈接作為輸入,鏈接指向包含著主要產(chǎn)品分類目錄的頁面,它們分別是:

??  https://outlet.scotch-soda.com/women/clothing??

??  https://outlet.scotch-soda.com/women/footwear??

??  https://outlet.scotch-soda.com/women/accessories/all-womens-accessories??

??  https://outlet.scotch-soda.com/men/clothing??

??  https://outlet.scotch-soda.com/men/footwear??

??  https://outlet.scotch-soda.com/men/accessories/all-mens-accessories??

??  https://outlet.scotch-soda.com/kids/girls/clothing/all-girls-clothing??

??  https://outlet.scotch-soda.com/kids/boys/clothing/all-boys-clothing??

  使用這么長的鏈接字符,會影響代碼美觀性,為了避免這種情形,讓我們用下列文件創(chuàng)建一個短小精悍的URL構(gòu)建器:

  categories.js: 包含路由參數(shù)的文件;

  target-builder.js: 構(gòu)建url集合的文件.

  /project_root

  |__ /src

  |   |__ /constants

  | |  |__ **categories.js**

  |   |__ /helpers

  |      |__ cookie-manager.js

  |      |__ cookie-storage.js

  |      |__ **target-builder.js**

  |**__ .env**

  |__ index.js

  添加以下代碼:

  // .env

  MAIN_PAGE=https://outlet.scotch-soda.com

// index.js

// import builder function
const getTargetUrls = require('./src/helpers/target-builder');

(async () => {
// here the proccess of getting cookie

// gets an array of url links and determines it's length L
const targetUrls = getTargetUrls();
const { length: L } = targetUrls;

})();
// categories.js

module.exports = [
'women/clothing',
'women/footwear',
'women/accessories/all-womens-accessories',
'men/clothing',
'men/footwear',
'men/accessories/all-mens-accessories',
'kids/girls/clothing/all-girls-clothing',
'kids/boys/clothing/all-boys-clothing',
];
// target-builder.js

const path = require('path');
const categories = require('../constants/categories');

// static fragment of route parameters
const staticPath = 'global/en';

// create URL object from main page address
const url = new URL(process.env.MAIN_PAGE);

// add the full string of route parameters to the URL object
// and return full url string
const addPath = (dynamicPath) => {
url.pathname = path.join(staticPath, dynamicPath);

return url.href;
};

// collect URL link from each element of the array with categories
module.exports = () => categories.map((category) => addPath(category));

  這三個代碼片段構(gòu)建了本段開頭給出的8個鏈接,演示了內(nèi)置的URL以及路徑庫的使用??赡苡腥藭X得,這不是大炮打蚊子嘛!使用插值明明更簡單??!

  有明確規(guī)范的NodeJS方法用于處理路由以及URL請求參數(shù),主要是基于以下兩個原因:

  1、插值在輕量級應(yīng)用下還好;

  2、為了養(yǎng)成良好的習(xí)慣,應(yīng)當(dāng)每天使用。

爬網(wǎng)和抓取

  向服務(wù)器應(yīng)用程序的邏輯中心添加兩個文件:

  ·crawler.js:包含用于向 Web 服務(wù)器發(fā)送請求和接收網(wǎng)頁標(biāo)記的爬網(wǎng)程序類;

  ·parser.js:包含解析器類,其中包含用于抓取標(biāo)記和獲取目標(biāo)數(shù)據(jù)的方法。

  /project_root

  |__ /src

  |   |__ /constants

  | |  |__ categories.js

  |   |__ /helpers

  |   |  |__ cookie-manager.js

  |   |  |__ cookie-storage.js

  |   |  |__ target-builder.js

  ****|   |__ **crawler.js**

  |   |__ **parser.js**

  |**__** .env

  |__ **index.js**

  首先,添加一個循環(huán)index.js,它將依次傳遞URL鏈接到爬取程序并接收解析后的數(shù)據(jù):

// index.js

const crawler = new Crawler();

(async () => {
// getting Cookie proccess
// and url-links array...
const { length: L } = targetUrls;

// run a loop through the length of the array of url links
for (let i = 0; i < L; i += 1) {
// call the run method of the crawler for each link
// and return parsed data
const result = await crawler.run(targetUrls[i]);

// do smth with parsed data...
}
})();

  爬取代碼:

// crawler.js

require('dotenv').config();
const cheerio = require('cheerio');
const axios = require('axios').default;
const UserAgent = require('user-agents');

const Parser = require('./parser');
// getLocalCookie - closure function, returns localProcessedCookie
const { getLocalCookie } = require('./helpers/cookie-storage');

module.exports = class Crawler {
constructor() {
// create a class variable and bind it to the newly created Axios object
// with the necessary headers
this.axios = axios.create({
headers: {
cookie: getLocalCookie(),
'user-agent': (new UserAgent()).toString(),
},
});
}

async run(url) {
console.log('IScraper: working on %s', url);

try {
// do HTTP request to the web server
const { data } = await this.axios.get(url);
// create a cheerio object with nodes from html markup
const $ = cheerio.load(data);

// if the cheerio object contains nodes, run Parser
// and return to index.js the result of parsing
if ($.length) {
const p = new Parser($);

return p.parse();
}
console.log('IScraper: could not fetch or handle the page content from %s', url);
return null;
} catch (e) {
console.log('IScraper: could not fetch the page content from %s', url);

return null;
}
}
};

  解析器的任務(wù)是在接收到 cheerio 對象時選擇數(shù)據(jù),然后為每個 URL 鏈接構(gòu)建以下結(jié)構(gòu):

[
{
"Title":"Graphic relaxed-fit T-shirt | Women",
"CurrentPrice":25.96,
"Currency":"€",
"isNew":false
},
{
// at all 36 such elements for every url-link
}
]

  解析代碼:

// parser.js

require('dotenv').config();
const _ = require('lodash');

module.exports = class Parser {
constructor(content) {
// this.$ - this is a cheerio object parsed from the markup
this.$ = content;
this.$$ = null;
}

// The crawler calls the parse method
// extracts all 'li' elements from the content block
// and in the map loop the target data is selected
parse() {
return this.$('#js-search-result-items')
.children('li')
.map((i, el) => {
this.$$ = this.$(el);

const Title = this.getTitle();
const CurrentPrice = this.getCurrentPrice();

// if two key values are missing, such object is rejected
if (!Title || !CurrentPrice) return {};

return {
Title,
CurrentPrice,
Currency: this.getCurrency(),
isNew: this.isNew(),
};
})
.toArray();
}

// next - private methods, which are used at 'parse' method
getTitle() {
return _.replace(this.$$.find('.product__name').text().trim(), /\s{2,}/g, ' ');
}

getCurrentPrice() {
return _.toNumber(
_.replace(
_.last(_.split(this.$$.find('.product__price').text().trim(), ' ')),
',',
'.',
),
);
}

getCurrency() {
return _.head(_.split(this.$$.find('.product__price').text().trim(), ' '));
}

isNew() {
return /new/.test(_.toLower(this.$$.find('.content-asset p').text().trim()));
}
};

  爬取程序和解析器運行的結(jié)果將是8個內(nèi)部包含對象的數(shù)組,并傳遞回index.js文件的for循環(huán)。

  流寫入文件

  要寫入一個文件,須使用可寫流。流是一種JS對象,包含了許多用于處理按順序出現(xiàn)的數(shù)據(jù)塊的方法。所有的流都繼承自EventEmitter類(即事件觸發(fā)器),因此,它們能夠?qū)\行環(huán)境中的突發(fā)事件做出反應(yīng)?;蛟S有人遇到過下面這種情形:

myServer.on('request', (request, response) => {
// something puts into response
});

// or

myObject.on('data', (chunk) => {
// do something with data
});

  這是NodeJS流的優(yōu)秀范例,盡管它們的名字不那么原始:我的服務(wù)器和我的對象。在此示例中,它們偵聽某些事件:HTTP請求(事件)的到達(dá)和一段數(shù)據(jù)(事件)的到達(dá),然后安排它們各就各位去工作?!傲魇絺鬏敗钡牧⒆阒驹谟谒鼈兪褂闷瑺顢?shù)據(jù)片段,并且只需要極低的運行內(nèi)存。

  在此情形下,for循環(huán)按順序接收 8 個包含數(shù)據(jù)的數(shù)組,這些數(shù)組將按順序?qū)懭胛募?,而無需等待累積完整的集合,也無需使用任何累加器。執(zhí)行示例代碼時,由于能夠得知下一部分解析數(shù)據(jù)到達(dá)for循環(huán)的確切時刻,所以無需任何偵聽,就可以使用內(nèi)置于流中的方法立即寫入。

  寫入到何處

  /project_root

  |__ /data

  |   |__ **data.json**

  ...

// index.js

const fs = require('fs');
const path = require('path');
const { createWriteStream } = require('fs');

// use constants to simplify work with addresses
const resultDirPath = path.join('./', 'data');
const resultFilePath = path.join(resultDirPath, 'data.json');

// check if the data directory exists; create if it's necessary
// if the data.json file existed - delete all data
// ...if not existed - create empty
!fs.existsSync(resultDirPath) && fs.mkdirSync(resultDirPath);
fs.writeFileSync(resultFilePath, '');

(async () => {
// getting Cookie proccess
// and url-links array...

// create a stream object for writing
// and add square bracket to the first line with a line break
const writer = createWriteStream(resultFilePath);
writer.write('[\n');

// run a loop through the length of the url-links array
for (let i = 0; i < L; i += 1) {
const result = await crawler.run(targetUrls[i]);

// if an array with parsed data is received, determine its length l
if (!_.isEmpty(result)) {
const { length: l } = result;

// using the write method, add the next portion
//of the incoming data to data.json
for (let j = 0; j < l; j += 1) {
if (i + 1 === L && j + 1 === l) {
writer.write(` ${JSON.stringify(result[j])}\n`);
} else {
writer.write(` ${JSON.stringify(result[j])},\n`);
}
}
}
}
})();

  嵌套的for循環(huán)解決了一個問題:為在輸出中獲取有效的json文件,需要注意結(jié)果數(shù)組中的最后一個對象后面不要有逗號。嵌套for循環(huán)決定哪個對象是應(yīng)用程序中最后一個撤消插入逗號的。

  如果事先創(chuàng)建data/data.json并在代碼運行時打開,就可以實時看到可寫流是如何按順序添加新數(shù)據(jù)片段的。

  結(jié)論

  輸出結(jié)果是以下形式的JSON對象:

[
{"Title":"Graphic relaxed-fit T-shirt | Women","CurrentPrice":25.96,"Currency":"€","isNew":false},
{"Title":"Printed mercerised T-shirt | Women","CurrentPrice":29.97,"Currency":"€","isNew":true},
{"Title":"Slim-fit camisole | Women","CurrentPrice":32.46,"Currency":"€","isNew":false},
{"Title":"Graphic relaxed-fit T-shirt | Women","CurrentPrice":25.96,"Currency":"€","isNew":true},
...
{"Title":"Piped-collar polo | Boys","CurrentPrice":23.36,"Currency":"€","isNew":false},
{"Title":"Denim chino shorts | Boys","CurrentPrice":45.46,"Currency":"€","isNew":false}
]

  應(yīng)用程序授權(quán)處理時間約為20秒。

  完整的開源項目代碼存在GitHub上。并附有依賴關(guān)系程序包package.json。

譯者介紹

  張哲剛,51CTO社區(qū)編輯,系統(tǒng)運維工程師,國內(nèi)較早一批硬件評測及互聯(lián)網(wǎng)從業(yè)者,曾入職阿里巴巴。十余年IT項目管理經(jīng)驗,具備復(fù)合知識技能,曾參與多個網(wǎng)站架構(gòu)設(shè)計、電子政務(wù)系統(tǒng)開發(fā),主導(dǎo)過某地市級招生考試管理平臺運維工作。

  原文標(biāo)題:Web Scraping Sites With Session Cookie Authentication Using NodeJS Request

  鏈接:

??  https://hackernoon.com/web-scraping-sites-with-session-cookie-authentication-using-nodejs-request??

責(zé)任編輯:張潔
相關(guān)推薦

2015-09-24 09:22:16

nodejs頁面始末

2021-06-29 15:52:03

PythonPOST

2013-08-21 10:08:16

2021-01-18 05:11:14

通信Nodejs進(jìn)程

2012-03-02 10:18:31

2012-02-24 15:25:45

ibmdw

2020-11-04 07:17:42

Nodejs通信進(jìn)程

2023-01-03 09:01:21

2011-09-02 11:06:28

Oracle服務(wù)器進(jìn)程為事務(wù)建立回滾段放入dirty lis

2021-07-28 09:00:00

編程語言Kotlin開發(fā)

2020-02-07 10:46:31

Python數(shù)據(jù)庫MySQL

2020-03-04 14:48:13

Python 開發(fā)編程語言

2014-09-03 10:09:23

LinuxOpenswan

2009-09-25 11:14:16

Hibernate批量

2011-02-18 11:02:28

2013-07-22 13:48:55

iOS開發(fā)ASIHTTPRequ使用Cookie

2013-08-21 09:21:01

2012-12-10 10:32:22

2022-04-01 12:38:32

cookie代碼面試

2023-03-09 15:55:17

JavaScriptURLCSS
點贊
收藏

51CTO技術(shù)棧公眾號