用TypeScript開發(fā)爬蟲程序
全局安裝typescript:
- npm install -g typescript
目前版本2.0.3,這個(gè)版本不再需要使用typings命令了。但是vscode捆綁的版本是1.8的,需要一些配置工作,看本文的處理辦法。
測試tsc命令:
- tsc
創(chuàng)建要寫的程序項(xiàng)目文件夾:
- mkdir test-typescript-spider
進(jìn)入該文件夾:
- cd test-typescript-spider
初始化項(xiàng)目:
- npm init
安裝superagent和cheerio模塊:
- npm i --save superagent cheerio
安裝對(duì)應(yīng)的類型聲明模塊:
- npm i -s @types/superagent --save
- npm i -s @types/cheerio --save
安裝項(xiàng)目內(nèi)的typescript(必須走這一步):
- npm i --save typescript
用vscode打開項(xiàng)目文件夾。在該文件夾下創(chuàng)建tsconfig.json文件,并復(fù)制以下配置代碼進(jìn)去:
- {
- "compilerOptions": {
- "target": "ES6",
- "module": "commonjs",
- "noEmitOnError": true,
- "noImplicitAny": true,
- "experimentalDecorators": true,
- "sourceMap": false,
- // "sourceRoot": "./",
- "outDir": "./out"
- },
- "exclude": [
- "node_modules"
- ]
- }
在vscode打開“文件”-“***項(xiàng)”-“工作區(qū)設(shè)置”在settings.json中加入(如果不做這個(gè)配置,vscode會(huì)在打開項(xiàng)目的時(shí)候提示選擇哪個(gè)版本的typescript):
- {
- "typescript.tsdk": "node_modules/typescript/lib"
- }
創(chuàng)建api.ts文件,復(fù)制以下代碼進(jìn)去:
- import superagent = require('superagent');
- import cheerio = require('cheerio');
- export const remote_get = function(url: string) {
- const promise = new Promise<superagent.Response>(function (resolve, reject) {
- superagent.get(url)
- .end(function (err, res) {
- if (!err) {
- resolve(res);
- } else {
- console.log(err)
- reject(err);
- }
- });
- });
- return promise;
- }
創(chuàng)建app.ts文件,書寫測試代碼:
- import api = require('./api');
- const go = async () => {
- let res = await api.remote_get('http://www.baidu.com/');
- console.log(res.text);
- }
- go();
執(zhí)行命令:
- tsc
然后:
- node out/app
觀察輸出是否正確。
現(xiàn)在嘗試抓取http://cnodejs.org/的***頁文章鏈接。
修改app.ts文件,代碼如下:
- import api = require('./api');
- import cheerio = require('cheerio');
- const go = async () => {
- const res = await api.remote_get('http://cnodejs.org/');
- const $ = cheerio.load(res.text);
- let urls: string[] = [];
- let titles: string[] = [];
- $('.topic_title_wrapper').each((index, element) => {
- titles.push($(element).find('.topic_title').first().text().trim());
- urls.push('http://cnodejs.org/' + $(element).find('.topic_title').first().attr('href'));
- })
- console.log(titles, urls);
- }
- go();
觀察輸出,文章的標(biāo)題和鏈接都已獲取到了。
現(xiàn)在嘗試深入抓取文章內(nèi)容
- import api = require('./api');
- import cheerio = require('cheerio');
- const go = async () => {
- const res = await api.remote_get('http://cnodejs.org/');
- const $ = cheerio.load(res.text);
- $('.topic_title_wrapper').each(async (index, element) => {
- let url = ('http://cnodejs.org' + $(element).find('.topic_title').first().attr('href'));
- const res_content = await api.remote_get(url);
- const $_content = cheerio.load(res_content.text);
- console.log($_content('.topic_content').first().text());
- })
- }
- go();
可以發(fā)現(xiàn)因?yàn)樵L問服務(wù)器太迅猛,導(dǎo)致出現(xiàn)很多次503錯(cuò)誤。
解決:
添加helper.ts文件:
- export const wait_seconds = function (senconds: number) {
- return new Promise(resolve => setTimeout(resolve, senconds * 1000));
- }
修改api.ts文件為:
- import superagent = require('superagent');
- import cheerio = require('cheerio');
- export const get_index_urls = function () {
- const res = await remote_get('http://cnodejs.org/');
- const $ = cheerio.load(res.text);
- let urls: string[] = [];
- $('.topic_title_wrapper').each(async (index, element) => {
- urls.push('http://cnodejs.org' + $(element).find('.topic_title').first().attr('href'));
- });
- return urls;
- }
- export const get_content = async function (url: string) {
- const res = await remote_get(url);
- const $ = cheerio.load(res.text);
- return $('.topic_content').first().text();
- }
- export const remote_get = function (url: string) {
- const promise = new Promise<superagent.Response>(function (resolve, reject) {
- superagent.get(url)
- .end(function (err, res) {
- if (!err) {
- resolve(res);
- } else {
- console.log(err)
- reject(err);
- }
- });
- });
- return promise;
- }
修改app.ts文件為:
- import api = require('./api');
- import helper = require('./helper');
- import cheerio = require('cheerio');
- const go = async () => {
- let urls = await api.get_index_urls();
- for (let i = 0; i < urls.length; i++) {
- await helper.wait_seconds(1);
- let text = await api.get_content(urls[i]);
- console.log(text);
- }
- }
- go();
觀察輸出可以看到,程序?qū)崿F(xiàn)了隔一秒再請(qǐng)求下一個(gè)內(nèi)容頁。
現(xiàn)在嘗試把抓取到的東西存到數(shù)據(jù)庫中。安裝mongoose模塊:
- npm i mongoose --save
- npm i -s @types/mongoose --save
然后建立Scheme。先創(chuàng)建models文件夾:
- mkdir models
在models文件夾下創(chuàng)建index.ts:
- import * as mongoose from 'mongoose';
- mongoose.connect('mongodb://127.0.0.1/cnodejs_data', {
- server: { poolSize: 20 }
- }, function (err) {
- if (err) {
- process.exit(1);
- }
- });
- // models
- export const Article = require('./article');
在models文件夾下創(chuàng)建IArticle.ts:
- interface IArticle {
- title: String;
- url: String;
- text: String;
- }
- export = IArticle;
在models文件夾下創(chuàng)建Article.ts:
- import mongoose = require('mongoose');
- import IArticle = require('./IArticle');
- interface IArticleModel extends IArticle, mongoose.Document { }
- const ArticleSchema = new mongoose.Schema({
- title: { type: String },
- url: { type: String },
- text: { type: String },
- });
- const Article = mongoose.model<IArticleModel>("Article", ArticleSchema);
- export = Article;
修改api.ts為:
- import superagent = require('superagent');
- import cheerio = require('cheerio');
- import models = require('./models');
- const Article = models.Article;
- export const get_index_urls = async function () {
- const res = await remote_get('http://cnodejs.org/');
- const $ = cheerio.load(res.text);
- let urls: string[] = [];
- $('.topic_title_wrapper').each((index, element) => {
- urls.push('http://cnodejs.org' + $(element).find('.topic_title').first().attr('href'));
- });
- return urls;
- }
- export const fetch_content = async function (url: string) {
- const res = await remote_get(url);
- const $ = cheerio.load(res.text);
- let article = new Article();
- article.text = $('.topic_content').first().text();
- article.title = $('.topic_full_title').first().text().replace('置頂', '').replace('精華', '').trim();
- article.url = url;
- console.log('獲取成功:' + article.title);
- article.save();
- }
- export const remote_get = function (url: string) {
- return new Promise<superagent.Response>((resolve, reject) => {
- superagent.get(url)
- .end(function (err, res) {
- if (!err) {
- resolve(res);
- } else {
- reject(err);
- }
- });
- });
- }
修改app.ts為:
- import api = require('./api');
- import helper = require('./helper');
- import cheerio = require('cheerio');
- (async () => {
- try {
- let urls = await api.get_index_urls();
- for (let i = 0; i < urls.length; i++) {
- await helper.wait_seconds(1);
- await api.fetch_content(urls[i]);
- }
- } catch (err) {
- console.log(err);
- }
- console.log('完畢!');
- })();
執(zhí)行
- tsc
- node out/app
觀察輸出,并去數(shù)據(jù)庫檢查一下可以發(fā)現(xiàn)入庫成功了!
補(bǔ)充:remote_get方法的改進(jìn)版,實(shí)現(xiàn)錯(cuò)誤重試和加入代理服務(wù)器.放棄了superagent庫,用的request庫,僅供參考:
- //config.retries = 3;
- let current_retry = config.retries || 0;
- export const remote_get = async function (url: string, proxy?: string) {
- //每次請(qǐng)求都先稍等一下
- await wait_seconds(2);
- if (!proxy) {
- proxy = '';
- }
- const promise = new Promise<string>(function (resolve, reject) {
- console.log('get: ' + url + ', using proxy: ' + proxy);
- let options: request.CoreOptions = {
- headers: {
- 'Cookie': '',
- 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36',
- 'Referer': 'https://www.baidu.com/'
- },
- encoding: 'utf-8',
- method: 'GET',
- proxy: proxy,
- timeout: 3000,
- }
- request(url, options, async function (err, response, body) {
- console.log('got:' + url);
- if (!err) {
- body = body.toString();
- current_retry = config.retries || 0;
- console.log('bytes:' + body.length);
- resolve(body);
- } else {
- console.log(err);
- if (current_retry <= 0) {
- current_retry = config.retries || 0;
- reject(err);
- } else {
- console.log('retry...(' + current_retry + ')')
- current_retry--;
- try {
- let body = await remote_get(url, proxy);
- resolve(body);
- } catch (e) {
- reject(e);
- }
- }
- }
- });
- });
- return promise;
- }
另外,IArticle.ts和Article.ts合并為一個(gè)文件,可能更好,可以參考我另一個(gè)model的寫法:
- import mongoose = require('mongoose');
- interface IProxyModel {
- uri: string;
- ip: string;
- port:string;
- info:string;
- }
- export interface IProxy extends IProxyModel, mongoose.Document { }
- const ProxySchema = new mongoose.Schema({
- uri: { type: String },//
- ip: { type: String },//
- port: { type: String },//
- info: { type: String },//
- });
- export const Proxy = mongoose.model<IProxy>("Proxy", ProxySchema);
導(dǎo)入的時(shí)候這么寫就行了:
- import { IProxy, Proxy } from './models';
其中Proxy可以用來做new、find、where之類的操作:
- let x = new Proxy();
- let xx = await Proxy.find({});
- let xxx = await Proxy.where('aaa',123).exec();
而IProxy用于實(shí)體對(duì)象的傳遞,例如
- function xxx(p:IProxy){
- }