自拍偷在线精品自拍偷,亚洲欧美中文日韩v在线观看不卡

<blockquote id="lg6on"><i id="lg6on"><video id="lg6on"></video></i></blockquote>

<cite id="lg6on"></cite>

51CTO首頁(yè)

AI.x社區(qū)

軟考社區(qū)

免費(fèi)課

企業(yè)培訓(xùn)

鴻蒙開(kāi)發(fā)者社區(qū)

WOT技術(shù)大會(huì)

公眾號(hào)矩陣

移動(dòng)端

視頻課免費(fèi)課排行榜短視頻直播課軟考學(xué)堂

全部課程軟考華為認(rèn)證廠商認(rèn)證 IT技術(shù)PMP項(xiàng)目管理免費(fèi)題庫(kù)

在線學(xué)習(xí)

文章資源問(wèn)答課堂專欄直播

51CTO

鴻蒙開(kāi)發(fā)者社區(qū)

51CTO技術(shù)棧

51CTO官微

51CTO學(xué)堂

51CTO博客

CTO訓(xùn)練營(yíng)

鴻蒙開(kāi)發(fā)者社區(qū)訂閱號(hào)

51CTO軟考

51CTO學(xué)堂APP

51CTO學(xué)堂企業(yè)版APP

鴻蒙開(kāi)發(fā)者社區(qū)視頻號(hào)

51CTO軟考題庫(kù)

賬號(hào)設(shè)置退出

基于TypeScript從0到1搭建一款爬蟲(chóng)工具

作者：maomin9761 2021-01-27 07:24:38

今天，我們將使用TS這門語(yǔ)言搭建一款爬蟲(chóng)工具。

前言

今天，我們將使用TS這門語(yǔ)言搭建一款爬蟲(chóng)工具。目標(biāo)網(wǎng)址是什么呢?我們?nèi)ド暇W(wǎng)一搜，經(jīng)過(guò)幾番排查之后，我們選定了這一個(gè)網(wǎng)站。

https://www.hanju.run/

一個(gè)視頻網(wǎng)站，我們的目的主要是爬取這個(gè)網(wǎng)站上視頻的播放鏈接。下面，我們就開(kāi)始進(jìn)行第一步。

第一步

俗話說(shuō)，萬(wàn)事開(kāi)頭難。不過(guò)對(duì)于這個(gè)項(xiàng)目而言，恰恰相反。你需要做以下幾個(gè)事情：

1.我們需要?jiǎng)?chuàng)建一個(gè)項(xiàng)目文件夾

2.鍵入命令，初始化項(xiàng)目

npm init -y

3.局部安裝typescript

npm install typescript -D

4.接著鍵入命令，生成ts配置文件

tsc --init

5.局部安裝ts-node，用于命令行輸出命令

npm install -D ts-node

6.在項(xiàng)目文件夾中創(chuàng)建一個(gè)src文件夾

然后我們?cè)趕rc文件夾中創(chuàng)建一個(gè)crawler.ts文件。

7.在package.json文件中修改快捷啟動(dòng)命令

"scripts": { 
    "dev-t": "ts-node ./src/crawler.ts" 
  }

第二步

接下來(lái)，我們將進(jìn)行實(shí)戰(zhàn)操作，也就是上文中crawler.ts文件是我們的主戰(zhàn)場(chǎng)。

我們首先需要引用的這幾個(gè)依賴，分別是

import superagent from "superagent"; 
import cheerio from "cheerio"; 
import fs from "fs"; 
import path from "path";

所以，我們會(huì)這樣安裝依賴：

superagent作用是獲取遠(yuǎn)程網(wǎng)址html的內(nèi)容。

npm install superagent

cheerio作用是可以通過(guò)jQ語(yǔ)法獲取頁(yè)面節(jié)點(diǎn)的內(nèi)容。

npm install cheerio

剩余兩個(gè)依賴fs，path。它們是node內(nèi)置依賴，直接引入即可。

我們完成了安裝依賴，但是會(huì)發(fā)現(xiàn)你安裝的依賴上會(huì)有紅色報(bào)錯(cuò)。原因是這樣的，superagent和cheerio內(nèi)部都是用JS寫的，并不是TS寫的，而我們現(xiàn)在的環(huán)境是TS。所以我們需要翻譯一下，我們將這種翻譯文件又稱類型定義文件(以.d.ts為后綴)。我們可以使用以下命令安裝類型定義文件。

npm install -D @types/superagent

npm install -D @types/cheerio

接下來(lái)，我們就認(rèn)認(rèn)真真看源碼了。

1.安裝完兩個(gè)依賴后，我們需要?jiǎng)?chuàng)建一個(gè)Crawler類，并且將其實(shí)例化。

import superagent from "superagent"; 
import cheerio from "cheerio"; 
import fs from "fs"; 
import path from "path"; 
 
class Crawler { 
  constructor() { 
     
  } 
} 
 
const crawler = new Crawler();

2.我們確定下要爬取的網(wǎng)址，然后賦給一個(gè)私有變量。最后我們會(huì)封裝一個(gè)getRawHtml方法來(lái)獲取對(duì)應(yīng)網(wǎng)址的內(nèi)容。

getRawHtml方法中我們使用了async/await關(guān)鍵字，主要用于異步獲取頁(yè)面內(nèi)容，然后返回值。

import superagent from "superagent"; 
import cheerio from "cheerio"; 
import fs from "fs"; 
import path from "path"; 
 
class Crawler { 
  private url = "https://www.hanju.run/play/39221-4-0.html"; 
 
  async getRawHtml() { 
    const result = await superagent.get(this.url); 
    return result.text; 
  } 
 
  async initSpiderProcess() { 
    const html = await this.getRawHtml(); 
  } 
 
  constructor() { 
    this.initSpiderProcess(); 
  } 
} 
 
const crawler = new Crawler();

3.使用cheerio依賴內(nèi)置的方法獲取對(duì)應(yīng)的節(jié)點(diǎn)內(nèi)容。

我們通過(guò)getRawHtml方法異步獲取網(wǎng)頁(yè)的內(nèi)容，然后我們傳給getJsonInfo這個(gè)方法，注意是string類型。我們這里通過(guò)cheerio.load(html)這條語(yǔ)句處理，就可以通過(guò)jQ語(yǔ)法來(lái)獲取對(duì)應(yīng)的節(jié)點(diǎn)內(nèi)容。我們獲取到了網(wǎng)頁(yè)中視頻的標(biāo)題以及鏈接，通過(guò)鍵值對(duì)的方式添加到一個(gè)對(duì)象中。注：我們?cè)谶@里定義了一個(gè)接口，定義鍵值對(duì)的類型。

import superagent from "superagent"; 
import cheerio from "cheerio"; 
import fs from "fs"; 
import path from "path"; 
 
interface Info { 
  name: string; 
  url: string; 
} 
 
class Crawler { 
  private url = "https://www.hanju.run/play/39221-4-0.html"; 
 
  getJsonInfo(html: string) { 
    const $ = cheerio.load(html); 
    const info: Info[] = []; 
    const scpt: string = String($(".play>script:nth-child(1)").html()); 
    const url = unescape( 
      scpt.split(";")[3].split("(")[1].split(")")[0].replace(/\"/g, "") 
    ); 
    const name: string = String($("title").html()); 
    info.push({ 
      name, 
      url, 
    }); 
    const result = { 
      time: new Date().getTime(), 
      data: info, 
    }; 
    return result; 
  } 
 
  async getRawHtml() { 
    const result = await superagent.get(this.url); 
    return result.text; 
  } 
 
  async initSpiderProcess() { 
    const html = await this.getRawHtml(); 
    const info = this.getJsonInfo(html); 
  } 
 
  constructor() { 
    this.initSpiderProcess(); 
  } 
} 
 
const crawler = new Crawler();

4.我們首先要在項(xiàng)目根目錄下創(chuàng)建一個(gè)data文件夾。然后我們將獲取的內(nèi)容我們存入文件夾內(nèi)的url.json文件(文件自動(dòng)生成)中。

我們將其封裝成getJsonContent方法，在這里我們使用了path.resolve來(lái)獲取文件的路徑。fs.readFileSync來(lái)讀取文件內(nèi)容，fs.writeFileSync來(lái)將內(nèi)容寫入文件。注：我們分別定義了兩個(gè)接口objJson與InfoResult。

import superagent from "superagent"; 
import cheerio from "cheerio"; 
import fs from "fs"; 
import path from "path"; 
 
interface objJson { 
  [propName: number]: Info[]; 
} 
 
interface Info { 
  name: string; 
  url: string; 
} 
 
interface InfoResult { 
  time: number; 
  data: Info[]; 
} 
 
class Crawler { 
  private url = "https://www.hanju.run/play/39221-4-0.html"; 
 
  getJsonInfo(html: string) { 
    const $ = cheerio.load(html); 
    const info: Info[] = []; 
    const scpt: string = String($(".play>script:nth-child(1)").html()); 
    const url = unescape( 
      scpt.split(";")[3].split("(")[1].split(")")[0].replace(/\"/g, "") 
    ); 
    const name: string = String($("title").html()); 
    info.push({ 
      name, 
      url, 
    }); 
    const result = { 
      time: new Date().getTime(), 
      data: info, 
    }; 
    return result; 
  } 
 
  async getRawHtml() { 
    const result = await superagent.get(this.url); 
    return result.text; 
  } 
 
  getJsonContent(info: InfoResult) { 
    const filePath = path.resolve(__dirname, "../data/url.json"); 
    let fileContent: objJson = {}; 
    if (fs.existsSync(filePath)) { 
      fileContent = JSON.parse(fs.readFileSync(filePath, "utf-8")); 
    } 
    fileContent[info.time] = info.data; 
    fs.writeFileSync(filePath, JSON.stringify(fileContent)); 
  } 
 
  async initSpiderProcess() { 
    const html = await this.getRawHtml(); 
    const info = this.getJsonInfo(html); 
    this.getJsonContent(info); 
  } 
 
  constructor() { 
    this.initSpiderProcess(); 
  } 
} 
 
const crawler = new Crawler();

5.運(yùn)行命令

npm run dev-t

6.查看生成文件的效果

{ 
  "1610738046569": [ 
    { 
      "name": "《復(fù)仇者聯(lián)盟4：終局之戰(zhàn)》HD1080P中字m3u8在線觀看-韓劇網(wǎng)", 
      "url": "https://wuxian.xueyou-kuyun.com/20190728/16820_302c7858/index.m3u8" 
    } 
  ], 
  "1610738872042": [ 
    { 
      "name": "《鋼鐵俠2》HD高清m3u8在線觀看-韓劇網(wǎng)", 
      "url": "https://www.yxlmbbs.com:65/20190920/54uIR9hI/index.m3u8" 
    } 
  ], 
  "1610739069969": [ 
    { 
      "name": "《鋼鐵俠2》中英特效m3u8在線觀看-韓劇網(wǎng)", 
      "url": "https://tv.youkutv.cc/2019/11/12/mjkHyHycfh0LyS4r/playlist.m3u8" 
    } 
  ] 
}

準(zhǔn)結(jié)語(yǔ)

到這里真的結(jié)束了嗎?

不!不!不!

真的沒(méi)有結(jié)束。

我們會(huì)看到上面一坨代碼，真的很臭~(yú)

我們將分別使用組合模式與單例模式將其優(yōu)化。

優(yōu)化一：組合模式

組合模式(Composite Pattern)，又叫部分整體模式，是用于把一組相似的對(duì)象當(dāng)作一個(gè)單一的對(duì)象。組合模式依據(jù)樹(shù)形結(jié)構(gòu)來(lái)組合對(duì)象，用來(lái)表示部分以及整體層次。這種類型的設(shè)計(jì)模式屬于結(jié)構(gòu)型模式，它創(chuàng)建了對(duì)象組的樹(shù)形結(jié)構(gòu)。

這種模式創(chuàng)建了一個(gè)包含自己對(duì)象組的類。該類提供了修改相同對(duì)象組的方式。

簡(jiǎn)言之，就是可以像處理簡(jiǎn)單元素一樣來(lái)處理復(fù)雜元素。

首先，我們?cè)趕rc文件夾下創(chuàng)建一個(gè)combination文件夾，然后在其文件夾下分別在創(chuàng)建兩個(gè)文件crawler.ts和urlAnalyzer.ts。

crawler.ts

crawler.ts文件的作用主要是處理獲取頁(yè)面內(nèi)容以及存入文件內(nèi)。

import superagent from "superagent"; 
import fs from "fs"; 
import path from "path"; 
import UrlAnalyzer from "./urlAnalyzer.ts"; 
 
export interface Analyzer { 
  analyze: (html: string, filePath: string) => string; 
} 
 
class Crowller { 
  private filePath = path.resolve(__dirname, "../../data/url.json"); 
 
  async getRawHtml() { 
    const result = await superagent.get(this.url); 
    return result.text; 
  } 
 
  writeFile(content: string) { 
    fs.writeFileSync(this.filePath, content); 
  } 
 
  async initSpiderProcess() { 
    const html = await this.getRawHtml(); 
    const fileContent = this.analyzer.analyze(html, this.filePath); 
    this.writeFile(fileContent); 
  } 
 
  constructor(private analyzer: Analyzer, private url: string) { 
    this.initSpiderProcess(); 
  } 
} 
const url = "https://www.hanju.run/play/39257-1-1.html"; 
 
const analyzer = new UrlAnalyzer(); 
new Crowller(analyzer, url);

urlAnalyzer.ts

urlAnalyzer.ts文件的作用主要是處理獲取頁(yè)面節(jié)點(diǎn)內(nèi)容的具體邏輯。

import cheerio from "cheerio"; 
import fs from "fs"; 
import { Analyzer } from "./crawler.ts"; 
 
interface objJson { 
  [propName: number]: Info[]; 
} 
interface InfoResult { 
  time: number; 
  data: Info[]; 
} 
interface Info { 
  name: string; 
  url: string; 
} 
 
export default class UrlAnalyzer implements Analyzer { 
  private getJsonInfo(html: string) { 
    const $ = cheerio.load(html); 
    const info: Info[] = []; 
    const scpt: string = String($(".play>script:nth-child(1)").html()); 
    const url = unescape( 
      scpt.split(";")[3].split("(")[1].split(")")[0].replace(/\"/g, "") 
    ); 
    const name: string = String($("title").html()); 
    info.push({ 
      name, 
      url, 
    }); 
    const result = { 
      time: new Date().getTime(), 
      data: info, 
    }; 
    return result; 
  } 
 
  private getJsonContent(info: InfoResult, filePath: string) { 
    let fileContent: objJson = {}; 
    if (fs.existsSync(filePath)) { 
      fileContent = JSON.parse(fs.readFileSync(filePath, "utf-8")); 
    } 
    fileContent[info.time] = info.data; 
    return fileContent; 
  } 
 
  public analyze(html: string, filePath: string) { 
    const info = this.getJsonInfo(html); 
    console.log(info); 
    const fileContent = this.getJsonContent(info, filePath); 
    return JSON.stringify(fileContent); 
  } 
}

可以在package.json文件中定義快捷啟動(dòng)命令。

"scripts": { 
  "dev-c": "ts-node ./src/combination/crawler.ts" 
},

然后使用npm run dev-c啟動(dòng)即可。

優(yōu)化二：?jiǎn)卫Ｊ?/span>

**單例模式(Singleton Pattern)**是 Java 中最簡(jiǎn)單的設(shè)計(jì)模式之一。這種類型的設(shè)計(jì)模式屬于創(chuàng)建型模式，它提供了一種創(chuàng)建對(duì)象的最佳方式。

這種模式涉及到一個(gè)單一的類，該類負(fù)責(zé)創(chuàng)建自己的對(duì)象，同時(shí)確保只有單個(gè)對(duì)象被創(chuàng)建。這個(gè)類提供了一種訪問(wèn)其唯一的對(duì)象的方式，可以直接訪問(wèn)，不需要實(shí)例化該類的對(duì)象。

應(yīng)用實(shí)例：

1、一個(gè)班級(jí)只有一個(gè)班主任。
2、Windows 是多進(jìn)程多線程的，在操作一個(gè)文件的時(shí)候，就不可避免地出現(xiàn)多個(gè)進(jìn)程或線程同時(shí)操作一個(gè)文件的現(xiàn)象，所以所有文件的處理必須通過(guò)唯一的實(shí)例來(lái)進(jìn)行。
3、一些設(shè)備管理器常常設(shè)計(jì)為單例模式，比如一個(gè)電腦有兩臺(tái)打印機(jī)，在輸出的時(shí)候就要處理不能兩臺(tái)打印機(jī)打印同一個(gè)文件。

同樣，我們?cè)趕rc文件夾下創(chuàng)建一個(gè)singleton文件夾，然后在其文件夾下分別在創(chuàng)建兩個(gè)文件crawler1.ts和urlAnalyzer.ts。

這兩個(gè)文件的作用與上文同樣，只不過(guò)代碼書(shū)寫不一樣。

crawler1.ts

import superagent from "superagent"; 
import fs from "fs"; 
import path from "path"; 
import UrlAnalyzer from "./urlAnalyzer.ts"; 
 
export interface Analyzer { 
  analyze: (html: string, filePath: string) => string; 
} 
 
class Crowller { 
  private filePath = path.resolve(__dirname, "../../data/url.json"); 
 
  async getRawHtml() { 
    const result = await superagent.get(this.url); 
    return result.text; 
  } 
 
  private writeFile(content: string) { 
    fs.writeFileSync(this.filePath, content); 
  } 
 
  private async initSpiderProcess() { 
    const html = await this.getRawHtml(); 
    const fileContent = this.analyzer.analyze(html, this.filePath); 
    this.writeFile(JSON.stringify(fileContent)); 
  } 
 
  constructor(private analyzer: Analyzer, private url: string) { 
    this.initSpiderProcess(); 
  } 
} 
const url = "https://www.hanju.run/play/39257-1-1.html"; 
 
const analyzer = UrlAnalyzer.getInstance(); 
new Crowller(analyzer, url);

urlAnalyzer.ts

import cheerio from "cheerio"; 
import fs from "fs"; 
import { Analyzer } from "./crawler1.ts"; 
 
interface objJson { 
  [propName: number]: Info[]; 
} 
interface InfoResult { 
  time: number; 
  data: Info[]; 
} 
interface Info { 
  name: string; 
  url: string; 
} 
export default class UrlAnalyzer implements Analyzer { 
  static instance: UrlAnalyzer; 
 
  static getInstance() { 
    if (!UrlAnalyzer.instance) { 
      UrlAnalyzer.instance = new UrlAnalyzer(); 
    } 
    return UrlAnalyzer.instance; 
  } 
 
  private getJsonInfo(html: string) { 
    const $ = cheerio.load(html); 
    const info: Info[] = []; 
    const scpt: string = String($(".play>script:nth-child(1)").html()); 
    const url = unescape( 
      scpt.split(";")[3].split("(")[1].split(")")[0].replace(/\"/g, "") 
    ); 
    const name: string = String($("title").html()); 
    info.push({ 
      name, 
      url, 
    }); 
    const result = { 
      time: new Date().getTime(), 
      data: info, 
    }; 
    return result; 
  } 
 
  private getJsonContent(info: InfoResult, filePath: string) { 
    let fileContent: objJson = {}; 
    if (fs.existsSync(filePath)) { 
      fileContent = JSON.parse(fs.readFileSync(filePath, "utf-8")); 
    } 
    fileContent[info.time] = info.data; 
    return fileContent; 
  } 
 
  public analyze(html: string, filePath: string) { 
     const info = this.getJsonInfo(html); 
     console.log(info); 
    const fileContent = this.getJsonContent(info, filePath); 
    return JSON.stringify(fileContent); 
  } 
 
  private constructor() {} 
}

可以在package.json文件中定義快捷啟動(dòng)命令。

"scripts": { 
    "dev-s": "ts-node ./src/singleton/crawler1.ts", 
 },

然后使用npm run dev-s啟動(dòng)即可。

結(jié)語(yǔ)

這下真的結(jié)束了，謝謝閱讀。希望可以幫到你。

完整源碼地址：

https://github.com/maomincoding/TsCrawler

責(zé)任編輯：姜華來(lái)源：前端歷劫之路

TypeScript 工具 Java

點(diǎn)贊

51CTO技術(shù)棧公眾號(hào)

業(yè)務(wù)
速覽

媒體

51CTO CIOAge HC3i

社區(qū)

51CTO博客鴻蒙開(kāi)發(fā)者社區(qū) AI.x社區(qū)

教育

51CTO學(xué)堂精培企業(yè)培訓(xùn) CTO訓(xùn)練營(yíng)

<bdo id="smgkd"></bdo>