我們一起聊聊.NET快速實現(xiàn)網(wǎng)頁數(shù)據(jù)抓取
前言
今天我們來講講如何使用.NET開源(MIT License)的輕量、靈活、高性能、跨平臺的分布式網(wǎng)絡爬蟲框架DotnetSpider來快速實現(xiàn)網(wǎng)頁數(shù)據(jù)抓取功能。
注意:為了自身安全請在國家法律允許范圍內(nèi)開發(fā)網(wǎng)頁爬蟲功能。
網(wǎng)頁數(shù)據(jù)抓取需求
本文我們以抓取博客園10天推薦排行榜第一頁的文章標題、文章簡介和文章地址為示例,并把抓取下來的數(shù)據(jù)保存到對應的txt文本中。
- 請求地址:https://www.cnblogs.com/aggsite/topdiggs
圖片
創(chuàng)建控制臺應用
創(chuàng)建名為DotnetSpiderExercise的控制臺應用。
圖片
圖片
圖片
安裝DotnetSpider NuGet包
NuGet包管理器搜索:DotnetSpider
圖片
添加Serilog日志組件
NuGet包管理器搜索:Serilog.AspNetCore
圖片
添加RecommendedRankingModel
namespace DotnetSpiderExercise
{
public class RecommendedRankingModel
{
/// <summary>
/// 文章標題
/// </summary>
public string ArticleTitle { get; set; }
/// <summary>
/// 文章簡介
/// </summary>
public string ArticleSummary { get; set; }
/// <summary>
/// 文章地址
/// </summary>
public string ArticleUrl { get; set; }
}
}
添加RecommendedRankingSpider
網(wǎng)頁數(shù)據(jù)抓取的業(yè)務邏輯都在這里面。
using DotnetSpider.DataFlow.Parser;
using DotnetSpider.DataFlow;
using DotnetSpider.Downloader;
using DotnetSpider.Http;
using DotnetSpider.Scheduler.Component;
using DotnetSpider.Selector;
using DotnetSpider;
using Microsoft.Extensions.Logging;
using Microsoft.Extensions.Options;
using Serilog;
using DotnetSpider.Scheduler;
using Microsoft.Extensions.Hosting;
using System.Reflection;
namespace DotnetSpiderExercise
{
public class RecommendedRankingSpider : Spider
{
public RecommendedRankingSpider(IOptions<SpiderOptions> options,
DependenceServices services,
ILogger<Spider> logger) : base(options, services, logger)
{
}
public static async Task RunAsync()
{
var builder = Builder.CreateDefaultBuilder<RecommendedRankingSpider>();
builder.UseSerilog();
builder.UseDownloader<HttpClientDownloader>();
builder.UseQueueDistinctBfsScheduler<HashSetDuplicateRemover>();
await builder.Build().RunAsync();
}
protected override async Task InitializeAsync(CancellationToken stoppingToken = default)
{
//添加自定義解析
AddDataFlow(new Parser());
//使用控制臺存儲器
AddDataFlow(new ConsoleStorage());
//添加采集請求:博客園10天推薦排行榜
await AddRequestsAsync(new Request("https://www.cnblogs.com/aggsite/topdiggs")
{
//請求超時10秒
Timeout = 10000
});
}
class Parser : DataParser
{
public override Task InitializeAsync()
{
return Task.CompletedTask;
}
protected override Task ParseAsync(DataFlowContext context)
{
var recommendedRankingList = new List<RecommendedRankingModel>();
// 網(wǎng)頁數(shù)據(jù)解析
var number = 1;
var recommendedList = context.Selectable.SelectList(Selectors.XPath(".//article[@class='post-item']"));
foreach (var news in recommendedList)
{
var articleTitle = news.Select(Selectors.XPath(".//a[@class='post-item-title']"))?.Value;
var articleSummary = news.Select(Selectors.XPath(".//p[@class='post-item-summary']"))?.Value?.Replace("\n", "").Replace(" ", "");
var articleUrl = news.Select(Selectors.XPath(".//a[@class='post-item-title']/@href"))?.Value;
Console.WriteLine($"第{number}篇文章 標題:{articleTitle}");
recommendedRankingList.Add(new RecommendedRankingModel
{
ArticleTitle = articleTitle,
ArticleSummary = articleSummary,
ArticleUrl = articleUrl
});
number++;
}
using (StreamWriter sw = new StreamWriter("RecommendedRanking.txt"))
{
foreach (RecommendedRankingModel model in recommendedRankingList)
{
string line = $"文章標題:{model.ArticleTitle}\r\n文章簡介:{model.ArticleSummary}\r\n文章地址:{model.ArticleUrl}";
sw.WriteLine(line + "\r\n ========================================================================================== \r\n");
}
}
return Task.CompletedTask;
}
}
}
}
Program執(zhí)行數(shù)據(jù)抓取
namespace DotnetSpiderExercise
{
public class Program
{
static async Task Main(string[] args)
{
Console.WriteLine("網(wǎng)頁數(shù)據(jù)抓取開始...");
await RecommendedRankingSpider.RunAsync();
Console.WriteLine("網(wǎng)頁數(shù)據(jù)抓取完成...");
}
}
}
圖片
抓取數(shù)據(jù)和頁面數(shù)據(jù)對比
抓取數(shù)據(jù)
圖片
頁面數(shù)據(jù)
圖片
項目源碼地址
更多項目實用功能和特性歡迎前往項目開源地址查看??,別忘了給項目一個Star支持??。
- GitHub源碼地址:https://github.com/dotnetcore/DotnetSpider
- GitHub wiki:https://github.com/dotnetcore/DotnetSpider/wiki
- 本文示例源碼:https://github.com/YSGStudyHards/DotNetExercises/tree/master/DotnetSpiderExercise