巧用IronPython做更靈活的網(wǎng)頁(yè)爬蟲(chóng)
由于各種原因,我們經(jīng)常需要去別的網(wǎng)站采集一些信息,.net下所有相關(guān)的技術(shù)都已經(jīng)非常成熟,用Webrequest抓取頁(yè)面,既支持自定義Reference頭,又支持cookie,解析頁(yè)面一般都是用正則,而且對(duì)方網(wǎng)站結(jié)構(gòu)一變,還得重新改代碼,重新編譯,發(fā)布。
如果有了IronPython,可以把抓取和分析的邏輯做成Python腳本,如果對(duì)方頁(yè)面結(jié)構(gòu)變了,只需修改腳本就行了,不需重新編譯軟件,這樣可以用c#做交互和界面部分,用Python封裝預(yù)期經(jīng)常變化的部分。
安裝好IronPython和vs.net 2010后,還需要下載一個(gè)SGMLReader(見(jiàn)參考鏈接),這個(gè)組件可以把格式不是很?chē)?yán)格的HTML轉(zhuǎn)換成格式良好的XML文件,甚至還能增加DTD的驗(yàn)證
我們以抓取百度貼吧頁(yè)面為例,新建一個(gè)Console項(xiàng)目,引用IronPython,Microsoft.Dynamic,Microsoft.Scripting,SgmlReaderDll這些組件,把SGMLReader里的Html.dtd復(fù)制到項(xiàng)目目錄下,如果沒(méi)有這個(gè),它會(huì)根據(jù)doctype去網(wǎng)絡(luò)上找dtd,然后新建baidu.py的文件,***在項(xiàng)目屬性的生成事件里寫(xiě)上如下代碼,把這兩個(gè)文件拷貝到目標(biāo)目錄里
- copy $(ProjectDir)\*.py $(TargetDir)
- copy $(ProjectDir)\*.dtd $(TargetDir)
在baidu.py里首先引用必要的.net程序集
- import clr, sys
- clr.AddReference("SgmlReaderDll")
- clr.AddReference("System.Xml")
完了導(dǎo)入我們需要的類
- from Sgml import *
- from System.Net import *
- from System.IO import TextReader,StreamReader
- from System.Xml import *
- from System.Text.UnicodeEncoding import UTF8
利用SgmlReader寫(xiě)一個(gè)把html轉(zhuǎn)換成xml的函數(shù),注意SystemLiteral屬性必須設(shè)置,否則就會(huì)去網(wǎng)上找dtd了,浪費(fèi)時(shí)間
- def fromHtml(textReader):
- sgmlReader = SgmlReader()
- sgmlReader.SystemLiteral = "html.dtd"
- sgmlReader.WhitespaceHandling = WhitespaceHandling.All
- sgmlReader.CaseFolding = CaseFolding.ToLower
- sgmlReader.InputStream = textReader
- doc = XmlDocument()
- doc.PreserveWhitespace = True
- doc.XmlResolver = None
- doc.Load(sgmlReader)
- return doc
利用webrequest寫(xiě)一個(gè)支持cookie和網(wǎng)頁(yè)編碼的抓網(wǎng)頁(yè)方法
- def getWebData
- (url, method, data = None, cookie = None, encoding = "UTF-8"):
- req = WebRequest.Create(url)
- req.Method = method
- if cookie != None:
- req.CookieContainer = cookie
- if data != None:
- stream = req.GetRequestStream()
- stream.Write(data, 0, data.Length)
- rsp = req.GetResponse()
- reader = StreamReader
- (rsp.GetResponseStream(), UTF8.GetEncoding(encoding))
- return reader
寫(xiě)一個(gè)類來(lái)定義抓取結(jié)果,這個(gè)類不需要在c#項(xiàng)目里定義,到時(shí)候直接用c# 4.0的dynamic關(guān)鍵字就可以使用
- class Post:
- def __init__(self, hit, comments, title, link, author):
- self.hit = hit
- self.comments = comments
- self.title = title
- self.link = link
- self.author = author
定義主要工作的類,__init__大概相當(dāng)于構(gòu)造函數(shù),我們傳入編碼參數(shù),并初始化cookie容器和解析結(jié)果,[]是python里的列表,大約相當(dāng)于c#的List
- class BaiDu:
- def __init__(self,encoding):
- self.cc = self.cc = CookieContainer()
- self.encoding = encoding
- self.posts = []
接下來(lái)定義抓取方法,調(diào)用getWebData抓網(wǎng)頁(yè),然后用fromHtml轉(zhuǎn)換成xml,剩下的就是xml操作,和.net里一樣,一看便知
- def getPosts(self, url):
- reader = getWebData
- (url, "GET", None, self.cc, self.encoding)
- doc = fromHtml(reader)
- trs = doc.SelectNodes
- ("html//table[@id='thread_list_table']/tbody/tr")
- self.parsePosts(trs)
- def parsePosts(self, trs):
- for tr in trs:
- tds = tr.SelectNodes("td")
- hit = tds[0].InnerText
- comments = tds[1].InnerText
- title = tds[2].ChildNodes[1].InnerText
- link = tds[2].ChildNodes[1].Attributes["href"]
- author = tds[3].InnerText
- post = Post(hit, comments, title, link, author)
- self.posts.append(post)
c#代碼要?jiǎng)?chuàng)建一個(gè)腳本運(yùn)行環(huán)境,設(shè)置允許調(diào)試,然后執(zhí)行baidu.py,***創(chuàng)建一個(gè)Baidu的類的實(shí)例,并用dynamic關(guān)鍵字引用這個(gè)實(shí)例
- Dictionary
options = new Dictionary (); - options["Debug"] = true;
- ScriptEngine engine = Python.CreateEngine(options);
- ScriptScope scope = engine.ExecuteFile("baidu.py");
- dynamic baidu = engine.Operations.Invoke(scope.GetVariable("BaiDu"), "GBK");
接下來(lái)調(diào)用BaiDu這個(gè)python類的方法獲取網(wǎng)頁(yè)抓取結(jié)果,然后輸出就可以了
- baidu.getPosts("http://tieba.baidu.com/f?kw=seo");
- dynamic posts = baidu.posts;
- foreach (dynamic post in posts)
- {
- Console.WriteLine("{0}
- (回復(fù)數(shù):{1})(點(diǎn)擊數(shù):{2})[作者:{3}]",
- post.title,
- post.comments,
- post.hit,
- post.author);
- }
原文鏈接:http://www.cnblogs.com/onlytiancai/archive/2011/02/22/1960859.html
【編輯推薦】