RAG爬蟲太拉垮?快來試試智能爬蟲Crawl4AI,開源高效,專為AI量身打造!附實測效果 原創(chuàng)
最近,銷售團隊頻繁反饋一個問題:在給客戶演示時,我們的AI系統(tǒng)知識庫爬蟲表現(xiàn)不佳,輸入客戶的網(wǎng)頁地址后,往往什么都抓取不到,導(dǎo)致知識庫無法更新。作為技術(shù)負(fù)責(zé)人,我一開始也有些頭疼,畢竟我對爬蟲的了解還停留在Scrapy和Selenium的時代,覺得這些工具既復(fù)雜又耗時,于是干脆拒絕了銷售的需求。銷售團隊一度認(rèn)為我們的爬蟲功能“雞肋”,直到我發(fā)現(xiàn)了這款好用的爬蟲工具——Crawl4AI。
自從用上Crawl4AI,銷售團隊反饋說,之前爬不到的內(nèi)容現(xiàn)在都能輕松搞定!它不僅能夠應(yīng)對動態(tài)內(nèi)容和反爬蟲機制,還能通過大模型將數(shù)據(jù)轉(zhuǎn)換成適合AI處理的Markdown格式。今天,我就來給大家詳細(xì)介紹一下這款強大的工具。
為什么選擇 Crawl4AI?
- 為LLM量身打造:Crawl4AI生成的Markdown文檔專門為RAG(檢索增強生成)和微調(diào)應(yīng)用程序優(yōu)化,簡潔且智能。
- 快如閃電:相比傳統(tǒng)爬蟲,Crawl4AI的速度提升了6倍,實時且經(jīng)濟高效。
- 靈活的瀏覽器控制:支持會話管理、代理和自定義鉤子,確保數(shù)據(jù)訪問無縫銜接。
- 智能化提取:采用高級算法,減少對昂貴模型的依賴,提升提取效率。
- 開源且可部署:完全開源,無需API密鑰,支持Docker和云集成。
- 活躍的社區(qū)支持:擁有一個充滿活力的開發(fā)者社區(qū),GitHub存儲庫持續(xù)更新。
核心特點
1. Markdown生成
- 純凈Markdown:生成結(jié)構(gòu)清晰、格式準(zhǔn)確的Markdown文檔。
- 優(yōu)化Markdown:通過啟發(fā)式過濾,去除噪音和不相關(guān)部分,便于AI處理。
- 引用與參考文獻:自動將頁面鏈接轉(zhuǎn)換為帶編號的參考文獻列表。
- 自定義策略:用戶可以根據(jù)需求創(chuàng)建自己的Markdown生成策略。
- BM25算法:采用BM25過濾技術(shù),提取核心信息,去除無關(guān)內(nèi)容。
2. 結(jié)構(gòu)化數(shù)據(jù)提取
- 大語言模型驅(qū)動:支持所有主流大語言模型(開源和專有)進行結(jié)構(gòu)化數(shù)據(jù)提取。
- 分塊策略:基于主題、正則表達(dá)式或句子級別進行分塊處理,確保內(nèi)容精準(zhǔn)提取。
- 余弦相似度:根據(jù)用戶查詢,查找相關(guān)內(nèi)容塊,進行語義提取。
- 基于CSS的提取:使用XPath和CSS選擇器進行快速模式化數(shù)據(jù)提取。
- 自定義模式:支持從重復(fù)模式中提取結(jié)構(gòu)化JSON數(shù)據(jù)。
3. 瀏覽器集成
- 托管瀏覽器:用戶可以使用自己的瀏覽器,完全控制爬取過程,避免被檢測為機器人。
- 遠(yuǎn)程瀏覽器控制:通過Chrome開發(fā)者工具協(xié)議,實現(xiàn)遠(yuǎn)程大規(guī)模數(shù)據(jù)提取。
- 瀏覽器配置文件管理:支持創(chuàng)建和管理持久化配置文件,保存認(rèn)證狀態(tài)、Cookies和設(shè)置。
- 代理支持:無縫連接帶認(rèn)證的代理,確保安全訪問。
- 多瀏覽器支持:兼容Chromium、Firefox和WebKit。
4. 動態(tài)內(nèi)容處理
- JavaScript執(zhí)行:能夠執(zhí)行JavaScript并等待異步或同步內(nèi)容加載,確保動態(tài)內(nèi)容被完整抓取。
- 懶加載處理:等待圖片完全加載,避免遺漏內(nèi)容。
- 全頁掃描:模擬滾動加載,適用于無限滾動頁面。
5. 部署與擴展
- Docker化設(shè)置:優(yōu)化Docker鏡像,集成FastAPI服務(wù)器,便于快速部署。
- 云部署:提供主流云平臺的即用部署配置,支持大規(guī)模生產(chǎn)環(huán)境。
- 安全認(rèn)證:內(nèi)置JWT令牌認(rèn)證,保障API安全。
如何使用Crawl4AI?
安裝
pip3 install crawl4ai # 安裝 crawl4ai 庫
crawl4ai-setup # 設(shè)置瀏覽器
安裝后執(zhí)行執(zhí)行 crawl4ai-doctor 驗證是否安裝成功
# crawl4ai-doctor
[INIT].... → Running Crawl4AI health check...
[INIT].... → Crawl4AI 0.5.0.post2
[TEST].... ? Testing crawling capabilities...
[EXPORT].. ? Exporting PDF and taking screenshot took 1.31s
[FETCH]... ↓ https://crawl4ai.com... | Status: True | Time: 4.04s
[SCRAPE].. ◆ https://crawl4ai.com... | Time: 0.06s
[COMPLETE] ● https://crawl4ai.com... | Status: True | Total: 4.10s
[COMPLETE] ● ? Crawling test passed!
如果遇到任何瀏覽器相關(guān)的錯誤,可以執(zhí)行:
python -m playwright install --with-deps chromium
運行一個簡單的爬蟲
import asyncio
from crawl4ai import *
asyncdef main():
asyncwith AsyncWebCrawler() as crawler:
run_config = CrawlerRunConfig(
word_count_threshold=10, # 每個內(nèi)容塊的最小字?jǐn)?shù)
exclude_external_links=True, # 移除外部鏈接
remove_overlay_elements=True, # 移除彈窗/模態(tài)框
process_iframes=True# 處理iframe內(nèi)容
)
result = await crawler.arun(
url="https://www.sentosa.com.sg/en/get-inspired/sentosa-guides/top-free-things-to-do",
config=run_config
)
print(result.markdown)
if __name__ == "__main__":
asyncio.run(main())
可以看到,Crawl4AI抓取的內(nèi)容幾乎與原始網(wǎng)頁一一對應(yīng),無論是文字、圖片還是鏈接,都被完整且準(zhǔn)確地提取出來。相比之下,傳統(tǒng)的Scrapy爬蟲在處理動態(tài)內(nèi)容和復(fù)雜網(wǎng)頁結(jié)構(gòu)時,往往顯得力不從心。
使用LLM做數(shù)據(jù)提取
Crawl4AI提供了多種數(shù)據(jù)提取策略,包括基于CSS/XPath的傳統(tǒng)方法和基于LLM的智能提取。以下是使用LLM提取策略的示例:
import asyncio
from crawl4ai import *
from pydantic import BaseModel, Field
INSTRUCTION_TO_LLM = """從抓取的內(nèi)容中提取所有的標(biāo)題,內(nèi)容和標(biāo)題的圖片鏈接link"""
class Sentosa(BaseModel):
name: str = Field(..., descriptinotallow="標(biāo)題")
content: str = Field(..., descriptinotallow="內(nèi)容")
link: str = Field(..., descriptinotallow="鏈接link")
llm_strategy = LLMExtractionStrategy(
llm_cnotallow=LLMConfig(provider="openai/gpt-4o", api_token="api_key"),
schema=Sentosa.model_json_schema(),
extraction_type="schema",
instructinotallow=INSTRUCTION_TO_LLM,
chunk_token_threshold=1000,
overlap_rate=0.0,
apply_chunking=True,
input_format="markdown",
extra_args={"temperature": 0.0, "max_tokens": 3600},
)
browser_cfg = BrowserConfig(headless=True, verbose=True)
asyncdef main():
asyncwith AsyncWebCrawler(cnotallow=browser_cfg) as crawler:
run_config = CrawlerRunConfig(
word_count_threshold=10, # Minimum words per content block
exclude_external_links=True, # Remove external links
remove_overlay_elements=True, # Remove popups/modals
process_iframes=True, # Process iframe content,
extraction_strategy=llm_strategy
)
result = await crawler.arun(
url="https://www.sentosa.com.sg/en/get-inspired/sentosa-guides/top-free-things-to-do",
cnotallow=run_config
)
print(result.extracted_content)
if __name__ == "__main__":
asyncio.run(main())
輸出結(jié)果如下:
[
{
"name": "Top Free Things to Do in Sentosa",
"content": "Explore the best free activities and attractions in Sentosa, from beautiful beaches to scenic nature trails.",
"link": "https://www.sentosa.com.sg/en/get-inspired/sentosa-guides/top-free-things-to-do",
"error": false
},
{
"name": "Stroll along the Sentosa Boardwalk",
"content": "Before we get to do all the exciting stuff waiting at Sentosa, we have to get there first. And what better way to do that than to stroll along the Sentosa Boardwalk? With its picturesque view of the city backdrop across the sea, it’ll be a waste not to snap a picture and share it with all your friends to see!",
"link": "https://www.sentosa.com.sg/-/media/sentosa/article-listing/articles/13-free-things-to-do/13tipsgallery14.jpg?revisinotallow=622c8081-1f61-427f-a8f1-e4da1d457ca6",
"error": false
},
{
"name": "Chill at Tanjong Beach",
"content": "For those looking to have a relaxing time at the beach indulging in their favourite book and music, Tanjong Beach is definitely the place to get away from the hectic pace of city life. Its tranquil atmosphere is thanks to its remote location at the southern end of Sentosa beachfront coastline.",
"link": "https://www.sentosa.com.sg/en/things-to-do/attractions/tanjong-beach/",
"error": false
},
{
"name": "Explore the southernmost tip of Asia",
"content": "For some adventuring at the beach, cross the suspension bridge at Palawan Beach to the southernmost tip of Asia. While you’re there, be sure to climb up the watchtower to enjoy a panoramic view of the South China sea as ships sail pass.",
"link": "https://www.sentosa.com.sg/en/things-to-do/attractions/southernmost-point-of-continental-asia/",
"error": false
},
{
"name": "Play some beach sports at Siloso Beach",
"content": "The beach isn’t just for sightseeing and chilling, it’s where all sorts of people gather to play beach sports such as Beach Volleyball, Ultimate Frisbee and Football! So call up your friends and family and head down to Siloso Beach to engage in one of the most fun beach activities you can do!",
"link": "https://www.sentosa.com.sg/en/things-to-do/attractions/siloso-beach/",
"error": false
},
{
"name": "Unleash and discover magical experiences at Sensoryscape",
"content": "Calling all my nature lovers, those who love scenic views or mesmerising interactive projections! Sensoryscape is the perfect place for you, and it's completely free too! So come on down for relaxing vibes during the day and stay for an enchanting night experience. As night falls, watch how this calming place transforms into the ImagiNite experience. Immerse yourself in the various sensory gardens like Symphony Streams enchanting underwater world, interactive projections at Palate Playground, dancing light beams at Lookout Loop, glowing giant flower stalks at Glow Garden and more. Don’t forget to download the ImagiNite App and witness the light shows and projections come to life! End your day in the most magical way possible at Sensoryscape where one can experience a blooming merge between vibrant reefs and lush ridges.",
"link": "https://www.sentosa.com.sg/en/things-to-do/attractions/sensoryscape/",
"error": false
},
{
"name": "Walk along Fort Siloso Skywalk",
"content": "Singapore has its own fair share of well-known Skywalks such as the OCBC Skyway and Henderson Waves, but the tallest one yet is Fort Siloso Skywalk, at 11 storeys high! It boasts beautiful views of Western Sentosa, Mount Faber and Keppel Harbour. Be sure to take plenty of photos but please don’t drop your phones while doing so!",
"link": "https://www.sentosa.com.sg/en/things-to-do/attractions/fort-siloso-skywalk/",
"error": false
},
{
"name": "Go back in time at Fort Siloso",
"content": "Singapore’s only preserved coastal fort is a treasure trove of WWII memorabilia. You can learn of its rich history by walking along its two trails: the Heritage Trail and the Gun Trail. Alternatively, you can go back in time with the Surrender Chambers immersive show where you get to relive Singapore’s epoch-making events.",
"link": "https://www.sentosa.com.sg/en/things-to-do/attractions/fort-siloso/",
"error": false
},
{
"name": "Go hiking at Sentosa Nature Discovery",
"content": "Nature lovers are sure to love this particular activity because in this 1.8-kilometre trek through a rainforest, you’ll get to get up and close with the birds, insects, plants of Sentosa. If you’re really observant, there’s over 20 different species of birds and even other animals like geckos and squirrels to see!",
"link": "https://www.sentosa.com.sg/en/things-to-do/attractions/sentosa-nature-discovery/",
"error": false
},
{
"name": "Bring your date to Quayside Isle",
"content": "Situated near Sentosa Cove, taking a leisurely walk along Quayside Isle’s cobbled pavements in the evening is the perfect way to wrap up a date. The neatly arranged yachts also lend some charm to the scene of the setting sun. With its quiet atmosphere and scenic view, it’s a great way to unwind and end the day.",
"link": "",
"error": false
},
{
"name": "Experience free live music & events",
"content": "From outdoor movie nights to beach concerts, Sentosa is always alive with free entertainment. Keep an eye on Sentosa’s event calendar for upcoming music gigs, pop-up markets, and cultural performances.",
"link": "https://www.sentosa.com.sg/en/things-to-do/events/live-music-performance/",
"error": false
},
{
"name": "Hike through Sentosa’s nature trails",
"content": "Escape the crowds and explore Sentosa’s lush greenery with these hidden gems: Imbiah Trail – Spot unique wildlife and discover ancient rock formations. Coastal Trail – A scenic, seaside route with panoramic ocean views.",
"link": "",
"error": false
}
]
處理動態(tài)內(nèi)容
Crawl4AI能夠處理通過JavaScript動態(tài)加載的內(nèi)容。以下是配置爬蟲執(zhí)行JavaScript的示例:
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://example.com",
js_code="window.scrollTo(0, document.body.scrollHeight);",
wait_for="document.querySelector('.content-loaded')"
)
print(result.markdown)
總結(jié)
Crawl4AI不僅解決了傳統(tǒng)爬蟲工具的痛點,還通過智能化、模塊化的設(shè)計,大大提升了數(shù)據(jù)抓取的效率和準(zhǔn)確性。無論是處理動態(tài)內(nèi)容、反爬蟲機制,還是生成適合AI處理的Markdown格式,Crawl4AI都表現(xiàn)得游刃有余。如果你也在為爬蟲問題頭疼,不妨試試Crawl4AI,相信它會給你帶來驚喜!
本文轉(zhuǎn)載自公眾號AI 博物院 作者:longyunfeigu
