自拍偷在线精品自拍偷,亚洲欧美中文日韩v在线观看不卡

51CTO首頁(yè)

AI.x社區(qū)

軟考社區(qū)

免費(fèi)課

企業(yè)培訓(xùn)

鴻蒙開發(fā)者社區(qū)

WOT技術(shù)大會(huì)

公眾號(hào)矩陣

移動(dòng)端

視頻課免費(fèi)課排行榜短視頻直播課軟考學(xué)堂

全部課程軟考華為認(rèn)證廠商認(rèn)證 IT技術(shù)PMP項(xiàng)目管理免費(fèi)題庫(kù)

在線學(xué)習(xí)

文章資源問答課堂專欄直播

51CTO

鴻蒙開發(fā)者社區(qū)

51CTO技術(shù)棧

51CTO官微

51CTO學(xué)堂

51CTO博客

CTO訓(xùn)練營(yíng)

鴻蒙開發(fā)者社區(qū)訂閱號(hào)

51CTO軟考

51CTO學(xué)堂APP

51CTO學(xué)堂企業(yè)版APP

鴻蒙開發(fā)者社區(qū)視頻號(hào)

51CTO軟考題庫(kù)

賬號(hào)設(shè)置退出

Python爬蟲，最新的B站彈幕和評(píng)論爬蟲，你們要的冰冰來啦！

作者：蘿卜大雜燴 2022-12-26 00:00:05

開發(fā) 前端

?最近想爬下B站的彈幕和評(píng)論，發(fā)現(xiàn)網(wǎng)上找到的教程基本都失效了，畢竟爬蟲和反爬是屬于魔高一尺、道高一丈的雙方，程序員小哥哥們?cè)诰W(wǎng)絡(luò)的兩端斗智斗勇，也是精彩紛呈。

最近想爬下B站的彈幕和評(píng)論，發(fā)現(xiàn)網(wǎng)上找到的教程基本都失效了，畢竟爬蟲和反爬是屬于魔高一尺、道高一丈的雙方，程序員小哥哥們?cè)诰W(wǎng)絡(luò)的兩端斗智斗勇，也是精彩紛呈。

當(dāng)然了，對(duì)于爬蟲這一方，爬取網(wǎng)站數(shù)據(jù)，一般目的都是比較明確的，比如我這里就是為了冰冰，廢話不多說，開干！

獲取彈幕數(shù)據(jù)

這里先聲明一點(diǎn)，雖然網(wǎng)絡(luò)上的整體教程都失效了，但是有一些步驟還是可以參考的，比如我們可以知道，對(duì)于彈幕數(shù)據(jù)，我們是可以通過如下的一個(gè)接口來獲取的。

??https://comment.bilibili.com/xxxx.xml??

在瀏覽器打開可以看到如下：

數(shù)據(jù)還是非常干凈的，那么下一步就是看如何獲取這個(gè) xml 的 url 地址了，也就是如何獲取 324768988 ID；

接下來我們搜索整個(gè)網(wǎng)頁(yè)的源碼，可以發(fā)現(xiàn)如下情況；

也就是說，我們需要的 ID 是可以在 script 當(dāng)中獲取的，下面就來編寫一個(gè)提取 script 內(nèi)容的函數(shù)。

def getHTML_content(self):
        # 獲取該視頻網(wǎng)頁(yè)的內(nèi)容
        response = requests.get(self.BVurl, headers = self.headers)
        html_str = response.content.decode()
        html=etree.HTML(html_str)
        result=etree.tostring(html)
        return result

def get_script_list(self,str):
    html = etree.HTML(str)
    script_list = html.xpath("http://script/text()")
    return script_list

拿到所有的 script 內(nèi)容之后，我們?cè)賮斫馕鑫覀冃枰臄?shù)據(jù)。

script_list = self.get_script_list(html_content)
# 解析script數(shù)據(jù)，獲取cid信息
for script in script_list:
        if '[{"cid":' in script:
            find_script_text = script
final_text = find_script_text.split('[{"cid":')[1].split(',"page":')[0]

最后，我們?cè)侔颜w代碼封裝成一個(gè)類，就完成了彈幕抓取的數(shù)據(jù)收集工作了。

spider = BiliSpider("BV16p4y187hc")
spider.run()

結(jié)果如下：

獲取評(píng)論數(shù)據(jù)

對(duì)于評(píng)論數(shù)據(jù)，可能要復(fù)雜一些，需要分為主（main）評(píng)論和回復(fù)主評(píng)論的 reply 評(píng)論。

我們通過瀏覽器工具抓取網(wǎng)頁(yè)上的所有請(qǐng)求，然后搜索 reply，可以得到如下結(jié)果：

我們先來看看 main 請(qǐng)求，整理后通過瀏覽器訪問如下：

也可以直接通過 requests 請(qǐng)求;

通過觀察可以得知，響應(yīng)消息里的 replies 就是主評(píng)論內(nèi)容，同時(shí)我們還可以改變 url 當(dāng)中的 next 參數(shù)來翻頁(yè)，進(jìn)而請(qǐng)求不同的數(shù)據(jù)。

這里我們?cè)訇P(guān)注下 rpid 參數(shù)，這個(gè)會(huì)用于 reply 評(píng)論中。

再來看看 reply 評(píng)論，同樣可以使用 requests 直接訪問，同時(shí) url 當(dāng)中的 root 參數(shù)就是我們上面提到的 rpid 參數(shù)。

我們厘清了上面的關(guān)系之后，我們就可以編寫代碼了；

def get_data(data):
    data_list = []
    comment_data_list = data["data"]["replies"]
    for i in comment_data_list:
        data_list.append([i['rpid'], i['like'], i['member']['uname'], i['member']['level_info']['current_level'], i['content']['message']])
    return data_list


def save_data(data_type, data):
    if not os.path.exists(data_type + r'_data.csv'):
        with open(data_type + r"_data.csv", "a+", encoding='utf-8') as f:
            f.write("rpid,點(diǎn)贊數(shù)量,用戶,等級(jí),評(píng)論內(nèi)容\n")
            for i in data:
                rpid = i[0]
                like_count = i[1]
                user = i[2].replace(',', '，')
                level = i[3]
                content = i[4].replace(',', '，')
                row = '{},{},{},{},{}'.format(rpid,like_count,user,level,content)
                f.write(row)
                f.write('\n')
    else:
        with open(data_type + r"_data.csv", "a+", encoding='utf-8') as f:
            for i in data:
                rpid = i[0]
                like_count = i[1]
                user = i[2].replace(',', '，')
                level = i[3]
                content = i[4].replace(',', '，')
                row = '{},{},{},{},{}'.format(rpid,like_count,user,level,content)
                f.write(row)
                f.write('\n')


for i in range(1000):
    url = "https://api.bilibili.com/x/v2/reply/main?jsnotallow=jsonp&next={}&type=1&oid=972516426&mode=3&plat=1&_=1632192192097".format(str(i))
    print(url)
    d = requests.get(url)
    data = d.json()
    if not data['data']['replies']:
        break
    m_data = get_data(data)
    save_data("main", m_data)
    for j in m_data:
        reply_url = "https://api.bilibili.com/x/v2/reply/reply?jsnotallow=jsonp&pn=1&type=1&oid=972516426&ps=10&root={}&_=1632192668665".format(str(j[0]))
        print(reply_url)
        r = requests.get(reply_url)
        r_data = r.json()
        if not r_data['data']['replies']:
            break
        reply_data = get_data(r_data)
        save_data("reply", reply_data)
        time.sleep(5)
    time.sleep(5)

爬取過程中：

這樣，針對(duì)一個(gè)冰冰視頻，我們就完成了上千評(píng)論的抓?。?/p>

可視化

下面我們簡(jiǎn)單做一些可視化動(dòng)作；

先來看下我們爬取的數(shù)據(jù)整體的樣子：

因?yàn)閿?shù)據(jù)中有一些空值，我們來處理下：

df_new = df.dropna(axis=0,subset = ["用戶"])

下面就可以作圖了，GO！

使用 pyecharts 還是我們的首選，畢竟編寫容易

評(píng)論熱度

df1 = df.sort_values(by="點(diǎn)贊數(shù)量",ascending=False).head(20)

c1 = (
    Bar()
    .add_xaxis(df1["評(píng)論內(nèi)容"].to_list())
    .add_yaxis("點(diǎn)贊數(shù)量", df1["點(diǎn)贊數(shù)量"].to_list(), color=Faker.rand_color())
    .set_global_opts(
        title_opts=opts.TitleOpts(title="評(píng)論熱度Top20"),
        datazoom_opts=[opts.DataZoomOpts(), opts.DataZoomOpts(type_="inside")],
    )
    .render_notebook()
)

等級(jí)分布

pie_data = df_new.等級(jí).value_counts().sort_index(ascending=False)
pie_data.tolist()
c2 = (
    Pie()
    .add(
        "",
        [list(z) for z in zip([str(i) for i in range(6, 1, -1)], pie_data.tolist())],
        radius=["40%", "75%"],
    )
    .set_global_opts(
        title_opts=opts.TitleOpts(title="等級(jí)分布"),
        legend_opts=opts.LegendOpts(orient="vertical", pos_top="15%", pos_left="2%"),
    )
    .set_series_opts(label_opts=opts.LabelOpts(formatter=": {c}"))
    .render_notebook()
)

評(píng)論詞云

def wordcloud(data, name, pic=None):
    comment = jieba.cut(str(data), cut_all=False)
    words = ' '.join(comment)
    img = Image.open(pic)
    img_array = np.array(img)
    wc = WordCloud(width=2000, height=1800, background_color='white', font_path=font, mask=img_array,
                   stopwords=STOPWORDS, contour_width=3, contour_color='steelblue')
    wc.generate(words)
    wc.to_file(name + '.png')

wordcloud(df_new["評(píng)論內(nèi)容"], "冰冰", '1.PNG')

好了，今天的分享就到這里，喜歡冰冰就點(diǎn)個(gè)在看吧！

責(zé)任編輯：武曉燕來源：蘿卜大雜燴

Python 爬蟲 B站彈幕

點(diǎn)贊

51CTO技術(shù)棧公眾號(hào)

業(yè)務(wù)
速覽

媒體

51CTO CIOAge HC3i

社區(qū)

51CTO博客鴻蒙開發(fā)者社區(qū) AI.x社區(qū)

教育

51CTO學(xué)堂精培企業(yè)培訓(xùn) CTO訓(xùn)練營(yíng)

^{<blockquote id="lf318"></blockquote>}