自拍偷在线精品自拍偷,亚洲欧美中文日韩v在线观看不卡

51CTO首頁(yè)

AI.x社區(qū)

軟考社區(qū)

免費(fèi)課

企業(yè)培訓(xùn)

鴻蒙開發(fā)者社區(qū)

WOT技術(shù)大會(huì)

公眾號(hào)矩陣

移動(dòng)端

視頻課免費(fèi)課排行榜短視頻直播課軟考學(xué)堂

全部課程軟考華為認(rèn)證廠商認(rèn)證 IT技術(shù)PMP項(xiàng)目管理免費(fèi)題庫(kù)

在線學(xué)習(xí)

文章資源問答課堂專欄直播

51CTO

鴻蒙開發(fā)者社區(qū)

51CTO技術(shù)棧

51CTO官微

51CTO學(xué)堂

51CTO博客

CTO訓(xùn)練營(yíng)

鴻蒙開發(fā)者社區(qū)訂閱號(hào)

51CTO軟考

51CTO學(xué)堂APP

51CTO學(xué)堂企業(yè)版APP

鴻蒙開發(fā)者社區(qū)視頻號(hào)

51CTO軟考題庫(kù)

賬號(hào)設(shè)置退出

Python 爬取了馬蜂窩的出行數(shù)據(jù)，告訴你這個(gè)夏天哪里最值得去！

作者：徐麟 2018-08-15 08:52:49

開發(fā) 后端數(shù)據(jù)分析

正值火辣的暑假，朋友圈已經(jīng)被大家的旅行足跡刷屏了，真的十分驚嘆于那些把全國(guó)所有省基本走遍的朋友們。與此同時(shí)，也就萌生了寫篇旅行相關(guān)的內(nèi)容，本次數(shù)據(jù)來(lái)源于一個(gè)對(duì)于爬蟲十分友好的旅行攻略類網(wǎng)站：螞蜂窩。

正值火辣的暑假，朋友圈已經(jīng)被大家的旅行足跡刷屏了，真的十分驚嘆于那些把全國(guó)所有省基本走遍的朋友們。與此同時(shí)，也就萌生了寫篇旅行相關(guān)的內(nèi)容，本次數(shù)據(jù)來(lái)源于一個(gè)對(duì)于爬蟲十分友好的旅行攻略類網(wǎng)站：螞蜂窩。

一、獲得城市編號(hào)

螞蜂窩中的所有城市、景點(diǎn)以及其他的一些信息都有一個(gè)專屬的5位數(shù)字編號(hào)，我們***步要做的就是獲取城市(直轄市+地級(jí)市)的編號(hào)，進(jìn)行后續(xù)的進(jìn)一步分析。

以上兩個(gè)頁(yè)面就是我們的城市編碼來(lái)源。需要首先從目的地頁(yè)面獲得各省編碼，之后進(jìn)入各省城市列表獲得編碼。

過程中需要Selenium進(jìn)行動(dòng)態(tài)數(shù)據(jù)爬取，部分代碼如下：

def find_cat_url(url):    
   headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20100101 Firefox/23.0'}      
 
   req=request.Request(url,headers=headers)    
   html=urlopen(req)    
   bsObj=BeautifulSoup(html.read(),"html.parser")  
   bs = bsObj.find('div',attrs={'class':'hot-list clearfix'}).find_all('dt')  
   cat_url = []  
   cat_name = []  
   for i in range(0,len(bs)):  
       for j in range(0,len(bs[i].find_all('a'))):  
           cat_url.append(bs[i].find_all('a')[j].attrs['href'])  
           cat_name.append(bs[i].find_all('a')[j].text)  
   cat_url = ['http://www.mafengwo.cn'+cat_url[i] for i in range(0,len(cat_url))]    
   return cat_url  
def find_city_url(url_list):  
   city_name_list = []  
   city_url_list = []  
   for i in range(0,len(url_list)):        
 
       driver = webdriver.Chrome()  
       driver.maximize_window()  
       url = url_list[i].replace('travel-scenic-spot/mafengwo','mdd/citylist')  
       driver.get(url)  
       while True: 
            try:  
               time.sleep(2)  
               bs = BeautifulSoup(driver.page_source,'html.parser')  
               url_set = bs.find_all('a',attrs={'data-type':'目的地'})  
               city_name_list = city_name_list +[url_set[i].text.replace('\n','').split()[0] for i in range(0,len(url_set))]  
               city_url_list = city_url_list+[url_set[i].attrs['data-id'] for i in range(0,len(url_set))]            
                js="var q=document.documentElement.scrollTop=800"    
               driver.execute_script(js)  
               time.sleep(2)  
               driver.find_element_by_class_name('pg-next').click()  
           except:  
               break  
       driver.close()  
   return city_name_list,city_url_list  
url = 'http://www.mafengwo.cn/mdd/'  
url_list = find_cat_url(url)  
city_name_list,city_url_list=find_city_url(url_list)  
city = pd.DataFrame({'city':city_name_list,'id':city_url_list})

二、獲得城市信息

城市數(shù)據(jù)分別從以下幾個(gè)頁(yè)面獲?。?/p>

(a)小吃頁(yè)面

(b)景點(diǎn)頁(yè)面

(c)標(biāo)簽頁(yè)面

我們將每個(gè)城市獲取數(shù)據(jù)的過程封裝成函數(shù)，每次傳入之前獲得的城市編碼，部分代碼如下：

def get_city_info(city_name,city_code):  
   this_city_base = get_city_base(city_name,city_code)  
   this_city_jd = get_city_jd(city_name,city_code)  
   this_city_jd['city_name'] = city_name  
   this_city_jd['total_city_yj'] = this_city_base['total_city_yj']  
   try:  
       this_city_food = get_city_food(city_name,city_code)  
       this_city_food['city_name'] = city_name  
       this_city_food['total_city_yj'] = this_city_base['total_city_yj'] 
 
   except: 
 
       this_city_food=pd.DataFrame()  
   return this_city_base,this_city_food,this_city_jd  
def get_city_base(city_name,city_code):  
   url = 'http://www.mafengwo.cn/xc/'+str(city_code)+'/'  
   bsObj = get_static_url_content(url)  
   node =  bsObj.find('div',{'class':'m-tags'}).find('div',{'class':'bd'}).find_all('a')  
   tag = [node[i].text.split()[0] for i in range(0,len(node))]  
   tag_node = bsObj.find('div',{'class':'m-tags'}).find('div',{'class':'bd'}).find_all('em')  
   tag_count = [int(k.text) for k in tag_node]  
   par = [k.attrs['href'][1:3] for k in node]  
   tag_all_count = sum([int(tag_count[i]) for i in range(0,len(tag_count))])  
   tag_jd_count = sum([int(tag_count[i]) for i in range(0,len(tag_count)) if par[i]=='jd'])  
   tag_cy_count = sum([int(tag_count[i]) for i in range(0,len(tag_count)) if par[i]=='cy'])  
   tag_gw_yl_count = sum([int(tag_count[i]) for i in range(0,len(tag_count)) if par[i] in ['gw','yl']])  
   url = 'http://www.mafengwo.cn/yj/'+str(city_code)+'/2-0-1.html '  
   bsObj = get_static_url_content(url)  
 
   total_city_yj = int(bsObj.find('span',{'class':'count'}).find_all('span')[1].text)  
   return {'city_name':city_name,'tag_all_count':tag_all_count,'tag_jd_count':tag_jd_count,  
           'tag_cy_count':tag_cy_count,'tag_gw_yl_count':tag_gw_yl_count,  
           'total_city_yj':total_city_yj} 
 
def get_city_food(city_name,city_code):  
   url = 'http://www.mafengwo.cn/cy/'+str(city_code)+'/gonglve.html'  
   bsObj = get_static_url_content(url)  
   food=[k.text for k in bsObj.find('ol',{'class':'list-rank'}).find_all('h3')]  
   food_count=[int(k.text) for k in bsObj.find('ol',{'class':'list-rank'}).find_all('span',{'class':'trend'})]  
   return pd.DataFrame({'food':food[0:len(food_count)],'food_count':food_count})  
def get_city_jd(city_name,city_code):  
   url = 'http://www.mafengwo.cn/jd/'+str(city_code)+'/gonglve.html'  
   bsObj = get_static_url_content(url)  
   node=bsObj.find('div',{'class':'row-top5'}).find_all('h3')  
   jd = [k.text.split('\n')[2] for k in node]  
   node=bsObj.find_all('span',{'class':'rev-total'})  
   jd_count=[int(k.text.replace(' 條點(diǎn)評(píng)','')) for k in node]  
   return pd.DataFrame({'jd':jd[0:len(jd_count)],'jd_count':jd_count})

三、數(shù)據(jù)分析

PART1：城市數(shù)據(jù)

首先我們看一下游記數(shù)量最多的***0城市：

游記數(shù)量***0數(shù)量基本上與我們?nèi)粘Ｋ私獾臒衢T城市相符，我們進(jìn)一步根據(jù)各個(gè)城市游記數(shù)量獲得全國(guó)旅行目的地?zé)崃D：

看到這里，是不是有種似曾相識(shí)的感覺，如果你在朋友圈曬的足跡圖與這幅圖很相符，那么說明螞蜂窩的數(shù)據(jù)與你不謀而合。

***我們看一下大家對(duì)于各個(gè)城市的印象是如何的，方法就是提取標(biāo)簽中的屬性，我們將屬性分為了休閑、飲食、景點(diǎn)三組，分別看一下每一組屬性下大家印象最深的城市：

看來(lái)對(duì)于螞蜂窩的用戶來(lái)說，廈門給大家留下的印象是非常深的，不僅游記數(shù)量充足，并且能從中提取的有效標(biāo)簽也非常多。重慶、西安、成都也無(wú)懸念地給吃貨們留下了非常深的印象，部分代碼如下：

bar1 = Bar("餐飲類標(biāo)簽排名")  
bar1.add("餐飲類標(biāo)簽分?jǐn)?shù)", city_aggregate.sort_values('cy_point',0,False)['city_name'][0:15],  
        city_aggregate.sort_values('cy_point',0,False)['cy_point'][0:15],  
        is_splitline_show =False,xaxis_rotate=30)  
bar2 = Bar("景點(diǎn)類標(biāo)簽排名",title_top="30%")  
bar2.add("景點(diǎn)類標(biāo)簽分?jǐn)?shù)", city_aggregate.sort_values('jd_point',0,False)['city_name'][0:15],  
        city_aggregate.sort_values('jd_point',0,False)['jd_point'][0:15],  
        legend_top="30%",is_splitline_show =False,xaxis_rotate=30)  
bar3 = Bar("休閑類標(biāo)簽排名",title_top="67.5%")  
bar3.add("休閑類標(biāo)簽分?jǐn)?shù)", city_aggregate.sort_values('xx_point',0,False)['city_name'][0:15],  
        city_aggregate.sort_values('xx_point',0,False)['xx_point'][0:15],  
        legend_top="67.5%",is_splitline_show =False,xaxis_rotate=30)  
grid = Grid(height=800)  
grid.add(bar1, grid_bottom="75%")  
grid.add(bar2, grid_bottom="37.5%",grid_top="37.5%")  
grid.add(bar3, grid_top="75%") 
 grid.render('城市分類標(biāo)簽.html')

PART2：景點(diǎn)數(shù)據(jù)

我們提取了各個(gè)景點(diǎn)評(píng)論數(shù)，并與城市游記數(shù)量進(jìn)行對(duì)比，分別得到景點(diǎn)評(píng)論的絕對(duì)值和相對(duì)值，并據(jù)此計(jì)算景點(diǎn)的人氣、代表性兩個(gè)分?jǐn)?shù)，最終排名***5的景點(diǎn)如下：

螞蜂窩網(wǎng)友對(duì)于廈門真的是情有獨(dú)鐘，鼓浪嶼也成為了***人氣的景點(diǎn)，在城市代表性方面西塘古鎮(zhèn)和羊卓雍措位列前茅。暑假之際，如果擔(dān)心上排的景點(diǎn)人太多，不妨從下排的景點(diǎn)中挖掘那些人少景美的旅游地。

PART3：小吃數(shù)據(jù)

***我們看一下大家最關(guān)注的的與吃相關(guān)的數(shù)據(jù)，處理方法與PART2景點(diǎn)數(shù)據(jù)相似，我們分別看一下***人氣和***城市代表性的小吃。

出乎意料，螞蜂窩網(wǎng)友對(duì)廈門果真愛得深沉，讓沙茶面得以超過火鍋、烤鴨、肉夾饃躋身***人氣的小吃。

在城市代表性方面，海鮮的出場(chǎng)頻率非常高，這點(diǎn)與大(ben)家(ren)的認(rèn)知也不謀而合，PART2與3的部分代碼如下：

bar1 = Bar("景點(diǎn)人氣排名")  
bar1.add("景點(diǎn)人氣分?jǐn)?shù)", city_jd_com.sort_values('rq_point',0,False)['jd'][0:15],  
        city_jd_com.sort_values('rq_point',0,False)['rq_point'][0:15],  
        is_splitline_show =False,xaxis_rotate=30)  
bar2 = Bar("景點(diǎn)代表性排名",title_top="55%")  
bar2.add("景點(diǎn)代表性分?jǐn)?shù)", city_jd_com.sort_values('db_point',0,False)['jd'][0:15],  
        city_jd_com.sort_values('db_point',0,False)['db_point'][0:15],  
        is_splitline_show =False,xaxis_rotate=30,legend_top="55%")  
grid=Grid(height=800)  
grid.add(bar1, grid_bottom="60%")  
grid.add(bar2, grid_top="60%",grid_bottom="10%") 
 grid.render('景點(diǎn)排名.html')

文中所有涉及到的代碼已經(jīng)發(fā)到Github上了，歡迎大家自取：

http://github.com/shujusenlin/mafengwo_data。

作者：徐麟，知乎同名專欄作者，目前就職于上海唯品會(huì)產(chǎn)品技術(shù)中心，哥大統(tǒng)計(jì)數(shù)據(jù)狗，從事數(shù)據(jù)挖掘&分析工作，喜歡用R&Python玩一些不一樣的數(shù)據(jù)。

責(zé)任編輯：未麗燕來(lái)源：數(shù)據(jù)森麟

爬蟲出行城市數(shù)據(jù)

點(diǎn)贊

51CTO技術(shù)棧公眾號(hào)

業(yè)務(wù)
速覽

媒體

51CTO CIOAge HC3i

社區(qū)

51CTO博客鴻蒙開發(fā)者社區(qū) AI.x社區(qū)

教育

51CTO學(xué)堂精培企業(yè)培訓(xùn) CTO訓(xùn)練營(yíng)