自拍偷在线精品自拍偷,亚洲欧美中文日韩v在线观看不卡

<cite id="f44y6"><rp id="f44y6"></rp></cite>

<p id="f44y6"><li id="f44y6"></li></p>

<style id="f44y6"><rp id="f44y6"></rp></style>

<sub id="f44y6"><p id="f44y6"><form id="f44y6"></form></p></sub>

<noframes id="f44y6"><abbr id="f44y6"></abbr></noframes>

AI.x社區(qū)

軟考社區(qū)

企業(yè)培訓

鴻蒙開發(fā)者社區(qū)

WOT技術(shù)大會

公眾號矩陣

移動端

視頻課免費課排行榜短視頻直播課軟考學堂

全部課程軟考華為認證廠商認證 IT技術(shù)PMP項目管理免費題庫

文章資源問答課堂專欄直播

51CTO

鴻蒙開發(fā)者社區(qū)

51CTO技術(shù)棧

51CTO官微

51CTO學堂

51CTO博客

CTO訓練營

鴻蒙開發(fā)者社區(qū)訂閱號

51CTO軟考

51CTO學堂APP

51CTO學堂企業(yè)版APP

鴻蒙開發(fā)者社區(qū)視頻號

51CTO軟考題庫

賬號設(shè)置退出

Python數(shù)據(jù)可視化：淺談數(shù)據(jù)分析崗

作者：法納斯特 2018-12-03 16:50:23

大數(shù)據(jù) 數(shù)據(jù)可視化數(shù)據(jù)分析

本次通過對BOSS直聘，拉勾網(wǎng)數(shù)據(jù)分析崗數(shù)據(jù)分析，了解數(shù)據(jù)分析崗的行業(yè)情況，也以此來了解從事數(shù)據(jù)分析所需要的技能。

有態(tài)度地學習

講道理，pyspider確實是一款優(yōu)秀的爬蟲框架，我們可以利用它快速方便地實現(xiàn)一個頁面的抓取。

不過帶來便捷性的同時，也有它的局限性，復雜頁面不好爬取。

在本次的數(shù)據(jù)爬取中，BOSS直聘是成功使用pyspider。但拉勾網(wǎng)卻不行，因為拉勾網(wǎng)的數(shù)據(jù)是Ajax加載的。

拉勾網(wǎng)崗位數(shù)據(jù)請求的網(wǎng)址是不變的，改變的是表單數(shù)據(jù)，表單數(shù)據(jù)隨著頁數(shù)改變，請求方式為POST。這里沒辦法在pyspider里用循環(huán)遍歷來獲取每一頁的數(shù)據(jù)。

也許是我對pyspider框架了解的不夠，還達不到得心應(yīng)手。所以***拉勾網(wǎng)的爬取，采用平常的辦法，在PyCharm中自行編寫程序。

本次通過對BOSS直聘，拉勾網(wǎng)數(shù)據(jù)分析崗數(shù)據(jù)分析，了解數(shù)據(jù)分析崗的行業(yè)情況，也以此來了解從事數(shù)據(jù)分析所需要的技能。

一、網(wǎng)頁分析

獲取BOSS直聘索引頁信息，主要是崗位名稱、薪資、地點、工作年限、學歷要求，公司名稱、類型、狀態(tài)、規(guī)模。

本來一開始是想對詳情頁分析的，還可以獲取詳情頁里的工作內(nèi)容和工作技能需求。

然后由于請求太多，就放棄了。索引頁有10頁，1頁有30個崗位，一個詳情頁就需要一個請求，算起來一共有300個請求。

我是到了第2頁(60個請求)，就出現(xiàn)了訪問過于頻繁的警告。

而只獲取索引頁信息的話，只有10個請求，基本上沒什么問題，外加也不想去鼓搗代理IP，所以來點簡單的。

到時候做數(shù)據(jù)挖掘崗位的數(shù)據(jù)時，看看放慢時間能否獲取成功。

獲取拉勾網(wǎng)索引頁信息，主要是崗位名稱、地點、薪資、工作年限、學歷要求，公司名稱、類型、狀態(tài)、規(guī)模，工作技能，工作福利。

網(wǎng)頁為Ajax請求，采用PyCharm編寫代碼，輕車熟路。

二、數(shù)據(jù)獲取

01 pyspider獲取BOSS直聘數(shù)據(jù)

pyspider的安裝很簡單，直接在命令行pip3 install pyspider即可。

這里因為之前沒有安裝pyspider對接的PhantomJS(處理JavaScript渲染的頁面)。

所以需要從網(wǎng)站下載下來它的exe文件，將其放入Python的exe文件所在的文件夾下。

***在命令行輸入pyspider all，即可運行pyspider。

在瀏覽器打開網(wǎng)址http://localhost:5000/，創(chuàng)建項目，添加項目名稱，輸入請求網(wǎng)址，得到如下圖。

***在pyspider的腳本編輯器里編寫代碼，結(jié)合左邊的反饋情況，對代碼加以改正。

腳本編輯器具體代碼如下。

#!/usr/bin/env python 
# -*- encoding: utf-8 -*- 
# Project: BOSS 
 
from pyspider.libs.base_handler import * 
import pymysql 
import random 
import time 
import re 
 
count = 0 
 
class Handler(BaseHandler): 
    # 添加請求頭,否則出現(xiàn)403報錯 
    crawl_config = {'headers': {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'}} 
 
    def __init__(self): 
        # 連接數(shù)據(jù)庫 
        self.db = pymysql.connect(host='127.0.0.1', user='root', password='774110919', port=3306, db='boss_job', charset='utf8mb4') 
 
    def add_Mysql(self, id, job_title, job_salary, job_city, job_experience, job_education, company_name, company_type, company_status, company_people): 
        # 將數(shù)據(jù)寫入數(shù)據(jù)庫中 
        try: 
            cursor = self.db.cursor() 
            sql = 'insert into job(id, job_title, job_salary, job_city, job_experience, job_education, company_name, company_type, company_status, company_people) values ("%d", "%s", "%s", "%s", "%s", "%s", "%s", "%s", "%s", "%s")' % (id, job_title, job_salary, job_city, job_experience, job_education, company_name, company_type, company_status, company_people); 
            print(sql) 
            cursor.execute(sql) 
            print(cursor.lastrowid) 
            self.db.commit() 
        except Exception as e: 
            print(e) 
            self.db.rollback() 
 
    @every(minutes=24 * 60) 
    def on_start(self): 
        # 因為pyspider默認是HTTP請求,對于HTTPS(加密)請求，需要添加validate_cert=False,否則599/SSL報錯 
        self.crawl('https://www.zhipin.com/job_detail/?query=%E6%95%B0%E6%8D%AE%E5%88%86%E6%9E%90&scity=100010000&industry=&position=', callback=self.index_page, validate_cert=False) 
 
    @config(age=10 * 24 * 60 * 60) 
    def index_page(self, response): 
        time.sleep(random.randint(2, 5)) 
        for i in response.doc('li > div').items(): 
            # 設(shè)置全局變量 
            global count 
            count += 1 
            # 崗位名稱 
            job_title = i('.job-title').text() 
            print(job_title) 
            # 崗位薪水 
            job_salary = i('.red').text() 
            print(job_salary) 
            # 崗位地點 
            city_result = re.search('(.*?)<em class=', i('.info-primary > p').html()) 
            job_city = city_result.group(1).split(' ')[0] 
            print(job_city) 
            # 崗位經(jīng)驗 
            experience_result = re.search('<em class="vline"/>(.*?)<em class="vline"/>', i('.info-primary > p').html()) 
            job_experience = experience_result.group(1) 
            print(job_experience) 
            # 崗位學歷 
            job_education = i('.info-primary > p').text().replace(' ', '').replace(city_result.group(1).replace(' ', ''), '').replace(experience_result.group(1).replace(' ', ''),'') 
            print(job_education) 
            # 公司名稱 
            company_name = i('.info-company a').text() 
            print(company_name) 
            # 公司類型 
            company_type_result = re.search('(.*?)<em class=', i('.info-company p').html()) 
            company_type = company_type_result.group(1) 
            print(company_type) 
            # 公司狀態(tài) 
            company_status_result = re.search('<em class="vline"/>(.*?)<em class="vline"/>', i('.info-company p').html()) 
            if company_status_result: 
                company_status = company_status_result.group(1) 
            else: 
                company_status = '無信息' 
            print(company_status) 
            # 公司規(guī)模 
            company_people = i('.info-company p').text().replace(company_type, '').replace(company_status,'') 
            print(company_people + '\n') 
            # 寫入數(shù)據(jù)庫中 
            self.add_Mysql(count, job_title, job_salary, job_city, job_experience, job_education, company_name, company_type, company_status, company_people) 
        # 獲取下一頁信息 
        next = response.doc('.next').attr.href 
        if next != 'javascript:;': 
            self.crawl(next, callback=self.index_page, validate_cert=False) 
        else: 
            print("The Work is Done") 
        # 詳情頁信息獲取,由于訪問次數(shù)有限制,不使用 
        #for each in response.doc('.name > a').items(): 
            #url = each.attr.href 
            #self.crawl(each.attr.href, callback=self.detail_page, validate_cert=False) 
 
    @config(priority=2) 
    def detail_page(self, response): 
        # 詳情頁信息獲取,由于訪問次數(shù)有限制,不使用 
        message_job = response.doc('div > .info-primary > p').text() 
        city_result = re.findall('城市：(.*?)經(jīng)驗', message_job) 
        experience_result = re.findall('經(jīng)驗：(.*?)學歷', message_job) 
        education_result = re.findall('學歷：(.*)', message_job) 
 
        message_company = response.doc('.info-company > p').text().replace(response.doc('.info-company > p > a').text(),'') 
        status_result = re.findall('(.*?)\d', message_company.split(' ')[0]) 
        people_result = message_company.split(' ')[0].replace(status_result[0], '') 
 
        return { 
            "job_title": response.doc('h1').text(), 
            "job_salary": response.doc('.info-primary .badge').text(), 
            "job_city": city_result[0], 
            "job_experience": experience_result[0], 
            "job_education": education_result[0], 
            "job_skills": response.doc('.info-primary > .job-tags > span').text(), 
            "job_detail": response.doc('div').filter('.text').eq(0).text().replace('\n', ''), 
            "company_name": response.doc('.info-company > .name > a').text(), 
            "company_status": status_result[0], 
            "company_people": people_result, 
            "company_type": response.doc('.info-company > p > a').text(), 
        }

獲取BOSS直聘數(shù)據(jù)分析崗數(shù)據(jù)如下。

02 PyCharm獲取拉勾網(wǎng)數(shù)據(jù)

import requests 
import pymysql 
import random 
import time 
import json 
 
count = 0 
# 設(shè)置請求網(wǎng)址及請求頭參數(shù) 
url = 'https://www.lagou.com/jobs/positionAjax.json?needAddtionalResult=false' 
headers = { 
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36', 
    'Cookie': '你的Cookie值', 
    'Accept': 'application/json, text/javascript, */*; q=0.01', 
    'Connection': 'keep-alive', 
    'Host': 'www.lagou.com', 
    'Origin': 'https://www.lagou.com', 
    'Referer': 'ttps://www.lagou.com/jobs/list_%E6%95%B0%E6%8D%AE%E5%88%86%E6%9E%90?labelWords=sug&fromSearch=true&suginput=shuju' 
} 
 
# 連接數(shù)據(jù)庫 
db = pymysql.connect(host='127.0.0.1', user='root', password='774110919', port=3306, db='lagou_job', charset='utf8mb4') 
 
 
def add_Mysql(id, job_title, job_salary, job_city, job_experience, job_education, company_name, company_type, company_status, company_people, job_tips, job_welfare): 
    # 將數(shù)據(jù)寫入數(shù)據(jù)庫中 
    try: 
        cursor = db.cursor() 
        sql = 'insert into job(id, job_title, job_salary, job_city, job_experience, job_education, company_name, company_type, company_status, company_people, job_tips, job_welfare) values ("%d", "%s", "%s", "%s", "%s", "%s", "%s", "%s", "%s", "%s", "%s", "%s")' % (id, job_title, job_salary, job_city, job_experience, job_education, company_name, company_type, company_status, company_people, job_tips, job_welfare); 
        print(sql) 
        cursor.execute(sql) 
        print(cursor.lastrowid) 
        db.commit() 
    except Exception as e: 
        print(e) 
        db.rollback() 
 
 
def get_message(): 
    for i in range(1, 31): 
        print('第' + str(i) + '頁') 
        time.sleep(random.randint(10, 20)) 
        data = { 
            'first': 'false', 
            'pn': i, 
            'kd': '數(shù)據(jù)分析' 
        } 
        response = requests.post(url=url, data=data, headers=headers) 
        result = json.loads(response.text) 
        job_messages = result['content']['positionResult']['result'] 
        for job in job_messages: 
            global count 
            count += 1 
            # 崗位名稱 
            job_title = job['positionName'] 
            print(job_title) 
            # 崗位薪水 
            job_salary = job['salary'] 
            print(job_salary) 
            # 崗位地點 
            job_city = job['city'] 
            print(job_city) 
            # 崗位經(jīng)驗 
            job_experience = job['workYear'] 
            print(job_experience) 
            # 崗位學歷 
            job_education = job['education'] 
            print(job_education) 
            # 公司名稱 
            company_name = job['companyShortName'] 
            print(company_name) 
            # 公司類型 
            company_type = job['industryField'] 
            print(company_type) 
            # 公司狀態(tài) 
            company_status = job['financeStage'] 
            print(company_status) 
            # 公司規(guī)模 
            company_people = job['companySize'] 
            print(company_people) 
            # 工作技能 
            if len(job['positionLables']) > 0: 
                job_tips = ','.join(job['positionLables']) 
            else: 
                job_tips = 'None' 
            print(job_tips) 
            # 工作福利 
            job_welfare = job['positionAdvantage'] 
            print(job_welfare + '\n\n') 
            # 寫入數(shù)據(jù)庫 
            add_Mysql(count, job_title, job_salary, job_city, job_experience, job_education, company_name, company_type, company_status, company_people, job_tips, job_welfare) 
 
 
if __name__ == '__main__': 
    get_message()

獲取拉勾網(wǎng)數(shù)據(jù)分析崗數(shù)據(jù)如下。

這里的數(shù)據(jù)庫都是自己在外面創(chuàng)建的，之前也用了好多回，就不貼代碼細說了。

三、數(shù)據(jù)可視化

01 城市分布圖

崗位的分布情況，這里可以看出崗位大多都分布在東部地區(qū)，中部也有一些。

02 城市分布熱力圖

京津冀、長三角、珠三角密集度不相上下，成都重慶地區(qū)也有一小些需求。

可以說北上廣深，這四個一線城市包攬了大部分的崗位需求。

03 工作經(jīng)驗薪水圖

這里通過看箱形圖的四分位及中間值，大致能看出隨著工作年限的增長，薪資也是一路上升。

BOSS直聘里，1年以內(nèi)工作經(jīng)驗的薪資，有個***4萬多的，這肯定是不合理的。

于是就去數(shù)據(jù)庫看了下，其實那個崗位要求是3年以上，但實際給的標簽卻是1年以內(nèi)。

所以說數(shù)據(jù)來源提供的數(shù)據(jù)的準確性很重要。

04 學歷薪水圖

總的來說「碩士」>「本科」>「大?！?，當然大專、本科中也有高薪水的。

畢竟越往后能力就越重要，學歷算是一個重要的加分項。

05 公司狀態(tài)薪水圖

這里的數(shù)據(jù)沒什么特點，就當了解下這些概念。

一個公司的發(fā)展，可以是從「天使輪」一直到「上市公司」，路途坎坷。

06 公司規(guī)模薪水圖

正常來說，公司規(guī)模越大，薪水應(yīng)該會越高。

畢竟大廠的工資擺在那里，想不知道都難。

不過這里沒能體現(xiàn)出來差距，倒是發(fā)現(xiàn)人數(shù)最少的公司，***工資給的不高，難不成是初期缺錢?

07 公司類型***0

數(shù)據(jù)分析崗主要集中在互聯(lián)網(wǎng)行業(yè)，「金融」「地產(chǎn)」「教育」「醫(yī)療」「游戲」也有所涉及。

大部分崗位需求都集中第三產(chǎn)業(yè)上。

08 工作技能圖

這個算是本次的重點，這些技能將會是日后學習的重點。

「數(shù)據(jù)挖掘」「SQL」「BI」「數(shù)據(jù)運營」「SPSS」「數(shù)據(jù)庫」「MySQL」等等。

09 工作福利圖

這里可以看出大部分重點都圍繞著「五險一金」「福利多」「團隊氛圍好」「晉升空間大」「行業(yè)大牛領(lǐng)頭」上。

要是哪家公司都具備了，那簡直就是要上天。

不過你我都清楚，這是不存在的，就算可能存在，也只是別人家的公司而已~

四、總結(jié)

***貼兩張BOSS直聘以及拉勾網(wǎng)薪水TOP20，以此來作為勉勵。

01 BOSS直聘薪水TOP20

02 拉勾網(wǎng)薪水TOP20

畢竟我們不能僅僅當條咸魚，我們要當就當一只有夢想的咸魚!!!

責任編輯：未麗燕來源：法納斯得

數(shù)據(jù)可視化數(shù)據(jù)分析薪水

51CTO技術(shù)棧公眾號

業(yè)務(wù)
速覽

媒體

51CTO CIOAge HC3i

社區(qū)

51CTO博客鴻蒙開發(fā)者社區(qū) AI.x社區(qū)

教育

51CTO學堂精培企業(yè)培訓 CTO訓練營

<style id="yn8ou"><kbd id="yn8ou"></kbd></style>

<sub id="yn8ou"></sub>

<thead id="yn8ou"></thead>