自拍偷在线精品自拍偷,亚洲欧美中文日韩v在线观看不卡

<sub id="mf11h"><p id="mf11h"></p></sub>

<sub id="mf11h"><p id="mf11h"></p></sub>

AI.x社區(qū)

軟考社區(qū)

企業(yè)培訓

鴻蒙開發(fā)者社區(qū)

WOT技術大會

公眾號矩陣

移動端

視頻課免費課排行榜短視頻直播課軟考學堂

全部課程軟考華為認證廠商認證 IT技術 PMP項目管理免費題庫

文章資源問答課堂專欄直播

51CTO

鴻蒙開發(fā)者社區(qū)

51CTO技術棧

51CTO官微

51CTO學堂

51CTO博客

CTO訓練營

鴻蒙開發(fā)者社區(qū)訂閱號

51CTO軟考

51CTO學堂APP

51CTO學堂企業(yè)版APP

鴻蒙開發(fā)者社區(qū)視頻號

51CTO軟考題庫

賬號設置退出

有這5小段代碼在手，輕松實現數據可視化（Python+Matplotlib）

作者：文摘菌 2018-04-24 16:01:46

企業(yè)動態(tài)

本文要講的是Matplotlib，一個強大的Python可視化庫。一共5小段代碼，輕松實現散點圖、折線圖、直方圖、柱狀圖、箱線圖，每段代碼只有10行，也是再簡單不過了吧!

大數據文摘作品

編譯：傅一洋、吳雙、龍牧雪

本文要講的是Matplotlib，一個強大的Python可視化庫。一共5小段代碼，輕松實現散點圖、折線圖、直方圖、柱狀圖、箱線圖，每段代碼只有10行，也是再簡單不過了吧!

數據可視化是數據科學家工作的一項主要任務。在項目早期階段，通常會進行探索性數據分析(EDA)以獲取對數據的理解和洞察，尤其對于大型高維的數據集，數據可視化著實有助于使數據關系更清晰易懂。

同時在項目結束時，以清晰、簡潔和引人注目的方式展示最終結果也是非常重要的，因為受眾往往是非技術性客戶，只有這樣，他們才更容易去理解。

Matplotlib是個很流行的Python庫，可以輕松實現數據可視化。但是，每次執(zhí)行新項目的繪圖時，設置數據、參數、圖形的過程都非常的繁瑣。在本文中，我們將著眼于5種數據可視化方法，用Python的Matplotlib庫實現一些快速而簡單的功能。

首先，請大家看看這張大的地圖，它能指引你根據不同情況，選擇正確的可視化方法：

根據情況選擇適當的數據可視化技術

散點圖

散點圖非常適合展現兩個變量間關系，因為，圖中可以直接看出數據的原始分布。還可以通過設置不同的顏色，輕松地查看不同組數據間的關系，如下圖所示。那如果想要可視化三個變量之間的關系呢?沒問題!只需再添加一個參數(如點的大小)來表示第三個變量就可以了，如下面第二個圖所示。

以顏色分組的散點圖

加入新維度：圓圈大小

現在來寫代碼。首先導入Matplotlib庫的pyplot子庫，并命名為plt。使用 plt.subplots()命令創(chuàng)建一個新的圖。將x軸和y軸數據傳遞給相應數組x_data和y_data，然后將數組和其他參數傳遞給ax.scatter()以繪制散點圖。我們還可以設置點的大小、顏色和alpha透明度，甚至將y軸設置成對數坐標。***再為該圖設置好必要的標題和軸標簽。這個函數輕松地實現了端到端的繪圖!

import matplotlib.pyplot as plt 
import numpy as np 
  
def scatterplot(x_data, y_data, x_label="", y_label="", title="", color = "r", yscale_log=False): 
  
   # Create the plot object 
   _, ax = plt.subplots() 
  
   # Plot the data, set the size (s), color and transparency (alpha) 
   # of the points 
   ax.scatter(x_data, y_data, s = 10, colorcolor = color, alpha = 0.75) 
  
   if yscale_log == True: 
       ax.set_yscale('log') 
  
   # Label the axes and provide a title 
   ax.set_title(title) 
   ax.set_xlabel(x_label) 
   ax.set_ylabel(y_label)

折線圖

如果一個變量隨著另一個變量的變化而大幅度變化(具有很高的協(xié)方差)，為了清楚地看出變量間的關系，***使用折線圖。例如，根據下圖，我們能清楚地看出，不同專業(yè)獲得學士學位的人群中，女性所占的百分比隨時間變化產生很大變化。

此時，若用散點圖繪制，數據點容易成簇，顯得非?；靵y，很難看出數據本身的意義。而折線圖就再合適不過了，因為它基本上反映出兩個變量(女性占比和時間)協(xié)方差的大體情況。同樣，也可使用不同顏色來對多組數據分組。

女性獲得學士學位的百分比(美國)

代碼與散點圖類似，只是一些微小的參數改動。

def lineplot(x_data, y_data, x_label="", y_label="", title=""): 
    # Create the plot object 
    _, ax = plt.subplots() 
  
    # Plot the best fit line, set the linewidth (lw), color and 
    # transparency (alpha) of the line 
    ax.plot(x_data, y_data, lw = 2, color = '#539caf', alpha = 1) 
  
    # Label the axes and provide a title 
    ax.set_title(title) 
    ax.set_xlabel(x_label) 
    ax.set_ylabel(y_label)

直方圖

直方圖適合查看(或發(fā)現)數據分布。下圖為不同IQ人群所占比例的直方圖。從中可以清楚地看出中心期望值和中位數，看出它遵循正態(tài)分布。使用直方圖(而不是散點圖)可以清楚地顯示出不同組數據頻率之間的相對差異。而且，分組(使數據離散化)有助于看出“更宏觀的分布”，若使用未被離散化的數據點，可能會產生大量數據噪聲，從而很難看出數據的真實分布。

正態(tài)分布的IQ

下面是用Matplotlib庫創(chuàng)建直方圖的代碼。這里有兩個參數需要注意。***個參數是n_bins參數，用于控制直方圖的離散度。一方面，更多的分組數能提供更詳細的信息，但可能會引入數據噪聲使結果偏離宏觀分布;另一方面，更少的分組數能提供更宏觀的數據“鳥瞰”，在不需要太多細節(jié)的情況下能更全面地了解數據整體情況。第二個參數是累積參數cumulative，是一個布爾值，通過它控制直方圖是否累積，也就是選擇使用概率密度函數(PDF)還是累積密度函數(CDF)。

def histogram(data, n_bins, cumulative=False, x_label = "", y_label = "", title = ""): 
    _, ax = plt.subplots() 
    ax.hist(data, n_binsn_bins = n_bins, cumulativecumulative = cumulative, color = '#539caf') 
    ax.set_ylabel(y_label) 
    ax.set_xlabel(x_label) 
    ax.set_title(title)

如果要比較數據中兩個變量的分布情況該怎么辦呢?有些人可能會認為，必須要制作兩個獨立的直方圖將它們并排放在一起進行比較。但實際上，有更好的方法：用不同透明度實現直方圖的疊加。比如下圖，將均勻分布透明度設置為0.5，以便看清后面的正態(tài)分布。這樣，用戶就可以在同一張圖上查看兩個變量的分布了。

疊加直方圖

在實現疊加直方圖的代碼中需要設置以下幾個參數：

設置水平范圍，以適應兩種可變分布;
根據這個范圍和期望的分組數量，計算并設置組距;
設置其中一個變量具有更高透明度，以便在一張圖上顯示兩個分布。

# Overlay 2 histograms to compare them 
def overlaid_histogram(data1, data2, n_bins = 0, data1_name="", data1_color="#539caf", data2_name="", data2_color="#7663b0", x_label="", y_label="", title=""): 
    # Set the bounds for the bins so that the two distributions are fairly compared 
    max_nbins = 10 
    data_range = [min(min(data1), min(data2)), max(max(data1), max(data2))] 
    binwidth = (data_range[1] - data_range[0]) / max_nbins 
  
  
    if n_bins == 0 
        bins = np.arange(data_range[0], data_range[1] + binwidth, binwidth) 
    else:  
        bins = n_bins 
  
    # Create the plot 
    _, ax = plt.subplots() 
    ax.hist(data1, binsbins = bins, color = data1_color, alpha = 1, label = data1_name) 
    ax.hist(data2, binsbins = bins, color = data2_color, alpha = 0.75, label = data2_name) 
    ax.set_ylabel(y_label) 
    ax.set_xlabel(x_label) 
    ax.set_title(title) 
    ax.legend(loc = 'best')

柱狀圖

柱狀圖適用于對類別較少(<10個)的分類數據進行可視化。但在類別太多時，圖中的柱體就會容易堆在一起，顯得非常亂，對數據的理解造成困難。柱狀圖適合于分類數據的原因，一是能根據柱體的高度(即長短)輕松地看出類別之間的差異，二是很容易將不同類別加以區(qū)分，甚至賦予不同顏色。以下介紹三種類型的柱狀圖：常規(guī)柱狀圖，分組柱狀圖和堆積柱狀圖。參考代碼來看詳細的說明。

常規(guī)柱狀圖，如下圖所示。代碼中，barplot()函數的x_data參數表示x軸坐標，y_data代表y軸(柱體的高度)坐標，yerr表示在每個柱體頂部中央顯示的標準偏差線。

分組柱狀圖，如下圖所示。它允許對多個分類變量進行對比。如圖所示，兩組關系其一是分數與組(組G1，G2，...等)的關系，其二是用顏色區(qū)分的性別之間的關系。代碼中，y_data_list是一個列表，其中又包含多個子列表，每個子列表代表一個組。對每個列表賦予x坐標，循環(huán)遍歷其中的每個子列表，設置成不同顏色，繪制出分組柱狀圖。

堆積柱狀圖，適合可視化含有子分類的分類數據。下面這張圖是用堆積柱狀圖展示的日常服務器負載情況統(tǒng)計。使用不同顏色進行堆疊，對不同服務器之間進行比較，從而能查看并了解每天中哪臺服務器的工作效率***，負載具體為多少。代碼與柱狀圖樣式相同，同樣為循環(huán)遍歷每個組，只是這次是在舊柱體基礎上堆疊，而不是在其旁邊繪制新柱體。

以下是三種堆積柱狀圖的代碼：

def barplot(x_data, y_data, error_data, x_label="", y_label="", title=""): 
    _, ax = plt.subplots() 
    # Draw bars, position them in the center of the tick mark on the x-axis 
    ax.bar(x_data, y_data, color = '#539caf', align = 'center') 
    # Draw error bars to show standard deviation, set ls to 'none' 
    # to remove line between points 
    ax.errorbar(x_data, y_data, yerr = error_data, color = '#297083', ls = 'none', lw = 2, capthick = 2) 
    ax.set_ylabel(y_label) 
    ax.set_xlabel(x_label) 
    ax.set_title(title)

def stackedbarplot(x_data, y_data_list, colors, y_data_names="", x_label="", y_label="", title=""): 
    _, ax = plt.subplots() 
    # Draw bars, one category at a time 
    for i in range(0, len(y_data_list)): 
        if i == 0: 
            ax.bar(x_data, y_data_list[i], color = colors[i], align = 'center', label = y_data_names[i]) 
        else: 
            # For each category after the first, the bottom of the 
            # bar will be the top of the last category 
            ax.bar(x_data, y_data_list[i], color = colors[i], bottom = y_data_list[i - 1], align = 'center', label = y_data_names[i]) 
    ax.set_ylabel(y_label) 
    ax.set_xlabel(x_label) 
    ax.set_title(title) 
    ax.legend(loc = 'upper right')

def groupedbarplot(x_data, y_data_list, colors, y_data_names="", x_label="", y_label="", title=""): 
    _, ax = plt.subplots() 
    # Total width for all bars at one x location 
    total_width = 0.8 
    # Width of each individual bar 
    ind_width = total_width / len(y_data_list) 
    # This centers each cluster of bars about the x tick mark 
    alteration = np.arange(-(total_width/2), total_width/2, ind_width) 
  
    # Draw bars, one category at a time 
    for i in range(0, len(y_data_list)): 
        # Move the bar to the right on the x-axis so it doesn't 
        # overlap with previously drawn ones 
        ax.bar(x_data + alteration[i], y_data_list[i], color = colors[i], label = y_data_names[i], width = ind_width) 
    ax.set_ylabel(y_label) 
    ax.set_xlabel(x_label) 
    ax.set_title(title) 
    ax.legend(loc = 'upper right')

箱線圖

前文介紹的直方圖非常適合于對變量分布的可視化。但是，如果想要將更多的變量信息可視化呢?比如要清楚地看出標準差，或者一些情況下，中位數與平均值存在很大差異，因此是存在很多異常值呢還是數據分布本身就向一端偏移呢?

這里，箱線圖就可以表示出上述的所有信息。箱體的底部和頂部分別為***和第三四分位數(即數據的25%和75%)，箱體內的橫線為第二四分位數(即中位數)。箱體上下的延伸線(即T型虛線)表示數據的上下限。

由于箱形圖是為每個組或變量繪制的，因此設置起來非常容易。x_data是組或變量的列表，x_data中的每個值對應于y_data中的一列值(一個列向量)。用Matplotlib庫的函數boxplot()為y_data的每列值(每個列向量)生成一個箱形，然后設定箱線圖中的各個參數就可以了。

def boxplot(x_data, y_data, base_color="#539caf", median_color="#297083", x_label="", y_label="", title=""): 
    _, ax = plt.subplots() 
  
    # Draw boxplots, specifying desired style 
    ax.boxplot(y_data 
               # patch_artist must be True to control box fill 
               , patch_artist = True 
               # Properties of median line 
               , medianprops = {'color': median_color} 
               # Properties of box 
               , boxprops = {'color': base_color, 'facecolor': base_color} 
               # Properties of whiskers 
               , whiskerprops = {'color': base_color} 
               # Properties of whisker caps 
               , capprops = {'color': base_color}) 
  
    # By default, the tick label starts at 1 and increments by 1 for 
    # each box drawn. This sets the labels to the ones we want 
    ax.set_xticklabels(x_data) 
    ax.set_ylabel(y_label) 
    ax.set_xlabel(x_label) 
    ax.set_title(title)

這就是可供你使用的Matplotlib庫的5個快速簡單的數據可視化方法了!將功能和方法包裝成函數，總是會使代碼的編寫和閱讀都變的更簡單!希望這篇文章能對你有所幫助，希望你能從中學到知識!如果喜歡就點個贊吧!

原文鏈接：

https://towardsdatascience.com/5-quick-and-easy-data-visualizations-in-python-with-code-a2284bae952f

【本文是51CTO專欄機構大數據文摘的原創(chuàng)譯文，微信公眾號“大數據文摘（ id: BigDataDigest）”】

戳這里，看該作者更多好文

責任編輯：趙寧寧來源： 51CTO專欄

代碼數據可視化 Python

51CTO技術棧公眾號

業(yè)務
速覽

媒體

51CTO CIOAge HC3i

社區(qū)

51CTO博客鴻蒙開發(fā)者社區(qū) AI.x社區(qū)

教育

51CTO學堂精培企業(yè)培訓 CTO訓練營