自拍偷在线精品自拍偷,亚洲欧美中文日韩v在线观看不卡

AI.x社區(qū)

軟考社區(qū)

免費(fèi)課

企業(yè)培訓(xùn)

鴻蒙開發(fā)者社區(qū)

WOT技術(shù)大會

公眾號矩陣

移動端

視頻課免費(fèi)課排行榜短視頻直播課軟考學(xué)堂

全部課程軟考華為認(rèn)證廠商認(rèn)證 IT技術(shù)PMP項目管理免費(fèi)題庫

在線學(xué)習(xí)

文章資源問答課堂專欄直播

51CTO

鴻蒙開發(fā)者社區(qū)

51CTO技術(shù)棧

51CTO官微

51CTO學(xué)堂

51CTO博客

CTO訓(xùn)練營

鴻蒙開發(fā)者社區(qū)訂閱號

51CTO軟考

51CTO學(xué)堂APP

51CTO學(xué)堂企業(yè)版APP

鴻蒙開發(fā)者社區(qū)視頻號

51CTO軟考題庫

賬號設(shè)置退出

在Python中創(chuàng)建相關(guān)系數(shù)矩陣的六種方法

作者：佚名 2023-09-24 14:52:21

大數(shù)據(jù) 數(shù)據(jù)分析

相關(guān)系數(shù)矩陣（Correlation matrix）是數(shù)據(jù)分析的基本工具。它們讓我們了解不同的變量是如何相互關(guān)聯(lián)的。在Python中，有很多個方法可以計算相關(guān)系數(shù)矩陣，今天我們來對這些方法進(jìn)行一個總結(jié)

相關(guān)系數(shù)矩陣（Correlation matrix）是數(shù)據(jù)分析的基本工具。它們讓我們了解不同的變量是如何相互關(guān)聯(lián)的。在Python中，有很多個方法可以計算相關(guān)系數(shù)矩陣，今天我們來對這些方法進(jìn)行一個總結(jié)

Pandas

Pandas的DataFrame對象可以使用corr方法直接創(chuàng)建相關(guān)矩陣。由于數(shù)據(jù)科學(xué)領(lǐng)域的大多數(shù)人都在使用Pandas來獲取數(shù)據(jù)，因此這通常是檢查數(shù)據(jù)相關(guān)性的最快、最簡單的方法之一。

import pandas as pd
 import seaborn as sns
 
 data = sns.load_dataset('mpg')
 correlation_matrix = data.corr(numeric_only=True)
 correlation_matrix

如果你是統(tǒng)計和分析相關(guān)工作的，你可能會問" p值在哪里？"，在最后我們會有介紹

Numpy

Numpy也包含了相關(guān)系數(shù)矩陣的計算函數(shù)，我們可以直接調(diào)用，但是因?yàn)榉祷氐氖莕darray，所以看起來沒有pandas那么清晰。

import numpy as np
 from sklearn.datasets import load_iris
 
 iris = load_iris()
 np.corrcoef(iris["data"])

為了更好的可視化，我們可以直接將其傳遞給sns.heatmap()函數(shù)。

import seaborn as sns
 
 data = sns.load_dataset('mpg')
 correlation_matrix = data.corr()
 
 sns.heatmap(data.corr(), 
            annot=True, 
            cmap='coolwarm')

annot=True這個參數(shù)可以輸出一些額外的有用信息。一個常見hack是使用sns.set_context('talk')來獲得額外的可讀輸出。

這個設(shè)置是為了生成幻燈片演示的圖像，它能幫助我們更好地閱讀(更大的字體)。

Statsmodels

Statsmodels這個統(tǒng)計分析庫也是肯定可以的

import statsmodels.api as sm
 
 correlation_matrix = sm.graphics.plot_corr(
    data.corr(), 
    xnames=data.columns.tolist())

plotly

默認(rèn)情況下plotly這個結(jié)果是如何從左下到右上運(yùn)行對角線1.0的。這種行為與大多數(shù)其他工具相反，所以如果你使用plotly需要特別注意

import plotly.offline as pyo
 pyo.init_notebook_mode(cnotallow=True)
 
 import plotly.figure_factory as ff
 
 correlation_matrix = data.corr()
 
 fig = ff.create_annotated_heatmap(
    z=correlation_matrix.values, 
    x=list(correlation_matrix.columns), 
    y=list(correlation_matrix.index), 
    colorscale='Blues')
 
 fig.show()

Pandas + Matplotlib更好的可視化

這個結(jié)果也可以直接使用用sns.pairplot(data)，兩種方法產(chǎn)生的圖差不多，但是seaborn只需要一句話

sns.pairplot(df[['mpg','weight','horsepower','acceleration']])

所以我們這里介紹如何使用Matplotlib來實(shí)現(xiàn)

import matplotlib.pyplot as plt
 
 pd.plotting.scatter_matrix(
    data, alpha=0.2, 
    figsize=(6, 6), 
    diagnotallow='hist')
 
 plt.show()

相關(guān)性的p值

如果你正在尋找一個簡單的矩陣(帶有p值)，這是許多其他工具(SPSS, Stata, R, SAS等)默認(rèn)做的，那如何在Python中獲得呢？

這里就要借助科學(xué)計算的scipy庫了，以下是實(shí)現(xiàn)的函數(shù)

from scipy.stats import pearsonr
 import pandas as pd
 import seaborn as sns
 
 def corr_full(df, numeric_notallow=True, rows=['corr', 'p-value', 'obs']):
    """
    Generates a correlation matrix with correlation coefficients, 
    p-values, and observation count.
     
    Args:
    - df:                 Input dataframe
    - numeric_only (bool): Whether to consider only numeric columns for 
                            correlation. Default is True.
    - rows:               Determines the information to show. 
                            Default is ['corr', 'p-value', 'obs'].
     
    Returns:
    - formatted_table: The correlation matrix with the specified rows.
    """
     
    # Calculate Pearson correlation coefficients
    corr_matrix = df.corr(
        numeric_notallow=numeric_only)
     
    # Calculate the p-values using scipy's pearsonr
    pvalue_matrix = df.corr(
        numeric_notallow=numeric_only, 
        method=lambda x, y: pearsonr(x, y)[1])
     
    # Calculate the non-null observation count for each column
    obs_count = df.apply(lambda x: x.notnull().sum())
     
    # Calculate observation count for each pair of columns
    obs_matrix = pd.DataFrame(
        index=corr_matrix.columns, columns=corr_matrix.columns)
    for col1 in obs_count.index:
        for col2 in obs_count.index:
            obs_matrix.loc[col1, col2] = min(obs_count[col1], obs_count[col2])
         
    # Create a multi-index dataframe to store the formatted correlations
    formatted_table = pd.DataFrame(
        index=pd.MultiIndex.from_product([corr_matrix.columns, rows]), 
        columns=corr_matrix.columns
    )
     
    # Assign values to the appropriate cells in the formatted table
    for col1 in corr_matrix.columns:
        for col2 in corr_matrix.columns:
            if 'corr' in rows:
                formatted_table.loc[
                    (col1, 'corr'), col2] = corr_matrix.loc[col1, col2]
             
            if 'p-value' in rows:
                # Avoid p-values for diagonal they correlate perfectly
                if col1 != col2:
                    formatted_table.loc[
                        (col1, 'p-value'), col2] = f"({pvalue_matrix.loc[col1, col2]:.4f})"
            if 'obs' in rows:
                formatted_table.loc[
                    (col1, 'obs'), col2] = obs_matrix.loc[col1, col2]
     
    return(formatted_table.fillna('')
            .style.set_properties(**{'text-align': 'center'}))

直接調(diào)用這個函數(shù)，我們返回的結(jié)果如下：

df = sns.load_dataset('mpg')
 result = corr_full(df, rows=['corr', 'p-value'])
 result

總結(jié)

我們介紹了Python創(chuàng)建相關(guān)系數(shù)矩陣的各種方法，這些方法可以隨意選擇（那個方便用哪個）。Python中大多數(shù)工具的標(biāo)準(zhǔn)默認(rèn)輸出將不包括p值或觀察計數(shù)，所以如果你需要這方面的統(tǒng)計，可以使用我們子厚提供的函數(shù)，因?yàn)橐M(jìn)行全面和完整的相關(guān)性分析，有p值和觀察計數(shù)作為參考是非常有幫助的。

責(zé)任編輯：華軒來源： DeepHub IMBA

數(shù)據(jù)分析相關(guān)系數(shù)矩陣

點(diǎn)贊

51CTO技術(shù)棧公眾號

業(yè)務(wù)
速覽

媒體

51CTO CIOAge HC3i

社區(qū)

51CTO博客鴻蒙開發(fā)者社區(qū) AI.x社區(qū)

教育

51CTO學(xué)堂精培企業(yè)培訓(xùn) CTO訓(xùn)練營

^{<blockquote id="ea8x2"><i id="ea8x2"></i></blockquote>}

<sub id="ea8x2"></sub>

<style id="ea8x2"><kbd id="ea8x2"></kbd></style>