一文帶您了解偽對(duì)數(shù)(Pseudo-Log):可視化傾斜數(shù)據(jù)的黃金方法
偏斜數(shù)據(jù)是指分布高度不均勻的數(shù)據(jù):當(dāng)變量數(shù)據(jù)顯示為直方圖時(shí),大部分?jǐn)?shù)據(jù)點(diǎn)要么聚集在分布的左側(cè),長(zhǎng)尾向右延伸(右偏斜),要么反之(左偏斜),或呈現(xiàn)更復(fù)雜的偏斜模式。偏斜數(shù)據(jù)對(duì)可視化,特別是熱力圖的繪制,提出了很大的挑戰(zhàn)。通常情況下,人們會(huì)使用對(duì)數(shù)變換來處理這些數(shù)據(jù)。然而,經(jīng)典對(duì)數(shù)變換無法處理零或負(fù)數(shù),而偽對(duì)數(shù)變換則能夠更好地處理和可視化這些數(shù)據(jù)。
為什么使用偽對(duì)數(shù)?
經(jīng)典對(duì)數(shù)對(duì)零和負(fù)值無定義,這限制了其在許多應(yīng)用中的使用。相比之下,偽對(duì)數(shù)(Pseudo-Logarithm)修正了經(jīng)典對(duì)數(shù)的這一限制:它對(duì)所有實(shí)數(shù)都有定義,對(duì)于大絕對(duì)值使用帶符號(hào)的對(duì)數(shù),并在底數(shù)趨近于零時(shí)平滑過渡到零。
以10為底的偽對(duì)數(shù)(pseudo-log10)的定義是:
在下面的代碼和圖中,x軸上的值通過偽對(duì)數(shù)10變換映射到y(tǒng)軸上,用藍(lán)線表示。相比之下,經(jīng)典的對(duì)數(shù)10變換則用黑線繪制。
import numpy as np
import matplotlib.pyplot as plt
# 偽對(duì)數(shù)10變換函數(shù)
def pseudo_log10(x):
return np.log(x/2+np.sqrt(x*x/4+1))/np.log(10)
# 定義數(shù)據(jù)
x = np.linspace(-15, 15, 400)
y_pseudo_log10 = pseudo_log10(x)
y_log10 = np.log10(x[200:])
# 創(chuàng)建圖形
plt.figure(figsize=(10, 6))
# 繪制偽對(duì)數(shù)10變換的曲線,繪制經(jīng)典對(duì)數(shù)10變換的曲線
plt.plot(x, y_pseudo_log10, label='Pseudo-log10 Transformation', color='blue')
plt.plot(x[200:], y_log10, label='Classic log10 Transformation', color='black')
# 添加圖例
plt.legend()
# 添加標(biāo)題和標(biāo)簽
plt.title('Comparison of Pseudo-log10 and Classic log10 Transformations')
plt.xlabel('x')
plt.ylabel('Transformed value')
# 顯示圖形
plt.grid(True)
plt.show()
該圖展示了偽對(duì)數(shù)變換的一些良好特性:
- 偽 log10(x) 在所有實(shí)數(shù)上都有定義。
- 偽 log10(0) = 0
- 如果 x ? 0,則偽 log10(x) ≈ log10(x)
- 如果 x ? 0,則偽 log10(x) ≈ ?log10(|x|)
類似地,任何底數(shù)為b的偽對(duì)數(shù)(偽對(duì)數(shù)b)可定義如下:
偽對(duì)數(shù)b (x) 具有以下性質(zhì):
- 偽對(duì)數(shù)b (0) = 0
- 如果 x ? 0,則偽對(duì)數(shù)b (x) ≈ log b (x)
- 如果 x ? 0,則偽對(duì)數(shù)b (x) ≈ ?log b (|x|)
數(shù)據(jù)可視化中的偽對(duì)數(shù)
對(duì)數(shù)變換是處理廣泛分布數(shù)據(jù)的常用方法。它將數(shù)據(jù)轉(zhuǎn)換為更規(guī)范的分布,從而更容易可視化。我們先看一下對(duì)數(shù)變換和偽對(duì)數(shù)變換對(duì)分布的影響(沒有找到合適的數(shù)據(jù),下面數(shù)據(jù)是生成的):
import numpy as np
import matplotlib.pyplot as plt
# 生成1000個(gè)來自右偏分布的樣本
data = np.random.exponential(scale=2, size=1000)
# 定義偽對(duì)數(shù)變換函數(shù)
def pseudo_log(x):
return np.log(x/2+np.sqrt(x*x/4+1))/np.log(10)
# 對(duì)數(shù)據(jù)應(yīng)用變換
log_data = np.log(data)
pseudo_log_data = pseudo_log(data)
# 繪制原始數(shù)據(jù)、對(duì)數(shù)變換數(shù)據(jù)和偽對(duì)數(shù)變換數(shù)據(jù)
plt.figure(figsize=(12, 8))
# 繪制原始數(shù)據(jù)
plt.subplot(3, 1, 1)
plt.hist(data, bins=50, color='blue', alpha=0.7, label='Original Data')
plt.legend()
plt.title('Original Data')
# 繪制對(duì)數(shù)變換數(shù)據(jù)
plt.subplot(3, 1, 2)
plt.hist(log_data, bins=50, color='green', alpha=0.7, label='Log-Transformed Data')
plt.legend()
plt.title('Log-Transformed Data')
# 繪制偽對(duì)數(shù)變換數(shù)據(jù)
plt.subplot(3, 1, 3)
plt.hist(pseudo_log_data, bins=50, color='red', alpha=0.7, label='Pseudo-Log-Transformed Data')
plt.legend()
plt.title('Pseudo-Log-Transformed Data')
plt.tight_layout()
plt.show()
偽對(duì)數(shù)變換對(duì)熱力圖可視化的效果影響:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Simulate some population density data with a wider range
np.random.seed(42)
data = np.concatenate([np.random.lognormal(mean=0, sigma=2.0, size=(10, 20)),
0-np.random.lognormal(mean=0, sigma=2.0, size=(10, 20))])
# Function to plot the heatmaps
def plot_heatmaps(data):
fig, axes = plt.subplots(1, 3, figsize=(18, 6))
# Linear color scale
sns.heatmap(data, ax=axes[0], cmap='viridis', cbar_kws={'label': 'population density / km2'})
axes[0].set_title('Linear color scale')
# Log10 transformation
log_data = np.log10(data + 1)
sns.heatmap(log_data, ax=axes[1], cmap='viridis', cbar_kws={'label': 'log10(population density / km2)'})
axes[1].set_title('log10 transformation')
# Pseudo-log transformation (log + sqrt)
pseudo_log_data = np.log(data + np.sqrt(data**2 + 1))
sns.heatmap(pseudo_log_data, ax=axes[2], cmap='viridis', cbar_kws={'label': 'pseudo-log(population density / km2)'})
axes[2].set_title('pseudo-log transformation')
plt.tight_layout()
plt.show()
plot_heatmaps(data)
下圖是https://www.databrewer.co/blog/pseudo-log-transformation 上面的R語言生成的示例圖:
偏斜數(shù)據(jù)是分布不均的數(shù)據(jù),對(duì)可視化,特別是熱力圖繪制提出了挑戰(zhàn)。經(jīng)典對(duì)數(shù)變換不能處理零和負(fù)數(shù),而偽對(duì)數(shù)變換可以應(yīng)對(duì)這一問題。偽對(duì)數(shù)在所有實(shí)數(shù)上都有定義,并能平滑過渡。本文通過比較偽對(duì)數(shù)和經(jīng)典對(duì)數(shù)變換,展示了偽對(duì)數(shù)在處理和可視化偏斜數(shù)據(jù)中的優(yōu)越性。通過實(shí)例數(shù)據(jù),顯示偽對(duì)數(shù)變換能夠有效地改善數(shù)據(jù)分布和可視化效果,使得數(shù)據(jù)展示更加清晰。