常用MapReduce數(shù)據(jù)挖掘算法之均值、方差
均值、方差的map-reduce
一堆數(shù)字的均值、方差公式,相信都很清楚,具體怎么設計map跟reduce函數(shù)呢,可以先從計算公式出發(fā),假設有n個數(shù)字,分別是a1,a2....an,那么 均值m=(a1+a2+...an) / n,方差 s= [(a1-m)^2+(a2-m)^2+....+(an-m)^2] / n
把方差公式展開來S=[(a1^2+.....an^2)+m^m*n-2*m*(a1+a2+....an) ] / n,根據(jù)這個我們可以把map端的輸入設定為(key,a1),輸出設定為(1,(n1,sum1,var1)),n1表示每個worker所計算的數(shù)字的個數(shù),sum1是這些數(shù)字的和(例如a1+a2+a3...),var1是這些數(shù)字的平方和(例如a1^2+a2^2+...)
reduce端接收到這些信息后緊接著把所有輸入的n1,n2....相加得到n,把sum1,sum2...相加得到sum,那么均值m=sum/n,把var1,var2...相加得到var,那么***的方差S=(var+m^2*n-2*m*sum)/n,reduce輸出(1,(m,S))。
算法代碼是基于mrjob的實現(xiàn)(https://pythonhosted.org/mrjob/,機器學習實戰(zhàn)第十五章)
- from mrjob.job import MRJob
- class MRmean(MRJob):
- def __init__(self, *args, **kwargs):
- super(MRmean, self).__init__(*args, **kwargs)
- self.inCount = 0
- self.inSum = 0
- self.inSqSum = 0
- def map(self, key, val): #needs exactly 2 arguments
- if False: yield
- inVal = float(val)
- self.inCount += 1
- self.inSum += inVal #每個元素之和
- self.inSqSum += inVal*inVal #求每個元素的平方
- def map_final(self):
- mn = self.inSum/self.inCount
- mnSq =self.inSqSum/self.inCount
- yield (1, [self.inCount, mn, mnSq]) #map的輸出,不過這里的mn=sum1/mn,mnsq=var1/mn
- def reduce(self, key, packedValues):
- cumVal=0.0; cumSumSq=0.0; cumN=0.0
- for valArr in packedValues: #get values from streamed inputs 解析map端的輸出
- nj = float(valArr[0])
- cumN += nj
- cumVal += nj*float(valArr[1])
- cumSumSq += nj*float(valArr[2])
- mean = cumVal/cumN
- var = (cumSumSq - 2*mean*cumVal + cumN*mean*mean)/cumN
- yield (mean, var) #emit mean and var reduce的輸出
- def steps(self):
- return ([self.mr(mapper=self.map, mapper_final=self.map_final,\
- reducer=self.reduce,)])
- if __name__ == '__main__':
- MRmean.run()