Mahout版本:0.7,hadoop版本:1.0.4,jdk:1.7.0_25 64bit。
接上篇,eigen分解,额,太复杂了,人太浮躁了,静不下来分析(说java对矩阵操作支持度不足,额,好吧是外部原因)。
1. 前奏:
eigen分解的是triDiag矩阵,这个矩阵,上篇求得的结果是:
[[0.315642761491587, 0.9488780991876485, 0.0], [0.9488780991876485, 2.855117440373572, 0.0], [0.0, 0.0, 0.0]]
根据源代码:
EigenDecomposition decomp = new EigenDecomposition(triDiag);
Matrix eigenVects = decomp.getV();
Vector eigenVals = decomp.getRealEigenvalues();
这里得到的eigenVectors和eigenVals就是eigen分解得到的结果,调试模式可以看到这两个变量的值是:
在这个网址可以使用eigen分解:http://www.yunsuanzi.com/cgi-bin/symmetric_eig_decomp.py,得到的结果如下:
其实这两个结果是一样的,只是列的顺序不一样。额,好吧,还有符号,好像有一点也不一样。额,确实是不一样,怎么办?用matlab试试吧,结果在matlab中的结果和java算出来的一模一样:
额,看来上面的那个网页的太不给力了,没算对。
接着往下看:
for (int row = 0; row < i; row++) {
Vector realEigen = null;
// the eigenvectors live as columns of V, in reverse order. Weird but true.
Vector ejCol = eigenVects.viewColumn(i - row - 1);
int size = Math.min(ejCol.size(), state.getBasisSize());
for (int j = 0; j < size; j++) {
double d = ejCol.get(j);
Vector rowJ = state.getBasisVector(j);
if (realEigen == null) {
realEigen = rowJ.like();
}
realEigen.assign(rowJ, new PlusMult(d));
}
realEigen = realEigen.normalize();
state.setRightSingularVector(row, realEigen);
double e = eigenVals.get(row) * state.getScaleFactor();
if (!isSymmetric) {
e = Math.sqrt(e);
}
log.info("Eigenvector {} found with eigenvalue {}", row, e);
state.setSingularValue(row, e);
}
log.info("LanczosSolver finished.");
endTime(TimingSection.FINAL_EIGEN_CREATE);
}
可以看到realEigen的值(当row=0时)就是eigenVects的(rank-1-row)列的转置和basisVector的转置的乘积,比如:
realEigen(0)的值是(调试):
{0:0.01180448947054423,1:0.001703710024210367,2:0.002100735590662567,3:0.014221147454610283,4:0.09654151173375553,5:0.0025666815984826535,6:0.0026147055494762234,7:1.753144283209579E-4,8:0.0017595900141802873,9:0.0049406361794682024,10:7.881250692924197E-4,11:0.002873479530226361,12:0.9951286321096425}
excel中计算的值是:
0.011804489 |
0.00170371 |
0.002100736 |
0.014221147 |
0.096541512 |
0.002566682 |
0.002614706 |
0.000175314 |
0.00175959 |
0.004940636 |
0.000788125 |
0.00287348 |
0.995128632 |
可见在误差范围内是一致的,而且realEigen的值的下标是和eigenVects的列标是相对的,比如realEigen是从下标零开始的,那么eigenVects就是从下标rank(最后的一个值)开始的;
然后就是normalize了,这个函数是更新realEigen的值的,使用原始值除以(realEigen(0)的点积开根号);最后就是赋值了,把这个realEigen赋值给state的singularVectors;e的值就更好理解了,直接从eigenVals中取出相应的值然后乘以scaleFactor,然后开根号就ok了;最后把e值赋值给state的singularValue。这里给出state的singularVectors和singularValue的定义:
protected final Map<Integer, Double> singularValues;
protected Map<Integer, Vector> singularVectors;
2. 输出state的singular*变量:
上面运行完成后就会返回DistributedLanczosSolver中的第203行执行:
Path outputEigenVectorPath = new Path(outputPath, RAW_EIGENVECTORS);
serializeOutput(state, outputEigenVectorPath);
首先初始化一个输出目录,然后序列化state进行输出,其中state在solve函数中进行了更新;
看serializeOutput的函数定义:
public void serializeOutput(LanczosState state, Path outputPath) throws IOException {
int numEigenVectors = state.getIterationNumber();
log.info("Persisting {} eigenVectors and eigenValues to: {}", numEigenVectors, outputPath);
Configuration conf = getConf() != null ? getConf() : new Configuration();
FileSystem fs = FileSystem.get(outputPath.toUri(), conf);
SequenceFile.Writer seqWriter =
new SequenceFile.Writer(fs, conf, outputPath, IntWritable.class, VectorWritable.class);
try {
IntWritable iw = new IntWritable();
for (int i = 0; i < numEigenVectors; i++) {
// Persist eigenvectors sorted by eigenvalues in descending order\
NamedVector v = new NamedVector(state.getRightSingularVector(numEigenVectors - 1 - i),
"eigenVector" + i + ", eigenvalue = " + state.getSingularValue(numEigenVectors - 1 - i));
Writable vw = new VectorWritable(v);
iw.set(i);
seqWriter.append(iw, vw);
}
} finally {
Closeables.closeQuietly(seqWriter);
}
}
上面最主要的就是:
NamedVector v = new NamedVector(state.getRightSingularVector(numEigenVectors - 1 - i),
"eigenVector" + i + ", eigenvalue = " + state.getSingularValue(numEigenVectors - 1 - i));
这个就是把上面state中的singularValue和singularVector写入到文件中:
singularVectors:
{0={0:0.01180448947054423,1:0.001703710024210367,2:0.002100735590662567,3:0.014221147454610283,4:0.09654151173375553,5:0.0025666815984826535,6:0.0026147055494762234,7:1.753144283209579E-4,8:0.0017595900141802873,9:0.0049406361794682024,10:7.881250692924197E-4,11:0.002873479530226361,12:0.9951286321096425},
1={0:-0.2883450858059115,1:-0.29170231535763447,2:-0.29157035465385267,3:-0.28754185317979386,4:-0.26018076078737895,5:-0.2914154866344813,6:-0.2913995247546756,7:-0.2922103132689348,8:-0.2916837423401091,9:-0.29062644748002026,10:-0.2920066313645422,11:-0.2913135151887795,12:0.03848561950058266},
2={0:0.01671441233225078,1:0.0935655369363106,2:0.09132650234523473,3:-0.0680324702834075,4:-0.9461123439509093,5:0.10210271255992123,6:0.10042714365337412,7:0.11137954332150339,8:0.10331974823993555,9:0.10621406378767596,10:0.10586960137353602,11:0.09262650242313884,12:0.09059904726143547}}
singularValue:
{0=0.0, 1=23.01314740985974, 2=2536.4018057098874}
读取生成的:hdfs://ubuntu:9000/svd/output1/rawEigenvectors/p*文件,可以看到是和上面的结果一致的(未验证);
然后就返回到了DistributedLanczosSolver的153行,接着往下执行;
3. 任务篇:
Path rawEigenVectorPath = new Path(outputPath, RAW_EIGENVECTORS);
return new EigenVerificationJob().run(inputPath,
rawEigenVectorPath,
outputPath,
outputTmpPath,
maxError,
minEigenvalue,
inMemory,
getConf() != null ? new Configuration(getConf()) : new Configuration());
先初始化一个文件,然后直接调用EigenVerificationJob的run方法,那么,整个分析就转移到了EigenVerificationJob。
附注:rawEigen是什么?根据上面的分析可以看出rawEigen其实就是state的singularVectors和singularValue的值而已;
分享,成长,快乐
转载请注明blog地址:http://blog.csdn.net/fansy1990
分享到:
相关推荐
mahoutAlgorithms源码分析 mahout代码解析
svd算法的工具类,直接调用出结果,调用及设置方式参考http://blog.csdn.net/fansy1990 <mahout源码分析之DistributedLanczosSolver(七)>
Mahout是一个Java的机器学习库。Mahout的完整源代码,基于maven,可以轻易导入工程中
mahout,朴素贝叶斯分类,中文分词,mahout,朴素贝叶斯分类,中文分词,
mahout-distribution-0.5-src.zip mahout 源码包
Mahout 是 Apache Software Foundation(ASF) 旗下的一个开源项目,提供一些可扩展的机器学习领域经典算法的实现,旨在帮助开发人员更加方便快捷地创建智能应用程序。Mahout包含许多实现,包括聚类、分类、推荐过滤...
mahout 0.7 src, mahout 源码包, hadoop 机器学习子项目 mahout 源码包
mahout in action中的example codes进行maven编译时由于maven相关jar包的URL的重定位,故无法进行有效编译,需要下载相关jar包进行手动加载!
mahout0.9的源码,支持hadoop2,需要自行使用mvn编译。mvn编译使用命令: mvn clean install -Dhadoop2 -Dhadoop.2.version=2.2.0 -DskipTests
该资源是mahout in action 中的源码,适用于自学,可在github下载:https://github.com/tdunning/MiA
mahout实战 源码 mahout实战 配套 mahout-distribution-0.5.tar.gz 版本
• 1、什么是mahout? • 2、mahout是干啥的 ? • 3、mahout是怎么干的? Apache Mahout 是 Apache Software Foundation (ASF) 开发的一个全新的开源项目,其主要目标是创建一些可伸缩的机器学习算法,供开发人员...
Thank you for requesting the download for Apache Mahout Cookbook. Please click the following link to download the code:
maven_mahout_template-mahout-0.8
mahout_help,mahout的java api帮助文档,可以帮你更轻松掌握mahout
mahout0.11版本,源码,可修改源码并自己编译,使用java语言编写,maven编译
mahout 入门中文材料,是IBM文章汇总,值得一看
mahout-examples-0.11.1 mahout-examples-0.11.1-job mahout-h2o_2.10-0.11.1 mahout-h2o_2.10-0.11.1-dependency-reduced mahout-hdfs-0.11.1 mahout-integration-0.11.1 mahout-math-0.11.1 mahout-math-0.11.1 ...
Mahout:整体框架,实现了协同过滤 Deeplearning4j,构建VSM Jieba:分词,关键词提取 HanLP:分词,关键词提取 Spring Boot:提供API、ORM 关键实现 基于用户的协同过滤 直接调用Mahout相关接口即可 选择不同...