`
445822357
  • 浏览: 739589 次
文章分类
社区版块
存档分类
最新评论

mahout源码分析之DistributedLanczosSolver(四)rawEigen是什么

 
阅读更多

Mahout版本:0.7,hadoop版本:1.0.4,jdk:1.7.0_25 64bit。

接上篇,eigen分解,额,太复杂了,人太浮躁了,静不下来分析(说java对矩阵操作支持度不足,额,好吧是外部原因)。

1. 前奏:

eigen分解的是triDiag矩阵,这个矩阵,上篇求得的结果是:

[[0.315642761491587, 0.9488780991876485, 0.0], [0.9488780991876485, 2.855117440373572, 0.0], [0.0, 0.0, 0.0]]
根据源代码:

EigenDecomposition decomp = new EigenDecomposition(triDiag);

    Matrix eigenVects = decomp.getV();
    Vector eigenVals = decomp.getRealEigenvalues();
这里得到的eigenVectors和eigenVals就是eigen分解得到的结果,调试模式可以看到这两个变量的值是:


在这个网址可以使用eigen分解:http://www.yunsuanzi.com/cgi-bin/symmetric_eig_decomp.py,得到的结果如下:


其实这两个结果是一样的,只是列的顺序不一样。额,好吧,还有符号,好像有一点也不一样。额,确实是不一样,怎么办?用matlab试试吧,结果在matlab中的结果和java算出来的一模一样:


额,看来上面的那个网页的太不给力了,没算对。

接着往下看:

for (int row = 0; row < i; row++) {
      Vector realEigen = null;
      // the eigenvectors live as columns of V, in reverse order.  Weird but true.
      Vector ejCol = eigenVects.viewColumn(i - row - 1);
      int size = Math.min(ejCol.size(), state.getBasisSize());
      for (int j = 0; j < size; j++) {
        double d = ejCol.get(j);
        Vector rowJ = state.getBasisVector(j);
        if (realEigen == null) {
          realEigen = rowJ.like();
        }
        realEigen.assign(rowJ, new PlusMult(d));
      }
      realEigen = realEigen.normalize();
      state.setRightSingularVector(row, realEigen);
      double e = eigenVals.get(row) * state.getScaleFactor();
      if (!isSymmetric) {
        e = Math.sqrt(e);
      }
      log.info("Eigenvector {} found with eigenvalue {}", row, e);
      state.setSingularValue(row, e);
    }
    log.info("LanczosSolver finished.");
    endTime(TimingSection.FINAL_EIGEN_CREATE);
  }
可以看到realEigen的值(当row=0时)就是eigenVects的(rank-1-row)列的转置和basisVector的转置的乘积,比如:


realEigen(0)的值是(调试):

{0:0.01180448947054423,1:0.001703710024210367,2:0.002100735590662567,3:0.014221147454610283,4:0.09654151173375553,5:0.0025666815984826535,6:0.0026147055494762234,7:1.753144283209579E-4,8:0.0017595900141802873,9:0.0049406361794682024,10:7.881250692924197E-4,11:0.002873479530226361,12:0.9951286321096425}
excel中计算的值是:

0.011804489 0.00170371 0.002100736 0.014221147 0.096541512 0.002566682 0.002614706 0.000175314 0.00175959 0.004940636 0.000788125 0.00287348 0.995128632
可见在误差范围内是一致的,而且realEigen的值的下标是和eigenVects的列标是相对的,比如realEigen是从下标零开始的,那么eigenVects就是从下标rank(最后的一个值)开始的;

然后就是normalize了,这个函数是更新realEigen的值的,使用原始值除以(realEigen(0)的点积开根号);最后就是赋值了,把这个realEigen赋值给state的singularVectors;e的值就更好理解了,直接从eigenVals中取出相应的值然后乘以scaleFactor,然后开根号就ok了;最后把e值赋值给state的singularValue。这里给出state的singularVectors和singularValue的定义:

protected final Map<Integer, Double> singularValues;
  protected Map<Integer, Vector> singularVectors;

2. 输出state的singular*变量:

上面运行完成后就会返回DistributedLanczosSolver中的第203行执行:

Path outputEigenVectorPath = new Path(outputPath, RAW_EIGENVECTORS);
    serializeOutput(state, outputEigenVectorPath);
首先初始化一个输出目录,然后序列化state进行输出,其中state在solve函数中进行了更新;

看serializeOutput的函数定义:

public void serializeOutput(LanczosState state, Path outputPath) throws IOException {
    int numEigenVectors = state.getIterationNumber();
    log.info("Persisting {} eigenVectors and eigenValues to: {}", numEigenVectors, outputPath); 
    Configuration conf = getConf() != null ? getConf() : new Configuration();
    FileSystem fs = FileSystem.get(outputPath.toUri(), conf);
    SequenceFile.Writer seqWriter =
        new SequenceFile.Writer(fs, conf, outputPath, IntWritable.class, VectorWritable.class);
    try {
      IntWritable iw = new IntWritable();
      for (int i = 0; i < numEigenVectors; i++) {
        // Persist eigenvectors sorted by eigenvalues in descending order\
        NamedVector v = new NamedVector(state.getRightSingularVector(numEigenVectors - 1 - i),
            "eigenVector" + i + ", eigenvalue = " + state.getSingularValue(numEigenVectors - 1 - i));
        Writable vw = new VectorWritable(v);
        iw.set(i);
        seqWriter.append(iw, vw);
      }
    } finally {
      Closeables.closeQuietly(seqWriter);
    }
  }
上面最主要的就是:

NamedVector v = new NamedVector(state.getRightSingularVector(numEigenVectors - 1 - i),
            "eigenVector" + i + ", eigenvalue = " + state.getSingularValue(numEigenVectors - 1 - i));
这个就是把上面state中的singularValue和singularVector写入到文件中:

singularVectors:
{0={0:0.01180448947054423,1:0.001703710024210367,2:0.002100735590662567,3:0.014221147454610283,4:0.09654151173375553,5:0.0025666815984826535,6:0.0026147055494762234,7:1.753144283209579E-4,8:0.0017595900141802873,9:0.0049406361794682024,10:7.881250692924197E-4,11:0.002873479530226361,12:0.9951286321096425},
1={0:-0.2883450858059115,1:-0.29170231535763447,2:-0.29157035465385267,3:-0.28754185317979386,4:-0.26018076078737895,5:-0.2914154866344813,6:-0.2913995247546756,7:-0.2922103132689348,8:-0.2916837423401091,9:-0.29062644748002026,10:-0.2920066313645422,11:-0.2913135151887795,12:0.03848561950058266},
2={0:0.01671441233225078,1:0.0935655369363106,2:0.09132650234523473,3:-0.0680324702834075,4:-0.9461123439509093,5:0.10210271255992123,6:0.10042714365337412,7:0.11137954332150339,8:0.10331974823993555,9:0.10621406378767596,10:0.10586960137353602,11:0.09262650242313884,12:0.09059904726143547}}

singularValue:
{0=0.0, 1=23.01314740985974, 2=2536.4018057098874}

读取生成的:hdfs://ubuntu:9000/svd/output1/rawEigenvectors/p*文件,可以看到是和上面的结果一致的(未验证);

然后就返回到了DistributedLanczosSolver的153行,接着往下执行;

3. 任务篇:

Path rawEigenVectorPath = new Path(outputPath, RAW_EIGENVECTORS);
    return new EigenVerificationJob().run(inputPath,
                                          rawEigenVectorPath,
                                          outputPath,
                                          outputTmpPath,
                                          maxError,
                                          minEigenvalue,
                                          inMemory,
                                          getConf() != null ? new Configuration(getConf()) : new Configuration());
  
先初始化一个文件,然后直接调用EigenVerificationJob的run方法,那么,整个分析就转移到了EigenVerificationJob。

附注:rawEigen是什么?根据上面的分析可以看出rawEigen其实就是state的singularVectors和singularValue的值而已;


分享,成长,快乐

转载请注明blog地址:http://blog.csdn.net/fansy1990



分享到:
评论

相关推荐

Global site tag (gtag.js) - Google Analytics