✎ 编程开发网

mahout源码分析之Decision Forest 三部曲之三TestForest(二)

2014-11-24 09:12:37 · 作者: · 浏览: 2

标签: mahout 源码分析 Decision Forest 三部曲之三 TestForest

estPath.toUri(), conf);

Job job = new Job(conf, "decision forest classifier");

log.info("Configuring the job...");

configureJob(job);

log.info("Running the job...");

if (!job.waitForCompletion(true)) {

throw new IllegalStateException("Job failed!");

}

先分别把dataset和model的路径加入到内存中，方便Job的Mapper调用，然后configureJob，然后直接就跑job了job.waitForCompletion(true);。这里看下configureJob的内容：

[java]

job.setJarByClass(Classifier.class);

FileInputFormat.setInputPaths(job, inputPath);

FileOutputFormat.setOutputPath(job, mappersOutputPath);

job.setOutputKeyClass(DoubleWritable.class);

job.setOutputValueClass(Text.class);

job.setMapperClass(CMapper.class);

job.setNumReduceTasks(0); // no reducers

job.setInputFormatClass(CTextInputFormat.class);

job.setOutputFormatClass(SequenceFileOutputFormat.class);

看到基本是一些常规的设置，然后Mapper就是CMapper了，Reducer没有。看CMapper是怎么操作的：

setup函数主要代码就三行：

[java]

dataset = Dataset.load(conf, new Path(files[0].getPath()));

converter = new DataConverter(dataset);

forest = DecisionForest.load(conf, new Path(files[1].getPath()));

分别设置dataset、converter、forest，其实就是从路径中把文件读出来而已。

map函数：

[java]

protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {

if (first) {

FileSplit split = (FileSplit) context.getInputSplit();

Path path = split.getPath(); // current split path

lvalue.set(path.getName());

lkey.set(key.get());

context.write(lkey, lvalue);

first = false;

}

String line = value.toString();

if (!line.isEmpty()) {

Instance instance = converter.convert(line);

double prediction = forest.classify(dataset, rng, instance);

lkey.set(dataset.getLabel(instance));

lvalue.set(Double.toString(prediction));

context.write(lkey, lvalue);

}

首先if里面的判断不知道是干啥的，这个应该要去看下输出文件才行（输出文件被源码删除了，但是这个不难搞到，只要在删除前设置断点即可。这个应该要下次分析了）。

然后判断输入是否为空，否则由converter把输入的一行转换为Instance变量，然后由setup函数中读出来的forest去分析这个Instance，看它应该是属于哪一类的，然后把key就设置为instance原来的分类，value设置为forest的分类结果（这里不明白干嘛还要把double转换为String，直接输入DoubleWritable的类型不就行了？可能是方便analyzer的分析吧）。这里最重要的操作其实就是forest.classify函数了：

这里先简要说下，下次再详细分析吧。前面得到的forest不是有很多棵树的嘛（这个可以自己设定的），然后每棵树都可以对这个Instance进行分析得到一个分类结果，然后取这些分类结果重复次数最多的那个即可。好了，眼睛要罢工了。。。

首页上一页 1 2 下一页尾页 2/2/2

上一篇 HTC Android手机无法修改热点 Ho..

下一篇 NSAttributedString详解