Twenty Newsgroups Classification任务之二seq2sparse - JAVA

seq2sparse对应于mahout中的org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles，从昨天跑的算法中的任务监控界面可以看到这一步包含了7个Job信息，分别是：（1）DocumentTokenizer（2）WordCount（3）MakePartialVectors（4）MergePartialVectors（5）VectorTfIdf Document Frequency Count（6）MakePartialVectors（7）MergePartialVectors。打印SparseVectorsFromSequenceFiles的参数帮助信息可以看到如下的信息：

[java]

Usage:

[--minSupport --analyzerName --chunkSize

--output --input --minDF --maxDFSigma

--maxDFPercent --weight --norm

--minLLR --numReducers --maxNGramSize

--overwrite --help --sequentialAccessVector --namedVector --logNormalize]

Options

--minSupport (-s) minSupport (Optional) Minimum Support. Default

Value: 2

--analyzerName (-a) analyzerName The class name of the analyzer

--chunkSize (-chunk) chunkSize The chunkSize in MegaBytes. 100-10000 MB

--output (-o) output The directory pathname for output.

--input (-i) input Path to job input directory.

--minDF (-md) minDF The minimum document frequency. Default

is 1

--maxDFSigma (-xs) maxDFSigma What portion of the tf (tf-idf) vectors

to be used, expressed in times the

standard deviation (sigma) of the

document frequencies of these vectors.

Can be used to remove really high

frequency terms. Expressed as a double

value. Good value to be specified is 3.0.

In case the value is less then 0 no

vectors will be filtered out. Default is

-1.0. Overrides maxDFPercent

--maxDFPercent (-x) maxDFPercent The max percentage of docs for the DF.

Can be used to remove really high

frequency terms. Expressed as an integer

between 0 and 100. Default is 99. If

maxDFSigma is also set, it will override

this value.

--weight (-wt) weight The kind of weight to use. Currently TF

or TFIDF

--norm (-n) norm The norm to use, expressed as either a

float or "INF" if you want to use the

Infinite norm. Must be greater or equal

to 0. The default is not to normalize

--minLLR (-ml) minLLR (Optional)The minimum Log Likelihood

Ratio(Float) Default is 1.0

--numReducers (-nr) numReducers (Optional) Number of reduce tasks.

Default Value: 1

--maxNGramSize (-ng) ngramSize (Optional) The maximum size of ngrams to

create (2 = bigrams, 3 = trigrams, etc)

Default Value:1

--overwrite (-ow) If set, overwrite the output directory

--help (-h) Print out help

--sequentialAc

首页上一页 1 2 3 下一页尾页 1/3/3

上一篇策略模式(Strategy)

下一篇使用java反射技术完成对象所有属..

Twenty Newsgroups Classification任务之二seq2sparse(一)