Hadoop-采样器－多输入路径－只采一个文件－（MultipleInputs+getsample(conf.getInputFormat) - linux编程基础

TOP

Hadoop-采样器－多输入路径－只采一个文件－（MultipleInputs+getsample(conf.getInputFormat)

2014-11-24 08:24:39 来源: 作者: 【大中小】浏览:0次

Tags：Hadoop- 采样输入路径 -只一个文件 MultipleInputs getsample conf.getInputFormat

之前弄采样器，以为已经结束了工作，结果现在又遇到了问题，因为我的输入有两个文件，设计要求是先只采样其中的大文件（未来是两个文件分别采样的），只有一个输入文件且采样时，使用采样器的代码是：

Path input = new Path(args[0].toString());
input = input.makeQualified(input.getFileSystem(conf));

InputSampler.IntervalSampler sampler = new InputSampler.IntervalSampler(0.4, 5);

// 这句话的意思是两个分区，

// K[] getSample(InputFormat inf, JobConf job) 函数原型

String skewuri_out = args[2] + "/sample_list"; // 存放采样的结果，不是分区的结果
FileSystem fs = FileSystem.get(URI.create(skewuri_out), conf);
FSDataOutputStream fs_out = fs.create(new Path(skewuri_out));

final InputFormat inf = conf.getInputFormat();//这个是获得Jobconf的InputFormat
Object[] p = sampler.getSample(inf, conf);// 输出采样的结果，必须前面是Object类型，换成I那头Writable就不管用了，不知道为什么

但是这样问题就来了，如果我写了两个Mapper类，分别为Map1class,Map2class,现在两个class分别处理两个不同输入路径的数据，目前是指定输入数据的格式是相同的，那么可以用MultipleInputs 来实现：

MultipleInputs.addInputPath(conf, new Path(args[0]), Definemyself.class,Map1class.class);
MultipleInputs.addInputPath(conf, new Path(args[1]), Definemyself.class,Map2class.class);

//Definemyself.class 是我自定义的继承了FileInputFormat ，并且实现了WritableComparable接口

//继承FileInputFormat 是采样的需要，实现WritableComparable接口，是因为我在join的时候想整体数据进行序列化，我自己也解释不明白这个序列化，可以理解成C里面的结构体吧，就是作为一个整体，可以toString()输出。

原型是：public class Definemyself extends FileInputFormat implements WritableComparable{...}

这个问题从昨晚就困扰我，上周做梦采样，这种做梦还是采样。中午和老公出去吃的，因为要好好探讨一下这个问题，我的理论就是既然系统提供MultipleInputs，同时Jobconf有能调用getInputFormat(),就肯定有办法二者同时使用，不让就矛盾了，傻子才会建立这样的系统呢。


【大中小】【打印】【繁体】【投稿】【收藏】【推荐】【举报】【评论】【关闭】【返回顶部】
分享到:
上一篇：Hadoop 中的采样器－附主要使用源..	下一篇：Linux中无缓冲文件I/O API

帐　　号:

密码: (新用户注册)

验证码:

表　　情:

内　　容: