hadoop的"mapred.ReduceTask: java.net.ConnectException: Connection timed out"问题解决

2015-02-03 10:49:47 · 作者: · 浏览: 24
?集群某节点91有故障发生,出现
?
[plain] ?
2013-11-08 08:32:13,908 WARN org.apache.hadoop.mapred.ReduceTask: attempt_201311061017_18902_r_000000_0 copy failed: attempt_201311061017_18902_m_000003_0 from node-192 ?
2013-11-08 08:32:13,921 WARN org.apache.hadoop.mapred.ReduceTask: java.net.ConnectException: Connection timed out ?
? ? at java.net.PlainSocketImpl.socketConnect(Native Method) ?
? ? at java.net.AbstractPlainSocketImpl.doConnect(Unknown Source) ?
? ? at java.net.AbstractPlainSocketImpl.connectToAddress(Unknown Source) ?
? ? at java.net.AbstractPlainSocketImpl.connect(Unknown Source) ?
? ? at java.net.SocksSocketImpl.connect(Unknown Source) ?
? ? at java.net.Socket.connect(Unknown Source) ?
? ? at sun.net.NetworkClient.doConnect(Unknown Source) ?
? ? at sun.net.www.http.HttpClient.openServer(Unknown Source) ?
? ? at sun.net.www.http.HttpClient.openServer(Unknown Source) ?
? ? at sun.net.www.http.HttpClient.(Unknown Source) ?
? ? at sun.net.www.http.HttpClient.New(Unknown Source) ?
? ? at sun.net.www.http.HttpClient.New(Unknown Source) ?
? ? at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(Unknown Source) ?
? ? at sun.net.www.protocol.http.HttpURLConnection.plainConnect(Unknown Source) ?
? ? at sun.net.www.protocol.http.HttpURLConnection.connect(Unknown Source) ?
? ? at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getInputStream(ReduceTask.java:1631) ?
? ? at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.setupSecureConnection(ReduceTask.java:1588) ?
? ? at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1488) ?
? ? at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1399) ?
? ? at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1331) ?
分析hadoop代码:
?
[java] ?
localFs = FileSystem.getLocal(fConf); ?
? ? if (fConf.get("slave.host.name") != null) { ?
? ? ? this.localHostname = fConf.get("slave.host.name"); ?
? ? } ?
? ? if (localHostname == null) { ?
? ? ? this.localHostname = ?
? ? ? DNS.getDefaultHost ?
? ? ? (fConf.get("mapred.tasktracker.dns.interface","default"), ?
? ? ? ?fConf.get("mapred.tasktracker.dns.nameserver","default")); ?
? ? } ?
?
在该节点ping 下这个hostname:
?
[plain] ?
ping node-191 ?
PING node-128-191.localhost (220.250.64.228) 56(84) bytes of data. ?
64 bytes from 220.250.64.228: icmp_seq=1 ttl=247 time=14.8 ms ?
64 bytes from 220.250.64.228: icmp_seq=2 ttl=247 time=14.3 ms ?
64 bytes from 220.250.64.228: icmp_seq=3 ttl=247 time=14.4 ms ?
发现压根不是191的ip。
?
到该节点的hosts里查看,也没有配置191的hostname。
?
问题得解。
?
将191的hostname添加到集群所有节点的hosts上。重启tasktracker搞定。