Hadoop 中 map任务数,reduce任务数 和机器节点数之间是什么关系

如题所述

搜到了答案,我觉得回答地不错。

根据Google发布的论文

MapReduce: Simplified Data Processing on Large Clusters
http://static.googleusercontent.com/media/research.google.com/zh-CN//archive/mapreduce-osdi04.pdf

这里引用3.5 TaskGranularity的一小段,下文中M代表map任务数,R代表reduce任务数

Furthermore, R is often constrained by users because the output of each reduce task ends up in a separate output fi le. In practice, we tend to choose M so that each individual task is roughly 16 MB to 64 MB of input data(so that the locality optimization described above is most effective), and we make R a small multiple of the number of worker machines we expect to use. We often perform MapReduce computations with M = 200,000 and R = 5,000, using 2,000 worker machines.

总的来说
map任务数倾向于把输入文件可以分割成16MB到64MB之间,因为这刚好是GFS每个分块文件的大小,可以减少数据在网络中流动
reduce任务数通常是机器节点数的小倍数
至于机器节点数?有钱就要任性,多多益善。。

著作权归作者所有。
商业转载请联系作者获得授权,非商业转载请注明出处。
作者:曾凌恒
链接:http://www.zhihu.com/question/26816539/answer/34133965
温馨提示:答案为网友推荐,仅供参考