Matarialize中文社区-MapReduce 基本概念及原理

MapReduce 基本概念及原理

~~情~非~ 2024-07-19 09:20:39

268 大数据,MapReduce 专栏

MapReduce 基本概念及原理

什么是MapReduce

MapReduce是由Google提出的一种编程模型，用于处理和生成大规模数据集。Hadoop的MapReduce是这一编程模型的实现，广泛应用于分布式计算环境中。它将数据处理任务分为两个阶段：Map阶段和Reduce阶段。

MapReduce的核心思想

Map（映射）：将输入数据分割成更小的子集，并对每个子集独立进行处理，生成中间键值对（key-value pairs）。
Reduce（归约）：对中间键值对按照键进行分组，并对每组数据进行汇总和处理，生成最终的输出结果。

MapReduce 执行流程

输入数据：存储在HDFS中的大规模数据集。
Map阶段：Mapper函数对输入数据进行处理，生成中间键值对。
Shuffle and Sort：对中间键值对进行排序和分组，将相同键的值汇聚到一起。
Reduce阶段：Reducer函数对每组中间键值对进行处理，生成最终输出结果。
输出数据：将处理结果存储回HDFS。

MapReduce 编程模型

Mapper

Mapper类用于处理输入数据，并生成中间键值对。

public class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable> {
    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();

    public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
        StringTokenizer itr = new StringTokenizer(value.toString());
        while (itr.hasMoreTokens()) {
            word.set(itr.nextToken());
            context.write(word, one);
        }
    }
}

Reducer

Reducer类用于汇总中间键值对，并生成最终输出结果。

public class IntSumReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
    private IntWritable result = new IntWritable();

    public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
        int sum = 0;
        for (IntWritable val : values) {
            sum += val.get();
        }
        result.set(sum);
        context.write(key, result);
    }
}

Driver

Driver类用于配置MapReduce作业并提交作业。

public class WordCount {
    public static void main(String[] args) throws Exception {
        Configuration conf = new Configuration();
        Job job = Job.getInstance(conf, "word count");
        job.setJarByClass(WordCount.class);
        job.setMapperClass(TokenizerMapper.class);
        job.setCombinerClass(IntSumReducer.class);
        job.setReducerClass(IntSumReducer.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);
        FileInputFormat.addInputPath(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));
        System.exit(job.waitForCompletion(true) ? 0 : 1);
    }
}

MapReduce 程序执行示例

准备数据
- 创建一个输入文件input.txt，内容如下：
```
Hello Hadoop
Hello MapReduce
```

将数据上传到HDFS

hdfs dfs -mkdir /user/hadoop/input
hdfs dfs -put input.txt /user/hadoop/input/

编译和运行MapReduce程序

将上面的Java代码保存为WordCount.java。

编译Java代码：

javac -classpath `hadoop classpath` -d wordcount_classes WordCount.java
jar -cvf wordcount.jar -C wordcount_classes/ .

运行MapReduce作业：

hadoop jar wordcount.jar WordCount /user/hadoop/input /user/hadoop/output

查看输出结果

hdfs dfs -cat /user/hadoop/output/part-r-00000

输出结果可能如下：

Hadoop 1
Hello 2
MapReduce 1

总结

MapReduce的核心概念是将大规模数据集分割成小块，并通过Map和Reduce两个阶段进行分布式处理。
Map阶段负责将输入数据处理成中间键值对。
Reduce阶段负责汇总和处理中间键值对，生成最终结果。
示例代码展示了一个简单的词频统计程序，通过MapReduce模型对文本文件中的单词进行计数。

通过掌握MapReduce的基本概念、编程模型和执行流程，你可以开发高效的分布式数据处理应用，处理大规模数据集。

评论区

评论列表

{{ rItem.user.nickname || rItem.user.username }} @ {{ rItem.toUser.nickname || rItem.toUser.username }}

作者信息

~~情~非~

我一点都不懒，我就是不想写

私信专栏

热门专栏

移动零

LeetCode 热题 100 - 两数之和

盛最多水的容器

三数之和

最长连续序列

在网页中使用 Materialize 实现中文日期选择器

MaterializeCSS轮播组件：构建动态网页内容的利器

在网页中使用 Materialize 实现日期选择器

使用 Materialize 实现图文列表布局

使用 Materialize 实现响应式布局

MapReduce 基本概念及原理

MapReduce 基本概念及原理

什么是MapReduce

MapReduce的核心思想

MapReduce 执行流程

MapReduce 编程模型

Mapper

Reducer

Driver

MapReduce 程序执行示例

总结

{{operTitle}}