Fun with MapReduce (part I)

In this post, I show you how to create a java mapreduce job. There are other programming languages you can use, but Hadoop was originally made for java so let’s start with the native implementation.

Dependencies
You’ll need the hadoop library dependencies. Download a release and add it to your classpath or you can use maven to inject the dependency into your project.
For maven users:

<dependency>
  <groupId>org.apache.hadoop</groupId>
  <artifactId>hadoop-core</artifactId>
  <version>0.20.204.0</version>
  <scope>provided</scope>
</dependency>

<dependency>
  <groupId>org.codehaus.jackson</groupId>
  <artifactId>jackson-mapper-asl</artifactId>
  <version>1.8.6</version>
  <scope>provided</scope>
</dependency>

The jackson package contained some Exception classes that hadoop needed.

And now the code…

public class WordCount {

	public static class Map extends MapReduceBase implements Mapper<LongWritable,Text,Text,IntWritable> {
		private final static IntWritable one = new IntWritable(1);
		private Text skill = new Text();
		
		public void map(LongWritable key, Text value, OutputCollector<Text,IntWritable> output,
						Reporter reporter) throws IOException {
			String line = value.toString();
			StringTokenizer tokenizer = new StringTokenizer(line);
			while( tokenizer.hasMoreTokens() ) {
				skill.set(tokenizer.nextToken());
				output.collect(skill, one);
			}
		}
	}
	
	public static class Reduce extends MapReduceBase implements Reducer<Text,IntWritable,Text,IntWritable> {
		public void reduce(Text key, Iterator<IntWritable> values,
							OutputCollector<Text,IntWritable> output, Reporter reporter) throws IOException {
			int sum = 0;
			while( values.hasNext() ) {
				sum += values.next().get();
			}
			output.collect(key, new IntWritable(sum));
		}
	}
}

This is the typical example of a mapreduce program, which just counts the frequency of words in a file. I could have created these Mapper and Reducer classes in their own class instead of encapsulating them here. It doesn’t matter.
The Mapper class has one implementation which tokenizes a file of strings and collects their count in the OutputCollector. The Reducer takes the output of the Mapper threads and consolidates, in this case, summing the final count of each word.

Once you have the MapReduce program, you still need some code to set it up.

public static void main(String[] args) throws Exception {
	JobConf conf = new JobConf(WordCount.class);
	conf.setJobName("wordcount");
	
	conf.setOutputKeyClass(Text.class);
	conf.setOutputValueClass(IntWritable.class);
	
	conf.setMapperClass(Map.class);
	conf.setCombinerClass(Reduce.class);
	conf.setReducerClass(Reduce.class);
	
	conf.setInputFormat(TextInputFormat.class);			
	conf.setOutputFormat(TextOutputFormat.class);
	
	conf.set("fs.s3n.awsAccessKeyId", "[INSERT YOUR ACCESS KEY]");
	conf.set("fs.s3n.awsSecretAccessKey", "[INSERT YOUR SECRET KEY]");

	FileInputFormat.addInputPath(conf, new Path("s3n://[S3_BUCKET]/data/someFile.txt"));
	FileOutputFormat.setOutputPath(conf, new Path("s3n://[S3_BUCKET]/results"));
	
	JobClient.runJob(conf);
}

Here you set the Mapper and Reducer classes and their input and output types. You can read more about that online.

The part that I want to direct your attention to is the awsAccessKeyId and awsSecretAccessKey.
When you create your AWS account, you’ll find these two values in your Account’s Security Credentials. The page changes once in awhile so just look for Security Credentials, Access Credentials, or Access Keys in your account.

The next part demonstrates how you can read and write to your Amazon S3 service’s folder. S3 is Amazon’s cloud storage service. Elastic MapReduce jobs typically read and write to this. I’ll explain S3 and the configuration in more detail in the next post.

Once you have this code, you want to compile and package it into a jar.

Advertisements
Tagged

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: