Monthly Archives: August 2013

Fun with MapReduce (part III)

In this post, we walk through the steps of running your mapreduce job on Amazon’s Elastic MapReduce.

Let’s recap.
In part I, you wrote your java MapReduce job.
In part II, you set up your S3 cloud storage and upload your input file and MapReduce jar.
Now, let’s run it.

Step 1
Go to your Elastic MapReduce console and click the “Create New Job Flow” button

Step 2
On the 1st page of the screen that comes up, you’re going to enter a job name, leave the default “Amazon Distribution” Hadoop Version, and select “Run your own application”.
Then you’re going to select from the dropdown “Custom JAR”

Step 3
On the 2nd page of the screen, enter the following

JAR Location: this should be of the form “bucket/job/your.jar”
Leave out the “s3n://” protocol.

JAR Arguments: the first argument should be the class with the main() method that you’re running.
In our case, it would be “WordCount”. If we had put it under a packet, say “com.kodingnotes”, then your first argument would be “com.kodingnotes.WordCount”
You may supply additional arguments and they will be given to the main(String[] args).

*Note, this step differs slightly from the video I linked to in part II and will link to again at the end of this post. In the video, their first argument is not the name of the MapReduce job class to be run. They must have specified that as part of the Manifest file in the jar although it’s not alluded to. For clarity, specify the class name first.

Step 4
First select the size of the Master instance machine. This machine will direct the work of the Mappers and Reducers. We can leave the default “Small (m1.small)” instance for our small job.

Next, select the number and types of the Core instance machines. These are the actual instances that will run the Mappers and Reducers. You will want to instantiate an appropriate number for your job. The larger (and more distributed) the job, the more instances and possibly larger the instance size. That’s really all I can tell you until I get more experience with larger jobs myself.

Leave the Task instance count as 0.

Step 5
You can specify a EC2 key pair (if you already created one) and only if you want to be able to ssh into the machine while it’s running (or after if you choose to keep it alive). I didn’t see any need for that so I did not select a key pair.

For VPC Subnet Id, just leave the default.

In the S3 Log path, specify your path to the log folder you created in your bucket. Here, you’ll need to specify the “s3n://” protocol.
So your entry might look like “s3n://bucket-of-awesome-posts/logs/”

Don’t Enable Debugging.

Don’t Keep Alive (unless you want to, but you’ll be charged for it as long as it’s alive).

Step 6
No Bootstrap Actions.
(easy enough right?)

Step 7
Confirm and create the job flow!
Now your job is running!

You can monitor the progress of the job in the Elastic MapReduce console. Once it’s done, check our your S3 bucket’s result folder.


Oh, and here’s that video again. This post covers the video from about 4:30min in and onwards.


Fun with MapReduce (part II)

In this post, I describe how you can use Amazon’s S3 cloud storage to set up your environment before you run the mapreduce job you just created.

Step 1
Go to your S3 console and create a bucket. Bucket names are universal (meaning if I name my bucket “bucket-of-awesome-posts”, then you cannot create one of the same name).

Step 2
Create the following 4 folders in the bucket
– data: this will hold the input data for your mapreduce job (e.g. the text file used by our WordCount program)
– job: this will hold the WordCount app’s jar.
– logs: AWS EMR will automatically log your program’s logs here, under a subfolder named after the mapreduce Job ID (which you can find in the EMR console)
– results: this is where your mapreduce results will go

Now let’s remember the Part I post where we had these two lines:

FileInputFormat.addInputPath(conf, new Path("s3n://[S3_BUCKET]/data/someFile"));
FileOutputFormat.setOutputPath(conf, new Path("s3n://[S3_BUCKET]/results"));

[S3_BUCKET] refers to the bucket you created in step 1. Notice the input path uses the bucket/data subfolder, which you create in step 2. The output path uses the bucket/result subfolder. You may also specify additional subfolders to place under /result programmatically and the subfolders will be automatically created.

Step 3
Upload your files to your S3 folders.
– Upload a text file as input to your bucket/data folder
– Upload your jar to the bucket/job folder

That’s it for now. Look for part III where we run the job.

By the way, here’s a video of the steps in this post.
This post covers the first 4:30min of the video. Hopefully you’ll be better able to follow along now that you’ve ready through this post at your own speed. The remaining part of the video is covered in part III but feel free to skip ahead too.


Fun with MapReduce (part I)

In this post, I show you how to create a java mapreduce job. There are other programming languages you can use, but Hadoop was originally made for java so let’s start with the native implementation.

You’ll need the hadoop library dependencies. Download a release and add it to your classpath or you can use maven to inject the dependency into your project.
For maven users:



The jackson package contained some Exception classes that hadoop needed.

And now the code…

public class WordCount {

	public static class Map extends MapReduceBase implements Mapper<LongWritable,Text,Text,IntWritable> {
		private final static IntWritable one = new IntWritable(1);
		private Text skill = new Text();
		public void map(LongWritable key, Text value, OutputCollector<Text,IntWritable> output,
						Reporter reporter) throws IOException {
			String line = value.toString();
			StringTokenizer tokenizer = new StringTokenizer(line);
			while( tokenizer.hasMoreTokens() ) {
				output.collect(skill, one);
	public static class Reduce extends MapReduceBase implements Reducer<Text,IntWritable,Text,IntWritable> {
		public void reduce(Text key, Iterator<IntWritable> values,
							OutputCollector<Text,IntWritable> output, Reporter reporter) throws IOException {
			int sum = 0;
			while( values.hasNext() ) {
				sum +=;
			output.collect(key, new IntWritable(sum));

This is the typical example of a mapreduce program, which just counts the frequency of words in a file. I could have created these Mapper and Reducer classes in their own class instead of encapsulating them here. It doesn’t matter.
The Mapper class has one implementation which tokenizes a file of strings and collects their count in the OutputCollector. The Reducer takes the output of the Mapper threads and consolidates, in this case, summing the final count of each word.

Once you have the MapReduce program, you still need some code to set it up.

public static void main(String[] args) throws Exception {
	JobConf conf = new JobConf(WordCount.class);
	conf.set("fs.s3n.awsAccessKeyId", "[INSERT YOUR ACCESS KEY]");
	conf.set("fs.s3n.awsSecretAccessKey", "[INSERT YOUR SECRET KEY]");

	FileInputFormat.addInputPath(conf, new Path("s3n://[S3_BUCKET]/data/someFile.txt"));
	FileOutputFormat.setOutputPath(conf, new Path("s3n://[S3_BUCKET]/results"));

Here you set the Mapper and Reducer classes and their input and output types. You can read more about that online.

The part that I want to direct your attention to is the awsAccessKeyId and awsSecretAccessKey.
When you create your AWS account, you’ll find these two values in your Account’s Security Credentials. The page changes once in awhile so just look for Security Credentials, Access Credentials, or Access Keys in your account.

The next part demonstrates how you can read and write to your Amazon S3 service’s folder. S3 is Amazon’s cloud storage service. Elastic MapReduce jobs typically read and write to this. I’ll explain S3 and the configuration in more detail in the next post.

Once you have this code, you want to compile and package it into a jar.


Fun with Elastic MapReduce (intro)

You’ve probably heard of the term “MapReduce” thrown around a bunch. Maybe your colleagues are talking more about it. Maybe it’s showing up a lot more on resumes from young candidates. Maybe it was mentioned as a solution on a project.
So what is it? And how can you get in on the action?

MapReduce is simply a programming paradigm that allows for massively parallel processing. As the name implies, it has a Map() part that does filtering and sorting followed by a Reduce() part that summarizes. Because these two parts are distinct from one another and even from other instances of themselves, they can be executed independently and parallelized across different machines. Here’s the wiki definition.

MapReduce itself is not a library or piece of code; it’s just a programming paradigm or way of programming. Google started it and has their own implementation. The one that commoners like you and I can use is the open source Hadoop project.

But what if you didn’t want to set up your own cluster of Hadoop nodes?
Well, that’s where Elastic MapReduce comes to play. Amazon makes setting up and running your MapReduce jobs very easily (after you spent many hours learning their environment that is). But that’s where I can help.

In this next few posts, I will show you
– part I: how to implement a mapreduce job
– part II: set up your S3 cloud storage for the input,output, and code
– part III: set up your Hadoop environment and run your mapreduce job

Tagged , ,