Fun with MapReduce (part III)

In this post, we walk through the steps of running your mapreduce job on Amazon’s Elastic MapReduce.

Let’s recap.
In part I, you wrote your java MapReduce job.
In part II, you set up your S3 cloud storage and upload your input file and MapReduce jar.
Now, let’s run it.

Step 1
Go to your Elastic MapReduce console and click the “Create New Job Flow” button

Step 2
On the 1st page of the screen that comes up, you’re going to enter a job name, leave the default “Amazon Distribution” Hadoop Version, and select “Run your own application”.
Then you’re going to select from the dropdown “Custom JAR”

Step 3
On the 2nd page of the screen, enter the following

JAR Location: this should be of the form “bucket/job/your.jar”
Leave out the “s3n://” protocol.

JAR Arguments: the first argument should be the class with the main() method that you’re running.
In our case, it would be “WordCount”. If we had put it under a packet, say “com.kodingnotes”, then your first argument would be “com.kodingnotes.WordCount”
You may supply additional arguments and they will be given to the main(String[] args).

*Note, this step differs slightly from the video I linked to in part II and will link to again at the end of this post. In the video, their first argument is not the name of the MapReduce job class to be run. They must have specified that as part of the Manifest file in the jar although it’s not alluded to. For clarity, specify the class name first.

Step 4
First select the size of the Master instance machine. This machine will direct the work of the Mappers and Reducers. We can leave the default “Small (m1.small)” instance for our small job.

Next, select the number and types of the Core instance machines. These are the actual instances that will run the Mappers and Reducers. You will want to instantiate an appropriate number for your job. The larger (and more distributed) the job, the more instances and possibly larger the instance size. That’s really all I can tell you until I get more experience with larger jobs myself.

Leave the Task instance count as 0.

Step 5
You can specify a EC2 key pair (if you already created one) and only if you want to be able to ssh into the machine while it’s running (or after if you choose to keep it alive). I didn’t see any need for that so I did not select a key pair.

For VPC Subnet Id, just leave the default.

In the S3 Log path, specify your path to the log folder you created in your bucket. Here, you’ll need to specify the “s3n://” protocol.
So your entry might look like “s3n://bucket-of-awesome-posts/logs/”

Don’t Enable Debugging.

Don’t Keep Alive (unless you want to, but you’ll be charged for it as long as it’s alive).

Step 6
No Bootstrap Actions.
(easy enough right?)

Step 7
Confirm and create the job flow!
Now your job is running!

You can monitor the progress of the job in the Elastic MapReduce console. Once it’s done, check our your S3 bucket’s result folder.


Oh, and here’s that video again. This post covers the video from about 4:30min in and onwards.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: