Fun with Elastic MapReduce (intro)

You’ve probably heard of the term “MapReduce” thrown around a bunch. Maybe your colleagues are talking more about it. Maybe it’s showing up a lot more on resumes from young candidates. Maybe it was mentioned as a solution on a project.
So what is it? And how can you get in on the action?

MapReduce is simply a programming paradigm that allows for massively parallel processing. As the name implies, it has a Map() part that does filtering and sorting followed by a Reduce() part that summarizes. Because these two parts are distinct from one another and even from other instances of themselves, they can be executed independently and parallelized across different machines. Here’s the wiki definition.

MapReduce itself is not a library or piece of code; it’s just a programming paradigm or way of programming. Google started it and has their own implementation. The one that commoners like you and I can use is the open source Hadoop project.

But what if you didn’t want to set up your own cluster of Hadoop nodes?
Well, that’s where Elastic MapReduce comes to play. Amazon makes setting up and running your MapReduce jobs very easily (after you spent many hours learning their environment that is). But that’s where I can help.

In this next few posts, I will show you
– part I: how to implement a mapreduce job
– part II: set up your S3 cloud storage for the input,output, and code
– part III: set up your Hadoop environment and run your mapreduce job

Advertisements
Tagged , ,

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: