You’ve probably heard of the term “MapReduce” thrown around a bunch. Maybe your colleagues are talking more about it. Maybe it’s showing up a lot more on resumes from young candidates. Maybe it was mentioned as a solution on a project.
So what is it? And how can you get in on the action?
MapReduce is simply a programming paradigm that allows for massively parallel processing. As the name implies, it has a Map() part that does filtering and sorting followed by a Reduce() part that summarizes. Because these two parts are distinct from one another and even from other instances of themselves, they can be executed independently and parallelized across different machines. Here’s the wiki definition.
MapReduce itself is not a library or piece of code; it’s just a programming paradigm or way of programming. Google started it and has their own implementation. The one that commoners like you and I can use is the open source Hadoop project.
But what if you didn’t want to set up your own cluster of Hadoop nodes?
Well, that’s where Elastic MapReduce comes to play. Amazon makes setting up and running your MapReduce jobs very easily (after you spent many hours learning their environment that is). But that’s where I can help.
In this next few posts, I will show you
– part I: how to implement a mapreduce job
– part II: set up your S3 cloud storage for the input,output, and code
– part III: set up your Hadoop environment and run your mapreduce job