This presentation was made during the HUG London Meetup: SQL and NoSQL on Hadoop – A look at performance.
Speakers: Alex Bordei- Techie Product Manager at Bigstep, Calin Burloiu- Big Data Engineer at Avira and Radu Pastia - Big Data Team Leader at Avira.
We worked with Avira to show how much throughput that can be squeezed from a Hadoop connector. Together we have benchmarked Couchdoop for performance and talked about the behavior you can expect and tweaks that can improve the performance of your big data setup.
If you have any questions, we will be glad to provide you with any additional information.
6. Building a connector – The Right Way
Mapper Partitioner Reducer
Input
Format
Input
Split
Record
Reader
Output
Format
Record
Writer
7.
8.
9.
10. The InputFormat: From Input to Mapper
--range 2014-09-01;2014-09-20
--number_of_mappers 4
2014-09-01 2014-09-02
2014-09-03
2014-09-04
2014-09-05
… … …
2014-09-06
2014-09-20
Input Split 1
2014-09-01
2014-09-02
.
.
.
2014-09-05
Record Reader 1
(2014-09-01-A; record A)
(2014-09-01-B; record B)
(2014-09-01-…; record …)
(2014-09-02-A; record A)
(2014-09-02-B; record B)
(2014-09-02-…; record …)
(2014-09-05-A; record A)
(2014-09-05-B; record B)
(2014-09-05-…; record …)
Mapper
11.
12.
13.
14.
15.
16. The InputFormat: From Input to Mapper
--range 2014-09-01;2014-09-20
--number_of_mappers 4
2014-09-01 2014-09-02
2014-09-03
2014-09-04
2014-09-05
… … …
2014-09-06
2014-09-20
Input Split 1
2014-09-01
2014-09-02
.
.
.
2014-09-05
Record Reader 1
(2014-09-01-A; record A)
(2014-09-01-B; record B)
(2014-09-01-…; record …)
(2014-09-02-A; record A)
(2014-09-02-B; record B)
(2014-09-02-…; record …)
(2014-09-05-A; record A)
(2014-09-05-B; record B)
(2014-09-05-…; record …)
Mapper
Editor's Notes
Hi guys! My name is Radu and I would like to show you how to write a connector for Hadoop in MapReduce, and how easy it actually is.
Quickly about myself. I am a software developer in the Big Data team at Orange Romania (just started actually) and I have been working with Hadoop for about two years. Before this I was working with backends, data processing, batch jobs and my passion for these kind of things is what eventually got me to Hadoop.
Now let’s jump straight into the topic: why are Hadoop connectors an important topic? This is because Hadoop is very often paired with another system that is better suited for real-time operations and no matter the setup you will eventually need to transfer data between the two. We’ll use MapReduce to do this in an optimal way. Let’s start!
First, avoid the pitfalls! You might be tempted to connect to the other system || either in the mapper || or in the reducer, since you’re already handling the data within these objects. || This is not a good idea!
These classes are not supposed to handle IO by themselves. If you do this, you will lose some of the features that come straight out of the MapReduce framework, the classes will be harder to test, and code will be less reusable.
So then, how do you build a Hadoop connector in MapReduce? What else is the besides the Mapper and the Reducer? We have the InputFormat with it’s Input Split and Record Reader; we have the Partitioner; and we have the OutputFormat with it’s RecordWriter. So I’m pretty sure the colors already gave it away, we’ll use the InputFormat to import data, and the OutputFormat to export it. And now I’ll show you how to do each one.
Let’s start with importing. Our data source will probably be some type of NoSQL DB and so the first thing we’ll have to do, is to think about how to find all the data, all the keys, and how to partition them so that we can query different partitions in parallel. There are several ways to do this and in the next slides I’ll assume that our data store will allow us to easily get all records from a given date. Next we’ll need to define our configuration parameters and make sure we get them into the Configuration object. Finally we’ll have to implement the InputFormat with the InputSplit and the RecordReader.
About Configuration parameters. We can of course use the Hadoop ToolRunner class to handle them but I recommend you also checkout the Apache Commons CLI library because it provides a few nice extra features. You end up with a command line like this; notice how we’re importing 20-days-worth of data, and we’re specifying 4 mappers. This means that we’ll get four processes importing this data in parallel.
The first class that we are going to look at is the InputFormat. We can have it split our input data into as many Input Splits as we want. Then, we’ll use it to create Record Readers that actually connect to the data source and provide us with the data.
Let’s see how this whole process works. So we are importing 20 days of data. || First, our range is expanded to the actual days. || Now, we want 4 mappers, that means 4 input splits, so we’ll create input splits with 5 days each. || Next, each input split will get a record reader that reads in succession each record from each of the five days. || Finally, the mapper is called for each record.
Here’s what the code looks like. We’ll extend the base Hadoop class and override the getSplits method. Inside we create InputSplit objects, set the partitions in each one and then add them to the list of InputSplits, until we’ve covered all partitions.
Next, the InputSplit class. Once all input splits have been constructed, map tasks are fired up throughout the Hadoop cluster and each task gets an input split. This is why the InputSplits need to be serialized, so we need to implement the Writable interface. So, we’ll need to store the data partitions, and we’ll need to override four methods of the base class: getLenght, getLocations and two more for the serialization.
Let’s look at an example implementation to make things more clear. Storing the dataPartitions: we use an ArrayList as class member, with proper setter and getter. The length reported to the framework can be the size of this array if we can’t otherwise determine the precise data size. The getLocations method is used by the framework to select where to run tasks so that data locality is achieved. If the data store we’re connecting to is on a different cluster, as it usually is, we’ll simply return an empty array.
Last, the serialization. This can be easily implemented by leveraging the writable classes built into Hadoop. Here, we are loading our data partitions into an ArrayWritable and calling the write method on that object. Similarly we are de-serializing data by calling the readFields method on a new ArrayWritable object.
To finish our import, the last piece that we need is the RecordReader. This is where we’ll actually connect to the data source. We can override the initialize method to fire up our database client and to load the partitions from the input split. Here, we are creating a queue out of all the partitions, that will be then queried one after the other. Then, we have to override the method that iterates over the data and the methods that make it available to the mapper; this is pretty straightforward and specific to the data source so we will not go into more details.
That’s it! We now have our data reaching the mapper. Here, we can use a simple identity mapper to save the unaltered data or we can even run a full MapReduce job before sending it to the output.
A few words on exporting. This is done in a very similar way to the import, but it’s even simpler. Still, specific to the export is that we must first decide on what operation we want to perform to the data that we are exporting. We can of course simply add, or store, the data, but we could also replace or even delete existing records. When exporting we don’t have to deal with splits anymore, so we just have to implement an OutputFormat and a RecordWriter.
The OutputFormat simply provides a RecordWriter. Since this class does not have an Initialize method, we’ll use the constructor to connect to the data store. Then, the write method can be used to perform the operations on the data.
That’s all there is to exporting! Now that we know how to implement both import and export we could even use them together in the same job to move data between two different databases.