SlideShare a Scribd company logo
1 of 20
pastiaro.wordpress.com 
@rpastia
Building a connector – The Wrong Way 
Mapper Reducer
Building a connector – The Right Way 
Mapper Partitioner Reducer 
Input 
Format 
Input 
Split 
Record 
Reader 
Output 
Format 
Record 
Writer
The InputFormat: From Input to Mapper 
--range 2014-09-01;2014-09-20 
--number_of_mappers 4 
2014-09-01 2014-09-02 
2014-09-03 
2014-09-04 
2014-09-05 
… … … 
2014-09-06 
2014-09-20 
Input Split 1 
2014-09-01 
2014-09-02 
. 
. 
. 
2014-09-05 
Record Reader 1 
(2014-09-01-A; record A) 
(2014-09-01-B; record B) 
(2014-09-01-…; record …) 
(2014-09-02-A; record A) 
(2014-09-02-B; record B) 
(2014-09-02-…; record …) 
(2014-09-05-A; record A) 
(2014-09-05-B; record B) 
(2014-09-05-…; record …) 
Mapper
The InputFormat: From Input to Mapper 
--range 2014-09-01;2014-09-20 
--number_of_mappers 4 
2014-09-01 2014-09-02 
2014-09-03 
2014-09-04 
2014-09-05 
… … … 
2014-09-06 
2014-09-20 
Input Split 1 
2014-09-01 
2014-09-02 
. 
. 
. 
2014-09-05 
Record Reader 1 
(2014-09-01-A; record A) 
(2014-09-01-B; record B) 
(2014-09-01-…; record …) 
(2014-09-02-A; record A) 
(2014-09-02-B; record B) 
(2014-09-02-…; record …) 
(2014-09-05-A; record A) 
(2014-09-05-B; record B) 
(2014-09-05-…; record …) 
Mapper
Building a Hadoop Connector
Building a Hadoop Connector
Building a Hadoop Connector
Building a Hadoop Connector

More Related Content

More from Bigstep

Data Lake and the rise of the microservices
Data Lake and the rise of the microservicesData Lake and the rise of the microservices
Data Lake and the rise of the microservicesBigstep
 
Memory, Big Data, NoSQL and Virtualization
Memory, Big Data, NoSQL and VirtualizationMemory, Big Data, NoSQL and Virtualization
Memory, Big Data, NoSQL and VirtualizationBigstep
 
Start Making Big Data With SQL and RDBMS Skills - Webinar by Bigstep and Exasol
Start Making Big Data With SQL and RDBMS Skills - Webinar by Bigstep and ExasolStart Making Big Data With SQL and RDBMS Skills - Webinar by Bigstep and Exasol
Start Making Big Data With SQL and RDBMS Skills - Webinar by Bigstep and ExasolBigstep
 
How to Automate Big Data with Ansible
How to Automate Big Data with AnsibleHow to Automate Big Data with Ansible
How to Automate Big Data with AnsibleBigstep
 
Cassandra Performance Benchmark
Cassandra Performance BenchmarkCassandra Performance Benchmark
Cassandra Performance BenchmarkBigstep
 
Couchbase In The Cloud - A Performance Benchmark
Couchbase In The Cloud - A Performance BenchmarkCouchbase In The Cloud - A Performance Benchmark
Couchbase In The Cloud - A Performance BenchmarkBigstep
 
Getting the Most Out of Your NoSQL DB
Getting the Most Out of Your NoSQL DBGetting the Most Out of Your NoSQL DB
Getting the Most Out of Your NoSQL DBBigstep
 
Getting the most out of Impala - Best practices for infrastructure optimization
Getting the most out of Impala - Best practices for infrastructure optimizationGetting the most out of Impala - Best practices for infrastructure optimization
Getting the most out of Impala - Best practices for infrastructure optimizationBigstep
 

More from Bigstep (8)

Data Lake and the rise of the microservices
Data Lake and the rise of the microservicesData Lake and the rise of the microservices
Data Lake and the rise of the microservices
 
Memory, Big Data, NoSQL and Virtualization
Memory, Big Data, NoSQL and VirtualizationMemory, Big Data, NoSQL and Virtualization
Memory, Big Data, NoSQL and Virtualization
 
Start Making Big Data With SQL and RDBMS Skills - Webinar by Bigstep and Exasol
Start Making Big Data With SQL and RDBMS Skills - Webinar by Bigstep and ExasolStart Making Big Data With SQL and RDBMS Skills - Webinar by Bigstep and Exasol
Start Making Big Data With SQL and RDBMS Skills - Webinar by Bigstep and Exasol
 
How to Automate Big Data with Ansible
How to Automate Big Data with AnsibleHow to Automate Big Data with Ansible
How to Automate Big Data with Ansible
 
Cassandra Performance Benchmark
Cassandra Performance BenchmarkCassandra Performance Benchmark
Cassandra Performance Benchmark
 
Couchbase In The Cloud - A Performance Benchmark
Couchbase In The Cloud - A Performance BenchmarkCouchbase In The Cloud - A Performance Benchmark
Couchbase In The Cloud - A Performance Benchmark
 
Getting the Most Out of Your NoSQL DB
Getting the Most Out of Your NoSQL DBGetting the Most Out of Your NoSQL DB
Getting the Most Out of Your NoSQL DB
 
Getting the most out of Impala - Best practices for infrastructure optimization
Getting the most out of Impala - Best practices for infrastructure optimizationGetting the most out of Impala - Best practices for infrastructure optimization
Getting the most out of Impala - Best practices for infrastructure optimization
 

Recently uploaded

BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEOrtus Solutions, Corp
 
What is Binary Language? Computer Number Systems
What is Binary Language?  Computer Number SystemsWhat is Binary Language?  Computer Number Systems
What is Binary Language? Computer Number SystemsJheuzeDellosa
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmSujith Sukumaran
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataBradBedford3
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01
 
Asset Management Software - Infographic
Asset Management Software - InfographicAsset Management Software - Infographic
Asset Management Software - InfographicHr365.us smith
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackVICTOR MAESTRE RAMIREZ
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...stazi3110
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 
cybersecurity notes for mca students for learning
cybersecurity notes for mca students for learningcybersecurity notes for mca students for learning
cybersecurity notes for mca students for learningVitsRangannavar
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideChristina Lin
 
chapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptchapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptkotipi9215
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Andreas Granig
 
What are the features of Vehicle Tracking System?
What are the features of Vehicle Tracking System?What are the features of Vehicle Tracking System?
What are the features of Vehicle Tracking System?Watsoo Telematics
 
buds n tech IT solutions
buds n  tech IT                solutionsbuds n  tech IT                solutions
buds n tech IT solutionsmonugehlot87
 
Engage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The UglyEngage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The UglyFrank van der Linden
 
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)OPEN KNOWLEDGE GmbH
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureDinusha Kumarasiri
 

Recently uploaded (20)

BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
 
What is Binary Language? Computer Number Systems
What is Binary Language?  Computer Number SystemsWhat is Binary Language?  Computer Number Systems
What is Binary Language? Computer Number Systems
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalm
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
 
Asset Management Software - Infographic
Asset Management Software - InfographicAsset Management Software - Infographic
Asset Management Software - Infographic
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStack
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
cybersecurity notes for mca students for learning
cybersecurity notes for mca students for learningcybersecurity notes for mca students for learning
cybersecurity notes for mca students for learning
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
 
chapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptchapter--4-software-project-planning.ppt
chapter--4-software-project-planning.ppt
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024
 
What are the features of Vehicle Tracking System?
What are the features of Vehicle Tracking System?What are the features of Vehicle Tracking System?
What are the features of Vehicle Tracking System?
 
buds n tech IT solutions
buds n  tech IT                solutionsbuds n  tech IT                solutions
buds n tech IT solutions
 
Engage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The UglyEngage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The Ugly
 
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with Azure
 

Building a Hadoop Connector

  • 1.
  • 3.
  • 4. Building a connector – The Wrong Way Mapper Reducer
  • 5.
  • 6. Building a connector – The Right Way Mapper Partitioner Reducer Input Format Input Split Record Reader Output Format Record Writer
  • 7.
  • 8.
  • 9.
  • 10. The InputFormat: From Input to Mapper --range 2014-09-01;2014-09-20 --number_of_mappers 4 2014-09-01 2014-09-02 2014-09-03 2014-09-04 2014-09-05 … … … 2014-09-06 2014-09-20 Input Split 1 2014-09-01 2014-09-02 . . . 2014-09-05 Record Reader 1 (2014-09-01-A; record A) (2014-09-01-B; record B) (2014-09-01-…; record …) (2014-09-02-A; record A) (2014-09-02-B; record B) (2014-09-02-…; record …) (2014-09-05-A; record A) (2014-09-05-B; record B) (2014-09-05-…; record …) Mapper
  • 11.
  • 12.
  • 13.
  • 14.
  • 15.
  • 16. The InputFormat: From Input to Mapper --range 2014-09-01;2014-09-20 --number_of_mappers 4 2014-09-01 2014-09-02 2014-09-03 2014-09-04 2014-09-05 … … … 2014-09-06 2014-09-20 Input Split 1 2014-09-01 2014-09-02 . . . 2014-09-05 Record Reader 1 (2014-09-01-A; record A) (2014-09-01-B; record B) (2014-09-01-…; record …) (2014-09-02-A; record A) (2014-09-02-B; record B) (2014-09-02-…; record …) (2014-09-05-A; record A) (2014-09-05-B; record B) (2014-09-05-…; record …) Mapper

Editor's Notes

  1. Hi guys! My name is Radu and I would like to show you how to write a connector for Hadoop in MapReduce, and how easy it actually is.
  2. Quickly about myself. I am a software developer in the Big Data team at Orange Romania (just started actually) and I have been working with Hadoop for about two years. Before this I was working with backends, data processing, batch jobs and my passion for these kind of things is what eventually got me to Hadoop.
  3. Now let’s jump straight into the topic: why are Hadoop connectors an important topic? This is because Hadoop is very often paired with another system that is better suited for real-time operations and no matter the setup you will eventually need to transfer data between the two. We’ll use MapReduce to do this in an optimal way. Let’s start!
  4. First, avoid the pitfalls! You might be tempted to connect to the other system || either in the mapper || or in the reducer, since you’re already handling the data within these objects. || This is not a good idea!
  5. These classes are not supposed to handle IO by themselves. If you do this, you will lose some of the features that come straight out of the MapReduce framework, the classes will be harder to test, and code will be less reusable.
  6. So then, how do you build a Hadoop connector in MapReduce? What else is the besides the Mapper and the Reducer? We have the InputFormat with it’s Input Split and Record Reader; we have the Partitioner; and we have the OutputFormat with it’s RecordWriter. So I’m pretty sure the colors already gave it away, we’ll use the InputFormat to import data, and the OutputFormat to export it. And now I’ll show you how to do each one.
  7. Let’s start with importing. Our data source will probably be some type of NoSQL DB and so the first thing we’ll have to do, is to think about how to find all the data, all the keys, and how to partition them so that we can query different partitions in parallel. There are several ways to do this and in the next slides I’ll assume that our data store will allow us to easily get all records from a given date. Next we’ll need to define our configuration parameters and make sure we get them into the Configuration object. Finally we’ll have to implement the InputFormat with the InputSplit and the RecordReader.
  8. About Configuration parameters. We can of course use the Hadoop ToolRunner class to handle them but I recommend you also checkout the Apache Commons CLI library because it provides a few nice extra features. You end up with a command line like this; notice how we’re importing 20-days-worth of data, and we’re specifying 4 mappers. This means that we’ll get four processes importing this data in parallel.
  9. The first class that we are going to look at is the InputFormat. We can have it split our input data into as many Input Splits as we want. Then, we’ll use it to create Record Readers that actually connect to the data source and provide us with the data.
  10. Let’s see how this whole process works. So we are importing 20 days of data. || First, our range is expanded to the actual days. || Now, we want 4 mappers, that means 4 input splits, so we’ll create input splits with 5 days each. || Next, each input split will get a record reader that reads in succession each record from each of the five days. || Finally, the mapper is called for each record.
  11. Here’s what the code looks like. We’ll extend the base Hadoop class and override the getSplits method. Inside we create InputSplit objects, set the partitions in each one and then add them to the list of InputSplits, until we’ve covered all partitions.
  12. Next, the InputSplit class. Once all input splits have been constructed, map tasks are fired up throughout the Hadoop cluster and each task gets an input split. This is why the InputSplits need to be serialized, so we need to implement the Writable interface. So, we’ll need to store the data partitions, and we’ll need to override four methods of the base class: getLenght, getLocations and two more for the serialization.
  13. Let’s look at an example implementation to make things more clear. Storing the dataPartitions: we use an ArrayList as class member, with proper setter and getter. The length reported to the framework can be the size of this array if we can’t otherwise determine the precise data size. The getLocations method is used by the framework to select where to run tasks so that data locality is achieved. If the data store we’re connecting to is on a different cluster, as it usually is, we’ll simply return an empty array.
  14. Last, the serialization. This can be easily implemented by leveraging the writable classes built into Hadoop. Here, we are loading our data partitions into an ArrayWritable and calling the write method on that object. Similarly we are de-serializing data by calling the readFields method on a new ArrayWritable object.
  15. To finish our import, the last piece that we need is the RecordReader. This is where we’ll actually connect to the data source. We can override the initialize method to fire up our database client and to load the partitions from the input split. Here, we are creating a queue out of all the partitions, that will be then queried one after the other. Then, we have to override the method that iterates over the data and the methods that make it available to the mapper; this is pretty straightforward and specific to the data source so we will not go into more details.
  16. That’s it! We now have our data reaching the mapper. Here, we can use a simple identity mapper to save the unaltered data or we can even run a full MapReduce job before sending it to the output.
  17. A few words on exporting. This is done in a very similar way to the import, but it’s even simpler. Still, specific to the export is that we must first decide on what operation we want to perform to the data that we are exporting. We can of course simply add, or store, the data, but we could also replace or even delete existing records. When exporting we don’t have to deal with splits anymore, so we just have to implement an OutputFormat and a RecordWriter.
  18. The OutputFormat simply provides a RecordWriter. Since this class does not have an Initialize method, we’ll use the constructor to connect to the data store. Then, the write method can be used to perform the operations on the data.
  19. That’s all there is to exporting! Now that we know how to implement both import and export we could even use them together in the same job to move data between two different databases.