SlideShare a Scribd company logo
1 of 20
Download to read offline
pas$aro.wordpress.com 
@rpas$a
Building a connector – The Wrong Way 
Mapper 
Reducer
Building a connector – The Right Way 
Mapper 
Par$$oner 
Reducer 
Input 
Format 
Input 
Split 
Record 
Reader 
Output 
Format 
Record 
Writer
The InputFormat: From Input to Mapper 
--range 2014-09-01;2014-09-20 
--number_of_mappers 4 
2014-­‐09-­‐01 
2014-­‐09-­‐02 
2014-­‐09-­‐03 
2014-­‐09-­‐04 
2014-­‐09-­‐05 
… 
… 
… 
2014-­‐09-­‐06 
2014-­‐09-­‐20 
Input Split 1 
2014-­‐09-­‐01 
2014-­‐09-­‐02 
... 
2014-­‐09-­‐05 
Record Reader 1 
(2014-­‐09-­‐01-­‐A; 
record 
A) 
(2014-­‐09-­‐01-­‐B; 
record 
B) 
(2014-­‐09-­‐01-­‐…; 
record 
…) 
(2014-­‐09-­‐02-­‐A; 
record 
A) 
(2014-­‐09-­‐02-­‐B; 
record 
B) 
(2014-­‐09-­‐02-­‐…; 
record 
…) 
(2014-­‐09-­‐05-­‐A; 
record 
A) 
(2014-­‐09-­‐05-­‐B; 
record 
B) 
(2014-­‐09-­‐05-­‐…; 
record 
…) 
Mapper
The InputFormat: From Input to Mapper 
--range 2014-09-01;2014-09-20 
--number_of_mappers 4 
2014-­‐09-­‐01 
2014-­‐09-­‐02 
2014-­‐09-­‐03 
2014-­‐09-­‐04 
2014-­‐09-­‐05 
… 
… 
… 
2014-­‐09-­‐06 
2014-­‐09-­‐20 
Input Split 1 
2014-­‐09-­‐01 
2014-­‐09-­‐02 
... 
2014-­‐09-­‐05 
Record Reader 1 
(2014-­‐09-­‐01-­‐A; 
record 
A) 
(2014-­‐09-­‐01-­‐B; 
record 
B) 
(2014-­‐09-­‐01-­‐…; 
record 
…) 
(2014-­‐09-­‐02-­‐A; 
record 
A) 
(2014-­‐09-­‐02-­‐B; 
record 
B) 
(2014-­‐09-­‐02-­‐…; 
record 
…) 
(2014-­‐09-­‐05-­‐A; 
record 
A) 
(2014-­‐09-­‐05-­‐B; 
record 
B) 
(2014-­‐09-­‐05-­‐…; 
record 
…) 
Mapper
Radu Pastia - Couchdoop - Connecting Hadoop with Couchbase
Radu Pastia - Couchdoop - Connecting Hadoop with Couchbase
Radu Pastia - Couchdoop - Connecting Hadoop with Couchbase
Radu Pastia - Couchdoop - Connecting Hadoop with Couchbase

More Related Content

More from huguk

Google Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop
Google Cloud Dataproc - Easier, faster, more cost-effective Spark and HadoopGoogle Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop
Google Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoophuguk
 
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...huguk
 
Extracting maximum value from data while protecting consumer privacy. Jason ...
Extracting maximum value from data while protecting consumer privacy.  Jason ...Extracting maximum value from data while protecting consumer privacy.  Jason ...
Extracting maximum value from data while protecting consumer privacy. Jason ...huguk
 
Intelligence Augmented vs Artificial Intelligence. Alex Flamant, IBM Watson
Intelligence Augmented vs Artificial Intelligence. Alex Flamant, IBM WatsonIntelligence Augmented vs Artificial Intelligence. Alex Flamant, IBM Watson
Intelligence Augmented vs Artificial Intelligence. Alex Flamant, IBM Watsonhuguk
 
Streaming Dataflow with Apache Flink
Streaming Dataflow with Apache Flink Streaming Dataflow with Apache Flink
Streaming Dataflow with Apache Flink huguk
 
Lambda architecture on Spark, Kafka for real-time large scale ML
Lambda architecture on Spark, Kafka for real-time large scale MLLambda architecture on Spark, Kafka for real-time large scale ML
Lambda architecture on Spark, Kafka for real-time large scale MLhuguk
 
Today’s reality Hadoop with Spark- How to select the best Data Science approa...
Today’s reality Hadoop with Spark- How to select the best Data Science approa...Today’s reality Hadoop with Spark- How to select the best Data Science approa...
Today’s reality Hadoop with Spark- How to select the best Data Science approa...huguk
 
Jonathon Southam: Venture Capital, Funding & Pitching
Jonathon Southam: Venture Capital, Funding & PitchingJonathon Southam: Venture Capital, Funding & Pitching
Jonathon Southam: Venture Capital, Funding & Pitchinghuguk
 
Signal Media: Real-Time Media & News Monitoring
Signal Media: Real-Time Media & News MonitoringSignal Media: Real-Time Media & News Monitoring
Signal Media: Real-Time Media & News Monitoringhuguk
 
Dean Bryen: Scaling The Platform For Your Startup
Dean Bryen: Scaling The Platform For Your StartupDean Bryen: Scaling The Platform For Your Startup
Dean Bryen: Scaling The Platform For Your Startuphuguk
 
Peter Karney: Intro to the Digital catapult
Peter Karney: Intro to the Digital catapultPeter Karney: Intro to the Digital catapult
Peter Karney: Intro to the Digital catapulthuguk
 
Cytora: Real-Time Political Risk Analysis
Cytora:  Real-Time Political Risk AnalysisCytora:  Real-Time Political Risk Analysis
Cytora: Real-Time Political Risk Analysishuguk
 
Cubitic: Predictive Analytics
Cubitic: Predictive AnalyticsCubitic: Predictive Analytics
Cubitic: Predictive Analyticshuguk
 
Bird.i: Earth Observation Data Made Social
Bird.i: Earth Observation Data Made SocialBird.i: Earth Observation Data Made Social
Bird.i: Earth Observation Data Made Socialhuguk
 
Aiseedo: Real Time Machine Intelligence
Aiseedo: Real Time Machine IntelligenceAiseedo: Real Time Machine Intelligence
Aiseedo: Real Time Machine Intelligencehuguk
 
Secrets of Spark's success - Deenar Toraskar, Think Reactive
Secrets of Spark's success - Deenar Toraskar, Think Reactive Secrets of Spark's success - Deenar Toraskar, Think Reactive
Secrets of Spark's success - Deenar Toraskar, Think Reactive huguk
 
TV Marketing and big data: cat and dog or thick as thieves? Krzysztof Osiewal...
TV Marketing and big data: cat and dog or thick as thieves? Krzysztof Osiewal...TV Marketing and big data: cat and dog or thick as thieves? Krzysztof Osiewal...
TV Marketing and big data: cat and dog or thick as thieves? Krzysztof Osiewal...huguk
 
Hadoop - Looking to the Future By Arun Murthy
Hadoop - Looking to the Future By Arun MurthyHadoop - Looking to the Future By Arun Murthy
Hadoop - Looking to the Future By Arun Murthyhuguk
 
Fast real-time approximations using Spark streaming
Fast real-time approximations using Spark streamingFast real-time approximations using Spark streaming
Fast real-time approximations using Spark streaminghuguk
 
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...huguk
 

More from huguk (20)

Google Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop
Google Cloud Dataproc - Easier, faster, more cost-effective Spark and HadoopGoogle Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop
Google Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop
 
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
 
Extracting maximum value from data while protecting consumer privacy. Jason ...
Extracting maximum value from data while protecting consumer privacy.  Jason ...Extracting maximum value from data while protecting consumer privacy.  Jason ...
Extracting maximum value from data while protecting consumer privacy. Jason ...
 
Intelligence Augmented vs Artificial Intelligence. Alex Flamant, IBM Watson
Intelligence Augmented vs Artificial Intelligence. Alex Flamant, IBM WatsonIntelligence Augmented vs Artificial Intelligence. Alex Flamant, IBM Watson
Intelligence Augmented vs Artificial Intelligence. Alex Flamant, IBM Watson
 
Streaming Dataflow with Apache Flink
Streaming Dataflow with Apache Flink Streaming Dataflow with Apache Flink
Streaming Dataflow with Apache Flink
 
Lambda architecture on Spark, Kafka for real-time large scale ML
Lambda architecture on Spark, Kafka for real-time large scale MLLambda architecture on Spark, Kafka for real-time large scale ML
Lambda architecture on Spark, Kafka for real-time large scale ML
 
Today’s reality Hadoop with Spark- How to select the best Data Science approa...
Today’s reality Hadoop with Spark- How to select the best Data Science approa...Today’s reality Hadoop with Spark- How to select the best Data Science approa...
Today’s reality Hadoop with Spark- How to select the best Data Science approa...
 
Jonathon Southam: Venture Capital, Funding & Pitching
Jonathon Southam: Venture Capital, Funding & PitchingJonathon Southam: Venture Capital, Funding & Pitching
Jonathon Southam: Venture Capital, Funding & Pitching
 
Signal Media: Real-Time Media & News Monitoring
Signal Media: Real-Time Media & News MonitoringSignal Media: Real-Time Media & News Monitoring
Signal Media: Real-Time Media & News Monitoring
 
Dean Bryen: Scaling The Platform For Your Startup
Dean Bryen: Scaling The Platform For Your StartupDean Bryen: Scaling The Platform For Your Startup
Dean Bryen: Scaling The Platform For Your Startup
 
Peter Karney: Intro to the Digital catapult
Peter Karney: Intro to the Digital catapultPeter Karney: Intro to the Digital catapult
Peter Karney: Intro to the Digital catapult
 
Cytora: Real-Time Political Risk Analysis
Cytora:  Real-Time Political Risk AnalysisCytora:  Real-Time Political Risk Analysis
Cytora: Real-Time Political Risk Analysis
 
Cubitic: Predictive Analytics
Cubitic: Predictive AnalyticsCubitic: Predictive Analytics
Cubitic: Predictive Analytics
 
Bird.i: Earth Observation Data Made Social
Bird.i: Earth Observation Data Made SocialBird.i: Earth Observation Data Made Social
Bird.i: Earth Observation Data Made Social
 
Aiseedo: Real Time Machine Intelligence
Aiseedo: Real Time Machine IntelligenceAiseedo: Real Time Machine Intelligence
Aiseedo: Real Time Machine Intelligence
 
Secrets of Spark's success - Deenar Toraskar, Think Reactive
Secrets of Spark's success - Deenar Toraskar, Think Reactive Secrets of Spark's success - Deenar Toraskar, Think Reactive
Secrets of Spark's success - Deenar Toraskar, Think Reactive
 
TV Marketing and big data: cat and dog or thick as thieves? Krzysztof Osiewal...
TV Marketing and big data: cat and dog or thick as thieves? Krzysztof Osiewal...TV Marketing and big data: cat and dog or thick as thieves? Krzysztof Osiewal...
TV Marketing and big data: cat and dog or thick as thieves? Krzysztof Osiewal...
 
Hadoop - Looking to the Future By Arun Murthy
Hadoop - Looking to the Future By Arun MurthyHadoop - Looking to the Future By Arun Murthy
Hadoop - Looking to the Future By Arun Murthy
 
Fast real-time approximations using Spark streaming
Fast real-time approximations using Spark streamingFast real-time approximations using Spark streaming
Fast real-time approximations using Spark streaming
 
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
 

Radu Pastia - Couchdoop - Connecting Hadoop with Couchbase

  • 1.
  • 3.
  • 4. Building a connector – The Wrong Way Mapper Reducer
  • 5.
  • 6. Building a connector – The Right Way Mapper Par$$oner Reducer Input Format Input Split Record Reader Output Format Record Writer
  • 7.
  • 8.
  • 9.
  • 10. The InputFormat: From Input to Mapper --range 2014-09-01;2014-09-20 --number_of_mappers 4 2014-­‐09-­‐01 2014-­‐09-­‐02 2014-­‐09-­‐03 2014-­‐09-­‐04 2014-­‐09-­‐05 … … … 2014-­‐09-­‐06 2014-­‐09-­‐20 Input Split 1 2014-­‐09-­‐01 2014-­‐09-­‐02 ... 2014-­‐09-­‐05 Record Reader 1 (2014-­‐09-­‐01-­‐A; record A) (2014-­‐09-­‐01-­‐B; record B) (2014-­‐09-­‐01-­‐…; record …) (2014-­‐09-­‐02-­‐A; record A) (2014-­‐09-­‐02-­‐B; record B) (2014-­‐09-­‐02-­‐…; record …) (2014-­‐09-­‐05-­‐A; record A) (2014-­‐09-­‐05-­‐B; record B) (2014-­‐09-­‐05-­‐…; record …) Mapper
  • 11.
  • 12.
  • 13.
  • 14.
  • 15.
  • 16. The InputFormat: From Input to Mapper --range 2014-09-01;2014-09-20 --number_of_mappers 4 2014-­‐09-­‐01 2014-­‐09-­‐02 2014-­‐09-­‐03 2014-­‐09-­‐04 2014-­‐09-­‐05 … … … 2014-­‐09-­‐06 2014-­‐09-­‐20 Input Split 1 2014-­‐09-­‐01 2014-­‐09-­‐02 ... 2014-­‐09-­‐05 Record Reader 1 (2014-­‐09-­‐01-­‐A; record A) (2014-­‐09-­‐01-­‐B; record B) (2014-­‐09-­‐01-­‐…; record …) (2014-­‐09-­‐02-­‐A; record A) (2014-­‐09-­‐02-­‐B; record B) (2014-­‐09-­‐02-­‐…; record …) (2014-­‐09-­‐05-­‐A; record A) (2014-­‐09-­‐05-­‐B; record B) (2014-­‐09-­‐05-­‐…; record …) Mapper