SlideShare a Scribd company logo
Ohai Hadoop!
Build your first MapReduce with
Hadoop & Ruby
Tweet@_swanand
GitHub@swanandp
StackOverflow@18678
Work@Kaverisoft
Make { DispatchTrack }
mailto:swanand@pagnis.
in
Who am I?
Ruby, Coffeescript,
Java, Rails, Sinatra,
Android, TextMate,
Emacs, Minitest,
MySQL, Cassandra,
Hadoop, Mountain
Lion, Curl, Zsh, GMail,
Solarized, Oscar
Wilde, Robert Jordan,
Quentin Tarantino,
Charlize Theron
● MapReduce! Wait, what?
● Enter the Hadoop. *gong*
● Convention over Configuration? You wish.
● Instant Gratification. Now you're talkin'
● Further Reading. Go forth and read!
Tell 'em what you're going to tell 'em
MapReduce! Wait, what?
● Map: Given a set of values (or key-values),
output another set of values (or key-values)
● [K1, V1] -> map -> [K2, V2]
● Map each value into a new value
MapReduce! Wait, what?
● Reduce: Given a set of values for a key,
come up with a summarized version
● K1[V1, V2 ... Vn] -> reduce -> K1[Y]
● Reduce given values into 1 value
MapReduce! Wait, what?
MapReduce! Um.. hmm..
Q: What is the single biggest takeaway from
mapping?
A: Map operation is stateless i.e. one iteration
doesn't depend on previous iteration.
Q: What is the single biggest takeaway from
reducing?
A: Reduce represents an operation for a
particular key.
Enter the Hadoop. *gong*
"The really interesting thing I want you to notice,
here, is that as soon as you think of map and
reduce as functions that everybody can use, and
they use them, you only have to get one
supergenius to write the hard code to run map and
reduce on a global massively parallel array of
computers, and all the old code that used to work
fine when you just ran a loop still works only it's a
zillion times faster which means it can be used to
tackle huge problems in an instant."
- Joel Spolsky
MapReduce! Oh, yeah!
1. Convert raw data into readable format
2. Iterate over data chunks, convert each
chunk into meaningful key, value pairs
3. Do this for all your data using massive
parallelization
4. Group all the keys and their respective
values
5. Take values for a key and convert into
desired meaningful format
6. Step 2 is called mapper
7. Step 5 is called reducer
Enter the Hadoop. *gong*
Same process has now become:
1. Put data into Hadoop
2. Define your mapper
3. Define your reducer
4. Run your jobs
5. Read processed data from Hadoop
Other advantages:
● Encapsulations over common problems
like large files, process management, disk
/ node failure
Top Level Descriptor
job has_many tasks
HDFS Boss core-site.xml
HDFS Slaves slaves
MapReduce Boss mapred-site.xml
MapReduce Slave mapred-site.xml
User's window into Hadoop, through the
command hadoop
Convention over Configuration? You wish.
Job
Task
NameNode
DataNode
JobTracker
TaskTracker
Client
Convention over Configuration? You wish.
● Configuration in XML & Shell scripts. Yuck!
● Respite:
○ Option for specifying a configuration directory
○ Shell script configuration is mostly ENV variables
● Which means:
○ Configuration can be written in YML or JSON or
Ruby and exported in XML
○ ENV variables can be set using rake, thor or just
plain Ruby
● Caveats:
○ No standard wrapper to do this (Go write one!)
Convention over Configuration? You wish.
● Default mappers and reducers are defined in
Java
● Other languages supported using Streaming
API
● Streaming API makes use of STDIN and
STDOUT to read and output data and
executable binaries for processing
● Caveats
○ No dependency management, we are on our own
Instant Gratification. Now you're talkin'
GOAL:
1. Take a couple of books in txt format
2. Find out the total usage of each character in
the english alphabet.
3. Establish that e is the most used.
4. Why this example?
a. Perfect use case for MapReduce.
b. Algorithm is simple.
c. Results are simple to analyze.
d. Txt formatted books are easily available in Project
Gutenberg.
● Official Documentation
● Wiki: http://wiki.apache.org/hadoop/
● Hadoop examples that ship with Hadoop
● http://www.bigfastblog.com/map-reduce-
with-ruby-using-hadoop
● http://www.youtube.com/watch?
v=d2xeNpfzsYI
Further Reading and Watching
Questions?
Thank you!

More Related Content

What's hot

PostgreSQL is the new NoSQL - at Devoxx 2018
PostgreSQL is the new NoSQL  - at Devoxx 2018PostgreSQL is the new NoSQL  - at Devoxx 2018
PostgreSQL is the new NoSQL - at Devoxx 2018
Quentin Adam
 
SnappyDB - NoSQL database for Android
SnappyDB - NoSQL database for AndroidSnappyDB - NoSQL database for Android
SnappyDB - NoSQL database for Android
nhachicha
 
MySQL And Search At Craigslist
MySQL And Search At CraigslistMySQL And Search At Craigslist
MySQL And Search At Craigslist
Jeremy Zawodny
 
Fusion-io and MySQL at Craigslist
Fusion-io and MySQL at CraigslistFusion-io and MySQL at Craigslist
Fusion-io and MySQL at Craigslist
Jeremy Zawodny
 
AGES Presentation on Web, Python, Django and GeoServer
AGES Presentation on Web, Python, Django and GeoServerAGES Presentation on Web, Python, Django and GeoServer
AGES Presentation on Web, Python, Django and GeoServer
Ng'eno Victor
 
Optimizing MongoDB: Lessons Learned at Localytics
Optimizing MongoDB: Lessons Learned at LocalyticsOptimizing MongoDB: Lessons Learned at Localytics
Optimizing MongoDB: Lessons Learned at Localytics
andrew311
 
Ruby on Rails & PostgreSQL - v2
Ruby on Rails & PostgreSQL - v2Ruby on Rails & PostgreSQL - v2
Ruby on Rails & PostgreSQL - v2
John Ashmead
 
20140120 presto meetup_en
20140120 presto meetup_en20140120 presto meetup_en
20140120 presto meetup_en
Ogibayashi
 
Transactional writes to cloud storage with Eric Liang
Transactional writes to cloud storage with Eric LiangTransactional writes to cloud storage with Eric Liang
Transactional writes to cloud storage with Eric Liang
Databricks
 
Universal Serverless with AWS Fargate
Universal Serverless with AWS FargateUniversal Serverless with AWS Fargate
Universal Serverless with AWS Fargate
Eka Cahya Pratama
 
PyConIE 2017 Writing and deploying serverless python applications
PyConIE 2017 Writing and deploying serverless python applicationsPyConIE 2017 Writing and deploying serverless python applications
PyConIE 2017 Writing and deploying serverless python applications
Cesar Cardenas Desales
 
SOLR Power FTW: short version
SOLR Power FTW: short versionSOLR Power FTW: short version
SOLR Power FTW: short version
Alex Pinkin
 
Pre fosdem2020 uber
Pre fosdem2020 uberPre fosdem2020 uber
Pre fosdem2020 uber
Giedrius Jaraminas
 
Monitoring with ElasticSearch
Monitoring with ElasticSearch Monitoring with ElasticSearch
Monitoring with ElasticSearch
Kris Buytaert
 
Event Pipe - Lambda Architecture
Event Pipe - Lambda ArchitectureEvent Pipe - Lambda Architecture
Event Pipe - Lambda Architecture
Bahadir Cambel
 
Sharding - patterns & antipatterns, Константин Осипов, Алексей Рыбак
Sharding -  patterns & antipatterns, Константин Осипов, Алексей РыбакSharding -  patterns & antipatterns, Константин Осипов, Алексей Рыбак
Sharding - patterns & antipatterns, Константин Осипов, Алексей Рыбак
Ontico
 
Introduction to mongo db
Introduction to mongo dbIntroduction to mongo db
Introduction to mongo db
Lawrence Mwai
 
Scalable Data Science with SparkR
Scalable Data Science with SparkRScalable Data Science with SparkR
Scalable Data Science with SparkR
DataWorks Summit
 
Bringing spatial love to your python application
Bringing spatial love to your python applicationBringing spatial love to your python application
Bringing spatial love to your python application
Shekhar Gulati
 
Scylla Summit 2022: ScyllaDB Embraces Wasm
Scylla Summit 2022: ScyllaDB Embraces WasmScylla Summit 2022: ScyllaDB Embraces Wasm
Scylla Summit 2022: ScyllaDB Embraces Wasm
ScyllaDB
 

What's hot (20)

PostgreSQL is the new NoSQL - at Devoxx 2018
PostgreSQL is the new NoSQL  - at Devoxx 2018PostgreSQL is the new NoSQL  - at Devoxx 2018
PostgreSQL is the new NoSQL - at Devoxx 2018
 
SnappyDB - NoSQL database for Android
SnappyDB - NoSQL database for AndroidSnappyDB - NoSQL database for Android
SnappyDB - NoSQL database for Android
 
MySQL And Search At Craigslist
MySQL And Search At CraigslistMySQL And Search At Craigslist
MySQL And Search At Craigslist
 
Fusion-io and MySQL at Craigslist
Fusion-io and MySQL at CraigslistFusion-io and MySQL at Craigslist
Fusion-io and MySQL at Craigslist
 
AGES Presentation on Web, Python, Django and GeoServer
AGES Presentation on Web, Python, Django and GeoServerAGES Presentation on Web, Python, Django and GeoServer
AGES Presentation on Web, Python, Django and GeoServer
 
Optimizing MongoDB: Lessons Learned at Localytics
Optimizing MongoDB: Lessons Learned at LocalyticsOptimizing MongoDB: Lessons Learned at Localytics
Optimizing MongoDB: Lessons Learned at Localytics
 
Ruby on Rails & PostgreSQL - v2
Ruby on Rails & PostgreSQL - v2Ruby on Rails & PostgreSQL - v2
Ruby on Rails & PostgreSQL - v2
 
20140120 presto meetup_en
20140120 presto meetup_en20140120 presto meetup_en
20140120 presto meetup_en
 
Transactional writes to cloud storage with Eric Liang
Transactional writes to cloud storage with Eric LiangTransactional writes to cloud storage with Eric Liang
Transactional writes to cloud storage with Eric Liang
 
Universal Serverless with AWS Fargate
Universal Serverless with AWS FargateUniversal Serverless with AWS Fargate
Universal Serverless with AWS Fargate
 
PyConIE 2017 Writing and deploying serverless python applications
PyConIE 2017 Writing and deploying serverless python applicationsPyConIE 2017 Writing and deploying serverless python applications
PyConIE 2017 Writing and deploying serverless python applications
 
SOLR Power FTW: short version
SOLR Power FTW: short versionSOLR Power FTW: short version
SOLR Power FTW: short version
 
Pre fosdem2020 uber
Pre fosdem2020 uberPre fosdem2020 uber
Pre fosdem2020 uber
 
Monitoring with ElasticSearch
Monitoring with ElasticSearch Monitoring with ElasticSearch
Monitoring with ElasticSearch
 
Event Pipe - Lambda Architecture
Event Pipe - Lambda ArchitectureEvent Pipe - Lambda Architecture
Event Pipe - Lambda Architecture
 
Sharding - patterns & antipatterns, Константин Осипов, Алексей Рыбак
Sharding -  patterns & antipatterns, Константин Осипов, Алексей РыбакSharding -  patterns & antipatterns, Константин Осипов, Алексей Рыбак
Sharding - patterns & antipatterns, Константин Осипов, Алексей Рыбак
 
Introduction to mongo db
Introduction to mongo dbIntroduction to mongo db
Introduction to mongo db
 
Scalable Data Science with SparkR
Scalable Data Science with SparkRScalable Data Science with SparkR
Scalable Data Science with SparkR
 
Bringing spatial love to your python application
Bringing spatial love to your python applicationBringing spatial love to your python application
Bringing spatial love to your python application
 
Scylla Summit 2022: ScyllaDB Embraces Wasm
Scylla Summit 2022: ScyllaDB Embraces WasmScylla Summit 2022: ScyllaDB Embraces Wasm
Scylla Summit 2022: ScyllaDB Embraces Wasm
 

Viewers also liked

Jarrera eta Ikaskuntza 2011 3. gaia amaiera
Jarrera eta Ikaskuntza 2011 3. gaia amaieraJarrera eta Ikaskuntza 2011 3. gaia amaiera
Jarrera eta Ikaskuntza 2011 3. gaia amaieraKirolPsikologia
 
Intro to Open Cloud Initiative
Intro to Open Cloud InitiativeIntro to Open Cloud Initiative
Intro to Open Cloud InitiativeJohn Mark Walker
 
Lasterketa plana prestatzeko tailerra
Lasterketa plana prestatzeko tailerraLasterketa plana prestatzeko tailerra
Lasterketa plana prestatzeko tailerraKirolPsikologia
 
Building Vibrant Open Source Communities
Building Vibrant Open Source CommunitiesBuilding Vibrant Open Source Communities
Building Vibrant Open Source Communities
John Mark Walker
 
Kitasato Flask 97
Kitasato Flask 97Kitasato Flask 97
Kitasato Flask 97
Daimy
 
Open Source and Cloud - The Two Great Tastes...
Open Source and Cloud - The Two Great Tastes...Open Source and Cloud - The Two Great Tastes...
Open Source and Cloud - The Two Great Tastes...
John Mark Walker
 
FOSS vs. Web Services Lightning Talk: Is FOSS Necessary?
FOSS vs. Web Services Lightning Talk: Is FOSS Necessary?FOSS vs. Web Services Lightning Talk: Is FOSS Necessary?
FOSS vs. Web Services Lightning Talk: Is FOSS Necessary?
John Mark Walker
 
Australia Presentation
Australia PresentationAustralia Presentation
Australia PresentationDaimy
 
The Future of GlusterFS and Gluster.org
The Future of GlusterFS and Gluster.orgThe Future of GlusterFS and Gluster.org
The Future of GlusterFS and Gluster.org
John Mark Walker
 

Viewers also liked (10)

Jarrera eta Ikaskuntza 2011 3. gaia amaiera
Jarrera eta Ikaskuntza 2011 3. gaia amaieraJarrera eta Ikaskuntza 2011 3. gaia amaiera
Jarrera eta Ikaskuntza 2011 3. gaia amaiera
 
Intro to Open Cloud Initiative
Intro to Open Cloud InitiativeIntro to Open Cloud Initiative
Intro to Open Cloud Initiative
 
Lasterketa plana prestatzeko tailerra
Lasterketa plana prestatzeko tailerraLasterketa plana prestatzeko tailerra
Lasterketa plana prestatzeko tailerra
 
Building Vibrant Open Source Communities
Building Vibrant Open Source CommunitiesBuilding Vibrant Open Source Communities
Building Vibrant Open Source Communities
 
Kitasato Flask 97
Kitasato Flask 97Kitasato Flask 97
Kitasato Flask 97
 
GlusterFS Community Preso
GlusterFS Community PresoGlusterFS Community Preso
GlusterFS Community Preso
 
Open Source and Cloud - The Two Great Tastes...
Open Source and Cloud - The Two Great Tastes...Open Source and Cloud - The Two Great Tastes...
Open Source and Cloud - The Two Great Tastes...
 
FOSS vs. Web Services Lightning Talk: Is FOSS Necessary?
FOSS vs. Web Services Lightning Talk: Is FOSS Necessary?FOSS vs. Web Services Lightning Talk: Is FOSS Necessary?
FOSS vs. Web Services Lightning Talk: Is FOSS Necessary?
 
Australia Presentation
Australia PresentationAustralia Presentation
Australia Presentation
 
The Future of GlusterFS and Gluster.org
The Future of GlusterFS and Gluster.orgThe Future of GlusterFS and Gluster.org
The Future of GlusterFS and Gluster.org
 

Similar to MapReduce with Hadoop and Ruby

New Analytics Toolbox DevNexus 2015
New Analytics Toolbox DevNexus 2015New Analytics Toolbox DevNexus 2015
New Analytics Toolbox DevNexus 2015
Robbie Strickland
 
Scaling up wso2 bam for billions of requests and terabytes of data
Scaling up wso2 bam for billions of requests and terabytes of dataScaling up wso2 bam for billions of requests and terabytes of data
Scaling up wso2 bam for billions of requests and terabytes of dataWSO2
 
Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018
Holden Karau
 
Big data beyond the JVM - DDTX 2018
Big data beyond the JVM -  DDTX 2018Big data beyond the JVM -  DDTX 2018
Big data beyond the JVM - DDTX 2018
Holden Karau
 
Improving PySpark performance: Spark Performance Beyond the JVM
Improving PySpark performance: Spark Performance Beyond the JVMImproving PySpark performance: Spark Performance Beyond the JVM
Improving PySpark performance: Spark Performance Beyond the JVM
Holden Karau
 
A fast introduction to PySpark with a quick look at Arrow based UDFs
A fast introduction to PySpark with a quick look at Arrow based UDFsA fast introduction to PySpark with a quick look at Arrow based UDFs
A fast introduction to PySpark with a quick look at Arrow based UDFs
Holden Karau
 
[@NaukriEngineering] Apache Spark
[@NaukriEngineering] Apache Spark[@NaukriEngineering] Apache Spark
[@NaukriEngineering] Apache Spark
Naukri.com
 
Hadoop And Big Data - My Presentation To Selective Audience
Hadoop And Big Data - My Presentation To Selective AudienceHadoop And Big Data - My Presentation To Selective Audience
Hadoop And Big Data - My Presentation To Selective Audience
Chandra Sekhar
 
New Analytics Toolbox
New Analytics ToolboxNew Analytics Toolbox
New Analytics Toolbox
Robbie Strickland
 
Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - ...
Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - ...Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - ...
Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - ...
DataStax Academy
 
A super fast introduction to Spark and glance at BEAM
A super fast introduction to Spark and glance at BEAMA super fast introduction to Spark and glance at BEAM
A super fast introduction to Spark and glance at BEAM
Holden Karau
 
Introduction to MapReduce and Hadoop
Introduction to MapReduce and HadoopIntroduction to MapReduce and Hadoop
Introduction to MapReduce and Hadoop
Mohamed Elsaka
 
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...
Big Data Montreal
 
Introduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at lastIntroduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at last
Holden Karau
 
Introduction to AWS Big Data
Introduction to AWS Big Data Introduction to AWS Big Data
Introduction to AWS Big Data
Omid Vahdaty
 
BDM25 - Spark runtime internal
BDM25 - Spark runtime internalBDM25 - Spark runtime internal
BDM25 - Spark runtime internal
David Lauzon
 
Hadoop 31-frequently-asked-interview-questions
Hadoop 31-frequently-asked-interview-questionsHadoop 31-frequently-asked-interview-questions
Hadoop 31-frequently-asked-interview-questions
Asad Masood Qazi
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
Ran Silberman
 
Are general purpose big data systems eating the world?
Are general purpose big data systems eating the world?Are general purpose big data systems eating the world?
Are general purpose big data systems eating the world?
Holden Karau
 

Similar to MapReduce with Hadoop and Ruby (20)

New Analytics Toolbox DevNexus 2015
New Analytics Toolbox DevNexus 2015New Analytics Toolbox DevNexus 2015
New Analytics Toolbox DevNexus 2015
 
Scaling up wso2 bam for billions of requests and terabytes of data
Scaling up wso2 bam for billions of requests and terabytes of dataScaling up wso2 bam for billions of requests and terabytes of data
Scaling up wso2 bam for billions of requests and terabytes of data
 
Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018
 
Big data beyond the JVM - DDTX 2018
Big data beyond the JVM -  DDTX 2018Big data beyond the JVM -  DDTX 2018
Big data beyond the JVM - DDTX 2018
 
Improving PySpark performance: Spark Performance Beyond the JVM
Improving PySpark performance: Spark Performance Beyond the JVMImproving PySpark performance: Spark Performance Beyond the JVM
Improving PySpark performance: Spark Performance Beyond the JVM
 
A fast introduction to PySpark with a quick look at Arrow based UDFs
A fast introduction to PySpark with a quick look at Arrow based UDFsA fast introduction to PySpark with a quick look at Arrow based UDFs
A fast introduction to PySpark with a quick look at Arrow based UDFs
 
[@NaukriEngineering] Apache Spark
[@NaukriEngineering] Apache Spark[@NaukriEngineering] Apache Spark
[@NaukriEngineering] Apache Spark
 
Hadoop And Big Data - My Presentation To Selective Audience
Hadoop And Big Data - My Presentation To Selective AudienceHadoop And Big Data - My Presentation To Selective Audience
Hadoop And Big Data - My Presentation To Selective Audience
 
New Analytics Toolbox
New Analytics ToolboxNew Analytics Toolbox
New Analytics Toolbox
 
Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - ...
Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - ...Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - ...
Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - ...
 
A super fast introduction to Spark and glance at BEAM
A super fast introduction to Spark and glance at BEAMA super fast introduction to Spark and glance at BEAM
A super fast introduction to Spark and glance at BEAM
 
Introduction to MapReduce and Hadoop
Introduction to MapReduce and HadoopIntroduction to MapReduce and Hadoop
Introduction to MapReduce and Hadoop
 
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...
 
2012 apache hadoop_map_reduce_windows_azure
2012 apache hadoop_map_reduce_windows_azure2012 apache hadoop_map_reduce_windows_azure
2012 apache hadoop_map_reduce_windows_azure
 
Introduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at lastIntroduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at last
 
Introduction to AWS Big Data
Introduction to AWS Big Data Introduction to AWS Big Data
Introduction to AWS Big Data
 
BDM25 - Spark runtime internal
BDM25 - Spark runtime internalBDM25 - Spark runtime internal
BDM25 - Spark runtime internal
 
Hadoop 31-frequently-asked-interview-questions
Hadoop 31-frequently-asked-interview-questionsHadoop 31-frequently-asked-interview-questions
Hadoop 31-frequently-asked-interview-questions
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Are general purpose big data systems eating the world?
Are general purpose big data systems eating the world?Are general purpose big data systems eating the world?
Are general purpose big data systems eating the world?
 

Recently uploaded

A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
sonjaschweigert1
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
danishmna97
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
SOFTTECHHUB
 
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofszkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
Alex Pruden
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
Matthew Sinclair
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
Neo4j
 
GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...
ThomasParaiso2
 
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIEnchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Vladimir Iglovikov, Ph.D.
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
Neo4j
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
Octavian Nadolu
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
SOFTTECHHUB
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
Adtran
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
Matthew Sinclair
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
Neo4j
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems S.M.S.A.
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Paige Cruz
 

Recently uploaded (20)

A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
 
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofszkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
 
GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...
 
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIEnchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
 

MapReduce with Hadoop and Ruby

  • 1. Ohai Hadoop! Build your first MapReduce with Hadoop & Ruby
  • 2. Tweet@_swanand GitHub@swanandp StackOverflow@18678 Work@Kaverisoft Make { DispatchTrack } mailto:swanand@pagnis. in Who am I? Ruby, Coffeescript, Java, Rails, Sinatra, Android, TextMate, Emacs, Minitest, MySQL, Cassandra, Hadoop, Mountain Lion, Curl, Zsh, GMail, Solarized, Oscar Wilde, Robert Jordan, Quentin Tarantino, Charlize Theron
  • 3. ● MapReduce! Wait, what? ● Enter the Hadoop. *gong* ● Convention over Configuration? You wish. ● Instant Gratification. Now you're talkin' ● Further Reading. Go forth and read! Tell 'em what you're going to tell 'em
  • 4. MapReduce! Wait, what? ● Map: Given a set of values (or key-values), output another set of values (or key-values) ● [K1, V1] -> map -> [K2, V2] ● Map each value into a new value
  • 5. MapReduce! Wait, what? ● Reduce: Given a set of values for a key, come up with a summarized version ● K1[V1, V2 ... Vn] -> reduce -> K1[Y] ● Reduce given values into 1 value
  • 7. MapReduce! Um.. hmm.. Q: What is the single biggest takeaway from mapping? A: Map operation is stateless i.e. one iteration doesn't depend on previous iteration. Q: What is the single biggest takeaway from reducing? A: Reduce represents an operation for a particular key.
  • 8. Enter the Hadoop. *gong* "The really interesting thing I want you to notice, here, is that as soon as you think of map and reduce as functions that everybody can use, and they use them, you only have to get one supergenius to write the hard code to run map and reduce on a global massively parallel array of computers, and all the old code that used to work fine when you just ran a loop still works only it's a zillion times faster which means it can be used to tackle huge problems in an instant." - Joel Spolsky
  • 9. MapReduce! Oh, yeah! 1. Convert raw data into readable format 2. Iterate over data chunks, convert each chunk into meaningful key, value pairs 3. Do this for all your data using massive parallelization 4. Group all the keys and their respective values 5. Take values for a key and convert into desired meaningful format 6. Step 2 is called mapper 7. Step 5 is called reducer
  • 10. Enter the Hadoop. *gong* Same process has now become: 1. Put data into Hadoop 2. Define your mapper 3. Define your reducer 4. Run your jobs 5. Read processed data from Hadoop Other advantages: ● Encapsulations over common problems like large files, process management, disk / node failure
  • 11. Top Level Descriptor job has_many tasks HDFS Boss core-site.xml HDFS Slaves slaves MapReduce Boss mapred-site.xml MapReduce Slave mapred-site.xml User's window into Hadoop, through the command hadoop Convention over Configuration? You wish. Job Task NameNode DataNode JobTracker TaskTracker Client
  • 12. Convention over Configuration? You wish. ● Configuration in XML & Shell scripts. Yuck! ● Respite: ○ Option for specifying a configuration directory ○ Shell script configuration is mostly ENV variables ● Which means: ○ Configuration can be written in YML or JSON or Ruby and exported in XML ○ ENV variables can be set using rake, thor or just plain Ruby ● Caveats: ○ No standard wrapper to do this (Go write one!)
  • 13. Convention over Configuration? You wish. ● Default mappers and reducers are defined in Java ● Other languages supported using Streaming API ● Streaming API makes use of STDIN and STDOUT to read and output data and executable binaries for processing ● Caveats ○ No dependency management, we are on our own
  • 14. Instant Gratification. Now you're talkin' GOAL: 1. Take a couple of books in txt format 2. Find out the total usage of each character in the english alphabet. 3. Establish that e is the most used. 4. Why this example? a. Perfect use case for MapReduce. b. Algorithm is simple. c. Results are simple to analyze. d. Txt formatted books are easily available in Project Gutenberg.
  • 15. ● Official Documentation ● Wiki: http://wiki.apache.org/hadoop/ ● Hadoop examples that ship with Hadoop ● http://www.bigfastblog.com/map-reduce- with-ruby-using-hadoop ● http://www.youtube.com/watch? v=d2xeNpfzsYI Further Reading and Watching