SlideShare a Scribd company logo
1 of 17
Download to read offline
Ohai Hadoop!
Build your first MapReduce with
Hadoop & Ruby
Tweet@_swanand
GitHub@swanandp
StackOverflow@18678
Work@Kaverisoft
Make { DispatchTrack }
mailto:swanand@pagnis.
in
Who am I?
Ruby, Coffeescript,
Java, Rails, Sinatra,
Android, TextMate,
Emacs, Minitest,
MySQL, Cassandra,
Hadoop, Mountain
Lion, Curl, Zsh, GMail,
Solarized, Oscar
Wilde, Robert Jordan,
Quentin Tarantino,
Charlize Theron
● MapReduce! Wait, what?
● Enter the Hadoop. *gong*
● Convention over Configuration? You wish.
● Instant Gratification. Now you're talkin'
● Further Reading. Go forth and read!
Tell 'em what you're going to tell 'em
MapReduce! Wait, what?
● Map: Given a set of values (or key-values),
output another set of values (or key-values)
● [K1, V1] -> map -> [K2, V2]
● Map each value into a new value
MapReduce! Wait, what?
● Reduce: Given a set of values for a key,
come up with a summarized version
● K1[V1, V2 ... Vn] -> reduce -> K1[Y]
● Reduce given values into 1 value
MapReduce! Wait, what?
MapReduce! Um.. hmm..
Q: What is the single biggest takeaway from
mapping?
A: Map operation is stateless i.e. one iteration
doesn't depend on previous iteration.
Q: What is the single biggest takeaway from
reducing?
A: Reduce represents an operation for a
particular key.
Enter the Hadoop. *gong*
"The really interesting thing I want you to notice,
here, is that as soon as you think of map and
reduce as functions that everybody can use, and
they use them, you only have to get one
supergenius to write the hard code to run map and
reduce on a global massively parallel array of
computers, and all the old code that used to work
fine when you just ran a loop still works only it's a
zillion times faster which means it can be used to
tackle huge problems in an instant."
- Joel Spolsky
MapReduce! Oh, yeah!
1. Convert raw data into readable format
2. Iterate over data chunks, convert each
chunk into meaningful key, value pairs
3. Do this for all your data using massive
parallelization
4. Group all the keys and their respective
values
5. Take values for a key and convert into
desired meaningful format
6. Step 2 is called mapper
7. Step 5 is called reducer
Enter the Hadoop. *gong*
Same process has now become:
1. Put data into Hadoop
2. Define your mapper
3. Define your reducer
4. Run your jobs
5. Read processed data from Hadoop
Other advantages:
● Encapsulations over common problems
like large files, process management, disk
/ node failure
Top Level Descriptor
job has_many tasks
HDFS Boss core-site.xml
HDFS Slaves slaves
MapReduce Boss mapred-site.xml
MapReduce Slave mapred-site.xml
User's window into Hadoop, through the
command hadoop
Convention over Configuration? You wish.
Job
Task
NameNode
DataNode
JobTracker
TaskTracker
Client
Convention over Configuration? You wish.
● Configuration in XML & Shell scripts. Yuck!
● Respite:
○ Option for specifying a configuration directory
○ Shell script configuration is mostly ENV variables
● Which means:
○ Configuration can be written in YML or JSON or
Ruby and exported in XML
○ ENV variables can be set using rake, thor or just
plain Ruby
● Caveats:
○ No standard wrapper to do this (Go write one!)
Convention over Configuration? You wish.
● Default mappers and reducers are defined in
Java
● Other languages supported using Streaming
API
● Streaming API makes use of STDIN and
STDOUT to read and output data and
executable binaries for processing
● Caveats
○ No dependency management, we are on our own
Instant Gratification. Now you're talkin'
GOAL:
1. Take a couple of books in txt format
2. Find out the total usage of each character in
the english alphabet.
3. Establish that e is the most used.
4. Why this example?
a. Perfect use case for MapReduce.
b. Algorithm is simple.
c. Results are simple to analyze.
d. Txt formatted books are easily available in Project
Gutenberg.
● Official Documentation
● Wiki: http://wiki.apache.org/hadoop/
● Hadoop examples that ship with Hadoop
● http://www.bigfastblog.com/map-reduce-
with-ruby-using-hadoop
● http://www.youtube.com/watch?
v=d2xeNpfzsYI
Further Reading and Watching
Questions?
Thank you!

More Related Content

What's hot

PostgreSQL is the new NoSQL - at Devoxx 2018
PostgreSQL is the new NoSQL  - at Devoxx 2018PostgreSQL is the new NoSQL  - at Devoxx 2018
PostgreSQL is the new NoSQL - at Devoxx 2018Quentin Adam
 
SnappyDB - NoSQL database for Android
SnappyDB - NoSQL database for AndroidSnappyDB - NoSQL database for Android
SnappyDB - NoSQL database for Androidnhachicha
 
MySQL And Search At Craigslist
MySQL And Search At CraigslistMySQL And Search At Craigslist
MySQL And Search At CraigslistJeremy Zawodny
 
Fusion-io and MySQL at Craigslist
Fusion-io and MySQL at CraigslistFusion-io and MySQL at Craigslist
Fusion-io and MySQL at CraigslistJeremy Zawodny
 
AGES Presentation on Web, Python, Django and GeoServer
AGES Presentation on Web, Python, Django and GeoServerAGES Presentation on Web, Python, Django and GeoServer
AGES Presentation on Web, Python, Django and GeoServerNg'eno Victor
 
Optimizing MongoDB: Lessons Learned at Localytics
Optimizing MongoDB: Lessons Learned at LocalyticsOptimizing MongoDB: Lessons Learned at Localytics
Optimizing MongoDB: Lessons Learned at Localyticsandrew311
 
Ruby on Rails & PostgreSQL - v2
Ruby on Rails & PostgreSQL - v2Ruby on Rails & PostgreSQL - v2
Ruby on Rails & PostgreSQL - v2John Ashmead
 
20140120 presto meetup_en
20140120 presto meetup_en20140120 presto meetup_en
20140120 presto meetup_enOgibayashi
 
Transactional writes to cloud storage with Eric Liang
Transactional writes to cloud storage with Eric LiangTransactional writes to cloud storage with Eric Liang
Transactional writes to cloud storage with Eric LiangDatabricks
 
Universal Serverless with AWS Fargate
Universal Serverless with AWS FargateUniversal Serverless with AWS Fargate
Universal Serverless with AWS FargateEka Cahya Pratama
 
PyConIE 2017 Writing and deploying serverless python applications
PyConIE 2017 Writing and deploying serverless python applicationsPyConIE 2017 Writing and deploying serverless python applications
PyConIE 2017 Writing and deploying serverless python applicationsCesar Cardenas Desales
 
SOLR Power FTW: short version
SOLR Power FTW: short versionSOLR Power FTW: short version
SOLR Power FTW: short versionAlex Pinkin
 
Monitoring with ElasticSearch
Monitoring with ElasticSearch Monitoring with ElasticSearch
Monitoring with ElasticSearch Kris Buytaert
 
Event Pipe - Lambda Architecture
Event Pipe - Lambda ArchitectureEvent Pipe - Lambda Architecture
Event Pipe - Lambda ArchitectureBahadir Cambel
 
Sharding - patterns & antipatterns, Константин Осипов, Алексей Рыбак
Sharding -  patterns & antipatterns, Константин Осипов, Алексей РыбакSharding -  patterns & antipatterns, Константин Осипов, Алексей Рыбак
Sharding - patterns & antipatterns, Константин Осипов, Алексей РыбакOntico
 
Introduction to mongo db
Introduction to mongo dbIntroduction to mongo db
Introduction to mongo dbLawrence Mwai
 
Scalable Data Science with SparkR
Scalable Data Science with SparkRScalable Data Science with SparkR
Scalable Data Science with SparkRDataWorks Summit
 
Bringing spatial love to your python application
Bringing spatial love to your python applicationBringing spatial love to your python application
Bringing spatial love to your python applicationShekhar Gulati
 
Scylla Summit 2022: ScyllaDB Embraces Wasm
Scylla Summit 2022: ScyllaDB Embraces WasmScylla Summit 2022: ScyllaDB Embraces Wasm
Scylla Summit 2022: ScyllaDB Embraces WasmScyllaDB
 

What's hot (20)

PostgreSQL is the new NoSQL - at Devoxx 2018
PostgreSQL is the new NoSQL  - at Devoxx 2018PostgreSQL is the new NoSQL  - at Devoxx 2018
PostgreSQL is the new NoSQL - at Devoxx 2018
 
SnappyDB - NoSQL database for Android
SnappyDB - NoSQL database for AndroidSnappyDB - NoSQL database for Android
SnappyDB - NoSQL database for Android
 
MySQL And Search At Craigslist
MySQL And Search At CraigslistMySQL And Search At Craigslist
MySQL And Search At Craigslist
 
Fusion-io and MySQL at Craigslist
Fusion-io and MySQL at CraigslistFusion-io and MySQL at Craigslist
Fusion-io and MySQL at Craigslist
 
AGES Presentation on Web, Python, Django and GeoServer
AGES Presentation on Web, Python, Django and GeoServerAGES Presentation on Web, Python, Django and GeoServer
AGES Presentation on Web, Python, Django and GeoServer
 
Optimizing MongoDB: Lessons Learned at Localytics
Optimizing MongoDB: Lessons Learned at LocalyticsOptimizing MongoDB: Lessons Learned at Localytics
Optimizing MongoDB: Lessons Learned at Localytics
 
Ruby on Rails & PostgreSQL - v2
Ruby on Rails & PostgreSQL - v2Ruby on Rails & PostgreSQL - v2
Ruby on Rails & PostgreSQL - v2
 
20140120 presto meetup_en
20140120 presto meetup_en20140120 presto meetup_en
20140120 presto meetup_en
 
Transactional writes to cloud storage with Eric Liang
Transactional writes to cloud storage with Eric LiangTransactional writes to cloud storage with Eric Liang
Transactional writes to cloud storage with Eric Liang
 
Universal Serverless with AWS Fargate
Universal Serverless with AWS FargateUniversal Serverless with AWS Fargate
Universal Serverless with AWS Fargate
 
PyConIE 2017 Writing and deploying serverless python applications
PyConIE 2017 Writing and deploying serverless python applicationsPyConIE 2017 Writing and deploying serverless python applications
PyConIE 2017 Writing and deploying serverless python applications
 
SOLR Power FTW: short version
SOLR Power FTW: short versionSOLR Power FTW: short version
SOLR Power FTW: short version
 
Pre fosdem2020 uber
Pre fosdem2020 uberPre fosdem2020 uber
Pre fosdem2020 uber
 
Monitoring with ElasticSearch
Monitoring with ElasticSearch Monitoring with ElasticSearch
Monitoring with ElasticSearch
 
Event Pipe - Lambda Architecture
Event Pipe - Lambda ArchitectureEvent Pipe - Lambda Architecture
Event Pipe - Lambda Architecture
 
Sharding - patterns & antipatterns, Константин Осипов, Алексей Рыбак
Sharding -  patterns & antipatterns, Константин Осипов, Алексей РыбакSharding -  patterns & antipatterns, Константин Осипов, Алексей Рыбак
Sharding - patterns & antipatterns, Константин Осипов, Алексей Рыбак
 
Introduction to mongo db
Introduction to mongo dbIntroduction to mongo db
Introduction to mongo db
 
Scalable Data Science with SparkR
Scalable Data Science with SparkRScalable Data Science with SparkR
Scalable Data Science with SparkR
 
Bringing spatial love to your python application
Bringing spatial love to your python applicationBringing spatial love to your python application
Bringing spatial love to your python application
 
Scylla Summit 2022: ScyllaDB Embraces Wasm
Scylla Summit 2022: ScyllaDB Embraces WasmScylla Summit 2022: ScyllaDB Embraces Wasm
Scylla Summit 2022: ScyllaDB Embraces Wasm
 

Viewers also liked

Jarrera eta Ikaskuntza 2011 3. gaia amaiera
Jarrera eta Ikaskuntza 2011 3. gaia amaieraJarrera eta Ikaskuntza 2011 3. gaia amaiera
Jarrera eta Ikaskuntza 2011 3. gaia amaieraKirolPsikologia
 
Intro to Open Cloud Initiative
Intro to Open Cloud InitiativeIntro to Open Cloud Initiative
Intro to Open Cloud InitiativeJohn Mark Walker
 
Lasterketa plana prestatzeko tailerra
Lasterketa plana prestatzeko tailerraLasterketa plana prestatzeko tailerra
Lasterketa plana prestatzeko tailerraKirolPsikologia
 
Building Vibrant Open Source Communities
Building Vibrant Open Source CommunitiesBuilding Vibrant Open Source Communities
Building Vibrant Open Source CommunitiesJohn Mark Walker
 
Kitasato Flask 97
Kitasato Flask 97Kitasato Flask 97
Kitasato Flask 97Daimy
 
Open Source and Cloud - The Two Great Tastes...
Open Source and Cloud - The Two Great Tastes...Open Source and Cloud - The Two Great Tastes...
Open Source and Cloud - The Two Great Tastes...John Mark Walker
 
FOSS vs. Web Services Lightning Talk: Is FOSS Necessary?
FOSS vs. Web Services Lightning Talk: Is FOSS Necessary?FOSS vs. Web Services Lightning Talk: Is FOSS Necessary?
FOSS vs. Web Services Lightning Talk: Is FOSS Necessary?John Mark Walker
 
Australia Presentation
Australia PresentationAustralia Presentation
Australia PresentationDaimy
 
The Future of GlusterFS and Gluster.org
The Future of GlusterFS and Gluster.orgThe Future of GlusterFS and Gluster.org
The Future of GlusterFS and Gluster.orgJohn Mark Walker
 

Viewers also liked (10)

Jarrera eta Ikaskuntza 2011 3. gaia amaiera
Jarrera eta Ikaskuntza 2011 3. gaia amaieraJarrera eta Ikaskuntza 2011 3. gaia amaiera
Jarrera eta Ikaskuntza 2011 3. gaia amaiera
 
Intro to Open Cloud Initiative
Intro to Open Cloud InitiativeIntro to Open Cloud Initiative
Intro to Open Cloud Initiative
 
Lasterketa plana prestatzeko tailerra
Lasterketa plana prestatzeko tailerraLasterketa plana prestatzeko tailerra
Lasterketa plana prestatzeko tailerra
 
Building Vibrant Open Source Communities
Building Vibrant Open Source CommunitiesBuilding Vibrant Open Source Communities
Building Vibrant Open Source Communities
 
Kitasato Flask 97
Kitasato Flask 97Kitasato Flask 97
Kitasato Flask 97
 
GlusterFS Community Preso
GlusterFS Community PresoGlusterFS Community Preso
GlusterFS Community Preso
 
Open Source and Cloud - The Two Great Tastes...
Open Source and Cloud - The Two Great Tastes...Open Source and Cloud - The Two Great Tastes...
Open Source and Cloud - The Two Great Tastes...
 
FOSS vs. Web Services Lightning Talk: Is FOSS Necessary?
FOSS vs. Web Services Lightning Talk: Is FOSS Necessary?FOSS vs. Web Services Lightning Talk: Is FOSS Necessary?
FOSS vs. Web Services Lightning Talk: Is FOSS Necessary?
 
Australia Presentation
Australia PresentationAustralia Presentation
Australia Presentation
 
The Future of GlusterFS and Gluster.org
The Future of GlusterFS and Gluster.orgThe Future of GlusterFS and Gluster.org
The Future of GlusterFS and Gluster.org
 

Similar to MapReduce with Hadoop and Ruby

New Analytics Toolbox DevNexus 2015
New Analytics Toolbox DevNexus 2015New Analytics Toolbox DevNexus 2015
New Analytics Toolbox DevNexus 2015Robbie Strickland
 
Scaling up wso2 bam for billions of requests and terabytes of data
Scaling up wso2 bam for billions of requests and terabytes of dataScaling up wso2 bam for billions of requests and terabytes of data
Scaling up wso2 bam for billions of requests and terabytes of dataWSO2
 
Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018Holden Karau
 
Big data beyond the JVM - DDTX 2018
Big data beyond the JVM -  DDTX 2018Big data beyond the JVM -  DDTX 2018
Big data beyond the JVM - DDTX 2018Holden Karau
 
Improving PySpark performance: Spark Performance Beyond the JVM
Improving PySpark performance: Spark Performance Beyond the JVMImproving PySpark performance: Spark Performance Beyond the JVM
Improving PySpark performance: Spark Performance Beyond the JVMHolden Karau
 
A fast introduction to PySpark with a quick look at Arrow based UDFs
A fast introduction to PySpark with a quick look at Arrow based UDFsA fast introduction to PySpark with a quick look at Arrow based UDFs
A fast introduction to PySpark with a quick look at Arrow based UDFsHolden Karau
 
[@NaukriEngineering] Apache Spark
[@NaukriEngineering] Apache Spark[@NaukriEngineering] Apache Spark
[@NaukriEngineering] Apache SparkNaukri.com
 
Hadoop And Big Data - My Presentation To Selective Audience
Hadoop And Big Data - My Presentation To Selective AudienceHadoop And Big Data - My Presentation To Selective Audience
Hadoop And Big Data - My Presentation To Selective AudienceChandra Sekhar
 
Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - ...
Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - ...Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - ...
Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - ...DataStax Academy
 
A super fast introduction to Spark and glance at BEAM
A super fast introduction to Spark and glance at BEAMA super fast introduction to Spark and glance at BEAM
A super fast introduction to Spark and glance at BEAMHolden Karau
 
Introduction to MapReduce and Hadoop
Introduction to MapReduce and HadoopIntroduction to MapReduce and Hadoop
Introduction to MapReduce and HadoopMohamed Elsaka
 
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...Big Data Montreal
 
Introduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at lastIntroduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at lastHolden Karau
 
Introduction to AWS Big Data
Introduction to AWS Big Data Introduction to AWS Big Data
Introduction to AWS Big Data Omid Vahdaty
 
BDM25 - Spark runtime internal
BDM25 - Spark runtime internalBDM25 - Spark runtime internal
BDM25 - Spark runtime internalDavid Lauzon
 
Hadoop 31-frequently-asked-interview-questions
Hadoop 31-frequently-asked-interview-questionsHadoop 31-frequently-asked-interview-questions
Hadoop 31-frequently-asked-interview-questionsAsad Masood Qazi
 
Are general purpose big data systems eating the world?
Are general purpose big data systems eating the world?Are general purpose big data systems eating the world?
Are general purpose big data systems eating the world?Holden Karau
 

Similar to MapReduce with Hadoop and Ruby (20)

New Analytics Toolbox DevNexus 2015
New Analytics Toolbox DevNexus 2015New Analytics Toolbox DevNexus 2015
New Analytics Toolbox DevNexus 2015
 
Scaling up wso2 bam for billions of requests and terabytes of data
Scaling up wso2 bam for billions of requests and terabytes of dataScaling up wso2 bam for billions of requests and terabytes of data
Scaling up wso2 bam for billions of requests and terabytes of data
 
Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018
 
Big data beyond the JVM - DDTX 2018
Big data beyond the JVM -  DDTX 2018Big data beyond the JVM -  DDTX 2018
Big data beyond the JVM - DDTX 2018
 
Improving PySpark performance: Spark Performance Beyond the JVM
Improving PySpark performance: Spark Performance Beyond the JVMImproving PySpark performance: Spark Performance Beyond the JVM
Improving PySpark performance: Spark Performance Beyond the JVM
 
A fast introduction to PySpark with a quick look at Arrow based UDFs
A fast introduction to PySpark with a quick look at Arrow based UDFsA fast introduction to PySpark with a quick look at Arrow based UDFs
A fast introduction to PySpark with a quick look at Arrow based UDFs
 
[@NaukriEngineering] Apache Spark
[@NaukriEngineering] Apache Spark[@NaukriEngineering] Apache Spark
[@NaukriEngineering] Apache Spark
 
Hadoop And Big Data - My Presentation To Selective Audience
Hadoop And Big Data - My Presentation To Selective AudienceHadoop And Big Data - My Presentation To Selective Audience
Hadoop And Big Data - My Presentation To Selective Audience
 
New Analytics Toolbox
New Analytics ToolboxNew Analytics Toolbox
New Analytics Toolbox
 
Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - ...
Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - ...Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - ...
Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - ...
 
A super fast introduction to Spark and glance at BEAM
A super fast introduction to Spark and glance at BEAMA super fast introduction to Spark and glance at BEAM
A super fast introduction to Spark and glance at BEAM
 
Introduction to MapReduce and Hadoop
Introduction to MapReduce and HadoopIntroduction to MapReduce and Hadoop
Introduction to MapReduce and Hadoop
 
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...
 
2012 apache hadoop_map_reduce_windows_azure
2012 apache hadoop_map_reduce_windows_azure2012 apache hadoop_map_reduce_windows_azure
2012 apache hadoop_map_reduce_windows_azure
 
Introduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at lastIntroduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at last
 
Introduction to AWS Big Data
Introduction to AWS Big Data Introduction to AWS Big Data
Introduction to AWS Big Data
 
BDM25 - Spark runtime internal
BDM25 - Spark runtime internalBDM25 - Spark runtime internal
BDM25 - Spark runtime internal
 
Hadoop 31-frequently-asked-interview-questions
Hadoop 31-frequently-asked-interview-questionsHadoop 31-frequently-asked-interview-questions
Hadoop 31-frequently-asked-interview-questions
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Are general purpose big data systems eating the world?
Are general purpose big data systems eating the world?Are general purpose big data systems eating the world?
Are general purpose big data systems eating the world?
 

Recently uploaded

The Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and InsightThe Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and InsightSafe Software
 
Intro to Passkeys and the State of Passwordless.pptx
Intro to Passkeys and the State of Passwordless.pptxIntro to Passkeys and the State of Passwordless.pptx
Intro to Passkeys and the State of Passwordless.pptxFIDO Alliance
 
Introduction to FIDO Authentication and Passkeys.pptx
Introduction to FIDO Authentication and Passkeys.pptxIntroduction to FIDO Authentication and Passkeys.pptx
Introduction to FIDO Authentication and Passkeys.pptxFIDO Alliance
 
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...ScyllaDB
 
ERP Contender Series: Acumatica vs. Sage Intacct
ERP Contender Series: Acumatica vs. Sage IntacctERP Contender Series: Acumatica vs. Sage Intacct
ERP Contender Series: Acumatica vs. Sage IntacctBrainSell Technologies
 
Continuing Bonds Through AI: A Hermeneutic Reflection on Thanabots
Continuing Bonds Through AI: A Hermeneutic Reflection on ThanabotsContinuing Bonds Through AI: A Hermeneutic Reflection on Thanabots
Continuing Bonds Through AI: A Hermeneutic Reflection on ThanabotsLeah Henrickson
 
ADP Passwordless Journey Case Study.pptx
ADP Passwordless Journey Case Study.pptxADP Passwordless Journey Case Study.pptx
ADP Passwordless Journey Case Study.pptxFIDO Alliance
 
“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf
“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf
“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdfMuhammad Subhan
 
WebRTC and SIP not just audio and video @ OpenSIPS 2024
WebRTC and SIP not just audio and video @ OpenSIPS 2024WebRTC and SIP not just audio and video @ OpenSIPS 2024
WebRTC and SIP not just audio and video @ OpenSIPS 2024Lorenzo Miniero
 
Working together SRE & Platform Engineering
Working together SRE & Platform EngineeringWorking together SRE & Platform Engineering
Working together SRE & Platform EngineeringMarcus Vechiato
 
1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPT
1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPT1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPT
1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPTiSEO AI
 
Design Guidelines for Passkeys 2024.pptx
Design Guidelines for Passkeys 2024.pptxDesign Guidelines for Passkeys 2024.pptx
Design Guidelines for Passkeys 2024.pptxFIDO Alliance
 
Hyatt driving innovation and exceptional customer experiences with FIDO passw...
Hyatt driving innovation and exceptional customer experiences with FIDO passw...Hyatt driving innovation and exceptional customer experiences with FIDO passw...
Hyatt driving innovation and exceptional customer experiences with FIDO passw...FIDO Alliance
 
The Metaverse: Are We There Yet?
The  Metaverse:    Are   We  There  Yet?The  Metaverse:    Are   We  There  Yet?
The Metaverse: Are We There Yet?Mark Billinghurst
 
Observability Concepts EVERY Developer Should Know (DevOpsDays Seattle)
Observability Concepts EVERY Developer Should Know (DevOpsDays Seattle)Observability Concepts EVERY Developer Should Know (DevOpsDays Seattle)
Observability Concepts EVERY Developer Should Know (DevOpsDays Seattle)Paige Cruz
 
Long journey of Ruby Standard library at RubyKaigi 2024
Long journey of Ruby Standard library at RubyKaigi 2024Long journey of Ruby Standard library at RubyKaigi 2024
Long journey of Ruby Standard library at RubyKaigi 2024Hiroshi SHIBATA
 
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptxHarnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptxFIDO Alliance
 
Revolutionizing SAP® Processes with Automation and Artificial Intelligence
Revolutionizing SAP® Processes with Automation and Artificial IntelligenceRevolutionizing SAP® Processes with Automation and Artificial Intelligence
Revolutionizing SAP® Processes with Automation and Artificial IntelligencePrecisely
 
UiPath manufacturing technology benefits and AI overview
UiPath manufacturing technology benefits and AI overviewUiPath manufacturing technology benefits and AI overview
UiPath manufacturing technology benefits and AI overviewDianaGray10
 
Microsoft CSP Briefing Pre-Engagement - Questionnaire
Microsoft CSP Briefing Pre-Engagement - QuestionnaireMicrosoft CSP Briefing Pre-Engagement - Questionnaire
Microsoft CSP Briefing Pre-Engagement - QuestionnaireExakis Nelite
 

Recently uploaded (20)

The Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and InsightThe Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and Insight
 
Intro to Passkeys and the State of Passwordless.pptx
Intro to Passkeys and the State of Passwordless.pptxIntro to Passkeys and the State of Passwordless.pptx
Intro to Passkeys and the State of Passwordless.pptx
 
Introduction to FIDO Authentication and Passkeys.pptx
Introduction to FIDO Authentication and Passkeys.pptxIntroduction to FIDO Authentication and Passkeys.pptx
Introduction to FIDO Authentication and Passkeys.pptx
 
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
 
ERP Contender Series: Acumatica vs. Sage Intacct
ERP Contender Series: Acumatica vs. Sage IntacctERP Contender Series: Acumatica vs. Sage Intacct
ERP Contender Series: Acumatica vs. Sage Intacct
 
Continuing Bonds Through AI: A Hermeneutic Reflection on Thanabots
Continuing Bonds Through AI: A Hermeneutic Reflection on ThanabotsContinuing Bonds Through AI: A Hermeneutic Reflection on Thanabots
Continuing Bonds Through AI: A Hermeneutic Reflection on Thanabots
 
ADP Passwordless Journey Case Study.pptx
ADP Passwordless Journey Case Study.pptxADP Passwordless Journey Case Study.pptx
ADP Passwordless Journey Case Study.pptx
 
“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf
“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf
“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf
 
WebRTC and SIP not just audio and video @ OpenSIPS 2024
WebRTC and SIP not just audio and video @ OpenSIPS 2024WebRTC and SIP not just audio and video @ OpenSIPS 2024
WebRTC and SIP not just audio and video @ OpenSIPS 2024
 
Working together SRE & Platform Engineering
Working together SRE & Platform EngineeringWorking together SRE & Platform Engineering
Working together SRE & Platform Engineering
 
1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPT
1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPT1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPT
1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPT
 
Design Guidelines for Passkeys 2024.pptx
Design Guidelines for Passkeys 2024.pptxDesign Guidelines for Passkeys 2024.pptx
Design Guidelines for Passkeys 2024.pptx
 
Hyatt driving innovation and exceptional customer experiences with FIDO passw...
Hyatt driving innovation and exceptional customer experiences with FIDO passw...Hyatt driving innovation and exceptional customer experiences with FIDO passw...
Hyatt driving innovation and exceptional customer experiences with FIDO passw...
 
The Metaverse: Are We There Yet?
The  Metaverse:    Are   We  There  Yet?The  Metaverse:    Are   We  There  Yet?
The Metaverse: Are We There Yet?
 
Observability Concepts EVERY Developer Should Know (DevOpsDays Seattle)
Observability Concepts EVERY Developer Should Know (DevOpsDays Seattle)Observability Concepts EVERY Developer Should Know (DevOpsDays Seattle)
Observability Concepts EVERY Developer Should Know (DevOpsDays Seattle)
 
Long journey of Ruby Standard library at RubyKaigi 2024
Long journey of Ruby Standard library at RubyKaigi 2024Long journey of Ruby Standard library at RubyKaigi 2024
Long journey of Ruby Standard library at RubyKaigi 2024
 
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptxHarnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
 
Revolutionizing SAP® Processes with Automation and Artificial Intelligence
Revolutionizing SAP® Processes with Automation and Artificial IntelligenceRevolutionizing SAP® Processes with Automation and Artificial Intelligence
Revolutionizing SAP® Processes with Automation and Artificial Intelligence
 
UiPath manufacturing technology benefits and AI overview
UiPath manufacturing technology benefits and AI overviewUiPath manufacturing technology benefits and AI overview
UiPath manufacturing technology benefits and AI overview
 
Microsoft CSP Briefing Pre-Engagement - Questionnaire
Microsoft CSP Briefing Pre-Engagement - QuestionnaireMicrosoft CSP Briefing Pre-Engagement - Questionnaire
Microsoft CSP Briefing Pre-Engagement - Questionnaire
 

MapReduce with Hadoop and Ruby

  • 1. Ohai Hadoop! Build your first MapReduce with Hadoop & Ruby
  • 2. Tweet@_swanand GitHub@swanandp StackOverflow@18678 Work@Kaverisoft Make { DispatchTrack } mailto:swanand@pagnis. in Who am I? Ruby, Coffeescript, Java, Rails, Sinatra, Android, TextMate, Emacs, Minitest, MySQL, Cassandra, Hadoop, Mountain Lion, Curl, Zsh, GMail, Solarized, Oscar Wilde, Robert Jordan, Quentin Tarantino, Charlize Theron
  • 3. ● MapReduce! Wait, what? ● Enter the Hadoop. *gong* ● Convention over Configuration? You wish. ● Instant Gratification. Now you're talkin' ● Further Reading. Go forth and read! Tell 'em what you're going to tell 'em
  • 4. MapReduce! Wait, what? ● Map: Given a set of values (or key-values), output another set of values (or key-values) ● [K1, V1] -> map -> [K2, V2] ● Map each value into a new value
  • 5. MapReduce! Wait, what? ● Reduce: Given a set of values for a key, come up with a summarized version ● K1[V1, V2 ... Vn] -> reduce -> K1[Y] ● Reduce given values into 1 value
  • 7. MapReduce! Um.. hmm.. Q: What is the single biggest takeaway from mapping? A: Map operation is stateless i.e. one iteration doesn't depend on previous iteration. Q: What is the single biggest takeaway from reducing? A: Reduce represents an operation for a particular key.
  • 8. Enter the Hadoop. *gong* "The really interesting thing I want you to notice, here, is that as soon as you think of map and reduce as functions that everybody can use, and they use them, you only have to get one supergenius to write the hard code to run map and reduce on a global massively parallel array of computers, and all the old code that used to work fine when you just ran a loop still works only it's a zillion times faster which means it can be used to tackle huge problems in an instant." - Joel Spolsky
  • 9. MapReduce! Oh, yeah! 1. Convert raw data into readable format 2. Iterate over data chunks, convert each chunk into meaningful key, value pairs 3. Do this for all your data using massive parallelization 4. Group all the keys and their respective values 5. Take values for a key and convert into desired meaningful format 6. Step 2 is called mapper 7. Step 5 is called reducer
  • 10. Enter the Hadoop. *gong* Same process has now become: 1. Put data into Hadoop 2. Define your mapper 3. Define your reducer 4. Run your jobs 5. Read processed data from Hadoop Other advantages: ● Encapsulations over common problems like large files, process management, disk / node failure
  • 11. Top Level Descriptor job has_many tasks HDFS Boss core-site.xml HDFS Slaves slaves MapReduce Boss mapred-site.xml MapReduce Slave mapred-site.xml User's window into Hadoop, through the command hadoop Convention over Configuration? You wish. Job Task NameNode DataNode JobTracker TaskTracker Client
  • 12. Convention over Configuration? You wish. ● Configuration in XML & Shell scripts. Yuck! ● Respite: ○ Option for specifying a configuration directory ○ Shell script configuration is mostly ENV variables ● Which means: ○ Configuration can be written in YML or JSON or Ruby and exported in XML ○ ENV variables can be set using rake, thor or just plain Ruby ● Caveats: ○ No standard wrapper to do this (Go write one!)
  • 13. Convention over Configuration? You wish. ● Default mappers and reducers are defined in Java ● Other languages supported using Streaming API ● Streaming API makes use of STDIN and STDOUT to read and output data and executable binaries for processing ● Caveats ○ No dependency management, we are on our own
  • 14. Instant Gratification. Now you're talkin' GOAL: 1. Take a couple of books in txt format 2. Find out the total usage of each character in the english alphabet. 3. Establish that e is the most used. 4. Why this example? a. Perfect use case for MapReduce. b. Algorithm is simple. c. Results are simple to analyze. d. Txt formatted books are easily available in Project Gutenberg.
  • 15. ● Official Documentation ● Wiki: http://wiki.apache.org/hadoop/ ● Hadoop examples that ship with Hadoop ● http://www.bigfastblog.com/map-reduce- with-ruby-using-hadoop ● http://www.youtube.com/watch? v=d2xeNpfzsYI Further Reading and Watching