SlideShare a Scribd company logo
1 of 81
Streaming API,
Spark & Ruby
AGENDA
BIG DATA
Ruby in BIG DATA
Why we should know?
Insights from SAS
Data
Information
Knowledge
Data Science Field
Computer
Science
Maths &
Statitics
Subject
Matter
Expertise
Where it all began?
PROBLEM
How to store ?
How to process ?
Solution
HADOOP
HADOOP Ecosystem
STORAGE & PROCESS
• Distributed Storage: HDFS A distributed file
system where commodity hardware can be
used to form clusters and store the huge data
in distributed fashion.
• Distributed Processing: MapReduce Paradigm
It can easily scale to multiple nodes(1,500–2,000
nodes in a cluster), with just configuration
change.
Case Study
Daywise Analysis of
Rubyconfindia Tweets
Twitter API
Sample Data
--- !ruby/object:Twitter::Tweet
attrs:
:created_at: Tue Mar 08 11:00:57 +0000 2016
:id: 707159160945811457
:id_str: '707159160945811457'
:text: 'Once in a life time to meet Matz at the
awesome #kochi https://t.co/6oCIagsHCg
#ruby #india https://t.co/YRlpABApkP'
Tweet Count
CLOUDERA QUICKSTART
STORAGE LAYER
HDFS
• It is a distributed file system
• Streaming Data Access: Write once, read many
times
• Able to run on commodity Hardware
• Fault tolerance
• Replication: 3 nodes by default, configurable
• Block based: 64-256MB, configurable
Replication Factor 2
Name Node: Stores Meta Data
Meta Data:
/data/pristine/catalina.log.> 1, 2, 4
/data/pristine/myfile. >3,5
Data Node 1 Data Node 2 Data Node 3
1 2 45 5 2 3 4 1 3
HDFS Command Line
Copying File to HDFS
HDFS NameNode UI
PROCESS LAYER
YARN / MapReduce 2.0
• YARN: A framework for job
scheduling and cluster resource
management.
• MapReduce: Distributed
processing paradigm
Map Function
Input:
(input_key, value)
Output:
bunch of
(intermediate_key, value)
System applies the map
function in parallel to all
inputs
Reduce Function
Input:
(intermediate_key, value)
Output:
bunch of (values)
System will group all pairs
with the same
intermediate key and
apply the reduce function
OUTPUT RESULT
SHUFFLE
STAGE
FILE CHUNKS
Leveraging Ruby
Mapper.rb
#!/usr/bin/env ruby
STDIN.read.split("---
!ruby/object:Twitter::Tweet").each do |t|
date = t.match(/:created_at: .{30}/).to_s.split
puts "#{date[1]}" if date[1]
end
Reducer.rb
#!/usr/bin/env ruby
STDIN.readlines.group_by{|i| i.strip}
.map{|i,j| "#{i} #{j.count}" }
.each{|i| puts i}
HADOOP STREAMING API
hadoop jar /usr/lib/hadoop-mapreduce/hadoop-
streaming.jar –input /user/rubyconf/tweets.txt –output
/user/rubyconf/daywise –mapper mapper.rb –reducer
reducer.rb –file mapper.rb –file reducer.rb
YARN Resource Manager
Job Details
YARN Resource Manager
DAYWISE TWEETS
TREND
SPARK
SPARK Benefits
• SPEED
• Ease of Use
• Runs Everywhere
• Source: http://spark.apache.org/
• Source: http://spark.apache.org/
gem ruby-spark
• gem install ruby-spark
• ruby-spark build
• ruby-spark shell
Ruby with SPARK
require 'ruby-spark'
# Configuration
Spark.config do
set_app_name "RubySpark"
set 'spark.ruby.serializer', 'oj'
set 'spark.ruby.serializer.batch_size', 100
end
# Start Apache Spark
Spark.start
# Context reference
sc = Spark.sc
rdd = sc.text_file("hdfs://user/rubyconf/tweets.txt”)
# Collect all created days from dates
days = rdd.map(lambda {|t| date = t.match(/:created_at:
.{30}/).to_s.split; date[1] if date[1]})
# Creating key value pair
pairrdd = days.map(lambda { |x| [x,1] })
# Final output by using reducer
daywise = pairrdd.reduce_by_key( lambda{|x,y|
x+y}).collect_as_hash
Expertise & Learnings
Remember
We can use Ruby with
HADOOP Streaming API
& SPARK
SPARK is more
generalized distributed
computing model
Thank You
@_manoharaa

More Related Content

What's hot

Big Data - Linked In_DEEPU
Big Data - Linked In_DEEPUBig Data - Linked In_DEEPU
Big Data - Linked In_DEEPU
Deepu M
 
Cloud Friendly Hadoop and Hive
Cloud Friendly Hadoop and HiveCloud Friendly Hadoop and Hive
Cloud Friendly Hadoop and Hive
DataWorks Summit
 
Kissy component system
Kissy component systemKissy component system
Kissy component system
yiming he
 
Hadoopを業務で使ってみた
Hadoopを業務で使ってみたHadoopを業務で使ってみた
Hadoopを業務で使ってみた
Tatsuya Sasaki
 
Taste Java In The Clouds
Taste Java In The CloudsTaste Java In The Clouds
Taste Java In The Clouds
Jacky Chu
 
Hadoop - Simple. Scalable.
Hadoop - Simple. Scalable.Hadoop - Simple. Scalable.
Hadoop - Simple. Scalable.
elliando dias
 
Hadoop foundation for analytics,B Monica II M.sc computer science ,BON SECOUR...
Hadoop foundation for analytics,B Monica II M.sc computer science ,BON SECOUR...Hadoop foundation for analytics,B Monica II M.sc computer science ,BON SECOUR...
Hadoop foundation for analytics,B Monica II M.sc computer science ,BON SECOUR...
BMonica1
 

What's hot (20)

Spark Gotchas and Lessons Learned (2/20/20)
Spark Gotchas and Lessons Learned (2/20/20)Spark Gotchas and Lessons Learned (2/20/20)
Spark Gotchas and Lessons Learned (2/20/20)
 
Big Data - Linked In_DEEPU
Big Data - Linked In_DEEPUBig Data - Linked In_DEEPU
Big Data - Linked In_DEEPU
 
Getting Started on Hadoop
Getting Started on HadoopGetting Started on Hadoop
Getting Started on Hadoop
 
A Hadoop Primer
A Hadoop PrimerA Hadoop Primer
A Hadoop Primer
 
Cloud Friendly Hadoop and Hive
Cloud Friendly Hadoop and HiveCloud Friendly Hadoop and Hive
Cloud Friendly Hadoop and Hive
 
Open Source Databases And Gis
Open Source Databases And GisOpen Source Databases And Gis
Open Source Databases And Gis
 
マーケティングのためのHadoop利用
マーケティングのためのHadoop利用マーケティングのためのHadoop利用
マーケティングのためのHadoop利用
 
Java/Scala Lab 2016. Александр Конопко: Машинное обучение в Spark.
Java/Scala Lab 2016. Александр Конопко: Машинное обучение в Spark.Java/Scala Lab 2016. Александр Конопко: Машинное обучение в Spark.
Java/Scala Lab 2016. Александр Конопко: Машинное обучение в Spark.
 
Kissy component system
Kissy component systemKissy component system
Kissy component system
 
Hadoopを業務で使ってみた
Hadoopを業務で使ってみたHadoopを業務で使ってみた
Hadoopを業務で使ってみた
 
Spatial Data processing with Hadoop
Spatial Data processing with HadoopSpatial Data processing with Hadoop
Spatial Data processing with Hadoop
 
Apache pig
Apache pigApache pig
Apache pig
 
Hadoop
HadoopHadoop
Hadoop
 
Taste Java In The Clouds
Taste Java In The CloudsTaste Java In The Clouds
Taste Java In The Clouds
 
B.MONICA II M.SC COMPUTER SCIENCE
B.MONICA II M.SC COMPUTER SCIENCEB.MONICA II M.SC COMPUTER SCIENCE
B.MONICA II M.SC COMPUTER SCIENCE
 
Raster package jacob
Raster package jacobRaster package jacob
Raster package jacob
 
Hadoop - Simple. Scalable.
Hadoop - Simple. Scalable.Hadoop - Simple. Scalable.
Hadoop - Simple. Scalable.
 
Apache pig
Apache pigApache pig
Apache pig
 
Apache PIG introduction
Apache PIG introductionApache PIG introduction
Apache PIG introduction
 
Hadoop foundation for analytics,B Monica II M.sc computer science ,BON SECOUR...
Hadoop foundation for analytics,B Monica II M.sc computer science ,BON SECOUR...Hadoop foundation for analytics,B Monica II M.sc computer science ,BON SECOUR...
Hadoop foundation for analytics,B Monica II M.sc computer science ,BON SECOUR...
 

Viewers also liked

Viewers also liked (6)

40 Days_Upward Bound
40 Days_Upward Bound40 Days_Upward Bound
40 Days_Upward Bound
 
Women in IT - Agile Tour 2015
Women in IT - Agile Tour 2015Women in IT - Agile Tour 2015
Women in IT - Agile Tour 2015
 
Why are you struggling with agile in india mumbai
Why are you struggling with agile in india mumbaiWhy are you struggling with agile in india mumbai
Why are you struggling with agile in india mumbai
 
Scrumban Lightning talk
Scrumban Lightning talkScrumban Lightning talk
Scrumban Lightning talk
 
Meetup Spark 2.0
Meetup Spark 2.0Meetup Spark 2.0
Meetup Spark 2.0
 
Spark 의 핵심은 무엇인가? RDD! (RDD paper review)
Spark 의 핵심은 무엇인가? RDD! (RDD paper review)Spark 의 핵심은 무엇인가? RDD! (RDD paper review)
Spark 의 핵심은 무엇인가? RDD! (RDD paper review)
 

Similar to Streaming API, Spark and Ruby

AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)
Paul Chao
 
Hadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User GroupHadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User Group
Csaba Toth
 
Big Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with PentahoBig Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with Pentaho
Mark Kromer
 
Hot-Spot analysis Using Apache Spark framework
Hot-Spot analysis Using Apache Spark frameworkHot-Spot analysis Using Apache Spark framework
Hot-Spot analysis Using Apache Spark framework
Supriya .
 

Similar to Streaming API, Spark and Ruby (20)

Hadoop tutorial
Hadoop tutorialHadoop tutorial
Hadoop tutorial
 
Hadoop Tutorial.ppt
Hadoop Tutorial.pptHadoop Tutorial.ppt
Hadoop Tutorial.ppt
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetup
 
Sumedh Wale's presentation
Sumedh Wale's presentationSumedh Wale's presentation
Sumedh Wale's presentation
 
Hadoop Big Data A big picture
Hadoop Big Data A big pictureHadoop Big Data A big picture
Hadoop Big Data A big picture
 
Meethadoop
MeethadoopMeethadoop
Meethadoop
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)
 
Report Hadoop Map Reduce
Report Hadoop Map ReduceReport Hadoop Map Reduce
Report Hadoop Map Reduce
 
Hadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User GroupHadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User Group
 
Stratosphere with big_data_analytics
Stratosphere with big_data_analyticsStratosphere with big_data_analytics
Stratosphere with big_data_analytics
 
Big data week presentation
Big data week presentationBig data week presentation
Big data week presentation
 
TheEdge10 : Big Data is Here - Hadoop to the Rescue
TheEdge10 : Big Data is Here - Hadoop to the RescueTheEdge10 : Big Data is Here - Hadoop to the Rescue
TheEdge10 : Big Data is Here - Hadoop to the Rescue
 
Hadoop - Introduction to HDFS
Hadoop - Introduction to HDFSHadoop - Introduction to HDFS
Hadoop - Introduction to HDFS
 
Big Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with PentahoBig Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with Pentaho
 
Apache Spark - San Diego Big Data Meetup Jan 14th 2015
Apache Spark - San Diego Big Data Meetup Jan 14th 2015Apache Spark - San Diego Big Data Meetup Jan 14th 2015
Apache Spark - San Diego Big Data Meetup Jan 14th 2015
 
Hadoop by sunitha
Hadoop by sunithaHadoop by sunitha
Hadoop by sunitha
 
Hot-Spot analysis Using Apache Spark framework
Hot-Spot analysis Using Apache Spark frameworkHot-Spot analysis Using Apache Spark framework
Hot-Spot analysis Using Apache Spark framework
 
Apache Spark Overview part1 (20161107)
Apache Spark Overview part1 (20161107)Apache Spark Overview part1 (20161107)
Apache Spark Overview part1 (20161107)
 
Spark!
Spark!Spark!
Spark!
 
Unit 2
Unit 2Unit 2
Unit 2
 

Recently uploaded

VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
dharasingh5698
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Dr.Costas Sachpazis
 

Recently uploaded (20)

Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxIntroduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptx
 
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptxBSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghly
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
 
Glass Ceramics: Processing and Properties
Glass Ceramics: Processing and PropertiesGlass Ceramics: Processing and Properties
Glass Ceramics: Processing and Properties
 
Roadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and RoutesRoadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and Routes
 
Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
 
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptx
 
University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdf
 
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptx
 
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
 

Streaming API, Spark and Ruby