Streaming API, Spark and Ruby

Computer
Science
Maths &
Statitics
Subject
Matter
Expertise

How to store ?
How to process ?

• Distributed Storage: HDFS A distributed file
system where commodity hardware can be
used to form clusters and store the huge data
in distributed fashion.
• Distributed Processing: MapReduce Paradigm
It can easily scale to multiple nodes(1,500–2,000
nodes in a cluster), with just configuration
change.

Daywise Analysis of
Rubyconfindia Tweets

--- !ruby/object:Twitter::Tweet
attrs:
:created_at: Tue Mar 08 11:00:57 +0000 2016
:id: 707159160945811457
:id_str: '707159160945811457'
:text: 'Once in a life time to meet Matz at the
awesome #kochi https://t.co/6oCIagsHCg
#ruby #india https://t.co/YRlpABApkP'

• It is a distributed file system
• Streaming Data Access: Write once, read many
times
• Able to run on commodity Hardware
• Fault tolerance
• Replication: 3 nodes by default, configurable
• Block based: 64-256MB, configurable

Name Node: Stores Meta Data
Meta Data:
/data/pristine/catalina.log.> 1, 2, 4
/data/pristine/myfile. >3,5
Data Node 1 Data Node 2 Data Node 3
1 2 45 5 2 3 4 1 3

• YARN: A framework for job
scheduling and cluster resource
management.
• MapReduce: Distributed
processing paradigm

Map Function
Input:
(input_key, value)
Output:
bunch of
(intermediate_key, value)
System applies the map
function in parallel to all
inputs
Reduce Function
Input:
(intermediate_key, value)
Output:
bunch of (values)
System will group all pairs
with the same
intermediate key and
apply the reduce function
OUTPUT RESULT
SHUFFLE
STAGE
FILE CHUNKS

#!/usr/bin/env ruby
STDIN.read.split("---
!ruby/object:Twitter::Tweet").each do |t|
date = t.match(/:created_at: .{30}/).to_s.split
puts "#{date[1]}" if date[1]
end

#!/usr/bin/env ruby
STDIN.readlines.group_by{|i| i.strip}
.map{|i,j| "#{i} #{j.count}" }
.each{|i| puts i}

hadoop jar /usr/lib/hadoop-mapreduce/hadoop-
streaming.jar –input /user/rubyconf/tweets.txt –output
/user/rubyconf/daywise –mapper mapper.rb –reducer
reducer.rb –file mapper.rb –file reducer.rb

SPARK Benefits
• SPEED
• Ease of Use
• Runs Everywhere
• Source: http://spark.apache.org/

• Source: http://spark.apache.org/

gem ruby-spark
• gem install ruby-spark
• ruby-spark build
• ruby-spark shell

require 'ruby-spark'
# Configuration
Spark.config do
set_app_name "RubySpark"
set 'spark.ruby.serializer', 'oj'
set 'spark.ruby.serializer.batch_size', 100
end
# Start Apache Spark
Spark.start

# Context reference
sc = Spark.sc
rdd = sc.text_file("hdfs://user/rubyconf/tweets.txt”)
# Collect all created days from dates
days = rdd.map(lambda {|t| date = t.match(/:created_at:
.{30}/).to_s.split; date[1] if date[1]})
# Creating key value pair
pairrdd = days.map(lambda { |x| [x,1] })
# Final output by using reducer
daywise = pairrdd.reduce_by_key( lambda{|x,y|
x+y}).collect_as_hash

We can use Ruby with
HADOOP Streaming API
& SPARK

SPARK is more
generalized distributed
computing model

Streaming API, Spark and Ruby

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (6)

Similar to Streaming API, Spark and Ruby

Similar to Streaming API, Spark and Ruby (20)

Recently uploaded

Recently uploaded (20)

Streaming API, Spark and Ruby