26. • Distributed Storage: HDFS A distributed file
system where commodity hardware can be
used to form clusters and store the huge data
in distributed fashion.
• Distributed Processing: MapReduce Paradigm
It can easily scale to multiple nodes(1,500–2,000
nodes in a cluster), with just configuration
change.
32. --- !ruby/object:Twitter::Tweet
attrs:
:created_at: Tue Mar 08 11:00:57 +0000 2016
:id: 707159160945811457
:id_str: '707159160945811457'
:text: 'Once in a life time to meet Matz at the
awesome #kochi https://t.co/6oCIagsHCg
#ruby #india https://t.co/YRlpABApkP'
38. • It is a distributed file system
• Streaming Data Access: Write once, read many
times
• Able to run on commodity Hardware
• Fault tolerance
• Replication: 3 nodes by default, configurable
• Block based: 64-256MB, configurable
40. Name Node: Stores Meta Data
Meta Data:
/data/pristine/catalina.log.> 1, 2, 4
/data/pristine/myfile. >3,5
Data Node 1 Data Node 2 Data Node 3
1 2 45 5 2 3 4 1 3
49. • YARN: A framework for job
scheduling and cluster resource
management.
• MapReduce: Distributed
processing paradigm
50. Map Function
Input:
(input_key, value)
Output:
bunch of
(intermediate_key, value)
System applies the map
function in parallel to all
inputs
Reduce Function
Input:
(intermediate_key, value)
Output:
bunch of (values)
System will group all pairs
with the same
intermediate key and
apply the reduce function
OUTPUT RESULT
SHUFFLE
STAGE
FILE CHUNKS
75. # Context reference
sc = Spark.sc
rdd = sc.text_file("hdfs://user/rubyconf/tweets.txt”)
# Collect all created days from dates
days = rdd.map(lambda {|t| date = t.match(/:created_at:
.{30}/).to_s.split; date[1] if date[1]})
# Creating key value pair
pairrdd = days.map(lambda { |x| [x,1] })
# Final output by using reducer
daywise = pairrdd.reduce_by_key( lambda{|x,y|
x+y}).collect_as_hash