SlideShare a Scribd company logo
1 of 85
UT DALLAS Erik Jonsson School of Engineering & Computer Science
FEARLESS engineering
Cloud Tools Overview
UT DALLAS Erik Jonsson School of Engineering & Computer Science
FEARLESS engineering
Hadoop
FEARLESS engineering
Outline
• Hadoop - Basics
• HDFS
– Goals
– Architecture
– Other functions
• MapReduce
– Basics
– Word Count Example
– Handy tools
– Finding shortest path example
• Related Apache sub-projects (Pig, HBase,Hive)
FEARLESS engineering
Hadoop - Why ?
• Need to process huge datasets on large
clusters of computers
• Very expensive to build reliability into each
application
• Nodes fail every day
– Failure is expected, rather than exceptional
– The number of nodes in a cluster is not constant
• Need a common infrastructure
– Efficient, reliable, easy to use
– Open Source, Apache Licence
FEARLESS engineering
Who uses Hadoop?
• Amazon/A9
• Facebook
• Google
• New York Times
• Veoh
• Yahoo!
• …. many more
FEARLESS engineering
Commodity Hardware
• Typically in 2 level architecture
– Nodes are commodity PCs
– 30-40 nodes/rack
– Uplink from rack is 3-4 gigabit
– Rack-internal is 1 gigabit
Aggregation switch
Rack switch
UT DALLAS Erik Jonsson School of Engineering & Computer Science
FEARLESS engineering
Hadoop Distributed File System
(HDFS)
Original Slides by
Dhruba Borthakur
Apache Hadoop Project Management Committee
FEARLESS engineering
Goals of HDFS
• Very Large Distributed File System
– 10K nodes, 100 million files, 10PB
• Assumes Commodity Hardware
– Files are replicated to handle hardware failure
– Detect failures and recover from them
• Optimized for Batch Processing
– Data locations exposed so that computations can
move to where data resides
– Provides very high aggregate bandwidth
FEARLESS engineering
Distributed File System
• Single Namespace for entire cluster
• Data Coherency
– Write-once-read-many access model
– Client can only append to existing files
• Files are broken up into blocks
– Typically 64MB block size
– Each block replicated on multiple DataNodes
• Intelligent Client
– Client can find location of blocks
– Client accesses data directly from DataNode
FEARLESS engineering
HDFS Architecture
FEARLESS engineering
Functions of a NameNode
• Manages File System Namespace
– Maps a file name to a set of blocks
– Maps a block to the DataNodes where it resides
• Cluster Configuration Management
• Replication Engine for Blocks
FEARLESS engineering
NameNode Metadata
• Metadata in Memory
– The entire metadata is in main memory
– No demand paging of metadata
• Types of metadata
– List of files
– List of Blocks for each file
– List of DataNodes for each block
– File attributes, e.g. creation time, replication factor
• A Transaction Log
– Records file creations, file deletions etc
FEARLESS engineering
DataNode
• A Block Server
– Stores data in the local file system (e.g. ext3)
– Stores metadata of a block (e.g. CRC)
– Serves data and metadata to Clients
• Block Report
– Periodically sends a report of all existing blocks to
the NameNode
• Facilitates Pipelining of Data
– Forwards data to other specified DataNodes
FEARLESS engineering
Block Placement
• Current Strategy
– One replica on local node
– Second replica on a remote rack
– Third replica on same remote rack
– Additional replicas are randomly placed
• Clients read from nearest replicas
• Would like to make this policy pluggable
FEARLESS engineering
Heartbeats
• DataNodes send hearbeat to the NameNode
– Once every 3 seconds
• NameNode uses heartbeats to detect
DataNode failure
FEARLESS engineering
Replication Engine
• NameNode detects DataNode failures
– Chooses new DataNodes for new replicas
– Balances disk usage
– Balances communication traffic to DataNodes
FEARLESS engineering
Data Correctness
• Use Checksums to validate data
– Use CRC32
• File Creation
– Client computes checksum per 512 bytes
– DataNode stores the checksum
• File access
– Client retrieves the data and checksum from
DataNode
– If Validation fails, Client tries other replicas
FEARLESS engineering
NameNode Failure
• A single point of failure
• Transaction Log stored in multiple directories
– A directory on the local file system
– A directory on a remote file system (NFS/CIFS)
• Need to develop a real HA solution
FEARLESS engineering
Data Pieplining
• Client retrieves a list of DataNodes on which
to place replicas of a block
• Client writes block to the first DataNode
• The first DataNode forwards the data to the
next node in the Pipeline
• When all replicas are written, the Client
moves on to write the next block in file
FEARLESS engineering
Rebalancer
• Goal: % disk full on DataNodes should be
similar
– Usually run when new DataNodes are added
– Cluster is online when Rebalancer is active
– Rebalancer is throttled to avoid network
congestion
– Command line tool
FEARLESS engineering
Secondary NameNode
• Copies FsImage and Transaction Log from
Namenode to a temporary directory
• Merges FSImage and Transaction Log into a
new FSImage in temporary directory
• Uploads new FSImage to the NameNode
– Transaction Log on NameNode is purged
FEARLESS engineering
User Interface
• Commads for HDFS User:
– hadoop dfs -mkdir /foodir
– hadoop dfs -cat /foodir/myfile.txt
– hadoop dfs -rm /foodir/myfile.txt
• Commands for HDFS Administrator
– hadoop dfsadmin -report
– hadoop dfsadmin -decommision datanodename
• Web Interface
– http://host:port/dfshealth.jsp
UT DALLAS Erik Jonsson School of Engineering & Computer Science
FEARLESS engineering
MapReduce
Original Slides by
Owen O’Malley (Yahoo!)
&
Christophe Bisciglia, Aaron Kimball & Sierra Michells-Slettvet
FEARLESS engineering
MapReduce - What?
• MapReduce is a programming model for
efficient distributed computing
• It works like a Unix pipeline
– cat input | grep | sort | uniq -c | cat > output
– Input | Map | Shuffle & Sort | Reduce | Output
• Efficiency from
– Streaming through data, reducing seeks
– Pipelining
• A good fit for a lot of applications
– Log processing
– Web index building
FEARLESS engineering
MapReduce - Dataflow
FEARLESS engineering
MapReduce - Features
• Fine grained Map and Reduce tasks
– Improved load balancing
– Faster recovery from failed tasks
• Automatic re-execution on failure
– In a large cluster, some nodes are always slow or flaky
– Framework re-executes failed tasks
• Locality optimizations
– With large data, bandwidth to data is a problem
– Map-Reduce + HDFS is a very effective solution
– Map-Reduce queries HDFS for locations of input data
– Map tasks are scheduled close to the inputs when
possible
FEARLESS engineering
Word Count Example
• Mapper
– Input: value: lines of text of input
– Output: key: word, value: 1
• Reducer
– Input: key: word, value: set of counts
– Output: key: word, value: sum
• Launching program
– Defines this job
– Submits job to cluster
FEARLESS engineering
Word Count Dataflow
FEARLESS engineering
Word Count Mapper
public static class Map extends MapReduceBase implements
Mapper<LongWritable,Text,Text,IntWritable> {
private static final IntWritable one = new IntWritable(1);
private Text word = new Text();
public static void map(LongWritable key, Text value,
OutputCollector<Text,IntWritable> output, Reporter reporter) throws
IOException {
String line = value.toString();
StringTokenizer = new StringTokenizer(line);
while(tokenizer.hasNext()) {
word.set(tokenizer.nextToken());
output.collect(word,one);
}
}
}
FEARLESS engineering
Word Count Reducer
public static class Reduce extends MapReduceBase implements
Reducer<Text,IntWritable,Text,IntWritable> {
public static void map(Text key, Iterator<IntWritable> values,
OutputCollector<Text,IntWritable> output, Reporter reporter) throws
IOException {
int sum = 0;
while(values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}
FEARLESS engineering
Word Count Example
• Jobs are controlled by configuring JobConfs
• JobConfs are maps from attribute names to string values
• The framework defines attributes to control how the job is
executed
– conf.set(“mapred.job.name”, “MyApp”);
• Applications can add arbitrary values to the JobConf
– conf.set(“my.string”, “foo”);
– conf.set(“my.integer”, 12);
• JobConf is available to all tasks
FEARLESS engineering
Putting it all together
• Create a launching program for your application
• The launching program configures:
– The Mapper and Reducer to use
– The output key and value types (input types are
inferred from the InputFormat)
– The locations for your input and output
• The launching program then submits the job and
typically waits for it to complete
FEARLESS engineering
Putting it all together
JobConf conf = new JobConf(WordCount.class);
conf.setJobName(“wordcount”);
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(Map.class);
conf.setCombinerClass(Reduce.class);
conf.setReducer(Reduce.class);
conf.setInputFormat(TextInputFormat.class);
Conf.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
JobClient.runJob(conf);
FEARLESS engineering
Input and Output Formats
• A Map/Reduce may specify how it’s input is to be read
by specifying an InputFormat to be used
• A Map/Reduce may specify how it’s output is to be
written by specifying an OutputFormat to be used
• These default to TextInputFormat and
TextOutputFormat, which process line-based text data
• Another common choice is SequenceFileInputFormat
and SequenceFileOutputFormat for binary data
• These are file-based, but they are not required to be
FEARLESS engineering
How many Maps and Reduces
• Maps
– Usually as many as the number of HDFS blocks being
processed, this is the default
– Else the number of maps can be specified as a hint
– The number of maps can also be controlled by specifying the
minimum split size
– The actual sizes of the map inputs are computed by:
• max(min(block_size,data/#maps), min_split_size
• Reduces
– Unless the amount of data being processed is small
• 0.95*num_nodes*mapred.tasktracker.tasks.maximum
FEARLESS engineering
Some handy tools
• Partitioners
• Combiners
• Compression
• Counters
• Speculation
• Zero Reduces
• Distributed File Cache
• Tool
FEARLESS engineering
Partitioners
• Partitioners are application code that define how keys
are assigned to reduces
• Default partitioning spreads keys evenly, but randomly
– Uses key.hashCode() % num_reduces
• Custom partitioning is often required, for example, to
produce a total order in the output
– Should implement Partitioner interface
– Set by calling conf.setPartitionerClass(MyPart.class)
– To get a total order, sample the map output keys and pick
values to divide the keys into roughly equal buckets and use
that in your partitioner
FEARLESS engineering
Combiners
• When maps produce many repeated keys
– It is often useful to do a local aggregation following the map
– Done by specifying a Combiner
– Goal is to decrease size of the transient data
– Combiners have the same interface as Reduces, and often are the
same class
– Combiners must not side effects, because they run an intermdiate
number of times
– In WordCount, conf.setCombinerClass(Reduce.class);
FEARLESS engineering
Compression
• Compressing the outputs and intermediate data will often yield
huge performance gains
– Can be specified via a configuration file or set programmatically
– Set mapred.output.compress to true to compress job output
– Set mapred.compress.map.output to true to compress map outputs
• Compression Types (mapred(.map)?.output.compression.type)
– “block” - Group of keys and values are compressed together
– “record” - Each value is compressed individually
– Block compression is almost always best
• Compression Codecs
(mapred(.map)?.output.compression.codec)
– Default (zlib) - slower, but more compression
– LZO - faster, but less compression
FEARLESS engineering
Counters
• Often Map/Reduce applications have countable events
• For example, framework counts records in to and out
of Mapper and Reducer
• To define user counters:
static enum Counter {EVENT1, EVENT2};
reporter.incrCounter(Counter.EVENT1, 1);
• Define nice names in a MyClass_Counter.properties
file
CounterGroupName=MyCounters
EVENT1.name=Event 1
EVENT2.name=Event 2
FEARLESS engineering
Speculative execution
• The framework can run multiple instances of slow
tasks
– Output from instance that finishes first is used
– Controlled by the configuration variable
mapred.speculative.execution
– Can dramatically bring in long tails on jobs
FEARLESS engineering
Zero Reduces
• Frequently, we only need to run a filter on the input
data
– No sorting or shuffling required by the job
– Set the number of reduces to 0
– Output from maps will go directly to OutputFormat and disk
FEARLESS engineering
Distributed File Cache
• Sometimes need read-only copies of data on the local
computer
– Downloading 1GB of data for each Mapper is expensive
• Define list of files you need to download in JobConf
• Files are downloaded once per computer
• Add to launching program:
DistributedCache.addCacheFile(new URI(“hdfs://nn:8020/foo”),
conf);
• Add to task:
Path[] files = DistributedCache.getLocalCacheFiles(conf);
FEARLESS engineering
Tool
• Handle “standard” Hadoop command line options
– -conf file - load a configuration file named file
– -D prop=value - define a single configuration property prop
• Class looks like:
public class MyApp extends Configured implements Tool {
public static void main(String[] args) throws Exception {
System.exit(ToolRunner.run(new Configuration(),
new MyApp(), args));
}
public int run(String[] args) throws Exception {
…. getConf() ….
}
}
FEARLESS engineering
Finding the Shortest Path
• A common graph search
application is finding the
shortest path from a start
node to one or more
target nodes
• Commonly done on a
single machine with
Dijkstra’s Algorithm
• Can we use BFS to find
the shortest path via
MapReduce?
FEARLESS engineering
Finding the Shortest Path: Intuition
• We can define the solution to this problem
inductively
– DistanceTo(startNode) = 0
– For all nodes n directly reachable from startNode,
DistanceTo(n) = 1
– For all nodes n reachable from some other set of nodes
S,
DistanceTo(n) = 1 + min(DistanceTo(m), m  S)
FEARLESS engineering
From Intuition to Algorithm
• A map task receives a node n as a key, and
(D, points-to) as its value
– D is the distance to the node from the start
– points-to is a list of nodes reachable from n
 p  points-to, emit (p, D+1)
• Reduces task gathers possible distances to a
given p and selects the minimum one
FEARLESS engineering
What This Gives Us
• This MapReduce task can advance the known
frontier by one hop
• To perform the whole BFS, a non-MapReduce
component then feeds the output of this step
back into the MapReduce task for another
iteration
– Problem: Where’d the points-to list go?
– Solution: Mapper emits (n, points-to) as well
FEARLESS engineering
Blow-up and Termination
• This algorithm starts from one node
• Subsequent iterations include many more
nodes of the graph as the frontier advances
• Does this ever terminate?
– Yes! Eventually, routes between nodes will stop
being discovered and no better distances will be
found. When distance is the same, we stop
– Mapper should emit (n,D) to ensure that “current
distance” is carried into the reducer
UT DALLAS Erik Jonsson School of Engineering & Computer Science
FEARLESS engineering
Hadoop Subprojects
FEARLESS engineering
Hadoop Related Subprojects
• Pig
– High-level language for data analysis
• HBase
– Table storage for semi-structured data
• Zookeeper
– Coordinating distributed applications
• Hive
– SQL-like Query language and Metastore
• Mahout
– Machine learning
UT DALLAS Erik Jonsson School of Engineering & Computer Science
FEARLESS engineering
Pig
Original Slides by
Matei Zaharia
UC Berkeley RAD Lab
FEARLESS engineering
Pig
• Started at Yahoo! Research
• Now runs about 30% of Yahoo!’s jobs
• Features
– Expresses sequences of MapReduce jobs
– Data model: nested “bags” of items
– Provides relational (SQL) operators
(JOIN, GROUP BY, etc.)
– Easy to plug in Java functions
FEARLESS engineering
An Example Problem
• Suppose you have
user data in a file,
website data in
another, and you
need to find the top
5 most visited pages
by users aged 18-25
Load Users Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
FEARLESS engineering
In MapReduce
FEARLESS engineering
In Pig Latin
Users = load ‘users’ as (name, age);
Filtered = filter Users by age >= 18 and age <=
25;
Pages = load ‘pages’ as (user, url);
Joined = join Filtered by name, Pages by user;
Grouped = group Joined by url;
Summed = foreach Grouped generate group,
count(Joined) as clicks;
Sorted = order Summed by clicks desc;
Top5 = limit Sorted 5;
store Top5 into ‘top5sites’;
FEARLESS engineering
Ease of Translation
Load Users Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
Users = load …
Fltrd = filter …
Pages = load …
Joined = join …
Grouped = group …
Summed = … count()…
Sorted = order …
Top5 = limit …
FEARLESS engineering
Ease of Translation
Load Users Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
Users = load …
Fltrd = filter …
Pages = load …
Joined = join …
Grouped = group …
Summed = … count()…
Sorted = order …
Top5 = limit …
Job 1
Job 2
Job 3
UT DALLAS Erik Jonsson School of Engineering & Computer Science
FEARLESS engineering
HBase
Original Slides by
Tom White
Lexeme Ltd.
FEARLESS engineering
HBase - What?
• Modeled on Google’s Bigtable
• Row/column store
• Billions of rows/millions on columns
• Column-oriented - nulls are free
• Untyped - stores byte[]
FEARLESS engineering
HBase - Data Model
Row Timestamp
Column family:
animal:
Column
family
repairs:
animal:type animal:size repairs:cost
enclosure1
t2 zebra 1000 EUR
t1 lion big
enclosure2 … … … …
FEARLESS engineering
HBase - Data Storage
Column family animal:
(enclosure1, t2, animal:type) zebra
(enclosure1, t1, animal:size) big
(enclosure1, t1, animal:type) lion
Column family repairs:
(enclosure1, t1, repairs:cost) 1000 EUR
FEARLESS engineering
HBase - Code
HTable table = …
Text row = new Text(“enclosure1”);
Text col1 = new Text(“animal:type”);
Text col2 = new Text(“animal:size”);
BatchUpdate update = new BatchUpdate(row);
update.put(col1, “lion”.getBytes(“UTF-8”));
update.put(col2, “big”.getBytes(“UTF-8));
table.commit(update);
update = new BatchUpdate(row);
update.put(col1, “zebra”.getBytes(“UTF-8”));
table.commit(update);
FEARLESS engineering
HBase - Querying
• Retrieve a cell
Cell = table.getRow(“enclosure1”).getColumn(“animal:type”).getValue();
• Retrieve a row
RowResult = table.getRow( “enclosure1” );
• Scan through a range of rows
Scanner s = table.getScanner( new String[] { “animal:type” } );
UT DALLAS Erik Jonsson School of Engineering & Computer Science
FEARLESS engineering
Hive
Original Slides by
Matei Zaharia
UC Berkeley RAD Lab
FEARLESS engineering
Hive
• Developed at Facebook
• Used for majority of Facebook jobs
• “Relational database” built on Hadoop
– Maintains list of table schemas
– SQL-like query language (HiveQL)
– Can call Hadoop Streaming scripts from HiveQL
– Supports table partitioning, clustering, complex
data types, some optimizations
FEARLESS engineering
Creating a Hive Table
• Partitioning breaks table into separate files for
each (dt, country) pair
Ex: /hive/page_view/dt=2008-06-08,country=USA
/hive/page_view/dt=2008-06-08,country=CA
CREATE TABLE page_views(viewTime INT, userid BIGINT,
page_url STRING, referrer_url STRING,
ip STRING COMMENT 'User IP address')
COMMENT 'This is the page view table'
PARTITIONED BY(dt STRING, country STRING)
STORED AS SEQUENCEFILE;
FEARLESS engineering
A Simple Query
SELECT page_views.*
FROM page_views
WHERE page_views.date >= '2008-03-01'
AND page_views.date <= '2008-03-31'
AND page_views.referrer_url like '%xyz.com';
• Hive only reads partition 2008-03-01,*
instead of scanning entire table
• Find all page views coming from xyz.com
on March 31st:
FEARLESS engineering
Aggregation and Joins
• Count users who visited each page by gender:
• Sample output:
SELECT pv.page_url, u.gender, COUNT(DISTINCT u.id)
FROM page_views pv JOIN user u ON (pv.userid = u.id)
GROUP BY pv.page_url, u.gender
WHERE pv.date = '2008-03-03';
FEARLESS engineering
Using a Hadoop Streaming Mapper Script
SELECT TRANSFORM(page_views.userid,
page_views.date)
USING 'map_script.py'
AS dt, uid CLUSTER BY dt
FROM page_views;
UT DALLAS Erik Jonsson School of Engineering & Computer Science
FEARLESS engineering
Storm
Original Slides by
Nathan Marz
Twitter
FEARLESS engineering
Storm
• Developed by BackType which was acquired
by Twitter
• Lots of tools for data (i.e. batch) processing
– Hadoop, Pig, HBase, Hive, …
• None of them are realtime systems which is
becoming a real requirement for businesses
• Storm provides realtime computation
– Scalable
– Guarantees no data loss
– Extremely robust and fault-tolerant
– Programming language agnostic
FEARLESS engineering
Before Storm
FEARLESS engineering
Before Storm – Adding a worker
Deploy
Reconfigure/Redeploy
FEARLESS engineering
Problems
• Scaling is painful
• Poor fault-tolerance
• Coding is tedious
FEARLESS engineering
What we want
• Guaranteed data processing
• Horizontal scalability
• Fault-tolerance
• No intermediate message brokers!
• Higher level abstraction than message
passing
• “Just works” !!
FEARLESS engineering
Storm Cluster
Master node (similar to
Hadoop JobTracker)
Used for cluster coordination
Run worker processes
FEARLESS engineering
Concepts
• Streams
• Spouts
• Bolts
• Topologies
FEARLESS engineering
Streams
Tuple Tuple Tuple Tuple Tuple Tuple Tuple
Unbounded sequence of tuples
FEARLESS engineering
Spouts
Source of streams
FEARLESS engineering
Bolts
Processes input streams and produces new streams:
Can implement functions such as filters, aggregation, join, etc
FEARLESS engineering
Topology
Network of spouts and bolts
FEARLESS engineering
Topology
Spouts and bolts execute as
many tasks across the cluster
FEARLESS engineering
Stream Grouping
When a tuple is emitted which task does it go to?
FEARLESS engineering
Stream Grouping
• Shuffle grouping: pick a random task
• Fields grouping: consistent hashing on a
subset of tuple fields
• All grouping: send to all tasks
• Global grouping: pick task with lowest id

More Related Content

Similar to Lecture17 (1).ppt

Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewNisanth Simon
 
Bigdata workshop february 2015
Bigdata workshop  february 2015 Bigdata workshop  february 2015
Bigdata workshop february 2015 clairvoyantllc
 
Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introductionSandeep Singh
 
Hadoop Fundamentals
Hadoop FundamentalsHadoop Fundamentals
Hadoop Fundamentalsits_skm
 
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...Data Con LA
 
Apache hadoop and hive
Apache hadoop and hiveApache hadoop and hive
Apache hadoop and hivesrikanthhadoop
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File Systemelliando dias
 
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkJames Chen
 
Apache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data ProcessingApache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data ProcessingDataWorks Summit
 
Big Data Technologies - Hadoop
Big Data Technologies - HadoopBig Data Technologies - Hadoop
Big Data Technologies - HadoopTalentica Software
 
Apache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with HadoopApache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with HadoopHortonworks
 
Borthakur hadoop univ-research
Borthakur hadoop univ-researchBorthakur hadoop univ-research
Borthakur hadoop univ-researchsaintdevil163
 
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDYVenneladonthireddy1
 
HDFS- What is New and Future
HDFS- What is New and FutureHDFS- What is New and Future
HDFS- What is New and FutureDataWorks Summit
 

Similar to Lecture17 (1).ppt (20)

Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce Overview
 
Bigdata workshop february 2015
Bigdata workshop  february 2015 Bigdata workshop  february 2015
Bigdata workshop february 2015
 
Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introduction
 
Apache Spark
Apache SparkApache Spark
Apache Spark
 
Hadoop fundamentals
Hadoop fundamentalsHadoop fundamentals
Hadoop fundamentals
 
Hadoop Fundamentals
Hadoop FundamentalsHadoop Fundamentals
Hadoop Fundamentals
 
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
 
Apache hadoop and hive
Apache hadoop and hiveApache hadoop and hive
Apache hadoop and hive
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
 
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
 
Apache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data ProcessingApache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data Processing
 
Big Data Technologies - Hadoop
Big Data Technologies - HadoopBig Data Technologies - Hadoop
Big Data Technologies - Hadoop
 
Hadoop
HadoopHadoop
Hadoop
 
Apache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with HadoopApache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with Hadoop
 
Lecture 2 part 1
Lecture 2 part 1Lecture 2 part 1
Lecture 2 part 1
 
Borthakur hadoop univ-research
Borthakur hadoop univ-researchBorthakur hadoop univ-research
Borthakur hadoop univ-research
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
Tutorial Haddop 2.3
Tutorial Haddop 2.3Tutorial Haddop 2.3
Tutorial Haddop 2.3
 
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
 
HDFS- What is New and Future
HDFS- What is New and FutureHDFS- What is New and Future
HDFS- What is New and Future
 

More from ssuserec53e73

Threats in network that can be noted in security
Threats in network that can be noted in securityThreats in network that can be noted in security
Threats in network that can be noted in securityssuserec53e73
 
Lsn21_NumPy in data science using python
Lsn21_NumPy in data science using pythonLsn21_NumPy in data science using python
Lsn21_NumPy in data science using pythonssuserec53e73
 
OpenSecure socket layerin cyber security
OpenSecure socket layerin cyber securityOpenSecure socket layerin cyber security
OpenSecure socket layerin cyber securityssuserec53e73
 
Hash functions, digital signatures and hmac
Hash functions, digital signatures and hmacHash functions, digital signatures and hmac
Hash functions, digital signatures and hmacssuserec53e73
 
Asian Elephant Adaptations - Chelsea P..pptx
Asian Elephant Adaptations - Chelsea P..pptxAsian Elephant Adaptations - Chelsea P..pptx
Asian Elephant Adaptations - Chelsea P..pptxssuserec53e73
 
Module 10-Introduction to OOP.pptx
Module 10-Introduction to OOP.pptxModule 10-Introduction to OOP.pptx
Module 10-Introduction to OOP.pptxssuserec53e73
 
50134147-Knowledge-Representation-Using-Rules.ppt
50134147-Knowledge-Representation-Using-Rules.ppt50134147-Knowledge-Representation-Using-Rules.ppt
50134147-Knowledge-Representation-Using-Rules.pptssuserec53e73
 
IoT Reference Architecture.pptx
IoT Reference Architecture.pptxIoT Reference Architecture.pptx
IoT Reference Architecture.pptxssuserec53e73
 
Introduction to measurement.pptx
Introduction to measurement.pptxIntroduction to measurement.pptx
Introduction to measurement.pptxssuserec53e73
 
ML-DecisionTrees.ppt
ML-DecisionTrees.pptML-DecisionTrees.ppt
ML-DecisionTrees.pptssuserec53e73
 

More from ssuserec53e73 (20)

Threats in network that can be noted in security
Threats in network that can be noted in securityThreats in network that can be noted in security
Threats in network that can be noted in security
 
Lsn21_NumPy in data science using python
Lsn21_NumPy in data science using pythonLsn21_NumPy in data science using python
Lsn21_NumPy in data science using python
 
OpenSecure socket layerin cyber security
OpenSecure socket layerin cyber securityOpenSecure socket layerin cyber security
OpenSecure socket layerin cyber security
 
Hash functions, digital signatures and hmac
Hash functions, digital signatures and hmacHash functions, digital signatures and hmac
Hash functions, digital signatures and hmac
 
Asian Elephant Adaptations - Chelsea P..pptx
Asian Elephant Adaptations - Chelsea P..pptxAsian Elephant Adaptations - Chelsea P..pptx
Asian Elephant Adaptations - Chelsea P..pptx
 
Module 10-Introduction to OOP.pptx
Module 10-Introduction to OOP.pptxModule 10-Introduction to OOP.pptx
Module 10-Introduction to OOP.pptx
 
unit-1-l3.ppt
unit-1-l3.pptunit-1-l3.ppt
unit-1-l3.ppt
 
AI.ppt
AI.pptAI.ppt
AI.ppt
 
50134147-Knowledge-Representation-Using-Rules.ppt
50134147-Knowledge-Representation-Using-Rules.ppt50134147-Knowledge-Representation-Using-Rules.ppt
50134147-Knowledge-Representation-Using-Rules.ppt
 
Dr Jose Reena K.pdf
Dr Jose Reena K.pdfDr Jose Reena K.pdf
Dr Jose Reena K.pdf
 
Enumeration.pptx
Enumeration.pptxEnumeration.pptx
Enumeration.pptx
 
footscan.PPT
footscan.PPTfootscan.PPT
footscan.PPT
 
UNIT II.pptx
UNIT II.pptxUNIT II.pptx
UNIT II.pptx
 
Unit 1 iot.pptx
Unit 1 iot.pptxUnit 1 iot.pptx
Unit 1 iot.pptx
 
IoT Reference Architecture.pptx
IoT Reference Architecture.pptxIoT Reference Architecture.pptx
IoT Reference Architecture.pptx
 
patent ppt.pptx
patent ppt.pptxpatent ppt.pptx
patent ppt.pptx
 
Introduction to measurement.pptx
Introduction to measurement.pptxIntroduction to measurement.pptx
Introduction to measurement.pptx
 
ML-DecisionTrees.ppt
ML-DecisionTrees.pptML-DecisionTrees.ppt
ML-DecisionTrees.ppt
 
ML_Lecture_7.ppt
ML_Lecture_7.pptML_Lecture_7.ppt
ML_Lecture_7.ppt
 
070308-simmons.ppt
070308-simmons.ppt070308-simmons.ppt
070308-simmons.ppt
 

Recently uploaded

20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxNLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxBoston Institute of Analytics
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一fhwihughh
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degreeyuu sss
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAmazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAbdelrhman abooda
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 

Recently uploaded (20)

20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxNLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAmazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 

Lecture17 (1).ppt

  • 1. UT DALLAS Erik Jonsson School of Engineering & Computer Science FEARLESS engineering Cloud Tools Overview
  • 2. UT DALLAS Erik Jonsson School of Engineering & Computer Science FEARLESS engineering Hadoop
  • 3. FEARLESS engineering Outline • Hadoop - Basics • HDFS – Goals – Architecture – Other functions • MapReduce – Basics – Word Count Example – Handy tools – Finding shortest path example • Related Apache sub-projects (Pig, HBase,Hive)
  • 4. FEARLESS engineering Hadoop - Why ? • Need to process huge datasets on large clusters of computers • Very expensive to build reliability into each application • Nodes fail every day – Failure is expected, rather than exceptional – The number of nodes in a cluster is not constant • Need a common infrastructure – Efficient, reliable, easy to use – Open Source, Apache Licence
  • 5. FEARLESS engineering Who uses Hadoop? • Amazon/A9 • Facebook • Google • New York Times • Veoh • Yahoo! • …. many more
  • 6. FEARLESS engineering Commodity Hardware • Typically in 2 level architecture – Nodes are commodity PCs – 30-40 nodes/rack – Uplink from rack is 3-4 gigabit – Rack-internal is 1 gigabit Aggregation switch Rack switch
  • 7. UT DALLAS Erik Jonsson School of Engineering & Computer Science FEARLESS engineering Hadoop Distributed File System (HDFS) Original Slides by Dhruba Borthakur Apache Hadoop Project Management Committee
  • 8. FEARLESS engineering Goals of HDFS • Very Large Distributed File System – 10K nodes, 100 million files, 10PB • Assumes Commodity Hardware – Files are replicated to handle hardware failure – Detect failures and recover from them • Optimized for Batch Processing – Data locations exposed so that computations can move to where data resides – Provides very high aggregate bandwidth
  • 9. FEARLESS engineering Distributed File System • Single Namespace for entire cluster • Data Coherency – Write-once-read-many access model – Client can only append to existing files • Files are broken up into blocks – Typically 64MB block size – Each block replicated on multiple DataNodes • Intelligent Client – Client can find location of blocks – Client accesses data directly from DataNode
  • 11. FEARLESS engineering Functions of a NameNode • Manages File System Namespace – Maps a file name to a set of blocks – Maps a block to the DataNodes where it resides • Cluster Configuration Management • Replication Engine for Blocks
  • 12. FEARLESS engineering NameNode Metadata • Metadata in Memory – The entire metadata is in main memory – No demand paging of metadata • Types of metadata – List of files – List of Blocks for each file – List of DataNodes for each block – File attributes, e.g. creation time, replication factor • A Transaction Log – Records file creations, file deletions etc
  • 13. FEARLESS engineering DataNode • A Block Server – Stores data in the local file system (e.g. ext3) – Stores metadata of a block (e.g. CRC) – Serves data and metadata to Clients • Block Report – Periodically sends a report of all existing blocks to the NameNode • Facilitates Pipelining of Data – Forwards data to other specified DataNodes
  • 14. FEARLESS engineering Block Placement • Current Strategy – One replica on local node – Second replica on a remote rack – Third replica on same remote rack – Additional replicas are randomly placed • Clients read from nearest replicas • Would like to make this policy pluggable
  • 15. FEARLESS engineering Heartbeats • DataNodes send hearbeat to the NameNode – Once every 3 seconds • NameNode uses heartbeats to detect DataNode failure
  • 16. FEARLESS engineering Replication Engine • NameNode detects DataNode failures – Chooses new DataNodes for new replicas – Balances disk usage – Balances communication traffic to DataNodes
  • 17. FEARLESS engineering Data Correctness • Use Checksums to validate data – Use CRC32 • File Creation – Client computes checksum per 512 bytes – DataNode stores the checksum • File access – Client retrieves the data and checksum from DataNode – If Validation fails, Client tries other replicas
  • 18. FEARLESS engineering NameNode Failure • A single point of failure • Transaction Log stored in multiple directories – A directory on the local file system – A directory on a remote file system (NFS/CIFS) • Need to develop a real HA solution
  • 19. FEARLESS engineering Data Pieplining • Client retrieves a list of DataNodes on which to place replicas of a block • Client writes block to the first DataNode • The first DataNode forwards the data to the next node in the Pipeline • When all replicas are written, the Client moves on to write the next block in file
  • 20. FEARLESS engineering Rebalancer • Goal: % disk full on DataNodes should be similar – Usually run when new DataNodes are added – Cluster is online when Rebalancer is active – Rebalancer is throttled to avoid network congestion – Command line tool
  • 21. FEARLESS engineering Secondary NameNode • Copies FsImage and Transaction Log from Namenode to a temporary directory • Merges FSImage and Transaction Log into a new FSImage in temporary directory • Uploads new FSImage to the NameNode – Transaction Log on NameNode is purged
  • 22. FEARLESS engineering User Interface • Commads for HDFS User: – hadoop dfs -mkdir /foodir – hadoop dfs -cat /foodir/myfile.txt – hadoop dfs -rm /foodir/myfile.txt • Commands for HDFS Administrator – hadoop dfsadmin -report – hadoop dfsadmin -decommision datanodename • Web Interface – http://host:port/dfshealth.jsp
  • 23. UT DALLAS Erik Jonsson School of Engineering & Computer Science FEARLESS engineering MapReduce Original Slides by Owen O’Malley (Yahoo!) & Christophe Bisciglia, Aaron Kimball & Sierra Michells-Slettvet
  • 24. FEARLESS engineering MapReduce - What? • MapReduce is a programming model for efficient distributed computing • It works like a Unix pipeline – cat input | grep | sort | uniq -c | cat > output – Input | Map | Shuffle & Sort | Reduce | Output • Efficiency from – Streaming through data, reducing seeks – Pipelining • A good fit for a lot of applications – Log processing – Web index building
  • 26. FEARLESS engineering MapReduce - Features • Fine grained Map and Reduce tasks – Improved load balancing – Faster recovery from failed tasks • Automatic re-execution on failure – In a large cluster, some nodes are always slow or flaky – Framework re-executes failed tasks • Locality optimizations – With large data, bandwidth to data is a problem – Map-Reduce + HDFS is a very effective solution – Map-Reduce queries HDFS for locations of input data – Map tasks are scheduled close to the inputs when possible
  • 27. FEARLESS engineering Word Count Example • Mapper – Input: value: lines of text of input – Output: key: word, value: 1 • Reducer – Input: key: word, value: set of counts – Output: key: word, value: sum • Launching program – Defines this job – Submits job to cluster
  • 29. FEARLESS engineering Word Count Mapper public static class Map extends MapReduceBase implements Mapper<LongWritable,Text,Text,IntWritable> { private static final IntWritable one = new IntWritable(1); private Text word = new Text(); public static void map(LongWritable key, Text value, OutputCollector<Text,IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer = new StringTokenizer(line); while(tokenizer.hasNext()) { word.set(tokenizer.nextToken()); output.collect(word,one); } } }
  • 30. FEARLESS engineering Word Count Reducer public static class Reduce extends MapReduceBase implements Reducer<Text,IntWritable,Text,IntWritable> { public static void map(Text key, Iterator<IntWritable> values, OutputCollector<Text,IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while(values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } }
  • 31. FEARLESS engineering Word Count Example • Jobs are controlled by configuring JobConfs • JobConfs are maps from attribute names to string values • The framework defines attributes to control how the job is executed – conf.set(“mapred.job.name”, “MyApp”); • Applications can add arbitrary values to the JobConf – conf.set(“my.string”, “foo”); – conf.set(“my.integer”, 12); • JobConf is available to all tasks
  • 32. FEARLESS engineering Putting it all together • Create a launching program for your application • The launching program configures: – The Mapper and Reducer to use – The output key and value types (input types are inferred from the InputFormat) – The locations for your input and output • The launching program then submits the job and typically waits for it to complete
  • 33. FEARLESS engineering Putting it all together JobConf conf = new JobConf(WordCount.class); conf.setJobName(“wordcount”); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class); conf.setMapperClass(Map.class); conf.setCombinerClass(Reduce.class); conf.setReducer(Reduce.class); conf.setInputFormat(TextInputFormat.class); Conf.setOutputFormat(TextOutputFormat.class); FileInputFormat.setInputPaths(conf, new Path(args[0])); FileOutputFormat.setOutputPath(conf, new Path(args[1])); JobClient.runJob(conf);
  • 34. FEARLESS engineering Input and Output Formats • A Map/Reduce may specify how it’s input is to be read by specifying an InputFormat to be used • A Map/Reduce may specify how it’s output is to be written by specifying an OutputFormat to be used • These default to TextInputFormat and TextOutputFormat, which process line-based text data • Another common choice is SequenceFileInputFormat and SequenceFileOutputFormat for binary data • These are file-based, but they are not required to be
  • 35. FEARLESS engineering How many Maps and Reduces • Maps – Usually as many as the number of HDFS blocks being processed, this is the default – Else the number of maps can be specified as a hint – The number of maps can also be controlled by specifying the minimum split size – The actual sizes of the map inputs are computed by: • max(min(block_size,data/#maps), min_split_size • Reduces – Unless the amount of data being processed is small • 0.95*num_nodes*mapred.tasktracker.tasks.maximum
  • 36. FEARLESS engineering Some handy tools • Partitioners • Combiners • Compression • Counters • Speculation • Zero Reduces • Distributed File Cache • Tool
  • 37. FEARLESS engineering Partitioners • Partitioners are application code that define how keys are assigned to reduces • Default partitioning spreads keys evenly, but randomly – Uses key.hashCode() % num_reduces • Custom partitioning is often required, for example, to produce a total order in the output – Should implement Partitioner interface – Set by calling conf.setPartitionerClass(MyPart.class) – To get a total order, sample the map output keys and pick values to divide the keys into roughly equal buckets and use that in your partitioner
  • 38. FEARLESS engineering Combiners • When maps produce many repeated keys – It is often useful to do a local aggregation following the map – Done by specifying a Combiner – Goal is to decrease size of the transient data – Combiners have the same interface as Reduces, and often are the same class – Combiners must not side effects, because they run an intermdiate number of times – In WordCount, conf.setCombinerClass(Reduce.class);
  • 39. FEARLESS engineering Compression • Compressing the outputs and intermediate data will often yield huge performance gains – Can be specified via a configuration file or set programmatically – Set mapred.output.compress to true to compress job output – Set mapred.compress.map.output to true to compress map outputs • Compression Types (mapred(.map)?.output.compression.type) – “block” - Group of keys and values are compressed together – “record” - Each value is compressed individually – Block compression is almost always best • Compression Codecs (mapred(.map)?.output.compression.codec) – Default (zlib) - slower, but more compression – LZO - faster, but less compression
  • 40. FEARLESS engineering Counters • Often Map/Reduce applications have countable events • For example, framework counts records in to and out of Mapper and Reducer • To define user counters: static enum Counter {EVENT1, EVENT2}; reporter.incrCounter(Counter.EVENT1, 1); • Define nice names in a MyClass_Counter.properties file CounterGroupName=MyCounters EVENT1.name=Event 1 EVENT2.name=Event 2
  • 41. FEARLESS engineering Speculative execution • The framework can run multiple instances of slow tasks – Output from instance that finishes first is used – Controlled by the configuration variable mapred.speculative.execution – Can dramatically bring in long tails on jobs
  • 42. FEARLESS engineering Zero Reduces • Frequently, we only need to run a filter on the input data – No sorting or shuffling required by the job – Set the number of reduces to 0 – Output from maps will go directly to OutputFormat and disk
  • 43. FEARLESS engineering Distributed File Cache • Sometimes need read-only copies of data on the local computer – Downloading 1GB of data for each Mapper is expensive • Define list of files you need to download in JobConf • Files are downloaded once per computer • Add to launching program: DistributedCache.addCacheFile(new URI(“hdfs://nn:8020/foo”), conf); • Add to task: Path[] files = DistributedCache.getLocalCacheFiles(conf);
  • 44. FEARLESS engineering Tool • Handle “standard” Hadoop command line options – -conf file - load a configuration file named file – -D prop=value - define a single configuration property prop • Class looks like: public class MyApp extends Configured implements Tool { public static void main(String[] args) throws Exception { System.exit(ToolRunner.run(new Configuration(), new MyApp(), args)); } public int run(String[] args) throws Exception { …. getConf() …. } }
  • 45. FEARLESS engineering Finding the Shortest Path • A common graph search application is finding the shortest path from a start node to one or more target nodes • Commonly done on a single machine with Dijkstra’s Algorithm • Can we use BFS to find the shortest path via MapReduce?
  • 46. FEARLESS engineering Finding the Shortest Path: Intuition • We can define the solution to this problem inductively – DistanceTo(startNode) = 0 – For all nodes n directly reachable from startNode, DistanceTo(n) = 1 – For all nodes n reachable from some other set of nodes S, DistanceTo(n) = 1 + min(DistanceTo(m), m  S)
  • 47. FEARLESS engineering From Intuition to Algorithm • A map task receives a node n as a key, and (D, points-to) as its value – D is the distance to the node from the start – points-to is a list of nodes reachable from n  p  points-to, emit (p, D+1) • Reduces task gathers possible distances to a given p and selects the minimum one
  • 48. FEARLESS engineering What This Gives Us • This MapReduce task can advance the known frontier by one hop • To perform the whole BFS, a non-MapReduce component then feeds the output of this step back into the MapReduce task for another iteration – Problem: Where’d the points-to list go? – Solution: Mapper emits (n, points-to) as well
  • 49. FEARLESS engineering Blow-up and Termination • This algorithm starts from one node • Subsequent iterations include many more nodes of the graph as the frontier advances • Does this ever terminate? – Yes! Eventually, routes between nodes will stop being discovered and no better distances will be found. When distance is the same, we stop – Mapper should emit (n,D) to ensure that “current distance” is carried into the reducer
  • 50. UT DALLAS Erik Jonsson School of Engineering & Computer Science FEARLESS engineering Hadoop Subprojects
  • 51. FEARLESS engineering Hadoop Related Subprojects • Pig – High-level language for data analysis • HBase – Table storage for semi-structured data • Zookeeper – Coordinating distributed applications • Hive – SQL-like Query language and Metastore • Mahout – Machine learning
  • 52. UT DALLAS Erik Jonsson School of Engineering & Computer Science FEARLESS engineering Pig Original Slides by Matei Zaharia UC Berkeley RAD Lab
  • 53. FEARLESS engineering Pig • Started at Yahoo! Research • Now runs about 30% of Yahoo!’s jobs • Features – Expresses sequences of MapReduce jobs – Data model: nested “bags” of items – Provides relational (SQL) operators (JOIN, GROUP BY, etc.) – Easy to plug in Java functions
  • 54. FEARLESS engineering An Example Problem • Suppose you have user data in a file, website data in another, and you need to find the top 5 most visited pages by users aged 18-25 Load Users Load Pages Filter by age Join on name Group on url Count clicks Order by clicks Take top 5
  • 56. FEARLESS engineering In Pig Latin Users = load ‘users’ as (name, age); Filtered = filter Users by age >= 18 and age <= 25; Pages = load ‘pages’ as (user, url); Joined = join Filtered by name, Pages by user; Grouped = group Joined by url; Summed = foreach Grouped generate group, count(Joined) as clicks; Sorted = order Summed by clicks desc; Top5 = limit Sorted 5; store Top5 into ‘top5sites’;
  • 57. FEARLESS engineering Ease of Translation Load Users Load Pages Filter by age Join on name Group on url Count clicks Order by clicks Take top 5 Users = load … Fltrd = filter … Pages = load … Joined = join … Grouped = group … Summed = … count()… Sorted = order … Top5 = limit …
  • 58. FEARLESS engineering Ease of Translation Load Users Load Pages Filter by age Join on name Group on url Count clicks Order by clicks Take top 5 Users = load … Fltrd = filter … Pages = load … Joined = join … Grouped = group … Summed = … count()… Sorted = order … Top5 = limit … Job 1 Job 2 Job 3
  • 59. UT DALLAS Erik Jonsson School of Engineering & Computer Science FEARLESS engineering HBase Original Slides by Tom White Lexeme Ltd.
  • 60. FEARLESS engineering HBase - What? • Modeled on Google’s Bigtable • Row/column store • Billions of rows/millions on columns • Column-oriented - nulls are free • Untyped - stores byte[]
  • 61. FEARLESS engineering HBase - Data Model Row Timestamp Column family: animal: Column family repairs: animal:type animal:size repairs:cost enclosure1 t2 zebra 1000 EUR t1 lion big enclosure2 … … … …
  • 62. FEARLESS engineering HBase - Data Storage Column family animal: (enclosure1, t2, animal:type) zebra (enclosure1, t1, animal:size) big (enclosure1, t1, animal:type) lion Column family repairs: (enclosure1, t1, repairs:cost) 1000 EUR
  • 63. FEARLESS engineering HBase - Code HTable table = … Text row = new Text(“enclosure1”); Text col1 = new Text(“animal:type”); Text col2 = new Text(“animal:size”); BatchUpdate update = new BatchUpdate(row); update.put(col1, “lion”.getBytes(“UTF-8”)); update.put(col2, “big”.getBytes(“UTF-8)); table.commit(update); update = new BatchUpdate(row); update.put(col1, “zebra”.getBytes(“UTF-8”)); table.commit(update);
  • 64. FEARLESS engineering HBase - Querying • Retrieve a cell Cell = table.getRow(“enclosure1”).getColumn(“animal:type”).getValue(); • Retrieve a row RowResult = table.getRow( “enclosure1” ); • Scan through a range of rows Scanner s = table.getScanner( new String[] { “animal:type” } );
  • 65. UT DALLAS Erik Jonsson School of Engineering & Computer Science FEARLESS engineering Hive Original Slides by Matei Zaharia UC Berkeley RAD Lab
  • 66. FEARLESS engineering Hive • Developed at Facebook • Used for majority of Facebook jobs • “Relational database” built on Hadoop – Maintains list of table schemas – SQL-like query language (HiveQL) – Can call Hadoop Streaming scripts from HiveQL – Supports table partitioning, clustering, complex data types, some optimizations
  • 67. FEARLESS engineering Creating a Hive Table • Partitioning breaks table into separate files for each (dt, country) pair Ex: /hive/page_view/dt=2008-06-08,country=USA /hive/page_view/dt=2008-06-08,country=CA CREATE TABLE page_views(viewTime INT, userid BIGINT, page_url STRING, referrer_url STRING, ip STRING COMMENT 'User IP address') COMMENT 'This is the page view table' PARTITIONED BY(dt STRING, country STRING) STORED AS SEQUENCEFILE;
  • 68. FEARLESS engineering A Simple Query SELECT page_views.* FROM page_views WHERE page_views.date >= '2008-03-01' AND page_views.date <= '2008-03-31' AND page_views.referrer_url like '%xyz.com'; • Hive only reads partition 2008-03-01,* instead of scanning entire table • Find all page views coming from xyz.com on March 31st:
  • 69. FEARLESS engineering Aggregation and Joins • Count users who visited each page by gender: • Sample output: SELECT pv.page_url, u.gender, COUNT(DISTINCT u.id) FROM page_views pv JOIN user u ON (pv.userid = u.id) GROUP BY pv.page_url, u.gender WHERE pv.date = '2008-03-03';
  • 70. FEARLESS engineering Using a Hadoop Streaming Mapper Script SELECT TRANSFORM(page_views.userid, page_views.date) USING 'map_script.py' AS dt, uid CLUSTER BY dt FROM page_views;
  • 71. UT DALLAS Erik Jonsson School of Engineering & Computer Science FEARLESS engineering Storm Original Slides by Nathan Marz Twitter
  • 72. FEARLESS engineering Storm • Developed by BackType which was acquired by Twitter • Lots of tools for data (i.e. batch) processing – Hadoop, Pig, HBase, Hive, … • None of them are realtime systems which is becoming a real requirement for businesses • Storm provides realtime computation – Scalable – Guarantees no data loss – Extremely robust and fault-tolerant – Programming language agnostic
  • 74. FEARLESS engineering Before Storm – Adding a worker Deploy Reconfigure/Redeploy
  • 75. FEARLESS engineering Problems • Scaling is painful • Poor fault-tolerance • Coding is tedious
  • 76. FEARLESS engineering What we want • Guaranteed data processing • Horizontal scalability • Fault-tolerance • No intermediate message brokers! • Higher level abstraction than message passing • “Just works” !!
  • 77. FEARLESS engineering Storm Cluster Master node (similar to Hadoop JobTracker) Used for cluster coordination Run worker processes
  • 78. FEARLESS engineering Concepts • Streams • Spouts • Bolts • Topologies
  • 79. FEARLESS engineering Streams Tuple Tuple Tuple Tuple Tuple Tuple Tuple Unbounded sequence of tuples
  • 81. FEARLESS engineering Bolts Processes input streams and produces new streams: Can implement functions such as filters, aggregation, join, etc
  • 83. FEARLESS engineering Topology Spouts and bolts execute as many tasks across the cluster
  • 84. FEARLESS engineering Stream Grouping When a tuple is emitted which task does it go to?
  • 85. FEARLESS engineering Stream Grouping • Shuffle grouping: pick a random task • Fields grouping: consistent hashing on a subset of tuple fields • All grouping: send to all tasks • Global grouping: pick task with lowest id