SlideShare a Scribd company logo
1 of 116
Download to read offline
Data Intensive Computing Frameworks
Amir H. Payberah
amir@sics.se
Amirkabir University of Technology
1394/2/25
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 1 / 95
Big Data
small data big data
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 2 / 95
Big Data refers to datasets and flows large
enough that has outpaced our capability to
store, process, analyze, and understand.
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 3 / 95
The Four Dimensions of Big Data
Volume: data size
Velocity: data generation rate
Variety: data heterogeneity
This 4th V is for Vacillation:
Veracity/Variability/Value
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 4 / 95
Where Does
Big Data Come From?
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 5 / 95
Big Data Market Driving Factors
The number of web pages indexed by Google, which were around
one million in 1998, have exceeded one trillion in 2008, and its
expansion is accelerated by appearance of the social networks.∗
∗“Mining big data: current status, and forecast to the future” [Wei Fan et al., 2013]
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 6 / 95
Big Data Market Driving Factors
The amount of mobile data traffic is expected to grow to 10.8
Exabyte per month by 2016.∗
∗“Worldwide Big Data Technology and Services 2012-2015 Forecast” [Dan Vesset et al., 2013]
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 7 / 95
Big Data Market Driving Factors
More than 65 billion devices were connected to the Internet by
2010, and this number will go up to 230 billion by 2020.∗
∗“The Internet of Things Is Coming” [John Mahoney et al., 2013]
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 8 / 95
Big Data Market Driving Factors
Many companies are moving towards using Cloud services to
access Big Data analytical tools.
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 9 / 95
Big Data Market Driving Factors
Open source communities
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 10 / 95
How To Store and Process
Big Data?
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 11 / 95
But First, The History
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 12 / 95
4000 B.C
Manual recording
From tablets to papyrus, to parchment, and then to paper
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 13 / 95
1450
Gutenberg’s printing press
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 14 / 95
1800’s - 1940’s
Punched cards (no fault-tolerance)
Binary data
1890: US census
1911: IBM appeared
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 15 / 95
1940’s - 1950’s
Magnetic tapes
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 16 / 95
1950’s - 1960’s
Large-scale mainframe computers
Batch transaction processing
File-oriented record processing model (e.g., COBOL)
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 17 / 95
1960’s - 1970’s
Hierarchical DBMS (one-to-many)
Network DBMS (many-to-many)
VM OS by IBM → multiple VMs on a single physical node.
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 18 / 95
1970’s - 1980’s
Relational DBMS (tables) and SQL
ACID
Client-server computing
Parallel processing
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 19 / 95
1990’s - 2000’s
Virtualized Private Network connections (VPN)
The Internet...
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 20 / 95
2000’s - Now
Cloud computing
NoSQL: BASE instead of ACID
Big Data
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 21 / 95
How To Store and Process
Big Data?
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 22 / 95
Scale Up vs. Scale Out (1/2)
Scale up or scale vertically: adding resources to a single node in a
system.
Scale out or scale horizontally: adding more nodes to a system.
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 23 / 95
Scale Up vs. Scale Out (2/2)
Scale up: more expensive than scaling out.
Scale out: more challenging for fault tolerance and software devel-
opment.
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 24 / 95
Taxonomy of Parallel Architectures
DeWitt, D. and Gray, J. “Parallel database systems: the future of high performance database systems”. ACM
Communications, 35(6), 85-98, 1992.
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 25 / 95
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 26 / 95
Two Main Types of Tools
Data store
Data processing
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 27 / 95
Data Store
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 28 / 95
Data Store
How to store and access files? File System
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 29 / 95
What is Filesystem?
Controls how data is stored in and retrieved from disk.
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 30 / 95
What is Filesystem?
Controls how data is stored in and retrieved from disk.
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 30 / 95
Distributed Filesystems
When data outgrows the storage capacity of a single machine.
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 31 / 95
Distributed Filesystems
When data outgrows the storage capacity of a single machine.
Partition data across a number of separate machines.
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 31 / 95
Distributed Filesystems
When data outgrows the storage capacity of a single machine.
Partition data across a number of separate machines.
Distributed filesystems: manage the storage across a network of
machines.
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 31 / 95
Distributed Filesystems
When data outgrows the storage capacity of a single machine.
Partition data across a number of separate machines.
Distributed filesystems: manage the storage across a network of
machines.
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 31 / 95
HDFS (1/2)
Hadoop Distributed FileSystem
Appears as a single disk
Runs on top of a native filesystem, e.g., ext3
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 32 / 95
HDFS (2/2)
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 33 / 95
Files and Blocks (1/2)
Files are split into blocks.
Blocks, the single unit of storage.
• Transparent to user.
• 64MB or 128MB.
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 34 / 95
Files and Blocks (2/2)
Same block is replicated on multiple machines: default is 3
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 35 / 95
HDFS Write
1. Create a new file in the Namenode’s Namespace; calculate block
topology.
2, 3, 4. Stream data to the first, second and third node.
5, 6, 7. Success/failure acknowledgment.
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 36 / 95
HDFS Read
1. Retrieve block locations.
2, 3. Read blocks to re-assemble the file.
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 37 / 95
What About Databases?
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 38 / 95
Database and Database Management System
Database: an organized collection of data.
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 39 / 95
Database and Database Management System
Database: an organized collection of data.
Database Management System (DBMS): a software that interacts
with users, other applications, and the database itself to capture
and analyze data.
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 39 / 95
Relational Databases Management Systems (RDMBSs)
RDMBSs: the dominant technology for storing structured data in
web and business applications.
SQL is good
• Rich language and toolset
• Easy to use and integrate
• Many vendors
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 40 / 95
Relational Databases Management Systems (RDMBSs)
RDMBSs: the dominant technology for storing structured data in
web and business applications.
SQL is good
• Rich language and toolset
• Easy to use and integrate
• Many vendors
They promise: ACID
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 40 / 95
ACID Properties
Atomicity
Consistency
Isolation
Durability
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 41 / 95
RDBMS Challenges
Web-based applications caused spikes.
• Internet-scale data size
• High read-write rates
• Frequent schema changes
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 42 / 95
Scaling RDBMSs is Expensive and Inefficient
[http://www.couchbase.com/sites/default/files/uploads/all/whitepapers/NoSQLWhitepaper.pdf]
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 43 / 95
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 44 / 95
NoSQL
Avoidance of unneeded complexity
High throughput
Horizontal scalability and running on commodity hardware
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 45 / 95
NoSQL Cost and Performance
[http://www.couchbase.com/sites/default/files/uploads/all/whitepapers/NoSQLWhitepaper.pdf]
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 46 / 95
RDBMS vs. NoSQL
[http://www.couchbase.com/sites/default/files/uploads/all/whitepapers/NoSQLWhitepaper.pdf]
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 47 / 95
NoSQL Data Models
[http://highlyscalable.wordpress.com/2012/03/01/nosql-data-modeling-techniques]
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 48 / 95
NoSQL Data Models: Key-Value
Collection of key/value pairs.
Ordered Key-Value: processing over key ranges.
Dynamo, Scalaris, Voldemort, Riak, ...
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 49 / 95
NoSQL Data Models: Column-Oriented
Similar to a key/value store, but the value can have multiple at-
tributes (Columns).
Column: a set of data values of a particular type.
BigTable, Hbase, Cassandra, ...
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 50 / 95
NoSQL Data Models: Document-Based
Similar to a column-oriented store, but values can have complex
documents, e.g., XML, YAML, JSON, and BSON.
CouchDB, MongoDB, ...
{
FirstName: "Bob",
Address: "5 Oak St.",
Hobby: "sailing"
}
{
FirstName: "Jonathan",
Address: "15 Wanamassa Point Road",
Children: [
{Name: "Michael", Age: 10},
{Name: "Jennifer", Age: 8},
]
}
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 51 / 95
NoSQL Data Models: Graph-Based
Uses graph structures with nodes, edges, and properties to represent
and store data.
Neo4J, InfoGrid, ...
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 52 / 95
Data Processing
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 53 / 95
Challenges
How to distribute computation?
How can we make it easy to write distributed programs?
Machines failure.
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 54 / 95
Idea
Issue:
• Copying data over a network takes time.
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 55 / 95
Idea
Issue:
• Copying data over a network takes time.
Idea:
• Bring computation close to the data.
• Store files multiple times for reliability.
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 55 / 95
MapReduce
A shared nothing architecture for processing large data sets with a
parallel/distributed algorithm on clusters.
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 56 / 95
Simplicity
Don’t worry about parallelization, fault tolerance, data distribution,
and load balancing (MapReduce takes care of these).
Hide system-level details from programmers.
Simplicity!
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 57 / 95
Warm-up Task (1/2)
We have a huge text document.
Count the number of times each distinct word appears in the file
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 58 / 95
Warm-up Task (2/2)
File too large for memory, but all word, count pairs fit in memory.
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 59 / 95
Warm-up Task (2/2)
File too large for memory, but all word, count pairs fit in memory.
words(doc.txt) | sort | uniq -c
• where words takes a file and outputs the words in it, one per a line
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 59 / 95
Warm-up Task (2/2)
File too large for memory, but all word, count pairs fit in memory.
words(doc.txt) | sort | uniq -c
• where words takes a file and outputs the words in it, one per a line
It captures the essence of MapReduce: great thing is that it is
naturally parallelizable.
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 59 / 95
MapReduce Overview
words(doc.txt) | sort | uniq -c
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 60 / 95
MapReduce Overview
words(doc.txt) | sort | uniq -c
Sequentially read a lot of data.
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 60 / 95
MapReduce Overview
words(doc.txt) | sort | uniq -c
Sequentially read a lot of data.
Map: extract something you care about.
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 60 / 95
MapReduce Overview
words(doc.txt) | sort | uniq -c
Sequentially read a lot of data.
Map: extract something you care about.
Group by key: sort and shuffle.
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 60 / 95
MapReduce Overview
words(doc.txt) | sort | uniq -c
Sequentially read a lot of data.
Map: extract something you care about.
Group by key: sort and shuffle.
Reduce: aggregate, summarize, filter or transform.
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 60 / 95
MapReduce Overview
words(doc.txt) | sort | uniq -c
Sequentially read a lot of data.
Map: extract something you care about.
Group by key: sort and shuffle.
Reduce: aggregate, summarize, filter or transform.
Write the result.
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 60 / 95
Example: Word Count
Consider doing a word count of the following file using MapReduce:
Hello World Bye World
Hello Hadoop Goodbye Hadoop
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 61 / 95
Example: Word Count - map
The map function reads in words one a time and outputs (word, 1)
for each parsed input word.
The map function output is:
(Hello, 1)
(World, 1)
(Bye, 1)
(World, 1)
(Hello, 1)
(Hadoop, 1)
(Goodbye, 1)
(Hadoop, 1)
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 62 / 95
Example: Word Count - shuffle
The shuffle phase between map and reduce phase creates a list of
values associated with each key.
The reduce function input is:
(Bye, (1))
(Goodbye, (1))
(Hadoop, (1, 1))
(Hello, (1, 1))
(World, (1, 1))
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 63 / 95
Example: Word Count - reduce
The reduce function sums the numbers in the list for each key and
outputs (word, count) pairs.
The output of the reduce function is the output of the MapReduce
job:
(Bye, 1)
(Goodbye, 1)
(Hadoop, 2)
(Hello, 2)
(World, 2)
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 64 / 95
Example: Word Count - map
public static class MyMap extends Mapper<...> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
context.write(word, one);
}
}
}
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 65 / 95
Example: Word Count - reduce
public static class MyReduce extends Reducer<...> {
public void reduce(Text key, Iterator<...> values, Context context)
throws IOException, InterruptedException {
int sum = 0;
while (values.hasNext())
sum += values.next().get();
context.write(key, new IntWritable(sum));
}
}
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 66 / 95
Example: Word Count - driver
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = new Job(conf, "wordcount");
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setMapperClass(MyMap.class);
job.setReducerClass(MyReduce.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);
}
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 67 / 95
MapReduce Execution
J. Dean and S. Ghemawat, “MapReduce: simplified data processing on large clusters”, ACM Communications 51(1), 2008.
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 68 / 95
MapReduce Weaknesses
MapReduce programming model has not been designed for complex
operations, e.g., data mining.
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 69 / 95
MapReduce Weaknesses
MapReduce programming model has not been designed for complex
operations, e.g., data mining.
Very expensive, i.e., always goes to disk and HDFS.
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 69 / 95
Solution?
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 70 / 95
Spark
Extends MapReduce with more operators.
Support for advanced data flow graphs.
In-memory and out-of-core processing.
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 71 / 95
Spark vs. Hadoop
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 72 / 95
Spark vs. Hadoop
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 72 / 95
Spark vs. Hadoop
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 73 / 95
Spark vs. Hadoop
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 73 / 95
Resilient Distributed Datasets (RDD)
Immutable collections of objects spread across a cluster.
An RDD is divided into a number of partitions.
Partitions of an RDD can be stored on different nodes of a cluster.
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 74 / 95
What About Streaming Data?
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 75 / 95
Motivation
Many applications must process large streams of live data and pro-
vide results in real-time.
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 76 / 95
Motivation
Many applications must process large streams of live data and pro-
vide results in real-time.
Processing information as it flows, without storing them persistently.
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 76 / 95
Motivation
Many applications must process large streams of live data and pro-
vide results in real-time.
Processing information as it flows, without storing them persistently.
Traditional DBMSs:
• Store and index data before processing it.
• Process data only when explicitly asked by the users.
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 76 / 95
DBMS vs. DSMS (1/3)
DBMS: persistent data where updates are relatively infrequent.
DSMS: transient data that is continuously updated.
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 77 / 95
DBMS vs. DSMS (2/3)
DBMS: runs queries just once to return a complete answer.
DSMS: executes standing queries, which run continuously and pro-
vide updated answers as new data arrives.
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 78 / 95
DBMS vs. DSMS (3/3)
Despite these differences, DSMSs resemble DBMSs: both process
incoming data through a sequence of transformations based on SQL
operators, e.g., selections, aggregates, joins.
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 79 / 95
DSMS
Source: produces the incoming information flows
Sink: consumes the results of processing
IFP engine: processes incoming flows
Processing rules: how to process the incoming flows
Rule manager: adds/removes processing rules
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 80 / 95
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 81 / 95
What About Graph Data?
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 82 / 95
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 83 / 95
Large Graph
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 84 / 95
Large-Scale Graph Processing
Large graphs need large-scale processing.
A large graph either cannot fit into memory of single computer or
it fits with huge cost.
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 85 / 95
Data-Parallel Model for Large-Scale Graph Processing
The platforms that have worked well for developing parallel applica-
tions are not necessarily effective for large-scale graph problems.
Why?
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 86 / 95
Graph Algorithms Characteristics
Unstructured problems: difficult to partition the data
Data-driven computations: difficult to partition computation
Poor data locality
High data access to computation ratio
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 87 / 95
Proposed Solution
Graph-Parallel Processing
Computation typically depends on the neighbors.
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 88 / 95
Graph-Parallel Processing
Restricts the types of computation.
New techniques to partition and distribute graphs.
Exploit graph structure.
Executes graph algorithms orders-of-magnitude faster than more
general data-parallel systems.
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 89 / 95
Data-Parallel vs. Graph-Parallel Computation
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 90 / 95
Vertex-Centric Programing
Think as a vertex.
Each vertex computes individually its value: in parallel
Each vertex can see its local context, and updates its value accord-
ingly.
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 91 / 95
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 92 / 95
Summary
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 93 / 95
Summary
Scale-out vs. Scale-up
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 94 / 95
Summary
Scale-out vs. Scale-up
How to store data?
• Distributed file systems: HDFS
• NoSQL databases: HBase, Cassandra, ...
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 94 / 95
Summary
Scale-out vs. Scale-up
How to store data?
• Distributed file systems: HDFS
• NoSQL databases: HBase, Cassandra, ...
How to process data?
• Batch data: MapReduce, Spark
• Streaming data: Spark stream, Flink, Storm, S4
• Graph data: Giraph, GraphLab, GraphX, Flink
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 94 / 95
Questions?
Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 95 / 95

More Related Content

What's hot

Distributed System
Distributed SystemDistributed System
Distributed SystemIqra khalil
 
Introduction to Map Reduce
Introduction to Map ReduceIntroduction to Map Reduce
Introduction to Map ReduceApache Apex
 
Version spaces
Version spacesVersion spaces
Version spacesGekkietje
 
Map reduce in BIG DATA
Map reduce in BIG DATAMap reduce in BIG DATA
Map reduce in BIG DATAGauravBiswas9
 
daa-unit-3-greedy method
daa-unit-3-greedy methoddaa-unit-3-greedy method
daa-unit-3-greedy methodhodcsencet
 
Distributed file system
Distributed file systemDistributed file system
Distributed file systemAnamika Singh
 
Analysing of big data using map reduce
Analysing of big data using map reduceAnalysing of big data using map reduce
Analysing of big data using map reducePaladion Networks
 
Parallel computing
Parallel computingParallel computing
Parallel computingVinay Gupta
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with HadoopPhilippe Julio
 
Deployment Models of Cloud Computing.pptx
Deployment Models of Cloud Computing.pptxDeployment Models of Cloud Computing.pptx
Deployment Models of Cloud Computing.pptxJaya Silwal
 
Presentation on flynn’s classification
Presentation on flynn’s classificationPresentation on flynn’s classification
Presentation on flynn’s classificationvani gupta
 
distributed Computing system model
distributed Computing system modeldistributed Computing system model
distributed Computing system modelHarshad Umredkar
 
Cloud and Grid Computing
Cloud and Grid ComputingCloud and Grid Computing
Cloud and Grid ComputingLeen Blom
 
Dijkstra's Algorithm
Dijkstra's AlgorithmDijkstra's Algorithm
Dijkstra's AlgorithmArijitDhali
 

What's hot (20)

Distributed System
Distributed SystemDistributed System
Distributed System
 
Hadoop YARN
Hadoop YARNHadoop YARN
Hadoop YARN
 
Introduction to Map Reduce
Introduction to Map ReduceIntroduction to Map Reduce
Introduction to Map Reduce
 
Version spaces
Version spacesVersion spaces
Version spaces
 
Extensible hashing
Extensible hashingExtensible hashing
Extensible hashing
 
parallel processing
parallel processingparallel processing
parallel processing
 
Map reduce in BIG DATA
Map reduce in BIG DATAMap reduce in BIG DATA
Map reduce in BIG DATA
 
Graph coloring problem
Graph coloring problemGraph coloring problem
Graph coloring problem
 
daa-unit-3-greedy method
daa-unit-3-greedy methoddaa-unit-3-greedy method
daa-unit-3-greedy method
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Distributed file system
Distributed file systemDistributed file system
Distributed file system
 
Analysing of big data using map reduce
Analysing of big data using map reduceAnalysing of big data using map reduce
Analysing of big data using map reduce
 
Parallel computing
Parallel computingParallel computing
Parallel computing
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 
Deployment Models of Cloud Computing.pptx
Deployment Models of Cloud Computing.pptxDeployment Models of Cloud Computing.pptx
Deployment Models of Cloud Computing.pptx
 
Presentation on flynn’s classification
Presentation on flynn’s classificationPresentation on flynn’s classification
Presentation on flynn’s classification
 
Cluster Computing
Cluster ComputingCluster Computing
Cluster Computing
 
distributed Computing system model
distributed Computing system modeldistributed Computing system model
distributed Computing system model
 
Cloud and Grid Computing
Cloud and Grid ComputingCloud and Grid Computing
Cloud and Grid Computing
 
Dijkstra's Algorithm
Dijkstra's AlgorithmDijkstra's Algorithm
Dijkstra's Algorithm
 

Viewers also liked

P2P Content Distribution Network
P2P Content Distribution NetworkP2P Content Distribution Network
P2P Content Distribution NetworkAmir Payberah
 
Introduction to Operating Systems - Part2
Introduction to Operating Systems - Part2Introduction to Operating Systems - Part2
Introduction to Operating Systems - Part2Amir Payberah
 
Process Management - Part2
Process Management - Part2Process Management - Part2
Process Management - Part2Amir Payberah
 
CPU Scheduling - Part2
CPU Scheduling - Part2CPU Scheduling - Part2
CPU Scheduling - Part2Amir Payberah
 
The Stratosphere Big Data Analytics Platform
The Stratosphere Big Data Analytics PlatformThe Stratosphere Big Data Analytics Platform
The Stratosphere Big Data Analytics PlatformAmir Payberah
 
File System Interface
File System InterfaceFile System Interface
File System InterfaceAmir Payberah
 
CPU Scheduling - Part1
CPU Scheduling - Part1CPU Scheduling - Part1
CPU Scheduling - Part1Amir Payberah
 
Virtual Memory - Part1
Virtual Memory - Part1Virtual Memory - Part1
Virtual Memory - Part1Amir Payberah
 
Virtual Memory - Part2
Virtual Memory - Part2Virtual Memory - Part2
Virtual Memory - Part2Amir Payberah
 
File System Implementation - Part1
File System Implementation - Part1File System Implementation - Part1
File System Implementation - Part1Amir Payberah
 
Process Management - Part3
Process Management - Part3Process Management - Part3
Process Management - Part3Amir Payberah
 
Process Management - Part1
Process Management - Part1Process Management - Part1
Process Management - Part1Amir Payberah
 

Viewers also liked (20)

Cloud Computing
Cloud ComputingCloud Computing
Cloud Computing
 
P2P Content Distribution Network
P2P Content Distribution NetworkP2P Content Distribution Network
P2P Content Distribution Network
 
Security
SecuritySecurity
Security
 
Introduction to Operating Systems - Part2
Introduction to Operating Systems - Part2Introduction to Operating Systems - Part2
Introduction to Operating Systems - Part2
 
Protection
ProtectionProtection
Protection
 
Main Memory - Part2
Main Memory - Part2Main Memory - Part2
Main Memory - Part2
 
Process Management - Part2
Process Management - Part2Process Management - Part2
Process Management - Part2
 
IO Systems
IO SystemsIO Systems
IO Systems
 
Storage
StorageStorage
Storage
 
CPU Scheduling - Part2
CPU Scheduling - Part2CPU Scheduling - Part2
CPU Scheduling - Part2
 
The Stratosphere Big Data Analytics Platform
The Stratosphere Big Data Analytics PlatformThe Stratosphere Big Data Analytics Platform
The Stratosphere Big Data Analytics Platform
 
File System Interface
File System InterfaceFile System Interface
File System Interface
 
Deadlocks
DeadlocksDeadlocks
Deadlocks
 
CPU Scheduling - Part1
CPU Scheduling - Part1CPU Scheduling - Part1
CPU Scheduling - Part1
 
Virtual Memory - Part1
Virtual Memory - Part1Virtual Memory - Part1
Virtual Memory - Part1
 
Virtual Memory - Part2
Virtual Memory - Part2Virtual Memory - Part2
Virtual Memory - Part2
 
File System Implementation - Part1
File System Implementation - Part1File System Implementation - Part1
File System Implementation - Part1
 
Main Memory - Part1
Main Memory - Part1Main Memory - Part1
Main Memory - Part1
 
Process Management - Part3
Process Management - Part3Process Management - Part3
Process Management - Part3
 
Process Management - Part1
Process Management - Part1Process Management - Part1
Process Management - Part1
 

Similar to Data Intensive Computing Frameworks

Frank kramer ibm-data_management-for-adas-scale-usergroup-sin-032018
Frank kramer ibm-data_management-for-adas-scale-usergroup-sin-032018Frank kramer ibm-data_management-for-adas-scale-usergroup-sin-032018
Frank kramer ibm-data_management-for-adas-scale-usergroup-sin-032018Snowy Chen
 
DataHealth First Insight Webinar 10.01.09
DataHealth First Insight Webinar 10.01.09DataHealth First Insight Webinar 10.01.09
DataHealth First Insight Webinar 10.01.09DataHEALTH
 
Deep Dive into FME Desktop 2014
Deep Dive into FME Desktop 2014Deep Dive into FME Desktop 2014
Deep Dive into FME Desktop 2014Safe Software
 
High Value Business Intelligence for IBM Platform compute environments
High Value Business Intelligence for IBM Platform compute environmentsHigh Value Business Intelligence for IBM Platform compute environments
High Value Business Intelligence for IBM Platform compute environmentsGabor Samu
 
ABCD's of WAN Optimization
ABCD's of WAN OptimizationABCD's of WAN Optimization
ABCD's of WAN OptimizationEdward Gilbert
 
What is expected from Chief Cloud Officers?
What is expected from Chief Cloud Officers?What is expected from Chief Cloud Officers?
What is expected from Chief Cloud Officers?Bernard Paques
 
system analysis and design Chap011
 system analysis and design  Chap011 system analysis and design  Chap011
system analysis and design Chap011Nderitu Muriithi
 
DM Radio Webinar: Adopting a Streaming-Enabled Architecture
DM Radio Webinar: Adopting a Streaming-Enabled ArchitectureDM Radio Webinar: Adopting a Streaming-Enabled Architecture
DM Radio Webinar: Adopting a Streaming-Enabled ArchitectureDATAVERSITY
 
Dr Training V1 07 17 09 Rev Four 4
 Dr Training V1 07 17 09 Rev Four 4 Dr Training V1 07 17 09 Rev Four 4
Dr Training V1 07 17 09 Rev Four 4Ricoh
 
Business Track Session 1: The Power of udp
Business Track Session 1: The Power of udpBusiness Track Session 1: The Power of udp
Business Track Session 1: The Power of udparcserve data protection
 
Presentation riverbed steelhead appliance main 2010
Presentation   riverbed steelhead appliance main 2010Presentation   riverbed steelhead appliance main 2010
Presentation riverbed steelhead appliance main 2010chanwitcs
 
Big Data for Security - DNS Analytics
Big Data for Security - DNS AnalyticsBig Data for Security - DNS Analytics
Big Data for Security - DNS AnalyticsMarco Casassa Mont
 
Privacy Preserving Elastic Stream Processing with Clouds using Homomorphic En...
Privacy Preserving Elastic Stream Processing with Clouds using Homomorphic En...Privacy Preserving Elastic Stream Processing with Clouds using Homomorphic En...
Privacy Preserving Elastic Stream Processing with Clouds using Homomorphic En...miyurud
 
New high-density storage server - IBM System x3650 M4 HD
New high-density storage server - IBM System x3650 M4 HDNew high-density storage server - IBM System x3650 M4 HD
New high-density storage server - IBM System x3650 M4 HDCliff Kinard
 
Big Data with Hadoop – For Data Management, Processing and Storing
Big Data with Hadoop – For Data Management, Processing and StoringBig Data with Hadoop – For Data Management, Processing and Storing
Big Data with Hadoop – For Data Management, Processing and StoringIRJET Journal
 
CSIT 534 Presentation Cherri_edmond
CSIT 534 Presentation Cherri_edmondCSIT 534 Presentation Cherri_edmond
CSIT 534 Presentation Cherri_edmondlowedmond
 

Similar to Data Intensive Computing Frameworks (20)

Frank kramer ibm-data_management-for-adas-scale-usergroup-sin-032018
Frank kramer ibm-data_management-for-adas-scale-usergroup-sin-032018Frank kramer ibm-data_management-for-adas-scale-usergroup-sin-032018
Frank kramer ibm-data_management-for-adas-scale-usergroup-sin-032018
 
DataHealth First Insight Webinar 10.01.09
DataHealth First Insight Webinar 10.01.09DataHealth First Insight Webinar 10.01.09
DataHealth First Insight Webinar 10.01.09
 
Deep Dive into FME Desktop 2014
Deep Dive into FME Desktop 2014Deep Dive into FME Desktop 2014
Deep Dive into FME Desktop 2014
 
High Value Business Intelligence for IBM Platform compute environments
High Value Business Intelligence for IBM Platform compute environmentsHigh Value Business Intelligence for IBM Platform compute environments
High Value Business Intelligence for IBM Platform compute environments
 
ABCD's of WAN Optimization
ABCD's of WAN OptimizationABCD's of WAN Optimization
ABCD's of WAN Optimization
 
What is expected from Chief Cloud Officers?
What is expected from Chief Cloud Officers?What is expected from Chief Cloud Officers?
What is expected from Chief Cloud Officers?
 
system analysis and design Chap011
 system analysis and design  Chap011 system analysis and design  Chap011
system analysis and design Chap011
 
DM Radio Webinar: Adopting a Streaming-Enabled Architecture
DM Radio Webinar: Adopting a Streaming-Enabled ArchitectureDM Radio Webinar: Adopting a Streaming-Enabled Architecture
DM Radio Webinar: Adopting a Streaming-Enabled Architecture
 
Dr Training V1 07 17 09 Rev Four 4
 Dr Training V1 07 17 09 Rev Four 4 Dr Training V1 07 17 09 Rev Four 4
Dr Training V1 07 17 09 Rev Four 4
 
Business Track Session 1: The Power of udp
Business Track Session 1: The Power of udpBusiness Track Session 1: The Power of udp
Business Track Session 1: The Power of udp
 
Presentation riverbed steelhead appliance main 2010
Presentation   riverbed steelhead appliance main 2010Presentation   riverbed steelhead appliance main 2010
Presentation riverbed steelhead appliance main 2010
 
Big Data for Security - DNS Analytics
Big Data for Security - DNS AnalyticsBig Data for Security - DNS Analytics
Big Data for Security - DNS Analytics
 
Mainframe Computers
Mainframe ComputersMainframe Computers
Mainframe Computers
 
Privacy Preserving Elastic Stream Processing with Clouds using Homomorphic En...
Privacy Preserving Elastic Stream Processing with Clouds using Homomorphic En...Privacy Preserving Elastic Stream Processing with Clouds using Homomorphic En...
Privacy Preserving Elastic Stream Processing with Clouds using Homomorphic En...
 
Bigtable
BigtableBigtable
Bigtable
 
RFP-Final3
RFP-Final3RFP-Final3
RFP-Final3
 
New high-density storage server - IBM System x3650 M4 HD
New high-density storage server - IBM System x3650 M4 HDNew high-density storage server - IBM System x3650 M4 HD
New high-density storage server - IBM System x3650 M4 HD
 
Cloud computing aenc - final
Cloud computing   aenc - finalCloud computing   aenc - final
Cloud computing aenc - final
 
Big Data with Hadoop – For Data Management, Processing and Storing
Big Data with Hadoop – For Data Management, Processing and StoringBig Data with Hadoop – For Data Management, Processing and Storing
Big Data with Hadoop – For Data Management, Processing and Storing
 
CSIT 534 Presentation Cherri_edmond
CSIT 534 Presentation Cherri_edmondCSIT 534 Presentation Cherri_edmond
CSIT 534 Presentation Cherri_edmond
 

More from Amir Payberah

The Spark Big Data Analytics Platform
The Spark Big Data Analytics PlatformThe Spark Big Data Analytics Platform
The Spark Big Data Analytics PlatformAmir Payberah
 
Linux Module Programming
Linux Module ProgrammingLinux Module Programming
Linux Module ProgrammingAmir Payberah
 
File System Implementation - Part2
File System Implementation - Part2File System Implementation - Part2
File System Implementation - Part2Amir Payberah
 
Process Synchronization - Part2
Process Synchronization -  Part2Process Synchronization -  Part2
Process Synchronization - Part2Amir Payberah
 
Process Synchronization - Part1
Process Synchronization -  Part1Process Synchronization -  Part1
Process Synchronization - Part1Amir Payberah
 
Introduction to Operating Systems - Part3
Introduction to Operating Systems - Part3Introduction to Operating Systems - Part3
Introduction to Operating Systems - Part3Amir Payberah
 
Introduction to Operating Systems - Part1
Introduction to Operating Systems - Part1Introduction to Operating Systems - Part1
Introduction to Operating Systems - Part1Amir Payberah
 
Graph processing - Powergraph and GraphX
Graph processing - Powergraph and GraphXGraph processing - Powergraph and GraphX
Graph processing - Powergraph and GraphXAmir Payberah
 

More from Amir Payberah (10)

The Spark Big Data Analytics Platform
The Spark Big Data Analytics PlatformThe Spark Big Data Analytics Platform
The Spark Big Data Analytics Platform
 
Linux Module Programming
Linux Module ProgrammingLinux Module Programming
Linux Module Programming
 
File System Implementation - Part2
File System Implementation - Part2File System Implementation - Part2
File System Implementation - Part2
 
Process Synchronization - Part2
Process Synchronization -  Part2Process Synchronization -  Part2
Process Synchronization - Part2
 
Process Synchronization - Part1
Process Synchronization -  Part1Process Synchronization -  Part1
Process Synchronization - Part1
 
Threads
ThreadsThreads
Threads
 
Introduction to Operating Systems - Part3
Introduction to Operating Systems - Part3Introduction to Operating Systems - Part3
Introduction to Operating Systems - Part3
 
Introduction to Operating Systems - Part1
Introduction to Operating Systems - Part1Introduction to Operating Systems - Part1
Introduction to Operating Systems - Part1
 
Mesos and YARN
Mesos and YARNMesos and YARN
Mesos and YARN
 
Graph processing - Powergraph and GraphX
Graph processing - Powergraph and GraphXGraph processing - Powergraph and GraphX
Graph processing - Powergraph and GraphX
 

Data Intensive Computing Frameworks

  • 1. Data Intensive Computing Frameworks Amir H. Payberah amir@sics.se Amirkabir University of Technology 1394/2/25 Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 1 / 95
  • 2. Big Data small data big data Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 2 / 95
  • 3. Big Data refers to datasets and flows large enough that has outpaced our capability to store, process, analyze, and understand. Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 3 / 95
  • 4. The Four Dimensions of Big Data Volume: data size Velocity: data generation rate Variety: data heterogeneity This 4th V is for Vacillation: Veracity/Variability/Value Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 4 / 95
  • 5. Where Does Big Data Come From? Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 5 / 95
  • 6. Big Data Market Driving Factors The number of web pages indexed by Google, which were around one million in 1998, have exceeded one trillion in 2008, and its expansion is accelerated by appearance of the social networks.∗ ∗“Mining big data: current status, and forecast to the future” [Wei Fan et al., 2013] Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 6 / 95
  • 7. Big Data Market Driving Factors The amount of mobile data traffic is expected to grow to 10.8 Exabyte per month by 2016.∗ ∗“Worldwide Big Data Technology and Services 2012-2015 Forecast” [Dan Vesset et al., 2013] Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 7 / 95
  • 8. Big Data Market Driving Factors More than 65 billion devices were connected to the Internet by 2010, and this number will go up to 230 billion by 2020.∗ ∗“The Internet of Things Is Coming” [John Mahoney et al., 2013] Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 8 / 95
  • 9. Big Data Market Driving Factors Many companies are moving towards using Cloud services to access Big Data analytical tools. Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 9 / 95
  • 10. Big Data Market Driving Factors Open source communities Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 10 / 95
  • 11. How To Store and Process Big Data? Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 11 / 95
  • 12. But First, The History Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 12 / 95
  • 13. 4000 B.C Manual recording From tablets to papyrus, to parchment, and then to paper Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 13 / 95
  • 14. 1450 Gutenberg’s printing press Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 14 / 95
  • 15. 1800’s - 1940’s Punched cards (no fault-tolerance) Binary data 1890: US census 1911: IBM appeared Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 15 / 95
  • 16. 1940’s - 1950’s Magnetic tapes Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 16 / 95
  • 17. 1950’s - 1960’s Large-scale mainframe computers Batch transaction processing File-oriented record processing model (e.g., COBOL) Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 17 / 95
  • 18. 1960’s - 1970’s Hierarchical DBMS (one-to-many) Network DBMS (many-to-many) VM OS by IBM → multiple VMs on a single physical node. Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 18 / 95
  • 19. 1970’s - 1980’s Relational DBMS (tables) and SQL ACID Client-server computing Parallel processing Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 19 / 95
  • 20. 1990’s - 2000’s Virtualized Private Network connections (VPN) The Internet... Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 20 / 95
  • 21. 2000’s - Now Cloud computing NoSQL: BASE instead of ACID Big Data Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 21 / 95
  • 22. How To Store and Process Big Data? Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 22 / 95
  • 23. Scale Up vs. Scale Out (1/2) Scale up or scale vertically: adding resources to a single node in a system. Scale out or scale horizontally: adding more nodes to a system. Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 23 / 95
  • 24. Scale Up vs. Scale Out (2/2) Scale up: more expensive than scaling out. Scale out: more challenging for fault tolerance and software devel- opment. Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 24 / 95
  • 25. Taxonomy of Parallel Architectures DeWitt, D. and Gray, J. “Parallel database systems: the future of high performance database systems”. ACM Communications, 35(6), 85-98, 1992. Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 25 / 95
  • 26. Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 26 / 95
  • 27. Two Main Types of Tools Data store Data processing Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 27 / 95
  • 28. Data Store Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 28 / 95
  • 29. Data Store How to store and access files? File System Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 29 / 95
  • 30. What is Filesystem? Controls how data is stored in and retrieved from disk. Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 30 / 95
  • 31. What is Filesystem? Controls how data is stored in and retrieved from disk. Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 30 / 95
  • 32. Distributed Filesystems When data outgrows the storage capacity of a single machine. Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 31 / 95
  • 33. Distributed Filesystems When data outgrows the storage capacity of a single machine. Partition data across a number of separate machines. Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 31 / 95
  • 34. Distributed Filesystems When data outgrows the storage capacity of a single machine. Partition data across a number of separate machines. Distributed filesystems: manage the storage across a network of machines. Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 31 / 95
  • 35. Distributed Filesystems When data outgrows the storage capacity of a single machine. Partition data across a number of separate machines. Distributed filesystems: manage the storage across a network of machines. Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 31 / 95
  • 36. HDFS (1/2) Hadoop Distributed FileSystem Appears as a single disk Runs on top of a native filesystem, e.g., ext3 Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 32 / 95
  • 37. HDFS (2/2) Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 33 / 95
  • 38. Files and Blocks (1/2) Files are split into blocks. Blocks, the single unit of storage. • Transparent to user. • 64MB or 128MB. Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 34 / 95
  • 39. Files and Blocks (2/2) Same block is replicated on multiple machines: default is 3 Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 35 / 95
  • 40. HDFS Write 1. Create a new file in the Namenode’s Namespace; calculate block topology. 2, 3, 4. Stream data to the first, second and third node. 5, 6, 7. Success/failure acknowledgment. Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 36 / 95
  • 41. HDFS Read 1. Retrieve block locations. 2, 3. Read blocks to re-assemble the file. Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 37 / 95
  • 42. What About Databases? Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 38 / 95
  • 43. Database and Database Management System Database: an organized collection of data. Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 39 / 95
  • 44. Database and Database Management System Database: an organized collection of data. Database Management System (DBMS): a software that interacts with users, other applications, and the database itself to capture and analyze data. Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 39 / 95
  • 45. Relational Databases Management Systems (RDMBSs) RDMBSs: the dominant technology for storing structured data in web and business applications. SQL is good • Rich language and toolset • Easy to use and integrate • Many vendors Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 40 / 95
  • 46. Relational Databases Management Systems (RDMBSs) RDMBSs: the dominant technology for storing structured data in web and business applications. SQL is good • Rich language and toolset • Easy to use and integrate • Many vendors They promise: ACID Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 40 / 95
  • 47. ACID Properties Atomicity Consistency Isolation Durability Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 41 / 95
  • 48. RDBMS Challenges Web-based applications caused spikes. • Internet-scale data size • High read-write rates • Frequent schema changes Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 42 / 95
  • 49. Scaling RDBMSs is Expensive and Inefficient [http://www.couchbase.com/sites/default/files/uploads/all/whitepapers/NoSQLWhitepaper.pdf] Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 43 / 95
  • 50. Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 44 / 95
  • 51. NoSQL Avoidance of unneeded complexity High throughput Horizontal scalability and running on commodity hardware Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 45 / 95
  • 52. NoSQL Cost and Performance [http://www.couchbase.com/sites/default/files/uploads/all/whitepapers/NoSQLWhitepaper.pdf] Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 46 / 95
  • 55. NoSQL Data Models: Key-Value Collection of key/value pairs. Ordered Key-Value: processing over key ranges. Dynamo, Scalaris, Voldemort, Riak, ... Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 49 / 95
  • 56. NoSQL Data Models: Column-Oriented Similar to a key/value store, but the value can have multiple at- tributes (Columns). Column: a set of data values of a particular type. BigTable, Hbase, Cassandra, ... Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 50 / 95
  • 57. NoSQL Data Models: Document-Based Similar to a column-oriented store, but values can have complex documents, e.g., XML, YAML, JSON, and BSON. CouchDB, MongoDB, ... { FirstName: "Bob", Address: "5 Oak St.", Hobby: "sailing" } { FirstName: "Jonathan", Address: "15 Wanamassa Point Road", Children: [ {Name: "Michael", Age: 10}, {Name: "Jennifer", Age: 8}, ] } Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 51 / 95
  • 58. NoSQL Data Models: Graph-Based Uses graph structures with nodes, edges, and properties to represent and store data. Neo4J, InfoGrid, ... Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 52 / 95
  • 59. Data Processing Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 53 / 95
  • 60. Challenges How to distribute computation? How can we make it easy to write distributed programs? Machines failure. Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 54 / 95
  • 61. Idea Issue: • Copying data over a network takes time. Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 55 / 95
  • 62. Idea Issue: • Copying data over a network takes time. Idea: • Bring computation close to the data. • Store files multiple times for reliability. Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 55 / 95
  • 63. MapReduce A shared nothing architecture for processing large data sets with a parallel/distributed algorithm on clusters. Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 56 / 95
  • 64. Simplicity Don’t worry about parallelization, fault tolerance, data distribution, and load balancing (MapReduce takes care of these). Hide system-level details from programmers. Simplicity! Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 57 / 95
  • 65. Warm-up Task (1/2) We have a huge text document. Count the number of times each distinct word appears in the file Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 58 / 95
  • 66. Warm-up Task (2/2) File too large for memory, but all word, count pairs fit in memory. Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 59 / 95
  • 67. Warm-up Task (2/2) File too large for memory, but all word, count pairs fit in memory. words(doc.txt) | sort | uniq -c • where words takes a file and outputs the words in it, one per a line Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 59 / 95
  • 68. Warm-up Task (2/2) File too large for memory, but all word, count pairs fit in memory. words(doc.txt) | sort | uniq -c • where words takes a file and outputs the words in it, one per a line It captures the essence of MapReduce: great thing is that it is naturally parallelizable. Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 59 / 95
  • 69. MapReduce Overview words(doc.txt) | sort | uniq -c Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 60 / 95
  • 70. MapReduce Overview words(doc.txt) | sort | uniq -c Sequentially read a lot of data. Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 60 / 95
  • 71. MapReduce Overview words(doc.txt) | sort | uniq -c Sequentially read a lot of data. Map: extract something you care about. Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 60 / 95
  • 72. MapReduce Overview words(doc.txt) | sort | uniq -c Sequentially read a lot of data. Map: extract something you care about. Group by key: sort and shuffle. Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 60 / 95
  • 73. MapReduce Overview words(doc.txt) | sort | uniq -c Sequentially read a lot of data. Map: extract something you care about. Group by key: sort and shuffle. Reduce: aggregate, summarize, filter or transform. Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 60 / 95
  • 74. MapReduce Overview words(doc.txt) | sort | uniq -c Sequentially read a lot of data. Map: extract something you care about. Group by key: sort and shuffle. Reduce: aggregate, summarize, filter or transform. Write the result. Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 60 / 95
  • 75. Example: Word Count Consider doing a word count of the following file using MapReduce: Hello World Bye World Hello Hadoop Goodbye Hadoop Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 61 / 95
  • 76. Example: Word Count - map The map function reads in words one a time and outputs (word, 1) for each parsed input word. The map function output is: (Hello, 1) (World, 1) (Bye, 1) (World, 1) (Hello, 1) (Hadoop, 1) (Goodbye, 1) (Hadoop, 1) Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 62 / 95
  • 77. Example: Word Count - shuffle The shuffle phase between map and reduce phase creates a list of values associated with each key. The reduce function input is: (Bye, (1)) (Goodbye, (1)) (Hadoop, (1, 1)) (Hello, (1, 1)) (World, (1, 1)) Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 63 / 95
  • 78. Example: Word Count - reduce The reduce function sums the numbers in the list for each key and outputs (word, count) pairs. The output of the reduce function is the output of the MapReduce job: (Bye, 1) (Goodbye, 1) (Hadoop, 2) (Hello, 2) (World, 2) Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 64 / 95
  • 79. Example: Word Count - map public static class MyMap extends Mapper<...> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); } } } Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 65 / 95
  • 80. Example: Word Count - reduce public static class MyReduce extends Reducer<...> { public void reduce(Text key, Iterator<...> values, Context context) throws IOException, InterruptedException { int sum = 0; while (values.hasNext()) sum += values.next().get(); context.write(key, new IntWritable(sum)); } } Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 66 / 95
  • 81. Example: Word Count - driver public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = new Job(conf, "wordcount"); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(MyMap.class); job.setReducerClass(MyReduce.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true); } Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 67 / 95
  • 82. MapReduce Execution J. Dean and S. Ghemawat, “MapReduce: simplified data processing on large clusters”, ACM Communications 51(1), 2008. Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 68 / 95
  • 83. MapReduce Weaknesses MapReduce programming model has not been designed for complex operations, e.g., data mining. Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 69 / 95
  • 84. MapReduce Weaknesses MapReduce programming model has not been designed for complex operations, e.g., data mining. Very expensive, i.e., always goes to disk and HDFS. Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 69 / 95
  • 85. Solution? Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 70 / 95
  • 86. Spark Extends MapReduce with more operators. Support for advanced data flow graphs. In-memory and out-of-core processing. Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 71 / 95
  • 87. Spark vs. Hadoop Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 72 / 95
  • 88. Spark vs. Hadoop Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 72 / 95
  • 89. Spark vs. Hadoop Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 73 / 95
  • 90. Spark vs. Hadoop Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 73 / 95
  • 91. Resilient Distributed Datasets (RDD) Immutable collections of objects spread across a cluster. An RDD is divided into a number of partitions. Partitions of an RDD can be stored on different nodes of a cluster. Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 74 / 95
  • 92. What About Streaming Data? Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 75 / 95
  • 93. Motivation Many applications must process large streams of live data and pro- vide results in real-time. Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 76 / 95
  • 94. Motivation Many applications must process large streams of live data and pro- vide results in real-time. Processing information as it flows, without storing them persistently. Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 76 / 95
  • 95. Motivation Many applications must process large streams of live data and pro- vide results in real-time. Processing information as it flows, without storing them persistently. Traditional DBMSs: • Store and index data before processing it. • Process data only when explicitly asked by the users. Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 76 / 95
  • 96. DBMS vs. DSMS (1/3) DBMS: persistent data where updates are relatively infrequent. DSMS: transient data that is continuously updated. Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 77 / 95
  • 97. DBMS vs. DSMS (2/3) DBMS: runs queries just once to return a complete answer. DSMS: executes standing queries, which run continuously and pro- vide updated answers as new data arrives. Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 78 / 95
  • 98. DBMS vs. DSMS (3/3) Despite these differences, DSMSs resemble DBMSs: both process incoming data through a sequence of transformations based on SQL operators, e.g., selections, aggregates, joins. Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 79 / 95
  • 99. DSMS Source: produces the incoming information flows Sink: consumes the results of processing IFP engine: processes incoming flows Processing rules: how to process the incoming flows Rule manager: adds/removes processing rules Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 80 / 95
  • 100. Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 81 / 95
  • 101. What About Graph Data? Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 82 / 95
  • 102. Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 83 / 95
  • 103. Large Graph Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 84 / 95
  • 104. Large-Scale Graph Processing Large graphs need large-scale processing. A large graph either cannot fit into memory of single computer or it fits with huge cost. Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 85 / 95
  • 105. Data-Parallel Model for Large-Scale Graph Processing The platforms that have worked well for developing parallel applica- tions are not necessarily effective for large-scale graph problems. Why? Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 86 / 95
  • 106. Graph Algorithms Characteristics Unstructured problems: difficult to partition the data Data-driven computations: difficult to partition computation Poor data locality High data access to computation ratio Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 87 / 95
  • 107. Proposed Solution Graph-Parallel Processing Computation typically depends on the neighbors. Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 88 / 95
  • 108. Graph-Parallel Processing Restricts the types of computation. New techniques to partition and distribute graphs. Exploit graph structure. Executes graph algorithms orders-of-magnitude faster than more general data-parallel systems. Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 89 / 95
  • 109. Data-Parallel vs. Graph-Parallel Computation Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 90 / 95
  • 110. Vertex-Centric Programing Think as a vertex. Each vertex computes individually its value: in parallel Each vertex can see its local context, and updates its value accord- ingly. Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 91 / 95
  • 111. Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 92 / 95
  • 112. Summary Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 93 / 95
  • 113. Summary Scale-out vs. Scale-up Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 94 / 95
  • 114. Summary Scale-out vs. Scale-up How to store data? • Distributed file systems: HDFS • NoSQL databases: HBase, Cassandra, ... Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 94 / 95
  • 115. Summary Scale-out vs. Scale-up How to store data? • Distributed file systems: HDFS • NoSQL databases: HBase, Cassandra, ... How to process data? • Batch data: MapReduce, Spark • Streaming data: Spark stream, Flink, Storm, S4 • Graph data: Giraph, GraphLab, GraphX, Flink Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 94 / 95
  • 116. Questions? Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 95 / 95