From Big Data to Fast Data

From Big Data to Fast Data
Sina Sheikholeslami
s.sheikholeslami@digikala.com
9th Amirkabir Linux Festival
May 4 2017

Overview
• The War on Big Data Deﬁnition
• The Early Days
• State-of-the-art Big Data Processing Platforms
• The Rise of Fast Data: Applications & Platforms
• How to Get Involved

Part 1:  
The War on Big Data Deﬁnition

What is Big Data?
• “Big Data… everyone talks about it, nobody really
knows how to do it, everyone thinks everyone else
is doing it, so everyone claims they are doing it…” 
- Dan Ariely
4

What is Big Data? (Cont’d)
• Big Data refers to extremely large data sets that
may be analyzed computationally to reveal
patterns, trends, and associations, especially
relating to human behavior and interactions. 
- Oxford English Dictionary (Since 2013)
5

• Big Data is high-volume, high-velocity and/or
high-variety information assets that demand cost-
effective, innovative forms of information
processing that enable enhanced insight, decision
making, and process automation.  
- Gartner IT Glossary
6

• Big Data consists of extensive datasets - primarily
in the characteristics of volume, variety, velocity,
and/or variability - that require a scalable
architecture for efﬁcient storage, manipulation, and
analysis. 
- U.S. National Institute of Standards & Technology
7

- UC Berkeley Datascience Survey, September 2014
8

The Google File System
10
In SOSP’03, Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung published the paper on GFS. 
Google developed GFS to provide efﬁcient, reliable access to data using large clusters of
commodity hardware.

Bringing the Computation Near Data:
MapReduce
11
Jeffrey Dean & Sanjay Ghemawat published the MapReduce paper in OSDI’04. 
It has been cited more than 20000 times since then.

Some say we can divide the human race in two: 
Those who have never heard of the “Word Count” example, 
and those who… well, let’s just say, don’t like it.

The Word Count Example
https://wikis.nyu.edu/display/NYUHPC/

The Yellow Elephant
14
Based on GFS & MapReduce papers, the guys at Yahoo! developed an open-
source platform for distributed storage and processing of big datasets. 
Called “Apache Nutch” in its early days, the ﬁrst release of Apache Hadoop
happened in January 2006.

The Hadoop Ecosystem
• Hadoop Common: The common
utilities that support the other
Hadoop modules.
• Hadoop Distributed File System
(HDFS): A distributed ﬁle system
that provides high-throughput
access to application data.
• Hadoop YARN: A framework for job
scheduling and cluster resource
management.
• Hadoop MapReduce: A YARN-
based system for parallel
processing of large data sets.
15

“Data” Got Bigger…
16
NumberofInternetUsers(Millions)
05001,0001,5002,000
December, 1995 December, 1999 March, 2001 July, 2002 October, 2003 October, 2004 September, 2005 June, 2006 June, 2007 June, 2008 June, 2009
internetworldstats.com

And Bigger…
17
“There were 5 exabytes of information created by the entire world between the
dawn of civilization and 2003. Now that same amount is created every two days.” 
Eric Schmidt (then CEO of Google), 
at the Techonomy Conference in Lake Tahoe, California, August 2010

Part 3:  
State-of-the-art 
Big Data Processing Platforms

A Classic Batch Processing Architecture
19
Dean Wampler, “Fast Data Architectures For Streaming Applications”

The Big Data Stack
20
Courtesy of Amir H. Payberah, “Data Intensive Computing Platforms”

The Big Data Stack 
Resource Management Layer
21

Storage Layer
22

Data Processing Layer
23

Apache Spark
• In-Memory Distributed Processing Platform
• Similar Semantics for Batch & Stream
Processing
• Initially started by Matei Zaharia at UC
Berkeley’s AMPLab in 2009
• Became a top-level Apache Project in
February 2014
• 11935 Forks, 1068 Contributors
• Written primarily in Scala, more than 1M
lines of code
24

Spark vs. Hadoop MapReduce
25

The Bigger Picture
27
BDAS, the Berkeley Data Analytics Stack

Apache Flink
• “open-source stream processing
framework for distributed, high-
performing, always-available, and
accurate data streaming applications”
• Data is processed an event-at-a-time
rather than as a series of batches
• Originally named “Stratosphere”, started in
2010 with funding from DFG
• Became a top-level Apache Project in
December 2014
• 1598 Forks, 309 Contributors
• Written primarily in Java, more than 1M
lines of code
28

Part 4:  
The Rise of Fast Data
Applications & Platforms

They Shouldn’t Wait For It
34
cabotsolutions.com

My Boss Won’t Wait For It
35

Fast Data: A Deﬁnition
“Fast data is the application of big data analytics to
smaller data sets in near-real or real-time in order to
solve a problem or create business value.” 
- TechTarget
36

Looking Back at a Classic Batch
Processing Architecture
37

“Fast Data” Processing Architecture
38

And to Wrap it Up…
• Big Data History & Platforms
• Big Data vs. Fast Data
• Fast Data Architectures & Platforms
• Getting Involved
41

Attribution
• Thanks to Alekksall, Ddraw, Ibrandify,Yurlick, 
and Makyzz of freepik.com, for the free pics!
• Thanks to the awesome people at The Apache
Foundation. For Everything. Including the graphics.
42

From Big Data to Fast Data

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to From Big Data to Fast Data

Similar to From Big Data to Fast Data (20)

Recently uploaded

Recently uploaded (20)

From Big Data to Fast Data