SlideShare a Scribd company logo
1 of 72
Nicola Ferraro
• Welcome intro
• What is Big Data?
• Some components:
• MapReduce
• Pig
• Hive
• Oozie
• Flume
• Sqoop
• Mahout
• Impala
• More components:
• YARN & MapReduceV2
• NoSQL & Hbase
• Solr
• Spark
• Cloudera Manager
• Hue
"There were 5 hexabytes of information created between the
dawn of civilization through 2003, but that much information
is now created every 2 days, and the pace is increasing.”
Eric Schmidt, Google, August 2010
“Big Data is a shorthand label that typically means applying
the tools of artificial intelligence, like machine learning, to
vast new troves of data beyond that captured in standard
databases. The new data sources include Web-browsing data
trails, social network communications, sensor data and
surveillance data.”
"A process that has the potential to transform everything”
NY Times, August 2012
“The ‘Big’ there is purely marketing… This is about you
buying big expensive servers and whatnot.”
“Good data is better than big data”
“Big data is bullshit”, 2013
Quote from Harper Reed,
Tech Guru for Obama re-election
caimpaign in 2012
EXPECTED Towards disillusionment…
We have a long history of successes in predictions…
“I predict the internet will […]
catastrophically collapse in 1996”
Robert Metcalfe, Inventor of Ethernet, 1995
And the world will end in 2012…
“Spam will be solved in 2 years”
Bill Gates, 2004
“There is no chance the iPhone will get any
significant market share”
Steve Ballmer, 2007
The communication paths in the “normal world”:
What about this technology ?
It is great !!!
The communication paths in the “big world”:
What about this technology ?
steep learning curve
Sarah, today I’m gonna run some wonderful
Spark applications on my new Big Data cluster
on Amazon EC2!!
Oh, I though that Amazon was just selling shoes!
One of the first
electronic digital
computers (1946).
It was 180 m2 big
A Big Data cluster (today).
We need new languages and
abstractions to develop on top of it.
(more powerful than assembly !)
Need to process:
• 1TB
• 1PB
• 1EB
of data, and extract useful
How ?
Need to handle:
of data and react
in nearly real time.
How ?
Think to sensor data
Need to process:
• Digital Images
• Video Recording
• Free Text
data and extract useful
Traditional systems are not
suitable for data characterized by
• Volume: TB, PB, HB, ZB …
• Velocity: GB/s, TB/s …
• Variety: unstructured or semi-
Big Data Systems have been
created for these purposes.
Oracle databases can host tables with more than 200TB of
data (some of them are in Italy).
Suppose you want to run a query like:
select type, count(*)
from events
group by type
How much will you wait ?
Even if you reach 10GB/s of read
speed from disks (with multiple
you will wait more than 5 hours !
(if the instance won’t crash before…)
Big Data systems are a collection of software applications
installed on different machines.
Each application can be used as if it was installed in a single
1 2 3 4 5 > 10.000
Do not try to install all software by yourself! You’ll
become crazy!
Get a “Platform including Hadoop” in a virtual machine from:
(Many) applications in the VM runs in:
“pseudo-distributed mode” For:
Commodity On-premises
Big Data Appliance Cloud Services
In 2003, Google published a paper about a new distributed file
system called Google File System (GFS).
Their largest cluster:
• Was composed of more than 1000 nodes
• Could store more than 100 TB (it was 2003 !)
• Can be accessed by hundreds of concurrent clients
Its main purpose was that of serving the Google search engine.
But… how?
In 2004, Google published another paper about a new batch
processing framework, called MapReduce.
• Was a parallel processing framework
• Was integrated perfectly with GFS
MapReduce was used for updating indexes in the Google
search engine.
In 2005, Doug Cutting (Yahoo!) created the basis of the Big
Data movement: Hadoop.
Originally, Hadoop was composed of:
1. HDFS: a Highly Distributed File System “inspired by”
2. MapReduce: a parallel processing framework “inspired
by” Google MapReduce
“inspired by” = “a copy of”
… but with an open source license (starting from 2009).
Hadoop was the
name of his son’s toy
A Master/Slave architecture:
• Master: takes care of directories and file block locations
• Slaves: store data blocks (128MB). Replication factor 3.
A HDFS cluster appears logically as a normal POSIX file
system (not fully compliant with POSIX):
• Clients are distribution unaware (eg. Shell, Hue)
• Allows creation of:
• Files and directories
• ACL (Users and groups)
• Read/Write/AccessChild permissions
1. Map: data is taken from HDFS and transformed
2. Shuffling: data is splitted and reorganized among nodes
3. Reduce: data is summarized and written back to HDFS
Slave 1
Slave 2
Slave 3
Delegation and
Shuffling (temp files)
map map map
Data locality
HDFS blocks on
different machines
Words in
The “mapper”:
public class TokenizerMapper
extends Mapper<Object, Text, Text, IntWritable>{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
context.write(word, one);
The “reducer”:
public class IntSumReducer
extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values,
Context context
) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
context.write(key, result);
The “main” class:
public class WordCount {
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "word count");
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
Configuration is usually pre-filled
from an additional xml file
• Can run in more than 10 thousand machines
• Linear scalability (theoretical/commercial feature):
• 100 nodes: 1PB in 2 hours  200 nodes: 1PB in 1 hour
• 100 nodes: 1PB in 2 hours  200 nodes: 2PB in 2 hours
• Programming model:
• You can do more than word count (samples follow)
• Complex data pipelines require more than 1 MapReduce
• Difficult to write programs as MapReduce Jobs (a brand
new way of writing algorithms)
• Difficult to maintain code (and to reverse engineer)
MapReduce job are difficult to write.
Complex pipelines require multiple jobs.
In 2006, people at Yahoo research started working on a new
language to simplify creation of MapReduce jobs.
They created Pig.
The “Pig Latin” is a procedural language.
It is still used by Data Scientist at Yahoo!,
and worldwide.
The word count in Pig:
lines = LOAD '/tmp/input-file' AS (line:chararray);
-- remove whitespaces words
filtered_words = FILTER words BY word MATCHES 'w+';
word_groups = GROUP filtered_words BY word;
word_count = FOREACH word_groups
GENERATE COUNT(filtered_words) AS count, group AS word;
ordered_word_count = ORDER word_count BY count DESC;
STORE ordered_word_count INTO '/tmp/output-file';
From this kind of sources,
Pig creates one or more
MapReduce jobs
• Procedural language, easier than MapReduce style
• Can build directed acyclic graphs (DAG) of MapReduce
• One step has same scalability as MapReduce counterpart
• Compatibility with many languages for writing user
defined functions (UDF): Java, Python, Ruby, …
• Thanks to UDF, you can treat also unstructured data
• It’s another language
• Stackoverflow cannot help you in case of bugs !
In 2009, the Facebook Data Infrastructure Team created Hive:
“An open-source data warehousing solution on top of hadoop”
Hive brings SQL to Hadoop (queries translated to MapReduce):
• You can define the structure of Hadoop files (tables, with
columns and data types) and save them in Hive Metastore
• You can query tables with HiveQL, a dialect of SQL.
• Joins only with equality conditions
• Subqueries with limitations
• Limitations depend on the version… Check the docs
An example of Hive query (you have seen it before):
select type, count(*)
from events
group by type
Another query:
select o.product,
from order o
join user u
on o.user_id =
Can be executed on many PB of
data (having an appropriate
number of machines)
How can you translate it in
MapReduce ?
How many MR steps ?
Order and User are folders with text files in
HDFS. Hive consider them as tables
What if we want distinct results ?
Hive and Pig produce multiple MapReduce jobs and run them
in sequence.
What if we want to define a custom workflow of MR/Hive/Pig
jobs ?
• Configure Jobs
• Define Workflow
• Schedule execution
Files on HDFS are not always uploaded “by hand” in
command line.
Flume can bring files to HDFS.
Web App
(Apache front-end)
Big Data Infrastructure
Flume transports data through channels. Any channel can be
connected with Sources and Sinks.
When you have multiple web servers. You can also send
output to multiple
A simple agent (simple-flume.conf):
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444
A netcat source listens
for incoming telnet data
# Describe the sink
a1.sinks.k1.type = logger
# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.r1.channels = c1 = c1
A logger sink just outputs
data (useful for debugging)
Run the example with the command:
flume-ng agent --conf conf --conf-file 
example.conf --name a1 
Then, open another terminal and send telnet commands to
the listening agent..
Another tool useful to ingest data in HDFS from relational
databases is Sqoop2.
With Sqoop2, you can configure Jobs made of:
• A JDBC Source (table or query)
• A HDFS Sink (folder with type and compression options).
Other type of sink are also supported, eg. HBase.
Jobs can be configured and run with: sqoop2-shell (or hue).
synch is not
Mahout is about “Machine learning on top of Hadoop”.
Main packages:
• Collaborative Filtering
• Classification
• Clustering
Many algorithms run on MapReduce.
Other algorithms run on other engines. Some of them can
run only locally (not parallelizable).
Sample usage: mahout kmeans –i … -o … -k …
Add Interactivity to SQL queries: Cloudera Impala
Next topic: MapReduce V2
• Paradigm: MapReduce is a new “template” algorithm.
Difficult to translate existing algorithms in MapReduce.
• Expressiveness: a single MapReduce job is often not
sufficient for enterprise data processing. You need to write
multiple jobs.
• Interactivity: MapReduce is terribly slow for small amounts of
data (time for initialization). Hive queries cannot be
• Maintainability: writing a single MapReduce job can be
cumbersome. Writing a pipeline of MapReduce jobs produce
100% unmaintainable code. ?
Nobody still writes MapReduce jobs directly
With “real” Big Data (Volume), another issue is coming:
When you have multiple PB of data, scalability is not linear
Google has replaced MapReduce with “Cloud Dataflow”
since many years.
Disk and network are the slowest components: imagine a pipeline of 20 MR jobs…
w r wr
MapReduce v1 had too many components.
• Resource Management has been moved to YARN
• MR API rewritten (changed package from
org.apache.hadoop.mapred to
Yet Another Resource Negotiator: started in 2008 at Yahoo!
Introduced in major data platforms in 2012.
Negotiate (allocate containers with):
YARN is considered the Hadoop Data Operating System.
Hortonworks Data Platform
MapReduce problems have been solved with MR2.
What about HDFS ?
• Can store large volumes of files
• Supports any format, from text files to custom records
• Supports “transparent” compression of data
• Parallel retrieve and storage of batches
• Does not provide:
• Fast random read/write (HDFS is append only)
• Data updates (rewrite the entire block: 128MB)
Google solved the problem starting from 2004.
In 2006 they published a paper about “Big Table”.
The Hadoop community made their own version of Big Table
in 2010.
It has been called HBase. It provides with:
• Fast read/write access to single records
• Organization of data in tables, column families and
• Also: performance, replication, availability, consistency…
Table Table
Column Family Col.
Col. Family Col. Family
Col. Col. Col.
put “value” to table1, cf1:col1 (row_key)
get * from table1 (row_key)
delete from table1, cf1:col1 (row_key)
scan …
We will have a whole presentation on HBase…
Hadoop Master
HMaster NameNode
Hadoop Slave 1
Hadoop Slave 2
Different ways to access HBase:
• HBase Driver API:
• MapReduce
• Hadoop InputFormat and OutputFormat to read/write data
in batches
• Hive/Impala
• Do SQL on HBase: limitations in “predicate pushdown”
• Apache Phoenix:
• A project to “translate” SQL queries into Driver API Calls
Some Data Platforms include different NoSQL databases.
Similar to HBase
Not Only SQL  NOw SQL
NoSQL databases have some “features” in common:
• You need to model the database having the queries in
• You need to add redundancy (to do different queries)
• Lack of a “good” indexing system (secondary indexes
absent or limited)
A Solution:
• Solr
Full Text Search
“Apache Spark is a fast and general engine for large-scale
data processing”
Spark vs MapReduce:
• Faster
• Clearer
• Shorter
• Easier
• More powerful
A MapReduce complex algorithm:
A Spark complex algorithm:
Map Reduce
The developer writes
multiple applications,
each one with 1 map and
1 reduce step.
A scheduler (Oozie) is
programmed to execute
all applications in a
configurable order.
The developer writes 1
application using a
simple API.
The Spark Framework
executes the application.
Data is processed in
memory as much as
Many applications originally developed on MapReduce are
gradually migrating to Spark (migration in progress).
Pig on Spark (Spork): just using “-x spark” in shell
Hive on Spark: “set hive.execution.engine=spark”
Since 25/04/2014: No more MapReduce based algorithms
RDDs can be used as normal Scala collections, there are only
small differences in the API.
val book = sc.textFile(”/books/dante/inferno.txt")
val words = book.flatMap(f => f.split(" "))
val chart = words
.map(w => (w, 1))
.reduceByKey((n1, n2) => n1 + n2)
.top(4)( => t._2))
Spark has a “Streaming” component.
Storm streaming
model is different
Spark libraries are developed on top of the Spark core framework for
large scale data processing:
• Spark SQL: execute SQL queries on heterogeneous distributed
• Spark Streaming: execute micro-batches on streaming data
• Spark MLib: ready-to-use machine learning algorithms
• Spark GraphX: algorithms and abstractions for working with graphs
Spark Core
Spark SQL
Spark MLib
Manage a Big Data cluster: Cloudera Manager (and others…)
Simplify the management of data: Hue
Version 2
+ Dataframes
+ MLLib
+ GraphX
A common “data platform on top of Hadoop”: Hortonworks
Cloudera Data Hub.
A brief history of "big data"

More Related Content

What's hot

Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkDatio Big Data
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsDatabricks
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...Simplilearn
Spark architecture
Spark architectureSpark architecture
Spark architectureGauravBiswas9
Hive+Tez: A performance deep dive
Hive+Tez: A performance deep diveHive+Tez: A performance deep dive
Hive+Tez: A performance deep divet3rmin4t0r
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture EMC
Building a Virtual Data Lake with Apache Arrow
Building a Virtual Data Lake with Apache ArrowBuilding a Virtual Data Lake with Apache Arrow
Building a Virtual Data Lake with Apache ArrowDremio Corporation
The Apache Spark File Format Ecosystem
The Apache Spark File Format EcosystemThe Apache Spark File Format Ecosystem
The Apache Spark File Format EcosystemDatabricks
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudNoritaka Sekiyama
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroThe Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroDatabricks
Apache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationApache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationDatabricks
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overviewDataArt
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...Cathrine Wilhelmsen
The Hadoop Ecosystem
The Hadoop EcosystemThe Hadoop Ecosystem
The Hadoop EcosystemJ Singh
Spark streaming , Spark SQL
Spark streaming , Spark SQLSpark streaming , Spark SQL
Spark streaming , Spark SQLYousun Jeong
Iceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data AnalyticsIceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data AnalyticsAlluxio, Inc.
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesApache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesWalaa Hamdy Assy
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks

What's hot (20)

Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
Spark architecture
Spark architectureSpark architecture
Spark architecture
Hive+Tez: A performance deep dive
Hive+Tez: A performance deep diveHive+Tez: A performance deep dive
Hive+Tez: A performance deep dive
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
Building a Virtual Data Lake with Apache Arrow
Building a Virtual Data Lake with Apache ArrowBuilding a Virtual Data Lake with Apache Arrow
Building a Virtual Data Lake with Apache Arrow
The Apache Spark File Format Ecosystem
The Apache Spark File Format EcosystemThe Apache Spark File Format Ecosystem
The Apache Spark File Format Ecosystem
Hadoop Tutorial For Beginners
Hadoop Tutorial For BeginnersHadoop Tutorial For Beginners
Hadoop Tutorial For Beginners
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
NoSQL databases
NoSQL databasesNoSQL databases
NoSQL databases
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroThe Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
Apache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationApache Spark Core – Practical Optimization
Apache Spark Core – Practical Optimization
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overview
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
The Hadoop Ecosystem
The Hadoop EcosystemThe Hadoop Ecosystem
The Hadoop Ecosystem
Spark streaming , Spark SQL
Spark streaming , Spark SQLSpark streaming , Spark SQL
Spark streaming , Spark SQL
Iceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data AnalyticsIceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data Analytics
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesApache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & libraries
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...

Similar to A brief history of "big data"

Big data Analytics Hadoop
Big data Analytics HadoopBig data Analytics Hadoop
Big data Analytics HadoopMishika Bharadwaj
Big data ppt
Big data pptBig data ppt
Big data pptShweta Sahu
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and HadoopFlavio Vit
Hadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG GridHadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG GridEvert Lammerts
Python in big data world
Python in big data worldPython in big data world
Python in big data worldRohit
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoopAshishRathore72
An Introduction to Apache Hadoop, Mahout and HBase
An Introduction to Apache Hadoop, Mahout and HBaseAn Introduction to Apache Hadoop, Mahout and HBase
An Introduction to Apache Hadoop, Mahout and HBaseLukas Vlcek
A gentle introduction to the world of BigData and Hadoop
A gentle introduction to the world of BigData and HadoopA gentle introduction to the world of BigData and Hadoop
A gentle introduction to the world of BigData and HadoopStefano Paluello
Introduction to BIg Data and Hadoop
Introduction to BIg Data and HadoopIntroduction to BIg Data and Hadoop
Introduction to BIg Data and HadoopAmir Shaikh
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Chris Baglieri
EclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An IntroductionEclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An IntroductionCloudera, Inc.
Introduction to Microsoft's Big Data Platform and Hadoop Primer
Introduction to Microsoft's Big Data Platform and Hadoop PrimerIntroduction to Microsoft's Big Data Platform and Hadoop Primer
Introduction to Microsoft's Big Data Platform and Hadoop PrimerDenny Lee
Big data & hadoop
Big data & hadoopBig data & hadoop
Big data & hadoopAbhi Goyan
Bigdata processing with Spark
Bigdata processing with SparkBigdata processing with Spark
Bigdata processing with SparkArjen de Vries
Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014
Dataiku  - hadoop ecosystem - @Epitech Paris - janvier 2014Dataiku  - hadoop ecosystem - @Epitech Paris - janvier 2014
Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014Dataiku
Getting Started with Hadoop
Getting Started with HadoopGetting Started with Hadoop
Getting Started with HadoopJosh Devins
Hadoop Seminar Report
Hadoop Seminar ReportHadoop Seminar Report
Hadoop Seminar ReportAtul Kushwaha

Similar to A brief history of "big data" (20)

Big data Analytics Hadoop
Big data Analytics HadoopBig data Analytics Hadoop
Big data Analytics Hadoop
Big data ppt
Big data pptBig data ppt
Big data ppt
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
Hadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG GridHadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG Grid
Python in big data world
Python in big data worldPython in big data world
Python in big data world
Hadoop seminar
Hadoop seminarHadoop seminar
Hadoop seminar
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
An Introduction to Apache Hadoop, Mahout and HBase
An Introduction to Apache Hadoop, Mahout and HBaseAn Introduction to Apache Hadoop, Mahout and HBase
An Introduction to Apache Hadoop, Mahout and HBase
A gentle introduction to the world of BigData and Hadoop
A gentle introduction to the world of BigData and HadoopA gentle introduction to the world of BigData and Hadoop
A gentle introduction to the world of BigData and Hadoop
Introduction to BIg Data and Hadoop
Introduction to BIg Data and HadoopIntroduction to BIg Data and Hadoop
Introduction to BIg Data and Hadoop
Hadoop HDFS.ppt
Hadoop HDFS.pptHadoop HDFS.ppt
Hadoop HDFS.ppt
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
EclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An IntroductionEclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An Introduction
Introduction to Microsoft's Big Data Platform and Hadoop Primer
Introduction to Microsoft's Big Data Platform and Hadoop PrimerIntroduction to Microsoft's Big Data Platform and Hadoop Primer
Introduction to Microsoft's Big Data Platform and Hadoop Primer
Big data & hadoop
Big data & hadoopBig data & hadoop
Big data & hadoop
Bigdata processing with Spark
Bigdata processing with SparkBigdata processing with Spark
Bigdata processing with Spark
Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014
Dataiku  - hadoop ecosystem - @Epitech Paris - janvier 2014Dataiku  - hadoop ecosystem - @Epitech Paris - janvier 2014
Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014
Getting Started with Hadoop
Getting Started with HadoopGetting Started with Hadoop
Getting Started with Hadoop
Hadoop Seminar Report
Hadoop Seminar ReportHadoop Seminar Report
Hadoop Seminar Report

More from Nicola Ferraro

Camel Day Italia 2021 - Camel K
Camel Day Italia 2021 - Camel KCamel Day Italia 2021 - Camel K
Camel Day Italia 2021 - Camel KNicola Ferraro
ApacheCon NA - Apache Camel K: connect your Knative serverless applications w...
ApacheCon NA - Apache Camel K: connect your Knative serverless applications w...ApacheCon NA - Apache Camel K: connect your Knative serverless applications w...
ApacheCon NA - Apache Camel K: connect your Knative serverless applications w...Nicola Ferraro
ApacheCon NA - Apache Camel K: a cloud-native integration platform
ApacheCon NA - Apache Camel K: a cloud-native integration platformApacheCon NA - Apache Camel K: a cloud-native integration platform
ApacheCon NA - Apache Camel K: a cloud-native integration platformNicola Ferraro
Analyzing Data at Scale with Apache Spark
Analyzing Data at Scale with Apache SparkAnalyzing Data at Scale with Apache Spark
Analyzing Data at Scale with Apache SparkNicola Ferraro
Integrating Applications: the Reactive Way
Integrating Applications: the Reactive WayIntegrating Applications: the Reactive Way
Integrating Applications: the Reactive WayNicola Ferraro
Cloud Native Applications on Kubernetes: a DevOps Approach
Cloud Native Applications on Kubernetes: a DevOps ApproachCloud Native Applications on Kubernetes: a DevOps Approach
Cloud Native Applications on Kubernetes: a DevOps ApproachNicola Ferraro
Extending DevOps to Big Data Applications with Kubernetes
Extending DevOps to Big Data Applications with KubernetesExtending DevOps to Big Data Applications with Kubernetes
Extending DevOps to Big Data Applications with KubernetesNicola Ferraro

More from Nicola Ferraro (7)

Camel Day Italia 2021 - Camel K
Camel Day Italia 2021 - Camel KCamel Day Italia 2021 - Camel K
Camel Day Italia 2021 - Camel K
ApacheCon NA - Apache Camel K: connect your Knative serverless applications w...
ApacheCon NA - Apache Camel K: connect your Knative serverless applications w...ApacheCon NA - Apache Camel K: connect your Knative serverless applications w...
ApacheCon NA - Apache Camel K: connect your Knative serverless applications w...
ApacheCon NA - Apache Camel K: a cloud-native integration platform
ApacheCon NA - Apache Camel K: a cloud-native integration platformApacheCon NA - Apache Camel K: a cloud-native integration platform
ApacheCon NA - Apache Camel K: a cloud-native integration platform
Analyzing Data at Scale with Apache Spark
Analyzing Data at Scale with Apache SparkAnalyzing Data at Scale with Apache Spark
Analyzing Data at Scale with Apache Spark
Integrating Applications: the Reactive Way
Integrating Applications: the Reactive WayIntegrating Applications: the Reactive Way
Integrating Applications: the Reactive Way
Cloud Native Applications on Kubernetes: a DevOps Approach
Cloud Native Applications on Kubernetes: a DevOps ApproachCloud Native Applications on Kubernetes: a DevOps Approach
Cloud Native Applications on Kubernetes: a DevOps Approach
Extending DevOps to Big Data Applications with Kubernetes
Extending DevOps to Big Data Applications with KubernetesExtending DevOps to Big Data Applications with Kubernetes
Extending DevOps to Big Data Applications with Kubernetes

Recently uploaded

Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...OnePlan Solutions
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...Technogeeks
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaHanief Utama
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmSujith Sukumaran
2.pdf Ejercicios de programaciĂłn competitiva
2.pdf Ejercicios de programaciĂłn competitiva2.pdf Ejercicios de programaciĂłn competitiva
2.pdf Ejercicios de programaciĂłn competitivaDiego IvĂĄn Oliveros Acosta
Sending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdfSending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and
Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic) smith
Odoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 EnterpriseOdoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 Enterprisepreethippts
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsAhmed Mohamed
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024StefanoLambiase
cpct NetworkING BASICS AND NETWORK TOOL.pptrcbcrtm
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Velvetech LLC
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...OnePlan Solutions
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odishasmiwainfosol
VK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web DevelopmentVK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web Developmentvyaparkranti
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Angel Borroy LĂłpez
Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Mater
Software Coding for software engineering
Software Coding for software engineeringSoftware Coding for software engineering
Software Coding for software engineeringssuserb3a23b

Recently uploaded (20)

Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief Utama
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort ServiceHot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalm
2.pdf Ejercicios de programaciĂłn competitiva
2.pdf Ejercicios de programaciĂłn competitiva2.pdf Ejercicios de programaciĂłn competitiva
2.pdf Ejercicios de programaciĂłn competitiva
Sending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdfSending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdf
Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)
Odoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 EnterpriseOdoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 Enterprise
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
VK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web DevelopmentVK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web Development
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)
Software Coding for software engineering
Software Coding for software engineeringSoftware Coding for software engineering
Software Coding for software engineering

A brief history of "big data"

  • 1. BIG DATA A BRIEF HISTORY OF Nicola Ferraro
  • 2. SUMMARY • Welcome intro • What is Big Data? • Some components: • HDFS • MapReduce • Pig • Hive • Oozie • Flume • Sqoop • Mahout • Impala • More components: • YARN & MapReduceV2 • NoSQL & Hbase • Solr • Spark • Cloudera Manager • Hue
  • 3. MOTIVATION "There were 5 hexabytes of information created between the dawn of civilization through 2003, but that much information is now created every 2 days, and the pace is increasing.” Eric Schmidt, Google, August 2010 5EB 5EB 5EB 5 E B 5 E B 5 E B ?
  • 4. WHAT PEOPLE THOUGHT ABOUT BIG DATA “Big Data is a shorthand label that typically means applying the tools of artificial intelligence, like machine learning, to vast new troves of data beyond that captured in standard databases. The new data sources include Web-browsing data trails, social network communications, sensor data and surveillance data.” "A process that has the potential to transform everything” NY Times, August 2012
  • 5. WHAT THEY THOUGHT LATER “The ‘Big’ there is purely marketing… This is about you buying big expensive servers and whatnot.” “Good data is better than big data” “Big data is bullshit”, 2013 Quote from Harper Reed, Tech Guru for Obama re-election caimpaign in 2012
  • 6. BUT, THIS WAS EXPECTED Towards disillusionment…
  • 7. HOW MUCH IT WILL CHANGE THE WORLD ? We have a long history of successes in predictions… “I predict the internet will […] catastrophically collapse in 1996” Robert Metcalfe, Inventor of Ethernet, 1995 And the world will end in 2012…
  • 8. THERE IS NO WAY TO PREDICT THE FUTURE “Spam will be solved in 2 years” Bill Gates, 2004 “There is no chance the iPhone will get any significant market share” Steve Ballmer, 2007
  • 9. <div> {{}} </div> WHAT ABOUT THE TECHNOLOGY ? The communication paths in the “normal world”: What about this technology ? It is great !!!
  • 10. CAN YOU DO THE SAME IN THE “BIG WORLD” ? The communication paths in the “big world”: What about this technology ? steep learning curve
  • 11. A REAL WORLD EXAMPLE Sarah, today I’m gonna run some wonderful Spark applications on my new Big Data cluster on Amazon EC2!! Oh, I though that Amazon was just selling shoes!
  • 12. WELCOME BACK TO 1946 ENIAC One of the first electronic digital computers (1946). It was 180 m2 big A Big Data cluster (today). We need new languages and abstractions to develop on top of it. (more powerful than assembly !)
  • 13. VOLUME Need to process: • 1TB • 1PB • 1EB of data, and extract useful information. How ?
  • 14. VELOCITY Need to handle: 1MB/s 1GB/s 1TB/s of data and react in nearly real time. How ? Think to sensor data
  • 15. VARIETY Need to process: • Digital Images • Video Recording • Free Text data and extract useful information. How?
  • 16. VVV: THAT IS BIG DATA Traditional systems are not suitable for data characterized by High: • Volume: TB, PB, HB, ZB … • Velocity: GB/s, TB/s … • Variety: unstructured or semi- structured Big Data Systems have been created for these purposes.
  • 17. WHAT’S WRONG WITH TRADITIONAL SYSTEMS? Oracle databases can host tables with more than 200TB of data (some of them are in Italy). Suppose you want to run a query like: select type, count(*) from events group by type How much will you wait ? Even if you reach 10GB/s of read speed from disks (with multiple SSD)… you will wait more than 5 hours ! (if the instance won’t crash before…)
  • 18. BIG DATA SYSTEMS ARE DISTRIBUTED Big Data systems are a collection of software applications installed on different machines. Each application can be used as if it was installed in a single machine. … 1 2 3 4 5 > 10.000
  • 19. GETTING STARTED Do not try to install all software by yourself! You’ll become crazy! Get a “Platform including Hadoop” in a virtual machine from: (Many) applications in the VM runs in: “pseudo-distributed mode” For:
  • 21. HISTORY: GOOGLE FILE SYSTEM In 2003, Google published a paper about a new distributed file system called Google File System (GFS). Their largest cluster: • Was composed of more than 1000 nodes • Could store more than 100 TB (it was 2003 !) • Can be accessed by hundreds of concurrent clients Its main purpose was that of serving the Google search engine. But… how?
  • 22. HISTORY: MAPREDUCE In 2004, Google published another paper about a new batch processing framework, called MapReduce. MapReduce: • Was a parallel processing framework • Was integrated perfectly with GFS MapReduce was used for updating indexes in the Google search engine.
  • 23. HISTORY: HADOOP In 2005, Doug Cutting (Yahoo!) created the basis of the Big Data movement: Hadoop. Originally, Hadoop was composed of: 1. HDFS: a Highly Distributed File System “inspired by” GFS 2. MapReduce: a parallel processing framework “inspired by” Google MapReduce “inspired by” = “a copy of” … but with an open source license (starting from 2009). Hadoop was the name of his son’s toy elephant.
  • 24. HDFS: INTERNALS A Master/Slave architecture: • Master: takes care of directories and file block locations • Slaves: store data blocks (128MB). Replication factor 3.
  • 25. HDFS: LOGICAL VIEW A HDFS cluster appears logically as a normal POSIX file system (not fully compliant with POSIX): • Clients are distribution unaware (eg. Shell, Hue) • Allows creation of: • Files and directories • ACL (Users and groups) • Read/Write/AccessChild permissions
  • 26. MAPREDUCE Algorithm: 1. Map: data is taken from HDFS and transformed 2. Shuffling: data is splitted and reorganized among nodes 3. Reduce: data is summarized and written back to HDFS Master HDFS NameNode MapReduce JobTracker Slave 1 HDFS DataNode MapReduce TaskTracker Slave 2 HDFS DataNode MapReduce TaskTracker Slave 3 HDFS DataNode MapReduce TaskTracker Delegation and aggregation Shuffling (temp files) map map map Data locality
  • 27. MAPREDUCE: WORD COUNT M M Map Reduce HDFS blocks on different machines (splits) 3 Mappers 4 Reducers Words in emails
  • 28. MAPREDUCE: SOFTWARE The “mapper”: public class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable>{ private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(Object key, Text value, Context context ) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); } } }
  • 29. MAPREDUCE: SOFTWARE The “reducer”: public class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable> { private IntWritable result = new IntWritable(); public void reduce(Text key, Iterable<IntWritable> values, Context context ) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result); } }
  • 30. MAPREDUCE: SOFTWARE The “main” class: public class WordCount { public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = Job.getInstance(conf, "word count"); job.setJarByClass(WordCount.class); job.setMapperClass(TokenizerMapper.class); job.setReducerClass(IntSumReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); } } Configuration is usually pre-filled from an additional xml file (hadoop-site.xml)
  • 31. MAPREDUCE Considerations: • Can run in more than 10 thousand machines • Linear scalability (theoretical/commercial feature): • 100 nodes: 1PB in 2 hours  200 nodes: 1PB in 1 hour • 100 nodes: 1PB in 2 hours  200 nodes: 2PB in 2 hours • Programming model: • You can do more than word count (samples follow) • Complex data pipelines require more than 1 MapReduce step • Difficult to write programs as MapReduce Jobs (a brand new way of writing algorithms) • Difficult to maintain code (and to reverse engineer)
  • 32. FIRST IMPROVEMENT: PIG MapReduce job are difficult to write. Complex pipelines require multiple jobs. In 2006, people at Yahoo research started working on a new language to simplify creation of MapReduce jobs. They created Pig. The “Pig Latin” is a procedural language. It is still used by Data Scientist at Yahoo!, and worldwide.
  • 33. PIG LATIN The word count in Pig: lines = LOAD '/tmp/input-file' AS (line:chararray); words = FOREACH lines GENERATE FLATTEN(TOKENIZE(line)) AS word; -- remove whitespaces words filtered_words = FILTER words BY word MATCHES 'w+'; word_groups = GROUP filtered_words BY word; word_count = FOREACH word_groups GENERATE COUNT(filtered_words) AS count, group AS word; ordered_word_count = ORDER word_count BY count DESC; STORE ordered_word_count INTO '/tmp/output-file'; From this kind of sources, Pig creates one or more MapReduce jobs
  • 34. PIG Considerations: • Procedural language, easier than MapReduce style • Can build directed acyclic graphs (DAG) of MapReduce steps • One step has same scalability as MapReduce counterpart • Compatibility with many languages for writing user defined functions (UDF): Java, Python, Ruby, … • Thanks to UDF, you can treat also unstructured data • It’s another language • Stackoverflow cannot help you in case of bugs !
  • 35. SECOND IMPROVEMENT: HIVE In 2009, the Facebook Data Infrastructure Team created Hive: “An open-source data warehousing solution on top of hadoop” Hive brings SQL to Hadoop (queries translated to MapReduce): • You can define the structure of Hadoop files (tables, with columns and data types) and save them in Hive Metastore • You can query tables with HiveQL, a dialect of SQL. Limitations: • Joins only with equality conditions • Subqueries with limitations • Limitations depend on the version… Check the docs
  • 36. HIVE: SAMPLE QUERIES An example of Hive query (you have seen it before): select type, count(*) from events group by type Another query: select o.product, from order o join user u on o.user_id = Can be executed on many PB of data (having an appropriate number of machines) How can you translate it in MapReduce ? How many MR steps ? Order and User are folders with text files in HDFS. Hive consider them as tables
  • 37. HIVE: EQUI-JOIN User Order What if we want distinct results ?
  • 38. MULTIPLE MR JOBS Hive and Pig produce multiple MapReduce jobs and run them in sequence. What if we want to define a custom workflow of MR/Hive/Pig jobs ? Oozie: • Configure Jobs • Define Workflow • Schedule execution
  • 39. FLUME Files on HDFS are not always uploaded “by hand” in command line. Flume can bring files to HDFS. Web App (Apache front-end) Big Data Infrastructure
  • 40. FLUME: AGENTS Flume transports data through channels. Any channel can be connected with Sources and Sinks.
  • 41. CHANNEL COMPOSITION When you have multiple web servers. You can also send output to multiple locations (multiplexing)
  • 42. CONFIGURING FLUME A simple agent (simple-flume.conf): # Name the components on this agent a1.sources = r1 a1.sinks = k1 a1.channels = c1 # Describe/configure the source a1.sources.r1.type = netcat a1.sources.r1.bind = localhost a1.sources.r1.port = 44444 A netcat source listens for incoming telnet data
  • 43. CONFIGURING FLUME (CONT.) # Describe the sink a1.sinks.k1.type = logger # Use a channel which buffers events in memory a1.channels.c1.type = memory a1.channels.c1.capacity = 1000 a1.channels.c1.transactionCapacity = 100 # Bind the source and sink to the channel a1.sources.r1.channels = c1 = c1 A logger sink just outputs data (useful for debugging)
  • 44. TESTING FLUME Run the example with the command: flume-ng agent --conf conf --conf-file example.conf --name a1 -Dflume.root.logger=INFO,console Then, open another terminal and send telnet commands to the listening agent..
  • 45. INGESTING FROM JDBC Another tool useful to ingest data in HDFS from relational databases is Sqoop2. With Sqoop2, you can configure Jobs made of: • A JDBC Source (table or query) • A HDFS Sink (folder with type and compression options). Other type of sink are also supported, eg. HBase. Jobs can be configured and run with: sqoop2-shell (or hue). Real-time synch is not supported
  • 46. MACHINE LEARNING: MAHOUT Mahout is about “Machine learning on top of Hadoop”. Main packages: • Collaborative Filtering • Classification • Clustering Many algorithms run on MapReduce. Other algorithms run on other engines. Some of them can run only locally (not parallelizable). Sample usage: mahout kmeans –i … -o … -k …
  • 47. IMPALA Add Interactivity to SQL queries: Cloudera Impala
  • 48. WHAT’S NEXT ? Next topic: MapReduce V2
  • 49. SOME PROBLEMS WITH MAPREDUCE • Paradigm: MapReduce is a new “template” algorithm. Difficult to translate existing algorithms in MapReduce. • Expressiveness: a single MapReduce job is often not sufficient for enterprise data processing. You need to write multiple jobs. • Interactivity: MapReduce is terribly slow for small amounts of data (time for initialization). Hive queries cannot be interactive. • Maintainability: writing a single MapReduce job can be cumbersome. Writing a pipeline of MapReduce jobs produce 100% unmaintainable code. ? Nobody still writes MapReduce jobs directly
  • 50. AN “UNEXPECTED” ISSUE WITH MR With “real” Big Data (Volume), another issue is coming: Performance: When you have multiple PB of data, scalability is not linear anymore. Google has replaced MapReduce with “Cloud Dataflow” since many years. HDFS MAP SPILL REDUCESHUFFLE HDFS Disk and network are the slowest components: imagine a pipeline of 20 MR jobs… partition w r wr network transfer
  • 51. MAPREDUCE 2 MapReduce v1 had too many components. Changes: • Resource Management has been moved to YARN • MR API rewritten (changed package from org.apache.hadoop.mapred to org.apache.hadoopmapreduce)
  • 52. YARN Yet Another Resource Negotiator: started in 2008 at Yahoo! Introduced in major data platforms in 2012. Negotiate (allocate containers with): • RAM • DISK • CPU • NETWORK
  • 53. YARN: ADVANTAGES YARN is considered the Hadoop Data Operating System. Hortonworks Data Platform
  • 54. HDFS LIMITATIONS MapReduce problems have been solved with MR2. What about HDFS ? • Can store large volumes of files • Supports any format, from text files to custom records • Supports “transparent” compression of data • Parallel retrieve and storage of batches • Does not provide: • Fast random read/write (HDFS is append only) • Data updates (rewrite the entire block: 128MB)
  • 55. HBASE: THE HADOOP DATABASE Google solved the problem starting from 2004. In 2006 they published a paper about “Big Table”. The Hadoop community made their own version of Big Table in 2010. It has been called HBase. It provides with: • Fast read/write access to single records • Organization of data in tables, column families and columns • Also: performance, replication, availability, consistency…
  • 56. HBase HBASE: DATA MODEL Table Table Column Family Col. Family Col. Family Col. Family Col. Col. Col. Cell Cell Cell put “value” to table1, cf1:col1 (row_key) get * from table1 (row_key) delete from table1, cf1:col1 (row_key) scan …
  • 57. HBASE: ARCHITECTURE We will have a whole presentation on HBase… Hadoop Master HMaster NameNode Hadoop Slave 1 Region Server DataNode Hadoop Slave 2 Region Server DataNode HDFS
  • 58. ACCESS HBASE Different ways to access HBase: • HBase Driver API: • CRUDL • MapReduce • Hadoop InputFormat and OutputFormat to read/write data in batches • Hive/Impala • Do SQL on HBase: limitations in “predicate pushdown” • Apache Phoenix: • A project to “translate” SQL queries into Driver API Calls
  • 59. NOSQL Some Data Platforms include different NoSQL databases. Similar to HBase Graphs Documents Key/Value Not Only SQL  NOw SQL
  • 60. SOLR NoSQL databases have some “features” in common: • You need to model the database having the queries in mind • You need to add redundancy (to do different queries) • Lack of a “good” indexing system (secondary indexes absent or limited) A Solution: • Solr Full Text Search
  • 61. APACHE SPARK: THE GAME CHANGER “Apache Spark is a fast and general engine for large-scale data processing” Spark vs MapReduce: • Faster • Clearer • Shorter • Easier • More powerful
  • 62. KEY DIFFERENCE A MapReduce complex algorithm: A Spark complex algorithm: Map Reduce HDFS The developer writes multiple applications, each one with 1 map and 1 reduce step. A scheduler (Oozie) is programmed to execute all applications in a configurable order. The developer writes 1 application using a simple API. The Spark Framework executes the application. Data is processed in memory as much as possible.
  • 63. MIGRATION Many applications originally developed on MapReduce are gradually migrating to Spark (migration in progress). Pig on Spark (Spork): just using “-x spark” in shell Hive on Spark: “set hive.execution.engine=spark” Since 25/04/2014: No more MapReduce based algorithms
  • 64. USAGE RDDs can be used as normal Scala collections, there are only small differences in the API. val book = sc.textFile(”/books/dante/inferno.txt") val words = book.flatMap(f => f.split(" ")) val chart = words .map(w => (w, 1)) .reduceByKey((n1, n2) => n1 + n2) .top(4)( => t._2)) SCALA !
  • 65. STREAMING Spark has a “Streaming” component. Storm streaming model is different SAME SCALA API
  • 66. SPARK COMPONENTS Spark libraries are developed on top of the Spark core framework for large scale data processing: • Spark SQL: execute SQL queries on heterogeneous distributed datasets • Spark Streaming: execute micro-batches on streaming data • Spark MLib: ready-to-use machine learning algorithms • Spark GraphX: algorithms and abstractions for working with graphs Spark Core Spark SQL Spark Streaming Spark MLib Spark GraphX
  • 67. CLUSTER MANAGEMENT Manage a Big Data cluster: Cloudera Manager (and others…)
  • 68. DATA MANAGEMENT Simplify the management of data: Hue
  • 69. THE PUZZLE Manager Version 2 + Dataframes + MLLib + GraphX
  • 70. REORGANIZE THE IDEAS A common “data platform on top of Hadoop”: Hortonworks

Editor's Notes

  1. We need new ways to handle this huge information
  2. People started discovering that no one had the potential to do the expected analysis
  3. The Hype Cycle, Gartner
  4. How much will you spend for a super fast SSD ?
  5. Storage and computation can be parallelized
  6. NameNode HA: Only one writer at time. NFS shared partition to share updates. Secondary reads updates and refresh status. Journal Nodes with Quorum Journal Manager (QJM). At least 3 nodes.
  7. Also yarn-site.xml and mapred-site.xml in MR2.
  8. Metadata queries are for determining data types.
  9. With a different paradigm, some read/writes can be avoided. Resorting data (shuffle) at each step is not necessary. Also, for “small” data (TB), many writes can be replaced by in-memory storage.
  10. News on HDFS2: - Enabled automated failover with a hot standby and full stack resiliency for the NameNode master service - Added enterprise standard NFS read/write access to HDFS - Enabled point in time recovery with Snapshots in HDFS - Wire Encryption for HDFS Data Transfer Protocol
  11. Region servers can also work with remote datanodes.
  12. Bash vs Spark ?