Big Data with Apache Hadoop

Data Science Company
Big Data with Apache Hadoop
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
8/10/2014

Who am I
BEN VERMEERSCH
Big Data Consultant
Cloudera Certified Developer
for Apache Hadoop
ben.vermeersch@infofarm.be @benvermeersch

About InfoFarm
Data
Science
Big
Data
Identifying, extracting and using data of all types
and origins; exploring, correlating and using it in new
and innovative ways in order to extract meaning
and business value from it.

About InfoFarm
2 Data Scientists 4 Big Data
Consultants
1 Infrastructure
Specialist

Java
PHP
E-Commerce
Mobile
Web
Development

Agenda
• 09:30 – What is Big Data?
• 09:45 – Hadoop – HDFS & MapReduce
• 10:00 – HDFS & MapReduce in Practice
• 10:30 – The Hadoop Ecosystem
• 11:30 – Examples
• 12:00 – Wrap up and Lunch

What is Big Data?

What is Big Data not?

What is Big Data not?
• a technology
• a solution (certainly not a silver-bullet) to
any IT problem
• a replacement for an RDBMs
• a cloud storage system
• …

Big Data definition attempt
“a description of a problem domain with
specific challenges and solutions which has
become relevant with increasing volume,
velocity and variety in business data and
the increasing requirements towards
processing of this data”

The 3 V’s

Working the (Hadoop) Big Data way
• Bringing data processing to the data (vs
centralized db)
• Using unstructured or semi-structured data
• Store first, process later
• Simple techniques applied at massive
scale
• Your hardware will fail!

Hadoop (limited) overview
Oozie
Workflow
HDFS
Distributed File System
MapReduce
Amazon S3 Local FS
YARN
Distributed Data Processing
HBase
NoSQL
Hive
Data Mart
Pig
Scripting
Sqoop
SQL
Import
Export
Mahout
Machine
Learning
…

HDFS

HDFS Rack Topology

MapReduce
• A method for distributing tasks across
multiple nodes
• Data is processed where it is stored (where
possible)
• Two phases:
– Map
– Reduce
• Both fases have key-value pairs as input and
output that may be chosen by the
programmer
• The output from the mappers is used by the
reducers

Map & Reduce
Mapper input Mapper output Reducer input Reducer output

Map function
Input.txt
Block 1
Block 2
Block 3
Node 1
Block 1
Block 2
Node 2
Block 2
Block 3
Node 3
Block 1
Block 3

Shuffle and sort
• Hadoop automatically sorts and merges output
from all map tasks
This intermediate process is known as the shuffle
and sort
The result is supplied to reduce tasks

Reduce function
• Reducer input comes from the shuffle and sort process
receives one record at a time
receives all records for a given key
emit zero or more output records
• Example: A reduce function sums total per person and emits
employee name (key) and total (value) as output

MapReduce under the hood
Client ResourceManager
Node 1 AppMaster
Node 2
Node 3
HDFS

HDFS & MapReduce
DEMO

Joining
User Name
1 John
2 Maria
3 Jane
User Comment
1 Cool
2 Nonono
2 Hi there
3 Hadoop is awesome
Mapper Mapper
Key Value
1 AJohn
2 AMaria
3 AJane
Key Value
1 BCool
2 BNonono
2 BHi there
3 BHadoop is awesome

Shuffle/Sort
Key Values
1 AJohn; BCool
2 AMaria; BNonono; BHi there
3 AJane; BHadoop is awesome
Joining
Key Value
1 AJohn
2 AMaria
3 AJane
Key Value
1 BCool
2 BNonono
2 BHi there
3 BHadoop is awesome
Reducer

Key Values
1 AJohn; BCool
2 AMaria; BNonono; BHi there
3 AJane; BHadoop is awesome
Joining
Reducer
Userid Name Comment
1 John Cool
2 Maria Nonono
2 Maria Hi there
3 Jane Hadoop is awesome

MapReduce Design Patterns
• More info:
• Frameworks on top of MapReduce like
Hive or Pig make this easier

The Hadoop Ecosystem
Oozie
Workflow
HDFS
Distributed File System
MapReduce
Amazon S3 Local FS
YARN
Distributed Data Processing
HBase
NoSQL
Hive
Data Mart
Pig
Scripting
Sqoop
SQL
Import
Export
Mahout
Machine
Learning
…

Apache Pig
• Processing
framework for (large)
datasets
• Pig Latin
• Runs on Hadoop (or
local) with
MapReduce
• Extensible with
UDFs

Apache Pig
DEMO

Apache Hive
• SQL-like querying on
Hadoop datasets
• Translates to
MapReduce under
the hood
• Originally developed
at Facebook
• Now Apache Top
Level project

Hive <-> Traditional RDBMS
• Schema on read
• Fast initial load
• Flexible schema
• No update or
delete (only insert
into)
• HiveQL (subset of
SQL)
• Schema on write
• Slow initial load
• Fixed schema
• Updates, deletes,
inserts all possible
• SQL compliant

Apache Hive
DEMO

HBase
• Column-oriented Data Store
• Distributed
• Type of NoSQL-DB
• Based on Google BigTable

HBase
• Lots and lots of
data
• Large amount of
clients
• Single selects
• Range scan by
key
• Variable schema
• Not Traditional
RDBMS
– Transactions
– Group by
– Join
– Where
– Like

HBase
DEMO

Sqoop
• Import data from structured data source
(typically RDBMS) into Hadoop
• Export data into structured data sources from
Hadoop
• sqoop import --connect
jdbc:mysql://localhost/salesdb --
table orders
• sqoop export --connect
jdbc:mysql://localhost/salesdb --
table orders --export-dir
/user/test/orders --input-fields-terminated-
by ‘t’

Mahout
• Scalable Machine Learning
Recommendation
Classification
Clustering

Recommendation

Classification
Mammal Reptile Bird

Clustering

More information:
• Free seminar: Machine Learning in
practice
• Fri 7th of November 2014 12:00 – 16:00
• Kontich
• http://www.buzzberry.be/events/

Integrating Hadoop in your IT landscape

Tools – BigData – IT options
• Hadoop is not a trivial piece of software to manage!
• On-premise
– Commodity Hardware
– Advantage: full control & performance
– Disadvantage: required skills, migrations, backup, ...
• Cloud – Amazon AWS
– EMR (Elastic Map Reduce)
– Storage in S3
– Very competitive offering financially
– Manageability and flexibility
• Cloud - IBM SoftLayer
• Hardware options (performance)

Beyond MapReduce

There is more…

Oak3 Courses
• Data Science
• Hadoop
• Hbase
• http://www.oak3.be/

Questions?
Veldkant 33A, Kontich ● info@infofarmDa.btae S●ciwewncwe. inCfoomfaprman.bye

Big Data with Apache Hadoop

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (16)

Similar to Big Data with Apache Hadoop

Similar to Big Data with Apache Hadoop (20)

Recently uploaded

Recently uploaded (20)

Big Data with Apache Hadoop

Editor's Notes