SlideShare a Scribd company logo
Data Science Company 
Big Data with Apache Hadoop 
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be 
8/10/2014
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be 
Who am I 
BEN VERMEERSCH 
Big Data Consultant 
Cloudera Certified Developer 
for Apache Hadoop 
ben.vermeersch@infofarm.be @benvermeersch
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be 
About InfoFarm 
Data 
Science 
Big 
Data 
Identifying, extracting and using data of all types 
and origins; exploring, correlating and using it in new 
and innovative ways in order to extract meaning 
and business value from it.
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be 
About InfoFarm 
2 Data Scientists 4 Big Data 
Consultants 
1 Infrastructure 
Specialist
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be 
Java 
PHP 
E-Commerce 
Mobile 
Web 
Development
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Agenda 
• 09:30 – What is Big Data? 
• 09:45 – Hadoop – HDFS & MapReduce 
• 10:00 – HDFS & MapReduce in Practice 
• 10:30 – The Hadoop Ecosystem 
• 11:30 – Examples 
• 12:00 – Wrap up and Lunch 
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be 
What is Big Data?
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be 
What is Big Data not?
What is Big Data not? 
• a technology 
• a solution (certainly not a silver-bullet) to 
any IT problem 
• a replacement for an RDBMs 
• a cloud storage system 
• … 
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Big Data definition attempt 
“a description of a problem domain with 
specific challenges and solutions which has 
become relevant with increasing volume, 
velocity and variety in business data and 
the increasing requirements towards 
processing of this data” 
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be 
The 3 V’s
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Working the (Hadoop) Big Data way 
• Bringing data processing to the data (vs 
centralized db) 
• Using unstructured or semi-structured data 
• Store first, process later 
• Simple techniques applied at massive 
scale 
• Your hardware will fail! 
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Hadoop (limited) overview 
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be 
Oozie 
Workflow 
HDFS 
Distributed File System 
MapReduce 
Amazon S3 Local FS 
YARN 
Distributed Data Processing 
HBase 
NoSQL 
Hive 
Data Mart 
Pig 
Scripting 
Sqoop 
SQL 
Import 
Export 
Mahout 
Machine 
Learning 
…
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be 
HDFS
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be 
HDFS Rack Topology
MapReduce 
• A method for distributing tasks across 
multiple nodes 
• Data is processed where it is stored (where 
possible) 
• Two phases: 
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be 
– Map 
– Reduce 
• Both fases have key-value pairs as input and 
output that may be chosen by the 
programmer 
• The output from the mappers is used by the 
reducers
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be 
Map & Reduce 
Mapper input Mapper output Reducer input Reducer output
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be 
Map function 
Input.txt 
Block 1 
Block 2 
Block 3 
Node 1 
Block 1 
Block 2 
Node 2 
Block 2 
Block 3 
Node 3 
Block 1 
Block 3
Shuffle and sort 
• Hadoop automatically sorts and merges output 
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be 
from all map tasks 
This intermediate process is known as the shuffle 
and sort 
The result is supplied to reduce tasks
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be 
Reduce function 
• Reducer input comes from the shuffle and sort process 
receives one record at a time 
receives all records for a given key 
emit zero or more output records 
• Example: A reduce function sums total per person and emits 
employee name (key) and total (value) as output
MapReduce under the hood 
Client ResourceManager 
Node 1 AppMaster 
Node 2 
Node 3 
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be 
HDFS
HDFS & MapReduce 
DEMO 
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be 
Joining 
User Name 
1 John 
2 Maria 
3 Jane 
User Comment 
1 Cool 
2 Nonono 
2 Hi there 
3 Hadoop is awesome 
Mapper Mapper 
Key Value 
1 AJohn 
2 AMaria 
3 AJane 
Key Value 
1 BCool 
2 BNonono 
2 BHi there 
3 BHadoop is awesome
Shuffle/Sort 
Key Values 
1 AJohn; BCool 
2 AMaria; BNonono; BHi there 
3 AJane; BHadoop is awesome 
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be 
Joining 
Key Value 
1 AJohn 
2 AMaria 
3 AJane 
Key Value 
1 BCool 
2 BNonono 
2 BHi there 
3 BHadoop is awesome 
Reducer
Key Values 
1 AJohn; BCool 
2 AMaria; BNonono; BHi there 
3 AJane; BHadoop is awesome 
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be 
Joining 
Reducer 
Userid Name Comment 
1 John Cool 
2 Maria Nonono 
2 Maria Hi there 
3 Jane Hadoop is awesome
MapReduce Design Patterns 
• More info: 
• Frameworks on top of MapReduce like 
Hive or Pig make this easier 
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
The Hadoop Ecosystem 
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be 
Oozie 
Workflow 
HDFS 
Distributed File System 
MapReduce 
Amazon S3 Local FS 
YARN 
Distributed Data Processing 
HBase 
NoSQL 
Hive 
Data Mart 
Pig 
Scripting 
Sqoop 
SQL 
Import 
Export 
Mahout 
Machine 
Learning 
…
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be 
Apache Pig 
• Processing 
framework for (large) 
datasets 
• Pig Latin 
• Runs on Hadoop (or 
local) with 
MapReduce 
• Extensible with 
UDFs
Apache Pig 
DEMO 
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be 
Apache Hive 
• SQL-like querying on 
Hadoop datasets 
• Translates to 
MapReduce under 
the hood 
• Originally developed 
at Facebook 
• Now Apache Top 
Level project
Hive <-> Traditional RDBMS 
• Schema on read 
• Fast initial load 
• Flexible schema 
• No update or 
delete (only insert 
into) 
• HiveQL (subset of 
SQL) 
• Schema on write 
• Slow initial load 
• Fixed schema 
• Updates, deletes, 
inserts all possible 
• SQL compliant 
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Apache Hive 
DEMO 
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
HBase 
• Column-oriented Data Store 
• Distributed 
• Type of NoSQL-DB 
• Based on Google BigTable 
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be 
HBase 
• Lots and lots of 
data 
• Large amount of 
clients 
• Single selects 
• Range scan by 
key 
• Variable schema 
• Not Traditional 
RDBMS 
– Transactions 
– Group by 
– Join 
– Where 
– Like
HBase 
DEMO 
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Sqoop 
• Import data from structured data source 
(typically RDBMS) into Hadoop 
• Export data into structured data sources from 
Hadoop 
• sqoop import --connect 
jdbc:mysql://localhost/salesdb -- 
table orders 
• sqoop export --connect 
jdbc:mysql://localhost/salesdb -- 
table orders --export-dir 
/user/test/orders --input-fields-terminated- 
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be 
by ‘t’
Mahout 
• Scalable Machine Learning 
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be 
Recommendation 
Classification 
Clustering
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be 
Recommendation
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be 
Classification 
Mammal Reptile Bird
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be 
Clustering
More information: 
• Free seminar: Machine Learning in 
practice 
• Fri 7th of November 2014 12:00 – 16:00 
• Kontich 
• http://www.buzzberry.be/events/ 
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Integrating Hadoop in your IT landscape 
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Tools – BigData – IT options 
• Hadoop is not a trivial piece of software to manage! 
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be 
• On-premise 
– Commodity Hardware 
– Advantage: full control & performance 
– Disadvantage: required skills, migrations, backup, ... 
• Cloud – Amazon AWS 
– EMR (Elastic Map Reduce) 
– Storage in S3 
– Very competitive offering financially 
– Manageability and flexibility 
• Cloud - IBM SoftLayer 
• Hardware options (performance)
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be 
Beyond MapReduce
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be 
There is more…
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be 
Oak3 Courses 
• Data Science 
• Hadoop 
• Hbase 
• http://www.oak3.be/
Questions? 
Veldkant 33A, Kontich ● info@infofarmDa.btae S●ciwewncwe. inCfoomfaprman.bye

More Related Content

What's hot

Solving the Industry 4.0 challenges on the logistics domain using Apache Meso...
Solving the Industry 4.0 challenges on the logistics domain using Apache Meso...Solving the Industry 4.0 challenges on the logistics domain using Apache Meso...
Solving the Industry 4.0 challenges on the logistics domain using Apache Meso...
Big Data Spain
 
Don't build a data science team
Don't build a data science teamDon't build a data science team
Don't build a data science team
Lars Albertsson
 
Oracle Big Data Spatial & Graph 
Social Media Analysis - Case Study
Oracle Big Data Spatial & Graph 
Social Media Analysis - Case StudyOracle Big Data Spatial & Graph 
Social Media Analysis - Case Study
Oracle Big Data Spatial & Graph 
Social Media Analysis - Case Study
Mark Rittman
 
Unlock the value in your big data reservoir using oracle big data discovery a...
Unlock the value in your big data reservoir using oracle big data discovery a...Unlock the value in your big data reservoir using oracle big data discovery a...
Unlock the value in your big data reservoir using oracle big data discovery a...
Mark Rittman
 
MongoDB & Hadoop - Understanding Your Big Data
MongoDB & Hadoop - Understanding Your Big DataMongoDB & Hadoop - Understanding Your Big Data
MongoDB & Hadoop - Understanding Your Big Data
MongoDB
 
Promote the Good of the People of the United Kingdom by Maintaining Monetary ...
Promote the Good of the People of the United Kingdom by Maintaining Monetary ...Promote the Good of the People of the United Kingdom by Maintaining Monetary ...
Promote the Good of the People of the United Kingdom by Maintaining Monetary ...
DataWorks Summit
 
How a Tweet Went Viral - BIWA Summit 2017
How a Tweet Went Viral - BIWA Summit 2017How a Tweet Went Viral - BIWA Summit 2017
How a Tweet Went Viral - BIWA Summit 2017
Rittman Analytics
 
TopNotch: Systematically Quality Controlling Big Data by David Durst
TopNotch: Systematically Quality Controlling Big Data by David DurstTopNotch: Systematically Quality Controlling Big Data by David Durst
TopNotch: Systematically Quality Controlling Big Data by David Durst
Spark Summit
 
The Rise of Engineering-Driven Analytics by Loren Shure
The Rise of Engineering-Driven Analytics by Loren ShureThe Rise of Engineering-Driven Analytics by Loren Shure
The Rise of Engineering-Driven Analytics by Loren Shure
Big Data Spain
 
Stream Processing as Game Changer for Big Data and Internet of Things by Kai ...
Stream Processing as Game Changer for Big Data and Internet of Things by Kai ...Stream Processing as Game Changer for Big Data and Internet of Things by Kai ...
Stream Processing as Game Changer for Big Data and Internet of Things by Kai ...
Big Data Spain
 
Technology behind-real-time-log-analytics
Technology behind-real-time-log-analytics Technology behind-real-time-log-analytics
Technology behind-real-time-log-analytics
Data Science Thailand
 
Spectator to Participant. Contributing to Cassandra (Patrick McFadin, DataSta...
Spectator to Participant. Contributing to Cassandra (Patrick McFadin, DataSta...Spectator to Participant. Contributing to Cassandra (Patrick McFadin, DataSta...
Spectator to Participant. Contributing to Cassandra (Patrick McFadin, DataSta...
DataStax
 
Big Data in the Cloud - Montreal April 2015
Big Data in the Cloud - Montreal April 2015Big Data in the Cloud - Montreal April 2015
Big Data in the Cloud - Montreal April 2015
Cindy Gross
 
From lots of reports (with some data Analysis) 
to Massive Data Analysis (Wit...
From lots of reports (with some data Analysis) 
to Massive Data Analysis (Wit...From lots of reports (with some data Analysis) 
to Massive Data Analysis (Wit...
From lots of reports (with some data Analysis) 
to Massive Data Analysis (Wit...
Mark Rittman
 
Social Network Analysis using Oracle Big Data Spatial & Graph (incl. why I di...
Social Network Analysis using Oracle Big Data Spatial & Graph (incl. why I di...Social Network Analysis using Oracle Big Data Spatial & Graph (incl. why I di...
Social Network Analysis using Oracle Big Data Spatial & Graph (incl. why I di...
Mark Rittman
 
Openhab Grafana and Influxdb
Openhab Grafana and InfluxdbOpenhab Grafana and Influxdb
Openhab Grafana and Influxdb
Code-House
 
Building a Self-Service Big Data Pipeline
Building a Self-Service Big Data PipelineBuilding a Self-Service Big Data Pipeline
Building a Self-Service Big Data Pipeline
DataWorks Summit
 
OTN EMEA Tour 2016 : Deploying Full BI Platforms to Oracle Cloud
OTN EMEA Tour 2016 : Deploying Full BI Platforms to Oracle CloudOTN EMEA Tour 2016 : Deploying Full BI Platforms to Oracle Cloud
OTN EMEA Tour 2016 : Deploying Full BI Platforms to Oracle Cloud
Mark Rittman
 
Converging Big Data and Application Infrastructure by Steven Poutsy
Converging Big Data and Application Infrastructure by Steven PoutsyConverging Big Data and Application Infrastructure by Steven Poutsy
Converging Big Data and Application Infrastructure by Steven Poutsy
Big Data Spain
 
The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem by A...
The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem by A...The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem by A...
The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem by A...
Big Data Spain
 

What's hot (20)

Solving the Industry 4.0 challenges on the logistics domain using Apache Meso...
Solving the Industry 4.0 challenges on the logistics domain using Apache Meso...Solving the Industry 4.0 challenges on the logistics domain using Apache Meso...
Solving the Industry 4.0 challenges on the logistics domain using Apache Meso...
 
Don't build a data science team
Don't build a data science teamDon't build a data science team
Don't build a data science team
 
Oracle Big Data Spatial & Graph 
Social Media Analysis - Case Study
Oracle Big Data Spatial & Graph 
Social Media Analysis - Case StudyOracle Big Data Spatial & Graph 
Social Media Analysis - Case Study
Oracle Big Data Spatial & Graph 
Social Media Analysis - Case Study
 
Unlock the value in your big data reservoir using oracle big data discovery a...
Unlock the value in your big data reservoir using oracle big data discovery a...Unlock the value in your big data reservoir using oracle big data discovery a...
Unlock the value in your big data reservoir using oracle big data discovery a...
 
MongoDB & Hadoop - Understanding Your Big Data
MongoDB & Hadoop - Understanding Your Big DataMongoDB & Hadoop - Understanding Your Big Data
MongoDB & Hadoop - Understanding Your Big Data
 
Promote the Good of the People of the United Kingdom by Maintaining Monetary ...
Promote the Good of the People of the United Kingdom by Maintaining Monetary ...Promote the Good of the People of the United Kingdom by Maintaining Monetary ...
Promote the Good of the People of the United Kingdom by Maintaining Monetary ...
 
How a Tweet Went Viral - BIWA Summit 2017
How a Tweet Went Viral - BIWA Summit 2017How a Tweet Went Viral - BIWA Summit 2017
How a Tweet Went Viral - BIWA Summit 2017
 
TopNotch: Systematically Quality Controlling Big Data by David Durst
TopNotch: Systematically Quality Controlling Big Data by David DurstTopNotch: Systematically Quality Controlling Big Data by David Durst
TopNotch: Systematically Quality Controlling Big Data by David Durst
 
The Rise of Engineering-Driven Analytics by Loren Shure
The Rise of Engineering-Driven Analytics by Loren ShureThe Rise of Engineering-Driven Analytics by Loren Shure
The Rise of Engineering-Driven Analytics by Loren Shure
 
Stream Processing as Game Changer for Big Data and Internet of Things by Kai ...
Stream Processing as Game Changer for Big Data and Internet of Things by Kai ...Stream Processing as Game Changer for Big Data and Internet of Things by Kai ...
Stream Processing as Game Changer for Big Data and Internet of Things by Kai ...
 
Technology behind-real-time-log-analytics
Technology behind-real-time-log-analytics Technology behind-real-time-log-analytics
Technology behind-real-time-log-analytics
 
Spectator to Participant. Contributing to Cassandra (Patrick McFadin, DataSta...
Spectator to Participant. Contributing to Cassandra (Patrick McFadin, DataSta...Spectator to Participant. Contributing to Cassandra (Patrick McFadin, DataSta...
Spectator to Participant. Contributing to Cassandra (Patrick McFadin, DataSta...
 
Big Data in the Cloud - Montreal April 2015
Big Data in the Cloud - Montreal April 2015Big Data in the Cloud - Montreal April 2015
Big Data in the Cloud - Montreal April 2015
 
From lots of reports (with some data Analysis) 
to Massive Data Analysis (Wit...
From lots of reports (with some data Analysis) 
to Massive Data Analysis (Wit...From lots of reports (with some data Analysis) 
to Massive Data Analysis (Wit...
From lots of reports (with some data Analysis) 
to Massive Data Analysis (Wit...
 
Social Network Analysis using Oracle Big Data Spatial & Graph (incl. why I di...
Social Network Analysis using Oracle Big Data Spatial & Graph (incl. why I di...Social Network Analysis using Oracle Big Data Spatial & Graph (incl. why I di...
Social Network Analysis using Oracle Big Data Spatial & Graph (incl. why I di...
 
Openhab Grafana and Influxdb
Openhab Grafana and InfluxdbOpenhab Grafana and Influxdb
Openhab Grafana and Influxdb
 
Building a Self-Service Big Data Pipeline
Building a Self-Service Big Data PipelineBuilding a Self-Service Big Data Pipeline
Building a Self-Service Big Data Pipeline
 
OTN EMEA Tour 2016 : Deploying Full BI Platforms to Oracle Cloud
OTN EMEA Tour 2016 : Deploying Full BI Platforms to Oracle CloudOTN EMEA Tour 2016 : Deploying Full BI Platforms to Oracle Cloud
OTN EMEA Tour 2016 : Deploying Full BI Platforms to Oracle Cloud
 
Converging Big Data and Application Infrastructure by Steven Poutsy
Converging Big Data and Application Infrastructure by Steven PoutsyConverging Big Data and Application Infrastructure by Steven Poutsy
Converging Big Data and Application Infrastructure by Steven Poutsy
 
The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem by A...
The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem by A...The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem by A...
The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem by A...
 

Viewers also liked

Big data and Hadoop
Big data and HadoopBig data and Hadoop
Big data and Hadoop
Rahul Agarwal
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
Philippe Julio
 
Retail Detail OmniChannel Congress 2015 - Data Science for e-commerce
Retail Detail OmniChannel Congress 2015 - Data Science for e-commerceRetail Detail OmniChannel Congress 2015 - Data Science for e-commerce
Retail Detail OmniChannel Congress 2015 - Data Science for e-commerce
InfoFarm
 
Data Driven Decisions seminar
Data Driven Decisions seminarData Driven Decisions seminar
Data Driven Decisions seminar
InfoFarm
 
Sqoop
SqoopSqoop
Introduction to sqoop
Introduction to sqoopIntroduction to sqoop
Introduction to sqoop
Uday Vakalapudi
 
First impressions of SparkR: our own machine learning algorithm
First impressions of SparkR: our own machine learning algorithmFirst impressions of SparkR: our own machine learning algorithm
First impressions of SparkR: our own machine learning algorithm
InfoFarm
 
Machine learning
Machine learningMachine learning
Machine learning
InfoFarm
 
Harvesting business Value with Data Science
Harvesting business Value with Data ScienceHarvesting business Value with Data Science
Harvesting business Value with Data Science
InfoFarm
 
Data Science for e-commerce
Data Science for e-commerceData Science for e-commerce
Data Science for e-commerce
InfoFarm
 
Introduction to (Big) Data Science
Introduction to (Big) Data ScienceIntroduction to (Big) Data Science
Introduction to (Big) Data Science
InfoFarm
 
Hadoop crashcourse v3
Hadoop crashcourse v3Hadoop crashcourse v3
Hadoop crashcourse v3
Hortonworks
 
Introduction to Apache Sqoop
Introduction to Apache SqoopIntroduction to Apache Sqoop
Introduction to Apache Sqoop
Avkash Chauhan
 
Hadoop Family and Ecosystem
Hadoop Family and EcosystemHadoop Family and Ecosystem
Hadoop Family and Ecosystem
tcloudcomputing-tw
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
Varun Narang
 
Big data ppt
Big  data pptBig  data ppt
Big data ppt
Nasrin Hussain
 

Viewers also liked (16)

Big data and Hadoop
Big data and HadoopBig data and Hadoop
Big data and Hadoop
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 
Retail Detail OmniChannel Congress 2015 - Data Science for e-commerce
Retail Detail OmniChannel Congress 2015 - Data Science for e-commerceRetail Detail OmniChannel Congress 2015 - Data Science for e-commerce
Retail Detail OmniChannel Congress 2015 - Data Science for e-commerce
 
Data Driven Decisions seminar
Data Driven Decisions seminarData Driven Decisions seminar
Data Driven Decisions seminar
 
Sqoop
SqoopSqoop
Sqoop
 
Introduction to sqoop
Introduction to sqoopIntroduction to sqoop
Introduction to sqoop
 
First impressions of SparkR: our own machine learning algorithm
First impressions of SparkR: our own machine learning algorithmFirst impressions of SparkR: our own machine learning algorithm
First impressions of SparkR: our own machine learning algorithm
 
Machine learning
Machine learningMachine learning
Machine learning
 
Harvesting business Value with Data Science
Harvesting business Value with Data ScienceHarvesting business Value with Data Science
Harvesting business Value with Data Science
 
Data Science for e-commerce
Data Science for e-commerceData Science for e-commerce
Data Science for e-commerce
 
Introduction to (Big) Data Science
Introduction to (Big) Data ScienceIntroduction to (Big) Data Science
Introduction to (Big) Data Science
 
Hadoop crashcourse v3
Hadoop crashcourse v3Hadoop crashcourse v3
Hadoop crashcourse v3
 
Introduction to Apache Sqoop
Introduction to Apache SqoopIntroduction to Apache Sqoop
Introduction to Apache Sqoop
 
Hadoop Family and Ecosystem
Hadoop Family and EcosystemHadoop Family and Ecosystem
Hadoop Family and Ecosystem
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
 
Big data ppt
Big  data pptBig  data ppt
Big data ppt
 

Similar to Big Data with Apache Hadoop

Data Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixData Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFix
C4Media
 
Big Data Analysis : Deciphering the haystack
Big Data Analysis : Deciphering the haystack Big Data Analysis : Deciphering the haystack
Big Data Analysis : Deciphering the haystack
Srinath Perera
 
Building real time data-driven products
Building real time data-driven productsBuilding real time data-driven products
Building real time data-driven products
Lars Albertsson
 
Using BigBench to compare Hive and Spark (Long version)
Using BigBench to compare Hive and Spark (Long version)Using BigBench to compare Hive and Spark (Long version)
Using BigBench to compare Hive and Spark (Long version)
Nicolas Poggi
 
Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...
Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...
Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...
ScyllaDB
 
The Perfect Fit: Scalable Graph for Big Data
The Perfect Fit: Scalable Graph for Big DataThe Perfect Fit: Scalable Graph for Big Data
The Perfect Fit: Scalable Graph for Big Data
Inside Analysis
 
Hybrid Transactional/Analytics Processing with Spark and IMDGs
Hybrid Transactional/Analytics Processing with Spark and IMDGsHybrid Transactional/Analytics Processing with Spark and IMDGs
Hybrid Transactional/Analytics Processing with Spark and IMDGs
Ali Hodroj
 
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in ProductionTugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Codemotion
 
HPCC Systems Engineering Summit: Community Use Case: Because Who Has Time for...
HPCC Systems Engineering Summit: Community Use Case: Because Who Has Time for...HPCC Systems Engineering Summit: Community Use Case: Because Who Has Time for...
HPCC Systems Engineering Summit: Community Use Case: Because Who Has Time for...
HPCC Systems
 
ISBG 2016 - XPages on IBM Bluemix
ISBG 2016 - XPages on IBM BluemixISBG 2016 - XPages on IBM Bluemix
ISBG 2016 - XPages on IBM Bluemix
Oliver Busse
 
OVHcloud Tech Talks S01E09 - OVHcloud Data Processing : Le nouveau service po...
OVHcloud Tech Talks S01E09 - OVHcloud Data Processing : Le nouveau service po...OVHcloud Tech Talks S01E09 - OVHcloud Data Processing : Le nouveau service po...
OVHcloud Tech Talks S01E09 - OVHcloud Data Processing : Le nouveau service po...
OVHcloud
 
Predictive Analytics San Diego
Predictive Analytics San DiegoPredictive Analytics San Diego
Predictive Analytics San Diego
MapR Technologies
 
Elastic Data Analytics Platform @Datadog
Elastic Data Analytics Platform @DatadogElastic Data Analytics Platform @Datadog
Elastic Data Analytics Platform @Datadog
C4Media
 
Google BigQuery for Everyday Developer
Google BigQuery for Everyday DeveloperGoogle BigQuery for Everyday Developer
Google BigQuery for Everyday Developer
Márton Kodok
 
Apache Flink 101 - the rise of stream processing and beyond
Apache Flink 101 - the rise of stream processing and beyondApache Flink 101 - the rise of stream processing and beyond
Apache Flink 101 - the rise of stream processing and beyond
Bowen Li
 
Spark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice Machine
Spark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice MachineSpark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice Machine
Spark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice Machine
Data Con LA
 
A Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's RoadmapA Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's Roadmap
Itai Yaffe
 
Paradigmas de procesamiento en Big Data: estado actual, tendencias y oportu...
Paradigmas de procesamiento en  Big Data: estado actual,  tendencias y oportu...Paradigmas de procesamiento en  Big Data: estado actual,  tendencias y oportu...
Paradigmas de procesamiento en Big Data: estado actual, tendencias y oportu...
Facultad de Informática UCM
 
The power of hadoop in business
The power of hadoop in businessThe power of hadoop in business
The power of hadoop in business
MapR Technologies
 
IlOUG Tech Days 2016 - Big Data for Oracle Developers - Towards Spark, Real-T...
IlOUG Tech Days 2016 - Big Data for Oracle Developers - Towards Spark, Real-T...IlOUG Tech Days 2016 - Big Data for Oracle Developers - Towards Spark, Real-T...
IlOUG Tech Days 2016 - Big Data for Oracle Developers - Towards Spark, Real-T...
Mark Rittman
 

Similar to Big Data with Apache Hadoop (20)

Data Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixData Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFix
 
Big Data Analysis : Deciphering the haystack
Big Data Analysis : Deciphering the haystack Big Data Analysis : Deciphering the haystack
Big Data Analysis : Deciphering the haystack
 
Building real time data-driven products
Building real time data-driven productsBuilding real time data-driven products
Building real time data-driven products
 
Using BigBench to compare Hive and Spark (Long version)
Using BigBench to compare Hive and Spark (Long version)Using BigBench to compare Hive and Spark (Long version)
Using BigBench to compare Hive and Spark (Long version)
 
Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...
Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...
Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...
 
The Perfect Fit: Scalable Graph for Big Data
The Perfect Fit: Scalable Graph for Big DataThe Perfect Fit: Scalable Graph for Big Data
The Perfect Fit: Scalable Graph for Big Data
 
Hybrid Transactional/Analytics Processing with Spark and IMDGs
Hybrid Transactional/Analytics Processing with Spark and IMDGsHybrid Transactional/Analytics Processing with Spark and IMDGs
Hybrid Transactional/Analytics Processing with Spark and IMDGs
 
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in ProductionTugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
 
HPCC Systems Engineering Summit: Community Use Case: Because Who Has Time for...
HPCC Systems Engineering Summit: Community Use Case: Because Who Has Time for...HPCC Systems Engineering Summit: Community Use Case: Because Who Has Time for...
HPCC Systems Engineering Summit: Community Use Case: Because Who Has Time for...
 
ISBG 2016 - XPages on IBM Bluemix
ISBG 2016 - XPages on IBM BluemixISBG 2016 - XPages on IBM Bluemix
ISBG 2016 - XPages on IBM Bluemix
 
OVHcloud Tech Talks S01E09 - OVHcloud Data Processing : Le nouveau service po...
OVHcloud Tech Talks S01E09 - OVHcloud Data Processing : Le nouveau service po...OVHcloud Tech Talks S01E09 - OVHcloud Data Processing : Le nouveau service po...
OVHcloud Tech Talks S01E09 - OVHcloud Data Processing : Le nouveau service po...
 
Predictive Analytics San Diego
Predictive Analytics San DiegoPredictive Analytics San Diego
Predictive Analytics San Diego
 
Elastic Data Analytics Platform @Datadog
Elastic Data Analytics Platform @DatadogElastic Data Analytics Platform @Datadog
Elastic Data Analytics Platform @Datadog
 
Google BigQuery for Everyday Developer
Google BigQuery for Everyday DeveloperGoogle BigQuery for Everyday Developer
Google BigQuery for Everyday Developer
 
Apache Flink 101 - the rise of stream processing and beyond
Apache Flink 101 - the rise of stream processing and beyondApache Flink 101 - the rise of stream processing and beyond
Apache Flink 101 - the rise of stream processing and beyond
 
Spark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice Machine
Spark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice MachineSpark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice Machine
Spark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice Machine
 
A Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's RoadmapA Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's Roadmap
 
Paradigmas de procesamiento en Big Data: estado actual, tendencias y oportu...
Paradigmas de procesamiento en  Big Data: estado actual,  tendencias y oportu...Paradigmas de procesamiento en  Big Data: estado actual,  tendencias y oportu...
Paradigmas de procesamiento en Big Data: estado actual, tendencias y oportu...
 
The power of hadoop in business
The power of hadoop in businessThe power of hadoop in business
The power of hadoop in business
 
IlOUG Tech Days 2016 - Big Data for Oracle Developers - Towards Spark, Real-T...
IlOUG Tech Days 2016 - Big Data for Oracle Developers - Towards Spark, Real-T...IlOUG Tech Days 2016 - Big Data for Oracle Developers - Towards Spark, Real-T...
IlOUG Tech Days 2016 - Big Data for Oracle Developers - Towards Spark, Real-T...
 

Recently uploaded

ACE - Team 24 Wrapup event at ahmedabad.
ACE - Team 24 Wrapup event at ahmedabad.ACE - Team 24 Wrapup event at ahmedabad.
ACE - Team 24 Wrapup event at ahmedabad.
Maitrey Patel
 
Oracle Database 19c New Features for DBAs and Developers.pptx
Oracle Database 19c New Features for DBAs and Developers.pptxOracle Database 19c New Features for DBAs and Developers.pptx
Oracle Database 19c New Features for DBAs and Developers.pptx
Remote DBA Services
 
一比一原版(UMN毕业证)明尼苏达大学毕业证如何办理
一比一原版(UMN毕业证)明尼苏达大学毕业证如何办理一比一原版(UMN毕业证)明尼苏达大学毕业证如何办理
一比一原版(UMN毕业证)明尼苏达大学毕业证如何办理
dakas1
 
Baha Majid WCA4Z IBM Z Customer Council Boston June 2024.pdf
Baha Majid WCA4Z IBM Z Customer Council Boston June 2024.pdfBaha Majid WCA4Z IBM Z Customer Council Boston June 2024.pdf
Baha Majid WCA4Z IBM Z Customer Council Boston June 2024.pdf
Baha Majid
 
Enums On Steroids - let's look at sealed classes !
Enums On Steroids - let's look at sealed classes !Enums On Steroids - let's look at sealed classes !
Enums On Steroids - let's look at sealed classes !
Marcin Chrost
 
Webinar On-Demand: Using Flutter for Embedded
Webinar On-Demand: Using Flutter for EmbeddedWebinar On-Demand: Using Flutter for Embedded
Webinar On-Demand: Using Flutter for Embedded
ICS
 
Kubernetes at Scale: Going Multi-Cluster with Istio
Kubernetes at Scale:  Going Multi-Cluster  with IstioKubernetes at Scale:  Going Multi-Cluster  with Istio
Kubernetes at Scale: Going Multi-Cluster with Istio
Severalnines
 
E-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian Companies
E-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian CompaniesE-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian Companies
E-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian Companies
Quickdice ERP
 
Energy consumption of Database Management - Florina Jonuzi
Energy consumption of Database Management - Florina JonuziEnergy consumption of Database Management - Florina Jonuzi
Energy consumption of Database Management - Florina Jonuzi
Green Software Development
 
Quarter 3 SLRP grade 9.. gshajsbhhaheabh
Quarter 3 SLRP grade 9.. gshajsbhhaheabhQuarter 3 SLRP grade 9.. gshajsbhhaheabh
Quarter 3 SLRP grade 9.. gshajsbhhaheabh
aisafed42
 
Why Apache Kafka Clusters Are Like Galaxies (And Other Cosmic Kafka Quandarie...
Why Apache Kafka Clusters Are Like Galaxies (And Other Cosmic Kafka Quandarie...Why Apache Kafka Clusters Are Like Galaxies (And Other Cosmic Kafka Quandarie...
Why Apache Kafka Clusters Are Like Galaxies (And Other Cosmic Kafka Quandarie...
Paul Brebner
 
Top Benefits of Using Salesforce Healthcare CRM for Patient Management.pdf
Top Benefits of Using Salesforce Healthcare CRM for Patient Management.pdfTop Benefits of Using Salesforce Healthcare CRM for Patient Management.pdf
Top Benefits of Using Salesforce Healthcare CRM for Patient Management.pdf
VALiNTRY360
 
J-Spring 2024 - Going serverless with Quarkus, GraalVM native images and AWS ...
J-Spring 2024 - Going serverless with Quarkus, GraalVM native images and AWS ...J-Spring 2024 - Going serverless with Quarkus, GraalVM native images and AWS ...
J-Spring 2024 - Going serverless with Quarkus, GraalVM native images and AWS ...
Bert Jan Schrijver
 
A Comprehensive Guide on Implementing Real-World Mobile Testing Strategies fo...
A Comprehensive Guide on Implementing Real-World Mobile Testing Strategies fo...A Comprehensive Guide on Implementing Real-World Mobile Testing Strategies fo...
A Comprehensive Guide on Implementing Real-World Mobile Testing Strategies fo...
kalichargn70th171
 
KuberTENes Birthday Bash Guadalajara - Introducción a Argo CD
KuberTENes Birthday Bash Guadalajara - Introducción a Argo CDKuberTENes Birthday Bash Guadalajara - Introducción a Argo CD
KuberTENes Birthday Bash Guadalajara - Introducción a Argo CD
rodomar2
 
GreenCode-A-VSCode-Plugin--Dario-Jurisic
GreenCode-A-VSCode-Plugin--Dario-JurisicGreenCode-A-VSCode-Plugin--Dario-Jurisic
GreenCode-A-VSCode-Plugin--Dario-Jurisic
Green Software Development
 
Everything You Need to Know About X-Sign: The eSign Functionality of XfilesPr...
Everything You Need to Know About X-Sign: The eSign Functionality of XfilesPr...Everything You Need to Know About X-Sign: The eSign Functionality of XfilesPr...
Everything You Need to Know About X-Sign: The eSign Functionality of XfilesPr...
XfilesPro
 
Oracle 23c New Features For DBAs and Developers.pptx
Oracle 23c New Features For DBAs and Developers.pptxOracle 23c New Features For DBAs and Developers.pptx
Oracle 23c New Features For DBAs and Developers.pptx
Remote DBA Services
 
Measures in SQL (SIGMOD 2024, Santiago, Chile)
Measures in SQL (SIGMOD 2024, Santiago, Chile)Measures in SQL (SIGMOD 2024, Santiago, Chile)
Measures in SQL (SIGMOD 2024, Santiago, Chile)
Julian Hyde
 
E-commerce Development Services- Hornet Dynamics
E-commerce Development Services- Hornet DynamicsE-commerce Development Services- Hornet Dynamics
E-commerce Development Services- Hornet Dynamics
Hornet Dynamics
 

Recently uploaded (20)

ACE - Team 24 Wrapup event at ahmedabad.
ACE - Team 24 Wrapup event at ahmedabad.ACE - Team 24 Wrapup event at ahmedabad.
ACE - Team 24 Wrapup event at ahmedabad.
 
Oracle Database 19c New Features for DBAs and Developers.pptx
Oracle Database 19c New Features for DBAs and Developers.pptxOracle Database 19c New Features for DBAs and Developers.pptx
Oracle Database 19c New Features for DBAs and Developers.pptx
 
一比一原版(UMN毕业证)明尼苏达大学毕业证如何办理
一比一原版(UMN毕业证)明尼苏达大学毕业证如何办理一比一原版(UMN毕业证)明尼苏达大学毕业证如何办理
一比一原版(UMN毕业证)明尼苏达大学毕业证如何办理
 
Baha Majid WCA4Z IBM Z Customer Council Boston June 2024.pdf
Baha Majid WCA4Z IBM Z Customer Council Boston June 2024.pdfBaha Majid WCA4Z IBM Z Customer Council Boston June 2024.pdf
Baha Majid WCA4Z IBM Z Customer Council Boston June 2024.pdf
 
Enums On Steroids - let's look at sealed classes !
Enums On Steroids - let's look at sealed classes !Enums On Steroids - let's look at sealed classes !
Enums On Steroids - let's look at sealed classes !
 
Webinar On-Demand: Using Flutter for Embedded
Webinar On-Demand: Using Flutter for EmbeddedWebinar On-Demand: Using Flutter for Embedded
Webinar On-Demand: Using Flutter for Embedded
 
Kubernetes at Scale: Going Multi-Cluster with Istio
Kubernetes at Scale:  Going Multi-Cluster  with IstioKubernetes at Scale:  Going Multi-Cluster  with Istio
Kubernetes at Scale: Going Multi-Cluster with Istio
 
E-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian Companies
E-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian CompaniesE-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian Companies
E-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian Companies
 
Energy consumption of Database Management - Florina Jonuzi
Energy consumption of Database Management - Florina JonuziEnergy consumption of Database Management - Florina Jonuzi
Energy consumption of Database Management - Florina Jonuzi
 
Quarter 3 SLRP grade 9.. gshajsbhhaheabh
Quarter 3 SLRP grade 9.. gshajsbhhaheabhQuarter 3 SLRP grade 9.. gshajsbhhaheabh
Quarter 3 SLRP grade 9.. gshajsbhhaheabh
 
Why Apache Kafka Clusters Are Like Galaxies (And Other Cosmic Kafka Quandarie...
Why Apache Kafka Clusters Are Like Galaxies (And Other Cosmic Kafka Quandarie...Why Apache Kafka Clusters Are Like Galaxies (And Other Cosmic Kafka Quandarie...
Why Apache Kafka Clusters Are Like Galaxies (And Other Cosmic Kafka Quandarie...
 
Top Benefits of Using Salesforce Healthcare CRM for Patient Management.pdf
Top Benefits of Using Salesforce Healthcare CRM for Patient Management.pdfTop Benefits of Using Salesforce Healthcare CRM for Patient Management.pdf
Top Benefits of Using Salesforce Healthcare CRM for Patient Management.pdf
 
J-Spring 2024 - Going serverless with Quarkus, GraalVM native images and AWS ...
J-Spring 2024 - Going serverless with Quarkus, GraalVM native images and AWS ...J-Spring 2024 - Going serverless with Quarkus, GraalVM native images and AWS ...
J-Spring 2024 - Going serverless with Quarkus, GraalVM native images and AWS ...
 
A Comprehensive Guide on Implementing Real-World Mobile Testing Strategies fo...
A Comprehensive Guide on Implementing Real-World Mobile Testing Strategies fo...A Comprehensive Guide on Implementing Real-World Mobile Testing Strategies fo...
A Comprehensive Guide on Implementing Real-World Mobile Testing Strategies fo...
 
KuberTENes Birthday Bash Guadalajara - Introducción a Argo CD
KuberTENes Birthday Bash Guadalajara - Introducción a Argo CDKuberTENes Birthday Bash Guadalajara - Introducción a Argo CD
KuberTENes Birthday Bash Guadalajara - Introducción a Argo CD
 
GreenCode-A-VSCode-Plugin--Dario-Jurisic
GreenCode-A-VSCode-Plugin--Dario-JurisicGreenCode-A-VSCode-Plugin--Dario-Jurisic
GreenCode-A-VSCode-Plugin--Dario-Jurisic
 
Everything You Need to Know About X-Sign: The eSign Functionality of XfilesPr...
Everything You Need to Know About X-Sign: The eSign Functionality of XfilesPr...Everything You Need to Know About X-Sign: The eSign Functionality of XfilesPr...
Everything You Need to Know About X-Sign: The eSign Functionality of XfilesPr...
 
Oracle 23c New Features For DBAs and Developers.pptx
Oracle 23c New Features For DBAs and Developers.pptxOracle 23c New Features For DBAs and Developers.pptx
Oracle 23c New Features For DBAs and Developers.pptx
 
Measures in SQL (SIGMOD 2024, Santiago, Chile)
Measures in SQL (SIGMOD 2024, Santiago, Chile)Measures in SQL (SIGMOD 2024, Santiago, Chile)
Measures in SQL (SIGMOD 2024, Santiago, Chile)
 
E-commerce Development Services- Hornet Dynamics
E-commerce Development Services- Hornet DynamicsE-commerce Development Services- Hornet Dynamics
E-commerce Development Services- Hornet Dynamics
 

Big Data with Apache Hadoop

  • 1. Data Science Company Big Data with Apache Hadoop Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be 8/10/2014
  • 2. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be Who am I BEN VERMEERSCH Big Data Consultant Cloudera Certified Developer for Apache Hadoop ben.vermeersch@infofarm.be @benvermeersch
  • 3. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be About InfoFarm Data Science Big Data Identifying, extracting and using data of all types and origins; exploring, correlating and using it in new and innovative ways in order to extract meaning and business value from it.
  • 4. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be About InfoFarm 2 Data Scientists 4 Big Data Consultants 1 Infrastructure Specialist
  • 5. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be Java PHP E-Commerce Mobile Web Development
  • 6. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
  • 7. Agenda • 09:30 – What is Big Data? • 09:45 – Hadoop – HDFS & MapReduce • 10:00 – HDFS & MapReduce in Practice • 10:30 – The Hadoop Ecosystem • 11:30 – Examples • 12:00 – Wrap up and Lunch Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
  • 8. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be What is Big Data?
  • 9. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be What is Big Data not?
  • 10. What is Big Data not? • a technology • a solution (certainly not a silver-bullet) to any IT problem • a replacement for an RDBMs • a cloud storage system • … Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
  • 11. Big Data definition attempt “a description of a problem domain with specific challenges and solutions which has become relevant with increasing volume, velocity and variety in business data and the increasing requirements towards processing of this data” Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
  • 12. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be The 3 V’s
  • 13. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
  • 14. Working the (Hadoop) Big Data way • Bringing data processing to the data (vs centralized db) • Using unstructured or semi-structured data • Store first, process later • Simple techniques applied at massive scale • Your hardware will fail! Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
  • 15. Hadoop (limited) overview Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be Oozie Workflow HDFS Distributed File System MapReduce Amazon S3 Local FS YARN Distributed Data Processing HBase NoSQL Hive Data Mart Pig Scripting Sqoop SQL Import Export Mahout Machine Learning …
  • 16. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be HDFS
  • 17. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be HDFS Rack Topology
  • 18. MapReduce • A method for distributing tasks across multiple nodes • Data is processed where it is stored (where possible) • Two phases: Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be – Map – Reduce • Both fases have key-value pairs as input and output that may be chosen by the programmer • The output from the mappers is used by the reducers
  • 19. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be Map & Reduce Mapper input Mapper output Reducer input Reducer output
  • 20. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be Map function Input.txt Block 1 Block 2 Block 3 Node 1 Block 1 Block 2 Node 2 Block 2 Block 3 Node 3 Block 1 Block 3
  • 21. Shuffle and sort • Hadoop automatically sorts and merges output Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be from all map tasks This intermediate process is known as the shuffle and sort The result is supplied to reduce tasks
  • 22. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be Reduce function • Reducer input comes from the shuffle and sort process receives one record at a time receives all records for a given key emit zero or more output records • Example: A reduce function sums total per person and emits employee name (key) and total (value) as output
  • 23. MapReduce under the hood Client ResourceManager Node 1 AppMaster Node 2 Node 3 Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be HDFS
  • 24. HDFS & MapReduce DEMO Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
  • 25. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be Joining User Name 1 John 2 Maria 3 Jane User Comment 1 Cool 2 Nonono 2 Hi there 3 Hadoop is awesome Mapper Mapper Key Value 1 AJohn 2 AMaria 3 AJane Key Value 1 BCool 2 BNonono 2 BHi there 3 BHadoop is awesome
  • 26. Shuffle/Sort Key Values 1 AJohn; BCool 2 AMaria; BNonono; BHi there 3 AJane; BHadoop is awesome Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be Joining Key Value 1 AJohn 2 AMaria 3 AJane Key Value 1 BCool 2 BNonono 2 BHi there 3 BHadoop is awesome Reducer
  • 27. Key Values 1 AJohn; BCool 2 AMaria; BNonono; BHi there 3 AJane; BHadoop is awesome Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be Joining Reducer Userid Name Comment 1 John Cool 2 Maria Nonono 2 Maria Hi there 3 Jane Hadoop is awesome
  • 28. MapReduce Design Patterns • More info: • Frameworks on top of MapReduce like Hive or Pig make this easier Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
  • 29. The Hadoop Ecosystem Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be Oozie Workflow HDFS Distributed File System MapReduce Amazon S3 Local FS YARN Distributed Data Processing HBase NoSQL Hive Data Mart Pig Scripting Sqoop SQL Import Export Mahout Machine Learning …
  • 30. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be Apache Pig • Processing framework for (large) datasets • Pig Latin • Runs on Hadoop (or local) with MapReduce • Extensible with UDFs
  • 31. Apache Pig DEMO Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
  • 32. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be Apache Hive • SQL-like querying on Hadoop datasets • Translates to MapReduce under the hood • Originally developed at Facebook • Now Apache Top Level project
  • 33. Hive <-> Traditional RDBMS • Schema on read • Fast initial load • Flexible schema • No update or delete (only insert into) • HiveQL (subset of SQL) • Schema on write • Slow initial load • Fixed schema • Updates, deletes, inserts all possible • SQL compliant Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
  • 34. Apache Hive DEMO Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
  • 35. HBase • Column-oriented Data Store • Distributed • Type of NoSQL-DB • Based on Google BigTable Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
  • 36. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be HBase • Lots and lots of data • Large amount of clients • Single selects • Range scan by key • Variable schema • Not Traditional RDBMS – Transactions – Group by – Join – Where – Like
  • 37. HBase DEMO Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
  • 38. Sqoop • Import data from structured data source (typically RDBMS) into Hadoop • Export data into structured data sources from Hadoop • sqoop import --connect jdbc:mysql://localhost/salesdb -- table orders • sqoop export --connect jdbc:mysql://localhost/salesdb -- table orders --export-dir /user/test/orders --input-fields-terminated- Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be by ‘t’
  • 39. Mahout • Scalable Machine Learning Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be Recommendation Classification Clustering
  • 40. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be Recommendation
  • 41. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be Classification Mammal Reptile Bird
  • 42. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be Clustering
  • 43. More information: • Free seminar: Machine Learning in practice • Fri 7th of November 2014 12:00 – 16:00 • Kontich • http://www.buzzberry.be/events/ Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
  • 44. Integrating Hadoop in your IT landscape Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
  • 45. Tools – BigData – IT options • Hadoop is not a trivial piece of software to manage! Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be • On-premise – Commodity Hardware – Advantage: full control & performance – Disadvantage: required skills, migrations, backup, ... • Cloud – Amazon AWS – EMR (Elastic Map Reduce) – Storage in S3 – Very competitive offering financially – Manageability and flexibility • Cloud - IBM SoftLayer • Hardware options (performance)
  • 46. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be Beyond MapReduce
  • 47. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be There is more…
  • 48. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be Oak3 Courses • Data Science • Hadoop • Hbase • http://www.oak3.be/
  • 49. Questions? Veldkant 33A, Kontich ● info@infofarmDa.btae S●ciwewncwe. inCfoomfaprman.bye

Editor's Notes

  1. Volume: we store more and more data, from more and more sources. As data storage costs lower, more and more data is stored ‘just in case’. And that’s a good thing! Velocity: Streaming data, sensor data. Data coming from Facebook or Twitter. Data sometimes isn’t relevant anymore after minutes Variety: different sources: Social Networks, Healthcare Data, Wearables, GPS, …
  2. Developed in 2005 at Yahoo!, Named after developer’s son’s toy elephant ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization
  3. Hadoop splits jobs into individual map tasks Number of map tasks is determined by the amount of input data Each map task receives a portion of the overall job input to process (taking data locality into account) Mappers process one input record at a time For each input record, they emit zero or more records as output Example: a map task simply parses the input record and emits the name and price fields for each as output
  4. Developed in 2005 at Yahoo!, Named after developer’s son’s toy elephant ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization
  5. Schubben = scales
  6. HDFS = PItA