SlideShare a Scribd company logo
Jan Pieter Posthuma – Inter Access
ETL with Hadoop and MapReduce
2
Introduction
 Jan Pieter Posthuma
 Technical Lead Microsoft BI and
Big Data consultant
 Inter Access, local consultancy firm in the
Netherlands
 Architect role at multiple projects
 Analysis Service, Reporting Service,
PerformancePoint Service, Big Data,
HDInsight, Cloud BI
http://twitter.com/jppp
http://linkedin.com/jpposthuma
jan.pieter.posthuma@interaccess.nl
3
Expectations
What to cover
 Simple ETL, so simple
sources
 Different way to achieve the
result
What not to cover
 Big Data
 Best Practices
 Deep internals Hadoop
4
Agenda
 Hadoop
 HDFS
 Map/Reduce
– Demo
 Hive and Pig
– Demo
 Polybase
5
Hadoop
 Hadoop is a collection of software to create a data-intensive
distributed cluster running on commodity hardware.
 Widely accepted by Database vendors as a solution for
unstructured data
 Microsoft partners with HortonWorks and delivers their
Hadoop Data Platform as Microsoft HDInsight
 Available on premise and as an Azure service
 HortonWorks Data Platform (HDP) 100% Open Source!
6
Hadoop
FastLoad
Source Systems
Historical Data
(Beyond Active Window)
Summarize &
Load
Big Data Sources
(Raw, Unstructured)
Alerts, Notifications
Data & Compute Intensive
Application
ERP CRM LOB APPS
Integrate/Enrich
SQL Server
StreamInsight
Enterprise ETL with SSIS,
DQS, MDS
HDInsight on
Windows Azure
HDInsight on
Windows Server
SQL Server FTDW Data
Marts
SQL Server Reporting
Services
SQL Server Analysis
Server
Business
Insights
Interactive
Reports
Performance
Scorecards
Crawlers
Bots
Devices
Sensors
SQL Server Parallel Data
Warehouse
Azure Market Place
CREATE EXTERNAL TABLE Customer
WITH
(LOCATION=„hdfs://10.13.12.14:5000/user/Hadoop/Customer‟
, FORMAT_OPTIONS (FIELDS_TERMINATOR = „,‟)
AS
SELECT * FROM DimCustomer
7
Hadoop
 HDFS – distributed, fault tolerant file system
 MapReduce – framework for writing/executing distributed,
fault tolerant algorithms
 Hive & Pig – SQL-like declarative languages
 Sqoop/PolyBase – package
for moving data between HDFS
and relational DB systems
 + Others…
HDFS
Map/
Reduce
Hive & Pig
Sqoop /
Poly
base
Avro(Serialization)
HBase
Zookeeper
ETL
Tools
BI
Reporting
RDBMS
8
HDFS
Large File
1100101010011100101010011100101010011100101010011100110010101001110010101001110010101001110010101001110010101001
1100101010011100101010011100101010011100101010011100110010101001110010101001110010101001110010101001110010101001
1100101010011100101010011100101010011100101010011100110010101001110010101001110010101001110010101001110010101001
1100101010011100101010011100101010011100101010011100110010101001110010101001110010101001110010101001110010101001
1100101010011100101010011100101010011100101010011100110010101001110010101001110010101001110010101001110010101001
1100101010011100101010011100101010011100101010011100110010101001110010101001110010101001110010101001110010101001
1100101010011100101010011100101010011100101010011100110010101001110010101001110010101001110010101001110010101001
1100101010011100101010011100101010011100101010011100110010101001110010101001110010101001110010101001110010101001
…
6440MB
Block
1
Block
2
Block
3
Block
4
Block
5
Block
6
Block
100
Block
101
64MB 64MB 64MB 64MB 64MB 64MB
…
64MB 40MB
Block
1
Block
2
Let‟s color-code them
Block
3
Block
4
Block
5
Block
6
Block
100
Block
101
e.g., Block Size = 64MB
HDFS
Files are composed of set of blocks
• Typically 64MB in size
• Each block is stored as a separate
file in the local file system (e.g.
NTFS)
9
HDFS
NameNode BackupNode
DataNode DataNode DataNode DataNode DataNode
(heartbeat, balancing, replication, etc.)
nodes write to local disk
namespace backups
HDFS was designed with the
expectation that failures (both
hardware and software) would
occur frequently
10
Map/Reduce
 Programming framework (library and runtime) for analyzing
data sets stored in HDFS
 MR framework provides all the “glue” and coordinates the
execution of the Map and Reduce jobs on the cluster.
– Fault tolerant
– Scalable
Map function:
var map = function(key, value, context) {}
Reduce function:
var reduce = function(key, values,
context) {}
Map/
Reduce
11
Map/Reduce
<keyA, valuea>
<keyB, valueb>
<keyC, valuec>
…
<keyA, valuea>
<keyB, valueb>
<keyC, valuec>
…
<keyA, valuea>
<keyB, valueb>
<keyC, valuec>
…
<keyA, valuea>
<keyB, valueb>
<keyC, valuec>
…
Output
Reducer
<keyA, list(valuea, valueb, valuec, …)>
Reducer
<keyB, list(valuea, valueb, valuec, …)>
Reducer
<keyC, list(valuea, valueb, valuec, …)>
Sort
and
group
by
key
DataNode
DataNode
DataNode
Mapper<keyi, valuei>
Mapper<keyi, valuei>
Mapper<keyi, valuei>
Mapper<keyi, valuei>
12
Demo
 Weather info: Need daily max and min temperature per station
var map = function (key, value, context) {
if (value[0] != '#') {
var allValues = value.split(',');
if (allValues[7].trim() != '') {
context.write(allValues[0]+'-'+allValues[1],
allValues[0] + ',' + allValues[1] + ',' + allValues[7]);
}}};
Output <key, value>:
<“210-19510101”, “210,19510101,-4”>
<“210-19510101”, “210,19510101,1”>
# STN,YYYYMMDD,HH, DD,FH, FF,FX, T,T10,TD,SQ, Q,DR,RH, P,VV, N, U,WW,IX, M, R, S, O, Y
#
210,19510101, 1,200, , 93, ,-4, , , , , , ,9947, , 8, , 5, , , , , ,
210,19510101, 2,190, ,108, , 1, , , , , , ,9937, , 8, , 5, , 0, 0, 0, 0, 0
13
Demo (cont.)
var reduce = function (key, values, context) {
var mMax = -9999;
var mMin = 9999;
var mKey = key.split('-');
while (values.hasNext()) {
var mValues = values.next().split(',');
mMax = mValues[2] > mMax ? mValues[2] : mMax;
mMin = mValues[2] < mMin ? mValues[2] : mMin; }
context.write(key.trim(),
mKey[0].toString() + 't' +
mKey[1].toString() + 't' +
mMax.toString() + 't' +
mMin.toString()); };
Reduce Input <key, values:=list(value1, …, valuen)>:
<“210-19510101”, {“210,19510101,-4”, “210,19510101,1”}>
Map Output <key, value>:
<“210-19510101”, “210,19510101,-4”>
<“210-19510101”, “210,19510101,1”>
Demo
15
Hive and Pig
Query:
Find the sourceIP address that generated the most adRevenue along
with its average pageRank
Rankings
(
pageURL STRING,
pageRank INT,
avgDuration INT
);
UserVisits
(
sourceIP STRING,
destURL STRING
visitDate DATE,
adRevenue FLOAT,
.. // fields omitted
);
Hive & Pig
package edu.brown.cs.mapreduce.benchmarks;
import java.util.*;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
import org.apache.hadoop.util.*;
import org.apache.hadoop.mapred.lib.*;
import org.apache.hadoop.fs.*;
import edu.brown.cs.mapreduce.BenchmarkBase;
public class Benchmark3 extends Configured implements Tool {
public static String getTypeString(int type) {
if (type == 1) {
return ("UserVisits");
} else if (type == 2) {
return ("Rankings");
}
return ("INVALID");
}
/* (non-Javadoc)
* @see org.apache.hadoop.util.Tool#run(java.lang.String[])
*/
public int run(String[] args) throws Exception {
BenchmarkBase base = new BenchmarkBase(this.getConf(), this.getClass(), args);
Date startTime = new Date();
System.out.println("Job started: " + startTime);
1
// Phase #1
// -------------------------------------------
JobConf p1_job = base.getJobConf();
p1_job.setJobName(p1_job.getJobName() + ".Phase1");
Path p1_output = new Path(base.getOutputPath().toString() + "/phase1");
FileOutputFormat.setOutputPath(p1_job, p1_output);
//
// Make sure we have our properties
//
String required[] = { BenchmarkBase.PROPERTY_START_DATE,
BenchmarkBase.PROPERTY_STOP_DATE };
for (String req : required) {
if (!base.getOptions().containsKey(req)) {
System.err.println("ERROR: The property '" + req + "' is not set");
System.exit(1);
}
} // FOR
p1_job.setInputFormat(base.getSequenceFile() ? SequenceFileInputFormat.class :
KeyValueTextInputFormat.class);
if (base.getSequenceFile()) p1_job.setOutputFormat(SequenceFileOutputFormat.class);
p1_job.setOutputKeyClass(Text.class);
p1_job.setOutputValueClass(Text.class);
p1_job.setMapperClass(base.getTupleData() ?
edu.brown.cs.mapreduce.benchmarks.benchmark3.phase1.TupleWritableMap.class :
edu.brown.cs.mapreduce.benchmarks.benchmark3.phase1.TextMap.class);
p1_job.setReducerClass(base.getTupleData() ?
edu.brown.cs.mapreduce.benchmarks.benchmark3.phase1.TupleWritableReduce.class :
edu.brown.cs.mapreduce.benchmarks.benchmark3.phase1.TextReduce.class);
p1_job.setCompressMapOutput(base.getCompress());
2
// Phase #2
// -------------------------------------------
JobConf p2_job = base.getJobConf();
p2_job.setJobName(p2_job.getJobName() + ".Phase2");
p2_job.setInputFormat(base.getSequenceFile() ? SequenceFileInputFormat.class :
KeyValueTextInputFormat.class);
if (base.getSequenceFile()) p2_job.setOutputFormat(SequenceFileOutputFormat.class);
p2_job.setOutputKeyClass(Text.class);
p2_job.setOutputValueClass(Text.class);
p2_job.setMapperClass(IdentityMapper.class);
p2_job.setReducerClass(base.getTupleData() ?
edu.brown.cs.mapreduce.benchmarks.benchmark3.phase2.TupleWritableReduce.class :
edu.brown.cs.mapreduce.benchmarks.benchmark3.phase2.TextReduce.class);
p2_job.setCompressMapOutput(base.getCompress());
// Phase #3
// -------------------------------------------
JobConf p3_job = base.getJobConf();
p3_job.setJobName(p3_job.getJobName() + ".Phase3");
p3_job.setNumReduceTasks(1);
p3_job.setInputFormat(base.getSequenceFile() ? SequenceFileInputFormat.class :
KeyValueTextInputFormat.class);
p3_job.setOutputKeyClass(Text.class);
p3_job.setOutputValueClass(Text.class);
//p3_job.setMapperClass(Phase3Map.class);
p3_job.setMapperClass(IdentityMapper.class);
p3_job.setReducerClass(base.getTupleData() ?
edu.brown.cs.mapreduce.benchmarks.benchmark3.phase3.TupleWritableReduce.class :
edu.brown.cs.mapreduce.benchmarks.benchmark3.phase3.TextReduce.class);
3
//
// Execute #1
//
base.runJob(p1_job);
//
// Execute #2
//
Path p2_output = new Path(base.getOutputPath().toString() + "/phase2");
FileOutputFormat.setOutputPath(p2_job, p2_output);
FileInputFormat.setInputPaths(p2_job, p1_output);
base.runJob(p2_job);
//
// Execute #3
//
Path p3_output = new Path(base.getOutputPath().toString() + "/phase3");
FileOutputFormat.setOutputPath(p3_job, p3_output);
FileInputFormat.setInputPaths(p3_job, p2_output);
base.runJob(p3_job);
// There does need to be a combine if (base.getCombine()) base.runCombine();
return 0;
}
}
4
16
Hive and Pig
 Principle is the same: easy data retrieval
 Both use MapReduce
 Different founders Facebook (Hive) and Yahoo (PIG)
 Different language SQL like (Hive) and more procedural (PIG)
 Both can store data in tables, which are stored as HDFS file(s)
 Extra language options to use benefits of Hadoop
– Partition by statement
– Map/Reduce statement
„Of the 150k jobs Facebook runs daily, only 500 are
MapReduce jobs. The rest are is HiveQL‟
17
Hive
Query 1: SELECT count_big(*) FROM lineitem
Query 2: SELECT max(l_quantity) FROM lineitem
WHERE l_orderkey>1000 and l_orderkey<100000
GROUP BY l_linestatus
0
500
1000
1500
Query 1 Query 2
1318
1397
252 279
Secs.
Hive
PDW
18
Demo
 Use the same data file as previous demo
 But now we directly „query‟ the file
Demo
20
Polybase
 PDW v2 introduces external tables to represent HDFS data
 PDW queries can now span HDFS and PDW data
 Hadoop cluster is not part of the appliance
Social
Apps
Sensor
& RFID
Mobile
Apps
Web
Apps
Unstructured data Structured data
RDBMS
HDFS Enhanced
PDW
query engine
T-SQL
Relational
databases
Sqoo
p /
Poly
base
Polybase
SQL Server
SQL Server SQL Server
…
SQL Server
PDW Cluster
DN DN DN
DN DN DN
DN DN DN
DN DN DN
Hadoop Cluster
21
This is PDW!
22
PDW Hadoop
1. Retrieve data from HDFS with a PDW query
– Seamlessly join structured and semi-structured data
2. Import data from HDFS to PDW
– Parallelized CREATE TABLE AS SELECT (CTAS)
– External tables as the source
– PDW table, either replicated or distributed, as destination
3. Export data from PDW to HDFS
– Parallelized CREATE EXTERNAL TABLE AS SELECT (CETAS)
– External table as the destination; creates a set of HDFS files
SELECT Username FROM ClickStream c, User u WHERE c.UserID = u.ID
AND c.URL=„www.bing.com‟;
CREATE TABLE ClickStreamInPDW WITH DISTRIBUTION = HASH(URL)
AS SELECT URL, EventDate, UserID FROM ClickStream;
CREATE EXTERNAL TABLE ClickStream2 (URL, EventDate, UserID)
WITH (LOCATION =„hdfs://MyHadoop:5000/joe‟, FORMAT_OPTIONS (...)
AS SELECT URL, EventDate, UserID FROM ClickStreamInPDW;
23
Recap
 Hadoop is the next big thing for DWH/BI
 Not a replacement, but an new dimension
 Many ways to integrate it‟s data
 What‟s next?
– Polybase combined with (custom) Map/Reduce?
– HDInsight appliance?
– Polybase for SQL Server vNext?
24
References
 Microsoft BigData (HDInsight):
http://www.microsoft.com/bigdata
 Microsoft HDInsight Azure (3 months free trail):
http://www.windowsazure.com
 Hortonworks Data Platform sandbox (VMware):
http://hortonworks.com/download/
Q&A
Coming up…
Speaker Title Room
Alberto Ferrari DAX Query Engine Internals Theatre
Wesley Backelant An introduction to the wonderful world of OData Exhibition B
Bob Duffy Windows Azure For SQL folk Suite 3
Dejan Sarka Excel 2013 Analytics Suite 1
Mladen Prajdić
From SQL Traces to Extended Events. The next big
switch. Suite 2
Sandip Pani New Analytic Functions in SQL server 2012 Suite 4
#SQLBITS

More Related Content

What's hot

Hadoop Family and Ecosystem
Hadoop Family and EcosystemHadoop Family and Ecosystem
Hadoop Family and Ecosystem
tcloudcomputing-tw
 
Real time hadoop + mapreduce intro
Real time hadoop + mapreduce introReal time hadoop + mapreduce intro
Real time hadoop + mapreduce intro
Geoff Hendrey
 
Hive at Yahoo: Letters from the trenches
Hive at Yahoo: Letters from the trenchesHive at Yahoo: Letters from the trenches
Hive at Yahoo: Letters from the trenches
DataWorks Summit
 
Hadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoopHadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoop
Victoria López
 
Hadoop Overview kdd2011
Hadoop Overview kdd2011Hadoop Overview kdd2011
Hadoop Overview kdd2011
Milind Bhandarkar
 
Integration of HIve and HBase
Integration of HIve and HBaseIntegration of HIve and HBase
Integration of HIve and HBaseHortonworks
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component
rebeccatho
 
Hadoop project design and a usecase
Hadoop project design and  a usecaseHadoop project design and  a usecase
Hadoop project design and a usecase
sudhakara st
 
Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1
Rohit Agrawal
 
Introduction to Big Data & Hadoop
Introduction to Big Data & HadoopIntroduction to Big Data & Hadoop
Introduction to Big Data & Hadoop
Edureka!
 
Introduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystemIntroduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystem
Shivaji Dutta
 
Introduction to Hadoop part 2
Introduction to Hadoop part 2Introduction to Hadoop part 2
Introduction to Hadoop part 2
Giovanna Roda
 
20140228 - Singapore - BDAS - Ensuring Hadoop Production Success
20140228 - Singapore - BDAS - Ensuring Hadoop Production Success20140228 - Singapore - BDAS - Ensuring Hadoop Production Success
20140228 - Singapore - BDAS - Ensuring Hadoop Production SuccessAllen Day, PhD
 
Supporting Financial Services with a More Flexible Approach to Big Data
Supporting Financial Services with a More Flexible Approach to Big DataSupporting Financial Services with a More Flexible Approach to Big Data
Supporting Financial Services with a More Flexible Approach to Big Data
WANdisco Plc
 
Practical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigMilind Bhandarkar
 
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Uwe Printz
 
Hadoop Summit Dublin 2016: Hadoop Platform at Yahoo - A Year in Review
Hadoop Summit Dublin 2016: Hadoop Platform at Yahoo - A Year in Review Hadoop Summit Dublin 2016: Hadoop Platform at Yahoo - A Year in Review
Hadoop Summit Dublin 2016: Hadoop Platform at Yahoo - A Year in Review
Sumeet Singh
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
Mohamed Ali Mahmoud khouder
 

What's hot (20)

Hadoop Family and Ecosystem
Hadoop Family and EcosystemHadoop Family and Ecosystem
Hadoop Family and Ecosystem
 
Real time hadoop + mapreduce intro
Real time hadoop + mapreduce introReal time hadoop + mapreduce intro
Real time hadoop + mapreduce intro
 
Hadoop
HadoopHadoop
Hadoop
 
Hive at Yahoo: Letters from the trenches
Hive at Yahoo: Letters from the trenchesHive at Yahoo: Letters from the trenches
Hive at Yahoo: Letters from the trenches
 
Hadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoopHadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoop
 
Hadoop Overview kdd2011
Hadoop Overview kdd2011Hadoop Overview kdd2011
Hadoop Overview kdd2011
 
Integration of HIve and HBase
Integration of HIve and HBaseIntegration of HIve and HBase
Integration of HIve and HBase
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component
 
Hadoop project design and a usecase
Hadoop project design and  a usecaseHadoop project design and  a usecase
Hadoop project design and a usecase
 
Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1
 
Introduction to Big Data & Hadoop
Introduction to Big Data & HadoopIntroduction to Big Data & Hadoop
Introduction to Big Data & Hadoop
 
Hadoop-Introduction
Hadoop-IntroductionHadoop-Introduction
Hadoop-Introduction
 
Introduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystemIntroduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystem
 
Introduction to Hadoop part 2
Introduction to Hadoop part 2Introduction to Hadoop part 2
Introduction to Hadoop part 2
 
20140228 - Singapore - BDAS - Ensuring Hadoop Production Success
20140228 - Singapore - BDAS - Ensuring Hadoop Production Success20140228 - Singapore - BDAS - Ensuring Hadoop Production Success
20140228 - Singapore - BDAS - Ensuring Hadoop Production Success
 
Supporting Financial Services with a More Flexible Approach to Big Data
Supporting Financial Services with a More Flexible Approach to Big DataSupporting Financial Services with a More Flexible Approach to Big Data
Supporting Financial Services with a More Flexible Approach to Big Data
 
Practical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & Pig
 
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
 
Hadoop Summit Dublin 2016: Hadoop Platform at Yahoo - A Year in Review
Hadoop Summit Dublin 2016: Hadoop Platform at Yahoo - A Year in Review Hadoop Summit Dublin 2016: Hadoop Platform at Yahoo - A Year in Review
Hadoop Summit Dublin 2016: Hadoop Platform at Yahoo - A Year in Review
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 

Similar to SQLBits XI - ETL with Hadoop

Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design PathshalaAdvance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Desing Pathshala
 
Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作
James Chen
 
EclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An IntroductionEclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An Introduction
Cloudera, Inc.
 
Hadoop
HadoopHadoop
Meethadoop
MeethadoopMeethadoop
Meethadoop
IIIT-H
 
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Andrey Vykhodtsev
 
Sql on hadoop the secret presentation.3pptx
Sql on hadoop  the secret presentation.3pptxSql on hadoop  the secret presentation.3pptx
Sql on hadoop the secret presentation.3pptx
Paulo Alonso
 
Hadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologiesHadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologies
Kelly Technologies
 
Map-Reduce and Apache Hadoop
Map-Reduce and Apache HadoopMap-Reduce and Apache Hadoop
Map-Reduce and Apache Hadoop
Svetlin Nakov
 
Hadoop scalability
Hadoop scalabilityHadoop scalability
Hadoop scalabilityWANdisco Plc
 
Behm Shah Pagerank
Behm Shah PagerankBehm Shah Pagerank
Behm Shah Pagerankgothicane
 
מיכאל
מיכאלמיכאל
מיכאל
sqlserver.co.il
 
Hadoop institutes-in-bangalore
Hadoop institutes-in-bangaloreHadoop institutes-in-bangalore
Hadoop institutes-in-bangalore
Kelly Technologies
 
Hadoop MapReduce
Hadoop MapReduceHadoop MapReduce
Hadoop MapReduce
Urvashi Kataria
 
Hadoop trainting-in-hyderabad@kelly technologies
Hadoop trainting-in-hyderabad@kelly technologiesHadoop trainting-in-hyderabad@kelly technologies
Hadoop trainting-in-hyderabad@kelly technologies
Kelly Technologies
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)
Paul Chao
 
MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptx
HARIKRISHNANU13
 
Apache Hadoop and HBase
Apache Hadoop and HBaseApache Hadoop and HBase
Apache Hadoop and HBase
Cloudera, Inc.
 

Similar to SQLBits XI - ETL with Hadoop (20)

Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design PathshalaAdvance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
 
Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作
 
EclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An IntroductionEclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An Introduction
 
Hadoop
HadoopHadoop
Hadoop
 
Meethadoop
MeethadoopMeethadoop
Meethadoop
 
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
 
Sql on hadoop the secret presentation.3pptx
Sql on hadoop  the secret presentation.3pptxSql on hadoop  the secret presentation.3pptx
Sql on hadoop the secret presentation.3pptx
 
Hadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologiesHadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologies
 
Map-Reduce and Apache Hadoop
Map-Reduce and Apache HadoopMap-Reduce and Apache Hadoop
Map-Reduce and Apache Hadoop
 
Hadoop scalability
Hadoop scalabilityHadoop scalability
Hadoop scalability
 
Big data
Big dataBig data
Big data
 
Behm Shah Pagerank
Behm Shah PagerankBehm Shah Pagerank
Behm Shah Pagerank
 
מיכאל
מיכאלמיכאל
מיכאל
 
Hadoop institutes-in-bangalore
Hadoop institutes-in-bangaloreHadoop institutes-in-bangalore
Hadoop institutes-in-bangalore
 
Hadoop MapReduce
Hadoop MapReduceHadoop MapReduce
Hadoop MapReduce
 
Hadoop trainting-in-hyderabad@kelly technologies
Hadoop trainting-in-hyderabad@kelly technologiesHadoop trainting-in-hyderabad@kelly technologies
Hadoop trainting-in-hyderabad@kelly technologies
 
Hackathon bonn
Hackathon bonnHackathon bonn
Hackathon bonn
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)
 
MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptx
 
Apache Hadoop and HBase
Apache Hadoop and HBaseApache Hadoop and HBase
Apache Hadoop and HBase
 

More from Jan Pieter Posthuma

Power BI for Developers
Power BI for DevelopersPower BI for Developers
Power BI for Developers
Jan Pieter Posthuma
 
Extending Power BI with your own custom visual
Extending Power BI with your own custom visualExtending Power BI with your own custom visual
Extending Power BI with your own custom visual
Jan Pieter Posthuma
 
Extending Power BI with your own custom visual
Extending Power BI with your own custom visualExtending Power BI with your own custom visual
Extending Power BI with your own custom visual
Jan Pieter Posthuma
 
Azure Global Bootcamp - CIS Handson
Azure Global Bootcamp - CIS HandsonAzure Global Bootcamp - CIS Handson
Azure Global Bootcamp - CIS Handson
Jan Pieter Posthuma
 
Extending Power BI With Your Own Custom Visual
Extending Power BI With Your Own Custom VisualExtending Power BI With Your Own Custom Visual
Extending Power BI With Your Own Custom Visual
Jan Pieter Posthuma
 
PBIG - Power BI en R visuals
PBIG - Power BI en R visualsPBIG - Power BI en R visuals
PBIG - Power BI en R visuals
Jan Pieter Posthuma
 
SQLSaturday 551 - Extending Power BI
SQLSaturday 551 - Extending Power BISQLSaturday 551 - Extending Power BI
SQLSaturday 551 - Extending Power BI
Jan Pieter Posthuma
 
SQLServer Days - Power BI Custom Visuals
SQLServer Days - Power BI Custom VisualsSQLServer Days - Power BI Custom Visuals
SQLServer Days - Power BI Custom Visuals
Jan Pieter Posthuma
 
TechDays - Power BI Custom Visuals
TechDays - Power BI Custom VisualsTechDays - Power BI Custom Visuals
TechDays - Power BI Custom Visuals
Jan Pieter Posthuma
 
SQLSaturday 541 - Extending Power BI
SQLSaturday 541 - Extending Power BISQLSaturday 541 - Extending Power BI
SQLSaturday 541 - Extending Power BI
Jan Pieter Posthuma
 
Power BI API
Power BI APIPower BI API
Power BI API
Jan Pieter Posthuma
 
Hadoop from Hive with Stinger to Tez
Hadoop from Hive with Stinger to TezHadoop from Hive with Stinger to Tez
Hadoop from Hive with Stinger to Tez
Jan Pieter Posthuma
 
SQLRally Amsterdam 2013 - Hadoop
SQLRally Amsterdam 2013 - HadoopSQLRally Amsterdam 2013 - Hadoop
SQLRally Amsterdam 2013 - Hadoop
Jan Pieter Posthuma
 

More from Jan Pieter Posthuma (13)

Power BI for Developers
Power BI for DevelopersPower BI for Developers
Power BI for Developers
 
Extending Power BI with your own custom visual
Extending Power BI with your own custom visualExtending Power BI with your own custom visual
Extending Power BI with your own custom visual
 
Extending Power BI with your own custom visual
Extending Power BI with your own custom visualExtending Power BI with your own custom visual
Extending Power BI with your own custom visual
 
Azure Global Bootcamp - CIS Handson
Azure Global Bootcamp - CIS HandsonAzure Global Bootcamp - CIS Handson
Azure Global Bootcamp - CIS Handson
 
Extending Power BI With Your Own Custom Visual
Extending Power BI With Your Own Custom VisualExtending Power BI With Your Own Custom Visual
Extending Power BI With Your Own Custom Visual
 
PBIG - Power BI en R visuals
PBIG - Power BI en R visualsPBIG - Power BI en R visuals
PBIG - Power BI en R visuals
 
SQLSaturday 551 - Extending Power BI
SQLSaturday 551 - Extending Power BISQLSaturday 551 - Extending Power BI
SQLSaturday 551 - Extending Power BI
 
SQLServer Days - Power BI Custom Visuals
SQLServer Days - Power BI Custom VisualsSQLServer Days - Power BI Custom Visuals
SQLServer Days - Power BI Custom Visuals
 
TechDays - Power BI Custom Visuals
TechDays - Power BI Custom VisualsTechDays - Power BI Custom Visuals
TechDays - Power BI Custom Visuals
 
SQLSaturday 541 - Extending Power BI
SQLSaturday 541 - Extending Power BISQLSaturday 541 - Extending Power BI
SQLSaturday 541 - Extending Power BI
 
Power BI API
Power BI APIPower BI API
Power BI API
 
Hadoop from Hive with Stinger to Tez
Hadoop from Hive with Stinger to TezHadoop from Hive with Stinger to Tez
Hadoop from Hive with Stinger to Tez
 
SQLRally Amsterdam 2013 - Hadoop
SQLRally Amsterdam 2013 - HadoopSQLRally Amsterdam 2013 - Hadoop
SQLRally Amsterdam 2013 - Hadoop
 

Recently uploaded

Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
RinaMondal9
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
Dorra BARTAGUIZ
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Aggregage
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
Ralf Eggert
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
Kari Kakkonen
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
SOFTTECHHUB
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 
UiPath Community Day Dubai: AI at Work..
UiPath Community Day Dubai: AI at Work..UiPath Community Day Dubai: AI at Work..
UiPath Community Day Dubai: AI at Work..
UiPathCommunity
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
Assure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyesAssure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 

Recently uploaded (20)

Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 
UiPath Community Day Dubai: AI at Work..
UiPath Community Day Dubai: AI at Work..UiPath Community Day Dubai: AI at Work..
UiPath Community Day Dubai: AI at Work..
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
Assure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyesAssure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyes
 

SQLBits XI - ETL with Hadoop

  • 1. Jan Pieter Posthuma – Inter Access ETL with Hadoop and MapReduce
  • 2. 2 Introduction  Jan Pieter Posthuma  Technical Lead Microsoft BI and Big Data consultant  Inter Access, local consultancy firm in the Netherlands  Architect role at multiple projects  Analysis Service, Reporting Service, PerformancePoint Service, Big Data, HDInsight, Cloud BI http://twitter.com/jppp http://linkedin.com/jpposthuma jan.pieter.posthuma@interaccess.nl
  • 3. 3 Expectations What to cover  Simple ETL, so simple sources  Different way to achieve the result What not to cover  Big Data  Best Practices  Deep internals Hadoop
  • 4. 4 Agenda  Hadoop  HDFS  Map/Reduce – Demo  Hive and Pig – Demo  Polybase
  • 5. 5 Hadoop  Hadoop is a collection of software to create a data-intensive distributed cluster running on commodity hardware.  Widely accepted by Database vendors as a solution for unstructured data  Microsoft partners with HortonWorks and delivers their Hadoop Data Platform as Microsoft HDInsight  Available on premise and as an Azure service  HortonWorks Data Platform (HDP) 100% Open Source!
  • 6. 6 Hadoop FastLoad Source Systems Historical Data (Beyond Active Window) Summarize & Load Big Data Sources (Raw, Unstructured) Alerts, Notifications Data & Compute Intensive Application ERP CRM LOB APPS Integrate/Enrich SQL Server StreamInsight Enterprise ETL with SSIS, DQS, MDS HDInsight on Windows Azure HDInsight on Windows Server SQL Server FTDW Data Marts SQL Server Reporting Services SQL Server Analysis Server Business Insights Interactive Reports Performance Scorecards Crawlers Bots Devices Sensors SQL Server Parallel Data Warehouse Azure Market Place CREATE EXTERNAL TABLE Customer WITH (LOCATION=„hdfs://10.13.12.14:5000/user/Hadoop/Customer‟ , FORMAT_OPTIONS (FIELDS_TERMINATOR = „,‟) AS SELECT * FROM DimCustomer
  • 7. 7 Hadoop  HDFS – distributed, fault tolerant file system  MapReduce – framework for writing/executing distributed, fault tolerant algorithms  Hive & Pig – SQL-like declarative languages  Sqoop/PolyBase – package for moving data between HDFS and relational DB systems  + Others… HDFS Map/ Reduce Hive & Pig Sqoop / Poly base Avro(Serialization) HBase Zookeeper ETL Tools BI Reporting RDBMS
  • 8. 8 HDFS Large File 1100101010011100101010011100101010011100101010011100110010101001110010101001110010101001110010101001110010101001 1100101010011100101010011100101010011100101010011100110010101001110010101001110010101001110010101001110010101001 1100101010011100101010011100101010011100101010011100110010101001110010101001110010101001110010101001110010101001 1100101010011100101010011100101010011100101010011100110010101001110010101001110010101001110010101001110010101001 1100101010011100101010011100101010011100101010011100110010101001110010101001110010101001110010101001110010101001 1100101010011100101010011100101010011100101010011100110010101001110010101001110010101001110010101001110010101001 1100101010011100101010011100101010011100101010011100110010101001110010101001110010101001110010101001110010101001 1100101010011100101010011100101010011100101010011100110010101001110010101001110010101001110010101001110010101001 … 6440MB Block 1 Block 2 Block 3 Block 4 Block 5 Block 6 Block 100 Block 101 64MB 64MB 64MB 64MB 64MB 64MB … 64MB 40MB Block 1 Block 2 Let‟s color-code them Block 3 Block 4 Block 5 Block 6 Block 100 Block 101 e.g., Block Size = 64MB HDFS Files are composed of set of blocks • Typically 64MB in size • Each block is stored as a separate file in the local file system (e.g. NTFS)
  • 9. 9 HDFS NameNode BackupNode DataNode DataNode DataNode DataNode DataNode (heartbeat, balancing, replication, etc.) nodes write to local disk namespace backups HDFS was designed with the expectation that failures (both hardware and software) would occur frequently
  • 10. 10 Map/Reduce  Programming framework (library and runtime) for analyzing data sets stored in HDFS  MR framework provides all the “glue” and coordinates the execution of the Map and Reduce jobs on the cluster. – Fault tolerant – Scalable Map function: var map = function(key, value, context) {} Reduce function: var reduce = function(key, values, context) {} Map/ Reduce
  • 11. 11 Map/Reduce <keyA, valuea> <keyB, valueb> <keyC, valuec> … <keyA, valuea> <keyB, valueb> <keyC, valuec> … <keyA, valuea> <keyB, valueb> <keyC, valuec> … <keyA, valuea> <keyB, valueb> <keyC, valuec> … Output Reducer <keyA, list(valuea, valueb, valuec, …)> Reducer <keyB, list(valuea, valueb, valuec, …)> Reducer <keyC, list(valuea, valueb, valuec, …)> Sort and group by key DataNode DataNode DataNode Mapper<keyi, valuei> Mapper<keyi, valuei> Mapper<keyi, valuei> Mapper<keyi, valuei>
  • 12. 12 Demo  Weather info: Need daily max and min temperature per station var map = function (key, value, context) { if (value[0] != '#') { var allValues = value.split(','); if (allValues[7].trim() != '') { context.write(allValues[0]+'-'+allValues[1], allValues[0] + ',' + allValues[1] + ',' + allValues[7]); }}}; Output <key, value>: <“210-19510101”, “210,19510101,-4”> <“210-19510101”, “210,19510101,1”> # STN,YYYYMMDD,HH, DD,FH, FF,FX, T,T10,TD,SQ, Q,DR,RH, P,VV, N, U,WW,IX, M, R, S, O, Y # 210,19510101, 1,200, , 93, ,-4, , , , , , ,9947, , 8, , 5, , , , , , 210,19510101, 2,190, ,108, , 1, , , , , , ,9937, , 8, , 5, , 0, 0, 0, 0, 0
  • 13. 13 Demo (cont.) var reduce = function (key, values, context) { var mMax = -9999; var mMin = 9999; var mKey = key.split('-'); while (values.hasNext()) { var mValues = values.next().split(','); mMax = mValues[2] > mMax ? mValues[2] : mMax; mMin = mValues[2] < mMin ? mValues[2] : mMin; } context.write(key.trim(), mKey[0].toString() + 't' + mKey[1].toString() + 't' + mMax.toString() + 't' + mMin.toString()); }; Reduce Input <key, values:=list(value1, …, valuen)>: <“210-19510101”, {“210,19510101,-4”, “210,19510101,1”}> Map Output <key, value>: <“210-19510101”, “210,19510101,-4”> <“210-19510101”, “210,19510101,1”>
  • 14. Demo
  • 15. 15 Hive and Pig Query: Find the sourceIP address that generated the most adRevenue along with its average pageRank Rankings ( pageURL STRING, pageRank INT, avgDuration INT ); UserVisits ( sourceIP STRING, destURL STRING visitDate DATE, adRevenue FLOAT, .. // fields omitted ); Hive & Pig package edu.brown.cs.mapreduce.benchmarks; import java.util.*; import org.apache.hadoop.conf.*; import org.apache.hadoop.io.*; import org.apache.hadoop.mapred.*; import org.apache.hadoop.util.*; import org.apache.hadoop.mapred.lib.*; import org.apache.hadoop.fs.*; import edu.brown.cs.mapreduce.BenchmarkBase; public class Benchmark3 extends Configured implements Tool { public static String getTypeString(int type) { if (type == 1) { return ("UserVisits"); } else if (type == 2) { return ("Rankings"); } return ("INVALID"); } /* (non-Javadoc) * @see org.apache.hadoop.util.Tool#run(java.lang.String[]) */ public int run(String[] args) throws Exception { BenchmarkBase base = new BenchmarkBase(this.getConf(), this.getClass(), args); Date startTime = new Date(); System.out.println("Job started: " + startTime); 1 // Phase #1 // ------------------------------------------- JobConf p1_job = base.getJobConf(); p1_job.setJobName(p1_job.getJobName() + ".Phase1"); Path p1_output = new Path(base.getOutputPath().toString() + "/phase1"); FileOutputFormat.setOutputPath(p1_job, p1_output); // // Make sure we have our properties // String required[] = { BenchmarkBase.PROPERTY_START_DATE, BenchmarkBase.PROPERTY_STOP_DATE }; for (String req : required) { if (!base.getOptions().containsKey(req)) { System.err.println("ERROR: The property '" + req + "' is not set"); System.exit(1); } } // FOR p1_job.setInputFormat(base.getSequenceFile() ? SequenceFileInputFormat.class : KeyValueTextInputFormat.class); if (base.getSequenceFile()) p1_job.setOutputFormat(SequenceFileOutputFormat.class); p1_job.setOutputKeyClass(Text.class); p1_job.setOutputValueClass(Text.class); p1_job.setMapperClass(base.getTupleData() ? edu.brown.cs.mapreduce.benchmarks.benchmark3.phase1.TupleWritableMap.class : edu.brown.cs.mapreduce.benchmarks.benchmark3.phase1.TextMap.class); p1_job.setReducerClass(base.getTupleData() ? edu.brown.cs.mapreduce.benchmarks.benchmark3.phase1.TupleWritableReduce.class : edu.brown.cs.mapreduce.benchmarks.benchmark3.phase1.TextReduce.class); p1_job.setCompressMapOutput(base.getCompress()); 2 // Phase #2 // ------------------------------------------- JobConf p2_job = base.getJobConf(); p2_job.setJobName(p2_job.getJobName() + ".Phase2"); p2_job.setInputFormat(base.getSequenceFile() ? SequenceFileInputFormat.class : KeyValueTextInputFormat.class); if (base.getSequenceFile()) p2_job.setOutputFormat(SequenceFileOutputFormat.class); p2_job.setOutputKeyClass(Text.class); p2_job.setOutputValueClass(Text.class); p2_job.setMapperClass(IdentityMapper.class); p2_job.setReducerClass(base.getTupleData() ? edu.brown.cs.mapreduce.benchmarks.benchmark3.phase2.TupleWritableReduce.class : edu.brown.cs.mapreduce.benchmarks.benchmark3.phase2.TextReduce.class); p2_job.setCompressMapOutput(base.getCompress()); // Phase #3 // ------------------------------------------- JobConf p3_job = base.getJobConf(); p3_job.setJobName(p3_job.getJobName() + ".Phase3"); p3_job.setNumReduceTasks(1); p3_job.setInputFormat(base.getSequenceFile() ? SequenceFileInputFormat.class : KeyValueTextInputFormat.class); p3_job.setOutputKeyClass(Text.class); p3_job.setOutputValueClass(Text.class); //p3_job.setMapperClass(Phase3Map.class); p3_job.setMapperClass(IdentityMapper.class); p3_job.setReducerClass(base.getTupleData() ? edu.brown.cs.mapreduce.benchmarks.benchmark3.phase3.TupleWritableReduce.class : edu.brown.cs.mapreduce.benchmarks.benchmark3.phase3.TextReduce.class); 3 // // Execute #1 // base.runJob(p1_job); // // Execute #2 // Path p2_output = new Path(base.getOutputPath().toString() + "/phase2"); FileOutputFormat.setOutputPath(p2_job, p2_output); FileInputFormat.setInputPaths(p2_job, p1_output); base.runJob(p2_job); // // Execute #3 // Path p3_output = new Path(base.getOutputPath().toString() + "/phase3"); FileOutputFormat.setOutputPath(p3_job, p3_output); FileInputFormat.setInputPaths(p3_job, p2_output); base.runJob(p3_job); // There does need to be a combine if (base.getCombine()) base.runCombine(); return 0; } } 4
  • 16. 16 Hive and Pig  Principle is the same: easy data retrieval  Both use MapReduce  Different founders Facebook (Hive) and Yahoo (PIG)  Different language SQL like (Hive) and more procedural (PIG)  Both can store data in tables, which are stored as HDFS file(s)  Extra language options to use benefits of Hadoop – Partition by statement – Map/Reduce statement „Of the 150k jobs Facebook runs daily, only 500 are MapReduce jobs. The rest are is HiveQL‟
  • 17. 17 Hive Query 1: SELECT count_big(*) FROM lineitem Query 2: SELECT max(l_quantity) FROM lineitem WHERE l_orderkey>1000 and l_orderkey<100000 GROUP BY l_linestatus 0 500 1000 1500 Query 1 Query 2 1318 1397 252 279 Secs. Hive PDW
  • 18. 18 Demo  Use the same data file as previous demo  But now we directly „query‟ the file
  • 19. Demo
  • 20. 20 Polybase  PDW v2 introduces external tables to represent HDFS data  PDW queries can now span HDFS and PDW data  Hadoop cluster is not part of the appliance Social Apps Sensor & RFID Mobile Apps Web Apps Unstructured data Structured data RDBMS HDFS Enhanced PDW query engine T-SQL Relational databases Sqoo p / Poly base
  • 21. Polybase SQL Server SQL Server SQL Server … SQL Server PDW Cluster DN DN DN DN DN DN DN DN DN DN DN DN Hadoop Cluster 21 This is PDW!
  • 22. 22 PDW Hadoop 1. Retrieve data from HDFS with a PDW query – Seamlessly join structured and semi-structured data 2. Import data from HDFS to PDW – Parallelized CREATE TABLE AS SELECT (CTAS) – External tables as the source – PDW table, either replicated or distributed, as destination 3. Export data from PDW to HDFS – Parallelized CREATE EXTERNAL TABLE AS SELECT (CETAS) – External table as the destination; creates a set of HDFS files SELECT Username FROM ClickStream c, User u WHERE c.UserID = u.ID AND c.URL=„www.bing.com‟; CREATE TABLE ClickStreamInPDW WITH DISTRIBUTION = HASH(URL) AS SELECT URL, EventDate, UserID FROM ClickStream; CREATE EXTERNAL TABLE ClickStream2 (URL, EventDate, UserID) WITH (LOCATION =„hdfs://MyHadoop:5000/joe‟, FORMAT_OPTIONS (...) AS SELECT URL, EventDate, UserID FROM ClickStreamInPDW;
  • 23. 23 Recap  Hadoop is the next big thing for DWH/BI  Not a replacement, but an new dimension  Many ways to integrate it‟s data  What‟s next? – Polybase combined with (custom) Map/Reduce? – HDInsight appliance? – Polybase for SQL Server vNext?
  • 24. 24 References  Microsoft BigData (HDInsight): http://www.microsoft.com/bigdata  Microsoft HDInsight Azure (3 months free trail): http://www.windowsazure.com  Hortonworks Data Platform sandbox (VMware): http://hortonworks.com/download/
  • 25. Q&A
  • 26. Coming up… Speaker Title Room Alberto Ferrari DAX Query Engine Internals Theatre Wesley Backelant An introduction to the wonderful world of OData Exhibition B Bob Duffy Windows Azure For SQL folk Suite 3 Dejan Sarka Excel 2013 Analytics Suite 1 Mladen Prajdić From SQL Traces to Extended Events. The next big switch. Suite 2 Sandip Pani New Analytic Functions in SQL server 2012 Suite 4 #SQLBITS