SlideShare a Scribd company logo
Henk van der Valk
Technical Sales Professional
Jan Pieter Posthuma
Microsoft BI Consultant
7/11/2013

Hadoop
Access to online
training content

JOIN THE PASS
COMMUNITY
Become a PASS member for free
and join the world‟s biggest SQL
Server Community.

Join Local
Chapters

Personalize your PASS website experience

Access to events at
discounted rates

Join Virtual
Chapters

2
Agenda
•
•
•
•
•
•
•
•
•

Introduction
Hadoop
HDFS
Data access to HDFS
Map/Reduce
Hive
Data access from HDFS
SQL PDW PolyBase
Wrap up

3
Introduction Henk
•
•
•
•
•

10 years of Unisys-EMEA Performance Center
2002- Largest SQL DWH in the world (SQL2000)
Project Real – (SQL 2005)
ETL WR - loading 1TB within 30 mins (SQL 2008)
Contributed to various SQL whitepapers

•
•
•

Schuberg Philis-100% uptime for mission critical applications
Since april 1st, 2011 – Microsoft SQL PDW - Western Europe
SQLPass speaker & volunteer since 2005

4
Introduction

Alerts, Notifications
SQL Server
StreamInsight

Big Data Sources
(Raw, Unstructured)
Data & Compute Intensive
Application

Business
Insights
SQL Server FTDW Data
Marts

Sensors

Load

SQL Server Reporting
Services

Fast

Devices

Summarize &
Load

HDInsight on
Windows Azure

Bots

HDInsight on
Windows Server

SQL Server Parallel Data
Warehouse

Historical Data
(Beyond Active Window)

Interactive
Reports

Integrate/Enrich

SQL Server Analysis
Server

Crawlers

Performance
Scorecards
Azure Market Place

Enterprise ETL with
SSIS, DQS, MDS

ERP

CRM

LOB

APPS

Source Systems

5
Introduction Jan Pieter Posthuma
Jan Pieter Posthuma
• Technical Lead Microsoft BI and
Big Data consultant
• Inter Access, local consultancy firm in the
Netherlands
• Architect role at multiple projects
• Analysis Service, Reporting Service,
PerformancePoint Service, Big Data,
HDInsight, Cloud BI
http://twitter.com/jppp
http://linkedin.com/jpposthuma
jan.pieter.posthuma@interaccess.nl

6
Hadoop
• Hadoop is a collection of software to create a data-intensive
distributed cluster running on commodity hardware.
• Original idea by Google (2003).
• Widely accepted by Database vendors as a solution for unstructured
data
• Microsoft partners with HortonWorks and delivers their Hadoop
Data Platform as Microsoft HDInsight
• Available as an Azure service and on premise
• HortonWorks Data Platform (HDP) 100% Open Source!

7

7
Hadoop

Map/
Reduce
HBase
HDFS

Poly
base

Avro (Serialization)

Zookeeper

• HDFS – distributed, fault tolerant file system
• MapReduce – framework for writing/executing distributed, fault
tolerant algorithms
• Hive & Pig – SQL-like declarative languages
• Sqoop/PolyBase – package
for moving data between HDFS
BI
ETL
RDBMS
Reporting Tools
and relational DB systems
• + Others…
Hive & Pig
• Hadoop 2.0
Sqoop /

8
HDFS
Large File

1100101010011100101010011100101010011100101010011100110010101001110010101001110010101001110010101001110010101001
1100101010011100101010011100101010011100101010011100110010101001110010101001110010101001110010101001110010101001
1100101010011100101010011100101010011100101010011100110010101001110010101001110010101001110010101001110010101001
1100101010011100101010011100101010011100101010011100110010101001110010101001110010101001110010101001110010101001
1100101010011100101010011100101010011100101010011100110010101001110010101001110010101001110010101001110010101001
1100101010011100101010011100101010011100101010011100110010101001110010101001110010101001110010101001110010101001
1100101010011100101010011100101010011100101010011100110010101001110010101001110010101001110010101001110010101001
1100101010011100101010011100101010011100101010011100110010101001110010101001110010101001110010101001110010101001
…

6440MB
Let‟s color-code them
Block
1

Block
2

Block
3

Block
4

Block
5

Block
6

64MB

64MB

64MB

64MB

64MB

64MB

e.g., Block Size = 64MB

…

Block
100

Block
101

64MB

40MB

Files are composed of set of blocks
• Typically 64MB in size
• Each block is stored as a separate file
in the local file system (e.g. NTFS)
9

9
HDFS
NameNode

HDFS was designed with the
expectation that failures (both
hardware and software) would
occur frequently

BackupNode

namespace backups

Hadoop 2.0 is more decentralized
• Interaction between DataNodes
• Less dependent on primary
NameNode

(heartbeat, balancing, replication, etc.)

DataNode

DataNode

DataNode

DataNode

nodes write to local disk

DataNode
Data access to HDFS
FTP – Upload your data files
Streaming – Via AVRO (RPC) or Flume
Hadoop command – hadoop fs -copyFromLocal
Windows Azure BLOB storage – HDInsight Service (Azure) uses BLOB
storage instead of local VM storage. Data can be uploaded without a
provisioned Hadoop cluster
• PolyBase – Feature of PDW 2012. Direct read/write data access to
the datanodes.
•
•
•
•

11
Data access
Hadoop command Demo

12
13
Map/Reduce
• MR: all functions in a batch oriented architecture
•
•

Map: Apply the logic to the data, eg page hits count.
Reduce: Reduces (aggregate) the results of the Mappers to one.

• YARN: split the JobTracker in to Resource Manager and Node
Manager. And MR in Hadoop 2.0 uses YARN as its JobTacker

14
Map/Reduce
Total page hits

15
Hive
•
•
•
•
•
•
•
•
•

Build for easy data retrieval
Uses Map/Reduce
Created by Facebook
HiveQL: SQL like language
Stores data in tables, which are stored as HDFS file(s)
Only initial INSERT supported, no UPDATE or DELETE
External tables possible on existing (CSV) file(s)
Extra language options to use benefits of Hadoop
Stinger initiative: Phase 1 (0.11) and Phase 2 (0.12).
Improve Hive 100x

16
Hive
Star schema join – (Based on TPC-DS Query 27)
SELECT
col5, avg(col6)
FROM
store_sales_fact ssf
41 GB
join item_dim on (ssf.col1 = item_dim.col1)
58 MB
join date_dim on (ssf.col2 = date_dim.col2)
11 MB
join custdmgrphcs_dim on (ssf.col3 = custdmgrphcs_dim.col3)
80 MB
join store_dim on (ssf.col4 = store_dim.col4)
106 KB
GROUP BY col5
ORDER BY col5
LIMIT 100;

Cluster: 6 Nodes (2 Name, 4 Compute – dual core, 14GB)
17
Hive
File Type

# MR jobs

Input Size

# Mappers

Time

Text / Hive 0.10

5

43.1 GB

179

21:00 min

Text / Hive 0.11

1

38.0 GB

151

4:06 min

RC / Hive 0.11

1

8.21 GB

76

2:16 min

ORC / Hive 0.11

1

2.83 GB

38

1:44 min

RC / Hive 0.11 /
Partitioned /
Bucketed

1

1.73 GB

19

1:44 min

ORC / Hive 0.11 /
Partitioned /
Bucketed

1

687 MB

27

01:19 min

Data: ~64x less data
Time; ~16x times faster
18
Data access from Hadoop
Excel
FTP
Hadoop command – hadoop fs -copyToLocal
ODBC[1] – Via Hive (HiveQL) data can be extracted.
Power Query – Is capable of extracting data directly from HDFS or
Azure BLOB storage
• PolyBase – Feature of PDW 2012. Direct read/write data access to
the datanodes.
•
•
•
•
•

[1] http://www.microsoft.com/en-us/download/details.aspx?id=40886
[2] Power BI Excel add-in – http://www.powerbi.com
19
Data access
Excel 2013 Demo

20
21
PDW – Polybase

…
SQL Server
SQL Server

SQL Server

SQL Server

Sqoop

This is PDW!

DN

DN

DN

DN

DN

DN

DN

DN

DN

DN

DN

DN

Hadoop Cluster

22

22
PDW – External Tables
• An external table is PDW‟s representation of data residing in HDFS
• The “table” (metadata) lives in the context of a SQL Server database

• The actual table data resides in HDFS
• No support for DML operations
• No concurrency control or isolation level guarantees
CREATE EXTERNAL TABLE table_name ({<column_definition>} [,...n ])
{WITH (LOCATION =‘<URI>’,[FORMAT_OPTIONS = (<VALUES>)])}
[;]

Required to indicate
location of Hadoop cluster

Optional format options
associated with parsing of data
from HDFS (e.g. field delimiters
& reject-related thresholds)

23
PDW – Hadoop use cases & examples

[1] Retrieve data from HDFS with a PDW query
• Seamlessly join structured and semi-structured data
SELECT Username FROM ClickStream c, User u WHERE c.UserID = u.ID
AND c.URL=‘www.bing.com’;

[2] Import data from HDFS to PDW
• Parallelized CREATE TABLE AS SELECT (CTAS)
• External tables as the source
• PDW table, either replicated or distributed, as destination
CREATE TABLE ClickStreamInPDW WITH DISTRIBUTION = HASH(URL)
AS SELECT URL, EventDate, UserID FROM ClickStream;

[3] Export data from PDW to HDFS
• Parallelized CREATE EXTERNAL TABLE AS SELECT (CETAS)
• External table as the destination; creates a set of HDFS files
CREATE EXTERNAL TABLE ClickStream2 (URL, EventDate, UserID)
WITH (LOCATION =‘hdfs://MyHadoop:5000/joe’, FORMAT_OPTIONS (...)
AS SELECT URL, EventDate, UserID FROM ClickStreamInPDW;
SQL Server 2012
PDW
Polybase demo

25
Wrap up
Hadoop „just another data source‟ @ your fingertips!
Batch processing large datasets before loading into your DWH
Offloading DWH data, but still accessible for analysis/reporting

Integrate Hadoop via SQOOP, ODBC (Hive) or PolyBase
Near future: deeply integration between Hadoop and SQL PDW
Try Hadoop / HDInsight yourself:
Azure: http://www.windowsazure.com/en-us/pricing/free-trial/
Web PI: http://www.microsoft.com/web/downloads/platform.aspx

26
Q&A

27
References
Microsoft Big Data
http://www.microsoft.com/bigdata
Windows Azure HDInsight Service (3 months free trail)
http://www.windowsazure.com/en-us/services/hdinsight/
SQL Server Parallel Data Warehouse (PDW) Landing Page

http://www.microsoft.com/PDW
http://www.upgradetopdw.com
Introduction to Polybase
http://www.microsoft.com/en-us/sqlserver/solutionstechnologies/data-warehousing/polybase.aspx
28

28
Thanks!

29

More Related Content

What's hot

Hadoop distributed file system
Hadoop distributed file systemHadoop distributed file system
Hadoop distributed file system
Anshul Bhatnagar
 
Hadoop Overview kdd2011
Hadoop Overview kdd2011Hadoop Overview kdd2011
Hadoop Overview kdd2011
Milind Bhandarkar
 
Introduction to hadoop and hdfs
Introduction to hadoop and hdfsIntroduction to hadoop and hdfs
Introduction to hadoop and hdfs
shrey mehrotra
 
Introduction to HDFS and MapReduce
Introduction to HDFS and MapReduceIntroduction to HDFS and MapReduce
Introduction to HDFS and MapReduce
Uday Vakalapudi
 
An Introduction to Hadoop
An Introduction to HadoopAn Introduction to Hadoop
An Introduction to Hadoop
DerrekYoungDotCom
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component
rebeccatho
 
Apache Hadoop and HBase
Apache Hadoop and HBaseApache Hadoop and HBase
Apache Hadoop and HBase
Cloudera, Inc.
 
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...
Simplilearn
 
Hadoop Technologies
Hadoop TechnologiesHadoop Technologies
Hadoop Technologies
Kannappan Sirchabesan
 
Hadoop training in hyderabad-kellytechnologies
Hadoop training in hyderabad-kellytechnologiesHadoop training in hyderabad-kellytechnologies
Hadoop training in hyderabad-kellytechnologies
Kelly Technologies
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
Ran Ziv
 
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Simplilearn
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
joelcrabb
 
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hari Shankar Sreekumar
 
Lecture 2 part 1
Lecture 2 part 1Lecture 2 part 1
Lecture 2 part 1
Jazan University
 
Hadoop 1.x vs 2
Hadoop 1.x vs 2Hadoop 1.x vs 2
Hadoop 1.x vs 2
Rommel Garcia
 
Distributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology OverviewDistributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology Overview
Konstantin V. Shvachko
 
02.28.13 WANdisco ApacheCon 2013
02.28.13 WANdisco ApacheCon 201302.28.13 WANdisco ApacheCon 2013
02.28.13 WANdisco ApacheCon 2013
WANdisco Plc
 
Hadoop - Introduction to Hadoop
Hadoop - Introduction to HadoopHadoop - Introduction to Hadoop
Hadoop - Introduction to Hadoop
Vibrant Technologies & Computers
 

What's hot (20)

Hadoop distributed file system
Hadoop distributed file systemHadoop distributed file system
Hadoop distributed file system
 
Hadoop Overview kdd2011
Hadoop Overview kdd2011Hadoop Overview kdd2011
Hadoop Overview kdd2011
 
Introduction to hadoop and hdfs
Introduction to hadoop and hdfsIntroduction to hadoop and hdfs
Introduction to hadoop and hdfs
 
Introduction to HDFS and MapReduce
Introduction to HDFS and MapReduceIntroduction to HDFS and MapReduce
Introduction to HDFS and MapReduce
 
An Introduction to Hadoop
An Introduction to HadoopAn Introduction to Hadoop
An Introduction to Hadoop
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component
 
Apache Hadoop and HBase
Apache Hadoop and HBaseApache Hadoop and HBase
Apache Hadoop and HBase
 
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...
 
Hadoop Technologies
Hadoop TechnologiesHadoop Technologies
Hadoop Technologies
 
Hadoop training in hyderabad-kellytechnologies
Hadoop training in hyderabad-kellytechnologiesHadoop training in hyderabad-kellytechnologies
Hadoop training in hyderabad-kellytechnologies
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
 
Lecture 2 part 1
Lecture 2 part 1Lecture 2 part 1
Lecture 2 part 1
 
Hadoop ppt2
Hadoop ppt2Hadoop ppt2
Hadoop ppt2
 
Hadoop 1.x vs 2
Hadoop 1.x vs 2Hadoop 1.x vs 2
Hadoop 1.x vs 2
 
Distributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology OverviewDistributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology Overview
 
02.28.13 WANdisco ApacheCon 2013
02.28.13 WANdisco ApacheCon 201302.28.13 WANdisco ApacheCon 2013
02.28.13 WANdisco ApacheCon 2013
 
Hadoop - Introduction to Hadoop
Hadoop - Introduction to HadoopHadoop - Introduction to Hadoop
Hadoop - Introduction to Hadoop
 

Viewers also liked

World of Watson 2016 - Information Insecurity
World of Watson 2016 - Information InsecurityWorld of Watson 2016 - Information Insecurity
World of Watson 2016 - Information Insecurity
Keith Redman
 
InfoSphere: Leading from the Front - Accelerating Data Integration through Me...
InfoSphere: Leading from the Front - Accelerating Data Integration through Me...InfoSphere: Leading from the Front - Accelerating Data Integration through Me...
InfoSphere: Leading from the Front - Accelerating Data Integration through Me...
Vincent Kwon
 
SQLBits XI - ETL with Hadoop
SQLBits XI - ETL with HadoopSQLBits XI - ETL with Hadoop
SQLBits XI - ETL with Hadoop
Jan Pieter Posthuma
 
ETL big data with apache hadoop
ETL big data with apache hadoopETL big data with apache hadoop
ETL big data with apache hadoop
Maulik Thaker
 
Simplifying Big Data ETL with Talend
Simplifying Big Data ETL with TalendSimplifying Big Data ETL with Talend
Simplifying Big Data ETL with Talend
Edureka!
 
Hadoop from Hive with Stinger to Tez
Hadoop from Hive with Stinger to TezHadoop from Hive with Stinger to Tez
Hadoop from Hive with Stinger to Tez
Jan Pieter Posthuma
 

Viewers also liked (6)

World of Watson 2016 - Information Insecurity
World of Watson 2016 - Information InsecurityWorld of Watson 2016 - Information Insecurity
World of Watson 2016 - Information Insecurity
 
InfoSphere: Leading from the Front - Accelerating Data Integration through Me...
InfoSphere: Leading from the Front - Accelerating Data Integration through Me...InfoSphere: Leading from the Front - Accelerating Data Integration through Me...
InfoSphere: Leading from the Front - Accelerating Data Integration through Me...
 
SQLBits XI - ETL with Hadoop
SQLBits XI - ETL with HadoopSQLBits XI - ETL with Hadoop
SQLBits XI - ETL with Hadoop
 
ETL big data with apache hadoop
ETL big data with apache hadoopETL big data with apache hadoop
ETL big data with apache hadoop
 
Simplifying Big Data ETL with Talend
Simplifying Big Data ETL with TalendSimplifying Big Data ETL with Talend
Simplifying Big Data ETL with Talend
 
Hadoop from Hive with Stinger to Tez
Hadoop from Hive with Stinger to TezHadoop from Hive with Stinger to Tez
Hadoop from Hive with Stinger to Tez
 

Similar to SQLRally Amsterdam 2013 - Hadoop

Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作
James Chen
 
Yahoo! Hack Europe Workshop
Yahoo! Hack Europe WorkshopYahoo! Hack Europe Workshop
Yahoo! Hack Europe Workshop
Hortonworks
 
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
MaharajothiP
 
SQL on Hadoop for the Oracle Professional
SQL on Hadoop for the Oracle ProfessionalSQL on Hadoop for the Oracle Professional
SQL on Hadoop for the Oracle Professional
Michael Rainey
 
Unit-3.pptx
Unit-3.pptxUnit-3.pptx
Unit-3.pptx
JasmineMichael1
 
Aziksa hadoop architecture santosh jha
Aziksa hadoop architecture santosh jhaAziksa hadoop architecture santosh jha
Aziksa hadoop architecture santosh jha
Data Con LA
 
Presentation sreenu dwh-services
Presentation sreenu dwh-servicesPresentation sreenu dwh-services
Presentation sreenu dwh-services
Sreenu Musham
 
Hadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for women
Hadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for womenHadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for women
Hadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for women
maharajothip1
 
CCD-410 Cloudera Study Material
CCD-410 Cloudera Study MaterialCCD-410 Cloudera Study Material
CCD-410 Cloudera Study Material
Roxycodone Online
 
Get started with Microsoft SQL Polybase
Get started with Microsoft SQL PolybaseGet started with Microsoft SQL Polybase
Get started with Microsoft SQL PolybaseHenk van der Valk
 
hive hadoop sql
hive hadoop sqlhive hadoop sql
hive hadoop sql
ssuserf8f9b2
 
Hadoop Distributed file system.pdf
Hadoop Distributed file system.pdfHadoop Distributed file system.pdf
Hadoop Distributed file system.pdf
vishal choudhary
 
Introduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop EcosystemIntroduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop Ecosystem
Mahabubur Rahaman
 
hadoop distributed file systems complete information
hadoop distributed file systems complete informationhadoop distributed file systems complete information
hadoop distributed file systems complete information
bhargavi804095
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
tcloudcomputing-tw
 
מיכאל
מיכאלמיכאל
מיכאל
sqlserver.co.il
 
Modernizing Your Data Warehouse using APS
Modernizing Your Data Warehouse using APSModernizing Your Data Warehouse using APS
Modernizing Your Data Warehouse using APS
Stéphane Fréchette
 

Similar to SQLRally Amsterdam 2013 - Hadoop (20)

Hadoop_arunam_ppt
Hadoop_arunam_pptHadoop_arunam_ppt
Hadoop_arunam_ppt
 
Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作
 
Yahoo! Hack Europe Workshop
Yahoo! Hack Europe WorkshopYahoo! Hack Europe Workshop
Yahoo! Hack Europe Workshop
 
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
 
SQL on Hadoop for the Oracle Professional
SQL on Hadoop for the Oracle ProfessionalSQL on Hadoop for the Oracle Professional
SQL on Hadoop for the Oracle Professional
 
Unit-3.pptx
Unit-3.pptxUnit-3.pptx
Unit-3.pptx
 
Aziksa hadoop architecture santosh jha
Aziksa hadoop architecture santosh jhaAziksa hadoop architecture santosh jha
Aziksa hadoop architecture santosh jha
 
Presentation sreenu dwh-services
Presentation sreenu dwh-servicesPresentation sreenu dwh-services
Presentation sreenu dwh-services
 
Hadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for women
Hadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for womenHadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for women
Hadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for women
 
CCD-410 Cloudera Study Material
CCD-410 Cloudera Study MaterialCCD-410 Cloudera Study Material
CCD-410 Cloudera Study Material
 
Get started with Microsoft SQL Polybase
Get started with Microsoft SQL PolybaseGet started with Microsoft SQL Polybase
Get started with Microsoft SQL Polybase
 
hive hadoop sql
hive hadoop sqlhive hadoop sql
hive hadoop sql
 
Hadoop Distributed file system.pdf
Hadoop Distributed file system.pdfHadoop Distributed file system.pdf
Hadoop Distributed file system.pdf
 
Introduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop EcosystemIntroduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop Ecosystem
 
hadoop distributed file systems complete information
hadoop distributed file systems complete informationhadoop distributed file systems complete information
hadoop distributed file systems complete information
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
 
מיכאל
מיכאלמיכאל
מיכאל
 
Hadoop intro
Hadoop introHadoop intro
Hadoop intro
 
Modernizing Your Data Warehouse using APS
Modernizing Your Data Warehouse using APSModernizing Your Data Warehouse using APS
Modernizing Your Data Warehouse using APS
 
Big data concepts
Big data conceptsBig data concepts
Big data concepts
 

More from Jan Pieter Posthuma

Power BI for Developers
Power BI for DevelopersPower BI for Developers
Power BI for Developers
Jan Pieter Posthuma
 
Extending Power BI with your own custom visual
Extending Power BI with your own custom visualExtending Power BI with your own custom visual
Extending Power BI with your own custom visual
Jan Pieter Posthuma
 
Extending Power BI with your own custom visual
Extending Power BI with your own custom visualExtending Power BI with your own custom visual
Extending Power BI with your own custom visual
Jan Pieter Posthuma
 
Azure Global Bootcamp - CIS Handson
Azure Global Bootcamp - CIS HandsonAzure Global Bootcamp - CIS Handson
Azure Global Bootcamp - CIS Handson
Jan Pieter Posthuma
 
Extending Power BI With Your Own Custom Visual
Extending Power BI With Your Own Custom VisualExtending Power BI With Your Own Custom Visual
Extending Power BI With Your Own Custom Visual
Jan Pieter Posthuma
 
PBIG - Power BI en R visuals
PBIG - Power BI en R visualsPBIG - Power BI en R visuals
PBIG - Power BI en R visuals
Jan Pieter Posthuma
 
SQLSaturday 551 - Extending Power BI
SQLSaturday 551 - Extending Power BISQLSaturday 551 - Extending Power BI
SQLSaturday 551 - Extending Power BI
Jan Pieter Posthuma
 
SQLServer Days - Power BI Custom Visuals
SQLServer Days - Power BI Custom VisualsSQLServer Days - Power BI Custom Visuals
SQLServer Days - Power BI Custom Visuals
Jan Pieter Posthuma
 
TechDays - Power BI Custom Visuals
TechDays - Power BI Custom VisualsTechDays - Power BI Custom Visuals
TechDays - Power BI Custom Visuals
Jan Pieter Posthuma
 
SQLSaturday 541 - Extending Power BI
SQLSaturday 541 - Extending Power BISQLSaturday 541 - Extending Power BI
SQLSaturday 541 - Extending Power BI
Jan Pieter Posthuma
 
Power BI API
Power BI APIPower BI API
Power BI API
Jan Pieter Posthuma
 

More from Jan Pieter Posthuma (11)

Power BI for Developers
Power BI for DevelopersPower BI for Developers
Power BI for Developers
 
Extending Power BI with your own custom visual
Extending Power BI with your own custom visualExtending Power BI with your own custom visual
Extending Power BI with your own custom visual
 
Extending Power BI with your own custom visual
Extending Power BI with your own custom visualExtending Power BI with your own custom visual
Extending Power BI with your own custom visual
 
Azure Global Bootcamp - CIS Handson
Azure Global Bootcamp - CIS HandsonAzure Global Bootcamp - CIS Handson
Azure Global Bootcamp - CIS Handson
 
Extending Power BI With Your Own Custom Visual
Extending Power BI With Your Own Custom VisualExtending Power BI With Your Own Custom Visual
Extending Power BI With Your Own Custom Visual
 
PBIG - Power BI en R visuals
PBIG - Power BI en R visualsPBIG - Power BI en R visuals
PBIG - Power BI en R visuals
 
SQLSaturday 551 - Extending Power BI
SQLSaturday 551 - Extending Power BISQLSaturday 551 - Extending Power BI
SQLSaturday 551 - Extending Power BI
 
SQLServer Days - Power BI Custom Visuals
SQLServer Days - Power BI Custom VisualsSQLServer Days - Power BI Custom Visuals
SQLServer Days - Power BI Custom Visuals
 
TechDays - Power BI Custom Visuals
TechDays - Power BI Custom VisualsTechDays - Power BI Custom Visuals
TechDays - Power BI Custom Visuals
 
SQLSaturday 541 - Extending Power BI
SQLSaturday 541 - Extending Power BISQLSaturday 541 - Extending Power BI
SQLSaturday 541 - Extending Power BI
 
Power BI API
Power BI APIPower BI API
Power BI API
 

Recently uploaded

Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
Product School
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Tobias Schneck
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
CatarinaPereira64715
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
Ralf Eggert
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Product School
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Ramesh Iyer
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
Frank van Harmelen
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
RTTS
 

Recently uploaded (20)

Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 

SQLRally Amsterdam 2013 - Hadoop

  • 1. Henk van der Valk Technical Sales Professional Jan Pieter Posthuma Microsoft BI Consultant 7/11/2013 Hadoop
  • 2. Access to online training content JOIN THE PASS COMMUNITY Become a PASS member for free and join the world‟s biggest SQL Server Community. Join Local Chapters Personalize your PASS website experience Access to events at discounted rates Join Virtual Chapters 2
  • 3. Agenda • • • • • • • • • Introduction Hadoop HDFS Data access to HDFS Map/Reduce Hive Data access from HDFS SQL PDW PolyBase Wrap up 3
  • 4. Introduction Henk • • • • • 10 years of Unisys-EMEA Performance Center 2002- Largest SQL DWH in the world (SQL2000) Project Real – (SQL 2005) ETL WR - loading 1TB within 30 mins (SQL 2008) Contributed to various SQL whitepapers • • • Schuberg Philis-100% uptime for mission critical applications Since april 1st, 2011 – Microsoft SQL PDW - Western Europe SQLPass speaker & volunteer since 2005 4
  • 5. Introduction Alerts, Notifications SQL Server StreamInsight Big Data Sources (Raw, Unstructured) Data & Compute Intensive Application Business Insights SQL Server FTDW Data Marts Sensors Load SQL Server Reporting Services Fast Devices Summarize & Load HDInsight on Windows Azure Bots HDInsight on Windows Server SQL Server Parallel Data Warehouse Historical Data (Beyond Active Window) Interactive Reports Integrate/Enrich SQL Server Analysis Server Crawlers Performance Scorecards Azure Market Place Enterprise ETL with SSIS, DQS, MDS ERP CRM LOB APPS Source Systems 5
  • 6. Introduction Jan Pieter Posthuma Jan Pieter Posthuma • Technical Lead Microsoft BI and Big Data consultant • Inter Access, local consultancy firm in the Netherlands • Architect role at multiple projects • Analysis Service, Reporting Service, PerformancePoint Service, Big Data, HDInsight, Cloud BI http://twitter.com/jppp http://linkedin.com/jpposthuma jan.pieter.posthuma@interaccess.nl 6
  • 7. Hadoop • Hadoop is a collection of software to create a data-intensive distributed cluster running on commodity hardware. • Original idea by Google (2003). • Widely accepted by Database vendors as a solution for unstructured data • Microsoft partners with HortonWorks and delivers their Hadoop Data Platform as Microsoft HDInsight • Available as an Azure service and on premise • HortonWorks Data Platform (HDP) 100% Open Source! 7 7
  • 8. Hadoop Map/ Reduce HBase HDFS Poly base Avro (Serialization) Zookeeper • HDFS – distributed, fault tolerant file system • MapReduce – framework for writing/executing distributed, fault tolerant algorithms • Hive & Pig – SQL-like declarative languages • Sqoop/PolyBase – package for moving data between HDFS BI ETL RDBMS Reporting Tools and relational DB systems • + Others… Hive & Pig • Hadoop 2.0 Sqoop / 8
  • 9. HDFS Large File 1100101010011100101010011100101010011100101010011100110010101001110010101001110010101001110010101001110010101001 1100101010011100101010011100101010011100101010011100110010101001110010101001110010101001110010101001110010101001 1100101010011100101010011100101010011100101010011100110010101001110010101001110010101001110010101001110010101001 1100101010011100101010011100101010011100101010011100110010101001110010101001110010101001110010101001110010101001 1100101010011100101010011100101010011100101010011100110010101001110010101001110010101001110010101001110010101001 1100101010011100101010011100101010011100101010011100110010101001110010101001110010101001110010101001110010101001 1100101010011100101010011100101010011100101010011100110010101001110010101001110010101001110010101001110010101001 1100101010011100101010011100101010011100101010011100110010101001110010101001110010101001110010101001110010101001 … 6440MB Let‟s color-code them Block 1 Block 2 Block 3 Block 4 Block 5 Block 6 64MB 64MB 64MB 64MB 64MB 64MB e.g., Block Size = 64MB … Block 100 Block 101 64MB 40MB Files are composed of set of blocks • Typically 64MB in size • Each block is stored as a separate file in the local file system (e.g. NTFS) 9 9
  • 10. HDFS NameNode HDFS was designed with the expectation that failures (both hardware and software) would occur frequently BackupNode namespace backups Hadoop 2.0 is more decentralized • Interaction between DataNodes • Less dependent on primary NameNode (heartbeat, balancing, replication, etc.) DataNode DataNode DataNode DataNode nodes write to local disk DataNode
  • 11. Data access to HDFS FTP – Upload your data files Streaming – Via AVRO (RPC) or Flume Hadoop command – hadoop fs -copyFromLocal Windows Azure BLOB storage – HDInsight Service (Azure) uses BLOB storage instead of local VM storage. Data can be uploaded without a provisioned Hadoop cluster • PolyBase – Feature of PDW 2012. Direct read/write data access to the datanodes. • • • • 11
  • 13. 13
  • 14. Map/Reduce • MR: all functions in a batch oriented architecture • • Map: Apply the logic to the data, eg page hits count. Reduce: Reduces (aggregate) the results of the Mappers to one. • YARN: split the JobTracker in to Resource Manager and Node Manager. And MR in Hadoop 2.0 uses YARN as its JobTacker 14
  • 16. Hive • • • • • • • • • Build for easy data retrieval Uses Map/Reduce Created by Facebook HiveQL: SQL like language Stores data in tables, which are stored as HDFS file(s) Only initial INSERT supported, no UPDATE or DELETE External tables possible on existing (CSV) file(s) Extra language options to use benefits of Hadoop Stinger initiative: Phase 1 (0.11) and Phase 2 (0.12). Improve Hive 100x 16
  • 17. Hive Star schema join – (Based on TPC-DS Query 27) SELECT col5, avg(col6) FROM store_sales_fact ssf 41 GB join item_dim on (ssf.col1 = item_dim.col1) 58 MB join date_dim on (ssf.col2 = date_dim.col2) 11 MB join custdmgrphcs_dim on (ssf.col3 = custdmgrphcs_dim.col3) 80 MB join store_dim on (ssf.col4 = store_dim.col4) 106 KB GROUP BY col5 ORDER BY col5 LIMIT 100; Cluster: 6 Nodes (2 Name, 4 Compute – dual core, 14GB) 17
  • 18. Hive File Type # MR jobs Input Size # Mappers Time Text / Hive 0.10 5 43.1 GB 179 21:00 min Text / Hive 0.11 1 38.0 GB 151 4:06 min RC / Hive 0.11 1 8.21 GB 76 2:16 min ORC / Hive 0.11 1 2.83 GB 38 1:44 min RC / Hive 0.11 / Partitioned / Bucketed 1 1.73 GB 19 1:44 min ORC / Hive 0.11 / Partitioned / Bucketed 1 687 MB 27 01:19 min Data: ~64x less data Time; ~16x times faster 18
  • 19. Data access from Hadoop Excel FTP Hadoop command – hadoop fs -copyToLocal ODBC[1] – Via Hive (HiveQL) data can be extracted. Power Query – Is capable of extracting data directly from HDFS or Azure BLOB storage • PolyBase – Feature of PDW 2012. Direct read/write data access to the datanodes. • • • • • [1] http://www.microsoft.com/en-us/download/details.aspx?id=40886 [2] Power BI Excel add-in – http://www.powerbi.com 19
  • 21. 21
  • 22. PDW – Polybase … SQL Server SQL Server SQL Server SQL Server Sqoop This is PDW! DN DN DN DN DN DN DN DN DN DN DN DN Hadoop Cluster 22 22
  • 23. PDW – External Tables • An external table is PDW‟s representation of data residing in HDFS • The “table” (metadata) lives in the context of a SQL Server database • The actual table data resides in HDFS • No support for DML operations • No concurrency control or isolation level guarantees CREATE EXTERNAL TABLE table_name ({<column_definition>} [,...n ]) {WITH (LOCATION =‘<URI>’,[FORMAT_OPTIONS = (<VALUES>)])} [;] Required to indicate location of Hadoop cluster Optional format options associated with parsing of data from HDFS (e.g. field delimiters & reject-related thresholds) 23
  • 24. PDW – Hadoop use cases & examples [1] Retrieve data from HDFS with a PDW query • Seamlessly join structured and semi-structured data SELECT Username FROM ClickStream c, User u WHERE c.UserID = u.ID AND c.URL=‘www.bing.com’; [2] Import data from HDFS to PDW • Parallelized CREATE TABLE AS SELECT (CTAS) • External tables as the source • PDW table, either replicated or distributed, as destination CREATE TABLE ClickStreamInPDW WITH DISTRIBUTION = HASH(URL) AS SELECT URL, EventDate, UserID FROM ClickStream; [3] Export data from PDW to HDFS • Parallelized CREATE EXTERNAL TABLE AS SELECT (CETAS) • External table as the destination; creates a set of HDFS files CREATE EXTERNAL TABLE ClickStream2 (URL, EventDate, UserID) WITH (LOCATION =‘hdfs://MyHadoop:5000/joe’, FORMAT_OPTIONS (...) AS SELECT URL, EventDate, UserID FROM ClickStreamInPDW;
  • 26. Wrap up Hadoop „just another data source‟ @ your fingertips! Batch processing large datasets before loading into your DWH Offloading DWH data, but still accessible for analysis/reporting Integrate Hadoop via SQOOP, ODBC (Hive) or PolyBase Near future: deeply integration between Hadoop and SQL PDW Try Hadoop / HDInsight yourself: Azure: http://www.windowsazure.com/en-us/pricing/free-trial/ Web PI: http://www.microsoft.com/web/downloads/platform.aspx 26
  • 28. References Microsoft Big Data http://www.microsoft.com/bigdata Windows Azure HDInsight Service (3 months free trail) http://www.windowsazure.com/en-us/services/hdinsight/ SQL Server Parallel Data Warehouse (PDW) Landing Page http://www.microsoft.com/PDW http://www.upgradetopdw.com Introduction to Polybase http://www.microsoft.com/en-us/sqlserver/solutionstechnologies/data-warehousing/polybase.aspx 28 28

Editor's Notes

  1. DEMO:Upload a local file with hadoop -copyFromLocal
  2. - Hadoopcommand- CoudXplorer
  3. DEMO: Total hit count W3C logs
  4. - TotalHits MR job
  5. ‘Of the 150k jobs Facebook runs daily, only 500 are MapReduce jobs. The rest are is HiveQL’
  6. Hive &lt;0.11:stores data is plain text filesno join optimization.typical DHW query (star schema join) results in 6 MR jobsHive 0.11:introduces (O)RC files, loosely based on column store indexesjoin optimizationTypical DWH query result in 1 MR jobHive 0.12:- Uses Yarn and Tez, optimized for DWH queries and less overhead then MR
  7. DEMO: Retrieving data via ODBC and Power Query in Excel
  8. Via ExcelData Explorer (Azure BLOB storage)