Christian Coté - ETL Architect and Microsoft MVP

ETL (extract, transform and load) architect/developer
ETL development using various ETL tools: DTS / SSIS,
Hummungbird Genio, Informatica, Datastage
DW Experience in various domains: Pharmaceutical, finance,
insurance and manufacturing
Specialized in Datawarehousing and BI
Microsoft Most Valuable Professional (MVP) – SQL Server
Montreal SQL Pass chapter co-leader
WhoAmI

• Why Big Data?
• Big Data Lambda Architecture
• Getting started with Windows Azure HDInsight
Service
• Introduction to Hive
Agenda

Data complexity: variety and velocity
Petabytes
What is Big Data?

Microsoft Confidential
Distributed, scalable system on commodity hardware composed of:
• HDFS—distributed file system
• MapReduce—programming model
• Others: HBase, R, Pig, Hive, Flume, Mahout, Avro, Zookeeper
HBase (column DB)
Hive Mahout
Oozie
Sqoop
HBase/Cassandra/Couch/
MongoDB
Avro
Zookeeper
Pig FlumeCascadingR
Ambari
HCatalog
Hadoop = MapReduce + HDFS
What is Hadoop?

Machine
Learning
Graph
Processing
Distributed
Compute
Extract Load
Transform
Predictive
Analysis

Move HDFS into the warehouse before analysis
ETL
Hadoop ecosystem
Learn new skills
SQL
Build
Integrate
Manage
Maintain
Support
Limitations: Analysis with Big Data today
Steep learning curve, slow and inefficient


Data sources Non-Relational Data

• Large amount of logged or archived data –
small # of large files
• Loosely structured data – no fixed schema
• Data is written once and may only be
appended
• Data sets are read frequently and often in
full
• Examples
• monitoring supply chains in retail
• suspicious trading patterns in finance
• air and water quality from arrays of environmental sensors

Traditional
Data Warehouse
ETL

Business Critical
Tomorrows
Data Warehouse
ETL
Sensor Data
Log Data
Automated
Data
Social
Networks
RFID Data
HDInsightSensor Data
Log Data
Automated
Data
Social
Networks
RFID Data

Microsoft Business Intelligence (BI)
• Hive ODBC Connectivity
• BI Tools for Big Data
Better on Windows and Azure
• Active Directory
• System Center
• .Net Programmability
• Azure Data Factory
Microsoft Data Connectivity
• SQL Server / SQL Parallel Data Warehouse
• Azure Storage / Azure Data Market
Collaborate with and Contribute to OSS
• Collaborate with HortonWorks
• Provide improvements and Windows support back to OSS

• Batch layer
• Stores master dataset
• Compute arbitrary views
• Speed layer
• Fast, incremental algorithms
• Batch layer eventually overrides
speed layer
• Serving layer
• Random access to batch views
• Updated by batch layer

• Stores master dataset
(in append mode)
• Unrestrained
computation
• Horizontally scalable
• High latency

• Stream processing of
data
• Stores a limited window
of data
• Dynamic computation

• Queries the batch and
real-time views
• Merges the results

Extremely large volume of unstructured web logs
Ad hoc analysis of logs to prototype patterns
Hadoop data cluster feeds large 24TB cube
Business users analyze cube data
E.g. STRUCTURED & UNSTRUCTURED DATA

Apache Hadoop SQL Server Analysis Service (SSAS)
Microsoft Excel and PowerPivot
Other BI Tools and Custom
Applications
Hadoop Data
Third Party
Database
SQL Server
Analysis Services
(SSAS Cube)
+
Custom
Applications
SQL Server Connector (Hadoop Hive ODBC)
Staging Database

Windows Azure HDInsight
Azure Blob storage
HDInsight Console

Windows Azure
HDInsight
Azure Blob storage
MapReduce
PowerShell Console

• Programming framework
(library and runtime) for
analyzing datasets stored in
HDFS
• Composed of user-supplied
Map and Reduce functions:
• Map() - subdivide and
conquer
• Reduce() - combine and
reduce cardinality
………
Do work() Do work() Do work()

• Rapidly process vast
amounts of data in parallel,
on a large cluster of
compute nodes
• Framework schedules and
monitors tasks, and
re-executes failed tasks
• Typically, both input and
output are stored in file
system
DataNode 1
Mapper
Data is shuffled
across the network
and sorted
Map Phase Shuffle/Sort Reduce Phase
DataNode 2
Mapper
DataNode 3
Mapper
DataNode 1
Reducer
DataNode 2
DataNode 3
Reducer

INPUT
OUTPUT
Pre-Execution
Member 1
Reducer 1
Member 2 Member 3 Member N
Reducer 2 Reducer 3 Reducer m
Data Summary
Reducer 4 Reducer 5
• Client app
creates a task
• Task is
scheduled in
Task Manager
• Task is
dispatched at
scheduled
time
Keyword Content RegionId
Complain OMITTED 10
Service OMITTED 10
Warranty OMITTED 10
Service OMITTED 20
Warranty OMITTED 20
Lawsuit OMITTED 20
Complain OMITTED 30
Tax OMITTED 30
Support OMITTED 30

INPUT
OUTPUT
Pre-Execution
Reducer 1
Mapper 1 Mapper 2 Mapper 3 Mapper NMember 1 Member 2 Member 3 Member N
Data Summary
Complain OMITTED 10
Service OMITTED 10
Warranty OMITTED 10
Service OMITTED 20
Warranty OMITTED 20
Lawsuit OMITTED 20
Complain OMITTED 30
Tax OMITTED 30
Support OMITTED 30
Reducer 4 Reducer 5
Complain OMITTED 10
Service OMITTED 10
Warranty OMITTED 10
Service OMITTED 20
Warranty OMITTED 20
Lawsuit OMITTED 20
Complain OMITTED 30
Tax OMITTED 30
Support OMITTED 30
• Task is
distributed to
all member
nodes
• Each member
node now
becomes a
Mapper

Reducer 5Reducer 4
INPUT
OUTPUT
Pre-Execution
Mapper 1
Reducer 1
Mapper N
Data Summary
Complain 19 10
Service 23 10
Warranty 22 10
Mapper 3
Complain 38 30
Support 69 30
Tax 23 30Mapper 2
Lawsuit 7 20
Service 44 20
Warranty 25 20
Keyword Occurrence RegionId
Complain 19 10
Service 23 10
Warranty 22 10
Service 44 20
Warranty 25 20
Lawsuit 7 20
Complain 38 30
Tax 23 30
Support 69 30
• Mapper
function
executes over
all rows in its
partition
• Mappers push
results to the
Reducers
• Reducers start
processing the
output from
Mappers

INPUT
OUTPUT
Pre-Execution
Mapper 1
Reducer 1
Mapper 2 Mapper 3 Mapper N
Data Summary
Reducer 4 Reducer 5Support 69Warranty 47 Lawsuit 7Service 67Complain 57 Tax 23
Keyword Occurrence
Support 69
Service 67
Warranty 47
Complain 57
Lawsuit 7
Tax 23
• Reducers
carry out their
operation in
parallel
• Output from
each Reducer is
summed into
one temporary
table
• Output results
are published
into output file

Demo:
The “Hello
World” of
Map Reduce
• Supplied sample on HDInsight
• Written in Java
• Source code at
http://wiki.apache.org/hadoop/WordCount
• Demo
Each mapper takes a line as input and breaks it into words. It then emits a
key/value pair of the word and 1. Each reducer sums the counts for each word
and emits a single key/value with the word and sum.

• Built on top of Hadoop to
provide data management,
querying, and analysis
• Access and query data
through simple SQL-like
statements, called Hive
queries
• In short, Hive complies,
Hadoop executes

• HiveQL includes data
definition language, data
import/export and data
manipulation language
statements
• See
https://cwiki.apache.org/confluence/
display/Hive/LanguageManual

http://blogs.msdn.com/b/windowsazure/archive/2013/03/
19/getting-started-with-hdinsight.aspx
http://blogs.msdn.com/b/windowsazure/archive/2013/03/
21/azure-hdinsight-and-azure-storage.aspx

Christian Coté - ETL Architect and Microsoft MVP

Christian Coté - ETL Architect and Microsoft MVP

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Viewers also liked

Viewers also liked (6)

Similar to Christian Coté - ETL Architect and Microsoft MVP

Similar to Christian Coté - ETL Architect and Microsoft MVP (20)

More from MSDEVMTL

More from MSDEVMTL (20)

Christian Coté - ETL Architect and Microsoft MVP