Christian Coté is an ETL architect and developer with experience using tools like DTS/SSIS, Hummungbird Genio, Informatica, and Datastage. He has worked in domains including pharmaceuticals, finance, insurance, and manufacturing. He specializes in data warehousing and business intelligence and is a Microsoft MVP for SQL Server. He is also the co-leader of the Montreal SQL Pass chapter.
2. ETL (extract, transform and load) architect/developer
ETL development using various ETL tools: DTS / SSIS,
Hummungbird Genio, Informatica, Datastage
DW Experience in various domains: Pharmaceutical, finance,
insurance and manufacturing
Specialized in Datawarehousing and BI
Microsoft Most Valuable Professional (MVP) – SQL Server
Montreal SQL Pass chapter co-leader
WhoAmI
3. • Why Big Data?
• Big Data Lambda Architecture
• Getting started with Windows Azure HDInsight
Service
• Introduction to Hive
Agenda
8. Move HDFS into the warehouse before analysis
ETL
Hadoop ecosystem
Learn new skills
SQL
Build
Integrate
Manage
Maintain
Support
Limitations: Analysis with Big Data today
Steep learning curve, slow and inefficient
10. • Large amount of logged or archived data –
small # of large files
• Loosely structured data – no fixed schema
• Data is written once and may only be
appended
• Data sets are read frequently and often in
full
• Examples
• monitoring supply chains in retail
• suspicious trading patterns in finance
• air and water quality from arrays of environmental sensors
13. Microsoft Business Intelligence (BI)
• Hive ODBC Connectivity
• BI Tools for Big Data
Better on Windows and Azure
• Active Directory
• System Center
• .Net Programmability
• Azure Data Factory
Microsoft Data Connectivity
• SQL Server / SQL Parallel Data Warehouse
• Azure Storage / Azure Data Market
Collaborate with and Contribute to OSS
• Collaborate with HortonWorks
• Provide improvements and Windows support back to OSS
15. • Batch layer
• Stores master dataset
• Compute arbitrary views
• Speed layer
• Fast, incremental algorithms
• Batch layer eventually overrides
speed layer
• Serving layer
• Random access to batch views
• Updated by batch layer
16. • Stores master dataset
(in append mode)
• Unrestrained
computation
• Horizontally scalable
• High latency
17. • Stream processing of
data
• Stores a limited window
of data
• Dynamic computation
18. • Queries the batch and
real-time views
• Merges the results
19.
20. Extremely large volume of unstructured web logs
Ad hoc analysis of logs to prototype patterns
Hadoop data cluster feeds large 24TB cube
Business users analyze cube data
E.g. STRUCTURED & UNSTRUCTURED DATA
21. Apache Hadoop SQL Server Analysis Service (SSAS)
Microsoft Excel and PowerPivot
Other BI Tools and Custom
Applications
Hadoop Data
Third Party
Database
SQL Server
Analysis Services
(SSAS Cube)
+
Custom
Applications
SQL Server Connector (Hadoop Hive ODBC)
Staging Database
25. • Programming framework
(library and runtime) for
analyzing datasets stored in
HDFS
• Composed of user-supplied
Map and Reduce functions:
• Map() - subdivide and
conquer
• Reduce() - combine and
reduce cardinality
………
Do work() Do work() Do work()
26. • Rapidly process vast
amounts of data in parallel,
on a large cluster of
compute nodes
• Framework schedules and
monitors tasks, and
re-executes failed tasks
• Typically, both input and
output are stored in file
system
DataNode 1
Mapper
Data is shuffled
across the network
and sorted
Map Phase Shuffle/Sort Reduce Phase
DataNode 2
Mapper
DataNode 3
Mapper
DataNode 1
Reducer
DataNode 2
DataNode 3
Reducer
27. INPUT
OUTPUT
Pre-Execution
Member 1
Reducer 1
Member 2 Member 3 Member N
Reducer 2 Reducer 3 Reducer m
Data Summary
Reducer 4 Reducer 5
• Client app
creates a task
• Task is
scheduled in
Task Manager
• Task is
dispatched at
scheduled
time
Keyword Content RegionId
Complain OMITTED 10
Service OMITTED 10
Warranty OMITTED 10
Service OMITTED 20
Warranty OMITTED 20
Lawsuit OMITTED 20
Complain OMITTED 30
Tax OMITTED 30
Support OMITTED 30
28. INPUT
OUTPUT
Pre-Execution
Reducer 1
Mapper 1 Mapper 2 Mapper 3 Mapper NMember 1 Member 2 Member 3 Member N
Reducer 2 Reducer 3 Reducer m
Data Summary
Keyword Content RegionId
Complain OMITTED 10
Service OMITTED 10
Warranty OMITTED 10
Keyword Content RegionId
Service OMITTED 20
Warranty OMITTED 20
Lawsuit OMITTED 20
Keyword Content RegionId
Complain OMITTED 30
Tax OMITTED 30
Support OMITTED 30
Reducer 4 Reducer 5
Keyword Content RegionId
Complain OMITTED 10
Service OMITTED 10
Warranty OMITTED 10
Service OMITTED 20
Warranty OMITTED 20
Lawsuit OMITTED 20
Complain OMITTED 30
Tax OMITTED 30
Support OMITTED 30
• Task is
distributed to
all member
nodes
• Each member
node now
becomes a
Mapper
29. Reducer 5Reducer 4
INPUT
OUTPUT
Pre-Execution
Mapper 1
Reducer 1
Mapper N
Reducer 2 Reducer 3 Reducer m
Data Summary
Complain 19 10
Service 23 10
Warranty 22 10
Mapper 3
Complain 38 30
Support 69 30
Tax 23 30Mapper 2
Lawsuit 7 20
Service 44 20
Warranty 25 20
Keyword Occurrence RegionId
Complain 19 10
Service 23 10
Warranty 22 10
Keyword Occurrence RegionId
Service 44 20
Warranty 25 20
Lawsuit 7 20
Keyword Occurrence RegionId
Complain 38 30
Tax 23 30
Support 69 30
• Mapper
function
executes over
all rows in its
partition
• Mappers push
results to the
Reducers
• Reducers start
processing the
output from
Mappers
30. INPUT
OUTPUT
Pre-Execution
Mapper 1
Reducer 1
Mapper 2 Mapper 3 Mapper N
Reducer 2 Reducer 3 Reducer m
Data Summary
Reducer 4 Reducer 5Support 69Warranty 47 Lawsuit 7Service 67Complain 57 Tax 23
Keyword Occurrence
Support 69
Service 67
Warranty 47
Complain 57
Lawsuit 7
Tax 23
• Reducers
carry out their
operation in
parallel
• Output from
each Reducer is
summed into
one temporary
table
• Output results
are published
into output file
31. Demo:
The “Hello
World” of
Map Reduce
• Supplied sample on HDInsight
• Written in Java
• Source code at
http://wiki.apache.org/hadoop/WordCount
• Demo
Each mapper takes a line as input and breaks it into words. It then emits a
key/value pair of the word and 1. Each reducer sums the counts for each word
and emits a single key/value with the word and sum.
32.
33.
34. • Built on top of Hadoop to
provide data management,
querying, and analysis
• Access and query data
through simple SQL-like
statements, called Hive
queries
• In short, Hive complies,
Hadoop executes
36. • HiveQL includes data
definition language, data
import/export and data
manipulation language
statements
• See
https://cwiki.apache.org/confluence/
display/Hive/LanguageManual