SlideShare a Scribd company logo
1 of 39
Download to read offline
Christian Coté
ETL (extract, transform and load) architect/developer
ETL development using various ETL tools: DTS / SSIS,
Hummungbird Genio, Informatica, Datastage
DW Experience in various domains: Pharmaceutical, finance,
insurance and manufacturing
Specialized in Datawarehousing and BI
Microsoft Most Valuable Professional (MVP) – SQL Server
Montreal SQL Pass chapter co-leader
WhoAmI
• Why Big Data?
• Big Data Lambda Architecture
• Getting started with Windows Azure HDInsight
Service
• Introduction to Hive
Agenda
Data complexity: variety and velocity
Petabytes
What is Big Data?
Microsoft Confidential
Distributed, scalable system on commodity hardware composed of:
• HDFS—distributed file system
• MapReduce—programming model
• Others: HBase, R, Pig, Hive, Flume, Mahout, Avro, Zookeeper
HBase (column DB)
Hive Mahout
Oozie
Sqoop
HBase/Cassandra/Couch/
MongoDB
Avro
Zookeeper
Pig FlumeCascadingR
Ambari
HCatalog
Hadoop = MapReduce + HDFS
What is Hadoop?
Machine
Learning
Graph
Processing
Distributed
Compute
Extract Load
Transform
Predictive
Analysis
Move HDFS into the warehouse before analysis
ETL
Hadoop ecosystem
Learn new skills
SQL
Build
Integrate
Manage
Maintain
Support
Limitations: Analysis with Big Data today
Steep learning curve, slow and inefficient

Data sources Non-Relational Data
• Large amount of logged or archived data –
small # of large files
• Loosely structured data – no fixed schema
• Data is written once and may only be
appended
• Data sets are read frequently and often in
full
• Examples
• monitoring supply chains in retail
• suspicious trading patterns in finance
• air and water quality from arrays of environmental sensors
Traditional
Data Warehouse
ETL
Business Critical
Tomorrows
Data Warehouse
ETL
Sensor Data
Log Data
Automated
Data
Social
Networks
RFID Data
HDInsightSensor Data
Log Data
Automated
Data
Social
Networks
RFID Data
Microsoft Business Intelligence (BI)
• Hive ODBC Connectivity
• BI Tools for Big Data
Better on Windows and Azure
• Active Directory
• System Center
• .Net Programmability
• Azure Data Factory
Microsoft Data Connectivity
• SQL Server / SQL Parallel Data Warehouse
• Azure Storage / Azure Data Market
Collaborate with and Contribute to OSS
• Collaborate with HortonWorks
• Provide improvements and Windows support back to OSS
Big Data
Lambda
Architecture
• Batch layer
• Stores master dataset
• Compute arbitrary views
• Speed layer
• Fast, incremental algorithms
• Batch layer eventually overrides
speed layer
• Serving layer
• Random access to batch views
• Updated by batch layer
• Stores master dataset
(in append mode)
• Unrestrained
computation
• Horizontally scalable
• High latency
• Stream processing of
data
• Stores a limited window
of data
• Dynamic computation
• Queries the batch and
real-time views
• Merges the results
Extremely large volume of unstructured web logs
Ad hoc analysis of logs to prototype patterns
Hadoop data cluster feeds large 24TB cube
Business users analyze cube data
E.g. STRUCTURED & UNSTRUCTURED DATA
Apache Hadoop SQL Server Analysis Service (SSAS)
Microsoft Excel and PowerPivot
Other BI Tools and Custom
Applications
Hadoop Data
Third Party
Database
SQL Server
Analysis Services
(SSAS Cube)
+
Custom
Applications
SQL Server Connector (Hadoop Hive ODBC)
Staging Database
Windows Azure HDInsight
Azure Blob storage
HDInsight Console
Windows Azure
HDInsight
Azure Blob storage
MapReduce
PowerShell Console
• Programming framework
(library and runtime) for
analyzing datasets stored in
HDFS
• Composed of user-supplied
Map and Reduce functions:
• Map() - subdivide and
conquer
• Reduce() - combine and
reduce cardinality
………
Do work() Do work() Do work()
• Rapidly process vast
amounts of data in parallel,
on a large cluster of
compute nodes
• Framework schedules and
monitors tasks, and
re-executes failed tasks
• Typically, both input and
output are stored in file
system
DataNode 1
Mapper
Data is shuffled
across the network
and sorted
Map Phase Shuffle/Sort Reduce Phase
DataNode 2
Mapper
DataNode 3
Mapper
DataNode 1
Reducer
DataNode 2
DataNode 3
Reducer
INPUT
OUTPUT
Pre-Execution
Member 1
Reducer 1
Member 2 Member 3 Member N
Reducer 2 Reducer 3 Reducer m
Data Summary
Reducer 4 Reducer 5
• Client app
creates a task
• Task is
scheduled in
Task Manager
• Task is
dispatched at
scheduled
time
Keyword Content RegionId
Complain OMITTED 10
Service OMITTED 10
Warranty OMITTED 10
Service OMITTED 20
Warranty OMITTED 20
Lawsuit OMITTED 20
Complain OMITTED 30
Tax OMITTED 30
Support OMITTED 30
INPUT
OUTPUT
Pre-Execution
Reducer 1
Mapper 1 Mapper 2 Mapper 3 Mapper NMember 1 Member 2 Member 3 Member N
Reducer 2 Reducer 3 Reducer m
Data Summary
Keyword Content RegionId
Complain OMITTED 10
Service OMITTED 10
Warranty OMITTED 10
Keyword Content RegionId
Service OMITTED 20
Warranty OMITTED 20
Lawsuit OMITTED 20
Keyword Content RegionId
Complain OMITTED 30
Tax OMITTED 30
Support OMITTED 30
Reducer 4 Reducer 5
Keyword Content RegionId
Complain OMITTED 10
Service OMITTED 10
Warranty OMITTED 10
Service OMITTED 20
Warranty OMITTED 20
Lawsuit OMITTED 20
Complain OMITTED 30
Tax OMITTED 30
Support OMITTED 30
• Task is
distributed to
all member
nodes
• Each member
node now
becomes a
Mapper
Reducer 5Reducer 4
INPUT
OUTPUT
Pre-Execution
Mapper 1
Reducer 1
Mapper N
Reducer 2 Reducer 3 Reducer m
Data Summary
Complain 19 10
Service 23 10
Warranty 22 10
Mapper 3
Complain 38 30
Support 69 30
Tax 23 30Mapper 2
Lawsuit 7 20
Service 44 20
Warranty 25 20
Keyword Occurrence RegionId
Complain 19 10
Service 23 10
Warranty 22 10
Keyword Occurrence RegionId
Service 44 20
Warranty 25 20
Lawsuit 7 20
Keyword Occurrence RegionId
Complain 38 30
Tax 23 30
Support 69 30
• Mapper
function
executes over
all rows in its
partition
• Mappers push
results to the
Reducers
• Reducers start
processing the
output from
Mappers
INPUT
OUTPUT
Pre-Execution
Mapper 1
Reducer 1
Mapper 2 Mapper 3 Mapper N
Reducer 2 Reducer 3 Reducer m
Data Summary
Reducer 4 Reducer 5Support 69Warranty 47 Lawsuit 7Service 67Complain 57 Tax 23
Keyword Occurrence
Support 69
Service 67
Warranty 47
Complain 57
Lawsuit 7
Tax 23
• Reducers
carry out their
operation in
parallel
• Output from
each Reducer is
summed into
one temporary
table
• Output results
are published
into output file
Demo:
The “Hello
World” of
Map Reduce
• Supplied sample on HDInsight
• Written in Java
• Source code at
http://wiki.apache.org/hadoop/WordCount
• Demo
Each mapper takes a line as input and breaks it into words. It then emits a
key/value pair of the word and 1. Each reducer sums the counts for each word
and emits a single key/value with the word and sum.
• Built on top of Hadoop to
provide data management,
querying, and analysis
• Access and query data
through simple SQL-like
statements, called Hive
queries
• In short, Hive complies,
Hadoop executes
Demo: Hive query on head
node
• HiveQL includes data
definition language, data
import/export and data
manipulation language
statements
• See
https://cwiki.apache.org/confluence/
display/Hive/LanguageManual
http://blogs.msdn.com/b/windowsazure/archive/2013/03/
19/getting-started-with-hdinsight.aspx
http://blogs.msdn.com/b/windowsazure/archive/2013/03/
21/azure-hdinsight-and-azure-storage.aspx
Questions?
Christian Coté - ETL Architect and Microsoft MVP

More Related Content

What's hot

Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA DeploymentsCisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA DeploymentsMapR Technologies
 
Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Ryan Blue
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon RedshiftAmazon Web Services
 
Big Data Education Webcast: Introducing DMX and DMX-h Release 8
Big Data Education Webcast: Introducing DMX and DMX-h Release 8Big Data Education Webcast: Introducing DMX and DMX-h Release 8
Big Data Education Webcast: Introducing DMX and DMX-h Release 8Precisely
 
Webinar | Getting Started With Amazon Redshift Spectrum
Webinar | Getting Started With Amazon Redshift SpectrumWebinar | Getting Started With Amazon Redshift Spectrum
Webinar | Getting Started With Amazon Redshift SpectrumMatillion
 
High Performance Data Lake with Apache Hudi and Alluxio at T3Go
High Performance Data Lake with Apache Hudi and Alluxio at T3GoHigh Performance Data Lake with Apache Hudi and Alluxio at T3Go
High Performance Data Lake with Apache Hudi and Alluxio at T3GoAlluxio, Inc.
 
Getting Started with HBase
Getting Started with HBaseGetting Started with HBase
Getting Started with HBaseCarol McDonald
 
LeanXcale for Monitoring
LeanXcale for MonitoringLeanXcale for Monitoring
LeanXcale for MonitoringLeanXcale
 
Microsoft Data Platform Airlift 2017 Rui Quintino Machine Learning with SQL S...
Microsoft Data Platform Airlift 2017 Rui Quintino Machine Learning with SQL S...Microsoft Data Platform Airlift 2017 Rui Quintino Machine Learning with SQL S...
Microsoft Data Platform Airlift 2017 Rui Quintino Machine Learning with SQL S...Rui Quintino
 
Docker data science pipeline
Docker data science pipelineDocker data science pipeline
Docker data science pipelineDataWorks Summit
 
SQL Server on Linux - march 2017
SQL Server on Linux - march 2017SQL Server on Linux - march 2017
SQL Server on Linux - march 2017Sorin Peste
 
Add Historical Analysis of Operational Data with Easy Configurations in Fivet...
Add Historical Analysis of Operational Data with Easy Configurations in Fivet...Add Historical Analysis of Operational Data with Easy Configurations in Fivet...
Add Historical Analysis of Operational Data with Easy Configurations in Fivet...Databricks
 
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and DruidPulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and DruidTony Ng
 

What's hot (19)

Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA DeploymentsCisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
 
Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)
 
AWS DynamoDB
AWS DynamoDBAWS DynamoDB
AWS DynamoDB
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon Redshift
 
Big Data Education Webcast: Introducing DMX and DMX-h Release 8
Big Data Education Webcast: Introducing DMX and DMX-h Release 8Big Data Education Webcast: Introducing DMX and DMX-h Release 8
Big Data Education Webcast: Introducing DMX and DMX-h Release 8
 
Deep Dive on Amazon Redshift
Deep Dive on Amazon RedshiftDeep Dive on Amazon Redshift
Deep Dive on Amazon Redshift
 
Webinar | Getting Started With Amazon Redshift Spectrum
Webinar | Getting Started With Amazon Redshift SpectrumWebinar | Getting Started With Amazon Redshift Spectrum
Webinar | Getting Started With Amazon Redshift Spectrum
 
Whats new in IC 2016?
Whats new in IC 2016?Whats new in IC 2016?
Whats new in IC 2016?
 
High Performance Data Lake with Apache Hudi and Alluxio at T3Go
High Performance Data Lake with Apache Hudi and Alluxio at T3GoHigh Performance Data Lake with Apache Hudi and Alluxio at T3Go
High Performance Data Lake with Apache Hudi and Alluxio at T3Go
 
Getting Started with HBase
Getting Started with HBaseGetting Started with HBase
Getting Started with HBase
 
LeanXcale for Monitoring
LeanXcale for MonitoringLeanXcale for Monitoring
LeanXcale for Monitoring
 
What's new in IP 4.4
What's new in IP 4.4What's new in IP 4.4
What's new in IP 4.4
 
Microsoft Data Platform Airlift 2017 Rui Quintino Machine Learning with SQL S...
Microsoft Data Platform Airlift 2017 Rui Quintino Machine Learning with SQL S...Microsoft Data Platform Airlift 2017 Rui Quintino Machine Learning with SQL S...
Microsoft Data Platform Airlift 2017 Rui Quintino Machine Learning with SQL S...
 
Docker data science pipeline
Docker data science pipelineDocker data science pipeline
Docker data science pipeline
 
Redshift overview
Redshift overviewRedshift overview
Redshift overview
 
SQL Server on Linux - march 2017
SQL Server on Linux - march 2017SQL Server on Linux - march 2017
SQL Server on Linux - march 2017
 
Add Historical Analysis of Operational Data with Easy Configurations in Fivet...
Add Historical Analysis of Operational Data with Easy Configurations in Fivet...Add Historical Analysis of Operational Data with Easy Configurations in Fivet...
Add Historical Analysis of Operational Data with Easy Configurations in Fivet...
 
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and DruidPulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
 
Big data in Azure
Big data in AzureBig data in Azure
Big data in Azure
 

Viewers also liked

Building for the Enterprise: Yammer's Product Development Methodology
Building for the Enterprise: Yammer's Product Development MethodologyBuilding for the Enterprise: Yammer's Product Development Methodology
Building for the Enterprise: Yammer's Product Development MethodologyJason Shah
 
Deprecating ActiveRecord Attributes without making Zombies
Deprecating ActiveRecord Attributes without making ZombiesDeprecating ActiveRecord Attributes without making Zombies
Deprecating ActiveRecord Attributes without making Zombiesyann ARMAND
 
Ruby à bordeaux
Ruby à bordeauxRuby à bordeaux
Ruby à bordeauxyann ARMAND
 
Open graph for Yammer, What and Why
Open graph for Yammer, What and WhyOpen graph for Yammer, What and Why
Open graph for Yammer, What and WhyNeil McCarthy
 
The Yammer Way - Our Product Development Approach
The Yammer Way - Our Product Development ApproachThe Yammer Way - Our Product Development Approach
The Yammer Way - Our Product Development Approachyann ARMAND
 

Viewers also liked (6)

Building for the Enterprise: Yammer's Product Development Methodology
Building for the Enterprise: Yammer's Product Development MethodologyBuilding for the Enterprise: Yammer's Product Development Methodology
Building for the Enterprise: Yammer's Product Development Methodology
 
Deprecating ActiveRecord Attributes without making Zombies
Deprecating ActiveRecord Attributes without making ZombiesDeprecating ActiveRecord Attributes without making Zombies
Deprecating ActiveRecord Attributes without making Zombies
 
Ruby à bordeaux
Ruby à bordeauxRuby à bordeaux
Ruby à bordeaux
 
Open graph for Yammer, What and Why
Open graph for Yammer, What and WhyOpen graph for Yammer, What and Why
Open graph for Yammer, What and Why
 
The Yammer Way - Our Product Development Approach
The Yammer Way - Our Product Development ApproachThe Yammer Way - Our Product Development Approach
The Yammer Way - Our Product Development Approach
 
Yammer for developers
Yammer for developersYammer for developers
Yammer for developers
 

Similar to Christian Coté - ETL Architect and Microsoft MVP

Day 1 - Module 1 - Introduction to Big Data MVA.pptx
Day 1 - Module 1 - Introduction to Big Data MVA.pptxDay 1 - Module 1 - Introduction to Big Data MVA.pptx
Day 1 - Module 1 - Introduction to Big Data MVA.pptxAhsanFazalQureshi1
 
WSO2 Stream Processor: Graphical Editor, HTTP & Message Trace Analytics and More
WSO2 Stream Processor: Graphical Editor, HTTP & Message Trace Analytics and MoreWSO2 Stream Processor: Graphical Editor, HTTP & Message Trace Analytics and More
WSO2 Stream Processor: Graphical Editor, HTTP & Message Trace Analytics and MoreWSO2
 
Hadoop Infrastructure and SoftServe Experience by Vitaliy Bashun, Data Architect
Hadoop Infrastructure and SoftServe Experience by Vitaliy Bashun, Data ArchitectHadoop Infrastructure and SoftServe Experience by Vitaliy Bashun, Data Architect
Hadoop Infrastructure and SoftServe Experience by Vitaliy Bashun, Data ArchitectSoftServe
 
Pacemaker hadoop infrastructure and soft serve experience
Pacemaker   hadoop infrastructure and soft serve experiencePacemaker   hadoop infrastructure and soft serve experience
Pacemaker hadoop infrastructure and soft serve experienceVitaliy Bashun
 
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI ProsBig Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI ProsAndrew Brust
 
IBM Internet-of-Things architecture and capabilities
IBM Internet-of-Things architecture and capabilitiesIBM Internet-of-Things architecture and capabilities
IBM Internet-of-Things architecture and capabilitiesIBM_Info_Management
 
IBM IoT Architecture and Capabilities at the Edge and Cloud
IBM IoT Architecture and Capabilities at the Edge and Cloud IBM IoT Architecture and Capabilities at the Edge and Cloud
IBM IoT Architecture and Capabilities at the Edge and Cloud Pradeep Natarajan
 
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...Mark Rittman
 
سکوهای ابری و مدل های برنامه نویسی در ابر
سکوهای ابری و مدل های برنامه نویسی در ابرسکوهای ابری و مدل های برنامه نویسی در ابر
سکوهای ابری و مدل های برنامه نویسی در ابرdatastack
 
Power Big Data Analytics with Informatica Cloud Integration for Redshift, Kin...
Power Big Data Analytics with Informatica Cloud Integration for Redshift, Kin...Power Big Data Analytics with Informatica Cloud Integration for Redshift, Kin...
Power Big Data Analytics with Informatica Cloud Integration for Redshift, Kin...Amazon Web Services
 
SQL Server 2017 on Linux Introduction
SQL Server 2017 on Linux IntroductionSQL Server 2017 on Linux Introduction
SQL Server 2017 on Linux IntroductionTravis Wright
 
Leapfrog into Serverless - a Deloitte-Amtrak Case Study | Serverless Confere...
Leapfrog into Serverless - a Deloitte-Amtrak Case Study | Serverless Confere...Leapfrog into Serverless - a Deloitte-Amtrak Case Study | Serverless Confere...
Leapfrog into Serverless - a Deloitte-Amtrak Case Study | Serverless Confere...Gary Arora
 
SQL Server vNext on Linux
SQL Server vNext on LinuxSQL Server vNext on Linux
SQL Server vNext on LinuxTravis Wright
 
Advanced Analytics in Hadoop
Advanced Analytics in HadoopAdvanced Analytics in Hadoop
Advanced Analytics in HadoopAnalyticsWeek
 
What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?DataWorks Summit
 
What's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoWhat's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoDataWorks Summit
 
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 2)
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 2)SQLSaturday #230 - Introduction to Microsoft Big Data (Part 2)
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 2)Sascha Dittmann
 
Discover MongoDB Atlas and MongoDB Stitch - DEM02-S - Mexico City AWS Summit
Discover MongoDB Atlas and MongoDB Stitch - DEM02-S - Mexico City AWS SummitDiscover MongoDB Atlas and MongoDB Stitch - DEM02-S - Mexico City AWS Summit
Discover MongoDB Atlas and MongoDB Stitch - DEM02-S - Mexico City AWS SummitAmazon Web Services
 
Bigdata Hadoop project payment gateway domain
Bigdata Hadoop project payment gateway domainBigdata Hadoop project payment gateway domain
Bigdata Hadoop project payment gateway domainKamal A
 

Similar to Christian Coté - ETL Architect and Microsoft MVP (20)

Day 1 - Module 1 - Introduction to Big Data MVA.pptx
Day 1 - Module 1 - Introduction to Big Data MVA.pptxDay 1 - Module 1 - Introduction to Big Data MVA.pptx
Day 1 - Module 1 - Introduction to Big Data MVA.pptx
 
WSO2 Stream Processor: Graphical Editor, HTTP & Message Trace Analytics and More
WSO2 Stream Processor: Graphical Editor, HTTP & Message Trace Analytics and MoreWSO2 Stream Processor: Graphical Editor, HTTP & Message Trace Analytics and More
WSO2 Stream Processor: Graphical Editor, HTTP & Message Trace Analytics and More
 
Hadoop Infrastructure and SoftServe Experience by Vitaliy Bashun, Data Architect
Hadoop Infrastructure and SoftServe Experience by Vitaliy Bashun, Data ArchitectHadoop Infrastructure and SoftServe Experience by Vitaliy Bashun, Data Architect
Hadoop Infrastructure and SoftServe Experience by Vitaliy Bashun, Data Architect
 
Pacemaker hadoop infrastructure and soft serve experience
Pacemaker   hadoop infrastructure and soft serve experiencePacemaker   hadoop infrastructure and soft serve experience
Pacemaker hadoop infrastructure and soft serve experience
 
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI ProsBig Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI Pros
 
IBM Internet-of-Things architecture and capabilities
IBM Internet-of-Things architecture and capabilitiesIBM Internet-of-Things architecture and capabilities
IBM Internet-of-Things architecture and capabilities
 
IBM IoT Architecture and Capabilities at the Edge and Cloud
IBM IoT Architecture and Capabilities at the Edge and Cloud IBM IoT Architecture and Capabilities at the Edge and Cloud
IBM IoT Architecture and Capabilities at the Edge and Cloud
 
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...
 
Big Data Processing
Big Data ProcessingBig Data Processing
Big Data Processing
 
سکوهای ابری و مدل های برنامه نویسی در ابر
سکوهای ابری و مدل های برنامه نویسی در ابرسکوهای ابری و مدل های برنامه نویسی در ابر
سکوهای ابری و مدل های برنامه نویسی در ابر
 
Power Big Data Analytics with Informatica Cloud Integration for Redshift, Kin...
Power Big Data Analytics with Informatica Cloud Integration for Redshift, Kin...Power Big Data Analytics with Informatica Cloud Integration for Redshift, Kin...
Power Big Data Analytics with Informatica Cloud Integration for Redshift, Kin...
 
SQL Server 2017 on Linux Introduction
SQL Server 2017 on Linux IntroductionSQL Server 2017 on Linux Introduction
SQL Server 2017 on Linux Introduction
 
Leapfrog into Serverless - a Deloitte-Amtrak Case Study | Serverless Confere...
Leapfrog into Serverless - a Deloitte-Amtrak Case Study | Serverless Confere...Leapfrog into Serverless - a Deloitte-Amtrak Case Study | Serverless Confere...
Leapfrog into Serverless - a Deloitte-Amtrak Case Study | Serverless Confere...
 
SQL Server vNext on Linux
SQL Server vNext on LinuxSQL Server vNext on Linux
SQL Server vNext on Linux
 
Advanced Analytics in Hadoop
Advanced Analytics in HadoopAdvanced Analytics in Hadoop
Advanced Analytics in Hadoop
 
What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?
 
What's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoWhat's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - Tokyo
 
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 2)
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 2)SQLSaturday #230 - Introduction to Microsoft Big Data (Part 2)
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 2)
 
Discover MongoDB Atlas and MongoDB Stitch - DEM02-S - Mexico City AWS Summit
Discover MongoDB Atlas and MongoDB Stitch - DEM02-S - Mexico City AWS SummitDiscover MongoDB Atlas and MongoDB Stitch - DEM02-S - Mexico City AWS Summit
Discover MongoDB Atlas and MongoDB Stitch - DEM02-S - Mexico City AWS Summit
 
Bigdata Hadoop project payment gateway domain
Bigdata Hadoop project payment gateway domainBigdata Hadoop project payment gateway domain
Bigdata Hadoop project payment gateway domain
 

More from MSDEVMTL

Intro grpc.net
Intro  grpc.netIntro  grpc.net
Intro grpc.netMSDEVMTL
 
Grpc and asp.net partie 2
Grpc and asp.net partie 2Grpc and asp.net partie 2
Grpc and asp.net partie 2MSDEVMTL
 
Property based testing
Property based testingProperty based testing
Property based testingMSDEVMTL
 
Improve cloud visibility and cost in Microsoft Azure
Improve cloud visibility and cost in Microsoft AzureImprove cloud visibility and cost in Microsoft Azure
Improve cloud visibility and cost in Microsoft AzureMSDEVMTL
 
Return on Ignite 2019: Azure, .NET, A.I. & Data
Return on Ignite 2019: Azure, .NET, A.I. & DataReturn on Ignite 2019: Azure, .NET, A.I. & Data
Return on Ignite 2019: Azure, .NET, A.I. & DataMSDEVMTL
 
C sharp 8.0 new features
C sharp 8.0 new featuresC sharp 8.0 new features
C sharp 8.0 new featuresMSDEVMTL
 
Asp.net core 3
Asp.net core 3Asp.net core 3
Asp.net core 3MSDEVMTL
 
MSDEVMTL Informations 2019
MSDEVMTL Informations 2019MSDEVMTL Informations 2019
MSDEVMTL Informations 2019MSDEVMTL
 
Common features in webapi aspnetcore
Common features in webapi aspnetcoreCommon features in webapi aspnetcore
Common features in webapi aspnetcoreMSDEVMTL
 
Groupe Excel et Power BI - Rencontre du 25 septembre 2018
Groupe Excel et Power BI  - Rencontre du 25 septembre 2018Groupe Excel et Power BI  - Rencontre du 25 septembre 2018
Groupe Excel et Power BI - Rencontre du 25 septembre 2018MSDEVMTL
 
Api gateway
Api gatewayApi gateway
Api gatewayMSDEVMTL
 
Common features in webapi aspnetcore
Common features in webapi aspnetcoreCommon features in webapi aspnetcore
Common features in webapi aspnetcoreMSDEVMTL
 
Stephane Lapointe: Governance in Azure, keep control of your environments
Stephane Lapointe: Governance in Azure, keep control of your environmentsStephane Lapointe: Governance in Azure, keep control of your environments
Stephane Lapointe: Governance in Azure, keep control of your environmentsMSDEVMTL
 
Eric Routhier: Garder le contrôle sur vos coûts Azure
Eric Routhier: Garder le contrôle sur vos coûts AzureEric Routhier: Garder le contrôle sur vos coûts Azure
Eric Routhier: Garder le contrôle sur vos coûts AzureMSDEVMTL
 
Data science presentation
Data science presentationData science presentation
Data science presentationMSDEVMTL
 
Michel Ouellette + Gabriel Lainesse: Process Automation & Data Analytics at S...
Michel Ouellette + Gabriel Lainesse: Process Automation & Data Analytics at S...Michel Ouellette + Gabriel Lainesse: Process Automation & Data Analytics at S...
Michel Ouellette + Gabriel Lainesse: Process Automation & Data Analytics at S...MSDEVMTL
 
Open id connect, azure ad, angular 5, web api core
Open id connect, azure ad, angular 5, web api coreOpen id connect, azure ad, angular 5, web api core
Open id connect, azure ad, angular 5, web api coreMSDEVMTL
 
Yoann Clombe : Fail fast, iterate quickly with power bi and google analytics
Yoann Clombe : Fail fast, iterate quickly with power bi and google analyticsYoann Clombe : Fail fast, iterate quickly with power bi and google analytics
Yoann Clombe : Fail fast, iterate quickly with power bi and google analyticsMSDEVMTL
 
CAE: etude de cas - Rolling Average
CAE: etude de cas - Rolling AverageCAE: etude de cas - Rolling Average
CAE: etude de cas - Rolling AverageMSDEVMTL
 
CAE: etude de cas
CAE: etude de casCAE: etude de cas
CAE: etude de casMSDEVMTL
 

More from MSDEVMTL (20)

Intro grpc.net
Intro  grpc.netIntro  grpc.net
Intro grpc.net
 
Grpc and asp.net partie 2
Grpc and asp.net partie 2Grpc and asp.net partie 2
Grpc and asp.net partie 2
 
Property based testing
Property based testingProperty based testing
Property based testing
 
Improve cloud visibility and cost in Microsoft Azure
Improve cloud visibility and cost in Microsoft AzureImprove cloud visibility and cost in Microsoft Azure
Improve cloud visibility and cost in Microsoft Azure
 
Return on Ignite 2019: Azure, .NET, A.I. & Data
Return on Ignite 2019: Azure, .NET, A.I. & DataReturn on Ignite 2019: Azure, .NET, A.I. & Data
Return on Ignite 2019: Azure, .NET, A.I. & Data
 
C sharp 8.0 new features
C sharp 8.0 new featuresC sharp 8.0 new features
C sharp 8.0 new features
 
Asp.net core 3
Asp.net core 3Asp.net core 3
Asp.net core 3
 
MSDEVMTL Informations 2019
MSDEVMTL Informations 2019MSDEVMTL Informations 2019
MSDEVMTL Informations 2019
 
Common features in webapi aspnetcore
Common features in webapi aspnetcoreCommon features in webapi aspnetcore
Common features in webapi aspnetcore
 
Groupe Excel et Power BI - Rencontre du 25 septembre 2018
Groupe Excel et Power BI  - Rencontre du 25 septembre 2018Groupe Excel et Power BI  - Rencontre du 25 septembre 2018
Groupe Excel et Power BI - Rencontre du 25 septembre 2018
 
Api gateway
Api gatewayApi gateway
Api gateway
 
Common features in webapi aspnetcore
Common features in webapi aspnetcoreCommon features in webapi aspnetcore
Common features in webapi aspnetcore
 
Stephane Lapointe: Governance in Azure, keep control of your environments
Stephane Lapointe: Governance in Azure, keep control of your environmentsStephane Lapointe: Governance in Azure, keep control of your environments
Stephane Lapointe: Governance in Azure, keep control of your environments
 
Eric Routhier: Garder le contrôle sur vos coûts Azure
Eric Routhier: Garder le contrôle sur vos coûts AzureEric Routhier: Garder le contrôle sur vos coûts Azure
Eric Routhier: Garder le contrôle sur vos coûts Azure
 
Data science presentation
Data science presentationData science presentation
Data science presentation
 
Michel Ouellette + Gabriel Lainesse: Process Automation & Data Analytics at S...
Michel Ouellette + Gabriel Lainesse: Process Automation & Data Analytics at S...Michel Ouellette + Gabriel Lainesse: Process Automation & Data Analytics at S...
Michel Ouellette + Gabriel Lainesse: Process Automation & Data Analytics at S...
 
Open id connect, azure ad, angular 5, web api core
Open id connect, azure ad, angular 5, web api coreOpen id connect, azure ad, angular 5, web api core
Open id connect, azure ad, angular 5, web api core
 
Yoann Clombe : Fail fast, iterate quickly with power bi and google analytics
Yoann Clombe : Fail fast, iterate quickly with power bi and google analyticsYoann Clombe : Fail fast, iterate quickly with power bi and google analytics
Yoann Clombe : Fail fast, iterate quickly with power bi and google analytics
 
CAE: etude de cas - Rolling Average
CAE: etude de cas - Rolling AverageCAE: etude de cas - Rolling Average
CAE: etude de cas - Rolling Average
 
CAE: etude de cas
CAE: etude de casCAE: etude de cas
CAE: etude de cas
 

Christian Coté - ETL Architect and Microsoft MVP

  • 2. ETL (extract, transform and load) architect/developer ETL development using various ETL tools: DTS / SSIS, Hummungbird Genio, Informatica, Datastage DW Experience in various domains: Pharmaceutical, finance, insurance and manufacturing Specialized in Datawarehousing and BI Microsoft Most Valuable Professional (MVP) – SQL Server Montreal SQL Pass chapter co-leader WhoAmI
  • 3. • Why Big Data? • Big Data Lambda Architecture • Getting started with Windows Azure HDInsight Service • Introduction to Hive Agenda
  • 4.
  • 5. Data complexity: variety and velocity Petabytes What is Big Data?
  • 6. Microsoft Confidential Distributed, scalable system on commodity hardware composed of: • HDFS—distributed file system • MapReduce—programming model • Others: HBase, R, Pig, Hive, Flume, Mahout, Avro, Zookeeper HBase (column DB) Hive Mahout Oozie Sqoop HBase/Cassandra/Couch/ MongoDB Avro Zookeeper Pig FlumeCascadingR Ambari HCatalog Hadoop = MapReduce + HDFS What is Hadoop?
  • 8. Move HDFS into the warehouse before analysis ETL Hadoop ecosystem Learn new skills SQL Build Integrate Manage Maintain Support Limitations: Analysis with Big Data today Steep learning curve, slow and inefficient
  • 10. • Large amount of logged or archived data – small # of large files • Loosely structured data – no fixed schema • Data is written once and may only be appended • Data sets are read frequently and often in full • Examples • monitoring supply chains in retail • suspicious trading patterns in finance • air and water quality from arrays of environmental sensors
  • 12. Business Critical Tomorrows Data Warehouse ETL Sensor Data Log Data Automated Data Social Networks RFID Data HDInsightSensor Data Log Data Automated Data Social Networks RFID Data
  • 13. Microsoft Business Intelligence (BI) • Hive ODBC Connectivity • BI Tools for Big Data Better on Windows and Azure • Active Directory • System Center • .Net Programmability • Azure Data Factory Microsoft Data Connectivity • SQL Server / SQL Parallel Data Warehouse • Azure Storage / Azure Data Market Collaborate with and Contribute to OSS • Collaborate with HortonWorks • Provide improvements and Windows support back to OSS
  • 15. • Batch layer • Stores master dataset • Compute arbitrary views • Speed layer • Fast, incremental algorithms • Batch layer eventually overrides speed layer • Serving layer • Random access to batch views • Updated by batch layer
  • 16. • Stores master dataset (in append mode) • Unrestrained computation • Horizontally scalable • High latency
  • 17. • Stream processing of data • Stores a limited window of data • Dynamic computation
  • 18. • Queries the batch and real-time views • Merges the results
  • 19.
  • 20. Extremely large volume of unstructured web logs Ad hoc analysis of logs to prototype patterns Hadoop data cluster feeds large 24TB cube Business users analyze cube data E.g. STRUCTURED & UNSTRUCTURED DATA
  • 21. Apache Hadoop SQL Server Analysis Service (SSAS) Microsoft Excel and PowerPivot Other BI Tools and Custom Applications Hadoop Data Third Party Database SQL Server Analysis Services (SSAS Cube) + Custom Applications SQL Server Connector (Hadoop Hive ODBC) Staging Database
  • 22.
  • 23. Windows Azure HDInsight Azure Blob storage HDInsight Console
  • 24. Windows Azure HDInsight Azure Blob storage MapReduce PowerShell Console
  • 25. • Programming framework (library and runtime) for analyzing datasets stored in HDFS • Composed of user-supplied Map and Reduce functions: • Map() - subdivide and conquer • Reduce() - combine and reduce cardinality ……… Do work() Do work() Do work()
  • 26. • Rapidly process vast amounts of data in parallel, on a large cluster of compute nodes • Framework schedules and monitors tasks, and re-executes failed tasks • Typically, both input and output are stored in file system DataNode 1 Mapper Data is shuffled across the network and sorted Map Phase Shuffle/Sort Reduce Phase DataNode 2 Mapper DataNode 3 Mapper DataNode 1 Reducer DataNode 2 DataNode 3 Reducer
  • 27. INPUT OUTPUT Pre-Execution Member 1 Reducer 1 Member 2 Member 3 Member N Reducer 2 Reducer 3 Reducer m Data Summary Reducer 4 Reducer 5 • Client app creates a task • Task is scheduled in Task Manager • Task is dispatched at scheduled time Keyword Content RegionId Complain OMITTED 10 Service OMITTED 10 Warranty OMITTED 10 Service OMITTED 20 Warranty OMITTED 20 Lawsuit OMITTED 20 Complain OMITTED 30 Tax OMITTED 30 Support OMITTED 30
  • 28. INPUT OUTPUT Pre-Execution Reducer 1 Mapper 1 Mapper 2 Mapper 3 Mapper NMember 1 Member 2 Member 3 Member N Reducer 2 Reducer 3 Reducer m Data Summary Keyword Content RegionId Complain OMITTED 10 Service OMITTED 10 Warranty OMITTED 10 Keyword Content RegionId Service OMITTED 20 Warranty OMITTED 20 Lawsuit OMITTED 20 Keyword Content RegionId Complain OMITTED 30 Tax OMITTED 30 Support OMITTED 30 Reducer 4 Reducer 5 Keyword Content RegionId Complain OMITTED 10 Service OMITTED 10 Warranty OMITTED 10 Service OMITTED 20 Warranty OMITTED 20 Lawsuit OMITTED 20 Complain OMITTED 30 Tax OMITTED 30 Support OMITTED 30 • Task is distributed to all member nodes • Each member node now becomes a Mapper
  • 29. Reducer 5Reducer 4 INPUT OUTPUT Pre-Execution Mapper 1 Reducer 1 Mapper N Reducer 2 Reducer 3 Reducer m Data Summary Complain 19 10 Service 23 10 Warranty 22 10 Mapper 3 Complain 38 30 Support 69 30 Tax 23 30Mapper 2 Lawsuit 7 20 Service 44 20 Warranty 25 20 Keyword Occurrence RegionId Complain 19 10 Service 23 10 Warranty 22 10 Keyword Occurrence RegionId Service 44 20 Warranty 25 20 Lawsuit 7 20 Keyword Occurrence RegionId Complain 38 30 Tax 23 30 Support 69 30 • Mapper function executes over all rows in its partition • Mappers push results to the Reducers • Reducers start processing the output from Mappers
  • 30. INPUT OUTPUT Pre-Execution Mapper 1 Reducer 1 Mapper 2 Mapper 3 Mapper N Reducer 2 Reducer 3 Reducer m Data Summary Reducer 4 Reducer 5Support 69Warranty 47 Lawsuit 7Service 67Complain 57 Tax 23 Keyword Occurrence Support 69 Service 67 Warranty 47 Complain 57 Lawsuit 7 Tax 23 • Reducers carry out their operation in parallel • Output from each Reducer is summed into one temporary table • Output results are published into output file
  • 31. Demo: The “Hello World” of Map Reduce • Supplied sample on HDInsight • Written in Java • Source code at http://wiki.apache.org/hadoop/WordCount • Demo Each mapper takes a line as input and breaks it into words. It then emits a key/value pair of the word and 1. Each reducer sums the counts for each word and emits a single key/value with the word and sum.
  • 32.
  • 33.
  • 34. • Built on top of Hadoop to provide data management, querying, and analysis • Access and query data through simple SQL-like statements, called Hive queries • In short, Hive complies, Hadoop executes
  • 35. Demo: Hive query on head node
  • 36. • HiveQL includes data definition language, data import/export and data manipulation language statements • See https://cwiki.apache.org/confluence/ display/Hive/LanguageManual