Introduction to Azure HDInsight

Stéphane Fréchette
Stéphane FréchetteData & Business Intelligence Solutions Architect | Consultant | Big Data | NoSQL | Data Science | Data Platform MVP
Introduction to HDInsight
Stéphane Fréchette
Saturday February 7, 2015
Who am I?
My name is Stéphane Fréchette
SQL Server MVP | Consultant | Speaker | Data & BI Architect | Big Data
|NoSQL | Data Science. Drums, good food and fine wine.
Founder @TEDxGatineau
I have a passion for architecting, designing and building solutions that
matter.
Twitter: @sfrechette
Blog: stephanefrechette.com
Email: stephanefrechette@ukubu.com
Topics
• What is Big Data?
• Apache Hadoop
• Hadoop Ecosystem
• Microsoft Azure HDInsight
• Demos
• Summary
• Resources
• Q&A
“Big data usually includes data sets with sizes
beyond the ability of commonly used software
tools to capture, curate, manage, and process
data within a tolerable elapsed time…”
- Wikipedia
What is Big Data?
Many Options
Variability
Internet of things
Audio /
Video
Log Files
Text/Image
Social
Sentiment
Data Market Feeds
eGov Feeds
Weather
Wikis / Blogs
Click Stream
Sensors / RFID / Devices
Spatial & GPS Coordinates
WEB 2.0Mobile
Advertising CollaborationeCommerce
Digital Marketing
Search Marketing
Web Logs
Recommendations
ERP / CRM
Sales Pipeline
Payables
Payroll
Inventory
Contacts
Deal
Tracking
Terabytes
(10E12)
Gigabytes
(10E9)
Exabytes
(10E18)
Petabytes
(10E15)
Velocity - Variety
Volume
1980
190,000$
2010
0.07$
1990
9,000$
2000
15$
Storage/GB
ERP / CRM WEB
2.0
Internet of things
What is Big Data?
Common Scenarios
What is Big Data?
Hadoop
• Apache Hadoop is for big data
• Open-source software framework that allows for the distributed processing
of large data sets across clusters of computers using simple programming
models
• Designed to scale up from single servers to thousands of machines, each
offering local computation and storage
TRADITIONAL RDBMS HADOOP
Data Size
Access
Updates
Structure
Integrity
Scaling
DBA Ratio
Hadoop
HDFS
• Hadoop Distributed File System (HDFS) is a Java-based file system that
provides scalable and reliable data storage that is designed to span large
clusters of commodity servers.
HDFS ≠ Database
MapReduce
• MapReduce is a software framework for easily writing applications which
process vast amounts of data (multi-terabyte data-sets) in-parallel on large
clusters (thousands of nodes) of commodity hardware in a reliable, fault-
tolerant manner.
Processing function:
- Mapping
- Reducing
How it works?
ServerServer
ServerServer
Runtime
How it works?
Distributed Storage
(HDFS)
Query
(Hive)
Distributed Processing
(MapReduce)
Scripting
(Pig)
NoSQLDatabase
(HBase)
Metadata
(HCatalog)
DataIntegration
(ODBC/SQOOP/REST)
Relational
(SQL
Server)
Machine
Learning
(Mahout)
Graph
(Pegasus)
Stats
processing
(RHadoop
EventPipeline
(Flume)
Active Directory
(Security)
Monitoring&
Deployment
(System Center)
C#, F#, .NETPowerShell
Pipeline/workflow
(Oozie)
Azure Storage
Vault (ASV)
Business
Intelligence
Excel,Power
View,SSAS)
World's Data
(Azure Data
Marketplace)
EventDriven
Processing
Legend
Red = Core
Hadoop
Blue = Data
processing
Purple =
Microsoft
integration points
and value adds
Orange = Data
Movement
Green = Packages
Hadoop Ecosystem
HDInsight
• HDInsight is a Hadoop-based service that brings a 100 % Apache Hadoop
solution that runs on the Microsoft Azure platform
• Based on the Hortonworks Data Platform (HDP)
• Scalable, on-demand service
Storage
Azure Storage (Blob)File System
Two choices
Demo
[Spinning up a HDInsight Cluster ;-)]
Now what?
Working with your HDInsight cluster - running jobs, import/export data,
viewing and consuming data…
• .NET
• Java
• Pig
• Hive
• Sqoop
• Excel
• Others
What is Hive?
• A data warehouse infrastructure built on top of Hadoop for providing data
summarization, query, and analysis
• Provides an SQL-Like language called HiveQL to query data
• Integration between Hadoop and BI and visualization tools
http://hive.apache.org
What is Pig?
• Write complex MapReduce jobs using a simple script language (Pig Latin)
• A platform for analyzing large data sets that consists of high-level language
for expressing data analysis programs
• Pig translates and compiles complex MapReduce jobs on the fly
http://pig.apache.org
What is Sqoop?
• Command-line interface application to transfer bulk data between Hadoop
and relational datastores
http://sqoop.apache.org
Demo
[Query, Analyze, Transfer + Visual Studio Tools for HDInsight]
HadoopData Analytics
Data Flow
Demo
[Self-Service BI with Hive and Excel…]
Machine
Learning
Graph
Processing
Distributed
Compute
Extract Load
Transform
Predictive
Analysis
Capabilities
Data Knowledge Action
Summary
Resources
• Apache Projects (list with links) http://bit.ly/MfpLtE
• Microsoft Azure HDInsight http://bit.ly/1dnlAX1
• HDInsight Documentation & Tutorials http://bit.ly/LWRYol
• Hortonworks Sandbox 2.2 & Tutorials http://bit.ly/1gkkCte
• Cloudera VMs CDH 5.3.x http://bit.ly/1ENWgHH
• Microsoft JDBC Driver 4.1 | 4.0 for SQL Server http://bit.ly/1kEgJ7O
• Microsoft Hive ODBC Driver http://bit.ly/NFkhcH
• Getting Started with Big Data (MVA) http://bit.ly/1wU90Xd
• Big Data and Business Analytics Immersion v3.1 (MVA) http://bit.ly/1unvvX1
• Introducing Microsoft Azure HDInsight (free e-book) http://bit.ly/1JOPe5F
What Questions Do You Have?
Thank You
For attending this session
1 of 29

Recommended

On the move with Big Data (Hadoop, Pig, Sqoop, SSIS...) by
On the move with Big Data (Hadoop, Pig, Sqoop, SSIS...)On the move with Big Data (Hadoop, Pig, Sqoop, SSIS...)
On the move with Big Data (Hadoop, Pig, Sqoop, SSIS...)Stéphane Fréchette
3.6K views15 slides
Big Data in Azure by
Big Data in AzureBig Data in Azure
Big Data in AzureDataWorks Summit/Hadoop Summit
8.4K views20 slides
Big Data Analytics in the Cloud with Microsoft Azure by
Big Data Analytics in the Cloud with Microsoft AzureBig Data Analytics in the Cloud with Microsoft Azure
Big Data Analytics in the Cloud with Microsoft AzureMark Kromer
2K views60 slides
Big Data on Azure Tutorial by
Big Data on Azure TutorialBig Data on Azure Tutorial
Big Data on Azure Tutorialrustd
4.8K views114 slides
Introduction to Microsoft’s Hadoop solution (HDInsight) by
Introduction to Microsoft’s Hadoop solution (HDInsight)Introduction to Microsoft’s Hadoop solution (HDInsight)
Introduction to Microsoft’s Hadoop solution (HDInsight)James Serra
8.1K views55 slides
Big Data with Azure by
Big Data with AzureBig Data with Azure
Big Data with AzureAaron (Ari) Bornstein
173 views49 slides

More Related Content

What's hot

Introduction to PolyBase by
Introduction to PolyBaseIntroduction to PolyBase
Introduction to PolyBaseJames Serra
5K views24 slides
Hd insight overview by
Hd insight overviewHd insight overview
Hd insight overviewvhrocca
1.3K views16 slides
Introduction to Microsoft Azure HD Insight by Dattatrey Sindhol by
Introduction to Microsoft Azure HD Insight by Dattatrey Sindhol Introduction to Microsoft Azure HD Insight by Dattatrey Sindhol
Introduction to Microsoft Azure HD Insight by Dattatrey Sindhol HARMAN Services
8.1K views24 slides
Azure HDInsight by
Azure HDInsightAzure HDInsight
Azure HDInsightKoray Kocabas
883 views41 slides
Building Modern Data Platform with Microsoft Azure by
Building Modern Data Platform with Microsoft AzureBuilding Modern Data Platform with Microsoft Azure
Building Modern Data Platform with Microsoft AzureDmitry Anoshin
3.2K views45 slides
Microsoft Azure Big Data Analytics by
Microsoft Azure Big Data AnalyticsMicrosoft Azure Big Data Analytics
Microsoft Azure Big Data AnalyticsMark Kromer
4.2K views82 slides

What's hot(20)

Introduction to PolyBase by James Serra
Introduction to PolyBaseIntroduction to PolyBase
Introduction to PolyBase
James Serra5K views
Hd insight overview by vhrocca
Hd insight overviewHd insight overview
Hd insight overview
vhrocca1.3K views
Introduction to Microsoft Azure HD Insight by Dattatrey Sindhol by HARMAN Services
Introduction to Microsoft Azure HD Insight by Dattatrey Sindhol Introduction to Microsoft Azure HD Insight by Dattatrey Sindhol
Introduction to Microsoft Azure HD Insight by Dattatrey Sindhol
HARMAN Services8.1K views
Building Modern Data Platform with Microsoft Azure by Dmitry Anoshin
Building Modern Data Platform with Microsoft AzureBuilding Modern Data Platform with Microsoft Azure
Building Modern Data Platform with Microsoft Azure
Dmitry Anoshin3.2K views
Microsoft Azure Big Data Analytics by Mark Kromer
Microsoft Azure Big Data AnalyticsMicrosoft Azure Big Data Analytics
Microsoft Azure Big Data Analytics
Mark Kromer4.2K views
Democratizing Data Science on Kubernetes by John Archer
Democratizing Data Science on Kubernetes Democratizing Data Science on Kubernetes
Democratizing Data Science on Kubernetes
John Archer3.8K views
Running cost effective big data workloads with Azure Synapse and Azure Data L... by Michael Rys
Running cost effective big data workloads with Azure Synapse and Azure Data L...Running cost effective big data workloads with Azure Synapse and Azure Data L...
Running cost effective big data workloads with Azure Synapse and Azure Data L...
Michael Rys735 views
The Hive Think Tank - The Microsoft Big Data Stack by Raghu Ramakrishnan, CTO... by The Hive
The Hive Think Tank - The Microsoft Big Data Stack by Raghu Ramakrishnan, CTO...The Hive Think Tank - The Microsoft Big Data Stack by Raghu Ramakrishnan, CTO...
The Hive Think Tank - The Microsoft Big Data Stack by Raghu Ramakrishnan, CTO...
The Hive2.6K views
Is the traditional data warehouse dead? by James Serra
Is the traditional data warehouse dead?Is the traditional data warehouse dead?
Is the traditional data warehouse dead?
James Serra9.3K views
Power BI for Big Data and the New Look of Big Data Solutions by James Serra
Power BI for Big Data and the New Look of Big Data SolutionsPower BI for Big Data and the New Look of Big Data Solutions
Power BI for Big Data and the New Look of Big Data Solutions
James Serra7.1K views
IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B... by Mark Rittman
IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...
IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...
Mark Rittman1.7K views
Big data architectures and the data lake by James Serra
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lake
James Serra54.1K views
How to boost your datamanagement with Dremio ? by Vincent Terrasi
How to boost your datamanagement with Dremio ?How to boost your datamanagement with Dremio ?
How to boost your datamanagement with Dremio ?
Vincent Terrasi1.4K views
DataOps for the Modern Data Warehouse on Microsoft Azure @ NDCOslo 2020 - Lac... by Lace Lofranco
DataOps for the Modern Data Warehouse on Microsoft Azure @ NDCOslo 2020 - Lac...DataOps for the Modern Data Warehouse on Microsoft Azure @ NDCOslo 2020 - Lac...
DataOps for the Modern Data Warehouse on Microsoft Azure @ NDCOslo 2020 - Lac...
Lace Lofranco1.8K views
Introducing Azure SQL Data Warehouse by James Serra
Introducing Azure SQL Data WarehouseIntroducing Azure SQL Data Warehouse
Introducing Azure SQL Data Warehouse
James Serra7.7K views
Eugene Polonichko "Azure Data Lake: what is it? why is it? where is it?" by DataConf
Eugene Polonichko "Azure Data Lake: what is it? why is it? where is it?"Eugene Polonichko "Azure Data Lake: what is it? why is it? where is it?"
Eugene Polonichko "Azure Data Lake: what is it? why is it? where is it?"
DataConf119 views
Azure Data Lake Intro (SQLBits 2016) by Michael Rys
Azure Data Lake Intro (SQLBits 2016)Azure Data Lake Intro (SQLBits 2016)
Azure Data Lake Intro (SQLBits 2016)
Michael Rys2.8K views

Similar to Introduction to Azure HDInsight

[Azureビッグデータ関連サービスとHortonworks勉強会] Azure HDInsight by
[Azureビッグデータ関連サービスとHortonworks勉強会] Azure HDInsight[Azureビッグデータ関連サービスとHortonworks勉強会] Azure HDInsight
[Azureビッグデータ関連サービスとHortonworks勉強会] Azure HDInsightNaoki (Neo) SATO
783 views51 slides
Big Data in the Microsoft Platform by
Big Data in the Microsoft PlatformBig Data in the Microsoft Platform
Big Data in the Microsoft PlatformJesus Rodriguez
5.2K views53 slides
Big data-at-detik by
Big data-at-detikBig data-at-detik
Big data-at-detikk4ndar
1.6K views27 slides
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 1) by
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 1)SQLSaturday #230 - Introduction to Microsoft Big Data (Part 1)
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 1)Sascha Dittmann
2.3K views46 slides
Hadoop Frameworks Panel__HadoopSummit2010 by
Hadoop Frameworks Panel__HadoopSummit2010Hadoop Frameworks Panel__HadoopSummit2010
Hadoop Frameworks Panel__HadoopSummit2010Yahoo Developer Network
1.3K views44 slides
Modernizing Your Data Warehouse using APS by
Modernizing Your Data Warehouse using APSModernizing Your Data Warehouse using APS
Modernizing Your Data Warehouse using APSStéphane Fréchette
1.6K views37 slides

Similar to Introduction to Azure HDInsight(20)

[Azureビッグデータ関連サービスとHortonworks勉強会] Azure HDInsight by Naoki (Neo) SATO
[Azureビッグデータ関連サービスとHortonworks勉強会] Azure HDInsight[Azureビッグデータ関連サービスとHortonworks勉強会] Azure HDInsight
[Azureビッグデータ関連サービスとHortonworks勉強会] Azure HDInsight
Naoki (Neo) SATO783 views
Big Data in the Microsoft Platform by Jesus Rodriguez
Big Data in the Microsoft PlatformBig Data in the Microsoft Platform
Big Data in the Microsoft Platform
Jesus Rodriguez5.2K views
Big data-at-detik by k4ndar
Big data-at-detikBig data-at-detik
Big data-at-detik
k4ndar1.6K views
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 1) by Sascha Dittmann
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 1)SQLSaturday #230 - Introduction to Microsoft Big Data (Part 1)
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 1)
Sascha Dittmann2.3K views
Differentiate Big Data vs Data Warehouse use cases for a cloud solution by James Serra
Differentiate Big Data vs Data Warehouse use cases for a cloud solutionDifferentiate Big Data vs Data Warehouse use cases for a cloud solution
Differentiate Big Data vs Data Warehouse use cases for a cloud solution
James Serra9.3K views
Scalable ETL with Talend and Hadoop, Cédric Carbone, Talend. by OW2
Scalable ETL with Talend and Hadoop, Cédric Carbone, Talend.Scalable ETL with Talend and Hadoop, Cédric Carbone, Talend.
Scalable ETL with Talend and Hadoop, Cédric Carbone, Talend.
OW24.1K views
Open source stak of big data techs open suse asia by Muhammad Rifqi
Open source stak of big data techs   open suse asiaOpen source stak of big data techs   open suse asia
Open source stak of big data techs open suse asia
Muhammad Rifqi476 views
Architecting the Future of Big Data and Search by Hortonworks
Architecting the Future of Big Data and SearchArchitecting the Future of Big Data and Search
Architecting the Future of Big Data and Search
Hortonworks2.6K views
Transform You Business with Big Data and Hortonworks by Hortonworks
Transform You Business with Big Data and HortonworksTransform You Business with Big Data and Hortonworks
Transform You Business with Big Data and Hortonworks
Hortonworks2.7K views
Web Briefing: Unlock the power of Hadoop to enable interactive analytics by Kognitio
Web Briefing: Unlock the power of Hadoop to enable interactive analyticsWeb Briefing: Unlock the power of Hadoop to enable interactive analytics
Web Briefing: Unlock the power of Hadoop to enable interactive analytics
Kognitio660 views
Talend Big Data Capabilities Overview by Rajan Kanitkar
Talend Big Data Capabilities OverviewTalend Big Data Capabilities Overview
Talend Big Data Capabilities Overview
Rajan Kanitkar11.7K views
SQL Server Konferenz 2014 - SSIS & HDInsight by Tillmann Eitelberg
SQL Server Konferenz 2014 - SSIS & HDInsightSQL Server Konferenz 2014 - SSIS & HDInsight
SQL Server Konferenz 2014 - SSIS & HDInsight
Tillmann Eitelberg2.8K views
Testing Big Data: Automated Testing of Hadoop with QuerySurge by RTTS
Testing Big Data: Automated  Testing of Hadoop with QuerySurgeTesting Big Data: Automated  Testing of Hadoop with QuerySurge
Testing Big Data: Automated Testing of Hadoop with QuerySurge
RTTS192.8K views
Create a Smarter Data Lake with HP Haven and Apache Hadoop by Hortonworks
Create a Smarter Data Lake with HP Haven and Apache HadoopCreate a Smarter Data Lake with HP Haven and Apache Hadoop
Create a Smarter Data Lake with HP Haven and Apache Hadoop
Hortonworks3.9K views

More from Stéphane Fréchette

Back to the future - Temporal Table in SQL Server 2016 by
Back to the future - Temporal Table in SQL Server 2016Back to the future - Temporal Table in SQL Server 2016
Back to the future - Temporal Table in SQL Server 2016Stéphane Fréchette
4.8K views16 slides
Self-Service Data Integration with Power Query - SQLSaturday #364 Boston by
Self-Service Data Integration with Power Query - SQLSaturday #364 Boston  Self-Service Data Integration with Power Query - SQLSaturday #364 Boston
Self-Service Data Integration with Power Query - SQLSaturday #364 Boston Stéphane Fréchette
1.4K views24 slides
Power BI - Bring your data together by
Power BI - Bring your data togetherPower BI - Bring your data together
Power BI - Bring your data togetherStéphane Fréchette
1.9K views28 slides
Data Analytics with R and SQL Server by
Data Analytics with R and SQL ServerData Analytics with R and SQL Server
Data Analytics with R and SQL ServerStéphane Fréchette
5.9K views30 slides
Self-Service Data Integration with Power Query by
Self-Service Data Integration with Power QuerySelf-Service Data Integration with Power Query
Self-Service Data Integration with Power QueryStéphane Fréchette
2.5K views24 slides
Le journalisme de données... par où commencer? by
Le journalisme de données... par où commencer?Le journalisme de données... par où commencer?
Le journalisme de données... par où commencer?Stéphane Fréchette
1.1K views36 slides

More from Stéphane Fréchette(15)

Back to the future - Temporal Table in SQL Server 2016 by Stéphane Fréchette
Back to the future - Temporal Table in SQL Server 2016Back to the future - Temporal Table in SQL Server 2016
Back to the future - Temporal Table in SQL Server 2016
Self-Service Data Integration with Power Query - SQLSaturday #364 Boston by Stéphane Fréchette
Self-Service Data Integration with Power Query - SQLSaturday #364 Boston  Self-Service Data Integration with Power Query - SQLSaturday #364 Boston
Self-Service Data Integration with Power Query - SQLSaturday #364 Boston
Graph Databases for SQL Server Professionals - SQLSaturday #350 Winnipeg by Stéphane Fréchette
Graph Databases for SQL Server Professionals - SQLSaturday #350 WinnipegGraph Databases for SQL Server Professionals - SQLSaturday #350 Winnipeg
Graph Databases for SQL Server Professionals - SQLSaturday #350 Winnipeg
Introduction to Master Data Services in SQL Server 2012 by Stéphane Fréchette
Introduction to Master Data Services in SQL Server 2012Introduction to Master Data Services in SQL Server 2012
Introduction to Master Data Services in SQL Server 2012
Stéphane Fréchette19.3K views

Recently uploaded

HTTP headers that make your website go faster - devs.gent November 2023 by
HTTP headers that make your website go faster - devs.gent November 2023HTTP headers that make your website go faster - devs.gent November 2023
HTTP headers that make your website go faster - devs.gent November 2023Thijs Feryn
26 views151 slides
Special_edition_innovator_2023.pdf by
Special_edition_innovator_2023.pdfSpecial_edition_innovator_2023.pdf
Special_edition_innovator_2023.pdfWillDavies22
18 views6 slides
Future of AR - Facebook Presentation by
Future of AR - Facebook PresentationFuture of AR - Facebook Presentation
Future of AR - Facebook Presentationssuserb54b561
22 views27 slides
Info Session November 2023.pdf by
Info Session November 2023.pdfInfo Session November 2023.pdf
Info Session November 2023.pdfAleksandraKoprivica4
15 views15 slides
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas... by
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...Bernd Ruecker
48 views69 slides
PRODUCT PRESENTATION.pptx by
PRODUCT PRESENTATION.pptxPRODUCT PRESENTATION.pptx
PRODUCT PRESENTATION.pptxangelicacueva6
18 views1 slide

Recently uploaded(20)

HTTP headers that make your website go faster - devs.gent November 2023 by Thijs Feryn
HTTP headers that make your website go faster - devs.gent November 2023HTTP headers that make your website go faster - devs.gent November 2023
HTTP headers that make your website go faster - devs.gent November 2023
Thijs Feryn26 views
Special_edition_innovator_2023.pdf by WillDavies22
Special_edition_innovator_2023.pdfSpecial_edition_innovator_2023.pdf
Special_edition_innovator_2023.pdf
WillDavies2218 views
Future of AR - Facebook Presentation by ssuserb54b561
Future of AR - Facebook PresentationFuture of AR - Facebook Presentation
Future of AR - Facebook Presentation
ssuserb54b56122 views
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas... by Bernd Ruecker
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...
Bernd Ruecker48 views
"Surviving highload with Node.js", Andrii Shumada by Fwdays
"Surviving highload with Node.js", Andrii Shumada "Surviving highload with Node.js", Andrii Shumada
"Surviving highload with Node.js", Andrii Shumada
Fwdays33 views
TrustArc Webinar - Managing Online Tracking Technology Vendors_ A Checklist f... by TrustArc
TrustArc Webinar - Managing Online Tracking Technology Vendors_ A Checklist f...TrustArc Webinar - Managing Online Tracking Technology Vendors_ A Checklist f...
TrustArc Webinar - Managing Online Tracking Technology Vendors_ A Checklist f...
TrustArc72 views
"Node.js Development in 2024: trends and tools", Nikita Galkin by Fwdays
"Node.js Development in 2024: trends and tools", Nikita Galkin "Node.js Development in 2024: trends and tools", Nikita Galkin
"Node.js Development in 2024: trends and tools", Nikita Galkin
Fwdays17 views
STKI Israeli Market Study 2023 corrected forecast 2023_24 v3.pdf by Dr. Jimmy Schwarzkopf
STKI Israeli Market Study 2023   corrected forecast 2023_24 v3.pdfSTKI Israeli Market Study 2023   corrected forecast 2023_24 v3.pdf
STKI Israeli Market Study 2023 corrected forecast 2023_24 v3.pdf

Introduction to Azure HDInsight

  • 1. Introduction to HDInsight Stéphane Fréchette Saturday February 7, 2015
  • 2. Who am I? My name is Stéphane Fréchette SQL Server MVP | Consultant | Speaker | Data & BI Architect | Big Data |NoSQL | Data Science. Drums, good food and fine wine. Founder @TEDxGatineau I have a passion for architecting, designing and building solutions that matter. Twitter: @sfrechette Blog: stephanefrechette.com Email: stephanefrechette@ukubu.com
  • 3. Topics • What is Big Data? • Apache Hadoop • Hadoop Ecosystem • Microsoft Azure HDInsight • Demos • Summary • Resources • Q&A
  • 4. “Big data usually includes data sets with sizes beyond the ability of commonly used software tools to capture, curate, manage, and process data within a tolerable elapsed time…” - Wikipedia
  • 5. What is Big Data? Many Options Variability
  • 6. Internet of things Audio / Video Log Files Text/Image Social Sentiment Data Market Feeds eGov Feeds Weather Wikis / Blogs Click Stream Sensors / RFID / Devices Spatial & GPS Coordinates WEB 2.0Mobile Advertising CollaborationeCommerce Digital Marketing Search Marketing Web Logs Recommendations ERP / CRM Sales Pipeline Payables Payroll Inventory Contacts Deal Tracking Terabytes (10E12) Gigabytes (10E9) Exabytes (10E18) Petabytes (10E15) Velocity - Variety Volume 1980 190,000$ 2010 0.07$ 1990 9,000$ 2000 15$ Storage/GB ERP / CRM WEB 2.0 Internet of things What is Big Data?
  • 8. Hadoop • Apache Hadoop is for big data • Open-source software framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models • Designed to scale up from single servers to thousands of machines, each offering local computation and storage
  • 9. TRADITIONAL RDBMS HADOOP Data Size Access Updates Structure Integrity Scaling DBA Ratio Hadoop
  • 10. HDFS • Hadoop Distributed File System (HDFS) is a Java-based file system that provides scalable and reliable data storage that is designed to span large clusters of commodity servers. HDFS ≠ Database
  • 11. MapReduce • MapReduce is a software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault- tolerant manner. Processing function: - Mapping - Reducing
  • 14. Distributed Storage (HDFS) Query (Hive) Distributed Processing (MapReduce) Scripting (Pig) NoSQLDatabase (HBase) Metadata (HCatalog) DataIntegration (ODBC/SQOOP/REST) Relational (SQL Server) Machine Learning (Mahout) Graph (Pegasus) Stats processing (RHadoop EventPipeline (Flume) Active Directory (Security) Monitoring& Deployment (System Center) C#, F#, .NETPowerShell Pipeline/workflow (Oozie) Azure Storage Vault (ASV) Business Intelligence Excel,Power View,SSAS) World's Data (Azure Data Marketplace) EventDriven Processing Legend Red = Core Hadoop Blue = Data processing Purple = Microsoft integration points and value adds Orange = Data Movement Green = Packages Hadoop Ecosystem
  • 15. HDInsight • HDInsight is a Hadoop-based service that brings a 100 % Apache Hadoop solution that runs on the Microsoft Azure platform • Based on the Hortonworks Data Platform (HDP) • Scalable, on-demand service
  • 16. Storage Azure Storage (Blob)File System Two choices
  • 17. Demo [Spinning up a HDInsight Cluster ;-)]
  • 18. Now what? Working with your HDInsight cluster - running jobs, import/export data, viewing and consuming data… • .NET • Java • Pig • Hive • Sqoop • Excel • Others
  • 19. What is Hive? • A data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis • Provides an SQL-Like language called HiveQL to query data • Integration between Hadoop and BI and visualization tools http://hive.apache.org
  • 20. What is Pig? • Write complex MapReduce jobs using a simple script language (Pig Latin) • A platform for analyzing large data sets that consists of high-level language for expressing data analysis programs • Pig translates and compiles complex MapReduce jobs on the fly http://pig.apache.org
  • 21. What is Sqoop? • Command-line interface application to transfer bulk data between Hadoop and relational datastores http://sqoop.apache.org
  • 22. Demo [Query, Analyze, Transfer + Visual Studio Tools for HDInsight]
  • 24. Demo [Self-Service BI with Hive and Excel…]
  • 27. Resources • Apache Projects (list with links) http://bit.ly/MfpLtE • Microsoft Azure HDInsight http://bit.ly/1dnlAX1 • HDInsight Documentation & Tutorials http://bit.ly/LWRYol • Hortonworks Sandbox 2.2 & Tutorials http://bit.ly/1gkkCte • Cloudera VMs CDH 5.3.x http://bit.ly/1ENWgHH • Microsoft JDBC Driver 4.1 | 4.0 for SQL Server http://bit.ly/1kEgJ7O • Microsoft Hive ODBC Driver http://bit.ly/NFkhcH • Getting Started with Big Data (MVA) http://bit.ly/1wU90Xd • Big Data and Business Analytics Immersion v3.1 (MVA) http://bit.ly/1unvvX1 • Introducing Microsoft Azure HDInsight (free e-book) http://bit.ly/1JOPe5F
  • 28. What Questions Do You Have?
  • 29. Thank You For attending this session

Editor's Notes

  1. Key attributes: Open source Highly scalable Runs on commodity hardware Redundant and reliable (no data loss) Batch processing centric – using “Map-Reduce” processing paradigm
  2. HDFS can replicate the data to multiple nodes, and it uses a name node daemon to track where the data is and how it is (or isn't) replicated. HDFS allows data to be split across multiple systems, which solves one problem in a large-scale data environment. But moving the data into various places creates another problem. How do you move the computing function to where the data is? Along comes MapReduce…
  3. The HDInsight service can actually access two types of storage: HDFS (as in standard Hadoop) and the Azure Storage system. When you store your data using HDFS, it's contained within the nodes of the cluster and it must be called through the HDFS API. When the cluster is decommissioned, the data is lost as well. The option of using Azure Storage provides several advantages: you can load the data using standard tools, retain the data when you decommission the cluster, the cost is less, and other processes in Azure or even from other cloud providers can access the data.