Introduction to HDInsight
Stéphane Fréchette
Saturday February 7, 2015
Who am I?
My name is Stéphane Fréchette
SQL Server MVP | Consultant | Speaker | Data & BI Architect | Big Data
|NoSQL | Data Science. Drums, good food and fine wine.
Founder @TEDxGatineau
I have a passion for architecting, designing and building solutions that
matter.
Twitter: @sfrechette
Blog: stephanefrechette.com
Email: stephanefrechette@ukubu.com
Topics
• What is Big Data?
• Apache Hadoop
• Hadoop Ecosystem
• Microsoft Azure HDInsight
• Demos
• Summary
• Resources
• Q&A
“Big data usually includes data sets with sizes
beyond the ability of commonly used software
tools to capture, curate, manage, and process
data within a tolerable elapsed time…”
- Wikipedia
What is Big Data?
Many Options
Variability
Internet of things
Audio /
Video
Log Files
Text/Image
Social
Sentiment
Data Market Feeds
eGov Feeds
Weather
Wikis / Blogs
Click Stream
Sensors / RFID / Devices
Spatial & GPS Coordinates
WEB 2.0Mobile
Advertising CollaborationeCommerce
Digital Marketing
Search Marketing
Web Logs
Recommendations
ERP / CRM
Sales Pipeline
Payables
Payroll
Inventory
Contacts
Deal
Tracking
Terabytes
(10E12)
Gigabytes
(10E9)
Exabytes
(10E18)
Petabytes
(10E15)
Velocity - Variety
Volume
1980
190,000$
2010
0.07$
1990
9,000$
2000
15$
Storage/GB
ERP / CRM WEB
2.0
Internet of things
What is Big Data?
Common Scenarios
What is Big Data?
Hadoop
• Apache Hadoop is for big data
• Open-source software framework that allows for the distributed processing
of large data sets across clusters of computers using simple programming
models
• Designed to scale up from single servers to thousands of machines, each
offering local computation and storage
TRADITIONAL RDBMS HADOOP
Data Size
Access
Updates
Structure
Integrity
Scaling
DBA Ratio
Hadoop
HDFS
• Hadoop Distributed File System (HDFS) is a Java-based file system that
provides scalable and reliable data storage that is designed to span large
clusters of commodity servers.
HDFS ≠ Database
MapReduce
• MapReduce is a software framework for easily writing applications which
process vast amounts of data (multi-terabyte data-sets) in-parallel on large
clusters (thousands of nodes) of commodity hardware in a reliable, fault-
tolerant manner.
Processing function:
- Mapping
- Reducing
How it works?
ServerServer
ServerServer
Runtime
How it works?
Distributed Storage
(HDFS)
Query
(Hive)
Distributed Processing
(MapReduce)
Scripting
(Pig)
NoSQLDatabase
(HBase)
Metadata
(HCatalog)
DataIntegration
(ODBC/SQOOP/REST)
Relational
(SQL
Server)
Machine
Learning
(Mahout)
Graph
(Pegasus)
Stats
processing
(RHadoop
EventPipeline
(Flume)
Active Directory
(Security)
Monitoring&
Deployment
(System Center)
C#, F#, .NETPowerShell
Pipeline/workflow
(Oozie)
Azure Storage
Vault (ASV)
Business
Intelligence
Excel,Power
View,SSAS)
World's Data
(Azure Data
Marketplace)
EventDriven
Processing
Legend
Red = Core
Hadoop
Blue = Data
processing
Purple =
Microsoft
integration points
and value adds
Orange = Data
Movement
Green = Packages
Hadoop Ecosystem
HDInsight
• HDInsight is a Hadoop-based service that brings a 100 % Apache Hadoop
solution that runs on the Microsoft Azure platform
• Based on the Hortonworks Data Platform (HDP)
• Scalable, on-demand service
Storage
Azure Storage (Blob)File System
Two choices
Demo
[Spinning up a HDInsight Cluster ;-)]
Now what?
Working with your HDInsight cluster - running jobs, import/export data,
viewing and consuming data…
• .NET
• Java
• Pig
• Hive
• Sqoop
• Excel
• Others
What is Hive?
• A data warehouse infrastructure built on top of Hadoop for providing data
summarization, query, and analysis
• Provides an SQL-Like language called HiveQL to query data
• Integration between Hadoop and BI and visualization tools
http://hive.apache.org
What is Pig?
• Write complex MapReduce jobs using a simple script language (Pig Latin)
• A platform for analyzing large data sets that consists of high-level language
for expressing data analysis programs
• Pig translates and compiles complex MapReduce jobs on the fly
http://pig.apache.org
What is Sqoop?
• Command-line interface application to transfer bulk data between Hadoop
and relational datastores
http://sqoop.apache.org
Demo
[Query, Analyze, Transfer + Visual Studio Tools for HDInsight]
HadoopData Analytics
Data Flow
Demo
[Self-Service BI with Hive and Excel…]
Machine
Learning
Graph
Processing
Distributed
Compute
Extract Load
Transform
Predictive
Analysis
Capabilities
Data Knowledge Action
Summary
Resources
• Apache Projects (list with links) http://bit.ly/MfpLtE
• Microsoft Azure HDInsight http://bit.ly/1dnlAX1
• HDInsight Documentation & Tutorials http://bit.ly/LWRYol
• Hortonworks Sandbox 2.2 & Tutorials http://bit.ly/1gkkCte
• Cloudera VMs CDH 5.3.x http://bit.ly/1ENWgHH
• Microsoft JDBC Driver 4.1 | 4.0 for SQL Server http://bit.ly/1kEgJ7O
• Microsoft Hive ODBC Driver http://bit.ly/NFkhcH
• Getting Started with Big Data (MVA) http://bit.ly/1wU90Xd
• Big Data and Business Analytics Immersion v3.1 (MVA) http://bit.ly/1unvvX1
• Introducing Microsoft Azure HDInsight (free e-book) http://bit.ly/1JOPe5F
What Questions Do You Have?
Thank You
For attending this session

Introduction to Azure HDInsight

  • 1.
    Introduction to HDInsight StéphaneFréchette Saturday February 7, 2015
  • 2.
    Who am I? Myname is Stéphane Fréchette SQL Server MVP | Consultant | Speaker | Data & BI Architect | Big Data |NoSQL | Data Science. Drums, good food and fine wine. Founder @TEDxGatineau I have a passion for architecting, designing and building solutions that matter. Twitter: @sfrechette Blog: stephanefrechette.com Email: stephanefrechette@ukubu.com
  • 3.
    Topics • What isBig Data? • Apache Hadoop • Hadoop Ecosystem • Microsoft Azure HDInsight • Demos • Summary • Resources • Q&A
  • 4.
    “Big data usuallyincludes data sets with sizes beyond the ability of commonly used software tools to capture, curate, manage, and process data within a tolerable elapsed time…” - Wikipedia
  • 5.
    What is BigData? Many Options Variability
  • 6.
    Internet of things Audio/ Video Log Files Text/Image Social Sentiment Data Market Feeds eGov Feeds Weather Wikis / Blogs Click Stream Sensors / RFID / Devices Spatial & GPS Coordinates WEB 2.0Mobile Advertising CollaborationeCommerce Digital Marketing Search Marketing Web Logs Recommendations ERP / CRM Sales Pipeline Payables Payroll Inventory Contacts Deal Tracking Terabytes (10E12) Gigabytes (10E9) Exabytes (10E18) Petabytes (10E15) Velocity - Variety Volume 1980 190,000$ 2010 0.07$ 1990 9,000$ 2000 15$ Storage/GB ERP / CRM WEB 2.0 Internet of things What is Big Data?
  • 7.
  • 8.
    Hadoop • Apache Hadoopis for big data • Open-source software framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models • Designed to scale up from single servers to thousands of machines, each offering local computation and storage
  • 9.
    TRADITIONAL RDBMS HADOOP DataSize Access Updates Structure Integrity Scaling DBA Ratio Hadoop
  • 10.
    HDFS • Hadoop DistributedFile System (HDFS) is a Java-based file system that provides scalable and reliable data storage that is designed to span large clusters of commodity servers. HDFS ≠ Database
  • 11.
    MapReduce • MapReduce isa software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault- tolerant manner. Processing function: - Mapping - Reducing
  • 12.
  • 13.
  • 14.
    Distributed Storage (HDFS) Query (Hive) Distributed Processing (MapReduce) Scripting (Pig) NoSQLDatabase (HBase) Metadata (HCatalog) DataIntegration (ODBC/SQOOP/REST) Relational (SQL Server) Machine Learning (Mahout) Graph (Pegasus) Stats processing (RHadoop EventPipeline (Flume) ActiveDirectory (Security) Monitoring& Deployment (System Center) C#, F#, .NETPowerShell Pipeline/workflow (Oozie) Azure Storage Vault (ASV) Business Intelligence Excel,Power View,SSAS) World's Data (Azure Data Marketplace) EventDriven Processing Legend Red = Core Hadoop Blue = Data processing Purple = Microsoft integration points and value adds Orange = Data Movement Green = Packages Hadoop Ecosystem
  • 15.
    HDInsight • HDInsight isa Hadoop-based service that brings a 100 % Apache Hadoop solution that runs on the Microsoft Azure platform • Based on the Hortonworks Data Platform (HDP) • Scalable, on-demand service
  • 16.
  • 17.
    Demo [Spinning up aHDInsight Cluster ;-)]
  • 18.
    Now what? Working withyour HDInsight cluster - running jobs, import/export data, viewing and consuming data… • .NET • Java • Pig • Hive • Sqoop • Excel • Others
  • 19.
    What is Hive? •A data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis • Provides an SQL-Like language called HiveQL to query data • Integration between Hadoop and BI and visualization tools http://hive.apache.org
  • 20.
    What is Pig? •Write complex MapReduce jobs using a simple script language (Pig Latin) • A platform for analyzing large data sets that consists of high-level language for expressing data analysis programs • Pig translates and compiles complex MapReduce jobs on the fly http://pig.apache.org
  • 21.
    What is Sqoop? •Command-line interface application to transfer bulk data between Hadoop and relational datastores http://sqoop.apache.org
  • 22.
    Demo [Query, Analyze, Transfer+ Visual Studio Tools for HDInsight]
  • 23.
  • 24.
    Demo [Self-Service BI withHive and Excel…]
  • 25.
  • 26.
  • 27.
    Resources • Apache Projects(list with links) http://bit.ly/MfpLtE • Microsoft Azure HDInsight http://bit.ly/1dnlAX1 • HDInsight Documentation & Tutorials http://bit.ly/LWRYol • Hortonworks Sandbox 2.2 & Tutorials http://bit.ly/1gkkCte • Cloudera VMs CDH 5.3.x http://bit.ly/1ENWgHH • Microsoft JDBC Driver 4.1 | 4.0 for SQL Server http://bit.ly/1kEgJ7O • Microsoft Hive ODBC Driver http://bit.ly/NFkhcH • Getting Started with Big Data (MVA) http://bit.ly/1wU90Xd • Big Data and Business Analytics Immersion v3.1 (MVA) http://bit.ly/1unvvX1 • Introducing Microsoft Azure HDInsight (free e-book) http://bit.ly/1JOPe5F
  • 28.
  • 29.

Editor's Notes

  • #9 Key attributes: Open source Highly scalable Runs on commodity hardware Redundant and reliable (no data loss) Batch processing centric – using “Map-Reduce” processing paradigm
  • #11 HDFS can replicate the data to multiple nodes, and it uses a name node daemon to track where the data is and how it is (or isn't) replicated. HDFS allows data to be split across multiple systems, which solves one problem in a large-scale data environment. But moving the data into various places creates another problem. How do you move the computing function to where the data is? Along comes MapReduce…
  • #17 The HDInsight service can actually access two types of storage: HDFS (as in standard Hadoop) and the Azure Storage system. When you store your data using HDFS, it's contained within the nodes of the cluster and it must be called through the HDFS API. When the cluster is decommissioned, the data is lost as well. The option of using Azure Storage provides several advantages: you can load the data using standard tools, retain the data when you decommission the cluster, the cost is less, and other processes in Azure or even from other cloud providers can access the data.