Introduction to Azure HDInsight

Introduction to HDInsight
Stéphane Fréchette
Saturday February 7, 2015

Who am I?
My name is Stéphane Fréchette
SQL Server MVP | Consultant | Speaker | Data & BI Architect | Big Data
|NoSQL | Data Science. Drums, good food and fine wine.
Founder @TEDxGatineau
I have a passion for architecting, designing and building solutions that
matter.
Twitter: @sfrechette
Blog: stephanefrechette.com
Email: stephanefrechette@ukubu.com

Topics
• What is Big Data?
• Apache Hadoop
• Hadoop Ecosystem
• Microsoft Azure HDInsight
• Demos
• Summary
• Resources
• Q&A

“Big data usually includes data sets with sizes
beyond the ability of commonly used software
tools to capture, curate, manage, and process
data within a tolerable elapsed time…”
- Wikipedia

What is Big Data?
Many Options
Variability

Internet of things
Audio /
Video
Log Files
Text/Image
Social
Sentiment
Data Market Feeds
eGov Feeds
Weather
Wikis / Blogs
Click Stream
Sensors / RFID / Devices
Spatial & GPS Coordinates
WEB 2.0Mobile
Advertising CollaborationeCommerce
Digital Marketing
Search Marketing
Web Logs
Recommendations
ERP / CRM
Sales Pipeline
Payables
Payroll
Inventory
Contacts
Deal
Tracking
Terabytes
(10E12)
Gigabytes
(10E9)
Exabytes
(10E18)
Petabytes
(10E15)
Velocity - Variety
Volume
1980
190,000$
2010
0.07$
1990
9,000$
2000
15$
Storage/GB
ERP / CRM WEB
2.0
Internet of things
What is Big Data?

Common Scenarios
What is Big Data?

Hadoop
• Apache Hadoop is for big data
• Open-source software framework that allows for the distributed processing
of large data sets across clusters of computers using simple programming
models
• Designed to scale up from single servers to thousands of machines, each
offering local computation and storage

TRADITIONAL RDBMS HADOOP
Data Size
Access
Updates
Structure
Integrity
Scaling
DBA Ratio
Hadoop

HDFS
• Hadoop Distributed File System (HDFS) is a Java-based file system that
provides scalable and reliable data storage that is designed to span large
clusters of commodity servers.
HDFS ≠ Database

MapReduce
• MapReduce is a software framework for easily writing applications which
process vast amounts of data (multi-terabyte data-sets) in-parallel on large
clusters (thousands of nodes) of commodity hardware in a reliable, fault-
tolerant manner.
Processing function:
- Mapping
- Reducing

ServerServer
ServerServer
Runtime
How it works?

Distributed Storage
(HDFS)
Query
(Hive)
Distributed Processing
(MapReduce)
Scripting
(Pig)
NoSQLDatabase
(HBase)
Metadata
(HCatalog)
DataIntegration
(ODBC/SQOOP/REST)
Relational
(SQL
Server)
Machine
Learning
(Mahout)
Graph
(Pegasus)
Stats
processing
(RHadoop
EventPipeline
(Flume)
Active Directory
(Security)
Monitoring&
Deployment
(System Center)
C#, F#, .NETPowerShell
Pipeline/workflow
(Oozie)
Azure Storage
Vault (ASV)
Business
Intelligence
Excel,Power
View,SSAS)
World's Data
(Azure Data
Marketplace)
EventDriven
Processing
Legend
Red = Core
Hadoop
Blue = Data
processing
Purple =
Microsoft
integration points
and value adds
Orange = Data
Movement
Green = Packages
Hadoop Ecosystem

HDInsight
• HDInsight is a Hadoop-based service that brings a 100 % Apache Hadoop
solution that runs on the Microsoft Azure platform
• Based on the Hortonworks Data Platform (HDP)
• Scalable, on-demand service

Storage
Azure Storage (Blob)File System
Two choices

Demo
[Spinning up a HDInsight Cluster ;-)]

Now what?
Working with your HDInsight cluster - running jobs, import/export data,
viewing and consuming data…
• .NET
• Java
• Pig
• Hive
• Sqoop
• Excel
• Others

What is Hive?
• A data warehouse infrastructure built on top of Hadoop for providing data
summarization, query, and analysis
• Provides an SQL-Like language called HiveQL to query data
• Integration between Hadoop and BI and visualization tools
http://hive.apache.org

What is Pig?
• Write complex MapReduce jobs using a simple script language (Pig Latin)
• A platform for analyzing large data sets that consists of high-level language
for expressing data analysis programs
• Pig translates and compiles complex MapReduce jobs on the fly
http://pig.apache.org

What is Sqoop?
• Command-line interface application to transfer bulk data between Hadoop
and relational datastores
http://sqoop.apache.org

Demo
[Query, Analyze, Transfer + Visual Studio Tools for HDInsight]

HadoopData Analytics
Data Flow

Demo
[Self-Service BI with Hive and Excel…]

Machine
Learning
Graph
Processing
Distributed
Compute
Extract Load
Transform
Predictive
Analysis
Capabilities

Resources
• Apache Projects (list with links) http://bit.ly/MfpLtE
• Microsoft Azure HDInsight http://bit.ly/1dnlAX1
• HDInsight Documentation & Tutorials http://bit.ly/LWRYol
• Hortonworks Sandbox 2.2 & Tutorials http://bit.ly/1gkkCte
• Cloudera VMs CDH 5.3.x http://bit.ly/1ENWgHH
• Microsoft JDBC Driver 4.1 | 4.0 for SQL Server http://bit.ly/1kEgJ7O
• Microsoft Hive ODBC Driver http://bit.ly/NFkhcH
• Getting Started with Big Data (MVA) http://bit.ly/1wU90Xd
• Big Data and Business Analytics Immersion v3.1 (MVA) http://bit.ly/1unvvX1
• Introducing Microsoft Azure HDInsight (free e-book) http://bit.ly/1JOPe5F

Thank You
For attending this session

Introduction to Azure HDInsight

More Related Content

What's hot

Similar to Introduction to Azure HDInsight

More from Stéphane Fréchette

Recently uploaded

Introduction to Azure HDInsight

Editor's Notes