Introduction to Microsoft’s Hadoop solution (HDInsight)

James Serra
Big Data Evangelist
Microsoft
JamesSerra3@gmail.com

About Me
 Microsoft, Big Data Evangelist
 In IT for 30 years, worked on many BI and DW projects
 Worked as desktop/web/database developer, DBA, BI and DW architect and developer, MDM
architect, PDW/APS developer
 Been perm employee, contractor, consultant, business owner
 Presenter at PASS Business Analytics Conference, PASS Summit, Enterprise Data World conference
 Certifications: MCSE: Data Platform, Business Intelligence; MS: Architecting Microsoft Azure
Solutions, Design and Implement Big Data Analytics Solutions, Design and Implement Cloud Data
Platform Solutions
 Blog at JamesSerra.com
 Former SQL Server MVP
 Author of book “Reporting with Microsoft SQL Server 2012”

Why Deploy To the Cloud?
Microsoft’s Solution
How Do I Get Started?

What if you could handle big data?

Hadoop is a platform with portfolio of projects

A Hadoop distribution is a package of projects

Business applications of Hadoop

New analytic applications from new data

Main differences vs RDBMS/NoSQL
Pros
• Not a type of database, but rather a open-source software ecosystem that allows for massively
parallel computing
• No inherent structure (no conversion to relational or JSON needed)
• Good for batch processing, large files, volume writes, parallel scans, sequential access
• Great for large, distributed data processing tasks where time isn’t a constraint (i.e. end-of-day
reports, scanning months of historical data)
• Tradeoff: In order to make deep connections between many data points, the technology sacrifices
speed
• Some NoSQL databases such as HBase are built on top of HDFS

Main differences vs RDBMS/NoSQL
Cons
• File system, not a database
• Not good for millions of users, random access, fast individual record lookups or updates (OLTP)
• Not so great for real-time analytics
• Lacks: indexing, metadata layer, query optimizer, memory management
• Same cons at non-relational: no ACID support, data integrity, limited indexing, weak SQL, etc
• Security limitations

What Is Hadoop?

Challenges with implementing Hadoop

Why Cloud + Big Data?
Speed Scale Economics
Always Up,
Always On
Open and flexibleTime to value
Data of all Volume,
Variety, Velocity
Massive Compute
and Storage
Deployment
expertise

Scenarios For Deploying Hadoop As Hybrid

What Is Hadoop?

Microsoft contributions to Hadoop

HDInsight Built for Windows or Linux

HDInsight Supports HBase
Data Node Data Node Data Node Data Node
Task Tracker Task Tracker Task Tracker Task Tracker
Name Node
Job Tracker
HMaster
Coordination
Region Server Region Server Region Server Region Server

HDInsight Supports Storm
Stream
processin
g
Search and query
Data analytics (Excel)
Web/thick client
dashboards
Devices to take action
RabbitMQ /
ActiveMQ

Spark for Azure HDInsight
In Memory Processing on Multiple Workloads
Azure
HDInsight
In Memory
Spark
• Single execution model for multiple
tasks
• Processing up to 100x faster
performance
• Developer friendly (Java, Python, Scala)
• BI tool of choice (Power BI, Tabelau,
Qlik, SAP)
• Notebook experience (Jupyter/iPython,
Zeppelin)

HDInsight Allows You To Add Hadoop Projects

R Server for HDInsight
• Familiarity of R (most popular language for data
scientists)
• Scalability of Hadoop and Spark
• Up to 7x faster using Spark engine
• Train and run ML models on datasets of any size
• Cloud managed solution (easy setup, elastic,
SLA)

Hyper scale Infrastructure is the enabler
32 Regions Worldwide, 24 Generally Available…
 100+ datacenters
 Top 3 networks in the world
 2.5x AWS, 7x Google DC Regions
 G Series – Largest VM in World, 32 cores, 448GB Ram, SSD…
Operational
Announced/Not Operational
Central US
Iowa
West US
California
East US
Virginia
US Gov
Virginia
North Central US
Illinois
US Gov
Iowa
South Central US
Texas
Brazil South
Sao Paulo State
West Europe
Netherlands
China North *
Beijing
China South *
Shanghai
Japan East
Tokyo, Saitama
Japan West
Osaka
India South
Chennai
East Asia
Hong Kong
SE Asia
Singapore
Australia South East
Victoria
Australia East
New South Wales
India Central
Pune
Canada East
Quebec City
Canada Central
Toronto
India West
Mumbai
Germany North East **
Magdeburg
Germany Central **
Frankfurt
North Europe
Ireland
East US 2
Virginia
United Kingdom
RegionsUnited Kingdom
Regions
US DoD East
TBD
US DoD West
TBD
* Operated by 21Vianet ** Data Stewardship by Deutsche Telekom
Seoul
Korea (2)

Why Microsoft Azure?
Azure Storage

Mission Critical, Enterprise Ready

Low Cost
$£€¥
*IDC study “The Business Value and TCO Advantage of Apache Hadoop in the Cloud with Microsoft Azure HDInsight”

Bringing Hadoop to a billion people

Making advanced analytics accessible to Hadoop
Cloud

HDInsight vs HDP on Azure VM
HDInsight HDP on Azure VM
PaaS (setup, scale, manage, patch, etc) IaaS
Managed by Microsoft Managed by customer
Storage separate (Blob or ADLS) Storage in VM (local disk), but can also
have storage in Azure blob or ADLS
Delete VM keeps data Delete VM deletes data (unless external)
Up to 30-days behind latest HDP version Latest HDP Version
Limited Hadoop projects Unlimited Hadoop projects
Microsoft supports VM and Hadoop Microsoft: VM, HDP: Hadoop
No on-prem version On-prem version

Distributed, parallel analytics
framework
U-SQL (based on C# and SQL)
Dial for scale
Hides infrastructure complexity
Visual Studio integration
Instant scale on demand
Reduced learning curve
Azure Data Lake Analytics
Azure Services for big data analytics
YARN
HDFS
HDInsightAnalytics
Service
Store
Partners
U-SQL
Clickstream
Sensors
Video
Social
Web
Devices
Relational
Applications
56

What Is Hadoop?

Get Started
http://azure.microsoft.com/en-us/documentation/services/hdinsight/
http://azure.microsoft.com/en-us/documentation/articles/hdinsight-learn-map/
http://www.microsoftvirtualacademy.com/training-courses/getting-started-with-microsoft-big-data
http://channel9.msdn.com/Shows/Data-Exposed
http://azure.microsoft.com/en-us/pricing/free-trial/

Azure getting started
• Free Azure account, $200 in credit, https://azure.microsoft.com/en-us/free/
• Startups: BizSpark, $750/month free Azure, BizSpark Plus - $120k/year free Azure,
https://www.microsoft.com/bizspark/
• MSDN subscription, Data Platform MVP, $150/month free Azure, https://azure.microsoft.com/en-
us/pricing/member-offers/msdn-benefits/
• Microsoft Educator Grant Program, faculty - $250/month free Azure for a year, students -
$100/month free Azure for 6 months, https://azure.microsoft.com/en-us/pricing/member-
offers/msdn-benefits/
• Microsoft Azure for Research Grant, http://research.microsoft.com/en-us/projects/azure/default.aspx
• DreamSpark for students, https://www.dreamspark.com/Student/Default.aspx
• DreamSpark for academic institutions: https://www.dreamspark.com/Institution/Subscription.aspx
• Various Microsoft funds so you can learn the technologies or build a client solution

Pricing for HDInsight
CAPABILITIES STANDARD PREMIUM PREVIEW
Big Data Workloads
Standard Hadoop and Open Source Projects
(Core Hadoop & YARN, Hive & HCatalog, Tez,
Pig, Sqoop, Oozie, Zookeeper, Phoenix)
Columnar NoSQL (HBase)
Stream processing (Storm)
Interactive processing, real-time stream
processing & ML (Spark)
Big Data statistics predictive modeling, and
machine learning with R Server
Enterprise Readiness
Administration – manage, monitor &
troubleshoot clusters
Hadoop version upgrades and patching –
Automatic patching and upgrades
Encryption of data at rest
Price Standard price per Node HDInsight Standard Price + $0.02/Core-hour for each core
used in the cluster during preview (75% discount)

Resources
 What is HDInsight? http://bit.ly/1WpS0at
 Hadoop and Microsoft http://bit.ly/20Cg2hA
 Introduction to Hadoop http://bit.ly/1WpTstq

Q & A ?
James Serra, Big Data Evangelist
Email me at: JamesSerra3@gmail.com
Follow me at: @JamesSerra
Link to me at: www.linkedin.com/in/JamesSerra
Visit my blog at: JamesSerra.com (where this slide deck is posted via the
“Presentations” link on the top menu)

Introduction to Microsoft’s Hadoop solution (HDInsight)

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Introduction to Microsoft’s Hadoop solution (HDInsight)

Similar to Introduction to Microsoft’s Hadoop solution (HDInsight) (20)

More from James Serra

More from James Serra (14)

Recently uploaded

Recently uploaded (20)

Introduction to Microsoft’s Hadoop solution (HDInsight)

Editor's Notes