SlideShare a Scribd company logo
HDInsight	
  Essentials	
  ISBN	
  :	
  1849695369	
  	
  /	
  ISBN	
  13	
  :	
  9781849695367	
  
Rajesh	
  Nadipalli	
  
05/01/2014	
  
Goals	
  of	
  this	
  Book	
  
• Focus	
  on	
  Microso'’s	
  new	
  Hadoop	
  
distribu=on	
  
• Serve	
  as	
  Quick	
  Reference	
  
• Provide	
  an	
  Overview	
  of	
  Hadoop	
  
• Address	
  both	
  cloud	
  and	
  on-­‐premise	
  setup	
  
for	
  HDInsight	
  
• Highlight	
  HDInsight	
  differen:ator	
  	
  
• Provide	
  Prac=cal	
  &	
  Real	
  world	
  examples	
  
Book	
  Table	
  of	
  Contents	
  
•  Chapter	
  1:	
  	
  HDInsight	
  in	
  a	
  Heartbeat	
  
•  Chapter	
  2:	
  	
  Deployment	
  HDInsight	
  on	
  premise	
  
•  Chapter	
  3:	
  	
  HDInsight	
  Azure	
  cloud	
  service	
  
•  Chapter	
  4:	
  	
  Administer	
  your	
  cluster	
  
•  Chapter	
  5:	
  	
  Ingest	
  data	
  to	
  your	
  cluster	
  
•  Chapter	
  6:	
  	
  Transform	
  data	
  in	
  your	
  cluster	
  
•  Chapter	
  7:	
  	
  Analyze	
  &	
  Report	
  data	
  from	
  cluster	
  
•  Chapter	
  8:	
  	
  Project	
  Planning	
  &	
  	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  Architectural	
  Considera=ons	
  
CHAPTER	
  1	
  HIGHLIGHTS:	
  	
  
HDINSIGHT	
  IN	
  A	
  HEARTBEAT	
  
Big	
  Data	
  Problem	
  Characteristics	
  	
  
Hadoop	
  Overview	
  
Self Healing
Distributed Storage
Fault Tolerant
Distributed
Computing
+
Abstraction for
Parallel Processing
CORE HADOOP COMPONENTS •  HDFS:	
  Distributed	
  
Storage	
  –	
  replicated,	
  
self-­‐healing	
  and	
  
scalable	
  
	
  
•  MapReduce:	
  	
  Parallel	
  
Processing,	
  process	
  
local	
  data	
  for	
  efficiency	
  	
  
	
  
NameNode
JobTracker
TaskTracker	
  
	
  
TaskTracker	
  
	
  
TaskTracker	
  
	
  MapReduce	
  
Layer	
  
Distributed	
  	
  
File	
  System	
  
Layer	
   Secondary
NameNode
Master	
  Node	
   Slaves	
  Nodes	
  
DataNode	
  
	
  
DataNode	
  
	
  
DataNode	
  
	
  
Hadoop	
  Nodes	
  Layout	
  
Data	
  Sources	
  
	
  
	
  
	
  
RDBMS	
  	
  
Databases	
  
Audio,	
  	
  
Images	
   Log	
  Files	
  
Sensors,	
  	
  
RFID	
  
Social	
  	
  
Media,	
  Feeds	
  
	
  
Hadoop	
  Data	
  Store	
  
	
  
	
  
	
  
	
  
HDFS	
  
Hbase	
  	
  (NOSQL	
  DB)	
  
	
  
Data	
  Processing	
  
	
  
	
  
	
  
Mapreduce	
  
	
  
Data	
  Access	
  
	
  
	
  
	
  
Hive	
   Pig	
  
Mahout	
  	
  
Machine	
  Learning	
  
Flume,	
  Sqoop	
  
Excel	
  
Business	
  	
  
Data	
  Feeds	
  
Zookeeper	
  (Distributed	
  Process	
  Management)	
  
Hcatalog	
  (Metadata	
  on	
  Pig,	
  Hive,	
  MapReduce	
  )	
  
Oozie	
  	
  
Workflow,	
  Scheduler	
  
Infrastructure	
  ,	
  Opera:ons	
  
(Monitoring,	
  Configura<on)	
  
Hadoop	
  Eco	
  System	
  
Collect & Import
to HDFS
Process
(MapReduce)
Analyze
(BI Tools)
Report & Publish
End	
  to	
  End	
  Solution	
  on	
  Hadoop	
  
Popular	
  Hadoop	
  Distributions	
  
•  Amazon	
  Elas=c	
  MapReduce	
  (cloud,	
  hbp://aws.amazon.com/
elas=cmapreduce/)	
  
	
  
•  Cloudera	
  (
hbp://www.cloudera.com/content/cloudera/en/home.html)	
  
	
  
•  EMC	
  PivitolHD	
  (hbp://gopivotal.com/)	
  
	
  
•  Hortonworks	
  HDP	
  (hbp://hortonworks.com/)	
  
	
  
•  MapR	
  (hbp://mapr.com/)	
  
	
  
•  Microsod	
  HDInsight	
  (cloud,	
  hbp://www.windowsazure.com/)	
  
HDInsight	
  Differenciator	
  
•  Enterprise-­‐ready	
  Hadoop	
  backed	
  by	
  Microsod	
  
	
  
•  Analy:cs	
  using	
  Excel	
  
•  Integra=on	
  with	
  Ac=ve	
  Directory.	
  
	
  	
  
•  Integra=on	
  with	
  .NET	
  and	
  Javascript	
  
	
  
•  Connectors	
  to	
  RDBMS	
  
	
  
•  Scale	
  using	
  cloud	
  offering:	
  	
  Azure	
  HDInsight	
  service	
  enables	
  customers	
  
to	
  scale	
  quickly	
  and	
  has	
  seamless	
  interface	
  between	
  HDFS	
  and	
  Azure	
  
Storage	
  Vault	
  
	
  
•  JavaScript	
  Console	
  
WordCount	
  in	
  HDInsight	
  
CHAPTER	
  2	
  HIGHLIGHTS:	
  	
  
HDINSIGHT	
  INSTALL	
  ON	
  PREMISE	
  
Apache	
  Hadoop	
  
	
  
	
  
	
  
•  Open	
  Source	
  Sodware	
  
•  Community	
  Development	
  
	
  	
  
Hortonworks	
  Data	
  PlaSorm	
  
	
  
	
  
	
  
•  Enterprise	
  Hadoop	
  Plagorm	
  (HDP)	
  
•  Leaders	
  in	
  Hadoop	
  
•  Code	
  commibers	
  to	
  Hadoop	
  
Microso'	
  HDInsight	
  
	
  
	
  
	
  
•  Built	
  on	
  top	
  of	
  HDP	
  
•  Integra=on	
  with	
  ASV,	
  Excel,	
  Powerview,	
  
SQLServer,	
  Ac=ve	
  Directory	
  
	
  	
  
HDInsight	
  Distribution	
  
Physical	
  Install	
  Options	
  
NN	
  	
  	
  	
  	
  SNN	
  	
  	
  	
  	
  	
  JT	
  
DN	
  	
  /	
  TT	
  
Single	
  node	
  for	
  development/test	
  	
  	
  
Mul=	
  node	
  for	
  produc=on	
  	
  	
  
Multi	
  Node	
  Install	
  Steps	
  
•  Pre-­‐requisites	
  
•  Networking	
  Setup	
  
•  Remote	
  Scrip=ng	
  
•  Firewall	
  Setup	
  
•  Sodware	
  Install	
  (each	
  node)	
  
•  Hadoop	
  Configura=on	
  
•  Verifica=on	
  
CHAPTER	
  3	
  HIGHLIGHTS:	
  	
  
HDINSIGHT	
  AZURE	
  SERVICE	
  
Azure	
  Cloud	
  Service	
  
Create	
  Storage	
  
Create	
  HDInsight	
  
cluster	
  
CHAPTER	
  4	
  HIGHLIGHTS:	
  	
  
ADMINISTER	
  YOUR	
  CLUSTER	
  
HDInsight	
  Cluster	
  Management	
  
HDInsight	
  Dashboard	
  
HDInsight	
  Dashboard	
  
NameNode	
  Status	
  
Jobtracker	
  Status	
  
CHAPTER	
  5	
  HIGHLIGHTS:	
  	
  
INGEST	
  DATA	
  INTO	
  YOUR	
  CLUSTER	
  
Loading	
  Data	
  into	
  your	
  Cluster	
  
You	
  have	
  following	
  op=ons…	
  
	
  
•  Loading	
  data	
  using	
  Hadoop	
  commands	
  
•  Loading	
  data	
  using	
  Azure	
  Storage	
  Vault	
  
•  Loading	
  data	
  using	
  Interac:ve	
  JavaScript	
  	
  
•  Shipping	
  data	
  to	
  your	
  Cluster	
  
•  Loading	
  data	
  from	
  RDBMS	
  via	
  Sqoop	
  
Loading	
  via	
  Azure	
  Storage	
  Explorer	
  
CHAPTER	
  6	
  HIGHLIGHTS:	
  	
  
TRANSFORM	
  YOUR	
  DATA	
  
Transforming	
  Data	
  
You	
  have	
  following	
  op=ons…	
  
	
  
•  MapReduce	
  
•  Hive	
  
•  Pig	
  
•  Others	
  
Processing	
  Data	
  in	
  Cluster	
  
Map for
Jan2012
Map for
Feb2012
Map for
Apr2013
…	
  
One Reducer
HDFS	
  
Hive	
  
JDBC/OBDC
Metastore
Thrift Server
Command LineWeb GUI
Driver
(Parser, Planner, Executor)
MapReduce	
  
Hive	
  
Raw	
  Data	
  in	
  HDFS	
  
•  Distributed	
  
Storage	
  
•  Reliable	
  
Data	
  Processing	
  via	
  Pig	
  
•  Pipelines	
  
•  Itera=ve	
  Processing	
  
•  Research	
  
Data	
  
Warehouse	
  
HDFS	
  
Data	
  Warehouse	
  via	
  Hive	
  
•  BI	
  Tools	
  
•  Analysis	
  
Hive	
  or	
  Pig?	
  
CHAPTER	
  7	
  HIGHLIGHTS:	
  	
  
ANALYZE	
  &	
  REPORT	
  
Analyze	
  using	
  Excel	
  
Analyze	
  using	
  Excel	
  
CHAPTER	
  8:	
  	
  
PROJECT	
  PLANNING	
  &	
  ARCHITECTURAL	
  
CONSIDERATIONS	
  
Execu:ve	
  &	
  
Stakeholder	
  	
  
Buy-­‐in	
  
Discovery	
  &	
  
Analysis	
  
Design	
  
Implementa:on	
  User	
  Acceptance	
  
Produc:on	
  
Opera:ons	
  
Feedback,	
  New	
  
Requirements	
  

More Related Content

What's hot

Introduction to Azure HDInsight
Introduction to Azure HDInsightIntroduction to Azure HDInsight
Introduction to Azure HDInsight
Stéphane Fréchette
 
Big Data with Azure
Big Data with AzureBig Data with Azure
Big Data with Azure
Aaron (Ari) Bornstein
 
Hadoop Reporting and Analysis - Jaspersoft
Hadoop Reporting and Analysis - JaspersoftHadoop Reporting and Analysis - Jaspersoft
Hadoop Reporting and Analysis - Jaspersoft
Hortonworks
 
Hadoop vs. RDBMS for Advanced Analytics
Hadoop vs. RDBMS for Advanced AnalyticsHadoop vs. RDBMS for Advanced Analytics
Hadoop vs. RDBMS for Advanced Analyticsjoshwills
 
Azure HDInsight
Azure HDInsightAzure HDInsight
Azure HDInsight
Koray Kocabas
 
The Hive Think Tank - The Microsoft Big Data Stack by Raghu Ramakrishnan, CTO...
The Hive Think Tank - The Microsoft Big Data Stack by Raghu Ramakrishnan, CTO...The Hive Think Tank - The Microsoft Big Data Stack by Raghu Ramakrishnan, CTO...
The Hive Think Tank - The Microsoft Big Data Stack by Raghu Ramakrishnan, CTO...
The Hive
 
Data lake – On Premise VS Cloud
Data lake – On Premise VS CloudData lake – On Premise VS Cloud
Data lake – On Premise VS Cloud
Idan Tohami
 
Big Data in Azure
Big Data in AzureBig Data in Azure
Big data in Azure
Big data in AzureBig data in Azure
Big data in Azure
Venkatesh Narayanan
 
Introduction to Azure Databricks
Introduction to Azure DatabricksIntroduction to Azure Databricks
Introduction to Azure Databricks
James Serra
 
Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL Server
Mark Kromer
 
Introduction to PolyBase
Introduction to PolyBaseIntroduction to PolyBase
Introduction to PolyBase
James Serra
 
Data Lakes with Azure Databricks
Data Lakes with Azure DatabricksData Lakes with Azure Databricks
Data Lakes with Azure Databricks
Data Con LA
 
Interactive query using hadoop
Interactive query using hadoopInteractive query using hadoop
Interactive query using hadoop
Arvind Radhakrishnen
 
Big Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with PentahoBig Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with PentahoMark Kromer
 
Hadoop and Hive in Enterprises
Hadoop and Hive in EnterprisesHadoop and Hive in Enterprises
Hadoop and Hive in Enterprises
markgrover
 
Interactive query in hadoop
Interactive query in hadoopInteractive query in hadoop
Interactive query in hadoop
Rommel Garcia
 
TechEvent Databricks on Azure
TechEvent Databricks on AzureTechEvent Databricks on Azure
TechEvent Databricks on Azure
Trivadis
 
Data lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amiryData lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amiry
datastack
 
Azure Databricks – Customer Experiences and Lessons Denzil Ribeiro Madhu Ganta
Azure Databricks – Customer Experiences and Lessons Denzil Ribeiro Madhu GantaAzure Databricks – Customer Experiences and Lessons Denzil Ribeiro Madhu Ganta
Azure Databricks – Customer Experiences and Lessons Denzil Ribeiro Madhu Ganta
Databricks
 

What's hot (20)

Introduction to Azure HDInsight
Introduction to Azure HDInsightIntroduction to Azure HDInsight
Introduction to Azure HDInsight
 
Big Data with Azure
Big Data with AzureBig Data with Azure
Big Data with Azure
 
Hadoop Reporting and Analysis - Jaspersoft
Hadoop Reporting and Analysis - JaspersoftHadoop Reporting and Analysis - Jaspersoft
Hadoop Reporting and Analysis - Jaspersoft
 
Hadoop vs. RDBMS for Advanced Analytics
Hadoop vs. RDBMS for Advanced AnalyticsHadoop vs. RDBMS for Advanced Analytics
Hadoop vs. RDBMS for Advanced Analytics
 
Azure HDInsight
Azure HDInsightAzure HDInsight
Azure HDInsight
 
The Hive Think Tank - The Microsoft Big Data Stack by Raghu Ramakrishnan, CTO...
The Hive Think Tank - The Microsoft Big Data Stack by Raghu Ramakrishnan, CTO...The Hive Think Tank - The Microsoft Big Data Stack by Raghu Ramakrishnan, CTO...
The Hive Think Tank - The Microsoft Big Data Stack by Raghu Ramakrishnan, CTO...
 
Data lake – On Premise VS Cloud
Data lake – On Premise VS CloudData lake – On Premise VS Cloud
Data lake – On Premise VS Cloud
 
Big Data in Azure
Big Data in AzureBig Data in Azure
Big Data in Azure
 
Big data in Azure
Big data in AzureBig data in Azure
Big data in Azure
 
Introduction to Azure Databricks
Introduction to Azure DatabricksIntroduction to Azure Databricks
Introduction to Azure Databricks
 
Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL Server
 
Introduction to PolyBase
Introduction to PolyBaseIntroduction to PolyBase
Introduction to PolyBase
 
Data Lakes with Azure Databricks
Data Lakes with Azure DatabricksData Lakes with Azure Databricks
Data Lakes with Azure Databricks
 
Interactive query using hadoop
Interactive query using hadoopInteractive query using hadoop
Interactive query using hadoop
 
Big Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with PentahoBig Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with Pentaho
 
Hadoop and Hive in Enterprises
Hadoop and Hive in EnterprisesHadoop and Hive in Enterprises
Hadoop and Hive in Enterprises
 
Interactive query in hadoop
Interactive query in hadoopInteractive query in hadoop
Interactive query in hadoop
 
TechEvent Databricks on Azure
TechEvent Databricks on AzureTechEvent Databricks on Azure
TechEvent Databricks on Azure
 
Data lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amiryData lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amiry
 
Azure Databricks – Customer Experiences and Lessons Denzil Ribeiro Madhu Ganta
Azure Databricks – Customer Experiences and Lessons Denzil Ribeiro Madhu GantaAzure Databricks – Customer Experiences and Lessons Denzil Ribeiro Madhu Ganta
Azure Databricks – Customer Experiences and Lessons Denzil Ribeiro Madhu Ganta
 

Similar to HdInsight essentials Hadoop on Microsoft Platform

Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
tcloudcomputing-tw
 
Big Data in the Microsoft Platform
Big Data in the Microsoft PlatformBig Data in the Microsoft Platform
Big Data in the Microsoft PlatformJesus Rodriguez
 
VMworld 2013: Big Data Platform Building Blocks: Serengeti, Resource Manageme...
VMworld 2013: Big Data Platform Building Blocks: Serengeti, Resource Manageme...VMworld 2013: Big Data Platform Building Blocks: Serengeti, Resource Manageme...
VMworld 2013: Big Data Platform Building Blocks: Serengeti, Resource Manageme...
VMworld
 
Hadoop and Big Data
Hadoop and Big DataHadoop and Big Data
Hadoop and Big Data
Harshdeep Kaur
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
SQL Server 2012 and Big Data
SQL Server 2012 and Big DataSQL Server 2012 and Big Data
SQL Server 2012 and Big Data
Microsoft TechNet - Belgium and Luxembourg
 
Microsoft's Hadoop Story
Microsoft's Hadoop StoryMicrosoft's Hadoop Story
Microsoft's Hadoop Story
Michael Rys
 
Vmware Serengeti - Based on Infochimps Ironfan
Vmware Serengeti - Based on Infochimps IronfanVmware Serengeti - Based on Infochimps Ironfan
Vmware Serengeti - Based on Infochimps Ironfan
Jim Kaskade
 
Hadoop
HadoopHadoop
Hadoop
Oded Rotter
 
Getting started big data
Getting started big dataGetting started big data
Getting started big data
Kibrom Gebrehiwot
 
Twitter with hadoop for oow
Twitter with hadoop for oowTwitter with hadoop for oow
Twitter with hadoop for oow
Gwen (Chen) Shapira
 
[Azureビッグデータ関連サービスとHortonworks勉強会] Azure HDInsight
[Azureビッグデータ関連サービスとHortonworks勉強会] Azure HDInsight[Azureビッグデータ関連サービスとHortonworks勉強会] Azure HDInsight
[Azureビッグデータ関連サービスとHortonworks勉強会] Azure HDInsight
Naoki (Neo) SATO
 
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI ProsBig Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI Pros
Andrew Brust
 
Big Data Journey
Big Data JourneyBig Data Journey
Big Data Journey
Tugdual Grall
 
Microsoft's Big Play for Big Data
Microsoft's Big Play for Big DataMicrosoft's Big Play for Big Data
Microsoft's Big Play for Big Data
Andrew Brust
 
Hadoop in the Cloud – The What, Why and How from the Experts
Hadoop in the Cloud – The What, Why and How from the ExpertsHadoop in the Cloud – The What, Why and How from the Experts
Hadoop in the Cloud – The What, Why and How from the Experts
DataWorks Summit/Hadoop Summit
 
What is hadoop
What is hadoopWhat is hadoop
What is hadoop
Asis Mohanty
 

Similar to HdInsight essentials Hadoop on Microsoft Platform (20)

Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
 
Big Data in the Microsoft Platform
Big Data in the Microsoft PlatformBig Data in the Microsoft Platform
Big Data in the Microsoft Platform
 
VMworld 2013: Big Data Platform Building Blocks: Serengeti, Resource Manageme...
VMworld 2013: Big Data Platform Building Blocks: Serengeti, Resource Manageme...VMworld 2013: Big Data Platform Building Blocks: Serengeti, Resource Manageme...
VMworld 2013: Big Data Platform Building Blocks: Serengeti, Resource Manageme...
 
Hadoop and Big Data
Hadoop and Big DataHadoop and Big Data
Hadoop and Big Data
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Unit IV.pdf
Unit IV.pdfUnit IV.pdf
Unit IV.pdf
 
SQL Server 2012 and Big Data
SQL Server 2012 and Big DataSQL Server 2012 and Big Data
SQL Server 2012 and Big Data
 
Microsoft's Hadoop Story
Microsoft's Hadoop StoryMicrosoft's Hadoop Story
Microsoft's Hadoop Story
 
Vmware Serengeti - Based on Infochimps Ironfan
Vmware Serengeti - Based on Infochimps IronfanVmware Serengeti - Based on Infochimps Ironfan
Vmware Serengeti - Based on Infochimps Ironfan
 
Hadoop
HadoopHadoop
Hadoop
 
Getting started big data
Getting started big dataGetting started big data
Getting started big data
 
Twitter with hadoop for oow
Twitter with hadoop for oowTwitter with hadoop for oow
Twitter with hadoop for oow
 
[Azureビッグデータ関連サービスとHortonworks勉強会] Azure HDInsight
[Azureビッグデータ関連サービスとHortonworks勉強会] Azure HDInsight[Azureビッグデータ関連サービスとHortonworks勉強会] Azure HDInsight
[Azureビッグデータ関連サービスとHortonworks勉強会] Azure HDInsight
 
Hadoop_arunam_ppt
Hadoop_arunam_pptHadoop_arunam_ppt
Hadoop_arunam_ppt
 
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI ProsBig Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI Pros
 
Big Data Journey
Big Data JourneyBig Data Journey
Big Data Journey
 
Robin_Hadoop
Robin_HadoopRobin_Hadoop
Robin_Hadoop
 
Microsoft's Big Play for Big Data
Microsoft's Big Play for Big DataMicrosoft's Big Play for Big Data
Microsoft's Big Play for Big Data
 
Hadoop in the Cloud – The What, Why and How from the Experts
Hadoop in the Cloud – The What, Why and How from the ExpertsHadoop in the Cloud – The What, Why and How from the Experts
Hadoop in the Cloud – The What, Why and How from the Experts
 
What is hadoop
What is hadoopWhat is hadoop
What is hadoop
 

More from nvvrajesh

Apache ranger meetup
Apache ranger meetupApache ranger meetup
Apache ranger meetup
nvvrajesh
 
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoop
nvvrajesh
 
Information management and enterprise architecture
Information management and enterprise architectureInformation management and enterprise architecture
Information management and enterprise architecture
nvvrajesh
 
Pentaho bi suite overview presentation
Pentaho bi suite overview   presentationPentaho bi suite overview   presentation
Pentaho bi suite overview presentationnvvrajesh
 
Social Networking for Non-Profits
Social Networking for Non-ProfitsSocial Networking for Non-Profits
Social Networking for Non-Profits
nvvrajesh
 
Oracle business intelligence overview
Oracle business intelligence overviewOracle business intelligence overview
Oracle business intelligence overview
nvvrajesh
 
BI the Agile Way
BI the Agile WayBI the Agile Way
BI the Agile Way
nvvrajesh
 
Agile Process in a Nutshell
Agile Process in a NutshellAgile Process in a Nutshell
Agile Process in a Nutshell
nvvrajesh
 
Hadoop For Enterprises
Hadoop For EnterprisesHadoop For Enterprises
Hadoop For Enterprises
nvvrajesh
 

More from nvvrajesh (9)

Apache ranger meetup
Apache ranger meetupApache ranger meetup
Apache ranger meetup
 
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoop
 
Information management and enterprise architecture
Information management and enterprise architectureInformation management and enterprise architecture
Information management and enterprise architecture
 
Pentaho bi suite overview presentation
Pentaho bi suite overview   presentationPentaho bi suite overview   presentation
Pentaho bi suite overview presentation
 
Social Networking for Non-Profits
Social Networking for Non-ProfitsSocial Networking for Non-Profits
Social Networking for Non-Profits
 
Oracle business intelligence overview
Oracle business intelligence overviewOracle business intelligence overview
Oracle business intelligence overview
 
BI the Agile Way
BI the Agile WayBI the Agile Way
BI the Agile Way
 
Agile Process in a Nutshell
Agile Process in a NutshellAgile Process in a Nutshell
Agile Process in a Nutshell
 
Hadoop For Enterprises
Hadoop For EnterprisesHadoop For Enterprises
Hadoop For Enterprises
 

Recently uploaded

一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
mbawufebxi
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
NABLAS株式会社
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
vcaxypu
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Linda486226
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
axoqas
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
nscud
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
ocavb
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
MaleehaSheikh2
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
ewymefz
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Subhajit Sahu
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
u86oixdj
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
ahzuo
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
balafet
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
vcaxypu
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
ewymefz
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
ewymefz
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
NABLAS株式会社
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Subhajit Sahu
 
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
pchutichetpong
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
ArpitMalhotra16
 

Recently uploaded (20)

一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
 
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
 

HdInsight essentials Hadoop on Microsoft Platform

  • 1. HDInsight  Essentials  ISBN  :  1849695369    /  ISBN  13  :  9781849695367   Rajesh  Nadipalli   05/01/2014  
  • 2. Goals  of  this  Book   • Focus  on  Microso'’s  new  Hadoop   distribu=on   • Serve  as  Quick  Reference   • Provide  an  Overview  of  Hadoop   • Address  both  cloud  and  on-­‐premise  setup   for  HDInsight   • Highlight  HDInsight  differen:ator     • Provide  Prac=cal  &  Real  world  examples  
  • 3. Book  Table  of  Contents   •  Chapter  1:    HDInsight  in  a  Heartbeat   •  Chapter  2:    Deployment  HDInsight  on  premise   •  Chapter  3:    HDInsight  Azure  cloud  service   •  Chapter  4:    Administer  your  cluster   •  Chapter  5:    Ingest  data  to  your  cluster   •  Chapter  6:    Transform  data  in  your  cluster   •  Chapter  7:    Analyze  &  Report  data  from  cluster   •  Chapter  8:    Project  Planning  &                                              Architectural  Considera=ons  
  • 4. CHAPTER  1  HIGHLIGHTS:     HDINSIGHT  IN  A  HEARTBEAT  
  • 5. Big  Data  Problem  Characteristics    
  • 6. Hadoop  Overview   Self Healing Distributed Storage Fault Tolerant Distributed Computing + Abstraction for Parallel Processing CORE HADOOP COMPONENTS •  HDFS:  Distributed   Storage  –  replicated,   self-­‐healing  and   scalable     •  MapReduce:    Parallel   Processing,  process   local  data  for  efficiency      
  • 7. NameNode JobTracker TaskTracker     TaskTracker     TaskTracker    MapReduce   Layer   Distributed     File  System   Layer   Secondary NameNode Master  Node   Slaves  Nodes   DataNode     DataNode     DataNode     Hadoop  Nodes  Layout  
  • 8. Data  Sources         RDBMS     Databases   Audio,     Images   Log  Files   Sensors,     RFID   Social     Media,  Feeds     Hadoop  Data  Store           HDFS   Hbase    (NOSQL  DB)     Data  Processing         Mapreduce     Data  Access         Hive   Pig   Mahout     Machine  Learning   Flume,  Sqoop   Excel   Business     Data  Feeds   Zookeeper  (Distributed  Process  Management)   Hcatalog  (Metadata  on  Pig,  Hive,  MapReduce  )   Oozie     Workflow,  Scheduler   Infrastructure  ,  Opera:ons   (Monitoring,  Configura<on)   Hadoop  Eco  System  
  • 9. Collect & Import to HDFS Process (MapReduce) Analyze (BI Tools) Report & Publish End  to  End  Solution  on  Hadoop  
  • 10. Popular  Hadoop  Distributions   •  Amazon  Elas=c  MapReduce  (cloud,  hbp://aws.amazon.com/ elas=cmapreduce/)     •  Cloudera  ( hbp://www.cloudera.com/content/cloudera/en/home.html)     •  EMC  PivitolHD  (hbp://gopivotal.com/)     •  Hortonworks  HDP  (hbp://hortonworks.com/)     •  MapR  (hbp://mapr.com/)     •  Microsod  HDInsight  (cloud,  hbp://www.windowsazure.com/)  
  • 11. HDInsight  Differenciator   •  Enterprise-­‐ready  Hadoop  backed  by  Microsod     •  Analy:cs  using  Excel   •  Integra=on  with  Ac=ve  Directory.       •  Integra=on  with  .NET  and  Javascript     •  Connectors  to  RDBMS     •  Scale  using  cloud  offering:    Azure  HDInsight  service  enables  customers   to  scale  quickly  and  has  seamless  interface  between  HDFS  and  Azure   Storage  Vault     •  JavaScript  Console  
  • 13. CHAPTER  2  HIGHLIGHTS:     HDINSIGHT  INSTALL  ON  PREMISE  
  • 14. Apache  Hadoop         •  Open  Source  Sodware   •  Community  Development       Hortonworks  Data  PlaSorm         •  Enterprise  Hadoop  Plagorm  (HDP)   •  Leaders  in  Hadoop   •  Code  commibers  to  Hadoop   Microso'  HDInsight         •  Built  on  top  of  HDP   •  Integra=on  with  ASV,  Excel,  Powerview,   SQLServer,  Ac=ve  Directory       HDInsight  Distribution  
  • 15. Physical  Install  Options   NN          SNN            JT   DN    /  TT   Single  node  for  development/test       Mul=  node  for  produc=on      
  • 16. Multi  Node  Install  Steps   •  Pre-­‐requisites   •  Networking  Setup   •  Remote  Scrip=ng   •  Firewall  Setup   •  Sodware  Install  (each  node)   •  Hadoop  Configura=on   •  Verifica=on  
  • 17. CHAPTER  3  HIGHLIGHTS:     HDINSIGHT  AZURE  SERVICE  
  • 18. Azure  Cloud  Service   Create  Storage   Create  HDInsight   cluster  
  • 19. CHAPTER  4  HIGHLIGHTS:     ADMINISTER  YOUR  CLUSTER  
  • 25. CHAPTER  5  HIGHLIGHTS:     INGEST  DATA  INTO  YOUR  CLUSTER  
  • 26. Loading  Data  into  your  Cluster   You  have  following  op=ons…     •  Loading  data  using  Hadoop  commands   •  Loading  data  using  Azure  Storage  Vault   •  Loading  data  using  Interac:ve  JavaScript     •  Shipping  data  to  your  Cluster   •  Loading  data  from  RDBMS  via  Sqoop  
  • 27. Loading  via  Azure  Storage  Explorer  
  • 28. CHAPTER  6  HIGHLIGHTS:     TRANSFORM  YOUR  DATA  
  • 29. Transforming  Data   You  have  following  op=ons…     •  MapReduce   •  Hive   •  Pig   •  Others  
  • 30. Processing  Data  in  Cluster   Map for Jan2012 Map for Feb2012 Map for Apr2013 …   One Reducer
  • 31. HDFS   Hive   JDBC/OBDC Metastore Thrift Server Command LineWeb GUI Driver (Parser, Planner, Executor) MapReduce   Hive  
  • 32. Raw  Data  in  HDFS   •  Distributed   Storage   •  Reliable   Data  Processing  via  Pig   •  Pipelines   •  Itera=ve  Processing   •  Research   Data   Warehouse   HDFS   Data  Warehouse  via  Hive   •  BI  Tools   •  Analysis   Hive  or  Pig?  
  • 33. CHAPTER  7  HIGHLIGHTS:     ANALYZE  &  REPORT  
  • 36. CHAPTER  8:     PROJECT  PLANNING  &  ARCHITECTURAL   CONSIDERATIONS  
  • 37. Execu:ve  &   Stakeholder     Buy-­‐in   Discovery  &   Analysis   Design   Implementa:on  User  Acceptance   Produc:on   Opera:ons   Feedback,  New   Requirements