Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Hadoop in the Cloud - The what, why and how from the experts

679 views

Published on

Hadoop in the Cloud - The what, why and how from the experts

Published in: Technology
  • Be the first to comment

Hadoop in the Cloud - The what, why and how from the experts

  1. 1. Hadoop in the cloud – The what, why and how from the experts Nishant Thacker Technical Product Manager – Big Data Microsoft @nishantthacker
  2. 2. Hadoop in the Cloud 2
  3. 3. Hadoop in the Cloud 3
  4. 4. Traditional Hadoop Clusters 4
  5. 5. Challenges with implementing Hadoop
  6. 6. Hadoop Clusters in the Cloud 6
  7. 7. Why Hadoop in the cloud?
  8. 8. Distributed Storage • Files split across storage • Files replicated • Nearest node responds • Abstracted Administration Hadoop Clusters Extensible • APIs to extend functionality • Add new capabilities • Allow for inclusion in custom environments Automated Failover • Unmonitored failover to replicated data • Built for resiliency • Metadata stored for later retrieval Hyper-Scale • Add resources as desired • Built to include commodity configs • Direct correlation of performance and resources Distributed Compute • Distributed processing • Resource Utilization • Cost-Efficient method calls 8
  9. 9. Distributed Storage • Files split across storage • Files replicated • Nearest node responds • Abstracted Administration Cloud Extensible • APIs to extend functionality • Add new capabilities • Allow for inclusion in custom environments Automated Failover • Unmonitored failover to replicated data • Built for resiliency • Metadata stored for later retrieval Hyper-Scale • Add resources as desired • Built to include commodity configs • Direct correlation of performance and resources Distributed Compute • Distributed processing • Resource Utilization • Cost-Efficient method calls 9
  10. 10. Distributed Storage • Files split across storage • Files replicated • Nearest node responds • Abstracted Administration Hadoop in the Cloud Extensible • APIs to extend functionality • Add new capabilities • Allow for inclusion in custom environments Automated Failover • Unmonitored failover to replicated data • Built for resiliency • Metadata stored for later retrieval Hyper-Scale • Add resources as desired • Built to include commodity configs • Direct correlation of performance and resources Distributed Compute • Distributed processing • Resource Utilization • Cost-Efficient method calls 10
  11. 11. Hadoop in the Cloud 11
  12. 12. Hadoop in the Cloud - Options
  13. 13. Scenarios for deploying Hadoop as hybrid
  14. 14. Traditional Hadoop Clusters – On Prem 14 Hadoop Cluster Worker Node HDFS HDFS HDFS Tasks Tasks Tasks Tasks Tasks Tasks Task Tracker Master Node Client Job (jar) file Job (jar) file
  15. 15. Hadoop Clusters in the Cloud
  16. 16. Azure HDInsight Hadoop and Spark as a Service on Azure Fully managed Hadoop and Spark for the cloud 100% Open Source Hortonworks Data Platform Clusters up and running in minutes Managed, monitored and supported by Microsoft with the industry’s best enterprise SLA Use familiar BI tools for analysis, or open source notebooks for interactive data science 63% lower total cost of ownership than deploy your own Hadoop on-premises* *IDC study “The Business Value and TCO Advantage of Apache Hadoop in the Cloud with Microsoft Azure HDInsight”
  17. 17. HDInsight Cluster Architecture AzureVNet HTTPS traffic ODBC/JDBC WebHCatalog Oozie Ambari Secure gateway AuthN HTTP Proxy Highly available Head nodes Worker nodes ADLS
  18. 18. Decoupling Compute from Storage Latency? Consistency? Bandwidth? Network
  19. 19. Decoupling Compute from Storage Network HDD-like latency 50 Tb+ aggregate bandwidth[1] Strong consistency [1] Azure Flat Network Architecture
  20. 20. Decoupling - Benefits
  21. 21. Azure Data Lake Store A hyper scale repository for big data analytics workloads Hadoop File System (HDFS) for the cloud No limits to scale Store any data in its native format Enterprise grade access control and encryption Optimized for analytic workload performance
  22. 22. Customize cluster? HDInsight cluster provisioning states RDP to cluster, update config files (non-durable) Ad hoc Cluster customization options Hive/Oozie Metastore Storage accounts & VNET’s ScriptAction Via Azure portal Ready for deployment Accepted Cluster storage provisioned AzureVM configuration Running Timed Out Error Cluster operational Configuring HDInsight Cluster customization (custom script running Config values JAR file placement in cluster Via scripting / SDK No Yes
  23. 23. Cluster integration options Each cluster surfaces a REST endpoint for integration, secured via basic authN over SSL /thrift – ODBC & JDBC /Templeton – Job Submission, Metadata management /ambari – Cluster health, monitoring /oozie – Job orchestration, scheduling
  24. 24. Hadoop in the Cloud 24
  25. 25. Cloud Deployments for Big Data 25
  26. 26. Introducing Cortana Intelligence Suite Action People Automated Systems Apps Web Mobile Bots Intelligence Dashboards & Visualizations Cortana Bot Framework Cognitive Services Power BI Information Management Event Hubs Data Catalog Data Factory Machine Learning and Analytics HDInsight (Hadoop and Spark) Stream Analytics Intelligence Data Lake Analytics Machine Learning Big Data Stores SQL Data Warehouse Data Lake Store Data Sources Apps Sensors and devices Data
  27. 27. Where Big Data is a cornerstone Action People Automated Systems Apps Web Mobile Bots Intelligence Dashboards & Visualizations Cortana Bot Framework Cognitive Services Power BI Information Management Event Hubs Data Catalog Data Factory Machine Learning and Analytics HDInsight (Hadoop and Spark) Stream Analytics Intelligence Data Lake Analytics Machine Learning Big Data Stores SQL Data Warehouse Data Lake Store Data Sources Apps Sensors and devices Data
  28. 28. Excel BI Power BI Mahout HiveQL HIVE Sqoop Pig Azure Data Lake Analytics HBase on Azure HDInsight Big Data Sources (Raw Unstructured) Log files Storm for Azure HDInsight Azure Stream Analytics Spark Streaming for Azure HDInsight Spark SQL Spark MLib Azure Data Lake Store U-SQL Data Orchestration/ Workflow Azure Data Factory Oozie for Azure HDInsight Kafka for Azure HDInsight (future) SQL Server Integration Services Azure Machine Learning R ServerSQL Server R Services SSRS SharePoint BI Transactional systems Azure SQL DW SQL Server APS ETL Azure Event Hubs Data Generation Streaming ConsumptionProcessingStorage OperationalAnalytical/Exploratory Data Warehouse Azure Website SSAS Spark MLLib
  29. 29. Summary 29
  30. 30.  For more information on HDInsight visit: http://azure.com/hdinsight  For more information on Data Lake visit: http://azure.com/datalake
  31. 31. nishant.thacker@microsoft.com
  32. 32. © 2016 Microsoft Corporation. All rights reserved.

×