Successfully reported this slideshow.

Introduction to Azure HDInsight

6

Share

Loading in …3
×
1 of 29
1 of 29

Introduction to Azure HDInsight

6

Share

Download to read offline

Description

Apache Hadoop is a platform that has emerged to help extract insight from all that data. In this session, you will learn the basics of Hadoop, how to get up and running with Hadoop in the cloud using Microsoft Azure HDInsight, and how you can leverage the deeper integration of Visual Studio to integrate Big Data with your existing applications. No previous experience with Hadoop is required.

Presented @ MSDEVMTL on Saturday February , 2015

Transcript

  1. 1. Introduction to HDInsight Stéphane Fréchette Saturday February 7, 2015
  2. 2. Who am I? My name is Stéphane Fréchette SQL Server MVP | Consultant | Speaker | Data & BI Architect | Big Data |NoSQL | Data Science. Drums, good food and fine wine. Founder @TEDxGatineau I have a passion for architecting, designing and building solutions that matter. Twitter: @sfrechette Blog: stephanefrechette.com Email: stephanefrechette@ukubu.com
  3. 3. Topics • What is Big Data? • Apache Hadoop • Hadoop Ecosystem • Microsoft Azure HDInsight • Demos • Summary • Resources • Q&A
  4. 4. “Big data usually includes data sets with sizes beyond the ability of commonly used software tools to capture, curate, manage, and process data within a tolerable elapsed time…” - Wikipedia
  5. 5. What is Big Data? Many Options Variability
  6. 6. Internet of things Audio / Video Log Files Text/Image Social Sentiment Data Market Feeds eGov Feeds Weather Wikis / Blogs Click Stream Sensors / RFID / Devices Spatial & GPS Coordinates WEB 2.0Mobile Advertising CollaborationeCommerce Digital Marketing Search Marketing Web Logs Recommendations ERP / CRM Sales Pipeline Payables Payroll Inventory Contacts Deal Tracking Terabytes (10E12) Gigabytes (10E9) Exabytes (10E18) Petabytes (10E15) Velocity - Variety Volume 1980 190,000$ 2010 0.07$ 1990 9,000$ 2000 15$ Storage/GB ERP / CRM WEB 2.0 Internet of things What is Big Data?
  7. 7. Common Scenarios What is Big Data?
  8. 8. Hadoop • Apache Hadoop is for big data • Open-source software framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models • Designed to scale up from single servers to thousands of machines, each offering local computation and storage
  9. 9. TRADITIONAL RDBMS HADOOP Data Size Access Updates Structure Integrity Scaling DBA Ratio Hadoop
  10. 10. HDFS • Hadoop Distributed File System (HDFS) is a Java-based file system that provides scalable and reliable data storage that is designed to span large clusters of commodity servers. HDFS ≠ Database
  11. 11. MapReduce • MapReduce is a software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault- tolerant manner. Processing function: - Mapping - Reducing
  12. 12. How it works?
  13. 13. ServerServer ServerServer Runtime How it works?
  14. 14. Distributed Storage (HDFS) Query (Hive) Distributed Processing (MapReduce) Scripting (Pig) NoSQLDatabase (HBase) Metadata (HCatalog) DataIntegration (ODBC/SQOOP/REST) Relational (SQL Server) Machine Learning (Mahout) Graph (Pegasus) Stats processing (RHadoop EventPipeline (Flume) Active Directory (Security) Monitoring& Deployment (System Center) C#, F#, .NETPowerShell Pipeline/workflow (Oozie) Azure Storage Vault (ASV) Business Intelligence Excel,Power View,SSAS) World's Data (Azure Data Marketplace) EventDriven Processing Legend Red = Core Hadoop Blue = Data processing Purple = Microsoft integration points and value adds Orange = Data Movement Green = Packages Hadoop Ecosystem
  15. 15. HDInsight • HDInsight is a Hadoop-based service that brings a 100 % Apache Hadoop solution that runs on the Microsoft Azure platform • Based on the Hortonworks Data Platform (HDP) • Scalable, on-demand service
  16. 16. Storage Azure Storage (Blob)File System Two choices
  17. 17. Demo [Spinning up a HDInsight Cluster ;-)]
  18. 18. Now what? Working with your HDInsight cluster - running jobs, import/export data, viewing and consuming data… • .NET • Java • Pig • Hive • Sqoop • Excel • Others
  19. 19. What is Hive? • A data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis • Provides an SQL-Like language called HiveQL to query data • Integration between Hadoop and BI and visualization tools http://hive.apache.org
  20. 20. What is Pig? • Write complex MapReduce jobs using a simple script language (Pig Latin) • A platform for analyzing large data sets that consists of high-level language for expressing data analysis programs • Pig translates and compiles complex MapReduce jobs on the fly http://pig.apache.org
  21. 21. What is Sqoop? • Command-line interface application to transfer bulk data between Hadoop and relational datastores http://sqoop.apache.org
  22. 22. Demo [Query, Analyze, Transfer + Visual Studio Tools for HDInsight]
  23. 23. HadoopData Analytics Data Flow
  24. 24. Demo [Self-Service BI with Hive and Excel…]
  25. 25. Machine Learning Graph Processing Distributed Compute Extract Load Transform Predictive Analysis Capabilities
  26. 26. Data Knowledge Action Summary
  27. 27. Resources • Apache Projects (list with links) http://bit.ly/MfpLtE • Microsoft Azure HDInsight http://bit.ly/1dnlAX1 • HDInsight Documentation & Tutorials http://bit.ly/LWRYol • Hortonworks Sandbox 2.2 & Tutorials http://bit.ly/1gkkCte • Cloudera VMs CDH 5.3.x http://bit.ly/1ENWgHH • Microsoft JDBC Driver 4.1 | 4.0 for SQL Server http://bit.ly/1kEgJ7O • Microsoft Hive ODBC Driver http://bit.ly/NFkhcH • Getting Started with Big Data (MVA) http://bit.ly/1wU90Xd • Big Data and Business Analytics Immersion v3.1 (MVA) http://bit.ly/1unvvX1 • Introducing Microsoft Azure HDInsight (free e-book) http://bit.ly/1JOPe5F
  28. 28. What Questions Do You Have?
  29. 29. Thank You For attending this session

Editor's Notes

  • Key attributes:
    Open source
    Highly scalable
    Runs on commodity hardware
    Redundant and reliable (no data loss)
    Batch processing centric – using “Map-Reduce” processing paradigm
  • HDFS can replicate the data to multiple nodes, and it uses a name node daemon to track where the data is and how it is (or isn't) replicated.

    HDFS allows data to be split across multiple systems, which solves one problem in a large-scale data environment. But moving the data into various places creates another problem. How do you move the computing function to where the data is?

    Along comes MapReduce…
  • The HDInsight service can actually access two types of storage: HDFS (as in standard Hadoop) and the Azure Storage system. When you store your data using HDFS, it's contained within the nodes of the cluster and it must be called through the HDFS API. When the cluster is decommissioned, the data is lost as well. The option of using Azure Storage provides several advantages: you can load the data using standard tools, retain the data when you decommission the cluster, the cost is less, and other processes in Azure or even from other cloud providers can access the data.
  • Description

    Apache Hadoop is a platform that has emerged to help extract insight from all that data. In this session, you will learn the basics of Hadoop, how to get up and running with Hadoop in the cloud using Microsoft Azure HDInsight, and how you can leverage the deeper integration of Visual Studio to integrate Big Data with your existing applications. No previous experience with Hadoop is required.

    Presented @ MSDEVMTL on Saturday February , 2015

    Transcript

    1. 1. Introduction to HDInsight Stéphane Fréchette Saturday February 7, 2015
    2. 2. Who am I? My name is Stéphane Fréchette SQL Server MVP | Consultant | Speaker | Data & BI Architect | Big Data |NoSQL | Data Science. Drums, good food and fine wine. Founder @TEDxGatineau I have a passion for architecting, designing and building solutions that matter. Twitter: @sfrechette Blog: stephanefrechette.com Email: stephanefrechette@ukubu.com
    3. 3. Topics • What is Big Data? • Apache Hadoop • Hadoop Ecosystem • Microsoft Azure HDInsight • Demos • Summary • Resources • Q&A
    4. 4. “Big data usually includes data sets with sizes beyond the ability of commonly used software tools to capture, curate, manage, and process data within a tolerable elapsed time…” - Wikipedia
    5. 5. What is Big Data? Many Options Variability
    6. 6. Internet of things Audio / Video Log Files Text/Image Social Sentiment Data Market Feeds eGov Feeds Weather Wikis / Blogs Click Stream Sensors / RFID / Devices Spatial & GPS Coordinates WEB 2.0Mobile Advertising CollaborationeCommerce Digital Marketing Search Marketing Web Logs Recommendations ERP / CRM Sales Pipeline Payables Payroll Inventory Contacts Deal Tracking Terabytes (10E12) Gigabytes (10E9) Exabytes (10E18) Petabytes (10E15) Velocity - Variety Volume 1980 190,000$ 2010 0.07$ 1990 9,000$ 2000 15$ Storage/GB ERP / CRM WEB 2.0 Internet of things What is Big Data?
    7. 7. Common Scenarios What is Big Data?
    8. 8. Hadoop • Apache Hadoop is for big data • Open-source software framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models • Designed to scale up from single servers to thousands of machines, each offering local computation and storage
    9. 9. TRADITIONAL RDBMS HADOOP Data Size Access Updates Structure Integrity Scaling DBA Ratio Hadoop
    10. 10. HDFS • Hadoop Distributed File System (HDFS) is a Java-based file system that provides scalable and reliable data storage that is designed to span large clusters of commodity servers. HDFS ≠ Database
    11. 11. MapReduce • MapReduce is a software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault- tolerant manner. Processing function: - Mapping - Reducing
    12. 12. How it works?
    13. 13. ServerServer ServerServer Runtime How it works?
    14. 14. Distributed Storage (HDFS) Query (Hive) Distributed Processing (MapReduce) Scripting (Pig) NoSQLDatabase (HBase) Metadata (HCatalog) DataIntegration (ODBC/SQOOP/REST) Relational (SQL Server) Machine Learning (Mahout) Graph (Pegasus) Stats processing (RHadoop EventPipeline (Flume) Active Directory (Security) Monitoring& Deployment (System Center) C#, F#, .NETPowerShell Pipeline/workflow (Oozie) Azure Storage Vault (ASV) Business Intelligence Excel,Power View,SSAS) World's Data (Azure Data Marketplace) EventDriven Processing Legend Red = Core Hadoop Blue = Data processing Purple = Microsoft integration points and value adds Orange = Data Movement Green = Packages Hadoop Ecosystem
    15. 15. HDInsight • HDInsight is a Hadoop-based service that brings a 100 % Apache Hadoop solution that runs on the Microsoft Azure platform • Based on the Hortonworks Data Platform (HDP) • Scalable, on-demand service
    16. 16. Storage Azure Storage (Blob)File System Two choices
    17. 17. Demo [Spinning up a HDInsight Cluster ;-)]
    18. 18. Now what? Working with your HDInsight cluster - running jobs, import/export data, viewing and consuming data… • .NET • Java • Pig • Hive • Sqoop • Excel • Others
    19. 19. What is Hive? • A data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis • Provides an SQL-Like language called HiveQL to query data • Integration between Hadoop and BI and visualization tools http://hive.apache.org
    20. 20. What is Pig? • Write complex MapReduce jobs using a simple script language (Pig Latin) • A platform for analyzing large data sets that consists of high-level language for expressing data analysis programs • Pig translates and compiles complex MapReduce jobs on the fly http://pig.apache.org
    21. 21. What is Sqoop? • Command-line interface application to transfer bulk data between Hadoop and relational datastores http://sqoop.apache.org
    22. 22. Demo [Query, Analyze, Transfer + Visual Studio Tools for HDInsight]
    23. 23. HadoopData Analytics Data Flow
    24. 24. Demo [Self-Service BI with Hive and Excel…]
    25. 25. Machine Learning Graph Processing Distributed Compute Extract Load Transform Predictive Analysis Capabilities
    26. 26. Data Knowledge Action Summary
    27. 27. Resources • Apache Projects (list with links) http://bit.ly/MfpLtE • Microsoft Azure HDInsight http://bit.ly/1dnlAX1 • HDInsight Documentation & Tutorials http://bit.ly/LWRYol • Hortonworks Sandbox 2.2 & Tutorials http://bit.ly/1gkkCte • Cloudera VMs CDH 5.3.x http://bit.ly/1ENWgHH • Microsoft JDBC Driver 4.1 | 4.0 for SQL Server http://bit.ly/1kEgJ7O • Microsoft Hive ODBC Driver http://bit.ly/NFkhcH • Getting Started with Big Data (MVA) http://bit.ly/1wU90Xd • Big Data and Business Analytics Immersion v3.1 (MVA) http://bit.ly/1unvvX1 • Introducing Microsoft Azure HDInsight (free e-book) http://bit.ly/1JOPe5F
    28. 28. What Questions Do You Have?
    29. 29. Thank You For attending this session

    Editor's Notes

  • Key attributes:
    Open source
    Highly scalable
    Runs on commodity hardware
    Redundant and reliable (no data loss)
    Batch processing centric – using “Map-Reduce” processing paradigm
  • HDFS can replicate the data to multiple nodes, and it uses a name node daemon to track where the data is and how it is (or isn't) replicated.

    HDFS allows data to be split across multiple systems, which solves one problem in a large-scale data environment. But moving the data into various places creates another problem. How do you move the computing function to where the data is?

    Along comes MapReduce…
  • The HDInsight service can actually access two types of storage: HDFS (as in standard Hadoop) and the Azure Storage system. When you store your data using HDFS, it's contained within the nodes of the cluster and it must be called through the HDFS API. When the cluster is decommissioned, the data is lost as well. The option of using Azure Storage provides several advantages: you can load the data using standard tools, retain the data when you decommission the cluster, the cost is less, and other processes in Azure or even from other cloud providers can access the data.
  • More Related Content

    Related Books

    Free with a 30 day trial from Scribd

    See all

    ×