Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Introduction to Azure HDInsight


Published on

Apache Hadoop is a platform that has emerged to help extract insight from all that data. In this session, you will learn the basics of Hadoop, how to get up and running with Hadoop in the cloud using Microsoft Azure HDInsight, and how you can leverage the deeper integration of Visual Studio to integrate Big Data with your existing applications. No previous experience with Hadoop is required.

Presented @ MSDEVMTL on Saturday February , 2015

Published in: Technology
  • Be the first to comment

Introduction to Azure HDInsight

  1. 1. Introduction to HDInsight Stéphane Fréchette Saturday February 7, 2015
  2. 2. Who am I? My name is Stéphane Fréchette SQL Server MVP | Consultant | Speaker | Data & BI Architect | Big Data |NoSQL | Data Science. Drums, good food and fine wine. Founder @TEDxGatineau I have a passion for architecting, designing and building solutions that matter. Twitter: @sfrechette Blog: Email:
  3. 3. Topics • What is Big Data? • Apache Hadoop • Hadoop Ecosystem • Microsoft Azure HDInsight • Demos • Summary • Resources • Q&A
  4. 4. “Big data usually includes data sets with sizes beyond the ability of commonly used software tools to capture, curate, manage, and process data within a tolerable elapsed time…” - Wikipedia
  5. 5. What is Big Data? Many Options Variability
  6. 6. Internet of things Audio / Video Log Files Text/Image Social Sentiment Data Market Feeds eGov Feeds Weather Wikis / Blogs Click Stream Sensors / RFID / Devices Spatial & GPS Coordinates WEB 2.0Mobile Advertising CollaborationeCommerce Digital Marketing Search Marketing Web Logs Recommendations ERP / CRM Sales Pipeline Payables Payroll Inventory Contacts Deal Tracking Terabytes (10E12) Gigabytes (10E9) Exabytes (10E18) Petabytes (10E15) Velocity - Variety Volume 1980 190,000$ 2010 0.07$ 1990 9,000$ 2000 15$ Storage/GB ERP / CRM WEB 2.0 Internet of things What is Big Data?
  7. 7. Common Scenarios What is Big Data?
  8. 8. Hadoop • Apache Hadoop is for big data • Open-source software framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models • Designed to scale up from single servers to thousands of machines, each offering local computation and storage
  9. 9. TRADITIONAL RDBMS HADOOP Data Size Access Updates Structure Integrity Scaling DBA Ratio Hadoop
  10. 10. HDFS • Hadoop Distributed File System (HDFS) is a Java-based file system that provides scalable and reliable data storage that is designed to span large clusters of commodity servers. HDFS ≠ Database
  11. 11. MapReduce • MapReduce is a software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault- tolerant manner. Processing function: - Mapping - Reducing
  12. 12. How it works?
  13. 13. ServerServer ServerServer Runtime How it works?
  14. 14. Distributed Storage (HDFS) Query (Hive) Distributed Processing (MapReduce) Scripting (Pig) NoSQLDatabase (HBase) Metadata (HCatalog) DataIntegration (ODBC/SQOOP/REST) Relational (SQL Server) Machine Learning (Mahout) Graph (Pegasus) Stats processing (RHadoop EventPipeline (Flume) Active Directory (Security) Monitoring& Deployment (System Center) C#, F#, .NETPowerShell Pipeline/workflow (Oozie) Azure Storage Vault (ASV) Business Intelligence Excel,Power View,SSAS) World's Data (Azure Data Marketplace) EventDriven Processing Legend Red = Core Hadoop Blue = Data processing Purple = Microsoft integration points and value adds Orange = Data Movement Green = Packages Hadoop Ecosystem
  15. 15. HDInsight • HDInsight is a Hadoop-based service that brings a 100 % Apache Hadoop solution that runs on the Microsoft Azure platform • Based on the Hortonworks Data Platform (HDP) • Scalable, on-demand service
  16. 16. Storage Azure Storage (Blob)File System Two choices
  17. 17. Demo [Spinning up a HDInsight Cluster ;-)]
  18. 18. Now what? Working with your HDInsight cluster - running jobs, import/export data, viewing and consuming data… • .NET • Java • Pig • Hive • Sqoop • Excel • Others
  19. 19. What is Hive? • A data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis • Provides an SQL-Like language called HiveQL to query data • Integration between Hadoop and BI and visualization tools
  20. 20. What is Pig? • Write complex MapReduce jobs using a simple script language (Pig Latin) • A platform for analyzing large data sets that consists of high-level language for expressing data analysis programs • Pig translates and compiles complex MapReduce jobs on the fly
  21. 21. What is Sqoop? • Command-line interface application to transfer bulk data between Hadoop and relational datastores
  22. 22. Demo [Query, Analyze, Transfer + Visual Studio Tools for HDInsight]
  23. 23. HadoopData Analytics Data Flow
  24. 24. Demo [Self-Service BI with Hive and Excel…]
  25. 25. Machine Learning Graph Processing Distributed Compute Extract Load Transform Predictive Analysis Capabilities
  26. 26. Data Knowledge Action Summary
  27. 27. Resources • Apache Projects (list with links) • Microsoft Azure HDInsight • HDInsight Documentation & Tutorials • Hortonworks Sandbox 2.2 & Tutorials • Cloudera VMs CDH 5.3.x • Microsoft JDBC Driver 4.1 | 4.0 for SQL Server • Microsoft Hive ODBC Driver • Getting Started with Big Data (MVA) • Big Data and Business Analytics Immersion v3.1 (MVA) • Introducing Microsoft Azure HDInsight (free e-book)
  28. 28. What Questions Do You Have?
  29. 29. Thank You For attending this session