Advertisement
Advertisement

More Related Content

Advertisement
Advertisement

Hadoop on Azure, Blue elephants

  1. One elephant went out to play, Azure way Orlando Code Camp, 2013 Ovidiu Dimulescu @odimulescu speakerdeck.com/odimulescu
  2. Agenda • Overview • Installation • Azure story • .Net Integration • MapReduce • Q &A
  3. About @odimulescu • Working on the Web since 1997 • • Organizer for JaxMUG.com • Co-Organizer for Jax Big Data meetup
  4. What is ? Apache Hadoop is an open source framework for running data-intensive applications on large clusters of commodity hardware
  5. What and how is solving? Processing diverse large datasets in practical time at low cost • Consolidates data in a distributed file system • Moves computation to data rather then data to computation • Simplifies programming model CPU CPU CPU CPU CPU CPU CPU CPU
  6. Why does it matter? • Volume - Datasets outgrow local HDDs let alone RAM • Velocity - Data grows at tremendous pace • Variety - Data is heterogeneous • Value - Scaling up is expensive (licensing, cpus, disks, fabric, etc.) - Scaling up has a ceiling (physical, technical, etc.)
  7. Why does it matter? Data types Complex Data Images,Video 20% Logs Documents Call records Sensor data 80% Mail archives Structured Data User Profiles CRM Complex HR Records Structured * Chart Source: IDC White Paper
  8. Use cases • ETL • Pattern Recognition • Recommendation Engines • Prediction Models • Log Processing • Data “sandbox”
  9. Who uses it?
  10. Who supports it?
  11. When not to use? • Not a database replacement • Not a data warehousing, complements it • Not for interactive reporting • Not a general purpose storage mechanism • Not for problems that are not parallelizable in a share-nothing fashion *
  12. Architecture – Core Components HDFS Distributed filesystem designed for low cost storage and high bandwidth access across the cluster. MapReduce Simpler programming model for processing and generating large data sets.
  13. Architecture - HDFS Namenode (NN) Client ask NN for file H NN returns DNs that has it D F Client ask DN for data S Datanode 1 Datanode 2 Datanode N Namenode - Master Datanode - Slaves • Filesystem metadata • Blocks R/W per clients • Files R/W control • Replicates blocks per master • Blocks replication • Notifies master about block-ids
  14. Architecture - MapReduce J JobsTracker (JT) O B Client starts a job S API TaskTracker 1 TaskTracker 2 TaskTracker N JobTracker - Master TaskTracker - Slaves • Accepts MR jobs submitted by clients • Runs MR tasks received from JobTracker • Assigns MR tasks to TaskTrackers • Manages storage and transmission of • Monitors tasks and TaskTracker status, intermediate output re-executes tasks upon failure • Speculative execution
  15. Architecture - Core Hadoop J JobsTracker O B S TaskTracker 1 TaskTracker 2 TaskTracker N API DataNode 1 DataNode 2 DataNode N H D F S NameNode * Mini OS: Filesystem & Scheduler
  16. Hadoop - Ecosystem Management ZooKeeper Chukwa Ambari HUE Data Access Pig Hive Sqoop Impala Stinger Data Processing MapReduce Giraph Hama Mahout Storage HDFS HBase
  17. Hadoop - Ecosystem Management ZooKeeper Chukwa Ambari HUE Data Access Pig Hive Sqoop Impala Stinger Data Processing MapReduce Giraph Hama Mahout Storage HDFS HBase
  18. Installation - Platform Notes Production Linux – Official Development Linux OSX Windows via Cygwin * Other Unixes
  19. Installation 1. Download & configure single-node cluster hadoop.apache.org/common/releases.html 2. Download a demo VM Cloudera, Hortonworks, MapR, etc. 3. Download MS HDInsight Server 4. Cloud: Amazon EMR, Azure HDInsight Service
  20. Hadoop - Azure Story Name: Windows Azure HDInsight Service Where: Hadoop on Azure dot com Status: Public Preview *On-premise: Microsoft HDInsight Server
  21. Hadoop - Azure Story
  22. Hadoop - Azure Story
  23. Hadoop - Azure Story
  24. Hadoop - Azure Story
  25. Hadoop - Azure Story
  26. HDFS - .Net access Microsoft Distribution of Hadoop C library for HDFS file access Hadoop .Net HDFS File Access Managed C++ Solution
  27. HDFS - .Net access
  28. Hadoop .Net SDK hadoopsdk.codeplex.com • MapReduce • LINQ to Hive • WebHDFS Client
  29. Hadoop Integration ODBC Driver Excel PowerPivot Other BI tools Connector for Hadoop Import / Export via SQOOP
  30. slideshare.net/esaliya/mapreduce-in-simple-terms by Saliya Ekanayake 30
  31. MapReduce - Clients Java - Native hadoop jar jar_path main_class input_path output_path C++ - Pipes framework hadoop pipes -input path_in -output path_out -program exec_program Any – Streaming hadoop jar hadoop-streaming.jar -mapper map_prog -reducer reduce_prog - input path_in -output path_out Pig Latin, Hive HQL, C via JNI
  32. C# - Streaming - Mapper
  33. C# - Streaming - Reducer
  34. C# - .Net SDK Mapper & Reducer
  35. C# - .Net SDK Driver Class
  36. C# - .Net SDK Driver Class MRRunner -dll WordFrequency.dll -- input output MRRunner -dll WordFrequency.dll -class WordFrequency -- input output
  37. C# - .Net SDK Debugging
  38. References Hadoop at Yahoo!, by Y! Developer Network MapReduce in Simple Terms, by Saliya Ekanayake Hadoop on Azure, Getting Started Hadoop .Net SDK .Net HDFS File Access SQL Server Connector for Hadoop
  39. Questions ? Ovidiu Dimulescu @odimulescu speakerdeck.com/odimulescu
Advertisement