One elephant went out to play, Azure way         Orlando Code Camp, 2013                       Ovidiu Dimulescu           ...
Agenda  •   Overview  •   Installation  •   Azure story  •   .Net Integration  •   MapReduce  •   Q &A
About @odimulescu• Working on the Web since 1997•• Organizer for JaxMUG.com• Co-Organizer for Jax Big Data meetup
What is                  ?Apache Hadoop is an open source frameworkfor running data-intensive applications on largecluster...
What and how is solving?Processing diverse large datasets in practical time at low cost• Consolidates data in a distribute...
Why does it matter?• Volume - Datasets outgrow local HDDs let alone RAM• Velocity - Data grows at tremendous pace• Variety...
Why does it matter?                         Data types     Complex Data                                           Images,V...
Use cases• ETL• Pattern Recognition• Recommendation Engines• Prediction Models• Log Processing• Data “sandbox”
Who uses it?
Who supports it?
When not to use?• Not a database replacement• Not a data warehousing, complements it• Not for interactive reporting• Not a...
Architecture – Core ComponentsHDFSDistributed filesystem designed for low cost storageand high bandwidth access across the ...
Architecture - HDFS                                                  Namenode (NN)Client ask NN for file        HNN returns...
Architecture - MapReduce                        J                     JobsTracker (JT)                        O           ...
Architecture - Core Hadoop    J                     JobsTracker    O    B    S          TaskTracker 1   TaskTracker 2   Ta...
Hadoop - Ecosystem                     Management ZooKeeper      Chukwa           Ambari          HUE                     ...
Hadoop - Ecosystem                     Management ZooKeeper      Chukwa           Ambari          HUE                     ...
Installation - Platform NotesProduction	 	 Linux – OfficialDevelopment	 	 Linux	 	 OSX	 	 Windows via Cygwin *	 	 Other Uni...
Installation1. Download & configure single-node cluster   hadoop.apache.org/common/releases.html2. Download a demo VM      ...
Hadoop - Azure StoryName:  Windows Azure HDInsight ServiceWhere:  Hadoop on Azure dot comStatus:    Public Preview*On-prem...
Hadoop - Azure Story
Hadoop - Azure Story
Hadoop - Azure Story
Hadoop - Azure Story
Hadoop - Azure Story
HDFS - .Net accessMicrosoft Distribution of Hadoop  C library for HDFS file accessHadoop .Net HDFS File Access  Managed C++...
HDFS - .Net access
Hadoop .Net SDKhadoopsdk.codeplex.com • MapReduce • LINQ to Hive • WebHDFS Client
Hadoop Integration    ODBC Driver     Excel PowerPivot     Other BI tools   Connector for Hadoop     Import / Export via S...
slideshare.net/esaliya/mapreduce-in-simple-termsby Saliya Ekanayake                                                   30
MapReduce - ClientsJava - Native hadoop jar jar_path main_class input_path output_pathC++ - Pipes framework hadoop pipes -...
C# - Streaming - Mapper
C# - Streaming - Reducer
C# - .Net SDK Mapper & Reducer
C# - .Net SDK Driver Class
C# - .Net SDK Driver ClassMRRunner -dll WordFrequency.dll -- input outputMRRunner -dll WordFrequency.dll -class WordFreque...
C# - .Net SDK Debugging
ReferencesHadoop at Yahoo!, by Y! Developer NetworkMapReduce in Simple Terms, by Saliya EkanayakeHadoop on Azure, Getting ...
Questions ?      Ovidiu Dimulescu      @odimulescu      speakerdeck.com/odimulescu
Upcoming SlideShare
Loading in …5
×

Hadoop on Azure, Blue elephants

1,276 views
1,115 views

Published on

Hadoop makes data storage and processing at scale available as a lower cost and open solution. If you ever wanted to get your feet wet but found the elephant intimidating fear no more.

We will explore several integration considerations from a Windows application prospective like accessing HDFS content, writing streaming jobs, using .NET SDK, as well as HDInsight on premise or on Azure.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,276
On SlideShare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Hadoop on Azure, Blue elephants

  1. 1. One elephant went out to play, Azure way Orlando Code Camp, 2013 Ovidiu Dimulescu @odimulescu speakerdeck.com/odimulescu
  2. 2. Agenda • Overview • Installation • Azure story • .Net Integration • MapReduce • Q &A
  3. 3. About @odimulescu• Working on the Web since 1997•• Organizer for JaxMUG.com• Co-Organizer for Jax Big Data meetup
  4. 4. What is ?Apache Hadoop is an open source frameworkfor running data-intensive applications on largeclusters of commodity hardware
  5. 5. What and how is solving?Processing diverse large datasets in practical time at low cost• Consolidates data in a distributed file system• Moves computation to data rather then data to computation• Simplifies programming model CPU CPU CPU CPU CPU CPU CPU CPU
  6. 6. Why does it matter?• Volume - Datasets outgrow local HDDs let alone RAM• Velocity - Data grows at tremendous pace• Variety - Data is heterogeneous• Value - Scaling up is expensive (licensing, cpus, disks, fabric, etc.) - Scaling up has a ceiling (physical, technical, etc.)
  7. 7. Why does it matter? Data types Complex Data Images,Video 20% Logs Documents Call records Sensor data 80% Mail archives Structured Data User Profiles CRM Complex HR Records Structured* Chart Source: IDC White Paper
  8. 8. Use cases• ETL• Pattern Recognition• Recommendation Engines• Prediction Models• Log Processing• Data “sandbox”
  9. 9. Who uses it?
  10. 10. Who supports it?
  11. 11. When not to use?• Not a database replacement• Not a data warehousing, complements it• Not for interactive reporting• Not a general purpose storage mechanism• Not for problems that are not parallelizable in a share-nothing fashion *
  12. 12. Architecture – Core ComponentsHDFSDistributed filesystem designed for low cost storageand high bandwidth access across the cluster.MapReduceSimpler programming model for processing andgenerating large data sets.
  13. 13. Architecture - HDFS Namenode (NN)Client ask NN for file HNN returns DNs that has it D FClient ask DN for data S Datanode 1 Datanode 2 Datanode NNamenode - Master Datanode - Slaves• Filesystem metadata • Blocks R/W per clients• Files R/W control • Replicates blocks per master• Blocks replication • Notifies master about block-ids
  14. 14. Architecture - MapReduce J JobsTracker (JT) O BClient starts a job S API TaskTracker 1 TaskTracker 2 TaskTracker NJobTracker - Master TaskTracker - Slaves• Accepts MR jobs submitted by clients • Runs MR tasks received from JobTracker• Assigns MR tasks to TaskTrackers • Manages storage and transmission of• Monitors tasks and TaskTracker status, intermediate output re-executes tasks upon failure• Speculative execution
  15. 15. Architecture - Core Hadoop J JobsTracker O B S TaskTracker 1 TaskTracker 2 TaskTracker N API DataNode 1 DataNode 2 DataNode N H D F S NameNode* Mini OS: Filesystem & Scheduler
  16. 16. Hadoop - Ecosystem Management ZooKeeper Chukwa Ambari HUE Data Access Pig Hive Sqoop Impala Stinger Data Processing MapReduce Giraph Hama Mahout Storage HDFS HBase
  17. 17. Hadoop - Ecosystem Management ZooKeeper Chukwa Ambari HUE Data Access Pig Hive Sqoop Impala Stinger Data Processing MapReduce Giraph Hama Mahout Storage HDFS HBase
  18. 18. Installation - Platform NotesProduction Linux – OfficialDevelopment Linux OSX Windows via Cygwin * Other Unixes
  19. 19. Installation1. Download & configure single-node cluster hadoop.apache.org/common/releases.html2. Download a demo VM Cloudera, Hortonworks, MapR, etc.3. Download MS HDInsight Server4. Cloud: Amazon EMR, Azure HDInsight Service
  20. 20. Hadoop - Azure StoryName: Windows Azure HDInsight ServiceWhere: Hadoop on Azure dot comStatus: Public Preview*On-premise: Microsoft HDInsight Server
  21. 21. Hadoop - Azure Story
  22. 22. Hadoop - Azure Story
  23. 23. Hadoop - Azure Story
  24. 24. Hadoop - Azure Story
  25. 25. Hadoop - Azure Story
  26. 26. HDFS - .Net accessMicrosoft Distribution of Hadoop C library for HDFS file accessHadoop .Net HDFS File Access Managed C++ Solution
  27. 27. HDFS - .Net access
  28. 28. Hadoop .Net SDKhadoopsdk.codeplex.com • MapReduce • LINQ to Hive • WebHDFS Client
  29. 29. Hadoop Integration ODBC Driver Excel PowerPivot Other BI tools Connector for Hadoop Import / Export via SQOOP
  30. 30. slideshare.net/esaliya/mapreduce-in-simple-termsby Saliya Ekanayake 30
  31. 31. MapReduce - ClientsJava - Native hadoop jar jar_path main_class input_path output_pathC++ - Pipes framework hadoop pipes -input path_in -output path_out -program exec_programAny – Streaming hadoop jar hadoop-streaming.jar -mapper map_prog -reducer reduce_prog - input path_in -output path_outPig Latin, Hive HQL, C via JNI
  32. 32. C# - Streaming - Mapper
  33. 33. C# - Streaming - Reducer
  34. 34. C# - .Net SDK Mapper & Reducer
  35. 35. C# - .Net SDK Driver Class
  36. 36. C# - .Net SDK Driver ClassMRRunner -dll WordFrequency.dll -- input outputMRRunner -dll WordFrequency.dll -class WordFrequency -- input output
  37. 37. C# - .Net SDK Debugging
  38. 38. ReferencesHadoop at Yahoo!, by Y! Developer NetworkMapReduce in Simple Terms, by Saliya EkanayakeHadoop on Azure, Getting StartedHadoop .Net SDK.Net HDFS File AccessSQL Server Connector for Hadoop
  39. 39. Questions ? Ovidiu Dimulescu @odimulescu speakerdeck.com/odimulescu

×