• Save
Hadoop on Azure,  Blue elephants
Upcoming SlideShare
Loading in...5
×
 

Hadoop on Azure, Blue elephants

on

  • 1,300 views

Hadoop makes data storage and processing at scale available as a lower cost and open solution. If you ever wanted to get your feet wet but found the elephant intimidating fear no more. ...

Hadoop makes data storage and processing at scale available as a lower cost and open solution. If you ever wanted to get your feet wet but found the elephant intimidating fear no more.

We will explore several integration considerations from a Windows application prospective like accessing HDFS content, writing streaming jobs, using .NET SDK, as well as HDInsight on premise or on Azure.

Statistics

Views

Total Views
1,300
Slideshare-icon Views on SlideShare
1,300
Embed Views
0

Actions

Likes
0
Downloads
0
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Hadoop on Azure,  Blue elephants Hadoop on Azure, Blue elephants Presentation Transcript

    • One elephant went out to play, Azure way Orlando Code Camp, 2013 Ovidiu Dimulescu @odimulescu speakerdeck.com/odimulescu
    • Agenda • Overview • Installation • Azure story • .Net Integration • MapReduce • Q &A
    • About @odimulescu• Working on the Web since 1997•• Organizer for JaxMUG.com• Co-Organizer for Jax Big Data meetup
    • What is ?Apache Hadoop is an open source frameworkfor running data-intensive applications on largeclusters of commodity hardware
    • What and how is solving?Processing diverse large datasets in practical time at low cost• Consolidates data in a distributed file system• Moves computation to data rather then data to computation• Simplifies programming model CPU CPU CPU CPU CPU CPU CPU CPU
    • Why does it matter?• Volume - Datasets outgrow local HDDs let alone RAM• Velocity - Data grows at tremendous pace• Variety - Data is heterogeneous• Value - Scaling up is expensive (licensing, cpus, disks, fabric, etc.) - Scaling up has a ceiling (physical, technical, etc.)
    • Why does it matter? Data types Complex Data Images,Video 20% Logs Documents Call records Sensor data 80% Mail archives Structured Data User Profiles CRM Complex HR Records Structured* Chart Source: IDC White Paper
    • Use cases• ETL• Pattern Recognition• Recommendation Engines• Prediction Models• Log Processing• Data “sandbox”
    • Who uses it?
    • Who supports it?
    • When not to use?• Not a database replacement• Not a data warehousing, complements it• Not for interactive reporting• Not a general purpose storage mechanism• Not for problems that are not parallelizable in a share-nothing fashion *
    • Architecture – Core ComponentsHDFSDistributed filesystem designed for low cost storageand high bandwidth access across the cluster.MapReduceSimpler programming model for processing andgenerating large data sets.
    • Architecture - HDFS Namenode (NN)Client ask NN for file HNN returns DNs that has it D FClient ask DN for data S Datanode 1 Datanode 2 Datanode NNamenode - Master Datanode - Slaves• Filesystem metadata • Blocks R/W per clients• Files R/W control • Replicates blocks per master• Blocks replication • Notifies master about block-ids
    • Architecture - MapReduce J JobsTracker (JT) O BClient starts a job S API TaskTracker 1 TaskTracker 2 TaskTracker NJobTracker - Master TaskTracker - Slaves• Accepts MR jobs submitted by clients • Runs MR tasks received from JobTracker• Assigns MR tasks to TaskTrackers • Manages storage and transmission of• Monitors tasks and TaskTracker status, intermediate output re-executes tasks upon failure• Speculative execution
    • Architecture - Core Hadoop J JobsTracker O B S TaskTracker 1 TaskTracker 2 TaskTracker N API DataNode 1 DataNode 2 DataNode N H D F S NameNode* Mini OS: Filesystem & Scheduler
    • Hadoop - Ecosystem Management ZooKeeper Chukwa Ambari HUE Data Access Pig Hive Sqoop Impala Stinger Data Processing MapReduce Giraph Hama Mahout Storage HDFS HBase
    • Hadoop - Ecosystem Management ZooKeeper Chukwa Ambari HUE Data Access Pig Hive Sqoop Impala Stinger Data Processing MapReduce Giraph Hama Mahout Storage HDFS HBase
    • Installation - Platform NotesProduction Linux – OfficialDevelopment Linux OSX Windows via Cygwin * Other Unixes
    • Installation1. Download & configure single-node cluster hadoop.apache.org/common/releases.html2. Download a demo VM Cloudera, Hortonworks, MapR, etc.3. Download MS HDInsight Server4. Cloud: Amazon EMR, Azure HDInsight Service
    • Hadoop - Azure StoryName: Windows Azure HDInsight ServiceWhere: Hadoop on Azure dot comStatus: Public Preview*On-premise: Microsoft HDInsight Server
    • Hadoop - Azure Story
    • Hadoop - Azure Story
    • Hadoop - Azure Story
    • Hadoop - Azure Story
    • Hadoop - Azure Story
    • HDFS - .Net accessMicrosoft Distribution of Hadoop C library for HDFS file accessHadoop .Net HDFS File Access Managed C++ Solution
    • HDFS - .Net access
    • Hadoop .Net SDKhadoopsdk.codeplex.com • MapReduce • LINQ to Hive • WebHDFS Client
    • Hadoop Integration ODBC Driver Excel PowerPivot Other BI tools Connector for Hadoop Import / Export via SQOOP
    • slideshare.net/esaliya/mapreduce-in-simple-termsby Saliya Ekanayake 30
    • MapReduce - ClientsJava - Native hadoop jar jar_path main_class input_path output_pathC++ - Pipes framework hadoop pipes -input path_in -output path_out -program exec_programAny – Streaming hadoop jar hadoop-streaming.jar -mapper map_prog -reducer reduce_prog - input path_in -output path_outPig Latin, Hive HQL, C via JNI
    • C# - Streaming - Mapper
    • C# - Streaming - Reducer
    • C# - .Net SDK Mapper & Reducer
    • C# - .Net SDK Driver Class
    • C# - .Net SDK Driver ClassMRRunner -dll WordFrequency.dll -- input outputMRRunner -dll WordFrequency.dll -class WordFrequency -- input output
    • C# - .Net SDK Debugging
    • ReferencesHadoop at Yahoo!, by Y! Developer NetworkMapReduce in Simple Terms, by Saliya EkanayakeHadoop on Azure, Getting StartedHadoop .Net SDK.Net HDFS File AccessSQL Server Connector for Hadoop
    • Questions ? Ovidiu Dimulescu @odimulescu speakerdeck.com/odimulescu