2012 apache hadoop_map_reduce_windows_azure

  • 527 views
Uploaded on

 

More in: Technology , Education
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
527
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
18
Comments
0
Likes
1

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Apache Hadoop, MapReduce & Windows Azure Guðmundur Jón Halldórsson Five Degrees July 2012
  • 2. Web crawler! „No this isn‘t about that“
  • 3. What is Hadoop?System for processingmind-boggingly largeamount of data
  • 4. HadoopMap-Reduce = Computation HDFS = Storage
  • 5. HDFSHadoop Distributed File SystemYes it is file system written in Java And you can do normal file system operationslike [ls, mkdir, ...].Works best with large files. HDFS splits file intoblocks of 128 MB (can be configures)
  • 6. HDFSHDFS will keep 3 copies of each blockThe NameNode tracks blocks and datanodes DN1 DN2 DN3 NN DN4 DN5 DN5 Namenode DN1, DN4, DN7 DN3, DN5, DN8 DN5 DN8 DN9 DN3, DN4, DN5
  • 7. Map-Reduce• Write a mapper that takes a key and value, emits zero or more new keys and values• Write a reducer all the values of one key and emits zero or more new keys and values
  • 8. Map-Reduce JS examplevar map = function ( key, value, context ) { var words = value.split(/[^a-zA-Z]/); for ( var i=0; i < words.length; i++ ) { if ( words[i] !== „“ ) { context.write( words[i].toLowerCase(), 1 ); } }}; var reduce = function ( key, values, context ) { var sum = 0; while ( values.hasNext() ) { sum += parseInt( values.next() ); } context.write( key, sum );}
  • 9. MapReduce
  • 10. Data Systems and Their Timeframes
  • 11. Does hadoop solve all my DATAproblems or is are there something else out there?
  • 12. • PIG High-level MapReduce Language• HIVE SQL Like high-level MapReduce Language• HBase Realtime processing (based on google BigTable)• Accumulo NSA fork of Hbase• Avro Data Serialization• ZooKeeper Low level coordination• HCatalog Storage Management and interoperability between all systems• OOZIE Job scheduler• Flume Log and data aggregation• Whirr Automated cloud cluster on ec2, rackspace etc• Sqoop Relational data importer• MrUnit Unit testing job• Mahout Machine learning libraries• BigTop Interoperability• Crunch MapReduce pipelines in Java and Scala• Giraph Processing math on huge distribute graphs