Hadoop ecosystem
Upcoming SlideShare
Loading in...5
×

Like this? Share it with your network

Share
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
8,939
On Slideshare
4,366
From Embeds
4,573
Number of Embeds
4

Actions

Shares
Downloads
366
Comments
0
Likes
4

Embeds 4,573

http://techforum.mail.ru 4,538
http://www.techforum.mail.ru 28
http://dedukhin.techforum.ft.mail.ru 5
http://2011.techforum.mail.ru 2

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Moscow, November 16th, 2011 The Hadoop Ecosystem Kai Voigt , Cloudera Inc.
  • 2. Cloudera ©2011 Cloudera, Inc. All Rights Reserved. Hadoop Linux Licence Apache GPL and others Distribution Vendor Cloudera Red Hat Free Distribution Cloudera's Distribution Including Hadoop (CDH) Fedora Core Commercial Distribution Cloudera Enterprise Red Hat Enterprise Linux (RHEL)
  • 3. Hadoop Core ©2011 Cloudera, Inc. All Rights Reserved. HDFS MapReduce
  • 4. HDFS
    • Hadoop Distributed File System
    • Redundancy
    • Fault Tolerant
    • Scalable
    • Self Healing
    • Write Once, Read Many Times
    • Java API
    • Command Line Tool
    ©2011 Cloudera, Inc. All Rights Reserved.
  • 5. MapReduce
    • Two Phases of Functional Programming
    • Redundancy
    • Fault Tolerant
    • Scalable
    • Self Healing
    • Java API
    ©2011 Cloudera, Inc. All Rights Reserved.
  • 6. Hadoop Core ©2011 Cloudera, Inc. All Rights Reserved. HDFS MapReduce Java Java Java Java
  • 7. HDFS-FUSE ©2011 Cloudera, Inc. All Rights Reserved. /mnt/hdfs/ HDFS-FUSE HDFS
  • 8. HDFS-FUSE Examples ©2011 Cloudera, Inc. All Rights Reserved. $ mount ... fuse on /mnt/hdfs type fuse (rw,nosuid,nodev,user_id=0,group_id=0,default_permissions,allow_other) $ cp /boot/vmlinuz-* /mnt/hdfs/user/cloudera/ $ hadoop fs -ls vmlinuz-*-rw-r--r-- 3 cloudera supergroup 2107004 2011-11-08 16:14 /user/cloudera/vmlinuz-2.6.18-274.7.1.el5
  • 9. Sqoop ©2011 Cloudera, Inc. All Rights Reserved. RDBMS Sqoop HDFS
  • 10. Sqoop
    • Import & Export
    • ODBC, JDBC Data Sources
    • CSV Files in HDFS
    ©2011 Cloudera, Inc. All Rights Reserved.
  • 11. Sqoop Examples ©2011 Cloudera, Inc. All Rights Reserved. $ sqoop import --connect jdbc:mysql://localhost/world --username root --table City ... $ hadoop fs -cat City/part-m-00000 1,Kabul,AFG,Kabol,17800002,Qandahar,AFG,Qandahar,2375003,Herat,AFG,Herat,1868004,Mazar-e-Sharif,AFG,Balkh,1278005,Amsterdam,NLD,Noord-Holland,731200 ...
  • 12. Hive ©2011 Cloudera, Inc. All Rights Reserved. MapReduce Hive SQL
  • 13. Hive
    • Data Warehouse System for Hadoop
    • Data Aggregation
    • Ad-Hoc Queries
    • SQL-like Language (HiveQL)
    • Developed at facebook
    ©2011 Cloudera, Inc. All Rights Reserved.
  • 14. Hive Examples ©2011 Cloudera, Inc. All Rights Reserved. CREATE TABLE newmovie (id INT, name STRING, year INT, numratings INT, avgrating FLOAT);INSERT OVERWRITE TABLE newmovieSELECT id, name, year, COUNT(1), AVG(rating)FROM movie JOIN movieratingON movie.id = movierating.movieidGROUP BY id, name, year;
  • 15. Pig ©2011 Cloudera, Inc. All Rights Reserved. MapReduce Pig Script
  • 16. Pig
    • Data Warehouse System for Hadoop
    • Data Aggregation
    • Ad-Hoc Queries
    • High-Level Scripting Language (Pig Latin)
    • Developed at Yahoo
    ©2011 Cloudera, Inc. All Rights Reserved.
  • 17. Pig Examples ©2011 Cloudera, Inc. All Rights Reserved. movierating = LOAD 'movierating' AS (userid, movieid, rating:INT);groupmr = GROUP movierating BY movieid;ratings = FOREACH groupmr GENERATE group AS movieid, COUNT(movierating.rating) AS numratings, AVG(movierating.rating) AS avgrating;movie = LOAD 'movie' AS (id, name, year);mr = JOIN movie BY id, ratings BY movieid;result = FOREACH mr GENERATE id, name, year, numratings, avgrating;STORE result INTO 'ratedmovie';
  • 18. The Story So Far ©2011 Cloudera, Inc. All Rights Reserved. RDBMS Hive Pig Sqoop MapReduce HDFS FUSE FS SQL SQL Script Posix Java Java
  • 19. HBase
    • Low Latency
    • Random Reads And Writes
    • Distributed Key/Value Store
    • Simple API
      • PUT
      • GET
      • DELETE
      • SCANE
    ©2011 Cloudera, Inc. All Rights Reserved.
  • 20. HBase Data Model ©2011 Cloudera, Inc. All Rights Reserved. Key RowID Columname Timestamp Value com.apple.www Size yesterday 1234 com.apple.www Content yesterday <html>... com.cloudera.www Size yesterday 2345 com.cloudera.www Content yesterday <html>... com.cloudera.www Size today 3456 com.cloudera.www Content today <html>... com.facebook.www Size yesterday 4567 com.facebook.www Content yesterday <html>... com.yahoo.www Size today 5678 com.yahoo.www Content today <html>...
  • 21. HBase Flow ©2011 Cloudera, Inc. All Rights Reserved. GET/PUT/DELETE MEMORY HDFS Logfile
  • 22. HBase Examples ©2011 Cloudera, Inc. All Rights Reserved. hbase> create 'mytable', 'mycf'hbase> listhbase> put 'mytable', 'row1', 'mycf:col1', 'val1'hbase> put 'mytable', 'row1', 'mycf:col2', 'val2'hbase> put 'mytable', 'row2', 'mycf:col1', 'val3'hbase> scan 'mytable'hbase> disable 'mytable'hbase> drop 'mytable'
  • 23. Flume
    • Many Servers with many Log Files
      • Webserver
      • Mailserver
      • Syslog
    • Store all Logs in One Place
      • Manageable
      • Extensible
      • Reliable
    ©2011 Cloudera, Inc. All Rights Reserved.
  • 24. Flume Architecture ©2011 Cloudera, Inc. All Rights Reserved. Log Flume Node Log Flume Node ... HDFS
  • 25. Flume Sources and Sinks
    • Local Files
    • HDFS
    • Stdin, Stdout
    • Twitter
    • IRC
    • IMAP
    ©2011 Cloudera, Inc. All Rights Reserved.
  • 26. Whirr
    • Automatic Cluster Setup in the Cloud
      • Amazon
      • Rackspace
    ©2011 Cloudera, Inc. All Rights Reserved.
  • 27. Whirr Example ©2011 Cloudera, Inc. All Rights Reserved. $ cat hadoop.properties whirr.cluster-name=myhadoopcluster whirr.instance-templates=1 hadoop-jobtracker+hadoop-namenode,7 hadoop-datanode+hadoop-tasktracker whirr.provider=aws-ec2 whirr.identity=${env:AWS_ACCESS_KEY_ID} whirr.credential=${env:AWS_SECRET_ACCESS_KEY} whirr.private-key-file=${sys:user.home}/.ssh/id_rsa whirr.public-key-file=${sys:user.home}/.ssh/id_rsa.pub $ bin/whirr launch-cluster --config hadoop.properties $ . ~/.whirr/myhadoopcluster/hadoop-proxy.sh $ export HADOOP_CONF_DIR=~/.whirr/myhadoopcluster $ bin/whirr destroy-cluster --config hadoop.properties
  • 28. Oozie Concept
    • crond for Hadoop
    • Job Flow Control
      • Branching
      • Serial
      • Loops
    • Triggered
      • Time
      • Data
    ©2011 Cloudera, Inc. All Rights Reserved. Job 1 Job 3 Job 2 Job 4 Job 5
  • 29. Oozie Features
    • Component Independent
      • MapReduce
      • Hive
      • Pig
      • Sqoop
      • Streaming
    ©2011 Cloudera, Inc. All Rights Reserved.
  • 30. Mahout
    • Machine Learning Library for Hadoop
      • Regression
      • Classification
      • Recommendations
      • Pattern Mining
    ©2011 Cloudera, Inc. All Rights Reserved.
  • 31. Mahout Use Cases
    • Yahoo: Spam Detection
    • Foursquare: Recommendations
    • SpeedDate.com: Recommendations
    • Adobe: User Targetting
    • Amazon: Personalization Platform
    ©2011 Cloudera, Inc. All Rights Reserved.
  • 32. CDH4u2
    • Cloudera's Distribution Including Hadoop
    • http://www.cloudera.com/download/
    • Linux Packages
      • Red Hat
      • Debian
      • Tar Archive
    • Virtual Machines
    • Cloud Installation with Whirr
    ©2011 Cloudera, Inc. All Rights Reserved.
  • 33. CDH Components ©2011 Cloudera, Inc. All Rights Reserved. Hadoop Hive Pig HBase Zookeeper Flume Sqoop Whirr Hue Oozie FUSE-DFS Mahout
  • 34. Thank you!
    • Kai Voigt
    • [email_address]
    • LinkedIn
    • http://www.cloudera.com/
    ©2011 Cloudera, Inc. All Rights Reserved.