Hadoop ecosystem
Upcoming SlideShare
Loading in...5
×
 

Hadoop ecosystem

on

  • 8,755 views

 

Statistics

Views

Total Views
8,755
Views on SlideShare
4,194
Embed Views
4,561

Actions

Likes
4
Downloads
353
Comments
0

4 Embeds 4,561

http://techforum.mail.ru 4526
http://www.techforum.mail.ru 28
http://dedukhin.techforum.ft.mail.ru 5
http://2011.techforum.mail.ru 2

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Hadoop ecosystem Hadoop ecosystem Presentation Transcript

  • Moscow, November 16th, 2011 The Hadoop Ecosystem Kai Voigt , Cloudera Inc.
  • Cloudera ©2011 Cloudera, Inc. All Rights Reserved. Hadoop Linux Licence Apache GPL and others Distribution Vendor Cloudera Red Hat Free Distribution Cloudera's Distribution Including Hadoop (CDH) Fedora Core Commercial Distribution Cloudera Enterprise Red Hat Enterprise Linux (RHEL)
  • Hadoop Core ©2011 Cloudera, Inc. All Rights Reserved. HDFS MapReduce View slide
  • HDFS
    • Hadoop Distributed File System
    • Redundancy
    • Fault Tolerant
    • Scalable
    • Self Healing
    • Write Once, Read Many Times
    • Java API
    • Command Line Tool
    ©2011 Cloudera, Inc. All Rights Reserved. View slide
  • MapReduce
    • Two Phases of Functional Programming
    • Redundancy
    • Fault Tolerant
    • Scalable
    • Self Healing
    • Java API
    ©2011 Cloudera, Inc. All Rights Reserved.
  • Hadoop Core ©2011 Cloudera, Inc. All Rights Reserved. HDFS MapReduce Java Java Java Java
  • HDFS-FUSE ©2011 Cloudera, Inc. All Rights Reserved. /mnt/hdfs/ HDFS-FUSE HDFS
  • HDFS-FUSE Examples ©2011 Cloudera, Inc. All Rights Reserved. $ mount ... fuse on /mnt/hdfs type fuse (rw,nosuid,nodev,user_id=0,group_id=0,default_permissions,allow_other) $ cp /boot/vmlinuz-* /mnt/hdfs/user/cloudera/ $ hadoop fs -ls vmlinuz-*-rw-r--r-- 3 cloudera supergroup 2107004 2011-11-08 16:14 /user/cloudera/vmlinuz-2.6.18-274.7.1.el5
  • Sqoop ©2011 Cloudera, Inc. All Rights Reserved. RDBMS Sqoop HDFS
  • Sqoop
    • Import & Export
    • ODBC, JDBC Data Sources
    • CSV Files in HDFS
    ©2011 Cloudera, Inc. All Rights Reserved.
  • Sqoop Examples ©2011 Cloudera, Inc. All Rights Reserved. $ sqoop import --connect jdbc:mysql://localhost/world --username root --table City ... $ hadoop fs -cat City/part-m-00000 1,Kabul,AFG,Kabol,17800002,Qandahar,AFG,Qandahar,2375003,Herat,AFG,Herat,1868004,Mazar-e-Sharif,AFG,Balkh,1278005,Amsterdam,NLD,Noord-Holland,731200 ...
  • Hive ©2011 Cloudera, Inc. All Rights Reserved. MapReduce Hive SQL
  • Hive
    • Data Warehouse System for Hadoop
    • Data Aggregation
    • Ad-Hoc Queries
    • SQL-like Language (HiveQL)
    • Developed at facebook
    ©2011 Cloudera, Inc. All Rights Reserved.
  • Hive Examples ©2011 Cloudera, Inc. All Rights Reserved. CREATE TABLE newmovie (id INT, name STRING, year INT, numratings INT, avgrating FLOAT);INSERT OVERWRITE TABLE newmovieSELECT id, name, year, COUNT(1), AVG(rating)FROM movie JOIN movieratingON movie.id = movierating.movieidGROUP BY id, name, year;
  • Pig ©2011 Cloudera, Inc. All Rights Reserved. MapReduce Pig Script
  • Pig
    • Data Warehouse System for Hadoop
    • Data Aggregation
    • Ad-Hoc Queries
    • High-Level Scripting Language (Pig Latin)
    • Developed at Yahoo
    ©2011 Cloudera, Inc. All Rights Reserved.
  • Pig Examples ©2011 Cloudera, Inc. All Rights Reserved. movierating = LOAD 'movierating' AS (userid, movieid, rating:INT);groupmr = GROUP movierating BY movieid;ratings = FOREACH groupmr GENERATE group AS movieid, COUNT(movierating.rating) AS numratings, AVG(movierating.rating) AS avgrating;movie = LOAD 'movie' AS (id, name, year);mr = JOIN movie BY id, ratings BY movieid;result = FOREACH mr GENERATE id, name, year, numratings, avgrating;STORE result INTO 'ratedmovie';
  • The Story So Far ©2011 Cloudera, Inc. All Rights Reserved. RDBMS Hive Pig Sqoop MapReduce HDFS FUSE FS SQL SQL Script Posix Java Java
  • HBase
    • Low Latency
    • Random Reads And Writes
    • Distributed Key/Value Store
    • Simple API
      • PUT
      • GET
      • DELETE
      • SCANE
    ©2011 Cloudera, Inc. All Rights Reserved.
  • HBase Data Model ©2011 Cloudera, Inc. All Rights Reserved. Key RowID Columname Timestamp Value com.apple.www Size yesterday 1234 com.apple.www Content yesterday <html>... com.cloudera.www Size yesterday 2345 com.cloudera.www Content yesterday <html>... com.cloudera.www Size today 3456 com.cloudera.www Content today <html>... com.facebook.www Size yesterday 4567 com.facebook.www Content yesterday <html>... com.yahoo.www Size today 5678 com.yahoo.www Content today <html>...
  • HBase Flow ©2011 Cloudera, Inc. All Rights Reserved. GET/PUT/DELETE MEMORY HDFS Logfile
  • HBase Examples ©2011 Cloudera, Inc. All Rights Reserved. hbase> create 'mytable', 'mycf'hbase> listhbase> put 'mytable', 'row1', 'mycf:col1', 'val1'hbase> put 'mytable', 'row1', 'mycf:col2', 'val2'hbase> put 'mytable', 'row2', 'mycf:col1', 'val3'hbase> scan 'mytable'hbase> disable 'mytable'hbase> drop 'mytable'
  • Flume
    • Many Servers with many Log Files
      • Webserver
      • Mailserver
      • Syslog
    • Store all Logs in One Place
      • Manageable
      • Extensible
      • Reliable
    ©2011 Cloudera, Inc. All Rights Reserved.
  • Flume Architecture ©2011 Cloudera, Inc. All Rights Reserved. Log Flume Node Log Flume Node ... HDFS
  • Flume Sources and Sinks
    • Local Files
    • HDFS
    • Stdin, Stdout
    • Twitter
    • IRC
    • IMAP
    ©2011 Cloudera, Inc. All Rights Reserved.
  • Whirr
    • Automatic Cluster Setup in the Cloud
      • Amazon
      • Rackspace
    ©2011 Cloudera, Inc. All Rights Reserved.
  • Whirr Example ©2011 Cloudera, Inc. All Rights Reserved. $ cat hadoop.properties whirr.cluster-name=myhadoopcluster whirr.instance-templates=1 hadoop-jobtracker+hadoop-namenode,7 hadoop-datanode+hadoop-tasktracker whirr.provider=aws-ec2 whirr.identity=${env:AWS_ACCESS_KEY_ID} whirr.credential=${env:AWS_SECRET_ACCESS_KEY} whirr.private-key-file=${sys:user.home}/.ssh/id_rsa whirr.public-key-file=${sys:user.home}/.ssh/id_rsa.pub $ bin/whirr launch-cluster --config hadoop.properties $ . ~/.whirr/myhadoopcluster/hadoop-proxy.sh $ export HADOOP_CONF_DIR=~/.whirr/myhadoopcluster $ bin/whirr destroy-cluster --config hadoop.properties
  • Oozie Concept
    • crond for Hadoop
    • Job Flow Control
      • Branching
      • Serial
      • Loops
    • Triggered
      • Time
      • Data
    ©2011 Cloudera, Inc. All Rights Reserved. Job 1 Job 3 Job 2 Job 4 Job 5
  • Oozie Features
    • Component Independent
      • MapReduce
      • Hive
      • Pig
      • Sqoop
      • Streaming
    ©2011 Cloudera, Inc. All Rights Reserved.
  • Mahout
    • Machine Learning Library for Hadoop
      • Regression
      • Classification
      • Recommendations
      • Pattern Mining
    ©2011 Cloudera, Inc. All Rights Reserved.
  • Mahout Use Cases
    • Yahoo: Spam Detection
    • Foursquare: Recommendations
    • SpeedDate.com: Recommendations
    • Adobe: User Targetting
    • Amazon: Personalization Platform
    ©2011 Cloudera, Inc. All Rights Reserved.
  • CDH4u2
    • Cloudera's Distribution Including Hadoop
    • http://www.cloudera.com/download/
    • Linux Packages
      • Red Hat
      • Debian
      • Tar Archive
    • Virtual Machines
    • Cloud Installation with Whirr
    ©2011 Cloudera, Inc. All Rights Reserved.
  • CDH Components ©2011 Cloudera, Inc. All Rights Reserved. Hadoop Hive Pig HBase Zookeeper Flume Sqoop Whirr Hue Oozie FUSE-DFS Mahout
  • Thank you!
    • Kai Voigt
    • [email_address]
    • LinkedIn
    • http://www.cloudera.com/
    ©2011 Cloudera, Inc. All Rights Reserved.