• Save
EclipseCon Keynote: Apache Hadoop - An Introduction

EclipseCon Keynote: Apache Hadoop - An Introduction



Todd Lipcon explains why you should be interested in Apache Hadoop, what it is, and how it works. Todd also brings to light the Hadoop ecosystem and real business use cases that evolve around Hadoop ...

Todd Lipcon explains why you should be interested in Apache Hadoop, what it is, and how it works. Todd also brings to light the Hadoop ecosystem and real business use cases that evolve around Hadoop and the ecosystem.



Total Views
Views on SlideShare
Embed Views



18 Embeds 624

http://www.cloudera.com 502
http://abava.blogspot.com 61
http://it.gilbird.com 17
http://servletsuite.blogspot.com 15
http://blog.cloudera.com 8
http://paper.li 4
http://abava.blogspot.ru 4
http://servletsuite.blogspot.ca 2
http://abava.blogspot.co.uk 2
http://servletsuite.blogspot.com.br 1
http://abava.blogspot.nl 1
http://servletsuite.blogspot.co.at 1
http://test.cloudera.com 1
http://webcache.googleusercontent.com 1
http://servletsuite.blogspot.in 1
http://xss.yandex.net 1
https://twitter.com 1
http://abava.blogspot.fr 1



Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
  • I would like to share with you Big Data in Retail Banking Presentation. http://www.slideshare.net/LadislavUrban/syoncloud-big-data-for-retail-banking-syoncloud
    Are you sure you want to
    Your message goes here
  • very good! But I always forget there is excellent slides!
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

EclipseCon Keynote: Apache Hadoop - An Introduction EclipseCon Keynote: Apache Hadoop - An Introduction Presentation Transcript

  • Apache Hadoop an introduction Todd Lipcon todd@cloudera.com @tlipcon @cloudera March 24, 2011
  • IntroductionsSoftware Engineer atApache Hadoop, HBase, ThriftcommitterPreviously: systems programming,operations, large scale data analysisI love data and data systems
  • OutlineWhy should you care? (Intro)What is Hadoop?How does it work?The Hadoop EcosystemUse CasesExperiences as a developer
  • Data is the difference. What‟s data?
  • Photo by C.C Chapman (CC BY-NC-ND)http://www.flickr.com/photos/cc_chapman/3342268874/
  • “Every two days we create asmuch information as we didfrom the dawn of civilizationup until 2003.” Eric Schmidt
  • “I keep saying that the sexyjob in the next 10 years will bestatisticians. And I‟m notkidding.” Hal Varian (Google‟s chief economist)
  • Are you throwing away data? Data comes in many shapes and sizes: relational tuples, log files, semistructured textual data (e.g., e- mail), … . Are you throwing it away because it doesn‟t „fit‟?
  • So, what‟sHadoop? The Little Prince, Antoine de Saint-Exupéry, Irene Testot-Ferry
  • So, what‟s Hadoop?The Little Prince, Antoine de Saint-Exupéry, Irene Testot-Ferry
  • Apache Hadoop is an open-source system to reliably store and process GOBS of dataacross many commodity computers.
  • Two Core Components Store Process HDFS Map/Reduce Self-healing Fault-tolerant high-bandwidth distributedclustered storage. processing.
  • What makesHadoop special?
  • Falsehood #1: Machines can be reliable…Image: MadMan the Mighty CC BY-NC-SA
  • Hadoop separatesdistributed system fault- tolerance code from application logic. Unicorns Systems Statisticians Programmers
  • Falsehood #2: Machines deserve identities... Image:Laughing Squid CC BY-NC-SA
  • Hadoop lets you interact with a cluster, not a bunch of machines.Image:Yahoo! Hadoop cluster [ OSCON ‟07 ]
  • Falsehood #3: Your analysis fits on one machine… Image: Matthew J. Stinson CC-BY-NC
  • Hadoop scales linearly with data sizeor analysis complexity.Data-parallel or compute-parallel. For example: Extensive machine learning on <100GB of image data Simple SQL queries on >100TB of clickstream dataHadoop works for both applications!
  • Hadoop sounds like magic. Coincidentally, today is Houdini‟s birthday, though he was not a Hadoop committer.How is it possible?
  • A Typical Look...5-4000 commodity servers(8-core, 24GB RAM, 4-12 TB, gig-E)2-level network architecture 20-40 nodes per rack
  • Cluster nodesMaster nodes (1 each) NameNode (metadata server and database) JobTracker (scheduler)Slave nodes (1-4000 each) DataNodes TaskTrackers (block storage) (task execution)
  • HDFS APIFileSystem fs = FileSystem.get(conf);InputStream in = fs.open(new Path(“/foo/bar”));OutputStream os = fs.create(new Path(“/baz”));fs.delete(…), fs.listStatus(…)
  • HDFS Data Storage /logs/weblog.txt DN 1 64MB blk_29232 DN 2158MB 30MB 64MB blk_19231 DN 3 blk_329432 NameNode DN 4
  • HDFS Write Path
  • • HDFS has split the file into 64MB blocks and stored it on the DataNodes.• Now, we want to process that data.
  • The MapReduce Programming Model
  • You specify map() and reduce() functions. The framework does the rest.
  • map() map: K₁,V₁→list K₂,V₂Key: byte offset 193284Value: “ - frank [10/Oct/2000:13:55:36-0700] "GET /userimage/123 HTTP/1.0" 200 2326”Key: userimageValue: 2326 bytesThe map function runs on the same node as the datawas stored!
  • Input Format• Wait! HDFS is not a Key-Value store!• InputFormat interprets bytes as a Key and Value - frank [10/Oct/2000:13:55:36 -0700] "GET /userimage/123 HTTP/1.0" 200 2326 Key: log offset 193284 Value: “ - frank [10/Oct/2000:13:55:36 -0700] "GET /userimage/123 HTTP/1.0" 200 2326”
  • The ShuffleEach map output is assigned to a“reducer” based on its keymap output is grouped andsorted by key
  • reduce() K₂, iter(V₂)→list(K₃,V₃)Key: userimageValue: 2326 bytes (from map task 0001)Value: 1000 bytes (from map task 0008)Value: 3020 bytes (from map task 0120) Reducer functionKey: userimageValue: 6346 bytes TextOutputFormatuserimage t 6346
  • Putting it together...
  • Hadoop isnot just MapReduce (NoNoSQL!) Hive project adds SQL support to Hadoop HiveQL (SQL dialect) compiles to a query plan Query plan executes as MapReduce jobs
  • Hive ExampleCREATE TABLE movie_rating_data ( userid INT, movieid INT, rating INT, unixtime STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY t„ STORED AS TEXTFILE;LOAD DATA INPATH „/datasets/movielens‟ INTO TABLEmovie_rating_data;CREATE TABLE average_ratings ASSELECT movieid, AVG(rating) FROM movie_rating_dataGROUP BY movieid;
  • The Hadoop Ecosystem(Column DB)
  • Hadoop in the Wild (yes, it‟s used in production)Yahoo! Hadoop Clusters: > 82PB, >25k machines(Eric14, HadoopWorld NYC ‟09)Facebook: 15TB new data per day;1200 machines, 21PB in one clusterTwitter: >1TB per day, ~120 nodesLots of 5-40 node clusters at companies withoutpetabytes of data (web, retail, finance, telecom,research)
  • Use Cases
  • Product Recommendations• Naïve approach: Users who bought toothpaste bought toothbrushes.• Hadoop approach • What products did a user browse, hover over, rate, add to cart (but not buy), etc in the last 2 months? • What are the attributes of the user? • What are our margins, promotions, inventory, etc?
  • Production Recommendations• A lot of data! • Activity: ~20GB/day x ~60 days = 1.2TB • User Data: 2GB • Purchase Data: ~5GB• Pre-aggregating loses fidelity for individual users.
  • Hadoop and Java (the good)Integration, integration, integration!Tooling: IDEs, JCarder, AspectJ,Maven/IvyDeveloper accessibility
  • Hadoop and Java (the bad)Java is great for applications. Hadoop issystems programming.JNI is our hammer Compression, Security, FS accessC++ wrapper for setuid task execution
  • Hadoop and Java (the ugly)JVM bugs!Garbage Collection pauses on 50GBheapsWORA is a giant lie for systems – worstof both worlds?
  • Ok, fine, what next?Get Hadoop! CDH - Cloudera‟s Distribution including Apache Hadoop http://cloudera.com/ http://hadoop.apache.org/Try it out! (Locally, VM, or EC2)Watch free training videos onhttp://cloudera.com/
  • Thanks!• todd@cloudera.com• @tlipcon• (feedback? yes!)• (hiring? yes!)