• Save
The Hadoop Ecosystem
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

The Hadoop Ecosystem

on

  • 8,838 views

 

Statistics

Views

Total Views
8,838
Views on SlideShare
8,556
Embed Views
282

Actions

Likes
14
Downloads
0
Comments
5

12 Embeds 282

http://www.datathinks.org 253
http://www.datathinks.com 8
http://cs542.wpi.edu.datathinks.org 6
http://master.datathinks.appspot.com 3
https://twitter.com 2
http://www.cs3431.datathinks.org 2
http://cs542.datathinks.org 2
http://cs3441.ali.datathinks.org 2
http://qa.datathinks.org 1
http://ams.activemailservice.com 1
http://wpi.datathinks.org 1
http://datathinks.appspot.com 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Sources: Top 5 Reasons Not to Use Hadoop for AnalyticsThe Dark Side of HadoopHadoopDon’t’s: What not to do to harvest Hadoop’s full potential
  • Get started with Hadoop
  • http://pig.apache.org/docs/r0.9.2/index.htmlApache HadoopCascading
  • http://pig.apache.org/docs/r0.9.2/index.html
  • Flume Users GuideThrift PaperThrift Paper
  • Missing components:Cascading

The Hadoop Ecosystem Presentation Transcript

  • 1. The Hadoop Ecosystem J Singh, DataThinks.org March 12, 2012
  • 2. The Hadoop Ecosystem• Introduction – What Hadoop is, and what it’s not – Origins and History – Hello Hadoop• The Hadoop Bestiary• The Hadoop Providers• Hosted Hadoop Frameworks© J Singh, 2011 2 2
  • 3. What Hadoop is, and what it’s not• A Framework for Map Reduce• A Top-level Apache Project• Hadoop is • Hadoop is not  A Framework, not a “solution” A painless replacement for SQL • Think Linux or J2EE  Scalable Uniformly fast or efficient  Great for pipelining massive Great for ad hoc Analysis amounts of data to achieve the end result  Sometimes the only option© J Singh, 2011 3 3
  • 4. You are ready for Hadoop when…• You no longer get enthused by the prospect of more data – Rate of data accumulation is increasing – The idea of moving data from hither to yon is positively scary – A hit man threatens to delete your data in the middle of the night • And you want to pay him to do it• Seriously, you are ready for Hadoop when analysis is the bottleneck – Could be because of data size – Could be because of the complexity of the data – Could be because of the level of analysis required – Could be because the analysis requirements are fluid© J Singh, 2011 4 4
  • 5. MapReduce Conceptual Underpinnings• Based on Functional Programming model – From Lisp • (map square (1 2 3 4)) (1 4 9 16) • (reduce plus (1 4 9 16)) 30 – From APL • +/ N N  1 2 3 4• Easy to distribute (based on each element of the vector)• New for Map/Reduce: Nice failure/retry semantics – Hundreds and thousands of low-end servers are running at the same time© J Singh, 2011 5 5
  • 6. MapReduce Flow Word Count Example MapOut foo 1Lines Result bar 1foo bar foo 3 quux 1quux foo labs 1 foo 1foo labs quux 2 foo 1quux bar 1 labs 1 quux 1 © J Singh, 2011 6 6
  • 7. Hello Hadoop• Word Count – Example with Unstructured Data – Load 5 books from Gutenberg.org into /tmp/gutenberg – Load them into HDFS – Run Hadoop • Results are put into HDFS – Copy results into file system – What could be simpler? – DIY instructions for Amazon EC2 available on DataThinks.org blog© J Singh, 2011 7 7
  • 8. The Hadoop Ecosystem• Introduction• The Hadoop Bestiary – Core: Hadoop Map Reduce and Hadoop Distributed File System – Data Access: HBase, Pig, Hive – Algorithms: Mahout – Data Import: Flume, Sqoop and Nutch• The Hadoop Providers• Hosted Hadoop Frameworks© J Singh, 2011 8 8
  • 9. The Core: Hadoop and HDFS• Hadoop • Hadoop Distributed File System – One master, n slaves – Robust Data Storage across – Master machines, insulating against • Schedules mappers & reducers failure • Connects pipeline stages – Keeps n copies of each file • Handles failure semantics • Configurable number of copies • Distributes copies across racks and locations© J Singh, 2011 9 9
  • 10. Hadoop Bestiary (p1a): Hbase, Pig• Database Primitives • Processing – Hbase – Pig • Wide column data structure • A high(-ish) level data-flow built on HDFS language and execution framework for parallel computation • Accesses HDFS and Hbase • Batch as well as Interactive • Integrates UDFs written in Java, Python, JavaScript • Compiles to map & reduce functions – not 100% efficiently© J Singh, 2011 10 10
  • 11. In Pig (Latin) Users = load ‘users’ as (name, age); Filtered = filter Users by age >= 18 and age <= 25; Pages = load ‘pages’ as (user, url); Joined = join Filtered by name, Pages by user; Grouped = group Joined by url; Summed = foreach Grouped generate group, count(Joined) as clicks; Sorted = order Summed by clicks desc; Top5 = limit Sorted 5; store Top5 into ‘top5sites’;© J Singh, 2011 11 11 Example from http://wiki.apache.org/pig-data/attachments/PigTalksPapers/attachments/ApacheConEurope09.ppt
  • 12. Pig Translation into Map Reduce Load Users Load Pages Users = load … Filter by age Fltrd = filter … Pages = load … Job 1 Join on name Joined = join … Group on url Grouped = group … Summed = … count()… Job 2 Count clicks Sorted = order … Top5 = limit … Order by clicks Job 3 Take top 5© J Singh, 2011 Example from http://wiki.apache.org/pig-data/attachments/PigTalksPapers/attachments/ApacheConEurope09.ppt 12 12
  • 13. Hadoop Bestiary (p1b): Hbase, Hive• Database Primitives • Processing – Hbase – Hive • Wide column data structure • Data Warehouse Infrastructure built on HDFS • QL, a subset of SQL that supports primitives supportable by Map Reduce • Support for custom mappers and reducers for more sophisticated analysis • Compiles to map & reduce functions – not 100% efficiently Hive Example CREATE TABLE page_view(viewTime INT, userid BIGINT, page_url STRING, referrer_url STRING, ip STRING COMMENT IP Address of the User) :: :: STORED AS SEQUENCEFILE;© J Singh, 2011 13 13
  • 14. Hadoop Bestiary (p2): Mahout• Algorithms • Examples – Mahout – Clustering Algorithms • Scalable machine learning and • Canopy Clustering data mining • K-Means Clustering • Runs on top of Hadoop • … • Written in Java • In active development – Recommenders / Collaborative – Algorithms being added Filtering Algorithms – Other • Regression Algorithms • Neural Networks • Hidden Markov Models© J Singh, 2011 14 14
  • 15. Hadoop Bestiary (p3): Data Import• Data Import Mechanisms • Data Import – Sqoop: Structured Data – Sqoop – Flume: Streams • Import from RDBMS to HDFS • Export too – Flume • Import streams – Text Files – System Logs – Nutch • Import from Web • Note: Nutch + Hadoop = Lucene© J Singh, 2011 15 15
  • 16. Hadoop Bestiary (p4): Complete Picture© J Singh, 2011 16 16
  • 17. The Hadoop Ecosystem• Introduction• The Hadoop Bestiary• The Hadoop Providers – Apache – Cloudera – Options when your data lives in a Database• Hosted Hadoop Frameworks© J Singh, 2011 17 17
  • 18. Apache Distribution• The Definitive Repository – The hub for Code, Documentation, Tutorials – Many contributors, for example • Pig was a Yahoo! Contribution • Hive came from Facebook • Sqoop came from Cloudera• Bare metal install option: – Download to your machine(s) from Apache – Install and Operate • Modify to fit your business better© J Singh, 2011 18 18
  • 19. Cloudera• Cloudera : Hadoop :: Red Hat : Linux• Cloudera’s Distribution Including Apache Hadoop (CDH) – A packaged set of Hadoop modules that work together – Now at CDH3 – Largest contributor of code to Apache Hadoop• $76M in Venture funding so far© J Singh, 2011 19 19
  • 20. When the data lives in a Database…• Objective: keeping Analytics and Data as close as possible• Options for RDBMS : • Options for NoSQL Databases – Sqoop data to/from HDFS – Sqoop-like connectors • Need to move the data • Need to move the data • Can utilize all parts of Hadoop – In-database analytics • Available for TeraData, – Built-in Map Reduce available Greenplum, etc. for most NoSQL databases • If you have the need • Knows about and tuned to the – And the $$$ storage mechanism • But typically only offers map and reduce – No Pig, Hive, …© J Singh, 2011 20 20
  • 21. The Hadoop Ecosystem• Introduction• The Hadoop Bestiary• The Hadoop Providers• Hadoop Platforms as a Service – Amazon Elastic MapReduce – Hadoop in Windows Azure – Google App Engine – Other • Infochimps • IBM SmartCloud© J Singh, 2011 21 21
  • 22. Amazon Elastic Map Reduce (EMR)• Hosted Map Reduce – CLI on your laptop • Control over size of cluster • Automatic spin-up/down instances – Map & Reduce programs on S3 • Pig, Hive or • Custom in Java, Ruby, Python, Perl, PHP, R, C++, Cascading – Data In/Out on S3 or – Data In/Out on DynamoDB• Keep in mind: – Hadoop on EC2 is also an option© J Singh, 2011 22 22
  • 23. Hadoop in Windows Azure• Basic Level – Hive Add-in for Excel – Hive ODBC Driver• Hadoop-based Distribution for Windows Server and Azure – Strategic Partnership with HortonWorks – Windows-based CLI on your laptop• Broadest Level – JavaScript framework for Hadoop – Hadoop connectors for SQL Server and Parallel Data Warehouse© J Singh, 2011 23 23
  • 24. Google App Engine MapReduce• Map Reduce as a Service – Distinct from Google’s internal Map Reduce – Part of Google App Engine• Works with Google Datastore – A Wide Column Store• A “purely programmatic” environment – Write Map and Reduce functions in Python / Java© J Singh, 2011 24 24
  • 25. Map Reduce Use at Google© J Singh, 2011 25 25
  • 26. Take Aways• There are many flavors of Hadoop. – The important part is Functional Programming and Map Reduce – Don’t let the proliferation of choices stump you. – Experiment with it!© J Singh, 2011 26 26
  • 27. Thank you• J Singh – President, Early Stage IT • Technology Services and Strategy for Startups• DataThinks.org is a new service of Early Stage IT – “Big Data” analytics solutions© J Singh, 2011 27 27