Apache Hadoop ecosystem - March 2013

Uploaded on

Summary and categorization of components available as Apache (ASF) projects/sub-projects and serving the Hadoop ecosystem

Summary and categorization of components available as Apache (ASF) projects/sub-projects and serving the Hadoop ecosystem

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads


Total Views
On Slideshare
From Embeds
Number of Embeds



Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

    No notes for slide


  • 1. hadoopsphere.com View in Full Screen mode for better readability Components that constitute the open source Apache Hadoop ecosystem - Summary and categorization of components available as Apache (ASF) projects/sub-projects and serving the Hadoop ecosystem. The document does not include other open source or commercial projects/products Contributed by : Sachin Ghai |@sachinghai
  • 2. hadoopsphere.com Distribution, Financial, Government, Heavy Industry, ‘Atmospheric ’ Layers Internet, Oil & Application Energy, Research, Domains Telecom Discovery & Lucene, Blur, Visualization Giraph Analytics & Intelligence Mahout, Drill Pig, Hive, Data HCatalog, Tez, Interactions Gora Hardware (& Appliances) Commodity H/w Distribution Apache Secure Knox Persist Oozie, Zookeeper, Crunch, MRUnit, HDT, Ambari, ‘Core ‘ Layers Vaidya, BigTop, Manage Whirr MapReduce, Run YARN, Hama HDFS, HBase, Cassandra, Accumulo, Avro, Persist Trevni, Thrift Flume, Sqoop, Transfer Chukwa, Kafka Contributed by : Sachin Ghai |@sachinghaiM
  • 3. hadoopsphere.com CORE LAYERS which constitute the Apache Hadoop ecosystem 3
  • 4. hadoopsphere.com PERSIST : File System & Data Store – • HDFS - Distributed file system that provides high-throughput access. Comprises of NameNode, Secondary NameNode and DataNodes • HBase - Distributed, scalable, big Persist data store • Cassandra - Highly scalable, eventually consistent, distributed, structured key-value store • Accumulo - Sorted, distributed key/value data storage and retrieval system 4
  • 5. hadoopsphere.com PERSIST : Serialization – • Avro - Data serialization system • Trevni - A Column File format to permit compatible, independent implementations that read and/or write files in this format Persist • Thrift - Framework, for scalable cross-language services development 5
  • 6. hadoopsphere.com RUN: Job Execution – • MapReduce - Framework for performing distributed data processing. Comprises of JobTracker, TaskTracker and JobHistoryServer • YARN - Framework that facilitates writing arbitrary distributed processing frameworks and Persist applications. • Hama - Pure BSP (Bulk Synchronous Parallel) computing framework for massive scientific computations such as matrix, graph and network algorithms 6
  • 7. hadoopsphere.com MANAGE : Work – • Oozie - Workflow/coordination system to manage Hadoop jobs • Zookeeper - Centralized service for maintaining configuration information, naming, providing distributed synchronization, and Persist providing group services 7
  • 8. hadoopsphere.com MANAGE : Dev – • Crunch - Framework for writing, testing, and running MapReduce pipelines • MRUnit - Java library that helps developers unit test Apache Hadoop MapReduce jobs • HDT – Hadoop Development Persist Tools (HDT) comprise Eclipse based tools for developing applications on the Hadoop platform 8
  • 9. hadoopsphere.com MANAGE : Ops – • Ambari - Web-based tool for provisioning, managing, and monitoring Apache Hadoop clusters • Vaidya - Performance diagnostic tool for MapReduce jobs • BigTop - Project for the Persist development of packaging and tests and ensuring interoperability among Apache Hadoop related projects • Whirr - Set of libraries for running cloud services like running Hadoop clusters on EC2 9
  • 10. hadoopsphere.com SECURE : • Knox - System that provides a single point of secure access for Apache Hadoop clusters Persist 10
  • 11. hadoopsphere.com TRANSFER : • Flume - Distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data • Sqoop - Tool designed for efficiently transferring bulk data between Apache Hadoop and Persist structured datastores such as relational databases. • Chukwa - Open source data collection system for monitoring large distributed systems • Kafka - Distributed publish- subscribe messaging system 11
  • 12. hadoopsphere.com ATMOSPHERIC LAYERS which build up the capabilities beyond the core of Persist Apache Hadoop ecosystem 12
  • 13. hadoopsphere.com HARDWARE : • Commodity Hardware - Low-cost, easily available hardware working in parallel C o r e L Atm a osp y heri Persist e c r Laye s rs Note: no appliances known to run on pure Apache Hadoop distribution; SSD and other cheap hardware options not mentioned separately here 13
  • 14. hadoopsphere.com DATA INTERACTIONS: • Pig - Platform for analyzing large data sets that consists of a high- level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs Persist • Hive - Data warehouse system that facilitates easy data summarization, ad-hoc queries and analysis of large datasets stored in Hadoop compatible file systems 14
  • 15. hadoopsphere.com DATA INTERACTIONS: • HCatalog - Table and storage management service for data created using Apache Hadoop C • Tez - Generic o r application framework e which can be used to L Atm process complex data- a osp y heri processing task DAGs and e c Persist r Laye runs natively on Apache s rs Hadoop YARN •Gora - Framework for in-memory data model and persistence with MapReduce support 15
  • 16. hadoopsphere.com ANALYTICS & INTELLIGENCE : • Mahout - Scalable machine learning and data mining algorithm library. Supports Recommendation mining, Clustering, Classification and Frequent itemset mining Persist • Drill - Distributed system for interactive analysis of large-scale datasets. Comprises of user interface (CLI, REST), pluggable query language and pluggable data source. 16
  • 17. hadoopsphere.com DISCOVERY & VISUALIZATION : • Lucene - Open-source search software including Java based indexing and search component Lucene Core and high performance search server component Solr • Blur - Search engine Persist capable of querying massive amounts of structured data at incredible speeds in a cloud computing environment 17
  • 18. hadoopsphere.com DISCOVERY & VISUALIZATION : • Giraph - Graph- processing framework leveraging existing Hadoop infrastructure. Follows bulk synchronous parallel model to run large scale algorithms. Supports directed, undirected, weighted, Persist unweighted and multigraphs Note: no pure visualization projects currently as part of ASF 18
  • 19. hadoopsphere.com APPLICATION DOMAINS : • Distribution - Includes applications in Travel, Transport, FMCG, supply chain e.g. Expedia • Financial - Includes applications in financial, banking, insurance e.g. Visa • Government - Includes Persist applications in government and public sector e.g. Aadhar (India ID card) • Heavy Industry - Includes applications in heavy industrial business including electronics, auto, aircraft e.g. Hitachi 19
  • 20. hadoopsphere.com APPLICATION DOMAINS : • Internet - Includes new age internet applications including social media, content distribution e.g. C Facebook o r • Oil & Energy - Includes e applications in L Atm upstream/downstream a osp y heri oil, gas business along c Persist e r Laye with those in Energy s rs sector. e.g. Chevron • Research - Includes applications in new research e.g. network analysis & security • Telecom - Includes applications in Telecom business e.g. Korea Telecom 20
  • 21. hadoopsphere.comReference :• www.apache.org• http://blogs.gartner.com/merv-adrian/2013/02/21/hadoopImage courtesy:• Slide 1 : Getty Images #84480368 Dorling Kindersley (free thumbnail copy)• Other images: Original source could not be established 21
  • 22. hadoopsphere.comAbout the document :• Voluntarily contributed by: Sachin Ghai (@sachinghai)• Publisher : hadoopsphere.com• Version : 1.0• Date : 11 March 2013• Copyright: 2013, All Rights Reserved• Note: The document does not use official lingo in part• Contact : Use ‘Contact’ menu option on www.hadoopsphere.com• Disclaimer: The project names mentioned in this document are either registered trademarks or trademarks of the Apache Software Foundation in the United States. The Apache Software Foundation has no affiliation with and does not endorse or review the materials provided in this document. 22
  • 23. hadoopsphere.comSubscribe to hadoopsphere.com:• Newsletter on e-mail subscription• RSS Feed for posts• Follow on Twitter• Like on Facebook