hadoopsphere.com                         View in Full Screen mode for better readability                                  ...
hadoopsphere.com                                                                                       Distribution,      ...
hadoopsphere.com                   CORE LAYERS                    which constitute                      the Apache        ...
hadoopsphere.com                                   PERSIST :                             File System & Data               ...
hadoopsphere.com                                  PERSIST :                                 Serialization –               ...
hadoopsphere.com                                       RUN:                                Job Execution –                ...
hadoopsphere.com                                MANAGE :                                       Work –                     ...
hadoopsphere.com                                MANAGE :                                         Dev –                    ...
hadoopsphere.com                                MANAGE :                                         Ops –                    ...
hadoopsphere.com                                  SECURE :                             • Knox - System that provides a    ...
hadoopsphere.com                               TRANSFER :                             • Flume - Distributed, reliable, and...
hadoopsphere.com                             ATMOSPHERIC                                  LAYERS                          ...
hadoopsphere.com                                                       HARDWARE :                                         ...
hadoopsphere.com                                     DATA                             INTERACTIONS:                       ...
hadoopsphere.com                                                DATA                                        INTERACTIONS: ...
hadoopsphere.com                               ANALYTICS &                             INTELLIGENCE :                     ...
hadoopsphere.com                                DISCOVERY &                             VISUALIZATION :                   ...
hadoopsphere.com                                         DISCOVERY &                                      VISUALIZATION : ...
hadoopsphere.com                             APPLICATION                               DOMAINS :                          ...
hadoopsphere.com                                        APPLICATION                                          DOMAINS :    ...
hadoopsphere.comReference :• www.apache.org• http://blogs.gartner.com/merv-adrian/2013/02/21/hadoopImage courtesy:• Slide ...
hadoopsphere.comAbout the document :• Voluntarily contributed by: Sachin Ghai (@sachinghai)• Publisher : hadoopsphere.com•...
hadoopsphere.comSubscribe to hadoopsphere.com:• Newsletter on e-mail subscription• RSS Feed for posts• Follow on Twitter• ...
Upcoming SlideShare
Loading in...5
×

Apache Hadoop ecosystem - March 2013

3,462

Published on

Summary and categorization of components available as Apache (ASF) projects/sub-projects and serving the Hadoop ecosystem

Published in: Technology

Transcript of "Apache Hadoop ecosystem - March 2013"

  1. 1. hadoopsphere.com View in Full Screen mode for better readability Components that constitute the open source Apache Hadoop ecosystem - Summary and categorization of components available as Apache (ASF) projects/sub-projects and serving the Hadoop ecosystem. The document does not include other open source or commercial projects/products Contributed by : Sachin Ghai |@sachinghai
  2. 2. hadoopsphere.com Distribution, Financial, Government, Heavy Industry, ‘Atmospheric ’ Layers Internet, Oil & Application Energy, Research, Domains Telecom Discovery & Lucene, Blur, Visualization Giraph Analytics & Intelligence Mahout, Drill Pig, Hive, Data HCatalog, Tez, Interactions Gora Hardware (& Appliances) Commodity H/w Distribution Apache Secure Knox Persist Oozie, Zookeeper, Crunch, MRUnit, HDT, Ambari, ‘Core ‘ Layers Vaidya, BigTop, Manage Whirr MapReduce, Run YARN, Hama HDFS, HBase, Cassandra, Accumulo, Avro, Persist Trevni, Thrift Flume, Sqoop, Transfer Chukwa, Kafka Contributed by : Sachin Ghai |@sachinghaiM
  3. 3. hadoopsphere.com CORE LAYERS which constitute the Apache Hadoop ecosystem 3
  4. 4. hadoopsphere.com PERSIST : File System & Data Store – • HDFS - Distributed file system that provides high-throughput access. Comprises of NameNode, Secondary NameNode and DataNodes • HBase - Distributed, scalable, big Persist data store • Cassandra - Highly scalable, eventually consistent, distributed, structured key-value store • Accumulo - Sorted, distributed key/value data storage and retrieval system 4
  5. 5. hadoopsphere.com PERSIST : Serialization – • Avro - Data serialization system • Trevni - A Column File format to permit compatible, independent implementations that read and/or write files in this format Persist • Thrift - Framework, for scalable cross-language services development 5
  6. 6. hadoopsphere.com RUN: Job Execution – • MapReduce - Framework for performing distributed data processing. Comprises of JobTracker, TaskTracker and JobHistoryServer • YARN - Framework that facilitates writing arbitrary distributed processing frameworks and Persist applications. • Hama - Pure BSP (Bulk Synchronous Parallel) computing framework for massive scientific computations such as matrix, graph and network algorithms 6
  7. 7. hadoopsphere.com MANAGE : Work – • Oozie - Workflow/coordination system to manage Hadoop jobs • Zookeeper - Centralized service for maintaining configuration information, naming, providing distributed synchronization, and Persist providing group services 7
  8. 8. hadoopsphere.com MANAGE : Dev – • Crunch - Framework for writing, testing, and running MapReduce pipelines • MRUnit - Java library that helps developers unit test Apache Hadoop MapReduce jobs • HDT – Hadoop Development Persist Tools (HDT) comprise Eclipse based tools for developing applications on the Hadoop platform 8
  9. 9. hadoopsphere.com MANAGE : Ops – • Ambari - Web-based tool for provisioning, managing, and monitoring Apache Hadoop clusters • Vaidya - Performance diagnostic tool for MapReduce jobs • BigTop - Project for the Persist development of packaging and tests and ensuring interoperability among Apache Hadoop related projects • Whirr - Set of libraries for running cloud services like running Hadoop clusters on EC2 9
  10. 10. hadoopsphere.com SECURE : • Knox - System that provides a single point of secure access for Apache Hadoop clusters Persist 10
  11. 11. hadoopsphere.com TRANSFER : • Flume - Distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data • Sqoop - Tool designed for efficiently transferring bulk data between Apache Hadoop and Persist structured datastores such as relational databases. • Chukwa - Open source data collection system for monitoring large distributed systems • Kafka - Distributed publish- subscribe messaging system 11
  12. 12. hadoopsphere.com ATMOSPHERIC LAYERS which build up the capabilities beyond the core of Persist Apache Hadoop ecosystem 12
  13. 13. hadoopsphere.com HARDWARE : • Commodity Hardware - Low-cost, easily available hardware working in parallel C o r e L Atm a osp y heri Persist e c r Laye s rs Note: no appliances known to run on pure Apache Hadoop distribution; SSD and other cheap hardware options not mentioned separately here 13
  14. 14. hadoopsphere.com DATA INTERACTIONS: • Pig - Platform for analyzing large data sets that consists of a high- level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs Persist • Hive - Data warehouse system that facilitates easy data summarization, ad-hoc queries and analysis of large datasets stored in Hadoop compatible file systems 14
  15. 15. hadoopsphere.com DATA INTERACTIONS: • HCatalog - Table and storage management service for data created using Apache Hadoop C • Tez - Generic o r application framework e which can be used to L Atm process complex data- a osp y heri processing task DAGs and e c Persist r Laye runs natively on Apache s rs Hadoop YARN •Gora - Framework for in-memory data model and persistence with MapReduce support 15
  16. 16. hadoopsphere.com ANALYTICS & INTELLIGENCE : • Mahout - Scalable machine learning and data mining algorithm library. Supports Recommendation mining, Clustering, Classification and Frequent itemset mining Persist • Drill - Distributed system for interactive analysis of large-scale datasets. Comprises of user interface (CLI, REST), pluggable query language and pluggable data source. 16
  17. 17. hadoopsphere.com DISCOVERY & VISUALIZATION : • Lucene - Open-source search software including Java based indexing and search component Lucene Core and high performance search server component Solr • Blur - Search engine Persist capable of querying massive amounts of structured data at incredible speeds in a cloud computing environment 17
  18. 18. hadoopsphere.com DISCOVERY & VISUALIZATION : • Giraph - Graph- processing framework leveraging existing Hadoop infrastructure. Follows bulk synchronous parallel model to run large scale algorithms. Supports directed, undirected, weighted, Persist unweighted and multigraphs Note: no pure visualization projects currently as part of ASF 18
  19. 19. hadoopsphere.com APPLICATION DOMAINS : • Distribution - Includes applications in Travel, Transport, FMCG, supply chain e.g. Expedia • Financial - Includes applications in financial, banking, insurance e.g. Visa • Government - Includes Persist applications in government and public sector e.g. Aadhar (India ID card) • Heavy Industry - Includes applications in heavy industrial business including electronics, auto, aircraft e.g. Hitachi 19
  20. 20. hadoopsphere.com APPLICATION DOMAINS : • Internet - Includes new age internet applications including social media, content distribution e.g. C Facebook o r • Oil & Energy - Includes e applications in L Atm upstream/downstream a osp y heri oil, gas business along c Persist e r Laye with those in Energy s rs sector. e.g. Chevron • Research - Includes applications in new research e.g. network analysis & security • Telecom - Includes applications in Telecom business e.g. Korea Telecom 20
  21. 21. hadoopsphere.comReference :• www.apache.org• http://blogs.gartner.com/merv-adrian/2013/02/21/hadoopImage courtesy:• Slide 1 : Getty Images #84480368 Dorling Kindersley (free thumbnail copy)• Other images: Original source could not be established 21
  22. 22. hadoopsphere.comAbout the document :• Voluntarily contributed by: Sachin Ghai (@sachinghai)• Publisher : hadoopsphere.com• Version : 1.0• Date : 11 March 2013• Copyright: 2013, All Rights Reserved• Note: The document does not use official lingo in part• Contact : Use ‘Contact’ menu option on www.hadoopsphere.com• Disclaimer: The project names mentioned in this document are either registered trademarks or trademarks of the Apache Software Foundation in the United States. The Apache Software Foundation has no affiliation with and does not endorse or review the materials provided in this document. 22
  23. 23. hadoopsphere.comSubscribe to hadoopsphere.com:• Newsletter on e-mail subscription• RSS Feed for posts• Follow on Twitter• Like on Facebook

×