A tour of the zoo – Hadoop ecosystem


Published on

Syntel Big Data Architect Prafulla Wani gives an overview of the Hadoop ecosystem.

Published in: Technology, Business
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

A tour of the zoo – Hadoop ecosystem

  1. 1. A Tour of the Zoo – theHadoop Ecosystem Prafulla Wani Technical Architect - Big Data Syntel
  2. 2. Agenda Welcome to the Zoo! Evolution Timeline Traditional BI/DW Architecture Where Hadoop Fits In 2Confidential ©2012 Syntel, Inc.
  3. 3. Welcome to the Zoo! Zookeeper Pig Hadoop Jaql Hama Shark Giraph 3 I am sure you won’t find a Shark in any other zoo  3Confidential ©2012 Syntel, Inc.
  4. 4. What is Hadoop? Hadoop is an open-source project overseen by the Apache Software Foundation Hadoop is an ecosystem, not a single product Originally based on papers published by Google in 2003 and 2004 Some of the projects in the ecosystem have been inspired based on whitepapers published by Google Google calls it: Hadoop equivalent GFS HDFS MapReduce Hadoop MapReduce Sawzall Hive, Pig BigTable HBase Chubby ZooKeeper Pregel Giraph 4Confidential ©2012 Syntel, Inc.
  5. 5. Evolution Timeline Started by Doug Cutting at Yahoo! in early 2006, and named after his kid’s toy elephant Hadoop committers work at several different organizations  Including Facebook, Yahoo!, LinkedIn, Twitter, Cloudera, Hortonworks 2006 2007 2008 2009 2010 2011 Jaql Giraph 5Confidential ©2012 Syntel, Inc.
  6. 6. Traditional Data Strategy - BI/DW Architecture Analytics ERP Data Warehouse CRM ETL Ad Hoc Process Reporting Data Marts Database Files OLAP Analysis/BI ETL Tools DW / Marts BI Analytics Informatica Teradata Microstrategy SAS Oracle Data Integrator Oracle OBIEE TIBCO Spotfire Commercial IBM Datastage DB2, Netezza Cognos SPSS Microsoft SSIS SQL server Microsoft SSRS Open source Talend mySQL Pentaho , Jaspersoft R, RapidMiner 6Confidential ©2012 Syntel, Inc.
  7. 7. How Hadoop Fits In Analytics ERP Data Warehouse CRM ETL Ad Hoc Process Reporting Data Marts Database Files OLAP Analysis/BI Hadoop can complement the existing DW environment as well replace some of the components in a traditional data architecture. 7Confidential ©2012 Syntel, Inc.
  8. 8. Data Storage Hadoop Distributed File System (HDFS) It’s a file system, not a DBMS Allows storage of both structured and unstructured data Provides distributed, redundant storage for massive amounts of data on cheap, unreliable computers Hadoop 2.0 release (still beta) added some important features –  HDFS Federation  High Availability HBase Distributed, versioned, column-oriented store on top of HDFS Provides an option of “low-latency” (OLTP) reads/writes along with support for batch-processing model of map-reduce Goal - To store tables with billion rows and million columns 8Confidential ©2012 Syntel, Inc.
  9. 9. Data Processing (ETL / Analytics) Extract / Load  Source / Target is RDBMS - Sqoop  Log collection and aggregation - Flume, Scribe, Chukwa  Stream processing - S4, Storm (supports Transformation also) Transformation  Map-reduce programming in Java or any other language or high level query languages like Pig, Hive etc.  Workflow design and implementation using tools like Oozie, Azkaban etc.  Iterative algorithms or in-memory cluster processing using Spark, Shark etc. Analytics  Mahout - Scalable machine learning library with most of the algorithms implemented on top Apache Hadoop using map/reduce paradigm  RHadoop – Provides R packages to access data in HDFS & HBase and also to write map-reduce jobs in R 9Confidential ©2012 Syntel, Inc.
  10. 10. Common Industry Use Cases Use cases Solution Comments Cold Data Storage HDFS More cost-effective option compared to most appliances in the market Huge transactional StumbleUpon created openTSDB to capture their infrastructure metrics HBase volume data MapReduce Batch processing /Hive /Pig Flume, Scribe, Log aggregation web-log collection on HDFS in near real-time Chukwa Real-time message/ Storm, S4 Used by twitter for real-time tweet processing stream processing Iterative algorithms / In- Spark / Shark Predictive analytics, Log Mining memory processing Machine Learning/ Mahout, Analytics RHadoop Graph data Giraph Championed at Yahoo! storage/processing 10Confidential ©2012 Syntel, Inc.
  11. 11. Proposed Big Data Roadmap Big Data integration – Big Data integration – Next steps Next steps  Throw data open to business users  Implement advanced for analysis and they will appreciate solutions the power of new infrastructure 6Proof of Concept: Proof of Concept can be performed to demonstrate applicability of Hadoop to enhance DW 4 5 Big Data integration – Next steps 2  Identify the opportunities in 3 Big Data integration – Initial steps ETL & Analytics space  Move cold/warm data to Hive/HBase to  Move Hot data to Hadoop reduce expenses on storage  Perform real-time data infrastructure integration using  Bring new data sources like web-logs, Storm/Spark which was not possible with traditional storage solutions Kickoff - Assessment Study: Hadoop Technology Stack  Understand the business processes  HDFS, Hbase  Understand organizational goals & current investments  Hive, Pig, MapReduce  Understand the challenges and pain- points of current setup  Mahout, RHadoop 1 11Confidential ©2012 Syntel, Inc.
  12. 12. Thank You For more information, visit www.syntelinc.com/bi/