Big data hadoop ecosystem and nosql


Published on

  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Big data hadoop ecosystem and nosql

  1. 1. Overview of Big DataHadoop Ecosystem and NoSQL Databases Khanderao Kand CTO GloMantra Inc. Entrepreneur and Technologist Twitter @khanderao
  2. 2. Big DataThe Dominant trend for 2013 will, once again, be Big DataGartner reports must have technology for “Competetiveadvantage by 2015”IDC forecasts that the market for Big Data is expected togrow from $3.2 billion in 2010 to $16.9 billion in 2015 in itsreport, Worldwide Big Data Technology and Services 2012-2015.By 2016, revenue from the big data sector will approach $24billion, reaching $48.3 billion by 2018.
  3. 3. The image was taken from the Atacama desert in western South America by YuriBeletsky (Las Campanas Observatory, Carnegie Institution for Science) on July 11, 2012.Copyright Yuri Beletsky
  4. 4. Alignment…Explosion of data from site logs, search engines, socialmedia…Google published paper on Map Reduce and Google FileSystem, inspired Doug Cutting working on Apache Lucene-Nutch, Hadoop bornYahoo took further with 1000 nodes in 2008Possible to process very very large data on commodityhardwareApache Open source
  5. 5. Big Data Stack PatentsSpeed Matlab SAS SPSS R SciPy Mahout ScaleSpeed kdb Esper, S4 MySQL MongoDB Hbase Hadoop Scale
  6. 6. Big Data Architecture Analytics Products Apps BI BI Tools - Dev VisualizationUnstructured Data Lucene Hadoop No-SQL RDBMS Nutch Map Reduce Hadoop No-SQL Based SOLR Structured System Data ETL Workflow Admin Data & Monitoring RDBMS Integration Scheduler Datalogs Streams
  7. 7. HDFSLarge Data Set Client 1 Client2Write Once – Read ManyFault Tolerant NameNodeDistributed File System Read WriteName Node – Data NodeFixed Size Data BlocksChecksum Rack1 Rack NFiles – Sequence of blocks ReplicationReplicated over Balanced ClusterHeartbeat Report from Nodes
  8. 8. Map Reduce• Two Step, Map and Reduce, approach of solving problem• Move the code to the data• Map step process data on nodes• Reduce step aggregates results from all Map nodes with reduce algorithm• JobTracker distributes and tracks tasks• TaskTracker on processing nodes communicated task status to JobTrackers• Inspired by Functional Programming
  9. 9. Hadoop Ecosystem BI Analytics Apps RDBMSWorkflow Chukwa Oozie FlumeOrchestration Data Avro Pig Hive Sqoop Security, Recovery, Infra Access HBase zookeeper Network Nagios, GangliaProcessing Map Reduce HCatalogStorage HDFS
  10. 10. Apache HiveSQL-like HiveQLWarehousing AppsCompiles to MapReduce TasksFacebook, Netflix, etc.
  11. 11. Apache Pig LatinHigher Level scripting above Map ReduceProcedureal (unlike SQL) by easy like SQLConstructs like FOREACH, GROUPSupports User Defined FunctionsFrom YahooGood for Integrating and writing Hadoop JObs
  12. 12. SqoopData Bulk LoadData Import ExportRDBMS and NoSQLHDFS, HbaseData SlicedSliced Transferred via MaP only Jobs
  13. 13. Chukwa & FlumeHadoop SubprojectLarge scale log processingOn Map RCollection and analysisBatch OrientedComponents: Agents Collectors MR Jobs for Parsing & Archiving HICC : Hadoop Infra Care Center Web App
  14. 14. Big „Fast‟ DataReal time adhoc querry:Once again Google Percolater and Dremel inspiredCloudera : Impala SQL like querry on HDFS Lower latency By pass Map ReduceApache Drill
  15. 15. NoSQL DataBasesDocument Databases : MongoDB, CouchDBColumn Databases: Cassandra, HbaseKV Pair:Graph DB: Neo4J
  16. 16. MongoDBDocument OrientedFlexible - No Fix SchemaDistributed – Sharding based on diff policiesFault Tolerant via ReplicationEasy to install useJSON – BSON format storageJavascript based QuerryJava, Python, other languagesOpensource, Supported by 10GenFast Read
  17. 17. CouchDBDocument OrientedJSON formatHTTP/REST interfaceMapReduce, JavascriptReplication supportMulti version CCWritten in ErlangFast Write – ReadGood Availability
  18. 18. Apache CassandraBased on Amazon Dynamo DbColumn orientedTheoretically infinite columnsColumns as tupple N,V, timestampOrganized as column family(unlike Hbase)Not Hadoop basedEqual Nodes, easier to config and manageParallel writeNetflix,,etc.
  19. 19. Apache HBaseModeled as Google Big TableColumn OrientedColumn Family stored together as against all columns in rowPredefine table schema with columnsHowever columns can be added in runtimeFault TolerantRuns on HDFSMapReduce basedInterface via REST, AVRO, ThriftFacebook‟s messaging platform