Overview of Big DataHadoop Ecosystem and NoSQL Databases Khanderao Kand CTO GloMantra Inc. Entrepreneur and Technologist Twitter @khanderao
Big DataThe Dominant trend for 2013 will, once again, be Big DataGartner reports must have technology for “Competetiveadvantage by 2015”IDC forecasts that the market for Big Data is expected togrow from $3.2 billion in 2010 to $16.9 billion in 2015 in itsreport, Worldwide Big Data Technology and Services 2012-2015.By 2016, revenue from the big data sector will approach $24billion, reaching $48.3 billion by 2018.
The image was taken from the Atacama desert in western South America by YuriBeletsky (Las Campanas Observatory, Carnegie Institution for Science) on July 11, 2012.Copyright Yuri Beletsky
Alignment…Explosion of data from site logs, search engines, socialmedia…Google published paper on Map Reduce and Google FileSystem, inspired Doug Cutting working on Apache Lucene-Nutch, Hadoop bornYahoo took further with 1000 nodes in 2008Possible to process very very large data on commodityhardwareApache Open source
Big Data Stack PatentsSpeed Matlab SAS SPSS R SciPy Mahout ScaleSpeed kdb Esper, S4 MySQL MongoDB Hbase Hadoop Scale
Big Data Architecture Analytics Products Apps BI BI Tools - Dev VisualizationUnstructured Data Lucene Hadoop No-SQL RDBMS Nutch Map Reduce Hadoop No-SQL Based SOLR Structured System Data ETL Workflow Admin Data & Monitoring RDBMS Integration Scheduler Datalogs Streams
HDFSLarge Data Set Client 1 Client2Write Once – Read ManyFault Tolerant NameNodeDistributed File System Read WriteName Node – Data NodeFixed Size Data BlocksChecksum Rack1 Rack NFiles – Sequence of blocks ReplicationReplicated over Balanced ClusterHeartbeat Report from Nodes
Map Reduce• Two Step, Map and Reduce, approach of solving problem• Move the code to the data• Map step process data on nodes• Reduce step aggregates results from all Map nodes with reduce algorithm• JobTracker distributes and tracks tasks• TaskTracker on processing nodes communicated task status to JobTrackers• Inspired by Functional Programming
Hadoop Ecosystem BI Analytics Apps RDBMSWorkflow Chukwa Oozie FlumeOrchestration Data Avro Pig Hive Sqoop Security, Recovery, Infra Access HBase zookeeper Network Nagios, GangliaProcessing Map Reduce HCatalogStorage HDFS
Apache HiveSQL-like HiveQLWarehousing AppsCompiles to MapReduce TasksFacebook, Netflix, etc.
Apache Pig LatinHigher Level scripting above Map ReduceProcedureal (unlike SQL) by easy like SQLConstructs like FOREACH, GROUPSupports User Defined FunctionsFrom YahooGood for Integrating and writing Hadoop JObs
SqoopData Bulk LoadData Import ExportRDBMS and NoSQLHDFS, HbaseData SlicedSliced Transferred via MaP only Jobs
Chukwa & FlumeHadoop SubprojectLarge scale log processingOn Map RCollection and analysisBatch OrientedComponents: Agents Collectors MR Jobs for Parsing & Archiving HICC : Hadoop Infra Care Center Web App
Big „Fast‟ DataReal time adhoc querry:Once again Google Percolater and Dremel inspiredCloudera : Impala SQL like querry on HDFS Lower latency By pass Map ReduceApache Drill
Apache CassandraBased on Amazon Dynamo DbColumn orientedTheoretically infinite columnsColumns as tupple N,V, timestampOrganized as column family(unlike Hbase)Not Hadoop basedEqual Nodes, easier to config and manageParallel writeNetflix,,etc.
Apache HBaseModeled as Google Big TableColumn OrientedColumn Family stored together as against all columns in rowPredefine table schema with columnsHowever columns can be added in runtimeFault TolerantRuns on HDFSMapReduce basedInterface via REST, AVRO, ThriftFacebook‟s messaging platform