• Save
Big data hadoop ecosystem and nosql
Upcoming SlideShare
Loading in...5
×

Like this? Share it with your network

Share
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
2,304
On Slideshare
2,303
From Embeds
1
Number of Embeds
1

Actions

Shares
Downloads
2
Comments
0
Likes
0

Embeds 1

https://twitter.com 1

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Overview of Big DataHadoop Ecosystem and NoSQL Databases Khanderao Kand CTO GloMantra Inc. Entrepreneur and Technologist Twitter @khanderao
  • 2. Big DataThe Dominant trend for 2013 will, once again, be Big DataGartner reports must have technology for “Competetiveadvantage by 2015”IDC forecasts that the market for Big Data is expected togrow from $3.2 billion in 2010 to $16.9 billion in 2015 in itsreport, Worldwide Big Data Technology and Services 2012-2015.By 2016, revenue from the big data sector will approach $24billion, reaching $48.3 billion by 2018.
  • 3. The image was taken from the Atacama desert in western South America by YuriBeletsky (Las Campanas Observatory, Carnegie Institution for Science) on July 11, 2012.Copyright Yuri Beletsky
  • 4. Alignment…Explosion of data from site logs, search engines, socialmedia…Google published paper on Map Reduce and Google FileSystem, inspired Doug Cutting working on Apache Lucene-Nutch, Hadoop bornYahoo took further with 1000 nodes in 2008Possible to process very very large data on commodityhardwareApache Open source
  • 5. Big Data Stack PatentsSpeed Matlab SAS SPSS R SciPy Mahout ScaleSpeed kdb Esper, S4 MySQL MongoDB Hbase Hadoop Scale
  • 6. Big Data Architecture Analytics Products Apps BI BI Tools - Dev VisualizationUnstructured Data Lucene Hadoop No-SQL RDBMS Nutch Map Reduce Hadoop No-SQL Based SOLR Structured System Data ETL Workflow Admin Data & Monitoring RDBMS Integration Scheduler Datalogs Streams
  • 7. HDFSLarge Data Set Client 1 Client2Write Once – Read ManyFault Tolerant NameNodeDistributed File System Read WriteName Node – Data NodeFixed Size Data BlocksChecksum Rack1 Rack NFiles – Sequence of blocks ReplicationReplicated over Balanced ClusterHeartbeat Report from Nodes
  • 8. Map Reduce• Two Step, Map and Reduce, approach of solving problem• Move the code to the data• Map step process data on nodes• Reduce step aggregates results from all Map nodes with reduce algorithm• JobTracker distributes and tracks tasks• TaskTracker on processing nodes communicated task status to JobTrackers• Inspired by Functional Programming
  • 9. Hadoop Ecosystem BI Analytics Apps RDBMSWorkflow Chukwa Oozie FlumeOrchestration Data Avro Pig Hive Sqoop Security, Recovery, Infra Access HBase zookeeper Network Nagios, GangliaProcessing Map Reduce HCatalogStorage HDFS
  • 10. Apache HiveSQL-like HiveQLWarehousing AppsCompiles to MapReduce TasksFacebook, Netflix, etc.
  • 11. Apache Pig LatinHigher Level scripting above Map ReduceProcedureal (unlike SQL) by easy like SQLConstructs like FOREACH, GROUPSupports User Defined FunctionsFrom YahooGood for Integrating and writing Hadoop JObs
  • 12. SqoopData Bulk LoadData Import ExportRDBMS and NoSQLHDFS, HbaseData SlicedSliced Transferred via MaP only Jobs
  • 13. Chukwa & FlumeHadoop SubprojectLarge scale log processingOn Map RCollection and analysisBatch OrientedComponents: Agents Collectors MR Jobs for Parsing & Archiving HICC : Hadoop Infra Care Center Web App
  • 14. Big „Fast‟ DataReal time adhoc querry:Once again Google Percolater and Dremel inspiredCloudera : Impala SQL like querry on HDFS Lower latency By pass Map ReduceApache Drill
  • 15. NoSQL DataBasesDocument Databases : MongoDB, CouchDBColumn Databases: Cassandra, HbaseKV Pair:Graph DB: Neo4J
  • 16. MongoDBDocument OrientedFlexible - No Fix SchemaDistributed – Sharding based on diff policiesFault Tolerant via ReplicationEasy to install useJSON – BSON format storageJavascript based QuerryJava, Python, other languagesOpensource, Supported by 10GenFast Read
  • 17. CouchDBDocument OrientedJSON formatHTTP/REST interfaceMapReduce, JavascriptReplication supportMulti version CCWritten in ErlangFast Write – ReadGood Availability
  • 18. Apache CassandraBased on Amazon Dynamo DbColumn orientedTheoretically infinite columnsColumns as tupple N,V, timestampOrganized as column family(unlike Hbase)Not Hadoop basedEqual Nodes, easier to config and manageParallel writeNetflix,,etc.
  • 19. Apache HBaseModeled as Google Big TableColumn OrientedColumn Family stored together as against all columns in rowPredefine table schema with columnsHowever columns can be added in runtimeFault TolerantRuns on HDFSMapReduce basedInterface via REST, AVRO, ThriftFacebook‟s messaging platform