SlideShare a Scribd company logo
1 of 19
Overview
         of
     Big Data
Hadoop Ecosystem and
  NoSQL Databases
            Khanderao Kand
          CTO GloMantra Inc.
      Entrepreneur and Technologist
           Twitter @khanderao
Big Data

The Dominant trend for 2013 will, once again, be Big Data

Gartner reports must have technology for “Competetive
advantage by 2015”

IDC forecasts that the market for Big Data is expected to
grow from $3.2 billion in 2010 to $16.9 billion in 2015 in its
report, Worldwide Big Data Technology and Services 2012-2015.

By 2016, revenue from the big data sector will approach $24
billion, reaching $48.3 billion by 2018.
The image was taken from the Atacama desert in western South America by Yuri
Beletsky (Las Campanas Observatory, Carnegie Institution for Science) on July 11, 2012.
Copyright Yuri Beletsky
Alignment…

Explosion of data from site logs, search engines, social
media…

Google published paper on Map Reduce and Google File
System, inspired Doug Cutting working on Apache Lucene-
Nutch, Hadoop born

Yahoo took further with 1000 nodes in 2008

Possible to process very very large data on commodity
hardware

Apache Open source
Big Data Stack


                          Patents

Speed

        Matlab
               SAS SPSS
             R
               SciPy
                          Mahout
                                    Scale

Speed         kdb
        Esper, S4
        MySQL
            MongoDB
                          Hbase
                          Hadoop    Scale
Big Data Architecture
                        Analytics Products                   Apps

                                                               BI
                        BI Tools - Dev                    Visualization



Unstructured
   Data
  Lucene              Hadoop                 No-SQL         RDBMS
   Nutch             Map Reduce              Hadoop         No-SQL
                                             Based
                                                            SOLR

 Structured                                   System
    Data            ETL         Workflow
                                              Admin
                    Data           &
                                             Monitoring
  RDBMS          Integration    Scheduler
  Datalogs
  Streams
HDFS
Large Data Set
                                     Client 1                     Client2
Write Once – Read Many
Fault Tolerant                                  NameNode
Distributed File System       Read
                                                                          Write

Name Node – Data Node
Fixed Size Data Blocks
Checksum
                                     Rack1                       Rack N
Files – Sequence of blocks                         Replication

Replicated over Balanced Cluster
Heartbeat Report from Nodes
Map Reduce




•   Two Step, Map and Reduce, approach of solving problem
•   Move the code to the data
•   Map step process data on nodes
•   Reduce step aggregates results from all Map nodes with reduce algorithm
•   JobTracker distributes and tracks tasks
•   TaskTracker on processing nodes communicated task status to JobTrackers
•   Inspired by Functional Programming
Hadoop Ecosystem

                BI Analytics           Apps           RDBMS



Workflow
                Chukwa         Oozie          Flume
Orchestration



 Data           Avro     Pig         Hive     Sqoop




                                                                           Security, Recovery, Infra
 Access                                                HBase




                                                               zookeeper
                           Network




                                                                                                       Nagios, Ganglia
Processing               Map Reduce

                                     HCatalog
Storage                                HDFS
Apache Hive

SQL-like HiveQL

Warehousing Apps

Compiles to MapReduce Tasks

Facebook, Netflix, etc.
Apache Pig Latin
Higher Level scripting above Map Reduce

Procedureal (unlike SQL) by easy like SQL

Constructs like FOREACH, GROUP

Supports User Defined Functions

From Yahoo

Good for Integrating and writing Hadoop JObs
Sqoop
Data Bulk Load

Data Import Export

RDBMS and NoSQL

HDFS, Hbase

Data Sliced

Sliced Transferred via MaP only Jobs
Chukwa & Flume

Hadoop Subproject

Large scale log processing

On Map R

Collection and analysis

Batch Oriented

Components:
  Agents
  Collectors
  MR Jobs for Parsing & Archiving
  HICC : Hadoop Infra Care Center Web App
Big „Fast‟ Data
Real time adhoc querry:

Once again Google Percolater and Dremel inspired

Cloudera : Impala
  SQL like querry on HDFS
  Lower latency
  By pass Map Reduce

Apache Drill
NoSQL DataBases
Document Databases : MongoDB, CouchDB

Column Databases:   Cassandra, Hbase

KV Pair:

Graph DB: Neo4J
MongoDB
Document Oriented

Flexible - No Fix Schema

Distributed – Sharding based on diff policies

Fault Tolerant via Replication

Easy to install use

JSON – BSON format storage

Javascript based Querry

Java, Python, other languages

Opensource, Supported by 10Gen

Fast Read
CouchDB
Document Oriented
JSON format
HTTP/REST interface
MapReduce, Javascript
Replication support
Multi version CC
Written in Erlang
Fast Write – Read
Good Availability
Apache Cassandra
Based on Amazon Dynamo Db

Column oriented

Theoretically infinite columns

Columns as tupple N,V, timestamp

Organized as column family

(unlike Hbase)Not Hadoop based

Equal Nodes, easier to config and manage

Parallel write

Netflix,,etc.
Apache HBase
Modeled as Google Big Table

Column Oriented

Column Family stored together as against all columns in row

Predefine table schema with columns

However columns can be added in runtime

Fault Tolerant

Runs on HDFS

MapReduce based

Interface via REST, AVRO, Thrift

Facebook‟s messaging platform

More Related Content

What's hot

Apache Spark in Scientific Applciations
Apache Spark in Scientific ApplciationsApache Spark in Scientific Applciations
Apache Spark in Scientific ApplciationsDr. Mirko Kämpf
 
Introduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystemIntroduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystemShivaji Dutta
 
Hadoop Ecosystem Architecture Overview
Hadoop Ecosystem Architecture Overview Hadoop Ecosystem Architecture Overview
Hadoop Ecosystem Architecture Overview Senthil Kumar
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Announcing Databricks Cloud (Spark Summit 2014)
Announcing Databricks Cloud (Spark Summit 2014)Announcing Databricks Cloud (Spark Summit 2014)
Announcing Databricks Cloud (Spark Summit 2014)Databricks
 
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise NetworksUsing Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise NetworksDataWorks Summit
 
Big data architecture on cloud computing infrastructure
Big data architecture on cloud computing infrastructureBig data architecture on cloud computing infrastructure
Big data architecture on cloud computing infrastructuredatastack
 
How to use Parquet as a Sasis for ETL and Analytics
How to use Parquet as a Sasis for ETL and AnalyticsHow to use Parquet as a Sasis for ETL and Analytics
How to use Parquet as a Sasis for ETL and AnalyticsDataWorks Summit
 
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop EcosystemLior Sidi
 
Overview of stinger interactive query for hive
Overview of stinger   interactive query for hiveOverview of stinger   interactive query for hive
Overview of stinger interactive query for hiveDavid Kaiser
 
What’s New in the Upcoming Apache Spark 3.0
What’s New in the Upcoming Apache Spark 3.0What’s New in the Upcoming Apache Spark 3.0
What’s New in the Upcoming Apache Spark 3.0Databricks
 
Hadoop Architecture Options for Existing Enterprise DataWarehouse
Hadoop Architecture Options for Existing Enterprise DataWarehouseHadoop Architecture Options for Existing Enterprise DataWarehouse
Hadoop Architecture Options for Existing Enterprise DataWarehouseAsis Mohanty
 
New Developments in the Open Source Ecosystem: Apache Spark 3.0, Delta Lake, ...
New Developments in the Open Source Ecosystem: Apache Spark 3.0, Delta Lake, ...New Developments in the Open Source Ecosystem: Apache Spark 3.0, Delta Lake, ...
New Developments in the Open Source Ecosystem: Apache Spark 3.0, Delta Lake, ...Databricks
 
Hive at Yahoo: Letters from the trenches
Hive at Yahoo: Letters from the trenchesHive at Yahoo: Letters from the trenches
Hive at Yahoo: Letters from the trenchesDataWorks Summit
 
Transitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkTransitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkSlim Baltagi
 
Top Hadoop Big Data Interview Questions and Answers for Fresher
Top Hadoop Big Data Interview Questions and Answers for FresherTop Hadoop Big Data Interview Questions and Answers for Fresher
Top Hadoop Big Data Interview Questions and Answers for FresherJanBask Training
 
Interactive query in hadoop
Interactive query in hadoopInteractive query in hadoop
Interactive query in hadoopRommel Garcia
 
An Online Spark Pipeline: Semi-Supervised Learning and Automatic Retraining w...
An Online Spark Pipeline: Semi-Supervised Learning and Automatic Retraining w...An Online Spark Pipeline: Semi-Supervised Learning and Automatic Retraining w...
An Online Spark Pipeline: Semi-Supervised Learning and Automatic Retraining w...Databricks
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?sudhakara st
 
The hadoop 2.0 ecosystem and yarn
The hadoop 2.0 ecosystem and yarnThe hadoop 2.0 ecosystem and yarn
The hadoop 2.0 ecosystem and yarnMichael Joseph
 

What's hot (20)

Apache Spark in Scientific Applciations
Apache Spark in Scientific ApplciationsApache Spark in Scientific Applciations
Apache Spark in Scientific Applciations
 
Introduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystemIntroduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystem
 
Hadoop Ecosystem Architecture Overview
Hadoop Ecosystem Architecture Overview Hadoop Ecosystem Architecture Overview
Hadoop Ecosystem Architecture Overview
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Announcing Databricks Cloud (Spark Summit 2014)
Announcing Databricks Cloud (Spark Summit 2014)Announcing Databricks Cloud (Spark Summit 2014)
Announcing Databricks Cloud (Spark Summit 2014)
 
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise NetworksUsing Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
 
Big data architecture on cloud computing infrastructure
Big data architecture on cloud computing infrastructureBig data architecture on cloud computing infrastructure
Big data architecture on cloud computing infrastructure
 
How to use Parquet as a Sasis for ETL and Analytics
How to use Parquet as a Sasis for ETL and AnalyticsHow to use Parquet as a Sasis for ETL and Analytics
How to use Parquet as a Sasis for ETL and Analytics
 
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop Ecosystem
 
Overview of stinger interactive query for hive
Overview of stinger   interactive query for hiveOverview of stinger   interactive query for hive
Overview of stinger interactive query for hive
 
What’s New in the Upcoming Apache Spark 3.0
What’s New in the Upcoming Apache Spark 3.0What’s New in the Upcoming Apache Spark 3.0
What’s New in the Upcoming Apache Spark 3.0
 
Hadoop Architecture Options for Existing Enterprise DataWarehouse
Hadoop Architecture Options for Existing Enterprise DataWarehouseHadoop Architecture Options for Existing Enterprise DataWarehouse
Hadoop Architecture Options for Existing Enterprise DataWarehouse
 
New Developments in the Open Source Ecosystem: Apache Spark 3.0, Delta Lake, ...
New Developments in the Open Source Ecosystem: Apache Spark 3.0, Delta Lake, ...New Developments in the Open Source Ecosystem: Apache Spark 3.0, Delta Lake, ...
New Developments in the Open Source Ecosystem: Apache Spark 3.0, Delta Lake, ...
 
Hive at Yahoo: Letters from the trenches
Hive at Yahoo: Letters from the trenchesHive at Yahoo: Letters from the trenches
Hive at Yahoo: Letters from the trenches
 
Transitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkTransitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to Spark
 
Top Hadoop Big Data Interview Questions and Answers for Fresher
Top Hadoop Big Data Interview Questions and Answers for FresherTop Hadoop Big Data Interview Questions and Answers for Fresher
Top Hadoop Big Data Interview Questions and Answers for Fresher
 
Interactive query in hadoop
Interactive query in hadoopInteractive query in hadoop
Interactive query in hadoop
 
An Online Spark Pipeline: Semi-Supervised Learning and Automatic Retraining w...
An Online Spark Pipeline: Semi-Supervised Learning and Automatic Retraining w...An Online Spark Pipeline: Semi-Supervised Learning and Automatic Retraining w...
An Online Spark Pipeline: Semi-Supervised Learning and Automatic Retraining w...
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
 
The hadoop 2.0 ecosystem and yarn
The hadoop 2.0 ecosystem and yarnThe hadoop 2.0 ecosystem and yarn
The hadoop 2.0 ecosystem and yarn
 

Viewers also liked

Hadoop Summit Dublin 2016: Hadoop Platform at Yahoo - A Year in Review
Hadoop Summit Dublin 2016: Hadoop Platform at Yahoo - A Year in Review Hadoop Summit Dublin 2016: Hadoop Platform at Yahoo - A Year in Review
Hadoop Summit Dublin 2016: Hadoop Platform at Yahoo - A Year in Review Sumeet Singh
 
Big data ppt
Big data pptBig data ppt
Big data pptYash Raj
 
MongoDB & Hadoop - Understanding Your Big Data
MongoDB & Hadoop - Understanding Your Big DataMongoDB & Hadoop - Understanding Your Big Data
MongoDB & Hadoop - Understanding Your Big DataMongoDB
 
Big data PPT prepared by Hritika Raj (Shivalik college of engg.)
Big data PPT prepared by Hritika Raj (Shivalik college of engg.)Big data PPT prepared by Hritika Raj (Shivalik college of engg.)
Big data PPT prepared by Hritika Raj (Shivalik college of engg.)Hritika Raj
 
big data overview ppt
big data overview pptbig data overview ppt
big data overview pptVIKAS KATARE
 
Big Data vs Data Warehousing
Big Data vs Data WarehousingBig Data vs Data Warehousing
Big Data vs Data WarehousingThomas Kejser
 
8.17.11 big data and hadoop with informatica slideshare
8.17.11 big data and hadoop with informatica slideshare8.17.11 big data and hadoop with informatica slideshare
8.17.11 big data and hadoop with informatica slideshareJulianna DeLua
 
Choosing the Right Big Data Architecture for your Business
Choosing the Right Big Data Architecture for your BusinessChoosing the Right Big Data Architecture for your Business
Choosing the Right Big Data Architecture for your BusinessChicago Hadoop Users Group
 
Big data processing with apache spark
Big data processing with apache sparkBig data processing with apache spark
Big data processing with apache sparksarith divakar
 
Internet of Things and Big Data: Vision and Concrete Use Cases
Internet of Things and Big Data: Vision and Concrete Use CasesInternet of Things and Big Data: Vision and Concrete Use Cases
Internet of Things and Big Data: Vision and Concrete Use CasesMongoDB
 
NoSQL Now! NoSQL Architecture Patterns
NoSQL Now! NoSQL Architecture PatternsNoSQL Now! NoSQL Architecture Patterns
NoSQL Now! NoSQL Architecture PatternsDATAVERSITY
 
Big Data Solutions for Healthcare
Big Data Solutions for HealthcareBig Data Solutions for Healthcare
Big Data Solutions for HealthcareOdinot Stanislas
 
Big Data in Retail - Examples in Action
Big Data in Retail - Examples in ActionBig Data in Retail - Examples in Action
Big Data in Retail - Examples in ActionDavid Pittman
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with HadoopPhilippe Julio
 

Viewers also liked (19)

Hadoop Summit Dublin 2016: Hadoop Platform at Yahoo - A Year in Review
Hadoop Summit Dublin 2016: Hadoop Platform at Yahoo - A Year in Review Hadoop Summit Dublin 2016: Hadoop Platform at Yahoo - A Year in Review
Hadoop Summit Dublin 2016: Hadoop Platform at Yahoo - A Year in Review
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
Big Data ppt
Big Data pptBig Data ppt
Big Data ppt
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
MongoDB & Hadoop - Understanding Your Big Data
MongoDB & Hadoop - Understanding Your Big DataMongoDB & Hadoop - Understanding Your Big Data
MongoDB & Hadoop - Understanding Your Big Data
 
Big data PPT prepared by Hritika Raj (Shivalik college of engg.)
Big data PPT prepared by Hritika Raj (Shivalik college of engg.)Big data PPT prepared by Hritika Raj (Shivalik college of engg.)
Big data PPT prepared by Hritika Raj (Shivalik college of engg.)
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
big data overview ppt
big data overview pptbig data overview ppt
big data overview ppt
 
Big Data vs Data Warehousing
Big Data vs Data WarehousingBig Data vs Data Warehousing
Big Data vs Data Warehousing
 
Big data Ppt
Big data PptBig data Ppt
Big data Ppt
 
8.17.11 big data and hadoop with informatica slideshare
8.17.11 big data and hadoop with informatica slideshare8.17.11 big data and hadoop with informatica slideshare
8.17.11 big data and hadoop with informatica slideshare
 
Choosing the Right Big Data Architecture for your Business
Choosing the Right Big Data Architecture for your BusinessChoosing the Right Big Data Architecture for your Business
Choosing the Right Big Data Architecture for your Business
 
Big data processing with apache spark
Big data processing with apache sparkBig data processing with apache spark
Big data processing with apache spark
 
Big data analysis concepts and references
Big data analysis concepts and referencesBig data analysis concepts and references
Big data analysis concepts and references
 
Internet of Things and Big Data: Vision and Concrete Use Cases
Internet of Things and Big Data: Vision and Concrete Use CasesInternet of Things and Big Data: Vision and Concrete Use Cases
Internet of Things and Big Data: Vision and Concrete Use Cases
 
NoSQL Now! NoSQL Architecture Patterns
NoSQL Now! NoSQL Architecture PatternsNoSQL Now! NoSQL Architecture Patterns
NoSQL Now! NoSQL Architecture Patterns
 
Big Data Solutions for Healthcare
Big Data Solutions for HealthcareBig Data Solutions for Healthcare
Big Data Solutions for Healthcare
 
Big Data in Retail - Examples in Action
Big Data in Retail - Examples in ActionBig Data in Retail - Examples in Action
Big Data in Retail - Examples in Action
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 

Similar to Big data hadoop ecosystem and nosql

Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerMark Kromer
 
Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Thanh Nguyen
 
Overview of big data & hadoop version 1 - Tony Nguyen
Overview of big data & hadoop   version 1 - Tony NguyenOverview of big data & hadoop   version 1 - Tony Nguyen
Overview of big data & hadoop version 1 - Tony NguyenThanh Nguyen
 
Big Data in the Microsoft Platform
Big Data in the Microsoft PlatformBig Data in the Microsoft Platform
Big Data in the Microsoft PlatformJesus Rodriguez
 
Hadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHitendra Kumar
 
عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟datastack
 
Microsoft's Hadoop Story
Microsoft's Hadoop StoryMicrosoft's Hadoop Story
Microsoft's Hadoop StoryMichael Rys
 
Cred_hadoop_presenatation
Cred_hadoop_presenatationCred_hadoop_presenatation
Cred_hadoop_presenatationAshish Saraf
 
Taylor bosc2010
Taylor bosc2010Taylor bosc2010
Taylor bosc2010BOSC 2010
 
4. hadoop גיא לבנברג
4. hadoop  גיא לבנברג4. hadoop  גיא לבנברג
4. hadoop גיא לבנברגTaldor Group
 
Hadoop on Azure, Blue elephants
Hadoop on Azure,  Blue elephantsHadoop on Azure,  Blue elephants
Hadoop on Azure, Blue elephantsOvidiu Dimulescu
 
Big Data Hoopla Simplified - TDWI Memphis 2014
Big Data Hoopla Simplified - TDWI Memphis 2014Big Data Hoopla Simplified - TDWI Memphis 2014
Big Data Hoopla Simplified - TDWI Memphis 2014Rajan Kanitkar
 
How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and FacebookHow Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and FacebookAmr Awadallah
 

Similar to Big data hadoop ecosystem and nosql (20)

Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL Server
 
Hadoop_arunam_ppt
Hadoop_arunam_pptHadoop_arunam_ppt
Hadoop_arunam_ppt
 
What is hadoop
What is hadoopWhat is hadoop
What is hadoop
 
Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1
 
Overview of big data & hadoop version 1 - Tony Nguyen
Overview of big data & hadoop   version 1 - Tony NguyenOverview of big data & hadoop   version 1 - Tony Nguyen
Overview of big data & hadoop version 1 - Tony Nguyen
 
Big Data in the Microsoft Platform
Big Data in the Microsoft PlatformBig Data in the Microsoft Platform
Big Data in the Microsoft Platform
 
Hadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log Processing
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟
 
Microsoft's Hadoop Story
Microsoft's Hadoop StoryMicrosoft's Hadoop Story
Microsoft's Hadoop Story
 
Hive and querying data
Hive and querying dataHive and querying data
Hive and querying data
 
Big Data , Big Problem?
Big Data , Big Problem?Big Data , Big Problem?
Big Data , Big Problem?
 
Cred_hadoop_presenatation
Cred_hadoop_presenatationCred_hadoop_presenatation
Cred_hadoop_presenatation
 
Taylor bosc2010
Taylor bosc2010Taylor bosc2010
Taylor bosc2010
 
4. hadoop גיא לבנברג
4. hadoop  גיא לבנברג4. hadoop  גיא לבנברג
4. hadoop גיא לבנברג
 
Hadoop on Azure, Blue elephants
Hadoop on Azure,  Blue elephantsHadoop on Azure,  Blue elephants
Hadoop on Azure, Blue elephants
 
Big Data Hoopla Simplified - TDWI Memphis 2014
Big Data Hoopla Simplified - TDWI Memphis 2014Big Data Hoopla Simplified - TDWI Memphis 2014
Big Data Hoopla Simplified - TDWI Memphis 2014
 
2.1-HADOOP.pdf
2.1-HADOOP.pdf2.1-HADOOP.pdf
2.1-HADOOP.pdf
 
How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and FacebookHow Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 

Big data hadoop ecosystem and nosql

  • 1. Overview of Big Data Hadoop Ecosystem and NoSQL Databases Khanderao Kand CTO GloMantra Inc. Entrepreneur and Technologist Twitter @khanderao
  • 2. Big Data The Dominant trend for 2013 will, once again, be Big Data Gartner reports must have technology for “Competetive advantage by 2015” IDC forecasts that the market for Big Data is expected to grow from $3.2 billion in 2010 to $16.9 billion in 2015 in its report, Worldwide Big Data Technology and Services 2012-2015. By 2016, revenue from the big data sector will approach $24 billion, reaching $48.3 billion by 2018.
  • 3. The image was taken from the Atacama desert in western South America by Yuri Beletsky (Las Campanas Observatory, Carnegie Institution for Science) on July 11, 2012. Copyright Yuri Beletsky
  • 4. Alignment… Explosion of data from site logs, search engines, social media… Google published paper on Map Reduce and Google File System, inspired Doug Cutting working on Apache Lucene- Nutch, Hadoop born Yahoo took further with 1000 nodes in 2008 Possible to process very very large data on commodity hardware Apache Open source
  • 5. Big Data Stack Patents Speed Matlab SAS SPSS R SciPy Mahout Scale Speed kdb Esper, S4 MySQL MongoDB Hbase Hadoop Scale
  • 6. Big Data Architecture Analytics Products Apps BI BI Tools - Dev Visualization Unstructured Data Lucene Hadoop No-SQL RDBMS Nutch Map Reduce Hadoop No-SQL Based SOLR Structured System Data ETL Workflow Admin Data & Monitoring RDBMS Integration Scheduler Datalogs Streams
  • 7. HDFS Large Data Set Client 1 Client2 Write Once – Read Many Fault Tolerant NameNode Distributed File System Read Write Name Node – Data Node Fixed Size Data Blocks Checksum Rack1 Rack N Files – Sequence of blocks Replication Replicated over Balanced Cluster Heartbeat Report from Nodes
  • 8. Map Reduce • Two Step, Map and Reduce, approach of solving problem • Move the code to the data • Map step process data on nodes • Reduce step aggregates results from all Map nodes with reduce algorithm • JobTracker distributes and tracks tasks • TaskTracker on processing nodes communicated task status to JobTrackers • Inspired by Functional Programming
  • 9. Hadoop Ecosystem BI Analytics Apps RDBMS Workflow Chukwa Oozie Flume Orchestration Data Avro Pig Hive Sqoop Security, Recovery, Infra Access HBase zookeeper Network Nagios, Ganglia Processing Map Reduce HCatalog Storage HDFS
  • 10. Apache Hive SQL-like HiveQL Warehousing Apps Compiles to MapReduce Tasks Facebook, Netflix, etc.
  • 11. Apache Pig Latin Higher Level scripting above Map Reduce Procedureal (unlike SQL) by easy like SQL Constructs like FOREACH, GROUP Supports User Defined Functions From Yahoo Good for Integrating and writing Hadoop JObs
  • 12. Sqoop Data Bulk Load Data Import Export RDBMS and NoSQL HDFS, Hbase Data Sliced Sliced Transferred via MaP only Jobs
  • 13. Chukwa & Flume Hadoop Subproject Large scale log processing On Map R Collection and analysis Batch Oriented Components: Agents Collectors MR Jobs for Parsing & Archiving HICC : Hadoop Infra Care Center Web App
  • 14. Big „Fast‟ Data Real time adhoc querry: Once again Google Percolater and Dremel inspired Cloudera : Impala SQL like querry on HDFS Lower latency By pass Map Reduce Apache Drill
  • 15. NoSQL DataBases Document Databases : MongoDB, CouchDB Column Databases: Cassandra, Hbase KV Pair: Graph DB: Neo4J
  • 16. MongoDB Document Oriented Flexible - No Fix Schema Distributed – Sharding based on diff policies Fault Tolerant via Replication Easy to install use JSON – BSON format storage Javascript based Querry Java, Python, other languages Opensource, Supported by 10Gen Fast Read
  • 17. CouchDB Document Oriented JSON format HTTP/REST interface MapReduce, Javascript Replication support Multi version CC Written in Erlang Fast Write – Read Good Availability
  • 18. Apache Cassandra Based on Amazon Dynamo Db Column oriented Theoretically infinite columns Columns as tupple N,V, timestamp Organized as column family (unlike Hbase)Not Hadoop based Equal Nodes, easier to config and manage Parallel write Netflix,,etc.
  • 19. Apache HBase Modeled as Google Big Table Column Oriented Column Family stored together as against all columns in row Predefine table schema with columns However columns can be added in runtime Fault Tolerant Runs on HDFS MapReduce based Interface via REST, AVRO, Thrift Facebook‟s messaging platform