SlideShare a Scribd company logo
1 of 8
Hadoop
Ecosystem
ACM Bay Area Data Mining Camp 2011
Patrick Nicolas
September 19, 2011
http://patricknicolas.blogspot.com
http://www.slideshare.net/pnicolas
https://github.com/prnicolas

Copyright 2011 Patrick Nicolas - All rights reserved

1
Overview
Beside providing developers and analysts with an open source
implementation of map-reduce functional model, the Hadoop
ecosystem incorporates analytical algorithms, tasks/workflow
managers and NoSQL stores.
Client code, Scripts
NoSQL

Analytics

Key-Values stores Mahout
Document stores
Multi-column stores
Graph databases

Configuration
Zookeeper

Workflow
Hive
Pig
Cascading

Map/Reduce framework
HDFS
Java Virtual Machine

Copyright 2011 Patrick Nicolas - All rights reserved

2
Key Components
The Hadoop ecosystem can be described as a data centric
taxonomy to analyze, aggregate, store and report data.
Admin.
File System

GFS,HDFS

MapReduce

K-V Stores

Redis, Memcache, Kyoto Cabinet

Doc Stores

Hadoop

Zookeeper

MongoDB, CouchDB

NoSQL

Multi-column
stores

HBase, Hypertable, BigData,
Cassandra, BerkeleyDB

Graph DB
Script
Workflow

Neo4j, GraphDB, InfiniteGraph
Pig
Cascading

SQL
Analytics

API

Hive

Mahout, Chunkwa

Copyright 2011 Patrick Nicolas - All rights reserved

3
NoSQL: Overview

Non relational data stores allow large amount of data to be
collected very efficiently. Contrary to RDBMS, NoSQL
schemas are optimized for sequential writes and therefore are
not appropriate for querying and reporting.

Key

Value

Column families, nested structures

NoSQL storages share the same basic key-value schema but
provide different method to describe values.

Copyright 2011 Patrick Nicolas - All rights reserved

4
NoSQL: Document Stores
Key-Value files (HDFS)
<key, value>
Distributed replicable blocks of sequential key-value string pairs

Key-Value stores (Redis, Memcache)
<key*, value>
Language independent, distributed, sorted key value pairs (keys
are list, sets or hashes) with in-memory caching and support for
atomic operations.

Document stores (MongoDB, CouchDB)
{ “k1”:val1, “k2”:val2 }
Fault-tolerant, document centric using dynamic schema of sorted
javascript objects and supports limited SQL like syntax.

Copyright 2011 Patrick Nicolas - All rights reserved

5
NoSQL: Tuples & Graphs

Sorted, ordered tuples(Cassandra, HBase,..)
{ name:x value: { key1: {name:key1, value:v1, tstamp:x}, key2:x}}

Fault-tolerant, distributed sorted, ordered, grouped (family)
‘super-column’ (map of unbounded number of columns)

Graph databases(Neo4j, GraphDB, InfiniteGraph,..)
Efficient transactional, traversal & storage of entity (vertice),
attribute & relationship (edge)

Copyright 2011 Patrick Nicolas - All rights reserved

6
Data Flow Managers
Map & Reduce tasks can be abstracted to a tasks or workflow
managers using high level language such as scripts, SQL or
UNIX-pipe like API. Those data flow tools hide the functional
complexity of Map-Reduce from domain experts.
Scripting

Pig

SQL

Hive

API: Pipes & flows

Cascading

API

Map
Map
Map
Map
Map

Combine
Combine

Reduce
Reduce
Reduce
Reduce

Copyright 2011 Patrick Nicolas - All rights reserved

7
Data Flow Code Samples
Pig Latin
A = LOAD „mydata' USING PigStorage() AS (f1:int, name:string);
B = GROUP A BY f1;
C = FOREACH B GENERATE COUNT ($0);

Hive
LOAD DATA LOCAL INPATH „xxx' OVERWRITE INTO TABLE z;
INSERT OVERWRITE TABLE z SELECT count(*) FROM y GROUP BY f1;

Cascading
Scheme srcScheme = new TextLine( new Fields( “line”));
Tap src = new Hfs(srcScheme, inpath);
Pipe counter = new Pipe (“count”);
counter = new GroupBy( counter, new Fields(“f1”);
FlowConnector connector = new FlowConnector(props);
Flow flow = connector.connect( “count”, src, sink, pipe);
flow.complete();

Copyright 2011 Patrick Nicolas - All rights reserved

8

More Related Content

What's hot

Payment Gateway Live hadoop project
Payment Gateway Live hadoop projectPayment Gateway Live hadoop project
Payment Gateway Live hadoop project
Kamal A
 
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise NetworksUsing Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
DataWorks Summit
 
Hadoop Infrastructure @Uber Past, Present and Future
Hadoop Infrastructure @Uber Past, Present and FutureHadoop Infrastructure @Uber Past, Present and Future
Hadoop Infrastructure @Uber Past, Present and Future
DataWorks Summit
 

What's hot (20)

Spark meetup TCHUG
Spark meetup TCHUGSpark meetup TCHUG
Spark meetup TCHUG
 
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
 
Geospatial Analytics at Scale with Deep Learning and Apache Spark with Tim hu...
Geospatial Analytics at Scale with Deep Learning and Apache Spark with Tim hu...Geospatial Analytics at Scale with Deep Learning and Apache Spark with Tim hu...
Geospatial Analytics at Scale with Deep Learning and Apache Spark with Tim hu...
 
Apache Spark Overview @ ferret
Apache Spark Overview @ ferretApache Spark Overview @ ferret
Apache Spark Overview @ ferret
 
Payment Gateway Live hadoop project
Payment Gateway Live hadoop projectPayment Gateway Live hadoop project
Payment Gateway Live hadoop project
 
Cloudera Hadoop Distribution
Cloudera Hadoop DistributionCloudera Hadoop Distribution
Cloudera Hadoop Distribution
 
Proud to be Polyglot - Riviera Dev 2015
Proud to be Polyglot - Riviera Dev 2015Proud to be Polyglot - Riviera Dev 2015
Proud to be Polyglot - Riviera Dev 2015
 
Apache drill
Apache drillApache drill
Apache drill
 
Summer Shorts: Big Data Integration
Summer Shorts: Big Data IntegrationSummer Shorts: Big Data Integration
Summer Shorts: Big Data Integration
 
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyScaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
 
Cloudera Impala + PostgreSQL
Cloudera Impala + PostgreSQLCloudera Impala + PostgreSQL
Cloudera Impala + PostgreSQL
 
Hadoop Architecture Options for Existing Enterprise DataWarehouse
Hadoop Architecture Options for Existing Enterprise DataWarehouseHadoop Architecture Options for Existing Enterprise DataWarehouse
Hadoop Architecture Options for Existing Enterprise DataWarehouse
 
The Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsightThe Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsight
 
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
 
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise NetworksUsing Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
 
Hadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologiesHadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologies
 
Summary machine learning and model deployment
Summary machine learning and model deploymentSummary machine learning and model deployment
Summary machine learning and model deployment
 
An introduction to Apache Hadoop Hive
An introduction to Apache Hadoop HiveAn introduction to Apache Hadoop Hive
An introduction to Apache Hadoop Hive
 
Big Data on the Microsoft Platform
Big Data on the Microsoft PlatformBig Data on the Microsoft Platform
Big Data on the Microsoft Platform
 
Hadoop Infrastructure @Uber Past, Present and Future
Hadoop Infrastructure @Uber Past, Present and FutureHadoop Infrastructure @Uber Past, Present and Future
Hadoop Infrastructure @Uber Past, Present and Future
 

Viewers also liked (6)

Hadoop Ecosystem Architecture Overview
Hadoop Ecosystem Architecture Overview Hadoop Ecosystem Architecture Overview
Hadoop Ecosystem Architecture Overview
 
Creating an Ecosystem Platform with Vertical PaaS
Creating an Ecosystem Platform with Vertical PaaSCreating an Ecosystem Platform with Vertical PaaS
Creating an Ecosystem Platform with Vertical PaaS
 
The Hadoop Ecosystem
The Hadoop EcosystemThe Hadoop Ecosystem
The Hadoop Ecosystem
 
Media Buying Platform Ecosystem
Media Buying Platform EcosystemMedia Buying Platform Ecosystem
Media Buying Platform Ecosystem
 
Understanding the Online Advertising Technology Landscape
Understanding the Online Advertising Technology Landscape Understanding the Online Advertising Technology Landscape
Understanding the Online Advertising Technology Landscape
 
Business Ecosystem Design
Business Ecosystem DesignBusiness Ecosystem Design
Business Ecosystem Design
 

Similar to Hadoop Ecosystem

Big Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with PentahoBig Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with Pentaho
Mark Kromer
 
Big data hadoop ecosystem and nosql
Big data hadoop ecosystem and nosqlBig data hadoop ecosystem and nosql
Big data hadoop ecosystem and nosql
Khanderao Kand
 
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Andrey Vykhodtsev
 
Hadoop and mysql by Chris Schneider
Hadoop and mysql by Chris SchneiderHadoop and mysql by Chris Schneider
Hadoop and mysql by Chris Schneider
Dmitry Makarchuk
 

Similar to Hadoop Ecosystem (20)

Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL Server
 
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
 
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
 
Hadoop: An Industry Perspective
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry Perspective
 
Hadoop_arunam_ppt
Hadoop_arunam_pptHadoop_arunam_ppt
Hadoop_arunam_ppt
 
Hadoop and IoT Sinergija 2014
Hadoop and IoT Sinergija 2014Hadoop and IoT Sinergija 2014
Hadoop and IoT Sinergija 2014
 
Big Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with PentahoBig Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with Pentaho
 
Spark Study Notes
Spark Study NotesSpark Study Notes
Spark Study Notes
 
Big data hadoop ecosystem and nosql
Big data hadoop ecosystem and nosqlBig data hadoop ecosystem and nosql
Big data hadoop ecosystem and nosql
 
Hadoop and IoT Sinergija 2014
Hadoop and IoT Sinergija 2014Hadoop and IoT Sinergija 2014
Hadoop and IoT Sinergija 2014
 
Pivotal HD and Spring for Apache Hadoop
Pivotal HD and Spring for Apache HadoopPivotal HD and Spring for Apache Hadoop
Pivotal HD and Spring for Apache Hadoop
 
Meetup Oracle Database BCN: 2.1 Data Management Trends
Meetup Oracle Database BCN: 2.1 Data Management TrendsMeetup Oracle Database BCN: 2.1 Data Management Trends
Meetup Oracle Database BCN: 2.1 Data Management Trends
 
Big data concepts
Big data conceptsBig data concepts
Big data concepts
 
Big data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irBig data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.ir
 
Sunshine consulting mopuru babu cv_java_j2ee_spring_bigdata_scala
Sunshine consulting mopuru babu cv_java_j2ee_spring_bigdata_scalaSunshine consulting mopuru babu cv_java_j2ee_spring_bigdata_scala
Sunshine consulting mopuru babu cv_java_j2ee_spring_bigdata_scala
 
Big Data Hoopla Simplified - TDWI Memphis 2014
Big Data Hoopla Simplified - TDWI Memphis 2014Big Data Hoopla Simplified - TDWI Memphis 2014
Big Data Hoopla Simplified - TDWI Memphis 2014
 
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
 
Hadoop and mysql by Chris Schneider
Hadoop and mysql by Chris SchneiderHadoop and mysql by Chris Schneider
Hadoop and mysql by Chris Schneider
 
Hadoop Big Data A big picture
Hadoop Big Data A big pictureHadoop Big Data A big picture
Hadoop Big Data A big picture
 
xPatterns ... beyond Hadoop (Spark, Shark, Mesos, Tachyon)
xPatterns ... beyond Hadoop (Spark, Shark, Mesos, Tachyon)xPatterns ... beyond Hadoop (Spark, Shark, Mesos, Tachyon)
xPatterns ... beyond Hadoop (Spark, Shark, Mesos, Tachyon)
 

More from Patrick Nicolas

More from Patrick Nicolas (12)

Autonomous medical coding with discriminative transformers
Autonomous medical coding with discriminative transformersAutonomous medical coding with discriminative transformers
Autonomous medical coding with discriminative transformers
 
Open Source Lambda Architecture for deep learning
Open Source Lambda Architecture for deep learningOpen Source Lambda Architecture for deep learning
Open Source Lambda Architecture for deep learning
 
AI for electronic health records
AI for electronic health recordsAI for electronic health records
AI for electronic health records
 
Monadic genetic kernels in Scala
Monadic genetic kernels in ScalaMonadic genetic kernels in Scala
Monadic genetic kernels in Scala
 
Scala for Machine Learning
Scala for Machine LearningScala for Machine Learning
Scala for Machine Learning
 
Stock Market Prediction using Hidden Markov Models and Investor sentiment
Stock Market Prediction using Hidden Markov Models and Investor sentimentStock Market Prediction using Hidden Markov Models and Investor sentiment
Stock Market Prediction using Hidden Markov Models and Investor sentiment
 
Advanced Functional Programming in Scala
Advanced Functional Programming in ScalaAdvanced Functional Programming in Scala
Advanced Functional Programming in Scala
 
Adaptive Intrusion Detection Using Learning Classifiers
Adaptive Intrusion Detection Using Learning ClassifiersAdaptive Intrusion Detection Using Learning Classifiers
Adaptive Intrusion Detection Using Learning Classifiers
 
Data Modeling using Symbolic Regression
Data Modeling using Symbolic RegressionData Modeling using Symbolic Regression
Data Modeling using Symbolic Regression
 
Semantic Analysis using Wikipedia Taxonomy
Semantic Analysis using Wikipedia TaxonomySemantic Analysis using Wikipedia Taxonomy
Semantic Analysis using Wikipedia Taxonomy
 
Taxonomy-based Contextual Ads Targeting
Taxonomy-based Contextual Ads TargetingTaxonomy-based Contextual Ads Targeting
Taxonomy-based Contextual Ads Targeting
 
Multi-tenancy in Private Clouds
Multi-tenancy in Private CloudsMulti-tenancy in Private Clouds
Multi-tenancy in Private Clouds
 

Recently uploaded

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 

Recently uploaded (20)

[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 

Hadoop Ecosystem

  • 1. Hadoop Ecosystem ACM Bay Area Data Mining Camp 2011 Patrick Nicolas September 19, 2011 http://patricknicolas.blogspot.com http://www.slideshare.net/pnicolas https://github.com/prnicolas Copyright 2011 Patrick Nicolas - All rights reserved 1
  • 2. Overview Beside providing developers and analysts with an open source implementation of map-reduce functional model, the Hadoop ecosystem incorporates analytical algorithms, tasks/workflow managers and NoSQL stores. Client code, Scripts NoSQL Analytics Key-Values stores Mahout Document stores Multi-column stores Graph databases Configuration Zookeeper Workflow Hive Pig Cascading Map/Reduce framework HDFS Java Virtual Machine Copyright 2011 Patrick Nicolas - All rights reserved 2
  • 3. Key Components The Hadoop ecosystem can be described as a data centric taxonomy to analyze, aggregate, store and report data. Admin. File System GFS,HDFS MapReduce K-V Stores Redis, Memcache, Kyoto Cabinet Doc Stores Hadoop Zookeeper MongoDB, CouchDB NoSQL Multi-column stores HBase, Hypertable, BigData, Cassandra, BerkeleyDB Graph DB Script Workflow Neo4j, GraphDB, InfiniteGraph Pig Cascading SQL Analytics API Hive Mahout, Chunkwa Copyright 2011 Patrick Nicolas - All rights reserved 3
  • 4. NoSQL: Overview Non relational data stores allow large amount of data to be collected very efficiently. Contrary to RDBMS, NoSQL schemas are optimized for sequential writes and therefore are not appropriate for querying and reporting. Key Value Column families, nested structures NoSQL storages share the same basic key-value schema but provide different method to describe values. Copyright 2011 Patrick Nicolas - All rights reserved 4
  • 5. NoSQL: Document Stores Key-Value files (HDFS) <key, value> Distributed replicable blocks of sequential key-value string pairs Key-Value stores (Redis, Memcache) <key*, value> Language independent, distributed, sorted key value pairs (keys are list, sets or hashes) with in-memory caching and support for atomic operations. Document stores (MongoDB, CouchDB) { “k1”:val1, “k2”:val2 } Fault-tolerant, document centric using dynamic schema of sorted javascript objects and supports limited SQL like syntax. Copyright 2011 Patrick Nicolas - All rights reserved 5
  • 6. NoSQL: Tuples & Graphs Sorted, ordered tuples(Cassandra, HBase,..) { name:x value: { key1: {name:key1, value:v1, tstamp:x}, key2:x}} Fault-tolerant, distributed sorted, ordered, grouped (family) ‘super-column’ (map of unbounded number of columns) Graph databases(Neo4j, GraphDB, InfiniteGraph,..) Efficient transactional, traversal & storage of entity (vertice), attribute & relationship (edge) Copyright 2011 Patrick Nicolas - All rights reserved 6
  • 7. Data Flow Managers Map & Reduce tasks can be abstracted to a tasks or workflow managers using high level language such as scripts, SQL or UNIX-pipe like API. Those data flow tools hide the functional complexity of Map-Reduce from domain experts. Scripting Pig SQL Hive API: Pipes & flows Cascading API Map Map Map Map Map Combine Combine Reduce Reduce Reduce Reduce Copyright 2011 Patrick Nicolas - All rights reserved 7
  • 8. Data Flow Code Samples Pig Latin A = LOAD „mydata' USING PigStorage() AS (f1:int, name:string); B = GROUP A BY f1; C = FOREACH B GENERATE COUNT ($0); Hive LOAD DATA LOCAL INPATH „xxx' OVERWRITE INTO TABLE z; INSERT OVERWRITE TABLE z SELECT count(*) FROM y GROUP BY f1; Cascading Scheme srcScheme = new TextLine( new Fields( “line”)); Tap src = new Hfs(srcScheme, inpath); Pipe counter = new Pipe (“count”); counter = new GroupBy( counter, new Fields(“f1”); FlowConnector connector = new FlowConnector(props); Flow flow = connector.connect( “count”, src, sink, pipe); flow.complete(); Copyright 2011 Patrick Nicolas - All rights reserved 8