SlideShare a Scribd company logo

Welcome to the Jungle: Distributed Systems for Large Data Sets - StampedeCon 2012

At StampedeCon 2012 in St. Louis, Scott Fines of NISC presents: Recent years have seen a sudden and rapid introduction of new technologies for distributing applications to essentially arbitrary levels. The growth in variety and depth of these different systems has grown to match, and it can be a challenge just to keep up. In this talk, I’ll discuss some of the more common systems such as Hadoop, HBase, and Cassandra, and some of the different scenarios and pitfalls of using them. I’ll cover when MapReduce is powerful and helpful, and when it’s better to use a different approach. Putting it all together, I’ll mention ZooKeeper, Flume, and some of the surrounding small projects that can help make a useable system.

Welcome to the Jungle: Distributed Systems for Large Data Sets - StampedeCon 2012

1 of 40
Welcome to The Jungle
Building Distributed Systems for Large data sets
!     SQL solves all our problems!
      !   Or does it?
The Problem with SQL


!     At some point, data is too large to fit on
      a single machine.
      !   Then what do you do?
Your cluster:

             SQL




          Application
The first sign of trouble

!     Can do small queries pretty good
!     Large analytical queries?
      !   forget it!

         !     Takes too long
         !     Uses too many resources
Hadoop for Bulk Processing

Recommended

Data Engineering Quick Guide
Data Engineering Quick GuideData Engineering Quick Guide
Data Engineering Quick GuideAsim Jalis
 
Optimising eCommerce with Machine Learning & Game Theory — Cassandra, Elasti...
 Optimising eCommerce with Machine Learning & Game Theory — Cassandra, Elasti... Optimising eCommerce with Machine Learning & Game Theory — Cassandra, Elasti...
Optimising eCommerce with Machine Learning & Game Theory — Cassandra, Elasti...Loqate, a GBG Solution
 
Predicting behaviour with Machine Learning
Predicting behaviour with Machine Learning Predicting behaviour with Machine Learning
Predicting behaviour with Machine Learning Loqate, a GBG Solution
 
Evan Ellis "Tumblr. Massively Sharded MySQL"
Evan Ellis "Tumblr. Massively Sharded MySQL"Evan Ellis "Tumblr. Massively Sharded MySQL"
Evan Ellis "Tumblr. Massively Sharded MySQL"Alexey Mahotkin
 
Intro to cassandra + hadoop
Intro to cassandra + hadoopIntro to cassandra + hadoop
Intro to cassandra + hadoopJeremy Hanna
 
How We Scaled Freshdesk To Take 65M Requests/week
How We Scaled Freshdesk To Take 65M Requests/weekHow We Scaled Freshdesk To Take 65M Requests/week
How We Scaled Freshdesk To Take 65M Requests/weekFreshdesk Inc.
 
C*ollege Credit: Is My App a Good Fit for Cassandra?
C*ollege Credit: Is My App a Good Fit for Cassandra?C*ollege Credit: Is My App a Good Fit for Cassandra?
C*ollege Credit: Is My App a Good Fit for Cassandra?DataStax
 

More Related Content

What's hot

HariKrishna4+_cv
HariKrishna4+_cvHariKrishna4+_cv
HariKrishna4+_cvrevuri
 
Hadoopソースコードリーディング第3回 Hadopo MR + Cassandra
Hadoopソースコードリーディング第3回 Hadopo MR + CassandraHadoopソースコードリーディング第3回 Hadopo MR + Cassandra
Hadoopソースコードリーディング第3回 Hadopo MR + CassandraRyu Kobayashi
 
Semantic web meetup 14.november 2013
Semantic web meetup 14.november 2013Semantic web meetup 14.november 2013
Semantic web meetup 14.november 2013Jean-Pierre König
 
Bdm hadoop ecosystem
Bdm hadoop ecosystemBdm hadoop ecosystem
Bdm hadoop ecosystemAmit Bhardwaj
 
Hadoop vs spark
Hadoop vs sparkHadoop vs spark
Hadoop vs sparkamarkayam
 
Hadoop distributions - ecosystem
Hadoop distributions - ecosystemHadoop distributions - ecosystem
Hadoop distributions - ecosystemJakub Stransky
 
Big Data and Hadoop Introduction
 Big Data and Hadoop Introduction Big Data and Hadoop Introduction
Big Data and Hadoop IntroductionDzung Nguyen
 
Accumulo: A Quick Introduction
Accumulo: A Quick IntroductionAccumulo: A Quick Introduction
Accumulo: A Quick IntroductionJames Salter
 
Hadoop in three use cases
Hadoop in three use casesHadoop in three use cases
Hadoop in three use casesJoey Echeverria
 
4. hadoop גיא לבנברג
4. hadoop  גיא לבנברג4. hadoop  גיא לבנברג
4. hadoop גיא לבנברגTaldor Group
 
Optimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public CloudOptimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public CloudQubole
 
Big data references
Big data referencesBig data references
Big data referenceszarigatongy
 
Atlanta MLConf
Atlanta MLConfAtlanta MLConf
Atlanta MLConfQubole
 
BIG DATA ANALYTICS WITH HADOOP
BIG DATA ANALYTICS WITH HADOOPBIG DATA ANALYTICS WITH HADOOP
BIG DATA ANALYTICS WITH HADOOPImviplav
 
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...Simplilearn
 

What's hot (20)

HariKrishna4+_cv
HariKrishna4+_cvHariKrishna4+_cv
HariKrishna4+_cv
 
Hadoopソースコードリーディング第3回 Hadopo MR + Cassandra
Hadoopソースコードリーディング第3回 Hadopo MR + CassandraHadoopソースコードリーディング第3回 Hadopo MR + Cassandra
Hadoopソースコードリーディング第3回 Hadopo MR + Cassandra
 
Semantic web meetup 14.november 2013
Semantic web meetup 14.november 2013Semantic web meetup 14.november 2013
Semantic web meetup 14.november 2013
 
Bdm hadoop ecosystem
Bdm hadoop ecosystemBdm hadoop ecosystem
Bdm hadoop ecosystem
 
HDFS
HDFSHDFS
HDFS
 
Hadoop vs spark
Hadoop vs sparkHadoop vs spark
Hadoop vs spark
 
Hadoop distributions - ecosystem
Hadoop distributions - ecosystemHadoop distributions - ecosystem
Hadoop distributions - ecosystem
 
Big Data and Hadoop Introduction
 Big Data and Hadoop Introduction Big Data and Hadoop Introduction
Big Data and Hadoop Introduction
 
Accumulo: A Quick Introduction
Accumulo: A Quick IntroductionAccumulo: A Quick Introduction
Accumulo: A Quick Introduction
 
Bw tech hadoop
Bw tech hadoopBw tech hadoop
Bw tech hadoop
 
Hadoop in three use cases
Hadoop in three use casesHadoop in three use cases
Hadoop in three use cases
 
4. hadoop גיא לבנברג
4. hadoop  גיא לבנברג4. hadoop  גיא לבנברג
4. hadoop גיא לבנברג
 
Optimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public CloudOptimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public Cloud
 
Big data references
Big data referencesBig data references
Big data references
 
Atlanta MLConf
Atlanta MLConfAtlanta MLConf
Atlanta MLConf
 
BIG DATA ANALYTICS WITH HADOOP
BIG DATA ANALYTICS WITH HADOOPBIG DATA ANALYTICS WITH HADOOP
BIG DATA ANALYTICS WITH HADOOP
 
Hadoop An Introduction
Hadoop An IntroductionHadoop An Introduction
Hadoop An Introduction
 
Hadoop jon
Hadoop jonHadoop jon
Hadoop jon
 
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
 
Apache hive1
Apache hive1Apache hive1
Apache hive1
 

Similar to Welcome to the Jungle: Distributed Systems for Large Data Sets - StampedeCon 2012

Real time hadoop + mapreduce intro
Real time hadoop + mapreduce introReal time hadoop + mapreduce intro
Real time hadoop + mapreduce introGeoff Hendrey
 
Core concepts and Key technologies - Big Data Analytics
Core concepts and Key technologies - Big Data AnalyticsCore concepts and Key technologies - Big Data Analytics
Core concepts and Key technologies - Big Data AnalyticsKaniska Mandal
 
Hw09 Practical HBase Getting The Most From Your H Base Install
Hw09   Practical HBase  Getting The Most From Your H Base InstallHw09   Practical HBase  Getting The Most From Your H Base Install
Hw09 Practical HBase Getting The Most From Your H Base InstallCloudera, Inc.
 
Comparison - RDBMS vs Hadoop vs Apache
Comparison - RDBMS vs Hadoop vs ApacheComparison - RDBMS vs Hadoop vs Apache
Comparison - RDBMS vs Hadoop vs ApacheSandeepTaksande
 
Apache hadoop basics
Apache hadoop basicsApache hadoop basics
Apache hadoop basicssaili mane
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?sudhakara st
 
Scalable Web Architectures: Common Patterns and Approaches - Web 2.0 Expo NYC
Scalable Web Architectures: Common Patterns and Approaches - Web 2.0 Expo NYCScalable Web Architectures: Common Patterns and Approaches - Web 2.0 Expo NYC
Scalable Web Architectures: Common Patterns and Approaches - Web 2.0 Expo NYCCal Henderson
 
Next Generation Data Platforms - Deon Thomas
Next Generation Data Platforms - Deon ThomasNext Generation Data Platforms - Deon Thomas
Next Generation Data Platforms - Deon ThomasThoughtworks
 
Web20expo Scalable Web Arch
Web20expo Scalable Web ArchWeb20expo Scalable Web Arch
Web20expo Scalable Web Archguest18a0f1
 
Web20expo Scalable Web Arch
Web20expo Scalable Web ArchWeb20expo Scalable Web Arch
Web20expo Scalable Web Archroyans
 
Web20expo Scalable Web Arch
Web20expo Scalable Web ArchWeb20expo Scalable Web Arch
Web20expo Scalable Web Archmclee
 
2013 year of real-time hadoop
2013 year of real-time hadoop2013 year of real-time hadoop
2013 year of real-time hadoopGeoff Hendrey
 
Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerMark Kromer
 
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014cdmaxime
 

Similar to Welcome to the Jungle: Distributed Systems for Large Data Sets - StampedeCon 2012 (20)

Real time hadoop + mapreduce intro
Real time hadoop + mapreduce introReal time hadoop + mapreduce intro
Real time hadoop + mapreduce intro
 
Impala for PhillyDB Meetup
Impala for PhillyDB MeetupImpala for PhillyDB Meetup
Impala for PhillyDB Meetup
 
Big Data Concepts
Big Data ConceptsBig Data Concepts
Big Data Concepts
 
Core concepts and Key technologies - Big Data Analytics
Core concepts and Key technologies - Big Data AnalyticsCore concepts and Key technologies - Big Data Analytics
Core concepts and Key technologies - Big Data Analytics
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Hw09 Practical HBase Getting The Most From Your H Base Install
Hw09   Practical HBase  Getting The Most From Your H Base InstallHw09   Practical HBase  Getting The Most From Your H Base Install
Hw09 Practical HBase Getting The Most From Your H Base Install
 
Comparison - RDBMS vs Hadoop vs Apache
Comparison - RDBMS vs Hadoop vs ApacheComparison - RDBMS vs Hadoop vs Apache
Comparison - RDBMS vs Hadoop vs Apache
 
Apache hadoop basics
Apache hadoop basicsApache hadoop basics
Apache hadoop basics
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
 
Scalable Web Architectures: Common Patterns and Approaches - Web 2.0 Expo NYC
Scalable Web Architectures: Common Patterns and Approaches - Web 2.0 Expo NYCScalable Web Architectures: Common Patterns and Approaches - Web 2.0 Expo NYC
Scalable Web Architectures: Common Patterns and Approaches - Web 2.0 Expo NYC
 
Next Generation Data Platforms - Deon Thomas
Next Generation Data Platforms - Deon ThomasNext Generation Data Platforms - Deon Thomas
Next Generation Data Platforms - Deon Thomas
 
Web20expo Scalable Web Arch
Web20expo Scalable Web ArchWeb20expo Scalable Web Arch
Web20expo Scalable Web Arch
 
Web20expo Scalable Web Arch
Web20expo Scalable Web ArchWeb20expo Scalable Web Arch
Web20expo Scalable Web Arch
 
Web20expo Scalable Web Arch
Web20expo Scalable Web ArchWeb20expo Scalable Web Arch
Web20expo Scalable Web Arch
 
2013 year of real-time hadoop
2013 year of real-time hadoop2013 year of real-time hadoop
2013 year of real-time hadoop
 
Not Just Another Overview of Apache Hadoop
Not Just Another Overview of Apache HadoopNot Just Another Overview of Apache Hadoop
Not Just Another Overview of Apache Hadoop
 
Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL Server
 
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
 
The ABC of Big Data
The ABC of Big DataThe ABC of Big Data
The ABC of Big Data
 
Nosql seminar
Nosql seminarNosql seminar
Nosql seminar
 

More from StampedeCon

Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...
Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...
Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...StampedeCon
 
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017StampedeCon
 
Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017
Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017
Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017StampedeCon
 
Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...
Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...
Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...StampedeCon
 
How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017
How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017
How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017StampedeCon
 
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017StampedeCon
 
Foundations of Machine Learning - StampedeCon AI Summit 2017
Foundations of Machine Learning - StampedeCon AI Summit 2017Foundations of Machine Learning - StampedeCon AI Summit 2017
Foundations of Machine Learning - StampedeCon AI Summit 2017StampedeCon
 
Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...
Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...
Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...StampedeCon
 
Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...
Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...
Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...StampedeCon
 
Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017
Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017
Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017StampedeCon
 
AI in the Enterprise: Past, Present & Future - StampedeCon AI Summit 2017
AI in the Enterprise: Past,  Present &  Future - StampedeCon AI Summit 2017AI in the Enterprise: Past,  Present &  Future - StampedeCon AI Summit 2017
AI in the Enterprise: Past, Present & Future - StampedeCon AI Summit 2017StampedeCon
 
A Different Data Science Approach - StampedeCon AI Summit 2017
A Different Data Science Approach - StampedeCon AI Summit 2017A Different Data Science Approach - StampedeCon AI Summit 2017
A Different Data Science Approach - StampedeCon AI Summit 2017StampedeCon
 
Graph in Customer 360 - StampedeCon Big Data Conference 2017
Graph in Customer 360 - StampedeCon Big Data Conference 2017Graph in Customer 360 - StampedeCon Big Data Conference 2017
Graph in Customer 360 - StampedeCon Big Data Conference 2017StampedeCon
 
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017StampedeCon
 
Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017
Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017
Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017StampedeCon
 
Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...
Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...
Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...StampedeCon
 
Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...
Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...
Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...StampedeCon
 
Innovation in the Data Warehouse - StampedeCon 2016
Innovation in the Data Warehouse - StampedeCon 2016Innovation in the Data Warehouse - StampedeCon 2016
Innovation in the Data Warehouse - StampedeCon 2016StampedeCon
 
Creating a Data Driven Organization - StampedeCon 2016
Creating a Data Driven Organization - StampedeCon 2016Creating a Data Driven Organization - StampedeCon 2016
Creating a Data Driven Organization - StampedeCon 2016StampedeCon
 
Using The Internet of Things for Population Health Management - StampedeCon 2016
Using The Internet of Things for Population Health Management - StampedeCon 2016Using The Internet of Things for Population Health Management - StampedeCon 2016
Using The Internet of Things for Population Health Management - StampedeCon 2016StampedeCon
 

More from StampedeCon (20)

Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...
Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...
Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...
 
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
 
Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017
Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017
Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017
 
Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...
Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...
Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...
 
How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017
How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017
How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017
 
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
 
Foundations of Machine Learning - StampedeCon AI Summit 2017
Foundations of Machine Learning - StampedeCon AI Summit 2017Foundations of Machine Learning - StampedeCon AI Summit 2017
Foundations of Machine Learning - StampedeCon AI Summit 2017
 
Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...
Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...
Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...
 
Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...
Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...
Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...
 
Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017
Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017
Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017
 
AI in the Enterprise: Past, Present & Future - StampedeCon AI Summit 2017
AI in the Enterprise: Past,  Present &  Future - StampedeCon AI Summit 2017AI in the Enterprise: Past,  Present &  Future - StampedeCon AI Summit 2017
AI in the Enterprise: Past, Present & Future - StampedeCon AI Summit 2017
 
A Different Data Science Approach - StampedeCon AI Summit 2017
A Different Data Science Approach - StampedeCon AI Summit 2017A Different Data Science Approach - StampedeCon AI Summit 2017
A Different Data Science Approach - StampedeCon AI Summit 2017
 
Graph in Customer 360 - StampedeCon Big Data Conference 2017
Graph in Customer 360 - StampedeCon Big Data Conference 2017Graph in Customer 360 - StampedeCon Big Data Conference 2017
Graph in Customer 360 - StampedeCon Big Data Conference 2017
 
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
 
Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017
Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017
Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017
 
Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...
Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...
Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...
 
Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...
Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...
Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...
 
Innovation in the Data Warehouse - StampedeCon 2016
Innovation in the Data Warehouse - StampedeCon 2016Innovation in the Data Warehouse - StampedeCon 2016
Innovation in the Data Warehouse - StampedeCon 2016
 
Creating a Data Driven Organization - StampedeCon 2016
Creating a Data Driven Organization - StampedeCon 2016Creating a Data Driven Organization - StampedeCon 2016
Creating a Data Driven Organization - StampedeCon 2016
 
Using The Internet of Things for Population Health Management - StampedeCon 2016
Using The Internet of Things for Population Health Management - StampedeCon 2016Using The Internet of Things for Population Health Management - StampedeCon 2016
Using The Internet of Things for Population Health Management - StampedeCon 2016
 

Recently uploaded

Introduction to Serverless with AWS Lambda in C#.pptx
Introduction to Serverless with AWS Lambda in C#.pptxIntroduction to Serverless with AWS Lambda in C#.pptx
Introduction to Serverless with AWS Lambda in C#.pptxBrandon Minnick, MBA
 
Unlocking the Cloud's True Potential: Why Multitenancy Is The Key?
Unlocking the Cloud's True Potential: Why Multitenancy Is The Key?Unlocking the Cloud's True Potential: Why Multitenancy Is The Key?
Unlocking the Cloud's True Potential: Why Multitenancy Is The Key?GleecusTechlabs1
 
zigbee motion sensor user manual NAS-PD07B2.pdf
zigbee motion sensor user manual NAS-PD07B2.pdfzigbee motion sensor user manual NAS-PD07B2.pdf
zigbee motion sensor user manual NAS-PD07B2.pdfDomotica daVinci
 
Bit N Build Poland
Bit N Build PolandBit N Build Poland
Bit N Build PolandGDSC PJATK
 
Manual Eurotronic Thermostatic Valve Comry Z-Wave
Manual Eurotronic Thermostatic Valve Comry Z-WaveManual Eurotronic Thermostatic Valve Comry Z-Wave
Manual Eurotronic Thermostatic Valve Comry Z-WaveDomotica daVinci
 
Power of 2024 - WITforce Odyssey.pptx.pdf
Power of 2024 - WITforce Odyssey.pptx.pdfPower of 2024 - WITforce Odyssey.pptx.pdf
Power of 2024 - WITforce Odyssey.pptx.pdfkatalinjordans1
 
Semiconductor Review Magazine Feature.pdf
Semiconductor Review Magazine Feature.pdfSemiconductor Review Magazine Feature.pdf
Semiconductor Review Magazine Feature.pdfkeyaramicrochipusa
 
Dynamical systems simulation in Python for science and engineering
Dynamical systems simulation in Python for science and engineeringDynamical systems simulation in Python for science and engineering
Dynamical systems simulation in Python for science and engineeringMassimo Talia
 
Q1 Memory Fabric Forum: Advantages of Optical CXL​ for Disaggregated Compute ...
Q1 Memory Fabric Forum: Advantages of Optical CXL​ for Disaggregated Compute ...Q1 Memory Fabric Forum: Advantages of Optical CXL​ for Disaggregated Compute ...
Q1 Memory Fabric Forum: Advantages of Optical CXL​ for Disaggregated Compute ...Memory Fabric Forum
 
Microsoft Azure News - Feb 2024
Microsoft Azure News - Feb 2024Microsoft Azure News - Feb 2024
Microsoft Azure News - Feb 2024Daniel Toomey
 
Quinto Z-Wave Heltun_HE-RS01_User_Manual_B9AH.pdf
Quinto Z-Wave Heltun_HE-RS01_User_Manual_B9AH.pdfQuinto Z-Wave Heltun_HE-RS01_User_Manual_B9AH.pdf
Quinto Z-Wave Heltun_HE-RS01_User_Manual_B9AH.pdfDomotica daVinci
 
"Journey of Aspiration: Unveiling the Path to Becoming a Technocrat and Entre...
"Journey of Aspiration: Unveiling the Path to Becoming a Technocrat and Entre..."Journey of Aspiration: Unveiling the Path to Becoming a Technocrat and Entre...
"Journey of Aspiration: Unveiling the Path to Becoming a Technocrat and Entre...shaiyuvasv
 
Journey of Television in World & in India
Journey of Television in World & in IndiaJourney of Television in World & in India
Journey of Television in World & in IndiaAdarshAgarwal66
 
H3 Platform CXL Solution_Memory Fabric Forum.pptx
H3 Platform CXL Solution_Memory Fabric Forum.pptxH3 Platform CXL Solution_Memory Fabric Forum.pptx
H3 Platform CXL Solution_Memory Fabric Forum.pptxMemory Fabric Forum
 
M.Aathiraju Self Intro.docx-AD21001_____
M.Aathiraju Self Intro.docx-AD21001_____M.Aathiraju Self Intro.docx-AD21001_____
M.Aathiraju Self Intro.docx-AD21001_____Aathiraju
 
Put a flag on it. A busy developer's guide to feature toggles.
Put a flag on it. A busy developer's guide to feature toggles.Put a flag on it. A busy developer's guide to feature toggles.
Put a flag on it. A busy developer's guide to feature toggles.Mateusz Kwasniewski
 
Enhancing SaaS Performance: A Hands-on Workshop for Partners
Enhancing SaaS Performance: A Hands-on Workshop for PartnersEnhancing SaaS Performance: A Hands-on Workshop for Partners
Enhancing SaaS Performance: A Hands-on Workshop for PartnersThousandEyes
 
Tete thermostatique Zigbee MOES BRT-100 V2.pdf
Tete thermostatique Zigbee MOES BRT-100 V2.pdfTete thermostatique Zigbee MOES BRT-100 V2.pdf
Tete thermostatique Zigbee MOES BRT-100 V2.pdfDomotica daVinci
 
AUGMENTED REALITY (AR) IN DAILY LIFE: EXPANDING BEYOND GAMING
AUGMENTED REALITY (AR) IN DAILY LIFE: EXPANDING BEYOND GAMINGAUGMENTED REALITY (AR) IN DAILY LIFE: EXPANDING BEYOND GAMING
AUGMENTED REALITY (AR) IN DAILY LIFE: EXPANDING BEYOND GAMINGLiveplex
 
My self introduction to know others abut me
My self  introduction to know others abut meMy self  introduction to know others abut me
My self introduction to know others abut meManoj Prabakar B
 

Recently uploaded (20)

Introduction to Serverless with AWS Lambda in C#.pptx
Introduction to Serverless with AWS Lambda in C#.pptxIntroduction to Serverless with AWS Lambda in C#.pptx
Introduction to Serverless with AWS Lambda in C#.pptx
 
Unlocking the Cloud's True Potential: Why Multitenancy Is The Key?
Unlocking the Cloud's True Potential: Why Multitenancy Is The Key?Unlocking the Cloud's True Potential: Why Multitenancy Is The Key?
Unlocking the Cloud's True Potential: Why Multitenancy Is The Key?
 
zigbee motion sensor user manual NAS-PD07B2.pdf
zigbee motion sensor user manual NAS-PD07B2.pdfzigbee motion sensor user manual NAS-PD07B2.pdf
zigbee motion sensor user manual NAS-PD07B2.pdf
 
Bit N Build Poland
Bit N Build PolandBit N Build Poland
Bit N Build Poland
 
Manual Eurotronic Thermostatic Valve Comry Z-Wave
Manual Eurotronic Thermostatic Valve Comry Z-WaveManual Eurotronic Thermostatic Valve Comry Z-Wave
Manual Eurotronic Thermostatic Valve Comry Z-Wave
 
Power of 2024 - WITforce Odyssey.pptx.pdf
Power of 2024 - WITforce Odyssey.pptx.pdfPower of 2024 - WITforce Odyssey.pptx.pdf
Power of 2024 - WITforce Odyssey.pptx.pdf
 
Semiconductor Review Magazine Feature.pdf
Semiconductor Review Magazine Feature.pdfSemiconductor Review Magazine Feature.pdf
Semiconductor Review Magazine Feature.pdf
 
Dynamical systems simulation in Python for science and engineering
Dynamical systems simulation in Python for science and engineeringDynamical systems simulation in Python for science and engineering
Dynamical systems simulation in Python for science and engineering
 
Q1 Memory Fabric Forum: Advantages of Optical CXL​ for Disaggregated Compute ...
Q1 Memory Fabric Forum: Advantages of Optical CXL​ for Disaggregated Compute ...Q1 Memory Fabric Forum: Advantages of Optical CXL​ for Disaggregated Compute ...
Q1 Memory Fabric Forum: Advantages of Optical CXL​ for Disaggregated Compute ...
 
Microsoft Azure News - Feb 2024
Microsoft Azure News - Feb 2024Microsoft Azure News - Feb 2024
Microsoft Azure News - Feb 2024
 
Quinto Z-Wave Heltun_HE-RS01_User_Manual_B9AH.pdf
Quinto Z-Wave Heltun_HE-RS01_User_Manual_B9AH.pdfQuinto Z-Wave Heltun_HE-RS01_User_Manual_B9AH.pdf
Quinto Z-Wave Heltun_HE-RS01_User_Manual_B9AH.pdf
 
"Journey of Aspiration: Unveiling the Path to Becoming a Technocrat and Entre...
"Journey of Aspiration: Unveiling the Path to Becoming a Technocrat and Entre..."Journey of Aspiration: Unveiling the Path to Becoming a Technocrat and Entre...
"Journey of Aspiration: Unveiling the Path to Becoming a Technocrat and Entre...
 
Journey of Television in World & in India
Journey of Television in World & in IndiaJourney of Television in World & in India
Journey of Television in World & in India
 
H3 Platform CXL Solution_Memory Fabric Forum.pptx
H3 Platform CXL Solution_Memory Fabric Forum.pptxH3 Platform CXL Solution_Memory Fabric Forum.pptx
H3 Platform CXL Solution_Memory Fabric Forum.pptx
 
M.Aathiraju Self Intro.docx-AD21001_____
M.Aathiraju Self Intro.docx-AD21001_____M.Aathiraju Self Intro.docx-AD21001_____
M.Aathiraju Self Intro.docx-AD21001_____
 
Put a flag on it. A busy developer's guide to feature toggles.
Put a flag on it. A busy developer's guide to feature toggles.Put a flag on it. A busy developer's guide to feature toggles.
Put a flag on it. A busy developer's guide to feature toggles.
 
Enhancing SaaS Performance: A Hands-on Workshop for Partners
Enhancing SaaS Performance: A Hands-on Workshop for PartnersEnhancing SaaS Performance: A Hands-on Workshop for Partners
Enhancing SaaS Performance: A Hands-on Workshop for Partners
 
Tete thermostatique Zigbee MOES BRT-100 V2.pdf
Tete thermostatique Zigbee MOES BRT-100 V2.pdfTete thermostatique Zigbee MOES BRT-100 V2.pdf
Tete thermostatique Zigbee MOES BRT-100 V2.pdf
 
AUGMENTED REALITY (AR) IN DAILY LIFE: EXPANDING BEYOND GAMING
AUGMENTED REALITY (AR) IN DAILY LIFE: EXPANDING BEYOND GAMINGAUGMENTED REALITY (AR) IN DAILY LIFE: EXPANDING BEYOND GAMING
AUGMENTED REALITY (AR) IN DAILY LIFE: EXPANDING BEYOND GAMING
 
My self introduction to know others abut me
My self  introduction to know others abut meMy self  introduction to know others abut me
My self introduction to know others abut me
 

Welcome to the Jungle: Distributed Systems for Large Data Sets - StampedeCon 2012

  • 1. Welcome to The Jungle Building Distributed Systems for Large data sets
  • 2. !   SQL solves all our problems! !   Or does it?
  • 3. The Problem with SQL !   At some point, data is too large to fit on a single machine. !   Then what do you do?
  • 4. Your cluster: SQL Application
  • 5. The first sign of trouble !   Can do small queries pretty good !   Large analytical queries? !   forget it! !   Takes too long !   Uses too many resources
  • 6. Hadoop for Bulk Processing
  • 7. !   Hadoop = HDFS + MapReduce !   HDFS = Distributed, Fault Tolerant File System !   MapReduce = Highly distributed processing engine
  • 8. !   MapReduce works if: !   Your algorithm needs to touch every piece of data in the set !   You can write your algorithm in a MapReduce structure !   Your data set is gigantic
  • 9. !   MapReduce is not so good if: !   Your data set is very small !   Your algorithm doesn t need to touch everything !   You only want to query specific pieces of data
  • 10. !   No Indexing !   Job startup cost !   No indices !   Always touches all the data
  • 11. !   MapReduce code is usually a pain to write !   requires a Java developer !   lots of boilerplate for common tasks
  • 13. Apache Pig !   Data Flow Language !   feels like using sed/awk !   good at transformations of data
  • 14. Apache Hive !   SQL-like interface !   good for large queries !   maintains table information from files
  • 15. Pig vs. Hive !   Both can do the same thing !   Hive is easier to learn !   Pig is easier to maintain !   Pretty much a matter of taste
  • 16. The second sign !   Your Bulk processing and ad-hoc analysis is working great in Hadoop !   But now your small queries are sucking
  • 17. Scale SQL? !   A Few options: !   Buy Oracle Rac...$$$$ !   Static Sharding...hard to maintain !   Don t do it?
  • 19. Column-Oriented Storage !   SQL = !   Fixed Columns, infinite rows !   Column-Oriented: !   Rows are groups of Key-Value pairs
  • 20. HBase/Cassandra !   Both Column-oriented stores !   Both highly available !   Both rely on memory for performance
  • 21. Apache Cassandra !   Highly Available and Partition Tolerant !   Attempts to hold as much data as possible in memory !   Manages files on local disk
  • 22. Eventual Consistency !   Cassandra has Eventual Consistency !   It is possible to read out-of-date data! !   Also possible to guarantee consistency, at a cost
  • 23. Why Eventual Consistency? !   Data is only written once !   Either it s there or not !   You don t care if you get out-of-date data !   Shopping Carts
  • 24. Cassandra Strengths !   Fast !   Writes faster than Reads! !   Easy to maintain !   Self-contained
  • 25. Cassandra Weaknesses !   Consistency Model is complex !   Scanning over rows is excruciating
  • 26. Apache HBase !   Uses HDFS as storage mechanism !   Holds large proportion of data in RAM !   need RAM >= 1% of your data size!
  • 27. HBase Strengths !   Strong consistency guarantee !   Good at scanning over rows !   Strong community !   part of the Hadoop ecosystem
  • 28. HBase weaknesses !   Slower than Cassandra !   HDFS is higher latency than direct disk !   Complex to maintain !   requires running !   HDFS !   ZooKeeper
  • 29. HBase vs. Cassandra !   Pick Cassandra if: !   Doings lots of writes !   need easy maintenance !   don t care about consistency so much !   Pick HBase if !   Scanning over rows a lot !   comfortable with maintaining Hadoop/ZooKeeper !   Need simple consistency guarantees
  • 30. Your cluster: HBase/ Hadoop Cassandra SQL Application
  • 31. This is complicated! !   How do we configure it? !   What if we have to run an algorithm on only a single node at a time? !   What if we need to coordinate actions?
  • 32. Apache ZooKeeper !   Distributed Coordination System !   Designed for creating distributed concurrency controls !   also good for storing configuration !   NOT good for storing anything else!
  • 33. !   Now you have: !   Bulk Processing with Hadoop !   Large data queries with HBase/ Cassandra !   Coordination with ZooKeeper !   Your old SQL database!
  • 34. !   Chances are, still need SQL for some stuff !   If the data sizes are manageable, SQL is tried-and-true
  • 35. The People Problem !   Big Data systems are complicated !   Lots of moving parts !   Lots of places where things can go wrong !   Need good people!
  • 36. !   Try and Hire an expert directly... !   Not that many out there
  • 37. !   Train 2 or 3 experts instead !   Worth every penny
  • 38. Who should I hire? !   Probably won t find direct experts !   Look instead for people who: !   are good with algorithms !   are fast learners !   not risk-averse
  • 40. Thank You !   email: ! scottfines@gmail.com !   github: !   scottfines !   linkedin: !   scottfines