SlideShare a Scribd company logo
Welcome to The Jungle
Building Distributed Systems for Large data sets
!     SQL solves all our problems!
      !   Or does it?
The Problem with SQL


!     At some point, data is too large to fit on
      a single machine.
      !   Then what do you do?
Your cluster:

             SQL




          Application
The first sign of trouble

!     Can do small queries pretty good
!     Large analytical queries?
      !   forget it!

         !     Takes too long
         !     Uses too many resources
Hadoop for Bulk Processing
!     Hadoop = HDFS + MapReduce
      !   HDFS = Distributed, Fault Tolerant

          File System
      !   MapReduce = Highly distributed

          processing engine
!     MapReduce works if:
      !     Your algorithm needs to touch every piece of data in the set
      !     You can write your algorithm in a MapReduce structure
      !     Your data set is gigantic
!     MapReduce is not so good if:
      !     Your data set is very small
      !     Your algorithm doesn t need to touch everything
      !     You only want to query specific pieces of data
!     No Indexing
!     Job startup cost
!     No indices 
      !   Always touches all the data
!     MapReduce code is usually a pain to
      write
      !   requires a Java developer


      !   lots of boilerplate for common tasks
Pig and Hive!
Apache Pig


!      Data Flow Language 
      !   feels like using sed/awk


      !   good at transformations of data
Apache Hive


!     SQL-like interface
      !   good for large queries


      !   maintains table information from

          files
Pig vs. Hive

!     Both can do the same thing
      !   Hive is easier to learn


      !   Pig is easier to maintain


!     Pretty much a matter of taste
The second sign


!     Your Bulk processing and ad-hoc
      analysis is working great in Hadoop
!     But now your small queries are sucking
Scale SQL?

!     A Few options:
      !   Buy Oracle Rac...$$$$


      !   Static Sharding...hard to maintain


      !   Don t do it?
HBase and Cassandra
Column-Oriented Storage


!     SQL = 
      !   Fixed Columns, infinite rows


!     Column-Oriented:
      !   Rows are groups of Key-Value pairs
HBase/Cassandra


!     Both Column-oriented stores
!     Both highly available
!     Both rely on memory for performance
Apache Cassandra


!     Highly Available and Partition Tolerant
!     Attempts to hold as much data as
      possible in memory
!     Manages files on local disk
Eventual Consistency

!     Cassandra has Eventual Consistency
      !   It is possible to read out-of-date

          data!
      !   Also possible to guarantee

          consistency, at a cost
Why Eventual Consistency?


!     Data is only written once
      !   Either it s there or not


!     You don t care if you get out-of-date
      data
      !   Shopping Carts
Cassandra Strengths

!     Fast
      !   Writes faster than Reads!


!     Easy to maintain
      !   Self-contained
Cassandra Weaknesses


!     Consistency Model is complex
!     Scanning over rows is excruciating
Apache HBase


!     Uses HDFS as storage mechanism
!     Holds large proportion of data in RAM
      !   need RAM >= 1% of your data size!
HBase Strengths

!     Strong consistency guarantee
!     Good at scanning over rows
!     Strong community
      !   part of the Hadoop ecosystem
HBase weaknesses
!     Slower than Cassandra
      !   HDFS is higher latency than direct

          disk
!     Complex to maintain
      !   requires running


           !   HDFS


           !   ZooKeeper
HBase vs. Cassandra

!     Pick Cassandra if:
      !     Doings lots of writes
      !     need easy maintenance
      !     don t care about consistency so much

!     Pick HBase if
      !     Scanning over rows a lot
      !     comfortable with maintaining Hadoop/ZooKeeper
      !     Need simple consistency guarantees
Your cluster:
             HBase/
  Hadoop
            Cassandra
                           SQL




            Application
This is complicated!


!     How do we configure it?
!     What if we have to run an algorithm on
      only a single node at a time?
!     What if we need to coordinate actions?
Apache ZooKeeper


!     Distributed Coordination System
      !     Designed for creating distributed concurrency controls
      !     also good for storing configuration
      !     NOT good for storing anything else!
!     Now you have:
      !   Bulk Processing with Hadoop


      !   Large data queries with HBase/

          Cassandra
      !   Coordination with ZooKeeper


      !   Your old SQL database!
!     Chances are, still need SQL for some
      stuff
!     If the data sizes are manageable, SQL is
      tried-and-true
The People Problem

!     Big Data systems are complicated
      !   Lots of moving parts


      !   Lots of places where things can go

          wrong
      !   Need good people!
!     Try and Hire an expert directly...
      !   Not that many out there
!     Train 2 or 3 experts instead
      !   Worth every penny
Who should I hire?

!     Probably won t find direct experts
!     Look instead for people who:
      !   are good with algorithms


      !   are fast learners


      !   not risk-averse
Questions?
Thank You

!     email: 
       !   scottfines@gmail.com
!     github:
       !   scottfines


!     linkedin:
       !   scottfines

More Related Content

What's hot

HariKrishna4+_cv
HariKrishna4+_cvHariKrishna4+_cv
HariKrishna4+_cvrevuri
 
Hadoopソースコードリーディング第3回 Hadopo MR + Cassandra
Hadoopソースコードリーディング第3回 Hadopo MR + CassandraHadoopソースコードリーディング第3回 Hadopo MR + Cassandra
Hadoopソースコードリーディング第3回 Hadopo MR + CassandraRyu Kobayashi
 
Semantic web meetup 14.november 2013
Semantic web meetup 14.november 2013Semantic web meetup 14.november 2013
Semantic web meetup 14.november 2013Jean-Pierre König
 
Bdm hadoop ecosystem
Bdm hadoop ecosystemBdm hadoop ecosystem
Bdm hadoop ecosystem
Amit Bhardwaj
 
HDFS
HDFSHDFS
Hadoop vs spark
Hadoop vs sparkHadoop vs spark
Hadoop vs spark
amarkayam
 
Hadoop distributions - ecosystem
Hadoop distributions - ecosystemHadoop distributions - ecosystem
Hadoop distributions - ecosystem
Jakub Stransky
 
Big Data and Hadoop Introduction
 Big Data and Hadoop Introduction Big Data and Hadoop Introduction
Big Data and Hadoop Introduction
Dzung Nguyen
 
Accumulo: A Quick Introduction
Accumulo: A Quick IntroductionAccumulo: A Quick Introduction
Accumulo: A Quick Introduction
James Salter
 
Hadoop in three use cases
Hadoop in three use casesHadoop in three use cases
Hadoop in three use casesJoey Echeverria
 
4. hadoop גיא לבנברג
4. hadoop  גיא לבנברג4. hadoop  גיא לבנברג
4. hadoop גיא לבנברגTaldor Group
 
Optimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public CloudOptimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public Cloud
Qubole
 
Big data references
Big data referencesBig data references
Big data references
zarigatongy
 
Atlanta MLConf
Atlanta MLConfAtlanta MLConf
Atlanta MLConf
Qubole
 
BIG DATA ANALYTICS WITH HADOOP
BIG DATA ANALYTICS WITH HADOOPBIG DATA ANALYTICS WITH HADOOP
BIG DATA ANALYTICS WITH HADOOP
Imviplav
 
Hadoop An Introduction
Hadoop An IntroductionHadoop An Introduction
Hadoop An Introduction
Mohanasundaram Ponnusamy
 
Hadoop jon
Hadoop jonHadoop jon
Hadoop jon
Humoyun Ahmedov
 
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Simplilearn
 
Apache hive1
Apache hive1Apache hive1
Apache hive1
sheetal sharma
 

What's hot (20)

HariKrishna4+_cv
HariKrishna4+_cvHariKrishna4+_cv
HariKrishna4+_cv
 
Hadoopソースコードリーディング第3回 Hadopo MR + Cassandra
Hadoopソースコードリーディング第3回 Hadopo MR + CassandraHadoopソースコードリーディング第3回 Hadopo MR + Cassandra
Hadoopソースコードリーディング第3回 Hadopo MR + Cassandra
 
Semantic web meetup 14.november 2013
Semantic web meetup 14.november 2013Semantic web meetup 14.november 2013
Semantic web meetup 14.november 2013
 
Bdm hadoop ecosystem
Bdm hadoop ecosystemBdm hadoop ecosystem
Bdm hadoop ecosystem
 
HDFS
HDFSHDFS
HDFS
 
Hadoop vs spark
Hadoop vs sparkHadoop vs spark
Hadoop vs spark
 
Hadoop distributions - ecosystem
Hadoop distributions - ecosystemHadoop distributions - ecosystem
Hadoop distributions - ecosystem
 
Big Data and Hadoop Introduction
 Big Data and Hadoop Introduction Big Data and Hadoop Introduction
Big Data and Hadoop Introduction
 
Accumulo: A Quick Introduction
Accumulo: A Quick IntroductionAccumulo: A Quick Introduction
Accumulo: A Quick Introduction
 
Bw tech hadoop
Bw tech hadoopBw tech hadoop
Bw tech hadoop
 
Hadoop in three use cases
Hadoop in three use casesHadoop in three use cases
Hadoop in three use cases
 
4. hadoop גיא לבנברג
4. hadoop  גיא לבנברג4. hadoop  גיא לבנברג
4. hadoop גיא לבנברג
 
Optimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public CloudOptimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public Cloud
 
Big data references
Big data referencesBig data references
Big data references
 
Atlanta MLConf
Atlanta MLConfAtlanta MLConf
Atlanta MLConf
 
BIG DATA ANALYTICS WITH HADOOP
BIG DATA ANALYTICS WITH HADOOPBIG DATA ANALYTICS WITH HADOOP
BIG DATA ANALYTICS WITH HADOOP
 
Hadoop An Introduction
Hadoop An IntroductionHadoop An Introduction
Hadoop An Introduction
 
Hadoop jon
Hadoop jonHadoop jon
Hadoop jon
 
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
 
Apache hive1
Apache hive1Apache hive1
Apache hive1
 

Similar to Welcome to the Jungle: Distributed Systems for Large Data Sets - StampedeCon 2012

Real time hadoop + mapreduce intro
Real time hadoop + mapreduce introReal time hadoop + mapreduce intro
Real time hadoop + mapreduce intro
Geoff Hendrey
 
Impala for PhillyDB Meetup
Impala for PhillyDB MeetupImpala for PhillyDB Meetup
Impala for PhillyDB Meetup
Shravan (Sean) Pabba
 
Big Data Concepts
Big Data ConceptsBig Data Concepts
Big Data Concepts
Ahmed Salman
 
Core concepts and Key technologies - Big Data Analytics
Core concepts and Key technologies - Big Data AnalyticsCore concepts and Key technologies - Big Data Analytics
Core concepts and Key technologies - Big Data AnalyticsKaniska Mandal
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
Dr. C.V. Suresh Babu
 
Hw09 Practical HBase Getting The Most From Your H Base Install
Hw09   Practical HBase  Getting The Most From Your H Base InstallHw09   Practical HBase  Getting The Most From Your H Base Install
Hw09 Practical HBase Getting The Most From Your H Base InstallCloudera, Inc.
 
Comparison - RDBMS vs Hadoop vs Apache
Comparison - RDBMS vs Hadoop vs ApacheComparison - RDBMS vs Hadoop vs Apache
Comparison - RDBMS vs Hadoop vs Apache
SandeepTaksande
 
Apache hadoop basics
Apache hadoop basicsApache hadoop basics
Apache hadoop basicssaili mane
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
sudhakara st
 
Scalable Web Architectures: Common Patterns and Approaches - Web 2.0 Expo NYC
Scalable Web Architectures: Common Patterns and Approaches - Web 2.0 Expo NYCScalable Web Architectures: Common Patterns and Approaches - Web 2.0 Expo NYC
Scalable Web Architectures: Common Patterns and Approaches - Web 2.0 Expo NYCCal Henderson
 
Next Generation Data Platforms - Deon Thomas
Next Generation Data Platforms - Deon ThomasNext Generation Data Platforms - Deon Thomas
Next Generation Data Platforms - Deon Thomas
Thoughtworks
 
Web20expo Scalable Web Arch
Web20expo Scalable Web ArchWeb20expo Scalable Web Arch
Web20expo Scalable Web Archguest18a0f1
 
Web20expo Scalable Web Arch
Web20expo Scalable Web ArchWeb20expo Scalable Web Arch
Web20expo Scalable Web Arch
mclee
 
Web20expo Scalable Web Arch
Web20expo Scalable Web ArchWeb20expo Scalable Web Arch
Web20expo Scalable Web Arch
royans
 
2013 year of real-time hadoop
2013 year of real-time hadoop2013 year of real-time hadoop
2013 year of real-time hadoop
Geoff Hendrey
 
Not Just Another Overview of Apache Hadoop
Not Just Another Overview of Apache HadoopNot Just Another Overview of Apache Hadoop
Not Just Another Overview of Apache Hadoop
Adaryl "Bob" Wakefield, MBA
 
Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL Server
Mark Kromer
 
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
cdmaxime
 
The ABC of Big Data
The ABC of Big DataThe ABC of Big Data
The ABC of Big Data
André Faria Gomes
 

Similar to Welcome to the Jungle: Distributed Systems for Large Data Sets - StampedeCon 2012 (20)

Real time hadoop + mapreduce intro
Real time hadoop + mapreduce introReal time hadoop + mapreduce intro
Real time hadoop + mapreduce intro
 
Impala for PhillyDB Meetup
Impala for PhillyDB MeetupImpala for PhillyDB Meetup
Impala for PhillyDB Meetup
 
Big Data Concepts
Big Data ConceptsBig Data Concepts
Big Data Concepts
 
Core concepts and Key technologies - Big Data Analytics
Core concepts and Key technologies - Big Data AnalyticsCore concepts and Key technologies - Big Data Analytics
Core concepts and Key technologies - Big Data Analytics
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Hw09 Practical HBase Getting The Most From Your H Base Install
Hw09   Practical HBase  Getting The Most From Your H Base InstallHw09   Practical HBase  Getting The Most From Your H Base Install
Hw09 Practical HBase Getting The Most From Your H Base Install
 
Comparison - RDBMS vs Hadoop vs Apache
Comparison - RDBMS vs Hadoop vs ApacheComparison - RDBMS vs Hadoop vs Apache
Comparison - RDBMS vs Hadoop vs Apache
 
Apache hadoop basics
Apache hadoop basicsApache hadoop basics
Apache hadoop basics
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
 
Scalable Web Architectures: Common Patterns and Approaches - Web 2.0 Expo NYC
Scalable Web Architectures: Common Patterns and Approaches - Web 2.0 Expo NYCScalable Web Architectures: Common Patterns and Approaches - Web 2.0 Expo NYC
Scalable Web Architectures: Common Patterns and Approaches - Web 2.0 Expo NYC
 
Next Generation Data Platforms - Deon Thomas
Next Generation Data Platforms - Deon ThomasNext Generation Data Platforms - Deon Thomas
Next Generation Data Platforms - Deon Thomas
 
Web20expo Scalable Web Arch
Web20expo Scalable Web ArchWeb20expo Scalable Web Arch
Web20expo Scalable Web Arch
 
Web20expo Scalable Web Arch
Web20expo Scalable Web ArchWeb20expo Scalable Web Arch
Web20expo Scalable Web Arch
 
Web20expo Scalable Web Arch
Web20expo Scalable Web ArchWeb20expo Scalable Web Arch
Web20expo Scalable Web Arch
 
2013 year of real-time hadoop
2013 year of real-time hadoop2013 year of real-time hadoop
2013 year of real-time hadoop
 
Not Just Another Overview of Apache Hadoop
Not Just Another Overview of Apache HadoopNot Just Another Overview of Apache Hadoop
Not Just Another Overview of Apache Hadoop
 
Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL Server
 
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
 
The ABC of Big Data
The ABC of Big DataThe ABC of Big Data
The ABC of Big Data
 
Nosql seminar
Nosql seminarNosql seminar
Nosql seminar
 

More from StampedeCon

Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...
Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...
Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...
StampedeCon
 
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
StampedeCon
 
Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017
Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017
Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017
StampedeCon
 
Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...
Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...
Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...
StampedeCon
 
How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017
How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017
How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017
StampedeCon
 
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
StampedeCon
 
Foundations of Machine Learning - StampedeCon AI Summit 2017
Foundations of Machine Learning - StampedeCon AI Summit 2017Foundations of Machine Learning - StampedeCon AI Summit 2017
Foundations of Machine Learning - StampedeCon AI Summit 2017
StampedeCon
 
Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...
Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...
Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...
StampedeCon
 
Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...
Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...
Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...
StampedeCon
 
Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017
Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017
Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017
StampedeCon
 
AI in the Enterprise: Past, Present & Future - StampedeCon AI Summit 2017
AI in the Enterprise: Past,  Present &  Future - StampedeCon AI Summit 2017AI in the Enterprise: Past,  Present &  Future - StampedeCon AI Summit 2017
AI in the Enterprise: Past, Present & Future - StampedeCon AI Summit 2017
StampedeCon
 
A Different Data Science Approach - StampedeCon AI Summit 2017
A Different Data Science Approach - StampedeCon AI Summit 2017A Different Data Science Approach - StampedeCon AI Summit 2017
A Different Data Science Approach - StampedeCon AI Summit 2017
StampedeCon
 
Graph in Customer 360 - StampedeCon Big Data Conference 2017
Graph in Customer 360 - StampedeCon Big Data Conference 2017Graph in Customer 360 - StampedeCon Big Data Conference 2017
Graph in Customer 360 - StampedeCon Big Data Conference 2017
StampedeCon
 
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
StampedeCon
 
Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017
Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017
Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017
StampedeCon
 
Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...
Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...
Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...
StampedeCon
 
Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...
Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...
Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...
StampedeCon
 
Innovation in the Data Warehouse - StampedeCon 2016
Innovation in the Data Warehouse - StampedeCon 2016Innovation in the Data Warehouse - StampedeCon 2016
Innovation in the Data Warehouse - StampedeCon 2016
StampedeCon
 
Creating a Data Driven Organization - StampedeCon 2016
Creating a Data Driven Organization - StampedeCon 2016Creating a Data Driven Organization - StampedeCon 2016
Creating a Data Driven Organization - StampedeCon 2016
StampedeCon
 
Using The Internet of Things for Population Health Management - StampedeCon 2016
Using The Internet of Things for Population Health Management - StampedeCon 2016Using The Internet of Things for Population Health Management - StampedeCon 2016
Using The Internet of Things for Population Health Management - StampedeCon 2016
StampedeCon
 

More from StampedeCon (20)

Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...
Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...
Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...
 
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
 
Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017
Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017
Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017
 
Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...
Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...
Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...
 
How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017
How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017
How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017
 
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
 
Foundations of Machine Learning - StampedeCon AI Summit 2017
Foundations of Machine Learning - StampedeCon AI Summit 2017Foundations of Machine Learning - StampedeCon AI Summit 2017
Foundations of Machine Learning - StampedeCon AI Summit 2017
 
Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...
Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...
Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...
 
Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...
Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...
Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...
 
Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017
Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017
Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017
 
AI in the Enterprise: Past, Present & Future - StampedeCon AI Summit 2017
AI in the Enterprise: Past,  Present &  Future - StampedeCon AI Summit 2017AI in the Enterprise: Past,  Present &  Future - StampedeCon AI Summit 2017
AI in the Enterprise: Past, Present & Future - StampedeCon AI Summit 2017
 
A Different Data Science Approach - StampedeCon AI Summit 2017
A Different Data Science Approach - StampedeCon AI Summit 2017A Different Data Science Approach - StampedeCon AI Summit 2017
A Different Data Science Approach - StampedeCon AI Summit 2017
 
Graph in Customer 360 - StampedeCon Big Data Conference 2017
Graph in Customer 360 - StampedeCon Big Data Conference 2017Graph in Customer 360 - StampedeCon Big Data Conference 2017
Graph in Customer 360 - StampedeCon Big Data Conference 2017
 
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
 
Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017
Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017
Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017
 
Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...
Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...
Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...
 
Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...
Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...
Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...
 
Innovation in the Data Warehouse - StampedeCon 2016
Innovation in the Data Warehouse - StampedeCon 2016Innovation in the Data Warehouse - StampedeCon 2016
Innovation in the Data Warehouse - StampedeCon 2016
 
Creating a Data Driven Organization - StampedeCon 2016
Creating a Data Driven Organization - StampedeCon 2016Creating a Data Driven Organization - StampedeCon 2016
Creating a Data Driven Organization - StampedeCon 2016
 
Using The Internet of Things for Population Health Management - StampedeCon 2016
Using The Internet of Things for Population Health Management - StampedeCon 2016Using The Internet of Things for Population Health Management - StampedeCon 2016
Using The Internet of Things for Population Health Management - StampedeCon 2016
 

Recently uploaded

Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems S.M.S.A.
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
Adtran
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
Pierluigi Pugliese
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Aggregage
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
Neo4j
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
James Anderson
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
Neo4j
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
DianaGray10
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Nexer Digital
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
Uni Systems S.M.S.A.
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
RinaMondal9
 
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
Peter Spielvogel
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
Neo4j
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
Ralf Eggert
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 

Recently uploaded (20)

Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
 
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 

Welcome to the Jungle: Distributed Systems for Large Data Sets - StampedeCon 2012

  • 1. Welcome to The Jungle Building Distributed Systems for Large data sets
  • 2. !   SQL solves all our problems! !   Or does it?
  • 3. The Problem with SQL !   At some point, data is too large to fit on a single machine. !   Then what do you do?
  • 4. Your cluster: SQL Application
  • 5. The first sign of trouble !   Can do small queries pretty good !   Large analytical queries? !   forget it! !   Takes too long !   Uses too many resources
  • 6. Hadoop for Bulk Processing
  • 7. !   Hadoop = HDFS + MapReduce !   HDFS = Distributed, Fault Tolerant File System !   MapReduce = Highly distributed processing engine
  • 8. !   MapReduce works if: !   Your algorithm needs to touch every piece of data in the set !   You can write your algorithm in a MapReduce structure !   Your data set is gigantic
  • 9. !   MapReduce is not so good if: !   Your data set is very small !   Your algorithm doesn t need to touch everything !   You only want to query specific pieces of data
  • 10. !   No Indexing !   Job startup cost !   No indices !   Always touches all the data
  • 11. !   MapReduce code is usually a pain to write !   requires a Java developer !   lots of boilerplate for common tasks
  • 13. Apache Pig !   Data Flow Language !   feels like using sed/awk !   good at transformations of data
  • 14. Apache Hive !   SQL-like interface !   good for large queries !   maintains table information from files
  • 15. Pig vs. Hive !   Both can do the same thing !   Hive is easier to learn !   Pig is easier to maintain !   Pretty much a matter of taste
  • 16. The second sign !   Your Bulk processing and ad-hoc analysis is working great in Hadoop !   But now your small queries are sucking
  • 17. Scale SQL? !   A Few options: !   Buy Oracle Rac...$$$$ !   Static Sharding...hard to maintain !   Don t do it?
  • 19. Column-Oriented Storage !   SQL = !   Fixed Columns, infinite rows !   Column-Oriented: !   Rows are groups of Key-Value pairs
  • 20. HBase/Cassandra !   Both Column-oriented stores !   Both highly available !   Both rely on memory for performance
  • 21. Apache Cassandra !   Highly Available and Partition Tolerant !   Attempts to hold as much data as possible in memory !   Manages files on local disk
  • 22. Eventual Consistency !   Cassandra has Eventual Consistency !   It is possible to read out-of-date data! !   Also possible to guarantee consistency, at a cost
  • 23. Why Eventual Consistency? !   Data is only written once !   Either it s there or not !   You don t care if you get out-of-date data !   Shopping Carts
  • 24. Cassandra Strengths !   Fast !   Writes faster than Reads! !   Easy to maintain !   Self-contained
  • 25. Cassandra Weaknesses !   Consistency Model is complex !   Scanning over rows is excruciating
  • 26. Apache HBase !   Uses HDFS as storage mechanism !   Holds large proportion of data in RAM !   need RAM >= 1% of your data size!
  • 27. HBase Strengths !   Strong consistency guarantee !   Good at scanning over rows !   Strong community !   part of the Hadoop ecosystem
  • 28. HBase weaknesses !   Slower than Cassandra !   HDFS is higher latency than direct disk !   Complex to maintain !   requires running !   HDFS !   ZooKeeper
  • 29. HBase vs. Cassandra !   Pick Cassandra if: !   Doings lots of writes !   need easy maintenance !   don t care about consistency so much !   Pick HBase if !   Scanning over rows a lot !   comfortable with maintaining Hadoop/ZooKeeper !   Need simple consistency guarantees
  • 30. Your cluster: HBase/ Hadoop Cassandra SQL Application
  • 31. This is complicated! !   How do we configure it? !   What if we have to run an algorithm on only a single node at a time? !   What if we need to coordinate actions?
  • 32. Apache ZooKeeper !   Distributed Coordination System !   Designed for creating distributed concurrency controls !   also good for storing configuration !   NOT good for storing anything else!
  • 33. !   Now you have: !   Bulk Processing with Hadoop !   Large data queries with HBase/ Cassandra !   Coordination with ZooKeeper !   Your old SQL database!
  • 34. !   Chances are, still need SQL for some stuff !   If the data sizes are manageable, SQL is tried-and-true
  • 35. The People Problem !   Big Data systems are complicated !   Lots of moving parts !   Lots of places where things can go wrong !   Need good people!
  • 36. !   Try and Hire an expert directly... !   Not that many out there
  • 37. !   Train 2 or 3 experts instead !   Worth every penny
  • 38. Who should I hire? !   Probably won t find direct experts !   Look instead for people who: !   are good with algorithms !   are fast learners !   not risk-averse
  • 40. Thank You !   email: ! scottfines@gmail.com !   github: !   scottfines !   linkedin: !   scottfines