SlideShare a Scribd company logo
1 of 125
Download to read offline
Runaway complexity in Big Data
       And a plan to stop it

                           Nathan Marz
                           @nathanmarz   1
Agenda
• Common    sources of complexity in data systems
• Design for a fundamentally better data system
What is a data system?

   A system that manages the storage and
   querying of data
What is a data system?

   A system that manages the storage and
   querying of data with a lifetime measured in
   years
What is a data system?

   A system that manages the storage and
   querying of data with a lifetime measured in
   years encompassing every version of the
   application to ever exist
What is a data system?

   A system that manages the storage and
   querying of data with a lifetime measured in
   years encompassing every version of the
   application to ever exist, every hardware
   failure
What is a data system?

   A system that manages the storage and
   querying of data with a lifetime measured in
   years encompassing every version of the
   application to ever exist, every hardware
   failure, and every human mistake ever
   made
Common sources of complexity

         Lack of human fault-tolerance



         Conflation of data and queries



             Schemas done wrong
Lack of human fault-tolerance
Human fault-tolerance
• Bugs will be deployed to production over the lifetime of a data system
• Operational mistakes will be made
• Humans are part of the overall system, just like your hard disks, CPUs,
  memory, and software
• Must design for human error like you’d design for any other fault
Human fault-tolerance
Examples of human error
• Deploy a bug that increments counters by two instead of by one
• Accidentally delete data from database
• Accidental DOS on important internal service
The worst consequence is
data loss or data corruption
As long as an error doesn’t
 lose or corrupt good data,
you can fix what went wrong
Mutability
• The U and D in CRUD
• A mutable system updates the current state of the world
• Mutable systems inherently lack human fault-tolerance
• Easy to corrupt or lose data
Immutability
•   An immutable system captures a historical record of events
•   Each event happens at a particular time and is always true
Capturing change with mutable data
model
  Person    Location                    Person   Location

   Sally   Philadelphia                  Sally   New York

   Bob      Chicago                      Bob     Chicago




                   Sally moves to New York
Capturing change with immutable
data model
Person    Location      Time               Person    Location      Time

 Sally   Philadelphia 1318358351            Sally   Philadelphia 1318358351

 Bob      Chicago    1327928370             Bob      Chicago    1327928370

                                            Sally   New York    1338469380


                         Sally moves to New York
Immutability greatly restricts the
 range of errors that can cause
  data loss or data corruption
Vastly more human fault-tolerant
Immutability
Other benefits
•   Fundamentally simpler
•   CR instead of CRUD
•   Only write operation is appending new units of data
•   Easy to implement on top of a distributed filesystem
    •   File = list of data records
    •   Append = Add a new file into a directory

                                                          basing a system on mutability is
                                                          like pouring gasoline on your
                                                          house (but don’t worry, i checked
                                                          all the wires carefully to make sure
                                                          there won’t be any sparks). when
                                                          someone makes a mistake who
                                                          knows what will burn
Immutability




       Please watch Rich Hickey’s talks to learn more
       about the enormous benefits of immutability
Conflation of data and queries
Conflation of data and queries
Normalization vs. denormalization


   ID      Name     Location ID   Location ID     City      State   Population

    1       Sally       3             1         New York    NY        8.2M

    2      George       1             2         San Diego   CA        1.3M

    3       Bob         3             3         Chicago      IL       2.7M




                            Normalized schema
Join is too expensive, so
      denormalize...
ID           Name        Location ID        City         State
1             Sally             3       Chicago             IL
2            George             1      New York             NY
3             Bob               3       Chicago             IL



     Location ID        City        State          Population
         1            New York         NY            8.2M
         2            San Diego        CA            1.3M
         3            Chicago          IL            2.7M


                   Denormalized schema
Obviously, you prefer all data to be
         fully normalized
But you are forced to denormalize
         for performance
Because the way data is modeled,
stored, and queried is complected
We will come back to how to build
 data systems in which these are
          disassociated
Schemas done wrong
Schemas have a bad rap
Schemas
•   Hard to change
•   Get in the way
•   Add development overhead
•   Requires annoying configuration
I know! Use a
schemaless database!
This is an overreaction
Confuses the poor implementation
 of schemas with the value that
        schemas provide
What is a schema exactly?
function(data unit)
That says whether this data is valid
             or not
This is useful
Value of schemas
•   Structural integrity
•   Guarantees on what can and can’t be stored
•   Prevents corruption
Otherwise you’ll detect corruption
      issues at read-time
Potentially long after the corruption
              happened
With little insight into the
circumstances of the corruption
Much better to get an exception
 where the mistake is made,
before it corrupts the database
Saves enormous amounts of time
Why are schemas considered
painful?
• Changing  the schema is hard (e.g., adding a column to a table)
• Schema is overly restrictive (e.g., cannot do nested objects)
• Require translation layers (e.g. ORM)
• Requires more typing (development overhead)
None of these are fundamentally
 linked with function(data unit)
These are problems in the
implementation of schemas, not in
      schemas themselves
Ideal schema tool
• Data  is represented as maps
• Schema tool is a library that helps construct the schema function:
  • Concisely specify required fields and types
  • Insert custom validation logic for fields (e.g. ages are between 0 and 200)
• Built-in support for evolving the schema over time
• Fast and space-efficient serialization/deserialization
• Cross-language                      this is easy to use and gets out of your way

                                      i use apache thrift, but it lacks the custom validation logic

                                      i think it could be done better with a clojure-like data as maps approach

                                      given that parameters of a data system: long-lived, ever changing, with mistakes being made, the
                                      amount of work it takes to make a schema (not that much) is absolutely worth it
Let’s get provocative
The relational database will
  be a footnote in history
Not because of SQL, restrictive
schemas, or scalability issues
But because of fundamental
flaws in the RDBMS approach
      to managing data
Mutability
Conflating the storage of data
   with how it is queried
Back in the day, these flaws were feature – because space was a
premium. The landscape has changed, and this is no longer the
constraint it once was. So these properties of mutability and
conflating data and queries are now major, glaring flaws.
Because there are better ways to design data system
“NewSQL” is misguided
Let’s use our ability to cheaply
store massive amounts of data
To do data right
And not inherit the complexities
          of the past
I know! Use a
NoSQL database!




                  if SQL’s wrong, and
                  NoSQL isn’t SQL, then
                  NoSQL must be right
NoSQL databases are generally
not a step in the right direction
Some aspects are, but not the
ones that get all the attention
Still based on mutability and
      not general purpose
Let’s see how you design
a data system that
doesn’t suffer from
these complexities




                           Let’s start from scratch
What does a data system do?
Retrieve data that you
 previously stored?
    Put        Get
Not really...
Counterexamples

          Store location information on people

      How many people live in a particular location?

                 Where does Sally live?

         What are the most populous locations?
Counterexamples

             Store pageview information

       How many pageviews on September 2nd?

         How many unique visitors over time?
Counterexamples

        Store transaction history for bank account

         How much money does George have?

     How much money do people spend on housing?
What does a data system do?


    Query = Function(All data)
Sometimes you retrieve what
       you stored
Oftentimes you do transformations,
        aggregations, etc.
Queries as pure functions that take
all data as input is the most general
              formulation
Example query


 Total number of pageviews to a
 URL over a range of time
Example query




          Implementation
Too slow: “all data” is petabyte-scale
On-the-fly computation



          All       Query
         data
Precomputation


     All     Precomputed
                           Query
    data         view
Example query
 Pageview

 Pageview

 Pageview                                  2930
                                   Query
 Pageview

 Pageview
  All data
                Precomputed view
Precomputation


     All     Precomputed
                           Query
    data         view
Precomputation


     All              Precomputed
                                               Query
    data   Function
                          view
                                    Function
Data system


     All              Precomputed
                                               Query
    data   Function
                          view
                                    Function




               Two problems to solve
Data system


     All              Precomputed
                                               Query
    data   Function
                          view
                                    Function




               How to compute views
Data system


     All              Precomputed
                                               Query
    data   Function
                          view
                                    Function




      How to compute queries from views
Computing views


          All              Precomputed

         data   Function
                               view
Function that takes in
  all data as input
Batch processing
MapReduce
MapReduce is a framework for
computing arbitrary functions on
        arbitrary data
Expressing those functions


                       Cascalog


                 Scalding
MapReduce precomputation
                                 Batch view #1

                    e wo rkflow
           MapReduc
All data




           MapRed
                 uce work
                          flow    Batch view #2
Batch view database

Need a database that...
• Is batch-writable from MapReduce
• Has fast random reads
• Examples: ElephantDB, Voldemort
Batch view database


  No random writes required!
Properties


              All               Batch

             data    Function
                                view




                                  ElephantDB is only a few
                                  thousand lines of code

                    Simple
Properties


              All               Batch

             data    Function
                                view




                    Scalable
Properties


              All              Batch

             data   Function
                               view




                Highly available
Properties


               All              Batch

              data   Function
                                view




  Can be heavily optimized (b/c no random writes)
Properties


              All                Batch

             data     Function
                                 view




                    Normalized
Properties

Not exactly denormalization, because
you’re doing more than just retrieving data
that you stored (can do aggregations)

You’re able to optimize data storage
separately from data modeling, without
the complexity typical of denormalization
in relational databases
                                               All               Batch
*This is because the batch view is a pure
function of all data* -> hard to get out of
sync, and if there’s ever a problem (like a
                                              data   Function
                                                                 view

bug in your code that computes the wrong
batch view) you can recompute

also easy to debug problems, since you have
the input that produced the batch view ->
this is not true in a mutable system based
on incremental updates




                                                “Denormalized”
So we’re done, right?
Not quite...
                                           Just a few hours
• A batch workflow is too slow              of data!
• Views are out of date


             Absorbed into batch views   Not absorbed


                                                   Now

                                Time
Properties


              All              Batch

             data   Function
                               view




             Eventually consistent
Properties


              All               Batch

             data    Function
                                view




      (without the associated complexities)
Properties


               All                Batch

              data     Function
                                  view




   (such as divergent values, vector clocks, etc.)
What’s left?


   Precompute views for last few
   hours of data
Realtime views
Realtime view #1



New data stream   Stream processor

                                        Realtime view #2



                                     NoSQL databases
Application queries

         Batch view


                        Merge
        Realtime view
Precomputation


     All     Precomputed
                           Query
    data         view
Precomputation

               All             Precomputed
                                batch view
              data
                                               Query
                               Precomputed
                               realtime view
 New data stream


                   “Lambda Architecture”
Precomputation

               All   Precomputed
                      batch view
              data
                                           Query
                     Precomputed
                     realtime view
 New data stream


                      Most complex part of system
Precomputation

               All                                Precomputed
                                                   batch view
              data
                                                                       Query
                                                  Precomputed
                                                  realtime view
 New data stream

                   This is where things like

                                               Random write dbs much more complex
                   vector clocks have to be
                   dealt with if using
                   eventually consistent
                   NoSQL database
Precomputation

               All     Precomputed
                        batch view
              data
                                               Query
                       Precomputed
                       realtime view
 New data stream


                     But only represents few hours of data
Precomputation

                     All                Precomputed
                                         batch view
                    data
                                                                Query
                                        Precomputed
                                        realtime view
 New data stream

           Can continuously discard
           realtime views, keeping
           them small                 If anything goes wrong, auto-corrects
CAP
Realtime layer decides whether to guarantee C or A
• If it chooses consistency, queries are consistent
• If it chooses availability, queries are eventually consistent



                                       All the complexity of *dealing* with the CAP theorem
                                       (like read repair) is isolated in the realtime layer. If
                                       anything goes wrong, it’s *auto-corrected*

                                       CAP is now a choice, as it should be, rather than a
                                       complexity burden. Making a mistake w.r.t. eventual
                                       consistency *won’t corrupt* your data
Eventual accuracy


  Sometimes hard to compute
  exact answer in realtime
Eventual accuracy



     Example: unique count
Eventual accuracy


 Can compute exact answer in
 batch layer and approximate
 answer in realtime layer
                     Though for functions
                     which can be computed
                     exactly in the realtime
                     layer (e.g. counting), you
                     can achieve full accuracy
Eventual accuracy


   Best of both worlds of
   performance and accuracy
Tools
                                    ElephantDB, Voldemort

               All     Hadoop Precomputed
                                 batch view
              data
                                                Storm   Query
                                Precomputed
                                realtime view
 New data stream       Storm
                                 Cassandra, Riak, HBase
    Kafka
                   “Lambda Architecture”
Lambda Architecture
• Candiscard batch views and realtime views and recreate everything
 from scratch

• Mistakes   corrected via recomputation

• Data   storage layer optimized independently from query resolution layer

                                              what mistakes can be made?

                                                - write bad data? - remove the data and recompute the
                                              views

                                                - bug in the functions that compute view? - recompute the
                                              view

                                               - bug in query function? just deploy the fix
Future
• Abstractionover batch and realtime
• More data structure implementations for batch and realtime views
Learn more




        http://manning.com/marz
Questions?

More Related Content

What's hot

Rancangan pengajaran harian matematik minggu 35
Rancangan pengajaran harian matematik minggu 35Rancangan pengajaran harian matematik minggu 35
Rancangan pengajaran harian matematik minggu 35
afiqah Yamani
 
rancangan pengajaran harian poligon tingkatan 1
rancangan pengajaran harian poligon tingkatan 1rancangan pengajaran harian poligon tingkatan 1
rancangan pengajaran harian poligon tingkatan 1
Azima Rahim
 
Rph micro teaching ekonomi asas form5 -terkini
Rph micro teaching  ekonomi asas form5 -terkiniRph micro teaching  ekonomi asas form5 -terkini
Rph micro teaching ekonomi asas form5 -terkini
cik noorlyda
 
konsep penyelesaian masalah gorge polya
konsep penyelesaian masalah gorge polyakonsep penyelesaian masalah gorge polya
konsep penyelesaian masalah gorge polya
Nur Farhanie
 
Unit 5 konstruktivisme
Unit 5 konstruktivismeUnit 5 konstruktivisme
Unit 5 konstruktivisme
Aminah Rahmat
 
Rancangan pengajaran harian matematik tahun 5
Rancangan pengajaran harian matematik tahun 5Rancangan pengajaran harian matematik tahun 5
Rancangan pengajaran harian matematik tahun 5
Ismi Huda
 

What's hot (20)

Peranan guru membudayakan pembelajaran sepanjang hayat
Peranan guru membudayakan pembelajaran sepanjang hayatPeranan guru membudayakan pembelajaran sepanjang hayat
Peranan guru membudayakan pembelajaran sepanjang hayat
 
Where destructors meet threads
Where destructors meet threadsWhere destructors meet threads
Where destructors meet threads
 
Rancangan pengajaran harian matematik minggu 35
Rancangan pengajaran harian matematik minggu 35Rancangan pengajaran harian matematik minggu 35
Rancangan pengajaran harian matematik minggu 35
 
Pembelajaran Berasaskan Projek (PBL) Poligon Matematik Tingkatan 2
Pembelajaran Berasaskan Projek (PBL) Poligon Matematik Tingkatan 2 Pembelajaran Berasaskan Projek (PBL) Poligon Matematik Tingkatan 2
Pembelajaran Berasaskan Projek (PBL) Poligon Matematik Tingkatan 2
 
rancangan pengajaran harian poligon tingkatan 1
rancangan pengajaran harian poligon tingkatan 1rancangan pengajaran harian poligon tingkatan 1
rancangan pengajaran harian poligon tingkatan 1
 
AtCoder Beginner Contest 030 解説
AtCoder Beginner Contest 030 解説AtCoder Beginner Contest 030 解説
AtCoder Beginner Contest 030 解説
 
Nota RINGKAS MENGIKUT TOPIK MATEMATIK SEKOLAH RENDAH
Nota RINGKAS MENGIKUT TOPIK MATEMATIK SEKOLAH RENDAHNota RINGKAS MENGIKUT TOPIK MATEMATIK SEKOLAH RENDAH
Nota RINGKAS MENGIKUT TOPIK MATEMATIK SEKOLAH RENDAH
 
Rph micro teaching ekonomi asas form5 -terkini
Rph micro teaching  ekonomi asas form5 -terkiniRph micro teaching  ekonomi asas form5 -terkini
Rph micro teaching ekonomi asas form5 -terkini
 
CARTA FILA
CARTA FILACARTA FILA
CARTA FILA
 
听说教学
听说教学听说教学
听说教学
 
konsep penyelesaian masalah gorge polya
konsep penyelesaian masalah gorge polyakonsep penyelesaian masalah gorge polya
konsep penyelesaian masalah gorge polya
 
Contoh rph
Contoh rphContoh rph
Contoh rph
 
Konsep kelajuan (Tahun 6) KBSR/ PPSMI
Konsep kelajuan (Tahun 6) KBSR/ PPSMIKonsep kelajuan (Tahun 6) KBSR/ PPSMI
Konsep kelajuan (Tahun 6) KBSR/ PPSMI
 
Indeedなう 予選B 解説
Indeedなう 予選B 解説Indeedなう 予選B 解説
Indeedなう 予選B 解説
 
Unit 5 konstruktivisme
Unit 5 konstruktivismeUnit 5 konstruktivisme
Unit 5 konstruktivisme
 
Rustで楽しむ競技プログラミング
Rustで楽しむ競技プログラミングRustで楽しむ競技プログラミング
Rustで楽しむ競技プログラミング
 
Penggunaan Kajian Kes Sebagai Strategi Pengajaran
Penggunaan Kajian Kes Sebagai Strategi PengajaranPenggunaan Kajian Kes Sebagai Strategi Pengajaran
Penggunaan Kajian Kes Sebagai Strategi Pengajaran
 
AtCoder Beginner Contest 016 解説
AtCoder Beginner Contest 016 解説AtCoder Beginner Contest 016 解説
AtCoder Beginner Contest 016 解説
 
Rancangan pengajaran harian matematik tahun 5
Rancangan pengajaran harian matematik tahun 5Rancangan pengajaran harian matematik tahun 5
Rancangan pengajaran harian matematik tahun 5
 
Jambatan
JambatanJambatan
Jambatan
 

Similar to Runaway complexity in Big Data... and a plan to stop it

Big iron 2 (published)
Big iron 2 (published)Big iron 2 (published)
Big iron 2 (published)
Ben Stopford
 
NoSQL-Database-Concepts
NoSQL-Database-ConceptsNoSQL-Database-Concepts
NoSQL-Database-Concepts
Bhaskar Gunda
 

Similar to Runaway complexity in Big Data... and a plan to stop it (20)

Data modeling trends for analytics
Data modeling trends for analyticsData modeling trends for analytics
Data modeling trends for analytics
 
NoSQL and Couchbase
NoSQL and CouchbaseNoSQL and Couchbase
NoSQL and Couchbase
 
To SQL or NoSQL, that is the question
To SQL or NoSQL, that is the questionTo SQL or NoSQL, that is the question
To SQL or NoSQL, that is the question
 
Big iron 2 (published)
Big iron 2 (published)Big iron 2 (published)
Big iron 2 (published)
 
Domain oriented development
Domain oriented developmentDomain oriented development
Domain oriented development
 
Data Anonymization For Better Software Testing
Data Anonymization For Better Software TestingData Anonymization For Better Software Testing
Data Anonymization For Better Software Testing
 
Industrial Data Science
Industrial Data ScienceIndustrial Data Science
Industrial Data Science
 
CQRS
CQRSCQRS
CQRS
 
Datastream management system1
Datastream management system1Datastream management system1
Datastream management system1
 
Big Data Overview
Big Data OverviewBig Data Overview
Big Data Overview
 
Big Data Platforms: An Overview
Big Data Platforms: An OverviewBig Data Platforms: An Overview
Big Data Platforms: An Overview
 
Relational and non relational database 7
Relational and non relational database 7Relational and non relational database 7
Relational and non relational database 7
 
NoSQL-Database-Concepts
NoSQL-Database-ConceptsNoSQL-Database-Concepts
NoSQL-Database-Concepts
 
A peek into the future
A peek into the futureA peek into the future
A peek into the future
 
Protecting privacy with fuzzy-feeling test data
Protecting privacy with fuzzy-feeling test dataProtecting privacy with fuzzy-feeling test data
Protecting privacy with fuzzy-feeling test data
 
History of NoSQL and Azure Documentdb feature set
History of NoSQL and Azure Documentdb feature setHistory of NoSQL and Azure Documentdb feature set
History of NoSQL and Azure Documentdb feature set
 
unba.se - ACM CSCW 2017 - IWCES15
unba.se - ACM CSCW 2017 - IWCES15unba.se - ACM CSCW 2017 - IWCES15
unba.se - ACM CSCW 2017 - IWCES15
 
Database revolution opening webcast 01 18-12
Database revolution opening webcast 01 18-12Database revolution opening webcast 01 18-12
Database revolution opening webcast 01 18-12
 
Database Revolution - Exploratory Webcast
Database Revolution - Exploratory WebcastDatabase Revolution - Exploratory Webcast
Database Revolution - Exploratory Webcast
 
Stig: Social Graphs & Discovery at Scale
Stig: Social Graphs & Discovery at ScaleStig: Social Graphs & Discovery at Scale
Stig: Social Graphs & Discovery at Scale
 

More from nathanmarz

Storm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computationStorm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computation
nathanmarz
 

More from nathanmarz (17)

Demystifying Data Engineering
Demystifying Data EngineeringDemystifying Data Engineering
Demystifying Data Engineering
 
The inherent complexity of stream processing
The inherent complexity of stream processingThe inherent complexity of stream processing
The inherent complexity of stream processing
 
Using Simplicity to Make Hard Big Data Problems Easy
Using Simplicity to Make Hard Big Data Problems EasyUsing Simplicity to Make Hard Big Data Problems Easy
Using Simplicity to Make Hard Big Data Problems Easy
 
The Epistemology of Software Engineering
The Epistemology of Software EngineeringThe Epistemology of Software Engineering
The Epistemology of Software Engineering
 
Your Code is Wrong
Your Code is WrongYour Code is Wrong
Your Code is Wrong
 
Storm
StormStorm
Storm
 
Storm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computationStorm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computation
 
ElephantDB
ElephantDBElephantDB
ElephantDB
 
Become Efficient or Die: The Story of BackType
Become Efficient or Die: The Story of BackTypeBecome Efficient or Die: The Story of BackType
Become Efficient or Die: The Story of BackType
 
The Secrets of Building Realtime Big Data Systems
The Secrets of Building Realtime Big Data SystemsThe Secrets of Building Realtime Big Data Systems
The Secrets of Building Realtime Big Data Systems
 
Clojure at BackType
Clojure at BackTypeClojure at BackType
Clojure at BackType
 
Cascalog workshop
Cascalog workshopCascalog workshop
Cascalog workshop
 
Cascalog at Strange Loop
Cascalog at Strange LoopCascalog at Strange Loop
Cascalog at Strange Loop
 
Cascalog at Hadoop Day
Cascalog at Hadoop DayCascalog at Hadoop Day
Cascalog at Hadoop Day
 
Cascalog at May Bay Area Hadoop User Group
Cascalog at May Bay Area Hadoop User GroupCascalog at May Bay Area Hadoop User Group
Cascalog at May Bay Area Hadoop User Group
 
Cascalog
CascalogCascalog
Cascalog
 
Cascading
CascadingCascading
Cascading
 

Recently uploaded

Recently uploaded (20)

Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
 
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
 
Powerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara LaskowskaPowerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara Laskowska
 
UiPath Test Automation using UiPath Test Suite series, part 1
UiPath Test Automation using UiPath Test Suite series, part 1UiPath Test Automation using UiPath Test Suite series, part 1
UiPath Test Automation using UiPath Test Suite series, part 1
 
Syngulon - Selection technology May 2024.pdf
Syngulon - Selection technology May 2024.pdfSyngulon - Selection technology May 2024.pdf
Syngulon - Selection technology May 2024.pdf
 
Optimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through ObservabilityOptimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through Observability
 
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
 
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya HalderCustom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
 
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdfLinux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
 
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
 
Extensible Python: Robustness through Addition - PyCon 2024
Extensible Python: Robustness through Addition - PyCon 2024Extensible Python: Robustness through Addition - PyCon 2024
Extensible Python: Robustness through Addition - PyCon 2024
 
UiPath Test Automation using UiPath Test Suite series, part 2
UiPath Test Automation using UiPath Test Suite series, part 2UiPath Test Automation using UiPath Test Suite series, part 2
UiPath Test Automation using UiPath Test Suite series, part 2
 
The Metaverse: Are We There Yet?
The  Metaverse:    Are   We  There  Yet?The  Metaverse:    Are   We  There  Yet?
The Metaverse: Are We There Yet?
 
Buy Epson EcoTank L3210 Colour Printer Online.pptx
Buy Epson EcoTank L3210 Colour Printer Online.pptxBuy Epson EcoTank L3210 Colour Printer Online.pptx
Buy Epson EcoTank L3210 Colour Printer Online.pptx
 
Salesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone KomSalesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
 
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdfSimplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
 
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptxUnpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
 
The UX of Automation by AJ King, Senior UX Researcher, Ocado
The UX of Automation by AJ King, Senior UX Researcher, OcadoThe UX of Automation by AJ King, Senior UX Researcher, Ocado
The UX of Automation by AJ King, Senior UX Researcher, Ocado
 
Strategic AI Integration in Engineering Teams
Strategic AI Integration in Engineering TeamsStrategic AI Integration in Engineering Teams
Strategic AI Integration in Engineering Teams
 
Connecting the Dots in Product Design at KAYAK
Connecting the Dots in Product Design at KAYAKConnecting the Dots in Product Design at KAYAK
Connecting the Dots in Product Design at KAYAK
 

Runaway complexity in Big Data... and a plan to stop it

  • 1. Runaway complexity in Big Data And a plan to stop it Nathan Marz @nathanmarz 1
  • 2. Agenda • Common sources of complexity in data systems • Design for a fundamentally better data system
  • 3. What is a data system? A system that manages the storage and querying of data
  • 4. What is a data system? A system that manages the storage and querying of data with a lifetime measured in years
  • 5. What is a data system? A system that manages the storage and querying of data with a lifetime measured in years encompassing every version of the application to ever exist
  • 6. What is a data system? A system that manages the storage and querying of data with a lifetime measured in years encompassing every version of the application to ever exist, every hardware failure
  • 7. What is a data system? A system that manages the storage and querying of data with a lifetime measured in years encompassing every version of the application to ever exist, every hardware failure, and every human mistake ever made
  • 8. Common sources of complexity Lack of human fault-tolerance Conflation of data and queries Schemas done wrong
  • 9. Lack of human fault-tolerance
  • 10. Human fault-tolerance • Bugs will be deployed to production over the lifetime of a data system • Operational mistakes will be made • Humans are part of the overall system, just like your hard disks, CPUs, memory, and software • Must design for human error like you’d design for any other fault
  • 11. Human fault-tolerance Examples of human error • Deploy a bug that increments counters by two instead of by one • Accidentally delete data from database • Accidental DOS on important internal service
  • 12. The worst consequence is data loss or data corruption
  • 13. As long as an error doesn’t lose or corrupt good data, you can fix what went wrong
  • 14. Mutability • The U and D in CRUD • A mutable system updates the current state of the world • Mutable systems inherently lack human fault-tolerance • Easy to corrupt or lose data
  • 15. Immutability • An immutable system captures a historical record of events • Each event happens at a particular time and is always true
  • 16. Capturing change with mutable data model Person Location Person Location Sally Philadelphia Sally New York Bob Chicago Bob Chicago Sally moves to New York
  • 17. Capturing change with immutable data model Person Location Time Person Location Time Sally Philadelphia 1318358351 Sally Philadelphia 1318358351 Bob Chicago 1327928370 Bob Chicago 1327928370 Sally New York 1338469380 Sally moves to New York
  • 18. Immutability greatly restricts the range of errors that can cause data loss or data corruption
  • 19. Vastly more human fault-tolerant
  • 20. Immutability Other benefits • Fundamentally simpler • CR instead of CRUD • Only write operation is appending new units of data • Easy to implement on top of a distributed filesystem • File = list of data records • Append = Add a new file into a directory basing a system on mutability is like pouring gasoline on your house (but don’t worry, i checked all the wires carefully to make sure there won’t be any sparks). when someone makes a mistake who knows what will burn
  • 21. Immutability Please watch Rich Hickey’s talks to learn more about the enormous benefits of immutability
  • 22. Conflation of data and queries
  • 23. Conflation of data and queries Normalization vs. denormalization ID Name Location ID Location ID City State Population 1 Sally 3 1 New York NY 8.2M 2 George 1 2 San Diego CA 1.3M 3 Bob 3 3 Chicago IL 2.7M Normalized schema
  • 24. Join is too expensive, so denormalize...
  • 25. ID Name Location ID City State 1 Sally 3 Chicago IL 2 George 1 New York NY 3 Bob 3 Chicago IL Location ID City State Population 1 New York NY 8.2M 2 San Diego CA 1.3M 3 Chicago IL 2.7M Denormalized schema
  • 26. Obviously, you prefer all data to be fully normalized
  • 27. But you are forced to denormalize for performance
  • 28. Because the way data is modeled, stored, and queried is complected
  • 29. We will come back to how to build data systems in which these are disassociated
  • 31. Schemas have a bad rap
  • 32. Schemas • Hard to change • Get in the way • Add development overhead • Requires annoying configuration
  • 33. I know! Use a schemaless database!
  • 34. This is an overreaction
  • 35. Confuses the poor implementation of schemas with the value that schemas provide
  • 36. What is a schema exactly?
  • 38. That says whether this data is valid or not
  • 40. Value of schemas • Structural integrity • Guarantees on what can and can’t be stored • Prevents corruption
  • 41. Otherwise you’ll detect corruption issues at read-time
  • 42. Potentially long after the corruption happened
  • 43. With little insight into the circumstances of the corruption
  • 44. Much better to get an exception where the mistake is made, before it corrupts the database
  • 46. Why are schemas considered painful? • Changing the schema is hard (e.g., adding a column to a table) • Schema is overly restrictive (e.g., cannot do nested objects) • Require translation layers (e.g. ORM) • Requires more typing (development overhead)
  • 47. None of these are fundamentally linked with function(data unit)
  • 48. These are problems in the implementation of schemas, not in schemas themselves
  • 49. Ideal schema tool • Data is represented as maps • Schema tool is a library that helps construct the schema function: • Concisely specify required fields and types • Insert custom validation logic for fields (e.g. ages are between 0 and 200) • Built-in support for evolving the schema over time • Fast and space-efficient serialization/deserialization • Cross-language this is easy to use and gets out of your way i use apache thrift, but it lacks the custom validation logic i think it could be done better with a clojure-like data as maps approach given that parameters of a data system: long-lived, ever changing, with mistakes being made, the amount of work it takes to make a schema (not that much) is absolutely worth it
  • 51. The relational database will be a footnote in history
  • 52. Not because of SQL, restrictive schemas, or scalability issues
  • 53. But because of fundamental flaws in the RDBMS approach to managing data
  • 55. Conflating the storage of data with how it is queried Back in the day, these flaws were feature – because space was a premium. The landscape has changed, and this is no longer the constraint it once was. So these properties of mutability and conflating data and queries are now major, glaring flaws. Because there are better ways to design data system
  • 57. Let’s use our ability to cheaply store massive amounts of data
  • 58. To do data right
  • 59. And not inherit the complexities of the past
  • 60. I know! Use a NoSQL database! if SQL’s wrong, and NoSQL isn’t SQL, then NoSQL must be right
  • 61. NoSQL databases are generally not a step in the right direction
  • 62. Some aspects are, but not the ones that get all the attention
  • 63. Still based on mutability and not general purpose
  • 64. Let’s see how you design a data system that doesn’t suffer from these complexities Let’s start from scratch
  • 65. What does a data system do?
  • 66. Retrieve data that you previously stored? Put Get
  • 68. Counterexamples Store location information on people How many people live in a particular location? Where does Sally live? What are the most populous locations?
  • 69. Counterexamples Store pageview information How many pageviews on September 2nd? How many unique visitors over time?
  • 70. Counterexamples Store transaction history for bank account How much money does George have? How much money do people spend on housing?
  • 71. What does a data system do? Query = Function(All data)
  • 72. Sometimes you retrieve what you stored
  • 73. Oftentimes you do transformations, aggregations, etc.
  • 74. Queries as pure functions that take all data as input is the most general formulation
  • 75. Example query Total number of pageviews to a URL over a range of time
  • 76. Example query Implementation
  • 77. Too slow: “all data” is petabyte-scale
  • 78. On-the-fly computation All Query data
  • 79. Precomputation All Precomputed Query data view
  • 80. Example query Pageview Pageview Pageview 2930 Query Pageview Pageview All data Precomputed view
  • 81. Precomputation All Precomputed Query data view
  • 82. Precomputation All Precomputed Query data Function view Function
  • 83. Data system All Precomputed Query data Function view Function Two problems to solve
  • 84. Data system All Precomputed Query data Function view Function How to compute views
  • 85. Data system All Precomputed Query data Function view Function How to compute queries from views
  • 86. Computing views All Precomputed data Function view
  • 87. Function that takes in all data as input
  • 90. MapReduce is a framework for computing arbitrary functions on arbitrary data
  • 91. Expressing those functions Cascalog Scalding
  • 92. MapReduce precomputation Batch view #1 e wo rkflow MapReduc All data MapRed uce work flow Batch view #2
  • 93. Batch view database Need a database that... • Is batch-writable from MapReduce • Has fast random reads • Examples: ElephantDB, Voldemort
  • 94. Batch view database No random writes required!
  • 95. Properties All Batch data Function view ElephantDB is only a few thousand lines of code Simple
  • 96. Properties All Batch data Function view Scalable
  • 97. Properties All Batch data Function view Highly available
  • 98. Properties All Batch data Function view Can be heavily optimized (b/c no random writes)
  • 99. Properties All Batch data Function view Normalized
  • 100. Properties Not exactly denormalization, because you’re doing more than just retrieving data that you stored (can do aggregations) You’re able to optimize data storage separately from data modeling, without the complexity typical of denormalization in relational databases All Batch *This is because the batch view is a pure function of all data* -> hard to get out of sync, and if there’s ever a problem (like a data Function view bug in your code that computes the wrong batch view) you can recompute also easy to debug problems, since you have the input that produced the batch view -> this is not true in a mutable system based on incremental updates “Denormalized”
  • 101. So we’re done, right?
  • 102. Not quite... Just a few hours • A batch workflow is too slow of data! • Views are out of date Absorbed into batch views Not absorbed Now Time
  • 103. Properties All Batch data Function view Eventually consistent
  • 104. Properties All Batch data Function view (without the associated complexities)
  • 105. Properties All Batch data Function view (such as divergent values, vector clocks, etc.)
  • 106. What’s left? Precompute views for last few hours of data
  • 108. Realtime view #1 New data stream Stream processor Realtime view #2 NoSQL databases
  • 109. Application queries Batch view Merge Realtime view
  • 110. Precomputation All Precomputed Query data view
  • 111. Precomputation All Precomputed batch view data Query Precomputed realtime view New data stream “Lambda Architecture”
  • 112. Precomputation All Precomputed batch view data Query Precomputed realtime view New data stream Most complex part of system
  • 113. Precomputation All Precomputed batch view data Query Precomputed realtime view New data stream This is where things like Random write dbs much more complex vector clocks have to be dealt with if using eventually consistent NoSQL database
  • 114. Precomputation All Precomputed batch view data Query Precomputed realtime view New data stream But only represents few hours of data
  • 115. Precomputation All Precomputed batch view data Query Precomputed realtime view New data stream Can continuously discard realtime views, keeping them small If anything goes wrong, auto-corrects
  • 116. CAP Realtime layer decides whether to guarantee C or A • If it chooses consistency, queries are consistent • If it chooses availability, queries are eventually consistent All the complexity of *dealing* with the CAP theorem (like read repair) is isolated in the realtime layer. If anything goes wrong, it’s *auto-corrected* CAP is now a choice, as it should be, rather than a complexity burden. Making a mistake w.r.t. eventual consistency *won’t corrupt* your data
  • 117. Eventual accuracy Sometimes hard to compute exact answer in realtime
  • 118. Eventual accuracy Example: unique count
  • 119. Eventual accuracy Can compute exact answer in batch layer and approximate answer in realtime layer Though for functions which can be computed exactly in the realtime layer (e.g. counting), you can achieve full accuracy
  • 120. Eventual accuracy Best of both worlds of performance and accuracy
  • 121. Tools ElephantDB, Voldemort All Hadoop Precomputed batch view data Storm Query Precomputed realtime view New data stream Storm Cassandra, Riak, HBase Kafka “Lambda Architecture”
  • 122. Lambda Architecture • Candiscard batch views and realtime views and recreate everything from scratch • Mistakes corrected via recomputation • Data storage layer optimized independently from query resolution layer what mistakes can be made? - write bad data? - remove the data and recompute the views - bug in the functions that compute view? - recompute the view - bug in query function? just deploy the fix
  • 123. Future • Abstractionover batch and realtime • More data structure implementations for batch and realtime views
  • 124. Learn more http://manning.com/marz