Runaway complexity in Big Data... and a plan to stop it
Upcoming SlideShare
Loading in...5
×
 

Runaway complexity in Big Data... and a plan to stop it

on

  • 46,672 views

 

Statistics

Views

Total Views
46,672
Views on SlideShare
39,782
Embed Views
6,890

Actions

Likes
151
Downloads
1,125
Comments
5

66 Embeds 6,890

http://www.scoop.it 2859
http://www.lowlevelmanager.com 2569
https://twitter.com 579
http://abava.blogspot.com 170
http://cloudburo.github.io 149
http://architects.dzone.com 86
http://java.dzone.com 82
http://timesense1.search.corp.yahoo.com 38
http://shyamalpandya.tumblr.com 38
https://si0.twimg.com 36
http://localhost 33
http://lanyrd.com 32
http://rapp.dev 24
http://dev.cloudburo.net 23
http://abava.blogspot.ru 19
http://twitter.com 16
http://nuevospowerpoints.blogspot.com 14
http://translate.googleusercontent.com 13
http://tweetedtimes.com 12
http://yanggw.tk 9
http://webcache.googleusercontent.com 8
http://kred.com 6
http://6019741717555751737_38efbc1458e8cb2d3f5412717573ddf6b27344d8.blogspot.com 6
http://abava.blogspot.co.il 5
http://prlog.ru 4
http://www.linkedin.com 3
http://www.techgig.com 3
http://learningspace.mondragon.edu 3
http://presentationdocs.playableitems.demobo.com 3
http://abtasty.com 2
http://www.twylah.com 2
http://www.pearltrees.com 2
https://twimg0-a.akamaihd.net 2
http://www.swarmiq.com 2
http://mundo-powerpoints.blogspot.com 2
http://strawberryj.am 2
http://abava.blogspot.fr 2
http://mundo-powerpoints.blogspot.com.es 2
http://nuevospowerpoints.blogspot.com.es 2
http://www.onlydoo.com 2
http://de.slideshare.net 1
http://www.lowlevelmanager.com&_=1402272000047 HTTP 1
http://svendvanderveken.wordpress.com 1
http://pinterest.com 1
https://ranking.dynamicsignal.com 1
https://content-preview.socialcast.com 1
https://fr.twitter.com 1
http://ranksit.com 1
https://www.linkedin.com 1
http://www.tumblr.com 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
  • Nice one. In my opinion the architecture will keep on evolving . We got to be really careful about what real time querying and processing would mean . Is it micro,milli second or so. If the answer is in for seconds than batch processing would do else lambda would be an overkill.
    Are you sure you want to
    Your message goes here
    Processing…
  • Hi, I would not use the concepts of a mutable (silde 16) and immutable data model (silde 17), because in both cases the state of the system has changed as soon as Sally has moved to New York.

    What is even worse, if you think of a system that interacts with the outside world, as any system does up to a certain degree, e.g. a self driving car from Google. Any error state in the system may create events that change the future of the world for ever. e.g. an car accident that kills people.
    Are you sure you want to
    Your message goes here
    Processing…
  • @LadislavUrban Excellent paper! Really impressive applications.
    Are you sure you want to
    Your message goes here
    Processing…
  • I like your Big Data presentation.
    I would like to share with you document about application of Big Data and Data Science in retail banking. http://www.slideshare.net/LadislavUrban/syoncloud-big-data-for-retail-banking-syoncloud
    Are you sure you want to
    Your message goes here
    Processing…
  • HI dear i am Priscilla by name, i read your profile and like to be in contact with you, that is why i drop this note for you, please i we like you to contact me in my privet mail box to enable me send my picture to you,and also tell you more about me thanks i we be waiting for your reply, (bernard_priscilla@yahoo.com)
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Runaway complexity in Big Data... and a plan to stop it Runaway complexity in Big Data... and a plan to stop it Presentation Transcript

  • Runaway complexity in Big Data And a plan to stop it Nathan Marz @nathanmarz 1
  • Agenda• Common sources of complexity in data systems• Design for a fundamentally better data system
  • What is a data system? A system that manages the storage and querying of data
  • What is a data system? A system that manages the storage and querying of data with a lifetime measured in years
  • What is a data system? A system that manages the storage and querying of data with a lifetime measured in years encompassing every version of the application to ever exist
  • What is a data system? A system that manages the storage and querying of data with a lifetime measured in years encompassing every version of the application to ever exist, every hardware failure
  • What is a data system? A system that manages the storage and querying of data with a lifetime measured in years encompassing every version of the application to ever exist, every hardware failure, and every human mistake ever made
  • Common sources of complexity Lack of human fault-tolerance Conflation of data and queries Schemas done wrong
  • Lack of human fault-tolerance
  • Human fault-tolerance• Bugs will be deployed to production over the lifetime of a data system• Operational mistakes will be made• Humans are part of the overall system, just like your hard disks, CPUs, memory, and software• Must design for human error like you’d design for any other fault
  • Human fault-toleranceExamples of human error• Deploy a bug that increments counters by two instead of by one• Accidentally delete data from database• Accidental DOS on important internal service
  • The worst consequence isdata loss or data corruption
  • As long as an error doesn’t lose or corrupt good data,you can fix what went wrong
  • Mutability• The U and D in CRUD• A mutable system updates the current state of the world• Mutable systems inherently lack human fault-tolerance• Easy to corrupt or lose data
  • Immutability• An immutable system captures a historical record of events• Each event happens at a particular time and is always true
  • Capturing change with mutable datamodel Person Location Person Location Sally Philadelphia Sally New York Bob Chicago Bob Chicago Sally moves to New York
  • Capturing change with immutabledata modelPerson Location Time Person Location Time Sally Philadelphia 1318358351 Sally Philadelphia 1318358351 Bob Chicago 1327928370 Bob Chicago 1327928370 Sally New York 1338469380 Sally moves to New York
  • Immutability greatly restricts the range of errors that can cause data loss or data corruption
  • Vastly more human fault-tolerant
  • ImmutabilityOther benefits• Fundamentally simpler• CR instead of CRUD• Only write operation is appending new units of data• Easy to implement on top of a distributed filesystem • File = list of data records • Append = Add a new file into a directory basing a system on mutability is like pouring gasoline on your house (but don’t worry, i checked all the wires carefully to make sure there won’t be any sparks). when someone makes a mistake who knows what will burn
  • Immutability Please watch Rich Hickey’s talks to learn more about the enormous benefits of immutability
  • Conflation of data and queries
  • Conflation of data and queriesNormalization vs. denormalization ID Name Location ID Location ID City State Population 1 Sally 3 1 New York NY 8.2M 2 George 1 2 San Diego CA 1.3M 3 Bob 3 3 Chicago IL 2.7M Normalized schema
  • Join is too expensive, so denormalize...
  • ID Name Location ID City State1 Sally 3 Chicago IL2 George 1 New York NY3 Bob 3 Chicago IL Location ID City State Population 1 New York NY 8.2M 2 San Diego CA 1.3M 3 Chicago IL 2.7M Denormalized schema
  • Obviously, you prefer all data to be fully normalized
  • But you are forced to denormalize for performance
  • Because the way data is modeled,stored, and queried is complected
  • We will come back to how to build data systems in which these are disassociated
  • Schemas done wrong
  • Schemas have a bad rap
  • Schemas• Hard to change• Get in the way• Add development overhead• Requires annoying configuration
  • I know! Use aschemaless database!
  • This is an overreaction
  • Confuses the poor implementation of schemas with the value that schemas provide
  • What is a schema exactly?
  • function(data unit)
  • That says whether this data is valid or not
  • This is useful
  • Value of schemas• Structural integrity• Guarantees on what can and can’t be stored• Prevents corruption
  • Otherwise you’ll detect corruption issues at read-time
  • Potentially long after the corruption happened
  • With little insight into thecircumstances of the corruption
  • Much better to get an exception where the mistake is made,before it corrupts the database
  • Saves enormous amounts of time
  • Why are schemas consideredpainful?• Changing the schema is hard (e.g., adding a column to a table)• Schema is overly restrictive (e.g., cannot do nested objects)• Require translation layers (e.g. ORM)• Requires more typing (development overhead)
  • None of these are fundamentally linked with function(data unit)
  • These are problems in theimplementation of schemas, not in schemas themselves
  • Ideal schema tool• Data is represented as maps• Schema tool is a library that helps construct the schema function: • Concisely specify required fields and types • Insert custom validation logic for fields (e.g. ages are between 0 and 200)• Built-in support for evolving the schema over time• Fast and space-efficient serialization/deserialization• Cross-language this is easy to use and gets out of your way i use apache thrift, but it lacks the custom validation logic i think it could be done better with a clojure-like data as maps approach given that parameters of a data system: long-lived, ever changing, with mistakes being made, the amount of work it takes to make a schema (not that much) is absolutely worth it
  • Let’s get provocative
  • The relational database will be a footnote in history
  • Not because of SQL, restrictiveschemas, or scalability issues
  • But because of fundamentalflaws in the RDBMS approach to managing data
  • Mutability
  • Conflating the storage of data with how it is queriedBack in the day, these flaws were feature – because space was apremium. The landscape has changed, and this is no longer theconstraint it once was. So these properties of mutability andconflating data and queries are now major, glaring flaws.Because there are better ways to design data system
  • “NewSQL” is misguided
  • Let’s use our ability to cheaplystore massive amounts of data
  • To do data right
  • And not inherit the complexities of the past
  • I know! Use aNoSQL database! if SQL’s wrong, and NoSQL isn’t SQL, then NoSQL must be right
  • NoSQL databases are generallynot a step in the right direction
  • Some aspects are, but not theones that get all the attention
  • Still based on mutability and not general purpose
  • Let’s see how you designa data system thatdoesn’t suffer fromthese complexities Let’s start from scratch
  • What does a data system do?
  • Retrieve data that you previously stored? Put Get
  • Not really...
  • Counterexamples Store location information on people How many people live in a particular location? Where does Sally live? What are the most populous locations?
  • Counterexamples Store pageview information How many pageviews on September 2nd? How many unique visitors over time?
  • Counterexamples Store transaction history for bank account How much money does George have? How much money do people spend on housing?
  • What does a data system do? Query = Function(All data)
  • Sometimes you retrieve what you stored
  • Oftentimes you do transformations, aggregations, etc.
  • Queries as pure functions that takeall data as input is the most general formulation
  • Example query Total number of pageviews to a URL over a range of time
  • Example query Implementation
  • Too slow: “all data” is petabyte-scale
  • On-the-fly computation All Query data
  • Precomputation All Precomputed Query data view
  • Example query Pageview Pageview Pageview 2930 Query Pageview Pageview All data Precomputed view
  • Precomputation All Precomputed Query data view
  • Precomputation All Precomputed Query data Function view Function
  • Data system All Precomputed Query data Function view Function Two problems to solve
  • Data system All Precomputed Query data Function view Function How to compute views
  • Data system All Precomputed Query data Function view Function How to compute queries from views
  • Computing views All Precomputed data Function view
  • Function that takes in all data as input
  • Batch processing
  • MapReduce
  • MapReduce is a framework forcomputing arbitrary functions on arbitrary data
  • Expressing those functions Cascalog Scalding
  • MapReduce precomputation Batch view #1 e wo rkflow MapReducAll data MapRed uce work flow Batch view #2
  • Batch view databaseNeed a database that...• Is batch-writable from MapReduce• Has fast random reads• Examples: ElephantDB, Voldemort
  • Batch view database No random writes required!
  • Properties All Batch data Function view ElephantDB is only a few thousand lines of code Simple
  • Properties All Batch data Function view Scalable
  • Properties All Batch data Function view Highly available
  • Properties All Batch data Function view Can be heavily optimized (b/c no random writes)
  • Properties All Batch data Function view Normalized
  • PropertiesNot exactly denormalization, becauseyou’re doing more than just retrieving datathat you stored (can do aggregations)You’re able to optimize data storageseparately from data modeling, withoutthe complexity typical of denormalizationin relational databases All Batch*This is because the batch view is a purefunction of all data* -> hard to get out ofsync, and if there’s ever a problem (like a data Function viewbug in your code that computes the wrongbatch view) you can recomputealso easy to debug problems, since you havethe input that produced the batch view ->this is not true in a mutable system basedon incremental updates “Denormalized”
  • So we’re done, right?
  • Not quite... Just a few hours• A batch workflow is too slow of data!• Views are out of date Absorbed into batch views Not absorbed Now Time
  • Properties All Batch data Function view Eventually consistent
  • Properties All Batch data Function view (without the associated complexities)
  • Properties All Batch data Function view (such as divergent values, vector clocks, etc.)
  • What’s left? Precompute views for last few hours of data
  • Realtime views
  • Realtime view #1New data stream Stream processor Realtime view #2 NoSQL databases
  • Application queries Batch view Merge Realtime view
  • Precomputation All Precomputed Query data view
  • Precomputation All Precomputed batch view data Query Precomputed realtime view New data stream “Lambda Architecture”
  • Precomputation All Precomputed batch view data Query Precomputed realtime view New data stream Most complex part of system
  • Precomputation All Precomputed batch view data Query Precomputed realtime view New data stream This is where things like Random write dbs much more complex vector clocks have to be dealt with if using eventually consistent NoSQL database
  • Precomputation All Precomputed batch view data Query Precomputed realtime view New data stream But only represents few hours of data
  • Precomputation All Precomputed batch view data Query Precomputed realtime view New data stream Can continuously discard realtime views, keeping them small If anything goes wrong, auto-corrects
  • CAPRealtime layer decides whether to guarantee C or A• If it chooses consistency, queries are consistent• If it chooses availability, queries are eventually consistent All the complexity of *dealing* with the CAP theorem (like read repair) is isolated in the realtime layer. If anything goes wrong, it’s *auto-corrected* CAP is now a choice, as it should be, rather than a complexity burden. Making a mistake w.r.t. eventual consistency *won’t corrupt* your data
  • Eventual accuracy Sometimes hard to compute exact answer in realtime
  • Eventual accuracy Example: unique count
  • Eventual accuracy Can compute exact answer in batch layer and approximate answer in realtime layer Though for functions which can be computed exactly in the realtime layer (e.g. counting), you can achieve full accuracy
  • Eventual accuracy Best of both worlds of performance and accuracy
  • Tools ElephantDB, Voldemort All Hadoop Precomputed batch view data Storm Query Precomputed realtime view New data stream Storm Cassandra, Riak, HBase Kafka “Lambda Architecture”
  • Lambda Architecture• Candiscard batch views and realtime views and recreate everything from scratch• Mistakes corrected via recomputation• Data storage layer optimized independently from query resolution layer what mistakes can be made? - write bad data? - remove the data and recompute the views - bug in the functions that compute view? - recompute the view - bug in query function? just deploy the fix
  • Future• Abstractionover batch and realtime• More data structure implementations for batch and realtime views
  • Learn more http://manning.com/marz
  • Questions?