Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Big iron 2 (published)


Published on

Published in: Technology
  • Be the first to comment

Big iron 2 (published)

  1. 1. The Return of Big Iron? Ben Stopford Distinguished Engineer RBS Markets
  2. 2. Much diversity
  3. 3. What does this mean? • A change in what customers (we) value • The mainstream is not serving customers (us) sufficiently
  4. 4. The Database field has problems
  5. 5. We Lose: Joe Hellerstein (Berkeley) 2001 “Databases are commoditised and cornered to slow-moving, evolving, structure intensive, applications that require schema evolution.“ … “The internet companies are lost and we will remain in the doldrums of the enterprise space.” … “As databases are black boxes which require a lot of coaxing to get maximum performance”
  6. 6. His question was how to win them back?
  7. 7. These new technologies also caused frustration
  8. 8. Backlash (2009) Not novel (dates back to the 80’s) Physical level not the logical level (messy?) Incompatible with tooling Lack of integrity (referential) & ACID MR is brute force ignoring indexing, scew
  9. 9. All points are reasonable
  10. 10. And they proved it too! “A comparison of Approaches to Large Scale Data Analysis” – Sigmod 2009 • Vertica vs. DBMSX vs. Hadoop • Vertica up to 7 x faster than Hadoop over benchmarks Databases faster than Hadoop
  11. 11. But possibly missed the point?
  12. 12. Databases were traditionally designed to keep data safe
  13. 13. NoSQL grew from a need to scale
  14. 14. It’s more than just scale, they facilitate different practices
  15. 15. A Better Fit They better match the way software is engineered today. – Iterative development – Fast feedback – Frequent releases
  16. 16. Is NoSQL a Disruptive Technology? Christensen’s observation: Market leaders are displaced when markets shift in ways that the incumbent leaders are not prepared for.
  17. 17. Aside: MongoDB • Impressive trajectory • Slightly crappy product (from a traditional database standpoint) • Most closely related to relational DB (of the NoSQLs) • Plays to the agile mindset
  18. 18. Yet the NoSQL market is relatively small • Currently around $600 but projected to grow strongly • Database and systems management market is worth around $34billion
  19. 19. Key Point There is more to NoSQL than just scale, it sits better with the way we build software today
  20. 20. We have new building blocks to play with!
  21. 21. My Problem • Sprawling application space, built over many years, grouped into both vertical and horizontal silos • Duplication of effort • Data corruption & preventative measures • Consolidation is costly, time consuming and technically challenging.
  22. 22. Traditional solutions (in chronological order) – Messaging – SOA – Enterprise Data Warehouse – Data virtualisation
  23. 23. Bringing data, applications, people together is hard
  24. 24. A popular choice is an EDW
  25. 25. EDW pattern is workable, but tough – As soon as you take a ‘view’ on what the shape of the data is, it becomes harder to change. • Leave ‘taking a view” to the last responsible moment – Multifaceted: Shape, diversity of source, diversity of population, temporal change
  26. 26. Harder to do iteratively
  27. 27. Is this the only way?
  28. 28. The Google Approach MapReduce Google Filesystem BigTable Tenzing Megastore F1 Dremel Spanner
  29. 29. And just one code base! So no enterprise schema secret society!
  30. 30. The Ebay Approach
  31. 31. The Partial-Schematic Approach Often termed Clobs & Cracking
  32. 32. Problems with solidifying a schematic representation • Risk of throwing information away, keeping only what you think you need. – OK if you create data – Bad if you got data from elsewhere • Data tends to be poly-structured in programs and on the wire • Early-binding slows down development
  33. 33. But schemas are good • They guarantee a contract • That contract spans the whole dataset – Similar to static typing in programming languages.
  34. 34. Compromise positions • Query schema can be a subset of data schema. • Use schemaless databases to capture diversity early and evolve it as you build.
  35. 35. Common solutions today use multiple technologies M Re u ap d ce D a at W ho se are u ? Ke Vl u y ae St o re In- M mry/ eo O LTP D ba ata se
  36. 36. We use an late-bound schema, sitting over a schemaless store S tructured S tandardisation Layer Raw Data Late Bound Schema
  37. 37. Evolutionary Approach • Late-binding makes consolidation incremental – Schematic representation delivered at the ‘last responsible moment’ (schema on demand) – A trade in this model has 4 mandatory nodes. A fully modeled trade has around 800. • The system of record is raw data, not our ‘view’ of it • No schema migration! But this comes at a price.
  38. 38. Scaling
  39. 39. Key based access always scales Client
  40. 40. But queries (without the sharding key) always broadcast Client
  41. 41. As query complexity increases so does the overhead Client
  42. 42. Course grained shards Client
  43. 43. Data Replicas provide hardware isolation Client
  44. 44. Scaling • Key based sharding is only sufficient very simple workloads • Course grained shards help (but suffer from skew) • Replication provides useful, if expensive, hardware isolation • Workload management is less useful in my experience
  45. 45. Weak consistency forces the problem onto the developer Particularly bad for banks!
  46. 46. Scaling two phase commit is hard to do efficiently • Requires distributed lock/clock/counter • Requires synchronisation of all readers & writers
  47. 47. Alternatives to traditional 2PC • MVCC over explicit locking • Timestamp based strong consistency – E.g. Granola • Optimistic concurrency control – Leverage short running transactions (avoid cross-network transactions) – Tolerate different temporal viewpoints to reduce synchronization costs.
  48. 48. Immutable Data • • • • • Safety ‘As was’ view Sits well with MVCC Efficiency problems Gaining popularity (e.g. Datomic)
  49. 49. Use joins to avoid ‘over aggregating’ Joins are ok, so long as they are – Local – via a unique key Trade r Party Trade
  50. 50. Memory/Disk Tradeoff • Memory only (possibly overplayed) • Pinned indexes (generally good idea if you can afford the RAM) • Disk resident (best general purpose solution and for very large datasets)
  51. 51. Balance flexibility and complexity Operational (real time / MR) Object/S QL S tandardisation Raw Data Relational Analytics
  52. 52. Supple at the front, more rigid at the back Raw Access Operational Access Analytic Access D Looser Tighter L M Untyped Object/S QL Reporting Broad Data Coverage Narrow Data Coverage Narrow Query Comprehensive Quer y
  53. 53. Principals • • • • Record everything Grow a schema, don’t do it upfront Avoid using a ‘view’ as your system of record. Differentiate between sourced data (out of your control) and generated data (in your control). • Use automated replication (for isolation) as well as sharding (for scale) • Leverage asynchronicity to reduce transaction overheads
  54. 54. Consolidation means more trust, less impedance mismatches and managing tighter couplings
  55. 55. Target architectures are starting to look more like large applications of cloud enabled services than heterogeneous application conglomerates
  56. 56. Are we going back to the mainframe?
  57. 57. Thanks