This document summarizes a keynote address on big data myths. It discusses that big data refers to problems of large volumes and high rates of change, and NoSQL is one proposed solution but not synonymous with big data. It also discusses that the CAP theorem is more about tradeoffs between consistency and availability. Finally, it introduces the YESQL project which aims to build a globally distributed SQL database that does not fail, lose data, or sacrifice consistency while supporting transactions and scaling linearly.
DSPy a system for AI to Write Prompts and Do Fine Tuning
Big myths about big data and new SQL alternatives
1. MySQL Connect Conference Keynote Address
September 30, 2012 v1.2
Big Data is a Big Scam (Most of the Time)
Daniel Austin, PayPal Technical Staff
2. Confidential and Proprietary 2Global In-memory MySQL
Big Myths About Big Data
Preview - YESQL: A
Counterexample
Today’s Agenda
3. Confidential and Proprietary
THE FUNDAMENTAL PROBLEM IN
DISTRIBUTED DATA SYSTEMS
“How Do We Manage Reliable
Distribution of Data Across Geographical
Distances?”
4. Confidential and Proprietary
Big Data Myth #1: Big Data = NoSQL
• „Big Data‟ Refers to a Common Set of Problems
– Large Volumes
– High Rates of Change
• Of Data
• Of Data Models
• Of Data Presentation and Output
– Often Require „Fast Data‟ as well as „Big‟
• Near-real Time Analytics
• Mapping Complex Structures
Takeaway: Big Data is the problem, NoSQL is one
(proposed) solution
5. Confidential and Proprietary
D oYou Need A Big Data System?
Well, Maybe….But Before You Go There…
There are essentially two „Big Data Problems‟:
“I have too much data and it‟s coming in too fast to
handle with any RDBMS.”
“I have a lot of data distributed geographically and
need to be able to read and write from anywhere in
near real-time.”
Takeaway: if you have one of these Big Data
problems, a NoSQL solution might work for you.
But there are also other alternatives…
6. Confidential and Proprietary
The NoSQL Solution
• NoSQL Systems provide a solution that relaxes
many of the common constraints of typical
RDBMS systems
– Slow - RDBMS has not scaled with CPUs
– Often require complex data management
(SOX, SOR)
– Costly to build and maintain, slow to change and
adapt
– Intolerant of CAP models (more on this later)
• Non-relational models, usually key-value
• May be batched or streaming
• Not necessarily distributed geographically
7. Confidential and Proprietary
Big Data Myth #2: The CAP Theorem Doesn’t
Say What You Think It Does
• Consistency, Availability, (Network) Partition
• The Real Story: These are not Independent
Variables
• AP =CP (Um, what? But…A != C )
• Variations:
– PACELC (adds latency tolerance)
Takeaway: the real story here is about the tradeoffs
made by designers of different systems, and the
main tradeoff is between consistency and
availability, usually in favor of the latter.
8. Confidential and Proprietary
Big Data Hype Cycle: Where Are We Now?
There are currently more than 120+ NoSQL
databases listed at nosql-databases.com!
You Are Here ?
As the pace of new technology solutions has slowed, some clear winners have emerged.
9. Confidential and Proprietary
BIG DATA MYTH #3: BIG DATA AND NOSQL
ARE NEW IDEAS
• The first and most successful
such system is DNS, created in
1983.
• Began with flat files
• Currently serves the entire
Internet (!)
• DNS is an AP
system, availability is #1
• Many extensions complicate a
simple design
• Suggests a new term for CAP-
like ideas: variability
• DNS variability is very
high, often 2-3x the mean
10. Confidential and Proprietary 10Global In-memory MySQL
Big Myths About Big Data
Preview : YESQL: A
Counterexample
Q&A
Today’s Agenda
11. Confidential and Proprietary
“Develop a globally distributed DB For
user-related data.”
• Must Not Fail (99.999%)
• Must Not Lose Data. Period.
• Must Support Transactions
• Must Support (some) SQL
• Must WriteRead 32-bit integer globally in
1000ms
• Maximum Data Volume: 100 TB
• Must Scale Linearly with Costs
Mission YESQL
12. Confidential and Proprietary
What about “High Performance”?
•Maximum lightspeed distance on Earth’s
Surface: ~67 ms
•Target: data available worldwide in < 1000 ms
Sound Easy?
Think Again!
14. Confidential and Proprietary
In The Full Session….
• More Big Data Myths
• YeSQL Architecture
• Failover
• Conservation of Timestamps!
• Join me today at 103o AM for the details!
15. Confidential and Proprietary
Summing Up: The Big Picture on Big Data
• Only use Big Data solutions when you have a real
Big Data problem.
– Don‟t be a Dedicated Follower of Tech Fashion!
• Not all Big Data solutions are created equal
– What tradeoffs are most important to you?
– Consistency, Fault
Tolerance, Availability, Performance, Variability
• Is your data model a fit for NoSQL?
– You don‟t have to give up the relational model in
most cases, so don‟t!
• You can achieve high performance and
availability without giving up relational models
and read consistency! Just say YESQL!
16. Twitter: @daniel_b_austin
Emai: daaustin@paypal.com
“In the long run, we are all dead
eventually consistent.”
Maynard Keynes on NoSQL Databases
With apologies and thanks to the real DB experts, Andrew Goodman, Yves
Trudeau, Clement Frazer, Daniel Abadi, Kent Beck, and everyone else who
contributed. It really works!
Editor's Notes
This is really the problem we want to solve. It’s one of the fundamental problems in computer science and doesn’t have a completely satisfactory solution.
This is big myth #1. they are not at all necessarily even related, one could have either or both. These are good problems to have!
The CAP Theorem is a limited version of the Systemic Qualities model.
Mike’s talk last year, only
Dr. Paul MockapetrisConsistency in DNS is a complicated idea
Service Reliability. Must be buzzword compliant, as in RFC 2119 Tradeoffs discussed previously.