PayPal Big Data and MySQL Cluster

  • 8,672 views
Uploaded on

PayPal's presentation from MySQL Connect conference, including their analysis of big data solutions and selection of MySQL Cluter

PayPal's presentation from MySQL Connect conference, including their analysis of big data solutions and selection of MySQL Cluter

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
8,672
On Slideshare
0
From Embeds
0
Number of Embeds
7

Actions

Shares
Downloads
132
Comments
0
Likes
9

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Big Data is a Big Scam (Most of the Time)Daniel Austin, PayPal Technical StaffMySQL Connect ConferenceSeptember 30, 2012 v1.2
  • 2. Today’s Agenda Big Myths about Big Data YESQL: A Counterexample Q&A Global In-memory MySQL Confidential and Proprietary 2
  • 3. THE FUNDAMENTAL PROBLEM INDISTRIBUTED DATA SYSTEMS“How Do We Manage ReliableDistribution of Data Across GeographicalDistances?” Confidential and Proprietary
  • 4. The NoSQL Solution •  NoSQL Systems provide a solution that relaxes many of the common constraints of typical RDBMS systems –  Slow - RDBMS has not scaled with CPUs –  Often require complex data management (SOX, SOR) –  Costly to build and maintain, slow to change and adapt –  Intolerant of CAP models (more on this later) •  Non-relational models, usually key-value •  May be batched or streaming •  Not necessarily distributed geographically Confidential and Proprietary
  • 5. Big Data Myth #1: Big Data = NoSQL •  ‘Big Data’ Refers to a Common Set of Problems –  Large Volumes –  High Rates of Change •  Of Data •  Of Data Models •  Of Data Presentation and Output –  Often Require ‘Fast Data’ as well as ‘Big’ •  Near-real Time Analytics •  Mapping Complex Structures Takeaway: Big Data is the problem, NoSQL is one (proposed) solution Confidential and Proprietary
  • 6. 3 Kinds of Big Data Systems 1.  Columnar K-V Systems – Hadoop, Hbase, Cassandra, PNUTs 2.  Document-Based – MongoDB, TerraCotta 3.  Graph-Based – FlockDB, Voldemort Takeaway: These were originally designed as solutions to specific problems because no commercial solution would work. Confidential and Proprietary
  • 7. Big Data Hype Cycle: Where Are We Now? There are currently more than 120+ NoSQL databases listed at nosql-databases.com! You Are Here ?As the pace of new technology solutions has slowed, some clear winners have emerged. Confidential and Proprietary
  • 8. Big Data Myth #2: The CAP Theorem Doesn’tSay What You Think It Does •  Consistency, Availability, (Network) Partition •  The Real Story: These are not Independent Variables •  AP =CP (Um, what? But…A != C ) •  Variations: –  PACELC (adds latency tolerance) Takeaway: the real story here is about the tradeoffs made by designers of different systems, and the main tradeoff is between consistency and availability, usually in favor of the latter. Confidential and Proprietary
  • 9. Big Data Myth: You Need A Big Data System Well, Maybe….But Before You Go There… There are essentially two ‘Big Data Problems’: “I have too much data and it’s coming in too fast to handle with any RDBMS.” “I have a lot of data distributed geographically and need to be able to read and write from anywhere in near real-time.” Takeaway: if you have one of these Big Data problems, a NoSQL solution might work for you. But there are also other alternatives… Confidential and Proprietary
  • 10. BIG DATA MYTH #3: BIG DATA AND NOSQLARE NEW IDEAS•  The first and most successful such system is DNS, created in 1983.•  Began with flat files•  Currently serves the entire Internet (!)•  DNS is an AP system, availability is #1•  Many extensions complicate a simple design•  Suggests a new term for CAP- like ideas: variability •  DNS variability is very high, often 2-3x the mean Confidential and Proprietary
  • 11. Today’s Agenda Big Myths About Big Data YESQL: A Counterexample Q&A Global In-memory MySQL Confidential and Proprietary 11
  • 12. Mission YESQL “Develop a globally distributed DB For user-related data.” •  Must Not Fail (99.999%) •  Must Not Lose Data. Period. •  Must Support Transactions •  Must Support (some) SQL •  Must WriteRead 32-bit integer globally in 1000ms •  Maximum Data Volume: 100 TB •  Must Scale Linearly with Costs Confidential and Proprietary
  • 13. What about “High Performance”? • Maximum lightspeed distance on Earth’s Surface: ~67 ms • Target: data available worldwide in < 1000 ms Sound Easy? Think Again! Confidential and Proprietary
  • 14. WHY MYSQL CLUSTER? Pro Con•  True HA by design •  Some semantic –  Fast recovery limitations on fields•  Supports (some) X- •  Size constraints (2 actions TB?)•  Relational Model –  Hardware limits•  In-memory also architecture = high •  Higher cost/byte performance •  Requires reasonable•  Disk storage for data partitioning non-indexed data •  Higher complexity•  APIs, APIs, APIs Confidential and Proprietary
  • 15. How MySQL Cluster Works in 1 SlideGraphics courtesy dev.mysql.com Confidential and Proprietary
  • 16. CIRCULAR REPLICATION/FAILOVERGraphics courtesy O’Reilly OnLamp.com Confidential and Proprietary
  • 17. AVAILABILITY DEFINED•  Availability of the entire system: n mAsys = 1 – Π(1-Πri)j V i=1 j=1 I P•  Number of Parallel Components Needed to Achieve Availability Amin: Parallel SerialNmin = [ln(1-Amin)/ln(1-r)] Confidential and Proprietary
  • 18. AWS Meets MySQL Cluster •  Why AWS? – Cheap and easy infrastructure-in-a-box (Or so I thought! Ha!) •  Services Used: – EC2 (Centos 5.3, small instances for mgm & query nodes, XL for data – Elastic IPs/ELB – EBS Volumes – S3 – Cloudwatch Confidential and Proprietary
  • 19. ARCHITECTURAL TILES AWS Availability ZonesTiling Rules•  Never separate NDB & SQL A B•  Ndb:2-SQL:1-MGM:1•  Scale by adding more tiles•  Failover 1st to nearest AZ•  Then to nearest DC•  At least 1 replica/AZ C ELB•  Don’t share nodes•  Mgmt nodes are redundantLimitations Unused (not present in all locations)•  AWS is network-bound @ 250 MBPS – ouch!•  Need specific ACL across AZ Data Mgmt SQL boundaries Node Node Node•  AZs not uniform!•  No GSLB•  Dynamic IPs•  ELB sticky sessions !reliable Confidential and Proprietary
  • 20. Architecture StackScale by Tiling A B A B A B A B A B A B A B 5 AWS Data Centers: US-E, US-W, TK, EU, AS Confidential and Proprietary
  • 21. Other Technologies Considered •  Paxos – Elegant-but-complex consensus-based messaging protocol – Used in Google Megastore, Bing metadata •  Java Query Caching – Queries as serialized objects – Not yet working •  Multiple Ring Architectures – Even more complicated = no way Confidential and Proprietary
  • 22. SYSTEM READ/WRITE PERFORMANCE (!) What we tested: •  32 & 256 byte char fields In-region replication tests •  Reads, writes, query speed vs. volume •  Data replication speeds Results: •  Global replication < 350 ms •  256 byte read < 10ms worldwide 06/19/2011 06/20/2011 06/21/2011 06/22/2011 06/23/2011 Confidential and Proprietary
  • 23. Data Models and Query Optimization •  Network Latency is an obvious issue •  Data model requires all segments present in each geo-region •  Parameterized (Linked) Joins – Adaptive Query Localization (SIP) technique from Clustra (see Clement Frazer’s blog for details) Confidential and Proprietary
  • 24. Conservation of Timestamps or TheCommit Ordering Problem •  Why does commit ordering matter? •  Write operators are non-commutative [W(d,t1),W(d,t2)] != 0 unless t1=t2 – Can lead to inconsistency – Can lead to timestamp corruption – Forcing sequential writes defeats Amdahl’s rule •  Can show up in GSLB scenarios Confidential and Proprietary
  • 25. Hard Lessons, Shared •  Be Careful… –  With “Eventual Consistency”-related concepts –  ACID, CAP are not really as well-defined as we’d like considering how often we invoke them •  MySQL Cluster is a good solution –  Real HA, real SQL –  Notable limitations around fields, datatypes –  Successfully competes with NoSQL systems for most use cases – better in many cases •  NoSQL Systems –  All have relatively low levels of maturity –  More suitable for simpler key-value models –  Victim of Tech Fashion Confidential and Proprietary
  • 26. Future Directions •  Alternate solution using Pacemaker, Heartbeat – From Yves Trudeau @ Percona – Uses InnoDB, not NDB •  Implement Memcached plugin – To test NoSQL functionality, APIs •  Add simple connection-based persistence to preserve connections during failover •  Better data node distribution •  Better testing & monitoring Confidential and Proprietary
  • 27. Summing Up On YESQL v0.85•  It works! Far better than expected.•  Very fast, very reliable•  Reduced complexity since v0.7•  AWS poses challenges that private data centers may not experience•  You can achieve high performance and availability without giving up relational models and read consistency! Confidential and Proprietary
  • 28. The Big Picture on Big Data •  Only use Big Data solutions when you have a real Big Data problem. –  Don’t be a Dedicated Follower of Tech Fashion! •  Not all Big Data solutions are created equal –  What tradeoffs are most important to you? –  Consistency, Fault Tolerance, Availability, Performance, Variability •  Is your data model a fit for NoSQL? –  You don’t have to give up the relational model in most cases, so don’t! •  You can achieve high performance and availability without giving up relational models and read consistency! Just say YESQL! Confidential and Proprietary
  • 29. “In the long run, we are all deadeventually consistent.”Maynard Keynes on NoSQL DatabasesTwitter: @daniel_b_austinEmail: daaustin@paypal.comWith apologies and thanks to the real DB experts, Andrew Goodman, YvesTrudeau, Frazer Clement, Daniel Abadi, Kent Beck, and everyone else whocontributed. It really works!