The document describes Tumblr's process for splitting a large MySQL shard into multiple smaller shards to improve scalability and performance. Some key points:
- Tumblr uses custom automation software to split shards by dividing the data set across new slave servers, moving reads and writes to the new shards, and removing any rows replicated to the wrong shard.
- The process involves creating new slaves with subsets of the data, moving reads first then writes, and stopping replication from the original "parent" shard before taking it offline.
- Techniques like parallel exporting/importing of data in chunks and copying files efficiently across multiple destinations are used to split very large shards containing hundreds of millions of rows within a few hours with no
Conquering "big data": An introduction to shard queryJustin Swanhart
Shard-Query is a middleware solution that enables massively parallel query execution for MySQL databases. It works by splitting single SQL queries into multiple smaller queries or tasks that can run concurrently on one or more database servers. This allows it to scale out query processing across more CPUs and servers for improved performance on large datasets and analytics workloads. It supports both partitioning tables for parallelism within a single server as well as sharding tables across multiple servers. The end result is that it can enable MySQL to perform like other parallel database solutions through distributed query processing.
Shard-Query, an MPP database for the cloud using the LAMP stackJustin Swanhart
This combined #SFMySQL and #SFPHP meetup talked about Shard-Query. You can find the video to accompany this set of slides here: https://www.youtube.com/watch?v=vC3mL_5DfEM
Executing Queries on a Sharded DatabaseNeha Narula
Determining a data storage solution as your web application scales can be the most difficult part of web development, and takes time away from developing application features. MongoDB, Redis, Postgres, Riak, Cassandra, Voldemort, NoSQL, MySQL, NewSQL — the options are overwhelming, and all claim to be elastic, fault-tolerant, durable, and give great performance for both reads and writes. In the first portion of this talk I’ll discuss these different storage solutions and explain what is really important when choosing a datastore — your application data schema and feature requirements.
These slides cover a talk on using distributed computation for database queries. Moore's Law, Amdahl's Law and distribution techniques are highlighted, and a simple performance comparison is provided.
This presentation can help you to apply partioning when appropriate, and to avoid problems when using it. The oneliner is: Simple Works Best. The illustrating demos are on Postgres12 (maybe -13 by the time of presenting) and show some of the problems and solutions that Partitioning can provide. Some of this “experience” is quite old and the demo runs near-identical on Oracle…
These problems are the same on any database.
Hadoop: Big Data Stacks validation w/ iTest How to tame the elephant?Dmitri Shiryaev
. Problem we are facing
. Big Data Stacks
. Why validation
. What is "Success" and the effort to achieve it
. Solutions
. Ops testing
. Platform certification
. Application testing
. Stack on stack
. Test artifacts are First Class Citizen
. Assembling validation stack (vstack)
. Tailoring vstack for target clusters
. D3: Deployment/Dependencies/Determinism
Conquering "big data": An introduction to shard queryJustin Swanhart
Shard-Query is a middleware solution that enables massively parallel query execution for MySQL databases. It works by splitting single SQL queries into multiple smaller queries or tasks that can run concurrently on one or more database servers. This allows it to scale out query processing across more CPUs and servers for improved performance on large datasets and analytics workloads. It supports both partitioning tables for parallelism within a single server as well as sharding tables across multiple servers. The end result is that it can enable MySQL to perform like other parallel database solutions through distributed query processing.
Shard-Query, an MPP database for the cloud using the LAMP stackJustin Swanhart
This combined #SFMySQL and #SFPHP meetup talked about Shard-Query. You can find the video to accompany this set of slides here: https://www.youtube.com/watch?v=vC3mL_5DfEM
Executing Queries on a Sharded DatabaseNeha Narula
Determining a data storage solution as your web application scales can be the most difficult part of web development, and takes time away from developing application features. MongoDB, Redis, Postgres, Riak, Cassandra, Voldemort, NoSQL, MySQL, NewSQL — the options are overwhelming, and all claim to be elastic, fault-tolerant, durable, and give great performance for both reads and writes. In the first portion of this talk I’ll discuss these different storage solutions and explain what is really important when choosing a datastore — your application data schema and feature requirements.
These slides cover a talk on using distributed computation for database queries. Moore's Law, Amdahl's Law and distribution techniques are highlighted, and a simple performance comparison is provided.
This presentation can help you to apply partioning when appropriate, and to avoid problems when using it. The oneliner is: Simple Works Best. The illustrating demos are on Postgres12 (maybe -13 by the time of presenting) and show some of the problems and solutions that Partitioning can provide. Some of this “experience” is quite old and the demo runs near-identical on Oracle…
These problems are the same on any database.
Hadoop: Big Data Stacks validation w/ iTest How to tame the elephant?Dmitri Shiryaev
. Problem we are facing
. Big Data Stacks
. Why validation
. What is "Success" and the effort to achieve it
. Solutions
. Ops testing
. Platform certification
. Application testing
. Stack on stack
. Test artifacts are First Class Citizen
. Assembling validation stack (vstack)
. Tailoring vstack for target clusters
. D3: Deployment/Dependencies/Determinism
1) Flickr is a photo sharing website built using PHP that allows users to upload and tag photos and share them publicly or privately.
2) The website uses a logical architecture with application logic, page logic, templates, and endpoints separating the different components.
3) While primarily built with PHP, the website also incorporates technologies like Smarty for templating, Perl for image processing, MySQL for the database, and Java for the node service.
Powering GIS Application with PostgreSQL and Postgres Plus Ashnikbiz
This document provides an overview of Postgres Plus Advanced Server and its features. It begins with introductions to PostgreSQL and PostGIS. It then discusses Postgres Plus Advanced Server's Oracle compatibility, performance enhancements, security features, high availability options, database administration tools, and migration toolkit. The document also provides information on scaling Postgres Plus Advanced Server through partitioning and infinite cache technologies. It concludes with summaries of the replication capabilities of Postgres Plus Advanced Server.
Design Patterns for Distributed Non-Relational Databasesguestdfd1ec
The document discusses design patterns for distributed non-relational databases, including consistent hashing for key placement, eventual consistency models, vector clocks for determining history, log-structured merge trees for storage layout, and gossip protocols for cluster management without a single point of failure. It raises questions to ask presenters about scalability, reliability, performance, consistency models, cluster management, data models, and real-life considerations for using such systems.
What Every Developer Should Know About Database Scalabilityjbellis
Replication. Partitioning. Relational databases. Bigtable. Dynamo. There is no one-size-fits-all approach to scaling your database, and the CAP theorem proved that there never will be. This talk will explain the advantages and limits of the approaches to scaling traditional relational databases, as well as the tradeoffs made by the designers of newer distributed systems like Cassandra. These slides are from Jonathan Ellis's OSCON 09 talk: http://en.oreilly.com/oscon2009/public/schedule/detail/7955
This document discusses using the HBase database with Ruby on Rails applications. It provides an overview of HBase, including what it is, its core concepts like tables, columns, and column families. It also covers some of HBase's tradeoffs compared to relational databases, such as limitations on real-time queries and joins. The document discusses when HBase may be a good fit, such as for large datasets or highly distributed applications, and libraries for integrating HBase into Rails like hbase-stargate and MassiveRecord. It concludes with a demo of a URL shortener application built on Rails and HBase.
Sql server 2016 it just runs faster sql bits 2017 editionBob Ward
SQL Server 2016 includes several performance improvements that help it run faster than previous versions:
1. Automatic Soft NUMA partitions workloads across NUMA nodes when there are more than 8 CPUs per node to avoid bottlenecks.
2. Dynamic memory objects are now partitioned by CPU to avoid contention on global memory objects.
3. Redo operations can now be parallelized across multiple tasks to improve performance during database recovery.
SQL Server In-Memory OLTP: What Every SQL Professional Should KnowBob Ward
Perhaps you have heard the term “In-Memory” but not sure what it means. If you are a SQL Server Professional then you will want to know. Even if you are new to SQL Server, you will want to learn more about this topic. Come learn the basics of how In-Memory OLTP technology in SQL Server 2016 and Azure SQL Database can boost your OLTP application by 30X. We will compare how In-Memory OTLP works vs “normal” disk-based tables. We will discuss what is required to migrate your existing data into memory optimized tables or how to build a new set of data and applications to take advantage of this technology. This presentation will cover the fundamentals of what, how, and why this technology is something every SQL Server Professional should know
Cloudera Impala: A Modern SQL Engine for Apache HadoopCloudera, Inc.
Impala is a massively parallel processing SQL query engine for Apache Hadoop. It allows real-time queries on large datasets using existing SQL skills. Impala's architecture includes impalad daemons that process queries in parallel across nodes, a statestore for metadata coordination, and a new execution engine written in C++. It aims to provide faster performance than Hive for interactive queries while leveraging Hadoop's existing ecosystem. The first general availability release is planned for April 2013.
The document provides an overview of different NoSQL database types, including key-value stores, document databases, column-oriented databases, graph databases, and caches. It discusses examples of databases for each type and common use cases. The document also covers querying graph databases, polyglot persistence using multiple database types, and concludes with when each database type is best suited and when not to use a NoSQL database.
SQL Server to Redshift Data Load Using SSISMarc Leinbach
In this article we will try to learn how to load data from SQL Server to Amazon Redshift Data warehouse using SSIS. Techniques outlined in this article can be also applied while extracting data from other Relational Source (e.g. Loading Data from MySQL to Redshift, Oracle to Redshift etc). First we will discuss steps needed to load data into Amazon Redshift Data Warehouse, challenges and then we will simplify whole process using SSIS Task for Amazon Redshift Data Transfer.
Based on the popular blog series, join me in taking a deep dive and a behind the scenes look at how SQL Server 2016 “It Just Runs Faster”, focused on scalability and performance enhancements. This talk will discuss the improvements, not only for awareness, but expose design and internal change details. The beauty behind ‘It Just Runs Faster’ is your ability to just upgrade, in place, and take advantage without lengthy and costly application or infrastructure changes. If you are looking at why SQL Server 2016 makes sense for your business you won’t want to miss this session.
The document discusses building a data platform for analytics in Azure. It outlines common issues with traditional data warehouse architectures and recommends building a data lake approach using Azure Synapse Analytics. The key elements include ingesting raw data from various sources into landing zones, creating a raw layer using file formats like Parquet, building star schemas in dedicated SQL pools or Spark tables, implementing alerting using Log Analytics, and loading data into Power BI. Building the platform with Python pipelines, notebooks, and GitHub integration is emphasized for flexibility, testability and collaboration.
Incorta allows users to create materialized views (MVs) using Spark. It provides functions to read data from Incorta tables and save Spark DataFrames as MVs. The document discusses Spark integration with Incorta, including installing and configuring Spark, and creating the first MV using Spark Python APIs. It demonstrates reading data from Incorta and saving a DataFrame as a new MV.
Oracle 12c Parallel Execution New FeaturesRandolf Geist
This document discusses new parallel execution features introduced in Oracle 12c. It begins with an introduction to key aspects of parallel execution, including the producer-consumer model and data distribution skew. The document then covers major new 12c features such as hybrid hash distribution, concurrent UNION ALL, and the 1 slave distribution method. It concludes with a question and answer section.
This document summarizes Terry Bunio's presentation on breaking and fixing broken data. It begins by thanking sponsors and providing information about Terry Bunio and upcoming SQL events. It then discusses the three types of broken data: inconsistent, incoherent, and ineffectual data. For each type, it provides an example and suggestions on how to identify and fix the issues. It demonstrates how to use tools like Oracle Data Modeler, execution plans, SQL Profiler, and OStress to diagnose problems to make data more consistent, coherent and effective.
Panduan solat sunat dan solat sunat istikarahkriptonium
Dokumen tersebut memberikan panduan mengenai beberapa solat sunat seperti Solat Dhuha, Tahajjud, Awwabin, Tasbih, Taubat dan Hajat. Ia menjelaskan lafaz niat, bacaan dan doa untuk setiap jenis solat sunat tersebut.
Joomla websites can be hacked for various reasons such as finding vulnerabilities, seeing if they can break in, or for financial gain. To prevent hacking, site owners should regularly update software, secure server configurations, remove unnecessary files and extensions, and implement security measures like two-factor authentication. Backups are also important in case a site becomes compromised, though completely restoring a hacked site can be difficult. Security is an ongoing process that requires vigilance through actions like monitoring, patching issues, and preparing for potential hacks.
1) Flickr is a photo sharing website built using PHP that allows users to upload and tag photos and share them publicly or privately.
2) The website uses a logical architecture with application logic, page logic, templates, and endpoints separating the different components.
3) While primarily built with PHP, the website also incorporates technologies like Smarty for templating, Perl for image processing, MySQL for the database, and Java for the node service.
Powering GIS Application with PostgreSQL and Postgres Plus Ashnikbiz
This document provides an overview of Postgres Plus Advanced Server and its features. It begins with introductions to PostgreSQL and PostGIS. It then discusses Postgres Plus Advanced Server's Oracle compatibility, performance enhancements, security features, high availability options, database administration tools, and migration toolkit. The document also provides information on scaling Postgres Plus Advanced Server through partitioning and infinite cache technologies. It concludes with summaries of the replication capabilities of Postgres Plus Advanced Server.
Design Patterns for Distributed Non-Relational Databasesguestdfd1ec
The document discusses design patterns for distributed non-relational databases, including consistent hashing for key placement, eventual consistency models, vector clocks for determining history, log-structured merge trees for storage layout, and gossip protocols for cluster management without a single point of failure. It raises questions to ask presenters about scalability, reliability, performance, consistency models, cluster management, data models, and real-life considerations for using such systems.
What Every Developer Should Know About Database Scalabilityjbellis
Replication. Partitioning. Relational databases. Bigtable. Dynamo. There is no one-size-fits-all approach to scaling your database, and the CAP theorem proved that there never will be. This talk will explain the advantages and limits of the approaches to scaling traditional relational databases, as well as the tradeoffs made by the designers of newer distributed systems like Cassandra. These slides are from Jonathan Ellis's OSCON 09 talk: http://en.oreilly.com/oscon2009/public/schedule/detail/7955
This document discusses using the HBase database with Ruby on Rails applications. It provides an overview of HBase, including what it is, its core concepts like tables, columns, and column families. It also covers some of HBase's tradeoffs compared to relational databases, such as limitations on real-time queries and joins. The document discusses when HBase may be a good fit, such as for large datasets or highly distributed applications, and libraries for integrating HBase into Rails like hbase-stargate and MassiveRecord. It concludes with a demo of a URL shortener application built on Rails and HBase.
Sql server 2016 it just runs faster sql bits 2017 editionBob Ward
SQL Server 2016 includes several performance improvements that help it run faster than previous versions:
1. Automatic Soft NUMA partitions workloads across NUMA nodes when there are more than 8 CPUs per node to avoid bottlenecks.
2. Dynamic memory objects are now partitioned by CPU to avoid contention on global memory objects.
3. Redo operations can now be parallelized across multiple tasks to improve performance during database recovery.
SQL Server In-Memory OLTP: What Every SQL Professional Should KnowBob Ward
Perhaps you have heard the term “In-Memory” but not sure what it means. If you are a SQL Server Professional then you will want to know. Even if you are new to SQL Server, you will want to learn more about this topic. Come learn the basics of how In-Memory OLTP technology in SQL Server 2016 and Azure SQL Database can boost your OLTP application by 30X. We will compare how In-Memory OTLP works vs “normal” disk-based tables. We will discuss what is required to migrate your existing data into memory optimized tables or how to build a new set of data and applications to take advantage of this technology. This presentation will cover the fundamentals of what, how, and why this technology is something every SQL Server Professional should know
Cloudera Impala: A Modern SQL Engine for Apache HadoopCloudera, Inc.
Impala is a massively parallel processing SQL query engine for Apache Hadoop. It allows real-time queries on large datasets using existing SQL skills. Impala's architecture includes impalad daemons that process queries in parallel across nodes, a statestore for metadata coordination, and a new execution engine written in C++. It aims to provide faster performance than Hive for interactive queries while leveraging Hadoop's existing ecosystem. The first general availability release is planned for April 2013.
The document provides an overview of different NoSQL database types, including key-value stores, document databases, column-oriented databases, graph databases, and caches. It discusses examples of databases for each type and common use cases. The document also covers querying graph databases, polyglot persistence using multiple database types, and concludes with when each database type is best suited and when not to use a NoSQL database.
SQL Server to Redshift Data Load Using SSISMarc Leinbach
In this article we will try to learn how to load data from SQL Server to Amazon Redshift Data warehouse using SSIS. Techniques outlined in this article can be also applied while extracting data from other Relational Source (e.g. Loading Data from MySQL to Redshift, Oracle to Redshift etc). First we will discuss steps needed to load data into Amazon Redshift Data Warehouse, challenges and then we will simplify whole process using SSIS Task for Amazon Redshift Data Transfer.
Based on the popular blog series, join me in taking a deep dive and a behind the scenes look at how SQL Server 2016 “It Just Runs Faster”, focused on scalability and performance enhancements. This talk will discuss the improvements, not only for awareness, but expose design and internal change details. The beauty behind ‘It Just Runs Faster’ is your ability to just upgrade, in place, and take advantage without lengthy and costly application or infrastructure changes. If you are looking at why SQL Server 2016 makes sense for your business you won’t want to miss this session.
The document discusses building a data platform for analytics in Azure. It outlines common issues with traditional data warehouse architectures and recommends building a data lake approach using Azure Synapse Analytics. The key elements include ingesting raw data from various sources into landing zones, creating a raw layer using file formats like Parquet, building star schemas in dedicated SQL pools or Spark tables, implementing alerting using Log Analytics, and loading data into Power BI. Building the platform with Python pipelines, notebooks, and GitHub integration is emphasized for flexibility, testability and collaboration.
Incorta allows users to create materialized views (MVs) using Spark. It provides functions to read data from Incorta tables and save Spark DataFrames as MVs. The document discusses Spark integration with Incorta, including installing and configuring Spark, and creating the first MV using Spark Python APIs. It demonstrates reading data from Incorta and saving a DataFrame as a new MV.
Oracle 12c Parallel Execution New FeaturesRandolf Geist
This document discusses new parallel execution features introduced in Oracle 12c. It begins with an introduction to key aspects of parallel execution, including the producer-consumer model and data distribution skew. The document then covers major new 12c features such as hybrid hash distribution, concurrent UNION ALL, and the 1 slave distribution method. It concludes with a question and answer section.
This document summarizes Terry Bunio's presentation on breaking and fixing broken data. It begins by thanking sponsors and providing information about Terry Bunio and upcoming SQL events. It then discusses the three types of broken data: inconsistent, incoherent, and ineffectual data. For each type, it provides an example and suggestions on how to identify and fix the issues. It demonstrates how to use tools like Oracle Data Modeler, execution plans, SQL Profiler, and OStress to diagnose problems to make data more consistent, coherent and effective.
Panduan solat sunat dan solat sunat istikarahkriptonium
Dokumen tersebut memberikan panduan mengenai beberapa solat sunat seperti Solat Dhuha, Tahajjud, Awwabin, Tasbih, Taubat dan Hajat. Ia menjelaskan lafaz niat, bacaan dan doa untuk setiap jenis solat sunat tersebut.
Joomla websites can be hacked for various reasons such as finding vulnerabilities, seeing if they can break in, or for financial gain. To prevent hacking, site owners should regularly update software, secure server configurations, remove unnecessary files and extensions, and implement security measures like two-factor authentication. Backups are also important in case a site becomes compromised, though completely restoring a hacked site can be difficult. Security is an ongoing process that requires vigilance through actions like monitoring, patching issues, and preparing for potential hacks.
MySQL 5.7 Fabric: Introduction to High Availability and Sharding Ulf Wendel
MySQL 5.7 has sharding built-in to MySQL. The free and open source MySQL Fabric utility simplifies the management of MySQL clusters of any kind. This includes MySQL Replication setup, monitoring, automatic failover, switchover and so fort for High Availability. Additionally, it offers measures to shard a MySQL database over many an arbitrary number of servers. Intelligent load balancer (updated drivers) take care of routing queries to the appropriate shards.
The document discusses the ecliptic and why certain celestial bodies appear to follow the same path in the sky. It begins by explaining that the ecliptic is an imaginary line that connects the apparent paths of the sun, moon, and planets as seen from Earth. It then describes how the planets, sun, and moon are the only objects that appear to change positions relative to the fixed background of stars. Finally, it explains that all planets orbit in the same plane as the Earth, which is why they appear to follow the ecliptic and can align in strange ways in the sky from our perspective on Earth.
Φορολογική μεταχείριση της χρεωστικής διαφοράς (ζημιάς) που προέκυψε σε βάρος προσωπικής εταιρείας από την ανταλλαγή ομολόγων του Ελληνικού Δημοσίου στο πλαίσιο του προγράμματος συμμετοχής στην αναδιάταξη του ελληνικού χρέους
Amit Kumar has over 2 years of experience working as a branch service incharge at Microtek International P.Ltd. He has a B.Tech in ECE from R.N. College of Engineering & Technology with 65% aggregate and a diploma in ECE from R.J.L.B Government Polytechnic with 62%. His career objective is to work for an organization that provides growth opportunities to enhance his professional and analytical skills. He has knowledge of electronic components and products like inverters, offline and online UPS systems.
This document provides guidelines for rearing and managing Hubbard breeder chickens. Key points include targeting a body weight of at least 150g for chicks at 1 week and 600g by 4 weeks. Flocks should be stimulated for egg production between 147-154 days when they reach a minimum fasted body weight of 2470-2620g. During production, feed intake should be increased as production rises by 3-5% daily to support egg mass growth of 0.34-0.57g per day. Floor eggs can be reduced by improving nests, feed and water management, and male behavior. Males should be mixed gradually starting at 24 weeks to establish relationships without becoming too aggressive.
This document provides specifications for wrought carbon steel and alloy steel piping fittings intended for moderate and high temperature service. It defines the scope and standards that apply, including those for materials, manufacture, heat treatment, and quality. Fittings covered are made of seamless or welded construction per ASME and MSS standards and are intended for use in pressure piping and vessels from moderate to elevated temperatures.
Γνωστοποίηση της τιμής ελαιολάδου έτους 2016-2017, για την εφαρμογή των διατάξεων της εισφοράς δακοκτονίας. Κοινοποίηση πίνακα περιοχών στις οποίες διενεργήθηκε, με μέριμνα του Δημοσίου, ομαδική καταπολέμηση δάκου της ελιάς κατά το ελαιοκομικό έτος 2016-2017
El documento ofrece consejos para diseñar y organizar la cocina de manera eficiente. Primero se debe hacer un plano considerando el espacio disponible, las medidas y objetos a incluir. Luego se colocan los elementos básicos del "triángulo de trabajo" (fregadero, cocina, refrigerador) formando un área central de trabajo. Finalmente, se distribuyen el resto de muebles y electrodomésticos teniendo en cuenta diferentes formas básicas de cocina, así como consejos prácticos sobre iluminación, materiales, accesibil
The team of 8 medical students designed and implemented several health education lessons for 15-20 4-year-old preschool students at Appletree Early Learning Public Charter School in Washington D.C. over the course of the semester. The lessons focused on hygiene, nutrition, exercise, and mental health. Feedback from students and staff was positive, with students demonstrating improved knowledge, participation, and emotional recognition. The medical students gained experience with education techniques and working with young children, while also learning about preventing health issues. They recommend maintaining small group sizes and collaborating closely with school staff.
Aprobación de la fusión voluntaria de los municipios de cerdedo y cotobadeDaniel Nogueira Martínez
El documento describe la fusión voluntaria de los municipios de Cerdedo y Cotobade en Galicia, España. Se aprueba la creación de un nuevo municipio llamado Cerdedo-Cotobade. La capitalidad del nuevo municipio se situará en el núcleo de A Chan, anteriormente parte de Cerdedo. La fusión se basa en consideraciones geográficas y económicas con el objetivo de mejorar la prestación de servicios.
This document analyzes vehicle performance data to develop mathematical models for predicting fuel economy and selling price. Regression analysis was used to determine factors that significantly predict fuel economy and price. For fuel economy, significant factors were weight, acceleration, year, number of cylinders, manufacturer, and country of origin. For price, significant factors were year, weight, manufacturer, country of origin, and acceleration. The models allow the company to understand consumer preferences and price vehicles competitively.
The document discusses maintenance challenges at a new $500 million electric vehicle battery plant and outlines a plan to address them. The plant faces issues including a lack of formal maintenance training programs, ineffective computerized maintenance management systems, and reactive rather than proactive maintenance practices. The plan involves building a skilled maintenance team through training, developing formal training and planning processes, utilizing computer systems properly to track work, and executing an improved maintenance strategy and culture.
This document summarizes Tumblr's massively sharded MySQL database architecture. Some key points:
- Tumblr has seen huge growth in traffic and data in the past year, necessitating major database scaling. They now have over 175 machines dedicated to MySQL handling over 11TB of relational data across 25 billion rows.
- They use horizontal partitioning (sharding) across multiple MySQL instances to scale writes and handle large data sizes. Sharding by a core column allows querying the appropriate shard.
- Operational challenges include automation for adding/rebalancing shards, handling failures, and migrating/splitting very large datasets across shards without downtime.
- When splitting a large shard, the process
ScaleBase Webinar: Scaling MySQL - Sharding Made Easy!ScaleBase
Home-grown sharding is hard - REALLY HARD! ScaleBase scales-out MySQL, delivering all the benefits of MySQL sharding, with NONE of the sharding headaches. This webinar explains: MySQL scale-out without embedding code and re-writing apps, Successful sharding on Amazon and private clouds, Single vs. multiple shards per server, Eliminating data silos, Creating a redundant, fault tolerant architecture with no single-point-of-failure, Re-balancing and splitting shards
This document summarizes MySQL Fabric, an open-source tool from Oracle for sharding and managing high availability of MySQL databases at scale. It discusses how MySQL Fabric uses server groups, a fabric node, and fabric-aware connectors to partition data across shards and handle failures while providing a global view of the database schema and tables. Key features of MySQL Fabric mentioned include automated shard splitting and movement, minimized downtime during failures, and use of XML-RPC to avoid proxy hops between applications and shards.
NOSQL Meets Relational - The MySQL Ecosystem Gains More FlexibilityIvan Zoratti
Colin Charles gave a presentation comparing SQL and NoSQL databases. He discussed why organizations adopt NoSQL databases like MongoDB for large, unstructured datasets and rapid development. However, he argued that MySQL can also handle these workloads through features like dynamic columns, memcached integration, and JSON support. MySQL addresses limitations around high availability, scalability, and schema flexibility through tools and plugins that provide sharding, replication, load balancing, and online schema changes. In the end, MySQL with the right tools is capable of fulfilling both transactional and NoSQL-style workloads.
The document discusses MySQL NDB 8.0 and high availability solutions for MySQL. It summarizes MySQL NDB Cluster, MySQL InnoDB Cluster, and MySQL Replication as high availability solutions. It also discusses features and performance of MySQL NDB Cluster 8.0, including linear scalability, predictable low-latency performance, and improved backup throughput.
MySQL Group Replication provides a high availability multi-master replication solution for MySQL. It allows multiple MySQL instances to act as equal masters that can accept writes and remain available even if some instances fail. Transactions are synchronously committed across all members of the replication group to ensure consistency. Group Replication handles failure detection and recovery transparently through its use of group communication systems and built-in conflict detection. It provides a highly available, scalable and fully distributed database solution compared to traditional MySQL replication and clustering options.
Basic Introduction to Cassandra with Architecture and strategies.
with big data challenge. What is NoSQL Database.
The Big Data Challenge
The Cassandra Solution
The CAP Theorem
The Architecture of Cassandra
The Data Partition and Replication
Cassandra is used for real-time bidding in online advertising. It processes billions of bid requests per day with low latency requirements. Segment data, which assigns product or service affinity to user groups, is stored in Cassandra to reduce calculations and allow users to be bid on sooner. Tuning the cache size and understanding the active dataset helps optimize performance.
Life After Sharding: Monitoring and Management of a Complex Data CloudOSCON Byrum
The document discusses managing data in a complex data cloud after sharding a database. It describes moving from a monolithic database to a sharded setup with multiple data nodes. Key challenges in managing the sharded data cloud are addressed, such as monitoring performance across nodes, unified alerting, and synchronized maintenance tasks. The document recommends using a service like Scalebase to provide a single point of management and access for the distributed data.
Tech Talk Series, Part 2: Why is sharding not smart to do in MySQL?Clustrix
The document discusses scaling MySQL databases and alternatives to sharding. It begins by outlining the typical path organizations take to sharding MySQL as their data and usage grows over time. This involves continually upgrading hardware, adding read replicas, and eventually implementing sharding. The document then covers the challenges of sharding, such as data skew across shards, lack of ACID transactions, application changes required, and complex infrastructure needs. As an alternative, the document introduces ClustrixDB, a database that can scale write and read performance linearly just by adding more servers without sharding. It achieves this through automatic data distribution, query fan-out, and data rebalancing. Performance benchmarks show ClustrixDB vastly outscaling alternatives on Amazon
Haytham ElFadeel presented on next-generation storage systems and key-value stores. He began with an overview of scalable systems and the need for both vertical and horizontal scalability. He discussed the limitations of traditional databases in scaling, including complexity, wasted features, and multi-step query processing. Key-value stores were presented as an alternative, offering simple interfaces and designs optimized for scaling across hundreds of machines. Performance comparisons showed key-value stores significantly outperforming databases. Systems discussed included Amazon Dynamo, Facebook Cassandra, and Redis.
Presentation at the Percona-Live event in San Francisco from Tamar Bercovici and me about the big scaling project of the MySQL database we did at Box in 2012.
Scaling with sync_replication using Galera and EC2Marco Tusa
Challenging architecture design, and proof of concept on a real case of study using Syncrhomous solution.
Customer asks me to investigate and design MySQL architecture to support his application serving shops around the globe.
Scale out and scale in base to sales seasons.
This talk explores the various non relational data stores that folks are using these days. We will disspell the myths and see what these things are actually useful for.
Big Data Storage Concepts from the "Big Data concepts Technology and Architec...raghdooosh
The document discusses big data storage concepts including cluster computing, distributed file systems, and different database types. It covers cluster structures like symmetric and asymmetric, distribution models like sharding and replication, and database types like relational, non-relational and NewSQL. Sharding partitions large datasets across multiple machines while replication stores duplicate copies of data to improve fault tolerance. Distributed file systems allow clients to access files stored across cluster nodes. Relational databases are schema-based while non-relational databases like NoSQL are schema-less and scale horizontally.
This document provides an overview and summary of key concepts related to advanced databases. It discusses relational databases including MySQL, SQL, transactions, and ODBC. It also covers database topics like triggers, indexes, and NoSQL databases. Alternative database systems like graph databases, triplestores, and linked data are introduced. Web services, XML, and data journalism are also briefly summarized. The document provides definitions and examples of these technical database terms and concepts.
This document summarizes and compares several solutions for multi-master replication in MySQL databases: Native MySQL replication, MySQL Cluster (NDB), Galera, and Tungsten. Native MySQL replication supports only limited topologies and has asynchronous replication. MySQL Cluster allows synchronous replication across two data centers but is limited to in-memory tables. Galera provides synchronous, row-based replication across multiple masters with automatic conflict resolution. Tungsten allows asynchronous multi-master replication to different database systems and automatic failover.
This document summarizes and compares several solutions for multi-master replication in MySQL databases: Native MySQL replication, MySQL Cluster (NDB), Galera, and Tungsten. Native MySQL replication supports only limited topologies and has asynchronous replication. MySQL Cluster allows synchronous replication across two data centers but is limited to in-memory tables. Galera provides synchronous, row-based replication across any number of nodes with automatic conflict resolution. Tungsten allows asynchronous, statement-based replication between different database brands with manual conflict handling.
Similar to Massively sharded my sql at tumblr presentation (20)
Meet up Milano 14 _ Axpo Italia_ Migration from Mule3 (On-prem) to.pdfFlorence Consulting
Quattordicesimo Meetup di Milano, tenutosi a Milano il 23 Maggio 2024 dalle ore 17:00 alle ore 18:30 in presenza e da remoto.
Abbiamo parlato di come Axpo Italia S.p.A. ha ridotto il technical debt migrando le proprie APIs da Mule 3.9 a Mule 4.4 passando anche da on-premises a CloudHub 1.0.
Discover the benefits of outsourcing SEO to Indiadavidjhones387
"Discover the benefits of outsourcing SEO to India! From cost-effective services and expert professionals to round-the-clock work advantages, learn how your business can achieve digital success with Indian SEO solutions.
Bridging the Digital Gap Brad Spiegel Macon, GA Initiative.pptxBrad Spiegel Macon GA
Brad Spiegel Macon GA’s journey exemplifies the profound impact that one individual can have on their community. Through his unwavering dedication to digital inclusion, he’s not only bridging the gap in Macon but also setting an example for others to follow.
Instagram has become one of the most popular social media platforms, allowing people to share photos, videos, and stories with their followers. Sometimes, though, you might want to view someone's story without them knowing.
Understanding User Behavior with Google Analytics.pdfSEO Article Boost
Unlocking the full potential of Google Analytics is crucial for understanding and optimizing your website’s performance. This guide dives deep into the essential aspects of Google Analytics, from analyzing traffic sources to understanding user demographics and tracking user engagement.
Traffic Sources Analysis:
Discover where your website traffic originates. By examining the Acquisition section, you can identify whether visitors come from organic search, paid campaigns, direct visits, social media, or referral links. This knowledge helps in refining marketing strategies and optimizing resource allocation.
User Demographics Insights:
Gain a comprehensive view of your audience by exploring demographic data in the Audience section. Understand age, gender, and interests to tailor your marketing strategies effectively. Leverage this information to create personalized content and improve user engagement and conversion rates.
Tracking User Engagement:
Learn how to measure user interaction with your site through key metrics like bounce rate, average session duration, and pages per session. Enhance user experience by analyzing engagement metrics and implementing strategies to keep visitors engaged.
Conversion Rate Optimization:
Understand the importance of conversion rates and how to track them using Google Analytics. Set up Goals, analyze conversion funnels, segment your audience, and employ A/B testing to optimize your website for higher conversions. Utilize ecommerce tracking and multi-channel funnels for a detailed view of your sales performance and marketing channel contributions.
Custom Reports and Dashboards:
Create custom reports and dashboards to visualize and interpret data relevant to your business goals. Use advanced filters, segments, and visualization options to gain deeper insights. Incorporate custom dimensions and metrics for tailored data analysis. Integrate external data sources to enrich your analytics and make well-informed decisions.
This guide is designed to help you harness the power of Google Analytics for making data-driven decisions that enhance website performance and achieve your digital marketing objectives. Whether you are looking to improve SEO, refine your social media strategy, or boost conversion rates, understanding and utilizing Google Analytics is essential for your success.
2. Massively Sharded MySQL
Tumblrʼs Size and Growth
1 Year Ago Today
Impressions
Total Posts
Total Blogs
Developers
Sys Admins
Total Staff (FT)
3 Billion/month 15 Billion/month
1.5 Billion 12.5 Billion
9 Million 33 Million
3 17
1 5
13 55
3. Massively Sharded MySQL
• Machines dedicated to MySQL: over 175
• Thatʼs roughly how many production machines we had total a year ago
• Relational data on master databases: over 11 terabytes
• Unique rows: over 25 billion
Our databases and dataset
4. Massively Sharded MySQL
• Asynchronous
• Single-threaded SQL execution on slave
• Masters can have multiple slaves
• A slave can only have one master
• Can be hierarchical, but complicates failure-handling
• Keep two standby slaves per pool: one to promote when a master fails, and
the other to bring up additional slaves quickly
• Scales reads, not writes
MySQL Replication 101
5. Massively Sharded MySQL
• No other way to scale writes beyond the limits of one machine
• During peak insert times, youʼll likely start hitting lag on slaves before your
master shows a concurrency problem
Reason 1: Write scalability
Why Partition?
6. Massively Sharded MySQL
Reason 2: Data size
Why Partition?
• Working set wonʼt fit in RAM
• SSD performance drops as disk fills up
• Risk of completely full disk
• Operational difficulties: slow backups, longer to spin up new slaves
• Fault isolation: all of your data in one place = single point of failure affecting
all users
7. Massively Sharded MySQL
• Divide a table
• Horizontal Partitioning
• Vertical Partitioning
• Divide a dataset / schema
• Functional Partitioning
Types of Partitioning
8. Massively Sharded MySQL
Divide a table by relocating sets of rows
Horizontal Partitioning
• Some support internally by MySQL, allowing you to divide a table into several
files transparently, but with limitations
• Sharding is the implementation of horizontal partitioning outside of MySQL
(at the application level or service level). Each partition is a separate table.
They may be located in different database schemas and/or different instances
of MySQL.
9. Massively Sharded MySQL
Divide a table by relocating sets of columns
Vertical Partitioning
• Not supported internally by MySQL, though you can do it manually by
creating separate tables.
• Not recommended in most cases – if your data is already normalized, then
vertical partitioning introduces unnecessary joins
• If your partitions are on different MySQL instances, then youʼre doing these
“joins” in application code instead of in SQL
10. Massively Sharded MySQL
Divide a dataset by moving one or more tables
Functional Partitioning
• First eliminate all JOINs across tables in different partitions
• Move tables to new partitions (separate MySQL instances) using selective
dumping, followed by replication filters
• Often just a temporary solution. If the table eventually grows too large to fit on
a single machine, youʼll need to shard it anyway.
11. Massively Sharded MySQL
• Sharding is very complex, so itʼs best not to shard until itʼs obvious that you
will actually need to!
• Predict when you will hit write scalability issues — determine this on spare
hardware
• Predict when you will hit data size issues — calculate based on your growth
rate
• Functional partitioning can buy time
When to Shard
12. Massively Sharded MySQL
• Sharding key — a core column present (or derivable) in most tables.
• Sharding scheme — how you will group and home data (ranges vs hash vs
lookup table)
• How many shards to start with, or equivalently, how much data per shard
• Shard colocation — do shards coexist within a DB schema, a MySQL
instance, or a physical machine?
Sharding Decisions
13. Massively Sharded MySQL
Determining which shard a row lives on
Sharding Schemes
• Ranges: Easy to implement and trivial to add new shards, but requires
frequent and uneven rebalancing due to user behavior differences.
• Hash or modulus: Apply function on the sharding key to determine which
shard. Simple to implement, and distributes data evenly. Incredibly difficult to
add new shards or rebalance existing ones.
• Lookup table: Highest flexibility, but impacts performance and adds a single
point of failure. Lookup table may eventually become too large.
14. Massively Sharded MySQL
• Sharding key must be available for all frequent look-up operations. For
example, canʼt efficiently look up posts by their own ID anymore, also need
blog ID to know which shard to hit.
• Support for read-only and offline shards. App code needs to gracefully handle
planned maintenance and unexpected failures.
• Support for reading and writing to different MySQL instances for the same
shard range — not for scaling reads, but for the rebalancing process
Application Requirements
15. Massively Sharded MySQL
• ID generation for PK of sharded tables
• Nice-to-have: Centralized service for handling common needs
• Querying multiple shards simultaneously
• Persistent connections
• Centralized failure handling
• Parsing SQL to determine which shard(s) to send a query to
Service Requirements
16. Massively Sharded MySQL
• Automation for adding and rebalancing shards, and sufficient monitoring to
know when each is necessary
• Nice-to-have: Support for multiple MySQL instances per machine — makes
cloning and replication setup simpler, and overhead isnʼt too bad
Operational Requirements
17. Massively Sharded MySQL
Option 1: Transitional migration with legacy DB
How to initially shard a table
• Choose a cutoff ID of the tableʼs PK (not the sharding key) which is slightly
higher than its current max ID. Once that cutoff has been reached, all new
rows get written to shards instead of legacy.
• Whenever a legacy row is updated by app, move it to a shard
• Migration script slowly saves old rows (at the app level) in the background,
moving them to shards, and gradually lowers cutoff ID
• Reads may need to check shards and legacy, but based on ID you can make
an informed choice of which to check first
18. Massively Sharded MySQL
Option 2: All at once
How to initially shard a table
1. Dark mode: app redundantly sends all writes (inserts, updates, deletes) to
legacy database as well as the appropriate shard. All reads still go to legacy
database.
2. Migration: script reads data from legacy DB (sweeping by the sharding
key) and writes it to the appropriate shard.
3. Finalize: move reads to shards, and then stop writing data to legacy.
19. Massively Sharded MySQL
Tumblrʼs custom automation software can:
Shard Automation
• Crawl replication topology for all shards
• Manipulate server settings or concurrently execute arbitrary UNIX
commands / administrative MySQL queries, on some or all shards
• Copy large files to multiple remote destinations efficiently
• Spin up multiple new slaves simultaneously from a single source
• Import or export arbitrary portions of the dataset
• Split a shard into N new shards
20. Massively Sharded MySQL
• Rebalance an overly-large shard by dividing it into N new shards, of even or
uneven size
• Speed
• No locks
• No application logic
• Divide a 800gb shard (hundreds of millions of rows) in two in only 5 hours
• Full read and write availability: shard-splitting process has no impact on live
application performance, functionality, or data consistency
Splitting shards: goals
21. Massively Sharded MySQL
• All tables using InnoDB
• All tables have an index that begins with your sharding key, and sharding scheme is range-based.
This plays nice with range queries in MySQL.
• No schema changes in process of split
• Disk is < 2/3 full, or thereʼs a second disk with sufficient space
• Keeping two standby slaves per shard pool (or more if multiple data centers)
• Uniform MySQL config between masters and slaves: log-slave-updates, unique server-id, generic log-
bin and relay-log names, replication user/grants everywhere
• No real slave lag, or already solved in app code
• Redundant rows temporarily on the wrong shard donʼt matter to app
Splitting shards: assumptions
22. Massively Sharded MySQL
Large “parent” shard divided into N “child” shards
Splitting shards: process
1. Create N new slaves in parent shard pool — these will soon become
masters of their own shard pools
2. Reduce the data set on those slaves so that each contains a different subset
of the data
3. Move app reads from the parent to the appropriate children
4. Move app writes from the parent to the appropriate children
5. Stop replicating writes from the parent; take the parent pool offline
6. Remove rows that replicated to the wrong child shard
23. Massively Sharded MySQL
Splitting shards: process
Large “parent” shard divided into N “child” shards
1. Create N new slaves in parent shard pool — these will soon become
masters of their own shard pools
2. Reduce the data set on those slaves so that each contains a different subset
of the data
3. Move app reads from the parent to the appropriate children
4. Move app writes from the parent to the appropriate children
5. Stop replicating writes from the parent; take the parent pool offline
6. Remove rows that replicated to the wrong child shard
26. Massively Sharded MySQL
To a single destination
Aside: copying files efficiently
• Our shard-split process relies on creating new slaves quickly, which involves
copying around very large data sets
• For compression we use pigz (parallel gzip), but there are other alternatives
like qpress
• Destination box: nc -l [port] | pigz -d | tar xvf -
• Source box: tar vc . | pigz | nc [destination hostname] [port]
27. Massively Sharded MySQL
To multiple destinations
Aside: copying files efficiently
• Add tee and a FIFO to the mix, and you can create a chained copy to multiple
destinations simultaneously
• Each box makes efficient use of CPU, memory, disk, uplink, downlink
• Performance penalty is only around 3% to 10% — much better than copying
serially or from copying in parallel from a single source
• http://engineering.tumblr.com/post/7658008285/efficiently-copying-files-to-
multiple-destinations
28. Massively Sharded MySQL
Splitting shards: process
Large “parent” shard divided into N “child” shards
1. Create N new slaves in parent shard pool — these will soon become
masters of their own shard pools
2. Reduce the data set on those slaves so that each contains a different subset
of the data
3. Move app reads from the parent to the appropriate children
4. Move app writes from the parent to the appropriate children
5. Stop replicating writes from the parent; take the parent pool offline
6. Remove rows that replicated to the wrong child shard
29. Massively Sharded MySQL
• Divide ID range into many chunks, and then execute export queries in
parallel. Likewise for import.
• Export via SELECT * FROM [table] WHERE [range clause] INTO
OUTFILE [file]
• Use mysqldump to export DROP TABLE / CREATE TABLE statements, and
then run those to get clean slate
• Import via LOAD DATA INFILE
Importing/exporting in chunks
30. Massively Sharded MySQL
• Disable binary logging before import, to speed it up
• Disable any query-killer scripts, or filter out SELECT ... INTO OUTFILE
and LOAD DATA INFILE
• Benchmark to figure out how many concurrent import/export queries can be
run on your hardware
• Be careful with disk I/O scheduler choices if using multiple disks with different
speeds
Importing/exporting gotchas
32. Massively Sharded MySQL
R/W
Parent Master, blogs 1-1000,
all app reads/writes
Standby Slave,
blogs 1-1000
Standby Slave,
blogs 1-1000
Child Shard
Master,
blogs 1-500
Child Shard
Master,
blogs 501-1000
Two slaves export different
halves of the dataset, drop
and recreate their tables,
and then import the data
33. Massively Sharded MySQL
Standby Slaves
blogs 1-500
Standby Slaves
blogs 501-1000
R/W
Parent Master, blogs 1-1000,
all app reads/writes
Standby Slave,
blogs 1-1000
Standby Slave,
blogs 1-1000
Child Shard
Master,
blogs 1-500
Child Shard
Master,
blogs 501-1000
Every pool needs two
standby slaves, so spin
them up for the child
shard pools
34. Massively Sharded MySQL
Splitting shards: process
Large “parent” shard divided into N “child” shards
1. Create N new slaves in parent shard pool — these will soon become
masters of their own shard pools
2. Reduce the data set on those slaves so that each contains a different subset
of the data
3. Move app reads from the parent to the appropriate children
4. Move app writes from the parent to the appropriate children
5. Stop replicating writes from the parent; take the parent pool offline
6. Remove rows that replicated to the wrong child shard
35. Massively Sharded MySQL
Moving reads and writes separately
• If configuration updates do not simultaneously reach all of your web/app
servers, this would create consistency issues if reads/writes moved at same
time
• Web A gets new config, writes post for blog 200 to 1st child shard
• Web B is still on old config, renders blog 200, reads from parent shard, doesnʼt find the new
post
• Instead only move reads first; let writes keep replicating from parent shard
• After first config update for reads, do a second one moving writes
• Then wait for parent master binlog to stop moving before proceeding
37. Massively Sharded MySQL
Reads now go to children
for appropriate queries to
these ranges
Standby Slaves
blogs 1-500
Standby Slaves
blogs 501-1000
Parent Master, blogs 1-1000,
all app writes (which then
replicate to children)
Standby Slave,
blogs 1-1000
Standby Slave,
blogs 1-1000
Child Shard
Master,
blogs 1-500
Child Shard
Master,
blogs 501-1000
W
RR
38. Massively Sharded MySQL
Splitting shards: process
Large “parent” shard divided into N “child” shards
1. Create N new slaves in parent shard pool — these will soon become
masters of their own shard pools
2. Reduce the data set on those slaves so that each contains a different subset
of the data
3. Move app reads from the parent to the appropriate children
4. Move app writes from the parent to the appropriate children
5. Stop replicating writes from the parent; take the parent pool offline
6. Remove rows that replicated to the wrong child shard
39. Massively Sharded MySQL
Reads now go to children
for appropriate queries to
these ranges
Standby Slaves
blogs 1-500
Standby Slaves
blogs 501-1000
Parent Master, blogs 1-1000,
all app writes (which then
replicate to children)
Standby Slave,
blogs 1-1000
Standby Slave,
blogs 1-1000
Child Shard
Master,
blogs 1-500
Child Shard
Master,
blogs 501-1000
W
RR
40. Massively Sharded MySQL
Reads and writes now go to
children for appropriate
queries to these ranges
Standby Slaves
blogs 1-500
Standby Slaves
blogs 501-1000
Parent Master, blogs 1-1000,
no longer in use by app
Standby Slave,
blogs 1-1000
Standby Slave,
blogs 1-1000
Child Shard
Master,
blogs 1-500
Child Shard
Master,
blogs 501-1000
R/W R/W
41. Massively Sharded MySQL
Splitting shards: process
Large “parent” shard divided into N “child” shards
1. Create N new slaves in parent shard pool — these will soon become
masters of their own shard pools
2. Reduce the data set on those slaves so that each contains a different subset
of the data
3. Move app reads from the parent to the appropriate children
4. Move app writes from the parent to the appropriate children
5. Stop replicating writes from the parent; take the parent pool offline
6. Remove rows that replicated to the wrong child shard
42. Massively Sharded MySQL
Reads and writes now go to
children for appropriate
queries to these ranges
Standby Slaves
blogs 1-500
Standby Slaves
blogs 501-1000
Parent Master, blogs 1-1000,
no longer in use by app
Standby Slave,
blogs 1-1000
Standby Slave,
blogs 1-1000
Child Shard
Master,
blogs 1-500
Child Shard
Master,
blogs 501-1000
R/W R/W
43. Massively Sharded MySQL
These are now complete
shards of their own
(but with some surplus data
that needs to be removed)
Standby Slaves
blogs 1-500
Standby Slaves
blogs 501-1000
Parent Master, blogs 1-1000
Global read-only set
App user dropped
Standby Slave,
blogs 1-1000
Standby Slave,
blogs 1-1000
Shard
Master,
blogs 1-500
Shard
Master,
blogs 501-1000
R/W R/W
44. Massively Sharded MySQL
Standby Slaves
blogs 1-500
Standby Slaves
blogs 501-1000
Deprecated Parent Master, blogs 1-1000
Recycle soon
Standby Slave,
blogs 1-1000
Recycle soon
Shard
Master,
blogs 1-500
Shard
Master,
blogs 501-1000
R/W R/W
X X
X
Standby Slave,
blogs 1-1000
Recycle soon
These are now complete
shards of their own
(but with some surplus data
that needs to be removed)
45. Massively Sharded MySQL
Splitting shards: process
Large “parent” shard divided into N “child” shards
1. Create N new slaves in parent shard pool — these will soon become
masters of their own shard pools
2. Reduce the data set on those slaves so that each contains a different subset
of the data
3. Move app reads from the parent to the appropriate children
4. Move app writes from the parent to the appropriate children
5. Stop replicating writes from the parent; take the parent pool offline
6. Remove rows that replicated to the wrong child shard
46. Massively Sharded MySQL
• Until writes are moved to the new shard masters, writes to the parent shard
replicate to all its child shards
• Purge this data after shard split process is finished
• Avoid huge single DELETE statements — they cause slave lag, among other
issues (huge long transactions are generally bad)
• Data is sparse, so itʼs efficient to repeatedly do a SELECT (find next sharding
key value to clean) followed by a DELETE.
Splitting shards: cleanup
47. Massively Sharded MySQL
• Sizing shards properly: as small as you can operationally handle
• Choosing ranges: try to keep them even, and better if they end in 000ʼs than
999ʼs or, worse yet, random ugly numbers. This matters once shard naming is
based solely on the ranges.
• Cutover: regularly create new shards to handle future data. Take the max
value of your sharding key and add a cushion to it.
• Before: highest blog ID 31.8 million, last shard handles range [30m, ∞)
• After: that shard now handles [30m, 32m) and new last shard handles [32m, ∞)
General sharding gotchas