These slides cover a talk on using distributed computation for database queries. Moore's Law, Amdahl's Law and distribution techniques are highlighted, and a simple performance comparison is provided.
Shard-Query, an MPP database for the cloud using the LAMP stackJustin Swanhart
This combined #SFMySQL and #SFPHP meetup talked about Shard-Query. You can find the video to accompany this set of slides here: https://www.youtube.com/watch?v=vC3mL_5DfEM
Conquering "big data": An introduction to shard queryJustin Swanhart
This talk introduces Shard-Query, an MPP distributed parallel processing middleware solution for MySQL.
Shard-Query is a federation engine which provides a virutal "grid computing" layer on top of MySQL. This can be used to access data spread over many machines (sharded) and also data partitioned in MySQL tables using the MySQL partitioning option. This is similar to using partitions for parallelism with Oracle Parallel Query.
This talk focuses on why Shard-Query is needed, how it works (not detailed) and the best schema to use with it. Shard-Query is designed to scan massive amounts of data in parallel.
Executing Queries on a Sharded DatabaseNeha Narula
Determining a data storage solution as your web application scales can be the most difficult part of web development, and takes time away from developing application features. MongoDB, Redis, Postgres, Riak, Cassandra, Voldemort, NoSQL, MySQL, NewSQL — the options are overwhelming, and all claim to be elastic, fault-tolerant, durable, and give great performance for both reads and writes. In the first portion of this talk I’ll discuss these different storage solutions and explain what is really important when choosing a datastore — your application data schema and feature requirements.
Designing your SaaS Database for Scale with PostgresOzgun Erdogan
If you’re building a SaaS application, you probably already have the notion of tenancy built in your data model. Typically, most information relates to tenants / customers / accounts and your database tables capture this natural relation.
With smaller amounts of data, it’s easy to throw more hardware at the problem and scale up your database. As these tables grow however, you need to think about ways to scale your multi-tenant (B2B) database across dozens or hundreds of machines.
In this talk, we're first going to talk about motivations behind scaling your SaaS (multi-tenant) database and several heuristics we found helpful on deciding when to scale. We'll then describe three design patterns that are common in scaling SaaS databases: (1) Create one database per tenant, (2) Create one schema per tenant, and (3) Have all tenants share the same table(s). Next, we'll highlight the tradeoffs involved with each design pattern and focus on one pattern that scales to hundreds of thousands of tenants. We'll also share an example architecture from the industry that describes this pattern in more detail.
Last, we'll talk about key PostgreSQL properties, such as semi-structured data types, that make building multi-tenant applications easy. We'll also mention Citus as a method to scale out your multi-tenant database. We'll conclude by answering frequently asked questions on multi-tenant databases and Q&A.
Shard-Query, an MPP database for the cloud using the LAMP stackJustin Swanhart
This combined #SFMySQL and #SFPHP meetup talked about Shard-Query. You can find the video to accompany this set of slides here: https://www.youtube.com/watch?v=vC3mL_5DfEM
Conquering "big data": An introduction to shard queryJustin Swanhart
This talk introduces Shard-Query, an MPP distributed parallel processing middleware solution for MySQL.
Shard-Query is a federation engine which provides a virutal "grid computing" layer on top of MySQL. This can be used to access data spread over many machines (sharded) and also data partitioned in MySQL tables using the MySQL partitioning option. This is similar to using partitions for parallelism with Oracle Parallel Query.
This talk focuses on why Shard-Query is needed, how it works (not detailed) and the best schema to use with it. Shard-Query is designed to scan massive amounts of data in parallel.
Executing Queries on a Sharded DatabaseNeha Narula
Determining a data storage solution as your web application scales can be the most difficult part of web development, and takes time away from developing application features. MongoDB, Redis, Postgres, Riak, Cassandra, Voldemort, NoSQL, MySQL, NewSQL — the options are overwhelming, and all claim to be elastic, fault-tolerant, durable, and give great performance for both reads and writes. In the first portion of this talk I’ll discuss these different storage solutions and explain what is really important when choosing a datastore — your application data schema and feature requirements.
Designing your SaaS Database for Scale with PostgresOzgun Erdogan
If you’re building a SaaS application, you probably already have the notion of tenancy built in your data model. Typically, most information relates to tenants / customers / accounts and your database tables capture this natural relation.
With smaller amounts of data, it’s easy to throw more hardware at the problem and scale up your database. As these tables grow however, you need to think about ways to scale your multi-tenant (B2B) database across dozens or hundreds of machines.
In this talk, we're first going to talk about motivations behind scaling your SaaS (multi-tenant) database and several heuristics we found helpful on deciding when to scale. We'll then describe three design patterns that are common in scaling SaaS databases: (1) Create one database per tenant, (2) Create one schema per tenant, and (3) Have all tenants share the same table(s). Next, we'll highlight the tradeoffs involved with each design pattern and focus on one pattern that scales to hundreds of thousands of tenants. We'll also share an example architecture from the industry that describes this pattern in more detail.
Last, we'll talk about key PostgreSQL properties, such as semi-structured data types, that make building multi-tenant applications easy. We'll also mention Citus as a method to scale out your multi-tenant database. We'll conclude by answering frequently asked questions on multi-tenant databases and Q&A.
This talk covers the Vault 8 team's journey at Capital One where we investigated a wide variety of stream processing solutions to build a next generation real-time decisioning platform to power Capital One's infrastructure.
The result of our analysis showed Apache Storm, Apache Flink, and Apache Apex as prime contenders for our use case with Apache Apex ultimately proving to be the solution of choice based on its present readiness for enterprise deployment and its excellent performance.
Empowering developers to deploy their own data storesTomas Doran
Empowering developers to deploy their own data stores using Terrafom, Puppet and rage. A talk about automating server building and configuration for Elasticsearch clusters, using Hashicorp and puppet labs tool. Presented at Config Management Camp 2016 in Ghent
A previous (OUTDATED) overview of resource management in Impala, relevant through Impala 2.2/CDH 5.4.
See the Cloudera documentation for the newest information: https://www.cloudera.com/documentation/enterprise/latest/topics/impala_howto_rm.html#impala_resource_management_example
We run multiple DataStax Enterprise clusters in Azure each holding 300 TB+ data to deeply understand Office 365 users. In this talk, we will deep dive into some of the key challenges and takeaways faced in running these clusters reliably over a year. To name a few: process crashes, ephemeral SSDs contributing to data loss, slow streaming between nodes, mutation drops, compaction strategy choices, schema updates when nodes are down and backup/restore. We will briefly talk about our contributions back to Cassandra, and our path forward using network attached disks offered via Azure premium storage.
About the Speaker
Anubhav Kale Sr. Software Engineer, Microsoft
Anubhav is a senior software engineer at Microsoft. His team is responsible for building big data platform using Cassandra, Spark and Azure to generate per-user insights of Office 365 users.
What Every Developer Should Know About Database Scalabilityjbellis
Replication. Partitioning. Relational databases. Bigtable. Dynamo. There is no one-size-fits-all approach to scaling your database, and the CAP theorem proved that there never will be. This talk will explain the advantages and limits of the approaches to scaling traditional relational databases, as well as the tradeoffs made by the designers of newer distributed systems like Cassandra. These slides are from Jonathan Ellis's OSCON 09 talk: http://en.oreilly.com/oscon2009/public/schedule/detail/7955
Hadoop World 2011: Advanced HBase Schema Design - Lars George, ClouderaCloudera, Inc.
"While running a simple key/value based solution on HBase usually requires an equally simple schema, it is less trivial to operate a different application that has to insert thousands of records per second.
This talk will address the architectural challenges when designing for either read or write performance imposed by HBase. It will include examples of real world use-cases and how they can be implemented on top of HBase, using schemas that optimize for the given access patterns. "
This presentation can help you to apply partioning when appropriate, and to avoid problems when using it. The oneliner is: Simple Works Best. The illustrating demos are on Postgres12 (maybe -13 by the time of presenting) and show some of the problems and solutions that Partitioning can provide. Some of this “experience” is quite old and the demo runs near-identical on Oracle…
These problems are the same on any database.
Learn how Cloudera Impala empowers you to:
- Perform interactive, real-time analysis directly on source data stored in Hadoop
- Interact with data in HDFS and HBase at the “speed of thought”
- Reduce data movement between systems & eliminate double storage
Terracotta (an open source technology) provides a clustered, durable virtual heap. Terracotta's goal is to make Java apps scale with as little effort as possible. If you are using Hibernate, there are several patterns that can be used to leverage Terracotta and reduce the load on your database so your app can scale.
First, you can use the Terracotta clustered Hibernate cache. This is a high-performance clustered cache and allows you to avoid hitting the database on all nodes in your cluster. It's suitable, not just for read-only, but also for read-mostly and read-write use cases, which traditionally have not been viewed as good use cases for Hibernate second level cache.
Another high performance option is to disconnect your POJOs from their Hibernate session and manage them entirely in Terracotta shared heap instead. This is a great option for conversational data where the conversational data is not of long-term interest but must be persistent and highly-available. This pattern can significantly reduce your database load but does require more changes to your application than using second-level cache.
This talk will examine the basics of what Terracotta provides and examples of how you can scale your Hibernate application with both clustered second level cache and detached clustered state. Also, we'll take a look at Terracotta's Hibernate-specific monitoring tools.
Introduction to TokuDB v7.5 and Read Free ReplicationTim Callaghan
TokuDB v7.5 introduced Read Free Replication, allowing MySQL slaves to run with virtually no read IO. This presentation discusses how Fractal Tree indexes work, what they enable in TokuDB, and they allow TokuDB to uniquely offer this replication innovation.
Rainbows, Unicorns, and other Fairy Tales in the Land of Serverless DreamsJosh Carlisle
When done correctly Serverless offers fantastic potential but can also lead to spectacular failure when critical concepts are overlooked. With over a dozen Serverless implementations on Azure Functions over the last couple years, I’ve learned some lessons the hard way. In this talk, I will be sharing a few of the most impactful hard-earned lessons and how I was able to overcome them. I’ll be touching on topics ranging from considerations using traditional relational databases, managing service and data connections to managing complexity and increasing observability. The talk is done in the context of Azure Functions but whose concepts apply equally to all Serverless Platforms.
This talk covers the Vault 8 team's journey at Capital One where we investigated a wide variety of stream processing solutions to build a next generation real-time decisioning platform to power Capital One's infrastructure.
The result of our analysis showed Apache Storm, Apache Flink, and Apache Apex as prime contenders for our use case with Apache Apex ultimately proving to be the solution of choice based on its present readiness for enterprise deployment and its excellent performance.
Empowering developers to deploy their own data storesTomas Doran
Empowering developers to deploy their own data stores using Terrafom, Puppet and rage. A talk about automating server building and configuration for Elasticsearch clusters, using Hashicorp and puppet labs tool. Presented at Config Management Camp 2016 in Ghent
A previous (OUTDATED) overview of resource management in Impala, relevant through Impala 2.2/CDH 5.4.
See the Cloudera documentation for the newest information: https://www.cloudera.com/documentation/enterprise/latest/topics/impala_howto_rm.html#impala_resource_management_example
We run multiple DataStax Enterprise clusters in Azure each holding 300 TB+ data to deeply understand Office 365 users. In this talk, we will deep dive into some of the key challenges and takeaways faced in running these clusters reliably over a year. To name a few: process crashes, ephemeral SSDs contributing to data loss, slow streaming between nodes, mutation drops, compaction strategy choices, schema updates when nodes are down and backup/restore. We will briefly talk about our contributions back to Cassandra, and our path forward using network attached disks offered via Azure premium storage.
About the Speaker
Anubhav Kale Sr. Software Engineer, Microsoft
Anubhav is a senior software engineer at Microsoft. His team is responsible for building big data platform using Cassandra, Spark and Azure to generate per-user insights of Office 365 users.
What Every Developer Should Know About Database Scalabilityjbellis
Replication. Partitioning. Relational databases. Bigtable. Dynamo. There is no one-size-fits-all approach to scaling your database, and the CAP theorem proved that there never will be. This talk will explain the advantages and limits of the approaches to scaling traditional relational databases, as well as the tradeoffs made by the designers of newer distributed systems like Cassandra. These slides are from Jonathan Ellis's OSCON 09 talk: http://en.oreilly.com/oscon2009/public/schedule/detail/7955
Hadoop World 2011: Advanced HBase Schema Design - Lars George, ClouderaCloudera, Inc.
"While running a simple key/value based solution on HBase usually requires an equally simple schema, it is less trivial to operate a different application that has to insert thousands of records per second.
This talk will address the architectural challenges when designing for either read or write performance imposed by HBase. It will include examples of real world use-cases and how they can be implemented on top of HBase, using schemas that optimize for the given access patterns. "
This presentation can help you to apply partioning when appropriate, and to avoid problems when using it. The oneliner is: Simple Works Best. The illustrating demos are on Postgres12 (maybe -13 by the time of presenting) and show some of the problems and solutions that Partitioning can provide. Some of this “experience” is quite old and the demo runs near-identical on Oracle…
These problems are the same on any database.
Learn how Cloudera Impala empowers you to:
- Perform interactive, real-time analysis directly on source data stored in Hadoop
- Interact with data in HDFS and HBase at the “speed of thought”
- Reduce data movement between systems & eliminate double storage
Terracotta (an open source technology) provides a clustered, durable virtual heap. Terracotta's goal is to make Java apps scale with as little effort as possible. If you are using Hibernate, there are several patterns that can be used to leverage Terracotta and reduce the load on your database so your app can scale.
First, you can use the Terracotta clustered Hibernate cache. This is a high-performance clustered cache and allows you to avoid hitting the database on all nodes in your cluster. It's suitable, not just for read-only, but also for read-mostly and read-write use cases, which traditionally have not been viewed as good use cases for Hibernate second level cache.
Another high performance option is to disconnect your POJOs from their Hibernate session and manage them entirely in Terracotta shared heap instead. This is a great option for conversational data where the conversational data is not of long-term interest but must be persistent and highly-available. This pattern can significantly reduce your database load but does require more changes to your application than using second-level cache.
This talk will examine the basics of what Terracotta provides and examples of how you can scale your Hibernate application with both clustered second level cache and detached clustered state. Also, we'll take a look at Terracotta's Hibernate-specific monitoring tools.
Introduction to TokuDB v7.5 and Read Free ReplicationTim Callaghan
TokuDB v7.5 introduced Read Free Replication, allowing MySQL slaves to run with virtually no read IO. This presentation discusses how Fractal Tree indexes work, what they enable in TokuDB, and they allow TokuDB to uniquely offer this replication innovation.
Rainbows, Unicorns, and other Fairy Tales in the Land of Serverless DreamsJosh Carlisle
When done correctly Serverless offers fantastic potential but can also lead to spectacular failure when critical concepts are overlooked. With over a dozen Serverless implementations on Azure Functions over the last couple years, I’ve learned some lessons the hard way. In this talk, I will be sharing a few of the most impactful hard-earned lessons and how I was able to overcome them. I’ll be touching on topics ranging from considerations using traditional relational databases, managing service and data connections to managing complexity and increasing observability. The talk is done in the context of Azure Functions but whose concepts apply equally to all Serverless Platforms.
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedInLinkedIn
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedIn. This was a presentation made at QCon 2009 and is embedded on LinkedIn's blog - http://blog.linkedin.com/
This is a ppt from Open Source Bridge that Thomas used for his session. This basically educates on why redundant power and back up power is so critical, and why you should always back up your info.
Deploying MongoDB can be a challenge if you don't understand how resources are used nor how to plan for the capacity of your systems. If you need to deploy, or grow, a MongoDB single instance, replica set, or tens of sharded clusters then you probably share the same challenges in trying to size that deployment. This talk will cover what resources MongoDB uses, and how to plan for their use in your deployment. Topics covered will include understanding how to model and plan capacity needs from the perspective of a new deployment, growing an existing one, and defining where the steps along scalability on your path to the top. The goal of this presentation will be to provide you with the tools needed to be successful in managing your MongoDB capacity planning tasks.
Caches are used in many layers of applications that we develop today, holding data inside or outside of your runtime environment, or even distributed across multiple platforms in data fabrics. However, considerable performance gains can often be realized by configuring the deployment platform/environment and coding your application to take advantage of the properties of CPU caches.
In this talk, we will explore what CPU caches are, how they work and how to measure your JVM-based application data usage to utilize them for maximum efficiency. We will discuss the future of CPU caches in a many-core world, as well as advancements that will soon arrive such as HP's Memristor.
This is the meat of the presentation, it describes in detail how do use anti-architecture to define what gets done, then discusses patterns, type systems, PaaS frameworks, services and components. There is a detailed explanation of Cassandra as a data store and open source components.
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
Search and Society: Reimagining Information Access for Radical FuturesBhaskar Mitra
The field of Information retrieval (IR) is currently undergoing a transformative shift, at least partly due to the emerging applications of generative AI to information access. In this talk, we will deliberate on the sociotechnical implications of generative AI for information access. We will argue that there is both a critical necessity and an exciting opportunity for the IR community to re-center our research agendas on societal needs while dismantling the artificial separation between the work on fairness, accountability, transparency, and ethics in IR and the rest of IR research. Instead of adopting a reactionary strategy of trying to mitigate potential social harms from emerging technologies, the community should aim to proactively set the research agenda for the kinds of systems we should build inspired by diverse explicitly stated sociotechnical imaginaries. The sociotechnical imaginaries that underpin the design and development of information access technologies needs to be explicitly articulated, and we need to develop theories of change in context of these diverse perspectives. Our guiding future imaginaries must be informed by other academic fields, such as democratic theory and critical theory, and should be co-developed with social science scholars, legal scholars, civil rights and social justice activists, and artists, among others.
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
PHP Frameworks: I want to break free (IPC Berlin 2024)Ralf Eggert
In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development.
This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
Let's dive deeper into the world of ODC! Ricardo Alves (OutSystems) will join us to tell all about the new Data Fabric. After that, Sezen de Bruijn (OutSystems) will get into the details on how to best design a sturdy architecture within ODC.
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
4. Who am I?
My name is Justin “Swany” Swanhart
I’m a trainer at Percona
www.percona.com
5. FOSS I maintain
• Shard-Query
• MPP distributed query middleware for MySQL*
• Work is mostly divided up using sharding and/or
table partitioning (divide)
• Distributes work over many machines in parallel
(conquer)
www.percona.com
6. FOSS I maintain
• Flexviews
• Caches result sets and can refresh them
efficiently based on only the database changes
since the last refresh
• Refresh cost is directly proportional to the
number of changes to the base data
www.percona.com
9. The new age of multi-core
CPU CPU
Core Core Core Core
Core
Core Core
Core Core
Core
Core
Core
“If your time to you is worth saving,
Core Core Core Core then you better start swimming.
CPU CPU Or you'll sink like a stone.
Core
Core
Core
Core
Core
Core
Core
Core
For the times they are a-changing.”
Core
Core
Core
Core
Core
Core
Core
Core
- Bob Dylan
www.percona.com
10. Moore’s law
• The number of transistors in a CPU doubles
every 24 months
• In combination with other component
improvements, this allowed doubling of CPU
clock speed approximately every 18 months
• Frequency scaling beyond a few GHz has
extreme power requirements
• Increased power = increased heat
www.percona.com
11. Moore’s law today
• Speed now doubles more slowly: 36 months
• Power efficient multiple-core designs have
become very mature
• Larger number of slower cores which use less
power in aggregate
www.percona.com
13. Answer: It depends
A program is only faster on a
multiple core CPU if it can use
more than one core
www.percona.com
14. Why?
1. A physical CPU may have many logical cpu
2. Each logical CPU runs at most one thread at
one time
3. Max running threads:
1. Physical CPU count x CPU core density x threads per core
• Dual CPU with 3 HT cores = 2 x 3 x 2
www.percona.com
15. What is a thread?
• Every program uses at least one thread
• A multithreaded program can do more work
• But only if it can split the work into many
threads
• If it doesn’t split up the work: then there is no
speedup.
www.percona.com
16. What is a task?
• An action in the system which results in work
• A login
• A report
• An individual query
• Tasks are single or multi-threaded
• Tasks may have sub-tasks
www.percona.com
17. Response Time
• Time to complete a task in wall clock time
• Response time includes the time to complete
all sub-tasks
• You must include queue time in response time
www.percona.com
18. Throughput
• How many tasks complete in a given unit of
time
• Throughput is a measure of overall work done
by the system
www.percona.com
19. Throughput vs Response time
Response time = Time / Task
Throughput = Tasks / Time
www.percona.com
20. Efficiency
• This is a score about how well you scheduled the
work for your task. Inefficient workloads
underutilize resources
Resources Used
( Resources Available
x 100 )
www.percona.com
21. Efficiency Example
Resources Used
( Resources Available
x 100 )
• CPU bound workload given 8 available cores
• When all work is done in single thread:12.5% CPU efficiency
• If the work can be split into 8 threads: 100% CPU efficiency
www.percona.com
22. Lost time is lost money
• You are paying for wall clock time
• Return on investment is directly proportional to efficiency.
• You can’t bank lost time
• If you miss an SLA or response time objective you lost the time
investment, plus you may have to pay an penalty
• It may be impossible to get critical insight
www.percona.com
23. Scalability
When resources are added what happens to efficiency?
• Given the same workload, if throughput does
not increase:
• Adding even more resources will not improve
performance
• But you may have the resources for more
concurrent tasks
• This is the traditional database scaling model
www.percona.com
24. Scalability
When resources are added what happens to efficiency?
• If throughput increases
• The system scales up to the workload
• When people ask if a system is scalable, this is usually
what they want to know:
www.percona.com
25. Scalability
When resources are added what happens to efficiency?
• If throughput goes down there is negative
scalability
• Mutexes are probably the culprit
• This is the biggest contention point for databases
with many cores
• This means you can’t just scale up a single
machine forever. You will always hit a limit.
www.percona.com
26. Response time is very important!
• Example:
• It takes 3 days for a 1 day report
• It doesn’t matter if you can run 100 reports at once
• Response time for any 1 report is too high if you need to
make decisions today about yesterday
www.percona.com
27. Workload
• A workload consists of many tasks
• Typical database workloads are categorized by
the nature of the individual tasks
www.percona.com
28. OLTP
• Frequent highly concurrent reads and writes of
small amounts data per query.
• Simple queries, little aggregation
• Many small queries naturally divide the work
over many cores
www.percona.com
29. Known OLTP scalability issues
• Many concurrent threads accessing critical
memory areas leads to mutex contention
• Reducing global mutex contention main dev
focus
• Mutex contention is still the biggest bottleneck
• Prevents scaling up forever (32 cores max)
www.percona.com
30. OLAP (analytical queries)
• Low concurrency reads of large amounts of
data
• Complex queries and frequent aggregation
• STAR schema common (data mart)
• Single table (machine generated data) common
• Partitioning very common
www.percona.com
31. Known OLAP scalability issues
• IO bottleneck usually gets hit first
• However, even if all data is in memory it still
may be impossible to reach the response time
objective
• Queries may not be able to complete fast enough
to meet business objectives because:
• MySQL only supports nested loops*
• All queries are single threaded
* MariaDB has limited support for hash joins
www.percona.com
32. You don’t need a bigger boat
• Buying a bigger server probably won’t help for
individual queries that are CPU bound.
• Queries are still single threaded.
www.percona.com
33. You need to change the
workload!
www.percona.com
34. You need to change the
workload!
• Turn OLAP into something more like OLTP
• Split one complex query into many smaller queries
• Run those queries in parallel
• Put the results together so it looks like nothing
happened
• This leverages multiple cores and multiple servers
www.percona.com
36. Enter Shard-Query
• Shard-Query transparently splits up complex
SQL into smaller SQL statements (sub tasks)
which run concurrently
• Proxy
• REST / HTTP GUI
• PHP OO interface
www.percona.com
37. Why not get a different database?
• Because MySQL is a great database and
sticking with it has advantages
• Operations knows how to work with it (backups!)
• It is FOSS (alternatives are very costly or very limited)
• MySQL’s special dialect means changes to apps to move
to an alternative database
• The alternatives are either a proprietary RDBMS* or
map/reduce. For many reasons these are undesirable.
* Vertica, Greenplum, Vectorwise, AsterData, Teradata, etc…
www.percona.com
38. Smaller queries are better queries
• This is closer to the OLTP workload that the
database has been optimized for
• Smaller queries use smaller temporary tables
resulting in more in-memory aggregation
• This reduces the usage of temporary tables on disk
• Parallelism reduces response time
www.percona.com
39. Data level parallelism
• Table level partitioning
• One big table is split up into smaller tables
internally
• Reduce IO and contention on shared data
structures in the database.
• MySQL could operate on partitions in parallel
• But it doesn’t
www.percona.com
40. Data level parallelism (cont)
• Sharding
• Horizontally partition data over more than one
server.
• Shards can naturally be operated on in parallel.
• This is called shared-nothing architecture, or data
level parallelism.
• This is also called scaling out.
www.percona.com
42. Not just for sharding
• The name is misleading as you have choices:
• Partition your data on one big server to scale up
• Shard your data onto multiple servers to scale out
• Do both for extreme scalability
• Or neither, but with less benefits
www.percona.com
43. Parallelism of SQL dataflow
• Shard-Query can add parallelism and improve
query performance based on SQL language
constructs:
• IN lists and subqueries are parallelized
• BETWEEN on date or integer operands adds
parallelism too
• UNION ALL and UNION queries are parallelized*
• Uncorrelated subqueries are materialized early, are
indexed and run in parallel (inside out execution)
* They have to have features necessary to enable parallelism.
You really need to partition and shard for best results
www.percona.com
44. Shard-Query is not for OLTP
• Parser and execution strategy is “relatively
expensive”
• For OLAP this time is a small fraction of the overall
execution time and the cost is “cheap”
• Tying to turn OLTP into OLTP is wasted effort and
waste reduces efficiency
• But OLTP still caps out at 32 cores on a single box
www.percona.com
45. Sharding for OLTP
• Use Shard-Key-Mapper
• Use this helper class to figure out which shard to use
• Then send queries directly to that shard (bypass parser)
• This could be made transparent with mysqlnd plugins
• You can then still use SQ when complex query
tasks are needed
• This is a topic for another day
www.percona.com
46. Why: Amdahl’s Law
• Speedup of parallelizing a task is bounded
by the amount of serialized work the task
must perform
f(N) = maximum speedup by parallelizing a process
1 P
f(N) = +
(1–P) N
N = Number of threads that may be run in parallel
P = Portion in percent, of the runtime of the program that may be parallelized
(1 – P) = Portion in percent, of the runtime which is serialized
www.percona.com
47. 1 P
Amdahl’s Law Example f(N) =
(1–P)
+
N
Your task: SUM a set of 10000 items
• Serially summing the items takes 10000* seconds
• Solution: Partition the set into 10 subsets
• Sum items using one thread per partition
* Times are exaggerated for demonstration purposes.
www.percona.com
48. 1 P
Amdahl’s Law Example f(N) =
(1–P)
+
N
• Summing 10 sets items in parallel takes 1000
seconds (10000s/10=1000s)
• Add an extra 10 seconds to add up the 10 sub-
results
• This is .1% of the original runtime
Max speedup = (.1–99.9))+(99.0/10) = 9.98X
* Times are exaggerated for demonstration purposes.
www.percona.com
49. Amdahl’s Law Example 2
Your task: STDDEV a set of 10000 items
• Some aggregate functions like STDDEV can’t be
distributed like SUM
• Serially doing a STDDEV takes 10000 seconds
• Because it can’t be split, we are forced to use one
thread
• The STDDEV takes 10000 second (10000/1)
• Max speedup = 1.0 (no speedup)
• Overhead may reduce efficiency
www.percona.com
50. You get to move all the pieces at the same time
T32
T1 T1
T48
T4 T4
T64 T8
T8
www.percona.com
51. Shard-Query 2.0 Features
• Automatic sharding and massively parallel loading
• Add new shards then spread new data over them
automatically
• Long query support
• Supports asynchronous jobs for running long queries
• Complex query support
• GROUP BY, HAVING, LIMIT, ORDER BY, WITH ROLLUP,
derived tables, UNION, etc
www.percona.com
52. Shard-Query 2.0 Features
• Automatic sharding and massively parallel loading
• Add new shards then spread new data over them
automatically
• Long query support
• Supports asynchronous jobs for running long queries
www.percona.com
54. Code Maturity
• Revision 1, May 22, 2010 0.0.1
• 270 lines of PHP code checked into Google Code
• Limited select support, no aggregation pushdown
• One developer (me) – only suited for limited audience as a POC
• Not object oriented, or a framework, just a simple CLI PHP app
• Revision 447, Jan 28, 2013 – Shard Query beta 2.0.0
• Over 9700 lines of PHP code
• Full coverage of SELECT statement plus custom aggregate function support
• Support for all almost DML and DDL operations (except create/drop database)
• Two additional developers
• Special thanks to Alex Hurd who is a production tester and contributor of the REST interface
• And Andre Rothe, who contributes to the SQL parser
• REST web interface, MySQL proxy Lua script, OO PHP library framework
• Feature complete and stable to a level fit for community use
www.percona.com
56. Big Data: Wikipedia Traffic Stats
• Wikipedia traffic stats
• 2 months of data (Feb and Mar 2008)
• 180 GB raw data (single table)
• No PK
• Partitioned by date, one partition per day
• One index, on the wiki language (“en”, etc)
• 360 GB data directory after loading
• 7 years of data is available (15TB)
http://dumps.wikimedia.org/other/pagecounts-raw/
www.percona.com
57. Big Data: Wikipedia Traffic Stats
CREATE TABLE `fact` (
`date_id` int(11) NOT NULL, Data set is sharded by
`hour_num` tinyint(4) NOT NULL, hour_num to evenly
`page` varchar(1024) NOT NULL, distribute work between
`project` char(3) NOT NULL, servers (each server has
`language` varchar(12) NOT NULL, 1/8 of the data)
`cnt` bigint(20) NOT NULL,
`bytes` bigint(20) NOT NULL,
KEY `language` (`language`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1
PARTITION BY RANGE (date_id) SQ can split tasks over
(PARTITION p1 VALUES LESS THAN (20080101) ENGINE = InnoDB, partitions too, for multiple
PARTITION p2 VALUES LESS THAN (20080102) ENGINE = InnoDB, tasks per shard
PARTITION p3 VALUES LESS THAN (20080103) ENGINE = InnoDB,
…,
PARTITION p91 VALUES LESS THAN (20080331) ENGINE = InnoDB,
PARTITION p92 VALUES LESS THAN (20080401) ENGINE = InnoDB,
PARTITION pmax VALUES LESS THAN (MAXVALUE) ENGINE = InnoDB,
)
www.percona.com
58. EC2 Instance Sizes
Instance Sizes Compared
Instance
Feature Measure Price Per hour
Size
cores 2 0.41 USD
ecu 6.5
m2.xlarge
ecu/core 3.25
The Pawn
storage 420
memory 17.1
cores 16 3.1 USD
ecu 35
hs1.8xlarge
ecu/core 2.25
The King
storage 2048 For a few cents more we
memory 60.5 get much more CPU
power in aggregate if
Scale Out eight smaller machines
Instance 1x 8x 8X Price Per
Size
Feature
Machine Machines hour are used
cores 2 16 3.28 USD
ecu 6.5 52
m2.xlarge
ecu/core 3.25 3.25
The Pawn
storage 420 3360 Working set fits into
memory 17.1 136.8
cores 16 128 the ram of 8 servers
ecu 35 280
hs1.8xlarge
ecu/core 2.25 2.25
The King
storage 2048 16384
memory 60.5 484
www.percona.com
60. Working Set
• The queries examine up to 15 days of data
• It is important that the working set be able to be kept in
memory for best performance
• Each partition contains approximately 8GB of data
• Working set is 120GB of InnoDB pages (60GB raw)
• Doesn’t fit on a single server
http://dumps.wikimedia.org/other/pagecounts-raw/
www.percona.com
61. Don’t bet on the king…
Single-Threaded select count(*)
Query from stats.fact
Performance the where date_id = 20080214; 8 shards should
unsharded +----------+ be able to do the work
| count(*) | In ~18s or ~2 seconds,
data set
+----------+ depending on where
| 84795020 | the bottleneck is.
+----------+
Instance Sizes Compared (again)
Instance Performance
Feature Measure Price Per hour
Size COLD / HOT
cores 2 0.41 USD
ecu 6.5
m2.xlarge
ecu/core 3.25 2m22s / 33s .21/core
The Pawn
storage 420
memory 17.1
cores 16 3.1 USD
ecu 35
hs1.8xlarge
ecu/core 2.25 1m10s / 43s .19/core
The King
storage 2048
memory 60.5
www.percona.com
63. Where to go from here
• Shard-Query is not the answer to every performance problem
• Flexviews can be used too
• This helps when the working set is too large to keep in memory
even with many machines
• Cost of aggregation is amortized over time, and distributed
over many machines if you still shard
• Scales reads, writes, aggregation
• But it is an excellent addition to your toolbox
www.percona.com
64. Distributed row store w/ Galera
• Each shard is an Percona XtraDB or MariaDB cluster
• Support massive ingestion rates via MPP loader
• real time ad-hoc complex querying
• All the components support HA
• Galera, Gearman, Apache, PHP, MySQL proxy
• Redundancy can be fully geographically distributed
• Use partitioning at the table level too
www.percona.com
65. Distributed document store
• Each shard is a MariaDB cluster
• Use dynamic columns and anchor models for dynamic schema
www.percona.com
66. Distributed column store
• Each shard is a Infobright database
• Pros
• Hash joins, small data footprint, no indexes
• Cons
• Single threaded loading (one thread per shard, can’t take
advantage of MPP loader)
• Append only (LOAD DATA INFILE)
• No partitions
• SQL harder to parallelize for scale up
www.percona.com
68. I’m also speaking at:
1. Detecting and preventing SQL injection with the Percona Toolkit
2. Divide and Conquer in the cloud: One big server or many small ones?
3. Introduction to open source column stores
PLMCE.com
April 2013