This combined #SFMySQL and #SFPHP meetup talked about Shard-Query. You can find the video to accompany this set of slides here: https://www.youtube.com/watch?v=vC3mL_5DfEM
Conquering "big data": An introduction to shard queryJustin Swanhart
This talk introduces Shard-Query, an MPP distributed parallel processing middleware solution for MySQL.
Shard-Query is a federation engine which provides a virutal "grid computing" layer on top of MySQL. This can be used to access data spread over many machines (sharded) and also data partitioned in MySQL tables using the MySQL partitioning option. This is similar to using partitions for parallelism with Oracle Parallel Query.
This talk focuses on why Shard-Query is needed, how it works (not detailed) and the best schema to use with it. Shard-Query is designed to scan massive amounts of data in parallel.
Executing Queries on a Sharded DatabaseNeha Narula
Determining a data storage solution as your web application scales can be the most difficult part of web development, and takes time away from developing application features. MongoDB, Redis, Postgres, Riak, Cassandra, Voldemort, NoSQL, MySQL, NewSQL — the options are overwhelming, and all claim to be elastic, fault-tolerant, durable, and give great performance for both reads and writes. In the first portion of this talk I’ll discuss these different storage solutions and explain what is really important when choosing a datastore — your application data schema and feature requirements.
These slides cover a talk on using distributed computation for database queries. Moore's Law, Amdahl's Law and distribution techniques are highlighted, and a simple performance comparison is provided.
Cloudera Impala: The Open Source, Distributed SQL Query Engine for Big Data. The Cloudera Impala project is pioneering the next generation of Hadoop capabilities: the convergence of fast SQL queries with the capacity, scalability, and flexibility of a Apache Hadoop cluster. With Impala, the Hadoop ecosystem now has an open-source codebase that helps users query data stored in Hadoop-based enterprise data hubs in real time, using familiar SQL syntax.
This talk will begin with an overview of the challenges organizations face as they collect and process more data than ever before, followed by an overview of Impala from the user's perspective and a dive into Impala's architecture. It concludes with stories of how Cloudera's customers are using Impala and the benefits they see.
Designing your SaaS Database for Scale with PostgresOzgun Erdogan
If you’re building a SaaS application, you probably already have the notion of tenancy built in your data model. Typically, most information relates to tenants / customers / accounts and your database tables capture this natural relation.
With smaller amounts of data, it’s easy to throw more hardware at the problem and scale up your database. As these tables grow however, you need to think about ways to scale your multi-tenant (B2B) database across dozens or hundreds of machines.
In this talk, we're first going to talk about motivations behind scaling your SaaS (multi-tenant) database and several heuristics we found helpful on deciding when to scale. We'll then describe three design patterns that are common in scaling SaaS databases: (1) Create one database per tenant, (2) Create one schema per tenant, and (3) Have all tenants share the same table(s). Next, we'll highlight the tradeoffs involved with each design pattern and focus on one pattern that scales to hundreds of thousands of tenants. We'll also share an example architecture from the industry that describes this pattern in more detail.
Last, we'll talk about key PostgreSQL properties, such as semi-structured data types, that make building multi-tenant applications easy. We'll also mention Citus as a method to scale out your multi-tenant database. We'll conclude by answering frequently asked questions on multi-tenant databases and Q&A.
Conquering "big data": An introduction to shard queryJustin Swanhart
This talk introduces Shard-Query, an MPP distributed parallel processing middleware solution for MySQL.
Shard-Query is a federation engine which provides a virutal "grid computing" layer on top of MySQL. This can be used to access data spread over many machines (sharded) and also data partitioned in MySQL tables using the MySQL partitioning option. This is similar to using partitions for parallelism with Oracle Parallel Query.
This talk focuses on why Shard-Query is needed, how it works (not detailed) and the best schema to use with it. Shard-Query is designed to scan massive amounts of data in parallel.
Executing Queries on a Sharded DatabaseNeha Narula
Determining a data storage solution as your web application scales can be the most difficult part of web development, and takes time away from developing application features. MongoDB, Redis, Postgres, Riak, Cassandra, Voldemort, NoSQL, MySQL, NewSQL — the options are overwhelming, and all claim to be elastic, fault-tolerant, durable, and give great performance for both reads and writes. In the first portion of this talk I’ll discuss these different storage solutions and explain what is really important when choosing a datastore — your application data schema and feature requirements.
These slides cover a talk on using distributed computation for database queries. Moore's Law, Amdahl's Law and distribution techniques are highlighted, and a simple performance comparison is provided.
Cloudera Impala: The Open Source, Distributed SQL Query Engine for Big Data. The Cloudera Impala project is pioneering the next generation of Hadoop capabilities: the convergence of fast SQL queries with the capacity, scalability, and flexibility of a Apache Hadoop cluster. With Impala, the Hadoop ecosystem now has an open-source codebase that helps users query data stored in Hadoop-based enterprise data hubs in real time, using familiar SQL syntax.
This talk will begin with an overview of the challenges organizations face as they collect and process more data than ever before, followed by an overview of Impala from the user's perspective and a dive into Impala's architecture. It concludes with stories of how Cloudera's customers are using Impala and the benefits they see.
Designing your SaaS Database for Scale with PostgresOzgun Erdogan
If you’re building a SaaS application, you probably already have the notion of tenancy built in your data model. Typically, most information relates to tenants / customers / accounts and your database tables capture this natural relation.
With smaller amounts of data, it’s easy to throw more hardware at the problem and scale up your database. As these tables grow however, you need to think about ways to scale your multi-tenant (B2B) database across dozens or hundreds of machines.
In this talk, we're first going to talk about motivations behind scaling your SaaS (multi-tenant) database and several heuristics we found helpful on deciding when to scale. We'll then describe three design patterns that are common in scaling SaaS databases: (1) Create one database per tenant, (2) Create one schema per tenant, and (3) Have all tenants share the same table(s). Next, we'll highlight the tradeoffs involved with each design pattern and focus on one pattern that scales to hundreds of thousands of tenants. We'll also share an example architecture from the industry that describes this pattern in more detail.
Last, we'll talk about key PostgreSQL properties, such as semi-structured data types, that make building multi-tenant applications easy. We'll also mention Citus as a method to scale out your multi-tenant database. We'll conclude by answering frequently asked questions on multi-tenant databases and Q&A.
Cloudera Impala: A Modern SQL Engine for HadoopCloudera, Inc.
This is a technical deep dive about Cloudera Impala, the project that makes scalable parallel databse technology available to the Hadoop community for the first time. Impala is an open-sourced code base that allows users to issue low-latency queries to data stored in HDFS and Apache HBase using familiar SQL operators.
Presenter Marcel Kornacker, creator of Impala, begins with an overview of Impala from the user's perspective, followed by an overview of Impala's architecture and implementation, and will conclude with a comparison of Impala with Dremel and Apache Hive, commercial MapReduce alternatives and traditional data warehouse infrastructure.
James Kinley from Cloudera:
An introduction to Cloudera Impala. Cloudera Impala provides fast, interactive SQL queries directly on your Apache Hadoop data stored in HDFS or HBase. In addition to using the same unified storage platform, Impala also uses the same metadata, SQL syntax (Hive SQL), ODBC driver and user interface (Hue Beeswax) as Apache Hive. This provides a familiar and unified platform for batch-oriented or real-time queries.
The link to the video: http://zurichtechtalks.ch/post/37339409724/an-introduction-to-cloudera-impala-sql-on-top-of
Citus Architecture: Extending Postgres to Build a Distributed DatabaseOzgun Erdogan
Citus is a distributed database that scales out Postgres. By using the extension APIs, Citus distributes your tables across a cluster of machines and parallelizes SQL queires. This talk describes the Citus architecture by focusing on our learnings in distributed systems. We first describe how Citus leverages PostgreSQL's extension APIs. These APIs are rich enough to store distributed metadata, add new commands to Postgres to help with sharding, parallelize and execute queries in a distributed cluster, and handle automatic failover of machines. Second, we show the architecture of a distributed query planner. We first describe the join order planner and describe how it chooses between broadcast, co-located, and repartition joins to minimize network I/O. We then show how we map SQL queries into distributed relational algebra, and optimize these plans for parallel execution. Third, we note a primary challenge in distributed systems. No single executor works great for all workloads. We show how Citus chooses between three executors, each one optimized for a different workload: NoSQL, operational analytics, and data warehousing. We then conclude with a demo that shows Citus running on a large cluster.
This presentation can help you to apply partioning when appropriate, and to avoid problems when using it. The oneliner is: Simple Works Best. The illustrating demos are on Postgres12 (maybe -13 by the time of presenting) and show some of the problems and solutions that Partitioning can provide. Some of this “experience” is quite old and the demo runs near-identical on Oracle…
These problems are the same on any database.
A brave new world in mutable big data relational storage (Strata NYC 2017)Todd Lipcon
The ever-increasing interest in running fast analytic scans on constantly updating data is stretching the capabilities of HDFS and NoSQL storage. Users want the fast online updates and serving of real-time data that NoSQL offers, as well as the fast scans, analytics, and processing of HDFS. Additionally, users are demanding that big data storage systems integrate natively with their existing BI and analytic technology investments, which typically use SQL as the standard query language of choice. This demand has led big data back to a familiar friend: relationally structured data storage systems.
Todd Lipcon explores the advantages of relational storage and reviews new developments, including Google Cloud Spanner and Apache Kudu, which provide a scalable relational solution for users who have too much data for a legacy high-performance analytic system. Todd explains how to address use cases that fall between HDFS and NoSQL with technologies like Apache Kudu or Google Cloud Spanner and how the combination of relational data models, SQL query support, and native API-based access enables the next generation of big data applications. Along the way, he also covers suggested architectures, the performance characteristics of Kudu and Spanner, and the deployment flexibility each option provides.
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv larsgeorge
This talk is about showing the complexity in building a data pipeline in Hadoop, starting with the technology aspect, and the correlating to the skillsets of current Hadoop adopters.
Cloudera Impala: A Modern SQL Engine for HadoopCloudera, Inc.
This is a technical deep dive about Cloudera Impala, the project that makes scalable parallel databse technology available to the Hadoop community for the first time. Impala is an open-sourced code base that allows users to issue low-latency queries to data stored in HDFS and Apache HBase using familiar SQL operators.
Presenter Marcel Kornacker, creator of Impala, begins with an overview of Impala from the user's perspective, followed by an overview of Impala's architecture and implementation, and will conclude with a comparison of Impala with Dremel and Apache Hive, commercial MapReduce alternatives and traditional data warehouse infrastructure.
James Kinley from Cloudera:
An introduction to Cloudera Impala. Cloudera Impala provides fast, interactive SQL queries directly on your Apache Hadoop data stored in HDFS or HBase. In addition to using the same unified storage platform, Impala also uses the same metadata, SQL syntax (Hive SQL), ODBC driver and user interface (Hue Beeswax) as Apache Hive. This provides a familiar and unified platform for batch-oriented or real-time queries.
The link to the video: http://zurichtechtalks.ch/post/37339409724/an-introduction-to-cloudera-impala-sql-on-top-of
Citus Architecture: Extending Postgres to Build a Distributed DatabaseOzgun Erdogan
Citus is a distributed database that scales out Postgres. By using the extension APIs, Citus distributes your tables across a cluster of machines and parallelizes SQL queires. This talk describes the Citus architecture by focusing on our learnings in distributed systems. We first describe how Citus leverages PostgreSQL's extension APIs. These APIs are rich enough to store distributed metadata, add new commands to Postgres to help with sharding, parallelize and execute queries in a distributed cluster, and handle automatic failover of machines. Second, we show the architecture of a distributed query planner. We first describe the join order planner and describe how it chooses between broadcast, co-located, and repartition joins to minimize network I/O. We then show how we map SQL queries into distributed relational algebra, and optimize these plans for parallel execution. Third, we note a primary challenge in distributed systems. No single executor works great for all workloads. We show how Citus chooses between three executors, each one optimized for a different workload: NoSQL, operational analytics, and data warehousing. We then conclude with a demo that shows Citus running on a large cluster.
This presentation can help you to apply partioning when appropriate, and to avoid problems when using it. The oneliner is: Simple Works Best. The illustrating demos are on Postgres12 (maybe -13 by the time of presenting) and show some of the problems and solutions that Partitioning can provide. Some of this “experience” is quite old and the demo runs near-identical on Oracle…
These problems are the same on any database.
A brave new world in mutable big data relational storage (Strata NYC 2017)Todd Lipcon
The ever-increasing interest in running fast analytic scans on constantly updating data is stretching the capabilities of HDFS and NoSQL storage. Users want the fast online updates and serving of real-time data that NoSQL offers, as well as the fast scans, analytics, and processing of HDFS. Additionally, users are demanding that big data storage systems integrate natively with their existing BI and analytic technology investments, which typically use SQL as the standard query language of choice. This demand has led big data back to a familiar friend: relationally structured data storage systems.
Todd Lipcon explores the advantages of relational storage and reviews new developments, including Google Cloud Spanner and Apache Kudu, which provide a scalable relational solution for users who have too much data for a legacy high-performance analytic system. Todd explains how to address use cases that fall between HDFS and NoSQL with technologies like Apache Kudu or Google Cloud Spanner and how the combination of relational data models, SQL query support, and native API-based access enables the next generation of big data applications. Along the way, he also covers suggested architectures, the performance characteristics of Kudu and Spanner, and the deployment flexibility each option provides.
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv larsgeorge
This talk is about showing the complexity in building a data pipeline in Hadoop, starting with the technology aspect, and the correlating to the skillsets of current Hadoop adopters.
From: DataWorks Summit 2017 - Munich - 20170406
HBase hast established itself as the backend for many operational and interactive use-cases, powering well-known services that support millions of users and thousands of concurrent requests. In terms of features HBase has come a long way, overing advanced options such as multi-level caching on- and off-heap, pluggable request handling, fast recovery options such as region replicas, table snapshots for data governance, tuneable write-ahead logging and so on. This talk is based on the research for the an upcoming second release of the speakers HBase book, correlated with the practical experience in medium to large HBase projects around the world. You will learn how to plan for HBase, starting with the selection of the matching use-cases, to determining the number of servers needed, leading into performance tuning options. There is no reason to be afraid of using HBase, but knowing its basic premises and technical choices will make using it much more successful. You will also learn about many of the new features of HBase up to version 1.3, and where they are applicable.
HBase hast established itself as the backend for many operational and interactive use-cases, powering well-known services that support millions of users and thousands of concurrent requests. In terms of features HBase has come a long way, overing advanced options such as multi-level caching on- and off-heap, pluggable request handling, fast recovery options such as region replicas, table snapshots for data governance, tuneable write-ahead logging and so on. This talk is based on the research for the an upcoming second release of the speakers HBase book, correlated with the practical experience in medium to large HBase projects around the world. You will learn how to plan for HBase, starting with the selection of the matching use-cases, to determining the number of servers needed, leading into performance tuning options. There is no reason to be afraid of using HBase, but knowing its basic premises and technical choices will make using it much more successful. You will also learn about many of the new features of HBase up to version 1.3, and where they are applicable.
Basic Introduction to Cassandra with Architecture and strategies.
with big data challenge. What is NoSQL Database.
The Big Data Challenge
The Cassandra Solution
The CAP Theorem
The Architecture of Cassandra
The Data Partition and Replication
A talk that discusses two topics regarding Elasticsearch - multitenancy and scalability and what are the technical details to achieving them efficiently
Data warehouse 26 exploiting parallel technologiesVaibhav Khanna
In-Memory parallel execution takes advantage of this large aggregated database buffer cache. By having parallel execution servers access objects using the database buffer cache, they can scan data at least ten times faster than they can on disk.
A talk given by Ted Dunning on February 2013 on Apache Drill, an open-source community-driven project to provide easy, dependable, fast and flexible ad hoc query capabilities.
Introduction to Big Data and NoSQL.
This presentation was given to the Master DBA course at John Bryce Education in Israel.
Work is based on presentations by Michael Naumov, Baruch Osoveskiy, Bill Graham and Ronen Fidel.
Nesta segunda parte do tema Redshift, mostramos o case da Movile, líder em mobile commerce com 50 milhões de usuários, e analisamos tópicos avançados como compressão, macros SQL embutidas e índices multidimensionais para grandes bases de dados.
StarCompliance is a leading firm specializing in the recovery of stolen cryptocurrency. Our comprehensive services are designed to assist individuals and organizations in navigating the complex process of fraud reporting, investigation, and fund recovery. We combine cutting-edge technology with expert legal support to provide a robust solution for victims of crypto theft.
Our Services Include:
Reporting to Tracking Authorities:
We immediately notify all relevant centralized exchanges (CEX), decentralized exchanges (DEX), and wallet providers about the stolen cryptocurrency. This ensures that the stolen assets are flagged as scam transactions, making it impossible for the thief to use them.
Assistance with Filing Police Reports:
We guide you through the process of filing a valid police report. Our support team provides detailed instructions on which police department to contact and helps you complete the necessary paperwork within the critical 72-hour window.
Launching the Refund Process:
Our team of experienced lawyers can initiate lawsuits on your behalf and represent you in various jurisdictions around the world. They work diligently to recover your stolen funds and ensure that justice is served.
At StarCompliance, we understand the urgency and stress involved in dealing with cryptocurrency theft. Our dedicated team works quickly and efficiently to provide you with the support and expertise needed to recover your assets. Trust us to be your partner in navigating the complexities of the crypto world and safeguarding your investments.
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
As Europe's leading economic powerhouse and the fourth-largest hashtag#economy globally, Germany stands at the forefront of innovation and industrial might. Renowned for its precision engineering and high-tech sectors, Germany's economic structure is heavily supported by a robust service industry, accounting for approximately 68% of its GDP. This economic clout and strategic geopolitical stance position Germany as a focal point in the global cyber threat landscape.
In the face of escalating global tensions, particularly those emanating from geopolitical disputes with nations like hashtag#Russia and hashtag#China, hashtag#Germany has witnessed a significant uptick in targeted cyber operations. Our analysis indicates a marked increase in hashtag#cyberattack sophistication aimed at critical infrastructure and key industrial sectors. These attacks range from ransomware campaigns to hashtag#AdvancedPersistentThreats (hashtag#APTs), threatening national security and business integrity.
🔑 Key findings include:
🔍 Increased frequency and complexity of cyber threats.
🔍 Escalation of state-sponsored and criminally motivated cyber operations.
🔍 Active dark web exchanges of malicious tools and tactics.
Our comprehensive report delves into these challenges, using a blend of open-source and proprietary data collection techniques. By monitoring activity on critical networks and analyzing attack patterns, our team provides a detailed overview of the threats facing German entities.
This report aims to equip stakeholders across public and private sectors with the knowledge to enhance their defensive strategies, reduce exposure to cyber risks, and reinforce Germany's resilience against cyber threats.
Opendatabay - Open Data Marketplace.pptxOpendatabay
Opendatabay.com unlocks the power of data for everyone. Open Data Marketplace fosters a collaborative hub for data enthusiasts to explore, share, and contribute to a vast collection of datasets.
First ever open hub for data enthusiasts to collaborate and innovate. A platform to explore, share, and contribute to a vast collection of datasets. Through robust quality control and innovative technologies like blockchain verification, opendatabay ensures the authenticity and reliability of datasets, empowering users to make data-driven decisions with confidence. Leverage cutting-edge AI technologies to enhance the data exploration, analysis, and discovery experience.
From intelligent search and recommendations to automated data productisation and quotation, Opendatabay AI-driven features streamline the data workflow. Finding the data you need shouldn't be a complex. Opendatabay simplifies the data acquisition process with an intuitive interface and robust search tools. Effortlessly explore, discover, and access the data you need, allowing you to focus on extracting valuable insights. Opendatabay breaks new ground with a dedicated, AI-generated, synthetic datasets.
Leverage these privacy-preserving datasets for training and testing AI models without compromising sensitive information. Opendatabay prioritizes transparency by providing detailed metadata, provenance information, and usage guidelines for each dataset, ensuring users have a comprehensive understanding of the data they're working with. By leveraging a powerful combination of distributed ledger technology and rigorous third-party audits Opendatabay ensures the authenticity and reliability of every dataset. Security is at the core of Opendatabay. Marketplace implements stringent security measures, including encryption, access controls, and regular vulnerability assessments, to safeguard your data and protect your privacy.
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
2. Introduction
Presenter
• Justin Swanhart
• Principal Support Engineer at Percona
• Previously a trainer and consultant at Percona too
Developer
• Swanhart-tools
• Shard-Query – MPP sharding middleware for MySQL
• Flexviews – Materialized views (fast refresh) for MySQL
• bcmath UDF – arbitrary precision math for MySQL
3. Intended Audience
• MySQL users with data too large to query efficiently using a single
machine
• Big Data
• Analytics / OLAP
• User generated content analysis
• People interested in distributed database processing
5. MPP – Massively Parallel Processing
• An MPP system is a system that can process a SQL statement in
parallel on a single machine or even many machines
• A collection of machines is often called a Grid
• MPP is also sometimes called Grid Computing
6. MPP (cont)
• Not many open source databases (none?) support MPP
• Community editions of closed source offerings are limited
• Some closed source databases include Vertica, Greenplum, Redshift
7. The Cloud
• Managed collection of virtual servers
• Easy to add servers on demand
• Ideal for a federated, distributed database grid
• Easy to “scale up” by moving to a VM with more cores
• Easy to “scale out” by adding machines
• Amazon is one of the most popular cloud environments
8. LAMP stack
• Linux
• Amazon Linux
• RHEL
• Ubuntu LTS, etc.
• Apache Web Server
• Most popular web server on the planet
• MySQL
• The world’s most popular open source database
• PHP
• High level language makes development easier
9. Database Middleware
• A piece of software that sits between an end-user application and
the database
• Operates on the queries submitted by the application, then
returns the results to the application
• Usually a proxy of some sort
• MySQL proxy is the open source user configurable proxy for MySQL
• Supports Lua scripts which intercept queries
• Shard-Query can use MySQL Proxy out of the box
10. Message Queue / Job Server
• Accepts jobs or messages and places them in a queue
• A worker reads jobs/messages from the queue and acts on them
• Offers support for asynchronous jobs
• Gearman
• My job server of choice for PHP
• Has two different PHP interfaces (pear and pecl)
• SQ comes bundled with a modified version* of the pear interface
• Excellent integration with MySQL as well (UDF)
* Removes warnings triggered by modern PHP strict mode
11. Sharding
• It is a short for Shared Nothing
• Means splitting up your data onto more than one machine
• Tables that are split up are called sharded tables
• Lookup tables are not sharded. In other words, they must be
duplicated on all nodes
• Shard-Query supports directory based or hash based sharding
12. Shard mapper
• Shard-Query supports DIRECTORY and HASH mapping out of the
box
• DIRECTORY based sharding allows you to add or remove shards
from the system, but lookups may go over the network, reducing
performance* compared to HASH mapping
• HASH based sharding uses a hash algorithm to balance rows over
the sharded database. However, since a HASH algorithm is used,
the number of database shards can not change after initial data
loading.
* But only for queries like “select count(*) from table where customer_id = 50”
13. What is “big data”
Most machine generated data
• Line order information for a large organization like Wal-Mart™
• Any data so large that you can’t effectively operate on it on one
machine
• For example, an important query that needs to run daily executes in
greater than 24 hours. It is impossible to meet the daily goal unless
you can find a way to make the query execute faster.
• These kind of problems can happen on relatively small amounts of
data (tens of gigabytes)
14. Analytics(OLAP) versus OLTP
• OLTP is focused on short lived small transactions that read or
write small amounts of data
• OLAP is focused on bulk loading and reading large amounts of
data in a single query.
• Aggregation queries are OLAP queries
• Shard-Query is designed for analytics (OLAP) not OLTP
• must parse all commands sent to it (and make multiple round trips)
• Minium query time of around 20ms
16. Single thread queries in the database
• MySQL, PostgreSQL, Firebird and all other major open source
databases have single threaded queries
• This means that a single query can only ever utilize the resources
of a single core
• As the data size grows, analytical queries get slower and slower
• In memory, as the data grows the speed decreases because the data is
accessed in a single query
• As the number of rows to be examined increases, performance
decreases
17. Why single threaded
• MySQL is optimized for getting small amounts of data
quickly(OLTP)
• It was created at a time when having more than one CPU was not
common
• Adding parallelism now is a very complex task, particularly since
MySQL supports multiple storage engines
• So adding parallel query is not a high priority (not even on the
roadmap)
• Designed to run LOTS of small queries simultaneously, not one
big query
18. Single Threading – bad for IO
• If the data set is significantly larger than memory, single threaded
queries often cause the buffer pool to "churn“
• For example, small lookup tables can easily be pushed out of the buffer
pool, resulting in frequent IO to look up values
• While SSD may helps somewhat, one database thread can not read
from an SSD at maximum device capacity
• While the disk may be capable of 1000+ MB/sec, a single thread is
generally limited to <100MB/sec (usually 30-40)
• This is because a single thread shares doing IO AND running the query
on one CPU (MySQL does not use read threads for queries)
19. The OLAP Example
• A large company maintains a star schema of their sales history for
analytics purposes
• This company likes to present a sum total of orders for all time on
the dashboard
• In the beginning the query is very fast
• It gets slower, though, as months of data are added and as the business
grows, data increases too
• Eventually the query takes more than 24 hours to run, which means it
can no longer be updated daily
• “Drill down” gets slower as data increases
20. What can be done?
• Caching?
• Materialized views?
• Partitioning?
• Sharding?
21. Making OLAP more like OLTP!
• Shard-Query breaks on big query up into smaller queries that can
access the database in parallel
• Partitioning and sharding are used to keep data size for any single
query to a minimum
• If your table has 16 partitions, you can get up to 16 way parallelism
• If you also have 2 nodes, you get 32 way parallelism, and so on
• You can use multiple database schema on a single server instead (a
form of sharding) if you don’t partition your data
23. Sharding Reviewed
• A sharded database contains multiple nodes or databases called
shards
• One physical machine might host many shards
• Each shard has identical schema
24. Sharding Reviewed (cont)
• The multiple shards function together as one RDBMS system.
• You can think of the shards as a big UNION ALL of the data, with
only a portion of the data on any one machine
• A mechanism must control which server on which to place
particular pieces of data.
• In Shard-Query a particular column controls data placement – this
is called the shard key
25. Sharding – Data distribution
• There are usually one or two large tables that are sharded
• These are usually called FACT tables
• An example might be blogs, blog_posts and blog_comments. All three
share a “shard key” of blog_id
• Most common case is one big table with smaller lookup tables
26. Sharding Reviewed (cont)
• The shard key is very important!
• Since a specific column acts as the “shard key”, all sharded tables must
contain the shard key.
• For example: blog_id might be the shard key.
• The rows for a specific blog_id are then located on the same shard in
any table that has the blog_id column
27. Optimization - Shard Elimination
• When Shard-Query sees an expression on the shard key it looks up*
the shard that contains the appropriate data and only sends queries to
the necessary shards.
• Equality lookup is most efficient, but IN, BETWEEN and other operators are allowed
as well
• Lookups may not use subqueries (ie, blog_id IN (1,2,3) is okay, not blog_id in (select
…))
• This is called “shard elimination”
• Shard elimination is analogous to partition elimination.
• where blog_id = 10, for example
28. Can Shard-Query help on 1 machine?
• Yes! - Use MySQL partitioning on a single machine
• Shard-Query can access the partitions of a table in parallel!
• This means that if you have many partitions, then Shard-Query can
utilize many cores to answer the query
Use partitions for
parallelism
29. How does that work?
• Shard-Query executes an EXPLAIN PLAN on the query
• This EXPLAIN PLAN shows the partitions that MySQL will access
when running the query
• Shard-Query uses the 5.6 PARTITION hint to generate one query
per partition
• These queries can execute in parallel
30. Sharding can help too
• How?
• Shard-Query adds parallelism to queries by spreading them over nodes
in parallel
• Spread the data over four nodes and queries are 4x faster
MySQL database shards
Shard-Query
31. Sharding + Partitioning is best
• Why?
• Partition the tables to add parallelism to each node
• Use sharding to have multiple nodes working together
• 4 nodes with 3 partitions each = 12 way parallelism
Shard-Query
MySQL database shards
Partitions
33. Configuration Repository
• Shard-Query stores all configuration information in a MySQL
database called the configuration repository
• This should be a highly available replication pair (or XtraDB
cluster) for HA
• Web interface can change the settings
• Manual settings changes can be done via SQL
• schemata_config table in Shard-Query repository
• Makes using Shard-Query easier, especially when using more than
one node
34. PHP OO Apache
Web
Interface
MySQL
Proxy
Gearman Message Queue
Worker Worker Worker Worker
MySQL database shards
Shard-Query Architecture
Interfaces
Communication
Workers
Storage
Config
Repository
Configuration
Management
35. PHP OO Apache
Web
Interface
MySQL
Proxy
Gearman Message Queue
Worker Worker Worker Worker
MySQL database shards
Shard-Query Architecture
Gearman job server
• Provides the parallel mechanism
for Shard-Query
• Multiple Gearman are
supported for HA
• Enables Shard-Query to use a
map/reduce like architecture
• Sends jobs to workers when they
arrive at the queue
• If all workers are busy the job
waits
36. Gearman at a glance
Shard-Query OO
Store-resultset
Loader worker
SQ run SQL worker
37. PHP OO Apache
Web
Interface
MySQL
Proxy
Gearman Message Queue
Worker Worker Worker Worker
MySQL database shards
Shard-Query Architecture
Three kinds of workers
• loader_worker – Listens for
loader jobs and executes them.
Used by parallel loader.
• shard_query_worker – Listens
for SQL jobs, runs the job via
Shard-Query and returns the
results as JSON. Used by web
and proxy interfaces.
• store_resultset_worker – Main
worker used by Shard-Query. It
runs SQL and stores the result
in a table.
38. PHP OO Apache
Web
Interface
MySQL
Proxy
Gearman Message Queue
Worker Worker Worker Worker
MySQL database shards
Shard-Query Architecture
PHP Object Oriented Interface
• Very simple to use
• Constructor parameters not
even usually needed
• Just one function to run a SQL
query and get results back
• Complete example comes with
Shard-Query as:
bin/run_query
39. PHP OO Example (from bin/run_query):
$shard_query = new ShardQuery();
$stime = microtime(true);
$stmt = $shard_query->query($sql);
$endtime = microtime(true);
if(!empty($shard_query->errors)) {
if(!empty($shard_query->errors)) {
echo "ERRORS RETURNED BY OPERATION:n";
print_r($shard_query->errors);
}
}
if(is_resource($stmt) || is_object($stmt)) {
$count=0;
while($row = $shard_query->DAL->my_fetch_assoc($stmt)) {
print_r($row);
++$count;
}
echo "$count rows returnedn";
$shard_query->DAL->my_free_result($stmt);
} else {
if(!empty($shard_query->info)) print_r($shard_query->info);
echo "no query resultsn";
}
echo "Exec time: " . ($endtime - $stime) . "n";
Simple data access layer
comes with Shard-Query
Errors are returned as a member
of the object
Run the query
40. PHP OO Apache
Web
Interface
MySQL
Proxy
Gearman Message Queue
Worker Worker Worker Worker
MySQL database shards
Shard-Query Architecture
Apache web interface
• GUI
• Easy to set up
• Run queries and get results
• Serves as an example of using
Shard-Query in a web app with
asynchronous queries
• Submits queries via Gearman
• Simple HTTP authentication
41. PHP OO Apache
Web
Interface
MySQL
Proxy
Gearman Message Queue
Worker Worker Worker Worker
MySQL database shards
Shard-Query Architecture
MySQL Proxy Interface
• LUA script for MySQL Proxy
• Supports most SHOW
commands
• Intercepts queries, and sends
them to Shard-Query using the
MySQL Gearman UDF
• Serves as another example of
using Gearman to execute
queries.
• Behaves slightly differently than
MySQL for some commands
42. Query submitted
SQL is parsed
Query rewrite
for parallelism
yields multiple
queries
Gearman Jobs
(map/combine)
Final Aggregation
(reduce)
Return result
Shard-Query Data Flow
Map/reduce like workflow
43. Query submitted
SQL is parsed
Query rewrite
for parallelism
yields multiple
queries
Gearman Jobs
(map/combine)
Final Aggregation
(reduce)
Return result
Shard-Query Data Flow
44. SQL Parser
• Find it at http://github.com/greenlion/php-sql-parser
• Supports
• SELECT/INSERT/UPDATE/DELETE
• REPLACE
• RENAME
• SHOW/SET
• DROP/CREATE INDEX/CREATE TABLE
• EXPLAIN/DESCRIBE
Used by SugarCRM too, as
well as other open source
projects.
45. Query submitted
SQL is parsed
Query rewrite
for parallelism
yields multiple
queries
Gearman Jobs
(map/combine)
Final Aggregation
(reduce)
Return result
Shard-Query Data Flow
46. Query Rewrite for parallelism
• Shard-Query has to manipulate the SQL statement so that it can
be executed over more than on partition or machine
• COUNT() turns into SUM of COUNTs from each query
• AVG turns into SUM and COUNT
• SEMI-JOIN is turned into a materialized join
• STDDEV/VARIANCE are rewritten as well use the sum of squares
method
• Push down LIMIT when possible
47. Query Rewrite for parallelism (cont)
• Because lookup tables are duplicated on all shards, the query
executes in a shared-nothing way
• All joins, filtering and aggregation are pushed down
• Mean very little data must flow between nodes in most cases
• High performance
• Meets or beats Amazon Redshift in testing at 200GB of data
48. Query submitted
SQL is parsed
Query rewrite
for parallelism
yields multiple
queries
Gearman Jobs
(map/combine)
Final Aggregation
(reduce)
Return result
Shard-Query Data Flow
49. Map/Combine
• The store_resultset gearman worker runs SQL and stores the result
in a table
• To keep the number of rows in the table (and the time it takes to
aggregate results in the end) small, an INSERT … ON DUPLICATE
KEY UPDATE (ODKU) statement is used when inserting the rows
• There is a UNIQUE KEY over the GROUP BY attributes to facilitate
the upsert
50. Query submitted
SQL is parsed
Query rewrite
for parallelism
yields multiple
queries
Gearman Jobs
(map/combine)
Final Aggregation
(reduce)
Return result
Shard-Query Data Flow
51. Final aggregation
• Shard-Query has to return a proper result, combining the results
in the result table together to return the correct answer
• Again, for example COUNT must be rewritten as SUM to combine
all the counts (from each shard) in the result table
• Aggregated result is returned to the client
52. Shard-Query Flow as SQL
[justin@localhost bin]$ ./run_query --verbose
select count(*) from lineorder;
Shard-Query optimizer messages:
SQL TO SEND TO SHARDS:
Array
(
[0] => SELECT COUNT(*) AS expr_2913896658
FROM lineorder PARTITION(p0) AS `lineorder` WHERE 1=1
[1] => SELECT COUNT(*) AS expr_2913896658
FROM lineorder PARTITION(p1) AS `lineorder` WHERE 1=1
[2] => SELECT COUNT(*) AS expr_2913896658
FROM lineorder PARTITION(p2) AS `lineorder` WHERE 1=1
[3] => SELECT COUNT(*) AS expr_2913896658
FROM lineorder PARTITION(p3) AS `lineorder` WHERE 1=1
)
SQL TO SEND TO COORDINATOR NODE:
SELECT SUM(expr_2913896658) AS ` count `
FROM `aggregation_tmp_58392079`
Array
(
[count ] => 0
)
1 rows returned
Exec time: 0.03083610534668
Initial query
Query rewrite / map
Final aggregation / reduce
Final result
53. Map/Combine example
select LO_OrderDateKey, count(*) from lineorder group by LO_OrderDateKey;
Shard-Query optimizer messages:
* The following projections may be selected for a UNIQUE CHECK on the storage node operation:
expr$0
* storage node result set merge optimization enabled:
ON DUPLICATE KEY UPDATE
expr_2445085448=expr_2445085448 + VALUES(expr_2445085448)
SQL TO SEND TO SHARDS:
Array
(
[0] => SELECT LO_OrderDateKey AS expr$0,COUNT(*) AS expr_2445085448
FROM lineorder PARTITION(p0) AS `lineorder` WHERE 1=1 GROUP BY expr$0
[1] => SELECT LO_OrderDateKey AS expr$0,COUNT(*) AS expr_2445085448
FROM lineorder PARTITION(p1) AS `lineorder` WHERE 1=1 GROUP BY expr$0
[2] => SELECT LO_OrderDateKey AS expr$0,COUNT(*) AS expr_2445085448
FROM lineorder PARTITION(p2) AS `lineorder` WHERE 1=1 GROUP BY expr$0
[3] => SELECT LO_OrderDateKey AS expr$0,COUNT(*) AS expr_2445085448
FROM lineorder PARTITION(p3) AS `lineorder` WHERE 1=1 GROUP BY expr$0
)
SQL TO SEND TO COORDINATOR NODE:
SELECT expr$0 AS `LO_OrderDateKey`,SUM(expr_2445085448) AS ` count `
FROM `aggregation_tmp_12033903` GROUP BY expr$0
combine
reduce
55. Machine generated data
• Sensor readings
• Metrics
• Logs
• Any large table with short lookup tables
Star schema are ideal
56. Call detail records
• Shard-Query is used in the billing system of a large cellular provider
• CDRs generate a lot of data
• Shard-Query includes a fast PERCENTILE function
57. Green energy meter processing
• High volume of data means sharding is necessary
• With Shard-Query, reporting is possible over all the shards,
making queries possible that would not work with Fabric or other
sharding solutions
• Used in India for reporting on a green power grid
58. Log analysis
• Performance logs from a web application for example
• Aggregate many different statistics and shard if log volumes are
high enough
• Search text logs with regular expressions
60. Star Schema Benchmark – SF 20
• 119 million rows of data (12GB)
• Infobright Community Database
• Only 1st query from each “flight” selected
• Unsharded compared to four shards (box has 4 cpu - Amazon
m1.xlarge)
61. COLD
• MySQL – 35.39s
• Shard-Query – 11.62s
HOT
• MySQL – 10.99s
• Shard-Query – 2.95s
Query 1
select sum(lo_extendedprice*lo_discount) as revenue
from lineorder join dim_date on lo_orderdatekey = d_datekey
where d_year = 1993
and lo_discount between 1 and 3
and lo_quantity < 25;
62. COLD
• MySQL – 34.24s
• Shard-Query – 12.74s
HOT
• MySQL – 12.74s
• Shard-Query – 3.26s
Query 2
select sum(lo_revenue), d_year, p_brand
from lineorder
join dim_date on lo_orderdatekey = d_datekey
join part on lo_partkey = p_partkey
join supplier on lo_suppkey = s_suppkey
where p_category = 'MFGR#12'
and s_region = 'AMERICA'
group by d_year, p_brand
order by d_year, p_brand;
63. COLD
• MySQL – 27.29s
• Shard-Query – 7.97s
HOT
• MySQL – 18.89
• Shard-Query – 5.06s
Query 3
select c_nation, s_nation, d_year, sum(lo_revenue) as revenue
from customer join lineorder
on lo_custkey = c_customerkey
join supplier on lo_suppkey = s_suppkey
join dim_date on lo_orderdatekey = d_datekey
where c_region = 'ASIA'
and s_region = 'ASIA'
and d_year >= 1992 and d_year <= 1997
group by c_nation, s_nation, d_year
order by d_year asc, revenue desc;
64. COLD
• MySQL – 23.02s
• Shard-Query – 8.48s
HOT
• MySQL – 14.77
• Shard-Query – 4.29s
Query 4
select d_year, c_nation, sum(lo_revenue - lo_supplycost) as profit
from lineorder join dim_date on lo_orderdatekey = d_datekey
join customer on lo_custkey = c_customerkey
join supplier on lo_suppkey = s_suppkey
join part on lo_partkey = p_partkey
where c_region = 'AMERICA'
and s_region = 'AMERICA'
and (p_mfgr = 'MFGR#1'
or p_mfgr = 'MFGR#2')
group by d_year, c_nation
order by d_year, c_nation;