SlideShare a Scribd company logo
1 of 17
Building a High-Volume
          Reporting System on Amazon AWS
          with MySQL, Tungsten, and Vertica




                                      GAMIFIED REW ARDS




4/11/12                  @jpmalek
What I’ll cover:
       Our reporting/analytics growth stages,
       their pitfalls and what we’ve learned:

  1.      Custom MySQL ETL via shell scripts, visualizations in Tableau
  2.      ETL via a custom Tungsten applier into Vertica
  3.      New Tungsten Vertica applier, built by Continuent
  4.      Sharded transactional system, multiple Tungsten Vertica appliers




4/11/12                                  @jpmalek
Stage 1 : Custom MySQL ETL via shell scripts, visualizations in Tableau




1.   On slave, dump an hour’s worth of new rows via SELECT INTO OUTFILE
2.   Ship data file to aggregations host, dump old hourly snapshot, load new
3.   Perform aggregation queries against temporary snapshot and FEDERATED tables
4.   Tableau refreshes its extracts after aggregated rows are inserted.




4/11/12                                      @jpmalek
Detour : RAID for the Win



                                        Big drop in
                                   API endpoint latency
                                         (writes)




4/11/12                 @jpmalek
Stage 2 : ETL via a custom Tungsten applier into Vertica




4/11/12                                  @jpmalek
Stage 2 : Customized Tungsten Replication Setup



                                                       Master Replicator
                              Extract binlog to
          MySQL               Tungsten Log




                    Extract From           Extract               Custom Vertica
                                                        Filter
                    Master to Log          from Log              JDBC Applier


                   Slave Replicator

                                                      Filter DDL &
                                                      unwanted
                                                      tables               Vertica



4/11/12
Stage 2 : Issues with the Custom Tungsten Filter




          1. OLTP transactions on Vertica are very slow! (10 transactions per second
             vs. around 1000 per second for a MySQL slave). Slave applier could
             not keep up with MySQL master.
          2. Person who created the applier was no longer in the company.
          3. Tungsten setup including custom applier was difficult to maintain and
             hard to move to other hosts.




4/11/12                                      @jpmalek
Detour : flexible APIs
          and baseball schedules




4/11/12   @jpmalek
Stage 3 : New Tungsten Vertica Applier




4/11/12                                        @jpmalek
Stage 3: A Template-Driven Batch Apply Process


                                      Tungsten Replicator Pipeline

                           Extract-        Extract-
          MySQL            Filter-         Filter-     Extract-Filter-Apply
                           Apply           Apply


                                                              Base Tables
                                                              63, ‘bob’, 23, …
                                   Staging Table              64, ‘sue’, 76, …
                                   233, d, 64, …, 1           67, ‘jim’, 1, …
              CSV                                             76, ‘dan’, 25, …
                                   233, i, 64, …, 2
              Files                                           98, ‘joe’, 66, …
                                   239, I, 76, …, 3



                        COPY                   DELETE, then INSERT
                      (Template)                    (Template)

4/11/12
Stage 3 : Batch Applier Replication Setup



                                                      Master Replicator
                            Extract binlog to
      MySQL                 Tungsten Log




                 Extract From            Extract                Batch applier using SQL
                                                       Filter
                 Master to Log           from Log               template commands
                                                                                  COPY /
                 Slave Replicator                                                 INSERT


                                       Use built-in
                                                                CSV            Vertica
                                       Filters; DDL
                                       ignored
                                                            Write date to
                                                            disk files
4/11/12
Stage 3 : Solving Problems to Get the New Applier to Work




          1. Testing – Developed a lightweight testing mechanism for heterogeneous
             replication
          2. Batch applier implementation – Two tries to get it right including SQL
             templates and full datatype support
          3. Character sets – Ensuring consistent UTF-8 handling throughout the
             replication change, including CSV files
          4. Time zones – Ensuring Java VM handled time values correctly
          5. Performance – Tweak SQL templates to get 50x boost over old applier



4/11/12                                      @jpmalek
Detour :
          Sharding

          or

          Learning How To Sleep
          In Any Position




4/11/12    @jpmalek
Stage 4 : Sharded transactional system, multiple Tungsten Vertica appliers




4/11/12                                      @jpmalek
Solving Problems to Scale Up The Replication Configuration




                                1. Implement remote batch apply so Tungsten can
                                   run off-board from Vertica
                                2. Convert replication to a direct pipeline with a single
                                   service between MySQL and Vertica
                                3. Create a script to deploy replicator in a single
                                   command
                                4. Create staging tables on Vertica server
4/11/12                                     @jpmalek
Remaining Challenges to Complete Replication Setup




                               1. Configure replication for global and local DBShards
                                  data
                               2. Ensure performance is up to snuff-currently at 500-
                                  1000 transactions per second
                               3. Introduce intermediate staging servers to reduce
                                  number of replication streams into Vertica

4/11/12                                    @jpmalek
Thank You!

      In summary:
      1. Tungsten is a great tool when it comes to MySQL ETL
         automation, so check it out as an alternative to custom
         in-house scripts or other options.
      2. Vertica is a high-performance, scaleable BI platform that
         now pairs well with Tungsten. Full360 offers a cloud-
         based solution.
      3. If you’re just getting started on the BI front, hire a BI
         developer to focus on this stuff, if you can.
      4. I see no reason why this framework couldn’t scale to
         easily handle whatever our business needs in the future.



4/11/12                               @jpmalek

More Related Content

What's hot

Multi Source Replication With MySQL 5.7 @ Verisure
Multi Source Replication With MySQL 5.7 @ VerisureMulti Source Replication With MySQL 5.7 @ Verisure
Multi Source Replication With MySQL 5.7 @ VerisureKenny Gryp
 
The Art of Doctrine Migrations
The Art of Doctrine MigrationsThe Art of Doctrine Migrations
The Art of Doctrine MigrationsRyan Weaver
 
MySQL Database Architectures - MySQL InnoDB ClusterSet 2021-11
MySQL Database Architectures - MySQL InnoDB ClusterSet 2021-11MySQL Database Architectures - MySQL InnoDB ClusterSet 2021-11
MySQL Database Architectures - MySQL InnoDB ClusterSet 2021-11Kenny Gryp
 
Building an Oracle Grid with Oracle VM on Dell Blade Servers and EqualLogic i...
Building an Oracle Grid with Oracle VM on Dell Blade Servers and EqualLogic i...Building an Oracle Grid with Oracle VM on Dell Blade Servers and EqualLogic i...
Building an Oracle Grid with Oracle VM on Dell Blade Servers and EqualLogic i...Lindsey Aitchison
 
symfony Live 2010 - Using Doctrine Migrations
symfony Live 2010 -  Using Doctrine Migrationssymfony Live 2010 -  Using Doctrine Migrations
symfony Live 2010 - Using Doctrine MigrationsD
 
Java troubleshooting thread dump
Java troubleshooting thread dumpJava troubleshooting thread dump
Java troubleshooting thread dumpejlp12
 
Apache Kafka Reliability
Apache Kafka Reliability Apache Kafka Reliability
Apache Kafka Reliability Jeff Holoman
 
Built-in Replication in PostgreSQL
Built-in Replication in PostgreSQLBuilt-in Replication in PostgreSQL
Built-in Replication in PostgreSQLMasao Fujii
 
Ycsb benchmarking
Ycsb benchmarkingYcsb benchmarking
Ycsb benchmarkingSqrrl
 
Xenalyze: Finding meaning in the chaos
Xenalyze: Finding meaning in the chaosXenalyze: Finding meaning in the chaos
Xenalyze: Finding meaning in the chaosThe Linux Foundation
 
Summary of "YCSB " paper for nosql summer reading in Tokyo" on Sep 15, 2010
Summary of "YCSB " paper for nosql summer reading in Tokyo" on Sep 15, 2010Summary of "YCSB " paper for nosql summer reading in Tokyo" on Sep 15, 2010
Summary of "YCSB " paper for nosql summer reading in Tokyo" on Sep 15, 2010CLOUDIAN KK
 
Oracle 11g R2 RAC setup on rhel 5.0
Oracle 11g R2 RAC setup on rhel 5.0Oracle 11g R2 RAC setup on rhel 5.0
Oracle 11g R2 RAC setup on rhel 5.0Santosh Kangane
 
Synchronous Log Shipping Replication
Synchronous Log Shipping ReplicationSynchronous Log Shipping Replication
Synchronous Log Shipping Replicationelliando dias
 
02.28.13 WANDisco SVN Training: Getting Info Out of SVN
02.28.13 WANDisco SVN Training: Getting Info Out of SVN02.28.13 WANDisco SVN Training: Getting Info Out of SVN
02.28.13 WANDisco SVN Training: Getting Info Out of SVNWANdisco Plc
 
FlexSC: Exception-Less System Calls - presented @ OSDI 2010
FlexSC: Exception-Less System Calls - presented @ OSDI 2010FlexSC: Exception-Less System Calls - presented @ OSDI 2010
FlexSC: Exception-Less System Calls - presented @ OSDI 2010Livio Soares
 
Instruction Level Parallelism and Superscalar Processors
Instruction Level Parallelism and Superscalar ProcessorsInstruction Level Parallelism and Superscalar Processors
Instruction Level Parallelism and Superscalar ProcessorsSyed Zaid Irshad
 

What's hot (20)

Multi Source Replication With MySQL 5.7 @ Verisure
Multi Source Replication With MySQL 5.7 @ VerisureMulti Source Replication With MySQL 5.7 @ Verisure
Multi Source Replication With MySQL 5.7 @ Verisure
 
The Art of Doctrine Migrations
The Art of Doctrine MigrationsThe Art of Doctrine Migrations
The Art of Doctrine Migrations
 
MySQL Database Architectures - MySQL InnoDB ClusterSet 2021-11
MySQL Database Architectures - MySQL InnoDB ClusterSet 2021-11MySQL Database Architectures - MySQL InnoDB ClusterSet 2021-11
MySQL Database Architectures - MySQL InnoDB ClusterSet 2021-11
 
Building an Oracle Grid with Oracle VM on Dell Blade Servers and EqualLogic i...
Building an Oracle Grid with Oracle VM on Dell Blade Servers and EqualLogic i...Building an Oracle Grid with Oracle VM on Dell Blade Servers and EqualLogic i...
Building an Oracle Grid with Oracle VM on Dell Blade Servers and EqualLogic i...
 
symfony Live 2010 - Using Doctrine Migrations
symfony Live 2010 -  Using Doctrine Migrationssymfony Live 2010 -  Using Doctrine Migrations
symfony Live 2010 - Using Doctrine Migrations
 
121 Pdfsam
121 Pdfsam121 Pdfsam
121 Pdfsam
 
A32 Database Virtulization Technologies
A32 Database Virtulization TechnologiesA32 Database Virtulization Technologies
A32 Database Virtulization Technologies
 
Java troubleshooting thread dump
Java troubleshooting thread dumpJava troubleshooting thread dump
Java troubleshooting thread dump
 
Presentation
PresentationPresentation
Presentation
 
Apache Kafka Reliability
Apache Kafka Reliability Apache Kafka Reliability
Apache Kafka Reliability
 
Built-in Replication in PostgreSQL
Built-in Replication in PostgreSQLBuilt-in Replication in PostgreSQL
Built-in Replication in PostgreSQL
 
Exch2007 sp1 win2008
Exch2007 sp1 win2008Exch2007 sp1 win2008
Exch2007 sp1 win2008
 
Ycsb benchmarking
Ycsb benchmarkingYcsb benchmarking
Ycsb benchmarking
 
Xenalyze: Finding meaning in the chaos
Xenalyze: Finding meaning in the chaosXenalyze: Finding meaning in the chaos
Xenalyze: Finding meaning in the chaos
 
Summary of "YCSB " paper for nosql summer reading in Tokyo" on Sep 15, 2010
Summary of "YCSB " paper for nosql summer reading in Tokyo" on Sep 15, 2010Summary of "YCSB " paper for nosql summer reading in Tokyo" on Sep 15, 2010
Summary of "YCSB " paper for nosql summer reading in Tokyo" on Sep 15, 2010
 
Oracle 11g R2 RAC setup on rhel 5.0
Oracle 11g R2 RAC setup on rhel 5.0Oracle 11g R2 RAC setup on rhel 5.0
Oracle 11g R2 RAC setup on rhel 5.0
 
Synchronous Log Shipping Replication
Synchronous Log Shipping ReplicationSynchronous Log Shipping Replication
Synchronous Log Shipping Replication
 
02.28.13 WANDisco SVN Training: Getting Info Out of SVN
02.28.13 WANDisco SVN Training: Getting Info Out of SVN02.28.13 WANDisco SVN Training: Getting Info Out of SVN
02.28.13 WANDisco SVN Training: Getting Info Out of SVN
 
FlexSC: Exception-Less System Calls - presented @ OSDI 2010
FlexSC: Exception-Less System Calls - presented @ OSDI 2010FlexSC: Exception-Less System Calls - presented @ OSDI 2010
FlexSC: Exception-Less System Calls - presented @ OSDI 2010
 
Instruction Level Parallelism and Superscalar Processors
Instruction Level Parallelism and Superscalar ProcessorsInstruction Level Parallelism and Superscalar Processors
Instruction Level Parallelism and Superscalar Processors
 

Similar to Building a High-Volume Reporting System on Amazon AWS with MySQL, Tungsten, and Vertica

Streaming in Practice - Putting Apache Kafka in Production
Streaming in Practice - Putting Apache Kafka in ProductionStreaming in Practice - Putting Apache Kafka in Production
Streaming in Practice - Putting Apache Kafka in Productionconfluent
 
Apache Tomcat 7 by Filip Hanik
Apache Tomcat 7 by Filip HanikApache Tomcat 7 by Filip Hanik
Apache Tomcat 7 by Filip HanikEdgar Espina
 
Percona XtraDB 集群文档
Percona XtraDB 集群文档Percona XtraDB 集群文档
Percona XtraDB 集群文档YUCHENG HU
 
Virtualizing Apache Spark with Justin Murray
Virtualizing Apache Spark with Justin MurrayVirtualizing Apache Spark with Justin Murray
Virtualizing Apache Spark with Justin MurrayDatabricks
 
Building Cloud Tools for Netflix with Jenkins
Building Cloud Tools for Netflix with JenkinsBuilding Cloud Tools for Netflix with Jenkins
Building Cloud Tools for Netflix with JenkinsGareth Bowles
 
The Very Very Latest in Database Development - Oracle Open World 2012
The Very Very Latest in Database Development - Oracle Open World 2012The Very Very Latest in Database Development - Oracle Open World 2012
The Very Very Latest in Database Development - Oracle Open World 2012Lucas Jellema
 
Clustered Architecture Patterns Delivering Scalability And Availability
Clustered Architecture Patterns Delivering Scalability And AvailabilityClustered Architecture Patterns Delivering Scalability And Availability
Clustered Architecture Patterns Delivering Scalability And AvailabilityConSanFrancisco123
 
Sql server 2012 - always on deep dive - bob duffy
Sql server 2012 - always on deep dive - bob duffySql server 2012 - always on deep dive - bob duffy
Sql server 2012 - always on deep dive - bob duffyAnuradha
 
Business_Continuity_Planning_with_SQL_Server_HADR_options_TechEd_Bangalore_20...
Business_Continuity_Planning_with_SQL_Server_HADR_options_TechEd_Bangalore_20...Business_Continuity_Planning_with_SQL_Server_HADR_options_TechEd_Bangalore_20...
Business_Continuity_Planning_with_SQL_Server_HADR_options_TechEd_Bangalore_20...LarryZaman
 
14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...
14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...
14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...Athens Big Data
 
Dissolving the Problem (Making an ACID-Compliant Database Out of Apache Kafka®)
Dissolving the Problem (Making an ACID-Compliant Database Out of Apache Kafka®)Dissolving the Problem (Making an ACID-Compliant Database Out of Apache Kafka®)
Dissolving the Problem (Making an ACID-Compliant Database Out of Apache Kafka®)confluent
 
Scarlet - Scalable, Redundant, Cloud Enabled JIRA
Scarlet - Scalable, Redundant, Cloud Enabled JIRAScarlet - Scalable, Redundant, Cloud Enabled JIRA
Scarlet - Scalable, Redundant, Cloud Enabled JIRASanne Grinovero
 
Solr Compute Cloud - An Elastic SolrCloud Infrastructure
Solr Compute Cloud - An Elastic SolrCloud Infrastructure Solr Compute Cloud - An Elastic SolrCloud Infrastructure
Solr Compute Cloud - An Elastic SolrCloud Infrastructure Nitin S
 
Solr Lucene Conference 2014 - Nitin Presentation
Solr Lucene Conference 2014 - Nitin PresentationSolr Lucene Conference 2014 - Nitin Presentation
Solr Lucene Conference 2014 - Nitin PresentationNitin Sharma
 
Near Real time Indexing Kafka Messages to Apache Blur using Spark Streaming
Near Real time Indexing Kafka Messages to Apache Blur using Spark StreamingNear Real time Indexing Kafka Messages to Apache Blur using Spark Streaming
Near Real time Indexing Kafka Messages to Apache Blur using Spark StreamingDibyendu Bhattacharya
 
ConFoo MySQL Replication Evolution : From Simple to Group Replication
ConFoo  MySQL Replication Evolution : From Simple to Group ReplicationConFoo  MySQL Replication Evolution : From Simple to Group Replication
ConFoo MySQL Replication Evolution : From Simple to Group ReplicationDave Stokes
 
MySQL Replication Update -- Zendcon 2016
MySQL Replication Update -- Zendcon 2016MySQL Replication Update -- Zendcon 2016
MySQL Replication Update -- Zendcon 2016Dave Stokes
 
Solr Compute Cloud – An Elastic Solr Infrastructure: Presented by Nitin Sharm...
Solr Compute Cloud – An Elastic Solr Infrastructure: Presented by Nitin Sharm...Solr Compute Cloud – An Elastic Solr Infrastructure: Presented by Nitin Sharm...
Solr Compute Cloud – An Elastic Solr Infrastructure: Presented by Nitin Sharm...Lucidworks
 
Apache Kafka - Event Sourcing, Monitoring, Librdkafka, Scaling & Partitioning
Apache Kafka - Event Sourcing, Monitoring, Librdkafka, Scaling & PartitioningApache Kafka - Event Sourcing, Monitoring, Librdkafka, Scaling & Partitioning
Apache Kafka - Event Sourcing, Monitoring, Librdkafka, Scaling & PartitioningGuido Schmutz
 

Similar to Building a High-Volume Reporting System on Amazon AWS with MySQL, Tungsten, and Vertica (20)

Streaming in Practice - Putting Apache Kafka in Production
Streaming in Practice - Putting Apache Kafka in ProductionStreaming in Practice - Putting Apache Kafka in Production
Streaming in Practice - Putting Apache Kafka in Production
 
Apache Tomcat 7 by Filip Hanik
Apache Tomcat 7 by Filip HanikApache Tomcat 7 by Filip Hanik
Apache Tomcat 7 by Filip Hanik
 
Percona XtraDB 集群文档
Percona XtraDB 集群文档Percona XtraDB 集群文档
Percona XtraDB 集群文档
 
Virtualizing Apache Spark with Justin Murray
Virtualizing Apache Spark with Justin MurrayVirtualizing Apache Spark with Justin Murray
Virtualizing Apache Spark with Justin Murray
 
Building Cloud Tools for Netflix with Jenkins
Building Cloud Tools for Netflix with JenkinsBuilding Cloud Tools for Netflix with Jenkins
Building Cloud Tools for Netflix with Jenkins
 
The Very Very Latest in Database Development - Oracle Open World 2012
The Very Very Latest in Database Development - Oracle Open World 2012The Very Very Latest in Database Development - Oracle Open World 2012
The Very Very Latest in Database Development - Oracle Open World 2012
 
The Very Very Latest In Database Development - Lucas Jellema - Oracle OpenWor...
The Very Very Latest In Database Development - Lucas Jellema - Oracle OpenWor...The Very Very Latest In Database Development - Lucas Jellema - Oracle OpenWor...
The Very Very Latest In Database Development - Lucas Jellema - Oracle OpenWor...
 
Clustered Architecture Patterns Delivering Scalability And Availability
Clustered Architecture Patterns Delivering Scalability And AvailabilityClustered Architecture Patterns Delivering Scalability And Availability
Clustered Architecture Patterns Delivering Scalability And Availability
 
Sql server 2012 - always on deep dive - bob duffy
Sql server 2012 - always on deep dive - bob duffySql server 2012 - always on deep dive - bob duffy
Sql server 2012 - always on deep dive - bob duffy
 
Business_Continuity_Planning_with_SQL_Server_HADR_options_TechEd_Bangalore_20...
Business_Continuity_Planning_with_SQL_Server_HADR_options_TechEd_Bangalore_20...Business_Continuity_Planning_with_SQL_Server_HADR_options_TechEd_Bangalore_20...
Business_Continuity_Planning_with_SQL_Server_HADR_options_TechEd_Bangalore_20...
 
14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...
14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...
14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...
 
Dissolving the Problem (Making an ACID-Compliant Database Out of Apache Kafka®)
Dissolving the Problem (Making an ACID-Compliant Database Out of Apache Kafka®)Dissolving the Problem (Making an ACID-Compliant Database Out of Apache Kafka®)
Dissolving the Problem (Making an ACID-Compliant Database Out of Apache Kafka®)
 
Scarlet - Scalable, Redundant, Cloud Enabled JIRA
Scarlet - Scalable, Redundant, Cloud Enabled JIRAScarlet - Scalable, Redundant, Cloud Enabled JIRA
Scarlet - Scalable, Redundant, Cloud Enabled JIRA
 
Solr Compute Cloud - An Elastic SolrCloud Infrastructure
Solr Compute Cloud - An Elastic SolrCloud Infrastructure Solr Compute Cloud - An Elastic SolrCloud Infrastructure
Solr Compute Cloud - An Elastic SolrCloud Infrastructure
 
Solr Lucene Conference 2014 - Nitin Presentation
Solr Lucene Conference 2014 - Nitin PresentationSolr Lucene Conference 2014 - Nitin Presentation
Solr Lucene Conference 2014 - Nitin Presentation
 
Near Real time Indexing Kafka Messages to Apache Blur using Spark Streaming
Near Real time Indexing Kafka Messages to Apache Blur using Spark StreamingNear Real time Indexing Kafka Messages to Apache Blur using Spark Streaming
Near Real time Indexing Kafka Messages to Apache Blur using Spark Streaming
 
ConFoo MySQL Replication Evolution : From Simple to Group Replication
ConFoo  MySQL Replication Evolution : From Simple to Group ReplicationConFoo  MySQL Replication Evolution : From Simple to Group Replication
ConFoo MySQL Replication Evolution : From Simple to Group Replication
 
MySQL Replication Update -- Zendcon 2016
MySQL Replication Update -- Zendcon 2016MySQL Replication Update -- Zendcon 2016
MySQL Replication Update -- Zendcon 2016
 
Solr Compute Cloud – An Elastic Solr Infrastructure: Presented by Nitin Sharm...
Solr Compute Cloud – An Elastic Solr Infrastructure: Presented by Nitin Sharm...Solr Compute Cloud – An Elastic Solr Infrastructure: Presented by Nitin Sharm...
Solr Compute Cloud – An Elastic Solr Infrastructure: Presented by Nitin Sharm...
 
Apache Kafka - Event Sourcing, Monitoring, Librdkafka, Scaling & Partitioning
Apache Kafka - Event Sourcing, Monitoring, Librdkafka, Scaling & PartitioningApache Kafka - Event Sourcing, Monitoring, Librdkafka, Scaling & Partitioning
Apache Kafka - Event Sourcing, Monitoring, Librdkafka, Scaling & Partitioning
 

Recently uploaded

Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Principled Technologies
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024SynarionITSolutions
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 

Recently uploaded (20)

Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 

Building a High-Volume Reporting System on Amazon AWS with MySQL, Tungsten, and Vertica

  • 1. Building a High-Volume Reporting System on Amazon AWS with MySQL, Tungsten, and Vertica GAMIFIED REW ARDS 4/11/12 @jpmalek
  • 2. What I’ll cover: Our reporting/analytics growth stages, their pitfalls and what we’ve learned: 1. Custom MySQL ETL via shell scripts, visualizations in Tableau 2. ETL via a custom Tungsten applier into Vertica 3. New Tungsten Vertica applier, built by Continuent 4. Sharded transactional system, multiple Tungsten Vertica appliers 4/11/12 @jpmalek
  • 3. Stage 1 : Custom MySQL ETL via shell scripts, visualizations in Tableau 1. On slave, dump an hour’s worth of new rows via SELECT INTO OUTFILE 2. Ship data file to aggregations host, dump old hourly snapshot, load new 3. Perform aggregation queries against temporary snapshot and FEDERATED tables 4. Tableau refreshes its extracts after aggregated rows are inserted. 4/11/12 @jpmalek
  • 4. Detour : RAID for the Win Big drop in API endpoint latency (writes) 4/11/12 @jpmalek
  • 5. Stage 2 : ETL via a custom Tungsten applier into Vertica 4/11/12 @jpmalek
  • 6. Stage 2 : Customized Tungsten Replication Setup Master Replicator Extract binlog to MySQL Tungsten Log Extract From Extract Custom Vertica Filter Master to Log from Log JDBC Applier Slave Replicator Filter DDL & unwanted tables Vertica 4/11/12
  • 7. Stage 2 : Issues with the Custom Tungsten Filter 1. OLTP transactions on Vertica are very slow! (10 transactions per second vs. around 1000 per second for a MySQL slave). Slave applier could not keep up with MySQL master. 2. Person who created the applier was no longer in the company. 3. Tungsten setup including custom applier was difficult to maintain and hard to move to other hosts. 4/11/12 @jpmalek
  • 8. Detour : flexible APIs and baseball schedules 4/11/12 @jpmalek
  • 9. Stage 3 : New Tungsten Vertica Applier 4/11/12 @jpmalek
  • 10. Stage 3: A Template-Driven Batch Apply Process Tungsten Replicator Pipeline Extract- Extract- MySQL Filter- Filter- Extract-Filter-Apply Apply Apply Base Tables 63, ‘bob’, 23, … Staging Table 64, ‘sue’, 76, … 233, d, 64, …, 1 67, ‘jim’, 1, … CSV 76, ‘dan’, 25, … 233, i, 64, …, 2 Files 98, ‘joe’, 66, … 239, I, 76, …, 3 COPY DELETE, then INSERT (Template) (Template) 4/11/12
  • 11. Stage 3 : Batch Applier Replication Setup Master Replicator Extract binlog to MySQL Tungsten Log Extract From Extract Batch applier using SQL Filter Master to Log from Log template commands COPY / Slave Replicator INSERT Use built-in CSV Vertica Filters; DDL ignored Write date to disk files 4/11/12
  • 12. Stage 3 : Solving Problems to Get the New Applier to Work 1. Testing – Developed a lightweight testing mechanism for heterogeneous replication 2. Batch applier implementation – Two tries to get it right including SQL templates and full datatype support 3. Character sets – Ensuring consistent UTF-8 handling throughout the replication change, including CSV files 4. Time zones – Ensuring Java VM handled time values correctly 5. Performance – Tweak SQL templates to get 50x boost over old applier 4/11/12 @jpmalek
  • 13. Detour : Sharding or Learning How To Sleep In Any Position 4/11/12 @jpmalek
  • 14. Stage 4 : Sharded transactional system, multiple Tungsten Vertica appliers 4/11/12 @jpmalek
  • 15. Solving Problems to Scale Up The Replication Configuration 1. Implement remote batch apply so Tungsten can run off-board from Vertica 2. Convert replication to a direct pipeline with a single service between MySQL and Vertica 3. Create a script to deploy replicator in a single command 4. Create staging tables on Vertica server 4/11/12 @jpmalek
  • 16. Remaining Challenges to Complete Replication Setup 1. Configure replication for global and local DBShards data 2. Ensure performance is up to snuff-currently at 500- 1000 transactions per second 3. Introduce intermediate staging servers to reduce number of replication streams into Vertica 4/11/12 @jpmalek
  • 17. Thank You! In summary: 1. Tungsten is a great tool when it comes to MySQL ETL automation, so check it out as an alternative to custom in-house scripts or other options. 2. Vertica is a high-performance, scaleable BI platform that now pairs well with Tungsten. Full360 offers a cloud- based solution. 3. If you’re just getting started on the BI front, hire a BI developer to focus on this stuff, if you can. 4. I see no reason why this framework couldn’t scale to easily handle whatever our business needs in the future. 4/11/12 @jpmalek

Editor's Notes

  1. Thanks for coming! I'll try to make this quick so we can all grab the last part of the “backing up FB” session. About me – Hi, I’m Jeff MalekThis is my second startup, and before that, 14 years ago, I did everything possible to avoid a real, normal job, wandering like a vagabond. So startup junkie, or real-world flunkie - take your pick. I’m CTO, co-founder at BigDoor. About BigDoor and why it fits for this talkBigDoor has been around for almost three years. We help sites create their own branded, gamified rewards programs. Think of it as your own engaging frequent flyer program. That means we give points to users. Lots of them, with all kinds of different flavors, attributes, types, and conversion rates. Between customers like Major League Baseball, Island Def Jam recording artists like Big K.R.I.T, another big national sports organization that I can’t mention but that you all know, and the rest of our customers, we accrue a lot of points for users. We store those and other event-like data in ledger tables, in MySQL. The tables just keep growing, and largely need to, to support business logic and associated API endpoints. For example, transaction histories need to be returned for users. We need to be able to report on this data, merging it with web log data, both for canned reports and on an ad-hoc basis. AWS ; at my last startup all of our systems were on co-located hardware that we managed and maintained, all of BigDoor's systems are in AWS.I’d like to hear from you what you think when you hear ‘high-volume’. TODO : For this talk, 10K inserts/updates per second should suffice, which on our system creates x volume per hour...
  2. What I’m going to coverCustom MySQL ETL via shell scripts, visualizations in TableauNote that this was something I knew wouldn’t scale, longer term. ETL via a custom Tungsten applier into VerticaNew Tungsten Vertica applier, built by ContinuentSharded transactional system, multiple Tungsten Vertica appliersBut first, is there anything that folks came hoping to hear specifically, given the topic? If so I’ll maybe I can try to hit those.
  3. Goals for this stage:Gather data for, and present aggregated reports about the points, levels and badges people were earning through our systems.What we didGo over four stepsGotchas:Queries against slaves can slow replication.ETL scripts may be fun and interesting for the author but maybe not so much for maintainers. Why move from this stage? I knew from the beginning that this wasn’t going to be the longer term solution, can’t scale.Didn’t want anyone in the company maintaining the scripted ETL process. Longer term, wanted a place where we could do ad-hoc queries without affecting transactional layer performance and replication. This legacy system is still running, at end-of-life.Primary is M2.4xlarge, with software RAID, 4 EBS volumes, 16kb chunk sizeTODO : M2.4xlarge is equivalent of : TODO : add reporting screenshot
  4. EBS volumes : 100-150 IOPs are standard, but we see spikes much higher than that. After RAID, seeing spikes over 1K tps during slow times. RAID makes taking snapshots a bit more complicated. There’s a pretty well-defined process for single volumes. When doing performance benchmarks, beware : high variability (low consistency) when it comes to EBS write performance. We rolled back our first attempt because of this. In general, RAID is the best thing you can do for performance on AWS, since writes in particular suck. At least, now…they’re working on various fronts to improve this.
  5. Goals for this stage :Start using a more mature data warehouse platform, needs: High-volume writesPerforms OK for ad-hoc queries OK to store de-normalized fact tables and aggregationsMostly SQL-compliant (vs. MDX, for example)Support for scale-out, e.g. clusteringMySQL is a great DBMS for OLTP; not so much for answering BI questionsRemove the dependency on custom ETL with a tool that doesn’t affect replication or otherwise screw up the transactional layerReplace it with a tool that allows for replication of all data if we like, and an easy means to filter out whatever we don’t want. What we did:Signed up with Full360 to implement VerticaFull360 takes care of provisioning the instance, has management scripts for things like backups, and handles the onerous licensing issues that come with VerticaVertica takes writes at high volume very well, and is fast to return answers.Not web-fast, but fast enough for background processing. Started testing Continuent’s Tungsten replication productOur DBA wrote a custom filter to forward data to VerticaConnects via JDBC to Vertica, writing changes on a per-row basis.Using row-based MySQL replication. GotchasVertica costs go up with data volume.Verticadoesn’t accept single-row writes very quickly (but batch writes are very fast). Why we had to move from this on to the next stage:We had a few large publishers start using our API, including a large sports organization I’m sure you’re familiar with - MLBThe custom in-house applier that we wrote couldn’t keep up with the volume of writes.
  6. (for those of you who can't read this)If so-and-so hit a single, every single user would get a badge, at the same time. I’m sure you can imagine how that raised some eyebrows on our side of things, in contrast to normal web traffic – this was ‘spike’ defined.We all started paying heavy attention to the game schedules. We were able to handle the load well, but it’s an example for how having an API can bring surprises. They're continuing to use us this year.
  7. Goals for this stage:Keep up with the rate of inserts and updates from the transactional system, applying changes directly to Vertica in near-real-time. As I mentioned, our custom applier wasn’t keeping up. So we called Continuent.They’d wanted to build an applier for Vertica for some time, anyway.To my knowledge, this hasn’t been done. Things that help when you’re working with a third party like this:Highly normalized schemaKeep the data-types simpleLow rate of change to schema. We had this because the API was designed to be very flexible right out of the gate. Gotchas :Time-zones – Robert, do you want to touch on that a bit?Source data stored in UTCTransforms should be done at presentation layer, ideally.Character setsUse UTF-8 across the boardWhy we had to move on from this stage:Another large publisher with a huge audience, also a large sports organization, loomed on the horizonWe needed to scale our systems 20x to handle their traffic.So we sharded our transactional layer.
  8. Six weeks of no sleepDidn’t port ledger data, just configuration data. Thank you team!
  9. Goals for this stage:Change the Vertica applier so that it can be run from a MySQL slave, doing remote batch apply to VerticaRun 16 appliers from a sharded MySQL cluster into VerticaMaintain a backlog of data for ad-hoc queries, otherwise cull and archive older data to keep the Vertica data set smallStill in beta test, but looking promising.
  10. In summary: Tungsten is a great tool when it comes to MySQL ETL automation, so check it out as an alternative to custom in-house scripts or other options. Vertica is a high-performance, scaleable BI platform that now pairs well with Tungsten. Full360 offers a cloud-based solution.If you’re just getting started on the BI front, hire a BI developer to focus on this stuff, if you can. I see no reason why this framework couldn’t scale to easily handle whatever our business needs in the future.