SlideShare a Scribd company logo
Building a High-Volume
          Reporting System on Amazon AWS
          with MySQL, Tungsten, and Vertica




                                      GAMIFIED REW ARDS




4/11/12                  @jpmalek
What I’ll cover:
       Our reporting/analytics growth stages,
       their pitfalls and what we’ve learned:

  1.      Custom MySQL ETL via shell scripts, visualizations in Tableau
  2.      ETL via a custom Tungsten applier into Vertica
  3.      New Tungsten Vertica applier, built by Continuent
  4.      Sharded transactional system, multiple Tungsten Vertica appliers




4/11/12                                  @jpmalek
Stage 1 : Custom MySQL ETL via shell scripts, visualizations in Tableau




1.   On slave, dump an hour’s worth of new rows via SELECT INTO OUTFILE
2.   Ship data file to aggregations host, dump old hourly snapshot, load new
3.   Perform aggregation queries against temporary snapshot and FEDERATED tables
4.   Tableau refreshes its extracts after aggregated rows are inserted.




4/11/12                                      @jpmalek
Detour : RAID for the Win



                                        Big drop in
                                   API endpoint latency
                                         (writes)




4/11/12                 @jpmalek
Stage 2 : ETL via a custom Tungsten applier into Vertica




4/11/12                                  @jpmalek
Stage 2 : Customized Tungsten Replication Setup



                                                       Master Replicator
                              Extract binlog to
          MySQL               Tungsten Log




                    Extract From           Extract               Custom Vertica
                                                        Filter
                    Master to Log          from Log              JDBC Applier


                   Slave Replicator

                                                      Filter DDL &
                                                      unwanted
                                                      tables               Vertica



4/11/12
Stage 2 : Issues with the Custom Tungsten Filter




          1. OLTP transactions on Vertica are very slow! (10 transactions per second
             vs. around 1000 per second for a MySQL slave). Slave applier could
             not keep up with MySQL master.
          2. Person who created the applier was no longer in the company.
          3. Tungsten setup including custom applier was difficult to maintain and
             hard to move to other hosts.




4/11/12                                      @jpmalek
Detour : flexible APIs
          and baseball schedules




4/11/12   @jpmalek
Stage 3 : New Tungsten Vertica Applier




4/11/12                                        @jpmalek
Stage 3: A Template-Driven Batch Apply Process


                                      Tungsten Replicator Pipeline

                           Extract-        Extract-
          MySQL            Filter-         Filter-     Extract-Filter-Apply
                           Apply           Apply


                                                              Base Tables
                                                              63, ‘bob’, 23, …
                                   Staging Table              64, ‘sue’, 76, …
                                   233, d, 64, …, 1           67, ‘jim’, 1, …
              CSV                                             76, ‘dan’, 25, …
                                   233, i, 64, …, 2
              Files                                           98, ‘joe’, 66, …
                                   239, I, 76, …, 3



                        COPY                   DELETE, then INSERT
                      (Template)                    (Template)

4/11/12
Stage 3 : Batch Applier Replication Setup



                                                      Master Replicator
                            Extract binlog to
      MySQL                 Tungsten Log




                 Extract From            Extract                Batch applier using SQL
                                                       Filter
                 Master to Log           from Log               template commands
                                                                                  COPY /
                 Slave Replicator                                                 INSERT


                                       Use built-in
                                                                CSV            Vertica
                                       Filters; DDL
                                       ignored
                                                            Write date to
                                                            disk files
4/11/12
Stage 3 : Solving Problems to Get the New Applier to Work




          1. Testing – Developed a lightweight testing mechanism for heterogeneous
             replication
          2. Batch applier implementation – Two tries to get it right including SQL
             templates and full datatype support
          3. Character sets – Ensuring consistent UTF-8 handling throughout the
             replication change, including CSV files
          4. Time zones – Ensuring Java VM handled time values correctly
          5. Performance – Tweak SQL templates to get 50x boost over old applier



4/11/12                                      @jpmalek
Detour :
          Sharding

          or

          Learning How To Sleep
          In Any Position




4/11/12    @jpmalek
Stage 4 : Sharded transactional system, multiple Tungsten Vertica appliers




4/11/12                                      @jpmalek
Solving Problems to Scale Up The Replication Configuration




                                1. Implement remote batch apply so Tungsten can
                                   run off-board from Vertica
                                2. Convert replication to a direct pipeline with a single
                                   service between MySQL and Vertica
                                3. Create a script to deploy replicator in a single
                                   command
                                4. Create staging tables on Vertica server
4/11/12                                     @jpmalek
Remaining Challenges to Complete Replication Setup




                               1. Configure replication for global and local DBShards
                                  data
                               2. Ensure performance is up to snuff-currently at 500-
                                  1000 transactions per second
                               3. Introduce intermediate staging servers to reduce
                                  number of replication streams into Vertica

4/11/12                                    @jpmalek
Thank You!

      In summary:
      1. Tungsten is a great tool when it comes to MySQL ETL
         automation, so check it out as an alternative to custom
         in-house scripts or other options.
      2. Vertica is a high-performance, scaleable BI platform that
         now pairs well with Tungsten. Full360 offers a cloud-
         based solution.
      3. If you’re just getting started on the BI front, hire a BI
         developer to focus on this stuff, if you can.
      4. I see no reason why this framework couldn’t scale to
         easily handle whatever our business needs in the future.



4/11/12                               @jpmalek

More Related Content

What's hot

symfony Live 2010 - Using Doctrine Migrations
symfony Live 2010 -  Using Doctrine Migrationssymfony Live 2010 -  Using Doctrine Migrations
symfony Live 2010 - Using Doctrine Migrations
D
 
Synchronous Log Shipping Replication
Synchronous Log Shipping ReplicationSynchronous Log Shipping Replication
Synchronous Log Shipping Replication
elliando dias
 
FlexSC: Exception-Less System Calls - presented @ OSDI 2010
FlexSC: Exception-Less System Calls - presented @ OSDI 2010FlexSC: Exception-Less System Calls - presented @ OSDI 2010
FlexSC: Exception-Less System Calls - presented @ OSDI 2010
Livio Soares
 

What's hot (20)

Multi Source Replication With MySQL 5.7 @ Verisure
Multi Source Replication With MySQL 5.7 @ VerisureMulti Source Replication With MySQL 5.7 @ Verisure
Multi Source Replication With MySQL 5.7 @ Verisure
 
The Art of Doctrine Migrations
The Art of Doctrine MigrationsThe Art of Doctrine Migrations
The Art of Doctrine Migrations
 
MySQL Database Architectures - MySQL InnoDB ClusterSet 2021-11
MySQL Database Architectures - MySQL InnoDB ClusterSet 2021-11MySQL Database Architectures - MySQL InnoDB ClusterSet 2021-11
MySQL Database Architectures - MySQL InnoDB ClusterSet 2021-11
 
Building an Oracle Grid with Oracle VM on Dell Blade Servers and EqualLogic i...
Building an Oracle Grid with Oracle VM on Dell Blade Servers and EqualLogic i...Building an Oracle Grid with Oracle VM on Dell Blade Servers and EqualLogic i...
Building an Oracle Grid with Oracle VM on Dell Blade Servers and EqualLogic i...
 
symfony Live 2010 - Using Doctrine Migrations
symfony Live 2010 -  Using Doctrine Migrationssymfony Live 2010 -  Using Doctrine Migrations
symfony Live 2010 - Using Doctrine Migrations
 
121 Pdfsam
121 Pdfsam121 Pdfsam
121 Pdfsam
 
A32 Database Virtulization Technologies
A32 Database Virtulization TechnologiesA32 Database Virtulization Technologies
A32 Database Virtulization Technologies
 
Java troubleshooting thread dump
Java troubleshooting thread dumpJava troubleshooting thread dump
Java troubleshooting thread dump
 
Presentation
PresentationPresentation
Presentation
 
Apache Kafka Reliability
Apache Kafka Reliability Apache Kafka Reliability
Apache Kafka Reliability
 
Built-in Replication in PostgreSQL
Built-in Replication in PostgreSQLBuilt-in Replication in PostgreSQL
Built-in Replication in PostgreSQL
 
Exch2007 sp1 win2008
Exch2007 sp1 win2008Exch2007 sp1 win2008
Exch2007 sp1 win2008
 
Ycsb benchmarking
Ycsb benchmarkingYcsb benchmarking
Ycsb benchmarking
 
Xenalyze: Finding meaning in the chaos
Xenalyze: Finding meaning in the chaosXenalyze: Finding meaning in the chaos
Xenalyze: Finding meaning in the chaos
 
Summary of "YCSB " paper for nosql summer reading in Tokyo" on Sep 15, 2010
Summary of "YCSB " paper for nosql summer reading in Tokyo" on Sep 15, 2010Summary of "YCSB " paper for nosql summer reading in Tokyo" on Sep 15, 2010
Summary of "YCSB " paper for nosql summer reading in Tokyo" on Sep 15, 2010
 
Oracle 11g R2 RAC setup on rhel 5.0
Oracle 11g R2 RAC setup on rhel 5.0Oracle 11g R2 RAC setup on rhel 5.0
Oracle 11g R2 RAC setup on rhel 5.0
 
Synchronous Log Shipping Replication
Synchronous Log Shipping ReplicationSynchronous Log Shipping Replication
Synchronous Log Shipping Replication
 
02.28.13 WANDisco SVN Training: Getting Info Out of SVN
02.28.13 WANDisco SVN Training: Getting Info Out of SVN02.28.13 WANDisco SVN Training: Getting Info Out of SVN
02.28.13 WANDisco SVN Training: Getting Info Out of SVN
 
FlexSC: Exception-Less System Calls - presented @ OSDI 2010
FlexSC: Exception-Less System Calls - presented @ OSDI 2010FlexSC: Exception-Less System Calls - presented @ OSDI 2010
FlexSC: Exception-Less System Calls - presented @ OSDI 2010
 
Instruction Level Parallelism and Superscalar Processors
Instruction Level Parallelism and Superscalar ProcessorsInstruction Level Parallelism and Superscalar Processors
Instruction Level Parallelism and Superscalar Processors
 

Similar to Building a High-Volume Reporting System on Amazon AWS with MySQL, Tungsten, and Vertica

Clustered Architecture Patterns Delivering Scalability And Availability
Clustered Architecture Patterns Delivering Scalability And AvailabilityClustered Architecture Patterns Delivering Scalability And Availability
Clustered Architecture Patterns Delivering Scalability And Availability
ConSanFrancisco123
 
Business_Continuity_Planning_with_SQL_Server_HADR_options_TechEd_Bangalore_20...
Business_Continuity_Planning_with_SQL_Server_HADR_options_TechEd_Bangalore_20...Business_Continuity_Planning_with_SQL_Server_HADR_options_TechEd_Bangalore_20...
Business_Continuity_Planning_with_SQL_Server_HADR_options_TechEd_Bangalore_20...
LarryZaman
 

Similar to Building a High-Volume Reporting System on Amazon AWS with MySQL, Tungsten, and Vertica (20)

Streaming in Practice - Putting Apache Kafka in Production
Streaming in Practice - Putting Apache Kafka in ProductionStreaming in Practice - Putting Apache Kafka in Production
Streaming in Practice - Putting Apache Kafka in Production
 
Apache Tomcat 7 by Filip Hanik
Apache Tomcat 7 by Filip HanikApache Tomcat 7 by Filip Hanik
Apache Tomcat 7 by Filip Hanik
 
Percona XtraDB 集群文档
Percona XtraDB 集群文档Percona XtraDB 集群文档
Percona XtraDB 集群文档
 
Virtualizing Apache Spark with Justin Murray
Virtualizing Apache Spark with Justin MurrayVirtualizing Apache Spark with Justin Murray
Virtualizing Apache Spark with Justin Murray
 
Building Cloud Tools for Netflix with Jenkins
Building Cloud Tools for Netflix with JenkinsBuilding Cloud Tools for Netflix with Jenkins
Building Cloud Tools for Netflix with Jenkins
 
The Very Very Latest in Database Development - Oracle Open World 2012
The Very Very Latest in Database Development - Oracle Open World 2012The Very Very Latest in Database Development - Oracle Open World 2012
The Very Very Latest in Database Development - Oracle Open World 2012
 
The Very Very Latest In Database Development - Lucas Jellema - Oracle OpenWor...
The Very Very Latest In Database Development - Lucas Jellema - Oracle OpenWor...The Very Very Latest In Database Development - Lucas Jellema - Oracle OpenWor...
The Very Very Latest In Database Development - Lucas Jellema - Oracle OpenWor...
 
Clustered Architecture Patterns Delivering Scalability And Availability
Clustered Architecture Patterns Delivering Scalability And AvailabilityClustered Architecture Patterns Delivering Scalability And Availability
Clustered Architecture Patterns Delivering Scalability And Availability
 
Sql server 2012 - always on deep dive - bob duffy
Sql server 2012 - always on deep dive - bob duffySql server 2012 - always on deep dive - bob duffy
Sql server 2012 - always on deep dive - bob duffy
 
Business_Continuity_Planning_with_SQL_Server_HADR_options_TechEd_Bangalore_20...
Business_Continuity_Planning_with_SQL_Server_HADR_options_TechEd_Bangalore_20...Business_Continuity_Planning_with_SQL_Server_HADR_options_TechEd_Bangalore_20...
Business_Continuity_Planning_with_SQL_Server_HADR_options_TechEd_Bangalore_20...
 
14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...
14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...
14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...
 
Dissolving the Problem (Making an ACID-Compliant Database Out of Apache Kafka®)
Dissolving the Problem (Making an ACID-Compliant Database Out of Apache Kafka®)Dissolving the Problem (Making an ACID-Compliant Database Out of Apache Kafka®)
Dissolving the Problem (Making an ACID-Compliant Database Out of Apache Kafka®)
 
Scarlet - Scalable, Redundant, Cloud Enabled JIRA
Scarlet - Scalable, Redundant, Cloud Enabled JIRAScarlet - Scalable, Redundant, Cloud Enabled JIRA
Scarlet - Scalable, Redundant, Cloud Enabled JIRA
 
Solr Compute Cloud - An Elastic SolrCloud Infrastructure
Solr Compute Cloud - An Elastic SolrCloud Infrastructure Solr Compute Cloud - An Elastic SolrCloud Infrastructure
Solr Compute Cloud - An Elastic SolrCloud Infrastructure
 
Solr Lucene Conference 2014 - Nitin Presentation
Solr Lucene Conference 2014 - Nitin PresentationSolr Lucene Conference 2014 - Nitin Presentation
Solr Lucene Conference 2014 - Nitin Presentation
 
Near Real time Indexing Kafka Messages to Apache Blur using Spark Streaming
Near Real time Indexing Kafka Messages to Apache Blur using Spark StreamingNear Real time Indexing Kafka Messages to Apache Blur using Spark Streaming
Near Real time Indexing Kafka Messages to Apache Blur using Spark Streaming
 
ConFoo MySQL Replication Evolution : From Simple to Group Replication
ConFoo  MySQL Replication Evolution : From Simple to Group ReplicationConFoo  MySQL Replication Evolution : From Simple to Group Replication
ConFoo MySQL Replication Evolution : From Simple to Group Replication
 
MySQL Replication Update -- Zendcon 2016
MySQL Replication Update -- Zendcon 2016MySQL Replication Update -- Zendcon 2016
MySQL Replication Update -- Zendcon 2016
 
Solr Compute Cloud – An Elastic Solr Infrastructure: Presented by Nitin Sharm...
Solr Compute Cloud – An Elastic Solr Infrastructure: Presented by Nitin Sharm...Solr Compute Cloud – An Elastic Solr Infrastructure: Presented by Nitin Sharm...
Solr Compute Cloud – An Elastic Solr Infrastructure: Presented by Nitin Sharm...
 
Apache Kafka - Event Sourcing, Monitoring, Librdkafka, Scaling & Partitioning
Apache Kafka - Event Sourcing, Monitoring, Librdkafka, Scaling & PartitioningApache Kafka - Event Sourcing, Monitoring, Librdkafka, Scaling & Partitioning
Apache Kafka - Event Sourcing, Monitoring, Librdkafka, Scaling & Partitioning
 

Recently uploaded

Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
Bhaskar Mitra
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 

Recently uploaded (20)

In-Depth Performance Testing Guide for IT Professionals
In-Depth Performance Testing Guide for IT ProfessionalsIn-Depth Performance Testing Guide for IT Professionals
In-Depth Performance Testing Guide for IT Professionals
 
UiPath Test Automation using UiPath Test Suite series, part 2
UiPath Test Automation using UiPath Test Suite series, part 2UiPath Test Automation using UiPath Test Suite series, part 2
UiPath Test Automation using UiPath Test Suite series, part 2
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptxUnpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
 
Exploring UiPath Orchestrator API: updates and limits in 2024 🚀
Exploring UiPath Orchestrator API: updates and limits in 2024 🚀Exploring UiPath Orchestrator API: updates and limits in 2024 🚀
Exploring UiPath Orchestrator API: updates and limits in 2024 🚀
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
Free and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
Free and Effective: Making Flows Publicly Accessible, Yumi IbrahimzadeFree and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
Free and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
 
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
IESVE for Early Stage Design and Planning
IESVE for Early Stage Design and PlanningIESVE for Early Stage Design and Planning
IESVE for Early Stage Design and Planning
 
10 Differences between Sales Cloud and CPQ, Blanka Doktorová
10 Differences between Sales Cloud and CPQ, Blanka Doktorová10 Differences between Sales Cloud and CPQ, Blanka Doktorová
10 Differences between Sales Cloud and CPQ, Blanka Doktorová
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 
Introduction to Open Source RAG and RAG Evaluation
Introduction to Open Source RAG and RAG EvaluationIntroduction to Open Source RAG and RAG Evaluation
Introduction to Open Source RAG and RAG Evaluation
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 

Building a High-Volume Reporting System on Amazon AWS with MySQL, Tungsten, and Vertica

  • 1. Building a High-Volume Reporting System on Amazon AWS with MySQL, Tungsten, and Vertica GAMIFIED REW ARDS 4/11/12 @jpmalek
  • 2. What I’ll cover: Our reporting/analytics growth stages, their pitfalls and what we’ve learned: 1. Custom MySQL ETL via shell scripts, visualizations in Tableau 2. ETL via a custom Tungsten applier into Vertica 3. New Tungsten Vertica applier, built by Continuent 4. Sharded transactional system, multiple Tungsten Vertica appliers 4/11/12 @jpmalek
  • 3. Stage 1 : Custom MySQL ETL via shell scripts, visualizations in Tableau 1. On slave, dump an hour’s worth of new rows via SELECT INTO OUTFILE 2. Ship data file to aggregations host, dump old hourly snapshot, load new 3. Perform aggregation queries against temporary snapshot and FEDERATED tables 4. Tableau refreshes its extracts after aggregated rows are inserted. 4/11/12 @jpmalek
  • 4. Detour : RAID for the Win Big drop in API endpoint latency (writes) 4/11/12 @jpmalek
  • 5. Stage 2 : ETL via a custom Tungsten applier into Vertica 4/11/12 @jpmalek
  • 6. Stage 2 : Customized Tungsten Replication Setup Master Replicator Extract binlog to MySQL Tungsten Log Extract From Extract Custom Vertica Filter Master to Log from Log JDBC Applier Slave Replicator Filter DDL & unwanted tables Vertica 4/11/12
  • 7. Stage 2 : Issues with the Custom Tungsten Filter 1. OLTP transactions on Vertica are very slow! (10 transactions per second vs. around 1000 per second for a MySQL slave). Slave applier could not keep up with MySQL master. 2. Person who created the applier was no longer in the company. 3. Tungsten setup including custom applier was difficult to maintain and hard to move to other hosts. 4/11/12 @jpmalek
  • 8. Detour : flexible APIs and baseball schedules 4/11/12 @jpmalek
  • 9. Stage 3 : New Tungsten Vertica Applier 4/11/12 @jpmalek
  • 10. Stage 3: A Template-Driven Batch Apply Process Tungsten Replicator Pipeline Extract- Extract- MySQL Filter- Filter- Extract-Filter-Apply Apply Apply Base Tables 63, ‘bob’, 23, … Staging Table 64, ‘sue’, 76, … 233, d, 64, …, 1 67, ‘jim’, 1, … CSV 76, ‘dan’, 25, … 233, i, 64, …, 2 Files 98, ‘joe’, 66, … 239, I, 76, …, 3 COPY DELETE, then INSERT (Template) (Template) 4/11/12
  • 11. Stage 3 : Batch Applier Replication Setup Master Replicator Extract binlog to MySQL Tungsten Log Extract From Extract Batch applier using SQL Filter Master to Log from Log template commands COPY / Slave Replicator INSERT Use built-in CSV Vertica Filters; DDL ignored Write date to disk files 4/11/12
  • 12. Stage 3 : Solving Problems to Get the New Applier to Work 1. Testing – Developed a lightweight testing mechanism for heterogeneous replication 2. Batch applier implementation – Two tries to get it right including SQL templates and full datatype support 3. Character sets – Ensuring consistent UTF-8 handling throughout the replication change, including CSV files 4. Time zones – Ensuring Java VM handled time values correctly 5. Performance – Tweak SQL templates to get 50x boost over old applier 4/11/12 @jpmalek
  • 13. Detour : Sharding or Learning How To Sleep In Any Position 4/11/12 @jpmalek
  • 14. Stage 4 : Sharded transactional system, multiple Tungsten Vertica appliers 4/11/12 @jpmalek
  • 15. Solving Problems to Scale Up The Replication Configuration 1. Implement remote batch apply so Tungsten can run off-board from Vertica 2. Convert replication to a direct pipeline with a single service between MySQL and Vertica 3. Create a script to deploy replicator in a single command 4. Create staging tables on Vertica server 4/11/12 @jpmalek
  • 16. Remaining Challenges to Complete Replication Setup 1. Configure replication for global and local DBShards data 2. Ensure performance is up to snuff-currently at 500- 1000 transactions per second 3. Introduce intermediate staging servers to reduce number of replication streams into Vertica 4/11/12 @jpmalek
  • 17. Thank You! In summary: 1. Tungsten is a great tool when it comes to MySQL ETL automation, so check it out as an alternative to custom in-house scripts or other options. 2. Vertica is a high-performance, scaleable BI platform that now pairs well with Tungsten. Full360 offers a cloud- based solution. 3. If you’re just getting started on the BI front, hire a BI developer to focus on this stuff, if you can. 4. I see no reason why this framework couldn’t scale to easily handle whatever our business needs in the future. 4/11/12 @jpmalek

Editor's Notes

  1. Thanks for coming! I'll try to make this quick so we can all grab the last part of the “backing up FB” session. About me – Hi, I’m Jeff MalekThis is my second startup, and before that, 14 years ago, I did everything possible to avoid a real, normal job, wandering like a vagabond. So startup junkie, or real-world flunkie - take your pick. I’m CTO, co-founder at BigDoor. About BigDoor and why it fits for this talkBigDoor has been around for almost three years. We help sites create their own branded, gamified rewards programs. Think of it as your own engaging frequent flyer program. That means we give points to users. Lots of them, with all kinds of different flavors, attributes, types, and conversion rates. Between customers like Major League Baseball, Island Def Jam recording artists like Big K.R.I.T, another big national sports organization that I can’t mention but that you all know, and the rest of our customers, we accrue a lot of points for users. We store those and other event-like data in ledger tables, in MySQL. The tables just keep growing, and largely need to, to support business logic and associated API endpoints. For example, transaction histories need to be returned for users. We need to be able to report on this data, merging it with web log data, both for canned reports and on an ad-hoc basis. AWS ; at my last startup all of our systems were on co-located hardware that we managed and maintained, all of BigDoor's systems are in AWS.I’d like to hear from you what you think when you hear ‘high-volume’. TODO : For this talk, 10K inserts/updates per second should suffice, which on our system creates x volume per hour...
  2. What I’m going to coverCustom MySQL ETL via shell scripts, visualizations in TableauNote that this was something I knew wouldn’t scale, longer term. ETL via a custom Tungsten applier into VerticaNew Tungsten Vertica applier, built by ContinuentSharded transactional system, multiple Tungsten Vertica appliersBut first, is there anything that folks came hoping to hear specifically, given the topic? If so I’ll maybe I can try to hit those.
  3. Goals for this stage:Gather data for, and present aggregated reports about the points, levels and badges people were earning through our systems.What we didGo over four stepsGotchas:Queries against slaves can slow replication.ETL scripts may be fun and interesting for the author but maybe not so much for maintainers. Why move from this stage? I knew from the beginning that this wasn’t going to be the longer term solution, can’t scale.Didn’t want anyone in the company maintaining the scripted ETL process. Longer term, wanted a place where we could do ad-hoc queries without affecting transactional layer performance and replication. This legacy system is still running, at end-of-life.Primary is M2.4xlarge, with software RAID, 4 EBS volumes, 16kb chunk sizeTODO : M2.4xlarge is equivalent of : TODO : add reporting screenshot
  4. EBS volumes : 100-150 IOPs are standard, but we see spikes much higher than that. After RAID, seeing spikes over 1K tps during slow times. RAID makes taking snapshots a bit more complicated. There’s a pretty well-defined process for single volumes. When doing performance benchmarks, beware : high variability (low consistency) when it comes to EBS write performance. We rolled back our first attempt because of this. In general, RAID is the best thing you can do for performance on AWS, since writes in particular suck. At least, now…they’re working on various fronts to improve this.
  5. Goals for this stage :Start using a more mature data warehouse platform, needs: High-volume writesPerforms OK for ad-hoc queries OK to store de-normalized fact tables and aggregationsMostly SQL-compliant (vs. MDX, for example)Support for scale-out, e.g. clusteringMySQL is a great DBMS for OLTP; not so much for answering BI questionsRemove the dependency on custom ETL with a tool that doesn’t affect replication or otherwise screw up the transactional layerReplace it with a tool that allows for replication of all data if we like, and an easy means to filter out whatever we don’t want. What we did:Signed up with Full360 to implement VerticaFull360 takes care of provisioning the instance, has management scripts for things like backups, and handles the onerous licensing issues that come with VerticaVertica takes writes at high volume very well, and is fast to return answers.Not web-fast, but fast enough for background processing. Started testing Continuent’s Tungsten replication productOur DBA wrote a custom filter to forward data to VerticaConnects via JDBC to Vertica, writing changes on a per-row basis.Using row-based MySQL replication. GotchasVertica costs go up with data volume.Verticadoesn’t accept single-row writes very quickly (but batch writes are very fast). Why we had to move from this on to the next stage:We had a few large publishers start using our API, including a large sports organization I’m sure you’re familiar with - MLBThe custom in-house applier that we wrote couldn’t keep up with the volume of writes.
  6. (for those of you who can't read this)If so-and-so hit a single, every single user would get a badge, at the same time. I’m sure you can imagine how that raised some eyebrows on our side of things, in contrast to normal web traffic – this was ‘spike’ defined.We all started paying heavy attention to the game schedules. We were able to handle the load well, but it’s an example for how having an API can bring surprises. They're continuing to use us this year.
  7. Goals for this stage:Keep up with the rate of inserts and updates from the transactional system, applying changes directly to Vertica in near-real-time. As I mentioned, our custom applier wasn’t keeping up. So we called Continuent.They’d wanted to build an applier for Vertica for some time, anyway.To my knowledge, this hasn’t been done. Things that help when you’re working with a third party like this:Highly normalized schemaKeep the data-types simpleLow rate of change to schema. We had this because the API was designed to be very flexible right out of the gate. Gotchas :Time-zones – Robert, do you want to touch on that a bit?Source data stored in UTCTransforms should be done at presentation layer, ideally.Character setsUse UTF-8 across the boardWhy we had to move on from this stage:Another large publisher with a huge audience, also a large sports organization, loomed on the horizonWe needed to scale our systems 20x to handle their traffic.So we sharded our transactional layer.
  8. Six weeks of no sleepDidn’t port ledger data, just configuration data. Thank you team!
  9. Goals for this stage:Change the Vertica applier so that it can be run from a MySQL slave, doing remote batch apply to VerticaRun 16 appliers from a sharded MySQL cluster into VerticaMaintain a backlog of data for ad-hoc queries, otherwise cull and archive older data to keep the Vertica data set smallStill in beta test, but looking promising.
  10. In summary: Tungsten is a great tool when it comes to MySQL ETL automation, so check it out as an alternative to custom in-house scripts or other options. Vertica is a high-performance, scaleable BI platform that now pairs well with Tungsten. Full360 offers a cloud-based solution.If you’re just getting started on the BI front, hire a BI developer to focus on this stuff, if you can. I see no reason why this framework couldn’t scale to easily handle whatever our business needs in the future.