Designing Scalable Data Warehouse Using MySQL


Published on

Orielly MySQL Conference 2011 - Designing MySQL as data warehouse solution to handle tera bytes of data which compromises OLTP, ETL, OLAP and reporting

Published in: Technology
No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • MySQL Employee 2000-2004 Database Companies MySQL SOLID ANTs Data Server ScaleDB Part of Yahoo’s cloud initiates like Sherpa and Mobstor and a platform MySQL Geek Still contribute randomly to MySQL source
  • Ask How many people use MySQL and how many does not use MySQL
  • OLAP - On-Line Analytical Processing MOLAP – Multi dimensional cubes (not applicable in relational schema) ROLAP - Relational OLAP (aggregation, MySQL ROLLUP) HOLAP – (Combination of both)
  • DCA Data Computing Appliance Talk about analytics and how crucial they are now
  • Designing Scalable Data Warehouse Using MySQL

    1. 1. Designing Scalable Data Warehouse with MySQL Venu Anuganti Apr 2011, Oreilly MySQL Conference
    2. 2. Who am I <ul><li>Data Architect, Database Kernel / Internals Engineer </li></ul><ul><li>Part of 5 database kernels (MySQL - 2000 to 2004) </li></ul><ul><li>Design, Implement and Scale SQL, NoSQL, Analytics and Data Warehouse solutions </li></ul><ul><li>Large scale data handling for Social Networking, SaaS, Click Tracking, Games, Recommendation, Advertisement, Mobile and SEM marketing </li></ul><ul><li>Blog: </li></ul>
    3. 3. Agenda <ul><li>Role of MySQL In </li></ul><ul><ul><li>Data World </li></ul></ul><ul><ul><li>Data Warehouse & Analytics </li></ul></ul><ul><ul><li>Real-time Analytics </li></ul></ul><ul><ul><li>Large Data </li></ul></ul><ul><li>How To Build Scalable Data Warehouse </li></ul><ul><li>Q & A </li></ul>
    4. 4. <ul><li>Data & MySQL Everywhere </li></ul>
    5. 5. Data Is The Business <ul><li>Lot of new business models are DATA centric </li></ul><ul><li>Web scale, social networking, real-time and interactive </li></ul><ul><ul><li>all most all companies talk about their data </li></ul></ul><ul><ul><li>millions of user base, clients, customers, applications, … </li></ul></ul><ul><ul><li>tera bytes to peta bytes of data on day to day </li></ul></ul><ul><ul><li>performance & scalability is a key factor </li></ul></ul>
    6. 6. Data Drives Business <ul><li>Business can only grow if they can properly make use of data </li></ul><ul><ul><li>statistics, mining, real-time </li></ul></ul><ul><ul><li>reporting, analytics </li></ul></ul><ul><ul><li>re-targeting </li></ul></ul><ul><ul><li>recommendation </li></ul></ul><ul><li>Examples of data driven companies </li></ul><ul><ul><li>Facebook, Twitter, LinkedIn, Zynga, Groupon, Quora, FourSquare, AppStores, mobile/web analytics, … </li></ul></ul><ul><ul><li>Any API Driven </li></ul></ul><ul><ul><li>All most all new emerging companies </li></ul></ul>
    7. 7. Data Solution Providers <ul><li>Companies emerging to solve data centric problems </li></ul><ul><ul><li>Cloudera </li></ul></ul><ul><ul><li>Percona </li></ul></ul><ul><ul><li>Cloud and SaaS solutions </li></ul></ul><ul><li>Large companies solve problems as part of their business </li></ul><ul><ul><li>Google (Big Table, App Engine, Megastore …) </li></ul></ul><ul><ul><li>Amazon (Dynamo, EC2, …) </li></ul></ul><ul><ul><li>Yahoo (Hadoop, PIG …) </li></ul></ul><ul><ul><li>Facebook (Cassandra, Messaging System Implementation, …) </li></ul></ul><ul><ul><li>Twitter (FlockDB, …) </li></ul></ul><ul><ul><li>Zynga (Membase) </li></ul></ul>
    8. 8. MySQL In Every Company <ul><li>90% of the companies that deals with data uses MySQL in some form </li></ul><ul><ul><li>Mainly OLTP </li></ul></ul><ul><ul><li>LAMP, Website, Blog services </li></ul></ul><ul><ul><li>SaaS, Cloud services </li></ul></ul><ul><ul><li>Analytics, Stats, Reporting tools </li></ul></ul><ul><ul><li>Warehousing and … </li></ul></ul><ul><li>Tera-bytes of data if not peta-bytes in MySQL </li></ul><ul><li>MySQL is de facto development model for all developers and startups - Until a saturation point </li></ul>
    9. 9. <ul><li>Role of MySQL </li></ul><ul><li>In </li></ul><ul><li>Data Warehouse </li></ul>
    10. 10. What Is Data Warehousing <ul><li>Data store repository with complete view of the business data </li></ul><ul><ul><li>Aggregated data from multiple sources </li></ul></ul><ul><ul><li>Active users/customers </li></ul></ul><ul><ul><li>Total sales/orders for a given period </li></ul></ul><ul><ul><li>Growth and retention rate </li></ul></ul><ul><ul><li>Top/Poorly performing apps, pages, games, ads, … </li></ul></ul><ul><ul><li>Spend vs performance </li></ul></ul><ul><li>Driving Factors </li></ul><ul><ul><li>Business Intelligence, BI </li></ul></ul><ul><ul><li>Data Analysis & Mining </li></ul></ul><ul><ul><li>Reporting / Dashboards </li></ul></ul><ul><ul><li>Business Decisions </li></ul></ul>
    11. 11. Data Warehouse Buzz words <ul><li>Common Buzz words </li></ul><ul><ul><li>Dimension Tables </li></ul></ul><ul><ul><li>Fact Tables </li></ul></ul><ul><ul><li>Aggregate Tables </li></ul></ul><ul><ul><li>ETL </li></ul></ul><ul><ul><li>Staging </li></ul></ul><ul><ul><li>Production </li></ul></ul><ul><ul><li>OLAP </li></ul></ul><ul><ul><li>Data Mart </li></ul></ul><ul><ul><li>Data Source </li></ul></ul><ul><ul><li>MOLAP, ROLAP, HOLAP </li></ul></ul><ul><ul><li>Star schema & Snowflake </li></ul></ul>
    12. 12. Typical Architecture
    13. 13. Data Model – Star Schema
    14. 14. Data Model – Snowflake
    15. 15. Data Models <ul><li>Bottom-up </li></ul><ul><ul><li>Current system data, dimensions and log events dictate the model </li></ul></ul><ul><li>Top-down </li></ul><ul><ul><li>Business and reporting needs dictates the model </li></ul></ul><ul><li>Hybrid </li></ul><ul><ul><li>Compromise between bottom-up and top-down and implement the model </li></ul></ul>
    16. 16. Data Source <ul><li>Identify the source of data </li></ul><ul><ul><li>Dimension data </li></ul></ul><ul><ul><ul><li>Typically from OLTP System </li></ul></ul></ul><ul><ul><li>Fact data </li></ul></ul><ul><ul><ul><li>Weblogs </li></ul></ul></ul><ul><ul><ul><ul><li>Use hadoop/MP/PIG to transform to CSV file </li></ul></ul></ul></ul><ul><ul><ul><li>CRM, SalesForce, Marketing </li></ul></ul></ul><ul><ul><ul><li>Click, Conversion Tracking </li></ul></ul></ul><ul><ul><ul><li>External feeds, reports, scrapping etc </li></ul></ul></ul>
    17. 17. Typical MySQL Setup <ul><li>Define star/snow flake/comb schema </li></ul><ul><li>Isolate production and staging </li></ul><ul><li>Production environment </li></ul><ul><ul><li>All OLTP happens here </li></ul></ul><ul><ul><li>Two schemas, OLTP and OLAP </li></ul></ul><ul><li>Staging environment </li></ul><ul><ul><li>Replicate OLTP tables </li></ul></ul><ul><ul><li>ETL - OLAP </li></ul></ul><ul><ul><ul><li>Map surrogate key between dimension (OLTP) and fact tables </li></ul></ul></ul><ul><ul><ul><li>Load data to fact tables </li></ul></ul></ul><ul><ul><ul><li>Generate aggregate tables for frequent access </li></ul></ul></ul><ul><ul><li>Reload the final aggregated tables into production </li></ul></ul>
    18. 18. MySQL Configuration <ul><li>MySQL is mainly designed for OLTP workloads </li></ul><ul><li>For OLAP, it is sub-optimal </li></ul><ul><ul><li>Highly recommended to keep small dataset size ( < 500G per server) </li></ul></ul><ul><ul><li>Pre-aggregate source data as much as possible </li></ul></ul><ul><ul><li>Use InnoDB for all staging tables with upsert + on duplicate key update (or use Hive) </li></ul></ul><ul><ul><li>Use MyISAM for production read-only tables for faster loads and pack keys </li></ul></ul><ul><ul><li>Use partitions for easy purging </li></ul></ul><ul><ul><li>Build OLAP cubes (Pentaho Mondrian or any reporting solution) </li></ul></ul><ul><ul><li>Compression, sequential IO and read-ahead is the key </li></ul></ul>
    19. 19. Scale-out
    20. 20. Common Use Cases – MySQL way <ul><li>Small datasets </li></ul><ul><li>Real-time analytics – count, histogram etc </li></ul><ul><li>Standard reporting </li></ul><ul><li>Historical data </li></ul><ul><li>BI, analytics and OLAP applications </li></ul>
    21. 21. Limitations <ul><li>Understand the hard limits of MySQL </li></ul><ul><li>MySQL is not a scalable warehouse solution </li></ul><ul><ul><li>Fits well for small datasets or to get started initially </li></ul></ul><ul><ul><li>Use columnar + compression enabled engines </li></ul></ul><ul><ul><li>Lacks analytic functions support </li></ul></ul><ul><li>Large datasets – Stay away </li></ul><ul><ul><li>Unless you have small data model/marts that can be distributed across multiple nodes </li></ul></ul>
    22. 22. Cache <ul><li>For small data-sets </li></ul><ul><ul><li>Page cache </li></ul></ul><ul><ul><li>Query cache </li></ul></ul><ul><ul><li>Buffer pool </li></ul></ul><ul><li>For advanced reporting </li></ul><ul><ul><li>Build OLAP cubes </li></ul></ul><ul><ul><li>Most BI, reporting solutions has in-memory OLAP </li></ul></ul>
    23. 23. Other Analytic Stores <ul><li>Columnar, Compression and MPP is de-facto </li></ul><ul><li>Other Data warehouse solutions ($$$$..) </li></ul><ul><ul><li>GreenPlum (+ DCA appliance – part of EMC now) </li></ul></ul><ul><ul><li>Vertica (Break through, currently my favorite data store, using it for more than an year now, part of HP now) </li></ul></ul><ul><ul><li>AsterData (now part of TeraData) </li></ul></ul><ul><ul><li>Oracle Exadata </li></ul></ul><ul><ul><li>ParAccel (Co-founder of Oracle Bruce Scott) </li></ul></ul><ul><ul><li>InfoBright (MySQL based) </li></ul></ul><ul><ul><li>InfiniDB (open source, Calpont appliance) </li></ul></ul><ul><ul><li>Netezza (appliance – IBM owns it now) </li></ul></ul><ul><ul><li>XtremeData dbX (appliance) </li></ul></ul><ul><ul><li>TeraData, and few more </li></ul></ul>
    24. 24. Real-time Analytics <ul><li>Warehouse is not real-time </li></ul><ul><ul><li>Hourly or daily depending on business needs </li></ul></ul><ul><li>(Near) Real-time analytics </li></ul><ul><ul><li>Performance and scalability challenges </li></ul></ul><ul><ul><li>Identify common metric(s) for real-time, ex: </li></ul></ul><ul><ul><ul><li>Active users currently online, users playing, friends online etc </li></ul></ul></ul><ul><ul><li>Implement as counter based atomic operation </li></ul></ul>
    25. 25. Questions ? <ul><li> </li></ul><ul><li>[email_address] </li></ul><ul><li>Twitter: @vanuganti </li></ul>