Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Designing Scalable  Data Warehouse  with  MySQL Venu Anuganti Apr 2011, Oreilly MySQL Conference http://venublog.com/
Who am I  <ul><li>Data Architect, Database Kernel / Internals Engineer </li></ul><ul><li>Part of 5 database kernels (MySQL...
Agenda <ul><li>Role of MySQL In </li></ul><ul><ul><li>Data World </li></ul></ul><ul><ul><li>Data Warehouse & Analytics </l...
<ul><li>Data & MySQL  Everywhere </li></ul>
Data Is The Business <ul><li>Lot of new business models are DATA centric </li></ul><ul><li>Web scale, social networking, r...
Data Drives Business <ul><li>Business can only grow if they can properly make use of data </li></ul><ul><ul><li>statistics...
Data Solution Providers <ul><li>Companies emerging to solve data centric problems </li></ul><ul><ul><li>Cloudera </li></ul...
MySQL In Every Company <ul><li>90% of the companies that deals with data uses MySQL in some form </li></ul><ul><ul><li>Mai...
<ul><li>Role of MySQL </li></ul><ul><li>In </li></ul><ul><li>Data Warehouse </li></ul>
What Is Data Warehousing <ul><li>Data store repository with complete view of the business data </li></ul><ul><ul><li>Aggre...
Data Warehouse Buzz words <ul><li>Common Buzz words </li></ul><ul><ul><li>Dimension Tables </li></ul></ul><ul><ul><li>Fact...
Typical Architecture
Data Model – Star Schema
Data Model – Snowflake
Data Models <ul><li>Bottom-up </li></ul><ul><ul><li>Current system data, dimensions and log events dictate the model </li>...
Data Source <ul><li>Identify the source of data </li></ul><ul><ul><li>Dimension data </li></ul></ul><ul><ul><ul><li>Typica...
Typical MySQL Setup <ul><li>Define star/snow flake/comb schema </li></ul><ul><li>Isolate production and staging </li></ul>...
MySQL Configuration <ul><li>MySQL is mainly designed for OLTP workloads </li></ul><ul><li>For OLAP, it is sub-optimal </li...
Scale-out
Common Use Cases – MySQL way <ul><li>Small datasets </li></ul><ul><li>Real-time analytics – count, histogram etc </li></ul...
Limitations <ul><li>Understand the hard limits of MySQL </li></ul><ul><li>MySQL is not a scalable warehouse solution </li>...
Cache <ul><li>For small data-sets </li></ul><ul><ul><li>Page cache </li></ul></ul><ul><ul><li>Query cache </li></ul></ul><...
Other Analytic Stores <ul><li>Columnar, Compression and MPP is de-facto </li></ul><ul><li>Other Data warehouse solutions (...
Real-time Analytics <ul><li>Warehouse is not real-time </li></ul><ul><ul><li>Hourly or daily depending on business needs <...
Questions ? <ul><li>http://venublog.com/ </li></ul><ul><li>[email_address] </li></ul><ul><li>Twitter: @vanuganti </li></ul>
Upcoming SlideShare
Loading in …5
×

Designing Scalable Data Warehouse Using MySQL

51,355 views

Published on

Orielly MySQL Conference 2011 - Designing MySQL as data warehouse solution to handle tera bytes of data which compromises OLTP, ETL, OLAP and reporting

Published in: Technology

Designing Scalable Data Warehouse Using MySQL

  1. 1. Designing Scalable Data Warehouse with MySQL Venu Anuganti Apr 2011, Oreilly MySQL Conference http://venublog.com/
  2. 2. Who am I <ul><li>Data Architect, Database Kernel / Internals Engineer </li></ul><ul><li>Part of 5 database kernels (MySQL - 2000 to 2004) </li></ul><ul><li>Design, Implement and Scale SQL, NoSQL, Analytics and Data Warehouse solutions </li></ul><ul><li>Large scale data handling for Social Networking, SaaS, Click Tracking, Games, Recommendation, Advertisement, Mobile and SEM marketing </li></ul><ul><li>Blog: http://venublog.com/ </li></ul>
  3. 3. Agenda <ul><li>Role of MySQL In </li></ul><ul><ul><li>Data World </li></ul></ul><ul><ul><li>Data Warehouse & Analytics </li></ul></ul><ul><ul><li>Real-time Analytics </li></ul></ul><ul><ul><li>Large Data </li></ul></ul><ul><li>How To Build Scalable Data Warehouse </li></ul><ul><li>Q & A </li></ul>
  4. 4. <ul><li>Data & MySQL Everywhere </li></ul>
  5. 5. Data Is The Business <ul><li>Lot of new business models are DATA centric </li></ul><ul><li>Web scale, social networking, real-time and interactive </li></ul><ul><ul><li>all most all companies talk about their data </li></ul></ul><ul><ul><li>millions of user base, clients, customers, applications, … </li></ul></ul><ul><ul><li>tera bytes to peta bytes of data on day to day </li></ul></ul><ul><ul><li>performance & scalability is a key factor </li></ul></ul>
  6. 6. Data Drives Business <ul><li>Business can only grow if they can properly make use of data </li></ul><ul><ul><li>statistics, mining, real-time </li></ul></ul><ul><ul><li>reporting, analytics </li></ul></ul><ul><ul><li>re-targeting </li></ul></ul><ul><ul><li>recommendation </li></ul></ul><ul><li>Examples of data driven companies </li></ul><ul><ul><li>Facebook, Twitter, LinkedIn, Zynga, Groupon, Quora, FourSquare, AppStores, mobile/web analytics, … </li></ul></ul><ul><ul><li>Any API Driven </li></ul></ul><ul><ul><li>All most all new emerging companies </li></ul></ul>
  7. 7. Data Solution Providers <ul><li>Companies emerging to solve data centric problems </li></ul><ul><ul><li>Cloudera </li></ul></ul><ul><ul><li>Percona </li></ul></ul><ul><ul><li>Cloud and SaaS solutions </li></ul></ul><ul><li>Large companies solve problems as part of their business </li></ul><ul><ul><li>Google (Big Table, App Engine, Megastore …) </li></ul></ul><ul><ul><li>Amazon (Dynamo, EC2, …) </li></ul></ul><ul><ul><li>Yahoo (Hadoop, PIG …) </li></ul></ul><ul><ul><li>Facebook (Cassandra, Messaging System Implementation, …) </li></ul></ul><ul><ul><li>Twitter (FlockDB, …) </li></ul></ul><ul><ul><li>Zynga (Membase) </li></ul></ul>
  8. 8. MySQL In Every Company <ul><li>90% of the companies that deals with data uses MySQL in some form </li></ul><ul><ul><li>Mainly OLTP </li></ul></ul><ul><ul><li>LAMP, Website, Blog services </li></ul></ul><ul><ul><li>SaaS, Cloud services </li></ul></ul><ul><ul><li>Analytics, Stats, Reporting tools </li></ul></ul><ul><ul><li>Warehousing and … </li></ul></ul><ul><li>Tera-bytes of data if not peta-bytes in MySQL </li></ul><ul><li>MySQL is de facto development model for all developers and startups - Until a saturation point </li></ul>
  9. 9. <ul><li>Role of MySQL </li></ul><ul><li>In </li></ul><ul><li>Data Warehouse </li></ul>
  10. 10. What Is Data Warehousing <ul><li>Data store repository with complete view of the business data </li></ul><ul><ul><li>Aggregated data from multiple sources </li></ul></ul><ul><ul><li>Active users/customers </li></ul></ul><ul><ul><li>Total sales/orders for a given period </li></ul></ul><ul><ul><li>Growth and retention rate </li></ul></ul><ul><ul><li>Top/Poorly performing apps, pages, games, ads, … </li></ul></ul><ul><ul><li>Spend vs performance </li></ul></ul><ul><li>Driving Factors </li></ul><ul><ul><li>Business Intelligence, BI </li></ul></ul><ul><ul><li>Data Analysis & Mining </li></ul></ul><ul><ul><li>Reporting / Dashboards </li></ul></ul><ul><ul><li>Business Decisions </li></ul></ul>
  11. 11. Data Warehouse Buzz words <ul><li>Common Buzz words </li></ul><ul><ul><li>Dimension Tables </li></ul></ul><ul><ul><li>Fact Tables </li></ul></ul><ul><ul><li>Aggregate Tables </li></ul></ul><ul><ul><li>ETL </li></ul></ul><ul><ul><li>Staging </li></ul></ul><ul><ul><li>Production </li></ul></ul><ul><ul><li>OLAP </li></ul></ul><ul><ul><li>Data Mart </li></ul></ul><ul><ul><li>Data Source </li></ul></ul><ul><ul><li>MOLAP, ROLAP, HOLAP </li></ul></ul><ul><ul><li>Star schema & Snowflake </li></ul></ul>
  12. 12. Typical Architecture
  13. 13. Data Model – Star Schema
  14. 14. Data Model – Snowflake
  15. 15. Data Models <ul><li>Bottom-up </li></ul><ul><ul><li>Current system data, dimensions and log events dictate the model </li></ul></ul><ul><li>Top-down </li></ul><ul><ul><li>Business and reporting needs dictates the model </li></ul></ul><ul><li>Hybrid </li></ul><ul><ul><li>Compromise between bottom-up and top-down and implement the model </li></ul></ul>
  16. 16. Data Source <ul><li>Identify the source of data </li></ul><ul><ul><li>Dimension data </li></ul></ul><ul><ul><ul><li>Typically from OLTP System </li></ul></ul></ul><ul><ul><li>Fact data </li></ul></ul><ul><ul><ul><li>Weblogs </li></ul></ul></ul><ul><ul><ul><ul><li>Use hadoop/MP/PIG to transform to CSV file </li></ul></ul></ul></ul><ul><ul><ul><li>CRM, SalesForce, Marketing </li></ul></ul></ul><ul><ul><ul><li>Click, Conversion Tracking </li></ul></ul></ul><ul><ul><ul><li>External feeds, reports, scrapping etc </li></ul></ul></ul>
  17. 17. Typical MySQL Setup <ul><li>Define star/snow flake/comb schema </li></ul><ul><li>Isolate production and staging </li></ul><ul><li>Production environment </li></ul><ul><ul><li>All OLTP happens here </li></ul></ul><ul><ul><li>Two schemas, OLTP and OLAP </li></ul></ul><ul><li>Staging environment </li></ul><ul><ul><li>Replicate OLTP tables </li></ul></ul><ul><ul><li>ETL - OLAP </li></ul></ul><ul><ul><ul><li>Map surrogate key between dimension (OLTP) and fact tables </li></ul></ul></ul><ul><ul><ul><li>Load data to fact tables </li></ul></ul></ul><ul><ul><ul><li>Generate aggregate tables for frequent access </li></ul></ul></ul><ul><ul><li>Reload the final aggregated tables into production </li></ul></ul>
  18. 18. MySQL Configuration <ul><li>MySQL is mainly designed for OLTP workloads </li></ul><ul><li>For OLAP, it is sub-optimal </li></ul><ul><ul><li>Highly recommended to keep small dataset size ( < 500G per server) </li></ul></ul><ul><ul><li>Pre-aggregate source data as much as possible </li></ul></ul><ul><ul><li>Use InnoDB for all staging tables with upsert + on duplicate key update (or use Hive) </li></ul></ul><ul><ul><li>Use MyISAM for production read-only tables for faster loads and pack keys </li></ul></ul><ul><ul><li>Use partitions for easy purging </li></ul></ul><ul><ul><li>Build OLAP cubes (Pentaho Mondrian or any reporting solution) </li></ul></ul><ul><ul><li>Compression, sequential IO and read-ahead is the key </li></ul></ul>
  19. 19. Scale-out
  20. 20. Common Use Cases – MySQL way <ul><li>Small datasets </li></ul><ul><li>Real-time analytics – count, histogram etc </li></ul><ul><li>Standard reporting </li></ul><ul><li>Historical data </li></ul><ul><li>BI, analytics and OLAP applications </li></ul>
  21. 21. Limitations <ul><li>Understand the hard limits of MySQL </li></ul><ul><li>MySQL is not a scalable warehouse solution </li></ul><ul><ul><li>Fits well for small datasets or to get started initially </li></ul></ul><ul><ul><li>Use columnar + compression enabled engines </li></ul></ul><ul><ul><li>Lacks analytic functions support </li></ul></ul><ul><li>Large datasets – Stay away </li></ul><ul><ul><li>Unless you have small data model/marts that can be distributed across multiple nodes </li></ul></ul>
  22. 22. Cache <ul><li>For small data-sets </li></ul><ul><ul><li>Page cache </li></ul></ul><ul><ul><li>Query cache </li></ul></ul><ul><ul><li>Buffer pool </li></ul></ul><ul><li>For advanced reporting </li></ul><ul><ul><li>Build OLAP cubes </li></ul></ul><ul><ul><li>Most BI, reporting solutions has in-memory OLAP </li></ul></ul>
  23. 23. Other Analytic Stores <ul><li>Columnar, Compression and MPP is de-facto </li></ul><ul><li>Other Data warehouse solutions ($$$$..) </li></ul><ul><ul><li>GreenPlum (+ DCA appliance – part of EMC now) </li></ul></ul><ul><ul><li>Vertica (Break through, currently my favorite data store, using it for more than an year now, part of HP now) </li></ul></ul><ul><ul><li>AsterData (now part of TeraData) </li></ul></ul><ul><ul><li>Oracle Exadata </li></ul></ul><ul><ul><li>ParAccel (Co-founder of Oracle Bruce Scott) </li></ul></ul><ul><ul><li>InfoBright (MySQL based) </li></ul></ul><ul><ul><li>InfiniDB (open source, Calpont appliance) </li></ul></ul><ul><ul><li>Netezza (appliance – IBM owns it now) </li></ul></ul><ul><ul><li>XtremeData dbX (appliance) </li></ul></ul><ul><ul><li>TeraData, and few more </li></ul></ul>
  24. 24. Real-time Analytics <ul><li>Warehouse is not real-time </li></ul><ul><ul><li>Hourly or daily depending on business needs </li></ul></ul><ul><li>(Near) Real-time analytics </li></ul><ul><ul><li>Performance and scalability challenges </li></ul></ul><ul><ul><li>Identify common metric(s) for real-time, ex: </li></ul></ul><ul><ul><ul><li>Active users currently online, users playing, friends online etc </li></ul></ul></ul><ul><ul><li>Implement as counter based atomic operation </li></ul></ul>
  25. 25. Questions ? <ul><li>http://venublog.com/ </li></ul><ul><li>[email_address] </li></ul><ul><li>Twitter: @vanuganti </li></ul>

×