Role of MySQL in Data Analytics, Warehousing


Published on

Role of MySQL in Data Analytics, Data Warehouse and Large Data At Scale

Published in: Technology, Business
1 Comment
No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • MySQL Employee 2000-2004 Database Companies MySQL SOLID ANTs Data Server ScaleDB Part of Yahoo’s cloud initiates like Sherpa and Mobstor and a platform MySQL Geek Still contribute randomly to MySQL source
  • Its going to be pretty short talk for 20 mts and leave it for Q & A
  • Well all most all companies has MySQL deployment one or the other day…
  • If Facebook, Twitter or someone else uses NoSQL, does not mean everyone has to use it If someone scales using MySQL, does not mean everyone can use the same concept
  • OLAP - On-Line Analytical Processing MOLAP – Multi dimensional cubes (not applicable in relational schema) ROLAP - Relational OLAP (aggregation, MySQL ROLLUP) HOLAP – (Combination of both)
  • DCA Data Computing Appliance Talk about analytics and how crucial they are now
  • DCA Data Computing Appliance Talk about analytics and how crucial they are now
  • Role of MySQL in Data Analytics, Warehousing

    1. 1. Role of MySQL In Data Analytics, Warehouse Large Data At Scale Venu Anuganti Feb 2011, Percona Live
    2. 2. Who am I <ul><li>Data Architect, Database Kernel / Internals Engineer </li></ul><ul><li>Part of 5 database kernels (MySQL - 2000 to 2004) </li></ul><ul><li>Implement and Scale SQL, NoSQL, Analytics and Data Warehouse solutions </li></ul><ul><li>Large scale data handling for Social Networking, SaaS, Click Tracking, Games, Recommendation, Advertisement, Mobile and SEM marketing </li></ul><ul><li>Blog: </li></ul>
    3. 3. Agenda <ul><li>Buzz Around SQL and NoSQL </li></ul><ul><li>Role of MySQL In </li></ul><ul><ul><li>Data World </li></ul></ul><ul><ul><li>Data Warehouse & Analytics </li></ul></ul><ul><ul><li>Real-time Analytics </li></ul></ul><ul><ul><li>Large Data </li></ul></ul><ul><li>How To Build Scalable Data Warehouse </li></ul><ul><li>Q & A </li></ul>
    4. 4. Buzzzzzzz <ul><li>Lately everyone talking about NoSQL </li></ul><ul><li>What is happening to (My)SQL </li></ul><ul><li>Does that mean end of (My)SQL ? </li></ul><ul><li>Why nobody talks about large (My)SQL implementations ? </li></ul>
    5. 5. <ul><li>Data & MySQL Everywhere </li></ul>
    6. 6. Data Is The Business <ul><li>Lot of new business models are DATA centric </li></ul><ul><li>Web scale, social networking, real-time and interactive </li></ul><ul><ul><li>all most all companies talk about their data </li></ul></ul><ul><ul><li>millions of user base, clients, customers, applications, … </li></ul></ul><ul><ul><li>tera bytes to peta bytes of data on day to day </li></ul></ul><ul><ul><li>performance & scalability is a key factor </li></ul></ul>
    7. 7. Data Drives Business <ul><li>Business can only grow if they can properly make use of data </li></ul><ul><ul><li>statistics, mining, real-time </li></ul></ul><ul><ul><li>reporting, analytics </li></ul></ul><ul><ul><li>re-targeting </li></ul></ul><ul><ul><li>Recommendation </li></ul></ul><ul><li>Examples of data driven companies </li></ul><ul><ul><li>Facebook, Twitter, LinkedIn, Zynga, Groupon, Quora, FourSquare, AppStores, mobile/web analytics, … </li></ul></ul><ul><ul><li>Any API Driven </li></ul></ul><ul><ul><li>All most all new emerging companies </li></ul></ul>
    8. 8. Data Solution Providers <ul><li>Companies emerging to solve data centric problems </li></ul><ul><ul><li>Cloudera </li></ul></ul><ul><ul><li>Percona </li></ul></ul><ul><ul><li>Cloud and SaaS solutions </li></ul></ul><ul><li>Large companies solve problems as part of their business </li></ul><ul><ul><li>Google (Big Table, MySQL patches, App Engine, Megastore …) </li></ul></ul><ul><ul><li>Yahoo (Hadoop, PIG …) </li></ul></ul><ul><ul><li>Facebook (Cassandra, MySQL Patches, Messaging System Implementation etc) </li></ul></ul><ul><ul><li>Twitter (FlockDB, …) </li></ul></ul><ul><ul><li>Zynga (membase) </li></ul></ul>
    9. 9. MySQL In Every Company <ul><li>90% of the companies that deals with data uses MySQL </li></ul><ul><ul><li>Mainly OLTP </li></ul></ul><ul><ul><li>LAMP, Website, Blog services </li></ul></ul><ul><ul><li>SaaS, Cloud services </li></ul></ul><ul><ul><li>Analytics, Stats </li></ul></ul><ul><ul><li>Warehousing and … </li></ul></ul><ul><li>Tera-bytes of data if not peta-bytes in MySQL </li></ul><ul><li>MySQL is de facto development model for all developers and startups </li></ul>
    10. 10. MySQL Widely Adopted <ul><li>Simple, easy to learn and adopt </li></ul><ul><li>Widely in use for 10+ yrs </li></ul><ul><li>Very large community </li></ul><ul><li>Most developers knows how to use MySQL </li></ul><ul><li>Lot of domain experts </li></ul><ul><li>All most all tools support MySQL </li></ul><ul><li>Highly optimal and scalable [if you use it right] </li></ul><ul><li>It is even available on the cloud </li></ul><ul><li>Used by all most all big companies </li></ul><ul><li>When people do not know what data store to choose – defaults to MySQL </li></ul>
    11. 11. <ul><li>Role of MySQL </li></ul><ul><li>In </li></ul><ul><li>Data Warehouse </li></ul>
    12. 12. Data Warehousing <ul><li>Data store repository with complete view of the business data </li></ul><ul><ul><li>Active users/customers </li></ul></ul><ul><ul><li>Total sales/orders for a given period </li></ul></ul><ul><ul><li>Growth and retention rate </li></ul></ul><ul><ul><li>Top performing, … </li></ul></ul><ul><li>Driving Factors </li></ul><ul><ul><li>Business Intelligence, BI </li></ul></ul><ul><ul><li>Data Analysis & Mining </li></ul></ul><ul><ul><li>Reporting / Dashboards </li></ul></ul><ul><ul><li>Business decisions </li></ul></ul>
    13. 13. DW Buzz words <ul><li>Dimension Tables </li></ul><ul><li>Fact Tables </li></ul><ul><li>Aggregate Tables </li></ul><ul><li>ETL </li></ul><ul><li>Staging </li></ul><ul><li>Production </li></ul><ul><li>OLAP </li></ul><ul><li>Data Mart </li></ul><ul><li>MOLAP, ROLAP, HOLAP </li></ul><ul><li>Star schema & Snowflake </li></ul>
    14. 14. Typical Architecture
    15. 15. Data Model – Star Schema
    16. 16. Data Model – Snowflake
    17. 17. DW Data Models <ul><li>Bottom-up </li></ul><ul><ul><li>Current system data, dimensions and log events dictate the model </li></ul></ul><ul><li>Top-down </li></ul><ul><ul><li>Business and reporting needs dictates the model </li></ul></ul><ul><li>Hybrid </li></ul><ul><ul><li>Compromise between bottom-up and top-down and implement the model </li></ul></ul>
    18. 18. Data Source <ul><li>Identify the source of data </li></ul><ul><ul><li>Dimension data </li></ul></ul><ul><ul><ul><li>Typically from OLTP System </li></ul></ul></ul><ul><ul><li>Fact data </li></ul></ul><ul><ul><ul><li>Weblogs </li></ul></ul></ul><ul><ul><ul><ul><li>Use hadoop/MP/PIG to transform to CSV file </li></ul></ul></ul></ul><ul><ul><ul><li>CRM, SalesForce, Marketing </li></ul></ul></ul><ul><ul><ul><li>Click, Conversion Tracking </li></ul></ul></ul><ul><ul><ul><li>External feeds, reports, scrapping etc </li></ul></ul></ul>
    19. 19. Typical MySQL Setup <ul><li>Isolate production and staging </li></ul><ul><li>Production environment </li></ul><ul><ul><li>All OLTP happens here </li></ul></ul><ul><ul><li>Two schemas, OLTP and OLAP </li></ul></ul><ul><li>Staging environment </li></ul><ul><ul><li>Replicate OLTP dim tables to separate schema </li></ul></ul><ul><ul><li>ETL </li></ul></ul><ul><ul><ul><li>Map surrogate key between dim and fact tables </li></ul></ul></ul><ul><ul><ul><li>Load data to fact tables </li></ul></ul></ul><ul><ul><ul><li>Generate aggregate tables for frequent access </li></ul></ul></ul><ul><ul><li>Reload the final aggregated tables into to production </li></ul></ul>
    20. 20. MySQL Configuration <ul><li>MySQL is mainly designed for OLTP workloads </li></ul><ul><li>For OLAP, it is sub-optimal </li></ul><ul><ul><li>Highly recommended to keep small dataset size ( < 500G per server) </li></ul></ul><ul><ul><li>Pre-aggregate source data as much as possible </li></ul></ul><ul><ul><li>Use InnoDB for all staging tables with upsert + on duplicate key update </li></ul></ul><ul><ul><li>Use MyISAM for production read-only tables for faster loads and pack keys </li></ul></ul><ul><ul><li>Use partitions for easy purging </li></ul></ul><ul><ul><li>Build OLAP cubes (Pentaho Mondrian or any reporting solution) </li></ul></ul><ul><ul><li>Compression, sequential IO and read-ahead is the key </li></ul></ul>
    21. 21. Scale-out
    22. 22. Common Use Cases – MySQL way <ul><li>Small datasets </li></ul><ul><li>Real-time analytics </li></ul><ul><li>Standard reporting </li></ul><ul><li>Historical data </li></ul><ul><li>BI, analytics and OLAP applications </li></ul>
    23. 23. Limitations <ul><li>Understand the hard limits of MySQL </li></ul><ul><li>MySQL is not a scalable warehouse solution </li></ul><ul><ul><li>Fits well for small datasets or to get started initially </li></ul></ul><ul><ul><li>Use columnar + compression enabled engines </li></ul></ul><ul><li>Large datasets – Stay away </li></ul><ul><ul><li>Unless you have small data model/marts that can be distributed across multiple nodes </li></ul></ul>
    24. 24. Cache <ul><li>For small data-sets </li></ul><ul><ul><li>Page cache </li></ul></ul><ul><ul><li>Query cache </li></ul></ul><ul><ul><li>Buffer pool </li></ul></ul><ul><li>For advanced reporting </li></ul><ul><ul><li>Build OLAP cubes </li></ul></ul><ul><ul><li>Most BI, reporting solutions has in-memory OLAP </li></ul></ul>
    25. 25. Analytic Stores <ul><li>Columnar, Compression and MPP is de-facto </li></ul><ul><li>Other Data warehouse solutions ($$$$..) </li></ul><ul><ul><li>GreenPlum (+ DCA appliance – part of EMC now) </li></ul></ul><ul><ul><li>Vertica (Break through, currently my favorite data store, part of HP since Monday) </li></ul></ul><ul><ul><li>AsterData </li></ul></ul><ul><ul><li>Oracle Exadata </li></ul></ul><ul><ul><li>ParAccel (Co-founder of Oracle Bruce Scott) </li></ul></ul><ul><ul><li>InfoBright (MySQL based) </li></ul></ul><ul><ul><li>InfiniDB (open source, Calpont appliance) </li></ul></ul><ul><ul><li>Netezza (appliance – IBM owns it now) </li></ul></ul><ul><ul><li>XtremeData dbX (appliance) </li></ul></ul><ul><ul><li>TeraData, and few more </li></ul></ul>
    26. 26. Real-time Analytics <ul><li>Warehouse is not real-time </li></ul><ul><ul><li>Hourly or daily depending on business needs </li></ul></ul><ul><li>(Near) Real-time analytics </li></ul><ul><ul><li>Performance and scalability challenges </li></ul></ul><ul><ul><li>Identify common metric(s) for real-time, ex: </li></ul></ul><ul><ul><ul><li>Active users currently online, users playing, friends online etc </li></ul></ul></ul><ul><ul><li>Implement as counter based atomic operation </li></ul></ul>
    27. 27. Questions ? <ul><li> </li></ul><ul><li>[email_address] </li></ul><ul><li>Twitter: @vanuganti </li></ul>