Role of MySQL  In  Data Analytics,  Warehouse Large Data At Scale Venu Anuganti Feb 2011, Percona Live http://venublog.com/
Who am I  Data Architect, Database Kernel / Internals Engineer Part of 5 database kernels (MySQL - 2000 to 2004) Implement and Scale SQL, NoSQL, Analytics and Data Warehouse solutions Large scale data handling for Social Networking, SaaS, Click Tracking, Games, Recommendation, Advertisement, Mobile and SEM marketing Blog:  http://venublog.com/
Agenda Buzz Around SQL and NoSQL Role of MySQL In  Data World Data Warehouse & Analytics Real-time Analytics Large Data How To Build Scalable Data Warehouse Q & A
Buzzzzzzz Lately everyone talking about NoSQL What is happening to (My)SQL Does that mean end of (My)SQL ? Why nobody talks about large (My)SQL implementations ?
Data & MySQL  Everywhere
Data Is The Business Lot of new business models are DATA centric Web scale, social networking, real-time and interactive all most all companies talk about their data millions of user base, clients, customers, applications, … tera bytes to peta bytes of data on day to day performance & scalability is a key factor
Data Drives Business Business can only grow if they can properly make use of data statistics, mining, real-time reporting, analytics re-targeting Recommendation Examples of data driven companies Facebook, Twitter, LinkedIn, Zynga, Groupon, Quora, FourSquare, AppStores, mobile/web analytics, … Any API Driven All most all new emerging companies
Data Solution Providers Companies emerging to solve data centric problems Cloudera Percona Cloud and SaaS solutions Large companies solve problems as part of their business Google (Big Table, MySQL patches, App Engine, Megastore …)  Yahoo (Hadoop, PIG …) Facebook (Cassandra, MySQL Patches, Messaging System Implementation etc) Twitter (FlockDB, …) Zynga (membase)
MySQL In Every Company 90% of the companies that deals with data uses MySQL Mainly OLTP LAMP, Website, Blog services SaaS, Cloud services Analytics, Stats Warehousing and … Tera-bytes of data if not peta-bytes in MySQL MySQL is de facto development model for all developers and startups
MySQL Widely Adopted Simple, easy to learn and adopt Widely in use for 10+ yrs Very large community Most developers knows how to use MySQL Lot of domain experts All most all tools support MySQL Highly optimal and scalable [if you use it right] It is even available on the cloud  Used by all most all big companies When people do not know what data store to choose – defaults to MySQL
Role of MySQL In Data Warehouse
Data Warehousing Data store repository with complete view of the business data Active users/customers  Total sales/orders for a given period Growth and retention rate Top performing, … Driving Factors Business Intelligence, BI Data Analysis & Mining  Reporting / Dashboards Business decisions
DW Buzz words Dimension Tables Fact Tables Aggregate Tables ETL Staging Production OLAP Data Mart MOLAP, ROLAP, HOLAP Star schema & Snowflake
Typical Architecture
Data Model – Star Schema
Data Model – Snowflake
DW Data Models Bottom-up Current system data, dimensions and log events dictate the model Top-down Business and reporting needs dictates the model Hybrid Compromise between bottom-up and top-down and implement the model
Data Source Identify the source of data Dimension data Typically from OLTP System Fact data Weblogs Use hadoop/MP/PIG to transform to CSV file CRM, SalesForce, Marketing Click, Conversion Tracking External feeds, reports, scrapping etc
Typical MySQL Setup Isolate production and staging Production environment All OLTP happens here Two schemas, OLTP and OLAP Staging environment Replicate OLTP dim tables to separate schema ETL  Map surrogate key between dim and fact tables Load data to fact tables Generate aggregate tables for frequent access Reload the final aggregated tables into to production
MySQL Configuration MySQL is mainly designed for OLTP workloads For OLAP, it is sub-optimal Highly recommended to keep small dataset size ( < 500G per server) Pre-aggregate source data as much as possible Use InnoDB for all staging tables with upsert + on duplicate key update Use MyISAM for production read-only tables for faster loads and pack keys Use partitions for easy purging Build OLAP cubes (Pentaho Mondrian or any reporting solution) Compression, sequential IO and read-ahead is the key
Scale-out
Common Use Cases – MySQL way Small datasets Real-time analytics Standard reporting Historical data BI, analytics and OLAP applications
Limitations Understand the hard limits of MySQL MySQL is not a scalable warehouse solution Fits well for small datasets or to get started initially Use columnar + compression enabled engines  Large datasets – Stay away Unless you have small data model/marts that can be distributed across multiple nodes
Cache For small data-sets Page cache Query cache Buffer pool For advanced reporting Build OLAP cubes Most BI, reporting solutions has in-memory OLAP
Analytic Stores Columnar, Compression and MPP is de-facto Other Data warehouse solutions ($$$$..) GreenPlum (+ DCA appliance – part of EMC now) Vertica (Break through, currently my favorite data store, part of HP since Monday) AsterData Oracle Exadata ParAccel (Co-founder of Oracle Bruce Scott) InfoBright (MySQL based) InfiniDB (open source, Calpont appliance) Netezza (appliance – IBM owns it now) XtremeData dbX (appliance) TeraData, and few more
Real-time Analytics Warehouse is not real-time Hourly or daily depending on business needs (Near) Real-time analytics Performance and scalability challenges Identify common metric(s) for real-time, ex: Active users currently online, users playing, friends online etc Implement as counter based atomic operation
Questions ? http://venublog.com/ [email_address] Twitter: @vanuganti

Role of MySQL in Data Analytics, Warehousing

  • 1.
    Role of MySQL In Data Analytics, Warehouse Large Data At Scale Venu Anuganti Feb 2011, Percona Live http://venublog.com/
  • 2.
    Who am I Data Architect, Database Kernel / Internals Engineer Part of 5 database kernels (MySQL - 2000 to 2004) Implement and Scale SQL, NoSQL, Analytics and Data Warehouse solutions Large scale data handling for Social Networking, SaaS, Click Tracking, Games, Recommendation, Advertisement, Mobile and SEM marketing Blog: http://venublog.com/
  • 3.
    Agenda Buzz AroundSQL and NoSQL Role of MySQL In Data World Data Warehouse & Analytics Real-time Analytics Large Data How To Build Scalable Data Warehouse Q & A
  • 4.
    Buzzzzzzz Lately everyonetalking about NoSQL What is happening to (My)SQL Does that mean end of (My)SQL ? Why nobody talks about large (My)SQL implementations ?
  • 5.
    Data & MySQL Everywhere
  • 6.
    Data Is TheBusiness Lot of new business models are DATA centric Web scale, social networking, real-time and interactive all most all companies talk about their data millions of user base, clients, customers, applications, … tera bytes to peta bytes of data on day to day performance & scalability is a key factor
  • 7.
    Data Drives BusinessBusiness can only grow if they can properly make use of data statistics, mining, real-time reporting, analytics re-targeting Recommendation Examples of data driven companies Facebook, Twitter, LinkedIn, Zynga, Groupon, Quora, FourSquare, AppStores, mobile/web analytics, … Any API Driven All most all new emerging companies
  • 8.
    Data Solution ProvidersCompanies emerging to solve data centric problems Cloudera Percona Cloud and SaaS solutions Large companies solve problems as part of their business Google (Big Table, MySQL patches, App Engine, Megastore …) Yahoo (Hadoop, PIG …) Facebook (Cassandra, MySQL Patches, Messaging System Implementation etc) Twitter (FlockDB, …) Zynga (membase)
  • 9.
    MySQL In EveryCompany 90% of the companies that deals with data uses MySQL Mainly OLTP LAMP, Website, Blog services SaaS, Cloud services Analytics, Stats Warehousing and … Tera-bytes of data if not peta-bytes in MySQL MySQL is de facto development model for all developers and startups
  • 10.
    MySQL Widely AdoptedSimple, easy to learn and adopt Widely in use for 10+ yrs Very large community Most developers knows how to use MySQL Lot of domain experts All most all tools support MySQL Highly optimal and scalable [if you use it right] It is even available on the cloud Used by all most all big companies When people do not know what data store to choose – defaults to MySQL
  • 11.
    Role of MySQLIn Data Warehouse
  • 12.
    Data Warehousing Datastore repository with complete view of the business data Active users/customers Total sales/orders for a given period Growth and retention rate Top performing, … Driving Factors Business Intelligence, BI Data Analysis & Mining Reporting / Dashboards Business decisions
  • 13.
    DW Buzz wordsDimension Tables Fact Tables Aggregate Tables ETL Staging Production OLAP Data Mart MOLAP, ROLAP, HOLAP Star schema & Snowflake
  • 14.
  • 15.
    Data Model –Star Schema
  • 16.
    Data Model –Snowflake
  • 17.
    DW Data ModelsBottom-up Current system data, dimensions and log events dictate the model Top-down Business and reporting needs dictates the model Hybrid Compromise between bottom-up and top-down and implement the model
  • 18.
    Data Source Identifythe source of data Dimension data Typically from OLTP System Fact data Weblogs Use hadoop/MP/PIG to transform to CSV file CRM, SalesForce, Marketing Click, Conversion Tracking External feeds, reports, scrapping etc
  • 19.
    Typical MySQL SetupIsolate production and staging Production environment All OLTP happens here Two schemas, OLTP and OLAP Staging environment Replicate OLTP dim tables to separate schema ETL Map surrogate key between dim and fact tables Load data to fact tables Generate aggregate tables for frequent access Reload the final aggregated tables into to production
  • 20.
    MySQL Configuration MySQLis mainly designed for OLTP workloads For OLAP, it is sub-optimal Highly recommended to keep small dataset size ( < 500G per server) Pre-aggregate source data as much as possible Use InnoDB for all staging tables with upsert + on duplicate key update Use MyISAM for production read-only tables for faster loads and pack keys Use partitions for easy purging Build OLAP cubes (Pentaho Mondrian or any reporting solution) Compression, sequential IO and read-ahead is the key
  • 21.
  • 22.
    Common Use Cases– MySQL way Small datasets Real-time analytics Standard reporting Historical data BI, analytics and OLAP applications
  • 23.
    Limitations Understand thehard limits of MySQL MySQL is not a scalable warehouse solution Fits well for small datasets or to get started initially Use columnar + compression enabled engines Large datasets – Stay away Unless you have small data model/marts that can be distributed across multiple nodes
  • 24.
    Cache For smalldata-sets Page cache Query cache Buffer pool For advanced reporting Build OLAP cubes Most BI, reporting solutions has in-memory OLAP
  • 25.
    Analytic Stores Columnar,Compression and MPP is de-facto Other Data warehouse solutions ($$$$..) GreenPlum (+ DCA appliance – part of EMC now) Vertica (Break through, currently my favorite data store, part of HP since Monday) AsterData Oracle Exadata ParAccel (Co-founder of Oracle Bruce Scott) InfoBright (MySQL based) InfiniDB (open source, Calpont appliance) Netezza (appliance – IBM owns it now) XtremeData dbX (appliance) TeraData, and few more
  • 26.
    Real-time Analytics Warehouseis not real-time Hourly or daily depending on business needs (Near) Real-time analytics Performance and scalability challenges Identify common metric(s) for real-time, ex: Active users currently online, users playing, friends online etc Implement as counter based atomic operation
  • 27.
    Questions ? http://venublog.com/[email_address] Twitter: @vanuganti

Editor's Notes

  • #3 MySQL Employee 2000-2004 Database Companies MySQL SOLID ANTs Data Server ScaleDB Part of Yahoo’s cloud initiates like Sherpa and Mobstor and a platform MySQL Geek Still contribute randomly to MySQL source
  • #4 Its going to be pretty short talk for 20 mts and leave it for Q &amp; A
  • #5 Well all most all companies has MySQL deployment one or the other day…
  • #10 If Facebook, Twitter or someone else uses NoSQL, does not mean everyone has to use it If someone scales using MySQL, does not mean everyone can use the same concept
  • #14 OLAP - On-Line Analytical Processing MOLAP – Multi dimensional cubes (not applicable in relational schema) ROLAP - Relational OLAP (aggregation, MySQL ROLLUP) HOLAP – (Combination of both)
  • #26 DCA Data Computing Appliance Talk about analytics and how crucial they are now
  • #27 DCA Data Computing Appliance Talk about analytics and how crucial they are now