• Save
Designing Scalable Data Warehouse Using MySQL
Upcoming SlideShare
Loading in...5

Designing Scalable Data Warehouse Using MySQL



Orielly MySQL Conference 2011 - Designing MySQL as data warehouse solution to handle tera bytes of data which compromises OLTP, ETL, OLAP and reporting

Orielly MySQL Conference 2011 - Designing MySQL as data warehouse solution to handle tera bytes of data which compromises OLTP, ETL, OLAP and reporting



Total Views
Views on SlideShare
Embed Views



10 Embeds 854

http://venublog.com 617
http://en.oreilly.com 202
http://www.scoop.it 13
http://wordpress 12
http://www.brijj.com 4
https://twitter.com 2 1 1
url_unknown 1
http://stackoverflow.com 1



Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment
  • MySQL Employee 2000-2004 Database Companies MySQL SOLID ANTs Data Server ScaleDB Part of Yahoo’s cloud initiates like Sherpa and Mobstor and a platform MySQL Geek Still contribute randomly to MySQL source
  • Ask How many people use MySQL and how many does not use MySQL
  • OLAP - On-Line Analytical Processing MOLAP – Multi dimensional cubes (not applicable in relational schema) ROLAP - Relational OLAP (aggregation, MySQL ROLLUP) HOLAP – (Combination of both)
  • DCA Data Computing Appliance Talk about analytics and how crucial they are now

Designing Scalable Data Warehouse Using MySQL Designing Scalable Data Warehouse Using MySQL Presentation Transcript

  • Designing Scalable Data Warehouse with MySQL Venu Anuganti Apr 2011, Oreilly MySQL Conference http://venublog.com/
  • Who am I
    • Data Architect, Database Kernel / Internals Engineer
    • Part of 5 database kernels (MySQL - 2000 to 2004)
    • Design, Implement and Scale SQL, NoSQL, Analytics and Data Warehouse solutions
    • Large scale data handling for Social Networking, SaaS, Click Tracking, Games, Recommendation, Advertisement, Mobile and SEM marketing
    • Blog: http://venublog.com/
  • Agenda
    • Role of MySQL In
      • Data World
      • Data Warehouse & Analytics
      • Real-time Analytics
      • Large Data
    • How To Build Scalable Data Warehouse
    • Q & A
    • Data & MySQL Everywhere
  • Data Is The Business
    • Lot of new business models are DATA centric
    • Web scale, social networking, real-time and interactive
      • all most all companies talk about their data
      • millions of user base, clients, customers, applications, …
      • tera bytes to peta bytes of data on day to day
      • performance & scalability is a key factor
  • Data Drives Business
    • Business can only grow if they can properly make use of data
      • statistics, mining, real-time
      • reporting, analytics
      • re-targeting
      • recommendation
    • Examples of data driven companies
      • Facebook, Twitter, LinkedIn, Zynga, Groupon, Quora, FourSquare, AppStores, mobile/web analytics, …
      • Any API Driven
      • All most all new emerging companies
  • Data Solution Providers
    • Companies emerging to solve data centric problems
      • Cloudera
      • Percona
      • Cloud and SaaS solutions
    • Large companies solve problems as part of their business
      • Google (Big Table, App Engine, Megastore …)
      • Amazon (Dynamo, EC2, …)
      • Yahoo (Hadoop, PIG …)
      • Facebook (Cassandra, Messaging System Implementation, …)
      • Twitter (FlockDB, …)
      • Zynga (Membase)
  • MySQL In Every Company
    • 90% of the companies that deals with data uses MySQL in some form
      • Mainly OLTP
      • LAMP, Website, Blog services
      • SaaS, Cloud services
      • Analytics, Stats, Reporting tools
      • Warehousing and …
    • Tera-bytes of data if not peta-bytes in MySQL
    • MySQL is de facto development model for all developers and startups - Until a saturation point
    • Role of MySQL
    • In
    • Data Warehouse
  • What Is Data Warehousing
    • Data store repository with complete view of the business data
      • Aggregated data from multiple sources
      • Active users/customers
      • Total sales/orders for a given period
      • Growth and retention rate
      • Top/Poorly performing apps, pages, games, ads, …
      • Spend vs performance
    • Driving Factors
      • Business Intelligence, BI
      • Data Analysis & Mining
      • Reporting / Dashboards
      • Business Decisions
  • Data Warehouse Buzz words
    • Common Buzz words
      • Dimension Tables
      • Fact Tables
      • Aggregate Tables
      • ETL
      • Staging
      • Production
      • OLAP
      • Data Mart
      • Data Source
      • Star schema & Snowflake
  • Typical Architecture
  • Data Model – Star Schema
  • Data Model – Snowflake
  • Data Models
    • Bottom-up
      • Current system data, dimensions and log events dictate the model
    • Top-down
      • Business and reporting needs dictates the model
    • Hybrid
      • Compromise between bottom-up and top-down and implement the model
  • Data Source
    • Identify the source of data
      • Dimension data
        • Typically from OLTP System
      • Fact data
        • Weblogs
          • Use hadoop/MP/PIG to transform to CSV file
        • CRM, SalesForce, Marketing
        • Click, Conversion Tracking
        • External feeds, reports, scrapping etc
  • Typical MySQL Setup
    • Define star/snow flake/comb schema
    • Isolate production and staging
    • Production environment
      • All OLTP happens here
      • Two schemas, OLTP and OLAP
    • Staging environment
      • Replicate OLTP tables
      • ETL - OLAP
        • Map surrogate key between dimension (OLTP) and fact tables
        • Load data to fact tables
        • Generate aggregate tables for frequent access
      • Reload the final aggregated tables into production
  • MySQL Configuration
    • MySQL is mainly designed for OLTP workloads
    • For OLAP, it is sub-optimal
      • Highly recommended to keep small dataset size ( < 500G per server)
      • Pre-aggregate source data as much as possible
      • Use InnoDB for all staging tables with upsert + on duplicate key update (or use Hive)
      • Use MyISAM for production read-only tables for faster loads and pack keys
      • Use partitions for easy purging
      • Build OLAP cubes (Pentaho Mondrian or any reporting solution)
      • Compression, sequential IO and read-ahead is the key
  • Scale-out
  • Common Use Cases – MySQL way
    • Small datasets
    • Real-time analytics – count, histogram etc
    • Standard reporting
    • Historical data
    • BI, analytics and OLAP applications
  • Limitations
    • Understand the hard limits of MySQL
    • MySQL is not a scalable warehouse solution
      • Fits well for small datasets or to get started initially
      • Use columnar + compression enabled engines
      • Lacks analytic functions support
    • Large datasets – Stay away
      • Unless you have small data model/marts that can be distributed across multiple nodes
  • Cache
    • For small data-sets
      • Page cache
      • Query cache
      • Buffer pool
    • For advanced reporting
      • Build OLAP cubes
      • Most BI, reporting solutions has in-memory OLAP
  • Other Analytic Stores
    • Columnar, Compression and MPP is de-facto
    • Other Data warehouse solutions ($$$$..)
      • GreenPlum (+ DCA appliance – part of EMC now)
      • Vertica (Break through, currently my favorite data store, using it for more than an year now, part of HP now)
      • AsterData (now part of TeraData)
      • Oracle Exadata
      • ParAccel (Co-founder of Oracle Bruce Scott)
      • InfoBright (MySQL based)
      • InfiniDB (open source, Calpont appliance)
      • Netezza (appliance – IBM owns it now)
      • XtremeData dbX (appliance)
      • TeraData, and few more
  • Real-time Analytics
    • Warehouse is not real-time
      • Hourly or daily depending on business needs
    • (Near) Real-time analytics
      • Performance and scalability challenges
      • Identify common metric(s) for real-time, ex:
        • Active users currently online, users playing, friends online etc
      • Implement as counter based atomic operation
  • Questions ?
    • http://venublog.com/
    • [email_address]
    • Twitter: @vanuganti