Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

VividCortex: Building a Time-Series Database in MySQL

9,411 views

Published on

MySQL is a flexible database that can support a large-scale, high-velocity time-series database in the AWS cloud. This presentation addresses the unique time-series data requirement for VividCortex. It shows how we built a solution, why we needed more than just MySQL, the good and bad aspects of the architecture and thoughts for the future of our time-series database. The slides will leave you with a greater understanding of MySQL's capabilities related to time-series data.

Published in: Software
  • I'd love to see how you actually model your MySQL tables. Neither the timestamp on its own nor the metric name on its own are unique keys. And the metric name is a very wide choice for a clustered index, even if you had an autoincrement column to the unique index.
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

VividCortex: Building a Time-Series Database in MySQL

  1. 1. TIME-SERIES DATA IN MYSQL VIVIDCORTEX WEBINAR DECEMBER 9, 2014
  2. 2. GET THE RECORDING! THIS WEBINAR WAS RECORDED LIVE ON DECEMBER 9, 2012. THE RECORDING IS AVAILABLE FREE AT VIVIDCORTEX.COM/WEBINARS/ BUILDING-TIME-SERIES-DATABASE-IN-MYSQL/
  3. 3. WEBINAR LOGISTICS RECORDING & SLIDES WILL BE AVAILABLE AFTERWARDS TWEET YOUR QUESTIONS/COMMENTS TO #VIVIDCORTEX FOLLOW ME AT @XAPRB FOLLOW VIVIDCORTEX AT @VIVIDCORTEX ENJOY :-)
  4. 4. ABOUT VIVIDCORTEX VIVIDCORTEX IS THE BEST WAY TO SEE WHAT YOUR PRODUCTION MYSQL SERVERS ARE DOING CAPTURES THOUSANDS OF METRICS IN ONE-SECOND RESOLUTION FROM YOUR PRODUCTION SYSTEMS NO MORE SLOW-QUERY-LOG ANALYSIS AND PAINFUL MANUAL CONFIGURATION — GET INSIGHT IN SECONDS, NOT HOURS AWESOME USER INTERFACE FREE TRIAL, NO-RISK: VIVIDCORTEX.COM/
  5. 5. WHAT IS TIME-SERIES DATA? ANY MEASUREMENTS TAKEN AT A SPECIFIC POINT IN TIME STOCK TICKERS, WEATHER DATA, TWEETS (?) FOR TODAY'S PURPOSES, LOTS AND LOTS OF: A MEASUREMENT (VALUE) OF A SPECIFIC METRIC OF INTEREST FROM A PARTICULAR HOST/SOURCE AT A SPECIFIC MOMENT IN TIME
  6. 6. POPULAR TIME-SERIES DATABASES RRDTOOL GRAPHITE (WHISPER) HBASE, CASSANDRA, OPENTSDB, ETC INFLUXDB HOMEGROWN
  7. 7. VIVIDCORTEX’S TIME-SERIES DATA METRICS: {HOST, METRIC, TIMESTAMP, VALUE} E.G. {83, “OS.CPU.UTILIZATION”, 1418143666, 18.2%} QUERY METRICS DITTO, BUT THE METRIC NAME IS RELATED TO THE QUERY FAMILY E.G. “HOST.QUERIES.C.1374C6821EAD6F47.TPUT” METRICS PER-USER, PER-PROCESS, PER-DATABASE, ETC QUERY SAMPLES, EVENTS, FAULTS, SYSTEM VARIABLE CHANGES, ETC NOT THE SUBJECT OF THIS WEBINAR
  8. 8. DENSE AND SPARSE METRICS DENSE METRICS ALWAYS EXIST AT EVERY POINT IN TIME EXAMPLE: SYSTEM FREE MEMORY EXAMPLE: CPU UTLIZATION SPARSE METRICS MAY ONLY OCCUR OCCASIONALLY EXAMPLE: METRICS RELATED TO A SPECIFIC QUERY
  9. 9. WHAT’S UNUSUAL AT VIVIDCORTEX HIGH RESOLUTION: EVERYTHING IN 1-SECOND GRANULARITY LARGE NUMBER OF METRICS (CARDINALITY, AND RATE) MANY METRICS ARE HIGHLY SPARSE
  10. 10. QUESTIONS WE ASK RETRIEVE METRIC A FROM TIMESTAMP B TO C AT RESOLUTION D RANK ALL METRICS MATCHING PATTERN X FROM B TO C, LIMIT N
  11. 11. SCHEMA DESIGN + INDEXING MULTI-TENANT, SHARDED ARCHITECTURE EACH CUSTOMER’S DATA STORED IN A SEPARATE DATABASE STRONG ENCRYPTION IN-FLIGHT AND AT-REST (SEE BLOG POST) DATA IS PARTITIONED BY DAY USING MYSQL PARTITIONING WE USE INNODB STORAGE ENGINE (TRANSACTIONAL, CRASH AND CORRUPTION RESISTANT, CLUSTERED INDEXES)
  12. 12. SCHEMA DESIGN + INDEXING METRIC-FIRST OR TIMESTAMP-FIRST, THAT IS THE QUESTION. FOR THIS PURPOSE, A HOST/SOURCE IS ESSENTIALLY A METRIC PREFIX.
  13. 13. METRIC-FIRST ADVANTAGES: OPTIMIZED FOR FAST READS OF DENSE METRICS DRAWBACKS: ENUMERATING / READING LARGE CATEGORIES OF METRICS
  14. 14. TIMESTAMP-FIRST ADVANTAGES: OPTIMIZED FOR WRITING METRICS OPTIMIZED FOR READING ALL METRICS FOR A TIME RANGE DRAWBACKS: PENALIZES READING A DENSE METRIC FOR A TIME RANGE NOT OPTIMAL FOR STREAMING BY METRIC BY TIMESTAMP
  15. 15. SECONDARY INDEXING? BENEFITS: OPTIMIZED FOR BOTH USE CASES, THEORETICALLY HOWEVER, NO SIGNIFICANT DIFFERENCE IN OUR TESTS DRAWBACKS: WRITE AMPLIFICATION, SPACE AMPLIFICATION STILL DOESN’T COVER ALL NEEDED SCENARIOS (WE’D NEED AT LEAST SIX INDEXES) CREATES RANDOM ACCESS LOOKUPS IN THE PRIMARY KEY HMMMM…. TOKUDB? SOME OPERATIONAL CHALLENGES.
  16. 16. PARTITIONING ADVANTAGES: COARSE-GRAINED TIMESTAMP-FIRST INDEXING EASY PURGE OF OLD DATA TRANSPARENT TO THE APPLICATION DRAWBACKS: PARTITION MAINTENANCE CAN BE A DRAG OPERATIONAL HASSLES FOR ALTER TABLE AND SO FORTH IMPROVEMENTS IN MYSQL 5.6 ARE VERY HELPFUL THOUGH
  17. 17. CHALLENGE #1: HIGH INGEST RATE LARGE NUMBER OF METRICS/SEC ARRIVING AT OUR SYSTEMS CURRENTLY ABOUT 100K METRICS/SEC PER SHARD WRITE WORKLOAD, SPACE USAGE
  18. 18. CHALLENGE #1: HIGH INGEST RATE SOLUTION: BATCH METRICS INTO VECTORS DRAWBACK: LOSE ABILITY TO QUERY WITH SQL COMPROMISE: AGGREGATE METADATA PER VECTOR SOLUTION: STORE METRIC IDS, NOT NAMES, WITH VECTORS DRAWBACK: MUST “JOIN” TO METRIC DICTIONARY FOR PATTERN-MATCHING ETC
  19. 19. CHALLENGE #2: SPARSE METRICS HUGE CARDINALITY OF METRICS X HOSTS CAN BE TENS OF MILLIONS OF METRICS PER HOST MOST OF THEM INACTIVE DURING ANY GIVEN TIME RANGE QUERYING FOR ALL IS INEFFICIENT; MUST FILTER OUT INACTIVE NEED: TIMESTAMP-BASED INDEX OF “METRIC HAS DATA” INEFFICIENT IN MYSQL, WORKS WELL IN REDIS
  20. 20. HOW WELL DOES IT WORK? DATA IS REASONABLY COMPACT, EVEN THOUGH NOT COMPRESSED FOR VIVIDCORTEX’S 50 PRODUCTION HOSTS: FOR 10 DAYS OF 1-SECOND DATA AND 90 DAYS OF 1-MIN 80GB OF TOTAL DATA MOST DATA IS IN QUERY SAMPLES, EVENT DATA, ETC (BLOBS)
  21. 21. HOW IS PERFORMANCE? WE USE “WEAK” AWS EC2 SERVERS; 8CPU, 26GB MEMORY WE INGEST ~28 BILLION METRICS PER DAY (332K/SEC) THESE ARE ESSENTIALLY HANDLED 100% BY 3 SERVERS (WE HAVE PASSIVE STANDBY SERVERS IN-REGION, CROSS-REGION, BACKUPS, ETC).
  22. 22. WHAT’S GOOD? RAW EFFICIENCY PER SERVER IS REASONABLY HIGH OUR INFRASTRUCTURE IS FAIRLY HOMOGENEOUS WE’RE RUNNING PRETTY LEAN
  23. 23. WHAT’S NOT SO GOOD? PROGRAMMER EFFICIENCY IS LOW CAN’T AD-HOC QUERY THE TIMESERIES DATA MUST USE INTERNAL TIMESERIES SERVICE INSTEAD MYSQL IS STILL NOT AS EFFICIENT AS I WANT INNODB OVERHEAD INDEXING MAY NOT BE A WIN FOR OUR USE CASE
  24. 24. ALTERNATIVES? CASSANDRA, CASSANDRA+SPARK, ELASTICSEARCH, INFLUXDB, HBASE, OPENTSDB, DRUID…? PROBLEMS: COMPLEXITY, PERFORMANCE, IMMATURITY, INEFFICIENCY, UNRELIABILITY... VENDOR PITCHES ARE OFTEN FAIRLY ABSURD RIGHT NOW, MYSQL’S RAW EFFICIENCY IS ENOUGH TO COMPENSATE FOR SOME OTHER SHORTCOMINGS. BETTER THE DEVIL YOU KNOW THAN THE DEVIL YOU DON’T?
  25. 25. UPCOMING STACKSCOPE INDUSTRY ROUND TABLE FREE ONLINE WEBINAR JANUARY 22ND FOCUS ON EMERGING TECHNOLOGY LANDSCAPE ADRIAN COCKCROFT, NATHEN HARVEY, JASON DIXON VIVIDCORTEX.COM/WEBINARS/INDUSTRY-ROUNDTABLE/ NEXT WEBINAR: WHAT SHOULD I MONITOR? VIVIDCORTEX.COM/NEWS-EVENTS/
  26. 26. QUESTIONS? TWEET TO #VIVIDCORTEX CONTACT US: - @VIVIDCORTEX - VIVIDCORTEX.COM HTTPS://WWW.FLICKR.COM/PHOTOS/OREGONDOT/14721613997/

×