Running a Realtime Stats Service on MySQL
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

Running a Realtime Stats Service on MySQL

on

  • 11,092 views

Slides used at Percona Performance Conference. Describes the optimizations / tweeks used on running pathtraq.com, one of Japan's largest web stats service.

Slides used at Percona Performance Conference. Describes the optimizations / tweeks used on running pathtraq.com, one of Japan's largest web stats service.

Statistics

Views

Total Views
11,092
Views on SlideShare
7,567
Embed Views
3,525

Actions

Likes
5
Downloads
59
Comments
0

9 Embeds 3,525

http://www.moskalyuk.com 3236
http://labs.cybozu.co.jp 266
http://www.slideshare.net 7
http://translate.googleusercontent.com 7
http://feeds.feedburner.com 3
http://translate.yandex.net 2
http://xss.yandex.net 2
http://www.translate.ru 1
http://webcache.googleusercontent.com 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Running a Realtime Stats Service on MySQL Presentation Transcript

  • 1. Running a Realtime Stats Service on MySQL Cybozu Labs, Inc. Kazuho Oku
  • 2.
    • Background
    Apr. 23 2009 Running Realtime Stats Service on MySQL
  • 3. Who am I?
    • Name: Kazuho Oku ( 奥 一穂 )
    • Original Developer of Palmscape / Xiino
      • The oldest web browser for Palm OS
    • Working at Cybozu Labs since 2005
      • Research subsidiary of Cybozu, Inc.
      • Cybozu is a leading groupware vendor in Japan
      • My weblog: tinyurl.com/kazuho
    Apr. 23 2009 Running Realtime Stats Service on MySQL
  • 4.
    • Introduction of Pathtraq
    Apr. 23 2009 Running Realtime Stats Service on MySQL
  • 5. What is Pathtraq?
    • Started in Aug. 2007
    • Web ranking service
      • One of Japan’s largest
        • 〜 10,000 users submit access information
        • 〜 1,000,000 access infomation per day
      • like Alexa, but semi-realtime, and per-page
    Apr. 23 2009 Running Realtime Stats Service on MySQL
  • 6. What is Pathtraq? (cont'd)
    • Automated Social News Service
      • find what's hot
      • like Google News + Digg
      • calculate relevance from access stats
    • Search by...
      • no filtering (all the Internet)
      • by category
      • by keyword
      • by URL (per-domain, etc.)
    Apr. 23 2009 Running Realtime Stats Service on MySQL
  • 7.  
  • 8.  
  • 9. How to Provide Real-time Analysis?
    • Data Set (as of Apr. 23 2009)
      • # of URLs: 147,748,546
      • # of total accesses: 413,272,527
    • Sharding is not a good option
      • since we need to join the tables and aggregate
        • prefix-search by URL, search by keyword, then join with access data table
      • core tables should be stored on RAM
        • not on HDD, due to lots of random access
    Apr. 23 2009 Running Realtime Stats Service on MySQL
  • 10. Our Decision was to...
    • Keep URL and access stats on RAM
      • compression for size and speed
    • Create a new message queue
    • Limit Pre-computation Load
    • Create our own cache, with locks
      • to minimize database access
    • Fulltext-search database on SSD
    Apr. 23 2009 Running Realtime Stats Service on MySQL
  • 11. Our Servers
    • Main Server
      • Opteron 2218 x2, 64GB Mem
      • MySQL, Apache
    • Fulltext Search Server
      • Opteron 240EE, 2GB Mem, Intel SSD
      • MySQL (w. Tritonn/Senna)
    • Helper Servers
      • for Content Analysis
      • for Screenshot Generation
    Apr. 23 2009 Running Realtime Stats Service on MySQL
  • 12. The Long Tail of the Internet
    • y=C ・ x -0.44
      • # of URLs with 1/10 hits: x2.75
    Apr. 23 2009 Running Realtime Stats Service on MySQL
  • 13.
    • Compressing URLs
    Apr. 23 2009 Running Realtime Stats Service on MySQL
  • 14. Compressing URLs
    • The Challenges:
      • URLs are too short for gzip, etc.
      • URLs should be prefix-searchable in compressed form
        • How to run like 'http://www.mysql.com/%' on a compressed URL?
    • The Answer:
      • Static PPM + Range Coder
    Apr. 23 2009 Running Realtime Stats Service on MySQL
  • 15. Static PPM
    • PPM: Prediction by Partial Matching
      • What is the next character after ".co"?
        • The answer is "m"!
      • PPM is used by 7-zip, etc.
    • Static PPM is PPM with static probabilistic model
      • Many URLs (or English words) have common patterns
      • Suitable for short texts (like URLs)
    Apr. 23 2009 Running Realtime Stats Service on MySQL
  • 16. Range Coder
    • A fast variant of arithmetic compression
      • similar to huffmann encoding, but better
      • If probability of next character being "m" was 75%, it will be encoded into 0.42 bit
    • Compressed strings preserve the sort order of uncompressed form
    Apr. 23 2009 Running Realtime Stats Service on MySQL
  • 17. Create Compression Functions
    • Build prediction table from stored URLs
    • Implement range coder
      • took an open-source impl. and optimized it
        • original impl. added some bits unnecessary at the tail
        • use SSE instructions for faster operation
        • coderepos.org/share/browser/lang/cplusplus/range_coder
    • Link the coder and the table to create MySQL UDFs
    Apr. 23 2009 Running Realtime Stats Service on MySQL
  • 18. Rewriting the Server Logic
    • Change schema
      • url varchar(255) not null # with unique index
      • urlc varbinary(767) not null # with unique index
    • Change prefix-search form
      • url like 'http://example.com/%'
      • url_compress('http://example.com/')<=urlc and urlc<url_compress('http://example.com 0 ')
      • Note: &quot;0&quot; is next character of '/'
    Apr. 23 2009 Running Realtime Stats Service on MySQL
  • 19. Compression Ratio
    • Compression ratio: 37%
      • Size of prediction table: 4MB
    • Benchmark of the compression functions
      • compression: 40MB/sec. (570k URLs/sec.)
      • decompression: 19.3MB/sec. (280k URLs/sec.)
      • fast enough since searchable in compressed form
    • Prefix-search became faster
      • shorter indexes lead to faster operation
    Apr. 23 2009 Running Realtime Stats Service on MySQL
  • 20. Re InnoDB Compression
    • URL Compression can coexist with InnoDB compression
        • though we aren't using InnoDB compression on our production environment
    Apr. 23 2009 Running Realtime Stats Service on MySQL Compression Table Size N/A 100% URL compression 57% InnoDB compression 50% using both 33%
  • 21. Compressing the Stats Table
    • Used to have two int columns: at , cnt
      • it was waste of space, since...
        • most cnt values are very small numbers
        • most accesses to each URL occur on a short period (ex. the day the blog entry was written)
        • at field should be part of the indexes
    Apr. 23 2009 Running Realtime Stats Service on MySQL at (hours since epoch) cnt (# of hits) 330168 1 330169 2 330173 1 330197 1
  • 22. Compressing the Stats Table (cont'd)
    • Merge the rows into a sparse array
      • example on the prev. page becomes:
      • (offset=330197),1,0(repeated 23 times),1,2,1
    • Then compress the array
      • the example becomes a blob of 8 bytes
      • originally was 8 bytes x 4 rows with index
    • And store the array in a single column
      • fewer rows lead to smaller table, faster access
    Apr. 23 2009 Running Realtime Stats Service on MySQL
  • 23. Compressing the Stats Table (cont'd)
    • Write MySQL UDFs to access the sparse array
        • cnt_add(column,at,cnt)
        • -- adds cnt on given index (at)
        • cnt_between(column,from,to)
        • -- returns # of hits between given hours
        • and more...
    • We use int[N] arrays for vectorized calc.
      • especially when creating access charts
    Apr. 23 2009 Running Realtime Stats Service on MySQL
  • 24.
    • Create a new Message Queue
    Apr. 23 2009 Running Realtime Stats Service on MySQL
  • 25. Q4M
    • A simple, reliable, fast message queue
      • runs as a pluggable storage engine of MySQL
      • GPL License; q4m.31tools.com
      • presented yesterday at MySQL Conference :-p
        • slides at tinyurl.com/q4m2009
    • Used for relaying messages between our servers
    Apr. 23 2009 Running Realtime Stats Service on MySQL
  • 26.
    • Limiting Pre-computation Load
    Apr. 23 2009 Running Realtime Stats Service on MySQL
  • 27. Limit # of CPU-intensive Pre-computations
    • Use cron & setlock
      • setlock is part of daemontools by djb
    • setlock
      • serializes processes by using flock
      • -n option: use trylock; if locked, do nothing
    • # use only one CPU core for pre-computation
    • */2 * * * * setlock –n /tmp/tasks.lock precompute_hot_entries
    • 5 0 * * * setlock /tmp/tasks.lock precompute_yesterday_data
    Apr. 23 2009 Running Realtime Stats Service on MySQL
  • 28. Limit # of Disk-intensive Pre-computations
    • Divide pre-computation to blocks and sleep depending on the elapsed time
    • my $LOAD = 0.25;
    • while (true) {
    • my $start = time();
    • precompute_block();
    • sleep(min(time - $start, 0) * (1 - $LOAD) / $LOAD);
    • }
    Apr. 23 2009 Running Realtime Stats Service on MySQL
  • 29.
    • Creating our own Cache System
    Apr. 23 2009 Running Realtime Stats Service on MySQL
  • 30. The Problem
    • Query cache is flushed on table update
      • access stats can be (should be) cached for a certain period
    • Memcached has a thundering-herd problem
      • all clients try to read the database when a cached-entry expires
      • critical for us since our queries does joins, aggregations, and sort operations
    Apr. 23 2009 Running Realtime Stats Service on MySQL
  • 31. Swifty and KeyedMutex
    • Swifty is a mmap-based cache
      • cached data shared between processes
      • lock-free on read, flock on write
      • notifies a single client that the accessed entry is going to expire within few seconds
      • notified client can start updating a cache entry before it expires
    • KeyedMutex
      • a daemon used to block multiple clients issuing same SQL queries
    Apr. 23 2009 Running Realtime Stats Service on MySQL
  • 32. Swifty and KeyedMutexd (cont'd)
    • Source codes are available:
      • coderepos.org/share/browser/lang/c/swifty
      • coderepos.org/share/browser/lang/perl/Cache-Swifty
      • coderepos.org/share/browser/lang/perl/KeyedMutex
    Apr. 23 2009 Running Realtime Stats Service on MySQL
  • 33.
    • Fulltext-search on SSD
    Apr. 23 2009 Running Realtime Stats Service on MySQL
  • 34. Senna / Tritonn
    • Senna is a FTS engine popular in Japan
      • might not work well with European languages
    • Tritonn is a replacement of MyISAM FTS
      • uses Senna as backend
      • faster than MyISAM FTS
    • Wrote patches to support SSD
      • during our transition from RAM to SSD
      • patches accepted in Senna 1.1.4 / Tritonn 1.0.12
    Apr. 23 2009 Running Realtime Stats Service on MySQL
  • 35. FTS: RAM-based vs. SSD-based
    • Size of FTS data: 〜 20GB
    • Downgraded hardware to see if SSD-based FTS is feasible
    • Speed became ¼
      • but latency of searches are well below one second
    Apr. 23 2009 Running Realtime Stats Service on MySQL Old Hardware New Hardware CPU Opteron 2218 (2.6GHz) x2 Opteron 240 (1.4GHz) Memory 32GB 2GB Storage 7200rpm SATA HDD SSD (Intel X25-M)
  • 36.
    • Summary
    Apr. 23 2009 Running Realtime Stats Service on MySQL
  • 37. Summary
    • Use UDFs for optimization
    • Sometime it is easier to scale UP
      • esp. when you can estimate your data growth
    • Use SSD for FTS
      • Baidu (China's leading search engine) uses SSD
    • Most of the things introduced are OSS
      • We plan to open-source our URL compression table as well
    Apr. 23 2009 Running Realtime Stats Service on MySQL
  • 38. We are Looking for...
    • If you are interested in localizing Pathtraq to your country, please contact us
      • we do not have resources outside of Japan
        • to translate the web interface
        • to ask people to install our browser extension
        • to follow local regulations, etc.
    Apr. 23 2009 Running Realtime Stats Service on MySQL
  • 39.
    • Thank you for listening
    • tinyurl.com/kazuho
    Apr. 23 2009 Running Realtime Stats Service on MySQL