Running a Realtime Stats Service on MySQL Cybozu Labs, Inc. Kazuho Oku
<ul><li>Background </li></ul>Apr. 23 2009 Running Realtime Stats Service on MySQL
Who am I? <ul><li>Name: Kazuho Oku ( 奥 一穂 ) </li></ul><ul><li>Original Developer of Palmscape / Xiino </li></ul><ul><ul><l...
<ul><li>Introduction of Pathtraq </li></ul>Apr. 23 2009 Running Realtime Stats Service on MySQL
What is Pathtraq? <ul><li>Started in Aug. 2007 </li></ul><ul><li>Web ranking service </li></ul><ul><ul><li>One of Japan’s ...
What is Pathtraq? (cont'd) <ul><li>Automated Social News Service </li></ul><ul><ul><li>find what's hot </li></ul></ul><ul>...
 
 
How to Provide Real-time Analysis? <ul><li>Data Set (as of Apr. 23 2009) </li></ul><ul><ul><li># of URLs: 147,748,546 </li...
Our Decision was to... <ul><li>Keep URL and access stats on RAM </li></ul><ul><ul><li>compression for  size and speed </li...
Our Servers <ul><li>Main Server </li></ul><ul><ul><li>Opteron 2218 x2, 64GB Mem </li></ul></ul><ul><ul><li>MySQL, Apache <...
The Long Tail of the Internet <ul><li>y=C ・ x -0.44 </li></ul><ul><ul><li># of URLs with 1/10 hits: x2.75 </li></ul></ul>A...
<ul><li>Compressing URLs </li></ul>Apr. 23 2009 Running Realtime Stats Service on MySQL
Compressing URLs <ul><li>The Challenges: </li></ul><ul><ul><li>URLs are too short for gzip, etc. </li></ul></ul><ul><ul><l...
Static PPM <ul><li>PPM: Prediction by Partial Matching </li></ul><ul><ul><li>What is the next character after &quot;.co&qu...
Range Coder <ul><li>A fast variant of arithmetic compression </li></ul><ul><ul><li>similar to huffmann encoding, but bette...
Create Compression Functions <ul><li>Build prediction table from stored URLs </li></ul><ul><li>Implement range coder </li>...
Rewriting the Server Logic <ul><li>Change schema </li></ul><ul><ul><li>url varchar(255) not null  # with unique index </li...
Compression Ratio <ul><li>Compression ratio: 37% </li></ul><ul><ul><li>Size of prediction table: 4MB </li></ul></ul><ul><l...
Re InnoDB Compression <ul><li>URL Compression can coexist with InnoDB compression </li></ul><ul><ul><ul><li>though we aren...
Compressing the Stats Table <ul><li>Used to have two int columns:  at ,  cnt </li></ul><ul><ul><li>it was waste of space, ...
Compressing the Stats Table (cont'd) <ul><li>Merge the rows into a sparse array </li></ul><ul><ul><li>example on the prev....
Compressing the Stats Table (cont'd) <ul><li>Write MySQL UDFs to access the sparse array </li></ul><ul><ul><ul><li>cnt_add...
<ul><li>Create a new Message Queue </li></ul>Apr. 23 2009 Running Realtime Stats Service on MySQL
Q4M <ul><li>A simple, reliable, fast message queue </li></ul><ul><ul><li>runs as a pluggable storage engine of MySQL </li>...
<ul><li>Limiting Pre-computation Load </li></ul>Apr. 23 2009 Running Realtime Stats Service on MySQL
Limit # of CPU-intensive Pre-computations <ul><li>Use cron & setlock </li></ul><ul><ul><li>setlock is part of daemontools ...
Limit # of Disk-intensive Pre-computations <ul><li>Divide pre-computation to blocks and sleep depending on the elapsed tim...
<ul><li>Creating our own Cache System </li></ul>Apr. 23 2009 Running Realtime Stats Service on MySQL
The Problem <ul><li>Query cache is flushed on table update </li></ul><ul><ul><li>access stats can be (should be) cached fo...
Swifty and KeyedMutex <ul><li>Swifty is a mmap-based cache </li></ul><ul><ul><li>cached data shared between processes </li...
Swifty and KeyedMutexd (cont'd) <ul><li>Source codes are available: </li></ul><ul><ul><li>coderepos.org/share/browser/lang...
<ul><li>Fulltext-search on SSD </li></ul>Apr. 23 2009 Running Realtime Stats Service on MySQL
Senna / Tritonn <ul><li>Senna is a FTS engine popular in Japan </li></ul><ul><ul><li>might not work well with European lan...
FTS: RAM-based vs. SSD-based <ul><li>Size of FTS data:  〜  20GB </li></ul><ul><li>Downgraded hardware to see if SSD-based ...
<ul><li>Summary </li></ul>Apr. 23 2009 Running Realtime Stats Service on MySQL
Summary <ul><li>Use UDFs for optimization </li></ul><ul><li>Sometime it is easier to scale  UP </li></ul><ul><ul><li>esp. ...
We are Looking for... <ul><li>If you are interested in localizing Pathtraq to your country, please contact us </li></ul><u...
<ul><li>Thank you for listening </li></ul><ul><li>tinyurl.com/kazuho </li></ul>Apr. 23 2009 Running Realtime Stats Service...
Upcoming SlideShare
Loading in...5
×

Running a Realtime Stats Service on MySQL

8,028

Published on

Slides used at Percona Performance Conference. Describes the optimizations / tweeks used on running pathtraq.com, one of Japan's largest web stats service.

Published in: Technology, Design
0 Comments
6 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
8,028
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
64
Comments
0
Likes
6
Embeds 0
No embeds

No notes for slide

Running a Realtime Stats Service on MySQL

  1. 1. Running a Realtime Stats Service on MySQL Cybozu Labs, Inc. Kazuho Oku
  2. 2. <ul><li>Background </li></ul>Apr. 23 2009 Running Realtime Stats Service on MySQL
  3. 3. Who am I? <ul><li>Name: Kazuho Oku ( 奥 一穂 ) </li></ul><ul><li>Original Developer of Palmscape / Xiino </li></ul><ul><ul><li>The oldest web browser for Palm OS </li></ul></ul><ul><li>Working at Cybozu Labs since 2005 </li></ul><ul><ul><li>Research subsidiary of Cybozu, Inc. </li></ul></ul><ul><ul><li>Cybozu is a leading groupware vendor in Japan </li></ul></ul><ul><ul><li>My weblog: tinyurl.com/kazuho </li></ul></ul>Apr. 23 2009 Running Realtime Stats Service on MySQL
  4. 4. <ul><li>Introduction of Pathtraq </li></ul>Apr. 23 2009 Running Realtime Stats Service on MySQL
  5. 5. What is Pathtraq? <ul><li>Started in Aug. 2007 </li></ul><ul><li>Web ranking service </li></ul><ul><ul><li>One of Japan’s largest </li></ul></ul><ul><ul><ul><li>〜 10,000 users submit access information </li></ul></ul></ul><ul><ul><ul><li>〜 1,000,000 access infomation per day </li></ul></ul></ul><ul><ul><li>like Alexa, but semi-realtime, and per-page </li></ul></ul>Apr. 23 2009 Running Realtime Stats Service on MySQL
  6. 6. What is Pathtraq? (cont'd) <ul><li>Automated Social News Service </li></ul><ul><ul><li>find what's hot </li></ul></ul><ul><ul><li>like Google News + Digg </li></ul></ul><ul><ul><li>calculate relevance from access stats </li></ul></ul><ul><li>Search by... </li></ul><ul><ul><li>no filtering (all the Internet) </li></ul></ul><ul><ul><li>by category </li></ul></ul><ul><ul><li>by keyword </li></ul></ul><ul><ul><li>by URL (per-domain, etc.) </li></ul></ul>Apr. 23 2009 Running Realtime Stats Service on MySQL
  7. 9. How to Provide Real-time Analysis? <ul><li>Data Set (as of Apr. 23 2009) </li></ul><ul><ul><li># of URLs: 147,748,546 </li></ul></ul><ul><ul><li># of total accesses: 413,272,527 </li></ul></ul><ul><li>Sharding is not a good option </li></ul><ul><ul><li>since we need to join the tables and aggregate </li></ul></ul><ul><ul><ul><li>prefix-search by URL, search by keyword, then join with access data table </li></ul></ul></ul><ul><ul><li>core tables should be stored on RAM </li></ul></ul><ul><ul><ul><li>not on HDD, due to lots of random access </li></ul></ul></ul>Apr. 23 2009 Running Realtime Stats Service on MySQL
  8. 10. Our Decision was to... <ul><li>Keep URL and access stats on RAM </li></ul><ul><ul><li>compression for size and speed </li></ul></ul><ul><li>Create a new message queue </li></ul><ul><li>Limit Pre-computation Load </li></ul><ul><li>Create our own cache, with locks </li></ul><ul><ul><li>to minimize database access </li></ul></ul><ul><li>Fulltext-search database on SSD </li></ul>Apr. 23 2009 Running Realtime Stats Service on MySQL
  9. 11. Our Servers <ul><li>Main Server </li></ul><ul><ul><li>Opteron 2218 x2, 64GB Mem </li></ul></ul><ul><ul><li>MySQL, Apache </li></ul></ul><ul><li>Fulltext Search Server </li></ul><ul><ul><li>Opteron 240EE, 2GB Mem, Intel SSD </li></ul></ul><ul><ul><li>MySQL (w. Tritonn/Senna) </li></ul></ul><ul><li>Helper Servers </li></ul><ul><ul><li>for Content Analysis </li></ul></ul><ul><ul><li>for Screenshot Generation </li></ul></ul>Apr. 23 2009 Running Realtime Stats Service on MySQL
  10. 12. The Long Tail of the Internet <ul><li>y=C ・ x -0.44 </li></ul><ul><ul><li># of URLs with 1/10 hits: x2.75 </li></ul></ul>Apr. 23 2009 Running Realtime Stats Service on MySQL
  11. 13. <ul><li>Compressing URLs </li></ul>Apr. 23 2009 Running Realtime Stats Service on MySQL
  12. 14. Compressing URLs <ul><li>The Challenges: </li></ul><ul><ul><li>URLs are too short for gzip, etc. </li></ul></ul><ul><ul><li>URLs should be prefix-searchable in compressed form </li></ul></ul><ul><ul><ul><li>How to run like 'http://www.mysql.com/%' on a compressed URL? </li></ul></ul></ul><ul><li>The Answer: </li></ul><ul><ul><li>Static PPM + Range Coder </li></ul></ul>Apr. 23 2009 Running Realtime Stats Service on MySQL
  13. 15. Static PPM <ul><li>PPM: Prediction by Partial Matching </li></ul><ul><ul><li>What is the next character after &quot;.co&quot;? </li></ul></ul><ul><ul><ul><li>The answer is &quot;m&quot;! </li></ul></ul></ul><ul><ul><li>PPM is used by 7-zip, etc. </li></ul></ul><ul><li>Static PPM is PPM with static probabilistic model </li></ul><ul><ul><li>Many URLs (or English words) have common patterns </li></ul></ul><ul><ul><li>Suitable for short texts (like URLs) </li></ul></ul>Apr. 23 2009 Running Realtime Stats Service on MySQL
  14. 16. Range Coder <ul><li>A fast variant of arithmetic compression </li></ul><ul><ul><li>similar to huffmann encoding, but better </li></ul></ul><ul><ul><li>If probability of next character being &quot;m&quot; was 75%, it will be encoded into 0.42 bit </li></ul></ul><ul><li>Compressed strings preserve the sort order of uncompressed form </li></ul>Apr. 23 2009 Running Realtime Stats Service on MySQL
  15. 17. Create Compression Functions <ul><li>Build prediction table from stored URLs </li></ul><ul><li>Implement range coder </li></ul><ul><ul><li>took an open-source impl. and optimized it </li></ul></ul><ul><ul><ul><li>original impl. added some bits unnecessary at the tail </li></ul></ul></ul><ul><ul><ul><li>use SSE instructions for faster operation </li></ul></ul></ul><ul><ul><ul><li>coderepos.org/share/browser/lang/cplusplus/range_coder </li></ul></ul></ul><ul><li>Link the coder and the table to create MySQL UDFs </li></ul>Apr. 23 2009 Running Realtime Stats Service on MySQL
  16. 18. Rewriting the Server Logic <ul><li>Change schema </li></ul><ul><ul><li>url varchar(255) not null # with unique index </li></ul></ul><ul><ul><li>↓ </li></ul></ul><ul><ul><li>urlc varbinary(767) not null # with unique index </li></ul></ul><ul><li>Change prefix-search form </li></ul><ul><ul><li>url like 'http://example.com/%' </li></ul></ul><ul><ul><li>↓ </li></ul></ul><ul><ul><li>url_compress('http://example.com/')<=urlc and urlc<url_compress('http://example.com 0 ') </li></ul></ul><ul><ul><li>Note: &quot;0&quot; is next character of '/' </li></ul></ul>Apr. 23 2009 Running Realtime Stats Service on MySQL
  17. 19. Compression Ratio <ul><li>Compression ratio: 37% </li></ul><ul><ul><li>Size of prediction table: 4MB </li></ul></ul><ul><li>Benchmark of the compression functions </li></ul><ul><ul><li>compression: 40MB/sec. (570k URLs/sec.) </li></ul></ul><ul><ul><li>decompression: 19.3MB/sec. (280k URLs/sec.) </li></ul></ul><ul><ul><li>fast enough since searchable in compressed form </li></ul></ul><ul><li>Prefix-search became faster </li></ul><ul><ul><li>shorter indexes lead to faster operation </li></ul></ul>Apr. 23 2009 Running Realtime Stats Service on MySQL
  18. 20. Re InnoDB Compression <ul><li>URL Compression can coexist with InnoDB compression </li></ul><ul><ul><ul><li>though we aren't using InnoDB compression on our production environment </li></ul></ul></ul>Apr. 23 2009 Running Realtime Stats Service on MySQL Compression Table Size N/A 100% URL compression 57% InnoDB compression 50% using both 33%
  19. 21. Compressing the Stats Table <ul><li>Used to have two int columns: at , cnt </li></ul><ul><ul><li>it was waste of space, since... </li></ul></ul><ul><ul><ul><li>most cnt values are very small numbers </li></ul></ul></ul><ul><ul><ul><li>most accesses to each URL occur on a short period (ex. the day the blog entry was written) </li></ul></ul></ul><ul><ul><ul><li>at field should be part of the indexes </li></ul></ul></ul>Apr. 23 2009 Running Realtime Stats Service on MySQL at (hours since epoch) cnt (# of hits) 330168 1 330169 2 330173 1 330197 1
  20. 22. Compressing the Stats Table (cont'd) <ul><li>Merge the rows into a sparse array </li></ul><ul><ul><li>example on the prev. page becomes: </li></ul></ul><ul><ul><li>(offset=330197),1,0(repeated 23 times),1,2,1 </li></ul></ul><ul><li>Then compress the array </li></ul><ul><ul><li>the example becomes a blob of 8 bytes </li></ul></ul><ul><ul><li>originally was 8 bytes x 4 rows with index </li></ul></ul><ul><li>And store the array in a single column </li></ul><ul><ul><li>fewer rows lead to smaller table, faster access </li></ul></ul>Apr. 23 2009 Running Realtime Stats Service on MySQL
  21. 23. Compressing the Stats Table (cont'd) <ul><li>Write MySQL UDFs to access the sparse array </li></ul><ul><ul><ul><li>cnt_add(column,at,cnt) </li></ul></ul></ul><ul><ul><ul><li>-- adds cnt on given index (at) </li></ul></ul></ul><ul><ul><ul><li>cnt_between(column,from,to) </li></ul></ul></ul><ul><ul><ul><li>-- returns # of hits between given hours </li></ul></ul></ul><ul><ul><ul><li>and more... </li></ul></ul></ul><ul><li>We use int[N] arrays for vectorized calc. </li></ul><ul><ul><li>especially when creating access charts </li></ul></ul>Apr. 23 2009 Running Realtime Stats Service on MySQL
  22. 24. <ul><li>Create a new Message Queue </li></ul>Apr. 23 2009 Running Realtime Stats Service on MySQL
  23. 25. Q4M <ul><li>A simple, reliable, fast message queue </li></ul><ul><ul><li>runs as a pluggable storage engine of MySQL </li></ul></ul><ul><ul><li>GPL License; q4m.31tools.com </li></ul></ul><ul><ul><li>presented yesterday at MySQL Conference :-p </li></ul></ul><ul><ul><ul><li>slides at tinyurl.com/q4m2009 </li></ul></ul></ul><ul><li>Used for relaying messages between our servers </li></ul>Apr. 23 2009 Running Realtime Stats Service on MySQL
  24. 26. <ul><li>Limiting Pre-computation Load </li></ul>Apr. 23 2009 Running Realtime Stats Service on MySQL
  25. 27. Limit # of CPU-intensive Pre-computations <ul><li>Use cron & setlock </li></ul><ul><ul><li>setlock is part of daemontools by djb </li></ul></ul><ul><li>setlock </li></ul><ul><ul><li>serializes processes by using flock </li></ul></ul><ul><ul><li>-n option: use trylock; if locked, do nothing </li></ul></ul><ul><li># use only one CPU core for pre-computation </li></ul><ul><li>*/2 * * * * setlock –n /tmp/tasks.lock precompute_hot_entries </li></ul><ul><li>5 0 * * * setlock /tmp/tasks.lock precompute_yesterday_data </li></ul>Apr. 23 2009 Running Realtime Stats Service on MySQL
  26. 28. Limit # of Disk-intensive Pre-computations <ul><li>Divide pre-computation to blocks and sleep depending on the elapsed time </li></ul><ul><li>my $LOAD = 0.25; </li></ul><ul><li>while (true) { </li></ul><ul><li>my $start = time(); </li></ul><ul><li>precompute_block(); </li></ul><ul><li>sleep(min(time - $start, 0) * (1 - $LOAD) / $LOAD); </li></ul><ul><li>} </li></ul>Apr. 23 2009 Running Realtime Stats Service on MySQL
  27. 29. <ul><li>Creating our own Cache System </li></ul>Apr. 23 2009 Running Realtime Stats Service on MySQL
  28. 30. The Problem <ul><li>Query cache is flushed on table update </li></ul><ul><ul><li>access stats can be (should be) cached for a certain period </li></ul></ul><ul><li>Memcached has a thundering-herd problem </li></ul><ul><ul><li>all clients try to read the database when a cached-entry expires </li></ul></ul><ul><ul><li>critical for us since our queries does joins, aggregations, and sort operations </li></ul></ul>Apr. 23 2009 Running Realtime Stats Service on MySQL
  29. 31. Swifty and KeyedMutex <ul><li>Swifty is a mmap-based cache </li></ul><ul><ul><li>cached data shared between processes </li></ul></ul><ul><ul><li>lock-free on read, flock on write </li></ul></ul><ul><ul><li>notifies a single client that the accessed entry is going to expire within few seconds </li></ul></ul><ul><ul><li>notified client can start updating a cache entry before it expires </li></ul></ul><ul><li>KeyedMutex </li></ul><ul><ul><li>a daemon used to block multiple clients issuing same SQL queries </li></ul></ul>Apr. 23 2009 Running Realtime Stats Service on MySQL
  30. 32. Swifty and KeyedMutexd (cont'd) <ul><li>Source codes are available: </li></ul><ul><ul><li>coderepos.org/share/browser/lang/c/swifty </li></ul></ul><ul><ul><li>coderepos.org/share/browser/lang/perl/Cache-Swifty </li></ul></ul><ul><ul><li>coderepos.org/share/browser/lang/perl/KeyedMutex </li></ul></ul>Apr. 23 2009 Running Realtime Stats Service on MySQL
  31. 33. <ul><li>Fulltext-search on SSD </li></ul>Apr. 23 2009 Running Realtime Stats Service on MySQL
  32. 34. Senna / Tritonn <ul><li>Senna is a FTS engine popular in Japan </li></ul><ul><ul><li>might not work well with European languages </li></ul></ul><ul><li>Tritonn is a replacement of MyISAM FTS </li></ul><ul><ul><li>uses Senna as backend </li></ul></ul><ul><ul><li>faster than MyISAM FTS </li></ul></ul><ul><li>Wrote patches to support SSD </li></ul><ul><ul><li>during our transition from RAM to SSD </li></ul></ul><ul><ul><li>patches accepted in Senna 1.1.4 / Tritonn 1.0.12 </li></ul></ul>Apr. 23 2009 Running Realtime Stats Service on MySQL
  33. 35. FTS: RAM-based vs. SSD-based <ul><li>Size of FTS data: 〜 20GB </li></ul><ul><li>Downgraded hardware to see if SSD-based FTS is feasible </li></ul><ul><li>Speed became ¼ </li></ul><ul><ul><li>but latency of searches are well below one second </li></ul></ul>Apr. 23 2009 Running Realtime Stats Service on MySQL Old Hardware New Hardware CPU Opteron 2218 (2.6GHz) x2 Opteron 240 (1.4GHz) Memory 32GB 2GB Storage 7200rpm SATA HDD SSD (Intel X25-M)
  34. 36. <ul><li>Summary </li></ul>Apr. 23 2009 Running Realtime Stats Service on MySQL
  35. 37. Summary <ul><li>Use UDFs for optimization </li></ul><ul><li>Sometime it is easier to scale UP </li></ul><ul><ul><li>esp. when you can estimate your data growth </li></ul></ul><ul><li>Use SSD for FTS </li></ul><ul><ul><li>Baidu (China's leading search engine) uses SSD </li></ul></ul><ul><li>Most of the things introduced are OSS </li></ul><ul><ul><li>We plan to open-source our URL compression table as well </li></ul></ul>Apr. 23 2009 Running Realtime Stats Service on MySQL
  36. 38. We are Looking for... <ul><li>If you are interested in localizing Pathtraq to your country, please contact us </li></ul><ul><ul><li>we do not have resources outside of Japan </li></ul></ul><ul><ul><ul><li>to translate the web interface </li></ul></ul></ul><ul><ul><ul><li>to ask people to install our browser extension </li></ul></ul></ul><ul><ul><ul><li>to follow local regulations, etc. </li></ul></ul></ul>Apr. 23 2009 Running Realtime Stats Service on MySQL
  37. 39. <ul><li>Thank you for listening </li></ul><ul><li>tinyurl.com/kazuho </li></ul>Apr. 23 2009 Running Realtime Stats Service on MySQL
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×