Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Running A Realtime Stats Service On My Sql

2,215 views

Published on

Published in: Technology, Design
  • Be the first to comment

  • Be the first to like this

Running A Realtime Stats Service On My Sql

  1. 1. Running a Realtime Stats Service on MySQL Cybozu Labs, Inc. Kazuho Oku
  2. 2. Background 2 Apr. 23 2009 Running Realtime Stats Service on MySQL
  3. 3. Who am I?  Name: Kazuho Oku ( )  Original Developer of Palmscape / Xiino  The oldest web browser for Palm OS  Working at Cybozu Labs since 2005  Research subsidiary of Cybozu, Inc.  Cybozu is a leading groupware vendor in Japan  My weblog: tinyurl.com/kazuho 3 Apr. 23 2009 Running Realtime Stats Service on MySQL
  4. 4. Introduction of Pathtraq 4 Apr. 23 2009 Running Realtime Stats Service on MySQL
  5. 5. What is Pathtraq?  Started in Aug. 2007  Web ranking service  One of Japan’s largest   10,000 users submit access information   1,000,000 access infomation per day  like Alexa, but semi-realtime, and per-page 5 Apr. 23 2009 Running Realtime Stats Service on MySQL
  6. 6. What is Pathtraq? (cont'd)  Automated Social News Service  find what's hot  like Google News + Digg  calculate relevance from access stats  Search by...  no filtering (all the Internet)  by category  by keyword  by URL (per-domain, etc.) 6 Apr. 23 2009 Running Realtime Stats Service on MySQL
  7. 7. How to Provide Real-time Analysis?  Data Set (as of Apr. 23 2009)  # of URLs: 147,748,546  # of total accesses: 413,272,527  Sharding is not a good option  since we need to join the tables and aggregate  prefix-search by URL, search by keyword, then join with access data table  core tables should be stored on RAM  not on HDD, due to lots of random access 9 Apr. 23 2009 Running Realtime Stats Service on MySQL
  8. 8. Our Decision was to...  Keep URL and access stats on RAM  compression for size and speed  Create a new message queue  Limit Pre-computation Load  Create our own cache, with locks  to minimize database access  Fulltext-search database on SSD 10 Apr. 23 2009 Running Realtime Stats Service on MySQL
  9. 9. Our Servers  Main Server  Opteron 2218 x2, 64GB Mem  MySQL, Apache  Fulltext Search Server  Opteron 240EE, 2GB Mem, Intel SSD  MySQL (w. Tritonn/Senna)  Helper Servers  for Content Analysis  for Screenshot Generation 11 Apr. 23 2009 Running Realtime Stats Service on MySQL
  10. 10. The Long Tail of the Internet -0.44 y=C x # of URLs with 1/10 hits: x2.75 12 Apr. 23 2009 Running Realtime Stats Service on MySQL
  11. 11. Compressing URLs 13 Apr. 23 2009 Running Realtime Stats Service on MySQL
  12. 12. Compressing URLs  The Challenges:  URLs are too short for gzip, etc.  URLs should be prefix-searchable in compressed form  How to run like 'http://www.mysql.com/%' on a compressed URL?  The Answer:  Static PPM + Range Coder 14 Apr. 23 2009 Running Realtime Stats Service on MySQL
  13. 13. Static PPM  PPM: Prediction by Partial Matching  What is the next character after quot;.coquot;?  The answer is quot;mquot;!  PPM is used by 7-zip, etc.  Static PPM is PPM with static probabilistic model  Many URLs (or English words) have common patterns  Suitable for short texts (like URLs) 15 Apr. 23 2009 Running Realtime Stats Service on MySQL
  14. 14. Range Coder  A fast variant of arithmetic compression  similar to huffmann encoding, but better  If probability of next character being quot;mquot; was 75%, it will be encoded into 0.42 bit  Compressed strings preserve the sort order of uncompressed form 16 Apr. 23 2009 Running Realtime Stats Service on MySQL
  15. 15. Create Compression Functions  Build prediction table from stored URLs  Implement range coder  took an open-source impl. and optimized it  original impl. added some bits unnecessary at the tail  use SSE instructions for faster operation  coderepos.org/share/browser/lang/cplusplus/range_coder  Link the coder and the table to create MySQL UDFs 17 Apr. 23 2009 Running Realtime Stats Service on MySQL
  16. 16. Rewriting the Server Logic  Change schema url varchar(255) not null # with unique index urlc varbinary(767) not null # with unique index  Change prefix-search form url like 'http://example.com/%' url_compress('http://example.com/')<=urlc and urlc<url_compress('http://example.com0') Note: quot;0quot; is next character of '/' 18 Apr. 23 2009 Running Realtime Stats Service on MySQL
  17. 17. Compression Ratio  Compression ratio: 37%  Size of prediction table: 4MB  Benchmark of the compression functions  compression: 40MB/sec. (570k URLs/sec.)  decompression: 19.3MB/sec. (280k URLs/sec.)  fast enough since searchable in compressed form  Prefix-search became faster  shorter indexes lead to faster operation 19 Apr. 23 2009 Running Realtime Stats Service on MySQL
  18. 18. Re InnoDB Compression  URL Compression can coexist with InnoDB compression  though we aren't using InnoDB compression on our production environment Compression Table Size N/A 100% URL compression 57% InnoDB compression 50% using both 33% 20 Apr. 23 2009 Running Realtime Stats Service on MySQL
  19. 19. Compressing the Stats Table  Used to have two int columns: at, cnt  it was waste of space, since...  most cnt values are very small numbers  most accesses to each URL occur on a short period (ex. the day the blog entry was written)  at field should be part of the indexes at (hours since epoch) cnt (# of hits) 330168 1 330169 2 330173 1 330197 1 21 Apr. 23 2009 Running Realtime Stats Service on MySQL
  20. 20. Compressing the Stats Table (cont'd)  Merge the rows into a sparse array  example on the prev. page becomes: (offset=330197),1,0(repeated 23 times),1,2,1  Then compress the array  the example becomes a blob of 8 bytes  originally was 8 bytes x 4 rows with index  And store the array in a single column  fewer rows lead to smaller table, faster access 22 Apr. 23 2009 Running Realtime Stats Service on MySQL
  21. 21. Compressing the Stats Table (cont'd)  Write MySQL UDFs to access the sparse array  cnt_add(column,at,cnt) -- adds cnt on given index (at)  cnt_between(column,from,to) -- returns # of hits between given hours  and more...  We use int[N] arrays for vectorized calc.  especially when creating access charts 23 Apr. 23 2009 Running Realtime Stats Service on MySQL
  22. 22. Create a new Message Queue 24 Apr. 23 2009 Running Realtime Stats Service on MySQL
  23. 23. Q4M  A simple, reliable, fast message queue  runs as a pluggable storage engine of MySQL  GPL License; q4m.31tools.com  presented yesterday at MySQL Conference :-p  slides at tinyurl.com/q4m2009  Used for relaying messages between our servers 25 Apr. 23 2009 Running Realtime Stats Service on MySQL
  24. 24. Limiting Pre-computation Load 26 Apr. 23 2009 Running Realtime Stats Service on MySQL
  25. 25. Limit # of CPU-intensive Pre-computations  Use cron & setlock  setlock is part of daemontools by djb  setlock  serializes processes by using flock  -n option: use trylock; if locked, do nothing # use only one CPU core for pre-computation */2 * * * * setlock –n /tmp/tasks.lock precompute_hot_entries 50*** setlock /tmp/tasks.lock precompute_yesterday_data 27 Apr. 23 2009 Running Realtime Stats Service on MySQL
  26. 26. Limit # of Disk-intensive Pre-computations  Divide pre-computation to blocks and sleep depending on the elapsed time my $LOAD = 0.25; while (true) { my $start = time(); precompute_block(); sleep(min(time - $start, 0) * (1 - $LOAD) / $LOAD); } 28 Apr. 23 2009 Running Realtime Stats Service on MySQL
  27. 27. Creating our own Cache System 29 Apr. 23 2009 Running Realtime Stats Service on MySQL
  28. 28. The Problem  Query cache is flushed on table update  access stats can be (should be) cached for a certain period  Memcached has a thundering-herd problem  all clients try to read the database when a cached-entry expires  critical for us since our queries does joins, aggregations, and sort operations 30 Apr. 23 2009 Running Realtime Stats Service on MySQL
  29. 29. Swifty and KeyedMutex  Swifty is a mmap-based cache  cached data shared between processes  lock-free on read, flock on write  notifies a single client that the accessed entry is going to expire within few seconds  notified client can start updating a cache entry before it expires  KeyedMutex  a daemon used to block multiple clients issuing same SQL queries 31 Apr. 23 2009 Running Realtime Stats Service on MySQL
  30. 30. Swifty and KeyedMutexd (cont'd)  Source codes are available:  coderepos.org/share/browser/lang/c/swifty  coderepos.org/share/browser/lang/perl/Cache-Swifty  coderepos.org/share/browser/lang/perl/KeyedMutex 32 Apr. 23 2009 Running Realtime Stats Service on MySQL
  31. 31. Fulltext-search on SSD 33 Apr. 23 2009 Running Realtime Stats Service on MySQL
  32. 32. Senna / Tritonn  Senna is a FTS engine popular in Japan  might not work well with European languages  Tritonn is a replacement of MyISAM FTS  uses Senna as backend  faster than MyISAM FTS  Wrote patches to support SSD  during our transition from RAM to SSD  patches accepted in Senna 1.1.4 / Tritonn 1.0.12 34 Apr. 23 2009 Running Realtime Stats Service on MySQL
  33. 33. FTS: RAM-based vs. SSD-based  Size of FTS data: 20GB  Downgraded hardware to see if SSD- based FTS is feasible  Speed became ¼  but latency of searches are well below one second Old Hardware New Hardware CPU Opteron 2218 (2.6GHz) x2 Opteron 240 (1.4GHz) Memory 32GB 2GB Storage 7200rpm SATA HDD SSD (Intel X25-M) 35 Apr. 23 2009 Running Realtime Stats Service on MySQL
  34. 34. Summary 36 Apr. 23 2009 Running Realtime Stats Service on MySQL
  35. 35. Summary  Use UDFs for optimization  Sometime it is easier to scale UP  esp. when you can estimate your data growth  Use SSD for FTS  Baidu (China's leading search engine) uses SSD  Most of the things introduced are OSS  We plan to open-source our URL compression table as well 37 Apr. 23 2009 Running Realtime Stats Service on MySQL
  36. 36. We are Looking for...  If you are interested in localizing Pathtraq to your country, please contact us  we do not have resources outside of Japan  to translate the web interface  to ask people to install our browser extension  to follow local regulations, etc. 38 Apr. 23 2009 Running Realtime Stats Service on MySQL
  37. 37. Thank you for listening tinyurl.com/kazuho 39 Apr. 23 2009 Running Realtime Stats Service on MySQL

×