Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Building and deploying large scale real time news system with my sql and distributed cache mysql_conf


Published on

Maintaining a constantly updated large data set alone is a big challenging not only to database administrators but also to developers as it is hard to maintain and expand. It adds more stress when the requirement is to serve real time data to heavy traffic websites.

In this presentation, we first examine the initial characteristics of AOL’s Real Time News system, the design strategy, and how MySQL fits into the overall architecture. We then review the issues encountered and the solutions applied when the system characteristics changed due to ever growing data set size and new query patterns.

In addition to common MySQL design, trouble-shooting, and performance tuning techniques, we will also share a heuristic algorithm implemented in the application servers to reduce the response time of complex queries from hours to a few milliseconds.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Building and deploying large scale real time news system with my sql and distributed cache mysql_conf

  1. 1. Building and Deploying Large Scale Real Time News System with MySQL and Distributed CachePresented  to  MySQL  Conference  Apr.  13,  2011  
  2. 2. Who am I? Pag e2  Tao Cheng <>, AOL Real Time News (RTN).  Worked on Mail and Browser clients in the ‘90 and then moved to web backend servers since.  Not an expert but am happy to share my experience and brainstorm solutions.Presentation for[CLIENT]
  3. 3. Agenda  AOL Real Time News (RTN): what it is?  Requirements  Technical solutions with focus on MySQL  Deployment Topology  Operational Monitoring  Metrics Collection
  4. 4. Agenda  Tips for query tuning and optimization  Heuristic Query Optimization Algorithm  Lessons learned  Q & A
  5. 5. Real Time News : background Pag e5AOL deployed its large scale Real Time News (RTN)system in 2007.This system ingests and processes news from 30,000sources on every second around the clock. Today, itsdata store, MySQL, has accumulated over severalbillions of rows and terabytes of data.However, news are delivered to end users in close toreal time fashion. This presentation shares how it isdone and the lessons learned.Presentation forAOLU Un-University
  6. 6. Brief Intro: sample features Pag e6  Data presentation: return most recent news in   flat view – most recent news about an entity. An entity could be a person, a company, a sports team, etc.   topic clusters – most recent news grouped by topics. A topic is a group of news about an event, headline news, etc.  News filtering by   source types such as news, blogs, press releases, regional, etc.   relevancy level (high, medium, low, etc) to the entities .  Data Delivery: push (to subscribers) and pull  Search by entities, categories (National, Sports, Finance, etc), topics, document ID, etc.Presentation for[CLIENT]
  7. 7. Requirements for Phase I (2006) Pag e7  Commodity hardware: 4 CPU, 16 GB MEM, 600 GB disk space.  Data ingestion rate = 250K docs/day; average document size = 5 KB.  Data retention period: 7 days to forever  Est. data set size: (1.25 GB/day or 456 GB/year) + space for indexes, schema change, and optimization.  Response time: < 30 milli-second/query  Throughputs: > 400 queries/sec/server  Up time: 99.999%Presentation for[CLIENT]
  8. 8. Solutions: MySQL + Bucky Pag e8  MySQL   Serve raw/distinct queries   Back fill  Bucky Technology (AOL’s distributed cache & computing framework)   Write ahead cache: pre-compute query results and push them into cache.   Messaging (optional): push data directly to subscribers   Updatesare pushed to data consumers or browsers via AIM Complex.  Updates go to both database and cache.Presentation for[CLIENT]
  9. 9. Architecture Diagram (over-simplified) Pag e9 WWW AIM   push Relegence   Ingestor   Distributed   Cache   Gateway   pull WWW Distributed   Cache   Gateway   Asset  DB  Presentation for[CLIENT]
  10. 10. Data Model: SOR v.s. Query DB Pag e 10  Separate query from storage to keep tables small and query fast.  System of Record (SOR): has all raw data   The authoritative data store; designed for data storage   Normalized schema: for simple key look-up; no table join.  Query DB – de-normalized for query speed   avoid JOIN, reduce # of trips to DB, increase throughputs.  Read/write small chunk of data at a time so database can get requests out quickly and process more.  Use replication to achieve linear scalability for read.Presentation for[CLIENT]
  11. 11. Design Strategies: partitioning (Why) Pag e 11  Dataset too big to fit on one host  Performance consideration: divide and conquer   Write: more masters (Nx) to take writes   Read: smaller tables + more (NxM) slaves to handle read.  Fault tolerance – distribute the risk and reduce the impact of system failure  Easier Maintenance – size does matter   Faster nightly backup, disaster recovery, schema change, etc.   Faster optimization –need optimization to reclaim disk space after deletion, rebuild indexes to improve query speed.Presentation for[CLIENT]
  12. 12. Design Strategies: partitioning (How) Pag e 12  Partition on most used keys (look at query patterns)   Document table – on document ID   Entity table – on entity ID  Simple hash on IDs – no partition map; thus no competition of read/write locks on yet another table  Managing growth: add another partition set   New documents are written into both old and new partition sets for a few weeks. Then, stop writing into the old partitions.   Queries go to the new partitions first and then the old ones if in-sufficient results found.  Works great in our case but might not for everyone.Presentation for[CLIENT]
  13. 13. Schema design: De-normalization Pag e 13  Make query tables small:   put only essential attributes in the de-normalized tables   store long text attributes in separate tables.  De-normalization: how to store and match attributes   Single value attributes (1:1) : document ID, short string, date time, etc. – one column, one row.   Multi-value attributes (1:many): tricky but feasible   Use multiple rows with composite index/key: (c1, c2, etc.)   One row one column: CSV string, e.g., “id1, id2, id3” – SQL: “val like ‘%id2%’”   One row but multiple columns, e.g., group1, group2, etc. – SQL: group1=val1 OR group2=val2 ...Presentation for[CLIENT]
  14. 14. Tips for indexing Pag e 14  Simple key – for metadata retrieval  Composite key – find matching documents   Start with low cardinality and most used columns   Order matter: (c1, c2, c3) != (c2, c3, c1)  InnoDB – all secondary indexes contain primary key   Make primary key short to keep index size small   Queries using secondary index references primary key too.  Integer v.s. String – comparison of numeric values is faster => index hash values of long string instead.  Index length – title:varchar(255) => idx_title(32)  Enforce referential integrity on application side.Presentation for[CLIENT]
  15. 15. MySQL configuration Pag e 15  Storage engine: InnoDB – row level locking  Table space – one file per table   Easier to maintain (schema change, optimization, etc.)  Character set: ‘UTF-8’   Disable persistent connection (5.0.x)   skip-character-set-client-handshake  Enable slow query log to identify bad queries.  System variables for memory buffer size   innodb_buffer_pool_size: data and indexes   Sort_buffer_size, max_heap_table_size, tmp_table_size   Query cache size=0; tables are updated constantlyPresentation for[CLIENT]
  16. 16. Runtime statistics (per server) Pag e 16  Average write rate:   daily: < 40 tps   max at 400 tps during recovery   Perform best when write rate < 100 tps  Query rate: 20~80 qps  Query response time – shorter when indexes and data are in memory   75%: ~3 ms when qps < 15; ~2 ms when qps ~= 60   95%: 6~8 ms when qps < 15; 3~4 ms when qps ~= 60   CPU Idle %: > 99%.Presentation for[CLIENT]
  17. 17. Pag e 17Presentation for[CLIENT]
  18. 18. Deployment Topology Consideration Pag e 18•  Minimum configuration: host/DC redundency •  DC1: host 1 (master), host 3 (slave) •  DC2: host 2 (failover master), host 4 (slave)•  Data locality: significant when network latency is a concern (100 Mbps) •  3,000 qps when DB is on remote host. •  15,000 qps when DB is on local host.•  Linking dependent servers across data centers •  Push cross link up as far as possible (Topology 3): link to dependent servers in the same data center.Presentation for[CLIENT]
  19. 19. Deployment Topology 1: minimum config Pag e 19 Date Center 1 DB DB Data WWW Consumer DB DB Date Center 2Presentation for[CLIENT]
  20. 20. Topology 2: link across DCs (bad) Pag e 20 Data DB V V DB Consumer I I P P Data DB Consumer G S L WWW GSLB B Data DB V V Consumer I I DB P P Data DB ConsumerPresentation for[CLIENT]
  21. 21. Topology 3: link to same DC (better) Pag e 21 Data DB V V DB Consumer I I P P Data DB Consumer G S L WWW B Data DB V V Consumer I I DB P P Data DB ConsumerPresentation for[CLIENT]
  22. 22. Topology 4: use local UNIX socket Pag e 22 Data DB V DB Consumer I P Data DB Consumer G S L WWW B Data DB Consumer V I DB P Data DB ConsumerPresentation for[CLIENT]
  23. 23. Production Monitoring Pag e 23  Operational Monitoring: logcheck, Scout/NOC alert, etc.  DB monitoring on replication failure, latency, read/ write rate, performance metrics.Presentation for[CLIENT]
  24. 24. Metrics Collection Pag e 24  Graphing collected metrics: visualize and collate operational metrics.   Help analyzing and fine tuning server performance.   Help trace production issues and identify point of failure.  What metrics are important?   Host: CPU, MEM, disk I/O, network I/O, # of processes, CPU swap/paging   Server: Throughputs, response time  Comparison: line up charts (throughputs, response time, CPU, disk i/o) in the same time window.Presentation for[CLIENT]
  25. 25. Pag e 25Presentation for[CLIENT]
  26. 26. Pag e 26Presentation for[CLIENT]
  27. 27. Pag e 27Presentation for[CLIENT]
  28. 28. Tuning and Optimizing Queries Pag e 28  Explain: mysql> explain SELECT ... FROM …  Watch out for tmp table usage, table scan, etc.  SQL_NO_CACHE  MySQL Query profiler   mysql> set profiling=1;  Linux OS Cache: leave enough memory on host  USE INDEX hint to choose INDEX explicitly   use wisely: most of the time, MySQL chooses the right index for you. But, when table size grows, index cardinality might change.Presentation for[CLIENT]
  29. 29. Important MySQL statistics Pag e 29  SHOW GLOBAL STATUS…   Qcache_free_blocks   Qcache_free_memory   Qcache_hits   Qcache_inserts   Qcache_lowmem_prunes   Qcache_not_cached   Qcache_queries_in_cache   Select_scan   Sort_scanPresentation for[CLIENT]
  30. 30. Important MySQL statistics (cont.) Pag e 30   Table_locks_waited   Innodb_row_lock_current_waits   Innodb_row_lock_time   Innodb_row_lock_time_avg   Innodb_row_lock_time_max   Innodb_row_lock_waits   Select_scan   Slave_open_temp_tablesPresentation for[CLIENT]
  31. 31. Heuristic Query Optimization Algorithm Pag e 31  Primary for complex cluster queries: find latest N topics and related stories.  Strategy: reduce the number of records database needs to load from disk to perform a query.   Pick a default query range. If in-sufficient docs are returned, expand query range proportionally.   If none return => sparse data => drop the range and retry.   Save query range for future references.  Result: reduce number of rows needed to process from millions to hundreds => cut query time down from minutes to less than 10 ms.Presentation for[CLIENT]
  32. 32. Query  range   Cluster  query   look  up   NumOfTripToDB  =0   no   Has query Use default range? range Compute docs to range ratio and prorate it to a range that would return sufficient amount of docs. Bound query with the range and send it to DB yes   NumOfTrip ToDB  >=2?   NumOfTripToDB++   Suf@icient   yes   results   numOfResults Send original from   == 0? query to DB query   engine?   Query   Engine   yes   Compute docs to range ratio and save it back Return query to the look up table for results to clients. future use.Presentation for[CLIENT]
  33. 33. Lessons Learned Pag e 33  Always load test well ahead of launch (2 weeks) to avoid fire drill.  Don’t rely on cache solely. Database needs to be able to serve reasonable amount of queries on its own.  Separate cache from applications to avoid cold start.  Keep transaction/query simple and return fast.  Avoid table join; limit it to 2 if really needed.  Avoid stored procedure: results are not cached; need DBA when altering implementation.Presentation for[CLIENT]
  34. 34. Lessons Learned (cont.) Pag e 34  Avoid using ‘offset’ in LIMIT clause; use application based pagination instead.  Avoid ‘SQL_CALC_FOUND_ROWS’ in SELECT  If possible, exclude text/blob columns from query results to avoid disk I/O.  Store text/blob in separate table to speed up backup, optimization, and schema change.  Separate real time v.s. archive data for better performance and easier maintenance.  Keep table size under control ( < 100 GB) ; optimized periodically.Presentation for[CLIENT]
  35. 35. Lessons Learned (cont.) Pag e 35  Put SQL statement (templates) in resource files so you can tune it without binary change.  Set up replication in dev & qa to catch replication issues earlier   Transactional (MySQL 5.0.x) v.s. data/mixed (5.1 or above)   Auto-increment + (INSERT.. ON DUPLICATE UPDATE…)   Date time column: default to NOW()   Oversized data: increase max_allowed_packet   Replication lag: transactions that involve index update/ deletion often take longer to complete.  Host and data center redundancy is important – don’t put all eggs in one basket.Presentation for[CLIENT]
  36. 36. RTN 3 Redesign Pag e 36  Free Text Search with SOLR   Real time v.s. archive shards.   1 minute latency w/o Ramdisk.  Asset DB partitioned – 5 rows/doc -> 25 rows/doc  Avoid (System) Virtual Machine; instead, stack high end hosts with processes that use different system resources (CPU, MEM, disk space, etc)   Better network and system resource utilization – cost effective.   Data Locality  More processors (< 12 ) help when under load.Presentation for[CLIENT]
  37. 37. Q&A Pag e 37  Questions or comments?Presentation for[CLIENT]
  38. 38. Pag e 38  THANK YOU !!Presentation for[CLIENT]