More Related Content
Similar to Feed Burner Scalability (20)
Feed Burner Scalability
- 2. What is FeedBurner? 2
• Market-leading feed management provider
• 170,000 bloggers, podcasters and commercial
publishers including Reuters, USA TODAY,
Newsweek, Ars Technica, BoingBoing…
• 11 million subscribers in 190 countries.
• Web-based services help publishers expand
their reach online, attract subscribers and
make money from their content
• The largest advertising network for feeds
© 2006 F eedBurner
- 3. Scaling history 3
• July 2004
– 300Kbps, 5,600 feeds
– 3 app servers, 3 web servers 2 DB servers
• April 2005
– 5Mbps, 47,700 feeds
– My first MySQL Users Conference
– 6 app servers, 6 web servers (same machines)
• September 2005
– 20Mbps, 109,200 feeds
• Currently
– 115 Mbps, 270,000 feeds, 100 Million hits per day
© 2006 F eedBurner
- 4. Scalability Problem 1: Plain old reliability 4
• August 2004
• 3 web servers, 3 app servers, 2 DB servers.
Round Robin DNS
• Single-server failure, seen by 1/3 of all users
© 2006 F eedBurner
- 5. Solution: Load Balancers, Monitoring 5
• Health Check pages
– Round trip all the way back to the database
– Same page monitored by load balancers
and monitoring
• Monitoring
– Cacti (http://www.cacti.net/)
– Nagios (http://www.nagios.org)
© 2006 F eedBurner
- 6. Health Check 6
UserComponent uc = UserComponentFactory.getUserComponent();
User user = uc.getUser(”monitor-userquot;);
// If first load, mark as down.
// Let FeedServlet mark things as up in init method. load-on-startup
String healthcheck = (String) application.getAttribute(quot;healthcheckquot;);
if(healthcheck == null || healthcheck.length() < 1) {
healthcheck = new String(”DOWNquot;);
application.setAttribute(quot;healthcheckquot;,healthcheck);
}
// We return null in case of problem, or if user doesn’t exist
if( user == null ) {
healthcheck = new String(quot;DOWNquot;);
application.setAttribute(quot;healthcheckquot;,healthcheck);
}
System.out.print(healthcheck);
© 2006 F eedBurner
- 7. Cacti 7
© 2006 F eedBurner
- 8. Start/Stop scripts 8
#!/bin/bash
# Source the environment
. ${HOME}/fb.env
# Start TOMCAT
cd ${FB_APPHOME}
# Remove stale temp files
find ~/rsspp/catalina/temp/ -type f -exec rm -rf {} ;
# Remove the work directory
#rm -rf ~/rsspp/catalina/work/*
${CATALINA_HOME}/bin/startup.sh
© 2006 F eedBurner
- 9. Start/Stop scripts 9
#!/bin/bash
FB_APPHOME=/opt/fb/fb-app
JAVA_HOME=/usr
CATALINA_HOME=/opt/tomcat
CATALINA_BASE=${FB_APPHOME}/catalina
CATALINA_OPTS=quot;-Xmx768m -Xms7688m -Dnetworkaddress.cache.ttl=0quot;
WEBROOT=/opt/fb/webroot
export JAVA_HOME CATALINA_HOME CATALINA_BASE CATALINA_OPTS WEBROOT
© 2006 F eedBurner
- 10. Scalability Problem 2: Stats recording/mgmt 10
• Every hit is recorded
• Certain hits mean more than others
• Flight recorder
• Any table management locks
• Inserts slow way down (90GB table)
© 2006 F eedBurner
- 11. Solution: Executor Pool 11
• Executor Pool
– Doug Lea’s concurrency library
– Use a PooledExecutor so stats inserts happen in a
separate thread
– Spring bean definition:
<bean id=quot;StatsExecutorquot;
class=quot;EDU.oswego.cs.dl.util.concurrent.PooledExecutorquot;>
<constructor-arg>
<bean class=quot;EDU.oswego.cs.dl.util.concurrent.LinkedQueuequot;/>
</constructor-arg>
<property name=quot;minimumPoolSizequot; value=quot;10quot; />
<property name=quot;keepAliveTimequot; value=quot;5000quot; />
</bean>
© 2006 F eedBurner
- 12. Solution: Lazy rollup 12
• Only today’s detailed stats need to go against
real-time table
• Roll up previous days into sparse summary
tables on-demand
• First access for stats for a day is slow,
subsequent request are fast
© 2006 F eedBurner
- 13. Scalability Problem 3: Primary DB overload 13
• Mostly used master DB server for everything
• Read vs. Read/Write load didn’t matter in the
beginnning
• Slow inserts would block reads, when using
MyISAM
© 2006 F eedBurner
- 14. Solution: Balance read and read/write load 14
• Looked at workload
– Found where we could break up read vs. read/write
– Created Spring ExtendedDaoObjects
– Tomcat-managed DataSources
• Balanced master vs. slave load (Duh)
– Slave becomes perfect place for snapshot backups
• Watch for replication problems
– Merge table problems (later)
– Slow queries slow down replication
© 2006 F eedBurner
- 16. ExtendedDaoObject 16
• Application code extends this class and uses
getHibernateTemplate() or getReadOnlyHibernateTemplate()
depending upon requirements
• Similar class for JDBC
public class ExtendedHibernateDaoSupport extends HibernateDaoSupport {
private HibernateTemplate readOnlyHibernateTemplate;
public void setReadOnlySessionFactory(SessionFactory sessionFactory) {
this.readOnlyHibernateTemplate = new HibernateTemplate(sessionFactory);
readOnlyHibernateTemplate.setFlushMode(HibernateTemplate.FLUSH_NEVER);
}
protected HibernateTemplate getReadOnlyHibernateTemplate() {
return (readOnlyHibernateTemplate == null) ? getHibernateTemplate() :
readOnlyHibernateTemplate;
}
}
© 2006 F eedBurner
- 17. Scalability Problem 4: Total DB overload 17
• Everything slowing down
• Using DB as cache
• Database is the ‘shared’ part of all app servers
• Ran into table size limit defaults on MyISAM
(4GB). We were lazy.
– Had to use Merge tables as a bridge to newer
larger tables
© 2006 F eedBurner
- 18. Solution: Stop using the database 18
• Where possible :)
• Multi-level caching
– Local VM caching (EHCache, memory only)
– Memcached (http://www.danga.com/memcached/)
– And finally, database.
• Memcached
– Fault-tolerant, but client handles that.
– Shared nothing
– Data is transient, can be recreated
© 2006 F eedBurner
- 19. Scalability Problem 5: Lazy initialization 19
• Our stats get rolled up on demand
– Popular feeds slowed down the whole system
• FeedCount chicklet calculation
– Every feed gets its circulation calculated at the
same time
– Contention on the table
© 2006 F eedBurner
- 20. Solution: BATCH PROCESSING 20
• For FeedCount, we staggered the calculation
– Still would run into contention
– Stats stuff again slowed down at 1AM Chicago time.
• We now process the rolled-up data every night
– Delay showing the previous circulation in the
FeedCount until roll-up is done.
• Still wasn’t enough
© 2006 F eedBurner
- 21. Scalability Problem 6: Stats writes, again 21
• Too much writing to master DB
• More and more data stored associated with
each feed
• More stats tracking
– Ad Stats
– Item Stats
– Circulation Stats
© 2006 F eedBurner
- 22. Solution: Merge Tables 22
• After the nightly rollup, we truncate the
subtable from 2 days ago
• Gotcha with truncating a subtable:
– FLUSH TABLES; TRUNCATE TABLE ad_stats0;
– Could succeed on master, but fail on slave
• The right way to truncate a subtable:
– ALTER TABLE ad_stats TYPE=MERGE
UNION=(ad_stats1,ad_stats2);
– TRUNCATE TABLE ad_stats0;
– ALTER TABLE ad_stats TYPE=MERGE
UNION=(ad_stats0,ad_stats1,ad_stats2);
© 2006 F eedBurner
- 23. Solution: Horizontal Partitioning 23
• Constantly identifying hot spots in the
database
– Ad serving
– Flare serving
– Circulation (constant writes, occasional reads)
• Move hottest tables/queries off to own clusters
– Hibernate and certain lazy patterns allow this
– Keeps the driving tables from slowing down
© 2006 F eedBurner
- 24. Scalability Problem 7: Master DB Failure 24
• Still using just a primary and slave
• Master crash: Single point of failure
• No easy way to promote a slave to a master
© 2006 F eedBurner
- 25. Solution: No easy answer 25
• Still using auto_increment
– Multi-master replication is out
• Tried DRBD + HeartBeat
– Disk is replicated block-by-block
– Hot primary, cold secondary
• Didn’t work as we hoped
– Myisamchk takes too long after failure
– I/O + CPU overhead
• InnoDB is supposedly better
© 2006 F eedBurner
- 26. Our multi-master solution 26
• Low-volume master cluster
– Uses DRBD + HeartBeat
– Works well under smaller load
– Does mapping to feed data clusters
• Feed Data Cluster
– Standard Master + Slave(s) structure
– Can be added as needed
© 2006 F eedBurner
- 28. Scalability Problem 8: Power Failure 28
• Chicago has ‘questionable’ infrastructure.
• Battery backup, generators can be problematic
• Colo techs have been known to hit the Big
Red Switch
• Needed a disaster recovery/secondary site
– Active/Active not possible for us. Yet.
– Would have to keep fast connection to redundant
site
– Would require 100% of current hardware, but
would lie quiet
© 2006 F eedBurner
- 29. Code Name: Panic App 29
• Product Name: Feed Insurance
• Elegant, simple solution
• Not Java (sorry)
• Perl-based feed fetcher
– Downloads copies of feeds, saved as flat XML files
– Synchronized out to local and remote servers
– Special rules for click tracking, dynamic GIFs, etc
© 2006 F eedBurner
- 30. General guidelines 30
• Know your DB workload
– Cacti really helps with this
• ‘EXPLAIN’ all of your queries
– Helps keep crushing queries out of the system
• Cache everything that you can
• Profile your code
– Usually only needed on hard-to-find leaks
© 2006 F eedBurner
- 31. Our settings / what we use 31
• Don’t always need the latest and greatest
– Hibernate 2.1
– Spring
– DBCP
– MySQL 4.1
– Tomcat 5.0.x
• Let the container manage DataSources
© 2006 F eedBurner
- 32. JDBC 32
• Hibernate/iBatis/Name-Your-ORM-Here
– Use ORM when appropriate
– Watch the queries that your ORM generates
– Don't be afraid to drop to JDBC
• Driver parameters we use:
# For Internationalization of Ads, multi-byte characters in general
useUnicode=true
characterEncoding=UTF-8
# Biggest performance bits
cacheServerConfiguration=true
useLocalSessionState=true
# Some other settings that we've needed as things have evolved
useServerPrepStmts=false
jdbcCompliantTruncation=false
© 2006 F eedBurner
- 33. Thank You 33
Questions?
joek@feedburner.com
© 2006 F eedBurner