Executing Queries on a Sharded Database


Published on

Determining a data storage solution as your web application scales can be the most difficult part of web development, and takes time away from developing application features. MongoDB, Redis, Postgres, Riak, Cassandra, Voldemort, NoSQL, MySQL, NewSQL — the options are overwhelming, and all claim to be elastic, fault-tolerant, durable, and give great performance for both reads and writes. In the first portion of this talk I’ll discuss these different storage solutions and explain what is really important when choosing a datastore — your application data schema and feature requirements.

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • First, a bit about me.Former Software Engineer at Google – worked on Froogle, a product aggregator and shopping website, Blobstore, a storage system for large binary objects like gmail attachments and images, and Native Client, a way of running fast native code safely in the browser.Left to go be a PhD student at MIT, studying distributed systems at CSAIL (computer science and AI lab). I’ve been there for four years!This summer I decided to do an internship at news.me, as a data scientist, because I’m really interested on how we get our news and content these days. News.me then bought digg.com (interesting story) so what I really did last summer was work on powering real-time news by looking at social sharing on twitter and facebook, and relaunchingdigg in six weeks.
  • Here’s what I’m going to cover in this talk.First,Then, some myths I’ve heard about NoSQL, SQL, and partitioning on the internet.
  • Split?
  • Split?
  • Split?
  • Split?
  • Split?
  • Split?
  • Split?
  • But I kind of understand where they are coming from – there are so many options out there.
  • Apologies if I left your favorite out.It’s totally overwhelming, and there is a lot of hype.
  • So I’m here to tell you what to think about when you’re choosing a datastore. It’s definitely not scaling!
  • We can see what is really important by looking at the development cycle of a project.When prototyping, we really care about iterating quickly, and how easy the storage system is to use and how soon you can get up and running with itWhen developing, we care a lot about flexibility, the ability to develop new features, and having a codebase that multiple developers can use.When we’re running your system in the real world, our first priority is reliability and performance.If we even get to the stage where we need to think about sharding, it means we’ve really succeeded, and it’s a great problem to have!But we might not even get there.
  • One thing a lot of developers pay attention to is how easy it is to get started with a datastore.This is something I saw in someone else’s presentation, and I thought it was really great – what are the first five minutes like with your datastore?
  • Here we have redis, and it’s pretty awesome. This guy has created a 30 second guide -- A few commands and I’m storing keys and getting them back out. Very little startup overhead.
  • Contrast that with MySQL – I didn’t even show you the actual commands to run to get to the same point, because we have to go through so much other stuff first. Namely, setting up permissions, creating a database, and choosing a schema, which is something I might not even be ready to do yet!!!Redis seems like a big win here.
  • But – what happens when we get to the next stage, developing? At this point, we might have multiple people working on the code, we’re testing new features and accessing the data in different ways.Someone new might want to see how your data is structured.Let’s go back and look at Redis and MySQL again
  • This is a redis server I used at Digg for testing. I came in not knowing what was in this datastore – some of the key patterns were strewn about the code, which is not ideal. And the only way to tell the type of the key (set, sorted set, etc) was to look at the operations on it – zadd vs. set and get.How do I find out what’s going on?Redis has a keys command…. But this datastore is 24 GB of smallish keys. Running “keys” will take forever (click) and overload my server. This isn’t a good thing to do on a production system.So it’s hard to find out what’s going on here.
  • Contrast that with MySQL – because SQL is a flexible query language, I can learn about the data really quickly. Here I’m learning about the table “cards”. I can describe it, and select a few rows to see what’s in it.I can also easily add an index if I want to look up stuff in different ways, no need to completely rewrite everything in the database or change my code a lot.So if it’s not scaling and it’s not initial impressions that are important, what is?
  • What we should really think about are some of the following key questions. I’m going to present this as a list, and then go into detail on a few of themFirst off, we would want to choose a datastore optimized for how many reads/writes the app is doing.The size of our data is incredibly important when choosing from in-memory vs. disk optimized data stores. Do we require multi-statement transactions? If so, there are only a few datastores that provide that out of the box. However, it is possible to build out this layer in the application.Our access patterns are important too, as is how you predict your workload will grow and change, and I’ll describe this in more detail later in the talk when I talk about scaling.Finally, what do you already know? The best datastore is the one you’re most familiar with. A lot of them have pretty funny semantics, and I’ll give some examples.
  • It’s important to figure out whether you need a read-optimized or a write-optimizeddatastore.For example, if your application needs to do a lot of writes, you might be in trouble if you choose MongoDB – [click]MongoDB has a global write lock per process. What this means is that you can’t do concurrent writes, so your write-heavy application will be really slowed down!
  • So let’s talk about the size of our data. Disk based solutions are slower, but it’s really important to pick a datastore that can handle our data size if it might not fit in memory. And figuring this out can be a bit tricky – a store like Redis might actually require 2X the memory of our data size! This is because it forks a process to do background persistence to disk, and in the worst case, we might need to replicate all of that memory between the two processes. If Redis doesn’t get this memory, it will become totally unresponsive, and we might wish we went with a slower disk-based solution.There are ways around this, by changing the way our virtual memory manager allocates memory, but that’s scary. Also, it might not work if we’re changing all the keys while this happens
  • Another great thing to do is lay out our application requirements. What can we tolerate, if we have to make a tradeoff? Here are some axes we should think about.First, performance, or latency tolerance. How slow is ok?Second, durability or data loss tolerance. How bad is it if we lose a few writes, in the case of a crash?Freshness, or staleness tolerance – really this means, can we serve stale data from a cache? For how long?Finally, uptime, or downtime tolerance. Do we need to be able to switch over to a replica in the case of failure immediately, or can we handle being read-only or down for a while?
  • First, let’s discuss performance. Oftentimes relational databases are considered very slow.But, they are doing a lot of work! Some of which you might not be interested in. As an example, consider the base case for an application, doing very simple primary key lookups
  • The application needs the data fast, since it is responding to user requests in real time. The data mostly fits in memory [click], with a high read rate [click] and it’s primary key lookups, so there is no need [click] to go to disk or scan lots of data.What this means is that we have removed the constraints of disk and network, and really the CPU is the bottleneck here.What does this mean for a relational database? How much does it cost in server CPU time to do one of these lookups?
  • If we don’t need all of the fancy query flexibility, we have some options. First off, the query cache. Skips optimization and parsing, good if we are doing the same query over and overPrepared statements avoid parsing, but not optimization phase, and can’t use query cache, but work for the same query with different arguments.Handler Socket – made by YoshinoriMatsunobu, incredibly fast. Bypasses all of the querying machinery and talks directly to the storage engine so we still have all of the durability and error recovery that the storage engine guarantees! Also what’s really cool is we have the flexibility of using SQL on that data if we want. Lesson:
  • Now, you might be wondering WHY we would bother with a relational database if we’re just not going to use half of it. CLICK. Well, we might be willing to pay that query overhead for flexibility, so we can access our data in many different ways.[Read 2nd point] we have to know queries up front, otherwise we might find ourselves in a situation where we can’t execute certain queries because they require scanning all the data. [click] SQL makes it easier to change what queries we are issuing, which is especially important if we don’t fully know all the features of our application yet.
  • Cut this?
  • Cut this?
  • Another really important issue is Durability. How is this defined? Well, in a persistent datastore, here is what you should expect.When a client does a write, the server flushes that right to disk BEFORE sending “done” to the client.That way, if the server were to crash, when it recovers we can see the write. [click, read 2nd point] In fact, it doesn’t even return to the client AT ALL after a write, so the client has no idea if the write succeeded or not without making another request. [CLICK, mysql]The thing is, some applications might not care about every single write making it to disk. Maybe they are just logging things to a database. But I really think making this the default, for supposedly persistent datastores, was a bad idea. This is something we really have to be aware of when choosing a datastore and building our application
  • What happens when we’ve gotten to the point where we have so much traffic that we need to shard? This is the time when we start thinking about customized data store solutions.At this point we know our application very well, [click]and we know where we use simple queries and where we need something more complex. We know where to create copies for fast reads and where we need to keep one copy for consistency. We can use a combination of datastores [click] to get the best performance, but we know what tradeoffs we can make to use one datastore for ease of developmentHopefully we will reach this stage of tremendous growth. I’ve told before you don’t need to worry about scaling in the beginning, and here’s why: No matter what we chose initially, we can scale it! I’ll show you what I mean.
  • First, let’s discuss some ways of scaling – when scaling reads, we have a few options. We can set up a cache, like memcached. We can replicate your servers, so reads can be spread across all of them, like with MongoDB replica sets or MySQL master-slave replication, or we can partition the data amongst different severs.For writes we only have the latter option, so I’m going to talk about partitioning when I talk about scaling.
  • Now, there’s a lot of folklore out there about which systems do what. I’m going to address two myths around scaling.
  • First, the idea that we had to have chosen a NoSQL database from the beginning, because relational databases don’t scale well. This is not true.Second, that you can’t do a join if you have a sharded database. Also not true!
  • Scaling isn’t about the datastore, it’s about the application.[click]Plenty of applications have scaled using a relational database. They just use them in a different way. Friendfeed and Quora both appreciated the tested solution of MySQL, and built partitioning into their applications. Facebook has thousands of MySQL servers. (though I guess Stonebraker would say they are trapped in a fate worse than death… handling 1B users is nothing to laugh at)The issue here is complex vs. simple queries.[click]Complex queries that go to all shards don’t scale. [click]. Simple … read [PAUSE]Great. So all we need to do is pick a good partition key and send every lookup and query to one server! But…
  • We still have a problem. LOTS of applications need to look up data in different ways. Nobody really talks about this. What do I mean? Well, let’s look at an example.
  • This is a post page in Hacker News. We need to get all of the comments on a post. Let’s even ignore nested comments for now.Here is a SQL query, getting all the comments for post 100.Here’s how we might do it in redis if you had comments stored in a sorted set by post id.I’d also just like to point out that I got this snapshot from hacker news today, and it’s about a guy getting bitten by the fact that mongo doesn’t return anything to the client after a write. HA.
  • Read slideNow it becomes a bit confusing to figure out how to use these table copies. Particularly if you are still using SQL.
  • To solve this problem we created a system called Dixie (Read 1st point).SQL queries originally written as though for a single database with no table copies[read rest]Pause after this.So does Dixie actually help?
  • Well, to evaluate this, we ran an experiment using Wikipedia.Here is a graph showing how throughput scales as we add more servers, using one copy per table or using Dixie and table copies.First off, a bit about this experiment – we used a Wikipedia database dump from 2008, and real HTTP traces of traffic. We partitioned this across 1, 2, 5, and 10 servers, both with one copy of each table and then with an added copy of one of the tables, the page table, partitioned in a different way.The yellow bars are most queries going to all servers [click] while the blue bars are most queries going to one server [click] because of the extra table copyOn 10 servers, Dixie plus a copy provides a 3.2 X improvement [click] in throughput over not having table copies.This shows that as we scale to more partitions, the benefit of table copies grows. When there are only one or two servers table copies don’t help much.
  • So we’ve shown that a SQL datastore can scale perfectly well, if we add table copies and make sure queries only need to go to one server. Now let’s talk about JOINs.
  • [read slide]Let’s look at an example
  • A little background on distributed query planning – there has been a lot of research in this area, and systems like Oracle and DB2 have sophisticated distributed query planners. Not a ton of open source solutions. Some things out there are very simple, like Gizzard.
  • This is what Dixie does!Dixie has a [read 1st point]And a [read 2nd point]
  • [read slide]So, now I’m going to change track and talk about research I’m currently working on, involving cacheing
  • And this is research with people from Harvard and MIT.[read slide]The problem is it’s often confusing as to whether we should set an expire time on the data, or if we should pay the cost of invalidating it on writes.
  • [last point] And we do tracking at the level of ranges, so that if a result is based on a range, it will get invalidated when new things are added to that range. Even if the range was originally empty!
  • [click] So here’s an example of how this might work. We cache a user’s view of their twitter homepage, their timeline.[click] We only invalidate this cached object when another user the original user is following tweets[click]So if a user reads their timeline infrequently, the invalidations don’t matter, we only recalculate the timeline once on a read[click] And if the people the user follows don’t tweet often, the timeline won’t change and won’t need to be recalculated
  • There are many challenges involved in this work. First off, we want to offer applications the option of using invalidation and expiration times together, if they are willing to occasionally serve stale data.Next, we are creating a distributed cache, so there is all of the work that goes along with that.Finally, we need to properly do cache coherence on the original database rows, which are spread out amongst all of these distributed caches.
  • But this coule be really helpful – it would mean that an application could get all of the benefits of cachingWhich include saving on computation, like not having to regenerate HTML pages, and sending less traffic to the backend databaseWhile it doesn’t have to worry about freshness or invalidating resultsThis is still a work in progress. Hopefully we’ll have a paper soon.
  • So in summary, I’ve told you about choosing adastore, scaling it, Dixie, and a dependency tracking cacheI’d love to talk further about any of these things if you’re interested.
  • Thank you so much for your time! I’m really excited to talk about any of this stuff, and I don’t know anyone at strange loop, so please feel free to say hi!Here’s my contact information, with pointers to some of the things I talked about.
  • Cut this?
  • Executing Queries on a Sharded Database

    1. 1. Executing Queries on a Sharded Database Neha Narula September 25, 2012
    2. 2. Choosing and Scaling Your Datastore Neha Narula September 25, 2012
    3. 3. Who Am I? Froogle Blobstore Native Client
    4. 4. In This Talk• What to think about when choosing a datastore for a web application• Myths and legends• Executing distributed queries• My research: a consistent cache
    5. 5. Every so often…Friends ask me for advice when they are building a new application
    6. 6. Friend
    7. 7. “Hi Neha! I am making a new application.
    8. 8. I have heard MySQL sucks and I should use NoSQL.
    9. 9. I am going to be iterating on my app a lot
    10. 10. I dont have any customers yet
    11. 11. and my current data set could fit on a thumb drive from 2004, but…
    12. 12. Can you tell me which NoSQL database I should use?
    13. 13. And how to shard it?"
    14. 14. Neha
    15. 15. http://knowyourmeme.com/memes/facepalm
    16. 16. What to Think about When You’re Choosing a Datastore Hint: Not Scaling
    17. 17. Development Cycle• Prototyping – Ease of use• Developing – Flexibility, multiple developers• Running a real production system – Reliability• SUCCESS! Sharding and specialized datastorage – We should only be so lucky
    18. 18. Getting Started First five minutes – what’s it like?Idea credit: Adam Marcus, The NoSQL Ecosystem, HPTS 2011 via Justin Sheehy from Basho
    19. 19. First Five Minutes: Redis http://simonwillison.net/2009/Oct/22/redis/
    20. 20. First Five Minutes: MySQL
    21. 21. Developing• Multiple people working on the same code• Testing new features => new access patterns• New person comes along…
    22. 22. RedisTime to go get lunch
    23. 23. MySQL
    24. 24. Questions• What is your mix of reads and writes?• How much data do you have?• Do you need transactions?• What are your access patterns?• What will grow and change?• What do you already know?
    25. 25. Reads and Writes• Write optimized vs. Read optimized• MongoDB has a global write lock per process• No concurrent writes!
    26. 26. Size of Data• Does it fit in memory?• Disk-based solutions will be slower• Worst case, Redis needs 2X the memory of your data! – It forks with copy-on-writesudo echo 1 > /proc/sys/vm/overcommit_memory
    27. 27. Requirements• Performance – Latency tolerance• Durability – Data loss tolerance• Freshness – Staleness tolerance• Uptime – Downtime tolerance
    28. 28. Performance• Relational databases are considered slow• But they are doing a lot of work!• Sometimes you need that work, sometimes you don’t.
    29. 29. Simplest Scenario• Responding to user requests in real time• Frequently read data fits in memory• High read rate• No need to go to disk or scan lots of data Datastore CPU is the bottleneck
    30. 30. Cost of Executing a Primary Key Lookup • Receiving message • Instantiating thread • Parsing query Query • Optimizing Cost • Take out read locks • (Sometimes) lookup in an index btree • Responding to the clientActually retrieving data
    31. 31. Options to Make This Fast• Query cache• Prepared statements• Handler Socket – 750K lookups/sec on a single serverLesson: Primary key lookups can be fast nomatter what datastore you use
    32. 32. Flexibility vs. Performance• We might want to pay the overhead for query flexibility• In a primary key datastore, we can only ask queries on primary key• SQL gives us flexibility to change our queries
    33. 33. Durability• Persistent datastores – Client: write – Server: flush to disk, then send “I completed your write” – CRASH – Recover: See the write• By default, MongoDB does not fsync() before returning to client on write – Need j:true• By default, MySQL uses MyISAM instead of InnoDB
    34. 34. Specialization• You know your – query access patterns and traffic – consistency requirements• Design specialized lookups – Transactional datastore for consistent data – Memcached for static, mostly unchanging content – Redis for a data processing pipeline• Know what tradeoffs to make
    35. 35. Ways to Scale• Reads – Cache – Replicate – Partition• Writes – Partition data amongst multiple servers
    36. 36. Lots of Folklore
    37. 37. Myths of Sharded Datastores• NoSQL scales better than a relational database• You can’t do a JOIN on a sharded datastore
    38. 38. MYTH: NoSQL scales better than a relational database• Scaling isn’t about the datastore, it’s about the application – Examples: FriendFeed, Quora, Facebook• Complex queries that go to all shards don’t scale• Simple queries that partition well and use only one shard do scale
    39. 39. Problem: Applications Look Up Data in Different Ways
    40. 40. HA!Post PageSELECT *FROM commentsWHERE post_id = 100zrange comments:100 0 -1
    41. 41. Example PartitionedDatabase Database comments table 0-99 post_id Webservers Database 100-199 Database 200-199
    42. 42. Query Goes to OnePartition MySQL 0-99 MySQL Comments on post 100 100-199 MySQL 200-299
    43. 43. Many ConcurrentQueries MySQL Comments on post 52 0-99 MySQL Comments on post 100 100-199 Comments MySQL on post 289 200-299
    44. 44. User Comments PageFetch all of a users comments:SELECT * FROM comments WHERE user = sam
    45. 45. Query Goes to AllPartitions MySQL 0-99Query goes to all servers Sam’s MySQL comments 100-199 MySQL 200-299
    46. 46. Costs for Query Go UpWhen Adding a New MySQLServer 0-99 MySQL 100-199 Sam’s comments MySQL 200-299 MySQL 300-399
    47. 47. CPU Cost on Server of Retrieving One RowTwo costs: Query overhead and row retrieval Query 97% Overhead MySQL
    48. 48. Idea: Multiple PartitioningsPartition comments table on post_id and user.Reads can choose appropriate copy so queries go toonly one server.Writes go to all copies.
    49. 49. Multiple Partitionings Database comments table 0-99 A-J post_id user Database 100-199 K-R Database 200-199 S-Z
    50. 50. All Read Queries Go To OnlyOne Partition MySQL 0-99 Comments on post 100 MySQL 100-199 Sam’s comments MySQL S-Z
    51. 51. Writes Go to Both TableCopies MySQL Write a new 0-99 comment by Vicki on post 100 MySQL 100-199 MySQL S-Z
    52. 52. Scaling• Reads scale by N – the number of servers• Writes are slowed down by T – the number of table copies• Big win if N >> T• Table copies use more memory – Often don’t have to copy large tables – usually metadata• How to execute these queries?
    53. 53. Dixie• Is a query planner which executes SQL queries over a partitioned database• Optimizes read queries to minimize query overhead and data retrieval costs• Uses a novel cost formula which takes advantage of data partitioned in different ways
    54. 54. Wikipedia Workload with Dixie 25000 Each Wikipedia One copy of page table query going to 1 20000 • DatabaseDixie + page.title copy dump from 2008 server • Real HTTP traces 3.2 X 15000 • Sharded across 1, 2, 5, and 10 serversQPS 10000 • Compare by adding a copy of the page Many queries table (< 100 MB), sharded another way going to all 5000 servers 0 1 2 5 10 Number of Partitions
    55. 55. Myths of Sharded Datastores• NoSQL scales better than a relational database• You can’t do a JOIN on a sharded datastore
    56. 56. MYTH: You can’t execute a JOIN on a sharded database• What are the JOIN keys? What are your partition keys?• Bad to move lots of data• Not bad to lookup a few pieces of data• Index lookup JOINs expressed as primary key lookups are fast
    57. 57. Join QueryFetch all of Alices comments on Maxs posts. comments table posts table post_id user text id author link 3 Alice First post! 1 Max http://… 6 Alice Like. 3 Max www. 7 Alice You think? 22 Max Ask Hacker 22 Alice Nice work! 37 Max http://…
    58. 58. Join QueryFetch all of Alices comments on Maxs posts. comments table posts table post_id user text id author link 3 Alice First post! 1 Max http://… 6 Alice Like. 3 Max www. 7 Alice You think? 22 Max Ask Hacker 22 Alice Nice work! 37 Max http://…
    59. 59. Distributed Query Planning• R*• Shore• The State of the Art in Distributed Query Processing, by Donald Kossman• Gizzard
    60. 60. Conventional WisdomPartition on JOIN keys, Send query 0-99 0-99as is to each server. SELECT comments.text FROM posts, comments WHERE comments.post_id = posts.id 100- 100- AND posts.author = Max 199 199 AND comments.user = Alice 200- 200- 299 299
    61. 61. Conventional Plan Scaling Alice’s comments on Max’s posts MySQL MySQL MySQL …
    62. 62. Index Lookup Join 0-99Partition on filter keys, retrieve Maxs A-I 0-99posts, then retrieve Alices comments on A-IMaxs posts. 100- SELECT posts.id 199 J-R 100- FROM posts 199 J-R WHERE posts.author = Max’ 200- 299 S-Z 200- 299 S-Z
    63. 63. Index Lookup Join 0-99Partition on filter keys, retrieve Maxs A-I 0-99 A-Iposts, then retrieve Alices comments onMaxs posts. 100- ids of Max’s 199 J-R 100- posts 199 J-R 200- 299 S-Z 200- 299 S-Z
    64. 64. Index Lookup Join 0-99Partition on filter keys, retrieve Maxs A-I 0-99posts, then retrieve Alices comments on A-IMaxs posts. SELECT comments.text 100- FROM comments 199 J-R 100- WHERE comments.post_id IN […] 199 J-R AND comments.user = Alice 200- 299 S-Z 200- 299 S-Z
    65. 65. Index Lookup Join 0-99Partition on filter keys, retrieve Maxs A-I 0-99posts, then retrieve Alices comments on A-IMaxs posts. 100- Alice’s comments on 199 J-R 100- 199 J-R Max’s posts 200- 299 S-Z 200- 299 S-Z
    66. 66. Index Lookup Join Plan Scaling Alice’s comments on Max’s posts MySQL MySQL MySQL …
    67. 67. Comparison Conventional Plan Index Lookup Plan Intermediate data, Max’s posts Alice DIDN’T comment on
    68. 68. Comparison Conventional Plan Index Lookup Plan Intermediate data, Max’s posts Alice DIDN’T comment on
    69. 69. Dixie• Cost model and query optimizer which predicts the costs of the two plans• Query executor which executes the original query, designed for a single database, on a partitioned database with table copies
    70. 70. Dixies Predictions for the Two Plans Index Lookup Plan, each 10000 query going to two servers 2-Step Join Estimate 8000 • Varying the size of the intermediate dataQueries/Sec 6000 • Max’s posts, Alice didn’t comment on Conventional Pushdown Join Estimate 4000 • Fixing the number of results returned Plan, each query going • Partitioned over 10 servers to all servers 2000 0 0 50 100 150 Posts/Author
    71. 71. Performance of the Two Plans 10000 2-Step Join 8000 Pushdown Join 2-Step JoinQueries/Sec 6000 Estimate Pushdown Join 4000 Estimate 2000 Blue beats yellow after this. 0 Dixie predicts it! 0 50 100 150 Posts/Author
    72. 72. JOINs Are Not Evil• When only transferring small amounts data• And with carefully partitioned data
    73. 73. Lessons Learned• Don’t worry at the beginning: Use what you know• Not about SQL vs. NoSQL systems – you can scale any system• More about complex vs. simple queries• When you do scale, try to make every query use one (or a few) shards
    74. 74. Research: Caching • Applications use a cache to store results computed from a webserver or database rows • Expiration or invalidation? • Annoying and difficult for app developers to invalidate cache items correctlyWork in progress with Bryan Kate, Eddie Kohler, Yandong Mao, and Robert Morris Harvard and MIT
    75. 75. Solution: Dependency Tracking Cache• Application indicates what data went into a cached result• The system will take care of invalidations• Don’t need to recalculate expired data if it has not changed• Always get the freshest data when data has changed• Track ranges, even if empty!
    76. 76. Example Application: Twitter• Cache a user’s view of their twitter homepage (timeline)• Only invalidate when a followed user tweets• Even if many followed users tweet, only recalculate the timeline once when read• No need to recalculate on read if no one has tweeted
    77. 77. Challenges• Invalidation + expiration times – Reading stale data• Distributed caching• Cache coherence on the original database rows
    78. 78. Benefits• Application gets all the benefits of caching – Saves on computation – Less traffic to the backend database• Doesn’t have to worry about freshness
    79. 79. Summary• Choosing a datastore• Dixie• Dependency tracking cache
    80. 80. Thanks! narula@mit.eduhttp://nehanaru.la @neha