• Like

Loading…

Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

Nosql at twitter_devoxx2010

  • 22,580 views
Published

 

Published in Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
No Downloads

Views

Total Views
22,580
On SlideShare
0
From Embeds
0
Number of Embeds
5

Actions

Shares
Downloads
629
Comments
6
Likes
78

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • Small talk -- how many use twitter? How many tweeted today? How many checked today? Feel free to tweet during the talk, I won’t get offended.

    Who am I? I went to a couple universities, worked in a few places, now I work on the analytics infrastructure at Twitter.
  • The NoSQL term is bad because it defines something by what it is not, conflating a number of different techs.
    I will be talking about scaling problems and big data problems.

  • Will check if there is time left over.





  • Events that happen during the same millisecond are ordered semi-arbitrarily (depends on what dc/worker they hit). We are ok with that.

    DC and worker ids come from config + ZK sanity check.


  • VoltDB independently came up with basically the same approach,
    It’s amusing to look through their code and find the same solutions to same weird corner cases.









  • This is a slow query, even if you have indexes.
    We’ll talk about the indexes in a sec, but first let’s consider whether it even makes sense to run this query.





  • Knowing we are dealing with a list saves a lot of client-side code, merging lists in store allows consistency control
































  • Logs are immutable; HDFS is great. Tables have mutable data.
    Ignore updates? bad data. Pull updates, resolve at read time? Pain, time.
    Pull updates, resolve in batches? Pain, time. Let someone else do the resolving? Helloooo, HBase!
    Bonus: Lookups, Projection push-downs.










Transcript

  • 1. Hadoop and NoSQL at Twitter Questions? Dmitriy Ryaboy, Twitter Inc bit.ly/devoxx- @squarecog t witter November 18, 2010
  • 2. Data Management at Twitter-Scale Hadoop and NoSQL at Twitter Dmitriy Ryaboy, Twitter Inc @squarecog November 18, 2010
  • 3. Post Your Questions here: http://bit.ly/devoxx-twitter
  • 4. Lots of “nosql”-ish systems • I will go over many systems – Snowflake, Haplocheirus, FlockDB, Gizzard – Cassandra, Rainbird, Cuckoo – Hadoop, HBase, Pig, Elephant-Bird • You don’t need most of them. 4
  • 5. Main Take-aways • Lots of different scale problems • General Principles can be applied to solving yours. • There’s a good chance something already solves your problem. Use and improve existing tools. 5
  • 6. Snowflake • Scalable, Fast, Distributed UUID generator • http://github.com/twitter/snowflake 6
  • 7. A Tweet. • 95 Million tweets per day • Highs of roughly 3,000 Tweets Per Second – Yes, we really do have a TPS report. 7
  • 8. Creating a Tweet • Insert <user_id, tweet_id, timestamp, tweet> at 3K TPS highs, and growing. • Single master with many read slaves? – single point of failure – write speed bottleneck – does not play well with multiple datacenters • Partition by user_id? – Need globally unique tweet_id 8
  • 9. Snowflake Design 42‐bit
Twepoch
Timestamp 5‐bit
 5‐bit
 12‐bit (ms
precision) DC
id Worker Sequence • Time-dominant, so K-sorted. • Twepoch starts on Nov. 04 2010 • Almost no coordination necessary. • BTW, JavaScript only has 53-bit numbers. 9
  • 10. Bottom Line: Use It! • Common problem • Generally applicable solution – Factored out and abstracted for your forking pleasure • Thrift interface -- use with any language 10
  • 11. Gizzard • A framework for sharding • http://github.com/twitter/gizzard 11
  • 12. Scalability, Reliability • Sharding – Spread keyspace across many nodes – Scale reads and writes • Replication – Keep multiple copies of same data – Scale reads, survive failures • Very common pattern. • Why write from scratch every time? 12
  • 13. Gizzard: Top Level • Messages mapped to Shards • Shards mapped to replication trees • Shards are abstract – MySQL Shards – Lucene Shards – Redis Shards – Logical Shards (Shard shards?) • Used by FlockDB, Haplo, others. 13
  • 14. Gizzard: Middleware Copy
1 Copy
2 Web
App ParGGon
1 Gizzard Copy
1 Web
App Copy
2 ParGGon
1 Gizzard Web
App Copy
1 Copy
2 ParGGon
1 • Stateless • Add more nodes as needed 14
  • 15. Gizzard: Partitioning • Define a function F(key) • Map ranges of co-domain of F to shards • Ranges do not have to be equal 15
  • 16. Gizzard: Replication • Shards can be: – Physical (actual datastore) – Logical (tree of shards) – Edges are replication policies. • e.g., Replicating, Write-Only, Read-Only 16
  • 17. Gizzard: Fault Tolerance • Partition, Replicate • Failing writes are re-enqueued – Writes must be commutative – Writes must be idempotent • Be tolerant of eventual consistency – You should ok with stale reads • CALM: Consistency As Logical Monotonicity http://db.cs.berkeley.edu/jmh/calm-cidr-short.pdf 17
  • 18. Bottom Line: Use it! • In production at Twitter. • Lots of hardening over time. • There are lots of gotchas. • Concentrate on your app, not on this. 18
  • 19. Haplocheirus • Message vector cache • http://github.com/twitter/haplocheirus 19
  • 20. A Timeline. SELECT * FROM tweets WHERE user_id IN (SELECT source_id FROM followers WHERE destination_id = ?) ORDER BY created_at DESC LIMIT 20 • Billions of total tweets. • Billions of edges in graph. • Yeah, Right. 20
  • 21. Some numbers average fanout date peak tps deliveries tps ratio 2008-10-07 30 120 175:1 21,000 2010-04-15 700 2,000 600:1 1,200,000 21
  • 22. 1,200,000 • 1.2M peak deliveries per second • 38B deliveries per day • ... 6 months ago • ... when we were doing 55 million tweets per day and 2K max TPS.
  • 23. Push vs Pull • Assemble on read? – Many more reads than writes. – Assembling timeline is expensive. – Try not to do this • Assemble on write? – High storage (memory) costs – Can make tweeting slow for popular users • fix that by doing async writes • Keep it simple, use an LRU 23
  • 24. Sizing your cache Highly Scientific Diagram • Don’t forget growth projections 24
  • 25. Timeline cache: Haplo(cheirus) • Current: Memcache. – Memcache stores binary blobs – Serialize/Deserialize lists of ints • Future: Haplo, Redis-based timeline store – Data-type aware – Methods for working with lists of ints, instead of binary blobs. – We added a bunch more methods (Redis 2.2) – Partitioning / Replication via Gizzard 25
  • 26. Bottom Line: Precompute (wisely) • An option when query space very limited • Eventual consistency helps scale giant batch writes. – Make sure it is eventually consistent! • Efficiency can be modeled - P(cached)*C(cached) + P(!cached)*C(!cached) • Content-aware stores are helpful. 26
  • 27. FlockDB • Social graph store • aka customized distributed index • http://github.com/twitter/flockdb • slides shamelessly stolen from @nk. 27
  • 28. 28
  • 29. Temporal enumeration 28
  • 30. Inclusion Temporal enumeration 28
  • 31. Inclusion Temporal enumeration Cardinality 28
  • 32. 29
  • 33. Intersection: Deliver to people who follow both @aplusk and @foursquare 29
  • 34. Original Implementation source_id destination_id 20 12 29 12 34 16 • Single table, vertically scaled • Master-Slave replication 30
  • 35. Index Original Implementation source_id destination_id 20 12 29 12 34 16 • Single table, vertically scaled • Master-Slave replication 30
  • 36. Index Index Original Implementation source_id destination_id 20 12 29 12 34 16 • Single table, vertically scaled • Master-Slave replication 30
  • 37. Problems with solution • Poor write throughput • Poor reads once indexes not in RAM – ... and indexes are really big – ... so that happens. 31
  • 38. Current solution Forward Backward source_id destination_id updated_at x destination_id source_id updated_at x 20 12 20:50:14 x 12 20 20:50:14 x 20 13 20:51:32 12 32 20:51:32 20 16 12 16 • Partitioned by user id • Edges stored in “forward” and “backward” directions • Indexed by time • Indexed by element (for set algebra) • Denormalized cardinality 32
  • 39. Current solution Forward Backward source_id destination_id updated_at x destination_id source_id updated_at x 20 12 20:50:14 x 12 20 20:50:14 x 20 13 20:51:32 12 32 20:51:32 20 16 12 16 • Partitioned by user id • Edges stored in “forward” and “backward” directions Partitioned by user • Indexed by time • Indexed by element (for set algebra) • Denormalized cardinality 32
  • 40. Edges stored in both directions Current solution Forward Backward source_id destination_id updated_at x destination_id source_id updated_at x 20 12 20:50:14 x 12 20 20:50:14 x 20 13 20:51:32 12 32 20:51:32 20 16 12 16 • Partitioned by user id • Edges stored in “forward” and “backward” directions Partitioned by user • Indexed by time • Indexed by element (for set algebra) • Denormalized cardinality 32
  • 41. Bottom Line: Stay CALM • Partition, Replicate, Index • Same applies to indexes • Be tolerant of eventual consistency • Use Gizzard 33
  • 42. Cassandra • I told you so. • http://cassandra.apache.org 34
  • 43. Current Uses • Results of large-scale data mining – P(is_a_bot), P(has_green_hair), P(will_buy_porsche) • Geo database – Nearby search, Global search, Reverse geocode • Realtime analytics – Once I've started counting it's really hard to stop – Faster, faster. It is so exciting! – I could count forever, count until I drop Count Von Count, Muppets. 35
  • 44. Geo: Place Database • Place DB: one place, many sources, many edits • Cassandra used because of free replication, sharding 36
  • 45. Geo: Search Indices • Lucene to serve search by place attribute • Place edit history kept in Cassandra • Want: reindex all edits since last snapshot • Key by edit time? – Bad locality for “changes since” queries • Use Order Preserving Partitioner? – Data skew issues, hot spots. 37
  • 46. Geo: changelog data model • Better: choose a random row, place into edit_time column – Columns in a row are ordered! • Indexers perform a multiget_slice(time1, time2) 38
  • 47. Rainbird: Time-series Analytics • Uses distributed counters in Cassandra – Targeted for release in 0.7.1 • Listen to event stream • Aggregate by time granularities and pre- determined hierarchies – e.g., “5 minutes, DC1.rack12.node15.http_requests” • Buffer and Flush to Cassandra 39
  • 48. Cuckoo: cluster monitoring • We use Ganglia extensively • Downsampling a problem • Rainbird gives us scalable storage for timeseries • Cuckoo: service on top of Rainbird for cluster metrics – sort of like RRDTool 40
  • 49. Bottom Line: Use existing tools • Powerful data model • Automatic partitioning • Automatic replication • Tunable consistency levels • (soon) Distributed counters • Exposes metrics via JMX • What’s not to like? 41
  • 50. Hadoop • and friends • http://hadoop.apache.org 42 23
  • 51. Daily Workload 1000s of Front End machines Billions of API requests 12 TB of ingested data 95 Million tweets 43
  • 52. Mo’ Data, Mo’ Problems • Option 1: Specialist OLAP database – Pro: • Heavily optimized for aggregation queries • SQL is awesome for counting stuff! – Cons: • Relational model not always best fit • We have a ridonculous amount of data • Can they even handle 500 Petabytes? • Some analysis does not translate to SQL – although I’ve seen PageRank in SQL. – but that way lies madness. 44
  • 53. Mo’ Data, Mo’ Problems • Option 2: Roll Hadoop – Pro: • Scales to infinity and beyond • Flexible data formats • Very complex workflows possible – Cons: • Learning curve for analysts • Slower, less efficient than dedicated solutions 45
  • 54. Mo’ Data, Mo’ Problems • Option 3: Use Both! – Use Vertica for table aggregations – Use Hadoop for • log parsing • extra-large aggregations • complex analysis • all offline, large-scale processing 46
  • 55. Architecture (Simplified) 47
  • 56. Bottom Line: Right tool for Right Job • Special-purpose tools are very powerful • Don’t force them to do things they are not meant for • Weigh benefits of using 2 system vs. cost of maintaing 2 systems 48
  • 57. Elephant-Bird • Library for working with data in Hadoop • http://github.com/kevinweil/elephant-bird 29 49
  • 58. Data Formats matter • This is insane. 50
  • 59. Use a serialization framework • Thrift, Avro, Protocol Buffers • Compact description of your data • Backwards compatible as schema evolves • Codegen for (most) languages 51
  • 60. Codegen++ • Elephant-Bird support for Protocol Buffers – Hadoop Input/Output Formats – Hadoop Writables – Pig Load / Store Funcs – Pig deserialization UDFs – Hive SerDes • Working on doing same for Thrift • Working on Pig 0.8 support 52
  • 61. 53
  • 62. Lazy Deserialization FTW • At first: converted Protocol Buffers into Pig tuples at read time. • Now, Tuple wrapper that deserializes fields upon request. • Huge performance boost for wide tables with only a few used columns lazy deserialization 54
  • 63. HBase • Mutability. • Random lookups. • Decent scan performance. • http://hbase.apache.org 55
  • 64. Pig • Data Flow language for Big Data • http://pig.apache.org 56
  • 65. Pig Latin example 57
  • 66. Why Pig? • Familiar Data Processing Primitives – Filter, Group, Join, Order... • Complex dataflows better expressed imperatively – SQL great for basic summarization. • Optimized Join strategies • Very powerful UDFs 58
  • 67. New in Pig 0.8 • UDFs in other languages • Enhanced HBaseStorage • Detailed Job counters • Arbitrary MapReduce jobs as part of flow • PigUnit simplifies writing unit tests • ... and much more. 59
  • 68. Why Not Pig? • Simple queries much more natural in SQL – People already know SQL • No Data dictionary – Have to learn where data is and what loaders to use • Very excited about Howl – Abstraction to seamlessly work with Pig and Hive – http://www.github.com/yahoo/howl 60
  • 69. Conclusion • Deep Thoughts. • www.deepthoughtsbyjackhandey.com/ 60 61
  • 70. Main Points: Online • Precompute results if query space is limited. – Model materialization cost vs read-time computation • Provide narrow query interfaces. Optimize them. • Staying CALM for eventual consistency • Sharding and replicating is a pattern. – Use a framework. • Use existing tools. Open-source rocks. 62
  • 71. Main Points: Offline • Hadoop is great when you need: – Multiple TBs of data. – Flexible offline analysis. • Choose your serialization format wisely. • HBase is the peas to Hadoop’s carrots 63
  • 72. Big Thanks. Nick Kallen, @nk Ryan King, @rk Kevin Weil, @kevinweil Marius Eriksen, @marius 64
  • 73. Be excellent to each other. SELECT * FROM questions ORDER BY rand()