• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content

Loading…

Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

Like this presentation? Why not share!

Nosql at twitter_devoxx2010

on

  • 24,323 views

 

Statistics

Views

Total Views
24,323
Views on SlideShare
23,634
Embed Views
689

Actions

Likes
78
Downloads
623
Comments
6

18 Embeds 689

http://chenqj.org 395
http://dancroak.com 126
http://www.chenqj.org 82
http://paper.li 38
http://www.techgig.com 10
http://twitter.com 9
http://www.goldendoc.org 8
http://bart7449.tistory.com 5
http://cache.baidu.com 4
http://edicolaeuropea.blogspot.com 3
http://theoldreader.com 2
http://www.onlydoo.com 1
http://xianguo.com 1
http://old.xianguo.com 1
http://static.slidesharecdn.com 1
http://reader.youdao.com 1
http://us-w1.rockmelt.com 1
http://qa28.corp.rockmelt.com 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Apple Keynote

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel

16 of 6 previous next Post a comment

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
  • I watched this one @ Devoxx and it was certainly one of the best presentations out there, I think
    Are you sure you want to
    Your message goes here
    Processing…
  • good
    Are you sure you want to
    Your message goes here
    Processing…
  • Any chance of a PDF upload for offline reading? No Mac, no Keynote.
    Are you sure you want to
    Your message goes here
    Processing…
  • I was in the presentation at Devoxx.

    Great content.
    Are you sure you want to
    Your message goes here
    Processing…
  • awesome presentation on twitter infrastructure and scalability
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Small talk -- how many use twitter? How many tweeted today? How many checked today? Feel free to tweet during the talk, I won&#x2019;t get offended. <br /> <br /> Who am I? I went to a couple universities, worked in a few places, now I work on the analytics infrastructure at Twitter. <br />
  • The NoSQL term is bad because it defines something by what it is not, conflating a number of different techs. <br /> I will be talking about scaling problems and big data problems. <br /> <br />
  • Will check if there is time left over. <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • Events that happen during the same millisecond are ordered semi-arbitrarily (depends on what dc/worker they hit). We are ok with that. <br /> <br /> DC and worker ids come from config + ZK sanity check. <br /> <br /> <br />
  • VoltDB independently came up with basically the same approach, <br /> It&#x2019;s amusing to look through their code and find the same solutions to same weird corner cases. <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • This is a slow query, even if you have indexes. <br /> We&#x2019;ll talk about the indexes in a sec, but first let&#x2019;s consider whether it even makes sense to run this query. <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • Knowing we are dealing with a list saves a lot of client-side code, merging lists in store allows consistency control <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • Logs are immutable; HDFS is great. Tables have mutable data. <br /> Ignore updates? bad data. Pull updates, resolve at read time? Pain, time. <br /> Pull updates, resolve in batches? Pain, time. Let someone else do the resolving? Helloooo, HBase! <br /> Bonus: Lookups, Projection push-downs. <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />

Nosql at twitter_devoxx2010 Nosql at twitter_devoxx2010 Presentation Transcript

  • Hadoop and NoSQL at Twitter Questions? Dmitriy Ryaboy, Twitter Inc bit.ly/devoxx- @squarecog t witter November 18, 2010
  • Data Management at Twitter-Scale Hadoop and NoSQL at Twitter Dmitriy Ryaboy, Twitter Inc @squarecog November 18, 2010
  • Post Your Questions here: http://bit.ly/devoxx-twitter
  • Lots of “nosql”-ish systems • I will go over many systems – Snowflake, Haplocheirus, FlockDB, Gizzard – Cassandra, Rainbird, Cuckoo – Hadoop, HBase, Pig, Elephant-Bird • You don’t need most of them. 4
  • Main Take-aways • Lots of different scale problems • General Principles can be applied to solving yours. • There’s a good chance something already solves your problem. Use and improve existing tools. 5
  • Snowflake • Scalable, Fast, Distributed UUID generator • http://github.com/twitter/snowflake 6
  • A Tweet. • 95 Million tweets per day • Highs of roughly 3,000 Tweets Per Second – Yes, we really do have a TPS report. 7
  • Creating a Tweet • Insert <user_id, tweet_id, timestamp, tweet> at 3K TPS highs, and growing. • Single master with many read slaves? – single point of failure – write speed bottleneck – does not play well with multiple datacenters • Partition by user_id? – Need globally unique tweet_id 8
  • Snowflake Design 42‐bit
Twepoch
Timestamp 5‐bit
 5‐bit
 12‐bit (ms
precision) DC
id Worker Sequence • Time-dominant, so K-sorted. • Twepoch starts on Nov. 04 2010 • Almost no coordination necessary. • BTW, JavaScript only has 53-bit numbers. 9
  • Bottom Line: Use It! • Common problem • Generally applicable solution – Factored out and abstracted for your forking pleasure • Thrift interface -- use with any language 10
  • Gizzard • A framework for sharding • http://github.com/twitter/gizzard 11
  • Scalability, Reliability • Sharding – Spread keyspace across many nodes – Scale reads and writes • Replication – Keep multiple copies of same data – Scale reads, survive failures • Very common pattern. • Why write from scratch every time? 12
  • Gizzard: Top Level • Messages mapped to Shards • Shards mapped to replication trees • Shards are abstract – MySQL Shards – Lucene Shards – Redis Shards – Logical Shards (Shard shards?) • Used by FlockDB, Haplo, others. 13
  • Gizzard: Middleware Copy
1 Copy
2 Web
App ParGGon
1 Gizzard Copy
1 Web
App Copy
2 ParGGon
1 Gizzard Web
App Copy
1 Copy
2 ParGGon
1 • Stateless • Add more nodes as needed 14
  • Gizzard: Partitioning • Define a function F(key) • Map ranges of co-domain of F to shards • Ranges do not have to be equal 15
  • Gizzard: Replication • Shards can be: – Physical (actual datastore) – Logical (tree of shards) – Edges are replication policies. • e.g., Replicating, Write-Only, Read-Only 16
  • Gizzard: Fault Tolerance • Partition, Replicate • Failing writes are re-enqueued – Writes must be commutative – Writes must be idempotent • Be tolerant of eventual consistency – You should ok with stale reads • CALM: Consistency As Logical Monotonicity http://db.cs.berkeley.edu/jmh/calm-cidr-short.pdf 17
  • Bottom Line: Use it! • In production at Twitter. • Lots of hardening over time. • There are lots of gotchas. • Concentrate on your app, not on this. 18
  • Haplocheirus • Message vector cache • http://github.com/twitter/haplocheirus 19
  • A Timeline. SELECT * FROM tweets WHERE user_id IN (SELECT source_id FROM followers WHERE destination_id = ?) ORDER BY created_at DESC LIMIT 20 • Billions of total tweets. • Billions of edges in graph. • Yeah, Right. 20
  • Some numbers average fanout date peak tps deliveries tps ratio 2008-10-07 30 120 175:1 21,000 2010-04-15 700 2,000 600:1 1,200,000 21
  • 1,200,000 • 1.2M peak deliveries per second • 38B deliveries per day • ... 6 months ago • ... when we were doing 55 million tweets per day and 2K max TPS.
  • Push vs Pull • Assemble on read? – Many more reads than writes. – Assembling timeline is expensive. – Try not to do this • Assemble on write? – High storage (memory) costs – Can make tweeting slow for popular users • fix that by doing async writes • Keep it simple, use an LRU 23
  • Sizing your cache Highly Scientific Diagram • Don’t forget growth projections 24
  • Timeline cache: Haplo(cheirus) • Current: Memcache. – Memcache stores binary blobs – Serialize/Deserialize lists of ints • Future: Haplo, Redis-based timeline store – Data-type aware – Methods for working with lists of ints, instead of binary blobs. – We added a bunch more methods (Redis 2.2) – Partitioning / Replication via Gizzard 25
  • Bottom Line: Precompute (wisely) • An option when query space very limited • Eventual consistency helps scale giant batch writes. – Make sure it is eventually consistent! • Efficiency can be modeled - P(cached)*C(cached) + P(!cached)*C(!cached) • Content-aware stores are helpful. 26
  • FlockDB • Social graph store • aka customized distributed index • http://github.com/twitter/flockdb • slides shamelessly stolen from @nk. 27
  • 28
  • Temporal enumeration 28
  • Inclusion Temporal enumeration 28
  • Inclusion Temporal enumeration Cardinality 28
  • 29
  • Intersection: Deliver to people who follow both @aplusk and @foursquare 29
  • Original Implementation source_id destination_id 20 12 29 12 34 16 • Single table, vertically scaled • Master-Slave replication 30
  • Index Original Implementation source_id destination_id 20 12 29 12 34 16 • Single table, vertically scaled • Master-Slave replication 30
  • Index Index Original Implementation source_id destination_id 20 12 29 12 34 16 • Single table, vertically scaled • Master-Slave replication 30
  • Problems with solution • Poor write throughput • Poor reads once indexes not in RAM – ... and indexes are really big – ... so that happens. 31
  • Current solution Forward Backward source_id destination_id updated_at x destination_id source_id updated_at x 20 12 20:50:14 x 12 20 20:50:14 x 20 13 20:51:32 12 32 20:51:32 20 16 12 16 • Partitioned by user id • Edges stored in “forward” and “backward” directions • Indexed by time • Indexed by element (for set algebra) • Denormalized cardinality 32
  • Current solution Forward Backward source_id destination_id updated_at x destination_id source_id updated_at x 20 12 20:50:14 x 12 20 20:50:14 x 20 13 20:51:32 12 32 20:51:32 20 16 12 16 • Partitioned by user id • Edges stored in “forward” and “backward” directions Partitioned by user • Indexed by time • Indexed by element (for set algebra) • Denormalized cardinality 32
  • Edges stored in both directions Current solution Forward Backward source_id destination_id updated_at x destination_id source_id updated_at x 20 12 20:50:14 x 12 20 20:50:14 x 20 13 20:51:32 12 32 20:51:32 20 16 12 16 • Partitioned by user id • Edges stored in “forward” and “backward” directions Partitioned by user • Indexed by time • Indexed by element (for set algebra) • Denormalized cardinality 32
  • Bottom Line: Stay CALM • Partition, Replicate, Index • Same applies to indexes • Be tolerant of eventual consistency • Use Gizzard 33
  • Cassandra • I told you so. • http://cassandra.apache.org 34
  • Current Uses • Results of large-scale data mining – P(is_a_bot), P(has_green_hair), P(will_buy_porsche) • Geo database – Nearby search, Global search, Reverse geocode • Realtime analytics – Once I've started counting it's really hard to stop – Faster, faster. It is so exciting! – I could count forever, count until I drop Count Von Count, Muppets. 35
  • Geo: Place Database • Place DB: one place, many sources, many edits • Cassandra used because of free replication, sharding 36
  • Geo: Search Indices • Lucene to serve search by place attribute • Place edit history kept in Cassandra • Want: reindex all edits since last snapshot • Key by edit time? – Bad locality for “changes since” queries • Use Order Preserving Partitioner? – Data skew issues, hot spots. 37
  • Geo: changelog data model • Better: choose a random row, place into edit_time column – Columns in a row are ordered! • Indexers perform a multiget_slice(time1, time2) 38
  • Rainbird: Time-series Analytics • Uses distributed counters in Cassandra – Targeted for release in 0.7.1 • Listen to event stream • Aggregate by time granularities and pre- determined hierarchies – e.g., “5 minutes, DC1.rack12.node15.http_requests” • Buffer and Flush to Cassandra 39
  • Cuckoo: cluster monitoring • We use Ganglia extensively • Downsampling a problem • Rainbird gives us scalable storage for timeseries • Cuckoo: service on top of Rainbird for cluster metrics – sort of like RRDTool 40
  • Bottom Line: Use existing tools • Powerful data model • Automatic partitioning • Automatic replication • Tunable consistency levels • (soon) Distributed counters • Exposes metrics via JMX • What’s not to like? 41
  • Hadoop • and friends • http://hadoop.apache.org 42 23
  • Daily Workload 1000s of Front End machines Billions of API requests 12 TB of ingested data 95 Million tweets 43
  • Mo’ Data, Mo’ Problems • Option 1: Specialist OLAP database – Pro: • Heavily optimized for aggregation queries • SQL is awesome for counting stuff! – Cons: • Relational model not always best fit • We have a ridonculous amount of data • Can they even handle 500 Petabytes? • Some analysis does not translate to SQL – although I’ve seen PageRank in SQL. – but that way lies madness. 44
  • Mo’ Data, Mo’ Problems • Option 2: Roll Hadoop – Pro: • Scales to infinity and beyond • Flexible data formats • Very complex workflows possible – Cons: • Learning curve for analysts • Slower, less efficient than dedicated solutions 45
  • Mo’ Data, Mo’ Problems • Option 3: Use Both! – Use Vertica for table aggregations – Use Hadoop for • log parsing • extra-large aggregations • complex analysis • all offline, large-scale processing 46
  • Architecture (Simplified) 47
  • Bottom Line: Right tool for Right Job • Special-purpose tools are very powerful • Don’t force them to do things they are not meant for • Weigh benefits of using 2 system vs. cost of maintaing 2 systems 48
  • Elephant-Bird • Library for working with data in Hadoop • http://github.com/kevinweil/elephant-bird 29 49
  • Data Formats matter • This is insane. 50
  • Use a serialization framework • Thrift, Avro, Protocol Buffers • Compact description of your data • Backwards compatible as schema evolves • Codegen for (most) languages 51
  • Codegen++ • Elephant-Bird support for Protocol Buffers – Hadoop Input/Output Formats – Hadoop Writables – Pig Load / Store Funcs – Pig deserialization UDFs – Hive SerDes • Working on doing same for Thrift • Working on Pig 0.8 support 52
  • 53
  • Lazy Deserialization FTW • At first: converted Protocol Buffers into Pig tuples at read time. • Now, Tuple wrapper that deserializes fields upon request. • Huge performance boost for wide tables with only a few used columns lazy deserialization 54
  • HBase • Mutability. • Random lookups. • Decent scan performance. • http://hbase.apache.org 55
  • Pig • Data Flow language for Big Data • http://pig.apache.org 56
  • Pig Latin example 57
  • Why Pig? • Familiar Data Processing Primitives – Filter, Group, Join, Order... • Complex dataflows better expressed imperatively – SQL great for basic summarization. • Optimized Join strategies • Very powerful UDFs 58
  • New in Pig 0.8 • UDFs in other languages • Enhanced HBaseStorage • Detailed Job counters • Arbitrary MapReduce jobs as part of flow • PigUnit simplifies writing unit tests • ... and much more. 59
  • Why Not Pig? • Simple queries much more natural in SQL – People already know SQL • No Data dictionary – Have to learn where data is and what loaders to use • Very excited about Howl – Abstraction to seamlessly work with Pig and Hive – http://www.github.com/yahoo/howl 60
  • Conclusion • Deep Thoughts. • www.deepthoughtsbyjackhandey.com/ 60 61
  • Main Points: Online • Precompute results if query space is limited. – Model materialization cost vs read-time computation • Provide narrow query interfaces. Optimize them. • Staying CALM for eventual consistency • Sharding and replicating is a pattern. – Use a framework. • Use existing tools. Open-source rocks. 62
  • Main Points: Offline • Hadoop is great when you need: – Multiple TBs of data. – Flexible offline analysis. • Choose your serialization format wisely. • HBase is the peas to Hadoop’s carrots 63
  • Big Thanks. Nick Kallen, @nk Ryan King, @rk Kevin Weil, @kevinweil Marius Eriksen, @marius 64
  • Be excellent to each other. SELECT * FROM questions ORDER BY rand()