NoSQL Findings
                                 Christian van der Leeden




Thursday, September 23, 2010
Our problem
                    • Growth is not linear and not predictable

                          • e.g. History::Session table now > 30 Mio entries

                          • Activities > 26 Mio entries

                    • Postgres will be the performance bottleneck




Thursday, September 23, 2010
Criteria
                    • Allow us to scale from 100k Daily Active Users (DAU)
                      to 1 Mio DAU up to 10Mio DAU

                    • Scale horizontally (“Just add servers”)

                    • Good ruby performance

                    • Good transition from Rails/Postgres -> Rails/NoSQL

                    • Actively developed




Thursday, September 23, 2010
Goal
                    • Scores (@ 10 Mio Daily Active Users)

                          • 10 Mio Scores/day == 350 inserts/second

                          • around same read rate for Leaderboards

                    • Game with 10 Mio Players

                          • Leaderboard with 10 Mio entries

                    • Session (@ 10 Mio DAU)

                          • > 10 Mio session handshakes/day



Thursday, September 23, 2010
Data Patterns
                    • Most data is accessed time based (the most recent
                      data is accessed the most often)

                    • Write-Read rate is around the same

                    • Eventually consistency is good enough most of the
                      time




Thursday, September 23, 2010
Rating criteria
                    •     Type (Document Store, Key/Value Store, Big Table)

                    •     Deployment

                          •    How easy is it to scale?

                    •     Existing installations

                          •    How big are known installations?

                    •     Heritage and activity

                          •    Where does the solution come from and how actively is it
                               developed by whom?




Thursday, September 23, 2010
Products evaluated
                    • MongoDB

                    • Redis

                    • Cassandra

                    • HBase

                    • Membase




Thursday, September 23, 2010
MongoDB
                    • document store

                    • “SQL DB” without relations

                    • easy transition with MongoMapper, Mongoid

                    • supports sharding over replication sets (since August
                      2010)

                    • Haven’t found a big shareded server installation




Thursday, September 23, 2010
Experience with Mongo
          • nice/easy to program with

          • deployment woes we’ve encountered (1.6.0)

                • segmentation fault

                • cannot read beacuse: invalid BSON object

                • when index is > RAM performance degradation (from
                  20ms to 200 ms for queries)

                • Global write lock makes data migrations slow




Thursday, September 23, 2010
Cassandra
                    •     Big Table data store

                    •     Was developed by Facebook and is actively maintained

                    •     Easy to add servers and to setup (peer to peer concept)

                    •     Thrift API to Ruby was slow in tests (Our tests: around 150 write
                          ops/second)

                    •     Avro API promises to be faster (will be an option in 0.7)

                    •     Used by Facebook

                    •     Not using it because it is too slow with ruby




Thursday, September 23, 2010
Redis
                    • Memcache with simple persistence

                    • Supports many different data types and atomic
                      operations on them

                    • Sharding is done client side (difficult to add new
                      servers)

                    • We’re using it for indexes on SQL data

                    • Very fast (Our tests: 4000 write operations/second)



Thursday, September 23, 2010
HBase
                    • Big Table Database

                    • Complex to setup and to maintain

                    • Very often used for Analytics Jobs with Hadoop/HIVE
                      e.g as Amazon EC2 Elastic Map Reduce

                    • For Analytics also look at Scribe for data collection




Thursday, September 23, 2010
Membase
                    • Key-Value Store

                    • Distributed, persistent Memcache

                    • Easy to add nodes

                    • Used by Zynga




Thursday, September 23, 2010
Example Leaderboards
                    • User has many scores

                    • Each score has one result (integer)

                    • Game has many scores

                    •      Query: the leaderboard for one game

                          • Insert one score into the leaderboard

                          • What is my rank?

                          • Give me 10 scores starting at position 100,000



Thursday, September 23, 2010
SQL vs NoSQL
                    • Think about Data         • Think about Queries

                    • Redundancy is bad        • Redundancy is ok

                    • Indexes are managed by   • Roll your own indexes
                      the DB                     depending on queries

                    • Query over relations     • No Joins and connecting
                                                 entities
                    • Always exact results
                                               • Query results don’t have to
                                                 return latest write
                                                 operation



Thursday, September 23, 2010
SQL vs NoSQL
                    • standardized query   • some solutions share
                      language and DDL       standards

                    • All DBs are “the     • Many different
                      same”                  approaches

                                             • Document store

                                             • Big Table

                                             • Key Value



Thursday, September 23, 2010
Postgres
                                         1      n           n   1
                                  User              Score           Game




                    •     Create new score:
                          Score.new(attributes)
                          Score.save => insert into scores;

                    •     What is my rank?
                          select count(*) from scores inner join games on (games.id =
                          scores.game_id)
                          where result > #{my_score.result} and games.name = #{game_name}
                          order by result desc

                    •     Give me 10 scores in leaderboard from position 100000
                          select * from scores inner join games on (games.id = scores.game_id)
                          order by result desc
                          offset 100000 limit 10;




Thursday, September 23, 2010
Redis
    SortedSet
                                                                          • New Score
    key: game_name
    score: result
    value: score_id
                                                                            redis.zadd(“Jewels”,
      key: "Jewels"
                                                                            result, score_id)
             100            99            96
           <2563>        <96877>        <6752>
                                                       ...                • My Rank?
      key: "Bug Landing"                                                    redis.zrevrank("Jewels",
      key: "Toss It"                                                        result)
     ...

                                                                          • 10 scores from position 100000
    KeyValue Store

    key: score_id
                                                                            redis.zrevrange(“Jewels”,
    value: marshalled score object
                                                                            100000, 10)
              2563: { result : 100, user_id : 52345, game_id: 57142 }
                96877: { result : 99, user_id : 2541, game_id: 57142 }
                9752: { result : 96, user_id : 3652, game_id: 57142 }




Thursday, September 23, 2010
Mongo
                                 Collection

                                 key: Scores


                                       { _id: 2563, result : 100, user_id : 52345, game_id: 57142 }
                                       { _id: 96877, result : 99, user_id : 2541, game_id: 57142 }
                                        { _id: 6752, result : 96, user_id : 3652, game_id: 57142 }




                    •     New Score
                          Score.create!(attributes)
                          db.scores.insert( { result: 100, user_id: 52345,
                          game_id: 57142 } )

                    •     What is my rank?
                          db.scores.count( { result: { $gt: #{my_score.result} }})

                    •     10 scores from position 100000
                          db.scores.find({}).sort({ result: -1 }).skip
                          (100000).limit(10)




Thursday, September 23, 2010
Cassandra
    ColumFamily: Leaderboards                          ColumFamily: Scores

    row_key: game_name                                 row_key: score_id




       row_key: "Jewels"                                  row_key: 2563

                                                               game_id: 57142   result: 100   user_id: 6325
            100: 2563       99: 96877   96: 6752

                                                          row_key: 96877
       row_key: "Bug Landing"
                                                               game_id: 57142   result: 99    user_id: 2375

       row_key: "Toss It"
                                                          row_key: 6752
      ...
                                                               game_id: 57142   result: 96    user_id: 2311
                                                         ...




Thursday, September 23, 2010
ColumFamily: Leaderboards

                                                     row_key: game_name




                Cassandra                              row_key: "Jewels"


                                                            100: 2563       99: 96877


                                                       row_key: "Bug Landing"
                                                                                        96: 6752




                                                       row_key: "Toss It"


                    • Insert new score:               ...


                          client.insert(“ScoreList”, “Jewels”, result => id)
                          client.insert(id, :result => result, :user_id =>
                          user_id, :game_id => game_id)


                    • What is my rank?
                      => not easy, need help from other tools

                    • Give me the next 10 scores starting at score X
                          client.get(“ScoreList”, “Jewels”, :start =>
                          X.result, count => 10)




Thursday, September 23, 2010
Findings
                    • Use and test the tools you want to use on the scale
                      you are going to use them

                    • There is no “Best NoSQL” solution

                    • Mix and match the tools you need

                    • NoSQL requires a lot of rethinking and change in
                      your Ruby Code.




Thursday, September 23, 2010
Links
                    •     Cassandra: http://cassandra.apache.org/

                    •     Cassandra API: http://wiki.apache.org/cassandra/API

                    •     Twitter on Cassandra: http://github.com/ericflo/twissandra

                    •     Redis: http://code.google.com/p/redis/

                    •     Redis API: http://code.google.com/p/redis/wiki/CommandReference

                    •     Membase: http://www.membase.org/

                    •     HBase: http://hbase.apache.org/

                    •     Scribe: http://github.com/facebook/scribe

                    •     Mongo: http://www.mongodb.org/




Thursday, September 23, 2010

No sql findings

  • 1.
    NoSQL Findings Christian van der Leeden Thursday, September 23, 2010
  • 2.
    Our problem • Growth is not linear and not predictable • e.g. History::Session table now > 30 Mio entries • Activities > 26 Mio entries • Postgres will be the performance bottleneck Thursday, September 23, 2010
  • 3.
    Criteria • Allow us to scale from 100k Daily Active Users (DAU) to 1 Mio DAU up to 10Mio DAU • Scale horizontally (“Just add servers”) • Good ruby performance • Good transition from Rails/Postgres -> Rails/NoSQL • Actively developed Thursday, September 23, 2010
  • 4.
    Goal • Scores (@ 10 Mio Daily Active Users) • 10 Mio Scores/day == 350 inserts/second • around same read rate for Leaderboards • Game with 10 Mio Players • Leaderboard with 10 Mio entries • Session (@ 10 Mio DAU) • > 10 Mio session handshakes/day Thursday, September 23, 2010
  • 5.
    Data Patterns • Most data is accessed time based (the most recent data is accessed the most often) • Write-Read rate is around the same • Eventually consistency is good enough most of the time Thursday, September 23, 2010
  • 6.
    Rating criteria • Type (Document Store, Key/Value Store, Big Table) • Deployment • How easy is it to scale? • Existing installations • How big are known installations? • Heritage and activity • Where does the solution come from and how actively is it developed by whom? Thursday, September 23, 2010
  • 7.
    Products evaluated • MongoDB • Redis • Cassandra • HBase • Membase Thursday, September 23, 2010
  • 8.
    MongoDB • document store • “SQL DB” without relations • easy transition with MongoMapper, Mongoid • supports sharding over replication sets (since August 2010) • Haven’t found a big shareded server installation Thursday, September 23, 2010
  • 9.
    Experience with Mongo • nice/easy to program with • deployment woes we’ve encountered (1.6.0) • segmentation fault • cannot read beacuse: invalid BSON object • when index is > RAM performance degradation (from 20ms to 200 ms for queries) • Global write lock makes data migrations slow Thursday, September 23, 2010
  • 10.
    Cassandra • Big Table data store • Was developed by Facebook and is actively maintained • Easy to add servers and to setup (peer to peer concept) • Thrift API to Ruby was slow in tests (Our tests: around 150 write ops/second) • Avro API promises to be faster (will be an option in 0.7) • Used by Facebook • Not using it because it is too slow with ruby Thursday, September 23, 2010
  • 11.
    Redis • Memcache with simple persistence • Supports many different data types and atomic operations on them • Sharding is done client side (difficult to add new servers) • We’re using it for indexes on SQL data • Very fast (Our tests: 4000 write operations/second) Thursday, September 23, 2010
  • 12.
    HBase • Big Table Database • Complex to setup and to maintain • Very often used for Analytics Jobs with Hadoop/HIVE e.g as Amazon EC2 Elastic Map Reduce • For Analytics also look at Scribe for data collection Thursday, September 23, 2010
  • 13.
    Membase • Key-Value Store • Distributed, persistent Memcache • Easy to add nodes • Used by Zynga Thursday, September 23, 2010
  • 14.
    Example Leaderboards • User has many scores • Each score has one result (integer) • Game has many scores • Query: the leaderboard for one game • Insert one score into the leaderboard • What is my rank? • Give me 10 scores starting at position 100,000 Thursday, September 23, 2010
  • 15.
    SQL vs NoSQL • Think about Data • Think about Queries • Redundancy is bad • Redundancy is ok • Indexes are managed by • Roll your own indexes the DB depending on queries • Query over relations • No Joins and connecting entities • Always exact results • Query results don’t have to return latest write operation Thursday, September 23, 2010
  • 16.
    SQL vs NoSQL • standardized query • some solutions share language and DDL standards • All DBs are “the • Many different same” approaches • Document store • Big Table • Key Value Thursday, September 23, 2010
  • 17.
    Postgres 1 n n 1 User Score Game • Create new score: Score.new(attributes) Score.save => insert into scores; • What is my rank? select count(*) from scores inner join games on (games.id = scores.game_id) where result > #{my_score.result} and games.name = #{game_name} order by result desc • Give me 10 scores in leaderboard from position 100000 select * from scores inner join games on (games.id = scores.game_id) order by result desc offset 100000 limit 10; Thursday, September 23, 2010
  • 18.
    Redis SortedSet • New Score key: game_name score: result value: score_id redis.zadd(“Jewels”, key: "Jewels" result, score_id) 100 99 96 <2563> <96877> <6752> ... • My Rank? key: "Bug Landing" redis.zrevrank("Jewels", key: "Toss It" result) ... • 10 scores from position 100000 KeyValue Store key: score_id redis.zrevrange(“Jewels”, value: marshalled score object 100000, 10) 2563: { result : 100, user_id : 52345, game_id: 57142 } 96877: { result : 99, user_id : 2541, game_id: 57142 } 9752: { result : 96, user_id : 3652, game_id: 57142 } Thursday, September 23, 2010
  • 19.
    Mongo Collection key: Scores { _id: 2563, result : 100, user_id : 52345, game_id: 57142 } { _id: 96877, result : 99, user_id : 2541, game_id: 57142 } { _id: 6752, result : 96, user_id : 3652, game_id: 57142 } • New Score Score.create!(attributes) db.scores.insert( { result: 100, user_id: 52345, game_id: 57142 } ) • What is my rank? db.scores.count( { result: { $gt: #{my_score.result} }}) • 10 scores from position 100000 db.scores.find({}).sort({ result: -1 }).skip (100000).limit(10) Thursday, September 23, 2010
  • 20.
    Cassandra ColumFamily: Leaderboards ColumFamily: Scores row_key: game_name row_key: score_id row_key: "Jewels" row_key: 2563 game_id: 57142 result: 100 user_id: 6325 100: 2563 99: 96877 96: 6752 row_key: 96877 row_key: "Bug Landing" game_id: 57142 result: 99 user_id: 2375 row_key: "Toss It" row_key: 6752 ... game_id: 57142 result: 96 user_id: 2311 ... Thursday, September 23, 2010
  • 21.
    ColumFamily: Leaderboards row_key: game_name Cassandra row_key: "Jewels" 100: 2563 99: 96877 row_key: "Bug Landing" 96: 6752 row_key: "Toss It" • Insert new score: ... client.insert(“ScoreList”, “Jewels”, result => id) client.insert(id, :result => result, :user_id => user_id, :game_id => game_id) • What is my rank? => not easy, need help from other tools • Give me the next 10 scores starting at score X client.get(“ScoreList”, “Jewels”, :start => X.result, count => 10) Thursday, September 23, 2010
  • 22.
    Findings • Use and test the tools you want to use on the scale you are going to use them • There is no “Best NoSQL” solution • Mix and match the tools you need • NoSQL requires a lot of rethinking and change in your Ruby Code. Thursday, September 23, 2010
  • 23.
    Links • Cassandra: http://cassandra.apache.org/ • Cassandra API: http://wiki.apache.org/cassandra/API • Twitter on Cassandra: http://github.com/ericflo/twissandra • Redis: http://code.google.com/p/redis/ • Redis API: http://code.google.com/p/redis/wiki/CommandReference • Membase: http://www.membase.org/ • HBase: http://hbase.apache.org/ • Scribe: http://github.com/facebook/scribe • Mongo: http://www.mongodb.org/ Thursday, September 23, 2010