Big data

1,960 views

Published on

a quick talk i gave at the meetup in boulder, colorado

Published in: Technology, Design
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,960
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
23
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Big data

  1. 1. Cassandra and Hadoop <ul><li>Kevin Cawley, Engineer Linksmart
  2. 2. Cassandra – been actively using for 2+ years
  3. 3. Hadoop – 1 yr experience, sort of </li></ul>
  4. 4. Problem For Today <ul><li>Running a survey for discovering nosql preference
  5. 5. Options: mongodb, redis, cassandra, couch, hbase, riak, voldermort, dynamodb
  6. 6. We're gonna get billions of responses – RDBMS is going to fall over
  7. 7. We need nosql... what the hell is that?? </li></ul>
  8. 8. Problem For Today, cont. Kevin, response=cassandra, kevin@foo.com Emma, response=redis, emma@foo.com Asher, response=cassandra, [email_address] … … … BILLIONS AND BILLIONS OF THESE!!!!
  9. 9. Cassandra <ul><li>Linear scalability, high availability & performant database
  10. 10. Key Value store
  11. 11. Ring architecture w/ replication 2^217 tokens </li></ul>Node 1 Node 2 Node 4 Node 3
  12. 12. Cassandra <ul><li>Keypace
  13. 13. Column Families – std, dynamic (mo better) </li></ul>name preference 100 kevin cawley cassandra 101 asher cawley cassandra 102 emma cawley redis 202 201 redis ['joe','bob'] ['matthias'] cassandra ['kevin', 'asher'] ['tom'] mongodb ['holly'] ['dan'] assume User keys as utf8;
  14. 14. Super columns <ul><li>Not so super </li><ul><li>Nice on paper, can be catastrophic in practice
  15. 15. Fanning – not the cool refreshing kind
  16. 16. Getting phased out </li></ul></ul>202 201 redis {'joe' => 'joe@foo.com , 'bob' => 'bob@boo.com'} {'matthias' => 'matthias@foo.cm', 'tom' => 'tom@boo.com'}
  17. 17. Secondary Indexes <ul><li>Indexes on column values
  18. 18. Replacement for not so super, super columns </li><ul><li>Composite columns US:colorado:cassandra => kevin </li></ul><li>Demo 1 </li></ul>
  19. 19. Counters <ul><li>Yes! Yum. Counters good
  20. 20. We built our own – now free
  21. 21. Cassandra is eventually consistent makes this hard
  22. 22. Be clever and you will win
  23. 23. Demo 2 </li></ul>
  24. 24. Counters Counter cassandra 30333 redis 22098 mongodb 24567 couch 12340 ...
  25. 25. Hadoop <ul><li>Distributed processing of large data sets across clusters of computers – too good to be true? </li></ul>
  26. 26. Map Reduce <ul><li>Acronym soup </li><ul><li>hadoop common, hdfs, map reduce
  27. 27. hbase, pig, hive, zookeeper </li></ul><li>Map Reduce is @ the heart
  28. 28. Map - processes a key/value pair to generate a set of intermediate key/value pairs
  29. 29. Reduce - function that merges all intermediate values associated with the same intermediate key </li></ul>
  30. 30. Map Reduce – Our Example Kevin, response=cassandra, kevin@foo.com Emma, response=redis, emma@foo.com Asher, response=cassandra, [email_address] ... <ul><li>Map: </li><ul><li>cassandra kevin
  31. 31. cassandra asher
  32. 32. redis emma </li></ul><li>Reduce: </li><ul><li>cassandra 2
  33. 33. Redis 1 </li></ul></ul>AND the winner is cassandra w/ 2 votes!!!
  34. 34. Cassandra Hive <ul><li>Hadoop/Brisk on Cassandra – no luck
  35. 35. Hive </li><ul><li>Data warehouse built on top of cassandraFS leveraging map reduce
  36. 36. Query the data using a SQL-like language called HiveQL </li></ul><li>Demo 3 </li></ul>
  37. 37. Summary <ul><li>Cassandra </li><ul><li>Awesome for storing massive amounts of data
  38. 38. Dangerous if you don't know what you are doing
  39. 39. Schemaless – ironically modelling is extermely imp.
  40. 40. Ad-hoc questions are hard to answer fast </li></ul><li>Hadoop/Brisk </li><ul><li>Great for answering ad hoc questions reasonably fast </li></ul><li>What you really want is Cassandra ↔ Hadoop ↔ RDBMS </li></ul>

×