Bigdata for small pockets, by Javier Ramirez from teowaki. RubyC Kiev 2014

772 views

Published on

This is the story of how https://teowaki.com added bigdata analytics using a very economic approach.

Bigdata is amazing. You can get insights from your users, find interesting patterns and have lots of geek fun. Problem is big data usually means many servers, a complex set up, intensive monitoring and a steep learning curve. All those things cost money. If you don't have the money, you are losing all the fun. In my talk I will show you how you can use Redis, Google Bigquery and Apps Script to manage big data from your application for under $1 per month. Don't you feel like running a RegExp over 300 million rows in just 5 seconds?

Published in: Software
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
772
On SlideShare
0
From Embeds
0
Number of Embeds
15
Actions
Shares
0
Downloads
6
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Bigdata for small pockets, by Javier Ramirez from teowaki. RubyC Kiev 2014

  1. 1. javier ramirez @supercoco9 BigData for small pockets Redis, BigQuery and Apps Scripts RubyC Kiev 2014
  2. 2. moral of the story you can do big, if you know how
  3. 3. javier ramirez @supercoco9 https://teowaki.com rubyc kiev 14 REST API (Ruby on Rails) + Web on top (AngularJS)
  4. 4. javier ramirez @supercoco9 https://teowaki.com rubyc kiev 14
  5. 5. data that’s an order of magnitude greater than data you’re accustomed to javier ramirez @supercoco9 https://teowaki.com rubyc kiev2014 Doug Laney VP Research, Business Analytics and Performance Management at Gartner
  6. 6. data that exceeds the processing capacity of conventional database systems. The data is too big, moves too fast, or doesn’t fit the structures of your database architectures. Ed Dumbill program chair for the O’Reilly Strata Conference javier ramirez @supercoco9 https://teowaki.com rubyc kiev2014
  7. 7. bigdata is doing a fullscan to 330MM rows, matching them against a regexp, and getting the result (223MM rows) in just 5 seconds javier ramirez @supercoco9 https://teowaki.com rubyc kiev2014 Javier Ramirez impresionable teowaki founder
  8. 8. 1. non intrusive metrics 2. keep the history 3. avoid vendor lock-in 4. interactive queries 5. cheap 6. extra ball: real time javier ramirez @supercoco9 https://teowaki.com rubyc kiev 14
  9. 9. javier ramirez @supercoco9 https://teowaki.com rubyc kiev2014
  10. 10. open source, BSD licensed, advanced key-value store. It is often referred to as a data structure server since keys can contain strings, hashes, lists, sets, sorted sets and hyperloglogs. http://redis.io started in 2009 by Salvatore Sanfilippo @antirez 111 contributors at https://github.com/antirez/redis javier ramirez @supercoco9 https://teowaki.com rubyc kiev2014
  11. 11. twitter stackoverflow pinterest booking.com World of Warcraft YouPorn HipChat Snapchat javier ramirez @supercoco9 https://teowaki.com rubyc kiev 14 ntopng LogStash
  12. 12. Intel(R) Xeon(R) CPU E5520 @ 2.27GHz (with pipelining) $ ./redis-benchmark -r 1000000 -n 2000000 -t get,set,lpush,lpop -P 16 -q SET: 552,028 requests per second GET: 707,463 requests per second LPUSH: 767,459 requests per second LPOP: 770,119 requests per second Intel(R) Xeon(R) CPU E5520 @ 2.27GHz (without pipelining) $ ./redis-benchmark -r 1000000 -n 2000000 -t get,set,lpush,lpop -q SET: 122,556 requests per second GET: 123,601 requests per second LPUSH: 136,752 requests per second LPOP: 132,424 requests per second javier ramirez @supercoco9 https://teowaki.com rubyc kiev2014
  13. 13. javier ramirez @supercoco9 https://teowaki.com rubyc kiev 14 Non intrusive metrics Capture data really fast. Then send the data on the background
  14. 14. javier ramirez @supercoco9 https://teowaki.com rubyc kiev2014
  15. 15. Redis keeps everything in memory all the time javier ramirez @supercoco9 https://teowaki.com rubyc kiev2014
  16. 16. javier ramirez @supercoco9 https://teowaki.com rubyc kiev 14
  17. 17. Gzip to AWS S3/Glacier or Google Cloud Storage javier ramirez @supercoco9 https://teowaki.com rubyc kiev 14
  18. 18. javier ramirez @supercoco9 https://teowaki.com rubyc kiev 14
  19. 19. Hadoop (map/reduce) javier ramirez @supercoco9 https://teowaki.com nosqlmatters 2013 http://hadoop.apache.org/ started in 2005 by Doug Cutting and Mike Cafarella
  20. 20. cassandra javier ramirez @supercoco9 https://teowaki.com nosqlmatters 2013 http://cassandra.apache.org/ released in 2008 by facebook.
  21. 21. other big data solutions: hadoop+voldemort+kafka hbase javier ramirez @supercoco9 https://teowaki.com nosqlmatters 2013 http://engineering.linkedin.com/projects http://hbase.apache.org/
  22. 22. Amazon Redshift javier ramirez @supercoco9 https://teowaki.com nosqlmatters 2013 http://aws.amazon.com/redshift/
  23. 23. but... not interactive enough hard to set up and monitor expensive cluster javier ramirez @supercoco9 https://teowaki.com rubyc kiev 14
  24. 24. Our choice: Google BigQuery Data analysis as a service http://developers.google.com/bigquery javier ramirez @supercoco9 https://teowaki.com rubyc kiev 14
  25. 25. Based on “Dremel” Specifically designed for interactive queries over petabytes of real-time data javier ramirez @supercoco9 https://teowaki.com rubyc kiev 14
  26. 26. loading data You just send the data in text (or JSON) format javier ramirez @supercoco9 https://teowaki.com rubyc kiev 14
  27. 27. SQL javier ramirez @supercoco9 https://teowaki.com rubyc kiev 14 select name from USERS order by date; select count(*) from users; select max(date) from USERS; select sum(total) from ORDERS group by user;
  28. 28. specific extensions for analytics javier ramirez @supercoco9 https://teowaki.com rubyc kiev 14 within flatten nest stddev top first last nth variance var_pop var_samp covar_pop covar_samp quantiles
  29. 29. Things you always wanted to try but were too scared to javier ramirez @supercoco9 https://teowaki.com rubyc kiev 14 select count(*) from publicdata:samples.wikipedia where REGEXP_MATCH(title, "[0-9]*") AND wp_namespace = 0; 223,163,387 Query complete (5.6s elapsed, 9.13 GB processed, Cost: 32¢)
  30. 30. columnar storage javier ramirez @supercoco9 https://teowaki.com rubyc kiev 14
  31. 31. highly distributed execution using a tree javier ramirez @supercoco9 https://teowaki.com rubyc kiev 14
  32. 32. web console screenshot javier ramirez @supercoco9 https://teowaki.com rubyc kiev 14
  33. 33. javier ramirez @supercoco9 https://teowaki.com rubyc kiev 14 window functions
  34. 34. javier ramirez @supercoco9 https://teowaki.com rubyc kiev 14 our most active user
  35. 35. javier ramirez @supercoco9 https://teowaki.com rubyc kiev 14 country segmented traffic
  36. 36. javier ramirez @supercoco9 https://teowaki.com rubyc kiev 14 10 request we should be caching
  37. 37. correlations. not to mistake with causality javier ramirez @supercoco9 https://teowaki.com rubyc kiev 14
  38. 38. javier ramirez @supercoco9 http://teowaki.com rubyc kiev2014 5 most created resources select uri, count(*) total from stats where method = 'POST' group by URI;
  39. 39. javier ramirez @supercoco9 http://teowaki.com rubyc kiev2014 ...but /users/javier/shouts /users/rgo/shouts /teams/javier-community/links /teams/nosqlmatters-cgn/links
  40. 40. javier ramirez @supercoco9 http://teowaki.com rubyc kiev2014 5 most created resources
  41. 41. new users per month
  42. 42. SELECT repository_name, repository_language, repository_description, COUNT(repository_name) as cnt, repository_url FROM github.timeline WHERE type="WatchEvent" AND PARSE_UTC_USEC(created_at) >= PARSE_UTC_USEC("#{yesterday} 20:00:00") AND repository_url IN ( SELECT repository_url FROM github.timeline WHERE type="CreateEvent" AND PARSE_UTC_USEC(repository_created_at) >= PARSE_UTC_USEC('#{yesterday} 20:00:00') AND repository_fork = "false" AND payload_ref_type = "repository" GROUP BY repository_url ) GROUP BY repository_name, repository_language, repository_description, repository_url HAVING cnt >= 5 ORDER BY cnt DESC LIMIT 25
  43. 43. gem install bigbroda https://github.com/michelson/BigBroda javier ramirez @supercoco9 https://teowaki.com rubyc kiev 14
  44. 44. Automation with Apps Script Read from bigquery Create a spreadsheet on Drive E-mail it everyday as a PDF javier ramirez @supercoco9 https://teowaki.com rubyc kiev 14
  45. 45. bigquery pricing $26 per stored TB 1000000 rows => $0.00416 / month £0.00243 / month $5 per processed TB 1 full scan = 160 MB 1 count = 0 MB 1 full scan over 1 column = 5.4 MB 100 GB => $0.05 / month £0.03javier ramirez @supercoco9 https://teowaki.com rubyc kiev 14
  46. 46. £0.054307 / month* per 1MM rows *the 1st 100GB every month are free of charge javier ramirez @supercoco9 https://teowaki.com rubyc kiev 14
  47. 47. 1. non intrusive metrics 2. keep the history 3. avoid vendor lock-in 4. interactive queries 5. cheap 6. extra ball: real time javier ramirez @supercoco9 https://teowaki.com rubyc kiev 14
  48. 48. ig
  49. 49. Find related links at https://teowaki.com/teams/javier-community/link-categories/bigquery-talk Thanks! спасибі Javier Ramírez @supercoco9 rubyc kiev 14

×