Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
javier ramirez
@supercoco9
API analytics
with Redis
and BigQuery
javier ramirez @supercoco9 https://teowaki.com api days 14
REST API (Ruby on Rails)
+
Web on top (AngularJS)
Use a
hosted
solution
questions?
api days 14
javier ramirez @supercoco9 https://teowaki.com api days 14
data that’s an order of
magnitude greater than
data you’re accustomed
to
javier ramirez @supercoco9 https://teowaki.com ap...
data that exceeds the
processing capacity of
conventional database
systems. The data is too big,
moves too fast, or doesn’...
bigdata is doing a
fullscan to 330MM rows,
matching them against a
regexp, and getting the
result (223MM rows) in
just 5 s...
1. non intrusive metrics
2. keep the history
3. avoid vendor lock-in
4. interactive queries
5. cheap
6. extra ball: real t...
javier ramirez @supercoco9 https://teowaki.com api days2014
open source, BSD licensed, advanced
key-value store. It is often referred to as a
data structure server since keys can con...
twitter
stackoverflow
pinterest
booking.com
World of Warcraft
YouPorn
HipChat
Snapchat
javier ramirez @supercoco9 https://...
Intel(R) Xeon(R) CPU E5520 @ 2.27GHz (with pipelining)
$ ./redis-benchmark -r 1000000 -n 2000000 -t get,set,lpush,lpop -P ...
javier ramirez @supercoco9 https://teowaki.com api days 14
Non intrusive metrics
Capture data really fast.
Then send the d...
javier ramirez @supercoco9 https://teowaki.com api days2014
Redis keeps
everything in
memory
all the time
javier ramirez @supercoco9 https://teowaki.com api days2014
javier ramirez @supercoco9 https://teowaki.com api days 14
Gzip to
AWS S3/Glacier
or
Google Cloud Storage
javier ramirez @supercoco9 https://teowaki.com api days 14
javier ramirez @supercoco9 https://teowaki.com api days 14
Hadoop
Cassandra
Amazon Redshift
...
javier ramirez @supercoco9 https://teowaki.com api days 14
tools we considered:
but...
hard to set up and monitor
expensive cluster
not interactive enough
javier ramirez @supercoco9 https://teowaki.com ...
Our choice:
Google BigQuery
Data analysis as a service
http://developers.google.com/bigquery
javier ramirez @supercoco9 ht...
Based on “Dremel”
Specifically designed for
interactive queries over
petabytes of real-time
data
javier ramirez @supercoco...
loading data
You just send the data in
text (or JSON) format
javier ramirez @supercoco9 https://teowaki.com api days 14
SQL
javier ramirez @supercoco9 https://teowaki.com api days 14
select name from USERS order by date;
select count(*) from ...
specific extensions for
analytics
javier ramirez @supercoco9 https://teowaki.com api days 14
within
flatten
nest
stddev
to...
web console screenshot
javier ramirez @supercoco9 https://teowaki.com api days 14
javier ramirez @supercoco9 https://teowaki.com api days 14
window functions
javier ramirez @supercoco9 https://teowaki.com api days 14
our most active user
javier ramirez @supercoco9 https://teowaki.com api days 14
country segmented traffic
javier ramirez @supercoco9 https://teowaki.com api days 14
10 request we should be caching
correlations.
not to mistake with
causality
javier ramirez @supercoco9 https://teowaki.com api days 14
Things you always wanted to
try but were too scared to
javier ramirez @supercoco9 https://teowaki.com api days 14
select c...
javier ramirez @supercoco9 http://teowaki.com api days2014
5 most created resources
select uri, count(*) total from
stats ...
javier ramirez @supercoco9 http://teowaki.com api days2014
...but
/users/javier/shouts
/users/rgo/shouts
/teams/javier-com...
javier ramirez @supercoco9 http://teowaki.com api days2014
5 most created resources
new users per month
SELECT repository_name, repository_language,
repository_description, COUNT(repository_name) as cnt,
repository_url
FROM gi...
Automation with Apps Script
Read from bigquery
Create a spreadsheet on Drive
E-mail it everyday as a PDF
javier ramirez @s...
bigquery pricing
$26 per stored TB
1000000 rows => $0.00416 / month
£0.00243 / month
$5 per processed TB
1 full scan = 160...
£0.054307 / month*
per 1MM rows
*the 1st
1TB every month are free of charge
javier ramirez @supercoco9 https://teowaki.com...
1. non intrusive metrics
2. keep the history
3. avoid vendor lock-in
4. interactive queries
5. cheap
6. extra ball: real t...
Find related links at
https://teowaki.com/teams/javier-community/link-categories/bigquery-talk
Thanks!
Gr ciesà
Javier Ram...
api analytics using Redis, BigQuery and AppsScripts  by Javier Ramirez from teowaki (Apidays Mediterranea)
api analytics using Redis, BigQuery and AppsScripts  by Javier Ramirez from teowaki (Apidays Mediterranea)
api analytics using Redis, BigQuery and AppsScripts  by Javier Ramirez from teowaki (Apidays Mediterranea)
api analytics using Redis, BigQuery and AppsScripts  by Javier Ramirez from teowaki (Apidays Mediterranea)
api analytics using Redis, BigQuery and AppsScripts  by Javier Ramirez from teowaki (Apidays Mediterranea)
api analytics using Redis, BigQuery and AppsScripts  by Javier Ramirez from teowaki (Apidays Mediterranea)
api analytics using Redis, BigQuery and AppsScripts  by Javier Ramirez from teowaki (Apidays Mediterranea)
api analytics using Redis, BigQuery and AppsScripts  by Javier Ramirez from teowaki (Apidays Mediterranea)
api analytics using Redis, BigQuery and AppsScripts  by Javier Ramirez from teowaki (Apidays Mediterranea)
api analytics using Redis, BigQuery and AppsScripts  by Javier Ramirez from teowaki (Apidays Mediterranea)
api analytics using Redis, BigQuery and AppsScripts  by Javier Ramirez from teowaki (Apidays Mediterranea)
api analytics using Redis, BigQuery and AppsScripts  by Javier Ramirez from teowaki (Apidays Mediterranea)
api analytics using Redis, BigQuery and AppsScripts  by Javier Ramirez from teowaki (Apidays Mediterranea)
api analytics using Redis, BigQuery and AppsScripts  by Javier Ramirez from teowaki (Apidays Mediterranea)
Upcoming SlideShare
Loading in …5
×

api analytics using Redis, BigQuery and AppsScripts by Javier Ramirez from teowaki (Apidays Mediterranea)

1,536 views

Published on

This is the story of how we implemented bigdata analytics for teowaki's api (https://teowaki.com)

Bigdata is amazing. You can get insights from your users, find interesting patterns and have lots of geek fun. Problem is big data usually means many servers, a complex set up, intensive monitoring and a steep learning curve. All those things cost money. If you don't have the money, you are losing all the fun. In my talk I will show you how you can use Redis, Google Bigquery and Apps Script to manage big data from your application for under $1 per month. Don't you feel like running a RegExp over 300 million rows in just 5 seconds?

Published in: Software
  • Be the first to comment

api analytics using Redis, BigQuery and AppsScripts by Javier Ramirez from teowaki (Apidays Mediterranea)

  1. 1. javier ramirez @supercoco9 API analytics with Redis and BigQuery
  2. 2. javier ramirez @supercoco9 https://teowaki.com api days 14 REST API (Ruby on Rails) + Web on top (AngularJS)
  3. 3. Use a hosted solution
  4. 4. questions? api days 14
  5. 5. javier ramirez @supercoco9 https://teowaki.com api days 14
  6. 6. data that’s an order of magnitude greater than data you’re accustomed to javier ramirez @supercoco9 https://teowaki.com api days2014 Doug Laney VP Research, Business Analytics and Performance Management at Gartner
  7. 7. data that exceeds the processing capacity of conventional database systems. The data is too big, moves too fast, or doesn’t fit the structures of your database architectures. Ed Dumbill program chair for the O’Reilly Strata Conference javier ramirez @supercoco9 https://teowaki.com api days2014
  8. 8. bigdata is doing a fullscan to 330MM rows, matching them against a regexp, and getting the result (223MM rows) in just 5 seconds javier ramirez @supercoco9 https://teowaki.com api days2014 Javier Ramirez impresionable teowaki founder
  9. 9. 1. non intrusive metrics 2. keep the history 3. avoid vendor lock-in 4. interactive queries 5. cheap 6. extra ball: real time javier ramirez @supercoco9 https://teowaki.com api days 14
  10. 10. javier ramirez @supercoco9 https://teowaki.com api days2014
  11. 11. open source, BSD licensed, advanced key-value store. It is often referred to as a data structure server since keys can contain strings, hashes, lists, sets, sorted sets and hyperloglogs. http://redis.io started in 2009 by Salvatore Sanfilippo @antirez 112 contributors at https://github.com/antirez/redis javier ramirez @supercoco9 https://teowaki.com api days2014
  12. 12. twitter stackoverflow pinterest booking.com World of Warcraft YouPorn HipChat Snapchat javier ramirez @supercoco9 https://teowaki.com api days 14 ntopng LogStash
  13. 13. Intel(R) Xeon(R) CPU E5520 @ 2.27GHz (with pipelining) $ ./redis-benchmark -r 1000000 -n 2000000 -t get,set,lpush,lpop -P 16 -q SET: 552,028 requests per second GET: 707,463 requests per second LPUSH: 767,459 requests per second LPOP: 770,119 requests per second Intel(R) Xeon(R) CPU E5520 @ 2.27GHz (without pipelining) $ ./redis-benchmark -r 1000000 -n 2000000 -t get,set,lpush,lpop -q SET: 122,556 requests per second GET: 123,601 requests per second LPUSH: 136,752 requests per second LPOP: 132,424 requests per second javier ramirez @supercoco9 https://teowaki.com api days2014
  14. 14. javier ramirez @supercoco9 https://teowaki.com api days 14 Non intrusive metrics Capture data really fast. Then send the data on the background
  15. 15. javier ramirez @supercoco9 https://teowaki.com api days2014
  16. 16. Redis keeps everything in memory all the time javier ramirez @supercoco9 https://teowaki.com api days2014
  17. 17. javier ramirez @supercoco9 https://teowaki.com api days 14
  18. 18. Gzip to AWS S3/Glacier or Google Cloud Storage javier ramirez @supercoco9 https://teowaki.com api days 14
  19. 19. javier ramirez @supercoco9 https://teowaki.com api days 14
  20. 20. Hadoop Cassandra Amazon Redshift ... javier ramirez @supercoco9 https://teowaki.com api days 14 tools we considered:
  21. 21. but... hard to set up and monitor expensive cluster not interactive enough javier ramirez @supercoco9 https://teowaki.com api days 14
  22. 22. Our choice: Google BigQuery Data analysis as a service http://developers.google.com/bigquery javier ramirez @supercoco9 https://teowaki.com api days 14
  23. 23. Based on “Dremel” Specifically designed for interactive queries over petabytes of real-time data javier ramirez @supercoco9 https://teowaki.com api days 14
  24. 24. loading data You just send the data in text (or JSON) format javier ramirez @supercoco9 https://teowaki.com api days 14
  25. 25. SQL javier ramirez @supercoco9 https://teowaki.com api days 14 select name from USERS order by date; select count(*) from users; select max(date) from USERS; select sum(total) from ORDERS group by user;
  26. 26. specific extensions for analytics javier ramirez @supercoco9 https://teowaki.com api days 14 within flatten nest stddev top first last nth variance var_pop var_samp covar_pop covar_samp quantiles
  27. 27. web console screenshot javier ramirez @supercoco9 https://teowaki.com api days 14
  28. 28. javier ramirez @supercoco9 https://teowaki.com api days 14 window functions
  29. 29. javier ramirez @supercoco9 https://teowaki.com api days 14 our most active user
  30. 30. javier ramirez @supercoco9 https://teowaki.com api days 14 country segmented traffic
  31. 31. javier ramirez @supercoco9 https://teowaki.com api days 14 10 request we should be caching
  32. 32. correlations. not to mistake with causality javier ramirez @supercoco9 https://teowaki.com api days 14
  33. 33. Things you always wanted to try but were too scared to javier ramirez @supercoco9 https://teowaki.com api days 14 select count(*) from publicdata:samples.wikipedia where REGEXP_MATCH(title, "[0-9]*") AND wp_namespace = 0; 223,163,387 Query complete (5.6s elapsed, 9.13 GB processed, Cost: 32¢)
  34. 34. javier ramirez @supercoco9 http://teowaki.com api days2014 5 most created resources select uri, count(*) total from stats where method = 'POST' group by URI;
  35. 35. javier ramirez @supercoco9 http://teowaki.com api days2014 ...but /users/javier/shouts /users/rgo/shouts /teams/javier-community/links /teams/nosqlmatters-cgn/links
  36. 36. javier ramirez @supercoco9 http://teowaki.com api days2014 5 most created resources
  37. 37. new users per month
  38. 38. SELECT repository_name, repository_language, repository_description, COUNT(repository_name) as cnt, repository_url FROM github.timeline WHERE type="WatchEvent" AND PARSE_UTC_USEC(created_at) >= PARSE_UTC_USEC("#{yesterday} 20:00:00") AND repository_url IN ( SELECT repository_url FROM github.timeline WHERE type="CreateEvent" AND PARSE_UTC_USEC(repository_created_at) >= PARSE_UTC_USEC('#{yesterday} 20:00:00') AND repository_fork = "false" AND payload_ref_type = "repository" GROUP BY repository_url ) GROUP BY repository_name, repository_language, repository_description, repository_url HAVING cnt >= 5 ORDER BY cnt DESC LIMIT 25
  39. 39. Automation with Apps Script Read from bigquery Create a spreadsheet on Drive E-mail it everyday as a PDF javier ramirez @supercoco9 https://teowaki.com api days 14
  40. 40. bigquery pricing $26 per stored TB 1000000 rows => $0.00416 / month £0.00243 / month $5 per processed TB 1 full scan = 160 MB 1 count = 0 MB 1 full scan over 1 column = 5.4 MB 100 GB => $0.05 / month £0.03javier ramirez @supercoco9 https://teowaki.com api days 14
  41. 41. £0.054307 / month* per 1MM rows *the 1st 1TB every month are free of charge javier ramirez @supercoco9 https://teowaki.com api days 14
  42. 42. 1. non intrusive metrics 2. keep the history 3. avoid vendor lock-in 4. interactive queries 5. cheap 6. extra ball: real time javier ramirez @supercoco9 https://teowaki.com api days 14
  43. 43. Find related links at https://teowaki.com/teams/javier-community/link-categories/bigquery-talk Thanks! Gr ciesà Javier Ramírez @supercoco9 api days 14

×