javier ramirez
@supercoco9
API Analytics with
Redis, BigQuery,
and AppsScripts
a two people
start-up
a different league...
.. or maybe not
moral of the story
you can do big,
if you know how
Set a distance.
Set an expiration time.
Bye bye noise.
javier ramirez @supercoco9 https://teowaki.com
REST API (Ruby on Rails)
+
Web on top (AngularJS)
javier ramirez @supercoco9 https://teowaki.com
data that’s an order of
magnitude greater than
data you’re accustomed
to
javier ramirez @supercoco9 https://teowaki.com
Do...
data that exceeds the
processing capacity of
conventional database
systems. The data is too big,
moves too fast, or doesn’...
bigdata is doing a
fullscan to 330MM rows,
matching them against a
regexp, and getting the
result (223MM rows) in
just 5 s...
1. non intrusive metrics
2. keep the history
3. avoid vendor lock-in
4. interactive queries
5. cheap
6. extra ball: real t...
twitter
stackoverflow
pinterest
booking.com
World of Warcraft
YouPorn
HipChat
Snapchat
javier ramirez @supercoco9 https://...
javier ramirez @supercoco9 https://teowaki.com
Non intrusive metrics
Capture data really fast.
Then process the data on
th...
javier ramirez @supercoco9 https://teowaki.com
javier ramirez @supercoco9 https://teowaki.com
Gzip to
AWS S3/Glacier
or
Google Cloud Storage
javier ramirez @supercoco9 https://teowaki.com
javier ramirez @supercoco9 https://teowaki.com
Hadoop
Cassandra
Hadoop + Voldemort + Kafka
HBase
…
Amazon Redshift
javier ramirez @supercoco9 https://teowaki.com
tools w...
but...
hard to set up and monitor
not interactive enough
expensive cluster
javier ramirez @supercoco9 https://teowaki.com
Our choice:
Google BigQuery
Data analysis as a service
http://developers.google.com/bigquery
javier ramirez @supercoco9 ht...
Based on “Dremel”
Specifically designed for
interactive queries over
petabytes of real-time
data
javier ramirez @supercoco...
loading data
You just send the data in
text (or JSON) format
javier ramirez @supercoco9 https://teowaki.com
SQL
javier ramirez @supercoco9 https://teowaki.com
select name from USERS order by date;
select count(*) from users;
selec...
specific extensions for
analytics
javier ramirez @supercoco9 https://teowaki.com
within
flatten
nest
stddev
top
first
last...
Things you always wanted to
try but were too scared to
javier ramirez @supercoco9 https://teowaki.com
select count(*) from...
columnar storage
javier ramirez @supercoco9 https://teowaki.com
highly distributed
execution using a tree
javier ramirez @supercoco9 https://teowaki.com
web console screenshot
javier ramirez @supercoco9 https://teowaki.com
javier ramirez @supercoco9 https://teowaki.com
country segmented traffic
javier ramirez @supercoco9 https://teowaki.com
window functions
javier ramirez @supercoco9 https://teowaki.com
our most active user
new users per month
javier ramirez @supercoco9 https://teowaki.com
10 request we should be caching
javier ramirez @supercoco9 http://teowaki.com
5 most created resources
select uri, count(*) total from
stats where method ...
javier ramirez @supercoco9 http://teowaki.com
...but
/users/javier/shouts
/users/rgo/shouts
/teams/javier-community/links
...
javier ramirez @supercoco9 http://teowaki.com
5 most created resources
SELECT repository_name, repository_language,
repository_description, COUNT(repository_name) as cnt,
repository_url
FROM gi...
NO
Automation with Apps Script
Read from bigquery
Create a spreadsheet on Drive
E-mail it everyday as a PDF
javier ramirez @s...
bigquery pricing
$26 per stored TB
1000000 rows => $0.00416 / month
£0.00243 / month
$5 per processed TB
1 full scan = 160...
£0.054307 / month*
per 1MM rows
*the 1st
1TB every month is free of charge
javier ramirez @supercoco9 https://teowaki.com
1. non intrusive metrics
2. keep the history
3. avoid vendor lock-in
4. interactive queries
5. cheap
6. extra ball: real t...
ig
Find related links at
https://teowaki.com/teams/javier-community/link-categories/bigquery-talk
Thanks!
‫תודה‬
Javier Ramír...
API analytics with Bigquery, by Javier Ramirez from teowaki
API analytics with Bigquery, by Javier Ramirez from teowaki
API analytics with Bigquery, by Javier Ramirez from teowaki
API analytics with Bigquery, by Javier Ramirez from teowaki
API analytics with Bigquery, by Javier Ramirez from teowaki
API analytics with Bigquery, by Javier Ramirez from teowaki
API analytics with Bigquery, by Javier Ramirez from teowaki
API analytics with Bigquery, by Javier Ramirez from teowaki
API analytics with Bigquery, by Javier Ramirez from teowaki
API analytics with Bigquery, by Javier Ramirez from teowaki
API analytics with Bigquery, by Javier Ramirez from teowaki
API analytics with Bigquery, by Javier Ramirez from teowaki
API analytics with Bigquery, by Javier Ramirez from teowaki
API analytics with Bigquery, by Javier Ramirez from teowaki
API analytics with Bigquery, by Javier Ramirez from teowaki
API analytics with Bigquery, by Javier Ramirez from teowaki
API analytics with Bigquery, by Javier Ramirez from teowaki
API analytics with Bigquery, by Javier Ramirez from teowaki
API analytics with Bigquery, by Javier Ramirez from teowaki
API analytics with Bigquery, by Javier Ramirez from teowaki
API analytics with Bigquery, by Javier Ramirez from teowaki
API analytics with Bigquery, by Javier Ramirez from teowaki
API analytics with Bigquery, by Javier Ramirez from teowaki
Upcoming SlideShare
Loading in …5
×

API analytics with Bigquery, by Javier Ramirez from teowaki

772 views

Published on

At https://teowaki.com we have a system for API usage analytics, with Redis as a fast intermediate store and google Bigquery as a big data backend. As a result, we can launch aggregated queries on our traffic/usage data in just a few seconds and we can try and find for usage patterns that wouldn’t be obvious otherwise.

In this session I will talk about how we entered the Big Data world, which alternatives we evaluated, and how we are using Redis and Bigquery to solve our problem.

Published in: Technology, Business
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
772
On SlideShare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
5
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

API analytics with Bigquery, by Javier Ramirez from teowaki

  1. 1. javier ramirez @supercoco9 API Analytics with Redis, BigQuery, and AppsScripts
  2. 2. a two people start-up
  3. 3. a different league...
  4. 4. .. or maybe not
  5. 5. moral of the story you can do big, if you know how
  6. 6. Set a distance. Set an expiration time. Bye bye noise.
  7. 7. javier ramirez @supercoco9 https://teowaki.com REST API (Ruby on Rails) + Web on top (AngularJS)
  8. 8. javier ramirez @supercoco9 https://teowaki.com
  9. 9. data that’s an order of magnitude greater than data you’re accustomed to javier ramirez @supercoco9 https://teowaki.com Doug Laney VP Research, Business Analytics and Performance Management at Gartner
  10. 10. data that exceeds the processing capacity of conventional database systems. The data is too big, moves too fast, or doesn’t fit the structures of your database architectures. Ed Dumbill program chair for the O’Reilly Strata Conference javier ramirez @supercoco9 https://teowaki.com
  11. 11. bigdata is doing a fullscan to 330MM rows, matching them against a regexp, and getting the result (223MM rows) in just 5 seconds javier ramirez @supercoco9 https://teowaki.com Javier Ramirez impresionable teowaki founder
  12. 12. 1. non intrusive metrics 2. keep the history 3. avoid vendor lock-in 4. interactive queries 5. cheap 6. extra ball: real time javier ramirez @supercoco9 https://teowaki.com
  13. 13. twitter stackoverflow pinterest booking.com World of Warcraft YouPorn HipChat Snapchat javier ramirez @supercoco9 https://teowaki.com ntopng LogStash
  14. 14. javier ramirez @supercoco9 https://teowaki.com Non intrusive metrics Capture data really fast. Then process the data on the background
  15. 15. javier ramirez @supercoco9 https://teowaki.com
  16. 16. javier ramirez @supercoco9 https://teowaki.com
  17. 17. Gzip to AWS S3/Glacier or Google Cloud Storage javier ramirez @supercoco9 https://teowaki.com
  18. 18. javier ramirez @supercoco9 https://teowaki.com
  19. 19. Hadoop Cassandra Hadoop + Voldemort + Kafka HBase … Amazon Redshift javier ramirez @supercoco9 https://teowaki.com tools we considered:
  20. 20. but... hard to set up and monitor not interactive enough expensive cluster javier ramirez @supercoco9 https://teowaki.com
  21. 21. Our choice: Google BigQuery Data analysis as a service http://developers.google.com/bigquery javier ramirez @supercoco9 https://teowaki.com
  22. 22. Based on “Dremel” Specifically designed for interactive queries over petabytes of real-time data javier ramirez @supercoco9 https://teowaki.com
  23. 23. loading data You just send the data in text (or JSON) format javier ramirez @supercoco9 https://teowaki.com
  24. 24. SQL javier ramirez @supercoco9 https://teowaki.com select name from USERS order by date; select count(*) from users; select max(date) from USERS; select sum(total) from ORDERS group by user;
  25. 25. specific extensions for analytics javier ramirez @supercoco9 https://teowaki.com within flatten nest stddev top first last nth variance var_pop var_samp covar_pop covar_samp quantiles correlations
  26. 26. Things you always wanted to try but were too scared to javier ramirez @supercoco9 https://teowaki.com select count(*) from publicdata:samples.wikipedia where REGEXP_MATCH(title, "[0-9]*") AND wp_namespace = 0; 223,163,387 Query complete (5.6s elapsed, 9.13 GB processed, Cost: 32¢)
  27. 27. columnar storage javier ramirez @supercoco9 https://teowaki.com
  28. 28. highly distributed execution using a tree javier ramirez @supercoco9 https://teowaki.com
  29. 29. web console screenshot javier ramirez @supercoco9 https://teowaki.com
  30. 30. javier ramirez @supercoco9 https://teowaki.com country segmented traffic
  31. 31. javier ramirez @supercoco9 https://teowaki.com window functions
  32. 32. javier ramirez @supercoco9 https://teowaki.com our most active user
  33. 33. new users per month
  34. 34. javier ramirez @supercoco9 https://teowaki.com 10 request we should be caching
  35. 35. javier ramirez @supercoco9 http://teowaki.com 5 most created resources select uri, count(*) total from stats where method = 'POST' group by URI;
  36. 36. javier ramirez @supercoco9 http://teowaki.com ...but /users/javier/shouts /users/rgo/shouts /teams/javier-community/links /teams/nosqlmatters-cgn/links
  37. 37. javier ramirez @supercoco9 http://teowaki.com 5 most created resources
  38. 38. SELECT repository_name, repository_language, repository_description, COUNT(repository_name) as cnt, repository_url FROM github.timeline WHERE type="WatchEvent" AND PARSE_UTC_USEC(created_at) >= PARSE_UTC_USEC("#{yesterday} 20:00:00") AND repository_url IN ( SELECT repository_url FROM github.timeline WHERE type="CreateEvent" AND PARSE_UTC_USEC(repository_created_at) >= PARSE_UTC_USEC('#{yesterday} 20:00:00') AND repository_fork = "false" AND payload_ref_type = "repository" GROUP BY repository_url ) GROUP BY repository_name, repository_language, repository_description, repository_url HAVING cnt >= 5 ORDER BY cnt DESC LIMIT 25
  39. 39. NO
  40. 40. Automation with Apps Script Read from bigquery Create a spreadsheet on Drive E-mail it everyday as a PDF javier ramirez @supercoco9 https://teowaki.com
  41. 41. bigquery pricing $26 per stored TB 1000000 rows => $0.00416 / month £0.00243 / month $5 per processed TB 1 full scan = 160 MB 1 count = 0 MB 1 full scan over 1 column = 5.4 MB 100 GB => $0.05 / month £0.03javier ramirez @supercoco9 https://teowaki.com
  42. 42. £0.054307 / month* per 1MM rows *the 1st 1TB every month is free of charge javier ramirez @supercoco9 https://teowaki.com
  43. 43. 1. non intrusive metrics 2. keep the history 3. avoid vendor lock-in 4. interactive queries 5. cheap 6. extra ball: real time javier ramirez @supercoco9 https://teowaki.com
  44. 44. ig
  45. 45. Find related links at https://teowaki.com/teams/javier-community/link-categories/bigquery-talk Thanks! ‫תודה‬ Javier Ramírez @supercoco9

×