api analytics redis bigquery. Lrug

  • 661 views
Uploaded on

At teowaki we have a system for API usage analytics, with Redis as a fast intermediate store and bigquery as a big data backend. As a result, we can launch aggregated queries on our traffic/usage data …

At teowaki we have a system for API usage analytics, with Redis as a fast intermediate store and bigquery as a big data backend. As a result, we can launch aggregated queries on our traffic/usage data in just a few seconds and we can try and find for usage patterns that wouldn’t be obvious otherwise.

In this session I will talk about how we entered the Big Data world, which alternatives we evaluated, and how we are using Redis and Bigquery to solve our problem.

More in: Technology , Design
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
661
On Slideshare
0
From Embeds
0
Number of Embeds
1

Actions

Shares
Downloads
2
Comments
0
Likes
4

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. API Analytics with Redis and Google Bigquery javier ramirez @supercoco9
  • 2. REST API + AngularJS web as an API client javier ramirez @supercoco9 https://teowaki.com
  • 3. obvious solution: use a ready-made service as 3scale or apigee javier ramirez @supercoco9 https://teowaki.com
  • 4. 1. non intrusive metrics 2. keep the history 3. avoid vendor lock-in 4. interactive queries 5. cheap 6. extra ball: real time javier ramirez @supercoco9 https://teowaki.com
  • 5. javier ramirez @supercoco9 https://teowaki.com
  • 6. data that’s an order of magnitude greater than data you’re accustomed to Doug Laney VP Research, Business Analytics and Performance Management at Gartner javier ramirez @supercoco9 https://teowaki.com
  • 7. data that exceeds the processing capacity of conventional database systems. The data is too big, moves too fast, or doesn’t fit the structures of your database architectures. Ed Dumbill program chair for the O’Reilly Strata Conference javier ramirez @supercoco9 https://teowaki.com
  • 8. bigdata is doing a fullscan to 330MM rows, matching them against a regexp, and getting the result (223MM rows) in just 5 seconds Javier Ramirez impresionable teowaki founder javier ramirez @supercoco9 https://teowaki.com
  • 9. javier ramirez @supercoco9 https://teowaki.com
  • 10. Intel(R) Xeon(R) CPU E5520 @ 2.27GHz (with pipelining) $ ./redis-benchmark -r 1000000 -n 2000000 -t get,set,lpush,lpop -P 16 -q SET: 552,028 requests per second GET: 707,463 requests per second LPUSH: 767,459 requests per second LPOP: 770,119 requests per second Intel(R) Xeon(R) CPU E5520 @ 2.27GHz (without pipelining) $ ./redis-benchmark -r 1000000 -n 2000000 -t get,set,lpush,lpop -q SET: 122,556 requests per second GET: 123,601 requests per second LPUSH: 136,752 requests per second LPOP: 132,424 requests per second javier ramirez @supercoco9 https://teowaki.com
  • 11. open source, BSD licensed, advanced key-value store. It is often referred to as a data structure server since keys can contain strings, hashes, lists, sets and sorted sets. http://redis.io started in 2009 by Salvatore Sanfilippo @antirez 100 contributors at https://github.com/antirez/redis javier ramirez @supercoco9 https://teowaki.com
  • 12. what is it used for javier ramirez @supercoco9 https://teowaki.com
  • 13. twitter Every time line (800 tweets per user) is stored in redis 5000 writes per second avg 300K reads per second javier ramirez @supercoco9 https://teowaki.com
  • 14. nginx + lua + redis apache + mruby + redis javier ramirez @supercoco9 https://teowaki.com
  • 15. javier ramirez @supercoco9 https://teowaki.com
  • 16. Redis keeps everything in memory all the time javier ramirez @supercoco9 https://teowaki.com
  • 17. javier ramirez @supercoco9 https://teowaki.com
  • 18. easy: store GZIPPED files into S3/Glacier * we are moving to google cloud now javier ramirez @supercoco9 https://teowaki.com
  • 19. javier ramirez @supercoco9 https://teowaki.com
  • 20. Hadoop (map/reduce) http://hadoop.apache.org/ started in 2005 by Doug Cutting and Mike Cafarella javier ramirez @supercoco9 https://teowaki.com
  • 21. cassandra http://cassandra.apache.org/ released in 2008 by facebook. javier ramirez @supercoco9 https://teowaki.com
  • 22. other big data solutions: hadoop+voldemort+kafka http://engineering.linkedin.com/projects hbase http://hbase.apache.org/ javier ramirez @supercoco9 https://teowaki.com
  • 23. Amazon Redshift http://aws.amazon.com/redshift/ javier ramirez @supercoco9 https://teowaki.com
  • 24. Our choice: google bigquery Data analysis as a service http://developers.google.com/bigquery javier ramirez @supercoco9 https://teowaki.com
  • 25. Based on Dremel Specifically designed for interactive queries over petabytes of real-time data javier ramirez @supercoco9 https://teowaki.com
  • 26. Columnar storage Easy to compress Convenient for querying long series over a single column javier ramirez @supercoco9 https://teowaki.com
  • 27. loading data You can feed flat CSV files or nested JSON objects javier ramirez @supercoco9 https://teowaki.com
  • 28. bq cli bq load --nosynchronous_mode --encoding UTF-8 --field_delimiter 'tab' --max_bad_records 100 --source_format CSV api.stats 20131014T11-42-05Z.gz javier ramirez @supercoco9 https://teowaki.com
  • 29. web console screenshot javier ramirez @supercoco9 https://teowaki.com
  • 30. almost SQL select from join where group by having order limit javier ramirez @supercoco9 avg count max min sum + * / % https://teowaki.com & | ^ << >> ~ = != <> > < >= <= IN IS NULL BETWEEN AND OR NOT
  • 31. Functions overview current_date current_time now datediff day day_of_week day_of_year hour minute quarter year... javier ramirez @supercoco9 abs acos atan ceil floor degrees log log2 log10 PI SQRT... https://teowaki.com concat contains left length lower upper lpad rpad right substr
  • 32. analytics specific extensions within flatten nest stddev variance top first last nth var_pop var_samp covar_pop covar_samp quantiles javier ramirez @supercoco9 https://teowaki.com
  • 33. window functions javier ramirez @supercoco9 https://teowaki.com
  • 34. correlations javier ramirez @supercoco9 https://teowaki.com
  • 35. Things you always wanted to try but were too scare to select count(*) from publicdata:samples.wikipedia where REGEXP_MATCH(title, "[0-9]*") AND wp_namespace = 0; 223,163,387 Query complete (5.6s elapsed, 9.13 GB processed, Cost: 32¢) javier ramirez @supercoco9 https://teowaki.com
  • 36. SELECT repository_name, repository_language, repository_description, COUNT(repository_name) as cnt, repository_url FROM github.timeline WHERE type="WatchEvent" AND PARSE_UTC_USEC(created_at) >= PARSE_UTC_USEC("#{yesterday} 20:00:00") AND repository_url IN ( SELECT repository_url FROM github.timeline WHERE type="CreateEvent" AND PARSE_UTC_USEC(repository_created_at) >= PARSE_UTC_USEC('#{yesterday} 20:00:00') AND repository_fork = "false" AND payload_ref_type = "repository" GROUP BY repository_url ) GROUP BY repository_name, repository_language, repository_description, repository_url HAVING cnt >= 5 ORDER BY cnt DESC LIMIT 25
  • 37. country segmented traffic javier ramirez @supercoco9 https://teowaki.com
  • 38. our most active user javier ramirez @supercoco9 https://teowaki.com
  • 39. 10 request we should be caching javier ramirez @supercoco9 https://teowaki.com
  • 40. 5 most created resources javier ramirez @supercoco9 http://teowaki.com
  • 41. redis pricing 2* machines (master/slave) at digital ocean $10 monthly * we were already using these instances for a lot of redis use cases javier ramirez @supercoco9 https://teowaki.com
  • 42. s3 pricing $0.095 per GB a gzipped 1.6 MB file stores 300K rows $0.0001541698 / monthly javier ramirez @supercoco9 https://teowaki.com
  • 43. glacier pricing $0.01 per GB $0.000016 / monthly javier ramirez @supercoco9 https://teowaki.com
  • 44. bigquery pricing $80 per stored TB 300000 rows => $0.007629392 / month $35 per processed TB 1 full scan = 84 MB 1 count = 0 MB 1 full scan over 1 column = 5.4 MB 10 GB => $0.35 / month javier ramirez @supercoco9 https://teowaki.com
  • 45. redis s3 storage s3 transfer glacier transfer glacier storage bigquery storage bigquery queries $10.0000000000 $00.0001541698 $00.0050000000 $00.0500000000 $00.0000160000 $00.0076293920 $00.3500000000 $10.41 / month for our first 330000 rows javier ramirez @supercoco9 https://teowaki.com
  • 46. 1. non intrusive metrics 2. keep the history 3. avoid vendor lock-in 4. interactive queries 5. cheap 6. extra ball: real time javier ramirez @supercoco9 https://teowaki.com
  • 47. Find related links at https://teowaki.com/teams/javier-community/link-categories/bigquery-talk Cheers! Javier Ramírez @supercoco9