Big Data Analytics 
with 
Google BigQuery 
by javier ramirez 
@supercoco9 
https://teowaki.com 
https://datawaki.com
INPUT 
/ 
OUTPUT 
Big Data's 
#1 Enemy
Read one 
terabyte of 
data in 
one second 
javier ramirez @supercoco9 https://teowaki.com
data that exceeds the 
processing capacity of 
conventional database 
systems. The data is too big, 
moves too fast, or doesn’t fit 
the structures of your 
database architectures. 
Ed Dumbill 
program chair for the O’Reilly Strata Conference 
javier ramirez @supercoco9 https://teowaki.com
bigdata is doing a fullscan 
to 330MM rows, matching 
them against a regexp, 
and getting the result 
(223MM rows) in just 5 
seconds 
javier ramirez @supercoco9 https://teowaki.com 
Javier Ramirez 
impresionable teowaki founder
REST API 
+ 
AngularJS web as 
an API client 
javier ramirez @supercoco9 https://teowaki.com nosqlmatters 2013
1. non intrusive metrics 
2. keep the history 
3. interactive queries 
4. cheap 
5. extra ball: real time 
javier ramirez @supercoco9 https://teowaki.com
javier ramirez @supercoco9 https://teowaki.com
Apache Hadoop 
Apache Cassandra 
Apache Spark 
Apache Storm 
Amazon Redshift 
javier ramirez @supercoco9 https://teowaki.com
bigdata is cool but... 
expensive cluster 
hard to set up and monitor 
not interactive enough
Our choice: 
Google BigQuery 
Data analysis as a service 
http://developers.google.com/bigquery 
javier ramirez @supercoco9 https://teowaki.com
Based on Dremel 
Specifically designed for 
interactive queries over 
petabytes of real-time 
data 
javier ramirez @supercoco9 https://teowaki.com
What Dremel is used for in Google 
• Analysis of crawled web documents. 
• Tracking install data for applications on Android Market. 
• Crash reporting for Google products. 
• OCR results from Google Books. 
• Spam analysis. 
• Debugging of map tiles on Google Maps. 
• Tablet migrations in managed Bigtable instances. 
• Results of tests run on Google’s distributed build system. 
• Disk I/O statistics for hundreds of thousands of disks. 
• Resource monitoring for jobs run in Google’s data centers. 
• Symbols and dependencies in Google’s codebase.
INDEXES 
Data 
Scientists's 
#1 Enemy
in BigQuery 
everything is 
a full-scan* 
*Over a ridiculously fast distributed filesystem. 
Dremel design goal: 1TB/sec. It was exceeded 
BigQuery delivers ~ 50Gb/Sec. 
javier ramirez @supercoco9 https://teowaki.com
Columnar 
storage 
javier ramirez @supercoco9 https://teowaki.com
Colossus filesystem 
Distributed/redundant 
Parallel reads 
Ultra fast network 
javier ramirez @supercoco9 https://teowaki.com
highly distributed 
execution using a tree 
javier ramirez @supercoco9 https://teowaki.com rubyc kiev 14
Getting started...
create dataset and tables
loading data 
You can feed flat CSV-like 
files or nested JSON 
objects 
javier ramirez @supercoco9 https://teowaki.com
bq cli 
bq load --nosynchronous_mode 
--encoding UTF-8 
--field_delimiter 'tab' 
--max_bad_records 100 
--source_format CSV 
api.stats 20131014T11-42- 
05Z.gz 
javier ramirez @supercoco9 https://teowaki.com
web console screenshot 
javier ramirez @supercoco9 https://teowaki.com
it's just sql, plus... 
analytical SQL functions. 
correlations. 
window functions. 
views. 
JSON fields. 
timestamped tables. 
javier ramirez @supercoco9 https://teowaki.com
Things you always wanted to 
try but were too scared to 
select count(*) from 
publicdata:samples.wikipedia 
where REGEXP_MATCH(title, "[0-9]*") 
AND wp_namespace = 0; 
223,163,387 
Query complete (5.6s elapsed, 9.13 GB processed, Cost: 32¢) 
javier ramirez @supercoco9 https://teowaki.com
Global Database of Events, 
Language and Tone 
quarter billion rows 
30 years 
updated daily 
http://gdeltproject.org/data.html#googlebigquery
SELECT Year, Actor1Name, Actor2Name, Count FROM ( 
SELECT Actor1Name, Actor2Name, Year, 
COUNT(*) Count, RANK() OVER(PARTITION BY YEAR ORDER BY 
Count DESC) rank 
FROM 
(SELECT Actor1Name, Actor2Name, Year FROM 
[gdelt-bq:full.events] WHERE Actor1Name < Actor2Name 
and Actor1CountryCode != '' and Actor2CountryCode != '' 
and Actor1CountryCode!=Actor2CountryCode), 
(SELECT Actor2Name Actor1Name, Actor1Name Actor2Name, 
Year FROM [gdelt-bq:full.events] WHERE 
Actor1Name > Actor2Name and Actor1CountryCode != '' and 
Actor2CountryCode != '' and 
Actor1CountryCode!=Actor2CountryCode), 
WHERE Actor1Name IS NOT null 
AND Actor2Name IS NOT null 
GROUP EACH BY 1, 2, 3 
HAVING Count > 100 
) 
WHERE rank=1 
ORDER BY Year
our most active user 
javier ramirez @supercoco9 https://teowaki.com
10 request we should be caching 
javier ramirez @supercoco9 https://teowaki.com
5 most created resources 
select uri, count(*) total from 
stats where method = 'POST' 
group by URI; 
javier ramirez @supercoco9 http://teowaki.com
...but 
/users/javier/shouts 
/users/rgo/shouts 
/teams/javier-community/links 
/teams/nosqlmatters-cgn/links 
javier ramirez @supercoco9 http://teowaki.com
5 most created resources 
javier ramirez @supercoco9 http://teowaki.com
what is it 
being used for?
Analysing weather information 
Finding patterns in e-commerce 
Match online/offline behaviour 
Log analysys 
Analysing inventory/booking data 
...
warning: BigQuery is not 
open source and not for 
free 
$26 per stored TB 
$5 per processed TB 
*the 1st TB processed every month is free of charge 
javier ramirez @supercoco9 https://teowaki.com
Find related links at 
https://teowaki.com/teams/javier-community/link-categories/bigquery-talk 
Thanks 
javier ramirez 
@supercoco9 
https://teowaki.com 
https://datawaki.com

Big Data Analytics with Google BigQuery, by Javier Ramirez, datawaki, at Span Conf

  • 1.
    Big Data Analytics with Google BigQuery by javier ramirez @supercoco9 https://teowaki.com https://datawaki.com
  • 3.
    INPUT / OUTPUT Big Data's #1 Enemy
  • 4.
    Read one terabyteof data in one second javier ramirez @supercoco9 https://teowaki.com
  • 5.
    data that exceedsthe processing capacity of conventional database systems. The data is too big, moves too fast, or doesn’t fit the structures of your database architectures. Ed Dumbill program chair for the O’Reilly Strata Conference javier ramirez @supercoco9 https://teowaki.com
  • 6.
    bigdata is doinga fullscan to 330MM rows, matching them against a regexp, and getting the result (223MM rows) in just 5 seconds javier ramirez @supercoco9 https://teowaki.com Javier Ramirez impresionable teowaki founder
  • 7.
    REST API + AngularJS web as an API client javier ramirez @supercoco9 https://teowaki.com nosqlmatters 2013
  • 8.
    1. non intrusivemetrics 2. keep the history 3. interactive queries 4. cheap 5. extra ball: real time javier ramirez @supercoco9 https://teowaki.com
  • 9.
    javier ramirez @supercoco9https://teowaki.com
  • 10.
    Apache Hadoop ApacheCassandra Apache Spark Apache Storm Amazon Redshift javier ramirez @supercoco9 https://teowaki.com
  • 11.
    bigdata is coolbut... expensive cluster hard to set up and monitor not interactive enough
  • 12.
    Our choice: GoogleBigQuery Data analysis as a service http://developers.google.com/bigquery javier ramirez @supercoco9 https://teowaki.com
  • 13.
    Based on Dremel Specifically designed for interactive queries over petabytes of real-time data javier ramirez @supercoco9 https://teowaki.com
  • 14.
    What Dremel isused for in Google • Analysis of crawled web documents. • Tracking install data for applications on Android Market. • Crash reporting for Google products. • OCR results from Google Books. • Spam analysis. • Debugging of map tiles on Google Maps. • Tablet migrations in managed Bigtable instances. • Results of tests run on Google’s distributed build system. • Disk I/O statistics for hundreds of thousands of disks. • Resource monitoring for jobs run in Google’s data centers. • Symbols and dependencies in Google’s codebase.
  • 15.
  • 16.
    in BigQuery everythingis a full-scan* *Over a ridiculously fast distributed filesystem. Dremel design goal: 1TB/sec. It was exceeded BigQuery delivers ~ 50Gb/Sec. javier ramirez @supercoco9 https://teowaki.com
  • 17.
    Columnar storage javierramirez @supercoco9 https://teowaki.com
  • 18.
    Colossus filesystem Distributed/redundant Parallel reads Ultra fast network javier ramirez @supercoco9 https://teowaki.com
  • 19.
    highly distributed executionusing a tree javier ramirez @supercoco9 https://teowaki.com rubyc kiev 14
  • 20.
  • 21.
  • 22.
    loading data Youcan feed flat CSV-like files or nested JSON objects javier ramirez @supercoco9 https://teowaki.com
  • 23.
    bq cli bqload --nosynchronous_mode --encoding UTF-8 --field_delimiter 'tab' --max_bad_records 100 --source_format CSV api.stats 20131014T11-42- 05Z.gz javier ramirez @supercoco9 https://teowaki.com
  • 24.
    web console screenshot javier ramirez @supercoco9 https://teowaki.com
  • 25.
    it's just sql,plus... analytical SQL functions. correlations. window functions. views. JSON fields. timestamped tables. javier ramirez @supercoco9 https://teowaki.com
  • 26.
    Things you alwayswanted to try but were too scared to select count(*) from publicdata:samples.wikipedia where REGEXP_MATCH(title, "[0-9]*") AND wp_namespace = 0; 223,163,387 Query complete (5.6s elapsed, 9.13 GB processed, Cost: 32¢) javier ramirez @supercoco9 https://teowaki.com
  • 28.
    Global Database ofEvents, Language and Tone quarter billion rows 30 years updated daily http://gdeltproject.org/data.html#googlebigquery
  • 29.
    SELECT Year, Actor1Name,Actor2Name, Count FROM ( SELECT Actor1Name, Actor2Name, Year, COUNT(*) Count, RANK() OVER(PARTITION BY YEAR ORDER BY Count DESC) rank FROM (SELECT Actor1Name, Actor2Name, Year FROM [gdelt-bq:full.events] WHERE Actor1Name < Actor2Name and Actor1CountryCode != '' and Actor2CountryCode != '' and Actor1CountryCode!=Actor2CountryCode), (SELECT Actor2Name Actor1Name, Actor1Name Actor2Name, Year FROM [gdelt-bq:full.events] WHERE Actor1Name > Actor2Name and Actor1CountryCode != '' and Actor2CountryCode != '' and Actor1CountryCode!=Actor2CountryCode), WHERE Actor1Name IS NOT null AND Actor2Name IS NOT null GROUP EACH BY 1, 2, 3 HAVING Count > 100 ) WHERE rank=1 ORDER BY Year
  • 31.
    our most activeuser javier ramirez @supercoco9 https://teowaki.com
  • 32.
    10 request weshould be caching javier ramirez @supercoco9 https://teowaki.com
  • 33.
    5 most createdresources select uri, count(*) total from stats where method = 'POST' group by URI; javier ramirez @supercoco9 http://teowaki.com
  • 34.
    ...but /users/javier/shouts /users/rgo/shouts /teams/javier-community/links /teams/nosqlmatters-cgn/links javier ramirez @supercoco9 http://teowaki.com
  • 35.
    5 most createdresources javier ramirez @supercoco9 http://teowaki.com
  • 36.
    what is it being used for?
  • 39.
    Analysing weather information Finding patterns in e-commerce Match online/offline behaviour Log analysys Analysing inventory/booking data ...
  • 40.
    warning: BigQuery isnot open source and not for free $26 per stored TB $5 per processed TB *the 1st TB processed every month is free of charge javier ramirez @supercoco9 https://teowaki.com
  • 42.
    Find related linksat https://teowaki.com/teams/javier-community/link-categories/bigquery-talk Thanks javier ramirez @supercoco9 https://teowaki.com https://datawaki.com