Big Data Analytics with Google BigQuery, by Javier Ramirez, datawaki, at Span Conf

Big Data Analytics
with
Google BigQuery
by javier ramirez
@supercoco9
https://teowaki.com
https://datawaki.com

INPUT
/
OUTPUT
Big Data's
#1 Enemy

Read one
terabyte of
data in
one second
javier ramirez @supercoco9 https://teowaki.com

data that exceeds the
processing capacity of
conventional database
systems. The data is too big,
moves too fast, or doesn’t fit
the structures of your
database architectures.
Ed Dumbill
program chair for the O’Reilly Strata Conference

bigdata is doing a fullscan
to 330MM rows, matching
them against a regexp,
and getting the result
(223MM rows) in just 5
seconds
Javier Ramirez
impresionable teowaki founder

REST API
+
AngularJS web as
an API client
javier ramirez @supercoco9 https://teowaki.com nosqlmatters 2013

1. non intrusive metrics
2. keep the history
3. interactive queries
4. cheap
5. extra ball: real time

Apache Hadoop
Apache Cassandra
Apache Spark
Apache Storm
Amazon Redshift

bigdata is cool but...
expensive cluster
hard to set up and monitor
not interactive enough

Our choice:
Google BigQuery
Data analysis as a service
http://developers.google.com/bigquery

Based on Dremel
Specifically designed for
interactive queries over
petabytes of real-time
data

What Dremel is used for in Google
• Analysis of crawled web documents.
• Tracking install data for applications on Android Market.
• Crash reporting for Google products.
• OCR results from Google Books.
• Spam analysis.
• Debugging of map tiles on Google Maps.
• Tablet migrations in managed Bigtable instances.
• Results of tests run on Google’s distributed build system.
• Disk I/O statistics for hundreds of thousands of disks.
• Resource monitoring for jobs run in Google’s data centers.
• Symbols and dependencies in Google’s codebase.

INDEXES
Data
Scientists's
#1 Enemy

in BigQuery
everything is
a full-scan*
*Over a ridiculously fast distributed filesystem.
Dremel design goal: 1TB/sec. It was exceeded
BigQuery delivers ~ 50Gb/Sec.

Columnar
storage

Colossus filesystem
Distributed/redundant
Parallel reads
Ultra fast network

highly distributed
execution using a tree
javier ramirez @supercoco9 https://teowaki.com rubyc kiev 14

loading data
You can feed flat CSV-like
files or nested JSON
objects

bq cli
bq load --nosynchronous_mode
--encoding UTF-8
--field_delimiter 'tab'
--max_bad_records 100
--source_format CSV
api.stats 20131014T11-42-
05Z.gz

web console screenshot

it's just sql, plus...
analytical SQL functions.
correlations.
window functions.
views.
JSON fields.
timestamped tables.

Things you always wanted to
try but were too scared to
select count(*) from
publicdata:samples.wikipedia
where REGEXP_MATCH(title, "[0-9]*")
AND wp_namespace = 0;
223,163,387
Query complete (5.6s elapsed, 9.13 GB processed, Cost: 32¢)

Global Database of Events,
Language and Tone
quarter billion rows
30 years
updated daily
http://gdeltproject.org/data.html#googlebigquery

SELECT Year, Actor1Name, Actor2Name, Count FROM (
SELECT Actor1Name, Actor2Name, Year,
COUNT(*) Count, RANK() OVER(PARTITION BY YEAR ORDER BY
Count DESC) rank
FROM
(SELECT Actor1Name, Actor2Name, Year FROM
[gdelt-bq:full.events] WHERE Actor1Name < Actor2Name
and Actor1CountryCode != '' and Actor2CountryCode != ''
and Actor1CountryCode!=Actor2CountryCode),
(SELECT Actor2Name Actor1Name, Actor1Name Actor2Name,
Year FROM [gdelt-bq:full.events] WHERE
Actor1Name > Actor2Name and Actor1CountryCode != '' and
Actor2CountryCode != '' and
Actor1CountryCode!=Actor2CountryCode),
WHERE Actor1Name IS NOT null
AND Actor2Name IS NOT null
GROUP EACH BY 1, 2, 3
HAVING Count > 100
)
WHERE rank=1
ORDER BY Year

our most active user

10 request we should be caching

5 most created resources
select uri, count(*) total from
stats where method = 'POST'
group by URI;
javier ramirez @supercoco9 http://teowaki.com

...but
/users/javier/shouts
/users/rgo/shouts
/teams/javier-community/links
/teams/nosqlmatters-cgn/links

5 most created resources

Analysing weather information
Finding patterns in e-commerce
Match online/offline behaviour
Log analysys
Analysing inventory/booking data
...

warning: BigQuery is not
open source and not for
free
$26 per stored TB
$5 per processed TB
*the 1st TB processed every month is free of charge

Find related links at
https://teowaki.com/teams/javier-community/link-categories/bigquery-talk
Thanks
javier ramirez
@supercoco9
https://teowaki.com
https://datawaki.com

Big Data Analytics with Google BigQuery, by Javier Ramirez, datawaki, at Span Conf

More Related Content

Viewers also liked

Similar to Big Data Analytics with Google BigQuery, by Javier Ramirez, datawaki, at Span Conf

More from javier ramirez

Recently uploaded

Big Data Analytics with Google BigQuery, by Javier Ramirez, datawaki, at Span Conf