An indepth look at Google BigQuery Architecture by Felipe Hoffa of Google

Exploring big data on the cloud
What we know, what is happening, and what we are made of:
Open Data, BigQuery, SQL
Felipe Hoffa
Developer Advocate
@felipehoffa, +FelipeHoffa

Google confidential │ Do not distribute

BIG DATA

1979 - 250 MB

Big Data is
LARGE?

Big Data is
HEAVY?

Big Data is
SLOW?

Big Data is
COMPLEX?

What if Big Data was
Sizeless
Weightless
Fast
Easy

What has more data?
el hrlaPg0sn uotthshok2e nsyr ph r.tosl us.2usd3extg a.
nor h dfe1tpi c.ria28r0se. .Mrrm iidCt8so0hsih0moi n
crld5n igA ti n0cd ew 1l,scu bonopg50gl.ek f
arbAm2endlswF3ioisn,la http .disb n ,tml
,,,........
000000112222335588AACFMPaaaaabbbccccdddddddeeee
eeeeffggggghhhhhhhhiiiiiiiiiiikklllllllllmmmmnnnnnnnnnnoo
ooooooopppprrrrrrrrrrsssssssssssssttttttttttuuuuwwxy

What has more data?
,,,........
000000112222335588AACFMPaaaaabbbccccdddddddee
eeeeeeffggggghhhhhhhhiiiiiiiiiiikklllllllllmmmmnnnnnnn
nnnooooooooopppprrrrrrrrrrsssssssssssssttttttttttuuuu
wwxy
29: ,13:s,11:i,10:t,10:r,10:n,9:l,9:o,8:e,8:.,8:h,7:d,6:0,5:a,
5:g,4:m,4:p,4:c,4:2,4:u,3:b,3:,,2:1,2:5,2:A,2:w,2:3,2:k,2:f,
2:8,1:M,1:C,1:F,1:x,1:P,1:y

What has more data?
29: ,13:s,11:i,10:t,10:r,10:n,9:l,9:o,8:e,8:.,8:h,7:d,6:0,5:a,
5:g,4:m,4:p,4:c,4:2,4:u,3:b,3:,,2:1,2:5,2:A,2:w,2:3,2:k,2:f,
2:8,1:M,1:C,1:F,1:x,1:P,1:y
Melt 300g chocolate, 3tbsp golden syrup, 125g butter. Add
50g marshmallows, 200g crushed shortbread and
sprinkles. And to finish... Pour into 18x28cm tin. Finish with
sprinkles. Chill for 20 mins.

The 3 big steps for
the data revolution

1. Price

Byte Magazine, 1980
https://archive.org/stream/byte-magazine-1980-08/1980_08_B

2. Access

http://www.ted.com/talks/tim_berners_lee_on_the_next_web.html

3. Speed

Peter Dutton
https://www.flickr.com/photos/joeshlabotnik/305410327/in/photostream/

?

"Search is solved" -- 1996
Excite – Born in 1993
Yahoo! - Born in 1994
WebCrawler – Born in 1994
Lycos – Born in 1994
Infoseek – Born in 1994
AltaVista – Born in 1995
Inktomi – Born in 1996

"Search is solved" -- 1996
Excite – Born in 1993
Yahoo! - Born in 1994
WebCrawler – Born in 1994
Lycos – Born in 1994
Infoseek – Born in 1994
AltaVista – Born in 1995
Inktomi – Born in 1996
Google - 1998

How Google was built
1. PageRank: A new idea
2. Collect the web
3. Create the technology

SpannerDremelMapReduce
Big Table Colossus
2012 20132002 2004 2006 2008 2010
GFS MillWheel
Flume

Google BigQuery

Google BigQuery
• Fast: terabytes in seconds
• Simple: SQL
• Scaleable: From bytes to petabytes
• No CAPEX: Always on
• Interoperable: Tableau, R, Python...
• Instant sharing
• Free monthly quota

How many pageviews does Wikipedia
have in a month?
SELECT SUM(requests)FROM
[fh-bigquery:wikipedia.pagecounts_201412]
bigquery.cloud.google.com/table/fh-bigquery:wikipedia.pagecounts_201412

How to load data into BigQuery
bq load -F" " --quote "" fh-bigquery:wikipedia.
pagecounts_20150828_10 pagecounts-20150828-100000.gz
language,title,requests:integer,content_size:integer

How BigQuery works
Tree Structured Query Dispatch and Aggregation
Distributed Storage
SELECT state, year
Leaf Leaf Leaf Leaf
O(Rows ~140M)
COUNT(*)
GROUP BY state
WHERE year >= 1980 and year < 1990
Mixer 1 Mixer 1
O(50 states)
COUNT(*)
GROUP BY state
Mixer 0
O(50 states)
LIMIT 10
ORDER BY count_babies DESC
COUNT(*)
GROUP BY state SELECT
state, COUNT(*) count_babies
FROM [publicdata:samples.natality]
WHERE
year >= 1980 AND year < 1990
GROUP BY state
ORDER BY count_babies DESC
LIMIT 10

Start-ups, multinationals, consultants...
"Streaming data into Google BigQuery with
special guest Streak"
"BigQuery, IPython, Pandas and R for data
science, starring Pearson"
"Shine with BigQuery: The 30
Terabyte challenge"

Many more
Games and social media analytics
Advertising campaign optimization
Web Logs, machine logs, infrastructure monitoring
POS-Retail analytics
Sensor data analysis
Mobile games, application analytics

https://www.youtube.com/watch?v=GpCarC_I3Ao

https://www.reddit.
com/r/bigquery/comments/33bgx9/heatmap_of_24_hour

http://blog.gdeltproject.org/mapping-gdelt-in-cartodb-a-tutorial-part-2/

SELECT TIMESTAMP(STRING(MonthYear)+'01') month,
SUM(ActionGeo_CountryCode='BR') Brasil
FROM [gdelt-bq:full.events]
WHERE MonthYear>0
GROUP BY 1 ORDER BY 1
GDELT: Rows per month (Brasil)

SUM(ActionGeo_CountryCode='BR')/COUNT(*) Brasil
WHERE MonthYear>0
GDELT: Rows per month (Brasil, normalized)
June 1992
July 2013

SUM(ActionGeo_CountryCode='CI')/COUNT(*) Chile
WHERE MonthYear>0
GDELT: Rows per month (Chile, normalized)

SUM(ActionGeo_CountryCode='CI')/COUNT(*) Chile
WHERE MonthYear>0
GDELT: Rows per month (Chile, normalized)
October 1988
March 2010
October 2010

Live data

http://wikipedia-live-monitor.herokuapp.com/

Google confidential │ Do not distributehttps://www.reddit.com/r/bigquery/comments/3io7wj/qotd_wikipedia_most_revised_pages_per_day_query

Dremel
MapReduce
2012 20132002 2004 2006 2008 2010
GFS
MillWheelFlume
Cloud Dataflow
Cloud Dataflow
BigQuery

Pipeline is a Directed Acyclic Graph
(DAG) of data flow, just like:
MapReduce/Hadoop jobs
Apache Spark jobs
Multiple Inputs and Outputs
Cloud Storage
BigQuery tables
Cloud Datastore
Cloud Pub/Sub
Cloud Bigtable
… and any external systems
Pipeline

The Auto Complete Example
Tweets
Predictions
read #argentina scores, my #art project,
watching #armenia vs #argentina
ExtractTags #argentina #art #armenia #argentina
Count (argentina, 5M) (art, 9M) (armenia, 2M)
ExpandPrefixes
a->(argentina,5M) ar->(argentina,5M)
arg->(argentina,5M) ar->(art, 9M) ...
Top(3)
write
a->[apple, art, argentina]
ar->[art, argentina, armenia]
.apply(TextIO.Read.from(...))
.apply(ParDo.of(new ExtractTags()))
.apply(Count.create())
.apply(ParDo.of(new ExpandPrefixes())
.apply(Top.largestPerKey(3)
Pipeline p = Pipeline.create();
p.begin();
.apply(TextIO.Write.to(...));
p.run()

800 RPS 1,200 RPS 5,000 RPS 50 RPS
Worker Scaling

Living cities

Google confidential │ Do not distributehttp://www.danielforsyth.me/mapping-nyc-taxi-data/

Google confidential │ Do not distributehttps://www.reddit.com/r/bigquery/comments/3fo9ao/nyc_taxi_trips_now_officially_shared_by_the_nyc/ctt5js

Google confidential │ Do not distributehttps://www.youtube.com/watch?v=l8MLIfU21pk

Google confidential │ Do not distributehttps://www.reddit.com/r/bigquery/comments/3cej2b/17_billion_reddit_comments_loaded_on_bigquery/

The weather

SELECT TIMESTAMP(year+mo+da) day, min, max, IF(prcp=99.99,0,prcp) prcp
FROM [fh-bigquery:weather_gsod.gsod2014] a
JOIN [fh-bigquery:weather_gsod.stations] b
ON a.wban=b.wban and b.usaf=a.stn
WHERE country='BZ' AND name='RIO DE JANEIRO /AER'
ORDER BY 1
Weather in Rio de Janeiro, 2014

What if Big Data was
Fast
&
Easy

Questions?
News: reddit.com/r/bigquery
Ask: stackoverflow.com
Felipe Hoffa
@felipehoffa +FelipeHoffa
Rate me?
bit.ly/bqfeedback

Freebase: What we know

Free and open. Licensed as CC-BY
Open for anyone to contribute.
A source for Google’s Knowledge Graph
Download the entire graph as RDF
42.9M people places and things
2.4B triples about those things
DEC 2014 Update: Migrating to Wikidata

Age distribution
SELECT age, COUNT(*) c
FROM [compute_ages ]
WHERE age
BETWEEN 1 AND 80
GROUP BY 1
ORDER BY 1

Exploring the Notability Gender Gap

SELECT title, count, iso FROM (
SELECT title, count, c.iso iso, RANK() OVER (PARTITION BY iso ORDER BY count DESC) rank
FROM (
SELECT a.title title, SUM(requests) count, b.person person
FROM [fh-bigquery:wikipedia.pagecounts_20140410_150000] a
JOIN (
SELECT REGEXP_REPLACE(obj, '/wikipedia/id/', '') title, a.sub person
FROM [fh-bigquery:freebase20140119.triples_nolang] a
JOIN (
SELECT sub FROM [fh-bigquery:freebase20140119.people_gender]
WHERE gender='/m/02zsn') b
ON a.sub=b.sub
WHERE obj CONTAINS '/wikipedia/id/' AND pred = '/type/object/key'
GROUP BY 1,2) b
ON a.title = b.title
GROUP BY 1,3) a
JOIN EACH [fh-bigquery:freebase20140119.people_place_of_birth] b
ON a.person=b.sub
JOIN [fh-bigquery:freebase20140119.place_of_birth_to_country] c
ON b.place_of_birth=c.place)
WHERE rank=1 ORDER BY count DESC
http://devnook.github.io/GenderMaps/maplabels/

An indepth look at Google BigQuery Architecture by Felipe Hoffa of Google

More Related Content

What's hot

Similar to An indepth look at Google BigQuery Architecture by Felipe Hoffa of Google

More from Data Con LA

Recently uploaded

An indepth look at Google BigQuery Architecture by Felipe Hoffa of Google