Exploring big data on the cloud
What we know, what is happening, and what we are made of:
Open Data, BigQuery, SQL
Felipe Hoffa
Developer Advocate
@felipehoffa, +FelipeHoffa
Google confidential │ Do not distribute
Google confidential │ Do not distribute
BIG DATA
Google confidential │ Do not distribute
1979 - 250 MB
Google confidential │ Do not distribute
Big Data is
LARGE?
Google confidential │ Do not distribute
Big Data is
HEAVY?
Google confidential │ Do not distribute
Big Data is
SLOW?
Google confidential │ Do not distribute
Big Data is
COMPLEX?
Google confidential │ Do not distribute
What if Big Data was
Sizeless
Weightless
Fast
Easy
Google confidential │ Do not distribute
What has more data?
el hrlaPg0sn uotthshok2e nsyr ph r.tosl us.2usd3extg a.
nor h dfe1tpi c.ria28r0se. .Mrrm iidCt8so0hsih0moi n
crld5n igA ti n0cd ew 1l,scu bonopg50gl.ek f
arbAm2endlswF3ioisn,la http .disb n ,tml
,,,........
000000112222335588AACFMPaaaaabbbccccdddddddeeee
eeeeffggggghhhhhhhhiiiiiiiiiiikklllllllllmmmmnnnnnnnnnnoo
ooooooopppprrrrrrrrrrsssssssssssssttttttttttuuuuwwxy
Google confidential │ Do not distribute
What has more data?
,,,........
000000112222335588AACFMPaaaaabbbccccdddddddee
eeeeeeffggggghhhhhhhhiiiiiiiiiiikklllllllllmmmmnnnnnnn
nnnooooooooopppprrrrrrrrrrsssssssssssssttttttttttuuuu
wwxy
29: ,13:s,11:i,10:t,10:r,10:n,9:l,9:o,8:e,8:.,8:h,7:d,6:0,5:a,
5:g,4:m,4:p,4:c,4:2,4:u,3:b,3:,,2:1,2:5,2:A,2:w,2:3,2:k,2:f,
2:8,1:M,1:C,1:F,1:x,1:P,1:y
Google confidential │ Do not distribute
What has more data?
29: ,13:s,11:i,10:t,10:r,10:n,9:l,9:o,8:e,8:.,8:h,7:d,6:0,5:a,
5:g,4:m,4:p,4:c,4:2,4:u,3:b,3:,,2:1,2:5,2:A,2:w,2:3,2:k,2:f,
2:8,1:M,1:C,1:F,1:x,1:P,1:y
Melt 300g chocolate, 3tbsp golden syrup, 125g butter. Add
50g marshmallows, 200g crushed shortbread and
sprinkles. And to finish... Pour into 18x28cm tin. Finish with
sprinkles. Chill for 20 mins.
Google confidential │ Do not distribute
The 3 big steps for
the data revolution
Google confidential │ Do not distribute
1. Price
Google confidential │ Do not distribute
Byte Magazine, 1980
https://archive.org/stream/byte-magazine-1980-08/1980_08_B
Google confidential │ Do not distribute
2. Access
Google confidential │ Do not distribute
http://www.ted.com/talks/tim_berners_lee_on_the_next_web.html
Google confidential │ Do not distribute
3. Speed
Google confidential │ Do not distribute
Google confidential │ Do not distribute
Peter Dutton
https://www.flickr.com/photos/joeshlabotnik/305410327/in/photostream/
Google confidential │ Do not distribute
?
Search in 1996
Story time
"Search is solved" -- 1996
Excite – Born in 1993
Yahoo! - Born in 1994
WebCrawler – Born in 1994
Lycos – Born in 1994
Infoseek – Born in 1994
AltaVista – Born in 1995
Inktomi – Born in 1996
"Search is solved" -- 1996
Excite – Born in 1993
Yahoo! - Born in 1994
WebCrawler – Born in 1994
Lycos – Born in 1994
Infoseek – Born in 1994
AltaVista – Born in 1995
Inktomi – Born in 1996
Google - 1998
How Google was built
1. PageRank: A new idea
2. Collect the web
3. Create the technology
Idea
Data Tech
Google confidential │ Do not distribute
SpannerDremelMapReduce
Big Table Colossus
2012 20132002 2004 2006 2008 2010
GFS MillWheel
Flume
Google confidential │ Do not distribute
SpannerDremelMapReduce
Big Table Colossus
2012 20132002 2004 2006 2008 2010
GFS MillWheel
Flume
Google confidential │ Do not distribute
SpannerDremelMapReduce
Big Table Colossus
2012 20132002 2004 2006 2008 2010
GFS MillWheel
Flume
Google confidential │ Do not distribute
Google BigQuery
Google BigQuery
• Fast: terabytes in seconds
• Simple: SQL
• Scaleable: From bytes to petabytes
• No CAPEX: Always on
• Interoperable: Tableau, R, Python...
• Instant sharing
• Free monthly quota
Google confidential │ Do not distribute
How many pageviews does Wikipedia
have in a month?
SELECT SUM(requests)FROM
[fh-bigquery:wikipedia.pagecounts_201412]
bigquery.cloud.google.com/table/fh-bigquery:wikipedia.pagecounts_201412
Google confidential │ Do not distribute
How to load data into BigQuery
bq load -F" " --quote "" fh-bigquery:wikipedia.
pagecounts_20150828_10 pagecounts-20150828-100000.gz
language,title,requests:integer,content_size:integer
How BigQuery works
Tree Structured Query Dispatch and Aggregation
Distributed Storage
SELECT state, year
Leaf Leaf Leaf Leaf
O(Rows ~140M)
COUNT(*)
GROUP BY state
WHERE year >= 1980 and year < 1990
Mixer 1 Mixer 1
O(50 states)
COUNT(*)
GROUP BY state
Mixer 0
O(50 states)
LIMIT 10
ORDER BY count_babies DESC
COUNT(*)
GROUP BY state SELECT
state, COUNT(*) count_babies
FROM [publicdata:samples.natality]
WHERE
year >= 1980 AND year < 1990
GROUP BY state
ORDER BY count_babies DESC
LIMIT 10
Google confidential │ Do not distribute
Start-ups, multinationals, consultants...
"Streaming data into Google BigQuery with
special guest Streak"
"BigQuery, IPython, Pandas and R for data
science, starring Pearson"
"Shine with BigQuery: The 30
Terabyte challenge"
Many more
Games and social media analytics
Advertising campaign optimization
Web Logs, machine logs, infrastructure monitoring
POS-Retail analytics
Sensor data analysis
Mobile games, application analytics
https://www.youtube.com/watch?v=GpCarC_I3Ao
https://www.youtube.com/watch?v=GpCarC_I3Ao
https://www.reddit.
com/r/bigquery/comments/33bgx9/heatmap_of_24_hour
http://blog.gdeltproject.org/mapping-gdelt-in-cartodb-a-tutorial-part-2/
SELECT TIMESTAMP(STRING(MonthYear)+'01') month,
SUM(ActionGeo_CountryCode='BR') Brasil
FROM [gdelt-bq:full.events]
WHERE MonthYear>0
GROUP BY 1 ORDER BY 1
GDELT: Rows per month (Brasil)
SELECT TIMESTAMP(STRING(MonthYear)+'01') month,
SUM(ActionGeo_CountryCode='BR') Brasil
FROM [gdelt-bq:full.events]
WHERE MonthYear>0
GROUP BY 1 ORDER BY 1
GDELT: Rows per month (Brasil)
SELECT TIMESTAMP(STRING(MonthYear)+'01') month,
SUM(ActionGeo_CountryCode='BR')/COUNT(*) Brasil
FROM [gdelt-bq:full.events]
WHERE MonthYear>0
GROUP BY 1 ORDER BY 1
GDELT: Rows per month (Brasil, normalized)
June 1992
July 2013
SELECT TIMESTAMP(STRING(MonthYear)+'01') month,
SUM(ActionGeo_CountryCode='CI')/COUNT(*) Chile
FROM [gdelt-bq:full.events]
WHERE MonthYear>0
GROUP BY 1 ORDER BY 1
GDELT: Rows per month (Chile, normalized)
SELECT TIMESTAMP(STRING(MonthYear)+'01') month,
SUM(ActionGeo_CountryCode='CI')/COUNT(*) Chile
FROM [gdelt-bq:full.events]
WHERE MonthYear>0
GROUP BY 1 ORDER BY 1
GDELT: Rows per month (Chile, normalized)
October 1988
March 2010
October 2010
Google confidential │ Do not distribute
Live data
Google confidential │ Do not distribute
http://wikipedia-live-monitor.herokuapp.com/
Google confidential │ Do not distributehttps://www.reddit.com/r/bigquery/comments/3io7wj/qotd_wikipedia_most_revised_pages_per_day_query
Dremel
MapReduce
2012 20132002 2004 2006 2008 2010
GFS
MillWheelFlume
Cloud Dataflow
Cloud Dataflow
BigQuery
Pipeline is a Directed Acyclic Graph
(DAG) of data flow, just like:
MapReduce/Hadoop jobs
Apache Spark jobs
Multiple Inputs and Outputs
Cloud Storage
BigQuery tables
Cloud Datastore
Cloud Pub/Sub
Cloud Bigtable
… and any external systems
Pipeline
The Auto Complete Example
Tweets
Predictions
read #argentina scores, my #art project,
watching #armenia vs #argentina
ExtractTags #argentina #art #armenia #argentina
Count (argentina, 5M) (art, 9M) (armenia, 2M)
ExpandPrefixes
a->(argentina,5M) ar->(argentina,5M)
arg->(argentina,5M) ar->(art, 9M) ...
Top(3)
write
a->[apple, art, argentina]
ar->[art, argentina, armenia]
.apply(TextIO.Read.from(...))
.apply(ParDo.of(new ExtractTags()))
.apply(Count.create())
.apply(ParDo.of(new ExpandPrefixes())
.apply(Top.largestPerKey(3)
Pipeline p = Pipeline.create();
p.begin();
.apply(TextIO.Write.to(...));
p.run()
800 RPS 1,200 RPS 5,000 RPS 50 RPS
Worker Scaling
Real-time Monitoring UI
Google confidential │ Do not distribute
Living cities
Google confidential │ Do not distributehttp://www.danielforsyth.me/mapping-nyc-taxi-data/
Google confidential │ Do not distribute
Google confidential │ Do not distributehttps://www.reddit.com/r/bigquery/comments/3fo9ao/nyc_taxi_trips_now_officially_shared_by_the_nyc/ctt5js
Google confidential │ Do not distributehttps://www.youtube.com/watch?v=l8MLIfU21pk
Google confidential │ Do not distribute
Google confidential │ Do not distribute
Reddit
Google confidential │ Do not distributehttps://www.youtube.com/watch?v=l8MLIfU21pk
Google confidential │ Do not distribute
Google confidential │ Do not distribute
Google confidential │ Do not distributehttps://www.reddit.com/r/bigquery/comments/3cej2b/17_billion_reddit_comments_loaded_on_bigquery/
Google confidential │ Do not distribute
The weather
SELECT TIMESTAMP(year+mo+da) day, min, max, IF(prcp=99.99,0,prcp) prcp
FROM [fh-bigquery:weather_gsod.gsod2014] a
JOIN [fh-bigquery:weather_gsod.stations] b
ON a.wban=b.wban and b.usaf=a.stn
WHERE country='BZ' AND name='RIO DE JANEIRO /AER'
ORDER BY 1
Weather in Rio de Janeiro, 2014
Google confidential │ Do not distribute
What if Big Data was
Fast
&
Easy
Questions?
News: reddit.com/r/bigquery
Ask: stackoverflow.com
Felipe Hoffa
@felipehoffa +FelipeHoffa
Rate me?
bit.ly/bqfeedback
Google confidential │ Do not distribute
Freebase: What we know
Free and open. Licensed as CC-BY
Open for anyone to contribute.
A source for Google’s Knowledge Graph
Download the entire graph as RDF
42.9M people places and things
2.4B triples about those things
DEC 2014 Update: Migrating to Wikidata
Age distribution
SELECT age, COUNT(*) c
FROM [compute_ages ]
WHERE age
BETWEEN 1 AND 80
GROUP BY 1
ORDER BY 1
Exploring the Notability Gender Gap
SELECT title, count, iso FROM (
SELECT title, count, c.iso iso, RANK() OVER (PARTITION BY iso ORDER BY count DESC) rank
FROM (
SELECT a.title title, SUM(requests) count, b.person person
FROM [fh-bigquery:wikipedia.pagecounts_20140410_150000] a
JOIN (
SELECT REGEXP_REPLACE(obj, '/wikipedia/id/', '') title, a.sub person
FROM [fh-bigquery:freebase20140119.triples_nolang] a
JOIN (
SELECT sub FROM [fh-bigquery:freebase20140119.people_gender]
WHERE gender='/m/02zsn') b
ON a.sub=b.sub
WHERE obj CONTAINS '/wikipedia/id/' AND pred = '/type/object/key'
GROUP BY 1,2) b
ON a.title = b.title
GROUP BY 1,3) a
JOIN EACH [fh-bigquery:freebase20140119.people_place_of_birth] b
ON a.person=b.sub
JOIN [fh-bigquery:freebase20140119.place_of_birth_to_country] c
ON b.place_of_birth=c.place)
WHERE rank=1 ORDER BY count DESC
http://devnook.github.io/GenderMaps/maplabels/

An indepth look at Google BigQuery Architecture by Felipe Hoffa of Google

  • 1.
    Exploring big dataon the cloud What we know, what is happening, and what we are made of: Open Data, BigQuery, SQL Felipe Hoffa Developer Advocate @felipehoffa, +FelipeHoffa
  • 2.
    Google confidential │Do not distribute
  • 3.
    Google confidential │Do not distribute BIG DATA
  • 4.
    Google confidential │Do not distribute 1979 - 250 MB
  • 5.
    Google confidential │Do not distribute Big Data is LARGE?
  • 6.
    Google confidential │Do not distribute Big Data is HEAVY?
  • 7.
    Google confidential │Do not distribute Big Data is SLOW?
  • 8.
    Google confidential │Do not distribute Big Data is COMPLEX?
  • 9.
    Google confidential │Do not distribute What if Big Data was Sizeless Weightless Fast Easy
  • 10.
    Google confidential │Do not distribute What has more data? el hrlaPg0sn uotthshok2e nsyr ph r.tosl us.2usd3extg a. nor h dfe1tpi c.ria28r0se. .Mrrm iidCt8so0hsih0moi n crld5n igA ti n0cd ew 1l,scu bonopg50gl.ek f arbAm2endlswF3ioisn,la http .disb n ,tml ,,,........ 000000112222335588AACFMPaaaaabbbccccdddddddeeee eeeeffggggghhhhhhhhiiiiiiiiiiikklllllllllmmmmnnnnnnnnnnoo ooooooopppprrrrrrrrrrsssssssssssssttttttttttuuuuwwxy
  • 11.
    Google confidential │Do not distribute What has more data? ,,,........ 000000112222335588AACFMPaaaaabbbccccdddddddee eeeeeeffggggghhhhhhhhiiiiiiiiiiikklllllllllmmmmnnnnnnn nnnooooooooopppprrrrrrrrrrsssssssssssssttttttttttuuuu wwxy 29: ,13:s,11:i,10:t,10:r,10:n,9:l,9:o,8:e,8:.,8:h,7:d,6:0,5:a, 5:g,4:m,4:p,4:c,4:2,4:u,3:b,3:,,2:1,2:5,2:A,2:w,2:3,2:k,2:f, 2:8,1:M,1:C,1:F,1:x,1:P,1:y
  • 12.
    Google confidential │Do not distribute What has more data? 29: ,13:s,11:i,10:t,10:r,10:n,9:l,9:o,8:e,8:.,8:h,7:d,6:0,5:a, 5:g,4:m,4:p,4:c,4:2,4:u,3:b,3:,,2:1,2:5,2:A,2:w,2:3,2:k,2:f, 2:8,1:M,1:C,1:F,1:x,1:P,1:y Melt 300g chocolate, 3tbsp golden syrup, 125g butter. Add 50g marshmallows, 200g crushed shortbread and sprinkles. And to finish... Pour into 18x28cm tin. Finish with sprinkles. Chill for 20 mins.
  • 13.
    Google confidential │Do not distribute The 3 big steps for the data revolution
  • 14.
    Google confidential │Do not distribute 1. Price
  • 15.
    Google confidential │Do not distribute Byte Magazine, 1980 https://archive.org/stream/byte-magazine-1980-08/1980_08_B
  • 16.
    Google confidential │Do not distribute 2. Access
  • 17.
    Google confidential │Do not distribute http://www.ted.com/talks/tim_berners_lee_on_the_next_web.html
  • 18.
    Google confidential │Do not distribute 3. Speed
  • 19.
    Google confidential │Do not distribute
  • 20.
    Google confidential │Do not distribute Peter Dutton https://www.flickr.com/photos/joeshlabotnik/305410327/in/photostream/
  • 21.
    Google confidential │Do not distribute ?
  • 22.
  • 23.
    "Search is solved"-- 1996 Excite – Born in 1993 Yahoo! - Born in 1994 WebCrawler – Born in 1994 Lycos – Born in 1994 Infoseek – Born in 1994 AltaVista – Born in 1995 Inktomi – Born in 1996
  • 24.
    "Search is solved"-- 1996 Excite – Born in 1993 Yahoo! - Born in 1994 WebCrawler – Born in 1994 Lycos – Born in 1994 Infoseek – Born in 1994 AltaVista – Born in 1995 Inktomi – Born in 1996 Google - 1998
  • 26.
    How Google wasbuilt 1. PageRank: A new idea 2. Collect the web 3. Create the technology
  • 27.
  • 28.
    Google confidential │Do not distribute SpannerDremelMapReduce Big Table Colossus 2012 20132002 2004 2006 2008 2010 GFS MillWheel Flume
  • 29.
    Google confidential │Do not distribute SpannerDremelMapReduce Big Table Colossus 2012 20132002 2004 2006 2008 2010 GFS MillWheel Flume
  • 30.
    Google confidential │Do not distribute SpannerDremelMapReduce Big Table Colossus 2012 20132002 2004 2006 2008 2010 GFS MillWheel Flume
  • 31.
    Google confidential │Do not distribute Google BigQuery
  • 32.
    Google BigQuery • Fast:terabytes in seconds • Simple: SQL • Scaleable: From bytes to petabytes • No CAPEX: Always on • Interoperable: Tableau, R, Python... • Instant sharing • Free monthly quota
  • 33.
    Google confidential │Do not distribute How many pageviews does Wikipedia have in a month? SELECT SUM(requests)FROM [fh-bigquery:wikipedia.pagecounts_201412] bigquery.cloud.google.com/table/fh-bigquery:wikipedia.pagecounts_201412
  • 34.
    Google confidential │Do not distribute How to load data into BigQuery bq load -F" " --quote "" fh-bigquery:wikipedia. pagecounts_20150828_10 pagecounts-20150828-100000.gz language,title,requests:integer,content_size:integer
  • 35.
    How BigQuery works TreeStructured Query Dispatch and Aggregation Distributed Storage SELECT state, year Leaf Leaf Leaf Leaf O(Rows ~140M) COUNT(*) GROUP BY state WHERE year >= 1980 and year < 1990 Mixer 1 Mixer 1 O(50 states) COUNT(*) GROUP BY state Mixer 0 O(50 states) LIMIT 10 ORDER BY count_babies DESC COUNT(*) GROUP BY state SELECT state, COUNT(*) count_babies FROM [publicdata:samples.natality] WHERE year >= 1980 AND year < 1990 GROUP BY state ORDER BY count_babies DESC LIMIT 10
  • 36.
    Google confidential │Do not distribute Start-ups, multinationals, consultants... "Streaming data into Google BigQuery with special guest Streak" "BigQuery, IPython, Pandas and R for data science, starring Pearson" "Shine with BigQuery: The 30 Terabyte challenge"
  • 37.
    Many more Games andsocial media analytics Advertising campaign optimization Web Logs, machine logs, infrastructure monitoring POS-Retail analytics Sensor data analysis Mobile games, application analytics
  • 38.
  • 39.
  • 40.
  • 41.
  • 42.
    SELECT TIMESTAMP(STRING(MonthYear)+'01') month, SUM(ActionGeo_CountryCode='BR')Brasil FROM [gdelt-bq:full.events] WHERE MonthYear>0 GROUP BY 1 ORDER BY 1 GDELT: Rows per month (Brasil)
  • 43.
    SELECT TIMESTAMP(STRING(MonthYear)+'01') month, SUM(ActionGeo_CountryCode='BR')Brasil FROM [gdelt-bq:full.events] WHERE MonthYear>0 GROUP BY 1 ORDER BY 1 GDELT: Rows per month (Brasil)
  • 44.
    SELECT TIMESTAMP(STRING(MonthYear)+'01') month, SUM(ActionGeo_CountryCode='BR')/COUNT(*)Brasil FROM [gdelt-bq:full.events] WHERE MonthYear>0 GROUP BY 1 ORDER BY 1 GDELT: Rows per month (Brasil, normalized) June 1992 July 2013
  • 45.
    SELECT TIMESTAMP(STRING(MonthYear)+'01') month, SUM(ActionGeo_CountryCode='CI')/COUNT(*)Chile FROM [gdelt-bq:full.events] WHERE MonthYear>0 GROUP BY 1 ORDER BY 1 GDELT: Rows per month (Chile, normalized)
  • 46.
    SELECT TIMESTAMP(STRING(MonthYear)+'01') month, SUM(ActionGeo_CountryCode='CI')/COUNT(*)Chile FROM [gdelt-bq:full.events] WHERE MonthYear>0 GROUP BY 1 ORDER BY 1 GDELT: Rows per month (Chile, normalized) October 1988 March 2010 October 2010
  • 47.
    Google confidential │Do not distribute Live data
  • 48.
    Google confidential │Do not distribute http://wikipedia-live-monitor.herokuapp.com/
  • 49.
    Google confidential │Do not distributehttps://www.reddit.com/r/bigquery/comments/3io7wj/qotd_wikipedia_most_revised_pages_per_day_query
  • 50.
    Dremel MapReduce 2012 20132002 20042006 2008 2010 GFS MillWheelFlume Cloud Dataflow Cloud Dataflow BigQuery
  • 51.
    Pipeline is aDirected Acyclic Graph (DAG) of data flow, just like: MapReduce/Hadoop jobs Apache Spark jobs Multiple Inputs and Outputs Cloud Storage BigQuery tables Cloud Datastore Cloud Pub/Sub Cloud Bigtable … and any external systems Pipeline
  • 52.
    The Auto CompleteExample Tweets Predictions read #argentina scores, my #art project, watching #armenia vs #argentina ExtractTags #argentina #art #armenia #argentina Count (argentina, 5M) (art, 9M) (armenia, 2M) ExpandPrefixes a->(argentina,5M) ar->(argentina,5M) arg->(argentina,5M) ar->(art, 9M) ... Top(3) write a->[apple, art, argentina] ar->[art, argentina, armenia] .apply(TextIO.Read.from(...)) .apply(ParDo.of(new ExtractTags())) .apply(Count.create()) .apply(ParDo.of(new ExpandPrefixes()) .apply(Top.largestPerKey(3) Pipeline p = Pipeline.create(); p.begin(); .apply(TextIO.Write.to(...)); p.run()
  • 53.
    800 RPS 1,200RPS 5,000 RPS 50 RPS Worker Scaling
  • 54.
  • 55.
    Google confidential │Do not distribute Living cities
  • 56.
    Google confidential │Do not distributehttp://www.danielforsyth.me/mapping-nyc-taxi-data/
  • 57.
    Google confidential │Do not distribute
  • 58.
    Google confidential │Do not distributehttps://www.reddit.com/r/bigquery/comments/3fo9ao/nyc_taxi_trips_now_officially_shared_by_the_nyc/ctt5js
  • 59.
    Google confidential │Do not distributehttps://www.youtube.com/watch?v=l8MLIfU21pk
  • 60.
    Google confidential │Do not distribute
  • 61.
    Google confidential │Do not distribute Reddit
  • 62.
    Google confidential │Do not distributehttps://www.youtube.com/watch?v=l8MLIfU21pk
  • 63.
    Google confidential │Do not distribute
  • 64.
    Google confidential │Do not distribute
  • 65.
    Google confidential │Do not distributehttps://www.reddit.com/r/bigquery/comments/3cej2b/17_billion_reddit_comments_loaded_on_bigquery/
  • 66.
    Google confidential │Do not distribute The weather
  • 67.
    SELECT TIMESTAMP(year+mo+da) day,min, max, IF(prcp=99.99,0,prcp) prcp FROM [fh-bigquery:weather_gsod.gsod2014] a JOIN [fh-bigquery:weather_gsod.stations] b ON a.wban=b.wban and b.usaf=a.stn WHERE country='BZ' AND name='RIO DE JANEIRO /AER' ORDER BY 1 Weather in Rio de Janeiro, 2014
  • 68.
    Google confidential │Do not distribute What if Big Data was Fast & Easy
  • 69.
    Questions? News: reddit.com/r/bigquery Ask: stackoverflow.com FelipeHoffa @felipehoffa +FelipeHoffa Rate me? bit.ly/bqfeedback
  • 70.
    Google confidential │Do not distribute Freebase: What we know
  • 71.
    Free and open.Licensed as CC-BY Open for anyone to contribute. A source for Google’s Knowledge Graph Download the entire graph as RDF 42.9M people places and things 2.4B triples about those things DEC 2014 Update: Migrating to Wikidata
  • 72.
    Age distribution SELECT age,COUNT(*) c FROM [compute_ages ] WHERE age BETWEEN 1 AND 80 GROUP BY 1 ORDER BY 1
  • 73.
  • 74.
    SELECT title, count,iso FROM ( SELECT title, count, c.iso iso, RANK() OVER (PARTITION BY iso ORDER BY count DESC) rank FROM ( SELECT a.title title, SUM(requests) count, b.person person FROM [fh-bigquery:wikipedia.pagecounts_20140410_150000] a JOIN ( SELECT REGEXP_REPLACE(obj, '/wikipedia/id/', '') title, a.sub person FROM [fh-bigquery:freebase20140119.triples_nolang] a JOIN ( SELECT sub FROM [fh-bigquery:freebase20140119.people_gender] WHERE gender='/m/02zsn') b ON a.sub=b.sub WHERE obj CONTAINS '/wikipedia/id/' AND pred = '/type/object/key' GROUP BY 1,2) b ON a.title = b.title GROUP BY 1,3) a JOIN EACH [fh-bigquery:freebase20140119.people_place_of_birth] b ON a.person=b.sub JOIN [fh-bigquery:freebase20140119.place_of_birth_to_country] c ON b.place_of_birth=c.place) WHERE rank=1 ORDER BY count DESC http://devnook.github.io/GenderMaps/maplabels/