SlideShare a Scribd company logo
1 of 72
Download to read offline
Beyond open data:
empowering citizens to
understand their cities
TICTeC 2016, Barcelona
Felipe Hoffa
Developer Advocate
@felipehoffa
Google confidential │ Do not distribute
Google confidential │ Do not distribute
The 3 big steps for
the data revolution
Google confidential │ Do not distribute
1. Price
Google confidential │ Do not distribute
Byte Magazine, 1980
https://archive.org/stream/byte-magazine-1980-08/1980_08_B
Google confidential │ Do not distribute
2. Access
Google confidential │ Do not distribute
http://www.ted.com/talks/tim_berners_lee_on_the_next_web.html
Google confidential │ Do not distribute
3. Speed
Google confidential │ Do not distribute
Google confidential │ Do not distribute
Peter Dutton
https://www.flickr.com/photos/joeshlabotnik/305410327/in/photostream/
Google confidential │ Do not distribute
?
"Search is solved" -- 1996
Excite – Born in 1993
Yahoo! - Born in 1994
WebCrawler – Born in 1994
Lycos – Born in 1994
Infoseek – Born in 1994
AltaVista – Born in 1995
Inktomi – Born in 1996
"Search is solved" -- 1996
Excite – Born in 1993
Yahoo! - Born in 1994
WebCrawler – Born in 1994
Lycos – Born in 1994
Infoseek – Born in 1994
AltaVista – Born in 1995
Inktomi – Born in 1996
Google - 1998
How Google was built
1. PageRank: A new idea
2. Collect the web
3. Create the technology
Data based start-ups
Idea
Data Tech
Google confidential │ Do not distribute
http://www.ted.com/talks/tim_berners_lee_on_the_next_web.html
The Global Open Data Index
http://index.okfn.org/
Google confidential │ Do not distribute
Google confidential │ Do not distribute
SpannerDremelMapReduce
Big Table Colossus
2012 20132002 2004 2006 2008 2010
GFS MillWheel
Flume
Google confidential │ Do not distribute
SpannerDremelMapReduce
Big Table Colossus
2012 20132002 2004 2006 2008 2010
GFS MillWheel
Flume
Google confidential │ Do not distribute
Google BigQuery
BigQuery
• Fast: terabytes in seconds
• Simple: SQL
• Scaleable: From bytes to petabytes
• No CAPEX: Always on
• Interoperable: Tableau, R, Python...
• Instant sharing
• Free monthly quota
Google confidential │ Do not distribute
How many pageviews does Wikipedia
have in a month?
SELECT COUNT(*)FROM
[fh-bigquery:wikipedia.wikipedia_views_201308]
https://bigquery.cloud.google.com/table/fh-bigquery:wikipedia.pagecounts_20140602_18
hoffa@hoffa:~/census$ wget http://datos.gob.cl/recursos/download/2323
hoffa@hoffa:~/census$ wget http://datos.gob.cl/recursos/download/2323
--2013-06-27 22:13:21-- http://datos.gob.cl/recursos/download/2323
Resolving datos.gob.cl (datos.gob.cl)... 198.41.35.100
Connecting to datos.gob.cl (datos.gob.cl)|198.41.35.100|:80... connected.
HTTP request sent, awaiting response... 302 Moved Temporarily
Location: http://www.ine.cl/openData/censo2012/persona.sav.gz.001 [following]
--2013-06-27 22:13:22-- http://www.ine.cl/openData/censo2012/persona.sav.gz.001
Resolving www.ine.cl (www.ine.cl)... 200.72.195.236
Connecting to www.ine.cl (www.ine.cl)|200.72.195.236|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 94371840 (90M) [application/x-gzip]
Saving to: `2323'
100%[===========================>] 94,371,840 7.69M/s in 22s
22 seconds
hoffa@hoffa:~/census$ wget http://datos.gob.cl/recursos/download/2323
--2013-06-27 22:13:21-- http://datos.gob.cl/recursos/download/2323
Resolving datos.gob.cl (datos.gob.cl)... 198.41.35.100
Connecting to datos.gob.cl (datos.gob.cl)|198.41.35.100|:80... connected.
HTTP request sent, awaiting response... 302 Moved Temporarily
Location: http://www.ine.cl/openData/censo2012/persona.sav.gz.001 [following]
--2013-06-27 22:13:22-- http://www.ine.cl/openData/censo2012/persona.sav.gz.001
Resolving www.ine.cl (www.ine.cl)... 200.72.195.236
Connecting to www.ine.cl (www.ine.cl)|200.72.195.236|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 94371840 (90M) [application/x-gzip]
Saving to: `2323'
100%[==========================================================================================================>] 94,371,840 7.69M/s in 22s
2013-06-27 22:13:44 (4.15 MB/s) - `2323' saved [94371840/94371840]
hoffa@hoffa:~/census$ wget http://datos.gob.cl/recursos/download/2324
--2013-06-27 22:14:02-- http://datos.gob.cl/recursos/download/2324
Resolving datos.gob.cl (datos.gob.cl)... 198.41.35.100
Connecting to datos.gob.cl (datos.gob.cl)|198.41.35.100|:80... connected.
HTTP request sent, awaiting response... 302 Moved Temporarily
Location: http://www.ine.cl/openData/censo2012/persona.sav.gz.002 [following]
--2013-06-27 22:14:03-- http://www.ine.cl/openData/censo2012/persona.sav.gz.002
Resolving www.ine.cl (www.ine.cl)... 200.72.195.236
Connecting to www.ine.cl (www.ine.cl)|200.72.195.236|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 94371840 (90M) [application/x-gzip]
Saving to: `2324'
100%[==========================================================================================================>] 94,371,840 9.08M/s in 29s
2013-06-27 22:14:33 (3.11 MB/s) - `2324' saved [94371840/94371840]
hoffa@hoffa:~/census$ wget http://datos.gob.cl/recursos/download/2325
--2013-06-27 22:14:37-- http://datos.gob.cl/recursos/download/2325
Resolving datos.gob.cl (datos.gob.cl)... 198.41.35.100
Connecting to datos.gob.cl (datos.gob.cl)|198.41.35.100|:80... connected.
HTTP request sent, awaiting response... 302 Moved Temporarily
Location: http://www.ine.cl/openData/censo2012/persona.sav.gz.003 [following]
--2013-06-27 22:14:37-- http://www.ine.cl/openData/censo2012/persona.sav.gz.003
Resolving www.ine.cl (www.ine.cl)... 200.72.195.236
Connecting to www.ine.cl (www.ine.cl)|200.72.195.236|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 9221838 (8.8M) [application/x-gzip]
Saving to: `2325'
100%[==========================================================================================================>] 9,221,838 2.27M/s in 5.8s
2013-06-27 22:14:44 (1.53 MB/s) - `2325' saved [9221838/9221838]
22 + 29 + 5 seconds
hoffa@hoffa:~/census$ ls -sh
total 189M
91M 2323
91M 2324
8.8M 2325
hoffa@hoffa:~/census$ ls -sh
total 189M
91M 2323
91M 2324
8.8M 2325
hoffa@hoffa:~/census$ cat 232* | gunzip > persona.sav
10 seconds (in a very fast multi-
core, solid-state computer)
hoffa@hoffa:~/census$ ls -sh
total 189M
91M 2323
91M 2324
8.8M 2325
hoffa@hoffa:~/census$ cat 232* | gunzip > persona.sav
hoffa@hoffa:~/census$ ls -sh persona.sav
1.1G persona.sav
hoffa@hoffa:~/census$ ls -sh
total 189M
91M 2323
91M 2324
8.8M 2325
hoffa@hoffa:~/census$ cat 232* | gunzip > persona.sav
hoffa@hoffa:~/census$ ls -sh persona.sav
1.1G persona.sav
hoffa@hoffa:~/census$ file persona.sav
persona.sav: SPSS System File TICS DATA FILE MS Windows 20.0.0
002
hoffa@hoffa:~/census$ ls -sh
total 189M
91M 2323
91M 2324
8.8M 2325
hoffa@hoffa:~/census$ cat 232* | gunzip > persona.sav
hoffa@hoffa:~/census$ ls -sh persona.sav
1.1G persona.sav
hoffa@hoffa:~/census$ file persona.sav
persona.sav: SPSS System File TICS DATA FILE MS Windows 20.0.0
002
14 minutes (+ lots of research and
debugging)
hoffa@hoffa:~/census$ R
R version 2.14.2 (2012-02-29)
...
> options(max.print=10)
> library(foreign)
> census <- read.spss('persona.sav', reencode='utf-8')
re-encoding from utf-8
> library(ff)
Loading package ff2.2-7
> for(i in ls(census)) {print(write.csv(census[i], file=paste('x',i,'.
csv', sep='')))}
...
...
...
...
...
...
...
hoffa@hoffa:~/census$ ls -sh persona
total 6.6G
177M xHN.csv
179M xP17.csv (relationship to house owner)
177M xP18.csv (gender)
189M xP19.csv (age)
...
...
178M xP35.csv (# of alive offspring)
180M xP36A.csv (birth month)
191M xP36B.csv (birth year)
37 ~180 MB files
hoffa@hoffa:~/census$ python merge_csv.py persona/*.csv >
output/persona.csv
import csv
import sys
input_files = sys.argv[1:]
files = [open(x) for x in input_files]
files_csv = [csv.reader(x) for x in files]
writer = csv.writer(sys.stdout)
while True:
row = [x.next()[1] for x in files_csv]
row = [(x if x.isdigit() else '') for x in row] # 1-line etl
writer.writerow(row)
hoffa@hoffa:~/census$ ls -sh output/persona.csv
1.4G output/persona.csv
6 minutes (+ coding)
From data discovery to CSV
- Download data: 1 minute
- Decompress data: 10 seconds (+ figure it out)
- Transform it to CSV: 14 minutes (+ learn R)
- Combine in one CSV: 6 minutes (+ Python)
= ~ 22 minutes (+ a lot of work)
What's next?
From CSV to accessible data
- Spreadsheet? (max 65,536 rows)
- Write code? (HD and RAM bounded)
(7 seconds to run in memory operations)
- MySQL? (SQL is easier, same bounds)
- Hadoop? (What's a cluster?)
- BigQuery?
Upload to Google
hoffa@hoffa:~/censo$gzip -c output/persona.csv| gsutil cp - gs://io13-
hoffa/persona.csv.gz
Copying from <STDIN> [Content-Type=application/octet-stream]...
1 minute
Import to BigQuery via web UI
3-5 minutes
http://bigquery.cloud.google.com/
Ready to query!
Religions in Chile
SELECT COUNT(*) AS COUNT,
p28 AS religion
FROM
[data-sensing-lab:hoffa.person
WHERE p28 != 0
GROUP BY religion
ORDER BY COUNT DESC
COUNT religion
7853428 1 (Catholicism)
1699725 2 (Protestantism)
931990 9 (None, atheist, agnostic)
493147 8 (Other)
119455 3 (Jehovah's Witnesses)
103735 5 (Mormon)
14976 4 (Judaism)
6959 7 (Orthodox)
2894 6 (Muslim)
1.5 seconds
Avg children per religion in Chile
SELECT
p28 AS religion,
AVG(p34) AS avg_children
FROM
[data-sensing-lab:hoffa.persona]
WHERE p28 != 0
GROUP BY religion
ORDER BY avg_children DESC
religion avg_children
3 1.48 (Jehovah's Witnesses)
2 1.41 (Protestantism)
5 1.19 (Mormon)
1 1.14 (Catholicism)
4 0.94 (Judaism)
7 0.89 (Orthodox)
9 0.59 (None, atheist, agnostic)
8 0.58 (Other)
6 0.56 (Muslim)1.5 seconds
Avg children per mother occupation
COUNT work avg_children
313098 2 1.90 (Domestic service)
5463709 0 1.71 (Non working)
78647 5 0.76 (Working for family, non remunerated)
903566 3 0.56 (Independent worker)
244137 4 0.54 (Business owner)
4223152 1 0.47 (Employee)
1.5 seconds
catalogo.datos.gob.mx/dataset/nacimientos-ocurridos
Google confidential │ Do not distribute
2008-2012 NYC Taxi: cash vs credit
Google confidential │ Do not distribute
Google confidential │ Do not distribute
Google confidential │ Do not distribute
Google confidential │ Do not distribute
GDELT: What is happening
https://www.youtube.com/watch?v=GpCarC_I3Ao
https://www.youtube.com/watch?v=GpCarC_I3Ao
https://www.reddit.
com/r/bigquery/comments/33bgx9/heatmap_of_24_hour
https://www.youtube.com/watch?v=GpCarC_I3Ao
SELECT TIMESTAMP(STRING(MonthYear)+'01') month,
SUM(ActionGeo_CountryCode='IT') Italy
FROM [gdelt-bq:full.events]
WHERE MonthYear>0
GROUP BY 1 ORDER BY 1
GDELT: Rows per month (Italy)
SELECT TIMESTAMP(STRING(MonthYear)+'01') month,
SUM(ActionGeo_CountryCode='IT')/COUNT(*) Italy
FROM [gdelt-bq:full.events]
WHERE MonthYear>0
GROUP BY 1 ORDER BY 1
GDELT: Rows per month (Italy, normalized)
October 1985
July 2001
SELECT TIMESTAMP(STRING(MonthYear)+'01') month,
SUM(ActionGeo_CountryCode='CI')/COUNT(*) Chile
FROM [gdelt-bq:full.events]
WHERE MonthYear>0
GROUP BY 1 ORDER BY 1
GDELT: Rows per month (Chile, normalized)
SELECT TIMESTAMP(STRING(MonthYear)+'01') month,
SUM(ActionGeo_CountryCode='CI')/COUNT(*) Chile
FROM [gdelt-bq:full.events]
WHERE MonthYear>0
GROUP BY 1 ORDER BY 1
GDELT: Rows per month (Chile, normalized)
October 1988
March 2010
October 2010
Weather data: Power to predict
Data empowerment
Ideas
Data Tech
Data empowerment
Ideas
Data Tech
Questions?
News: reddit.com/r/bigquery
Ask: stackoverflow.com
Felipe Hoffa
@felipehoffa
Rate me?
bit.ly/bqfeedback
End
Twitter: @felipehoffa
G+: +FelipeHoffa
Google confidential | Do not distribute
Encore demo
A big join: Freebase + Wikipedia fresh logs
Google confidential │ Do not distribute
Exploring the Notability Gender Gap
Google confidential | Do not distribute
SELECT title, count, iso FROM (
SELECT title, count, c.iso iso, RANK() OVER (PARTITION BY iso ORDER BY count DESC) rank
FROM (
SELECT a.title title, SUM(requests) count, b.person person
FROM [fh-bigquery:wikipedia.pagecounts_20140410_150000] a
JOIN (
SELECT REGEXP_REPLACE(obj, '/wikipedia/id/', '') title, a.sub person
FROM [fh-bigquery:freebase20140119.triples_nolang] a
JOIN (
SELECT sub FROM [fh-bigquery:freebase20140119.people_gender]
WHERE gender='/m/02zsn') b
ON a.sub=b.sub
WHERE obj CONTAINS '/wikipedia/id/' AND pred = '/type/object/key'
GROUP BY 1,2) b
ON a.title = b.title
GROUP BY 1,3) a
JOIN EACH [fh-bigquery:freebase20140119.people_place_of_birth] b
ON a.person=b.sub
JOIN [fh-bigquery:freebase20140119.place_of_birth_to_country] c
ON b.place_of_birth=c.place)
WHERE rank=1 ORDER BY count DESC
http://devnook.github.io/GenderMaps/maplabels/

More Related Content

Similar to Beyond open data: empowering citizens to understand their cities

Serverless Data Architecture at scale on Google Cloud Platform - Lorenzo Ridi...
Serverless Data Architecture at scale on Google Cloud Platform - Lorenzo Ridi...Serverless Data Architecture at scale on Google Cloud Platform - Lorenzo Ridi...
Serverless Data Architecture at scale on Google Cloud Platform - Lorenzo Ridi...Codemotion
 
Google BigQuery for Everyday Developer
Google BigQuery for Everyday DeveloperGoogle BigQuery for Everyday Developer
Google BigQuery for Everyday DeveloperMárton Kodok
 
Intro to data visualisation
Intro to data visualisationIntro to data visualisation
Intro to data visualisationAnna Gerber
 
Kusto (Azure Data Explorer) Training for R&D - January 2019
Kusto (Azure Data Explorer) Training for R&D - January 2019 Kusto (Azure Data Explorer) Training for R&D - January 2019
Kusto (Azure Data Explorer) Training for R&D - January 2019 Tal Bar-Zvi
 
Understanding apache-druid
Understanding apache-druidUnderstanding apache-druid
Understanding apache-druidSuman Banerjee
 
Cloud Device Insecurity
Cloud Device InsecurityCloud Device Insecurity
Cloud Device InsecurityJeremy Brown
 
Voxxed Days Cluj - Powering interactive data analysis with Google BigQuery
Voxxed Days Cluj - Powering interactive data analysis with Google BigQueryVoxxed Days Cluj - Powering interactive data analysis with Google BigQuery
Voxxed Days Cluj - Powering interactive data analysis with Google BigQueryMárton Kodok
 
IoT NY - Google Cloud Services for IoT
IoT NY - Google Cloud Services for IoTIoT NY - Google Cloud Services for IoT
IoT NY - Google Cloud Services for IoTJames Chittenden
 
Cloudstack interfaces to EC2 and GCE
Cloudstack interfaces to EC2 and GCECloudstack interfaces to EC2 and GCE
Cloudstack interfaces to EC2 and GCEShapeBlue
 
TIAD 2016 : Real-Time Data Processing Pipeline & Visualization with Docker, S...
TIAD 2016 : Real-Time Data Processing Pipeline & Visualization with Docker, S...TIAD 2016 : Real-Time Data Processing Pipeline & Visualization with Docker, S...
TIAD 2016 : Real-Time Data Processing Pipeline & Visualization with Docker, S...The Incredible Automation Day
 
Real-Time Data Processing Pipeline & Visualization with Docker, Spark, Kafka ...
Real-Time Data Processing Pipeline & Visualization with Docker, Spark, Kafka ...Real-Time Data Processing Pipeline & Visualization with Docker, Spark, Kafka ...
Real-Time Data Processing Pipeline & Visualization with Docker, Spark, Kafka ...Roberto Hashioka
 
BreizhJUG - Janvier 2014 - Big Data - Dataiku - Pages Jaunes
BreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages JaunesBreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages Jaunes
BreizhJUG - Janvier 2014 - Big Data - Dataiku - Pages JaunesDataiku
 
VoxxedDays Bucharest 2017 - Powering interactive data analysis with Google Bi...
VoxxedDays Bucharest 2017 - Powering interactive data analysis with Google Bi...VoxxedDays Bucharest 2017 - Powering interactive data analysis with Google Bi...
VoxxedDays Bucharest 2017 - Powering interactive data analysis with Google Bi...Márton Kodok
 
Big Data Analysis : Deciphering the haystack
Big Data Analysis : Deciphering the haystack Big Data Analysis : Deciphering the haystack
Big Data Analysis : Deciphering the haystack Srinath Perera
 
Streaming Trend Discovery: Real-Time Discovery in a Sea of Events with Scott ...
Streaming Trend Discovery: Real-Time Discovery in a Sea of Events with Scott ...Streaming Trend Discovery: Real-Time Discovery in a Sea of Events with Scott ...
Streaming Trend Discovery: Real-Time Discovery in a Sea of Events with Scott ...Databricks
 
He stopped using for/while loops, you won't believe what happened next!
He stopped using for/while loops, you won't believe what happened next!He stopped using for/while loops, you won't believe what happened next!
He stopped using for/while loops, you won't believe what happened next!François-Guillaume Ribreau
 
Apache Kylin - Balance between space and time - Hadoop Summit 2015
Apache Kylin -  Balance between space and time - Hadoop Summit 2015Apache Kylin -  Balance between space and time - Hadoop Summit 2015
Apache Kylin - Balance between space and time - Hadoop Summit 2015Debashis Saha
 
Hadoop Basics - Apache hadoop Bigdata training by Design Pathshala
Hadoop Basics - Apache hadoop Bigdata training by Design Pathshala Hadoop Basics - Apache hadoop Bigdata training by Design Pathshala
Hadoop Basics - Apache hadoop Bigdata training by Design Pathshala Desing Pathshala
 
The Future of Sharding
The Future of ShardingThe Future of Sharding
The Future of ShardingEDB
 
Warsaw muleSoft meetup #11 MuleSoft OData
Warsaw muleSoft meetup #11 MuleSoft ODataWarsaw muleSoft meetup #11 MuleSoft OData
Warsaw muleSoft meetup #11 MuleSoft ODataPatryk Bandurski
 

Similar to Beyond open data: empowering citizens to understand their cities (20)

Serverless Data Architecture at scale on Google Cloud Platform - Lorenzo Ridi...
Serverless Data Architecture at scale on Google Cloud Platform - Lorenzo Ridi...Serverless Data Architecture at scale on Google Cloud Platform - Lorenzo Ridi...
Serverless Data Architecture at scale on Google Cloud Platform - Lorenzo Ridi...
 
Google BigQuery for Everyday Developer
Google BigQuery for Everyday DeveloperGoogle BigQuery for Everyday Developer
Google BigQuery for Everyday Developer
 
Intro to data visualisation
Intro to data visualisationIntro to data visualisation
Intro to data visualisation
 
Kusto (Azure Data Explorer) Training for R&D - January 2019
Kusto (Azure Data Explorer) Training for R&D - January 2019 Kusto (Azure Data Explorer) Training for R&D - January 2019
Kusto (Azure Data Explorer) Training for R&D - January 2019
 
Understanding apache-druid
Understanding apache-druidUnderstanding apache-druid
Understanding apache-druid
 
Cloud Device Insecurity
Cloud Device InsecurityCloud Device Insecurity
Cloud Device Insecurity
 
Voxxed Days Cluj - Powering interactive data analysis with Google BigQuery
Voxxed Days Cluj - Powering interactive data analysis with Google BigQueryVoxxed Days Cluj - Powering interactive data analysis with Google BigQuery
Voxxed Days Cluj - Powering interactive data analysis with Google BigQuery
 
IoT NY - Google Cloud Services for IoT
IoT NY - Google Cloud Services for IoTIoT NY - Google Cloud Services for IoT
IoT NY - Google Cloud Services for IoT
 
Cloudstack interfaces to EC2 and GCE
Cloudstack interfaces to EC2 and GCECloudstack interfaces to EC2 and GCE
Cloudstack interfaces to EC2 and GCE
 
TIAD 2016 : Real-Time Data Processing Pipeline & Visualization with Docker, S...
TIAD 2016 : Real-Time Data Processing Pipeline & Visualization with Docker, S...TIAD 2016 : Real-Time Data Processing Pipeline & Visualization with Docker, S...
TIAD 2016 : Real-Time Data Processing Pipeline & Visualization with Docker, S...
 
Real-Time Data Processing Pipeline & Visualization with Docker, Spark, Kafka ...
Real-Time Data Processing Pipeline & Visualization with Docker, Spark, Kafka ...Real-Time Data Processing Pipeline & Visualization with Docker, Spark, Kafka ...
Real-Time Data Processing Pipeline & Visualization with Docker, Spark, Kafka ...
 
BreizhJUG - Janvier 2014 - Big Data - Dataiku - Pages Jaunes
BreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages JaunesBreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages Jaunes
BreizhJUG - Janvier 2014 - Big Data - Dataiku - Pages Jaunes
 
VoxxedDays Bucharest 2017 - Powering interactive data analysis with Google Bi...
VoxxedDays Bucharest 2017 - Powering interactive data analysis with Google Bi...VoxxedDays Bucharest 2017 - Powering interactive data analysis with Google Bi...
VoxxedDays Bucharest 2017 - Powering interactive data analysis with Google Bi...
 
Big Data Analysis : Deciphering the haystack
Big Data Analysis : Deciphering the haystack Big Data Analysis : Deciphering the haystack
Big Data Analysis : Deciphering the haystack
 
Streaming Trend Discovery: Real-Time Discovery in a Sea of Events with Scott ...
Streaming Trend Discovery: Real-Time Discovery in a Sea of Events with Scott ...Streaming Trend Discovery: Real-Time Discovery in a Sea of Events with Scott ...
Streaming Trend Discovery: Real-Time Discovery in a Sea of Events with Scott ...
 
He stopped using for/while loops, you won't believe what happened next!
He stopped using for/while loops, you won't believe what happened next!He stopped using for/while loops, you won't believe what happened next!
He stopped using for/while loops, you won't believe what happened next!
 
Apache Kylin - Balance between space and time - Hadoop Summit 2015
Apache Kylin -  Balance between space and time - Hadoop Summit 2015Apache Kylin -  Balance between space and time - Hadoop Summit 2015
Apache Kylin - Balance between space and time - Hadoop Summit 2015
 
Hadoop Basics - Apache hadoop Bigdata training by Design Pathshala
Hadoop Basics - Apache hadoop Bigdata training by Design Pathshala Hadoop Basics - Apache hadoop Bigdata training by Design Pathshala
Hadoop Basics - Apache hadoop Bigdata training by Design Pathshala
 
The Future of Sharding
The Future of ShardingThe Future of Sharding
The Future of Sharding
 
Warsaw muleSoft meetup #11 MuleSoft OData
Warsaw muleSoft meetup #11 MuleSoft ODataWarsaw muleSoft meetup #11 MuleSoft OData
Warsaw muleSoft meetup #11 MuleSoft OData
 

More from mysociety

Regulating Access to Information - Alex Parsons, mySociety (UK)
Regulating Access to Information - Alex Parsons, mySociety (UK)Regulating Access to Information - Alex Parsons, mySociety (UK)
Regulating Access to Information - Alex Parsons, mySociety (UK)mysociety
 
Watch this space (and pay for it): Alaveteli-driven exposure of the misuse of...
Watch this space (and pay for it): Alaveteli-driven exposure of the misuse of...Watch this space (and pay for it): Alaveteli-driven exposure of the misuse of...
Watch this space (and pay for it): Alaveteli-driven exposure of the misuse of...mysociety
 
What are the effects of OpenStreetMapping on the mappers themselves? - Aishwo...
What are the effects of OpenStreetMapping on the mappers themselves? - Aishwo...What are the effects of OpenStreetMapping on the mappers themselves? - Aishwo...
What are the effects of OpenStreetMapping on the mappers themselves? - Aishwo...mysociety
 
#PlanTech and the geospatial ecosystem - Ben Fowkes (Delib)
#PlanTech and the geospatial ecosystem - Ben Fowkes (Delib)#PlanTech and the geospatial ecosystem - Ben Fowkes (Delib)
#PlanTech and the geospatial ecosystem - Ben Fowkes (Delib)mysociety
 
Open data for local self governance: learnings from five Ukrainian cities - N...
Open data for local self governance: learnings from five Ukrainian cities - N...Open data for local self governance: learnings from five Ukrainian cities - N...
Open data for local self governance: learnings from five Ukrainian cities - N...mysociety
 
Digital Champions: community led development monitoring in Tanzania - Janet C...
Digital Champions: community led development monitoring in Tanzania - Janet C...Digital Champions: community led development monitoring in Tanzania - Janet C...
Digital Champions: community led development monitoring in Tanzania - Janet C...mysociety
 
Don’t build it: a practical guide for those building Civic Tech - Luke Jordan...
Don’t build it: a practical guide for those building Civic Tech - Luke Jordan...Don’t build it: a practical guide for those building Civic Tech - Luke Jordan...
Don’t build it: a practical guide for those building Civic Tech - Luke Jordan...mysociety
 
It takes two: when citizens and Congress Members deliberate online - Samantha...
It takes two: when citizens and Congress Members deliberate online - Samantha...It takes two: when citizens and Congress Members deliberate online - Samantha...
It takes two: when citizens and Congress Members deliberate online - Samantha...mysociety
 
Understanding the small hurdles that block community engagement, with behavio...
Understanding the small hurdles that block community engagement, with behavio...Understanding the small hurdles that block community engagement, with behavio...
Understanding the small hurdles that block community engagement, with behavio...mysociety
 
Our COVID consultation journey: from a small initiative to the desk of the pr...
Our COVID consultation journey: from a small initiative to the desk of the pr...Our COVID consultation journey: from a small initiative to the desk of the pr...
Our COVID consultation journey: from a small initiative to the desk of the pr...mysociety
 
Keeping track of open data in times of political change - David Zamora (Open ...
Keeping track of open data in times of political change - David Zamora (Open ...Keeping track of open data in times of political change - David Zamora (Open ...
Keeping track of open data in times of political change - David Zamora (Open ...mysociety
 
Civic tech vs. illicit pharmacies - Ibraheem Saleem (Code for Pakistan)
Civic tech vs. illicit pharmacies - Ibraheem Saleem (Code for Pakistan)Civic tech vs. illicit pharmacies - Ibraheem Saleem (Code for Pakistan)
Civic tech vs. illicit pharmacies - Ibraheem Saleem (Code for Pakistan)mysociety
 
Find that Charity: a tool to help find charities and improve charity data - D...
Find that Charity: a tool to help find charities and improve charity data - D...Find that Charity: a tool to help find charities and improve charity data - D...
Find that Charity: a tool to help find charities and improve charity data - D...mysociety
 
Civic tech for smartphone beginners: is the future binary? - Arran Leonard (I...
Civic tech for smartphone beginners: is the future binary? - Arran Leonard (I...Civic tech for smartphone beginners: is the future binary? - Arran Leonard (I...
Civic tech for smartphone beginners: is the future binary? - Arran Leonard (I...mysociety
 
How to monitor emergency procurement with open data: lessons from 12 countrie...
How to monitor emergency procurement with open data: lessons from 12 countrie...How to monitor emergency procurement with open data: lessons from 12 countrie...
How to monitor emergency procurement with open data: lessons from 12 countrie...mysociety
 
Accountability through petitions: Elections 2019 - Gaston Wright & Leandro As...
Accountability through petitions: Elections 2019 - Gaston Wright & Leandro As...Accountability through petitions: Elections 2019 - Gaston Wright & Leandro As...
Accountability through petitions: Elections 2019 - Gaston Wright & Leandro As...mysociety
 
Openly available air quality data: not just blue-sky thinking - Sruti Modekur...
Openly available air quality data: not just blue-sky thinking - Sruti Modekur...Openly available air quality data: not just blue-sky thinking - Sruti Modekur...
Openly available air quality data: not just blue-sky thinking - Sruti Modekur...mysociety
 
How to last in civic tech (especially now) - Matthew Stempeck & Micah L. Sifr...
How to last in civic tech (especially now) - Matthew Stempeck & Micah L. Sifr...How to last in civic tech (especially now) - Matthew Stempeck & Micah L. Sifr...
How to last in civic tech (especially now) - Matthew Stempeck & Micah L. Sifr...mysociety
 
Future of tech and democracy at the city of Reykjavík - Sigurlaug Anna Jóhann...
Future of tech and democracy at the city of Reykjavík - Sigurlaug Anna Jóhann...Future of tech and democracy at the city of Reykjavík - Sigurlaug Anna Jóhann...
Future of tech and democracy at the city of Reykjavík - Sigurlaug Anna Jóhann...mysociety
 
Lessons learned from building democracy’s database - Stacy Henderson (Cicero,...
Lessons learned from building democracy’s database - Stacy Henderson (Cicero,...Lessons learned from building democracy’s database - Stacy Henderson (Cicero,...
Lessons learned from building democracy’s database - Stacy Henderson (Cicero,...mysociety
 

More from mysociety (20)

Regulating Access to Information - Alex Parsons, mySociety (UK)
Regulating Access to Information - Alex Parsons, mySociety (UK)Regulating Access to Information - Alex Parsons, mySociety (UK)
Regulating Access to Information - Alex Parsons, mySociety (UK)
 
Watch this space (and pay for it): Alaveteli-driven exposure of the misuse of...
Watch this space (and pay for it): Alaveteli-driven exposure of the misuse of...Watch this space (and pay for it): Alaveteli-driven exposure of the misuse of...
Watch this space (and pay for it): Alaveteli-driven exposure of the misuse of...
 
What are the effects of OpenStreetMapping on the mappers themselves? - Aishwo...
What are the effects of OpenStreetMapping on the mappers themselves? - Aishwo...What are the effects of OpenStreetMapping on the mappers themselves? - Aishwo...
What are the effects of OpenStreetMapping on the mappers themselves? - Aishwo...
 
#PlanTech and the geospatial ecosystem - Ben Fowkes (Delib)
#PlanTech and the geospatial ecosystem - Ben Fowkes (Delib)#PlanTech and the geospatial ecosystem - Ben Fowkes (Delib)
#PlanTech and the geospatial ecosystem - Ben Fowkes (Delib)
 
Open data for local self governance: learnings from five Ukrainian cities - N...
Open data for local self governance: learnings from five Ukrainian cities - N...Open data for local self governance: learnings from five Ukrainian cities - N...
Open data for local self governance: learnings from five Ukrainian cities - N...
 
Digital Champions: community led development monitoring in Tanzania - Janet C...
Digital Champions: community led development monitoring in Tanzania - Janet C...Digital Champions: community led development monitoring in Tanzania - Janet C...
Digital Champions: community led development monitoring in Tanzania - Janet C...
 
Don’t build it: a practical guide for those building Civic Tech - Luke Jordan...
Don’t build it: a practical guide for those building Civic Tech - Luke Jordan...Don’t build it: a practical guide for those building Civic Tech - Luke Jordan...
Don’t build it: a practical guide for those building Civic Tech - Luke Jordan...
 
It takes two: when citizens and Congress Members deliberate online - Samantha...
It takes two: when citizens and Congress Members deliberate online - Samantha...It takes two: when citizens and Congress Members deliberate online - Samantha...
It takes two: when citizens and Congress Members deliberate online - Samantha...
 
Understanding the small hurdles that block community engagement, with behavio...
Understanding the small hurdles that block community engagement, with behavio...Understanding the small hurdles that block community engagement, with behavio...
Understanding the small hurdles that block community engagement, with behavio...
 
Our COVID consultation journey: from a small initiative to the desk of the pr...
Our COVID consultation journey: from a small initiative to the desk of the pr...Our COVID consultation journey: from a small initiative to the desk of the pr...
Our COVID consultation journey: from a small initiative to the desk of the pr...
 
Keeping track of open data in times of political change - David Zamora (Open ...
Keeping track of open data in times of political change - David Zamora (Open ...Keeping track of open data in times of political change - David Zamora (Open ...
Keeping track of open data in times of political change - David Zamora (Open ...
 
Civic tech vs. illicit pharmacies - Ibraheem Saleem (Code for Pakistan)
Civic tech vs. illicit pharmacies - Ibraheem Saleem (Code for Pakistan)Civic tech vs. illicit pharmacies - Ibraheem Saleem (Code for Pakistan)
Civic tech vs. illicit pharmacies - Ibraheem Saleem (Code for Pakistan)
 
Find that Charity: a tool to help find charities and improve charity data - D...
Find that Charity: a tool to help find charities and improve charity data - D...Find that Charity: a tool to help find charities and improve charity data - D...
Find that Charity: a tool to help find charities and improve charity data - D...
 
Civic tech for smartphone beginners: is the future binary? - Arran Leonard (I...
Civic tech for smartphone beginners: is the future binary? - Arran Leonard (I...Civic tech for smartphone beginners: is the future binary? - Arran Leonard (I...
Civic tech for smartphone beginners: is the future binary? - Arran Leonard (I...
 
How to monitor emergency procurement with open data: lessons from 12 countrie...
How to monitor emergency procurement with open data: lessons from 12 countrie...How to monitor emergency procurement with open data: lessons from 12 countrie...
How to monitor emergency procurement with open data: lessons from 12 countrie...
 
Accountability through petitions: Elections 2019 - Gaston Wright & Leandro As...
Accountability through petitions: Elections 2019 - Gaston Wright & Leandro As...Accountability through petitions: Elections 2019 - Gaston Wright & Leandro As...
Accountability through petitions: Elections 2019 - Gaston Wright & Leandro As...
 
Openly available air quality data: not just blue-sky thinking - Sruti Modekur...
Openly available air quality data: not just blue-sky thinking - Sruti Modekur...Openly available air quality data: not just blue-sky thinking - Sruti Modekur...
Openly available air quality data: not just blue-sky thinking - Sruti Modekur...
 
How to last in civic tech (especially now) - Matthew Stempeck & Micah L. Sifr...
How to last in civic tech (especially now) - Matthew Stempeck & Micah L. Sifr...How to last in civic tech (especially now) - Matthew Stempeck & Micah L. Sifr...
How to last in civic tech (especially now) - Matthew Stempeck & Micah L. Sifr...
 
Future of tech and democracy at the city of Reykjavík - Sigurlaug Anna Jóhann...
Future of tech and democracy at the city of Reykjavík - Sigurlaug Anna Jóhann...Future of tech and democracy at the city of Reykjavík - Sigurlaug Anna Jóhann...
Future of tech and democracy at the city of Reykjavík - Sigurlaug Anna Jóhann...
 
Lessons learned from building democracy’s database - Stacy Henderson (Cicero,...
Lessons learned from building democracy’s database - Stacy Henderson (Cicero,...Lessons learned from building democracy’s database - Stacy Henderson (Cicero,...
Lessons learned from building democracy’s database - Stacy Henderson (Cicero,...
 

Recently uploaded

Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 

Recently uploaded (20)

Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 

Beyond open data: empowering citizens to understand their cities

  • 1. Beyond open data: empowering citizens to understand their cities TICTeC 2016, Barcelona Felipe Hoffa Developer Advocate @felipehoffa
  • 2. Google confidential │ Do not distribute
  • 3. Google confidential │ Do not distribute The 3 big steps for the data revolution
  • 4. Google confidential │ Do not distribute 1. Price
  • 5. Google confidential │ Do not distribute Byte Magazine, 1980 https://archive.org/stream/byte-magazine-1980-08/1980_08_B
  • 6. Google confidential │ Do not distribute 2. Access
  • 7. Google confidential │ Do not distribute http://www.ted.com/talks/tim_berners_lee_on_the_next_web.html
  • 8. Google confidential │ Do not distribute 3. Speed
  • 9. Google confidential │ Do not distribute
  • 10. Google confidential │ Do not distribute Peter Dutton https://www.flickr.com/photos/joeshlabotnik/305410327/in/photostream/
  • 11. Google confidential │ Do not distribute ?
  • 12. "Search is solved" -- 1996 Excite – Born in 1993 Yahoo! - Born in 1994 WebCrawler – Born in 1994 Lycos – Born in 1994 Infoseek – Born in 1994 AltaVista – Born in 1995 Inktomi – Born in 1996
  • 13. "Search is solved" -- 1996 Excite – Born in 1993 Yahoo! - Born in 1994 WebCrawler – Born in 1994 Lycos – Born in 1994 Infoseek – Born in 1994 AltaVista – Born in 1995 Inktomi – Born in 1996 Google - 1998
  • 14.
  • 15. How Google was built 1. PageRank: A new idea 2. Collect the web 3. Create the technology
  • 17. Google confidential │ Do not distribute http://www.ted.com/talks/tim_berners_lee_on_the_next_web.html
  • 18. The Global Open Data Index http://index.okfn.org/
  • 19. Google confidential │ Do not distribute
  • 20. Google confidential │ Do not distribute SpannerDremelMapReduce Big Table Colossus 2012 20132002 2004 2006 2008 2010 GFS MillWheel Flume
  • 21. Google confidential │ Do not distribute SpannerDremelMapReduce Big Table Colossus 2012 20132002 2004 2006 2008 2010 GFS MillWheel Flume
  • 22. Google confidential │ Do not distribute Google BigQuery
  • 23. BigQuery • Fast: terabytes in seconds • Simple: SQL • Scaleable: From bytes to petabytes • No CAPEX: Always on • Interoperable: Tableau, R, Python... • Instant sharing • Free monthly quota
  • 24. Google confidential │ Do not distribute How many pageviews does Wikipedia have in a month? SELECT COUNT(*)FROM [fh-bigquery:wikipedia.wikipedia_views_201308] https://bigquery.cloud.google.com/table/fh-bigquery:wikipedia.pagecounts_20140602_18
  • 25.
  • 26.
  • 27.
  • 29. hoffa@hoffa:~/census$ wget http://datos.gob.cl/recursos/download/2323 --2013-06-27 22:13:21-- http://datos.gob.cl/recursos/download/2323 Resolving datos.gob.cl (datos.gob.cl)... 198.41.35.100 Connecting to datos.gob.cl (datos.gob.cl)|198.41.35.100|:80... connected. HTTP request sent, awaiting response... 302 Moved Temporarily Location: http://www.ine.cl/openData/censo2012/persona.sav.gz.001 [following] --2013-06-27 22:13:22-- http://www.ine.cl/openData/censo2012/persona.sav.gz.001 Resolving www.ine.cl (www.ine.cl)... 200.72.195.236 Connecting to www.ine.cl (www.ine.cl)|200.72.195.236|:80... connected. HTTP request sent, awaiting response... 200 OK Length: 94371840 (90M) [application/x-gzip] Saving to: `2323' 100%[===========================>] 94,371,840 7.69M/s in 22s 22 seconds
  • 30. hoffa@hoffa:~/census$ wget http://datos.gob.cl/recursos/download/2323 --2013-06-27 22:13:21-- http://datos.gob.cl/recursos/download/2323 Resolving datos.gob.cl (datos.gob.cl)... 198.41.35.100 Connecting to datos.gob.cl (datos.gob.cl)|198.41.35.100|:80... connected. HTTP request sent, awaiting response... 302 Moved Temporarily Location: http://www.ine.cl/openData/censo2012/persona.sav.gz.001 [following] --2013-06-27 22:13:22-- http://www.ine.cl/openData/censo2012/persona.sav.gz.001 Resolving www.ine.cl (www.ine.cl)... 200.72.195.236 Connecting to www.ine.cl (www.ine.cl)|200.72.195.236|:80... connected. HTTP request sent, awaiting response... 200 OK Length: 94371840 (90M) [application/x-gzip] Saving to: `2323' 100%[==========================================================================================================>] 94,371,840 7.69M/s in 22s 2013-06-27 22:13:44 (4.15 MB/s) - `2323' saved [94371840/94371840] hoffa@hoffa:~/census$ wget http://datos.gob.cl/recursos/download/2324 --2013-06-27 22:14:02-- http://datos.gob.cl/recursos/download/2324 Resolving datos.gob.cl (datos.gob.cl)... 198.41.35.100 Connecting to datos.gob.cl (datos.gob.cl)|198.41.35.100|:80... connected. HTTP request sent, awaiting response... 302 Moved Temporarily Location: http://www.ine.cl/openData/censo2012/persona.sav.gz.002 [following] --2013-06-27 22:14:03-- http://www.ine.cl/openData/censo2012/persona.sav.gz.002 Resolving www.ine.cl (www.ine.cl)... 200.72.195.236 Connecting to www.ine.cl (www.ine.cl)|200.72.195.236|:80... connected. HTTP request sent, awaiting response... 200 OK Length: 94371840 (90M) [application/x-gzip] Saving to: `2324' 100%[==========================================================================================================>] 94,371,840 9.08M/s in 29s 2013-06-27 22:14:33 (3.11 MB/s) - `2324' saved [94371840/94371840] hoffa@hoffa:~/census$ wget http://datos.gob.cl/recursos/download/2325 --2013-06-27 22:14:37-- http://datos.gob.cl/recursos/download/2325 Resolving datos.gob.cl (datos.gob.cl)... 198.41.35.100 Connecting to datos.gob.cl (datos.gob.cl)|198.41.35.100|:80... connected. HTTP request sent, awaiting response... 302 Moved Temporarily Location: http://www.ine.cl/openData/censo2012/persona.sav.gz.003 [following] --2013-06-27 22:14:37-- http://www.ine.cl/openData/censo2012/persona.sav.gz.003 Resolving www.ine.cl (www.ine.cl)... 200.72.195.236 Connecting to www.ine.cl (www.ine.cl)|200.72.195.236|:80... connected. HTTP request sent, awaiting response... 200 OK Length: 9221838 (8.8M) [application/x-gzip] Saving to: `2325' 100%[==========================================================================================================>] 9,221,838 2.27M/s in 5.8s 2013-06-27 22:14:44 (1.53 MB/s) - `2325' saved [9221838/9221838] 22 + 29 + 5 seconds
  • 31. hoffa@hoffa:~/census$ ls -sh total 189M 91M 2323 91M 2324 8.8M 2325
  • 32. hoffa@hoffa:~/census$ ls -sh total 189M 91M 2323 91M 2324 8.8M 2325 hoffa@hoffa:~/census$ cat 232* | gunzip > persona.sav 10 seconds (in a very fast multi- core, solid-state computer)
  • 33. hoffa@hoffa:~/census$ ls -sh total 189M 91M 2323 91M 2324 8.8M 2325 hoffa@hoffa:~/census$ cat 232* | gunzip > persona.sav hoffa@hoffa:~/census$ ls -sh persona.sav 1.1G persona.sav
  • 34. hoffa@hoffa:~/census$ ls -sh total 189M 91M 2323 91M 2324 8.8M 2325 hoffa@hoffa:~/census$ cat 232* | gunzip > persona.sav hoffa@hoffa:~/census$ ls -sh persona.sav 1.1G persona.sav hoffa@hoffa:~/census$ file persona.sav persona.sav: SPSS System File TICS DATA FILE MS Windows 20.0.0 002
  • 35. hoffa@hoffa:~/census$ ls -sh total 189M 91M 2323 91M 2324 8.8M 2325 hoffa@hoffa:~/census$ cat 232* | gunzip > persona.sav hoffa@hoffa:~/census$ ls -sh persona.sav 1.1G persona.sav hoffa@hoffa:~/census$ file persona.sav persona.sav: SPSS System File TICS DATA FILE MS Windows 20.0.0 002
  • 36.
  • 37. 14 minutes (+ lots of research and debugging) hoffa@hoffa:~/census$ R R version 2.14.2 (2012-02-29) ... > options(max.print=10) > library(foreign) > census <- read.spss('persona.sav', reencode='utf-8') re-encoding from utf-8 > library(ff) Loading package ff2.2-7 > for(i in ls(census)) {print(write.csv(census[i], file=paste('x',i,'. csv', sep='')))} ... ... ... ... ... ... ...
  • 38. hoffa@hoffa:~/census$ ls -sh persona total 6.6G 177M xHN.csv 179M xP17.csv (relationship to house owner) 177M xP18.csv (gender) 189M xP19.csv (age) ... ... 178M xP35.csv (# of alive offspring) 180M xP36A.csv (birth month) 191M xP36B.csv (birth year) 37 ~180 MB files
  • 39. hoffa@hoffa:~/census$ python merge_csv.py persona/*.csv > output/persona.csv import csv import sys input_files = sys.argv[1:] files = [open(x) for x in input_files] files_csv = [csv.reader(x) for x in files] writer = csv.writer(sys.stdout) while True: row = [x.next()[1] for x in files_csv] row = [(x if x.isdigit() else '') for x in row] # 1-line etl writer.writerow(row) hoffa@hoffa:~/census$ ls -sh output/persona.csv 1.4G output/persona.csv 6 minutes (+ coding)
  • 40. From data discovery to CSV - Download data: 1 minute - Decompress data: 10 seconds (+ figure it out) - Transform it to CSV: 14 minutes (+ learn R) - Combine in one CSV: 6 minutes (+ Python) = ~ 22 minutes (+ a lot of work) What's next?
  • 41. From CSV to accessible data - Spreadsheet? (max 65,536 rows) - Write code? (HD and RAM bounded) (7 seconds to run in memory operations) - MySQL? (SQL is easier, same bounds) - Hadoop? (What's a cluster?) - BigQuery?
  • 42. Upload to Google hoffa@hoffa:~/censo$gzip -c output/persona.csv| gsutil cp - gs://io13- hoffa/persona.csv.gz Copying from <STDIN> [Content-Type=application/octet-stream]... 1 minute
  • 43. Import to BigQuery via web UI 3-5 minutes
  • 45. Religions in Chile SELECT COUNT(*) AS COUNT, p28 AS religion FROM [data-sensing-lab:hoffa.person WHERE p28 != 0 GROUP BY religion ORDER BY COUNT DESC COUNT religion 7853428 1 (Catholicism) 1699725 2 (Protestantism) 931990 9 (None, atheist, agnostic) 493147 8 (Other) 119455 3 (Jehovah's Witnesses) 103735 5 (Mormon) 14976 4 (Judaism) 6959 7 (Orthodox) 2894 6 (Muslim) 1.5 seconds
  • 46. Avg children per religion in Chile SELECT p28 AS religion, AVG(p34) AS avg_children FROM [data-sensing-lab:hoffa.persona] WHERE p28 != 0 GROUP BY religion ORDER BY avg_children DESC religion avg_children 3 1.48 (Jehovah's Witnesses) 2 1.41 (Protestantism) 5 1.19 (Mormon) 1 1.14 (Catholicism) 4 0.94 (Judaism) 7 0.89 (Orthodox) 9 0.59 (None, atheist, agnostic) 8 0.58 (Other) 6 0.56 (Muslim)1.5 seconds
  • 47. Avg children per mother occupation COUNT work avg_children 313098 2 1.90 (Domestic service) 5463709 0 1.71 (Non working) 78647 5 0.76 (Working for family, non remunerated) 903566 3 0.56 (Independent worker) 244137 4 0.54 (Business owner) 4223152 1 0.47 (Employee) 1.5 seconds
  • 49. Google confidential │ Do not distribute
  • 50. 2008-2012 NYC Taxi: cash vs credit
  • 51. Google confidential │ Do not distribute
  • 52. Google confidential │ Do not distribute
  • 53. Google confidential │ Do not distribute
  • 54. Google confidential │ Do not distribute GDELT: What is happening
  • 59. SELECT TIMESTAMP(STRING(MonthYear)+'01') month, SUM(ActionGeo_CountryCode='IT') Italy FROM [gdelt-bq:full.events] WHERE MonthYear>0 GROUP BY 1 ORDER BY 1 GDELT: Rows per month (Italy)
  • 60. SELECT TIMESTAMP(STRING(MonthYear)+'01') month, SUM(ActionGeo_CountryCode='IT')/COUNT(*) Italy FROM [gdelt-bq:full.events] WHERE MonthYear>0 GROUP BY 1 ORDER BY 1 GDELT: Rows per month (Italy, normalized) October 1985 July 2001
  • 61. SELECT TIMESTAMP(STRING(MonthYear)+'01') month, SUM(ActionGeo_CountryCode='CI')/COUNT(*) Chile FROM [gdelt-bq:full.events] WHERE MonthYear>0 GROUP BY 1 ORDER BY 1 GDELT: Rows per month (Chile, normalized)
  • 62. SELECT TIMESTAMP(STRING(MonthYear)+'01') month, SUM(ActionGeo_CountryCode='CI')/COUNT(*) Chile FROM [gdelt-bq:full.events] WHERE MonthYear>0 GROUP BY 1 ORDER BY 1 GDELT: Rows per month (Chile, normalized) October 1988 March 2010 October 2010
  • 63. Weather data: Power to predict
  • 64.
  • 65.
  • 68. Questions? News: reddit.com/r/bigquery Ask: stackoverflow.com Felipe Hoffa @felipehoffa Rate me? bit.ly/bqfeedback
  • 70. Google confidential | Do not distribute Encore demo A big join: Freebase + Wikipedia fresh logs
  • 71. Google confidential │ Do not distribute Exploring the Notability Gender Gap
  • 72. Google confidential | Do not distribute SELECT title, count, iso FROM ( SELECT title, count, c.iso iso, RANK() OVER (PARTITION BY iso ORDER BY count DESC) rank FROM ( SELECT a.title title, SUM(requests) count, b.person person FROM [fh-bigquery:wikipedia.pagecounts_20140410_150000] a JOIN ( SELECT REGEXP_REPLACE(obj, '/wikipedia/id/', '') title, a.sub person FROM [fh-bigquery:freebase20140119.triples_nolang] a JOIN ( SELECT sub FROM [fh-bigquery:freebase20140119.people_gender] WHERE gender='/m/02zsn') b ON a.sub=b.sub WHERE obj CONTAINS '/wikipedia/id/' AND pred = '/type/object/key' GROUP BY 1,2) b ON a.title = b.title GROUP BY 1,3) a JOIN EACH [fh-bigquery:freebase20140119.people_place_of_birth] b ON a.person=b.sub JOIN [fh-bigquery:freebase20140119.place_of_birth_to_country] c ON b.place_of_birth=c.place) WHERE rank=1 ORDER BY count DESC http://devnook.github.io/GenderMaps/maplabels/