Beyond open data: empowering citizens to understand their cities

Beyond open data:
empowering citizens to
understand their cities
TICTeC 2016, Barcelona
Felipe Hoffa
Developer Advocate
@felipehoffa

Google confidential │ Do not distribute

The 3 big steps for
the data revolution

1. Price

Byte Magazine, 1980
https://archive.org/stream/byte-magazine-1980-08/1980_08_B

2. Access

http://www.ted.com/talks/tim_berners_lee_on_the_next_web.html

3. Speed

Peter Dutton
https://www.flickr.com/photos/joeshlabotnik/305410327/in/photostream/

?

"Search is solved" -- 1996
Excite – Born in 1993
Yahoo! - Born in 1994
WebCrawler – Born in 1994
Lycos – Born in 1994
Infoseek – Born in 1994
AltaVista – Born in 1995
Inktomi – Born in 1996

"Search is solved" -- 1996
Excite – Born in 1993
Yahoo! - Born in 1994
WebCrawler – Born in 1994
Lycos – Born in 1994
Infoseek – Born in 1994
AltaVista – Born in 1995
Inktomi – Born in 1996
Google - 1998

How Google was built
1. PageRank: A new idea
2. Collect the web
3. Create the technology

Data based start-ups
Idea
Data Tech

The Global Open Data Index
http://index.okfn.org/

SpannerDremelMapReduce
Big Table Colossus
2012 20132002 2004 2006 2008 2010
GFS MillWheel
Flume

Google BigQuery

BigQuery
• Fast: terabytes in seconds
• Simple: SQL
• Scaleable: From bytes to petabytes
• No CAPEX: Always on
• Interoperable: Tableau, R, Python...
• Instant sharing
• Free monthly quota

How many pageviews does Wikipedia
have in a month?
SELECT COUNT(*)FROM
[fh-bigquery:wikipedia.wikipedia_views_201308]
https://bigquery.cloud.google.com/table/fh-bigquery:wikipedia.pagecounts_20140602_18

hoffa@hoffa:~/census$ wget http://datos.gob.cl/recursos/download/2323

--2013-06-27 22:13:21-- http://datos.gob.cl/recursos/download/2323
Resolving datos.gob.cl (datos.gob.cl)... 198.41.35.100
Connecting to datos.gob.cl (datos.gob.cl)|198.41.35.100|:80... connected.
HTTP request sent, awaiting response... 302 Moved Temporarily
Location: http://www.ine.cl/openData/censo2012/persona.sav.gz.001 [following]
--2013-06-27 22:13:22-- http://www.ine.cl/openData/censo2012/persona.sav.gz.001
Resolving www.ine.cl (www.ine.cl)... 200.72.195.236
Connecting to www.ine.cl (www.ine.cl)|200.72.195.236|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 94371840 (90M) [application/x-gzip]
Saving to: `2323'
100%[===========================>] 94,371,840 7.69M/s in 22s
22 seconds

Saving to: `2323'
100%[==========================================================================================================>] 94,371,840 7.69M/s in 22s
2013-06-27 22:13:44 (4.15 MB/s) - `2323' saved [94371840/94371840]
Saving to: `2324'
100%[==========================================================================================================>] 94,371,840 9.08M/s in 29s
2013-06-27 22:14:33 (3.11 MB/s) - `2324' saved [94371840/94371840]
Length: 9221838 (8.8M) [application/x-gzip]
Saving to: `2325'
100%[==========================================================================================================>] 9,221,838 2.27M/s in 5.8s
2013-06-27 22:14:44 (1.53 MB/s) - `2325' saved [9221838/9221838]
22 + 29 + 5 seconds

hoffa@hoffa:~/census$ ls -sh
total 189M
91M 2323
91M 2324
8.8M 2325

total 189M
91M 2323
91M 2324
8.8M 2325
hoffa@hoffa:~/census$ cat 232* | gunzip > persona.sav
10 seconds (in a very fast multi-
core, solid-state computer)

total 189M
91M 2323
91M 2324
8.8M 2325
hoffa@hoffa:~/census$ ls -sh persona.sav
1.1G persona.sav

total 189M
91M 2323
91M 2324
8.8M 2325
hoffa@hoffa:~/census$ ls -sh persona.sav
1.1G persona.sav
hoffa@hoffa:~/census$ file persona.sav
persona.sav: SPSS System File TICS DATA FILE MS Windows 20.0.0
002

14 minutes (+ lots of research and
debugging)
hoffa@hoffa:~/census$ R
R version 2.14.2 (2012-02-29)
...
> options(max.print=10)
> library(foreign)
> census <- read.spss('persona.sav', reencode='utf-8')
re-encoding from utf-8
> library(ff)
Loading package ff2.2-7
> for(i in ls(census)) {print(write.csv(census[i], file=paste('x',i,'.
csv', sep='')))}
...
...
...
...
...
...
...

hoffa@hoffa:~/census$ ls -sh persona
total 6.6G
177M xHN.csv
179M xP17.csv (relationship to house owner)
177M xP18.csv (gender)
189M xP19.csv (age)
...
...
178M xP35.csv (# of alive offspring)
180M xP36A.csv (birth month)
191M xP36B.csv (birth year)
37 ~180 MB files

hoffa@hoffa:~/census$ python merge_csv.py persona/*.csv >
output/persona.csv
import csv
import sys
input_files = sys.argv[1:]
files = [open(x) for x in input_files]
files_csv = [csv.reader(x) for x in files]
writer = csv.writer(sys.stdout)
while True:
row = [x.next()[1] for x in files_csv]
row = [(x if x.isdigit() else '') for x in row] # 1-line etl
writer.writerow(row)
hoffa@hoffa:~/census$ ls -sh output/persona.csv
1.4G output/persona.csv
6 minutes (+ coding)

From data discovery to CSV
- Download data: 1 minute
- Decompress data: 10 seconds (+ figure it out)
- Transform it to CSV: 14 minutes (+ learn R)
- Combine in one CSV: 6 minutes (+ Python)
= ~ 22 minutes (+ a lot of work)
What's next?

From CSV to accessible data
- Spreadsheet? (max 65,536 rows)
- Write code? (HD and RAM bounded)
(7 seconds to run in memory operations)
- MySQL? (SQL is easier, same bounds)
- Hadoop? (What's a cluster?)
- BigQuery?

Upload to Google
hoffa@hoffa:~/censo$gzip -c output/persona.csv| gsutil cp - gs://io13-
hoffa/persona.csv.gz
Copying from <STDIN> [Content-Type=application/octet-stream]...
1 minute

Import to BigQuery via web UI
3-5 minutes

http://bigquery.cloud.google.com/
Ready to query!

Religions in Chile
SELECT COUNT(*) AS COUNT,
p28 AS religion
FROM
[data-sensing-lab:hoffa.person
WHERE p28 != 0
GROUP BY religion
ORDER BY COUNT DESC
COUNT religion
7853428 1 (Catholicism)
1699725 2 (Protestantism)
931990 9 (None, atheist, agnostic)
493147 8 (Other)
119455 3 (Jehovah's Witnesses)
103735 5 (Mormon)
14976 4 (Judaism)
6959 7 (Orthodox)
2894 6 (Muslim)
1.5 seconds

Avg children per religion in Chile
SELECT
p28 AS religion,
AVG(p34) AS avg_children
FROM
[data-sensing-lab:hoffa.persona]
WHERE p28 != 0
GROUP BY religion
ORDER BY avg_children DESC
religion avg_children
3 1.48 (Jehovah's Witnesses)
2 1.41 (Protestantism)
5 1.19 (Mormon)
1 1.14 (Catholicism)
4 0.94 (Judaism)
7 0.89 (Orthodox)
9 0.59 (None, atheist, agnostic)
8 0.58 (Other)
6 0.56 (Muslim)1.5 seconds

Avg children per mother occupation
COUNT work avg_children
313098 2 1.90 (Domestic service)
5463709 0 1.71 (Non working)
78647 5 0.76 (Working for family, non remunerated)
903566 3 0.56 (Independent worker)
244137 4 0.54 (Business owner)
4223152 1 0.47 (Employee)
1.5 seconds

catalogo.datos.gob.mx/dataset/nacimientos-ocurridos

2008-2012 NYC Taxi: cash vs credit

GDELT: What is happening

https://www.youtube.com/watch?v=GpCarC_I3Ao

https://www.reddit.
com/r/bigquery/comments/33bgx9/heatmap_of_24_hour

SELECT TIMESTAMP(STRING(MonthYear)+'01') month,
SUM(ActionGeo_CountryCode='IT') Italy
FROM [gdelt-bq:full.events]
WHERE MonthYear>0
GROUP BY 1 ORDER BY 1
GDELT: Rows per month (Italy)

SUM(ActionGeo_CountryCode='IT')/COUNT(*) Italy
WHERE MonthYear>0
GDELT: Rows per month (Italy, normalized)
October 1985
July 2001

SUM(ActionGeo_CountryCode='CI')/COUNT(*) Chile
WHERE MonthYear>0
GDELT: Rows per month (Chile, normalized)

SUM(ActionGeo_CountryCode='CI')/COUNT(*) Chile
WHERE MonthYear>0
GDELT: Rows per month (Chile, normalized)
October 1988
March 2010
October 2010

Weather data: Power to predict

Data empowerment
Ideas
Data Tech

Questions?
News: reddit.com/r/bigquery
Ask: stackoverflow.com
Felipe Hoffa
@felipehoffa
Rate me?
bit.ly/bqfeedback

End
Twitter: @felipehoffa
G+: +FelipeHoffa

Google confidential | Do not distribute
Encore demo
A big join: Freebase + Wikipedia fresh logs

Exploring the Notability Gender Gap

Google confidential | Do not distribute
SELECT title, count, iso FROM (
SELECT title, count, c.iso iso, RANK() OVER (PARTITION BY iso ORDER BY count DESC) rank
FROM (
SELECT a.title title, SUM(requests) count, b.person person
FROM [fh-bigquery:wikipedia.pagecounts_20140410_150000] a
JOIN (
SELECT REGEXP_REPLACE(obj, '/wikipedia/id/', '') title, a.sub person
FROM [fh-bigquery:freebase20140119.triples_nolang] a
JOIN (
SELECT sub FROM [fh-bigquery:freebase20140119.people_gender]
WHERE gender='/m/02zsn') b
ON a.sub=b.sub
WHERE obj CONTAINS '/wikipedia/id/' AND pred = '/type/object/key'
GROUP BY 1,2) b
ON a.title = b.title
GROUP BY 1,3) a
JOIN EACH [fh-bigquery:freebase20140119.people_place_of_birth] b
ON a.person=b.sub
JOIN [fh-bigquery:freebase20140119.place_of_birth_to_country] c
ON b.place_of_birth=c.place)
WHERE rank=1 ORDER BY count DESC
http://devnook.github.io/GenderMaps/maplabels/

Beyond open data: empowering citizens to understand their cities

Recommended

Recommended

More Related Content

Similar to Beyond open data: empowering citizens to understand their cities

Similar to Beyond open data: empowering citizens to understand their cities (20)

More from mysociety

More from mysociety (20)

Recently uploaded

Recently uploaded (20)

Beyond open data: empowering citizens to understand their cities