This was presented by Felipe Hoffa from Google at the Impacts of Civic Technology Conference (TICTeC2016) in Barcelona on 28th April. You can find out more information about the conference here: https://www.mysociety.org/research/tictec-2016/
12. "Search is solved" -- 1996
Excite – Born in 1993
Yahoo! - Born in 1994
WebCrawler – Born in 1994
Lycos – Born in 1994
Infoseek – Born in 1994
AltaVista – Born in 1995
Inktomi – Born in 1996
13. "Search is solved" -- 1996
Excite – Born in 1993
Yahoo! - Born in 1994
WebCrawler – Born in 1994
Lycos – Born in 1994
Infoseek – Born in 1994
AltaVista – Born in 1995
Inktomi – Born in 1996
Google - 1998
14.
15. How Google was built
1. PageRank: A new idea
2. Collect the web
3. Create the technology
23. BigQuery
• Fast: terabytes in seconds
• Simple: SQL
• Scaleable: From bytes to petabytes
• No CAPEX: Always on
• Interoperable: Tableau, R, Python...
• Instant sharing
• Free monthly quota
24. Google confidential │ Do not distribute
How many pageviews does Wikipedia
have in a month?
SELECT COUNT(*)FROM
[fh-bigquery:wikipedia.wikipedia_views_201308]
https://bigquery.cloud.google.com/table/fh-bigquery:wikipedia.pagecounts_20140602_18
32. hoffa@hoffa:~/census$ ls -sh
total 189M
91M 2323
91M 2324
8.8M 2325
hoffa@hoffa:~/census$ cat 232* | gunzip > persona.sav
10 seconds (in a very fast multi-
core, solid-state computer)
33. hoffa@hoffa:~/census$ ls -sh
total 189M
91M 2323
91M 2324
8.8M 2325
hoffa@hoffa:~/census$ cat 232* | gunzip > persona.sav
hoffa@hoffa:~/census$ ls -sh persona.sav
1.1G persona.sav
34. hoffa@hoffa:~/census$ ls -sh
total 189M
91M 2323
91M 2324
8.8M 2325
hoffa@hoffa:~/census$ cat 232* | gunzip > persona.sav
hoffa@hoffa:~/census$ ls -sh persona.sav
1.1G persona.sav
hoffa@hoffa:~/census$ file persona.sav
persona.sav: SPSS System File TICS DATA FILE MS Windows 20.0.0
002
35. hoffa@hoffa:~/census$ ls -sh
total 189M
91M 2323
91M 2324
8.8M 2325
hoffa@hoffa:~/census$ cat 232* | gunzip > persona.sav
hoffa@hoffa:~/census$ ls -sh persona.sav
1.1G persona.sav
hoffa@hoffa:~/census$ file persona.sav
persona.sav: SPSS System File TICS DATA FILE MS Windows 20.0.0
002
36.
37. 14 minutes (+ lots of research and
debugging)
hoffa@hoffa:~/census$ R
R version 2.14.2 (2012-02-29)
...
> options(max.print=10)
> library(foreign)
> census <- read.spss('persona.sav', reencode='utf-8')
re-encoding from utf-8
> library(ff)
Loading package ff2.2-7
> for(i in ls(census)) {print(write.csv(census[i], file=paste('x',i,'.
csv', sep='')))}
...
...
...
...
...
...
...
38. hoffa@hoffa:~/census$ ls -sh persona
total 6.6G
177M xHN.csv
179M xP17.csv (relationship to house owner)
177M xP18.csv (gender)
189M xP19.csv (age)
...
...
178M xP35.csv (# of alive offspring)
180M xP36A.csv (birth month)
191M xP36B.csv (birth year)
37 ~180 MB files
39. hoffa@hoffa:~/census$ python merge_csv.py persona/*.csv >
output/persona.csv
import csv
import sys
input_files = sys.argv[1:]
files = [open(x) for x in input_files]
files_csv = [csv.reader(x) for x in files]
writer = csv.writer(sys.stdout)
while True:
row = [x.next()[1] for x in files_csv]
row = [(x if x.isdigit() else '') for x in row] # 1-line etl
writer.writerow(row)
hoffa@hoffa:~/census$ ls -sh output/persona.csv
1.4G output/persona.csv
6 minutes (+ coding)
40. From data discovery to CSV
- Download data: 1 minute
- Decompress data: 10 seconds (+ figure it out)
- Transform it to CSV: 14 minutes (+ learn R)
- Combine in one CSV: 6 minutes (+ Python)
= ~ 22 minutes (+ a lot of work)
What's next?
41. From CSV to accessible data
- Spreadsheet? (max 65,536 rows)
- Write code? (HD and RAM bounded)
(7 seconds to run in memory operations)
- MySQL? (SQL is easier, same bounds)
- Hadoop? (What's a cluster?)
- BigQuery?
42. Upload to Google
hoffa@hoffa:~/censo$gzip -c output/persona.csv| gsutil cp - gs://io13-
hoffa/persona.csv.gz
Copying from <STDIN> [Content-Type=application/octet-stream]...
1 minute
72. Google confidential | Do not distribute
SELECT title, count, iso FROM (
SELECT title, count, c.iso iso, RANK() OVER (PARTITION BY iso ORDER BY count DESC) rank
FROM (
SELECT a.title title, SUM(requests) count, b.person person
FROM [fh-bigquery:wikipedia.pagecounts_20140410_150000] a
JOIN (
SELECT REGEXP_REPLACE(obj, '/wikipedia/id/', '') title, a.sub person
FROM [fh-bigquery:freebase20140119.triples_nolang] a
JOIN (
SELECT sub FROM [fh-bigquery:freebase20140119.people_gender]
WHERE gender='/m/02zsn') b
ON a.sub=b.sub
WHERE obj CONTAINS '/wikipedia/id/' AND pred = '/type/object/key'
GROUP BY 1,2) b
ON a.title = b.title
GROUP BY 1,3) a
JOIN EACH [fh-bigquery:freebase20140119.people_place_of_birth] b
ON a.person=b.sub
JOIN [fh-bigquery:freebase20140119.place_of_birth_to_country] c
ON b.place_of_birth=c.place)
WHERE rank=1 ORDER BY count DESC
http://devnook.github.io/GenderMaps/maplabels/