© Hortonworks Inc. 2011
Bring Cartography to the Cloud
with Apache Hadoop
Nick Dimiduk
Member of Technical Staff, HBase
FOSS4G-NA, 2013-05-23
Page 1
© Hortonworks Inc. 2011
Beginnings…
Page 2
Architecting the Future of Big Data
mapbox.com/blog/
rendering-the-world/
bmander.com/dotmap/index.html
© Hortonworks Inc. 2011
Definitions
Page 3
Architecting the Future of Big Data
car•tog•ra•phy
|kärˈtägrəәfē|
noun

the science or practice of drawing maps.

rendering map tiles from some kind of
geographic data.
cloud
|kloud|
noun

a visible mass of condensed water vapor
floating in the atmosphere, typically high
above the ground.

on demand consumption of
computation and storage resources.
© Hortonworks Inc. 2011
Background
Architecting the Future of Big Data
Page 4
© Hortonworks Inc. 2011
Apache Hadoop in Review
•  Apache Hadoop Distributed Filesystem (HDFS)
–  Distributed, fault-tolerant, throughput-optimized data storage
–  Uses a filesystem analogy, not structured tables
–  The Google File System, 2003, Ghemawat et al.
–  http://research.google.com/archive/gfs.html
•  Apache Hadoop MapReduce (MR)
–  Distributed, fault-tolerant, batch-oriented data processing
–  Line- or record-oriented processing of the entire dataset *
–  “[Application] schema on read”
–  MapReduce: Simplified Data Processing on Large Clusters, 2004, Dean and
Ghemawat
–  http://research.google.com/archive/mapreduce.html
Page 5
Architecting the Future of Big Data
* For more on writing MapReduce applications, see “MapReduce
Patterns, Algorithms, and Use Cases”
http://highlyscalable.wordpress.com/2012/02/01/mapreduce-patterns/
© Hortonworks Inc. 2011
MapReduce in Detail
Page 6
Architecting the Future of Big Data
highlyscalable.wordpress.com/2012/02/01/mapreduce-patterns/
© Hortonworks Inc. 2011
MapReduce in Detail
Page 7
Architecting the Future of Big Data
highlyscalable.wordpress.com/2012/02/01/mapreduce-patterns/
© Hortonworks Inc. 2011
What we care about
Page 8
Architecting the Future of Big Data
$ map < input | sort | reduce > output
© Hortonworks Inc. 2011
How Seamlessly?
Page 9
Architecting the Future of Big Data
$ git show e65731e:bin/10_simulated_hadoop.sh
gzcat "$INPUT_FILES" 
| python "${PYTHON_DIR}/sample_shapes.py" 
| sort 
| python "${PYTHON_DIR}/draw_tiles.py"
$ git show e65731e:bin/11_hadoop_local.sh
hadoop jar target/tile-brute-0.1.0-SNAPSHOT.jar 
-input /tmp/input.csv 
-output "$OUTPUT_DIR" 
-mapper "python ${PYTHON_DIR}/sample_shapes.py" 
-reducer "python ${PYTHON_DIR}/draw_tiles.py"
© Hortonworks Inc. 2011
To the Code!
github.com/ndimiduk/tilebrute
Architecting the Future of Big Data
Page 10
© Hortonworks Inc. 2011
Our Tools
•  Python + GIS
–  GDAL
–  Shapely
–  Mapnik
•  Java
•  Apache Hadoop
•  Bash
•  MrJob
Page 11
Architecting the Future of Big Data
© Hortonworks Inc. 2011
Prepare the Input
Page 12
Architecting the Future of Big Data
TIGER/Line Shapefiles
www.census.gov/geo/maps-data/data/tiger-line.html
$ tail -n6 bin/00_prepare_input.sh
ogr2ogr `: invoke gdal tool ogr2ogr` 
-t_srs epsg:4326 `: reproject the data` 
-f CSV `: in CSV format` 
$OUTPUT `: producing output file` 
$INPUT `: from input file` 
-lco GEOMETRY=AS_WKT `: including geometries as WKT`
$ head -n2 /tmp/input.csv
WKT,STATEFP10,COUNTYFP10,TRACTCE10,BLOCKCE,BLOCKID10,PARTFLG,HOUSING10,POP10
"POLYGON ((-118.81473 47.233499,...))",53,001,950100,1042,530019501001042,N,1,5
© Hortonworks Inc. 2011
Prepare the Input
Page 13
Architecting the Future of Big Data
TIGER/Line Shapefiles
www.census.gov/geo/maps-data/data/tiger-line.html
$ tail -n6 bin/00_prepare_input.sh
ogr2ogr `: invoke gdal tool ogr2ogr` 
-t_srs epsg:4326 `: reproject the data` 
-f CSV `: in CSV format` 
$OUTPUT `: producing output file` 
$INPUT `: from input file` 
-lco GEOMETRY=AS_WKT `: including geometries as WKT`
$ head -n2 /tmp/input.csv
WKT,STATEFP10,COUNTYFP10,TRACTCE10,BLOCKCE,BLOCKID10,PARTFLG,HOUSING10,POP10
"POLYGON ((-118.81473 47.233499,...))",53,001,950100,1042,530019501001042,N,1,5
© Hortonworks Inc. 2011
Map: Sample Geometries
Page 14
Architecting the Future of Big Data
[,[WKT, population]] => mapper => ['tx,ty,z', 'px,py']
def main():
for geom, population in read_feature(stdin):
for lng, lat in sample_geometry(geom, population):
for key, val in make_kv(lat, lng):
emit(key, val)
$ map < input | sort | reduce > output
© Hortonworks Inc. 2011
Map: Sample Geometries
Page 15
Architecting the Future of Big Data
$ head -n1 input.csv | python -m tilebrute.sample_shapes
2,5,4 -13224181.65427 5981084.37214
5,11,5 -13224181.65427 5981084.37214
10,22,6 -13224181.65427 5981084.37214
21,44,7 -13224181.65427 5981084.37214
43,89,8 -13224181.65427 5981084.37214
87,179,9 -13224181.65427 5981084.37214
174,359,10 -13224181.65427 5981084.37214
348,718,11 -13224181.65427 5981084.37214
696,1436,12 -13224181.65427 5981084.37214
1392,2873,13 -13224181.65427 5981084.37214
2785,5746,14 -13224181.65427 5981084.37214
5571,11493,15 -13224181.65427 5981084.37214
11142,22986,16 -13224181.65427 5981084.37214
22284,45973,17 -13224181.65427 5981084.37214
$ map < input | sort | reduce > output
© Hortonworks Inc. 2011
Sort
Page 16
Architecting the Future of Big Data
$ head -n1 input.csv | python -m tilebrute.sample_shapes | sort
10,22,6 -13224414.42332 5983539.01581
10,22,6 -13225723.87449 5981201.60336
10,22,6 -13225793.67181 5983127.53706
10,22,6 -13226046.70101 5983375.66839
10,22,6 -13226331.90155 5984272.31303
11138,22981,16 -13226331.90155 5984272.31303
11139,22983,16 -13225793.67181 5983127.53706
11139,22983,16 -13226046.70101 5983375.66839
11139,22986,16 -13225723.87449 5981201.60336
11141,22982,16 -13224414.42332 5983539.01581
$ map < input | sort | reduce > output
© Hortonworks Inc. 2011
Reduce: Draw Tiles
Page 17
Architecting the Future of Big Data
def main():
for tile,points in groupby(read_points(stdin), lambda x: x[0]):
zoom = get_zoom(tile)
map = init_map(zoom, points)
map.zoom_all()
im = mapnik.Image(256,256)
mapnik.render(map,im)
emit(tile, encode_image(im))
$ map < input | sort | reduce > output
$ head -n1 input.csv | python -m tilebrute.sample_shapes | sort | head -n5 |
python -m tilebrute.draw_tiles
10,22,6 iVBORw0KGgoAAAANSUhEUgAAAQAAAAEACAYAAABccqhmAAADJ...+aBAAAAAElFTkSuQmCC
© Hortonworks Inc. 2011
Write Output
Page 18
Architecting the Future of Big Data
public void write(Text tileId, Text tile) throws IOException {
String[] tileIdSplits = tileId.toString().split(",");
assert tileIdSplits.length == 3;
String tx = tileIdSplits[0];
String ty = tileIdSplits[1];
String zoom = tileIdSplits[2];
Path tilePath = new Path(outputPath, zoom + "/" + tx + "/" + ty + ".png");
fs.mkdirs(tilePath.getParent());
byte[] buf = Base64.decodeBase64(tile.toString());
final FSDataOutputStream fout = fs.create(tilePath, progress);
fout.write(buf);
fout.close();
}
© Hortonworks Inc. 2011
To the Cloud!
Architecting the Future of Big Data
Page 19
© Hortonworks Inc. 2011
Basic Services: EC2, S3
•  EC2: Elastic Compute Cloud
–  Virtual machines on demand
–  Different “instance types” with different hardware profiles
–  m1.large (2 cores, 7.5G), c1.xlarge (8 cores, 7G)
•  S3: Simple Storage Service
–  Distributed, replicated storage
–  Native Hadoop integration
–  Also exposed over http(s), easy tile hosting
Page 20
Architecting the Future of Big Data
© Hortonworks Inc. 2011
Add-on Service: EMR
•  EMR: Elastic MapReduce
–  “Hadoop as a Service”
–  On-demand, pre-installed and configured Hadoop clusters
–  +1: standardize of provisioning, deployment, monitoring
–  -1: “stable” (old) software
Page 21
Architecting the Future of Big Data
© Hortonworks Inc. 2011
MrJob: Python for EMR
Page 22
Architecting the Future of Big Data
class TileBrute(MRJob):
HADOOP_OUTPUT_FORMAT = 'tilebrute.hadoop.mapred.MapTileOutputFormat'
def mapper_cmd(self):
return bash_wrap('$PYTHON -m tilebrute.sample_shapes')
def reducer_cmd(self):
return bash_wrap('$PYTHON -m tilebrute.draw_tiles')
github.com/Yelp/mrjob
© Hortonworks Inc. 2011
Results
Architecting the Future of Big Data
Page 23
© Hortonworks Inc. 2011
Page 24
Architecting the Future of Big Data
© Hortonworks Inc. 2011
Page 25
Architecting the Future of Big Data
14z, 2624x, 5722y
© Hortonworks Inc. 2011
Page 26
Architecting the Future of Big Data
14z, 2624x, 5722y
© Hortonworks Inc. 2011
How much code?
Page 27
Architecting the Future of Big Data
$ find -f src -f bin | egrep '.(java|sh|py)$' | grep -v test | xargs cloc --quiet
http://cloc.sourceforge.net v 1.56 T=0.5 s (28.0 files/s, 1868.0 lines/s)
-------------------------------------------------------------------------------
Language files blank comment code
-------------------------------------------------------------------------------
Python 4 69 105 299
Bourne Shell 8 51 85 210
Java 2 25 16 74
-------------------------------------------------------------------------------
SUM: 14 145 206 583
-------------------------------------------------------------------------------
© Hortonworks Inc. 2011
Performance
Page 28
Architecting the Future of Big Data
•  1 x m1.large (2 cores)
–  195575 input features (WA state)
–  3 zoom levels (6, 7, 8)
–  1 hour
•  19 x c1.xlarge (152 cores)
–  308745538 input features (all data)
–  3 zoom levels (6, 7, 8)
–  3 hours 15 minutes
© Hortonworks Inc. 2011
TODOs
•  Macro-level performance optimizations (configuration)
–  Balancing mappers and reducers, memory allocation, &c.
–  On-demand Hadoop means tuning the cluster to the application
•  Micro-level performance optimizations (code)
–  Smarter sampling logic
–  Mapnik API considerations
–  Multi-threaded S3 PUTs
–  https://forums.aws.amazon.com/thread.jspa?threadID=125135
•  Write tiles in MBTiles format
•  Write tiles to HBase
•  Compression!
•  Ogrbrute?
Page 29
Architecting the Future of Big Data
© Hortonworks Inc. 2011
Thanks!
Architecting the Future of Big Data
Page 30
M A N N I N G
Nick Dimiduk
Amandeep Khurana
FOREWORD BY
Michael Stack
hbaseinaction.com
Nick Dimiduk
github.com/ndimiduk
@xefyr
n10k.com

Bring Cartography to the Cloud

  • 1.
    © Hortonworks Inc.2011 Bring Cartography to the Cloud with Apache Hadoop Nick Dimiduk Member of Technical Staff, HBase FOSS4G-NA, 2013-05-23 Page 1
  • 2.
    © Hortonworks Inc.2011 Beginnings… Page 2 Architecting the Future of Big Data mapbox.com/blog/ rendering-the-world/ bmander.com/dotmap/index.html
  • 3.
    © Hortonworks Inc.2011 Definitions Page 3 Architecting the Future of Big Data car•tog•ra•phy |kärˈtägrəәfē| noun the science or practice of drawing maps. rendering map tiles from some kind of geographic data. cloud |kloud| noun a visible mass of condensed water vapor floating in the atmosphere, typically high above the ground. on demand consumption of computation and storage resources.
  • 4.
    © Hortonworks Inc.2011 Background Architecting the Future of Big Data Page 4
  • 5.
    © Hortonworks Inc.2011 Apache Hadoop in Review •  Apache Hadoop Distributed Filesystem (HDFS) –  Distributed, fault-tolerant, throughput-optimized data storage –  Uses a filesystem analogy, not structured tables –  The Google File System, 2003, Ghemawat et al. –  http://research.google.com/archive/gfs.html •  Apache Hadoop MapReduce (MR) –  Distributed, fault-tolerant, batch-oriented data processing –  Line- or record-oriented processing of the entire dataset * –  “[Application] schema on read” –  MapReduce: Simplified Data Processing on Large Clusters, 2004, Dean and Ghemawat –  http://research.google.com/archive/mapreduce.html Page 5 Architecting the Future of Big Data * For more on writing MapReduce applications, see “MapReduce Patterns, Algorithms, and Use Cases” http://highlyscalable.wordpress.com/2012/02/01/mapreduce-patterns/
  • 6.
    © Hortonworks Inc.2011 MapReduce in Detail Page 6 Architecting the Future of Big Data highlyscalable.wordpress.com/2012/02/01/mapreduce-patterns/
  • 7.
    © Hortonworks Inc.2011 MapReduce in Detail Page 7 Architecting the Future of Big Data highlyscalable.wordpress.com/2012/02/01/mapreduce-patterns/
  • 8.
    © Hortonworks Inc.2011 What we care about Page 8 Architecting the Future of Big Data $ map < input | sort | reduce > output
  • 9.
    © Hortonworks Inc.2011 How Seamlessly? Page 9 Architecting the Future of Big Data $ git show e65731e:bin/10_simulated_hadoop.sh gzcat "$INPUT_FILES" | python "${PYTHON_DIR}/sample_shapes.py" | sort | python "${PYTHON_DIR}/draw_tiles.py" $ git show e65731e:bin/11_hadoop_local.sh hadoop jar target/tile-brute-0.1.0-SNAPSHOT.jar -input /tmp/input.csv -output "$OUTPUT_DIR" -mapper "python ${PYTHON_DIR}/sample_shapes.py" -reducer "python ${PYTHON_DIR}/draw_tiles.py"
  • 10.
    © Hortonworks Inc.2011 To the Code! github.com/ndimiduk/tilebrute Architecting the Future of Big Data Page 10
  • 11.
    © Hortonworks Inc.2011 Our Tools •  Python + GIS –  GDAL –  Shapely –  Mapnik •  Java •  Apache Hadoop •  Bash •  MrJob Page 11 Architecting the Future of Big Data
  • 12.
    © Hortonworks Inc.2011 Prepare the Input Page 12 Architecting the Future of Big Data TIGER/Line Shapefiles www.census.gov/geo/maps-data/data/tiger-line.html $ tail -n6 bin/00_prepare_input.sh ogr2ogr `: invoke gdal tool ogr2ogr` -t_srs epsg:4326 `: reproject the data` -f CSV `: in CSV format` $OUTPUT `: producing output file` $INPUT `: from input file` -lco GEOMETRY=AS_WKT `: including geometries as WKT` $ head -n2 /tmp/input.csv WKT,STATEFP10,COUNTYFP10,TRACTCE10,BLOCKCE,BLOCKID10,PARTFLG,HOUSING10,POP10 "POLYGON ((-118.81473 47.233499,...))",53,001,950100,1042,530019501001042,N,1,5
  • 13.
    © Hortonworks Inc.2011 Prepare the Input Page 13 Architecting the Future of Big Data TIGER/Line Shapefiles www.census.gov/geo/maps-data/data/tiger-line.html $ tail -n6 bin/00_prepare_input.sh ogr2ogr `: invoke gdal tool ogr2ogr` -t_srs epsg:4326 `: reproject the data` -f CSV `: in CSV format` $OUTPUT `: producing output file` $INPUT `: from input file` -lco GEOMETRY=AS_WKT `: including geometries as WKT` $ head -n2 /tmp/input.csv WKT,STATEFP10,COUNTYFP10,TRACTCE10,BLOCKCE,BLOCKID10,PARTFLG,HOUSING10,POP10 "POLYGON ((-118.81473 47.233499,...))",53,001,950100,1042,530019501001042,N,1,5
  • 14.
    © Hortonworks Inc.2011 Map: Sample Geometries Page 14 Architecting the Future of Big Data [,[WKT, population]] => mapper => ['tx,ty,z', 'px,py'] def main(): for geom, population in read_feature(stdin): for lng, lat in sample_geometry(geom, population): for key, val in make_kv(lat, lng): emit(key, val) $ map < input | sort | reduce > output
  • 15.
    © Hortonworks Inc.2011 Map: Sample Geometries Page 15 Architecting the Future of Big Data $ head -n1 input.csv | python -m tilebrute.sample_shapes 2,5,4 -13224181.65427 5981084.37214 5,11,5 -13224181.65427 5981084.37214 10,22,6 -13224181.65427 5981084.37214 21,44,7 -13224181.65427 5981084.37214 43,89,8 -13224181.65427 5981084.37214 87,179,9 -13224181.65427 5981084.37214 174,359,10 -13224181.65427 5981084.37214 348,718,11 -13224181.65427 5981084.37214 696,1436,12 -13224181.65427 5981084.37214 1392,2873,13 -13224181.65427 5981084.37214 2785,5746,14 -13224181.65427 5981084.37214 5571,11493,15 -13224181.65427 5981084.37214 11142,22986,16 -13224181.65427 5981084.37214 22284,45973,17 -13224181.65427 5981084.37214 $ map < input | sort | reduce > output
  • 16.
    © Hortonworks Inc.2011 Sort Page 16 Architecting the Future of Big Data $ head -n1 input.csv | python -m tilebrute.sample_shapes | sort 10,22,6 -13224414.42332 5983539.01581 10,22,6 -13225723.87449 5981201.60336 10,22,6 -13225793.67181 5983127.53706 10,22,6 -13226046.70101 5983375.66839 10,22,6 -13226331.90155 5984272.31303 11138,22981,16 -13226331.90155 5984272.31303 11139,22983,16 -13225793.67181 5983127.53706 11139,22983,16 -13226046.70101 5983375.66839 11139,22986,16 -13225723.87449 5981201.60336 11141,22982,16 -13224414.42332 5983539.01581 $ map < input | sort | reduce > output
  • 17.
    © Hortonworks Inc.2011 Reduce: Draw Tiles Page 17 Architecting the Future of Big Data def main(): for tile,points in groupby(read_points(stdin), lambda x: x[0]): zoom = get_zoom(tile) map = init_map(zoom, points) map.zoom_all() im = mapnik.Image(256,256) mapnik.render(map,im) emit(tile, encode_image(im)) $ map < input | sort | reduce > output $ head -n1 input.csv | python -m tilebrute.sample_shapes | sort | head -n5 | python -m tilebrute.draw_tiles 10,22,6 iVBORw0KGgoAAAANSUhEUgAAAQAAAAEACAYAAABccqhmAAADJ...+aBAAAAAElFTkSuQmCC
  • 18.
    © Hortonworks Inc.2011 Write Output Page 18 Architecting the Future of Big Data public void write(Text tileId, Text tile) throws IOException { String[] tileIdSplits = tileId.toString().split(","); assert tileIdSplits.length == 3; String tx = tileIdSplits[0]; String ty = tileIdSplits[1]; String zoom = tileIdSplits[2]; Path tilePath = new Path(outputPath, zoom + "/" + tx + "/" + ty + ".png"); fs.mkdirs(tilePath.getParent()); byte[] buf = Base64.decodeBase64(tile.toString()); final FSDataOutputStream fout = fs.create(tilePath, progress); fout.write(buf); fout.close(); }
  • 19.
    © Hortonworks Inc.2011 To the Cloud! Architecting the Future of Big Data Page 19
  • 20.
    © Hortonworks Inc.2011 Basic Services: EC2, S3 •  EC2: Elastic Compute Cloud –  Virtual machines on demand –  Different “instance types” with different hardware profiles –  m1.large (2 cores, 7.5G), c1.xlarge (8 cores, 7G) •  S3: Simple Storage Service –  Distributed, replicated storage –  Native Hadoop integration –  Also exposed over http(s), easy tile hosting Page 20 Architecting the Future of Big Data
  • 21.
    © Hortonworks Inc.2011 Add-on Service: EMR •  EMR: Elastic MapReduce –  “Hadoop as a Service” –  On-demand, pre-installed and configured Hadoop clusters –  +1: standardize of provisioning, deployment, monitoring –  -1: “stable” (old) software Page 21 Architecting the Future of Big Data
  • 22.
    © Hortonworks Inc.2011 MrJob: Python for EMR Page 22 Architecting the Future of Big Data class TileBrute(MRJob): HADOOP_OUTPUT_FORMAT = 'tilebrute.hadoop.mapred.MapTileOutputFormat' def mapper_cmd(self): return bash_wrap('$PYTHON -m tilebrute.sample_shapes') def reducer_cmd(self): return bash_wrap('$PYTHON -m tilebrute.draw_tiles') github.com/Yelp/mrjob
  • 23.
    © Hortonworks Inc.2011 Results Architecting the Future of Big Data Page 23
  • 24.
    © Hortonworks Inc.2011 Page 24 Architecting the Future of Big Data
  • 25.
    © Hortonworks Inc.2011 Page 25 Architecting the Future of Big Data 14z, 2624x, 5722y
  • 26.
    © Hortonworks Inc.2011 Page 26 Architecting the Future of Big Data 14z, 2624x, 5722y
  • 27.
    © Hortonworks Inc.2011 How much code? Page 27 Architecting the Future of Big Data $ find -f src -f bin | egrep '.(java|sh|py)$' | grep -v test | xargs cloc --quiet http://cloc.sourceforge.net v 1.56 T=0.5 s (28.0 files/s, 1868.0 lines/s) ------------------------------------------------------------------------------- Language files blank comment code ------------------------------------------------------------------------------- Python 4 69 105 299 Bourne Shell 8 51 85 210 Java 2 25 16 74 ------------------------------------------------------------------------------- SUM: 14 145 206 583 -------------------------------------------------------------------------------
  • 28.
    © Hortonworks Inc.2011 Performance Page 28 Architecting the Future of Big Data •  1 x m1.large (2 cores) –  195575 input features (WA state) –  3 zoom levels (6, 7, 8) –  1 hour •  19 x c1.xlarge (152 cores) –  308745538 input features (all data) –  3 zoom levels (6, 7, 8) –  3 hours 15 minutes
  • 29.
    © Hortonworks Inc.2011 TODOs •  Macro-level performance optimizations (configuration) –  Balancing mappers and reducers, memory allocation, &c. –  On-demand Hadoop means tuning the cluster to the application •  Micro-level performance optimizations (code) –  Smarter sampling logic –  Mapnik API considerations –  Multi-threaded S3 PUTs –  https://forums.aws.amazon.com/thread.jspa?threadID=125135 •  Write tiles in MBTiles format •  Write tiles to HBase •  Compression! •  Ogrbrute? Page 29 Architecting the Future of Big Data
  • 30.
    © Hortonworks Inc.2011 Thanks! Architecting the Future of Big Data Page 30 M A N N I N G Nick Dimiduk Amandeep Khurana FOREWORD BY Michael Stack hbaseinaction.com Nick Dimiduk github.com/ndimiduk @xefyr n10k.com