Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Bring Cartography to the Cloud

2,485 views

Published on

If you've used a modern, interactive map such as Google or Bing Maps, you've consumed "map tiles". Map tiles are small images rendering a piece of the mosaic that is the whole map. Using conventional means, rendering tiles for the whole globe at multiple resolutions is a huge data processing effort. Even highly optimized, it spans a couple TBs and a few days of computation. Enter Hadoop. In this talk, I'll show you how to generate your own custom tiles using Hadoop. There will be pretty pictures.

Published in: Technology
  • Be the first to comment

Bring Cartography to the Cloud

  1. 1. © Hortonworks Inc. 2011Bring Cartography to the Cloudwith Apache HadoopNick DimidukMember of Technical Staff, HBaseFOSS4G-NA, 2013-05-23Page 1
  2. 2. © Hortonworks Inc. 2011Beginnings…Page 2Architecting the Future of Big Datamapbox.com/blog/rendering-the-world/bmander.com/dotmap/index.html
  3. 3. © Hortonworks Inc. 2011DefinitionsPage 3Architecting the Future of Big Datacar•tog•ra•phy|kärˈtägrəәfē|nounthe science or practice of drawing maps.rendering map tiles from some kind ofgeographic data.cloud|kloud|nouna visible mass of condensed water vaporfloating in the atmosphere, typically highabove the ground.on demand consumption ofcomputation and storage resources.
  4. 4. © Hortonworks Inc. 2011BackgroundArchitecting the Future of Big DataPage 4
  5. 5. © Hortonworks Inc. 2011Apache Hadoop in Review•  Apache Hadoop Distributed Filesystem (HDFS)–  Distributed, fault-tolerant, throughput-optimized data storage–  Uses a filesystem analogy, not structured tables–  The Google File System, 2003, Ghemawat et al.–  http://research.google.com/archive/gfs.html•  Apache Hadoop MapReduce (MR)–  Distributed, fault-tolerant, batch-oriented data processing–  Line- or record-oriented processing of the entire dataset *–  “[Application] schema on read”–  MapReduce: Simplified Data Processing on Large Clusters, 2004, Dean andGhemawat–  http://research.google.com/archive/mapreduce.htmlPage 5Architecting the Future of Big Data* For more on writing MapReduce applications, see “MapReducePatterns, Algorithms, and Use Cases”http://highlyscalable.wordpress.com/2012/02/01/mapreduce-patterns/
  6. 6. © Hortonworks Inc. 2011MapReduce in DetailPage 6Architecting the Future of Big Datahighlyscalable.wordpress.com/2012/02/01/mapreduce-patterns/
  7. 7. © Hortonworks Inc. 2011MapReduce in DetailPage 7Architecting the Future of Big Datahighlyscalable.wordpress.com/2012/02/01/mapreduce-patterns/
  8. 8. © Hortonworks Inc. 2011What we care aboutPage 8Architecting the Future of Big Data$ map < input | sort | reduce > output
  9. 9. © Hortonworks Inc. 2011How Seamlessly?Page 9Architecting the Future of Big Data$ git show e65731e:bin/10_simulated_hadoop.shgzcat "$INPUT_FILES" | python "${PYTHON_DIR}/sample_shapes.py" | sort | python "${PYTHON_DIR}/draw_tiles.py"$ git show e65731e:bin/11_hadoop_local.shhadoop jar target/tile-brute-0.1.0-SNAPSHOT.jar -input /tmp/input.csv -output "$OUTPUT_DIR" -mapper "python ${PYTHON_DIR}/sample_shapes.py" -reducer "python ${PYTHON_DIR}/draw_tiles.py"
  10. 10. © Hortonworks Inc. 2011To the Code!github.com/ndimiduk/tilebruteArchitecting the Future of Big DataPage 10
  11. 11. © Hortonworks Inc. 2011Our Tools•  Python + GIS–  GDAL–  Shapely–  Mapnik•  Java•  Apache Hadoop•  Bash•  MrJobPage 11Architecting the Future of Big Data
  12. 12. © Hortonworks Inc. 2011Prepare the InputPage 12Architecting the Future of Big DataTIGER/Line Shapefileswww.census.gov/geo/maps-data/data/tiger-line.html$ tail -n6 bin/00_prepare_input.shogr2ogr `: invoke gdal tool ogr2ogr` -t_srs epsg:4326 `: reproject the data` -f CSV `: in CSV format` $OUTPUT `: producing output file` $INPUT `: from input file` -lco GEOMETRY=AS_WKT `: including geometries as WKT`$ head -n2 /tmp/input.csvWKT,STATEFP10,COUNTYFP10,TRACTCE10,BLOCKCE,BLOCKID10,PARTFLG,HOUSING10,POP10"POLYGON ((-118.81473 47.233499,...))",53,001,950100,1042,530019501001042,N,1,5
  13. 13. © Hortonworks Inc. 2011Prepare the InputPage 13Architecting the Future of Big DataTIGER/Line Shapefileswww.census.gov/geo/maps-data/data/tiger-line.html$ tail -n6 bin/00_prepare_input.shogr2ogr `: invoke gdal tool ogr2ogr` -t_srs epsg:4326 `: reproject the data` -f CSV `: in CSV format` $OUTPUT `: producing output file` $INPUT `: from input file` -lco GEOMETRY=AS_WKT `: including geometries as WKT`$ head -n2 /tmp/input.csvWKT,STATEFP10,COUNTYFP10,TRACTCE10,BLOCKCE,BLOCKID10,PARTFLG,HOUSING10,POP10"POLYGON ((-118.81473 47.233499,...))",53,001,950100,1042,530019501001042,N,1,5
  14. 14. © Hortonworks Inc. 2011Map: Sample GeometriesPage 14Architecting the Future of Big Data[,[WKT, population]] => mapper => [tx,ty,z, px,py]def main():for geom, population in read_feature(stdin):for lng, lat in sample_geometry(geom, population):for key, val in make_kv(lat, lng):emit(key, val)$ map < input | sort | reduce > output
  15. 15. © Hortonworks Inc. 2011Map: Sample GeometriesPage 15Architecting the Future of Big Data$ head -n1 input.csv | python -m tilebrute.sample_shapes2,5,4 -13224181.65427 5981084.372145,11,5 -13224181.65427 5981084.3721410,22,6 -13224181.65427 5981084.3721421,44,7 -13224181.65427 5981084.3721443,89,8 -13224181.65427 5981084.3721487,179,9 -13224181.65427 5981084.37214174,359,10 -13224181.65427 5981084.37214348,718,11 -13224181.65427 5981084.37214696,1436,12 -13224181.65427 5981084.372141392,2873,13 -13224181.65427 5981084.372142785,5746,14 -13224181.65427 5981084.372145571,11493,15 -13224181.65427 5981084.3721411142,22986,16 -13224181.65427 5981084.3721422284,45973,17 -13224181.65427 5981084.37214$ map < input | sort | reduce > output
  16. 16. © Hortonworks Inc. 2011SortPage 16Architecting the Future of Big Data$ head -n1 input.csv | python -m tilebrute.sample_shapes | sort10,22,6 -13224414.42332 5983539.0158110,22,6 -13225723.87449 5981201.6033610,22,6 -13225793.67181 5983127.5370610,22,6 -13226046.70101 5983375.6683910,22,6 -13226331.90155 5984272.3130311138,22981,16 -13226331.90155 5984272.3130311139,22983,16 -13225793.67181 5983127.5370611139,22983,16 -13226046.70101 5983375.6683911139,22986,16 -13225723.87449 5981201.6033611141,22982,16 -13224414.42332 5983539.01581$ map < input | sort | reduce > output
  17. 17. © Hortonworks Inc. 2011Reduce: Draw TilesPage 17Architecting the Future of Big Datadef main():for tile,points in groupby(read_points(stdin), lambda x: x[0]):zoom = get_zoom(tile)map = init_map(zoom, points)map.zoom_all()im = mapnik.Image(256,256)mapnik.render(map,im)emit(tile, encode_image(im))$ map < input | sort | reduce > output$ head -n1 input.csv | python -m tilebrute.sample_shapes | sort | head -n5 |python -m tilebrute.draw_tiles10,22,6 iVBORw0KGgoAAAANSUhEUgAAAQAAAAEACAYAAABccqhmAAADJ...+aBAAAAAElFTkSuQmCC
  18. 18. © Hortonworks Inc. 2011Write OutputPage 18Architecting the Future of Big Datapublic void write(Text tileId, Text tile) throws IOException {String[] tileIdSplits = tileId.toString().split(",");assert tileIdSplits.length == 3;String tx = tileIdSplits[0];String ty = tileIdSplits[1];String zoom = tileIdSplits[2];Path tilePath = new Path(outputPath, zoom + "/" + tx + "/" + ty + ".png");fs.mkdirs(tilePath.getParent());byte[] buf = Base64.decodeBase64(tile.toString());final FSDataOutputStream fout = fs.create(tilePath, progress);fout.write(buf);fout.close();}
  19. 19. © Hortonworks Inc. 2011To the Cloud!Architecting the Future of Big DataPage 19
  20. 20. © Hortonworks Inc. 2011Basic Services: EC2, S3•  EC2: Elastic Compute Cloud–  Virtual machines on demand–  Different “instance types” with different hardware profiles–  m1.large (2 cores, 7.5G), c1.xlarge (8 cores, 7G)•  S3: Simple Storage Service–  Distributed, replicated storage–  Native Hadoop integration–  Also exposed over http(s), easy tile hostingPage 20Architecting the Future of Big Data
  21. 21. © Hortonworks Inc. 2011Add-on Service: EMR•  EMR: Elastic MapReduce–  “Hadoop as a Service”–  On-demand, pre-installed and configured Hadoop clusters–  +1: standardize of provisioning, deployment, monitoring–  -1: “stable” (old) softwarePage 21Architecting the Future of Big Data
  22. 22. © Hortonworks Inc. 2011MrJob: Python for EMRPage 22Architecting the Future of Big Dataclass TileBrute(MRJob):HADOOP_OUTPUT_FORMAT = tilebrute.hadoop.mapred.MapTileOutputFormatdef mapper_cmd(self):return bash_wrap($PYTHON -m tilebrute.sample_shapes)def reducer_cmd(self):return bash_wrap($PYTHON -m tilebrute.draw_tiles)github.com/Yelp/mrjob
  23. 23. © Hortonworks Inc. 2011ResultsArchitecting the Future of Big DataPage 23
  24. 24. © Hortonworks Inc. 2011Page 24Architecting the Future of Big Data
  25. 25. © Hortonworks Inc. 2011Page 25Architecting the Future of Big Data14z, 2624x, 5722y
  26. 26. © Hortonworks Inc. 2011Page 26Architecting the Future of Big Data14z, 2624x, 5722y
  27. 27. © Hortonworks Inc. 2011How much code?Page 27Architecting the Future of Big Data$ find -f src -f bin | egrep .(java|sh|py)$ | grep -v test | xargs cloc --quiethttp://cloc.sourceforge.net v 1.56 T=0.5 s (28.0 files/s, 1868.0 lines/s)-------------------------------------------------------------------------------Language files blank comment code-------------------------------------------------------------------------------Python 4 69 105 299Bourne Shell 8 51 85 210Java 2 25 16 74-------------------------------------------------------------------------------SUM: 14 145 206 583-------------------------------------------------------------------------------
  28. 28. © Hortonworks Inc. 2011PerformancePage 28Architecting the Future of Big Data•  1 x m1.large (2 cores)–  195575 input features (WA state)–  3 zoom levels (6, 7, 8)–  1 hour•  19 x c1.xlarge (152 cores)–  308745538 input features (all data)–  3 zoom levels (6, 7, 8)–  3 hours 15 minutes
  29. 29. © Hortonworks Inc. 2011TODOs•  Macro-level performance optimizations (configuration)–  Balancing mappers and reducers, memory allocation, &c.–  On-demand Hadoop means tuning the cluster to the application•  Micro-level performance optimizations (code)–  Smarter sampling logic–  Mapnik API considerations–  Multi-threaded S3 PUTs–  https://forums.aws.amazon.com/thread.jspa?threadID=125135•  Write tiles in MBTiles format•  Write tiles to HBase•  Compression!•  Ogrbrute?Page 29Architecting the Future of Big Data
  30. 30. © Hortonworks Inc. 2011Thanks!Architecting the Future of Big DataPage 30M A N N I N GNick DimidukAmandeep KhuranaFOREWORD BYMichael Stackhbaseinaction.comNick Dimidukgithub.com/ndimiduk@xefyrn10k.com

×