Your SlideShare is downloading. ×
Bring Cartography to the Cloud
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Saving this for later?

Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime - even offline.

Text the download link to your phone

Standard text messaging rates apply

Bring Cartography to the Cloud

1,381
views

Published on

If you've used a modern, interactive map such as Google or Bing Maps, you've consumed "map tiles". Map tiles are small images rendering a piece of the mosaic that is the whole map. Using conventional …

If you've used a modern, interactive map such as Google or Bing Maps, you've consumed "map tiles". Map tiles are small images rendering a piece of the mosaic that is the whole map. Using conventional means, rendering tiles for the whole globe at multiple resolutions is a huge data processing effort. Even highly optimized, it spans a couple TBs and a few days of computation. Enter Hadoop. In this talk, I'll show you how to generate your own custom tiles using Hadoop. There will be pretty pictures.

Published in: Technology

0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,381
On Slideshare
0
From Embeds
0
Number of Embeds
7
Actions
Shares
0
Downloads
31
Comments
0
Likes
2
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. © Hortonworks Inc. 2011Bring Cartography to the Cloudwith Apache HadoopNick DimidukMember of Technical Staff, HBaseFOSS4G-NA, 2013-05-23Page 1
  • 2. © Hortonworks Inc. 2011Beginnings…Page 2Architecting the Future of Big Datamapbox.com/blog/rendering-the-world/bmander.com/dotmap/index.html
  • 3. © Hortonworks Inc. 2011DefinitionsPage 3Architecting the Future of Big Datacar•tog•ra•phy|kärˈtägrəәfē|nounthe science or practice of drawing maps.rendering map tiles from some kind ofgeographic data.cloud|kloud|nouna visible mass of condensed water vaporfloating in the atmosphere, typically highabove the ground.on demand consumption ofcomputation and storage resources.
  • 4. © Hortonworks Inc. 2011BackgroundArchitecting the Future of Big DataPage 4
  • 5. © Hortonworks Inc. 2011Apache Hadoop in Review•  Apache Hadoop Distributed Filesystem (HDFS)–  Distributed, fault-tolerant, throughput-optimized data storage–  Uses a filesystem analogy, not structured tables–  The Google File System, 2003, Ghemawat et al.–  http://research.google.com/archive/gfs.html•  Apache Hadoop MapReduce (MR)–  Distributed, fault-tolerant, batch-oriented data processing–  Line- or record-oriented processing of the entire dataset *–  “[Application] schema on read”–  MapReduce: Simplified Data Processing on Large Clusters, 2004, Dean andGhemawat–  http://research.google.com/archive/mapreduce.htmlPage 5Architecting the Future of Big Data* For more on writing MapReduce applications, see “MapReducePatterns, Algorithms, and Use Cases”http://highlyscalable.wordpress.com/2012/02/01/mapreduce-patterns/
  • 6. © Hortonworks Inc. 2011MapReduce in DetailPage 6Architecting the Future of Big Datahighlyscalable.wordpress.com/2012/02/01/mapreduce-patterns/
  • 7. © Hortonworks Inc. 2011MapReduce in DetailPage 7Architecting the Future of Big Datahighlyscalable.wordpress.com/2012/02/01/mapreduce-patterns/
  • 8. © Hortonworks Inc. 2011What we care aboutPage 8Architecting the Future of Big Data$ map < input | sort | reduce > output
  • 9. © Hortonworks Inc. 2011How Seamlessly?Page 9Architecting the Future of Big Data$ git show e65731e:bin/10_simulated_hadoop.shgzcat "$INPUT_FILES" | python "${PYTHON_DIR}/sample_shapes.py" | sort | python "${PYTHON_DIR}/draw_tiles.py"$ git show e65731e:bin/11_hadoop_local.shhadoop jar target/tile-brute-0.1.0-SNAPSHOT.jar -input /tmp/input.csv -output "$OUTPUT_DIR" -mapper "python ${PYTHON_DIR}/sample_shapes.py" -reducer "python ${PYTHON_DIR}/draw_tiles.py"
  • 10. © Hortonworks Inc. 2011To the Code!github.com/ndimiduk/tilebruteArchitecting the Future of Big DataPage 10
  • 11. © Hortonworks Inc. 2011Our Tools•  Python + GIS–  GDAL–  Shapely–  Mapnik•  Java•  Apache Hadoop•  Bash•  MrJobPage 11Architecting the Future of Big Data
  • 12. © Hortonworks Inc. 2011Prepare the InputPage 12Architecting the Future of Big DataTIGER/Line Shapefileswww.census.gov/geo/maps-data/data/tiger-line.html$ tail -n6 bin/00_prepare_input.shogr2ogr `: invoke gdal tool ogr2ogr` -t_srs epsg:4326 `: reproject the data` -f CSV `: in CSV format` $OUTPUT `: producing output file` $INPUT `: from input file` -lco GEOMETRY=AS_WKT `: including geometries as WKT`$ head -n2 /tmp/input.csvWKT,STATEFP10,COUNTYFP10,TRACTCE10,BLOCKCE,BLOCKID10,PARTFLG,HOUSING10,POP10"POLYGON ((-118.81473 47.233499,...))",53,001,950100,1042,530019501001042,N,1,5
  • 13. © Hortonworks Inc. 2011Prepare the InputPage 13Architecting the Future of Big DataTIGER/Line Shapefileswww.census.gov/geo/maps-data/data/tiger-line.html$ tail -n6 bin/00_prepare_input.shogr2ogr `: invoke gdal tool ogr2ogr` -t_srs epsg:4326 `: reproject the data` -f CSV `: in CSV format` $OUTPUT `: producing output file` $INPUT `: from input file` -lco GEOMETRY=AS_WKT `: including geometries as WKT`$ head -n2 /tmp/input.csvWKT,STATEFP10,COUNTYFP10,TRACTCE10,BLOCKCE,BLOCKID10,PARTFLG,HOUSING10,POP10"POLYGON ((-118.81473 47.233499,...))",53,001,950100,1042,530019501001042,N,1,5
  • 14. © Hortonworks Inc. 2011Map: Sample GeometriesPage 14Architecting the Future of Big Data[,[WKT, population]] => mapper => [tx,ty,z, px,py]def main():for geom, population in read_feature(stdin):for lng, lat in sample_geometry(geom, population):for key, val in make_kv(lat, lng):emit(key, val)$ map < input | sort | reduce > output
  • 15. © Hortonworks Inc. 2011Map: Sample GeometriesPage 15Architecting the Future of Big Data$ head -n1 input.csv | python -m tilebrute.sample_shapes2,5,4 -13224181.65427 5981084.372145,11,5 -13224181.65427 5981084.3721410,22,6 -13224181.65427 5981084.3721421,44,7 -13224181.65427 5981084.3721443,89,8 -13224181.65427 5981084.3721487,179,9 -13224181.65427 5981084.37214174,359,10 -13224181.65427 5981084.37214348,718,11 -13224181.65427 5981084.37214696,1436,12 -13224181.65427 5981084.372141392,2873,13 -13224181.65427 5981084.372142785,5746,14 -13224181.65427 5981084.372145571,11493,15 -13224181.65427 5981084.3721411142,22986,16 -13224181.65427 5981084.3721422284,45973,17 -13224181.65427 5981084.37214$ map < input | sort | reduce > output
  • 16. © Hortonworks Inc. 2011SortPage 16Architecting the Future of Big Data$ head -n1 input.csv | python -m tilebrute.sample_shapes | sort10,22,6 -13224414.42332 5983539.0158110,22,6 -13225723.87449 5981201.6033610,22,6 -13225793.67181 5983127.5370610,22,6 -13226046.70101 5983375.6683910,22,6 -13226331.90155 5984272.3130311138,22981,16 -13226331.90155 5984272.3130311139,22983,16 -13225793.67181 5983127.5370611139,22983,16 -13226046.70101 5983375.6683911139,22986,16 -13225723.87449 5981201.6033611141,22982,16 -13224414.42332 5983539.01581$ map < input | sort | reduce > output
  • 17. © Hortonworks Inc. 2011Reduce: Draw TilesPage 17Architecting the Future of Big Datadef main():for tile,points in groupby(read_points(stdin), lambda x: x[0]):zoom = get_zoom(tile)map = init_map(zoom, points)map.zoom_all()im = mapnik.Image(256,256)mapnik.render(map,im)emit(tile, encode_image(im))$ map < input | sort | reduce > output$ head -n1 input.csv | python -m tilebrute.sample_shapes | sort | head -n5 |python -m tilebrute.draw_tiles10,22,6 iVBORw0KGgoAAAANSUhEUgAAAQAAAAEACAYAAABccqhmAAADJ...+aBAAAAAElFTkSuQmCC
  • 18. © Hortonworks Inc. 2011Write OutputPage 18Architecting the Future of Big Datapublic void write(Text tileId, Text tile) throws IOException {String[] tileIdSplits = tileId.toString().split(",");assert tileIdSplits.length == 3;String tx = tileIdSplits[0];String ty = tileIdSplits[1];String zoom = tileIdSplits[2];Path tilePath = new Path(outputPath, zoom + "/" + tx + "/" + ty + ".png");fs.mkdirs(tilePath.getParent());byte[] buf = Base64.decodeBase64(tile.toString());final FSDataOutputStream fout = fs.create(tilePath, progress);fout.write(buf);fout.close();}
  • 19. © Hortonworks Inc. 2011To the Cloud!Architecting the Future of Big DataPage 19
  • 20. © Hortonworks Inc. 2011Basic Services: EC2, S3•  EC2: Elastic Compute Cloud–  Virtual machines on demand–  Different “instance types” with different hardware profiles–  m1.large (2 cores, 7.5G), c1.xlarge (8 cores, 7G)•  S3: Simple Storage Service–  Distributed, replicated storage–  Native Hadoop integration–  Also exposed over http(s), easy tile hostingPage 20Architecting the Future of Big Data
  • 21. © Hortonworks Inc. 2011Add-on Service: EMR•  EMR: Elastic MapReduce–  “Hadoop as a Service”–  On-demand, pre-installed and configured Hadoop clusters–  +1: standardize of provisioning, deployment, monitoring–  -1: “stable” (old) softwarePage 21Architecting the Future of Big Data
  • 22. © Hortonworks Inc. 2011MrJob: Python for EMRPage 22Architecting the Future of Big Dataclass TileBrute(MRJob):HADOOP_OUTPUT_FORMAT = tilebrute.hadoop.mapred.MapTileOutputFormatdef mapper_cmd(self):return bash_wrap($PYTHON -m tilebrute.sample_shapes)def reducer_cmd(self):return bash_wrap($PYTHON -m tilebrute.draw_tiles)github.com/Yelp/mrjob
  • 23. © Hortonworks Inc. 2011ResultsArchitecting the Future of Big DataPage 23
  • 24. © Hortonworks Inc. 2011Page 24Architecting the Future of Big Data
  • 25. © Hortonworks Inc. 2011Page 25Architecting the Future of Big Data14z, 2624x, 5722y
  • 26. © Hortonworks Inc. 2011Page 26Architecting the Future of Big Data14z, 2624x, 5722y
  • 27. © Hortonworks Inc. 2011How much code?Page 27Architecting the Future of Big Data$ find -f src -f bin | egrep .(java|sh|py)$ | grep -v test | xargs cloc --quiethttp://cloc.sourceforge.net v 1.56 T=0.5 s (28.0 files/s, 1868.0 lines/s)-------------------------------------------------------------------------------Language files blank comment code-------------------------------------------------------------------------------Python 4 69 105 299Bourne Shell 8 51 85 210Java 2 25 16 74-------------------------------------------------------------------------------SUM: 14 145 206 583-------------------------------------------------------------------------------
  • 28. © Hortonworks Inc. 2011PerformancePage 28Architecting the Future of Big Data•  1 x m1.large (2 cores)–  195575 input features (WA state)–  3 zoom levels (6, 7, 8)–  1 hour•  19 x c1.xlarge (152 cores)–  308745538 input features (all data)–  3 zoom levels (6, 7, 8)–  3 hours 15 minutes
  • 29. © Hortonworks Inc. 2011TODOs•  Macro-level performance optimizations (configuration)–  Balancing mappers and reducers, memory allocation, &c.–  On-demand Hadoop means tuning the cluster to the application•  Micro-level performance optimizations (code)–  Smarter sampling logic–  Mapnik API considerations–  Multi-threaded S3 PUTs–  https://forums.aws.amazon.com/thread.jspa?threadID=125135•  Write tiles in MBTiles format•  Write tiles to HBase•  Compression!•  Ogrbrute?Page 29Architecting the Future of Big Data
  • 30. © Hortonworks Inc. 2011Thanks!Architecting the Future of Big DataPage 30M A N N I N GNick DimidukAmandeep KhuranaFOREWORD BYMichael Stackhbaseinaction.comNick Dimidukgithub.com/ndimiduk@xefyrn10k.com