Bring Cartography to the Cloud

Nick Dimiduk
Nick DimidukEngineer, Hacker, Author
© Hortonworks Inc. 2011
Bring Cartography to the Cloud
with Apache Hadoop
Nick Dimiduk
Member of Technical Staff, HBase
FOSS4G-NA, 2013-05-23
Page 1
© Hortonworks Inc. 2011
Beginnings…
Page 2
Architecting the Future of Big Data
mapbox.com/blog/
rendering-the-world/
bmander.com/dotmap/index.html
© Hortonworks Inc. 2011
Definitions
Page 3
Architecting the Future of Big Data
car•tog•ra•phy
|kärˈtägrəәfē|
noun

the science or practice of drawing maps.

rendering map tiles from some kind of
geographic data.
cloud
|kloud|
noun

a visible mass of condensed water vapor
floating in the atmosphere, typically high
above the ground.

on demand consumption of
computation and storage resources.
© Hortonworks Inc. 2011
Background
Architecting the Future of Big Data
Page 4
© Hortonworks Inc. 2011
Apache Hadoop in Review
•  Apache Hadoop Distributed Filesystem (HDFS)
–  Distributed, fault-tolerant, throughput-optimized data storage
–  Uses a filesystem analogy, not structured tables
–  The Google File System, 2003, Ghemawat et al.
–  http://research.google.com/archive/gfs.html
•  Apache Hadoop MapReduce (MR)
–  Distributed, fault-tolerant, batch-oriented data processing
–  Line- or record-oriented processing of the entire dataset *
–  “[Application] schema on read”
–  MapReduce: Simplified Data Processing on Large Clusters, 2004, Dean and
Ghemawat
–  http://research.google.com/archive/mapreduce.html
Page 5
Architecting the Future of Big Data
* For more on writing MapReduce applications, see “MapReduce
Patterns, Algorithms, and Use Cases”
http://highlyscalable.wordpress.com/2012/02/01/mapreduce-patterns/
© Hortonworks Inc. 2011
MapReduce in Detail
Page 6
Architecting the Future of Big Data
highlyscalable.wordpress.com/2012/02/01/mapreduce-patterns/
© Hortonworks Inc. 2011
MapReduce in Detail
Page 7
Architecting the Future of Big Data
highlyscalable.wordpress.com/2012/02/01/mapreduce-patterns/
© Hortonworks Inc. 2011
What we care about
Page 8
Architecting the Future of Big Data
$ map < input | sort | reduce > output
© Hortonworks Inc. 2011
How Seamlessly?
Page 9
Architecting the Future of Big Data
$ git show e65731e:bin/10_simulated_hadoop.sh
gzcat "$INPUT_FILES" 
| python "${PYTHON_DIR}/sample_shapes.py" 
| sort 
| python "${PYTHON_DIR}/draw_tiles.py"
$ git show e65731e:bin/11_hadoop_local.sh
hadoop jar target/tile-brute-0.1.0-SNAPSHOT.jar 
-input /tmp/input.csv 
-output "$OUTPUT_DIR" 
-mapper "python ${PYTHON_DIR}/sample_shapes.py" 
-reducer "python ${PYTHON_DIR}/draw_tiles.py"
© Hortonworks Inc. 2011
To the Code!
github.com/ndimiduk/tilebrute
Architecting the Future of Big Data
Page 10
© Hortonworks Inc. 2011
Our Tools
•  Python + GIS
–  GDAL
–  Shapely
–  Mapnik
•  Java
•  Apache Hadoop
•  Bash
•  MrJob
Page 11
Architecting the Future of Big Data
© Hortonworks Inc. 2011
Prepare the Input
Page 12
Architecting the Future of Big Data
TIGER/Line Shapefiles
www.census.gov/geo/maps-data/data/tiger-line.html
$ tail -n6 bin/00_prepare_input.sh
ogr2ogr `: invoke gdal tool ogr2ogr` 
-t_srs epsg:4326 `: reproject the data` 
-f CSV `: in CSV format` 
$OUTPUT `: producing output file` 
$INPUT `: from input file` 
-lco GEOMETRY=AS_WKT `: including geometries as WKT`
$ head -n2 /tmp/input.csv
WKT,STATEFP10,COUNTYFP10,TRACTCE10,BLOCKCE,BLOCKID10,PARTFLG,HOUSING10,POP10
"POLYGON ((-118.81473 47.233499,...))",53,001,950100,1042,530019501001042,N,1,5
© Hortonworks Inc. 2011
Prepare the Input
Page 13
Architecting the Future of Big Data
TIGER/Line Shapefiles
www.census.gov/geo/maps-data/data/tiger-line.html
$ tail -n6 bin/00_prepare_input.sh
ogr2ogr `: invoke gdal tool ogr2ogr` 
-t_srs epsg:4326 `: reproject the data` 
-f CSV `: in CSV format` 
$OUTPUT `: producing output file` 
$INPUT `: from input file` 
-lco GEOMETRY=AS_WKT `: including geometries as WKT`
$ head -n2 /tmp/input.csv
WKT,STATEFP10,COUNTYFP10,TRACTCE10,BLOCKCE,BLOCKID10,PARTFLG,HOUSING10,POP10
"POLYGON ((-118.81473 47.233499,...))",53,001,950100,1042,530019501001042,N,1,5
© Hortonworks Inc. 2011
Map: Sample Geometries
Page 14
Architecting the Future of Big Data
[,[WKT, population]] => mapper => ['tx,ty,z', 'px,py']
def main():
for geom, population in read_feature(stdin):
for lng, lat in sample_geometry(geom, population):
for key, val in make_kv(lat, lng):
emit(key, val)
$ map < input | sort | reduce > output
© Hortonworks Inc. 2011
Map: Sample Geometries
Page 15
Architecting the Future of Big Data
$ head -n1 input.csv | python -m tilebrute.sample_shapes
2,5,4 -13224181.65427 5981084.37214
5,11,5 -13224181.65427 5981084.37214
10,22,6 -13224181.65427 5981084.37214
21,44,7 -13224181.65427 5981084.37214
43,89,8 -13224181.65427 5981084.37214
87,179,9 -13224181.65427 5981084.37214
174,359,10 -13224181.65427 5981084.37214
348,718,11 -13224181.65427 5981084.37214
696,1436,12 -13224181.65427 5981084.37214
1392,2873,13 -13224181.65427 5981084.37214
2785,5746,14 -13224181.65427 5981084.37214
5571,11493,15 -13224181.65427 5981084.37214
11142,22986,16 -13224181.65427 5981084.37214
22284,45973,17 -13224181.65427 5981084.37214
$ map < input | sort | reduce > output
© Hortonworks Inc. 2011
Sort
Page 16
Architecting the Future of Big Data
$ head -n1 input.csv | python -m tilebrute.sample_shapes | sort
10,22,6 -13224414.42332 5983539.01581
10,22,6 -13225723.87449 5981201.60336
10,22,6 -13225793.67181 5983127.53706
10,22,6 -13226046.70101 5983375.66839
10,22,6 -13226331.90155 5984272.31303
11138,22981,16 -13226331.90155 5984272.31303
11139,22983,16 -13225793.67181 5983127.53706
11139,22983,16 -13226046.70101 5983375.66839
11139,22986,16 -13225723.87449 5981201.60336
11141,22982,16 -13224414.42332 5983539.01581
$ map < input | sort | reduce > output
© Hortonworks Inc. 2011
Reduce: Draw Tiles
Page 17
Architecting the Future of Big Data
def main():
for tile,points in groupby(read_points(stdin), lambda x: x[0]):
zoom = get_zoom(tile)
map = init_map(zoom, points)
map.zoom_all()
im = mapnik.Image(256,256)
mapnik.render(map,im)
emit(tile, encode_image(im))
$ map < input | sort | reduce > output
$ head -n1 input.csv | python -m tilebrute.sample_shapes | sort | head -n5 |
python -m tilebrute.draw_tiles
10,22,6 iVBORw0KGgoAAAANSUhEUgAAAQAAAAEACAYAAABccqhmAAADJ...+aBAAAAAElFTkSuQmCC
© Hortonworks Inc. 2011
Write Output
Page 18
Architecting the Future of Big Data
public void write(Text tileId, Text tile) throws IOException {
String[] tileIdSplits = tileId.toString().split(",");
assert tileIdSplits.length == 3;
String tx = tileIdSplits[0];
String ty = tileIdSplits[1];
String zoom = tileIdSplits[2];
Path tilePath = new Path(outputPath, zoom + "/" + tx + "/" + ty + ".png");
fs.mkdirs(tilePath.getParent());
byte[] buf = Base64.decodeBase64(tile.toString());
final FSDataOutputStream fout = fs.create(tilePath, progress);
fout.write(buf);
fout.close();
}
© Hortonworks Inc. 2011
To the Cloud!
Architecting the Future of Big Data
Page 19
© Hortonworks Inc. 2011
Basic Services: EC2, S3
•  EC2: Elastic Compute Cloud
–  Virtual machines on demand
–  Different “instance types” with different hardware profiles
–  m1.large (2 cores, 7.5G), c1.xlarge (8 cores, 7G)
•  S3: Simple Storage Service
–  Distributed, replicated storage
–  Native Hadoop integration
–  Also exposed over http(s), easy tile hosting
Page 20
Architecting the Future of Big Data
© Hortonworks Inc. 2011
Add-on Service: EMR
•  EMR: Elastic MapReduce
–  “Hadoop as a Service”
–  On-demand, pre-installed and configured Hadoop clusters
–  +1: standardize of provisioning, deployment, monitoring
–  -1: “stable” (old) software
Page 21
Architecting the Future of Big Data
© Hortonworks Inc. 2011
MrJob: Python for EMR
Page 22
Architecting the Future of Big Data
class TileBrute(MRJob):
HADOOP_OUTPUT_FORMAT = 'tilebrute.hadoop.mapred.MapTileOutputFormat'
def mapper_cmd(self):
return bash_wrap('$PYTHON -m tilebrute.sample_shapes')
def reducer_cmd(self):
return bash_wrap('$PYTHON -m tilebrute.draw_tiles')
github.com/Yelp/mrjob
© Hortonworks Inc. 2011
Results
Architecting the Future of Big Data
Page 23
© Hortonworks Inc. 2011
Page 24
Architecting the Future of Big Data
© Hortonworks Inc. 2011
Page 25
Architecting the Future of Big Data
14z, 2624x, 5722y
© Hortonworks Inc. 2011
Page 26
Architecting the Future of Big Data
14z, 2624x, 5722y
© Hortonworks Inc. 2011
How much code?
Page 27
Architecting the Future of Big Data
$ find -f src -f bin | egrep '.(java|sh|py)$' | grep -v test | xargs cloc --quiet
http://cloc.sourceforge.net v 1.56 T=0.5 s (28.0 files/s, 1868.0 lines/s)
-------------------------------------------------------------------------------
Language files blank comment code
-------------------------------------------------------------------------------
Python 4 69 105 299
Bourne Shell 8 51 85 210
Java 2 25 16 74
-------------------------------------------------------------------------------
SUM: 14 145 206 583
-------------------------------------------------------------------------------
© Hortonworks Inc. 2011
Performance
Page 28
Architecting the Future of Big Data
•  1 x m1.large (2 cores)
–  195575 input features (WA state)
–  3 zoom levels (6, 7, 8)
–  1 hour
•  19 x c1.xlarge (152 cores)
–  308745538 input features (all data)
–  3 zoom levels (6, 7, 8)
–  3 hours 15 minutes
© Hortonworks Inc. 2011
TODOs
•  Macro-level performance optimizations (configuration)
–  Balancing mappers and reducers, memory allocation, &c.
–  On-demand Hadoop means tuning the cluster to the application
•  Micro-level performance optimizations (code)
–  Smarter sampling logic
–  Mapnik API considerations
–  Multi-threaded S3 PUTs
–  https://forums.aws.amazon.com/thread.jspa?threadID=125135
•  Write tiles in MBTiles format
•  Write tiles to HBase
•  Compression!
•  Ogrbrute?
Page 29
Architecting the Future of Big Data
© Hortonworks Inc. 2011
Thanks!
Architecting the Future of Big Data
Page 30
M A N N I N G
Nick Dimiduk
Amandeep Khurana
FOREWORD BY
Michael Stack
hbaseinaction.com
Nick Dimiduk
github.com/ndimiduk
@xefyr
n10k.com
1 of 30

Recommended

GeoWeb: overview of services and functions, 2010 by
GeoWeb: overview of services and functions, 2010GeoWeb: overview of services and functions, 2010
GeoWeb: overview of services and functions, 2010Moullet
465 views42 slides
Scrap Your MapReduce - Apache Spark by
 Scrap Your MapReduce - Apache Spark Scrap Your MapReduce - Apache Spark
Scrap Your MapReduce - Apache SparkIndicThreads
1K views35 slides
Precog & MongoDB User Group: Skyrocket Your Analytics by
Precog & MongoDB User Group: Skyrocket Your Analytics Precog & MongoDB User Group: Skyrocket Your Analytics
Precog & MongoDB User Group: Skyrocket Your Analytics MongoDB
1.4K views47 slides
QGIS training class 1 by
QGIS training class 1QGIS training class 1
QGIS training class 1Hiroaki Sengoku
8.9K views77 slides
A Big-Data Process Consigned Geographically by Employing Mapreduce Frame Work by
A Big-Data Process Consigned Geographically by Employing Mapreduce Frame WorkA Big-Data Process Consigned Geographically by Employing Mapreduce Frame Work
A Big-Data Process Consigned Geographically by Employing Mapreduce Frame WorkIRJET Journal
32 views4 slides
Strata Conference + Hadoop World NY 2013: Running On-premise Hadoop as a Busi... by
Strata Conference + Hadoop World NY 2013: Running On-premise Hadoop as a Busi...Strata Conference + Hadoop World NY 2013: Running On-premise Hadoop as a Busi...
Strata Conference + Hadoop World NY 2013: Running On-premise Hadoop as a Busi...Sumeet Singh
383 views26 slides

More Related Content

Viewers also liked

HBase Data Types (WIP) by
HBase Data Types (WIP)HBase Data Types (WIP)
HBase Data Types (WIP)Nick Dimiduk
2K views8 slides
HBase Low Latency, StrataNYC 2014 by
HBase Low Latency, StrataNYC 2014HBase Low Latency, StrataNYC 2014
HBase Low Latency, StrataNYC 2014Nick Dimiduk
1.8K views47 slides
HBase Data Types by
HBase Data TypesHBase Data Types
HBase Data TypesNick Dimiduk
6K views21 slides
Apache Big Data EU 2015 - HBase by
Apache Big Data EU 2015 - HBaseApache Big Data EU 2015 - HBase
Apache Big Data EU 2015 - HBaseNick Dimiduk
3.3K views40 slides
Introduction to Hadoop, HBase, and NoSQL by
Introduction to Hadoop, HBase, and NoSQLIntroduction to Hadoop, HBase, and NoSQL
Introduction to Hadoop, HBase, and NoSQLNick Dimiduk
7.5K views44 slides
HBase Client APIs (for webapps?) by
HBase Client APIs (for webapps?)HBase Client APIs (for webapps?)
HBase Client APIs (for webapps?)Nick Dimiduk
9K views35 slides

Viewers also liked(13)

HBase Low Latency, StrataNYC 2014 by Nick Dimiduk
HBase Low Latency, StrataNYC 2014HBase Low Latency, StrataNYC 2014
HBase Low Latency, StrataNYC 2014
Nick Dimiduk1.8K views
Apache Big Data EU 2015 - HBase by Nick Dimiduk
Apache Big Data EU 2015 - HBaseApache Big Data EU 2015 - HBase
Apache Big Data EU 2015 - HBase
Nick Dimiduk3.3K views
Introduction to Hadoop, HBase, and NoSQL by Nick Dimiduk
Introduction to Hadoop, HBase, and NoSQLIntroduction to Hadoop, HBase, and NoSQL
Introduction to Hadoop, HBase, and NoSQL
Nick Dimiduk7.5K views
HBase Client APIs (for webapps?) by Nick Dimiduk
HBase Client APIs (for webapps?)HBase Client APIs (for webapps?)
HBase Client APIs (for webapps?)
Nick Dimiduk9K views
Apache Big Data EU 2015 - Phoenix by Nick Dimiduk
Apache Big Data EU 2015 - PhoenixApache Big Data EU 2015 - Phoenix
Apache Big Data EU 2015 - Phoenix
Nick Dimiduk4.7K views
Apache HBase 1.0 Release by Nick Dimiduk
Apache HBase 1.0 ReleaseApache HBase 1.0 Release
Apache HBase 1.0 Release
Nick Dimiduk25.6K views
Apache HBase for Architects by Nick Dimiduk
Apache HBase for ArchitectsApache HBase for Architects
Apache HBase for Architects
Nick Dimiduk11.2K views
Apache HBase Low Latency by Nick Dimiduk
Apache HBase Low LatencyApache HBase Low Latency
Apache HBase Low Latency
Nick Dimiduk12K views
HBase for Architects by Nick Dimiduk
HBase for ArchitectsHBase for Architects
HBase for Architects
Nick Dimiduk33.7K views
Pig, Making Hadoop Easy by Nick Dimiduk
Pig, Making Hadoop EasyPig, Making Hadoop Easy
Pig, Making Hadoop Easy
Nick Dimiduk84.7K views
HBase Blockcache 101 by Nick Dimiduk
HBase Blockcache 101HBase Blockcache 101
HBase Blockcache 101
Nick Dimiduk5.7K views

Similar to Bring Cartography to the Cloud

도시건축설계와 오픈소스 기반 GIS by
도시건축설계와 오픈소스 기반 GIS도시건축설계와 오픈소스 기반 GIS
도시건축설계와 오픈소스 기반 GISmac999
1.1K views74 slides
도시 설계와 GIS 기술의 관계 by
도시 설계와 GIS 기술의 관계도시 설계와 GIS 기술의 관계
도시 설계와 GIS 기술의 관계Tae wook kang
446 views74 slides
Pivotal Greenplum 次世代マルチクラウド・データ分析プラットフォーム by
Pivotal Greenplum 次世代マルチクラウド・データ分析プラットフォームPivotal Greenplum 次世代マルチクラウド・データ分析プラットフォーム
Pivotal Greenplum 次世代マルチクラウド・データ分析プラットフォームMasayuki Matsushita
501 views36 slides
Making pig fly optimizing data processing on hadoop presentation by
Making pig fly  optimizing data processing on hadoop presentationMaking pig fly  optimizing data processing on hadoop presentation
Making pig fly optimizing data processing on hadoop presentationMd Rasool
1.6K views36 slides
Best practices for_managing_geospatial_data1 by
Best practices for_managing_geospatial_data1Best practices for_managing_geospatial_data1
Best practices for_managing_geospatial_data1Leng Kim Leng
115 views182 slides
Workshop on Google Cloud Data Platform by
Workshop on Google Cloud Data PlatformWorkshop on Google Cloud Data Platform
Workshop on Google Cloud Data PlatformGoDataDriven
213 views31 slides

Similar to Bring Cartography to the Cloud(20)

도시건축설계와 오픈소스 기반 GIS by mac999
도시건축설계와 오픈소스 기반 GIS도시건축설계와 오픈소스 기반 GIS
도시건축설계와 오픈소스 기반 GIS
mac9991.1K views
도시 설계와 GIS 기술의 관계 by Tae wook kang
도시 설계와 GIS 기술의 관계도시 설계와 GIS 기술의 관계
도시 설계와 GIS 기술의 관계
Tae wook kang446 views
Pivotal Greenplum 次世代マルチクラウド・データ分析プラットフォーム by Masayuki Matsushita
Pivotal Greenplum 次世代マルチクラウド・データ分析プラットフォームPivotal Greenplum 次世代マルチクラウド・データ分析プラットフォーム
Pivotal Greenplum 次世代マルチクラウド・データ分析プラットフォーム
Making pig fly optimizing data processing on hadoop presentation by Md Rasool
Making pig fly  optimizing data processing on hadoop presentationMaking pig fly  optimizing data processing on hadoop presentation
Making pig fly optimizing data processing on hadoop presentation
Md Rasool1.6K views
Best practices for_managing_geospatial_data1 by Leng Kim Leng
Best practices for_managing_geospatial_data1Best practices for_managing_geospatial_data1
Best practices for_managing_geospatial_data1
Leng Kim Leng115 views
Workshop on Google Cloud Data Platform by GoDataDriven
Workshop on Google Cloud Data PlatformWorkshop on Google Cloud Data Platform
Workshop on Google Cloud Data Platform
GoDataDriven213 views
State of the Art Web Mapping with Open Source by OSCON Byrum
State of the Art Web Mapping with Open SourceState of the Art Web Mapping with Open Source
State of the Art Web Mapping with Open Source
OSCON Byrum6K views
XQuery - The GSD (Getting Stuff Done) language by jimfuller2009
XQuery - The GSD (Getting Stuff Done) languageXQuery - The GSD (Getting Stuff Done) language
XQuery - The GSD (Getting Stuff Done) language
jimfuller20095.5K views
IoT NY - Google Cloud Services for IoT by James Chittenden
IoT NY - Google Cloud Services for IoTIoT NY - Google Cloud Services for IoT
IoT NY - Google Cloud Services for IoT
James Chittenden1.4K views
Tez: Accelerating Data Pipelines - fifthel by t3rmin4t0r
Tez: Accelerating Data Pipelines - fifthelTez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthel
t3rmin4t0r1.5K views
NCGIC The Geospatial Revolution by Peter Batty
NCGIC The Geospatial RevolutionNCGIC The Geospatial Revolution
NCGIC The Geospatial Revolution
Peter Batty807 views
GIS in the Rockies Geospatial Revolution by Peter Batty
GIS in the Rockies Geospatial RevolutionGIS in the Rockies Geospatial Revolution
GIS in the Rockies Geospatial Revolution
Peter Batty670 views
Run Your First Hadoop 2.x Program by Skillspeed
Run Your First Hadoop 2.x ProgramRun Your First Hadoop 2.x Program
Run Your First Hadoop 2.x Program
Skillspeed557 views
How Apache Spark fits into the Big Data landscape by Paco Nathan
How Apache Spark fits into the Big Data landscapeHow Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscape
Paco Nathan4.7K views
Hadoop past, present and future by Codemotion
Hadoop past, present and futureHadoop past, present and future
Hadoop past, present and future
Codemotion3.9K views
Developing Spatial Applications with Google Maps and CARTO by CARTO
Developing Spatial Applications with Google Maps and CARTODeveloping Spatial Applications with Google Maps and CARTO
Developing Spatial Applications with Google Maps and CARTO
CARTO260 views
Architecting a Scalable Hadoop Platform: Top 10 considerations for success by DataWorks Summit
Architecting a Scalable Hadoop Platform: Top 10 considerations for successArchitecting a Scalable Hadoop Platform: Top 10 considerations for success
Architecting a Scalable Hadoop Platform: Top 10 considerations for success
DataWorks Summit1.6K views
Economies of Scaling Software by Joshua Long
Economies of Scaling SoftwareEconomies of Scaling Software
Economies of Scaling Software
Joshua Long3K views

Recently uploaded

Netmera Presentation.pdf by
Netmera Presentation.pdfNetmera Presentation.pdf
Netmera Presentation.pdfMustafa Kuğu
22 views50 slides
"Running students' code in isolation. The hard way", Yurii Holiuk by
"Running students' code in isolation. The hard way", Yurii Holiuk "Running students' code in isolation. The hard way", Yurii Holiuk
"Running students' code in isolation. The hard way", Yurii Holiuk Fwdays
38 views34 slides
Business Analyst Series 2023 - Week 4 Session 8 by
Business Analyst Series 2023 -  Week 4 Session 8Business Analyst Series 2023 -  Week 4 Session 8
Business Analyst Series 2023 - Week 4 Session 8DianaGray10
180 views13 slides
MVP and prioritization.pdf by
MVP and prioritization.pdfMVP and prioritization.pdf
MVP and prioritization.pdfrahuldharwal141
40 views8 slides
Bronack Skills - Risk Management and SRE v1.0 12-3-2023.pdf by
Bronack Skills - Risk Management and SRE v1.0 12-3-2023.pdfBronack Skills - Risk Management and SRE v1.0 12-3-2023.pdf
Bronack Skills - Risk Management and SRE v1.0 12-3-2023.pdfThomasBronack
31 views31 slides
Cencora Executive Symposium by
Cencora Executive SymposiumCencora Executive Symposium
Cencora Executive Symposiummarketingcommunicati21
174 views14 slides

Recently uploaded(20)

"Running students' code in isolation. The hard way", Yurii Holiuk by Fwdays
"Running students' code in isolation. The hard way", Yurii Holiuk "Running students' code in isolation. The hard way", Yurii Holiuk
"Running students' code in isolation. The hard way", Yurii Holiuk
Fwdays38 views
Business Analyst Series 2023 - Week 4 Session 8 by DianaGray10
Business Analyst Series 2023 -  Week 4 Session 8Business Analyst Series 2023 -  Week 4 Session 8
Business Analyst Series 2023 - Week 4 Session 8
DianaGray10180 views
Bronack Skills - Risk Management and SRE v1.0 12-3-2023.pdf by ThomasBronack
Bronack Skills - Risk Management and SRE v1.0 12-3-2023.pdfBronack Skills - Risk Management and SRE v1.0 12-3-2023.pdf
Bronack Skills - Risk Management and SRE v1.0 12-3-2023.pdf
ThomasBronack31 views
Measurecamp Brussels - Synthetic data.pdf by Human37
Measurecamp Brussels - Synthetic data.pdfMeasurecamp Brussels - Synthetic data.pdf
Measurecamp Brussels - Synthetic data.pdf
Human37 27 views
"Node.js Development in 2024: trends and tools", Nikita Galkin by Fwdays
"Node.js Development in 2024: trends and tools", Nikita Galkin "Node.js Development in 2024: trends and tools", Nikita Galkin
"Node.js Development in 2024: trends and tools", Nikita Galkin
Fwdays37 views
The Role of Patterns in the Era of Large Language Models by Yunyao Li
The Role of Patterns in the Era of Large Language ModelsThe Role of Patterns in the Era of Large Language Models
The Role of Patterns in the Era of Large Language Models
Yunyao Li104 views
GDSC GLAU Info Session.pptx by gauriverrma4
GDSC GLAU Info Session.pptxGDSC GLAU Info Session.pptx
GDSC GLAU Info Session.pptx
gauriverrma415 views
Digital Personal Data Protection (DPDP) Practical Approach For CISOs by Priyanka Aash
Digital Personal Data Protection (DPDP) Practical Approach For CISOsDigital Personal Data Protection (DPDP) Practical Approach For CISOs
Digital Personal Data Protection (DPDP) Practical Approach For CISOs
Priyanka Aash171 views
Deep Tech and the Amplified Organisation: Core Concepts by Holonomics
Deep Tech and the Amplified Organisation: Core ConceptsDeep Tech and the Amplified Organisation: Core Concepts
Deep Tech and the Amplified Organisation: Core Concepts
Holonomics17 views
"Package management in monorepos", Zoltan Kochan by Fwdays
"Package management in monorepos", Zoltan Kochan"Package management in monorepos", Zoltan Kochan
"Package management in monorepos", Zoltan Kochan
Fwdays37 views
Transcript: Redefining the book supply chain: A glimpse into the future - Tec... by BookNet Canada
Transcript: Redefining the book supply chain: A glimpse into the future - Tec...Transcript: Redefining the book supply chain: A glimpse into the future - Tec...
Transcript: Redefining the book supply chain: A glimpse into the future - Tec...
BookNet Canada43 views
AIM102-S_Cognizant_CognizantCognitive by PhilipBasford
AIM102-S_Cognizant_CognizantCognitiveAIM102-S_Cognizant_CognizantCognitive
AIM102-S_Cognizant_CognizantCognitive
PhilipBasford23 views
Optimizing Communication to Optimize Human Behavior - LCBM by Yaman Kumar
Optimizing Communication to Optimize Human Behavior - LCBMOptimizing Communication to Optimize Human Behavior - LCBM
Optimizing Communication to Optimize Human Behavior - LCBM
Yaman Kumar39 views
Discover Aura Workshop (12.5.23).pdf by Neo4j
Discover Aura Workshop (12.5.23).pdfDiscover Aura Workshop (12.5.23).pdf
Discover Aura Workshop (12.5.23).pdf
Neo4j20 views

Bring Cartography to the Cloud

  • 1. © Hortonworks Inc. 2011 Bring Cartography to the Cloud with Apache Hadoop Nick Dimiduk Member of Technical Staff, HBase FOSS4G-NA, 2013-05-23 Page 1
  • 2. © Hortonworks Inc. 2011 Beginnings… Page 2 Architecting the Future of Big Data mapbox.com/blog/ rendering-the-world/ bmander.com/dotmap/index.html
  • 3. © Hortonworks Inc. 2011 Definitions Page 3 Architecting the Future of Big Data car•tog•ra•phy |kärˈtägrəәfē| noun the science or practice of drawing maps. rendering map tiles from some kind of geographic data. cloud |kloud| noun a visible mass of condensed water vapor floating in the atmosphere, typically high above the ground. on demand consumption of computation and storage resources.
  • 4. © Hortonworks Inc. 2011 Background Architecting the Future of Big Data Page 4
  • 5. © Hortonworks Inc. 2011 Apache Hadoop in Review •  Apache Hadoop Distributed Filesystem (HDFS) –  Distributed, fault-tolerant, throughput-optimized data storage –  Uses a filesystem analogy, not structured tables –  The Google File System, 2003, Ghemawat et al. –  http://research.google.com/archive/gfs.html •  Apache Hadoop MapReduce (MR) –  Distributed, fault-tolerant, batch-oriented data processing –  Line- or record-oriented processing of the entire dataset * –  “[Application] schema on read” –  MapReduce: Simplified Data Processing on Large Clusters, 2004, Dean and Ghemawat –  http://research.google.com/archive/mapreduce.html Page 5 Architecting the Future of Big Data * For more on writing MapReduce applications, see “MapReduce Patterns, Algorithms, and Use Cases” http://highlyscalable.wordpress.com/2012/02/01/mapreduce-patterns/
  • 6. © Hortonworks Inc. 2011 MapReduce in Detail Page 6 Architecting the Future of Big Data highlyscalable.wordpress.com/2012/02/01/mapreduce-patterns/
  • 7. © Hortonworks Inc. 2011 MapReduce in Detail Page 7 Architecting the Future of Big Data highlyscalable.wordpress.com/2012/02/01/mapreduce-patterns/
  • 8. © Hortonworks Inc. 2011 What we care about Page 8 Architecting the Future of Big Data $ map < input | sort | reduce > output
  • 9. © Hortonworks Inc. 2011 How Seamlessly? Page 9 Architecting the Future of Big Data $ git show e65731e:bin/10_simulated_hadoop.sh gzcat "$INPUT_FILES" | python "${PYTHON_DIR}/sample_shapes.py" | sort | python "${PYTHON_DIR}/draw_tiles.py" $ git show e65731e:bin/11_hadoop_local.sh hadoop jar target/tile-brute-0.1.0-SNAPSHOT.jar -input /tmp/input.csv -output "$OUTPUT_DIR" -mapper "python ${PYTHON_DIR}/sample_shapes.py" -reducer "python ${PYTHON_DIR}/draw_tiles.py"
  • 10. © Hortonworks Inc. 2011 To the Code! github.com/ndimiduk/tilebrute Architecting the Future of Big Data Page 10
  • 11. © Hortonworks Inc. 2011 Our Tools •  Python + GIS –  GDAL –  Shapely –  Mapnik •  Java •  Apache Hadoop •  Bash •  MrJob Page 11 Architecting the Future of Big Data
  • 12. © Hortonworks Inc. 2011 Prepare the Input Page 12 Architecting the Future of Big Data TIGER/Line Shapefiles www.census.gov/geo/maps-data/data/tiger-line.html $ tail -n6 bin/00_prepare_input.sh ogr2ogr `: invoke gdal tool ogr2ogr` -t_srs epsg:4326 `: reproject the data` -f CSV `: in CSV format` $OUTPUT `: producing output file` $INPUT `: from input file` -lco GEOMETRY=AS_WKT `: including geometries as WKT` $ head -n2 /tmp/input.csv WKT,STATEFP10,COUNTYFP10,TRACTCE10,BLOCKCE,BLOCKID10,PARTFLG,HOUSING10,POP10 "POLYGON ((-118.81473 47.233499,...))",53,001,950100,1042,530019501001042,N,1,5
  • 13. © Hortonworks Inc. 2011 Prepare the Input Page 13 Architecting the Future of Big Data TIGER/Line Shapefiles www.census.gov/geo/maps-data/data/tiger-line.html $ tail -n6 bin/00_prepare_input.sh ogr2ogr `: invoke gdal tool ogr2ogr` -t_srs epsg:4326 `: reproject the data` -f CSV `: in CSV format` $OUTPUT `: producing output file` $INPUT `: from input file` -lco GEOMETRY=AS_WKT `: including geometries as WKT` $ head -n2 /tmp/input.csv WKT,STATEFP10,COUNTYFP10,TRACTCE10,BLOCKCE,BLOCKID10,PARTFLG,HOUSING10,POP10 "POLYGON ((-118.81473 47.233499,...))",53,001,950100,1042,530019501001042,N,1,5
  • 14. © Hortonworks Inc. 2011 Map: Sample Geometries Page 14 Architecting the Future of Big Data [,[WKT, population]] => mapper => ['tx,ty,z', 'px,py'] def main(): for geom, population in read_feature(stdin): for lng, lat in sample_geometry(geom, population): for key, val in make_kv(lat, lng): emit(key, val) $ map < input | sort | reduce > output
  • 15. © Hortonworks Inc. 2011 Map: Sample Geometries Page 15 Architecting the Future of Big Data $ head -n1 input.csv | python -m tilebrute.sample_shapes 2,5,4 -13224181.65427 5981084.37214 5,11,5 -13224181.65427 5981084.37214 10,22,6 -13224181.65427 5981084.37214 21,44,7 -13224181.65427 5981084.37214 43,89,8 -13224181.65427 5981084.37214 87,179,9 -13224181.65427 5981084.37214 174,359,10 -13224181.65427 5981084.37214 348,718,11 -13224181.65427 5981084.37214 696,1436,12 -13224181.65427 5981084.37214 1392,2873,13 -13224181.65427 5981084.37214 2785,5746,14 -13224181.65427 5981084.37214 5571,11493,15 -13224181.65427 5981084.37214 11142,22986,16 -13224181.65427 5981084.37214 22284,45973,17 -13224181.65427 5981084.37214 $ map < input | sort | reduce > output
  • 16. © Hortonworks Inc. 2011 Sort Page 16 Architecting the Future of Big Data $ head -n1 input.csv | python -m tilebrute.sample_shapes | sort 10,22,6 -13224414.42332 5983539.01581 10,22,6 -13225723.87449 5981201.60336 10,22,6 -13225793.67181 5983127.53706 10,22,6 -13226046.70101 5983375.66839 10,22,6 -13226331.90155 5984272.31303 11138,22981,16 -13226331.90155 5984272.31303 11139,22983,16 -13225793.67181 5983127.53706 11139,22983,16 -13226046.70101 5983375.66839 11139,22986,16 -13225723.87449 5981201.60336 11141,22982,16 -13224414.42332 5983539.01581 $ map < input | sort | reduce > output
  • 17. © Hortonworks Inc. 2011 Reduce: Draw Tiles Page 17 Architecting the Future of Big Data def main(): for tile,points in groupby(read_points(stdin), lambda x: x[0]): zoom = get_zoom(tile) map = init_map(zoom, points) map.zoom_all() im = mapnik.Image(256,256) mapnik.render(map,im) emit(tile, encode_image(im)) $ map < input | sort | reduce > output $ head -n1 input.csv | python -m tilebrute.sample_shapes | sort | head -n5 | python -m tilebrute.draw_tiles 10,22,6 iVBORw0KGgoAAAANSUhEUgAAAQAAAAEACAYAAABccqhmAAADJ...+aBAAAAAElFTkSuQmCC
  • 18. © Hortonworks Inc. 2011 Write Output Page 18 Architecting the Future of Big Data public void write(Text tileId, Text tile) throws IOException { String[] tileIdSplits = tileId.toString().split(","); assert tileIdSplits.length == 3; String tx = tileIdSplits[0]; String ty = tileIdSplits[1]; String zoom = tileIdSplits[2]; Path tilePath = new Path(outputPath, zoom + "/" + tx + "/" + ty + ".png"); fs.mkdirs(tilePath.getParent()); byte[] buf = Base64.decodeBase64(tile.toString()); final FSDataOutputStream fout = fs.create(tilePath, progress); fout.write(buf); fout.close(); }
  • 19. © Hortonworks Inc. 2011 To the Cloud! Architecting the Future of Big Data Page 19
  • 20. © Hortonworks Inc. 2011 Basic Services: EC2, S3 •  EC2: Elastic Compute Cloud –  Virtual machines on demand –  Different “instance types” with different hardware profiles –  m1.large (2 cores, 7.5G), c1.xlarge (8 cores, 7G) •  S3: Simple Storage Service –  Distributed, replicated storage –  Native Hadoop integration –  Also exposed over http(s), easy tile hosting Page 20 Architecting the Future of Big Data
  • 21. © Hortonworks Inc. 2011 Add-on Service: EMR •  EMR: Elastic MapReduce –  “Hadoop as a Service” –  On-demand, pre-installed and configured Hadoop clusters –  +1: standardize of provisioning, deployment, monitoring –  -1: “stable” (old) software Page 21 Architecting the Future of Big Data
  • 22. © Hortonworks Inc. 2011 MrJob: Python for EMR Page 22 Architecting the Future of Big Data class TileBrute(MRJob): HADOOP_OUTPUT_FORMAT = 'tilebrute.hadoop.mapred.MapTileOutputFormat' def mapper_cmd(self): return bash_wrap('$PYTHON -m tilebrute.sample_shapes') def reducer_cmd(self): return bash_wrap('$PYTHON -m tilebrute.draw_tiles') github.com/Yelp/mrjob
  • 23. © Hortonworks Inc. 2011 Results Architecting the Future of Big Data Page 23
  • 24. © Hortonworks Inc. 2011 Page 24 Architecting the Future of Big Data
  • 25. © Hortonworks Inc. 2011 Page 25 Architecting the Future of Big Data 14z, 2624x, 5722y
  • 26. © Hortonworks Inc. 2011 Page 26 Architecting the Future of Big Data 14z, 2624x, 5722y
  • 27. © Hortonworks Inc. 2011 How much code? Page 27 Architecting the Future of Big Data $ find -f src -f bin | egrep '.(java|sh|py)$' | grep -v test | xargs cloc --quiet http://cloc.sourceforge.net v 1.56 T=0.5 s (28.0 files/s, 1868.0 lines/s) ------------------------------------------------------------------------------- Language files blank comment code ------------------------------------------------------------------------------- Python 4 69 105 299 Bourne Shell 8 51 85 210 Java 2 25 16 74 ------------------------------------------------------------------------------- SUM: 14 145 206 583 -------------------------------------------------------------------------------
  • 28. © Hortonworks Inc. 2011 Performance Page 28 Architecting the Future of Big Data •  1 x m1.large (2 cores) –  195575 input features (WA state) –  3 zoom levels (6, 7, 8) –  1 hour •  19 x c1.xlarge (152 cores) –  308745538 input features (all data) –  3 zoom levels (6, 7, 8) –  3 hours 15 minutes
  • 29. © Hortonworks Inc. 2011 TODOs •  Macro-level performance optimizations (configuration) –  Balancing mappers and reducers, memory allocation, &c. –  On-demand Hadoop means tuning the cluster to the application •  Micro-level performance optimizations (code) –  Smarter sampling logic –  Mapnik API considerations –  Multi-threaded S3 PUTs –  https://forums.aws.amazon.com/thread.jspa?threadID=125135 •  Write tiles in MBTiles format •  Write tiles to HBase •  Compression! •  Ogrbrute? Page 29 Architecting the Future of Big Data
  • 30. © Hortonworks Inc. 2011 Thanks! Architecting the Future of Big Data Page 30 M A N N I N G Nick Dimiduk Amandeep Khurana FOREWORD BY Michael Stack hbaseinaction.com Nick Dimiduk github.com/ndimiduk @xefyr n10k.com