How to Process Spatial Big Data 
on Hadoop/MapReduce 
Department of Computer and Information Engineering 
Kunsan National University 
kwnam@kunsan.ac.kr 
Kwang Woo Nam 
http://mcalab.kunsan.ac.kr or http://u-gis.com
Introduction : History 
• FOSS4G 2012 
– 오픈소스 NoSQL(MongoDB)을 이용한 Twitter Stream의 저장 
과 실시간 공간 지식 발견 
• FOSS4G 2013 
– Hadoop/Hbase 기반의 Twitter 공간 정보 분석 
– 한국연구재단 
• '소셜 미디어 스트림 기반 공간 지식의 연속 마이닝'(2013. 06.01- 
2016.05.31) 
• Today : FOSS4G 2014 
– Tutorial: How to Process Spatial Big Data on MapReduce 
Spatial Data Processing on Hadoop 2
Conclusion 
• Hadoop and MapReduce 
• GeoHash : Simple Way to Process Spatial Big Data 
• Constructing a Hadoop Cluster 
• Spatial Big Data Systems 
Spatial Data Processing on Hadoop 3
Introduction : Architecture 
Social Data GeoAnaytics 
MapReduce 
Spatial Data Processing on Hadoop 4 
Social 
Network 
: Twitter 
Social Data 
Stream Mining 
Manager 
MongoDB 
Raw 
Data 
Abstract 
Data 
Streaming API 
Social Media 
Collector 
Word 
Dictionary 
Social Data 
Analytics 
Manager 
Hadoop/HBase 
Cluster
Introduction : How Big? 
• Social Data Set 
– Twitter Data within North America(Twitter Streaming API) 
– Dec. 13, 2013 ~ Mar. 31, 2014( 109 days : 2616 hour) 
– Number of Social Transactions : 350,019,315 transactions 
– Transactions/Hour : 133,799 tweets per hour 
– Dictionary : 80,098,306 words 
(15,-150) 
(90,-60) 
Spatial Data Processing on Hadoop 5
Introduction : How Big? 
• Data File Size 
– Total : 1.1 TB(in MongoDB) 
• Mediadata 
– 951.04 GB 
• Abstracted 
– 160.08 GB 
• Dictionary 
– 6.37 GB 
– HDD to HDD Copy Time : minimum 3 hours 
– HBase Insertion Time : about 20 hours 
• 18,000,000 rows/hour 
• Data 특성 
– 90% 이상 lat, long 좌표 포함 
Spatial Data Processing on Hadoop 6
Hadoop/MapReduce 
• Hadoop 1.0 vs. Hodoop 2.0 
http://hortonworks.com/blog/apache-hadoop-2-is-ga/ 
Spatial Data Processing on Hadoop 7
Hadoop/MapReduce 
• Cluster and HDFS 
http://www.datascience-labs.com/hadoop/hdfs/ 
HBase 
Yarn/MR 
HDFS 
Spatial Data Processing on Hadoop 8
Hadoop/MapReduce 
• MapReduce Process 
map 
map 
map 
Spatial Data Processing on Hadoop 9 
Hadoop(HDFS) 
http://xiaochongzhang.me/blog/?p=338 
MapReudce 
User 
Programming 
Red 
Red 
Red 
Red 
User 
Programming
MapReduce 
• Collected Tweet Data 
• Twitter Inner Data Model 
1 
n 
n 
m 
n m 
following 
count … 
1 1 
follower 
count 
Spatial Data Processing on Hadoop 10 
{ 
“mid” : “1234567”, 
"filter_level":"medium", 
"contributors":null, 
"text":"역시 다 멋있다 #EVERYBODY", 
"geo":{ 
"type":"Point", 
"coordinates":[37.3604652, 127.9554015] 
}, 
"retweeted":false, 
"created_at":"Fri Oct 11 10:32:52 +0000 2013", 
"lang":"ko", 
"id":388613160678088700, 
"retweet_count":0, 
"favorite_count":0, 
"id_str":"388613160678088704", 
"user":{ 
"lang":"ko", 
"id":1394439386, 
"verified":false, 
"contributors_enabled":false, 
"name":"은지", 
"created_at":"Wed May 01 11:32:39 +0000 2013", 
"geo_enabled":true, 
"time_zone":"Seoul", 
“follower_count":93, 
“following_count”:30, 
“favorate_count”: 105, 
"id_str":"1394439386", 
} 
} 
user tw tweet 
fol fav 
• Real Collected Twitter Data 
userlog tw tweet
Spatial Aggregation on MapReduce 
• Simplified Tweet Data Model 
tweet : (mid, userid, x, y, time, text) 
tweet : (mid, userid, x, y, time, {word,…}) 
• Simple MapReduce Example 
(userid2, {mid2,mid4}) 
(userid2, 2) 
Spatial Data Processing on Hadoop 11 
(mid, userid, x, y, 
time, {word,…}) 
Map 
(userid1, mid1) 
(userid2, mid2) 
(userid3, mid3) 
(userid2, mid4) 
(userid3, mid5) 
(userid4, mid6) 
shuffle Reduce 
(userid1, {mid1}) 
(userid3, {mid3,mid5}) 
(userid4, mid6) 
(userid1, 1) 
(userid3, 2) 
(userid4, 1)
Spatial Aggregation on MapReduce 
• Counting Tweets by Hour 
– Use time as a row key 
• Convert Time into TimeId 
Map Shuffle 
(mid, userid, x, y, 
time, {word,…}) (timeid, mid) 
(timeid, {mid, mid, …}) 
– Counting User Tweets by Hour 
Map Shuffle 
(mid, userid, x, y, 
time, {word,…}) (<userid,timeid>, mid) 
(<userid,timeid>, {mid, 
mid, …}) 
• Fining Home/Visitor Position 
– Problem : How to make spatial keys? 
Reduce 
(timeid, n) 
(timeid, n) 
… 
Reduce 
(<userid,timeid>, n) 
Spatial Data Processing on Hadoop 12
GEOHASH : spatial key for Hadoop 
• Real-world Social Data 
– (lat, long) => key value 
13 
a b 
o2 
o18 o17 
o1 
o5 
o3 
o7 o4 
o8 
o9 
o12 
o10 
o11 
o15 
o6 
o13 
o14 
o16 
o19 
o20 
c 
e 
d 
f 
(b) Social Database 
oid wordset 
o1 Italian, restaurant, expensive 
o2 coffee, expensive, restaurant 
o3 Italian, pizza, expensive 
o4 restaurant, pizza, expensive 
o5 Italian, pizza, restaurant, expensive 
o6 coffee, restaurant, low-priced 
o7 Italian, coffee, low-priced, restaurant, pizza 
o8 coffee, restaurant, expensive 
o9 expensive, restaurant 
o10 pasta, pizza, expensive 
o11 pasta, low-priced, restaurant 
o12 Italian, restaurant, expensive 
o13 pizza, low-priced 
o14 tea, expensive, restaurant 
o15 Italian, restaurant 
o16 pasta, restaurant, expensive 
o17 pizza, restaurant, low-priced 
o18 Italian, pizza, restaurant 
o19 Italian, pasta, restaurant, expensive 
o20 pasta, expensive 
(a) social data in real world 
(lat, long) 
Spatial Data Processing on Hadoop
GEOHASH : spatial key for Hadoop 
14 
(b) Spatial social data and grid space 
0 1 4 5 16 17 20 21 
o2 
2 3 6 7 18 19 22 23 
o18 o17 
o1 
o5 
8 9 12 13 24 25 28 29 
o11 
o9 
o12 
o16 
o19 
o3 
10 11 14 15 26 27 30 31 
o14 
32 33 36 37 48 49 52 53 
o6 
34 35 38 39 50 51 54 55 
o7 o4 
o15 
40 41 44 45 56 57 60 61 
o8 
o10 
o13 
42 43 46 47 58 59 o20 
62 63 
(c) Spatial Wordset Transaction Database 
oid wordset geo 
o1 Italian, restaurant, expensive 13 
o2 coffee, expensive, restaurant 18 
o3 Italian, pizza, expensive 12 
o4 restaurant, pizza, expensive 60 
o5 Italian, pizza, restaurant, expensive 12 
o6 coffee, restaurant, low-priced 35 
o7 Italian, coffee, low-priced, restaurant, pizza 44 
o8 coffee, restaurant, expensive 62 
o9 expensive, restaurant 15 
o10 pasta, pizza, expensive 12 
o11 pasta, low-priced, restaurant 13 
o12 Italian, restaurant, expensive 11 
o13 pizza, low-priced 44 
o14 tea, expensive, restaurant 15 
o15 Italian, restaurant 35 
o16 pasta, restaurant, expensive 15 
o17 pizza, restaurant, low-priced 18 
o18 Italian, pizza, restaurant 7 
o19 Italian, pasta, restaurant, expensive 60 
o20 pasta, expensive 62 
Spatial 
(a) social data in real world 
Single Level Grid Approach 
Spatial Data Processing on Hadoop
GEOHASH : spatial key for Hadoop 
• GeoHash 
– 지오해시는 Base32 부호화 문자열로 표현 
15 
Base32 : binary를 A-Z,2-7의 32글자로 표현 
6bit=>1(8bit)byte char 
Spatial Data Processing on Hadoop
GEOHASH : spatial key for Hadoop 
• GeoHash = Quad-tree 
16 
Grid Level 1 
Grid Level 2 
Grid Level 3 
Grid Level 4 
00 01 
00 01 
10 11 
10 11 
Miami, Florida의 ID? 11 11 10 = geohash : 7 
Base32 Table 
Spatial Data Processing on Hadoop
GEOHASH : spatial key for Hadoop 
• Rowkey로서의 GeoHash 
– 지오해시가 로우키를 위한 최적의 선택인 이유 
① 계산하기 쉽다 
② 접두사가 최근접 이웃을 발견하는데 중요한 역할을 하기 때문 
– 단점 : 접두사의 정확도와 경계값 문제 
Spatial Data Processing on Hadoop 17
Spatial Word Aggregation 
• Spatial Aggregation on MapReduce 
– Counting Tweets by Spatial Cell 
Map Shuffle 
– Two Mapping Approach for Extending Spatial Area 
original data Approach 1(6bit) 
Approach 1(2bit) 
(‘aaae’, mid5) 
Spatial Data Processing on Hadoop 18 
(mid, userid, x, y, 
time, {word,…}) (geoid, mid) 
(geoid, {mid, mid, …}) 
Reduce 
(geoid, n) 
(‘aaaaa’, mid1) 
(‘aaaab’, mid2) 
(‘aaaac’, mid3) 
(‘aaaad’, mid4) 
(‘aaaae’, mid5) 
(‘aaaba’, mid6) 
(‘aaabb’, mid7) 
(‘aaaa’, mid1) 
(‘aaaa’, mid2) 
(‘aaaa’, mid3) 
(‘aaaa’, mid4) 
(‘aaaa’, mid5) 
(‘aaab’, mid6) 
(‘aaab’, mid7) 
(‘aaaa’, mid1) 
(‘aaaa’, mid2) 
(‘aaaa’, mid3) 
(‘aaaa’, mid4) 
(‘aaab’, mid6) 
(‘aaab’, mid7)
Spatial Word Aggregation 
• Spatial Aggregation on MapReduce 
– Finding Home/Visitor Position 
Map Shuffle 
(mid, userid, x, y, 
time, {word,…}) (<userid,geoid>, mid) 
– Finding Spatial Word 
(<userid,geoid>, {mid, 
mid, …}) 
Reduce 
(<userid,geoid>, n) 
Map Shuffle 
(mid, userid, x, y, 
time, {word,…}) (<geoid, word>, mid) 
(<geoid, word>, mid) 
(<geoid, word>, mid) 
(<geoid, word>,{mid, 
mid, …}) 
Reduce 
(<geoid, word>, n) 
Spatial Data Processing on Hadoop 19
Spatial Word Aggregation 
• Tables in Hadoop/Hbase 
Word Count 
Word Count Timestamp 
Spatial Data Processing on Hadoop 20 
Media Data 
ID GeoTag Timestamp 
Content GeoHash 
Additional 
data 
Tweet Count 
GeoHash Count Timestamp 
GeoWord Count 
GeoHash Word Count Timestamp
GeoSocial Analytics System 
Client 
Web Server (Vert.x) 
Spatial Word Aggregation 
Global 
Tweet Word GeoWor 
d 
Local 
Tweet GeoWord 
Query Application 
MetaTable Ranking Count 
BATCH … 
YARN 
HDFS2 
(MapReduce) 
Hadoop 
ONLINE 
(HBase) 
Spatial Data Processing on Hadoop
Constructing a Hadoop Cluster 
• 좀 다른 얘기이지만… Supercomputer~~~도 Cluster! 
Tiahe-2(MilkyWay-2) 
China 
E5-2690: 12 cores 
Passmark : about 16,000 
CPUs : 260,000ea 
MongoDB 기반 Twitter Stream 저장과 탐사 22 
Spatial Data Processing on Hadoop
Constructing a Hadoop Cluster 
• Supercomputer Milky Way-2의 내부 
Spatial Data Processing on Hadoop 23
Constructing a Hadoop Cluster 
• Cluster(Hadoop or Whatever) 
24 
Source: Yahoo! Hadoop Cluster 
http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-multi- 
node-cluster/ 
Spatial Data Processing on Hadoop
Constructing a Hadoop Cluster 
Spatial Data Processing on Hadoop 25 
Source: Google! 
http://www.wired.com/2012/10/ff-inside-google-data-center/all/ 
동영상: http://youtu.be/avP5d16wEp0
Constructing a Hadoop Cluster 
Raspberry Hadoop Cluster(6 node) 
board: 41,850, SD(32G) :  13,950 
4 node : 223,200 
6 node : 334,800 
8 node : 446,400 
Cubieboard3 Hadoop Cluster(8 node) 
board: 137,000, SD(32G) :  13,950 
4 node : 603,800 
6 node : 905,700 
8 node : 1,207,600 
Spatial Data Processing on Hadoop 26
Constructing a Hadoop Cluster 
• DIY Hadoop Cluster 
http://serverfault.com/questions/463670/diy-hadoop-cluster-heat-dust-issues 
http://www.scl.ameslab.gov/Projects/parallel_computing/cluster_examples.ht 
ml 
Spatial Data Processing on Hadoop 27
Constructing a Hadoop Cluster 
• We have no money! 
Spatial Data Processing on Hadoop 28
Hadoop Cluster : Our Approach 
Seagate Barracuda 3TB : 117,310 
Spatial Data Processing on Hadoop 29 
2cpu.co.kr(2014.07.04) 
CPU Passmark 
Xeon X3440 : 4,429(200,000) 
i7 4790 : 10,198(320,000) 
i5 4690 : 7,741(227,000) 
Ultra Cheap Hadoop Cluster 
4 node : 22만원*4대+11.7만원*4대 = 134.924만원 
8 node : 22만원*8대+11.7만원*8대 = 269.848만원 
1G switch : 195,000
Hadoop Cluster : Our Approach 
Spatial Data Processing on Hadoop 30 
2cpu.co.kr(2014.07.14) 
CPU Passmark 
Xeon X5675 : 8,095(200,000) 
i7 4790 : 10,198(320,000) 
i5 4690 : 7,741(227,000) 
Memory 2G : 10,000
Hadoop Cluster : Our Approach 
• Cluster Configuration 
– 1 name node : i7, 16G RAM, 3TB HDD 
– 5 data node : i5, 4G RAM, 3TB HDD 
Spatial Data Processing on Hadoop 31
Test 
• Simple Result 
– Input : 20,000,000 tweets 
– Number of Word : 7,500,000 words 
400 167 
0 2 
92 148 
110 149 
4,119 
0 
669 
657 
Spatial Data Processing on Hadoop 32 
6,000 
5,000 
4,000 
3,000 
2,000 
1,000 
0 
Num Of GeoTweet Num of Word Num of GeoWord 
Map Shuffle Merge Reduce 
90min 
8min 
10min
Systems: ESRI GIS tools for Hadoop 
• Three GIS Tools by ESRI 
– Geoprocessing Tools 
– Spatial Framework for Hadoop 
• Extension of Hive SQL 
– ESRI Geometry API 
Spatial Data Processing on Hadoop 33
Systems: ESRI GIS tools for Hadoop 
• ESRI Spatial Framework for Hadoop 
– Extension of Hive SQL 
• https://github.com/Esri/gis-tools-for-hadoop/ 
tree/master/samples/point-in-polygon-aggregation-hive 
– Spatial Function Import 
add jar 
${env:HOME}/esri-git/gis-tools-for-hadoop/samples/lib/esri-geometry-api.jar 
${env:HOME}/esri-git/gis-tools-for-hadoop/samples/lib/spatial-sdk-hadoop.jar 
create temporary function ST_Point as 'com.esri.hadoop.hive.ST_Point'; 
create temporary function ST_Contains as 'com.esri.hadoop.hive.ST_Contains'; 
– CREATE TABLE 
CREATE EXTERNAL TABLE earthquakes 
(earthquake_date STRING, latitude DOUBLE, longitude DOUBLE, magnitude DOUBLE) 
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' 
LOCATION '${env:HOME}/esri-git/gis-tools-for-hadoop/samples/data/earthquake-data'; 
Spatial Data Processing on Hadoop 34
Systems: ESRI GIS tools for Hadoop 
• ESRI Spatial Framework for Hadoop 
– CREATE TABLE (input/output format) 
CREATE EXTERNAL TABLE IF NOT EXISTS counties 
(Area string, Perimeter string, State string, County string, Name string, BoundaryShape binary) 
ROW FORMAT SERDE 'com.esri.hadoop.hive.serde.JsonSerde' 
STORED AS INPUTFORMAT 'com.esri.json.hadoop.EnclosedJsonInputFormat' 
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' 
LOCATION '${env:HOME}/esri-git/gis-tools-for-hadoop/samples/data/counties-data'; 
– HIVE Spatial SQL Queries 
SELECT counties.name, count(*) cnt 
FROM counties JOIN earthquakes 
WHERE ST_Contains(counties.boundaryshape, ST_Point(earthquakes.longitude, earthquakes.latitude)) 
GROUP BY counties.name 
ORDER BY cnt desc; 
– Hive not support the CRUD(INSERT/UPDATE/DELETE) in current 1.3 
version 
• PLAN : 1.4 
• No Spatial Index 
Spatial Data Processing on Hadoop 35
Systems: SpatialHadoop 
• SpatialHadoop : Spatial Support on Hadoop 
– University of Minnesota (Prof. Mohamed Mokbel) 
• http://spatialhadoop.cs.umn.edu/ 
– using ESRI Geometry API 
A Demonstration of SpatialHadoop: An Efficient MapReduce Framework for Spatial Data(VLDB ‘2013) 
Spatial Data Processing on Hadoop 36
Systems: SpatialHadoop 
– Generation of Test Data 
%shadoop generate test mbr:0,0,1000000,1000000 size:1.gb shape:rect 
– Creation of Spatial Index(R-tree and GRID) 
%shadoop index test.rects sindex:grid test.grid 
– Range Query 
%shadoop rangequery test.grid rq_results rect:500,500,1000,1000 
Spatial Data Processing on Hadoop 37
Systems: SpatialHadoop 
– k-NN Query 
%shadoop knn test.grid knn_results point:1000,1000 k:1000 
– Join Query(Distributed Join) 
%shadoop dj test.grid test2.grid sj_results 
– Other Supported Operations 
• sjmr : parallelized spatial join with MapReduce 
• convexhull 
• skyline 
• closestpair/farthestpair 
• plot 
Spatial Data Processing on Hadoop 38
Systems: Pigeon 
• Pigeon : Spatial Support in Pig 
– University of Minnesota (Prof. Mohamed Mokbel) 
– Pig Query(Non Spatial) 
points = LOAD 'points' AS (id:long, lon:double, lat:double); 
– Pigeon Query(Spatial) 
Spatial Data Processing on Hadoop 39 
results = FILTER points BY 
lon < -93.158 AND lon > -93.175 AND 
lat > 45.0077 AND lat < 45.0164; 
STORE results INTO 'results'; 
IMPORT 'pigeon_import.pig'; 
points = LOAD 'points-pigeon' AS (id:long, location); 
results = FILTER points BY 
ST_Contains(ST_MakeBox(-93.175, 45.0077, -93.158, 45.0164), location); 
STORE results INTO 'results-pigeon';
Conclusion 
• Hadoop and MapReduce 
• GeoHash : Simple Way to Process Spatial Big Data 
• Constructing a Hadoop Cluster 
• Spatial Big Data Systems 
Spatial Data Processing on Hadoop 40

[FOSS4G KOREA 2014]Hadoop 상에서 MapReduce를 이용한 Spatial Big Data 집계와 시스템 구축

  • 1.
    How to ProcessSpatial Big Data on Hadoop/MapReduce Department of Computer and Information Engineering Kunsan National University kwnam@kunsan.ac.kr Kwang Woo Nam http://mcalab.kunsan.ac.kr or http://u-gis.com
  • 2.
    Introduction : History • FOSS4G 2012 – 오픈소스 NoSQL(MongoDB)을 이용한 Twitter Stream의 저장 과 실시간 공간 지식 발견 • FOSS4G 2013 – Hadoop/Hbase 기반의 Twitter 공간 정보 분석 – 한국연구재단 • '소셜 미디어 스트림 기반 공간 지식의 연속 마이닝'(2013. 06.01- 2016.05.31) • Today : FOSS4G 2014 – Tutorial: How to Process Spatial Big Data on MapReduce Spatial Data Processing on Hadoop 2
  • 3.
    Conclusion • Hadoopand MapReduce • GeoHash : Simple Way to Process Spatial Big Data • Constructing a Hadoop Cluster • Spatial Big Data Systems Spatial Data Processing on Hadoop 3
  • 4.
    Introduction : Architecture Social Data GeoAnaytics MapReduce Spatial Data Processing on Hadoop 4 Social Network : Twitter Social Data Stream Mining Manager MongoDB Raw Data Abstract Data Streaming API Social Media Collector Word Dictionary Social Data Analytics Manager Hadoop/HBase Cluster
  • 5.
    Introduction : HowBig? • Social Data Set – Twitter Data within North America(Twitter Streaming API) – Dec. 13, 2013 ~ Mar. 31, 2014( 109 days : 2616 hour) – Number of Social Transactions : 350,019,315 transactions – Transactions/Hour : 133,799 tweets per hour – Dictionary : 80,098,306 words (15,-150) (90,-60) Spatial Data Processing on Hadoop 5
  • 6.
    Introduction : HowBig? • Data File Size – Total : 1.1 TB(in MongoDB) • Mediadata – 951.04 GB • Abstracted – 160.08 GB • Dictionary – 6.37 GB – HDD to HDD Copy Time : minimum 3 hours – HBase Insertion Time : about 20 hours • 18,000,000 rows/hour • Data 특성 – 90% 이상 lat, long 좌표 포함 Spatial Data Processing on Hadoop 6
  • 7.
    Hadoop/MapReduce • Hadoop1.0 vs. Hodoop 2.0 http://hortonworks.com/blog/apache-hadoop-2-is-ga/ Spatial Data Processing on Hadoop 7
  • 8.
    Hadoop/MapReduce • Clusterand HDFS http://www.datascience-labs.com/hadoop/hdfs/ HBase Yarn/MR HDFS Spatial Data Processing on Hadoop 8
  • 9.
    Hadoop/MapReduce • MapReduceProcess map map map Spatial Data Processing on Hadoop 9 Hadoop(HDFS) http://xiaochongzhang.me/blog/?p=338 MapReudce User Programming Red Red Red Red User Programming
  • 10.
    MapReduce • CollectedTweet Data • Twitter Inner Data Model 1 n n m n m following count … 1 1 follower count Spatial Data Processing on Hadoop 10 { “mid” : “1234567”, "filter_level":"medium", "contributors":null, "text":"역시 다 멋있다 #EVERYBODY", "geo":{ "type":"Point", "coordinates":[37.3604652, 127.9554015] }, "retweeted":false, "created_at":"Fri Oct 11 10:32:52 +0000 2013", "lang":"ko", "id":388613160678088700, "retweet_count":0, "favorite_count":0, "id_str":"388613160678088704", "user":{ "lang":"ko", "id":1394439386, "verified":false, "contributors_enabled":false, "name":"은지", "created_at":"Wed May 01 11:32:39 +0000 2013", "geo_enabled":true, "time_zone":"Seoul", “follower_count":93, “following_count”:30, “favorate_count”: 105, "id_str":"1394439386", } } user tw tweet fol fav • Real Collected Twitter Data userlog tw tweet
  • 11.
    Spatial Aggregation onMapReduce • Simplified Tweet Data Model tweet : (mid, userid, x, y, time, text) tweet : (mid, userid, x, y, time, {word,…}) • Simple MapReduce Example (userid2, {mid2,mid4}) (userid2, 2) Spatial Data Processing on Hadoop 11 (mid, userid, x, y, time, {word,…}) Map (userid1, mid1) (userid2, mid2) (userid3, mid3) (userid2, mid4) (userid3, mid5) (userid4, mid6) shuffle Reduce (userid1, {mid1}) (userid3, {mid3,mid5}) (userid4, mid6) (userid1, 1) (userid3, 2) (userid4, 1)
  • 12.
    Spatial Aggregation onMapReduce • Counting Tweets by Hour – Use time as a row key • Convert Time into TimeId Map Shuffle (mid, userid, x, y, time, {word,…}) (timeid, mid) (timeid, {mid, mid, …}) – Counting User Tweets by Hour Map Shuffle (mid, userid, x, y, time, {word,…}) (<userid,timeid>, mid) (<userid,timeid>, {mid, mid, …}) • Fining Home/Visitor Position – Problem : How to make spatial keys? Reduce (timeid, n) (timeid, n) … Reduce (<userid,timeid>, n) Spatial Data Processing on Hadoop 12
  • 13.
    GEOHASH : spatialkey for Hadoop • Real-world Social Data – (lat, long) => key value 13 a b o2 o18 o17 o1 o5 o3 o7 o4 o8 o9 o12 o10 o11 o15 o6 o13 o14 o16 o19 o20 c e d f (b) Social Database oid wordset o1 Italian, restaurant, expensive o2 coffee, expensive, restaurant o3 Italian, pizza, expensive o4 restaurant, pizza, expensive o5 Italian, pizza, restaurant, expensive o6 coffee, restaurant, low-priced o7 Italian, coffee, low-priced, restaurant, pizza o8 coffee, restaurant, expensive o9 expensive, restaurant o10 pasta, pizza, expensive o11 pasta, low-priced, restaurant o12 Italian, restaurant, expensive o13 pizza, low-priced o14 tea, expensive, restaurant o15 Italian, restaurant o16 pasta, restaurant, expensive o17 pizza, restaurant, low-priced o18 Italian, pizza, restaurant o19 Italian, pasta, restaurant, expensive o20 pasta, expensive (a) social data in real world (lat, long) Spatial Data Processing on Hadoop
  • 14.
    GEOHASH : spatialkey for Hadoop 14 (b) Spatial social data and grid space 0 1 4 5 16 17 20 21 o2 2 3 6 7 18 19 22 23 o18 o17 o1 o5 8 9 12 13 24 25 28 29 o11 o9 o12 o16 o19 o3 10 11 14 15 26 27 30 31 o14 32 33 36 37 48 49 52 53 o6 34 35 38 39 50 51 54 55 o7 o4 o15 40 41 44 45 56 57 60 61 o8 o10 o13 42 43 46 47 58 59 o20 62 63 (c) Spatial Wordset Transaction Database oid wordset geo o1 Italian, restaurant, expensive 13 o2 coffee, expensive, restaurant 18 o3 Italian, pizza, expensive 12 o4 restaurant, pizza, expensive 60 o5 Italian, pizza, restaurant, expensive 12 o6 coffee, restaurant, low-priced 35 o7 Italian, coffee, low-priced, restaurant, pizza 44 o8 coffee, restaurant, expensive 62 o9 expensive, restaurant 15 o10 pasta, pizza, expensive 12 o11 pasta, low-priced, restaurant 13 o12 Italian, restaurant, expensive 11 o13 pizza, low-priced 44 o14 tea, expensive, restaurant 15 o15 Italian, restaurant 35 o16 pasta, restaurant, expensive 15 o17 pizza, restaurant, low-priced 18 o18 Italian, pizza, restaurant 7 o19 Italian, pasta, restaurant, expensive 60 o20 pasta, expensive 62 Spatial (a) social data in real world Single Level Grid Approach Spatial Data Processing on Hadoop
  • 15.
    GEOHASH : spatialkey for Hadoop • GeoHash – 지오해시는 Base32 부호화 문자열로 표현 15 Base32 : binary를 A-Z,2-7의 32글자로 표현 6bit=>1(8bit)byte char Spatial Data Processing on Hadoop
  • 16.
    GEOHASH : spatialkey for Hadoop • GeoHash = Quad-tree 16 Grid Level 1 Grid Level 2 Grid Level 3 Grid Level 4 00 01 00 01 10 11 10 11 Miami, Florida의 ID? 11 11 10 = geohash : 7 Base32 Table Spatial Data Processing on Hadoop
  • 17.
    GEOHASH : spatialkey for Hadoop • Rowkey로서의 GeoHash – 지오해시가 로우키를 위한 최적의 선택인 이유 ① 계산하기 쉽다 ② 접두사가 최근접 이웃을 발견하는데 중요한 역할을 하기 때문 – 단점 : 접두사의 정확도와 경계값 문제 Spatial Data Processing on Hadoop 17
  • 18.
    Spatial Word Aggregation • Spatial Aggregation on MapReduce – Counting Tweets by Spatial Cell Map Shuffle – Two Mapping Approach for Extending Spatial Area original data Approach 1(6bit) Approach 1(2bit) (‘aaae’, mid5) Spatial Data Processing on Hadoop 18 (mid, userid, x, y, time, {word,…}) (geoid, mid) (geoid, {mid, mid, …}) Reduce (geoid, n) (‘aaaaa’, mid1) (‘aaaab’, mid2) (‘aaaac’, mid3) (‘aaaad’, mid4) (‘aaaae’, mid5) (‘aaaba’, mid6) (‘aaabb’, mid7) (‘aaaa’, mid1) (‘aaaa’, mid2) (‘aaaa’, mid3) (‘aaaa’, mid4) (‘aaaa’, mid5) (‘aaab’, mid6) (‘aaab’, mid7) (‘aaaa’, mid1) (‘aaaa’, mid2) (‘aaaa’, mid3) (‘aaaa’, mid4) (‘aaab’, mid6) (‘aaab’, mid7)
  • 19.
    Spatial Word Aggregation • Spatial Aggregation on MapReduce – Finding Home/Visitor Position Map Shuffle (mid, userid, x, y, time, {word,…}) (<userid,geoid>, mid) – Finding Spatial Word (<userid,geoid>, {mid, mid, …}) Reduce (<userid,geoid>, n) Map Shuffle (mid, userid, x, y, time, {word,…}) (<geoid, word>, mid) (<geoid, word>, mid) (<geoid, word>, mid) (<geoid, word>,{mid, mid, …}) Reduce (<geoid, word>, n) Spatial Data Processing on Hadoop 19
  • 20.
    Spatial Word Aggregation • Tables in Hadoop/Hbase Word Count Word Count Timestamp Spatial Data Processing on Hadoop 20 Media Data ID GeoTag Timestamp Content GeoHash Additional data Tweet Count GeoHash Count Timestamp GeoWord Count GeoHash Word Count Timestamp
  • 21.
    GeoSocial Analytics System Client Web Server (Vert.x) Spatial Word Aggregation Global Tweet Word GeoWor d Local Tweet GeoWord Query Application MetaTable Ranking Count BATCH … YARN HDFS2 (MapReduce) Hadoop ONLINE (HBase) Spatial Data Processing on Hadoop
  • 22.
    Constructing a HadoopCluster • 좀 다른 얘기이지만… Supercomputer~~~도 Cluster! Tiahe-2(MilkyWay-2) China E5-2690: 12 cores Passmark : about 16,000 CPUs : 260,000ea MongoDB 기반 Twitter Stream 저장과 탐사 22 Spatial Data Processing on Hadoop
  • 23.
    Constructing a HadoopCluster • Supercomputer Milky Way-2의 내부 Spatial Data Processing on Hadoop 23
  • 24.
    Constructing a HadoopCluster • Cluster(Hadoop or Whatever) 24 Source: Yahoo! Hadoop Cluster http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-multi- node-cluster/ Spatial Data Processing on Hadoop
  • 25.
    Constructing a HadoopCluster Spatial Data Processing on Hadoop 25 Source: Google! http://www.wired.com/2012/10/ff-inside-google-data-center/all/ 동영상: http://youtu.be/avP5d16wEp0
  • 26.
    Constructing a HadoopCluster Raspberry Hadoop Cluster(6 node) board: 41,850, SD(32G) : 13,950 4 node : 223,200 6 node : 334,800 8 node : 446,400 Cubieboard3 Hadoop Cluster(8 node) board: 137,000, SD(32G) : 13,950 4 node : 603,800 6 node : 905,700 8 node : 1,207,600 Spatial Data Processing on Hadoop 26
  • 27.
    Constructing a HadoopCluster • DIY Hadoop Cluster http://serverfault.com/questions/463670/diy-hadoop-cluster-heat-dust-issues http://www.scl.ameslab.gov/Projects/parallel_computing/cluster_examples.ht ml Spatial Data Processing on Hadoop 27
  • 28.
    Constructing a HadoopCluster • We have no money! Spatial Data Processing on Hadoop 28
  • 29.
    Hadoop Cluster :Our Approach Seagate Barracuda 3TB : 117,310 Spatial Data Processing on Hadoop 29 2cpu.co.kr(2014.07.04) CPU Passmark Xeon X3440 : 4,429(200,000) i7 4790 : 10,198(320,000) i5 4690 : 7,741(227,000) Ultra Cheap Hadoop Cluster 4 node : 22만원*4대+11.7만원*4대 = 134.924만원 8 node : 22만원*8대+11.7만원*8대 = 269.848만원 1G switch : 195,000
  • 30.
    Hadoop Cluster :Our Approach Spatial Data Processing on Hadoop 30 2cpu.co.kr(2014.07.14) CPU Passmark Xeon X5675 : 8,095(200,000) i7 4790 : 10,198(320,000) i5 4690 : 7,741(227,000) Memory 2G : 10,000
  • 31.
    Hadoop Cluster :Our Approach • Cluster Configuration – 1 name node : i7, 16G RAM, 3TB HDD – 5 data node : i5, 4G RAM, 3TB HDD Spatial Data Processing on Hadoop 31
  • 32.
    Test • SimpleResult – Input : 20,000,000 tweets – Number of Word : 7,500,000 words 400 167 0 2 92 148 110 149 4,119 0 669 657 Spatial Data Processing on Hadoop 32 6,000 5,000 4,000 3,000 2,000 1,000 0 Num Of GeoTweet Num of Word Num of GeoWord Map Shuffle Merge Reduce 90min 8min 10min
  • 33.
    Systems: ESRI GIStools for Hadoop • Three GIS Tools by ESRI – Geoprocessing Tools – Spatial Framework for Hadoop • Extension of Hive SQL – ESRI Geometry API Spatial Data Processing on Hadoop 33
  • 34.
    Systems: ESRI GIStools for Hadoop • ESRI Spatial Framework for Hadoop – Extension of Hive SQL • https://github.com/Esri/gis-tools-for-hadoop/ tree/master/samples/point-in-polygon-aggregation-hive – Spatial Function Import add jar ${env:HOME}/esri-git/gis-tools-for-hadoop/samples/lib/esri-geometry-api.jar ${env:HOME}/esri-git/gis-tools-for-hadoop/samples/lib/spatial-sdk-hadoop.jar create temporary function ST_Point as 'com.esri.hadoop.hive.ST_Point'; create temporary function ST_Contains as 'com.esri.hadoop.hive.ST_Contains'; – CREATE TABLE CREATE EXTERNAL TABLE earthquakes (earthquake_date STRING, latitude DOUBLE, longitude DOUBLE, magnitude DOUBLE) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION '${env:HOME}/esri-git/gis-tools-for-hadoop/samples/data/earthquake-data'; Spatial Data Processing on Hadoop 34
  • 35.
    Systems: ESRI GIStools for Hadoop • ESRI Spatial Framework for Hadoop – CREATE TABLE (input/output format) CREATE EXTERNAL TABLE IF NOT EXISTS counties (Area string, Perimeter string, State string, County string, Name string, BoundaryShape binary) ROW FORMAT SERDE 'com.esri.hadoop.hive.serde.JsonSerde' STORED AS INPUTFORMAT 'com.esri.json.hadoop.EnclosedJsonInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' LOCATION '${env:HOME}/esri-git/gis-tools-for-hadoop/samples/data/counties-data'; – HIVE Spatial SQL Queries SELECT counties.name, count(*) cnt FROM counties JOIN earthquakes WHERE ST_Contains(counties.boundaryshape, ST_Point(earthquakes.longitude, earthquakes.latitude)) GROUP BY counties.name ORDER BY cnt desc; – Hive not support the CRUD(INSERT/UPDATE/DELETE) in current 1.3 version • PLAN : 1.4 • No Spatial Index Spatial Data Processing on Hadoop 35
  • 36.
    Systems: SpatialHadoop •SpatialHadoop : Spatial Support on Hadoop – University of Minnesota (Prof. Mohamed Mokbel) • http://spatialhadoop.cs.umn.edu/ – using ESRI Geometry API A Demonstration of SpatialHadoop: An Efficient MapReduce Framework for Spatial Data(VLDB ‘2013) Spatial Data Processing on Hadoop 36
  • 37.
    Systems: SpatialHadoop –Generation of Test Data %shadoop generate test mbr:0,0,1000000,1000000 size:1.gb shape:rect – Creation of Spatial Index(R-tree and GRID) %shadoop index test.rects sindex:grid test.grid – Range Query %shadoop rangequery test.grid rq_results rect:500,500,1000,1000 Spatial Data Processing on Hadoop 37
  • 38.
    Systems: SpatialHadoop –k-NN Query %shadoop knn test.grid knn_results point:1000,1000 k:1000 – Join Query(Distributed Join) %shadoop dj test.grid test2.grid sj_results – Other Supported Operations • sjmr : parallelized spatial join with MapReduce • convexhull • skyline • closestpair/farthestpair • plot Spatial Data Processing on Hadoop 38
  • 39.
    Systems: Pigeon •Pigeon : Spatial Support in Pig – University of Minnesota (Prof. Mohamed Mokbel) – Pig Query(Non Spatial) points = LOAD 'points' AS (id:long, lon:double, lat:double); – Pigeon Query(Spatial) Spatial Data Processing on Hadoop 39 results = FILTER points BY lon < -93.158 AND lon > -93.175 AND lat > 45.0077 AND lat < 45.0164; STORE results INTO 'results'; IMPORT 'pigeon_import.pig'; points = LOAD 'points-pigeon' AS (id:long, location); results = FILTER points BY ST_Contains(ST_MakeBox(-93.175, 45.0077, -93.158, 45.0164), location); STORE results INTO 'results-pigeon';
  • 40.
    Conclusion • Hadoopand MapReduce • GeoHash : Simple Way to Process Spatial Big Data • Constructing a Hadoop Cluster • Spatial Big Data Systems Spatial Data Processing on Hadoop 40