[FOSS4G KOREA 2014]Hadoop 상에서 MapReduce를 이용한 Spatial Big Data 집계와 시스템 구축

How to Process Spatial Big Data
on Hadoop/MapReduce
Department of Computer and Information Engineering
Kunsan National University
kwnam@kunsan.ac.kr
Kwang Woo Nam
http://mcalab.kunsan.ac.kr or http://u-gis.com

Introduction : History
• FOSS4G 2012
– 오픈소스 NoSQL(MongoDB)을 이용한 Twitter Stream의 저장
과 실시간 공간 지식 발견
• FOSS4G 2013
– Hadoop/Hbase 기반의 Twitter 공간 정보 분석
– 한국연구재단
• '소셜 미디어 스트림 기반 공간 지식의 연속 마이닝'(2013. 06.01-
2016.05.31)
• Today : FOSS4G 2014
– Tutorial: How to Process Spatial Big Data on MapReduce
Spatial Data Processing on Hadoop 2

Conclusion
• Hadoop and MapReduce
• GeoHash : Simple Way to Process Spatial Big Data
• Constructing a Hadoop Cluster
• Spatial Big Data Systems

Introduction : Architecture
Social Data GeoAnaytics
MapReduce
Social
Network
: Twitter
Social Data
Stream Mining
Manager
MongoDB
Raw
Data
Abstract
Data
Streaming API
Social Media
Collector
Word
Dictionary
Social Data
Analytics
Manager
Hadoop/HBase
Cluster

Introduction : How Big?
• Social Data Set
– Twitter Data within North America(Twitter Streaming API)
– Dec. 13, 2013 ~ Mar. 31, 2014( 109 days : 2616 hour)
– Number of Social Transactions : 350,019,315 transactions
– Transactions/Hour : 133,799 tweets per hour
– Dictionary : 80,098,306 words
(15,-150)
(90,-60)

Introduction : How Big?
• Data File Size
– Total : 1.1 TB(in MongoDB)
• Mediadata
– 951.04 GB
• Abstracted
– 160.08 GB
• Dictionary
– 6.37 GB
– HDD to HDD Copy Time : minimum 3 hours
– HBase Insertion Time : about 20 hours
• 18,000,000 rows/hour
• Data 특성
– 90% 이상 lat, long 좌표 포함

Hadoop/MapReduce
• Hadoop 1.0 vs. Hodoop 2.0
http://hortonworks.com/blog/apache-hadoop-2-is-ga/

Hadoop/MapReduce
• Cluster and HDFS
http://www.datascience-labs.com/hadoop/hdfs/
HBase
Yarn/MR
HDFS

Hadoop/MapReduce
• MapReduce Process
map
map
map
Hadoop(HDFS)
http://xiaochongzhang.me/blog/?p=338
MapReudce
User
Programming
Red
Red
Red
Red
User
Programming

MapReduce
• Collected Tweet Data
• Twitter Inner Data Model
1
n
n
m
n m
following
count …
1 1
follower
count
{
“mid” : “1234567”,
"filter_level":"medium",
"contributors":null,
"text":"역시 다 멋있다 #EVERYBODY",
"geo":{
"type":"Point",
"coordinates":[37.3604652, 127.9554015]
},
"retweeted":false,
"created_at":"Fri Oct 11 10:32:52 +0000 2013",
"lang":"ko",
"id":388613160678088700,
"retweet_count":0,
"favorite_count":0,
"id_str":"388613160678088704",
"user":{
"lang":"ko",
"id":1394439386,
"verified":false,
"contributors_enabled":false,
"name":"은지",
"created_at":"Wed May 01 11:32:39 +0000 2013",
"geo_enabled":true,
"time_zone":"Seoul",
“follower_count":93,
“following_count”:30,
“favorate_count”: 105,
"id_str":"1394439386",
}
}
user tw tweet
fol fav
• Real Collected Twitter Data
userlog tw tweet

Spatial Aggregation on MapReduce
• Simplified Tweet Data Model
tweet : (mid, userid, x, y, time, text)
tweet : (mid, userid, x, y, time, {word,…})
• Simple MapReduce Example
(userid2, {mid2,mid4})
(userid2, 2)
(mid, userid, x, y,
time, {word,…})
Map
(userid1, mid1)
(userid2, mid2)
(userid3, mid3)
(userid2, mid4)
(userid3, mid5)
(userid4, mid6)
shuffle Reduce
(userid1, {mid1})
(userid3, {mid3,mid5})
(userid4, mid6)
(userid1, 1)
(userid3, 2)
(userid4, 1)

Spatial Aggregation on MapReduce
• Counting Tweets by Hour
– Use time as a row key
• Convert Time into TimeId
Map Shuffle
(mid, userid, x, y,
time, {word,…}) (timeid, mid)
(timeid, {mid, mid, …})
– Counting User Tweets by Hour
Map Shuffle
(mid, userid, x, y,
time, {word,…}) (<userid,timeid>, mid)
(<userid,timeid>, {mid,
mid, …})
• Fining Home/Visitor Position
– Problem : How to make spatial keys?
Reduce
(timeid, n)
(timeid, n)
…
Reduce
(<userid,timeid>, n)

GEOHASH : spatial key for Hadoop
• Real-world Social Data
– (lat, long) => key value
13
a b
o2
o18 o17
o1
o5
o3
o7 o4
o8
o9
o12
o10
o11
o15
o6
o13
o14
o16
o19
o20
c
e
d
f
(b) Social Database
oid wordset
o1 Italian, restaurant, expensive
o2 coffee, expensive, restaurant
o3 Italian, pizza, expensive
o4 restaurant, pizza, expensive
o5 Italian, pizza, restaurant, expensive
o6 coffee, restaurant, low-priced
o7 Italian, coffee, low-priced, restaurant, pizza
o8 coffee, restaurant, expensive
o9 expensive, restaurant
o10 pasta, pizza, expensive
o11 pasta, low-priced, restaurant
o12 Italian, restaurant, expensive
o13 pizza, low-priced
o14 tea, expensive, restaurant
o15 Italian, restaurant
o16 pasta, restaurant, expensive
o17 pizza, restaurant, low-priced
o18 Italian, pizza, restaurant
o19 Italian, pasta, restaurant, expensive
o20 pasta, expensive
(a) social data in real world
(lat, long)
Spatial Data Processing on Hadoop

14
(b) Spatial social data and grid space
0 1 4 5 16 17 20 21
o2
2 3 6 7 18 19 22 23
o18 o17
o1
o5
8 9 12 13 24 25 28 29
o11
o9
o12
o16
o19
o3
10 11 14 15 26 27 30 31
o14
32 33 36 37 48 49 52 53
o6
34 35 38 39 50 51 54 55
o7 o4
o15
40 41 44 45 56 57 60 61
o8
o10
o13
42 43 46 47 58 59 o20
62 63
(c) Spatial Wordset Transaction Database
oid wordset geo
o1 Italian, restaurant, expensive 13
o2 coffee, expensive, restaurant 18
o3 Italian, pizza, expensive 12
o4 restaurant, pizza, expensive 60
o5 Italian, pizza, restaurant, expensive 12
o6 coffee, restaurant, low-priced 35
o7 Italian, coffee, low-priced, restaurant, pizza 44
o8 coffee, restaurant, expensive 62
o9 expensive, restaurant 15
o10 pasta, pizza, expensive 12
o11 pasta, low-priced, restaurant 13
o12 Italian, restaurant, expensive 11
o13 pizza, low-priced 44
o14 tea, expensive, restaurant 15
o15 Italian, restaurant 35
o16 pasta, restaurant, expensive 15
o17 pizza, restaurant, low-priced 18
o18 Italian, pizza, restaurant 7
o19 Italian, pasta, restaurant, expensive 60
o20 pasta, expensive 62
Spatial
(a) social data in real world
Single Level Grid Approach

• GeoHash
– 지오해시는 Base32 부호화 문자열로 표현
15
Base32 : binary를 A-Z,2-7의 32글자로 표현
6bit=>1(8bit)byte char

• GeoHash = Quad-tree
16
Grid Level 1
Grid Level 2
Grid Level 3
Grid Level 4
00 01
00 01
10 11
10 11
Miami, Florida의 ID? 11 11 10 = geohash : 7
Base32 Table

• Rowkey로서의 GeoHash
– 지오해시가 로우키를 위한 최적의 선택인 이유
① 계산하기 쉽다
② 접두사가 최근접 이웃을 발견하는데 중요한 역할을 하기 때문
– 단점 : 접두사의 정확도와 경계값 문제

Spatial Word Aggregation
• Spatial Aggregation on MapReduce
– Counting Tweets by Spatial Cell
Map Shuffle
– Two Mapping Approach for Extending Spatial Area
original data Approach 1(6bit)
Approach 1(2bit)
(‘aaae’, mid5)
(mid, userid, x, y,
time, {word,…}) (geoid, mid)
(geoid, {mid, mid, …})
Reduce
(geoid, n)
(‘aaaaa’, mid1)
(‘aaaab’, mid2)
(‘aaaac’, mid3)
(‘aaaad’, mid4)
(‘aaaae’, mid5)
(‘aaaba’, mid6)
(‘aaabb’, mid7)
(‘aaaa’, mid1)
(‘aaaa’, mid2)
(‘aaaa’, mid3)
(‘aaaa’, mid4)
(‘aaaa’, mid5)
(‘aaab’, mid6)
(‘aaab’, mid7)
(‘aaaa’, mid1)
(‘aaaa’, mid2)
(‘aaaa’, mid3)
(‘aaaa’, mid4)
(‘aaab’, mid6)
(‘aaab’, mid7)

• Spatial Aggregation on MapReduce
– Finding Home/Visitor Position
Map Shuffle
(mid, userid, x, y,
time, {word,…}) (<userid,geoid>, mid)
– Finding Spatial Word
(<userid,geoid>, {mid,
mid, …})
Reduce
(<userid,geoid>, n)
Map Shuffle
(mid, userid, x, y,
time, {word,…}) (<geoid, word>, mid)
(<geoid, word>, mid)
(<geoid, word>, mid)
(<geoid, word>,{mid,
mid, …})
Reduce
(<geoid, word>, n)

• Tables in Hadoop/Hbase
Word Count
Word Count Timestamp
Media Data
ID GeoTag Timestamp
Content GeoHash
Additional
data
Tweet Count
GeoHash Count Timestamp
GeoWord Count
GeoHash Word Count Timestamp

GeoSocial Analytics System
Client
Web Server (Vert.x)
Global
Tweet Word GeoWor
d
Local
Tweet GeoWord
Query Application
MetaTable Ranking Count
BATCH …
YARN
HDFS2
(MapReduce)
Hadoop
ONLINE
(HBase)

Constructing a Hadoop Cluster
• 좀 다른 얘기이지만… Supercomputer~~~도 Cluster!
Tiahe-2(MilkyWay-2)
China
E5-2690: 12 cores
Passmark : about 16,000
CPUs : 260,000ea
MongoDB 기반 Twitter Stream 저장과 탐사 22

• Supercomputer Milky Way-2의 내부

• Cluster(Hadoop or Whatever)
24
Source: Yahoo! Hadoop Cluster
http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-multi-
node-cluster/

Source: Google!
http://www.wired.com/2012/10/ff-inside-google-data-center/all/
동영상: http://youtu.be/avP5d16wEp0

Raspberry Hadoop Cluster(6 node)
board: 41,850, SD(32G) : 13,950
4 node : 223,200
6 node : 334,800
8 node : 446,400
Cubieboard3 Hadoop Cluster(8 node)
board: 137,000, SD(32G) : 13,950
4 node : 603,800
6 node : 905,700
8 node : 1,207,600

• DIY Hadoop Cluster
http://serverfault.com/questions/463670/diy-hadoop-cluster-heat-dust-issues
http://www.scl.ameslab.gov/Projects/parallel_computing/cluster_examples.ht
ml

• We have no money!

Hadoop Cluster : Our Approach
Seagate Barracuda 3TB : 117,310
2cpu.co.kr(2014.07.04)
CPU Passmark
Xeon X3440 : 4,429(200,000)
i7 4790 : 10,198(320,000)
i5 4690 : 7,741(227,000)
Ultra Cheap Hadoop Cluster
4 node : 22만원*4대+11.7만원*4대 = 134.924만원
8 node : 22만원*8대+11.7만원*8대 = 269.848만원
1G switch : 195,000

2cpu.co.kr(2014.07.14)
CPU Passmark
Xeon X5675 : 8,095(200,000)
i7 4790 : 10,198(320,000)
i5 4690 : 7,741(227,000)
Memory 2G : 10,000

• Cluster Configuration
– 1 name node : i7, 16G RAM, 3TB HDD
– 5 data node : i5, 4G RAM, 3TB HDD

Test
• Simple Result
– Input : 20,000,000 tweets
– Number of Word : 7,500,000 words
400 167
0 2
92 148
110 149
4,119
0
669
657
6,000
5,000
4,000
3,000
2,000
1,000
0
Num Of GeoTweet Num of Word Num of GeoWord
Map Shuffle Merge Reduce
90min
8min
10min

Systems: ESRI GIS tools for Hadoop
• Three GIS Tools by ESRI
– Geoprocessing Tools
– Spatial Framework for Hadoop
• Extension of Hive SQL
– ESRI Geometry API

• ESRI Spatial Framework for Hadoop
– Extension of Hive SQL
• https://github.com/Esri/gis-tools-for-hadoop/
tree/master/samples/point-in-polygon-aggregation-hive
– Spatial Function Import
add jar
${env:HOME}/esri-git/gis-tools-for-hadoop/samples/lib/esri-geometry-api.jar
${env:HOME}/esri-git/gis-tools-for-hadoop/samples/lib/spatial-sdk-hadoop.jar
create temporary function ST_Point as 'com.esri.hadoop.hive.ST_Point';
create temporary function ST_Contains as 'com.esri.hadoop.hive.ST_Contains';
– CREATE TABLE
CREATE EXTERNAL TABLE earthquakes
(earthquake_date STRING, latitude DOUBLE, longitude DOUBLE, magnitude DOUBLE)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '${env:HOME}/esri-git/gis-tools-for-hadoop/samples/data/earthquake-data';

• ESRI Spatial Framework for Hadoop
– CREATE TABLE (input/output format)
CREATE EXTERNAL TABLE IF NOT EXISTS counties
(Area string, Perimeter string, State string, County string, Name string, BoundaryShape binary)
ROW FORMAT SERDE 'com.esri.hadoop.hive.serde.JsonSerde'
STORED AS INPUTFORMAT 'com.esri.json.hadoop.EnclosedJsonInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION '${env:HOME}/esri-git/gis-tools-for-hadoop/samples/data/counties-data';
– HIVE Spatial SQL Queries
SELECT counties.name, count(*) cnt
FROM counties JOIN earthquakes
WHERE ST_Contains(counties.boundaryshape, ST_Point(earthquakes.longitude, earthquakes.latitude))
GROUP BY counties.name
ORDER BY cnt desc;
– Hive not support the CRUD(INSERT/UPDATE/DELETE) in current 1.3
version
• PLAN : 1.4
• No Spatial Index

Systems: SpatialHadoop
• SpatialHadoop : Spatial Support on Hadoop
– University of Minnesota (Prof. Mohamed Mokbel)
• http://spatialhadoop.cs.umn.edu/
– using ESRI Geometry API
A Demonstration of SpatialHadoop: An Efficient MapReduce Framework for Spatial Data(VLDB ‘2013)

– Generation of Test Data
%shadoop generate test mbr:0,0,1000000,1000000 size:1.gb shape:rect
– Creation of Spatial Index(R-tree and GRID)
%shadoop index test.rects sindex:grid test.grid
– Range Query
%shadoop rangequery test.grid rq_results rect:500,500,1000,1000

– k-NN Query
%shadoop knn test.grid knn_results point:1000,1000 k:1000
– Join Query(Distributed Join)
%shadoop dj test.grid test2.grid sj_results
– Other Supported Operations
• sjmr : parallelized spatial join with MapReduce
• convexhull
• skyline
• closestpair/farthestpair
• plot

Systems: Pigeon
• Pigeon : Spatial Support in Pig
– University of Minnesota (Prof. Mohamed Mokbel)
– Pig Query(Non Spatial)
points = LOAD 'points' AS (id:long, lon:double, lat:double);
– Pigeon Query(Spatial)
results = FILTER points BY
lon < -93.158 AND lon > -93.175 AND
lat > 45.0077 AND lat < 45.0164;
STORE results INTO 'results';
IMPORT 'pigeon_import.pig';
points = LOAD 'points-pigeon' AS (id:long, location);
results = FILTER points BY
ST_Contains(ST_MakeBox(-93.175, 45.0077, -93.158, 45.0164), location);
STORE results INTO 'results-pigeon';

Conclusion
• Hadoop and MapReduce
• GeoHash : Simple Way to Process Spatial Big Data
• Constructing a Hadoop Cluster
• Spatial Big Data Systems

[FOSS4G KOREA 2014]Hadoop 상에서 MapReduce를 이용한 Spatial Big Data 집계와 시스템 구축

More Related Content

What's hot

Viewers also liked

Similar to [FOSS4G KOREA 2014]Hadoop 상에서 MapReduce를 이용한 Spatial Big Data 집계와 시스템 구축

More from Kwang Woo NAM

Recently uploaded

[FOSS4G KOREA 2014]Hadoop 상에서 MapReduce를 이용한 Spatial Big Data 집계와 시스템 구축