High Performance Spatial-Temporal Trajectory Analysis with Spark

© 2016 IBM Corporation
High Performance Spatial-Temporal Trajectory
Analysis with Spark
YongHua (Henry) Zeng
zengyh@cn.ibm.com
Big Data & Analytics Solution Architect
Analytics Platform Services,IBM China Lab

© 2016 IBM Corporation2
Agenda
• Background
• Architecture
• Technical Design
• Big Data Platform design
• Data governance design
• Algorithm model
• Spark spatial computing
• Scenarios demo
• Conclusion and Next step
2

Background Introduction
-- study the human trajectory by mobile signal data
Problem
• Varieties of data that traditional
planning will not be able to tackle
• Many of the data have the characteristics
of big data (volume, velocity, varieties)
• Cellular signaling data is one of such
typical data that can enable new types of
applications to facilitate smarter urban
planning
• Analyzing cellular signal data can help
urban planner & city governing bodies to
better understand the city
Data Set
• Cellular signal data
• Mobile users 5M
• 25M to 50M data every minute; 30G of data daily
• ~ 400M cellular signal records daily
• More data coming with GPS, RFID for 4M vehicles

Solution Architecture
Data
sources
Distributed File System
Streaming
Resource Management
YARN
API Services
Orchestration
Batch
Relational
Database
w/ Spatial
Extention
Computation Engine
Visualization
& Report
Data
Ingestion
HDFS
LDAP
Service
Cluster
Management
Security
Service
javascript
Flex
Shp
file
etc

Data
Collection Data
Aggregation
Coordinates
Formalization
Abnormal
Detections
Final
Computing
Source
Data
Pre-processing Base Model Computing
Data Quality
Metrics
Application Model Computing
Residential
Statistics
Working
Region
Statistics
Regional
Commuting
Analysis
The Big Data Platform
Application Views
GIS Server
GIS
Database
Residential,
Community
Data
Data
Cleansing
Business Architecture

Architecture Decision Points
GIS spatial DB
Data Fusion Standard
Bigdata Platform
ELT
Data Store & Analysis
OD analyssi
Index
Computing
Data Quality
computiing
Home-office
analyssi
Streaming
Home-Office
DW/Market
Data Export
thermodynamic
diagram
User 2
User 3
User1
GIS 应用展现
Base Alg App Alg
手机信令
（在线/脱
机）
Data
collect
ion
Database(business,
spatial)
Home-Office
DW/Market
Job andresourceSchedule
Flex/JS
Spatial DB
（spatial
extension）
ArcGIS
Spark
Streaming
Oozie/YarnShell脚本
Spark/HDFS
Sqoop
Java

System front-end architecture
Geospatial
Analysis Big
Data Platform
(HDFS)
Sqoop
FTP

Agenda
• Background
• Architecture
• Algorithm model
• Scenarios demo
8

Items on Big Data Platform Design
ü Planning and product selection
ü Deployment and operation
ü Application deployment
ü Job scheduling
ü Resource management
ü Spark within BigInsights

IBM BigInsights for Apache Hadoop and Spark
Discovery
& Exploration
Prescriptive
Analytics
Predictive
Analytics
Content
Analytics
Business Intelligence
Data
Mgmt
Hadoop &
NoSQL
Content
Mgmt
Data
Warehouse
Information Integration & Governance
IBM ANALYTICS PLATFORM
Built on Spark. Hybrid. Trusted.
Spark Analytics Operating System
Machine LearningOn premises On cloud
Data at rest & In-motion.Inside & outside the firewall. Structured & unstructured.
§ Analytical platform for
persistent Big Data
– 100% open source core
with IBM add-ons for
analysts, data
scientists, and admins
– On site or cloud
§ Distinguishing
characteristics
– Built-in analytics . . . .
Enhances business
knowledge
– Enterprise software
integration . . . .
Complements and
extends existing
capabilities
– Production-ready . . . .
Speeds time-to-value
§ IBM advantage
– Combination of
software, hardware,
services and research

IBM Open Platform
100% open source platform compliant with ODPi
Apache Hadoop ecosystem
Apache Spark ecosystem
IBM-specific BigInsights features
Big SQL (industry standard SQL)
Text analytics
BigSheets (spreadsheet-style tool)
Big R (R support)
IBM Streams, Cognos (limited use licenses)
Overview of BigInsights
Free Quick Start (non production):
• IBM Open Platform
• IBM added value features
• Community support

Big data platform job scheduling and resource mgmt
12
- Dedicated slave nodes for computing - almost all CPU & memory
resources in each slave node is managed by Yarn
- Capacity scheduler using dedicated queues for various business usage -
production (batch & streaming processing, data movement), development
- Elastic resource capacity for each queue by specifying a large maximum
capacity, to achieve high resource utilization
- Fine-grained Yarn container allocation by specifying small increment
vcore/memory sizes, to support various workload types - big, medium and
small jobs
- No CGroups-based CPU resource isolation, because of system stability
issues caused by this in our IOP 4.1/RHEL 6.5 environments
Job scheduling with
Oozie
Resource mgmt with
YARN

Spark within BigInsights
ü Deployment
§ Amabari for installation and deployment
§ Spark (compute node) co-exist with data node (HDFS)
§ Cluster mode with YARN as the resource mgmt
ü Runtime Configuration
§ Bad configuration may cause job under-perform, failed, cluster
instable etc
§ Methodology to configure the partition #, cores/mem of executor, #
of executors
ü Monitoring and Tuning
§ Spark streaming stability (monitoring log, checkpoint)
§ Handle massive small files
§ Shuffle, partition, IO utilization etc
§ Job execution, GC time etc via dashboard

Data Perspective Considerations
§ Data process flow
§ Data management
− capacity sizing, layout in HDFS, lifecycle mgmt
§ Data movement
− Between big data platform and other systems
RDBMS
Data Process Flow

DIST
15
• 5 Layers of Data in the System
• L1 raw data ingested into HDFS
• L2 ELT data (pre-processing with streaming) in HDFS
• L3 result data via algorithm model in HDFS
• L4 data for visualization (in HDFS or RDBMS)
• L5 archived data in external storage (compressed)
• Design the data layout, # of copies, retention etc in HDFS
• Jobs to prune out-dated data in HDFS (Oozie)
Data lifecycle management

DIST
16
• Data Ingestion into Big Data platform
• Offline/Online data ingestion -- HDFS loading from (external storage)
/FTP server + Streaming
• Future – Kafka + Spark Streaming (more data sources, analytics path)
• Data Export from Big Data platform
• Near real-time heatmap generation
• Algorithm model results exported to RDBMS -- Sqoop
Data movement
HDFS
load
from
FTP
Spark
streaming
ArcGIS
Server
(generat
e
heatma
p based
on shp
file)
实时
展现
回溯查
询
每30分钟推送到数
据库中
Basic
Algorithm
-- stay
pointin
HDFS
FTP
push

DIST
17
Algorithm Model of Trajectory(OD) Analysis
统计数据导出
Cellular
signal data
ELT
Trajectory
Sequence
Multi-Day
Stable Point
OD
Identification
CommuteStay Points
OD Statistics
OD Index Stats
Commute Stats
People Flow
Stats
Data Quality
Index
统计数据导出>1km >1k
m
GRO UP1 GRO UP2 GRO UP3
By different
area type
Algorithm Accuracy
Validation
Algorithm
Performance
Algorithm Stability
Algorithm
Extensibility
Algorithm
Configurable
Application
Algorithm
Base
Algorithm

Geospatial Computation with Spark
§ Requirements
− Spark to direct support of SDE/Shp/GeoJson file
− Most of the geospatial computation in Spark cluster (point-area relationship, spherical distance, geospatial
stats etc)
− Performance challenge – 20M records per each iteration of geospatial computation
§ Solution Design
SDE
shp
file
Spark
Cluster
Basic Algorithm (geospatial
computation)
ApplicationAlgorithm (geospatial
statistics)
SDE interface SHP interface
Geospatial
API
Grid API
Spark-GIS libGrid
definition

Spatial Grid Design for Spark
关系
Home-Office
Model
Statistic by
Group
Group-Grid
Mapping
Statistic by
Grid
Grid Home-
Office Statistic
Table
Grid Statistic
Table
User Define
Query
Pre-define
Query
Convert formula expansion formula
Spark
Base Algorithm
Spark
App Statistic
Relation
Database
Web GIS
Application
Web GIS Front-end

Agenda
• Background – problem and data
• Architecture
• Algorithm model
• Scenarios demo
20

Scenario --1
Population Heatmap Commute OD Route
Better Understanding of Key Metrics for Urban City
Planning with Big Data (Sampled data vs all data; History
data vs current data)
Ø Urban planner can have more reasonable planning ofthe
city based on current population distribution
Ø Traffic planning institute leverage this to optimize the traffic
network
Ø City mgmt. unitcan better plan city services facilities & city
abnormal events detection based on population flow
New Methodology & New Applications Using Big Data for
Better Urban City Planning, Monitoring & Decision
Making
v Quickly understand the currentpeople commute traffic
volume and directions,and identify the bottleneck
v Optimize the traffic plan and scheduling during commute
peak time
v More new applications can be builtfor planners,
administrators and new data services can be provided to
city residents for the participation ofcity management

Scenario --2
Commute Time Cost
Office-Residence Imbalance

Big Data Architecture Key Point – System
v Big data product selection
• ODPi (Open Data Platform)
v Big data component selection
• Data moving，data store，computing，SQL interface…
• …
v Deployment mode selection
• Local cluster
• IAAS
• Bigdata cloud
v Separate deployment env and data exploration env
Big Data Architecture Key Point – Data
Ø Data collection
Ø Data ELT
Ø Data Pipeline
Ø Data lifecycle governance
Ø Data Volume plan
Ø Data Fusion
Ø Spatial data analysis and visualization
Big Data Architecture Key Point – Platform
Ø HA
Ø Security
Ø Monitoring & Stability
Ø Scale-out and upgrade
Ø Resource management
Ø Job Schedule
Ø Multi-tenant
Big Data Architecture Key Point –
Algorithm & Model层面
Ø BusinessAnalysis
Ø Alg model design
Ø Model verification
Ø Model adjustment
Ø Model validity insurance

Road Ahead…
Deep analysis with more scenarios
• Traffic prediction
• Trip predication
• Commute methods
• etc
More data sources for trajectory/traffic
• GPS for taxi, bus
• RFID on road
• Road monitoring data
• Subway stop check-in/out info
• Parking Lot
• Fusion with weather, social data
Data exploration environment to support data science &
continuous engineering of new features
Leverage more SparkML for traffic prediction
Cluster scale-out with more data and algorithms
Data ingestion with Kafka/Flume (message hub)
SQL on Hadoop
Graph computation for nearest path and roadmatcher
Current
Deployment
Big Data Platform
Scale-out
Scale-out
New
Scenarios
w/ new data
Data Exploration
Environment
Engineering
and
deployment
Data movement

© 2016 IBM Corporation25 © 2016 IBM Corporation
Spark GeoSpatial Analysis for Other Scenarios
Spatial-Temporal Trajectory
Analysis for human
Trajectory Data Management
Trajectory Analysis Function
Spatial-Temporal Trajectory
Analysis for vehicle
Common
API
geo-spatial data pre-process，geo-spatial Geometry Computing，Surface Mesh
Computing
Distributed geo-spatial calculating API (Base on Spark)
IBM’s Big Data Analytics Platform
Smart
Transportation
Smart
Logistics
Smart
Tourism
others

Big Data University and Data Science Workbench
− A community initiative led by IBM
− @yourpace, @yourplaceonline courses about data
− Developed by industry experts
− Free courses by the community with hands-on labs
− Certificate of completion and badges
− Looking for contributors!
Integrated Set of Tools, Languages and Execution Environments
Clean and Prepare Data
• OpenRefine
Experiment with and Analyze Data
• Jupyter Notebooks, R Studio， SeaHorse
Connect to data processing engines:
• Spark, Hadoop, dashDB, BigSQL, BigR
http://DataScientistWorkbench.com
http://bigdatauniversity.com

High Performance Spatial-Temporal Trajectory Analysis with Spark

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to High Performance Spatial-Temporal Trajectory Analysis with Spark

Similar to High Performance Spatial-Temporal Trajectory Analysis with Spark (20)

More from DataWorks Summit/Hadoop Summit

More from DataWorks Summit/Hadoop Summit (20)

Recently uploaded

Recently uploaded (20)

High Performance Spatial-Temporal Trajectory Analysis with Spark