SlideShare a Scribd company logo
© 2016 IBM Corporation
High Performance Spatial-Temporal Trajectory
Analysis with Spark
YongHua (Henry) Zeng
zengyh@cn.ibm.com
Big Data & Analytics Solution Architect
Analytics Platform Services,IBM China Lab
© 2016 IBM Corporation2
Agenda
• Background
• Architecture
• Technical Design
• Big Data Platform design
• Data governance design
• Algorithm model
• Spark spatial computing
• Scenarios demo
• Conclusion and Next step
2
© 2016 IBM Corporation3
Background Introduction
-- study the human trajectory by mobile signal data
Problem
• Varieties of data that traditional
planning will not be able to tackle
• Many of the data have the characteristics
of big data (volume, velocity, varieties)
• Cellular signaling data is one of such
typical data that can enable new types of
applications to facilitate smarter urban
planning
• Analyzing cellular signal data can help
urban planner & city governing bodies to
better understand the city
Data Set
• Cellular signal data
• Mobile users 5M
• 25M to 50M data every minute; 30G of data daily
• ~ 400M cellular signal records daily
• More data coming with GPS, RFID for 4M vehicles
© 2016 IBM Corporation4
Solution Architecture
Data
sources
Distributed File System
Streaming
Resource Management
YARN
API Services
Orchestration
Batch
Relational
Database
w/ Spatial
Extention
Computation Engine
Visualization
& Report
Data
Ingestion
HDFS
LDAP
Service
Cluster
Management
Security
Service
javascript
Flex
Shp
file
etc
© 2016 IBM Corporation5
Data
Collection Data
Aggregation
Coordinates
Formalization
Abnormal
Detections
Final
Computing
Source
Data
Pre-processing Base Model Computing
Data Quality
Metrics
Application Model Computing
Residential
Statistics
Working
Region
Statistics
Regional
Commuting
Analysis
The Big Data Platform
Application Views
GIS Server
GIS
Database
Residential,
Community
Data
Data
Cleansing
Business Architecture
© 2016 IBM Corporation6
Architecture Decision Points
GIS spatial DB
Data Fusion Standard
Bigdata Platform
ELT
Data Store & Analysis
OD analyssi
Index
Computing
Data Quality
computiing
Home-office
analyssi
Streaming
Home-Office	
DW/Market
Data Export
thermodynamic
diagram
User 2
User 3
User1
GIS 应用展现
Base Alg App Alg
手机信令
(在线/脱
机)
Data
collect
ion
Database(business,
spatial)
Home-Office	
DW/Market
Job andresourceSchedule
Flex/JS
Spatial DB
(spatial
extension)
ArcGIS
Spark
Streaming
Oozie/YarnShell脚本
Spark/HDFS
Sqoop
Java
© 2016 IBM Corporation7
System front-end architecture
Geospatial
Analysis Big
Data Platform
(HDFS)
Sqoop
FTP
© 2016 IBM Corporation8
Agenda
• Background
• Architecture
• Technical Design
• Big Data Platform design
• Data governance design
• Algorithm model
• Spark spatial computing
• Scenarios demo
• Conclusion and Next step
8
© 2016 IBM Corporation9
Items on Big Data Platform Design
ü Planning and product selection
ü Deployment and operation
ü Application deployment
ü Job scheduling
ü Resource management
ü Spark within BigInsights
© 2016 IBM Corporation10
IBM BigInsights for Apache Hadoop and Spark
Discovery
& Exploration
Prescriptive
Analytics
Predictive
Analytics
Content
Analytics
Business Intelligence
Data
Mgmt
Hadoop &
NoSQL
Content
Mgmt
Data
Warehouse
Information Integration & Governance
IBM ANALYTICS PLATFORM
Built on Spark. Hybrid. Trusted.
Spark Analytics Operating System
Machine LearningOn premises On cloud
Data at rest & In-motion.Inside & outside the firewall. Structured & unstructured.
§ Analytical platform for
persistent Big Data
– 100% open source core
with IBM add-ons for
analysts, data
scientists, and admins
– On site or cloud
§ Distinguishing
characteristics
– Built-in analytics . . . .
Enhances business
knowledge
– Enterprise software
integration . . . .
Complements and
extends existing
capabilities
– Production-ready . . . .
Speeds time-to-value
§ IBM advantage
– Combination of
software, hardware,
services and research
© 2016 IBM Corporation11
IBM Open Platform
100% open source platform compliant with ODPi
Apache Hadoop ecosystem
Apache Spark ecosystem
IBM-specific BigInsights features
Big SQL (industry standard SQL)
Text analytics
BigSheets (spreadsheet-style tool)
Big R (R support)
IBM Streams, Cognos (limited use licenses)
Overview of BigInsights
Free Quick Start (non production):
• IBM Open Platform
• IBM added value features
• Community support
© 2016 IBM Corporation12
Big data platform job scheduling and resource mgmt
12
- Dedicated slave nodes for computing - almost all CPU & memory
resources in each slave node is managed by Yarn
- Capacity scheduler using dedicated queues for various business usage -
production (batch & streaming processing, data movement), development
- Elastic resource capacity for each queue by specifying a large maximum
capacity, to achieve high resource utilization
- Fine-grained Yarn container allocation by specifying small increment
vcore/memory sizes, to support various workload types - big, medium and
small jobs
- No CGroups-based CPU resource isolation, because of system stability
issues caused by this in our IOP 4.1/RHEL 6.5 environments
Job scheduling with
Oozie
Resource mgmt with
YARN
© 2016 IBM Corporation13
Spark within BigInsights
ü Deployment
§ Amabari for installation and deployment
§ Spark (compute node) co-exist with data node (HDFS)
§ Cluster mode with YARN as the resource mgmt
ü Runtime Configuration
§ Bad configuration may cause job under-perform, failed, cluster
instable etc
§ Methodology to configure the partition #, cores/mem of executor, #
of executors
ü Monitoring and Tuning
§ Spark streaming stability (monitoring log, checkpoint)
§ Handle massive small files
§ Shuffle, partition, IO utilization etc
§ Job execution, GC time etc via dashboard
© 2016 IBM Corporation14
Data Perspective Considerations
§ Data process flow
§ Data management
− capacity sizing, layout in HDFS, lifecycle mgmt
§ Data movement
− Between big data platform and other systems
RDBMS
Data Process Flow
© 2016 IBM Corporation15
DIST
15
• 5 Layers of Data in the System
• L1 raw data ingested into HDFS
• L2 ELT data (pre-processing with streaming) in HDFS
• L3 result data via algorithm model in HDFS
• L4 data for visualization (in HDFS or RDBMS)
• L5 archived data in external storage (compressed)
• Design the data layout, # of copies, retention etc in HDFS
• Jobs to prune out-dated data in HDFS (Oozie)
Data lifecycle management
© 2016 IBM Corporation16
DIST
16
• Data Ingestion into Big Data platform
• Offline/Online data ingestion -- HDFS loading from (external storage)
/FTP server + Streaming
• Future – Kafka + Spark Streaming (more data sources, analytics path)
• Data Export from Big Data platform
• Near real-time heatmap generation
• Algorithm model results exported to RDBMS -- Sqoop
Data movement
HDFS
load
from
FTP
Spark
streaming
ArcGIS
Server
(generat
e
heatma
p based
on shp
file)
实时
展现
回溯查
询
每30分钟推送到数
据库中
Basic
Algorithm
-- stay
pointin
HDFS
FTP
push
© 2016 IBM Corporation17
DIST
17
Algorithm Model of Trajectory(OD) Analysis
统计数据导出
Cellular
signal data
ELT
Trajectory
Sequence
Multi-Day
Stable Point
OD
Identification
CommuteStay Points
OD Statistics
OD Index Stats
Commute Stats
People Flow
Stats
Data Quality
Index
统计数据导出>1km >1k
m
GRO UP1 GRO UP2 GRO UP3
By different
area type
Algorithm Accuracy
Validation
Algorithm
Performance
Algorithm Stability
Algorithm
Extensibility
Algorithm
Configurable
Application
Algorithm
Base
Algorithm
© 2016 IBM Corporation18
Geospatial Computation with Spark
§ Requirements
− Spark to direct support of SDE/Shp/GeoJson file
− Most of the geospatial computation in Spark cluster (point-area relationship, spherical distance, geospatial
stats etc)
− Performance challenge – 20M records per each iteration of geospatial computation
§ Solution Design
SDE
shp
file
Spark
Cluster
Basic Algorithm (geospatial
computation)
ApplicationAlgorithm (geospatial
statistics)
SDE interface SHP interface
Geospatial
API
Grid API
Spark-GIS libGrid
definition
© 2016 IBM Corporation19
Spatial Grid Design for Spark
关系
Home-Office
Model
Statistic by
Group
Group-Grid
Mapping
Statistic by
Grid
Grid Home-
Office Statistic
Table
Grid Statistic
Table
User Define
Query
Pre-define
Query
Convert formula expansion formula
Spark
Base Algorithm
Spark
App Statistic
Relation
Database
Web GIS
Application
Web GIS Front-end
© 2016 IBM Corporation20
Agenda
• Background – problem and data
• Architecture
• Technical Design
• Big Data Platform design
• Data governance design
• Algorithm model
• Spark spatial computing
• Scenarios demo
• Conclusion and Next step
20
© 2016 IBM Corporation21
Scenario --1
Population Heatmap Commute OD Route
Better Understanding of Key Metrics for Urban City
Planning with Big Data (Sampled data vs all data; History
data vs current data)
Ø Urban planner can have more reasonable planning ofthe
city based on current population distribution
Ø Traffic planning institute leverage this to optimize the traffic
network
Ø City mgmt. unitcan better plan city services facilities & city
abnormal events detection based on population flow
New Methodology & New Applications Using Big Data for
Better Urban City Planning, Monitoring & Decision
Making
v Quickly understand the currentpeople commute traffic
volume and directions,and identify the bottleneck
v Optimize the traffic plan and scheduling during commute
peak time
v More new applications can be builtfor planners,
administrators and new data services can be provided to
city residents for the participation ofcity management
© 2016 IBM Corporation22
Scenario --2
Commute Time Cost
Office-Residence Imbalance
© 2016 IBM Corporation23
Big Data Architecture Key Point – System
v Big data product selection
• ODPi (Open Data Platform)
v Big data component selection
• Data moving,data store,computing,SQL interface…
• …
v Deployment mode selection
• Local cluster
• IAAS
• Bigdata cloud
v Separate deployment env and data exploration env
Big Data Architecture Key Point – Data
Ø Data collection
Ø Data ELT
Ø Data Pipeline
Ø Data lifecycle governance
Ø Data Volume plan
Ø Data Fusion
Ø Spatial data analysis and visualization
Big Data Architecture Key Point – Platform
Ø HA
Ø Security
Ø Monitoring & Stability
Ø Scale-out and upgrade
Ø Resource management
Ø Job Schedule
Ø Multi-tenant
Big Data Architecture Key Point –
Algorithm & Model层面
Ø BusinessAnalysis
Ø Alg model design
Ø Model verification
Ø Model adjustment
Ø Model validity insurance
© 2016 IBM Corporation24
Road Ahead…
Deep analysis with more scenarios
• Traffic prediction
• Trip predication
• Commute methods
• etc
More data sources for trajectory/traffic
• GPS for taxi, bus
• RFID on road
• Road monitoring data
• Subway stop check-in/out info
• Parking Lot
• Fusion with weather, social data
Data exploration environment to support data science &
continuous engineering of new features
Leverage more SparkML for traffic prediction
Cluster scale-out with more data and algorithms
Data ingestion with Kafka/Flume (message hub)
SQL on Hadoop
Graph computation for nearest path and roadmatcher
Current
Deployment
Big Data Platform
Scale-out
Scale-out
New
Scenarios
w/ new data
Data Exploration
Environment
Engineering
and
deployment
Data movement
© 2016 IBM Corporation25 © 2016 IBM Corporation
Spark GeoSpatial Analysis for Other Scenarios
Spatial-Temporal Trajectory
Analysis for human
Trajectory Data Management
Trajectory Analysis Function
Spatial-Temporal Trajectory
Analysis for vehicle
Common
API
geo-spatial data pre-process,geo-spatial Geometry Computing,Surface Mesh
Computing
Distributed geo-spatial calculating API (Base on Spark)
IBM’s Big Data Analytics Platform
Smart
Transportation
Smart
Logistics
Smart
Tourism
others
© 2016 IBM Corporation26
Big Data University and Data Science Workbench
− A community initiative led by IBM
− @yourpace, @yourplaceonline courses about data
− Developed by industry experts
− Free courses by the community with hands-on labs
− Certificate of completion and badges
− Looking for contributors!
Integrated Set of Tools, Languages and Execution Environments
Clean and Prepare Data
• OpenRefine
Experiment with and Analyze Data
• Jupyter Notebooks, R Studio, SeaHorse
Connect to data processing engines:
• Spark, Hadoop, dashDB, BigSQL, BigR
http://DataScientistWorkbench.com
http://bigdatauniversity.com

More Related Content

What's hot

Building a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe CrobakBuilding a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe CrobakHakka Labs
 
BATbern52 Swisscom's Journey into Data Mesh
BATbern52 Swisscom's Journey into Data MeshBATbern52 Swisscom's Journey into Data Mesh
BATbern52 Swisscom's Journey into Data MeshBATbern
 
Getting Started with BigQuery ML
Getting Started with BigQuery MLGetting Started with BigQuery ML
Getting Started with BigQuery MLDan Sullivan, Ph.D.
 
Combining logs, metrics, and traces for unified observability
Combining logs, metrics, and traces for unified observabilityCombining logs, metrics, and traces for unified observability
Combining logs, metrics, and traces for unified observabilityElasticsearch
 
Informatica Data Quality Training
Informatica Data Quality TrainingInformatica Data Quality Training
Informatica Data Quality Trainingtekslate1
 
Streaming architecture patterns
Streaming architecture patternsStreaming architecture patterns
Streaming architecture patternshadooparchbook
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...HostedbyConfluent
 
New OpManager v12
New OpManager v12New OpManager v12
New OpManager v12Inuit AB
 
Exceptions are the Norm: Dealing with Bad Actors in ETL
Exceptions are the Norm: Dealing with Bad Actors in ETLExceptions are the Norm: Dealing with Bad Actors in ETL
Exceptions are the Norm: Dealing with Bad Actors in ETLDatabricks
 
Balkan - data eng meetup - data fusion
Balkan - data eng meetup - data fusionBalkan - data eng meetup - data fusion
Balkan - data eng meetup - data fusionBalkan Misirli
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
 
Dataday Texas 2016 - Datadog
Dataday Texas 2016 - DatadogDataday Texas 2016 - Datadog
Dataday Texas 2016 - DatadogDatadog
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
Search and analyze your data with elasticsearch
Search and analyze your data with elasticsearchSearch and analyze your data with elasticsearch
Search and analyze your data with elasticsearchAnton Udovychenko
 
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingDataWorks Summit
 
Koalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache SparkKoalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache SparkDatabricks
 
Machine Learning Orchestration with Airflow
Machine Learning Orchestration with AirflowMachine Learning Orchestration with Airflow
Machine Learning Orchestration with AirflowAnant Corporation
 
BigQuery ML - Machine learning at scale using SQL
BigQuery ML - Machine learning at scale using SQLBigQuery ML - Machine learning at scale using SQL
BigQuery ML - Machine learning at scale using SQLMárton Kodok
 

What's hot (20)

Building a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe CrobakBuilding a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe Crobak
 
BATbern52 Swisscom's Journey into Data Mesh
BATbern52 Swisscom's Journey into Data MeshBATbern52 Swisscom's Journey into Data Mesh
BATbern52 Swisscom's Journey into Data Mesh
 
Getting Started with BigQuery ML
Getting Started with BigQuery MLGetting Started with BigQuery ML
Getting Started with BigQuery ML
 
Combining logs, metrics, and traces for unified observability
Combining logs, metrics, and traces for unified observabilityCombining logs, metrics, and traces for unified observability
Combining logs, metrics, and traces for unified observability
 
Informatica Data Quality Training
Informatica Data Quality TrainingInformatica Data Quality Training
Informatica Data Quality Training
 
Prometheus + Grafana = Awesome Monitoring
Prometheus + Grafana = Awesome MonitoringPrometheus + Grafana = Awesome Monitoring
Prometheus + Grafana = Awesome Monitoring
 
Streaming architecture patterns
Streaming architecture patternsStreaming architecture patterns
Streaming architecture patterns
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...
 
New OpManager v12
New OpManager v12New OpManager v12
New OpManager v12
 
Exceptions are the Norm: Dealing with Bad Actors in ETL
Exceptions are the Norm: Dealing with Bad Actors in ETLExceptions are the Norm: Dealing with Bad Actors in ETL
Exceptions are the Norm: Dealing with Bad Actors in ETL
 
Balkan - data eng meetup - data fusion
Balkan - data eng meetup - data fusionBalkan - data eng meetup - data fusion
Balkan - data eng meetup - data fusion
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
Dataday Texas 2016 - Datadog
Dataday Texas 2016 - DatadogDataday Texas 2016 - Datadog
Dataday Texas 2016 - Datadog
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
Search and analyze your data with elasticsearch
Search and analyze your data with elasticsearchSearch and analyze your data with elasticsearch
Search and analyze your data with elasticsearch
 
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data Processing
 
Koalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache SparkKoalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache Spark
 
Machine Learning Orchestration with Airflow
Machine Learning Orchestration with AirflowMachine Learning Orchestration with Airflow
Machine Learning Orchestration with Airflow
 
BigQuery ML - Machine learning at scale using SQL
BigQuery ML - Machine learning at scale using SQLBigQuery ML - Machine learning at scale using SQL
BigQuery ML - Machine learning at scale using SQL
 

Viewers also liked

SexTant: Visualizing Time-Evolving Linked Geospatial Data
SexTant: Visualizing Time-Evolving Linked Geospatial DataSexTant: Visualizing Time-Evolving Linked Geospatial Data
SexTant: Visualizing Time-Evolving Linked Geospatial DataCharalampos (Babis) Nikolaou
 
Neo4j Spatial - FooCafe September 2015
Neo4j Spatial - FooCafe September 2015Neo4j Spatial - FooCafe September 2015
Neo4j Spatial - FooCafe September 2015Craig Taverner
 
Indexing 3-dimensional trajectories: Apache Spark and Cassandra integration
Indexing 3-dimensional trajectories: Apache Spark and Cassandra integrationIndexing 3-dimensional trajectories: Apache Spark and Cassandra integration
Indexing 3-dimensional trajectories: Apache Spark and Cassandra integrationCesare Cugnasco
 
Magellen: Geospatial Analytics on Spark by Ram Sriharsha
Magellen: Geospatial Analytics on Spark by Ram SriharshaMagellen: Geospatial Analytics on Spark by Ram Sriharsha
Magellen: Geospatial Analytics on Spark by Ram SriharshaSpark Summit
 
Using python to analyze spatial data
Using python to analyze spatial dataUsing python to analyze spatial data
Using python to analyze spatial dataKudos S.A.S
 
Auto-scaling Techniques for Elastic Data Stream Processing
Auto-scaling Techniques for Elastic Data Stream ProcessingAuto-scaling Techniques for Elastic Data Stream Processing
Auto-scaling Techniques for Elastic Data Stream ProcessingZbigniew Jerzak
 
Adaptive Replication for Elastic Data Stream Processing
Adaptive Replication for Elastic Data Stream ProcessingAdaptive Replication for Elastic Data Stream Processing
Adaptive Replication for Elastic Data Stream ProcessingZbigniew Jerzak
 
Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...
Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...
Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...DataStax
 
Multilevel aggregation for Hadoop/MapReduce
Multilevel aggregation for Hadoop/MapReduceMultilevel aggregation for Hadoop/MapReduce
Multilevel aggregation for Hadoop/MapReduceTsuyoshi OZAWA
 
HTM & Apache Flink (2016-06-27)
HTM & Apache Flink (2016-06-27)HTM & Apache Flink (2016-06-27)
HTM & Apache Flink (2016-06-27)Eron Wright
 
Magellan-Spark as a Geospatial Analytics Engine by Ram Sriharsha
Magellan-Spark as a Geospatial Analytics Engine by Ram SriharshaMagellan-Spark as a Geospatial Analytics Engine by Ram Sriharsha
Magellan-Spark as a Geospatial Analytics Engine by Ram SriharshaSpark Summit
 
FOSDEM 2015: Distributed Tile Processing with GeoTrellis and Spark
FOSDEM 2015: Distributed Tile Processing with GeoTrellis and SparkFOSDEM 2015: Distributed Tile Processing with GeoTrellis and Spark
FOSDEM 2015: Distributed Tile Processing with GeoTrellis and SparkRob Emanuele
 
Monitoring temporary populations through cellular core network data
Monitoring temporary populations through cellular core network dataMonitoring temporary populations through cellular core network data
Monitoring temporary populations through cellular core network dataBeniamino Murgante
 
Dataflow - A Unified Model for Batch and Streaming Data Processing
Dataflow - A Unified Model for Batch and Streaming Data ProcessingDataflow - A Unified Model for Batch and Streaming Data Processing
Dataflow - A Unified Model for Batch and Streaming Data ProcessingDoiT International
 
High Scalability Network Monitoring for Communications Service Providers
High Scalability Network Monitoring for Communications Service ProvidersHigh Scalability Network Monitoring for Communications Service Providers
High Scalability Network Monitoring for Communications Service ProvidersCA Technologies
 
Will it Scale? The Secrets behind Scaling Stream Processing Applications
Will it Scale? The Secrets behind Scaling Stream Processing ApplicationsWill it Scale? The Secrets behind Scaling Stream Processing Applications
Will it Scale? The Secrets behind Scaling Stream Processing ApplicationsNavina Ramesh
 
Linked data presentation for who umc 21 jan 2015
Linked data presentation for who umc 21 jan 2015Linked data presentation for who umc 21 jan 2015
Linked data presentation for who umc 21 jan 2015Kerstin Forsberg
 
Modern Applications Demand Network Analytics
Modern Applications Demand Network AnalyticsModern Applications Demand Network Analytics
Modern Applications Demand Network AnalyticsPluribus Networks
 
Is your MQTT broker IoT ready?
Is your MQTT broker IoT ready?Is your MQTT broker IoT ready?
Is your MQTT broker IoT ready?Eurotech
 

Viewers also liked (20)

SexTant: Visualizing Time-Evolving Linked Geospatial Data
SexTant: Visualizing Time-Evolving Linked Geospatial DataSexTant: Visualizing Time-Evolving Linked Geospatial Data
SexTant: Visualizing Time-Evolving Linked Geospatial Data
 
Neo4j Spatial - FooCafe September 2015
Neo4j Spatial - FooCafe September 2015Neo4j Spatial - FooCafe September 2015
Neo4j Spatial - FooCafe September 2015
 
Indexing 3-dimensional trajectories: Apache Spark and Cassandra integration
Indexing 3-dimensional trajectories: Apache Spark and Cassandra integrationIndexing 3-dimensional trajectories: Apache Spark and Cassandra integration
Indexing 3-dimensional trajectories: Apache Spark and Cassandra integration
 
Magellen: Geospatial Analytics on Spark by Ram Sriharsha
Magellen: Geospatial Analytics on Spark by Ram SriharshaMagellen: Geospatial Analytics on Spark by Ram Sriharsha
Magellen: Geospatial Analytics on Spark by Ram Sriharsha
 
Using python to analyze spatial data
Using python to analyze spatial dataUsing python to analyze spatial data
Using python to analyze spatial data
 
Auto-scaling Techniques for Elastic Data Stream Processing
Auto-scaling Techniques for Elastic Data Stream ProcessingAuto-scaling Techniques for Elastic Data Stream Processing
Auto-scaling Techniques for Elastic Data Stream Processing
 
Adaptive Replication for Elastic Data Stream Processing
Adaptive Replication for Elastic Data Stream ProcessingAdaptive Replication for Elastic Data Stream Processing
Adaptive Replication for Elastic Data Stream Processing
 
Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...
Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...
Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...
 
Multilevel aggregation for Hadoop/MapReduce
Multilevel aggregation for Hadoop/MapReduceMultilevel aggregation for Hadoop/MapReduce
Multilevel aggregation for Hadoop/MapReduce
 
HTM & Apache Flink (2016-06-27)
HTM & Apache Flink (2016-06-27)HTM & Apache Flink (2016-06-27)
HTM & Apache Flink (2016-06-27)
 
Magellan-Spark as a Geospatial Analytics Engine by Ram Sriharsha
Magellan-Spark as a Geospatial Analytics Engine by Ram SriharshaMagellan-Spark as a Geospatial Analytics Engine by Ram Sriharsha
Magellan-Spark as a Geospatial Analytics Engine by Ram Sriharsha
 
FOSDEM 2015: Distributed Tile Processing with GeoTrellis and Spark
FOSDEM 2015: Distributed Tile Processing with GeoTrellis and SparkFOSDEM 2015: Distributed Tile Processing with GeoTrellis and Spark
FOSDEM 2015: Distributed Tile Processing with GeoTrellis and Spark
 
Monitoring temporary populations through cellular core network data
Monitoring temporary populations through cellular core network dataMonitoring temporary populations through cellular core network data
Monitoring temporary populations through cellular core network data
 
Dataflow - A Unified Model for Batch and Streaming Data Processing
Dataflow - A Unified Model for Batch and Streaming Data ProcessingDataflow - A Unified Model for Batch and Streaming Data Processing
Dataflow - A Unified Model for Batch and Streaming Data Processing
 
High Scalability Network Monitoring for Communications Service Providers
High Scalability Network Monitoring for Communications Service ProvidersHigh Scalability Network Monitoring for Communications Service Providers
High Scalability Network Monitoring for Communications Service Providers
 
Spatial Data Model 2
Spatial Data Model 2Spatial Data Model 2
Spatial Data Model 2
 
Will it Scale? The Secrets behind Scaling Stream Processing Applications
Will it Scale? The Secrets behind Scaling Stream Processing ApplicationsWill it Scale? The Secrets behind Scaling Stream Processing Applications
Will it Scale? The Secrets behind Scaling Stream Processing Applications
 
Linked data presentation for who umc 21 jan 2015
Linked data presentation for who umc 21 jan 2015Linked data presentation for who umc 21 jan 2015
Linked data presentation for who umc 21 jan 2015
 
Modern Applications Demand Network Analytics
Modern Applications Demand Network AnalyticsModern Applications Demand Network Analytics
Modern Applications Demand Network Analytics
 
Is your MQTT broker IoT ready?
Is your MQTT broker IoT ready?Is your MQTT broker IoT ready?
Is your MQTT broker IoT ready?
 

Similar to High Performance Spatial-Temporal Trajectory Analysis with Spark

Evolving Beyond the Data Lake: A Story of Wind and Rain
Evolving Beyond the Data Lake: A Story of Wind and RainEvolving Beyond the Data Lake: A Story of Wind and Rain
Evolving Beyond the Data Lake: A Story of Wind and RainMapR Technologies
 
IBM Data Engine for Hadoop and Spark - POWER System Edition ver1 March 2016
IBM Data Engine for Hadoop and Spark - POWER System Edition ver1 March 2016IBM Data Engine for Hadoop and Spark - POWER System Edition ver1 March 2016
IBM Data Engine for Hadoop and Spark - POWER System Edition ver1 March 2016Anand Haridass
 
Hybrid Transactional/Analytics Processing with Spark and IMDGs
Hybrid Transactional/Analytics Processing with Spark and IMDGsHybrid Transactional/Analytics Processing with Spark and IMDGs
Hybrid Transactional/Analytics Processing with Spark and IMDGsAli Hodroj
 
Big Data Analytics Platforms by KTH and RISE SICS
Big Data Analytics Platforms by KTH and RISE SICSBig Data Analytics Platforms by KTH and RISE SICS
Big Data Analytics Platforms by KTH and RISE SICSBig Data Value Association
 
L'architettura di classe enterprise di nuova generazione - Massimo Brignoli
L'architettura di classe enterprise di nuova generazione - Massimo BrignoliL'architettura di classe enterprise di nuova generazione - Massimo Brignoli
L'architettura di classe enterprise di nuova generazione - Massimo BrignoliData Driven Innovation
 
Big Data: Introducing BigInsights, IBM's Hadoop- and Spark-based analytical p...
Big Data: Introducing BigInsights, IBM's Hadoop- and Spark-based analytical p...Big Data: Introducing BigInsights, IBM's Hadoop- and Spark-based analytical p...
Big Data: Introducing BigInsights, IBM's Hadoop- and Spark-based analytical p...Cynthia Saracco
 
Modernizing Business Processes with Big Data: Real-World Use Cases for Produc...
Modernizing Business Processes with Big Data: Real-World Use Cases for Produc...Modernizing Business Processes with Big Data: Real-World Use Cases for Produc...
Modernizing Business Processes with Big Data: Real-World Use Cases for Produc...DataWorks Summit/Hadoop Summit
 
Putting Apache Drill into Production
Putting Apache Drill into ProductionPutting Apache Drill into Production
Putting Apache Drill into ProductionMapR Technologies
 
Scaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data ScienceScaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data ScienceeRic Choo
 
An Introduction to Apache Ignite - Mandhir Gidda - Codemotion Rome 2017
An Introduction to Apache Ignite - Mandhir Gidda - Codemotion Rome 2017An Introduction to Apache Ignite - Mandhir Gidda - Codemotion Rome 2017
An Introduction to Apache Ignite - Mandhir Gidda - Codemotion Rome 2017Codemotion
 
CEP - simplified streaming architecture - Strata Singapore 2016
CEP - simplified streaming architecture - Strata Singapore 2016CEP - simplified streaming architecture - Strata Singapore 2016
CEP - simplified streaming architecture - Strata Singapore 2016Mathieu Dumoulin
 
The Value of the Modern Data Architecture with Apache Hadoop and Teradata
The Value of the Modern Data Architecture with Apache Hadoop and Teradata The Value of the Modern Data Architecture with Apache Hadoop and Teradata
The Value of the Modern Data Architecture with Apache Hadoop and Teradata Hortonworks
 
InfoSphere BigInsights - Analytics power for Hadoop - field experience
InfoSphere BigInsights - Analytics power for Hadoop - field experienceInfoSphere BigInsights - Analytics power for Hadoop - field experience
InfoSphere BigInsights - Analytics power for Hadoop - field experienceWilfried Hoge
 
Big and fast data strategy 2017 jr
Big and fast data strategy 2017 jrBig and fast data strategy 2017 jr
Big and fast data strategy 2017 jrJonathan Raspaud
 
Operational Analytics Using Spark and NoSQL Data Stores
Operational Analytics Using Spark and NoSQL Data StoresOperational Analytics Using Spark and NoSQL Data Stores
Operational Analytics Using Spark and NoSQL Data StoresDATAVERSITY
 
Knowledge Graph for Machine Learning and Data Science
Knowledge Graph for Machine Learning and Data ScienceKnowledge Graph for Machine Learning and Data Science
Knowledge Graph for Machine Learning and Data ScienceCambridge Semantics
 

Similar to High Performance Spatial-Temporal Trajectory Analysis with Spark (20)

Iotbds v1.0
Iotbds v1.0Iotbds v1.0
Iotbds v1.0
 
Evolving Beyond the Data Lake: A Story of Wind and Rain
Evolving Beyond the Data Lake: A Story of Wind and RainEvolving Beyond the Data Lake: A Story of Wind and Rain
Evolving Beyond the Data Lake: A Story of Wind and Rain
 
IBM Data Engine for Hadoop and Spark - POWER System Edition ver1 March 2016
IBM Data Engine for Hadoop and Spark - POWER System Edition ver1 March 2016IBM Data Engine for Hadoop and Spark - POWER System Edition ver1 March 2016
IBM Data Engine for Hadoop and Spark - POWER System Edition ver1 March 2016
 
Hybrid Transactional/Analytics Processing with Spark and IMDGs
Hybrid Transactional/Analytics Processing with Spark and IMDGsHybrid Transactional/Analytics Processing with Spark and IMDGs
Hybrid Transactional/Analytics Processing with Spark and IMDGs
 
Big Data Analytics Platforms by KTH and RISE SICS
Big Data Analytics Platforms by KTH and RISE SICSBig Data Analytics Platforms by KTH and RISE SICS
Big Data Analytics Platforms by KTH and RISE SICS
 
L'architettura di classe enterprise di nuova generazione - Massimo Brignoli
L'architettura di classe enterprise di nuova generazione - Massimo BrignoliL'architettura di classe enterprise di nuova generazione - Massimo Brignoli
L'architettura di classe enterprise di nuova generazione - Massimo Brignoli
 
Sohail resume
Sohail resumeSohail resume
Sohail resume
 
Big Data: Introducing BigInsights, IBM's Hadoop- and Spark-based analytical p...
Big Data: Introducing BigInsights, IBM's Hadoop- and Spark-based analytical p...Big Data: Introducing BigInsights, IBM's Hadoop- and Spark-based analytical p...
Big Data: Introducing BigInsights, IBM's Hadoop- and Spark-based analytical p...
 
Modernizing Business Processes with Big Data: Real-World Use Cases for Produc...
Modernizing Business Processes with Big Data: Real-World Use Cases for Produc...Modernizing Business Processes with Big Data: Real-World Use Cases for Produc...
Modernizing Business Processes with Big Data: Real-World Use Cases for Produc...
 
Putting Apache Drill into Production
Putting Apache Drill into ProductionPutting Apache Drill into Production
Putting Apache Drill into Production
 
Scaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data ScienceScaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data Science
 
Big Data and OSS at IBM
Big Data and OSS at IBMBig Data and OSS at IBM
Big Data and OSS at IBM
 
An Introduction to Apache Ignite - Mandhir Gidda - Codemotion Rome 2017
An Introduction to Apache Ignite - Mandhir Gidda - Codemotion Rome 2017An Introduction to Apache Ignite - Mandhir Gidda - Codemotion Rome 2017
An Introduction to Apache Ignite - Mandhir Gidda - Codemotion Rome 2017
 
CEP - simplified streaming architecture - Strata Singapore 2016
CEP - simplified streaming architecture - Strata Singapore 2016CEP - simplified streaming architecture - Strata Singapore 2016
CEP - simplified streaming architecture - Strata Singapore 2016
 
The Value of the Modern Data Architecture with Apache Hadoop and Teradata
The Value of the Modern Data Architecture with Apache Hadoop and Teradata The Value of the Modern Data Architecture with Apache Hadoop and Teradata
The Value of the Modern Data Architecture with Apache Hadoop and Teradata
 
The CDO Agenda: how data architecture can help?
The CDO Agenda: how data architecture can help?The CDO Agenda: how data architecture can help?
The CDO Agenda: how data architecture can help?
 
InfoSphere BigInsights - Analytics power for Hadoop - field experience
InfoSphere BigInsights - Analytics power for Hadoop - field experienceInfoSphere BigInsights - Analytics power for Hadoop - field experience
InfoSphere BigInsights - Analytics power for Hadoop - field experience
 
Big and fast data strategy 2017 jr
Big and fast data strategy 2017 jrBig and fast data strategy 2017 jr
Big and fast data strategy 2017 jr
 
Operational Analytics Using Spark and NoSQL Data Stores
Operational Analytics Using Spark and NoSQL Data StoresOperational Analytics Using Spark and NoSQL Data Stores
Operational Analytics Using Spark and NoSQL Data Stores
 
Knowledge Graph for Machine Learning and Data Science
Knowledge Graph for Machine Learning and Data ScienceKnowledge Graph for Machine Learning and Data Science
Knowledge Graph for Machine Learning and Data Science
 

More from DataWorks Summit/Hadoop Summit

Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerDataWorks Summit/Hadoop Summit
 
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformEnabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformDataWorks Summit/Hadoop Summit
 
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDouble Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDataWorks Summit/Hadoop Summit
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...DataWorks Summit/Hadoop Summit
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...DataWorks Summit/Hadoop Summit
 
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLMool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLDataWorks Summit/Hadoop Summit
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)DataWorks Summit/Hadoop Summit
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...DataWorks Summit/Hadoop Summit
 

More from DataWorks Summit/Hadoop Summit (20)

Running Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in ProductionRunning Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in Production
 
State of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache ZeppelinState of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache Zeppelin
 
Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache Ranger
 
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformEnabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science Platform
 
Revolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and ZeppelinRevolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and Zeppelin
 
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDouble Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSense
 
Hadoop Crash Course
Hadoop Crash CourseHadoop Crash Course
Hadoop Crash Course
 
Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Apache Spark Crash Course
Apache Spark Crash CourseApache Spark Crash Course
Apache Spark Crash Course
 
Dataflow with Apache NiFi
Dataflow with Apache NiFiDataflow with Apache NiFi
Dataflow with Apache NiFi
 
Schema Registry - Set you Data Free
Schema Registry - Set you Data FreeSchema Registry - Set you Data Free
Schema Registry - Set you Data Free
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
 
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLMool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and ML
 
How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient
 
HBase in Practice
HBase in Practice HBase in Practice
HBase in Practice
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)
 
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS HadoopBreaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
 

Recently uploaded

Powerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara LaskowskaPowerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara LaskowskaCzechDreamin
 
Salesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone KomSalesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone KomCzechDreamin
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor TurskyiFwdays
 
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya HalderCustom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya HalderCzechDreamin
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Product School
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...Product School
 
Optimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through ObservabilityOptimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through ObservabilityScyllaDB
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
 
Exploring UiPath Orchestrator API: updates and limits in 2024 🚀
Exploring UiPath Orchestrator API: updates and limits in 2024 🚀Exploring UiPath Orchestrator API: updates and limits in 2024 🚀
Exploring UiPath Orchestrator API: updates and limits in 2024 🚀DianaGray10
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Thierry Lestable
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
 
AI revolution and Salesforce, Jiří Karpíšek
AI revolution and Salesforce, Jiří KarpíšekAI revolution and Salesforce, Jiří Karpíšek
AI revolution and Salesforce, Jiří KarpíšekCzechDreamin
 
IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024IoTAnalytics
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...Product School
 
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)Julian Hyde
 
Demystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John StaveleyDemystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John StaveleyJohn Staveley
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...Elena Simperl
 
In-Depth Performance Testing Guide for IT Professionals
In-Depth Performance Testing Guide for IT ProfessionalsIn-Depth Performance Testing Guide for IT Professionals
In-Depth Performance Testing Guide for IT ProfessionalsExpeed Software
 
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...CzechDreamin
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...Product School
 

Recently uploaded (20)

Powerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara LaskowskaPowerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara Laskowska
 
Salesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone KomSalesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
 
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya HalderCustom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
 
Optimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through ObservabilityOptimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through Observability
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
Exploring UiPath Orchestrator API: updates and limits in 2024 🚀
Exploring UiPath Orchestrator API: updates and limits in 2024 🚀Exploring UiPath Orchestrator API: updates and limits in 2024 🚀
Exploring UiPath Orchestrator API: updates and limits in 2024 🚀
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 
AI revolution and Salesforce, Jiří Karpíšek
AI revolution and Salesforce, Jiří KarpíšekAI revolution and Salesforce, Jiří Karpíšek
AI revolution and Salesforce, Jiří Karpíšek
 
IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
 
Demystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John StaveleyDemystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John Staveley
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
In-Depth Performance Testing Guide for IT Professionals
In-Depth Performance Testing Guide for IT ProfessionalsIn-Depth Performance Testing Guide for IT Professionals
In-Depth Performance Testing Guide for IT Professionals
 
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 

High Performance Spatial-Temporal Trajectory Analysis with Spark

  • 1. © 2016 IBM Corporation High Performance Spatial-Temporal Trajectory Analysis with Spark YongHua (Henry) Zeng zengyh@cn.ibm.com Big Data & Analytics Solution Architect Analytics Platform Services,IBM China Lab
  • 2. © 2016 IBM Corporation2 Agenda • Background • Architecture • Technical Design • Big Data Platform design • Data governance design • Algorithm model • Spark spatial computing • Scenarios demo • Conclusion and Next step 2
  • 3. © 2016 IBM Corporation3 Background Introduction -- study the human trajectory by mobile signal data Problem • Varieties of data that traditional planning will not be able to tackle • Many of the data have the characteristics of big data (volume, velocity, varieties) • Cellular signaling data is one of such typical data that can enable new types of applications to facilitate smarter urban planning • Analyzing cellular signal data can help urban planner & city governing bodies to better understand the city Data Set • Cellular signal data • Mobile users 5M • 25M to 50M data every minute; 30G of data daily • ~ 400M cellular signal records daily • More data coming with GPS, RFID for 4M vehicles
  • 4. © 2016 IBM Corporation4 Solution Architecture Data sources Distributed File System Streaming Resource Management YARN API Services Orchestration Batch Relational Database w/ Spatial Extention Computation Engine Visualization & Report Data Ingestion HDFS LDAP Service Cluster Management Security Service javascript Flex Shp file etc
  • 5. © 2016 IBM Corporation5 Data Collection Data Aggregation Coordinates Formalization Abnormal Detections Final Computing Source Data Pre-processing Base Model Computing Data Quality Metrics Application Model Computing Residential Statistics Working Region Statistics Regional Commuting Analysis The Big Data Platform Application Views GIS Server GIS Database Residential, Community Data Data Cleansing Business Architecture
  • 6. © 2016 IBM Corporation6 Architecture Decision Points GIS spatial DB Data Fusion Standard Bigdata Platform ELT Data Store & Analysis OD analyssi Index Computing Data Quality computiing Home-office analyssi Streaming Home-Office DW/Market Data Export thermodynamic diagram User 2 User 3 User1 GIS 应用展现 Base Alg App Alg 手机信令 (在线/脱 机) Data collect ion Database(business, spatial) Home-Office DW/Market Job andresourceSchedule Flex/JS Spatial DB (spatial extension) ArcGIS Spark Streaming Oozie/YarnShell脚本 Spark/HDFS Sqoop Java
  • 7. © 2016 IBM Corporation7 System front-end architecture Geospatial Analysis Big Data Platform (HDFS) Sqoop FTP
  • 8. © 2016 IBM Corporation8 Agenda • Background • Architecture • Technical Design • Big Data Platform design • Data governance design • Algorithm model • Spark spatial computing • Scenarios demo • Conclusion and Next step 8
  • 9. © 2016 IBM Corporation9 Items on Big Data Platform Design ü Planning and product selection ü Deployment and operation ü Application deployment ü Job scheduling ü Resource management ü Spark within BigInsights
  • 10. © 2016 IBM Corporation10 IBM BigInsights for Apache Hadoop and Spark Discovery & Exploration Prescriptive Analytics Predictive Analytics Content Analytics Business Intelligence Data Mgmt Hadoop & NoSQL Content Mgmt Data Warehouse Information Integration & Governance IBM ANALYTICS PLATFORM Built on Spark. Hybrid. Trusted. Spark Analytics Operating System Machine LearningOn premises On cloud Data at rest & In-motion.Inside & outside the firewall. Structured & unstructured. § Analytical platform for persistent Big Data – 100% open source core with IBM add-ons for analysts, data scientists, and admins – On site or cloud § Distinguishing characteristics – Built-in analytics . . . . Enhances business knowledge – Enterprise software integration . . . . Complements and extends existing capabilities – Production-ready . . . . Speeds time-to-value § IBM advantage – Combination of software, hardware, services and research
  • 11. © 2016 IBM Corporation11 IBM Open Platform 100% open source platform compliant with ODPi Apache Hadoop ecosystem Apache Spark ecosystem IBM-specific BigInsights features Big SQL (industry standard SQL) Text analytics BigSheets (spreadsheet-style tool) Big R (R support) IBM Streams, Cognos (limited use licenses) Overview of BigInsights Free Quick Start (non production): • IBM Open Platform • IBM added value features • Community support
  • 12. © 2016 IBM Corporation12 Big data platform job scheduling and resource mgmt 12 - Dedicated slave nodes for computing - almost all CPU & memory resources in each slave node is managed by Yarn - Capacity scheduler using dedicated queues for various business usage - production (batch & streaming processing, data movement), development - Elastic resource capacity for each queue by specifying a large maximum capacity, to achieve high resource utilization - Fine-grained Yarn container allocation by specifying small increment vcore/memory sizes, to support various workload types - big, medium and small jobs - No CGroups-based CPU resource isolation, because of system stability issues caused by this in our IOP 4.1/RHEL 6.5 environments Job scheduling with Oozie Resource mgmt with YARN
  • 13. © 2016 IBM Corporation13 Spark within BigInsights ü Deployment § Amabari for installation and deployment § Spark (compute node) co-exist with data node (HDFS) § Cluster mode with YARN as the resource mgmt ü Runtime Configuration § Bad configuration may cause job under-perform, failed, cluster instable etc § Methodology to configure the partition #, cores/mem of executor, # of executors ü Monitoring and Tuning § Spark streaming stability (monitoring log, checkpoint) § Handle massive small files § Shuffle, partition, IO utilization etc § Job execution, GC time etc via dashboard
  • 14. © 2016 IBM Corporation14 Data Perspective Considerations § Data process flow § Data management − capacity sizing, layout in HDFS, lifecycle mgmt § Data movement − Between big data platform and other systems RDBMS Data Process Flow
  • 15. © 2016 IBM Corporation15 DIST 15 • 5 Layers of Data in the System • L1 raw data ingested into HDFS • L2 ELT data (pre-processing with streaming) in HDFS • L3 result data via algorithm model in HDFS • L4 data for visualization (in HDFS or RDBMS) • L5 archived data in external storage (compressed) • Design the data layout, # of copies, retention etc in HDFS • Jobs to prune out-dated data in HDFS (Oozie) Data lifecycle management
  • 16. © 2016 IBM Corporation16 DIST 16 • Data Ingestion into Big Data platform • Offline/Online data ingestion -- HDFS loading from (external storage) /FTP server + Streaming • Future – Kafka + Spark Streaming (more data sources, analytics path) • Data Export from Big Data platform • Near real-time heatmap generation • Algorithm model results exported to RDBMS -- Sqoop Data movement HDFS load from FTP Spark streaming ArcGIS Server (generat e heatma p based on shp file) 实时 展现 回溯查 询 每30分钟推送到数 据库中 Basic Algorithm -- stay pointin HDFS FTP push
  • 17. © 2016 IBM Corporation17 DIST 17 Algorithm Model of Trajectory(OD) Analysis 统计数据导出 Cellular signal data ELT Trajectory Sequence Multi-Day Stable Point OD Identification CommuteStay Points OD Statistics OD Index Stats Commute Stats People Flow Stats Data Quality Index 统计数据导出>1km >1k m GRO UP1 GRO UP2 GRO UP3 By different area type Algorithm Accuracy Validation Algorithm Performance Algorithm Stability Algorithm Extensibility Algorithm Configurable Application Algorithm Base Algorithm
  • 18. © 2016 IBM Corporation18 Geospatial Computation with Spark § Requirements − Spark to direct support of SDE/Shp/GeoJson file − Most of the geospatial computation in Spark cluster (point-area relationship, spherical distance, geospatial stats etc) − Performance challenge – 20M records per each iteration of geospatial computation § Solution Design SDE shp file Spark Cluster Basic Algorithm (geospatial computation) ApplicationAlgorithm (geospatial statistics) SDE interface SHP interface Geospatial API Grid API Spark-GIS libGrid definition
  • 19. © 2016 IBM Corporation19 Spatial Grid Design for Spark 关系 Home-Office Model Statistic by Group Group-Grid Mapping Statistic by Grid Grid Home- Office Statistic Table Grid Statistic Table User Define Query Pre-define Query Convert formula expansion formula Spark Base Algorithm Spark App Statistic Relation Database Web GIS Application Web GIS Front-end
  • 20. © 2016 IBM Corporation20 Agenda • Background – problem and data • Architecture • Technical Design • Big Data Platform design • Data governance design • Algorithm model • Spark spatial computing • Scenarios demo • Conclusion and Next step 20
  • 21. © 2016 IBM Corporation21 Scenario --1 Population Heatmap Commute OD Route Better Understanding of Key Metrics for Urban City Planning with Big Data (Sampled data vs all data; History data vs current data) Ø Urban planner can have more reasonable planning ofthe city based on current population distribution Ø Traffic planning institute leverage this to optimize the traffic network Ø City mgmt. unitcan better plan city services facilities & city abnormal events detection based on population flow New Methodology & New Applications Using Big Data for Better Urban City Planning, Monitoring & Decision Making v Quickly understand the currentpeople commute traffic volume and directions,and identify the bottleneck v Optimize the traffic plan and scheduling during commute peak time v More new applications can be builtfor planners, administrators and new data services can be provided to city residents for the participation ofcity management
  • 22. © 2016 IBM Corporation22 Scenario --2 Commute Time Cost Office-Residence Imbalance
  • 23. © 2016 IBM Corporation23 Big Data Architecture Key Point – System v Big data product selection • ODPi (Open Data Platform) v Big data component selection • Data moving,data store,computing,SQL interface… • … v Deployment mode selection • Local cluster • IAAS • Bigdata cloud v Separate deployment env and data exploration env Big Data Architecture Key Point – Data Ø Data collection Ø Data ELT Ø Data Pipeline Ø Data lifecycle governance Ø Data Volume plan Ø Data Fusion Ø Spatial data analysis and visualization Big Data Architecture Key Point – Platform Ø HA Ø Security Ø Monitoring & Stability Ø Scale-out and upgrade Ø Resource management Ø Job Schedule Ø Multi-tenant Big Data Architecture Key Point – Algorithm & Model层面 Ø BusinessAnalysis Ø Alg model design Ø Model verification Ø Model adjustment Ø Model validity insurance
  • 24. © 2016 IBM Corporation24 Road Ahead… Deep analysis with more scenarios • Traffic prediction • Trip predication • Commute methods • etc More data sources for trajectory/traffic • GPS for taxi, bus • RFID on road • Road monitoring data • Subway stop check-in/out info • Parking Lot • Fusion with weather, social data Data exploration environment to support data science & continuous engineering of new features Leverage more SparkML for traffic prediction Cluster scale-out with more data and algorithms Data ingestion with Kafka/Flume (message hub) SQL on Hadoop Graph computation for nearest path and roadmatcher Current Deployment Big Data Platform Scale-out Scale-out New Scenarios w/ new data Data Exploration Environment Engineering and deployment Data movement
  • 25. © 2016 IBM Corporation25 © 2016 IBM Corporation Spark GeoSpatial Analysis for Other Scenarios Spatial-Temporal Trajectory Analysis for human Trajectory Data Management Trajectory Analysis Function Spatial-Temporal Trajectory Analysis for vehicle Common API geo-spatial data pre-process,geo-spatial Geometry Computing,Surface Mesh Computing Distributed geo-spatial calculating API (Base on Spark) IBM’s Big Data Analytics Platform Smart Transportation Smart Logistics Smart Tourism others
  • 26. © 2016 IBM Corporation26 Big Data University and Data Science Workbench − A community initiative led by IBM − @yourpace, @yourplaceonline courses about data − Developed by industry experts − Free courses by the community with hands-on labs − Certificate of completion and badges − Looking for contributors! Integrated Set of Tools, Languages and Execution Environments Clean and Prepare Data • OpenRefine Experiment with and Analyze Data • Jupyter Notebooks, R Studio, SeaHorse Connect to data processing engines: • Spark, Hadoop, dashDB, BigSQL, BigR http://DataScientistWorkbench.com http://bigdatauniversity.com