© 2016 IBM Corporation
High Performance Spatial-Temporal Trajectory
Analysis with Spark
YongHua (Henry) Zeng
zengyh@cn.ibm.com
Big Data & Analytics Solution Architect
Analytics Platform Services,IBM China Lab
© 2016 IBM Corporation2
Agenda
• Background
• Architecture
• Technical Design
• Big Data Platform design
• Data governance design
• Algorithm model
• Spark spatial computing
• Scenarios demo
• Conclusion and Next step
2
© 2016 IBM Corporation3
Background Introduction
-- study the human trajectory by mobile signal data
Problem
• Varieties of data that traditional
planning will not be able to tackle
• Many of the data have the characteristics
of big data (volume, velocity, varieties)
• Cellular signaling data is one of such
typical data that can enable new types of
applications to facilitate smarter urban
planning
• Analyzing cellular signal data can help
urban planner & city governing bodies to
better understand the city
Data Set
• Cellular signal data
• Mobile users 5M
• 25M to 50M data every minute; 30G of data daily
• ~ 400M cellular signal records daily
• More data coming with GPS, RFID for 4M vehicles
© 2016 IBM Corporation4
Solution Architecture
Data
sources
Distributed File System
Streaming
Resource Management
YARN
API Services
Orchestration
Batch
Relational
Database
w/ Spatial
Extention
Computation Engine
Visualization
& Report
Data
Ingestion
HDFS
LDAP
Service
Cluster
Management
Security
Service
javascript
Flex
Shp
file
etc
© 2016 IBM Corporation5
Data
Collection Data
Aggregation
Coordinates
Formalization
Abnormal
Detections
Final
Computing
Source
Data
Pre-processing Base Model Computing
Data Quality
Metrics
Application Model Computing
Residential
Statistics
Working
Region
Statistics
Regional
Commuting
Analysis
The Big Data Platform
Application Views
GIS Server
GIS
Database
Residential,
Community
Data
Data
Cleansing
Business Architecture
© 2016 IBM Corporation6
Architecture Decision Points
GIS spatial DB
Data Fusion Standard
Bigdata Platform
ELT
Data Store & Analysis
OD analyssi
Index
Computing
Data Quality
computiing
Home-office
analyssi
Streaming
Home-Office	
DW/Market
Data Export
thermodynamic
diagram
User 2
User 3
User1
GIS 应用展现
Base Alg App Alg
手机信令
(在线/脱
机)
Data
collect
ion
Database(business,
spatial)
Home-Office	
DW/Market
Job andresourceSchedule
Flex/JS
Spatial DB
(spatial
extension)
ArcGIS
Spark
Streaming
Oozie/YarnShell脚本
Spark/HDFS
Sqoop
Java
© 2016 IBM Corporation7
System front-end architecture
Geospatial
Analysis Big
Data Platform
(HDFS)
Sqoop
FTP
© 2016 IBM Corporation8
Agenda
• Background
• Architecture
• Technical Design
• Big Data Platform design
• Data governance design
• Algorithm model
• Spark spatial computing
• Scenarios demo
• Conclusion and Next step
8
© 2016 IBM Corporation9
Items on Big Data Platform Design
ü Planning and product selection
ü Deployment and operation
ü Application deployment
ü Job scheduling
ü Resource management
ü Spark within BigInsights
© 2016 IBM Corporation10
IBM BigInsights for Apache Hadoop and Spark
Discovery
& Exploration
Prescriptive
Analytics
Predictive
Analytics
Content
Analytics
Business Intelligence
Data
Mgmt
Hadoop &
NoSQL
Content
Mgmt
Data
Warehouse
Information Integration & Governance
IBM ANALYTICS PLATFORM
Built on Spark. Hybrid. Trusted.
Spark Analytics Operating System
Machine LearningOn premises On cloud
Data at rest & In-motion.Inside & outside the firewall. Structured & unstructured.
§ Analytical platform for
persistent Big Data
– 100% open source core
with IBM add-ons for
analysts, data
scientists, and admins
– On site or cloud
§ Distinguishing
characteristics
– Built-in analytics . . . .
Enhances business
knowledge
– Enterprise software
integration . . . .
Complements and
extends existing
capabilities
– Production-ready . . . .
Speeds time-to-value
§ IBM advantage
– Combination of
software, hardware,
services and research
© 2016 IBM Corporation11
IBM Open Platform
100% open source platform compliant with ODPi
Apache Hadoop ecosystem
Apache Spark ecosystem
IBM-specific BigInsights features
Big SQL (industry standard SQL)
Text analytics
BigSheets (spreadsheet-style tool)
Big R (R support)
IBM Streams, Cognos (limited use licenses)
Overview of BigInsights
Free Quick Start (non production):
• IBM Open Platform
• IBM added value features
• Community support
© 2016 IBM Corporation12
Big data platform job scheduling and resource mgmt
12
- Dedicated slave nodes for computing - almost all CPU & memory
resources in each slave node is managed by Yarn
- Capacity scheduler using dedicated queues for various business usage -
production (batch & streaming processing, data movement), development
- Elastic resource capacity for each queue by specifying a large maximum
capacity, to achieve high resource utilization
- Fine-grained Yarn container allocation by specifying small increment
vcore/memory sizes, to support various workload types - big, medium and
small jobs
- No CGroups-based CPU resource isolation, because of system stability
issues caused by this in our IOP 4.1/RHEL 6.5 environments
Job scheduling with
Oozie
Resource mgmt with
YARN
© 2016 IBM Corporation13
Spark within BigInsights
ü Deployment
§ Amabari for installation and deployment
§ Spark (compute node) co-exist with data node (HDFS)
§ Cluster mode with YARN as the resource mgmt
ü Runtime Configuration
§ Bad configuration may cause job under-perform, failed, cluster
instable etc
§ Methodology to configure the partition #, cores/mem of executor, #
of executors
ü Monitoring and Tuning
§ Spark streaming stability (monitoring log, checkpoint)
§ Handle massive small files
§ Shuffle, partition, IO utilization etc
§ Job execution, GC time etc via dashboard
© 2016 IBM Corporation14
Data Perspective Considerations
§ Data process flow
§ Data management
− capacity sizing, layout in HDFS, lifecycle mgmt
§ Data movement
− Between big data platform and other systems
RDBMS
Data Process Flow
© 2016 IBM Corporation15
DIST
15
• 5 Layers of Data in the System
• L1 raw data ingested into HDFS
• L2 ELT data (pre-processing with streaming) in HDFS
• L3 result data via algorithm model in HDFS
• L4 data for visualization (in HDFS or RDBMS)
• L5 archived data in external storage (compressed)
• Design the data layout, # of copies, retention etc in HDFS
• Jobs to prune out-dated data in HDFS (Oozie)
Data lifecycle management
© 2016 IBM Corporation16
DIST
16
• Data Ingestion into Big Data platform
• Offline/Online data ingestion -- HDFS loading from (external storage)
/FTP server + Streaming
• Future – Kafka + Spark Streaming (more data sources, analytics path)
• Data Export from Big Data platform
• Near real-time heatmap generation
• Algorithm model results exported to RDBMS -- Sqoop
Data movement
HDFS
load
from
FTP
Spark
streaming
ArcGIS
Server
(generat
e
heatma
p based
on shp
file)
实时
展现
回溯查
询
每30分钟推送到数
据库中
Basic
Algorithm
-- stay
pointin
HDFS
FTP
push
© 2016 IBM Corporation17
DIST
17
Algorithm Model of Trajectory(OD) Analysis
统计数据导出
Cellular
signal data
ELT
Trajectory
Sequence
Multi-Day
Stable Point
OD
Identification
CommuteStay Points
OD Statistics
OD Index Stats
Commute Stats
People Flow
Stats
Data Quality
Index
统计数据导出>1km >1k
m
GRO UP1 GRO UP2 GRO UP3
By different
area type
Algorithm Accuracy
Validation
Algorithm
Performance
Algorithm Stability
Algorithm
Extensibility
Algorithm
Configurable
Application
Algorithm
Base
Algorithm
© 2016 IBM Corporation18
Geospatial Computation with Spark
§ Requirements
− Spark to direct support of SDE/Shp/GeoJson file
− Most of the geospatial computation in Spark cluster (point-area relationship, spherical distance, geospatial
stats etc)
− Performance challenge – 20M records per each iteration of geospatial computation
§ Solution Design
SDE
shp
file
Spark
Cluster
Basic Algorithm (geospatial
computation)
ApplicationAlgorithm (geospatial
statistics)
SDE interface SHP interface
Geospatial
API
Grid API
Spark-GIS libGrid
definition
© 2016 IBM Corporation19
Spatial Grid Design for Spark
关系
Home-Office
Model
Statistic by
Group
Group-Grid
Mapping
Statistic by
Grid
Grid Home-
Office Statistic
Table
Grid Statistic
Table
User Define
Query
Pre-define
Query
Convert formula expansion formula
Spark
Base Algorithm
Spark
App Statistic
Relation
Database
Web GIS
Application
Web GIS Front-end
© 2016 IBM Corporation20
Agenda
• Background – problem and data
• Architecture
• Technical Design
• Big Data Platform design
• Data governance design
• Algorithm model
• Spark spatial computing
• Scenarios demo
• Conclusion and Next step
20
© 2016 IBM Corporation21
Scenario --1
Population Heatmap Commute OD Route
Better Understanding of Key Metrics for Urban City
Planning with Big Data (Sampled data vs all data; History
data vs current data)
Ø Urban planner can have more reasonable planning ofthe
city based on current population distribution
Ø Traffic planning institute leverage this to optimize the traffic
network
Ø City mgmt. unitcan better plan city services facilities & city
abnormal events detection based on population flow
New Methodology & New Applications Using Big Data for
Better Urban City Planning, Monitoring & Decision
Making
v Quickly understand the currentpeople commute traffic
volume and directions,and identify the bottleneck
v Optimize the traffic plan and scheduling during commute
peak time
v More new applications can be builtfor planners,
administrators and new data services can be provided to
city residents for the participation ofcity management
© 2016 IBM Corporation22
Scenario --2
Commute Time Cost
Office-Residence Imbalance
© 2016 IBM Corporation23
Big Data Architecture Key Point – System
v Big data product selection
• ODPi (Open Data Platform)
v Big data component selection
• Data moving,data store,computing,SQL interface…
• …
v Deployment mode selection
• Local cluster
• IAAS
• Bigdata cloud
v Separate deployment env and data exploration env
Big Data Architecture Key Point – Data
Ø Data collection
Ø Data ELT
Ø Data Pipeline
Ø Data lifecycle governance
Ø Data Volume plan
Ø Data Fusion
Ø Spatial data analysis and visualization
Big Data Architecture Key Point – Platform
Ø HA
Ø Security
Ø Monitoring & Stability
Ø Scale-out and upgrade
Ø Resource management
Ø Job Schedule
Ø Multi-tenant
Big Data Architecture Key Point –
Algorithm & Model层面
Ø BusinessAnalysis
Ø Alg model design
Ø Model verification
Ø Model adjustment
Ø Model validity insurance
© 2016 IBM Corporation24
Road Ahead…
Deep analysis with more scenarios
• Traffic prediction
• Trip predication
• Commute methods
• etc
More data sources for trajectory/traffic
• GPS for taxi, bus
• RFID on road
• Road monitoring data
• Subway stop check-in/out info
• Parking Lot
• Fusion with weather, social data
Data exploration environment to support data science &
continuous engineering of new features
Leverage more SparkML for traffic prediction
Cluster scale-out with more data and algorithms
Data ingestion with Kafka/Flume (message hub)
SQL on Hadoop
Graph computation for nearest path and roadmatcher
Current
Deployment
Big Data Platform
Scale-out
Scale-out
New
Scenarios
w/ new data
Data Exploration
Environment
Engineering
and
deployment
Data movement
© 2016 IBM Corporation25 © 2016 IBM Corporation
Spark GeoSpatial Analysis for Other Scenarios
Spatial-Temporal Trajectory
Analysis for human
Trajectory Data Management
Trajectory Analysis Function
Spatial-Temporal Trajectory
Analysis for vehicle
Common
API
geo-spatial data pre-process,geo-spatial Geometry Computing,Surface Mesh
Computing
Distributed geo-spatial calculating API (Base on Spark)
IBM’s Big Data Analytics Platform
Smart
Transportation
Smart
Logistics
Smart
Tourism
others
© 2016 IBM Corporation26
Big Data University and Data Science Workbench
− A community initiative led by IBM
− @yourpace, @yourplaceonline courses about data
− Developed by industry experts
− Free courses by the community with hands-on labs
− Certificate of completion and badges
− Looking for contributors!
Integrated Set of Tools, Languages and Execution Environments
Clean and Prepare Data
• OpenRefine
Experiment with and Analyze Data
• Jupyter Notebooks, R Studio, SeaHorse
Connect to data processing engines:
• Spark, Hadoop, dashDB, BigSQL, BigR
http://DataScientistWorkbench.com
http://bigdatauniversity.com

High Performance Spatial-Temporal Trajectory Analysis with Spark

  • 1.
    © 2016 IBMCorporation High Performance Spatial-Temporal Trajectory Analysis with Spark YongHua (Henry) Zeng zengyh@cn.ibm.com Big Data & Analytics Solution Architect Analytics Platform Services,IBM China Lab
  • 2.
    © 2016 IBMCorporation2 Agenda • Background • Architecture • Technical Design • Big Data Platform design • Data governance design • Algorithm model • Spark spatial computing • Scenarios demo • Conclusion and Next step 2
  • 3.
    © 2016 IBMCorporation3 Background Introduction -- study the human trajectory by mobile signal data Problem • Varieties of data that traditional planning will not be able to tackle • Many of the data have the characteristics of big data (volume, velocity, varieties) • Cellular signaling data is one of such typical data that can enable new types of applications to facilitate smarter urban planning • Analyzing cellular signal data can help urban planner & city governing bodies to better understand the city Data Set • Cellular signal data • Mobile users 5M • 25M to 50M data every minute; 30G of data daily • ~ 400M cellular signal records daily • More data coming with GPS, RFID for 4M vehicles
  • 4.
    © 2016 IBMCorporation4 Solution Architecture Data sources Distributed File System Streaming Resource Management YARN API Services Orchestration Batch Relational Database w/ Spatial Extention Computation Engine Visualization & Report Data Ingestion HDFS LDAP Service Cluster Management Security Service javascript Flex Shp file etc
  • 5.
    © 2016 IBMCorporation5 Data Collection Data Aggregation Coordinates Formalization Abnormal Detections Final Computing Source Data Pre-processing Base Model Computing Data Quality Metrics Application Model Computing Residential Statistics Working Region Statistics Regional Commuting Analysis The Big Data Platform Application Views GIS Server GIS Database Residential, Community Data Data Cleansing Business Architecture
  • 6.
    © 2016 IBMCorporation6 Architecture Decision Points GIS spatial DB Data Fusion Standard Bigdata Platform ELT Data Store & Analysis OD analyssi Index Computing Data Quality computiing Home-office analyssi Streaming Home-Office DW/Market Data Export thermodynamic diagram User 2 User 3 User1 GIS 应用展现 Base Alg App Alg 手机信令 (在线/脱 机) Data collect ion Database(business, spatial) Home-Office DW/Market Job andresourceSchedule Flex/JS Spatial DB (spatial extension) ArcGIS Spark Streaming Oozie/YarnShell脚本 Spark/HDFS Sqoop Java
  • 7.
    © 2016 IBMCorporation7 System front-end architecture Geospatial Analysis Big Data Platform (HDFS) Sqoop FTP
  • 8.
    © 2016 IBMCorporation8 Agenda • Background • Architecture • Technical Design • Big Data Platform design • Data governance design • Algorithm model • Spark spatial computing • Scenarios demo • Conclusion and Next step 8
  • 9.
    © 2016 IBMCorporation9 Items on Big Data Platform Design ü Planning and product selection ü Deployment and operation ü Application deployment ü Job scheduling ü Resource management ü Spark within BigInsights
  • 10.
    © 2016 IBMCorporation10 IBM BigInsights for Apache Hadoop and Spark Discovery & Exploration Prescriptive Analytics Predictive Analytics Content Analytics Business Intelligence Data Mgmt Hadoop & NoSQL Content Mgmt Data Warehouse Information Integration & Governance IBM ANALYTICS PLATFORM Built on Spark. Hybrid. Trusted. Spark Analytics Operating System Machine LearningOn premises On cloud Data at rest & In-motion.Inside & outside the firewall. Structured & unstructured. § Analytical platform for persistent Big Data – 100% open source core with IBM add-ons for analysts, data scientists, and admins – On site or cloud § Distinguishing characteristics – Built-in analytics . . . . Enhances business knowledge – Enterprise software integration . . . . Complements and extends existing capabilities – Production-ready . . . . Speeds time-to-value § IBM advantage – Combination of software, hardware, services and research
  • 11.
    © 2016 IBMCorporation11 IBM Open Platform 100% open source platform compliant with ODPi Apache Hadoop ecosystem Apache Spark ecosystem IBM-specific BigInsights features Big SQL (industry standard SQL) Text analytics BigSheets (spreadsheet-style tool) Big R (R support) IBM Streams, Cognos (limited use licenses) Overview of BigInsights Free Quick Start (non production): • IBM Open Platform • IBM added value features • Community support
  • 12.
    © 2016 IBMCorporation12 Big data platform job scheduling and resource mgmt 12 - Dedicated slave nodes for computing - almost all CPU & memory resources in each slave node is managed by Yarn - Capacity scheduler using dedicated queues for various business usage - production (batch & streaming processing, data movement), development - Elastic resource capacity for each queue by specifying a large maximum capacity, to achieve high resource utilization - Fine-grained Yarn container allocation by specifying small increment vcore/memory sizes, to support various workload types - big, medium and small jobs - No CGroups-based CPU resource isolation, because of system stability issues caused by this in our IOP 4.1/RHEL 6.5 environments Job scheduling with Oozie Resource mgmt with YARN
  • 13.
    © 2016 IBMCorporation13 Spark within BigInsights ü Deployment § Amabari for installation and deployment § Spark (compute node) co-exist with data node (HDFS) § Cluster mode with YARN as the resource mgmt ü Runtime Configuration § Bad configuration may cause job under-perform, failed, cluster instable etc § Methodology to configure the partition #, cores/mem of executor, # of executors ü Monitoring and Tuning § Spark streaming stability (monitoring log, checkpoint) § Handle massive small files § Shuffle, partition, IO utilization etc § Job execution, GC time etc via dashboard
  • 14.
    © 2016 IBMCorporation14 Data Perspective Considerations § Data process flow § Data management − capacity sizing, layout in HDFS, lifecycle mgmt § Data movement − Between big data platform and other systems RDBMS Data Process Flow
  • 15.
    © 2016 IBMCorporation15 DIST 15 • 5 Layers of Data in the System • L1 raw data ingested into HDFS • L2 ELT data (pre-processing with streaming) in HDFS • L3 result data via algorithm model in HDFS • L4 data for visualization (in HDFS or RDBMS) • L5 archived data in external storage (compressed) • Design the data layout, # of copies, retention etc in HDFS • Jobs to prune out-dated data in HDFS (Oozie) Data lifecycle management
  • 16.
    © 2016 IBMCorporation16 DIST 16 • Data Ingestion into Big Data platform • Offline/Online data ingestion -- HDFS loading from (external storage) /FTP server + Streaming • Future – Kafka + Spark Streaming (more data sources, analytics path) • Data Export from Big Data platform • Near real-time heatmap generation • Algorithm model results exported to RDBMS -- Sqoop Data movement HDFS load from FTP Spark streaming ArcGIS Server (generat e heatma p based on shp file) 实时 展现 回溯查 询 每30分钟推送到数 据库中 Basic Algorithm -- stay pointin HDFS FTP push
  • 17.
    © 2016 IBMCorporation17 DIST 17 Algorithm Model of Trajectory(OD) Analysis 统计数据导出 Cellular signal data ELT Trajectory Sequence Multi-Day Stable Point OD Identification CommuteStay Points OD Statistics OD Index Stats Commute Stats People Flow Stats Data Quality Index 统计数据导出>1km >1k m GRO UP1 GRO UP2 GRO UP3 By different area type Algorithm Accuracy Validation Algorithm Performance Algorithm Stability Algorithm Extensibility Algorithm Configurable Application Algorithm Base Algorithm
  • 18.
    © 2016 IBMCorporation18 Geospatial Computation with Spark § Requirements − Spark to direct support of SDE/Shp/GeoJson file − Most of the geospatial computation in Spark cluster (point-area relationship, spherical distance, geospatial stats etc) − Performance challenge – 20M records per each iteration of geospatial computation § Solution Design SDE shp file Spark Cluster Basic Algorithm (geospatial computation) ApplicationAlgorithm (geospatial statistics) SDE interface SHP interface Geospatial API Grid API Spark-GIS libGrid definition
  • 19.
    © 2016 IBMCorporation19 Spatial Grid Design for Spark 关系 Home-Office Model Statistic by Group Group-Grid Mapping Statistic by Grid Grid Home- Office Statistic Table Grid Statistic Table User Define Query Pre-define Query Convert formula expansion formula Spark Base Algorithm Spark App Statistic Relation Database Web GIS Application Web GIS Front-end
  • 20.
    © 2016 IBMCorporation20 Agenda • Background – problem and data • Architecture • Technical Design • Big Data Platform design • Data governance design • Algorithm model • Spark spatial computing • Scenarios demo • Conclusion and Next step 20
  • 21.
    © 2016 IBMCorporation21 Scenario --1 Population Heatmap Commute OD Route Better Understanding of Key Metrics for Urban City Planning with Big Data (Sampled data vs all data; History data vs current data) Ø Urban planner can have more reasonable planning ofthe city based on current population distribution Ø Traffic planning institute leverage this to optimize the traffic network Ø City mgmt. unitcan better plan city services facilities & city abnormal events detection based on population flow New Methodology & New Applications Using Big Data for Better Urban City Planning, Monitoring & Decision Making v Quickly understand the currentpeople commute traffic volume and directions,and identify the bottleneck v Optimize the traffic plan and scheduling during commute peak time v More new applications can be builtfor planners, administrators and new data services can be provided to city residents for the participation ofcity management
  • 22.
    © 2016 IBMCorporation22 Scenario --2 Commute Time Cost Office-Residence Imbalance
  • 23.
    © 2016 IBMCorporation23 Big Data Architecture Key Point – System v Big data product selection • ODPi (Open Data Platform) v Big data component selection • Data moving,data store,computing,SQL interface… • … v Deployment mode selection • Local cluster • IAAS • Bigdata cloud v Separate deployment env and data exploration env Big Data Architecture Key Point – Data Ø Data collection Ø Data ELT Ø Data Pipeline Ø Data lifecycle governance Ø Data Volume plan Ø Data Fusion Ø Spatial data analysis and visualization Big Data Architecture Key Point – Platform Ø HA Ø Security Ø Monitoring & Stability Ø Scale-out and upgrade Ø Resource management Ø Job Schedule Ø Multi-tenant Big Data Architecture Key Point – Algorithm & Model层面 Ø BusinessAnalysis Ø Alg model design Ø Model verification Ø Model adjustment Ø Model validity insurance
  • 24.
    © 2016 IBMCorporation24 Road Ahead… Deep analysis with more scenarios • Traffic prediction • Trip predication • Commute methods • etc More data sources for trajectory/traffic • GPS for taxi, bus • RFID on road • Road monitoring data • Subway stop check-in/out info • Parking Lot • Fusion with weather, social data Data exploration environment to support data science & continuous engineering of new features Leverage more SparkML for traffic prediction Cluster scale-out with more data and algorithms Data ingestion with Kafka/Flume (message hub) SQL on Hadoop Graph computation for nearest path and roadmatcher Current Deployment Big Data Platform Scale-out Scale-out New Scenarios w/ new data Data Exploration Environment Engineering and deployment Data movement
  • 25.
    © 2016 IBMCorporation25 © 2016 IBM Corporation Spark GeoSpatial Analysis for Other Scenarios Spatial-Temporal Trajectory Analysis for human Trajectory Data Management Trajectory Analysis Function Spatial-Temporal Trajectory Analysis for vehicle Common API geo-spatial data pre-process,geo-spatial Geometry Computing,Surface Mesh Computing Distributed geo-spatial calculating API (Base on Spark) IBM’s Big Data Analytics Platform Smart Transportation Smart Logistics Smart Tourism others
  • 26.
    © 2016 IBMCorporation26 Big Data University and Data Science Workbench − A community initiative led by IBM − @yourpace, @yourplaceonline courses about data − Developed by industry experts − Free courses by the community with hands-on labs − Certificate of completion and badges − Looking for contributors! Integrated Set of Tools, Languages and Execution Environments Clean and Prepare Data • OpenRefine Experiment with and Analyze Data • Jupyter Notebooks, R Studio, SeaHorse Connect to data processing engines: • Spark, Hadoop, dashDB, BigSQL, BigR http://DataScientistWorkbench.com http://bigdatauniversity.com