Vector Databases 101 - An introduction to the world of Vector Databases
Time Series in Driverless AI by Marios Michailidis
1. Background
• Competitive data scientist at H2O.ai
• PhD in ensemble methods at UCL
• Former kaggle #1 – over 150+ competitions
2. H2O.ai Product Suite
Automatic feature engineering,
machine learning and interpretability
• 100% open source – Apache V2 licensed
• Built for data scientists – interface using R, Python or H2O Flow
(interactive notebook interface)
• Enterprise support subscriptions
• Fully automated machine learning
from ingest to deployment
• User licenses on a per seat basis
(annual subscription)
H2O AI open source engine
integration with Spark
Lightning fast machine
learning on GPUs
In-memory, distributed
machine learning algorithms
with H2O Flow GUI
3. Confidential3
Automate
Features Target
Data Quality and
Transformation Modeling Table
Model
Building
Model
Deployment
Data Integration
+
Driverless AI Workflow
visualizations
Machine Learning
Interpretability
FAST!!
Objective
Optimization
4. 0
50
100
150
200
250
12/21/2017 12/31/2017 1/10/2018 1/20/2018 1/30/2018 2/9/2018 2/19/2018
Sales over time
0
10
20
30
40
50
60
70
80
12/21/2017 12/31/2017 1/10/2018 1/20/2018 1/30/2018 2/9/2018 2/19/2018
Sales over time
0
50
100
150
200
250
300
350
400
12/31/2017 1/2/2018 1/4/2018 1/6/2018 1/8/2018 1/10/2018 1/12/2018 1/14/2018
Sales over time
Linear relationshipNonlinear (seasonal) relationship
What is a Time Series Problem?
5. 0
100
200
300
400
500
600
700
800
12/21/2017 12/31/2017 1/10/2018 1/20/2018 1/30/2018 2/9/2018 2/19/2018 3/1/2018 3/11/2018
sales per per day (all groups)
0
100
200
300
400
500
600
700
800
12/21/2017 12/31/2017 1/10/2018 1/20/2018 1/30/2018 2/9/2018 2/19/2018 3/1/2018 3/11/2018
sales by group
group 1 group 2 group 3
time groups sales
01/01/2018 group1 30
01/01/2018 group2 100
01/01/2018 group3 10
02/01/2018 group1 60.2
02/01/2018 group2 200.2
02/01/2018 group3 20.2
03/01/2018 group1 90.3
03/01/2018 group2 300.3
03/01/2018 group3 30.3
04/01/2018 group1 120.4
04/01/2018 group2 400.4
04/01/2018 group3 40.4
Time Groups
6. Modeling Foundation
1 2 3 4 5 6 7 8 9 10 11 12
[Gap]
1 2 3 4 5 6 7 8 9 10 11 12
[Gap] [Gap]
testtrain
tvs train tvs valid test
time:
Gap | Forecast Horizon
invalid lag size
valid lag size
time:
7. • Single time split (most recent training data becomes validation)
Validation Schemas #1
0
10000
20000
30000
40000
50000
60000
70000
2/5/20103/5/20104/5/20105/5/20106/5/20107/5/20108/5/20109/5/201010/5/201011/5/201012/5/20101/5/20112/5/20113/5/20114/5/20115/5/20116/5/20117/5/20118/5/20119/5/201110/5/201111/5/201112/5/20111/5/20122/5/20123/5/20124/5/2012
8. Validation Schemas #2
Rolling window with adjusting training size Rolling window with constant training size
• Multi window validation
15. • Ranking based on autocorrelation
• Pre-defined intervals (based on estimated frequency)
Daily data
• [7, 14, 21, …]
• [14, 28, 32, …]
• …
Weekly data
• [2, 4, 6, 8, …]
• [4, 8, 12, 16, …]
• …
…
Candidates for Lag-Sizes
16. • Dropouts
• Random replacement of actual lag-values by „n.a.“
• Align frequency of available lag information between train and validation/test
Regularization of Lag-Features
20. Bring Your own recipe!
• Bring in your domain knowledge and
achieve even better results.
• Add additional transformers from the
open source git repo :
https://github.com/h2oai/driverlessai-
recipes
• You can contribute too!
• They follow sklearn type of api
• Add models, scorers or transformers
21. The Prophet model
y(t)= g(t) + s(t) + h(t)
g(t) Piecewise linear or logistic
regressor to calculate trend
s(t) models periodic changes (e.g.
weekly/yearly seasonality)
h(t) holiday component