SlideShare a Scribd company logo
전태균, 전승현
Developer of Satrec Initiative
Taegyun Jeon and Seunghyun Jeon
시계열 분석: TensorFlow로
짜보고 Kaggle 도전하기
Time Series Analysis
Introduction to Kaggle
KaggleZeroToAll
Contents
코드랩을 다 듣고 나시면
1.시계열 문제에 대해 이해!
2.Kaggle에서 문제 풀기 가능!
3.Kaggle Leaderboard에 본인의 모델 업로드!
Time Series Analysis
시계열 분석
● Time Series Analysis
● Models for Time Series Analysis: AR, MA, ARMA, ARIMA, RNN
● TensorFlow TimeSeries API (TFTS)
시계열 분석
● Time Series Analysis
● Models for Time Series Analysis: AR, MA, ARMA, ARIMA, RNN
● TensorFlow TimeSeries API (TFTS)
시계열 분석
시계열 데이터
시계열 데이터
● Stock values
● Economic variables
● Weather
● Sensor: Internet-of-Things
● Energy demand
● Signal processing
● Sales forecasting
문제점
● Standard Supervised Learning
○ IID assumption
○ Same distribution for training and test data
○ Distributions fixed over time (stationarity)
● Time Series
○ 모두 해당 되지 않음!!
시계열 분석
● Time Series Analysis
● Models for Time Series Analysis: AR, MA, ARMA, ARIMA, RNN
● TensorFlow TimeSeries API (TFTS)
Autoregressive (AR) Models
● AR(p) model
: Linear generative model based on the pth order Markov assumption
○ : zero mean uncorrelated random variables with variance
○ : autoregressive coefficients
○ : observed stochastic process
Moving Average (MA)
● MA(q) model
: Linear generative model for noise term on the qth order Markov
assumption
○ : moving average coefficients
ARMA Model
● ARMA(p,q) model
: generative linear model that combines AR(p) and MA(q) models
Stationarity
● Definition: a sequence of random variables is stationary if its
distribution is invariant to shifting in time.
Lag Operator
● Definition: Lag operator is defined by
● ARMA model in terms of the lag operator:
● Characteristic polynomial
can be used to study properties of this stochastic process.
ARIMA Model
● Definition: Non-stationary processes can be modeled using processes
whose characteristic polynomial has unit roots.
● Characteristic polynomial with unit roots can be factored:
● ARIMA(p, D, q) model is an ARMA(p,q) model for
Other Extensions
● Further variants:
○ Models with seasonal components (SARIMA)
○ Models with side information (ARIMAX)
○ Models with long-memory (ARFIMA)
○ Multi-variate time series model (VAR)
○ Models with time-varing coefficients
○ other non-linear models
Recurrent Neural Networks
Recurrent Neural Networks
Recurrent Neural Networks
Recurrent Neural Networks
Recurrent Neural Networks
Recurrent Neural Networks
Recurrent Neural Networks
Recurrent Neural Networks
Recurrent Neural Networks
시계열 분석
● Time Series Analysis
● Models for Time Series Analysis: AR, MA, ARMA, ARIMA, RNN
● TensorFlow TimeSeries API (TFTS)
쉽게 구현 할 수 있는 방법?
TensorFlow TimeSeries
● tf.contrib.timeseries
○ Classic model (state space, autoregressive)
○ Flexible infrastructure
○ Data management
■ Chunking
■ Batching
■ Saving model
■ Truncated backpropagation
과연 쉬울까요??
예제부터 살펴봅시다
Introduction to Kaggle
https://www.kaggle.com/
What is the Kaggle?
마음껏 데이터를 가지고 놀수있는
데이터 놀이터
Kaggle에서 노는 법
1.대회 고르기
2.문제와 데이터를 확인하고 분석하기
3.다른 사람들은 어떻게 하나 구경하기
4.본인만의 솔루션 만들기
Competitions 종류
1.Featured: 기업, 기관에서 돈을 걸고 경쟁
2.Research: 연구 목적 대회
3.Playground: 연습 문제
4.Getting Started: 연습 문제
몇 가지 일반적인 대회 규칙
1.하루 제출 횟수 제한
2.Test의 일정 비율만 Public Score에 노출
3.대회가 종료될때 최종 점수가 공개
4.대회가 끝나도 데이터셋 접근 가능!
Kaggle에서 노는 법
1.대회 고르기
2.문제와 데이터를 확인하고 분석하기
3.다른 사람들은 어떻게 하나 구경하기
4.본인만의 솔루션 만들기
Kaggle에서 노는 법
1.대회 고르기
2.문제와 데이터를 확인하고 분석하기
3.다른 사람들은 어떻게 하나 구경하기
4.본인만의 솔루션 만들기
https://www.kaggle.com/c/favorita-grocery-sales-forecasting
오프라인 식료품점의 판매량 예측
하기
복잡하다면…
남이 잘 분석한걸 이용하자:
https://www.kaggle.com/headsortails/shopping-for-insights-favorita-eda
대부분의 대회에서 가장 많이 추천을 받는 커널은 EDA
처음 대회 들어가면 EDA를 먼저 보는걸 추천
Kaggle에서 노는 법
1.대회 고르기
2.문제와 데이터를 확인하고 분석하기
3.다른 사람들은 어떻게 하나 구경하기
4.본인만의 솔루션 만들기
https://www.kaggle.com/towever/devfest
KaggleZeroToAll
# -*- coding: utf-8 -*-
import datetime
from datetime import timedelta
import numpy as np
import pandas as pd
import tensorflow as tf
from tensorflow.contrib.timeseries.python.timeseries import NumpyReader
from tensorflow.contrib.timeseries.python.timeseries import estimators as tfts_estimators
from tensorflow.contrib.timeseries.python.timeseries import model as tfts_model
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
Prepare
dtypes = {'id':'int64', 'item_nbr':'int32', 'store_nbr':'int8'}
train = pd.read_csv('../input/train.csv', usecols=[1,2,3,4], dtype=dtypes,
parse_dates=['date'],
skiprows=range(1, 101688780) #Skip initial dates
)
train.loc[(train.unit_sales < 0),'unit_sales'] = 0 # eliminate negatives
train['unit_sales'] = train['unit_sales'].apply(pd.np.log1p) #logarithm conversion
train['dow'] = train['date'].dt.dayofweek
Read Dataset
# creating records for all items, in all markets on all dates
# for correct calculation of daily unit sales averages.
u_dates = train.date.unique()
u_stores = train.store_nbr.unique()
u_items = train.item_nbr.unique()
train.set_index(['date', 'store_nbr', 'item_nbr'], inplace=True)
train = train.reindex(
pd.MultiIndex.from_product(
(u_dates, u_stores, u_items),
names=['date','store_nbr','item_nbr']
)
)
Preprocess data
train.loc[:, 'unit_sales'].fillna(0, inplace=True) # fill NaNs
train.reset_index(inplace=True) # reset index and restoring unique columns
lastdate = train.iloc[train.shape[0]-1].date # get last day on data
train.head()
Preprocess data
train.loc[:, 'unit_sales'].fillna(0, inplace=True) # fill NaNs
train.reset_index(inplace=True) # reset index and restoring unique columns
lastdate = train.iloc[train.shape[0]-1].date # get last day on data
train.head()
Preprocess data
tmp = train[['item_nbr','store_nbr','dow','unit_sales']]
ma_dw = tmp.groupby(['item_nbr','store_nbr','dow'])['unit_sales'].mean().to_frame('madw')
ma_dw.reset_index(inplace=True)
ma_dw.head()
Preprocess data
tmp = ma_dw[['item_nbr','store_nbr','madw']]
ma_wk = tmp.groupby(['item_nbr', 'store_nbr'])['madw'].mean().to_frame('mawk')
ma_wk.reset_index(inplace=True)
ma_wk.head()
Preprocess data
tmp = train[['item_nbr','store_nbr','unit_sales']]
ma_is = tmp.groupby(['item_nbr', 'store_nbr'])['unit_sales'].mean().to_frame('mais226')
Moving Average using Pandas
for i in [112,56,28,14,7,3,1]:
tmp = train[train.date>lastdate-timedelta(int(i))]
tmpg = tmp.groupby(['item_nbr','store_nbr'])['unit_sales'].mean().to_frame('mais'+str(i))
ma_is = ma_is.join(tmpg, how='left')
del tmp,tmpg
Moving Average using Pandas
ma_is['mais']=ma_is.median(axis=1)
ma_is.reset_index(inplace=True)
ma_is.head()
Moving Average using Pandas
def data_to_npreader(store_nbr: int, item_nbr: int) -> NumpyReader:
unit_sales = train[np.logical_and(train["store_nbr"] == store_nbr,
train['item_nbr'] == item_nbr)].unit_sales
x = np.asarray(range(len(unit_sales)))
y = np.asarray(unit_sales)
dataset = {
tf.contrib.timeseries.TrainEvalFeatures.TIMES: x,
tf.contrib.timeseries.TrainEvalFeatures.VALUES: y,
}
reader = NumpyReader(dataset)
return x, y, reader
Make data trainable
x, y, reader = data_to_npreader(store_nbr=1, item_nbr=105574)
train_input_fn = tf.contrib.timeseries.RandomWindowInputFn(
reader, batch_size=32, window_size=40)
ar = tf.contrib.timeseries.ARRegressor(
periodicities=21, input_window_size=30, output_window_size=10,
num_features=1,
loss=tf.contrib.timeseries.ARModel.NORMAL_LIKELIHOOD_LOSS
)
ar.train(input_fn=train_input_fn, steps=16000)
Tensorflow Timesereies - ARRegressor
evaluation_input_fn = tf.contrib.timeseries.WholeDatasetInputFn(reader)
# keys of evaluation: ['covariance', 'loss', 'mean', 'observed', 'start_tuple',
'times', 'global_step']
evaluation = ar.evaluate(input_fn=evaluation_input_fn, steps=1)
(ar_predictions,) = tuple(ar.predict(
input_fn=tf.contrib.timeseries.predict_continuation_input_fn(
evaluation, steps=16)))
Tensorflow Timesereies - ARRegressor
plt.figure(figsize=(15, 5))
plt.plot(x.reshape(-1), y.reshape(-1), label='origin')
plt.plot(evaluation['times'].reshape(-1), evaluation['mean'].reshape(-1), label='evaluation')
plt.plot(ar_predictions['times'].reshape(-1), ar_predictions['mean'].reshape(-1),
label='prediction')
plt.xlabel('time_step')
plt.ylabel('values')
plt.legend(loc=4)
plt.show()
Tensorflow Timesereies - ARRegressor
Tensorflow Timesereies - ARRegressor
Tensorflow Timesereies - LSTM
get lstm class: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/timeseries/examples/lstm.py
Tensorflow Timesereies - LSTM
x, y, reader = data_to_npreader(store_nbr=2, item_nbr=105574)
train_input_fn = tf.contrib.timeseries.RandomWindowInputFn(
reader, batch_size=16, window_size=21)
estimator = tfts_estimators.TimeSeriesRegressor(
model=_LSTMModel(num_features=1, num_units=32),
optimizer=tf.train.AdamOptimizer(0.001))
estimator.train(input_fn=train_input_fn, steps=16000)
evaluation_input_fn = tf.contrib.timeseries.WholeDatasetInputFn(reader)
evaluation = estimator.evaluate(input_fn=evaluation_input_fn, steps=1)
Tensorflow Timesereies - LSTM
(lstm_predictions,) = tuple(estimator.predict(
input_fn=tf.contrib.timeseries.predict_continuation_input_fn(
evaluation, steps=16)))
Tensorflow Timesereies - LSTM
plt.figure(figsize=(15, 5))
plt.plot(x.reshape(-1), y.reshape(-1), label='origin')
plt.plot(evaluation['times'].reshape(-1), evaluation['mean'].reshape(-1), label='evaluation')
plt.plot(lstm_predictions['times'].reshape(-1), lstm_predictions['mean'].reshape(-1),
label='prediction')
plt.xlabel('time_step')
plt.ylabel('values')
plt.legend(loc=4)
plt.show()
Tensorflow Timesereies - LSTM
Forecasting test data
# Read test dataset
test = pd.read_csv('../input/test.csv', dtype=dtypes,
parse_dates=['date'])
test['dow'] = test['date'].dt.dayofweek
Forecasting test data
# Moving Average
test = pd.merge(test, ma_is, how='left', on=['item_nbr','store_nbr'])
test = pd.merge(test, ma_wk, how='left', on=['item_nbr','store_nbr'])
test = pd.merge(test, ma_dw, how='left', on=['item_nbr','store_nbr','dow'])
test['unit_sales'] = test.mais
# Autoregressive
ar_predictions['mean'][ar_predictions['mean'] < 0] = 0
test.loc[np.logical_and(test['store_nbr'] == 1, test['item_nbr'] == 105574), 'unit_sales'] =
ar_predictions['mean']
# LSTM
lstm_predictions['mean'][lstm_predictions['mean'] < 0] = 0
test.loc[np.logical_and(test['store_nbr'] == 2, test['item_nbr'] == 105574), 'unit_sales'] =
lstm_predictions['mean']
Forecasting test data
pos_idx = test['mawk'] > 0
test_pos = test.loc[pos_idx]
test.loc[pos_idx, 'unit_sales'] = test_pos['unit_sales'] * test_pos['madw'] / test_pos['mawk']
test.loc[:, "unit_sales"].fillna(0, inplace=True)
test['unit_sales'] = test['unit_sales'].apply(pd.np.expm1) # restoring unit values
Forecasting test data
holiday = pd.read_csv('../input/holidays_events.csv', parse_dates=['date'])
holiday = holiday.loc[holiday['transferred'] == False]
test = pd.merge(test, holiday, how = 'left', on =['date'] )
test['transferred'].fillna(True, inplace=True)
test.loc[test['transferred'] == False, 'unit_sales'] *= 1.2
test.loc[test['onpromotion'] == True, 'unit_sales'] *= 1.15
test[['id','unit_sales']].to_csv('submission.csv.gz', index=False, compression='gzip')
Thanks You!

More Related Content

Similar to Time Series Analysis: Challenge Kaggle with TensorFlow

Unit 5 Time series Data Analysis.pdf
Unit 5 Time series Data Analysis.pdfUnit 5 Time series Data Analysis.pdf
Unit 5 Time series Data Analysis.pdf
Sheba41
 
Using Bayesian Optimization to Tune Machine Learning Models
Using Bayesian Optimization to Tune Machine Learning ModelsUsing Bayesian Optimization to Tune Machine Learning Models
Using Bayesian Optimization to Tune Machine Learning Models
Scott Clark
 
Using Bayesian Optimization to Tune Machine Learning Models
Using Bayesian Optimization to Tune Machine Learning ModelsUsing Bayesian Optimization to Tune Machine Learning Models
Using Bayesian Optimization to Tune Machine Learning Models
SigOpt
 
Pydata talk
Pydata talkPydata talk
Pydata talk
Turi, Inc.
 
Time series analysis on The daily closing price of bitcoin from the 27th of A...
Time series analysis on The daily closing price of bitcoin from the 27th of A...Time series analysis on The daily closing price of bitcoin from the 27th of A...
Time series analysis on The daily closing price of bitcoin from the 27th of A...
ShuaiGao3
 
Elasticsearch Performance Testing and Scaling @ Signal
Elasticsearch Performance Testing and Scaling @ SignalElasticsearch Performance Testing and Scaling @ Signal
Elasticsearch Performance Testing and Scaling @ Signal
Joachim Draeger
 
C3 w1
C3 w1C3 w1
#OSSPARIS19 : Detecter des anomalies de séries temporelles à la volée avec Wa...
#OSSPARIS19 : Detecter des anomalies de séries temporelles à la volée avec Wa...#OSSPARIS19 : Detecter des anomalies de séries temporelles à la volée avec Wa...
#OSSPARIS19 : Detecter des anomalies de séries temporelles à la volée avec Wa...
Paris Open Source Summit
 
Powering a Graph Data System with Scylla + JanusGraph
Powering a Graph Data System with Scylla + JanusGraphPowering a Graph Data System with Scylla + JanusGraph
Powering a Graph Data System with Scylla + JanusGraph
ScyllaDB
 
Introduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at lastIntroduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at last
Holden Karau
 
Basic terminologies & asymptotic notations
Basic terminologies & asymptotic notationsBasic terminologies & asymptotic notations
Basic terminologies & asymptotic notations
Rajendran
 
Journey through high performance django application
Journey through high performance django applicationJourney through high performance django application
Journey through high performance django application
bangaloredjangousergroup
 
GANs for Anti Money Laundering
GANs for Anti Money LaunderingGANs for Anti Money Laundering
GANs for Anti Money Laundering
Jim Dowling
 
Hadoop France meetup Feb2016 : recommendations with spark
Hadoop France meetup  Feb2016 : recommendations with sparkHadoop France meetup  Feb2016 : recommendations with spark
Hadoop France meetup Feb2016 : recommendations with spark
Modern Data Stack France
 
Mapreduce Algorithms
Mapreduce AlgorithmsMapreduce Algorithms
Mapreduce Algorithms
Amund Tveit
 
Time series representations for better data mining
Time series representations for better data miningTime series representations for better data mining
Time series representations for better data mining
Peter Laurinec
 
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018
Beyond Wordcount  with spark datasets (and scalaing) - Nide PDX Jan 2018Beyond Wordcount  with spark datasets (and scalaing) - Nide PDX Jan 2018
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018
Holden Karau
 
Optimizing Performance - Clojure Remote - Nikola Peric
Optimizing Performance - Clojure Remote - Nikola PericOptimizing Performance - Clojure Remote - Nikola Peric
Optimizing Performance - Clojure Remote - Nikola Peric
Nik Peric
 
Production ready big ml workflows from zero to hero daniel marcous @ waze
Production ready big ml workflows from zero to hero daniel marcous @ wazeProduction ready big ml workflows from zero to hero daniel marcous @ waze
Production ready big ml workflows from zero to hero daniel marcous @ waze
Ido Shilon
 
Robert Meyer- pypet
Robert Meyer- pypetRobert Meyer- pypet
Robert Meyer- pypet
PyData
 

Similar to Time Series Analysis: Challenge Kaggle with TensorFlow (20)

Unit 5 Time series Data Analysis.pdf
Unit 5 Time series Data Analysis.pdfUnit 5 Time series Data Analysis.pdf
Unit 5 Time series Data Analysis.pdf
 
Using Bayesian Optimization to Tune Machine Learning Models
Using Bayesian Optimization to Tune Machine Learning ModelsUsing Bayesian Optimization to Tune Machine Learning Models
Using Bayesian Optimization to Tune Machine Learning Models
 
Using Bayesian Optimization to Tune Machine Learning Models
Using Bayesian Optimization to Tune Machine Learning ModelsUsing Bayesian Optimization to Tune Machine Learning Models
Using Bayesian Optimization to Tune Machine Learning Models
 
Pydata talk
Pydata talkPydata talk
Pydata talk
 
Time series analysis on The daily closing price of bitcoin from the 27th of A...
Time series analysis on The daily closing price of bitcoin from the 27th of A...Time series analysis on The daily closing price of bitcoin from the 27th of A...
Time series analysis on The daily closing price of bitcoin from the 27th of A...
 
Elasticsearch Performance Testing and Scaling @ Signal
Elasticsearch Performance Testing and Scaling @ SignalElasticsearch Performance Testing and Scaling @ Signal
Elasticsearch Performance Testing and Scaling @ Signal
 
C3 w1
C3 w1C3 w1
C3 w1
 
#OSSPARIS19 : Detecter des anomalies de séries temporelles à la volée avec Wa...
#OSSPARIS19 : Detecter des anomalies de séries temporelles à la volée avec Wa...#OSSPARIS19 : Detecter des anomalies de séries temporelles à la volée avec Wa...
#OSSPARIS19 : Detecter des anomalies de séries temporelles à la volée avec Wa...
 
Powering a Graph Data System with Scylla + JanusGraph
Powering a Graph Data System with Scylla + JanusGraphPowering a Graph Data System with Scylla + JanusGraph
Powering a Graph Data System with Scylla + JanusGraph
 
Introduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at lastIntroduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at last
 
Basic terminologies & asymptotic notations
Basic terminologies & asymptotic notationsBasic terminologies & asymptotic notations
Basic terminologies & asymptotic notations
 
Journey through high performance django application
Journey through high performance django applicationJourney through high performance django application
Journey through high performance django application
 
GANs for Anti Money Laundering
GANs for Anti Money LaunderingGANs for Anti Money Laundering
GANs for Anti Money Laundering
 
Hadoop France meetup Feb2016 : recommendations with spark
Hadoop France meetup  Feb2016 : recommendations with sparkHadoop France meetup  Feb2016 : recommendations with spark
Hadoop France meetup Feb2016 : recommendations with spark
 
Mapreduce Algorithms
Mapreduce AlgorithmsMapreduce Algorithms
Mapreduce Algorithms
 
Time series representations for better data mining
Time series representations for better data miningTime series representations for better data mining
Time series representations for better data mining
 
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018
Beyond Wordcount  with spark datasets (and scalaing) - Nide PDX Jan 2018Beyond Wordcount  with spark datasets (and scalaing) - Nide PDX Jan 2018
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018
 
Optimizing Performance - Clojure Remote - Nikola Peric
Optimizing Performance - Clojure Remote - Nikola PericOptimizing Performance - Clojure Remote - Nikola Peric
Optimizing Performance - Clojure Remote - Nikola Peric
 
Production ready big ml workflows from zero to hero daniel marcous @ waze
Production ready big ml workflows from zero to hero daniel marcous @ wazeProduction ready big ml workflows from zero to hero daniel marcous @ waze
Production ready big ml workflows from zero to hero daniel marcous @ waze
 
Robert Meyer- pypet
Robert Meyer- pypetRobert Meyer- pypet
Robert Meyer- pypet
 

Recently uploaded

Hematology Analyzer Machine - Complete Blood Count
Hematology Analyzer Machine - Complete Blood CountHematology Analyzer Machine - Complete Blood Count
Hematology Analyzer Machine - Complete Blood Count
shahdabdulbaset
 
Casting-Defect-inSlab continuous casting.pdf
Casting-Defect-inSlab continuous casting.pdfCasting-Defect-inSlab continuous casting.pdf
Casting-Defect-inSlab continuous casting.pdf
zubairahmad848137
 
Introduction to AI Safety (public presentation).pptx
Introduction to AI Safety (public presentation).pptxIntroduction to AI Safety (public presentation).pptx
Introduction to AI Safety (public presentation).pptx
MiscAnnoy1
 
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
Sinan KOZAK
 
官方认证美国密歇根州立大学毕业证学位证书原版一模一样
官方认证美国密歇根州立大学毕业证学位证书原版一模一样官方认证美国密歇根州立大学毕业证学位证书原版一模一样
官方认证美国密歇根州立大学毕业证学位证书原版一模一样
171ticu
 
一比一原版(CalArts毕业证)加利福尼亚艺术学院毕业证如何办理
一比一原版(CalArts毕业证)加利福尼亚艺术学院毕业证如何办理一比一原版(CalArts毕业证)加利福尼亚艺术学院毕业证如何办理
一比一原版(CalArts毕业证)加利福尼亚艺术学院毕业证如何办理
ecqow
 
Literature Review Basics and Understanding Reference Management.pptx
Literature Review Basics and Understanding Reference Management.pptxLiterature Review Basics and Understanding Reference Management.pptx
Literature Review Basics and Understanding Reference Management.pptx
Dr Ramhari Poudyal
 
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
IJECEIAES
 
2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf
2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf
2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf
Yasser Mahgoub
 
Unit-III-ELECTROCHEMICAL STORAGE DEVICES.ppt
Unit-III-ELECTROCHEMICAL STORAGE DEVICES.pptUnit-III-ELECTROCHEMICAL STORAGE DEVICES.ppt
Unit-III-ELECTROCHEMICAL STORAGE DEVICES.ppt
KrishnaveniKrishnara1
 
Advanced control scheme of doubly fed induction generator for wind turbine us...
Advanced control scheme of doubly fed induction generator for wind turbine us...Advanced control scheme of doubly fed induction generator for wind turbine us...
Advanced control scheme of doubly fed induction generator for wind turbine us...
IJECEIAES
 
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECTCHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
jpsjournal1
 
Manufacturing Process of molasses based distillery ppt.pptx
Manufacturing Process of molasses based distillery ppt.pptxManufacturing Process of molasses based distillery ppt.pptx
Manufacturing Process of molasses based distillery ppt.pptx
Madan Karki
 
ACEP Magazine edition 4th launched on 05.06.2024
ACEP Magazine edition 4th launched on 05.06.2024ACEP Magazine edition 4th launched on 05.06.2024
ACEP Magazine edition 4th launched on 05.06.2024
Rahul
 
NATURAL DEEP EUTECTIC SOLVENTS AS ANTI-FREEZING AGENT
NATURAL DEEP EUTECTIC SOLVENTS AS ANTI-FREEZING AGENTNATURAL DEEP EUTECTIC SOLVENTS AS ANTI-FREEZING AGENT
NATURAL DEEP EUTECTIC SOLVENTS AS ANTI-FREEZING AGENT
Addu25809
 
IEEE Aerospace and Electronic Systems Society as a Graduate Student Member
IEEE Aerospace and Electronic Systems Society as a Graduate Student MemberIEEE Aerospace and Electronic Systems Society as a Graduate Student Member
IEEE Aerospace and Electronic Systems Society as a Graduate Student Member
VICTOR MAESTRE RAMIREZ
 
22CYT12-Unit-V-E Waste and its Management.ppt
22CYT12-Unit-V-E Waste and its Management.ppt22CYT12-Unit-V-E Waste and its Management.ppt
22CYT12-Unit-V-E Waste and its Management.ppt
KrishnaveniKrishnara1
 
Certificates - Mahmoud Mohamed Moursi Ahmed
Certificates - Mahmoud Mohamed Moursi AhmedCertificates - Mahmoud Mohamed Moursi Ahmed
Certificates - Mahmoud Mohamed Moursi Ahmed
Mahmoud Morsy
 
BPV-GUI-01-Guide-for-ASME-Review-Teams-(General)-10-10-2023.pdf
BPV-GUI-01-Guide-for-ASME-Review-Teams-(General)-10-10-2023.pdfBPV-GUI-01-Guide-for-ASME-Review-Teams-(General)-10-10-2023.pdf
BPV-GUI-01-Guide-for-ASME-Review-Teams-(General)-10-10-2023.pdf
MIGUELANGEL966976
 
UNLOCKING HEALTHCARE 4.0: NAVIGATING CRITICAL SUCCESS FACTORS FOR EFFECTIVE I...
UNLOCKING HEALTHCARE 4.0: NAVIGATING CRITICAL SUCCESS FACTORS FOR EFFECTIVE I...UNLOCKING HEALTHCARE 4.0: NAVIGATING CRITICAL SUCCESS FACTORS FOR EFFECTIVE I...
UNLOCKING HEALTHCARE 4.0: NAVIGATING CRITICAL SUCCESS FACTORS FOR EFFECTIVE I...
amsjournal
 

Recently uploaded (20)

Hematology Analyzer Machine - Complete Blood Count
Hematology Analyzer Machine - Complete Blood CountHematology Analyzer Machine - Complete Blood Count
Hematology Analyzer Machine - Complete Blood Count
 
Casting-Defect-inSlab continuous casting.pdf
Casting-Defect-inSlab continuous casting.pdfCasting-Defect-inSlab continuous casting.pdf
Casting-Defect-inSlab continuous casting.pdf
 
Introduction to AI Safety (public presentation).pptx
Introduction to AI Safety (public presentation).pptxIntroduction to AI Safety (public presentation).pptx
Introduction to AI Safety (public presentation).pptx
 
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
 
官方认证美国密歇根州立大学毕业证学位证书原版一模一样
官方认证美国密歇根州立大学毕业证学位证书原版一模一样官方认证美国密歇根州立大学毕业证学位证书原版一模一样
官方认证美国密歇根州立大学毕业证学位证书原版一模一样
 
一比一原版(CalArts毕业证)加利福尼亚艺术学院毕业证如何办理
一比一原版(CalArts毕业证)加利福尼亚艺术学院毕业证如何办理一比一原版(CalArts毕业证)加利福尼亚艺术学院毕业证如何办理
一比一原版(CalArts毕业证)加利福尼亚艺术学院毕业证如何办理
 
Literature Review Basics and Understanding Reference Management.pptx
Literature Review Basics and Understanding Reference Management.pptxLiterature Review Basics and Understanding Reference Management.pptx
Literature Review Basics and Understanding Reference Management.pptx
 
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
 
2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf
2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf
2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf
 
Unit-III-ELECTROCHEMICAL STORAGE DEVICES.ppt
Unit-III-ELECTROCHEMICAL STORAGE DEVICES.pptUnit-III-ELECTROCHEMICAL STORAGE DEVICES.ppt
Unit-III-ELECTROCHEMICAL STORAGE DEVICES.ppt
 
Advanced control scheme of doubly fed induction generator for wind turbine us...
Advanced control scheme of doubly fed induction generator for wind turbine us...Advanced control scheme of doubly fed induction generator for wind turbine us...
Advanced control scheme of doubly fed induction generator for wind turbine us...
 
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECTCHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
 
Manufacturing Process of molasses based distillery ppt.pptx
Manufacturing Process of molasses based distillery ppt.pptxManufacturing Process of molasses based distillery ppt.pptx
Manufacturing Process of molasses based distillery ppt.pptx
 
ACEP Magazine edition 4th launched on 05.06.2024
ACEP Magazine edition 4th launched on 05.06.2024ACEP Magazine edition 4th launched on 05.06.2024
ACEP Magazine edition 4th launched on 05.06.2024
 
NATURAL DEEP EUTECTIC SOLVENTS AS ANTI-FREEZING AGENT
NATURAL DEEP EUTECTIC SOLVENTS AS ANTI-FREEZING AGENTNATURAL DEEP EUTECTIC SOLVENTS AS ANTI-FREEZING AGENT
NATURAL DEEP EUTECTIC SOLVENTS AS ANTI-FREEZING AGENT
 
IEEE Aerospace and Electronic Systems Society as a Graduate Student Member
IEEE Aerospace and Electronic Systems Society as a Graduate Student MemberIEEE Aerospace and Electronic Systems Society as a Graduate Student Member
IEEE Aerospace and Electronic Systems Society as a Graduate Student Member
 
22CYT12-Unit-V-E Waste and its Management.ppt
22CYT12-Unit-V-E Waste and its Management.ppt22CYT12-Unit-V-E Waste and its Management.ppt
22CYT12-Unit-V-E Waste and its Management.ppt
 
Certificates - Mahmoud Mohamed Moursi Ahmed
Certificates - Mahmoud Mohamed Moursi AhmedCertificates - Mahmoud Mohamed Moursi Ahmed
Certificates - Mahmoud Mohamed Moursi Ahmed
 
BPV-GUI-01-Guide-for-ASME-Review-Teams-(General)-10-10-2023.pdf
BPV-GUI-01-Guide-for-ASME-Review-Teams-(General)-10-10-2023.pdfBPV-GUI-01-Guide-for-ASME-Review-Teams-(General)-10-10-2023.pdf
BPV-GUI-01-Guide-for-ASME-Review-Teams-(General)-10-10-2023.pdf
 
UNLOCKING HEALTHCARE 4.0: NAVIGATING CRITICAL SUCCESS FACTORS FOR EFFECTIVE I...
UNLOCKING HEALTHCARE 4.0: NAVIGATING CRITICAL SUCCESS FACTORS FOR EFFECTIVE I...UNLOCKING HEALTHCARE 4.0: NAVIGATING CRITICAL SUCCESS FACTORS FOR EFFECTIVE I...
UNLOCKING HEALTHCARE 4.0: NAVIGATING CRITICAL SUCCESS FACTORS FOR EFFECTIVE I...
 

Time Series Analysis: Challenge Kaggle with TensorFlow

  • 1. 전태균, 전승현 Developer of Satrec Initiative Taegyun Jeon and Seunghyun Jeon 시계열 분석: TensorFlow로 짜보고 Kaggle 도전하기
  • 2. Time Series Analysis Introduction to Kaggle KaggleZeroToAll Contents
  • 3. 코드랩을 다 듣고 나시면 1.시계열 문제에 대해 이해! 2.Kaggle에서 문제 풀기 가능! 3.Kaggle Leaderboard에 본인의 모델 업로드!
  • 5. 시계열 분석 ● Time Series Analysis ● Models for Time Series Analysis: AR, MA, ARMA, ARIMA, RNN ● TensorFlow TimeSeries API (TFTS)
  • 6. 시계열 분석 ● Time Series Analysis ● Models for Time Series Analysis: AR, MA, ARMA, ARIMA, RNN ● TensorFlow TimeSeries API (TFTS)
  • 9. 시계열 데이터 ● Stock values ● Economic variables ● Weather ● Sensor: Internet-of-Things ● Energy demand ● Signal processing ● Sales forecasting
  • 10.
  • 11.
  • 12. 문제점 ● Standard Supervised Learning ○ IID assumption ○ Same distribution for training and test data ○ Distributions fixed over time (stationarity) ● Time Series ○ 모두 해당 되지 않음!!
  • 13. 시계열 분석 ● Time Series Analysis ● Models for Time Series Analysis: AR, MA, ARMA, ARIMA, RNN ● TensorFlow TimeSeries API (TFTS)
  • 14. Autoregressive (AR) Models ● AR(p) model : Linear generative model based on the pth order Markov assumption ○ : zero mean uncorrelated random variables with variance ○ : autoregressive coefficients ○ : observed stochastic process
  • 15. Moving Average (MA) ● MA(q) model : Linear generative model for noise term on the qth order Markov assumption ○ : moving average coefficients
  • 16. ARMA Model ● ARMA(p,q) model : generative linear model that combines AR(p) and MA(q) models
  • 17. Stationarity ● Definition: a sequence of random variables is stationary if its distribution is invariant to shifting in time.
  • 18. Lag Operator ● Definition: Lag operator is defined by ● ARMA model in terms of the lag operator: ● Characteristic polynomial can be used to study properties of this stochastic process.
  • 19. ARIMA Model ● Definition: Non-stationary processes can be modeled using processes whose characteristic polynomial has unit roots. ● Characteristic polynomial with unit roots can be factored: ● ARIMA(p, D, q) model is an ARMA(p,q) model for
  • 20. Other Extensions ● Further variants: ○ Models with seasonal components (SARIMA) ○ Models with side information (ARIMAX) ○ Models with long-memory (ARFIMA) ○ Multi-variate time series model (VAR) ○ Models with time-varing coefficients ○ other non-linear models
  • 30. 시계열 분석 ● Time Series Analysis ● Models for Time Series Analysis: AR, MA, ARMA, ARIMA, RNN ● TensorFlow TimeSeries API (TFTS)
  • 31. 쉽게 구현 할 수 있는 방법?
  • 32.
  • 33.
  • 34. TensorFlow TimeSeries ● tf.contrib.timeseries ○ Classic model (state space, autoregressive) ○ Flexible infrastructure ○ Data management ■ Chunking ■ Batching ■ Saving model ■ Truncated backpropagation
  • 37.
  • 38.
  • 39.
  • 40.
  • 41.
  • 42.
  • 43.
  • 44.
  • 45.
  • 46.
  • 47.
  • 48.
  • 49.
  • 50.
  • 51.
  • 52.
  • 53.
  • 54.
  • 55.
  • 56.
  • 59. What is the Kaggle?
  • 60. 마음껏 데이터를 가지고 놀수있는 데이터 놀이터
  • 61. Kaggle에서 노는 법 1.대회 고르기 2.문제와 데이터를 확인하고 분석하기 3.다른 사람들은 어떻게 하나 구경하기 4.본인만의 솔루션 만들기
  • 62.
  • 63.
  • 64. Competitions 종류 1.Featured: 기업, 기관에서 돈을 걸고 경쟁 2.Research: 연구 목적 대회 3.Playground: 연습 문제 4.Getting Started: 연습 문제
  • 65. 몇 가지 일반적인 대회 규칙 1.하루 제출 횟수 제한 2.Test의 일정 비율만 Public Score에 노출 3.대회가 종료될때 최종 점수가 공개 4.대회가 끝나도 데이터셋 접근 가능!
  • 66. Kaggle에서 노는 법 1.대회 고르기 2.문제와 데이터를 확인하고 분석하기 3.다른 사람들은 어떻게 하나 구경하기 4.본인만의 솔루션 만들기
  • 67.
  • 68. Kaggle에서 노는 법 1.대회 고르기 2.문제와 데이터를 확인하고 분석하기 3.다른 사람들은 어떻게 하나 구경하기 4.본인만의 솔루션 만들기
  • 71.
  • 72.
  • 73. 복잡하다면… 남이 잘 분석한걸 이용하자: https://www.kaggle.com/headsortails/shopping-for-insights-favorita-eda
  • 74. 대부분의 대회에서 가장 많이 추천을 받는 커널은 EDA 처음 대회 들어가면 EDA를 먼저 보는걸 추천
  • 75. Kaggle에서 노는 법 1.대회 고르기 2.문제와 데이터를 확인하고 분석하기 3.다른 사람들은 어떻게 하나 구경하기 4.본인만의 솔루션 만들기
  • 78.
  • 79. # -*- coding: utf-8 -*- import datetime from datetime import timedelta import numpy as np import pandas as pd import tensorflow as tf from tensorflow.contrib.timeseries.python.timeseries import NumpyReader from tensorflow.contrib.timeseries.python.timeseries import estimators as tfts_estimators from tensorflow.contrib.timeseries.python.timeseries import model as tfts_model import matplotlib import matplotlib.pyplot as plt %matplotlib inline Prepare
  • 80. dtypes = {'id':'int64', 'item_nbr':'int32', 'store_nbr':'int8'} train = pd.read_csv('../input/train.csv', usecols=[1,2,3,4], dtype=dtypes, parse_dates=['date'], skiprows=range(1, 101688780) #Skip initial dates ) train.loc[(train.unit_sales < 0),'unit_sales'] = 0 # eliminate negatives train['unit_sales'] = train['unit_sales'].apply(pd.np.log1p) #logarithm conversion train['dow'] = train['date'].dt.dayofweek Read Dataset
  • 81. # creating records for all items, in all markets on all dates # for correct calculation of daily unit sales averages. u_dates = train.date.unique() u_stores = train.store_nbr.unique() u_items = train.item_nbr.unique() train.set_index(['date', 'store_nbr', 'item_nbr'], inplace=True) train = train.reindex( pd.MultiIndex.from_product( (u_dates, u_stores, u_items), names=['date','store_nbr','item_nbr'] ) ) Preprocess data
  • 82. train.loc[:, 'unit_sales'].fillna(0, inplace=True) # fill NaNs train.reset_index(inplace=True) # reset index and restoring unique columns lastdate = train.iloc[train.shape[0]-1].date # get last day on data train.head() Preprocess data
  • 83. train.loc[:, 'unit_sales'].fillna(0, inplace=True) # fill NaNs train.reset_index(inplace=True) # reset index and restoring unique columns lastdate = train.iloc[train.shape[0]-1].date # get last day on data train.head() Preprocess data
  • 84. tmp = train[['item_nbr','store_nbr','dow','unit_sales']] ma_dw = tmp.groupby(['item_nbr','store_nbr','dow'])['unit_sales'].mean().to_frame('madw') ma_dw.reset_index(inplace=True) ma_dw.head() Preprocess data
  • 85. tmp = ma_dw[['item_nbr','store_nbr','madw']] ma_wk = tmp.groupby(['item_nbr', 'store_nbr'])['madw'].mean().to_frame('mawk') ma_wk.reset_index(inplace=True) ma_wk.head() Preprocess data
  • 86. tmp = train[['item_nbr','store_nbr','unit_sales']] ma_is = tmp.groupby(['item_nbr', 'store_nbr'])['unit_sales'].mean().to_frame('mais226') Moving Average using Pandas
  • 87. for i in [112,56,28,14,7,3,1]: tmp = train[train.date>lastdate-timedelta(int(i))] tmpg = tmp.groupby(['item_nbr','store_nbr'])['unit_sales'].mean().to_frame('mais'+str(i)) ma_is = ma_is.join(tmpg, how='left') del tmp,tmpg Moving Average using Pandas
  • 89. def data_to_npreader(store_nbr: int, item_nbr: int) -> NumpyReader: unit_sales = train[np.logical_and(train["store_nbr"] == store_nbr, train['item_nbr'] == item_nbr)].unit_sales x = np.asarray(range(len(unit_sales))) y = np.asarray(unit_sales) dataset = { tf.contrib.timeseries.TrainEvalFeatures.TIMES: x, tf.contrib.timeseries.TrainEvalFeatures.VALUES: y, } reader = NumpyReader(dataset) return x, y, reader Make data trainable
  • 90. x, y, reader = data_to_npreader(store_nbr=1, item_nbr=105574) train_input_fn = tf.contrib.timeseries.RandomWindowInputFn( reader, batch_size=32, window_size=40) ar = tf.contrib.timeseries.ARRegressor( periodicities=21, input_window_size=30, output_window_size=10, num_features=1, loss=tf.contrib.timeseries.ARModel.NORMAL_LIKELIHOOD_LOSS ) ar.train(input_fn=train_input_fn, steps=16000) Tensorflow Timesereies - ARRegressor
  • 91. evaluation_input_fn = tf.contrib.timeseries.WholeDatasetInputFn(reader) # keys of evaluation: ['covariance', 'loss', 'mean', 'observed', 'start_tuple', 'times', 'global_step'] evaluation = ar.evaluate(input_fn=evaluation_input_fn, steps=1) (ar_predictions,) = tuple(ar.predict( input_fn=tf.contrib.timeseries.predict_continuation_input_fn( evaluation, steps=16))) Tensorflow Timesereies - ARRegressor
  • 92. plt.figure(figsize=(15, 5)) plt.plot(x.reshape(-1), y.reshape(-1), label='origin') plt.plot(evaluation['times'].reshape(-1), evaluation['mean'].reshape(-1), label='evaluation') plt.plot(ar_predictions['times'].reshape(-1), ar_predictions['mean'].reshape(-1), label='prediction') plt.xlabel('time_step') plt.ylabel('values') plt.legend(loc=4) plt.show() Tensorflow Timesereies - ARRegressor
  • 94. Tensorflow Timesereies - LSTM get lstm class: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/timeseries/examples/lstm.py
  • 95. Tensorflow Timesereies - LSTM x, y, reader = data_to_npreader(store_nbr=2, item_nbr=105574) train_input_fn = tf.contrib.timeseries.RandomWindowInputFn( reader, batch_size=16, window_size=21) estimator = tfts_estimators.TimeSeriesRegressor( model=_LSTMModel(num_features=1, num_units=32), optimizer=tf.train.AdamOptimizer(0.001)) estimator.train(input_fn=train_input_fn, steps=16000) evaluation_input_fn = tf.contrib.timeseries.WholeDatasetInputFn(reader) evaluation = estimator.evaluate(input_fn=evaluation_input_fn, steps=1)
  • 96. Tensorflow Timesereies - LSTM (lstm_predictions,) = tuple(estimator.predict( input_fn=tf.contrib.timeseries.predict_continuation_input_fn( evaluation, steps=16)))
  • 97. Tensorflow Timesereies - LSTM plt.figure(figsize=(15, 5)) plt.plot(x.reshape(-1), y.reshape(-1), label='origin') plt.plot(evaluation['times'].reshape(-1), evaluation['mean'].reshape(-1), label='evaluation') plt.plot(lstm_predictions['times'].reshape(-1), lstm_predictions['mean'].reshape(-1), label='prediction') plt.xlabel('time_step') plt.ylabel('values') plt.legend(loc=4) plt.show()
  • 99. Forecasting test data # Read test dataset test = pd.read_csv('../input/test.csv', dtype=dtypes, parse_dates=['date']) test['dow'] = test['date'].dt.dayofweek
  • 100. Forecasting test data # Moving Average test = pd.merge(test, ma_is, how='left', on=['item_nbr','store_nbr']) test = pd.merge(test, ma_wk, how='left', on=['item_nbr','store_nbr']) test = pd.merge(test, ma_dw, how='left', on=['item_nbr','store_nbr','dow']) test['unit_sales'] = test.mais # Autoregressive ar_predictions['mean'][ar_predictions['mean'] < 0] = 0 test.loc[np.logical_and(test['store_nbr'] == 1, test['item_nbr'] == 105574), 'unit_sales'] = ar_predictions['mean'] # LSTM lstm_predictions['mean'][lstm_predictions['mean'] < 0] = 0 test.loc[np.logical_and(test['store_nbr'] == 2, test['item_nbr'] == 105574), 'unit_sales'] = lstm_predictions['mean']
  • 101. Forecasting test data pos_idx = test['mawk'] > 0 test_pos = test.loc[pos_idx] test.loc[pos_idx, 'unit_sales'] = test_pos['unit_sales'] * test_pos['madw'] / test_pos['mawk'] test.loc[:, "unit_sales"].fillna(0, inplace=True) test['unit_sales'] = test['unit_sales'].apply(pd.np.expm1) # restoring unit values
  • 102. Forecasting test data holiday = pd.read_csv('../input/holidays_events.csv', parse_dates=['date']) holiday = holiday.loc[holiday['transferred'] == False] test = pd.merge(test, holiday, how = 'left', on =['date'] ) test['transferred'].fillna(True, inplace=True) test.loc[test['transferred'] == False, 'unit_sales'] *= 1.2 test.loc[test['onpromotion'] == True, 'unit_sales'] *= 1.15 test[['id','unit_sales']].to_csv('submission.csv.gz', index=False, compression='gzip')
  • 103.