3. Business Understanding
Strike Price Option Price (Premium) Time to Maturity The underlying asset price
Profit
Loss
Definition: A European call option gives the owner the right to acquire the underlying security at expiry.
For an investor to profit from a European call option, the
stock's price, at expiry, has to be trading high enough
above the strike price to cover the cost of the option
premium.
➔The market price of the options sometimes deviates
from the fair price, we need a tool that can help us
judge pricing.
Price
Option
Strike
Price
8. Three primary methods of treating the outliers
● Trimming/removing the outlier
● Quantile based flooring and capping
● Mean/Median imputation
Boxplot after data cleaning
Data Cleaning
Handling Outliers
Handling Missing Values
Two primary ways of handling missing values
● Deleting the Missing values
● Imputing the Missing Values
10. Regression Models Building
Cross Validation Scores of Regression Models
1. Use 5 statistical/ML models to predict option
value on training data
1. Use GridSearchCV to tune the parameters of
models
1. Given the cross validation scores (R-squared as
criterion), we finally choose random forest
model
3
2
1
11. Regression Champion Model - Random Forest Regression
Lasso and Ridge are types of linear models. According to cross validation results, random forest has a
much greater advantage in predicting option value than linear models.
Random forest is able to discover more complex dependencies at the cost of more time for fitting.
Why does random forest get a higher R-squared
But random forest still has some drawbacks…
“Random forests are black boxes derived by machine-learning.”
12. Classification results
● Use 7 statistical/ML models to predict BS
value on training data
● Use GridSearchCV to tune the parameters
of models
● Given the cross validation scores
(accuracy rate as criterion), we finally
choose random forest model
● We do not choose Gradient Boosting
because it has much larger variance than
random forest, which indicates it is
unstable.
3
2
1
4
Cross Validation Scores of Classification Models
13. Random forest algorithm is based on decision trees. It have better accurate rate than distance based
classification method like KNN and SVM, because
1. It can judge the importance of the feature
2. Can judge the interaction between different features
3. Not easy to overfit compared with decision tree
Classification Champion Model - Random Forest Classification
Why does random forest get a higher accurate rate
K<=427.5
gini=0.126
K<=452.5
gini=0.306
K<=422.5
gini=0.005
K<=427.5
gini=0.0349
S<=443.411
gini=0.459
K<=437.5
gini=0.492
14. Model Selection Criteria
Interpretation
Accuracy VS.
➔ Only a score to pass
to an automated
process
➔ Large amount of
data being
processed
Eg: Spam detection
➔ Need further
modification if
needed
➔ Increase social
acceptance
Eg: Medical cases
Random
Forest
Classifier
Random
Forest
Regressor ✓
✓
Linear
Regression
✓
Decision Tree
✓
Option Pricing
thousands of options dealing every day -> huge amount of data
15. Four Feature Understanding
Asset
Value
Interest
Rate
Time To
Maturity
Lower strike price,
Lower risk to lose
Negative
Higher current asset value,
Lower risk to lose
Positive
Higher time to maturity,
Higher freedom for buyers
to make decisions
Positive
Higher interest rate,
Higher value for buyers’ cash
Positive
Strike
Price
16. ✓ Machine learning models does
not rely on pre-assumptions
✓ Calculate from historical data
✓ Can reproduce most of the
empirical characteristics of
options prices
1. No dividends are paid out during
the life of the option.
2. The risk-free rate and volatility of
the underlying asset are known and
constant.
3. Markets are random there is no
emotional decisions.
4. There are no transaction costs in
buying the option.
5. The returns on the underlying asset
are log-normally distributed.
BS is based on the following
assumptions
Why Machine Learning Models Outperform Black-Scholes
17. ➔500 companies
➔Less fluctuation
➔Overall stock market performance
Factors affecting stock market:
1. Supply and demand
2. Investor sentiment
3. Interest rates
4. Politics
5. Current events
6. Natural calamities
7. Exchange rates
S&P 500 Tesla
Using trained model to predict option values for Tesla stocks?
➔Only 1 company
➔Less stability
➔Company performance
Factors affecting:
1. Product
2. Revenue & Debt
3. Investor capital
4. Management
5. Mergers & Acquisitions
6. …
Hello everyone, we are Group 14 presenting our understanding and machine learning models for options pricing project.
We will follow a typical machine learning project workflow by starting with business understanding, and we will conclude our presentation by answering the 4 business questions.
A European call option gives the owner the right to acquire the underlying security at expiry.
For an investor to profit from a European call option, the stock's price, at expiry, has to be trading high enough above the strike price to cover the cost of the option price.
But the market price of the options sometimes deviates from the fair price, so we need a tool that can help us judge pricing.
Therefore, we decided to explore machine learning algorithms in calculating fair option prices.
Our dataset has 1680 records and consists of 2 dependent variables, 4 independent variables.
‘K’ stands for the strike price of option.
‘S’ stands for the current asset value.
‘tau’ stands for days remaining to expiration converted to the percentage of the year. So the legal range should be between 0 and 1.
‘R’ stands for the annual interest rate.
The value field stands for the current European call option value.
By applying the BS formula to the features data, we get the predicted option value.
If the predicted value is greater than current value, we associate that option with Over, otherwise we associate that with Under.
From the count plot, we can see the ratio between ‘over’ and ‘under’ is relatively balanced, so there is no need to upsample or downsample our dataset.
From the boxplot of ‘S’, we can identify one extreme outlier ‘0’.
Also, we can identify 2 extreme outliers from the box plot of ‘tau’; those 2 outliers might be due to human error.
From this table we can see observation 292 has 3 missing values, observation 818 has 2 missing values, and one of the missing values is located in the target field.
Since the missing values all appear in 2 records, imputing missing values might distort the data. We choose to delete the two observations.
It is obvious that the outlier is due to incorrectly entered or measured data, so we choose to simply drop the outliers.
For model preparation stage, we first normalize the data for our four feature variables since they are different scales. Normalization helps to change the values of numeric columns to a common scale, without distorting differences in the ranges of values.
Since we will use some regression and classification models that are based on distance, so for these algorithms, normalization is helpful for reducing the scale difference between features. For non parametric algorithms, eg random forest, normalization will not change the ranks of data, so it will make no difference. For easier comparison between different models, we choose to use normalized data for all models.
For classification models, we took an extra step in changing the target variable to dummies, to ensure that the algorithm can read the target variable.
In the first part of model building, we have used 5 regression models to predict option value and used GridSearchCV to tune our models’ parameters such as # of trees in Random Forest. Then we implement cross validation to evaluate models and compare their performance. As the box plot shows, Random Forest model is more robust because it has the highest R-squared and the standard deviation of its R-squared is also small.
After choosing Random Forest as our final regression model, we want to find out why it can get a better performance. Firstly, the performance is significantly different between linear models like Lasso and tree-based models like Random Forest. Tree-based models can discover more complex dependencies, whereas linear models can only produce functions with a linear "shape". Therefore, if the relationship between input variables and option value is non-linear, a tree-based model would be able to capture such a relationship, but linear models can’t. However, Random Forest still has some drawbacks. Almost everyone can understand and interpret linear models easily. In contrast, Random Forest, like black boxes, it’s very hard to get such straightforward interpretation. We will dive into the tradeoff between model accuracy and interpretability in insights part.
We used 7 machine learning models and GridSearchCV to tune the parameters and predicted BS value on training data. We used 10 fold cross validation to test our model, using accuracy rate as criterion, and we found that gradient boosting and rf have higher accuracy than all the rest. We finally chose rf instead of Gradient Boosting because it has much smaller variance than random forest, which indicates it is more stable.
After selecting the models, we found 3 possible reasons to explain why decision tree based models have much higher accuracy rate compared to distance based models like KNN and SVM.
First, It can judge the importance of the feature
Second, It can judge the interaction between different features
Third, It can avoid overfitting compared with decision tree
Here is one of the decision trees from our random forest model. Because the decision tree uses greedy approach to attain minimum gini value, we can see the first binary split on K value dramatically decreases the gini rate, which implies that feature K is much more important than other features. Thus it can make better classification and result in higher accuracy rate.
We decided to use the model with highest accuracy, which is the random forest model. We think problems that require more interpretation are case by case problems that need further model modifications or judgements based on the result. For example, for tumor detections, the doctors need to understand the models to avoid misclassification. Therefore, the models should be highly interpretable. What’s more, simple models are easier to earn trust from others such as patients.
Problems that require a high accuracy rate are ones that need to process large amounts of data at a fast pace to deliver the results to another process. There is no need to understand the model logic because high accuracy is good enough for the purpose. Even if they sometimes generate some errors, the influence will not be detrimental. For example, the users do not need to understand the complicated spam detection model. Even if spam is not detected, the overall influence is not huge, so no human interpretation and intervention is needed.
For our option price situation, there are thousands of options dealing every day. The amount of data being processed is very huge that a good prediction on prices overall with some errors will be good enough. Therefore, we think accuracy matters more in this case.
Under the model, we think all of the 4 features we used are necessary and important to predict the option price. The higher current asset value and a lower strike price means the need for an asset’s price increase is lower, which means there is lower risk to lose, so the option price will be higher. Higher the interest rate, means the call option buyers can earn more interest by holding the cash in bank during the time to maturity, which means they are willing to pay more to buy the call option. In contrast, the seller lost the opportunity to benefit from the increased interest rate, because the cash is in the buyer's hand until maturity. Lastly, longer time to maturity gives call option holders more freedom to trade the option in the market which again increases the call option price. Also, from our linear regression test, the p-value for all variables are significantly under 0.05, meaning they are all important to include.
Machine Learning models avoid assumptions made for Black-Scholes like constant risk-free rates and volatility over the option’s life and learn from historical data. The assumptions are not true in reality that volatility fluctuates with the level of supply and demand. Additionally, the other assumptions can lead to prices that deviate from the real world. Instead, machine learning models release the model and calculate prices from a large amount of historical data. And this pricing process can reproduce most of the empirical characteristics of options prices. In this sense, Machine Learning Models Outperform Black-Scholes.