Giorgio Alfredo Spedicato will give a presentation on machine learning and actuarial science. He will review machine learning theory, including unsupervised and supervised learning algorithms. He will provide examples using various datasets, including using unsupervised learning on an auto insurance dataset and supervised learning for credit scoring and claim severity prediction. Spedicato has experience as a data scientist and actuary and holds a PhD in Actuarial Science.
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
Machine Learning - Intro
1. Machine Learning and Actuarial Science
Introduction
GIORGIO ALFREDO SPEDICATO, PHD FCAS FSA CSPA
UNISACT 2018
2. Agenda
Introduction
ML theory review:
1. General Intro
2. Core concepts overview;
3. Unsupervised learning algorithms;
4. Supervised learning algorithms.
Examples:
1. Kaggle Allstate Vehicle data set – Unsupervised learning;
2. Credit Scoring dataset: binary classification example;
3. GLM-GAM vs XGBoost on Allstate claim severity data set.
Q&A session
3. Bio
Professional:
◦ Dal 2014: Data Scientist @Unipol Group
◦ Prima: Reserving & Economic Capital Actuary (@Aviva Italia), MTLP Pricing Actuary (@Axa Italia)
Academics:
◦ Fellow of the Casualty Society of Actuaries (FCAS, 2016); Fellow of the Society of Actuaries (FSA, 2016)
◦ Certified Specialist in Predictive Analytics (2017)
◦ PhD in Actuarial Science (2011)
◦ Ordine Nazionale degli Attuari (2007)
◦ MSc Statistics Economics and Actuarial Science (2006)
Links:
◦ https://www.researchgate.net/profile/Giorgio_Spedicato
◦ https://www.linkedin.com/in/giorgio-alfredo-spedicato-044a8226/
5. Trending topics
▪«Big data» and «Machine Learning» gain increasing importance in actuarial world:
▪ Two half day sessions @ ICA 2018
▪ Big Data & Machine Learning working groups within IAA, the Casualty Actuarial Society and other
actuarial organizations.
▪ Analytics credential “actuarially” focused:
▪ Chartered Specialist Predictive Analytics (CSPA) by CAS
▪ Society of Actuaries (SOA) launched an “Advanced Analytics” credential in March 2018.
▪ Increasing number of “Data Science” MSc programs in many universities, eg, (in Italy):
▪ Bocconi (LM)
▪ Bicocca (LM)
▪ Milan Polytechnic (within Mathematical Engineering)
6. Kaggle competitions
▪Kaggle is a well know host of ML competitions. Sponsor provides data and a predictive modeling
problems. Best performing solution are usually paid (largest prize was in the six zeros range)
▪Some Insurance carriers have been competitions’ sponsors, among which:
▪ US Allstate (many competitions since 2011)
▪ US Prudential
▪ Brazilian Porto Seguro, 2017
▪Provided data are split into a train and a test set. Team o single players fill their test set
predictions following a predefined format that allow fast scoring. Teams are ranked according an
accuracy metric and the best performing solution(s) receive(s) a monetary prize.
7. Big Data
«Big Data» means easy availability of data amounts huge enough to be not easily handled with
standard approaches, such specialized that methods and tools are required.
The term appeared in 1998, becoming «viral» in 2011.
Examples of Big Data infrastructure are:
▪ Social network: Facebook, Twitter;
▪ Google archives: (Photo, Trends,…);
▪ NHIs health data bases;
▪ Financial transactions archive ;
▪ DNA microarrays banks.
9. Big Data in Italian Insurance Business
▪All that an «.xlsx» file can’t fit:
▪ Monthly MTPL quotes on Web aggregators;
▪ Insurer bureau association (ANIA) MTLP archives (SIVI, SITA, SSCARD, ATR);
▪ Pricing data base of a typical MTLP insurer…
▪ Telematics data
▪ Social Security data bases
10. Tools: core principles
Handling large data sets, often «out of (RAM) memory»
Parallelization and distributed computing
GPU computing
11. HPC: Hadoop & Spark
▪Hadoop is a framework to handle data and computation on distributed file systems (HDFS)
following a MapReduce approach.
▪Spark is a framework focused on distributed computing, that can use Hadoop to handle data
storage. It is based on Scala programming language. Many ML algorithms have been efficiently
rewritten in Scala/Spark.
▪Many traditional suites (eg. SAS and Matlab) as well as open source tools (es. R, Python)
provides libraries to interface with Hadoop/Spark.
12. HPC: parallelizing & MapReduce
«Parallelizing» means split the calculation tasks into distinct units (“MAPping”) and then
collecting and integrating the results («Reduce»).
The computing units may consist in the distinct «core» of a Desktop CPU, GPU, of distributed
calculus clusters.
The algorithm must be entirely or partially «parallelizable».
16. Software & Tools: «corporate tools»
SAS, Matlab, IBM SPSS: pros technical support, cons relevant costs.
SAS:
▪ In the analytics market for several decades, probably the standard ETL tool in finance and insurance;
▪ Solid «classic» statistical analysis functions;
▪ Pros: known tool in corporate/technical assistance;
▪ Cons: pricy, declining use in Academics and University teaching.
Matlab:
▪ Standard tool in Finance and Engineering industries.
▪ Very efficient kernel for math computations.
▪ Recent versions broads ETL, statistics and ML capabilities.
IBM:
▪ SPSS Statistics & Modeler, mostly visual tool
▪ Extremely used in social and health sciences, much less in finance and insurance.
▪ Watson: AI tools focused on non-structured data; license includes the used of a proprietary HPC hardware (computing cluster).
17. Software & Tools: «corporate tools»
▪TW Software Suite: Emblem, Radar, Moses,… focused on actuarial analysis…
▪EQECAT, RMS, AIR cat modeling tools
▪R and Python well implements most analyses used in Insurance and Finance.
18. Software & Tools: open - source
▪Broadly speaking:
▪ Pros:
▪ free, wide developers’ communities
▪ Structured as a core programming language that is extended by packages and libraries. It is
also possible to connect to routines written in other programming language (C++, Java, …)
▪ Cons:
▪ no «free» technical support, work station set up is not always «user – friendly»….
▪ «official» documentation often is incomplete (googling, Quora and Stackoverflow querying
is often necessary).
▪ use at own risk
19. Open source: R
▪ What is:
▪ Scripting language focused on ETL and statistics;
▪ First release in 1996;
▪ Current version 3.5.1
▪ Diffusion:
▪ Mostly used statistical software in Academia.
▪ Efficient free IDE (Rstudio).
▪ Many actuarial libraries available, for example:
▪ «actuar», for loss distribution fitting and credibility theory;
▪ «lifecontigencies», for standard Actuarial Mathematics (life contingent insurances…)
▪ Pros & Cons:
▪ Broad user community;
▪ Most recently introduced statistical algorithm are implemented in R;
▪ cons: «in memory processing»
20. Open source: Python
Python:
▪ True multipurpose programming language
▪ Introduced in early 90s, current stable release is 3.6.x and 4.0 is ongoing.
◦ Usage:
▪ Many core data preprocessing and scientific libraries: Pandas, Numpy, scikit-learn, Scipy,…
▪ Core libraries for relevant Data Science tasks:
▪ Deep Learning: Tensorflow, Keras,…,
▪ Natural Language Processing: NTLK,….
▪ Scraping: beautifulsoup
◦ Pros & Cons:
▪ Standard programming language for Data Scientist with no Statistics Academic background.
▪ Less known by Statistics and Actuarial Science graduates.
23. Collaborative tools: Docker
▪Software container, that allows to create and distribute
working environments.
▪Docker allows to define, save and generate «on the fly»
configurations (libraries, software's dependencies) required
by a specific software, obtaining a functioning working
environment in a «sandbox».
▪Pros & Cons:
▪ Step learning curve;
▪ Avoid dependency conflicts
▪ Ease working environment distributions
24. Collaborative tools: GitHub
▪It is the developers’ social network, the world biggest open
programming code repository.
▪It contains the source code of most relevant open – source
tools, providing instruments to:
▪ Check algorithms’ implementation;
▪ Encouraging collaborative debug and development
▪ Ease the distribution of recent enhancements and
patches.
▪GitHub is based on Git protocol.
25. GDPR Regulation
▪The «General Data Protection Regulation» has been enforced since 25 May 2018 across all EU
states.
▪All EU and foreign entities that work on EU citizens data must abide to GDPR
▪ GDPR supersedes previous EU states national privacy laws.
26. GDPR Regulation
▪New rights and obligations have been introduced:
▪ Strengthened data providing consent acquisition;
▪ Delete right;
▪ Data Protection Officer.
▪Possible impact on Data Science profession:
▪ Right to obtain own data in portable and intelligible form;
▪ Right to know whether:
▪ algorithm profiling has been used,
▪ in certain case, the right to oppose to decisions solely based on algorithms