Analyse prédictive en assurance santé par Julien Cabot

Health Insurance Predictive Analysis
with MapReduce and Machine Learning

Julien Cabot
Managing Director
OCTO
jcabot@octo.com
@julien_cabot

50, avenue des Champs-Elysées Tél : +33 (0)1 58 56 10 00
75008 Paris - FRANCE Fax : +33 (0)1 58 56 10 01 1
© OCTO 2012 www.octo.com

Internet as a Data Source…

Internet as the voice of the crowd
© OCTO 2012 2

… in Healthcare

71% about
• Illness
• Symptom
• Medecine
• Advice / opinion

Main sources are old school
forums, not social network

© OCTO 2012 3

Benefits for Insurance Company?

Understand the subject of interest of the
patient to design customer-centric products
and marketing actions

Anticipate the psycho-social effect due to
Internet to prevent excessive consultations
(and reimbursements)

Predict the claims while monitoring the
request about symptoms and drugs

4

How to run the predictive analysis?

5

The data problem

Understand the semantic field of
Healthcare…used on Internet

Find correlation between the evolution of
claims and … many millions of unidentified
external variables

Find correlated variables… anticipating the
claims

We need some help from Machine Learning !
6

Correlation search in external datasets

Automated tokenization of Google search Socio-economical
message per posted date volume of symptom context from Open
and semantic tagging and drugs keywords Data initiatives

Trends of medical Trends of medical
Trends of socio-
keywords used in keywords searched in
economical factors
forums Google

Determination
Health claims by Correlation
coeff. (R²) sorted
act typology Search Machine matrix

7

Understand the semantic field of Healthcare

Message Word stemming, tagging Timelines of
tokenization and common word healthcare
by date filtering with NTLK key words
How to tag Healthcare
words?

1-Build a first list of
keywords
Healthcare
semantic
2-Enrich the list
with highly field
searched keywords keywords
database
3-Learn
automatically from
Wikipedia Medical
Categories
8

How to find correlations between time series?
Compare the evolution of the variable and the claims over the time
Find non linear regression and learn a polymorphic predictive function
f(x) from the dataset with Support Vector Regression (SVR)

y Problem to solve

f(x) + ε 1 𝑇
min 𝑤 . 𝑤
f(x) w 2
f(x) - ε
𝑦 𝑖 - (𝑤 𝑇 ·ϕ(x) + b) ≤ ε
(𝑤 𝑇 ·ϕ(x) + b) - 𝑦 𝑖 ≤ ε
Resolution
x • Stochastic gradient descendent
• Test the response through the coef.
of determination R²

Open source ML library helps!
9

Data Processing Profiles

The current volume of external data grabbed is large but not so huge (~10 Gb)

Data aggregation
Eg. Select … Group By Date
Data volume

Correlation search ~5Gb . 123 = 8,64 Tb
Eg. SVR computing

Data volume

We need Parallel Computing to divide
RAM requirement and time processing !
10

How to build the platform?

11

IT drivers

Requirements IT drivers
Aggregate data
from Mb to Gb file Data
while sequential IO Elasticity
aggregation
reading

SVR, NLP Large Tasks
execution time is CPU Elasticity
~100ms by task execution

Process many Tb Large RAM
in memory data RAM Elasticity
execution

Commodity HW
Increase the ROI of Low CAPEX
the research OSS SW
project while
decreasing the
TCO
Low OPEX Cost Elasticity

12

Available solutions

RAM Elasticity

OSS Software
CPU Elasticity

Cost Elasticity
IO Elasticity

Commodity
Hardware
RDBMS

In Memory analytics

HPC

Hadoop
With With With
repartitioning repartitioning repartitioning

AWS Elastic MapReduce
Through Task Through Task

13

AWS Elastic MapReduce Architecture

Source: AWS

14

Hadoop components

Custom App Dataming tools BI tools
Java, C#, PHP, … R, SAS Tableau, Pentaho, …

Hue Pig Streaming Hive
Hadoop GUI Flow processing MR scripting SQL-like querying

Oozie MapReduce Zookeeper
MR workflow Parallel processing framework Coordination service

Mahout Sqoop
Machine Learning
RDBMS integration

Hama
Bulk synchronous Flume
processing Data stream integration
Solr HBase
Full text search NoSQL on HDFS
HDFS
Distributed file storage

Grid of commodity hardware – storage and processing

15

General architecture of the platform

DataViz Application

• Store detailed
results for
• Store raw drill down
data AWS S3 Redis
• Store results
files

Core Task Master
Instance 1 Instance 1 Instance

Core Task
Instance 2 Instance 2

Task • For SVR and
2 x m2.4xlarge
Instances 3 NLP
processing,
&4 only
4 x m2.4xlarge
16

Data aggregation with Pig Job flow

Num_of_messages_by_date.pig

records = LOAD ‘/input/forums/messages.txt’
AS (str_date:chararray, message:chararray,
url:chararray);

date_grouped = GROUP records BY str_date

results = FOREACH date_grouped GENERATE
group, COUNT(records);

DUMP results;

17

Hadoop streaming

Hadoop streaming runs map/reduce jobs with any
executables or scripts through standard input and
standard output

It looks like that (on a cluster) :
cat input.txt | map.py | sort | reduce.py

Why Hadoop streaming?
Intensive use of NLTK for Natural Language Processing
Intensive use of NumPy and Sklearn for Machine Learning

18

Stemmed word distribution with Hadoop streaming, mapper.py

Stem_distribution_by_date/mapper.py
import sys
import nltk
from nltk.tokenize import regexp_tokenize
from nltk.stem.snowball import FrenchStemmer

# input comes from STDIN (standard input)
for line in sys.stdin:
line = line.strip()
str_date, message, url = line.split(";")

stemmer = FrenchStemmer("french")
tokens = regexp_tokenize(message, pattern='w+')
for token in tokens:
word = stemmer.stem(token)
if len(word) >= 3:
print '%s;%s' % (word, str_date)

19

Stemmed word distribution with Hadoop streaming, reducer.py

Stem_distribution_by_date/reducer.py
import sys
import json
from itertools import groupby
from operator import itemgetter
from nltk.probability import FreqDist

def read(f):
for line in f:
line = line.strip()
yield line.split(';')

data = read(sys.stdin)

for current_stem, group in groupby(data, itemgetter(0)):
values = [item[1] for item in group]
freq_dist = FreqDist()

print "%s;%s" % (current_stem, json.dumps(freq_dist))

20

Conclusions

 The correlation search identifies currently 462 variables correlated with a R² >= 80%
and a lag >= 1 month

 Amazon Elastic MapReduce provides the elasticity required by the morphology of
the jobs and the cost elasticity
 Monthly cost with zero activity : < 5 €
 Monthly cost with intensive activity : < 1 000 €
 The equivalent cost of the platform would be around 50 000 €

 The S3 transfer overhead is not a problem due the volume of stored data

 While Correlation search processing, only 80% max of the virtual CPU are
used due to job scheduling with a parallelism factor of 36 instead of 48
regarding SMP

22

Future works

Data mining

 Increase the number of data sources
 Testing the robustness of the predictive model over the time
 Reducing the over fitting of the correlation
 Enhance the correlation search for word while testing combinations

IT
 Switch only the correlation search to a map reduce engine for SMP
architecture and cluster of cores, inspired by the Stanford Phoenix and the
Nokia Disco engine
 Industrialize the data mining components as a platform for generalization to
IARD insurance, banking, e-commerce, telecoms and retails

23

OCTO in a nutshell

Big data Analytics Offer
 Business case and benchmark studies
 Business Proof of Concept
 Data feeds : Web Trends
 Big Data and Analytics architecture design
 Big data project delivery
 Training, seminar : Big Data, Hadoop

IT Consulting firm OCTO offices
 Established in 1998
 175 employees
 19,5 million turnover worldwide (2011)
 Verticals-based organization
 Banking – Financial Services
 Insurance
 Media – Internet – Leisure
 Industry – Distribution
 Telecom – Services

24

Analyse prédictive en assurance santé par Julien Cabot

More Related Content

What's hot

Viewers also liked

Similar to Analyse prédictive en assurance santé par Julien Cabot

More from Modern Data Stack France

Analyse prédictive en assurance santé par Julien Cabot