Health Insurance Predictive Analysis
         with MapReduce and Machine Learning


                                                         Julien Cabot
                                                     Managing Director
                                                               OCTO
                                                                      jcabot@octo.com
                                                                         @julien_cabot


              50, avenue des Champs-Elysées   Tél : +33 (0)1 58 56 10 00
                       75008 Paris - FRANCE   Fax : +33 (0)1 58 56 10 01                 1
© OCTO 2012                                   www.octo.com
Internet as a Data Source…




              Internet as the voice of the crowd
© OCTO 2012                                                2
… in Healthcare




              71% about
              • Illness
              • Symptom
              • Medecine
              • Advice / opinion

              Main sources are old school
              forums, not social network




© OCTO 2012                                 3
Benefits for Insurance Company?


Understand the subject of interest of the
patient to design customer-centric products
and marketing actions

Anticipate the psycho-social effect due to
Internet to prevent excessive consultations
(and reimbursements)

Predict the claims while monitoring the
request about symptoms and drugs

                                                 4
How to run the predictive analysis?




                                      5
The data problem


Understand the semantic field of
Healthcare…used on Internet

Find correlation between the evolution of
claims and … many millions of unidentified
external variables

Find correlated variables… anticipating the
claims

We need some help from Machine Learning !
                                                  6
Correlation search in external datasets



Automated tokenization of       Google search             Socio-economical
message per posted date       volume of symptom           context from Open
 and semantic tagging         and drugs keywords            Data initiatives




    Trends of medical          Trends of medical
                                                           Trends of socio-
    keywords used in         keywords searched in
                                                          economical factors
         forums                     Google




                                                                  Determination
 Health claims by             Correlation
                                                                 coeff. (R²) sorted
   act typology             Search Machine                            matrix




                                                                                      7
Understand the semantic field of Healthcare

   Message                Word stemming, tagging   Timelines of
 tokenization               and common word         healthcare
   by date                  filtering with NTLK     key words
                           How to tag Healthcare
                                    words?




1-Build a first list of
keywords
                               Healthcare
                                semantic
2-Enrich the list
with highly                       field
searched keywords              keywords
                               database
3-Learn
automatically from
Wikipedia Medical
Categories
                                                                   8
How to find correlations between time series?
    Compare the evolution of the variable and the claims over the time
    Find non linear regression and learn a polymorphic predictive function
    f(x) from the dataset with Support Vector Regression (SVR)

y                                         Problem to solve

                             f(x) + ε                1 𝑇
                                                 min  𝑤 . 𝑤
                             f(x)                 w 2
                             f(x) - ε
                                                 𝑦 𝑖 - (𝑤 𝑇 ·ϕ(x) + b) ≤ ε
                                                 (𝑤 𝑇 ·ϕ(x) + b) - 𝑦 𝑖 ≤ ε
                                          Resolution
                         x                • Stochastic gradient descendent
                                          • Test the response through the coef.
                                            of determination R²


              Open source ML library helps!
                                                                              9
Data Processing Profiles



The current volume of external data grabbed is large but not so huge (~10 Gb)

Data aggregation
      Eg. Select … Group By Date
                                      Data volume



Correlation search         ~5Gb . 123 = 8,64 Tb
      Eg. SVR computing




                                       Data volume

          We need Parallel Computing to divide
         RAM requirement and time processing !
                                                                                10
How to build the platform?




                             11
IT drivers

                      Requirements   IT drivers
  Aggregate data
from Mb to Gb file        Data
 while sequential                     IO Elasticity
                       aggregation
     reading

    SVR, NLP           Large Tasks
 execution time is                   CPU Elasticity
 ~100ms by task         execution

Process many Tb        Large RAM
 in memory data                      RAM Elasticity
                        execution

                                     Commodity HW
Increase the ROI of    Low CAPEX
    the research                       OSS SW
    project while
  decreasing the
        TCO
                        Low OPEX     Cost Elasticity

                                                             12
Available solutions




                                                                   RAM Elasticity




                                                                                                OSS Software
                                             CPU Elasticity




                                                                                                                   Cost Elasticity
                        IO Elasticity




                                                                                    Commodity
                                                                                     Hardware
RDBMS

In Memory analytics

HPC

Hadoop
                                            With                  With                                             With
                                        repartitioning        repartitioning                                   repartitioning

AWS Elastic MapReduce
                                        Through Task Through Task




                                                                                                                                     13
AWS Elastic MapReduce Architecture




           Source: AWS

                                     14
Hadoop components




                           Custom App             Dataming tools           BI tools
                           Java, C#, PHP, …       R, SAS                   Tableau, Pentaho, …


              Hue                    Pig                    Streaming               Hive
              Hadoop GUI             Flow processing        MR scripting            SQL-like querying

Oozie                  MapReduce                                                                  Zookeeper
MR workflow            Parallel processing framework                                              Coordination service

Mahout                                                                                            Sqoop
Machine Learning
                                                                                                  RDBMS integration

Hama
Bulk synchronous                                                                                  Flume
processing                                                                                        Data stream integration
                                              Solr                     HBase
                                              Full text search         NoSQL on HDFS
                       HDFS
                       Distributed file storage


                                   Grid of commodity hardware – storage and processing

                                                                                                                         15
General architecture of the platform

                                 DataViz Application

                                                          •   Store detailed
                                                              results for
•   Store raw                                                 drill down
    data              AWS S3            Redis
•   Store results
    files

                         Core            Task           Master
                      Instance 1      Instance 1       Instance

                         Core            Task
                      Instance 2      Instance 2

                                        Task              •   For SVR and
                    2 x m2.4xlarge
                                     Instances 3              NLP
                                                              processing,
                                         &4                   only
                                     4 x m2.4xlarge
                                                                                      16
Data aggregation with Pig Job flow

Num_of_messages_by_date.pig

records = LOAD ‘/input/forums/messages.txt’
AS (str_date:chararray, message:chararray,
url:chararray);

date_grouped = GROUP records BY str_date

results = FOREACH date_grouped GENERATE
group, COUNT(records);

DUMP results;




                                                                   17
Hadoop streaming



Hadoop streaming runs map/reduce jobs with any
executables or scripts through standard input and
standard output

It looks like that (on a cluster) :
   cat input.txt | map.py | sort | reduce.py



Why Hadoop streaming?
   Intensive use of NLTK for Natural Language Processing
   Intensive use of NumPy and Sklearn for Machine Learning



                                                                  18
Stemmed word distribution with Hadoop streaming, mapper.py

Stem_distribution_by_date/mapper.py
import sys
import nltk
from nltk.tokenize import regexp_tokenize
from nltk.stem.snowball import FrenchStemmer

# input comes from STDIN (standard input)
for line in sys.stdin:
    line = line.strip()
    str_date, message, url = line.split(";")

   stemmer = FrenchStemmer("french")
   tokens = regexp_tokenize(message, pattern='w+')
   for token in tokens:
       word = stemmer.stem(token)
       if len(word) >= 3:
           print '%s;%s' % (word, str_date)




                                                                      19
Stemmed word distribution with Hadoop streaming, reducer.py

Stem_distribution_by_date/reducer.py
import sys
import json
from itertools import groupby
from operator import itemgetter
from nltk.probability import FreqDist

def read(f):
    for line in f:
        line = line.strip()
        yield line.split(';')

data = read(sys.stdin)

for current_stem, group in groupby(data, itemgetter(0)):
    values = [item[1] for item in group]
    freq_dist = FreqDist()

   print "%s;%s" % (current_stem, json.dumps(freq_dist))



                                                                       20
Conclusions




              21
Conclusions


 The correlation search identifies currently 462 variables correlated with a R² >= 80%
   and a lag >= 1 month

 Amazon Elastic MapReduce provides the elasticity required by the morphology of
   the jobs and the cost elasticity
     Monthly cost with zero activity : < 5 €
     Monthly cost with intensive activity : < 1 000 €
     The equivalent cost of the platform would be around 50 000 €


 The S3 transfer overhead is not a problem due the volume of stored data

 While Correlation search processing, only 80% max of the virtual CPU are
   used due to job scheduling with a parallelism factor of 36 instead of 48
   regarding SMP



                                                                                          22
Future works


Data mining

    Increase the number of data sources
    Testing the robustness of the predictive model over the time
    Reducing the over fitting of the correlation
    Enhance the correlation search for word while testing combinations

IT
 Switch only the correlation search to a map reduce engine for SMP
  architecture and cluster of cores, inspired by the Stanford Phoenix and the
  Nokia Disco engine
 Industrialize the data mining components as a platform for generalization to
  IARD insurance, banking, e-commerce, telecoms and retails



                                                                                 23
OCTO in a nutshell

          Big data Analytics Offer
   Business case and benchmark studies
   Business Proof of Concept
   Data feeds : Web Trends
   Big Data and Analytics architecture design
   Big data project delivery
   Training, seminar : Big Data, Hadoop



               IT Consulting firm                OCTO offices
       Established in 1998
       175 employees
       19,5 million turnover worldwide (2011)
       Verticals-based organization
             Banking – Financial Services
             Insurance
             Media – Internet – Leisure
             Industry – Distribution
             Telecom – Services

                                                                          24
Thank you!




             25

Analyse prédictive en assurance santé par Julien Cabot

  • 1.
    Health Insurance PredictiveAnalysis with MapReduce and Machine Learning Julien Cabot Managing Director OCTO jcabot@octo.com @julien_cabot 50, avenue des Champs-Elysées Tél : +33 (0)1 58 56 10 00 75008 Paris - FRANCE Fax : +33 (0)1 58 56 10 01 1 © OCTO 2012 www.octo.com
  • 2.
    Internet as aData Source… Internet as the voice of the crowd © OCTO 2012 2
  • 3.
    … in Healthcare 71% about • Illness • Symptom • Medecine • Advice / opinion Main sources are old school forums, not social network © OCTO 2012 3
  • 4.
    Benefits for InsuranceCompany? Understand the subject of interest of the patient to design customer-centric products and marketing actions Anticipate the psycho-social effect due to Internet to prevent excessive consultations (and reimbursements) Predict the claims while monitoring the request about symptoms and drugs 4
  • 5.
    How to runthe predictive analysis? 5
  • 6.
    The data problem Understandthe semantic field of Healthcare…used on Internet Find correlation between the evolution of claims and … many millions of unidentified external variables Find correlated variables… anticipating the claims We need some help from Machine Learning ! 6
  • 7.
    Correlation search inexternal datasets Automated tokenization of Google search Socio-economical message per posted date volume of symptom context from Open and semantic tagging and drugs keywords Data initiatives Trends of medical Trends of medical Trends of socio- keywords used in keywords searched in economical factors forums Google Determination Health claims by Correlation coeff. (R²) sorted act typology Search Machine matrix 7
  • 8.
    Understand the semanticfield of Healthcare Message Word stemming, tagging Timelines of tokenization and common word healthcare by date filtering with NTLK key words How to tag Healthcare words? 1-Build a first list of keywords Healthcare semantic 2-Enrich the list with highly field searched keywords keywords database 3-Learn automatically from Wikipedia Medical Categories 8
  • 9.
    How to findcorrelations between time series? Compare the evolution of the variable and the claims over the time Find non linear regression and learn a polymorphic predictive function f(x) from the dataset with Support Vector Regression (SVR) y Problem to solve f(x) + ε 1 𝑇 min 𝑤 . 𝑤 f(x) w 2 f(x) - ε 𝑦 𝑖 - (𝑤 𝑇 ·ϕ(x) + b) ≤ ε (𝑤 𝑇 ·ϕ(x) + b) - 𝑦 𝑖 ≤ ε Resolution x • Stochastic gradient descendent • Test the response through the coef. of determination R² Open source ML library helps! 9
  • 10.
    Data Processing Profiles Thecurrent volume of external data grabbed is large but not so huge (~10 Gb) Data aggregation Eg. Select … Group By Date Data volume Correlation search ~5Gb . 123 = 8,64 Tb Eg. SVR computing Data volume We need Parallel Computing to divide RAM requirement and time processing ! 10
  • 11.
    How to buildthe platform? 11
  • 12.
    IT drivers Requirements IT drivers Aggregate data from Mb to Gb file Data while sequential IO Elasticity aggregation reading SVR, NLP Large Tasks execution time is CPU Elasticity ~100ms by task execution Process many Tb Large RAM in memory data RAM Elasticity execution Commodity HW Increase the ROI of Low CAPEX the research OSS SW project while decreasing the TCO Low OPEX Cost Elasticity 12
  • 13.
    Available solutions RAM Elasticity OSS Software CPU Elasticity Cost Elasticity IO Elasticity Commodity Hardware RDBMS In Memory analytics HPC Hadoop With With With repartitioning repartitioning repartitioning AWS Elastic MapReduce Through Task Through Task 13
  • 14.
    AWS Elastic MapReduceArchitecture Source: AWS 14
  • 15.
    Hadoop components Custom App Dataming tools BI tools Java, C#, PHP, … R, SAS Tableau, Pentaho, … Hue Pig Streaming Hive Hadoop GUI Flow processing MR scripting SQL-like querying Oozie MapReduce Zookeeper MR workflow Parallel processing framework Coordination service Mahout Sqoop Machine Learning RDBMS integration Hama Bulk synchronous Flume processing Data stream integration Solr HBase Full text search NoSQL on HDFS HDFS Distributed file storage Grid of commodity hardware – storage and processing 15
  • 16.
    General architecture ofthe platform DataViz Application • Store detailed results for • Store raw drill down data AWS S3 Redis • Store results files Core Task Master Instance 1 Instance 1 Instance Core Task Instance 2 Instance 2 Task • For SVR and 2 x m2.4xlarge Instances 3 NLP processing, &4 only 4 x m2.4xlarge 16
  • 17.
    Data aggregation withPig Job flow Num_of_messages_by_date.pig records = LOAD ‘/input/forums/messages.txt’ AS (str_date:chararray, message:chararray, url:chararray); date_grouped = GROUP records BY str_date results = FOREACH date_grouped GENERATE group, COUNT(records); DUMP results; 17
  • 18.
    Hadoop streaming Hadoop streamingruns map/reduce jobs with any executables or scripts through standard input and standard output It looks like that (on a cluster) : cat input.txt | map.py | sort | reduce.py Why Hadoop streaming? Intensive use of NLTK for Natural Language Processing Intensive use of NumPy and Sklearn for Machine Learning 18
  • 19.
    Stemmed word distributionwith Hadoop streaming, mapper.py Stem_distribution_by_date/mapper.py import sys import nltk from nltk.tokenize import regexp_tokenize from nltk.stem.snowball import FrenchStemmer # input comes from STDIN (standard input) for line in sys.stdin: line = line.strip() str_date, message, url = line.split(";") stemmer = FrenchStemmer("french") tokens = regexp_tokenize(message, pattern='w+') for token in tokens: word = stemmer.stem(token) if len(word) >= 3: print '%s;%s' % (word, str_date) 19
  • 20.
    Stemmed word distributionwith Hadoop streaming, reducer.py Stem_distribution_by_date/reducer.py import sys import json from itertools import groupby from operator import itemgetter from nltk.probability import FreqDist def read(f): for line in f: line = line.strip() yield line.split(';') data = read(sys.stdin) for current_stem, group in groupby(data, itemgetter(0)): values = [item[1] for item in group] freq_dist = FreqDist() print "%s;%s" % (current_stem, json.dumps(freq_dist)) 20
  • 21.
  • 22.
    Conclusions  The correlationsearch identifies currently 462 variables correlated with a R² >= 80% and a lag >= 1 month  Amazon Elastic MapReduce provides the elasticity required by the morphology of the jobs and the cost elasticity  Monthly cost with zero activity : < 5 €  Monthly cost with intensive activity : < 1 000 €  The equivalent cost of the platform would be around 50 000 €  The S3 transfer overhead is not a problem due the volume of stored data  While Correlation search processing, only 80% max of the virtual CPU are used due to job scheduling with a parallelism factor of 36 instead of 48 regarding SMP 22
  • 23.
    Future works Data mining  Increase the number of data sources  Testing the robustness of the predictive model over the time  Reducing the over fitting of the correlation  Enhance the correlation search for word while testing combinations IT  Switch only the correlation search to a map reduce engine for SMP architecture and cluster of cores, inspired by the Stanford Phoenix and the Nokia Disco engine  Industrialize the data mining components as a platform for generalization to IARD insurance, banking, e-commerce, telecoms and retails 23
  • 24.
    OCTO in anutshell Big data Analytics Offer  Business case and benchmark studies  Business Proof of Concept  Data feeds : Web Trends  Big Data and Analytics architecture design  Big data project delivery  Training, seminar : Big Data, Hadoop IT Consulting firm OCTO offices  Established in 1998  175 employees  19,5 million turnover worldwide (2011)  Verticals-based organization  Banking – Financial Services  Insurance  Media – Internet – Leisure  Industry – Distribution  Telecom – Services 24
  • 25.