Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Health Insurance Predictive Analysis with Hadoop and Machine Learning. JULIEN CABOT at Big Data Spain 2012

Session presented at Big Data Spain 2012 Conference
16th Nov 2012
ETSI Telecomunicacion UPM Madrid
More info:

  • Login to see the comments

Health Insurance Predictive Analysis with Hadoop and Machine Learning. JULIEN CABOT at Big Data Spain 2012

  1. 1. Health Insurance Predictive Analysis with MapReduce and Machine Learning Julien Cabot Managing Director OCTO Madrid 16th of November 2012 @julien_cabot 50, avenue des Champs-Elysées Tél : +33 (0)1 58 56 10 00 75008 Paris - FRANCE Fax : +33 (0)1 58 56 10 01 1© OCTO 2012
  2. 2. Internet as a Data Source… Internet as the voice of the crowd© OCTO 2012 2
  3. 3. … in Healthcare 71% about • Illness • Symptom • Medecine • Advice / opinion Main sources are old school forums, not social network© OCTO 2012 3
  4. 4. Benefits for Insurance Company?Understand the subject of interest of thepatient to design customer-centric productsand marketing actionsAnticipate the psycho-social effect due toInternet to prevent excessive consultations(and reimbursements)Predict the claims while monitoring therequest about symptoms and drugs 4
  5. 5. How to run the predictive analysis? 5
  6. 6. The data problemUnderstand the semantic field ofHealthcare…used on InternetFind correlation between the evolution ofclaims and … many millions of unidentifiedexternal variablesFind correlated variables… anticipating theclaimsWe need some help from Machine Learning ! 6
  7. 7. Correlation search in external datasetsAutomated tokenization of Google search Socio-economicalmessage per posted date volume of symptom context from Open and semantic tagging and drugs keywords Data initiatives Trends of medical Trends of medical Trends of socio- keywords used in keywords searched in economical factors forums Google Determination Health claims by Correlation coeff. (R²) sorted act typology Search Machine matrix 7
  8. 8. Understand the semantic field of Healthcare Message Word stemming, tagging Timelines of tokenization and common word healthcare by date filtering with NTLK key words How to tag Healthcare words?1-Build a first list ofkeywords Healthcare semantic2-Enrich the listwith highly fieldsearched keywords keywords database3-Learnautomatically fromWikipedia MedicalCategories 8
  9. 9. How to find correlations between time series? Compare the evolution of the variable and the claims over the time Find non linear regression and learn a polymorphic predictive function f(x) from the dataset with Support Vector Regression (SVR)y Problem to solve f(x) + ε 1 𝑇 min 𝑤 . 𝑤 f(x) w 2 f(x) - ε 𝑦 𝑖 - (𝑤 𝑇 ·ϕ(x) + b) ≤ ε (𝑤 𝑇 ·ϕ(x) + b) - 𝑦 𝑖 ≤ ε Resolution x • Stochastic gradient descendent • Test the response through the coef. of determination R² Open source ML library helps! 9
  10. 10. Data Processing ProfilesThe current volume of external data grabbed is large but not so huge (~10 Gb)Data aggregation Eg. Select … Group By Date Data volumeCorrelation search ~5Gb . 123 = 8,64 Tb Eg. SVR computing Data volume We need Parallel Computing to divide RAM requirement and time processing ! 10
  11. 11. How to build the platform? 11
  12. 12. IT drivers Requirements IT drivers Aggregate datafrom Mb to Gb file Data while sequential IO Elasticity aggregation reading SVR, NLP Large Tasks execution time is CPU Elasticity ~100ms by task executionProcess many Tb Large RAM in memory data RAM Elasticity execution Commodity HWIncrease the ROI of Low CAPEX the research OSS SW project while decreasing the TCO Low OPEX Cost Elasticity 12
  13. 13. Available solutions RAM Elasticity OSS Software CPU Elasticity Cost Elasticity IO Elasticity Commodity HardwareRDBMSIn Memory analyticsHPCHadoop With With With repartitioning repartitioning repartitioningAWS Elastic MapReduce Through Task Through Task 13
  14. 14. AWS Elastic MapReduce Architecture Source: AWS 14
  15. 15. Hadoop components Custom App Dataming tools BI tools Java, C#, PHP, … R, SAS Tableau, Pentaho, … Hue Pig Streaming Hive Hadoop GUI Flow processing MR scripting SQL-like queryingOozie MapReduce ZookeeperMR workflow Parallel processing framework Coordination serviceMahout SqoopMachine Learning RDBMS integrationHamaBulk synchronous Flumeprocessing Data stream integration Solr HBase Full text search NoSQL on HDFS HDFS Distributed file storage Grid of commodity hardware – storage and processing 15
  16. 16. General architecture of the platform DataViz Application • Store detailed results for• Store raw drill down data AWS S3 Redis• Store results files Core Task Master Instance 1 Instance 1 Instance Core Task Instance 2 Instance 2 Task • For SVR and 2 x m2.4xlarge Instances 3 NLP processing, &4 only 4 x m2.4xlarge 16
  17. 17. Data aggregation with Pig Job flowNum_of_messages_by_date.pigrecords = LOAD ‘/input/forums/messages.txt’AS (str_date:chararray, message:chararray,url:chararray);date_grouped = GROUP records BY str_dateresults = FOREACH date_grouped GENERATEgroup, COUNT(records);DUMP results; 17
  18. 18. Hadoop streamingHadoop streaming runs map/reduce jobs with anyexecutables or scripts through standard input andstandard outputIt looks like that (on a cluster) : cat input.txt | | sort | reduce.pyWhy Hadoop streaming? Intensive use of NLTK for Natural Language Processing Intensive use of NumPy and Sklearn for Machine Learning 18
  19. 19. Stemmed word distribution with Hadoop streaming, mapper.pyStem_distribution_by_date/mapper.pyimport sysimport nltkfrom nltk.tokenize import regexp_tokenizefrom nltk.stem.snowball import FrenchStemmer# input comes from STDIN (standard input)for line in sys.stdin: line = line.strip() str_date, message, url = line.split(";") stemmer = FrenchStemmer("french") tokens = regexp_tokenize(message, pattern=w+) for token in tokens: word = stemmer.stem(token) if len(word) >= 3: print %s;%s % (word, str_date) 19
  20. 20. Stemmed word distribution with Hadoop streaming, reducer.pyStem_distribution_by_date/reducer.pyimport sysimport jsonfrom itertools import groupbyfrom operator import itemgetterfrom nltk.probability import FreqDistdef read(f): for line in f: line = line.strip() yield line.split(;)data = read(sys.stdin)for current_stem, group in groupby(data, itemgetter(0)): values = [item[1] for item in group] freq_dist = FreqDist() print "%s;%s" % (current_stem, json.dumps(freq_dist)) 20
  21. 21. Conclusions 21
  22. 22. Conclusions The correlation search identifies currently 462 variables correlated with a R² >= 80% and a lag >= 1 month Amazon Elastic MapReduce provides the elasticity required by the morphology of the jobs and the cost elasticity  Monthly cost with zero activity : < 5 €  Monthly cost with intensive activity : < 1 000 €  The equivalent cost of the platform would be around 50 000 € The S3 transfer overhead is not a problem due the volume of stored data While Correlation search processing, only 80% max of the virtual CPU are used due to job scheduling with a parallelism factor of 36 instead of 48 regarding SMP 22
  23. 23. Future worksData mining Increase the number of data sources Testing the robustness of the predictive model over the time Reducing the over fitting of the correlation Enhance the correlation search for word while testing combinationsIT Switch only the correlation search to a map reduce engine for SMP architecture and cluster of cores, inspired by the Stanford Phoenix and the Nokia Disco engine Industrialize the data mining components as a platform for generalization to IARD insurance, banking, e-commerce, telecoms and retails 23
  24. 24. OCTO in a nutshell Big data Analytics Offer Business case and benchmark studies Business Proof of Concept Data feeds : Web Trends Big Data and Analytics architecture design Big data project delivery Training, seminar : Big Data, Hadoop IT Consulting firm OCTO offices  Established in 1998  175 employees  19,5 million turnover worldwide (2011)  Verticals-based organization  Banking – Financial Services  Insurance  Media – Internet – Leisure  Industry – Distribution  Telecom – Services 24
  25. 25. Thank you! 25