BigDataCloud Sept 8 2011 meetup - Big Data Analytics for Health by Charles Kaminski of LexisNexis


Published on

Big Data Analytics for Health - Insights from the Healthcare Industry.
- Charles Kaminski, LexisNexis

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

BigDataCloud Sept 8 2011 meetup - Big Data Analytics for Health by Charles Kaminski of LexisNexis

  1. 1. Big Data Cloud Meet Up September 8 th , 2011 HPCC Platform Big Data Analytics and Delivery LexisNexis’ massive parallel-processing open-source computing platform
  2. 2. Who’s been using the HPCC Platform and why? <ul><ul><li>Very large businesses </li></ul></ul><ul><ul><li>Federal Agencies </li></ul></ul><ul><ul><li>National research labs </li></ul></ul><ul><ul><li>It’s 4 to 10 times faster </li></ul></ul><ul><ul><li>Products and solutions are built much faster </li></ul></ul><ul><ul><li>Very complex problems can be modeled and solved </li></ul></ul><ul><ul><li>It’s proven </li></ul></ul>
  3. 3. What’s changed? We just Open-Sourced! The HPCC Platform is now available to you.
  4. 4. Big Data…It’s our business. Big Data Open Source Components Insurance Financial Services Cyber Security Government Health Care Retail Telecommunications Transportation & Logistics Weblog Analysis INDUSTRY SOLUTIONS Online Reservations <ul><li>Customer Data Integration </li></ul><ul><li>Data Fusion </li></ul><ul><li>Fraud Detection and Prevention </li></ul><ul><li>Know Your Customer </li></ul><ul><li>Master Data Management </li></ul><ul><li>Weblog Analysis </li></ul>
  5. 5. The Platform’s Major Parts <ul><ul><li>Thor – Data ingestion, hygiene, refining, transformation, linking, fusion </li></ul></ul><ul><ul><li>Roxie – Data Delivery Engine </li></ul></ul><ul><ul><ul><li>Supports complex queries and distributed indexes </li></ul></ul></ul><ul><ul><ul><li>Low latency -- Latencies grow logarithmically </li></ul></ul></ul><ul><ul><li>ECL – One language </li></ul></ul><ul><ul><ul><li>Highly expressive and efficient declarative language </li></ul></ul></ul><ul><ul><ul><ul><li>Solve complex problems </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Encourage code reuse </li></ul></ul></ul></ul>
  6. 6. How we’re different <ul><ul><li>It’s not a group of disparate technologies or competing visions bolted together. </li></ul></ul><ul><ul><li>It’s one platform with a clear proven vision. </li></ul></ul><ul><ul><li>This by itself is powerful. </li></ul></ul>
  7. 7. How we’re different <ul><ul><li>You can transcend map reduce </li></ul></ul><ul><ul><ul><li>Build transformative data graphs and applications using ECL </li></ul></ul></ul><ul><ul><ul><li>Solve very complex Big Data problems </li></ul></ul></ul><ul><ul><ul><li>Don’t struggle to fit your Big Data problem into groups of map reduce jobs </li></ul></ul></ul>
  8. 8. How we’re different <ul><ul><li>No need to munge the data before ingestion </li></ul></ul><ul><ul><li>No complex block file system </li></ul></ul><ul><ul><li>No need to tune number of tasks for different jobs </li></ul></ul><ul><ul><li>Data Delivery Engine is included </li></ul></ul><ul><ul><li>Use a single language for data cleansing, transformation, linking, fusion, and delivery </li></ul></ul><ul><ul><li>ECL promotes language extension and code reuse </li></ul></ul><ul><ul><li>Data graphs are built and optimized by the system </li></ul></ul><ul><ul><li>The system-generated C++ is highly optimized </li></ul></ul><ul><ul><li>Code execution is optimized </li></ul></ul><ul><ul><li>Low and predictable latencies </li></ul></ul><ul><ul><li>Modeling data problems as data problems leads to richer solutions </li></ul></ul>
  9. 9. Challenges Facing Health Care Enterprises Challenges facing the health insurance industry <ul><li>Disparate data in spread across separate physical locations </li></ul><ul><li>Scale of data. BIG Data is getting BIGGER. </li></ul><ul><li>Adding relationships exponentially expands the size of the BIG Data analytics challenge. </li></ul><ul><li>LexisNexis has leveraged parallel-processing computing platforms and large scale graph analytics for a over a decade. </li></ul>
  10. 10. Potential Fraud – a POC for the State of New York <ul><li>Applied social network analytics to information provided by the State of New York and public data supplied by LexisNexis to identify relationships between a group of New York Medicaid recipients living in high-end condominiums located within the same complex and any links those individuals might have to medical facilities or others providing care to New York Medicaid recipients. </li></ul>
  11. 11. What’s entailed (high level) <ul><li>Mix First Party data with Public and Third Data sources </li></ul><ul><li>Adds fidelity to existing entities </li></ul><ul><li>Adds new linkages into the analysis </li></ul><ul><li>Ads new entities into the analysis </li></ul><ul><li>Exposes ring leaders and brokers that don’t directly participate </li></ul>Addition of External Data
  12. 12. <ul><li>Graph Network 3 Billion derived public data relationships between people merged with risk indicators. </li></ul><ul><li>Graph Analytics examine up to 20 billion data points to create variables that allows for predictive analysis incorporating relationship context and associated risk. </li></ul><ul><li>Targets fraud across all sectors including Healthcare, Financial Services and Government. </li></ul>How we did it
  13. 13. Cluster Visualization Introduction <ul><ul><li>How many of them are living in expensive residences, owned expensive property or drive expensive cars? </li></ul></ul><ul><ul><li>How many recipients are contacts of medical businesses? </li></ul></ul><ul><ul><li>How many medical businesses are associated with any of the people in the cluster? </li></ul></ul><ul><ul><li>How many are currently receiving benefits? </li></ul></ul>Medicaid Recipient Expensive Residence Owns expensive property Owns Expensive Vehicles Business Contact of Medical Business Entity Cluster visualization introduction
  14. 14. Cluster Visualization Cluster visualization
  15. 15. City Walk Sample: Vehicle Statistics What is the list of preferred expensive vehicles? Vehicle Statistics Make Description # Owned Make Description # Owned Mercedes-Benz 46 Chevrolet 2 Lexus 41 Hummer 2 BMW 27 Jeep 2 Infiniti 13 Nissan 2 Acura 9 Toyota 2 Lincoln 8 Aston Martin 1 Audi 7 Bentley 1 Land Rover 7 Cadillac 1 Porsche 6 GMC 1 Jaguar 5 Honda 1 Mercedes Benz 3 Volkswagen 1 Saab 3 Volvo 1
  16. 16. Dominant buyers and sellers at City Walk Property deed reference counts Name Deeds Held Name Deeds Held Hudson Eight 78 Mike Greem 21 Hudson Five 74 Scott Hill 21 Hudson First 73 Betty Donaway 21 Hudson Nine 65 Al Clark 19 Harry Anderson 45 Dave Miller 17 Hudson Ten 41 Mark Walker 16 Hudson Seven 39 Mike Smith 16 Home Nationwide 33 Val Edwards 15 Hudson Three 33 Eric Garcia 14 Brian Smith 28 Dane Young 14 Alan Stevens 25 Bill Moore 14 Chris Doe 24 Karen Carter 14 Sophie Davis 23 Casey Baker 14 Washington Mutual 23 Art Nelson 14 Fleet Mortgage Co. 21 Cathy Parker 13
  17. 17. The engineering story One guy (Joe Prichard). Three weeks. Less than part time. The platform lets him focus on the data. Joe’s a lot of fun to work with.
  18. 18. Do you do build other POC’s? Yes
  19. 19. What next? <ul><li>Try us out! </li></ul><ul><li>Virtual Machine </li></ul><ul><li>Binaries </li></ul><ul><ul><li>EC2 Data Script </li></ul></ul><ul><ul><li>Ensemble Recipe…Juan from Cannonical </li></ul></ul>
  20. 20. Contact Information Charles Kaminski Senior Architect Academic Development Lead HPCC Systems [email_address] 402-619-9413