Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Data Patterns - A Native Open Source Data Profiling Tool for HPCC Systems

22 views

Published on

As part of the 2018 HPCC Systems Summit Community Day event:

Data profiling is a technique used to uncover information about a source of data. Information such as the shape or accuracy of the data is extremely useful during data discovery (when you're exploring a new dataset) or when verifying that updated data appears to be a valid replacement for old data. DataPatterns, an open sourced ECL bundle for HPCC Systems, offers a native function macro for data profiling that is easy to use and supports a number of options for tuning the profile result. This talk will briefly explore the bundle's profile feature and options.

Dan Camper has been with LexisNexis Risk for four years and is a Senior Architect in the Solutions Lab Group. He has worked for Apple and Dun & Bradstreet, and he ran his own custom programming shop for a decade. He's been writing software professionally for over 35 years and has worked on a myriad of systems, using a lot of different programming languages. He thinks ECL is pretty neat.

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Data Patterns - A Native Open Source Data Profiling Tool for HPCC Systems

  1. 1. Innovation and Reinvention Driving Transformation OCTOBER 9, 2018 2018 HPCC Systems® Community Day Dan S. Camper – Sr. Architect, HPCC Solutions Lab Data Patterns: A Native Open Source Data Profiling Tool for HPCC Systems
  2. 2. What Is Data Profiling? • A method of examining data to collect statistics and information about that data • Determines the “shape” of the data • Data types • Lengths • Cardinality • Prominent discrete values • Patterns • Also known as “Data Discovery” Data Patterns: A Data Profiling Tool for HPCC Systems 2
  3. 3. When Would You Profile Data? • Explore a new dataset • Determine the real data types • Determine field population • Spot garbage data • Find highly-correlated fields • Verify data updates • Ensure that structure has not changed • Check for expected cardinality • Check for expected fill rates • Check for unexpected garbage Data Patterns: A Data Profiling Tool for HPCC Systems 3
  4. 4. DataPatterns.Profile() • Written entirely in ECL • It is a single FUNCTIONMACRO • No library or module dependencies • Performs all profiling checks by default • Numerous parameters for controlling analysis and output • Analyze all rows in a dataset or just a sample • Analyze all fields or only certain fields • Enable only specified profiling checks • Specify returned pattern counts • Creates a single dataset as a result • One record for each field analyzed Data Patterns: A Data Profiling Tool for HPCC Systems 4
  5. 5. DataPatterns.Profile() – The Usual Analysis Data Patterns: A Data Profiling Tool for HPCC Systems 5 Output Description attribute The name of the field in the input dataset given_attribute_type The ECL type of the attribute as it was defined in the RECORD definition best_attribute_type An ECL data type that both allows all values in the input dataset and consumes the least amount of memory rec_count The number of records analyzed fill_count The number of rec_count records containing non-nil values fill_rate The percentage of rec_count records containing non-nil values cardinality The number of unique, non-nil values modes The most common value(s) in the attribute, after coercing all values to STRING, along with the number of records in which the values were found min_length The shortest length of a value when expressed as a string max_length The longest length of a value when expressed as a string ave_length The average length of a value when expressed as a string
  6. 6. DataPatterns.Profile() – Analysis For Numeric Fields Data Patterns: A Data Profiling Tool for HPCC Systems 6 Output Description is_numeric Boolean indicating if the original attribute was numeric and therefore whether or not the numeric_xxxx output fields will be populated with actual values numeric_min The smallest non-nil value as a DECIMAL numeric_max The largest non-nil value as a DECIMAL numeric_mean The mean (average) non-nil value as a DECIMAL numeric_std_dev The standard deviation of the non-nil values as a DECIMAL numeric_lower_quartile The value separating the first (bottom) and second quarters of non-nil values as a DECIMAL numeric_median The median non-nil value as a DECIMAL numeric_upper_quartile The value separating the third and fourth (top) quarters of non-nil values as a DECIMAL numeric_correlations A child dataset containing correlation values comparing the current numeric attribute with all other numeric attributes, listed in descending correlation value order
  7. 7. DataPatterns.Profile() – Text Patterns • Text patterns give you an idea of what your data looks like when it is expressed as a human-readable generalized string • Very useful for spotting data that doesn’t belong • Converts each character of the string into a fixed character palette to produce a new string pattern • Any uppercase letter => A • Any lowercase letter => a • Any numeric digit => 9 • Any boolean value => B • All other characters remain as-is • By counting the unique patterns and ranking them, you can easily see what kind of data is very common or very rare • All string data types are supported Data Patterns: A Data Profiling Tool for HPCC Systems 7
  8. 8. DataPatterns.Profile() – Text Pattern Analysis Data Patterns: A Data Profiling Tool for HPCC Systems 8 Output Description popular_patterns The most common patterns of values; patterns are listed from most- to least- common and an example (pulled from the data) is shown for each rare_patterns The least common patterns of values; patterns are listed from least- to most-common and an example (pulled from the data) is shown for each; patterns already shown in popular_patterns are not repeated here Original Value Pattern 45816.01 99999.99 Dan Camper Aaa Aaaaaa For *only* $10! Aaa *aaaa* $99! Examples
  9. 9. Some Data To Profile … Data Patterns: A Data Profiling Tool for HPCC Systems 9
  10. 10. … And How To Profile It Data Patterns: A Data Profiling Tool for HPCC Systems 10 Import the DataPatterns module Define a record structure Declare the dataset Call the profiler Show result
  11. 11. Profiling Results – The Usual Suspects Data Patterns: A Data Profiling Tool for HPCC Systems 11
  12. 12. Profiling Results – Numeric Fields Data Patterns: A Data Profiling Tool for HPCC Systems 12
  13. 13. Profiling Results – Data Pattern Analysis Data Patterns: A Data Profiling Tool for HPCC Systems 13
  14. 14. Final Thoughts • DataPatterns is an open-source ECL bundle • https://github.com/hpcc-systems/DataPatterns.git • Currently contains only two functions • Profile() • BestRecordStructure() • Future plans • Histograms for numeric fields • Additional information for low-cardinality fields • Expand correlations to non-numeric discrete-value fields • Easy comparison of profile results to detect changes • Visualization • Data Detectors Data Patterns: A Data Profiling Tool for HPCC Systems 14
  15. 15. Data Patterns: A Data Profiling Tool for HPCC Systems 15 Questions?
  16. 16. Innovation and Reinvention Driving Transformation OCTOBER 9, 2018 2018 HPCC Systems® Community Day Hicham Elhassani – VP Modeling Vertical Support Dan S. Camper – Sr. Architect, HPCC Solutions Lab Making IoT Data Actionable Using Predictive Analytics
  17. 17. Making IoT Data Actionable Using Predictive Analytics 17
  18. 18. If you think connected “things” are everywhere NOW . . . Making IoT Data Actionable Using Predictive Analytics 2016 2017 2018 2020 Consumer 3,963 5,244 7,036 12,863 Business:Cross-Industry 1,102 1,501 2,133 4,381 Business:Vertical-Specific 1,317 1,635 2,028 3,171 Grand Total 6,382 8,381 11,197 20,415 Source: Gartner (January 2017) IoT Units Installed Base by Category (Millions of Units) 18
  19. 19. Value proposition? Cyber risk? What does the data say? Who is driving? Incremental or revolutionary? Cost vs. Benefit? Making IoT Data Actionable Using Predictive Analytics BIG QUESTIONS FOR INSURANCE 19
  20. 20. Making IoT Data Actionable Using Predictive Analytics Importance of collecting Iot data to company’s insurance strategy (n=120) 8% 70% 22% Very / Somewhat Important Neither important or unimportant Not at all/not very important Importance for insurers to collect IoT data today 20
  21. 21. Making IoT Data Actionable Using Predictive Analytics Collection and/or Purchase of Connected Home Data (n=120) 1% 4% 19% 38% 38% Collect/purchase, use in decision-making Collect/purchase, plan to use Collect/purchase, but not sure how to use Don’t collect/purchase, but plan to Don’t collect/purchase, don’t plan to Collect today = 24% Don’t Collect today = 76% Collection of Connected Home Data 21
  22. 22. Making IoT Data Actionable Using Predictive Analytics Timeline to begin collecting Connected Home data Anticipated Timeline for Collecting and/or Using Connected Homes Data (among those not currently using, but planning to use connected homes, n=73) In next year In next 2-3 years In next 4-5 years In 6+ years Not sure 4% 52% 34% 7% 3% Next 3Years = 56% 4+Years = 41% 22
  23. 23. Home Loss Statistics and IOT opportunities Making IoT Data Actionable Using Predictive Analytics 11 % OTHERTHEFT 25 % 21% 22% 21% WIND HAIL FIRE WATER NON- WEATHERWATER WEATHER LIABILITY Internals data Security Freeze detection Leak detection Smoke/CO Temp/Humidity Motion sensor Appliances Audio/video External data Weather API Social M events Loss history Property info Geo information Internals data Security Freeze detection Leak detection Smoke/CO Temp/Humidity Motion sensor Appliances Audio/Video External data Weather API Social M events Loss history Property info Geo information Internals data Security Freeze detection Leak detection Smoke/CO Temp/Humidity Motion sensor Appliances Audio/video External data Weather API Social M events Loss history Property info Geo information Internals data Security Freeze detection Leak detection Smoke/CO Temp/Humidity Motion sensor Appliances Audio/video External data Weather API Social M events Loss history Property info Geo information Internals data Security Freeze detection Leak detection Smoke/CO Temp/Humidity Motion sensor Appliances Audio/video External data Weather API Social M events Loss history Property info Geo information 23
  24. 24. Today, let’s discuss some examples Occupancy: Monitoring/Prevention Water Leak: Monitoring/Alert 24
  25. 25. Making IoT Data Actionable Using Predictive Analytics Smart Thermostat Data: Primary Residence HVAC Mode Observations 0 50 100 150 200 250 300 350 Eco July 4th Weekend Source: Nest 25
  26. 26. Making IoT Data Actionable Using Predictive Analytics Smart Thermostat Data: Vacation Home 0 20 40 60 80 100 120 Eco HVAC Mode Observations July 4th Weekend Source: Nest 26
  27. 27. Making IoT Data Actionable Using Predictive Analytics 0.0 10.0 20.0 30.0 40.0 50.0 60.0 70.0 3/12/20180:00 3/12/20186:00 3/12/201812:00 3/12/201818:00 3/13/20180:00 3/13/20186:00 3/13/201812:00 3/13/201818:00 3/14/20180:00 3/14/20186:00 3/14/201812:00 3/14/201818:00 3/15/20180:00 3/15/20186:00 3/15/201812:00 3/15/201818:00 3/16/20180:00 3/16/20186:00 3/16/201812:00 3/16/201818:00 3/17/20180:00 3/17/20187:00 3/17/201813:00 3/17/201819:00 3/18/20181:00 3/18/20187:00 3/18/201813:00 3/18/201819:00 3/19/20181:00 3/19/20187:00 3/19/201813:00 3/19/201819:00 3/20/20181:00 3/20/20187:00 3/20/201813:00 3/20/201819:00 3/21/20181:00 3/21/20187:00 3/21/201813:00 3/21/201819:00 3/22/20181:00 3/22/20187:00 3/22/201813:00 3/22/201819:00 3/23/20181:00 3/23/20187:00 3/23/201813:00 3/23/201819:00 3/24/20181:00 3/24/20187:00 3/24/201813:00 3/24/201819:00 3/25/20181:00 3/25/20187:00 3/25/201813:00 3/25/201819:00 3/26/20181:00 3/26/20187:00 3/26/201813:00 3/26/201819:00 Shower Restroo m Laundry x3 Dishwasher x2 Child’s bath Dishwasher Child’s bath Child’s bath Child’s bath Child’s bath Child’s bath Child’s bath Source: Streamlabs Example: Water Leak Detection 27
  28. 28. Example: Water Leak & Assignment of Benefits Making IoT Data Actionable Using Predictive Analytics File it Assign of benefits (AOB) is a legal tool that allows the homeowner to transfer their rights to collect from an insurance claim to a third party. Fix It AOB is commonly used when a homeowner employs a contractor or water remediation company to fix water damage from pipe and appliance leaks Fake it This arrangement has permitted some contractors to overinflate claims, resulting in a dramatic increase in frequency and severity in Florida water non-weather claims Source: Office of Insurance Consumer Advocate, Florida Office of Insurance Regulation 28
  29. 29. Assignment of Benefits – Florida vs USA (Excl. Florida) Making IoT Data Actionable Using Predictive Analytics 30 25 20 15 10 5 0 LossCost($) 2011 2012 2013 2014 2015 2016 Accidental Water Discharge and Appliance Leakage Loss Cost USA (Excl. Florida) FloridaSource: LexisNexis Internal Research 29
  30. 30. Broward Miami-Dade Palm Beach Assignment of Benefits – Tri Counties Making IoT Data Actionable Using Predictive Analytics Source: LexisNexis Internal Research 30
  31. 31. Broward Miami-Dade Palm Beach Assignment of Benefits – Tri Counties Making IoT Data Actionable Using Predictive Analytics Source: LexisNexis Internal Research 31
  32. 32. Water Leak and Geo-located losses Making IoT Data Actionable Using Predictive Analytics 0.50% 0.45% 0.40% 0.35% 0.30% 0.25% 0.20% 0.15% 0.10% 0.05% 0.00% Frequency 2011 2012 2013 2014 2015 2016 Accidental Water Discharge and Appliance Leakage Frequency Broward County Miami-Dade County Palm Beach County Florida (Excl. Tri Counties) Source: LexisNexis Internal Research 32
  33. 33. Harvey: Tweets Containing “Flood” Making IoT Data Actionable Using Predictive Analytics 33
  34. 34. Weather Events Digital Trail • Elk City tornado by the NOAA:yesterday 17/05/2017 • Flood • Hail • Lightning • Tornado • Wildfire Making IoT Data Actionable Using Predictive Analytics 34
  35. 35. Stream Analytics: Push and Pull data sources Making IoT Data Actionable Using Predictive Analytics Wind Fire Water (non- weather) Water (weather ) Theft Liability Other Hail 35
  36. 36. Data platforms will be key to unlocking the full potential of this opportunity Making IoT Data Actionable Using Predictive Analytics MARKETING CONTACT QUOTE UNDERWRITIN G RENEWAL COMPLIANCE CLAIM IoT Platform Insurer Automatio n Mitigation Utilities Connected Home Securit y Connecte d Car Connecte d Self Connecte d Business 36
  37. 37. How to start unlocking these insights now Technology/Analytics to develop and deploy a pilot program
  38. 38. HPCC Systems Architecture Making IoT Data Actionable Using Predictive Analytics 38
  39. 39. HPCC Systems – Pull Architecture • Device users register at a web portal • Authentication and authorization via device manufacturer’s web site • Authorization response includes an access token • All registration information saved • Thor queries devices for all registered users in parallel • Ancillary data, such as weather conditions local to every device, is periodically gathered • Analytics are also run periodically, as often as needed • ROXIE updated with analytics results and are made available to external services Making IoT Data Actionable Using Predictive Analytics 39
  40. 40. HPCC Systems – Push Architecture • Authorized devices whitelisted via master device management • Remote devices send their data to ROXIE • After validation and normalization, message stored in Kafka and Couchbase • Thor periodically pulls new messages from Kafka for processing • Ancillary data, such as weather conditions local to every device, is periodically gathered • Analytics are also run periodically, as often as needed • ROXIE updated with analytics results and are made available to external services Making IoT Data Actionable Using Predictive Analytics 40

×