2019 HPCC
Systems®
Community Day
Challenge Yourself –
Challenge the Status Quo
DataPatterns - Profiling in ECL WatchDan S. Camper
Thaumaturge
HPCC Systems Solutions
Lab
Topics
• What is DataPatterns?
• Improvements since last year
• ECL Standard Library integration
• ECL Watch integration
• Differences between installations
2DataPatterns - Profiling in ECL Watch
What is DataPatterns?
DataPatterns – What is it?
• ECL bundle that provides some basic data profiling and research tools to an ECL
programmer
• Today, it is primarily a data profiling tool
• Numerous parameters for controlling analysis and output
• Analyze all rows in a dataset or just a sample
• Analyze all fields or only certain fields
• Enable only specified profiling checks
• Specify returned pattern counts
• Creates a single dataset as a result
• One record for each field analyzed
4DataPatterns - Profiling in ECL Watch
Improvements
Improvements Since Last Year
• Profile()
• Cardinality Breakdown
• Improved UTF-8 handling
• Support additional data types
• Embedded child records
• Child datasets
• SET OF
• Pretty results
• BestRecordStructure()
• Optional generated TRANSFORM()
• Lots of bug fixes
6DataPatterns - Profiling in ECL Watch
ECL Standard Library
Integration
DataPatterns Grows Up …
• Portions of bundle integrated with ECL Standard Library
• Profile()
• BestRecordStructure()
• As of HPCC Systems 7.4.0
8DataPatterns - Profiling in ECL Watch
… And Gains A User Interface in ECL Watch
9DataPatterns - Profiling in ECL Watch
Logical File’s Record Structure
10DataPatterns - Profiling in ECL Watch
Executing DataPatterns.Profile()
11DataPatterns - Profiling in ECL Watch
DataPatterns.Profile() In Progress
12DataPatterns - Profiling in ECL Watch
DataPatterns.Profile() Workunit ECL
13DataPatterns - Profiling in ECL Watch
DataPatterns.Profile() Raw Results
14DataPatterns - Profiling in ECL Watch
DataPatterns.Profile() Report Results
15DataPatterns - Profiling in ECL Watch
Differences Between
Installations
Differences Between Installations
• ECL Bundle contains additional functions
• ProfileFromPath()
• BestRecordStructureFromPath()
• Contains support for pretty report
• Available from https://github.com/hpcc-systems/DataPatterns
• ECL Standard Library
• Does not support pretty report
• Available with HPCC Systems 7.4.0 and later
• ECL Watch
• Supports only data profiling
• Available with HPCC Systems 7.4.0 and later
17DataPatterns - Profiling in ECL Watch
18DataPatterns - Profiling in ECL Watch
DataDetectors – What Is This Data?
Bloom Filter Models
• Person.FirstName
• Person.LastName
• Geo.USA.Address.StreetName
• Geo.USA.Address.CityName
• Geo.USA.Address.PostalCode
• Geo.USA.PhoneAreaCode
• Geo.CountryName
• Geo.CountryCode
• Identifier.USA.StockSymbol
Heuristic Models
• Calendar.Date
• Calendar.Month
• Calendar.Quarter
• Calendar.Year
• Calendar.YearMonth
• Currency
• Group.StockExchange
• Geo.USA.Address.State
• Geo.Longitude
• Geo.Latitude
• Geo.LatLon
• Identifier.USA.PhoneNumber
• Identifier.EmailAddress
• Identifier.RecordID
• Identifier.WebSiteURL
DataPatterns - Profiling in ECL Watch 19
DataDetectors Test – Raw Data
20DataPatterns - Profiling in ECL Watch
DataDetectors Test – Data Examination Results
21DataPatterns - Profiling in ECL Watch
22DataPatterns - Profiling in ECL Watch
Cloud IDE – ECL Programming in a browser-based IDE
23DataPatterns - Profiling in ECL Watch
Cloud IDE
24DataPatterns - Profiling in ECL Watch
Cloud IDE
25DataPatterns - Profiling in ECL Watch
26DataPatterns - Profiling in ECL Watch
fini
View this presentation on YouTube:
https://www.youtube.com/watch?v=TtcrOcyf6gQ&list=PL-
8MJMUpp8IKH5-d56az56t52YccleX5h&index=6&t=0s (13:19)

DataPatterns - Profiling in ECL Watch

  • 1.
    2019 HPCC Systems® Community Day ChallengeYourself – Challenge the Status Quo DataPatterns - Profiling in ECL WatchDan S. Camper Thaumaturge HPCC Systems Solutions Lab
  • 2.
    Topics • What isDataPatterns? • Improvements since last year • ECL Standard Library integration • ECL Watch integration • Differences between installations 2DataPatterns - Profiling in ECL Watch
  • 3.
  • 4.
    DataPatterns – Whatis it? • ECL bundle that provides some basic data profiling and research tools to an ECL programmer • Today, it is primarily a data profiling tool • Numerous parameters for controlling analysis and output • Analyze all rows in a dataset or just a sample • Analyze all fields or only certain fields • Enable only specified profiling checks • Specify returned pattern counts • Creates a single dataset as a result • One record for each field analyzed 4DataPatterns - Profiling in ECL Watch
  • 5.
  • 6.
    Improvements Since LastYear • Profile() • Cardinality Breakdown • Improved UTF-8 handling • Support additional data types • Embedded child records • Child datasets • SET OF • Pretty results • BestRecordStructure() • Optional generated TRANSFORM() • Lots of bug fixes 6DataPatterns - Profiling in ECL Watch
  • 7.
  • 8.
    DataPatterns Grows Up… • Portions of bundle integrated with ECL Standard Library • Profile() • BestRecordStructure() • As of HPCC Systems 7.4.0 8DataPatterns - Profiling in ECL Watch
  • 9.
    … And GainsA User Interface in ECL Watch 9DataPatterns - Profiling in ECL Watch
  • 10.
    Logical File’s RecordStructure 10DataPatterns - Profiling in ECL Watch
  • 11.
  • 12.
  • 13.
  • 14.
  • 15.
  • 16.
  • 17.
    Differences Between Installations •ECL Bundle contains additional functions • ProfileFromPath() • BestRecordStructureFromPath() • Contains support for pretty report • Available from https://github.com/hpcc-systems/DataPatterns • ECL Standard Library • Does not support pretty report • Available with HPCC Systems 7.4.0 and later • ECL Watch • Supports only data profiling • Available with HPCC Systems 7.4.0 and later 17DataPatterns - Profiling in ECL Watch
  • 18.
  • 19.
    DataDetectors – WhatIs This Data? Bloom Filter Models • Person.FirstName • Person.LastName • Geo.USA.Address.StreetName • Geo.USA.Address.CityName • Geo.USA.Address.PostalCode • Geo.USA.PhoneAreaCode • Geo.CountryName • Geo.CountryCode • Identifier.USA.StockSymbol Heuristic Models • Calendar.Date • Calendar.Month • Calendar.Quarter • Calendar.Year • Calendar.YearMonth • Currency • Group.StockExchange • Geo.USA.Address.State • Geo.Longitude • Geo.Latitude • Geo.LatLon • Identifier.USA.PhoneNumber • Identifier.EmailAddress • Identifier.RecordID • Identifier.WebSiteURL DataPatterns - Profiling in ECL Watch 19
  • 20.
    DataDetectors Test –Raw Data 20DataPatterns - Profiling in ECL Watch
  • 21.
    DataDetectors Test –Data Examination Results 21DataPatterns - Profiling in ECL Watch
  • 22.
  • 23.
    Cloud IDE –ECL Programming in a browser-based IDE 23DataPatterns - Profiling in ECL Watch
  • 24.
    Cloud IDE 24DataPatterns -Profiling in ECL Watch
  • 25.
    Cloud IDE 25DataPatterns -Profiling in ECL Watch
  • 26.
    26DataPatterns - Profilingin ECL Watch fini View this presentation on YouTube: https://www.youtube.com/watch?v=TtcrOcyf6gQ&list=PL- 8MJMUpp8IKH5-d56az56t52YccleX5h&index=6&t=0s (13:19)