The New Trillium DQ:
Big Data Insights When and
Where You Need Them
Harald Smith
1
Speaker
Harald Smith
• Director of Product Marketing, Syncsort
• 20+ years in Information Management with a focus
on data quality, integration, and governance
• Co-author of Patterns of Information Management
• Author of two Redbooks on Information Governance
and Data Integration
• Blogs on Dataversity and InfoWorld
2
Only 35%of senior executives have a
high level of trust in the
accuracy of their Big Data
Analytics
KPMG 2016 Global CEO Outlook
92% of
executives are concerned
about the negative impact of
data and analytics on
corporate reputation
KPMG 2017 Global CEO Outlook
80%of AI/ML projects are stalling
due to poor data quality
Dimensional Research, 2019
ALL Data Needs
Data Quality
“Societal trust in business is
arguably at an all-time low
and, in a world increasingly
driven by data and technology,
reputations and brands are
ever harder to protect.”
EY “Trust in Data and Why it Matters”, 2017.
The importance of data quality
in the enterprise:
• Decision making
• Customer centricity
• Compliance
• Machine learning & AI
3
Key Outcomes
• Maximize the value of data quality across your organization
• Deploy and leverage data quality capabilities consistently when and
where needed
• Leverage the resources and skills your organization has invested in
whether on-premise or in the cloud
• Scale to address the data challenges you face and deliver high quality
results you can trust for critical business decisions
• Integrate best-in-class data quality into your data governance framework
to ensure visibility across your organization
• Ensure global data requirements are addressed
4
Trillium DQ version 16
• Single cross-platform scalable architecture
• Native Big Data connectivity
• Distributed execution for all functions
• Full, rich data quality capabilities and familiar interface
• Design-once, deploy-anywhere data quality projects
• Out-of-the-box data governance integration with Collibra
• Broad location and geoenrichment data options
Trillium DQ v16 Highlights
5
Ensures consistent use, processing, and outcomes for traditional or distributed platforms, on-premise or in the cloud
6
Trillium DQ – common scalable architecture
UI Server or
Edge Node
ODBC
Native RDBMS
Delimited
Fixed
Cobol
Distributed
Cluster
Distributed HDFS / Distributed Execution / Distributed Storage
Name Node
Trillium DQ
Metadata
Delimited
HDFS
2xFaster data cleansing and
matching on small
distributed cluster – more
nodes, faster time
3xFaster data profiling on
small distributed cluster
– more nodes, faster
time with linear scaling
2xFaster data profiling even
on traditional platforms
Key Outcomes
• More sources of data
• Higher volumes of data
• Faster processing of data
• Fit limited time windows
• Utilize Big Data investments
• Reduced disk space usage
Scalable
Architecture
7
8
Trillium DQ for Big Data on Amazon EMR:
• Cleansed, standardized and matched over
130 million recs/hour on basic 10-node
test cluster
• Processing full transaction volume daily, and
business is growing
• Met the business SLA’s with ability to scale
Challenge Solution
Delivered higher levels of matching/data accuracy and satisfied contracts
Saved software costs – Replaced multiple solutions – Melissa Data, Oracle de-dupe, ...
Saved Amazon cluster costs and left room for company growth
Impact
Ensure accurate corporate credit ratings of 330M global
companies for clients within contracted timeframes.
• Could not scale to deliver ratings to clients within SLA’s –
impacting client fulfillment
• Need to process >800M records daily
• Lacked flexibility to address issues with similar company
names including volume and variety of data sources
“We can’t afford to miss or mix up information about businesses with similar names. Companies
count on our highly accurate predictive scoring to provide fast, accurate ratings for their potential
customers and vendors.”
Match to corporate credit data with Syncsort Trillium
Key Outcomes
• Reduce the time for business analysts to discover and understand
data on Hadoop platforms
• Allow business analysts who understand the data but have little
technical expertise to quickly find data and run data profiling in
three steps
• Let analysts explore results and drilldown to details within
seconds per view to review and then report on data issues to
business leaders
• Scale to large volumes of data sources & attributes so that
business analysts can understand the contents of any data source
needed for business decisions
9
Trillium Discovery
• Delivers enterprise trusted Trillium Discovery on traditional and distributed
Hadoop platforms for high-volume, scalable data profiling
• Provides complete Trillium Discovery data profiling for analysis & review
• Attribute metadata, value & pattern frequencies, key & dependency analysis,
cross-source join analysis, drill down to any outlier or issue, and more…
• Provides easily configured native connectivity for Big Data sources
• Provides managing and monitoring for task execution
• Integrates with the security frameworks (Kerberos, AD, LDAP) of
Big Data platforms
10
Trillium Discovery
Execute Profiling
1
n
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
11
Trillium Discovery – Data Profiling at Scale
Select Source Explore ProfilesRun Profiling
Stored Profiling Results
▪ Metadata & Statistics
▪ Frequency Distributions
▪ Drilldown Indices
Share &
Govern
Results
Integration
(APIs)
Notification
Collaboration
Native Connectors
▪ HDFS source directories
▪ …
Drilldown to IssuesEvaluate Business Rules
3 Steps to Run
Key Outcomes
• Match and link any data entity – customers, suppliers, products, etc. –
into a trusted single view to support a broad array of business-critical
use cases (e.g. Customer 360, fraud, AML)
• Parse and standardize complex multi-domain data, extended with
enrichment and verification of critical address and geolocation data –
all leveraging out-of-the-box templates
• Utilize “design once, deploy anywhere” approach to speed time-to-
value and focus on building data quality business logic while letting the
product handle the technical aspects of framework execution with no
coding or tuning required
• Leverage the high-performance compute power of distributed Hadoop
frameworks to process high volumes within targeted time windows to
meet critical Service Level Agreements (SLA’s)
12
Trillium Quality
13
Trillium Quality
• Integrate, parse, standardize, and match new and legacy customer data
from multiple disparate sources.
• Provide high-quality entity resolution through multi-domain deduplication
and matching with the most comprehensive set of match comparisons
available, including fuzzy matching, distance comparisons, and more.
• Standardize, enhance, and match international data sets with postal and
country-code validation.
• Deploy data quality workflows as native MapReduce processes for optimal
efficiency.
• Process hundreds of millions of records of data.
• Increase processing efficiency.
• Support failover through Hadoop’s fault-tolerant design; during a node
failure, processing is redirected to another node.
Syncsort Trillium Delivers Data You can Trust
Data Profiling Business Rules &
Data Quality
Assessment
Data Validation,
Standardization,
Enrichment & more
Matching, Entity
Resolution &
Verification
•Customer 360
•AI/ML
Operational Integrations
•Analytics &
Reporting
Data Governance
Trillium Discovery
Trillium Quality
+ Global Address Verification
Trillium DQ/Trillium DQ for Big Data
•Collibra DGC
•BI tools
14
15
Trillium Quality for Big Data to support next-generation
AML transaction monitoring and FCA compliance
• Cluster-native data verification, enrichment, and
demanding multi-field entity resolution executing
natively on Spark within financial crimes database
• Unmodified mainframe “Golden Records” stored on
Hadoop
Global Bank
Challenge Solution
Ensure Anti-Money Laundering regulatory compliance is met through financial crimes data lake –
high performance results at massive scale.
Achieve fast time to value with flexible deployment and ease of use
Ensure the data lake is trusted source of data feeding critical machine learning-based fraud detection
Expanding use to additional Customer Engagement solutions and applications.
Impact
Meet AML transaction monitoring and
Financial Conduct Authority (FCA) compliance
• Data volume too large, diversely scattered to
analyze
• Disparate data sources – Mainframe, RDBMS,
Cloud, etc.
• Maximize the value/ROI of the data lake
Trillium DQ + Collibra DGC
Trillium Discovery
• Market-leading, best-of-breed
data quality solution
• Profile and understand all the
critical data
• Leverage highly flexible business
rules for the right metrics
• Find ALL the DQ issues
Out-of-the-box integration of DQ
metrics with Collibra DGC
✓ Bi-directional solution
✓ Automated & synchronized
✓ Configurable to organizational
needs for all profiling results –
broad API support
Collibra DGC
• Market-leading, best-of-breed
data governance solution
• Establish a common
understanding of the business
• Automate governance and
stewardship tasks
• Interact with common workflows
Deploy Trillium’s bi-directional data
quality integration to ensure:
✓ All key business rules are
implemented and validated
✓ DQ metrics are automatically
delivered to those who need to
know when they need to know
16
Delivers fully integrated data duality with Collibra
Collibra Data Governance Center
✓ Enables non-technical users to define business
policies and data quality rules in plain
language
✓ Makes data quality metrics and performance
available to all users
Trillium Discovery
✓ Automatically receives business rules so technical
user can convert to executable data quality rules
✓ Constantly runs data quality metrics on desired
schedule, automatically delivers results back to
Collibra dashboards
Rulebooks to Rules
Quality test Results
Bi-directional connectivity Constant sync
Metric falling below
thresholds can
trigger workflow in
Collibra Issue
Management
17
18
Connection to/from Collibra is straightforward
Packaged
Workflow
• Out-of-the-box packaged workflow with Trillium Discovery
✓ Easy to setup and run – no complex technical requirements
✓ Part of delivered product – use immediately; no add-on charges; fully supported
• Automatically connects to and delivers content via REST API’s
✓ Collibra provides a single self-service API which facilitates connecting integrations to Collibra DGC
✓ Trillium Discovery provides standard, documented REST API’s – easy to extend application;
insulated from underlying product changes; same API’s used by UI, so always tested
19
Trillium DQ with Collibra DGC to:
• Profile, analyse and provide measurement of
data quality concerns
• Integrate data quality rules and metrics between
the tools to ensure management has immediate
knowledge of improvements/issues
DNB
Challenge Solution
Pilot phase for 2 branches completed July 2019
• Able to provide proof that data wasn’t “missing”, but pinpointed a number of quality issues requiring improvements
• Able to report to regulators on the findings with proof rather than previous hearsay
Spun off requirements to provide similar work for all branches AND Head Office
Addressing Master Data Analysis on customer data and associated cleanup
Impact
Poor, inconsistent customer data, and aggressive
timelines to address regulatory compliance
requirements (BCBS239, GDPR, and AML)
• Focus on whether DNB can measure Data Quality
in an ongoing manner
• Concerns around Customer Sanctions Screening
and Transaction Monitoring
See: The Data Journey at DNB: Data Driven Customer Centricity
• Rich set of capabilities to discover, classify, profile, and evaluate data across
platforms including big data, cloud.
Don’t need to move data off the cluster and can provide drilldown to all issues
• High performance standardization and matching for entity resolution with
global coverage in batch & real time.
Meet challenging time windows for critical analytics and regulations
• Native connectivity, execution, and storage for optimized Big Data processing.
Take full advantage of the cluster to expand and scale
• Design once, deploy anywhere architecture that future proofs existing
applications.
Leverage the skills you already have
• Ease to connect to & integrate with CRM, ERP, MDM, enrichment, and Data
Governance solutions.
Deliver consistent data quality processing and results throughout the organization
20
Trillium DQ
21
Available end of month
• Linux
• Cloudera
• CDH 5.8.3, 5.11, 5.15.2, 5.16.2
• HDP 2.6.4
• Google Cloud Platform
• Amazon EMR (Trillium Quality – now; Trillium Discovery - coming soon)
• Windows (coming soon)
Turn your data into a
trusted view of your
customers, products
and more
Power machine
learning and
advanced analytics
with reliable, fit-for-
purpose data
Gain actionable
business insights
from high-volume
disparate data sets
from across the
enterprise
Deploy industry-
leading data quality
processes at massive
scale, with no coding
or Big Data skills
required
Trillium DQ
evaluates &
transforms your
data for trusted
business insights
22
Next Steps
For more information on Trillium DQ and our other Syncsort
Trillium data quality solutions, please visit:
https://www.syncsort.com/en/solutions/data-quality
https://www.syncsort.com/en/products/trillium-dq
https://www.syncsort.com/en/products/trillium-dq-for-big-data
23
Questions?
24
The New Trillium DQ: Big Data Insights When and Where You Need Them

The New Trillium DQ: Big Data Insights When and Where You Need Them

  • 1.
    The New TrilliumDQ: Big Data Insights When and Where You Need Them Harald Smith 1
  • 2.
    Speaker Harald Smith • Directorof Product Marketing, Syncsort • 20+ years in Information Management with a focus on data quality, integration, and governance • Co-author of Patterns of Information Management • Author of two Redbooks on Information Governance and Data Integration • Blogs on Dataversity and InfoWorld 2
  • 3.
    Only 35%of seniorexecutives have a high level of trust in the accuracy of their Big Data Analytics KPMG 2016 Global CEO Outlook 92% of executives are concerned about the negative impact of data and analytics on corporate reputation KPMG 2017 Global CEO Outlook 80%of AI/ML projects are stalling due to poor data quality Dimensional Research, 2019 ALL Data Needs Data Quality “Societal trust in business is arguably at an all-time low and, in a world increasingly driven by data and technology, reputations and brands are ever harder to protect.” EY “Trust in Data and Why it Matters”, 2017. The importance of data quality in the enterprise: • Decision making • Customer centricity • Compliance • Machine learning & AI 3
  • 4.
    Key Outcomes • Maximizethe value of data quality across your organization • Deploy and leverage data quality capabilities consistently when and where needed • Leverage the resources and skills your organization has invested in whether on-premise or in the cloud • Scale to address the data challenges you face and deliver high quality results you can trust for critical business decisions • Integrate best-in-class data quality into your data governance framework to ensure visibility across your organization • Ensure global data requirements are addressed 4 Trillium DQ version 16
  • 5.
    • Single cross-platformscalable architecture • Native Big Data connectivity • Distributed execution for all functions • Full, rich data quality capabilities and familiar interface • Design-once, deploy-anywhere data quality projects • Out-of-the-box data governance integration with Collibra • Broad location and geoenrichment data options Trillium DQ v16 Highlights 5
  • 6.
    Ensures consistent use,processing, and outcomes for traditional or distributed platforms, on-premise or in the cloud 6 Trillium DQ – common scalable architecture UI Server or Edge Node ODBC Native RDBMS Delimited Fixed Cobol Distributed Cluster Distributed HDFS / Distributed Execution / Distributed Storage Name Node Trillium DQ Metadata Delimited HDFS
  • 7.
    2xFaster data cleansingand matching on small distributed cluster – more nodes, faster time 3xFaster data profiling on small distributed cluster – more nodes, faster time with linear scaling 2xFaster data profiling even on traditional platforms Key Outcomes • More sources of data • Higher volumes of data • Faster processing of data • Fit limited time windows • Utilize Big Data investments • Reduced disk space usage Scalable Architecture 7
  • 8.
    8 Trillium DQ forBig Data on Amazon EMR: • Cleansed, standardized and matched over 130 million recs/hour on basic 10-node test cluster • Processing full transaction volume daily, and business is growing • Met the business SLA’s with ability to scale Challenge Solution Delivered higher levels of matching/data accuracy and satisfied contracts Saved software costs – Replaced multiple solutions – Melissa Data, Oracle de-dupe, ... Saved Amazon cluster costs and left room for company growth Impact Ensure accurate corporate credit ratings of 330M global companies for clients within contracted timeframes. • Could not scale to deliver ratings to clients within SLA’s – impacting client fulfillment • Need to process >800M records daily • Lacked flexibility to address issues with similar company names including volume and variety of data sources “We can’t afford to miss or mix up information about businesses with similar names. Companies count on our highly accurate predictive scoring to provide fast, accurate ratings for their potential customers and vendors.” Match to corporate credit data with Syncsort Trillium
  • 9.
    Key Outcomes • Reducethe time for business analysts to discover and understand data on Hadoop platforms • Allow business analysts who understand the data but have little technical expertise to quickly find data and run data profiling in three steps • Let analysts explore results and drilldown to details within seconds per view to review and then report on data issues to business leaders • Scale to large volumes of data sources & attributes so that business analysts can understand the contents of any data source needed for business decisions 9 Trillium Discovery
  • 10.
    • Delivers enterprisetrusted Trillium Discovery on traditional and distributed Hadoop platforms for high-volume, scalable data profiling • Provides complete Trillium Discovery data profiling for analysis & review • Attribute metadata, value & pattern frequencies, key & dependency analysis, cross-source join analysis, drill down to any outlier or issue, and more… • Provides easily configured native connectivity for Big Data sources • Provides managing and monitoring for task execution • Integrates with the security frameworks (Kerberos, AD, LDAP) of Big Data platforms 10 Trillium Discovery
  • 11.
    Execute Profiling 1 n . . . . . . . . . . . . . . . . . . . . . . 11 Trillium Discovery– Data Profiling at Scale Select Source Explore ProfilesRun Profiling Stored Profiling Results ▪ Metadata & Statistics ▪ Frequency Distributions ▪ Drilldown Indices Share & Govern Results Integration (APIs) Notification Collaboration Native Connectors ▪ HDFS source directories ▪ … Drilldown to IssuesEvaluate Business Rules 3 Steps to Run
  • 12.
    Key Outcomes • Matchand link any data entity – customers, suppliers, products, etc. – into a trusted single view to support a broad array of business-critical use cases (e.g. Customer 360, fraud, AML) • Parse and standardize complex multi-domain data, extended with enrichment and verification of critical address and geolocation data – all leveraging out-of-the-box templates • Utilize “design once, deploy anywhere” approach to speed time-to- value and focus on building data quality business logic while letting the product handle the technical aspects of framework execution with no coding or tuning required • Leverage the high-performance compute power of distributed Hadoop frameworks to process high volumes within targeted time windows to meet critical Service Level Agreements (SLA’s) 12 Trillium Quality
  • 13.
    13 Trillium Quality • Integrate,parse, standardize, and match new and legacy customer data from multiple disparate sources. • Provide high-quality entity resolution through multi-domain deduplication and matching with the most comprehensive set of match comparisons available, including fuzzy matching, distance comparisons, and more. • Standardize, enhance, and match international data sets with postal and country-code validation. • Deploy data quality workflows as native MapReduce processes for optimal efficiency. • Process hundreds of millions of records of data. • Increase processing efficiency. • Support failover through Hadoop’s fault-tolerant design; during a node failure, processing is redirected to another node.
  • 14.
    Syncsort Trillium DeliversData You can Trust Data Profiling Business Rules & Data Quality Assessment Data Validation, Standardization, Enrichment & more Matching, Entity Resolution & Verification •Customer 360 •AI/ML Operational Integrations •Analytics & Reporting Data Governance Trillium Discovery Trillium Quality + Global Address Verification Trillium DQ/Trillium DQ for Big Data •Collibra DGC •BI tools 14
  • 15.
    15 Trillium Quality forBig Data to support next-generation AML transaction monitoring and FCA compliance • Cluster-native data verification, enrichment, and demanding multi-field entity resolution executing natively on Spark within financial crimes database • Unmodified mainframe “Golden Records” stored on Hadoop Global Bank Challenge Solution Ensure Anti-Money Laundering regulatory compliance is met through financial crimes data lake – high performance results at massive scale. Achieve fast time to value with flexible deployment and ease of use Ensure the data lake is trusted source of data feeding critical machine learning-based fraud detection Expanding use to additional Customer Engagement solutions and applications. Impact Meet AML transaction monitoring and Financial Conduct Authority (FCA) compliance • Data volume too large, diversely scattered to analyze • Disparate data sources – Mainframe, RDBMS, Cloud, etc. • Maximize the value/ROI of the data lake
  • 16.
    Trillium DQ +Collibra DGC Trillium Discovery • Market-leading, best-of-breed data quality solution • Profile and understand all the critical data • Leverage highly flexible business rules for the right metrics • Find ALL the DQ issues Out-of-the-box integration of DQ metrics with Collibra DGC ✓ Bi-directional solution ✓ Automated & synchronized ✓ Configurable to organizational needs for all profiling results – broad API support Collibra DGC • Market-leading, best-of-breed data governance solution • Establish a common understanding of the business • Automate governance and stewardship tasks • Interact with common workflows Deploy Trillium’s bi-directional data quality integration to ensure: ✓ All key business rules are implemented and validated ✓ DQ metrics are automatically delivered to those who need to know when they need to know 16
  • 17.
    Delivers fully integrateddata duality with Collibra Collibra Data Governance Center ✓ Enables non-technical users to define business policies and data quality rules in plain language ✓ Makes data quality metrics and performance available to all users Trillium Discovery ✓ Automatically receives business rules so technical user can convert to executable data quality rules ✓ Constantly runs data quality metrics on desired schedule, automatically delivers results back to Collibra dashboards Rulebooks to Rules Quality test Results Bi-directional connectivity Constant sync Metric falling below thresholds can trigger workflow in Collibra Issue Management 17
  • 18.
    18 Connection to/from Collibrais straightforward Packaged Workflow • Out-of-the-box packaged workflow with Trillium Discovery ✓ Easy to setup and run – no complex technical requirements ✓ Part of delivered product – use immediately; no add-on charges; fully supported • Automatically connects to and delivers content via REST API’s ✓ Collibra provides a single self-service API which facilitates connecting integrations to Collibra DGC ✓ Trillium Discovery provides standard, documented REST API’s – easy to extend application; insulated from underlying product changes; same API’s used by UI, so always tested
  • 19.
    19 Trillium DQ withCollibra DGC to: • Profile, analyse and provide measurement of data quality concerns • Integrate data quality rules and metrics between the tools to ensure management has immediate knowledge of improvements/issues DNB Challenge Solution Pilot phase for 2 branches completed July 2019 • Able to provide proof that data wasn’t “missing”, but pinpointed a number of quality issues requiring improvements • Able to report to regulators on the findings with proof rather than previous hearsay Spun off requirements to provide similar work for all branches AND Head Office Addressing Master Data Analysis on customer data and associated cleanup Impact Poor, inconsistent customer data, and aggressive timelines to address regulatory compliance requirements (BCBS239, GDPR, and AML) • Focus on whether DNB can measure Data Quality in an ongoing manner • Concerns around Customer Sanctions Screening and Transaction Monitoring See: The Data Journey at DNB: Data Driven Customer Centricity
  • 20.
    • Rich setof capabilities to discover, classify, profile, and evaluate data across platforms including big data, cloud. Don’t need to move data off the cluster and can provide drilldown to all issues • High performance standardization and matching for entity resolution with global coverage in batch & real time. Meet challenging time windows for critical analytics and regulations • Native connectivity, execution, and storage for optimized Big Data processing. Take full advantage of the cluster to expand and scale • Design once, deploy anywhere architecture that future proofs existing applications. Leverage the skills you already have • Ease to connect to & integrate with CRM, ERP, MDM, enrichment, and Data Governance solutions. Deliver consistent data quality processing and results throughout the organization 20 Trillium DQ
  • 21.
    21 Available end ofmonth • Linux • Cloudera • CDH 5.8.3, 5.11, 5.15.2, 5.16.2 • HDP 2.6.4 • Google Cloud Platform • Amazon EMR (Trillium Quality – now; Trillium Discovery - coming soon) • Windows (coming soon)
  • 22.
    Turn your datainto a trusted view of your customers, products and more Power machine learning and advanced analytics with reliable, fit-for- purpose data Gain actionable business insights from high-volume disparate data sets from across the enterprise Deploy industry- leading data quality processes at massive scale, with no coding or Big Data skills required Trillium DQ evaluates & transforms your data for trusted business insights 22
  • 23.
    Next Steps For moreinformation on Trillium DQ and our other Syncsort Trillium data quality solutions, please visit: https://www.syncsort.com/en/solutions/data-quality https://www.syncsort.com/en/products/trillium-dq https://www.syncsort.com/en/products/trillium-dq-for-big-data 23
  • 24.