Building trust in your data lake. A fintech case study on automated data discovery, control and monitoring data leveraging Apache Ranger and Apache Atlas

Fintech case study on Automated
Discovery, Control and Monitoring Data

AGENDA
▸Introduction
▸Challenges with big data
▸G-Research overview
▸Journey with HDP
▸Challenges and learnings
▸Recommendation

PARTNE
RSGLOBAL
SOLUTIONS FOR DISCOVERING AND
MANAGING SENSITIVE DATA
ABOUT PRIVACERA

DETECT
MALICIOUS OR
ACCIDENTAL
USE
CONTROL
ANONYMIZE
DATA/
RESTRICT
ACCESS
DISCOVER
WHAT TYPE OF
DATA STORED
AND WHERE?
REPORT
SECURITY AND
COMPLIANCE
REPORTING
SOLUTIONS FOR MANAGING SENSITIVE DATA

CHALLENGES WITH BIG DATA
▸High volume of data
▸Stringent security and privacy requirements
▸Multi-tenant environments
▸Multiple methods for accessing data
▸Business users using newer tools
▸Traditional tools for security and governance “retrofitted” for
the modern data architecture

DATA
DISCOVERY
ACCESS
CONTROL ANONYMIZATION MONITORING
4 STEPS FOR MANAGING SENSITIVE DATA

ABOUT G-RESEARCH
▸Quantitative research technology company
▸Statistical analysis and Big Data pipelines to recognise
patterns and extract insights from very large market
datasets
▸Forecasting analytics to predict variances in financial
markets
▸Clients operate in capital markets globally
▸Undergoing very aggressive growth and adoption of cutting
edge technology

DATASETS
▸Market Data
▸Level 1 (top of book)
▸Basic market data (instrument, bid price, bid size, ask
price, ask size)
▸Level 2 (order book or market depth)
▸Richer data (highest bid prices, lowest ask prices)

DATASETS
▸Market Data
▸Level 1 (top of book)
▸Level 2 (order book or market depth)
▸Often represented as incremental updates at
nanosecond granularity
▸HUGE dataset!
Time Quantity Bid Ask Quantity Time
12:32:16 120 12.25 12.25 150 12:32:55
12:31:01 50 12.26 12.27 60 12:31:19
12:33:45 150 12.25 12.27 100 12:34:27

DATASETS
▸Other datasets include
▸Datapacks for additional enrichment
▸Risk analysis on positions, portfolios and strategies
▸Reference data about markets structures and corporate
entities
▸News feeds
▸…

HADOOP DATA PLATFORM – REASONS
▸Volume of data
▸Fast processing
▸Multiple use cases and tenants
▸Flexibility
▸Development
▸Business

CORE REQUIREMENTS FOR DATA
PLATFORM▸Security
▸Protect intellectual property of the company
▸Datasets processed through the pipeline
▸...but also code
▸Integration with existing security systems and policies
▸Authentication
▸Encryption
▸Strict and flexible authorization

CORE REQUIREMENTS, CONTINUED…
▸Governance
▸Ability to find data easily enabling collaboration
▸Track data changes, impact, lineage and maintain
consistency
▸Multi-tenancy
▸Variety of development teams need to work on the the
same platform and share data and resources
▸Scalability
▸Explosive data growth

IMPORTANCE OF GOVERNANCE
“Data is the new oil”
- Someone who’s been quoted about a billion times
“Metadata is the new Data”

GOVERNANCE AND SECURITY - SOLUTION
DESIGN
▸Kerberos based authentication
▸Integrated with AD
▸HDFS encryption at rest
▸Hive
▸Atlas
▸Great framework...but limited as a finished product
 i.e. Spark process lineage
 Custom implementation Spark lineage in Atlas

DESIGN
▸Automatic discovery and tagging of data using Privacera
▸Tags created at HDFS folder level
▸Privacera publishes ”tags” to Atlas
▸Leverage Atlas-Ranger integration
▸Tag based policies in Ranger
▸Highly confidential data will have restricted access

DESIGN

CUSTOM METADATA IN ATLAS
▸Custom datasets  Atlas metamodel definitions
Type
spark_job
id
cardinality
indexable
operations
cardinality
indexable
input_data
cardinality
indexable
output_data
cardinality
indexable
Entity
Type
spark_join_
operation
Type
string
Type
spark_dataset
Type
Dataset
Type
Process
Type
spark_operation
Attributes
Type
spark_entity
Type
Referenceable

OUR JOURNEY IN SECURITY
MANAGEMENT ▸Datazone definition to
capture data movement
▸Advanced data discovery
and tagging
▸Custom lineage applied
on our own data types
Standard Ranger
Policies
Tag-based Ranger
Policies
Comprehensive
Tag-based Ranger
Policies
Privacera Advanced
Security Policies
Atlas OoB
Custom Atlas
metamodels
Privacera
▸Access management
through data tags
▸Basic Ranger policies
t=0
t=1
t=2
t=3

KEY CHALLENGES AND LEARNINGS
▸Technical challenges
▸GraphDB does not scale as metastore
▸~100k’s entities tagged per week
▸Back-end rewritten to only use HBase + Solr
▸Open source flexibility can be a 2-sided coin
▸Business challenges
▸Business process integration

SUMMARY
▸Understand your data before expanding your data lake
▸Invest in automated classification and centralized metadata
▸Manage user access through data classification
▸Anonymise data to reduce exposure
▸Monitor the use of data - “trust but verify”

Building trust in your data lake. A fintech case study on automated data discovery, control and monitoring data leveraging Apache Ranger and Apache Atlas

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Building trust in your data lake. A fintech case study on automated data discovery, control and monitoring data leveraging Apache Ranger and Apache Atlas

Similar to Building trust in your data lake. A fintech case study on automated data discovery, control and monitoring data leveraging Apache Ranger and Apache Atlas (20)

More from DataWorks Summit

More from DataWorks Summit (20)

Recently uploaded

Recently uploaded (20)

Building trust in your data lake. A fintech case study on automated data discovery, control and monitoring data leveraging Apache Ranger and Apache Atlas