G-Research - Privacera - Dataworks Summit 2018

Fintech case study on Automated
Discovery, Control and Monitoring Data

AGENDA
▸Introduction
▸Challenges with big data
▸G-Research overview
▸Journey with HDP
▸Challenges and learnings
▸Recommendation

PARTNE
RSGLOBAL
SOLUTIONS FOR DISCOVERING AND
MANAGING SENSITIVE DATA
ABOUT PRIVACERA

DETECT
MALICIOUS OR
ACCIDENTAL
USE
CONTROL
ANONYMIZE
DATA/
RESTRICT
ACCESS
DISCOVER
WHAT TYPE OF
DATA STORED
AND WHERE?
REPORT
SECURITY AND
COMPLIANCE
REPORTING
SOLUTIONS FOR MANAGING SENSITIVE DATA

CHALLENGES WITH BIG DATA
▸High volume of data
▸Stringent security and privacy requirements
▸Multi-tenant environments
▸Multiple methods for accessing data
▸Business users using newer tools
▸Traditional tools for security and governance “retrofitted” for
the modern data architecture

DATA
DISCOVERY
ACCESS
CONTROL ANONYMIZATION MONITORING
4 STEPS FOR MANAGING SENSITIVE DATA

ABOUT G-RESEARCH
▸Quantitative research technology company
▸Statistical analysis and Big Data pipelines to recognise
patterns and extract insights from very large market
datasets
▸Forecasting analytics to predict variances in financial
markets
▸Clients operate in capital markets globally
▸Undergoing very aggressive growth and adoption of cutting
edge technology

DATASETS
▸Market Data
▸Level 1 (top of book)
▸Basic market data (instrument, bid price, bid size, ask
price, ask size)
▸Level 2 (order book or market depth)
▸Richer data (highest bid prices, lowest ask prices)

DATASETS
▸Market Data
▸Level 1 (top of book)
▸Level 2 (order book or market depth)
▸Often represented as incremental updates at
nanosecond granularity
▸HUGE dataset!
Time Quantity Bid Ask Quantity Time
12:32:16 120 12.25 12.25 150 12:32:55
12:31:01 50 12.26 12.27 60 12:31:19
12:33:45 150 12.25 12.27 100 12:34:27

DATASETS
▸Other datasets include
▸Datapacks for additional enrichment
▸Risk analysis on positions, portfolios and strategies
▸Reference data about markets structures and corporate
entities
▸News feeds
▸…

HADOOP DATA PLATFORM – REASONS
High Volume Fast Processing
Multi Tenant Flexibility

CORE REQUIREMENTS FOR DATA
PLATFORM
Security
▸Protect intellectual property
of the company
▸Datasets processed
through the pipeline
▸...but also code
Integration
▸Integration with existing
security systems and
policies
▸Authentication
▸Encryption
▸Strict and flexible
authorization

CORE REQUIREMENTS.. CONTD..
Governance
▸Governance
▸Ability to find data easily
enabling collaboration
▸Track data changes,
impact, lineage and
maintain consistency
Multi Tenancy and Scalability
▸Multi-tenancy
▸Variety of development
teams need to work on
the the same platform
and share data and
resources
▸Scalability
▸Explosive data growth

IMPORTANCE OF GOVERNANCE
- Someone who’s been quoted about a billion times
“Metadata is the new Data”

GOVERNANCE AND SECURITY - SOLUTION
DESIGN

GOVERNANCE AND SECURITY – SECURITY
FOUNDATION
Authentication Authorization Auditing Data Protection
Kerberos +
Knox
Ranger +
Knox
Ranger
HDFS
Encryption

GOVERNANCE AND SECURITY – MANAGING
RESTRICTED DATA
Data Discovery Access Control Anonymization Monitoring
Privacera +
Atlas
Ranger tag
based
policies
Ranger
Dynamic
Masking
Privacera
Custom
Spark
Lineage

CUSTOM METADATA IN ATLAS
▸Custom datasets  Atlas metamodel definitions
Type
spark_job
id
cardinality
indexable
operations
cardinality
indexable
input_data
cardinality
indexable
output_data
cardinality
indexable
Entity
Type
spark_join_
operation
Type
string
Type
spark_dataset
Type
Dataset
Type
Process
Type
spark_operation
Attributes
Type
spark_entity
Type
Referenceable

OUR JOURNEY IN DATA SECURITY
▸Datazone definition to
capture data movement
▸Advanced data discovery
and tagging
▸Custom lineage applied
on our own data types
Standard Ranger
Policies
Tag-based Ranger
Policies
Comprehensive
Tag-based Ranger
Policies
Privacera Advanced
Security Policies
Atlas OoB
Custom Atlas
metamodels
Privacera
▸Access management
through data tags
▸Basic Ranger policies
t=0
t=1
t=2
t=3

OUR JOURNEY IN SECURITY
MANAGEMENT - MONITORING
PUBLIC
DATAZONE
RESTRICTED
DATAZONE
TABLE A TABLE B
FOLDER
A
FOLDER
B
TABLE
C
TABLE
D
FOLDER
C
FOLDER
D
“Sensitive tags not
allowed”
“Sensitive tags are
OK”

KEY CHALLENGES AND LEARNINGS
▸Technical challenges
▸GraphDB does not scale as metastore
▸~100k’s entities tagged per week
▸Back-end rewritten to only use HBase + Solr
▸Open source flexibility can be a 2-sided coin
▸Business challenges
▸Business process integration

SUMMARY
▸Understand your data before expanding your data lake
▸Invest in automated classification and centralized metadata
▸Manage user access through data classification
▸Anonymise data to reduce exposure
▸Monitor the use of data - “trust but verify”

G-Research - Privacera - Dataworks Summit 2018

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to G-Research - Privacera - Dataworks Summit 2018

Similar to G-Research - Privacera - Dataworks Summit 2018 (20)

Recently uploaded

Recently uploaded (20)

G-Research - Privacera - Dataworks Summit 2018