Building trust in your data lake. A fintech case study on automated data discovery, control and monitoring data leveraging Apache Ranger and Apache Atlas
This talk talks through learning from the HDP implementation at G-Research, a leading Fin-Tech company based in London.
The team at G-Research implemented the Hortonworks Data Platform to build a data lake and
enable the business team to build analytics and machine learning tools. The team faced challenges
to accurately control and manage any sensitive data. Business teams were not able to search
through data due to lack of data classification.
G-Research implemented Privacera auto-discovery solution to precisely discover and tag data
as it is ingested into the HDP environment. The tags are pushed to Apache Atlas and then
Apache Ranger for enabling tag based policies. The G-Research team also build custom tools to push Spark lineage
information into Atlas. Finally, Privacera monitoring tools continuously analyzed access audit information to
alert if sensitive data is moved to folders that might not be protected.
Consequently, security team got real visibility into the sensitive data. Also, business users could
search and find the data within appropriate data classification in place.
Speakers
Balaji Ganesan, Co-Founder and CEO, Privacera
Alberto Romero, Big Data Architect, G-Research
Entity Resolution Service - Bringing Petabytes of Data Online for Instant Access
Similar to Building trust in your data lake. A fintech case study on automated data discovery, control and monitoring data leveraging Apache Ranger and Apache Atlas
Similar to Building trust in your data lake. A fintech case study on automated data discovery, control and monitoring data leveraging Apache Ranger and Apache Atlas (20)
Building trust in your data lake. A fintech case study on automated data discovery, control and monitoring data leveraging Apache Ranger and Apache Atlas
6. CHALLENGES WITH BIG DATA
▸High volume of data
▸Stringent security and privacy requirements
▸Multi-tenant environments
▸Multiple methods for accessing data
▸Business users using newer tools
▸Traditional tools for security and governance “retrofitted” for
the modern data architecture
9. ABOUT G-RESEARCH
▸Quantitative research technology company
▸Statistical analysis and Big Data pipelines to recognise
patterns and extract insights from very large market
datasets
▸Forecasting analytics to predict variances in financial
markets
▸Clients operate in capital markets globally
▸Undergoing very aggressive growth and adoption of cutting
edge technology
10. DATASETS
▸Market Data
▸Level 1 (top of book)
▸Basic market data (instrument, bid price, bid size, ask
price, ask size)
▸Level 2 (order book or market depth)
▸Richer data (highest bid prices, lowest ask prices)
11. DATASETS
▸Market Data
▸Level 1 (top of book)
▸Level 2 (order book or market depth)
▸Often represented as incremental updates at
nanosecond granularity
▸HUGE dataset!
Time Quantity Bid Ask Quantity Time
12:32:16 120 12.25 12.25 150 12:32:55
12:31:01 50 12.26 12.27 60 12:31:19
12:33:45 150 12.25 12.27 100 12:34:27
12. DATASETS
▸Other datasets include
▸Datapacks for additional enrichment
▸Risk analysis on positions, portfolios and strategies
▸Reference data about markets structures and corporate
entities
▸News feeds
▸…
13. HADOOP DATA PLATFORM – REASONS
▸Volume of data
▸Fast processing
▸Multiple use cases and tenants
▸Flexibility
▸Development
▸Business
15. CORE REQUIREMENTS FOR DATA
PLATFORM▸Security
▸Protect intellectual property of the company
▸Datasets processed through the pipeline
▸...but also code
▸Integration with existing security systems and policies
▸Authentication
▸Encryption
▸Strict and flexible authorization
16. CORE REQUIREMENTS, CONTINUED…
▸Governance
▸Ability to find data easily enabling collaboration
▸Track data changes, impact, lineage and maintain
consistency
▸Multi-tenancy
▸Variety of development teams need to work on the the
same platform and share data and resources
▸Scalability
▸Explosive data growth
17. IMPORTANCE OF GOVERNANCE
“Data is the new oil”
- Someone who’s been quoted about a billion times
“Metadata is the new Data”
18. GOVERNANCE AND SECURITY - SOLUTION
DESIGN
▸Kerberos based authentication
▸Integrated with AD
▸HDFS encryption at rest
▸Hive
▸Atlas
▸Great framework...but limited as a finished product
i.e. Spark process lineage
Custom implementation Spark lineage in Atlas
19. GOVERNANCE AND SECURITY - SOLUTION
DESIGN
▸Automatic discovery and tagging of data using Privacera
▸Tags created at HDFS folder level
▸Privacera publishes ”tags” to Atlas
▸Leverage Atlas-Ranger integration
▸Tag based policies in Ranger
▸Highly confidential data will have restricted access
22. CUSTOM METADATA IN ATLAS
▸Custom datasets Atlas metamodel definitions
Type
spark_job
id
cardinality
indexable
operations
cardinality
indexable
input_data
cardinality
indexable
output_data
cardinality
indexable
Entity
Type
spark_join_
operation
Type
string
Type
spark_dataset
Type
Dataset
Type
Process
Type
spark_operation
Attributes
Type
spark_entity
Type
Referenceable
23. OUR JOURNEY IN SECURITY
MANAGEMENT ▸Datazone definition to
capture data movement
▸Advanced data discovery
and tagging
▸Custom lineage applied
on our own data types
Standard Ranger
Policies
Tag-based Ranger
Policies
Comprehensive
Tag-based Ranger
Policies
Privacera Advanced
Security Policies
Atlas OoB
Custom Atlas
metamodels
Privacera
▸Access management
through data tags
▸Basic Ranger policies
t=0
t=1
t=2
t=3
24. KEY CHALLENGES AND LEARNINGS
▸Technical challenges
▸GraphDB does not scale as metastore
▸~100k’s entities tagged per week
▸Back-end rewritten to only use HBase + Solr
▸Open source flexibility can be a 2-sided coin
▸Business challenges
▸Business process integration
25. SUMMARY
▸Understand your data before expanding your data lake
▸Invest in automated classification and centralized metadata
▸Manage user access through data classification
▸Anonymise data to reduce exposure
▸Monitor the use of data - “trust but verify”