SlideShare a Scribd company logo
1 of 27
Fintech case study on Automated
Discovery, Control and Monitoring Data
SPEAKERS
ALBERTO BALAJI
AGENDA
▸Introduction
▸Challenges with big data
▸G-Research overview
▸Journey with HDP
▸Challenges and learnings
▸Recommendation
PARTNE
RSGLOBAL
SOLUTIONS FOR DISCOVERING AND
MANAGING SENSITIVE DATA
ABOUT PRIVACERA
DETECT
MALICIOUS OR
ACCIDENTAL
USE
CONTROL
ANONYMIZE
DATA/
RESTRICT
ACCESS
DISCOVER
WHAT TYPE OF
DATA STORED
AND WHERE?
REPORT
SECURITY AND
COMPLIANCE
REPORTING
SOLUTIONS FOR MANAGING SENSITIVE DATA
CHALLENGES WITH BIG DATA
▸High volume of data
▸Stringent security and privacy requirements
▸Multi-tenant environments
▸Multiple methods for accessing data
▸Business users using newer tools
▸Traditional tools for security and governance “retrofitted” for
the modern data architecture
DATA
DISCOVERY
ACCESS
CONTROL ANONYMIZATION MONITORING
4 STEPS FOR MANAGING SENSITIVE DATA
G-RESEARCH
CASE STUDY
ABOUT G-RESEARCH
▸Quantitative research technology company
▸Statistical analysis and Big Data pipelines to recognise
patterns and extract insights from very large market
datasets
▸Forecasting analytics to predict variances in financial
markets
▸Clients operate in capital markets globally
▸Undergoing very aggressive growth and adoption of cutting
edge technology
DATASETS
▸Market Data
▸Level 1 (top of book)
▸Basic market data (instrument, bid price, bid size, ask
price, ask size)
▸Level 2 (order book or market depth)
▸Richer data (highest bid prices, lowest ask prices)
DATASETS
▸Market Data
▸Level 1 (top of book)
▸Level 2 (order book or market depth)
▸Often represented as incremental updates at
nanosecond granularity
▸HUGE dataset!
Time Quantity Bid Ask Quantity Time
12:32:16 120 12.25 12.25 150 12:32:55
12:31:01 50 12.26 12.27 60 12:31:19
12:33:45 150 12.25 12.27 100 12:34:27
DATASETS
▸Other datasets include
▸Datapacks for additional enrichment
▸Risk analysis on positions, portfolios and strategies
▸Reference data about markets structures and corporate
entities
▸News feeds
▸…
HADOOP DATA PLATFORM – REASONS
▸Volume of data
▸Fast processing
▸Multiple use cases and tenants
▸Flexibility
▸Development
▸Business
TECHNOLOGY STACK
CORE REQUIREMENTS FOR DATA
PLATFORM▸Security
▸Protect intellectual property of the company
▸Datasets processed through the pipeline
▸...but also code
▸Integration with existing security systems and policies
▸Authentication
▸Encryption
▸Strict and flexible authorization
CORE REQUIREMENTS, CONTINUED…
▸Governance
▸Ability to find data easily enabling collaboration
▸Track data changes, impact, lineage and maintain
consistency
▸Multi-tenancy
▸Variety of development teams need to work on the the
same platform and share data and resources
▸Scalability
▸Explosive data growth
IMPORTANCE OF GOVERNANCE
“Data is the new oil”
- Someone who’s been quoted about a billion times
“Metadata is the new Data”
GOVERNANCE AND SECURITY - SOLUTION
DESIGN
▸Kerberos based authentication
▸Integrated with AD
▸HDFS encryption at rest
▸Hive
▸Atlas
▸Great framework...but limited as a finished product
 i.e. Spark process lineage
 Custom implementation Spark lineage in Atlas
GOVERNANCE AND SECURITY - SOLUTION
DESIGN
▸Automatic discovery and tagging of data using Privacera
▸Tags created at HDFS folder level
▸Privacera publishes ”tags” to Atlas
▸Leverage Atlas-Ranger integration
▸Tag based policies in Ranger
▸Highly confidential data will have restricted access
GOVERNANCE AND SECURITY - SOLUTION
DESIGN
ARCHITECTURE
CUSTOM METADATA IN ATLAS
▸Custom datasets  Atlas metamodel definitions
Type
spark_job
id
cardinality
indexable
operations
cardinality
indexable
input_data
cardinality
indexable
output_data
cardinality
indexable
Entity
Type
spark_join_
operation
Type
string
Type
spark_dataset
Type
Dataset
Type
Process
Type
spark_operation
Attributes
Type
spark_entity
Type
Referenceable
OUR JOURNEY IN SECURITY
MANAGEMENT ▸Datazone definition to
capture data movement
▸Advanced data discovery
and tagging
▸Custom lineage applied
on our own data types
Standard Ranger
Policies
Tag-based Ranger
Policies
Comprehensive
Tag-based Ranger
Policies
Privacera Advanced
Security Policies
Atlas OoB
Custom Atlas
metamodels
Privacera
▸Access management
through data tags
▸Basic Ranger policies
t=0
t=1
t=2
t=3
KEY CHALLENGES AND LEARNINGS
▸Technical challenges
▸GraphDB does not scale as metastore
▸~100k’s entities tagged per week
▸Back-end rewritten to only use HBase + Solr
▸Open source flexibility can be a 2-sided coin
▸Business challenges
▸Business process integration
SUMMARY
▸Understand your data before expanding your data lake
▸Invest in automated classification and centralized metadata
▸Manage user access through data classification
▸Anonymise data to reduce exposure
▸Monitor the use of data - “trust but verify”
WE’RE
QUESTIONS ?

More Related Content

What's hot

Secure Your Data with Virtual Data Fabric (ASEAN)
Secure Your Data with Virtual Data Fabric (ASEAN)Secure Your Data with Virtual Data Fabric (ASEAN)
Secure Your Data with Virtual Data Fabric (ASEAN)Denodo
 
Modern Data Management for Federal Modernization
Modern Data Management for Federal ModernizationModern Data Management for Federal Modernization
Modern Data Management for Federal ModernizationDenodo
 
Denodo DataFest 2017: Conquering the Edge with Data Virtualization
Denodo DataFest 2017: Conquering the Edge with Data VirtualizationDenodo DataFest 2017: Conquering the Edge with Data Virtualization
Denodo DataFest 2017: Conquering the Edge with Data VirtualizationDenodo
 
3 Reasons Data Virtualization Matters in Your Portfolio
3 Reasons Data Virtualization Matters in Your Portfolio3 Reasons Data Virtualization Matters in Your Portfolio
3 Reasons Data Virtualization Matters in Your PortfolioDenodo
 
Enterprise Data Hub: The Next Big Thing in Big Data
Enterprise Data Hub: The Next Big Thing in Big DataEnterprise Data Hub: The Next Big Thing in Big Data
Enterprise Data Hub: The Next Big Thing in Big DataCloudera, Inc.
 
Take your Data Management Practice to the Next Level with Denodo 7
Take your Data Management Practice to the Next Level with Denodo 7Take your Data Management Practice to the Next Level with Denodo 7
Take your Data Management Practice to the Next Level with Denodo 7Denodo
 
The 5 Keys to a Killer Data Lake
The 5 Keys to a Killer Data LakeThe 5 Keys to a Killer Data Lake
The 5 Keys to a Killer Data LakeDataWorks Summit
 
Denodo DataFest 2017: Company Leadership from Data Leadership
Denodo DataFest 2017: Company Leadership from Data LeadershipDenodo DataFest 2017: Company Leadership from Data Leadership
Denodo DataFest 2017: Company Leadership from Data LeadershipDenodo
 
Data Virtualization enabled Data Fabric: Operationalize the Data Lake (APAC)
Data Virtualization enabled Data Fabric: Operationalize the Data Lake (APAC)Data Virtualization enabled Data Fabric: Operationalize the Data Lake (APAC)
Data Virtualization enabled Data Fabric: Operationalize the Data Lake (APAC)Denodo
 
Why Data Virtualization? An Introduction.
Why Data Virtualization? An Introduction.Why Data Virtualization? An Introduction.
Why Data Virtualization? An Introduction.Denodo
 
Delivering Self-Service Analytics using Big Data and Data Virtualization on t...
Delivering Self-Service Analytics using Big Data and Data Virtualization on t...Delivering Self-Service Analytics using Big Data and Data Virtualization on t...
Delivering Self-Service Analytics using Big Data and Data Virtualization on t...Denodo
 
Big Data Fabric for At-Scale Real-Time Analysis by Edwin Robbins
 Big Data Fabric for At-Scale Real-Time Analysis by Edwin Robbins Big Data Fabric for At-Scale Real-Time Analysis by Edwin Robbins
Big Data Fabric for At-Scale Real-Time Analysis by Edwin RobbinsData Con LA
 
Top 5 Considerations for a Big Data Solution
Top 5 Considerations for a Big Data SolutionTop 5 Considerations for a Big Data Solution
Top 5 Considerations for a Big Data SolutionDataStax
 
The Curse of the Data Lake Monster
The Curse of the Data Lake MonsterThe Curse of the Data Lake Monster
The Curse of the Data Lake MonsterThoughtworks
 
Denodo Platform 7.0: What's New?
Denodo Platform 7.0: What's New?Denodo Platform 7.0: What's New?
Denodo Platform 7.0: What's New?Denodo
 
Data Virtualization: From Zero to Hero (Middle East)
Data Virtualization: From Zero to Hero (Middle East)Data Virtualization: From Zero to Hero (Middle East)
Data Virtualization: From Zero to Hero (Middle East)Denodo
 
Big Data Analytics: Reference Architectures and Case Studies by Serhiy Haziye...
Big Data Analytics: Reference Architectures and Case Studies by Serhiy Haziye...Big Data Analytics: Reference Architectures and Case Studies by Serhiy Haziye...
Big Data Analytics: Reference Architectures and Case Studies by Serhiy Haziye...SoftServe
 
Analyst Keynote: Forrester: Data Fabric Strategy is Vital for Business Innova...
Analyst Keynote: Forrester: Data Fabric Strategy is Vital for Business Innova...Analyst Keynote: Forrester: Data Fabric Strategy is Vital for Business Innova...
Analyst Keynote: Forrester: Data Fabric Strategy is Vital for Business Innova...Denodo
 
Organising the Data Lake - Information Management in a Big Data World
Organising the Data Lake - Information Management in a Big Data WorldOrganising the Data Lake - Information Management in a Big Data World
Organising the Data Lake - Information Management in a Big Data WorldDataWorks Summit/Hadoop Summit
 
Entity Resolution Service - Bringing Petabytes of Data Online for Instant Access
Entity Resolution Service - Bringing Petabytes of Data Online for Instant AccessEntity Resolution Service - Bringing Petabytes of Data Online for Instant Access
Entity Resolution Service - Bringing Petabytes of Data Online for Instant AccessDataWorks Summit
 

What's hot (20)

Secure Your Data with Virtual Data Fabric (ASEAN)
Secure Your Data with Virtual Data Fabric (ASEAN)Secure Your Data with Virtual Data Fabric (ASEAN)
Secure Your Data with Virtual Data Fabric (ASEAN)
 
Modern Data Management for Federal Modernization
Modern Data Management for Federal ModernizationModern Data Management for Federal Modernization
Modern Data Management for Federal Modernization
 
Denodo DataFest 2017: Conquering the Edge with Data Virtualization
Denodo DataFest 2017: Conquering the Edge with Data VirtualizationDenodo DataFest 2017: Conquering the Edge with Data Virtualization
Denodo DataFest 2017: Conquering the Edge with Data Virtualization
 
3 Reasons Data Virtualization Matters in Your Portfolio
3 Reasons Data Virtualization Matters in Your Portfolio3 Reasons Data Virtualization Matters in Your Portfolio
3 Reasons Data Virtualization Matters in Your Portfolio
 
Enterprise Data Hub: The Next Big Thing in Big Data
Enterprise Data Hub: The Next Big Thing in Big DataEnterprise Data Hub: The Next Big Thing in Big Data
Enterprise Data Hub: The Next Big Thing in Big Data
 
Take your Data Management Practice to the Next Level with Denodo 7
Take your Data Management Practice to the Next Level with Denodo 7Take your Data Management Practice to the Next Level with Denodo 7
Take your Data Management Practice to the Next Level with Denodo 7
 
The 5 Keys to a Killer Data Lake
The 5 Keys to a Killer Data LakeThe 5 Keys to a Killer Data Lake
The 5 Keys to a Killer Data Lake
 
Denodo DataFest 2017: Company Leadership from Data Leadership
Denodo DataFest 2017: Company Leadership from Data LeadershipDenodo DataFest 2017: Company Leadership from Data Leadership
Denodo DataFest 2017: Company Leadership from Data Leadership
 
Data Virtualization enabled Data Fabric: Operationalize the Data Lake (APAC)
Data Virtualization enabled Data Fabric: Operationalize the Data Lake (APAC)Data Virtualization enabled Data Fabric: Operationalize the Data Lake (APAC)
Data Virtualization enabled Data Fabric: Operationalize the Data Lake (APAC)
 
Why Data Virtualization? An Introduction.
Why Data Virtualization? An Introduction.Why Data Virtualization? An Introduction.
Why Data Virtualization? An Introduction.
 
Delivering Self-Service Analytics using Big Data and Data Virtualization on t...
Delivering Self-Service Analytics using Big Data and Data Virtualization on t...Delivering Self-Service Analytics using Big Data and Data Virtualization on t...
Delivering Self-Service Analytics using Big Data and Data Virtualization on t...
 
Big Data Fabric for At-Scale Real-Time Analysis by Edwin Robbins
 Big Data Fabric for At-Scale Real-Time Analysis by Edwin Robbins Big Data Fabric for At-Scale Real-Time Analysis by Edwin Robbins
Big Data Fabric for At-Scale Real-Time Analysis by Edwin Robbins
 
Top 5 Considerations for a Big Data Solution
Top 5 Considerations for a Big Data SolutionTop 5 Considerations for a Big Data Solution
Top 5 Considerations for a Big Data Solution
 
The Curse of the Data Lake Monster
The Curse of the Data Lake MonsterThe Curse of the Data Lake Monster
The Curse of the Data Lake Monster
 
Denodo Platform 7.0: What's New?
Denodo Platform 7.0: What's New?Denodo Platform 7.0: What's New?
Denodo Platform 7.0: What's New?
 
Data Virtualization: From Zero to Hero (Middle East)
Data Virtualization: From Zero to Hero (Middle East)Data Virtualization: From Zero to Hero (Middle East)
Data Virtualization: From Zero to Hero (Middle East)
 
Big Data Analytics: Reference Architectures and Case Studies by Serhiy Haziye...
Big Data Analytics: Reference Architectures and Case Studies by Serhiy Haziye...Big Data Analytics: Reference Architectures and Case Studies by Serhiy Haziye...
Big Data Analytics: Reference Architectures and Case Studies by Serhiy Haziye...
 
Analyst Keynote: Forrester: Data Fabric Strategy is Vital for Business Innova...
Analyst Keynote: Forrester: Data Fabric Strategy is Vital for Business Innova...Analyst Keynote: Forrester: Data Fabric Strategy is Vital for Business Innova...
Analyst Keynote: Forrester: Data Fabric Strategy is Vital for Business Innova...
 
Organising the Data Lake - Information Management in a Big Data World
Organising the Data Lake - Information Management in a Big Data WorldOrganising the Data Lake - Information Management in a Big Data World
Organising the Data Lake - Information Management in a Big Data World
 
Entity Resolution Service - Bringing Petabytes of Data Online for Instant Access
Entity Resolution Service - Bringing Petabytes of Data Online for Instant AccessEntity Resolution Service - Bringing Petabytes of Data Online for Instant Access
Entity Resolution Service - Bringing Petabytes of Data Online for Instant Access
 

Similar to Building trust in your data lake. A fintech case study on automated data discovery, control and monitoring data leveraging Apache Ranger and Apache Atlas

G-Research - Privacera - Dataworks Summit 2018
G-Research - Privacera - Dataworks Summit 2018G-Research - Privacera - Dataworks Summit 2018
G-Research - Privacera - Dataworks Summit 2018Alberto Romero
 
Next generation security analytics
Next generation security analyticsNext generation security analytics
Next generation security analyticsChristian Have
 
Scalable Data Management for Kafka and Beyond | Dan Rice, BigID
Scalable Data Management for Kafka and Beyond | Dan Rice, BigIDScalable Data Management for Kafka and Beyond | Dan Rice, BigID
Scalable Data Management for Kafka and Beyond | Dan Rice, BigIDHostedbyConfluent
 
Best Practices for Protecting Sensitive Data Across the Big Data Platform
Best Practices for Protecting Sensitive Data Across the Big Data PlatformBest Practices for Protecting Sensitive Data Across the Big Data Platform
Best Practices for Protecting Sensitive Data Across the Big Data PlatformMapR Technologies
 
Leveraging Spark to Democratize Data for Omni-Commerce with Shafaq Abdullah
Leveraging Spark to Democratize Data for Omni-Commerce with Shafaq AbdullahLeveraging Spark to Democratize Data for Omni-Commerce with Shafaq Abdullah
Leveraging Spark to Democratize Data for Omni-Commerce with Shafaq AbdullahDatabricks
 
Mike Palmer of Veritas: Debunking the myths of multi-cloud to achieve 360 Dat...
Mike Palmer of Veritas: Debunking the myths of multi-cloud to achieve 360 Dat...Mike Palmer of Veritas: Debunking the myths of multi-cloud to achieve 360 Dat...
Mike Palmer of Veritas: Debunking the myths of multi-cloud to achieve 360 Dat...Veritas Technologies LLC
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformCloudera, Inc.
 
How to streamline data governance and security across on-prem and cloud?
How to streamline data governance and security across on-prem and cloud?How to streamline data governance and security across on-prem and cloud?
How to streamline data governance and security across on-prem and cloud?Privacera
 
Webinar - Case Study: ProtectWise enhances network security with DataStax alw...
Webinar - Case Study: ProtectWise enhances network security with DataStax alw...Webinar - Case Study: ProtectWise enhances network security with DataStax alw...
Webinar - Case Study: ProtectWise enhances network security with DataStax alw...DataStax
 
How Virtual Reality and Machine Learning Are Powering the New Age of Network ...
How Virtual Reality and Machine Learning Are Powering the New Age of Network ...How Virtual Reality and Machine Learning Are Powering the New Age of Network ...
How Virtual Reality and Machine Learning Are Powering the New Age of Network ...DataStax
 
Article data-centric security key to cloud and digital business
Article   data-centric security key to cloud and digital businessArticle   data-centric security key to cloud and digital business
Article data-centric security key to cloud and digital businessUlf Mattsson
 
Data centric security key to cloud and digital business
Data centric security key to cloud and digital businessData centric security key to cloud and digital business
Data centric security key to cloud and digital businessUlf Mattsson
 
NETWORK SECURITY MONITORING WITH BIG DATA ANALYTICS - Nguyễn Minh Đức
NETWORK SECURITY  MONITORING WITH BIG  DATA ANALYTICS - Nguyễn Minh ĐứcNETWORK SECURITY  MONITORING WITH BIG  DATA ANALYTICS - Nguyễn Minh Đức
NETWORK SECURITY MONITORING WITH BIG DATA ANALYTICS - Nguyễn Minh ĐứcSecurity Bootcamp
 
NTXISSACSC2 - Information Security Opportunity: Embracing Big Data with Peopl...
NTXISSACSC2 - Information Security Opportunity: Embracing Big Data with Peopl...NTXISSACSC2 - Information Security Opportunity: Embracing Big Data with Peopl...
NTXISSACSC2 - Information Security Opportunity: Embracing Big Data with Peopl...North Texas Chapter of the ISSA
 
Managing Data from the Edge to HPC
Managing Data from the Edge to HPCManaging Data from the Edge to HPC
Managing Data from the Edge to HPCinside-BigData.com
 

Similar to Building trust in your data lake. A fintech case study on automated data discovery, control and monitoring data leveraging Apache Ranger and Apache Atlas (20)

G-Research - Privacera - Dataworks Summit 2018
G-Research - Privacera - Dataworks Summit 2018G-Research - Privacera - Dataworks Summit 2018
G-Research - Privacera - Dataworks Summit 2018
 
Next generation security analytics
Next generation security analyticsNext generation security analytics
Next generation security analytics
 
Scalable Data Management for Kafka and Beyond | Dan Rice, BigID
Scalable Data Management for Kafka and Beyond | Dan Rice, BigIDScalable Data Management for Kafka and Beyond | Dan Rice, BigID
Scalable Data Management for Kafka and Beyond | Dan Rice, BigID
 
Best Practices for Protecting Sensitive Data Across the Big Data Platform
Best Practices for Protecting Sensitive Data Across the Big Data PlatformBest Practices for Protecting Sensitive Data Across the Big Data Platform
Best Practices for Protecting Sensitive Data Across the Big Data Platform
 
Leveraging Spark to Democratize Data for Omni-Commerce with Shafaq Abdullah
Leveraging Spark to Democratize Data for Omni-Commerce with Shafaq AbdullahLeveraging Spark to Democratize Data for Omni-Commerce with Shafaq Abdullah
Leveraging Spark to Democratize Data for Omni-Commerce with Shafaq Abdullah
 
Mike Palmer of Veritas: Debunking the myths of multi-cloud to achieve 360 Dat...
Mike Palmer of Veritas: Debunking the myths of multi-cloud to achieve 360 Dat...Mike Palmer of Veritas: Debunking the myths of multi-cloud to achieve 360 Dat...
Mike Palmer of Veritas: Debunking the myths of multi-cloud to achieve 360 Dat...
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
 
Data mining
Data miningData mining
Data mining
 
Distributed Tracing
Distributed TracingDistributed Tracing
Distributed Tracing
 
Security&Governance
Security&GovernanceSecurity&Governance
Security&Governance
 
Bigdata : Big picture
Bigdata : Big pictureBigdata : Big picture
Bigdata : Big picture
 
How to streamline data governance and security across on-prem and cloud?
How to streamline data governance and security across on-prem and cloud?How to streamline data governance and security across on-prem and cloud?
How to streamline data governance and security across on-prem and cloud?
 
Webinar - Case Study: ProtectWise enhances network security with DataStax alw...
Webinar - Case Study: ProtectWise enhances network security with DataStax alw...Webinar - Case Study: ProtectWise enhances network security with DataStax alw...
Webinar - Case Study: ProtectWise enhances network security with DataStax alw...
 
How Virtual Reality and Machine Learning Are Powering the New Age of Network ...
How Virtual Reality and Machine Learning Are Powering the New Age of Network ...How Virtual Reality and Machine Learning Are Powering the New Age of Network ...
How Virtual Reality and Machine Learning Are Powering the New Age of Network ...
 
Article data-centric security key to cloud and digital business
Article   data-centric security key to cloud and digital businessArticle   data-centric security key to cloud and digital business
Article data-centric security key to cloud and digital business
 
Data centric security key to cloud and digital business
Data centric security key to cloud and digital businessData centric security key to cloud and digital business
Data centric security key to cloud and digital business
 
NETWORK SECURITY MONITORING WITH BIG DATA ANALYTICS - Nguyễn Minh Đức
NETWORK SECURITY  MONITORING WITH BIG  DATA ANALYTICS - Nguyễn Minh ĐứcNETWORK SECURITY  MONITORING WITH BIG  DATA ANALYTICS - Nguyễn Minh Đức
NETWORK SECURITY MONITORING WITH BIG DATA ANALYTICS - Nguyễn Minh Đức
 
NTXISSACSC2 - Information Security Opportunity: Embracing Big Data with Peopl...
NTXISSACSC2 - Information Security Opportunity: Embracing Big Data with Peopl...NTXISSACSC2 - Information Security Opportunity: Embracing Big Data with Peopl...
NTXISSACSC2 - Information Security Opportunity: Embracing Big Data with Peopl...
 
Managing Data from the Edge to HPC
Managing Data from the Edge to HPCManaging Data from the Edge to HPC
Managing Data from the Edge to HPC
 
Healthcare fraud detection
Healthcare fraud detectionHealthcare fraud detection
Healthcare fraud detection
 

More from DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

More from DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Recently uploaded

costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentationphoebematthew05
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfngoud9212
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Neo4j
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfjimielynbastida
 

Recently uploaded (20)

costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentation
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdf
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort ServiceHot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdf
 

Building trust in your data lake. A fintech case study on automated data discovery, control and monitoring data leveraging Apache Ranger and Apache Atlas

  • 1. Fintech case study on Automated Discovery, Control and Monitoring Data
  • 3. AGENDA ▸Introduction ▸Challenges with big data ▸G-Research overview ▸Journey with HDP ▸Challenges and learnings ▸Recommendation
  • 4. PARTNE RSGLOBAL SOLUTIONS FOR DISCOVERING AND MANAGING SENSITIVE DATA ABOUT PRIVACERA
  • 5. DETECT MALICIOUS OR ACCIDENTAL USE CONTROL ANONYMIZE DATA/ RESTRICT ACCESS DISCOVER WHAT TYPE OF DATA STORED AND WHERE? REPORT SECURITY AND COMPLIANCE REPORTING SOLUTIONS FOR MANAGING SENSITIVE DATA
  • 6. CHALLENGES WITH BIG DATA ▸High volume of data ▸Stringent security and privacy requirements ▸Multi-tenant environments ▸Multiple methods for accessing data ▸Business users using newer tools ▸Traditional tools for security and governance “retrofitted” for the modern data architecture
  • 9. ABOUT G-RESEARCH ▸Quantitative research technology company ▸Statistical analysis and Big Data pipelines to recognise patterns and extract insights from very large market datasets ▸Forecasting analytics to predict variances in financial markets ▸Clients operate in capital markets globally ▸Undergoing very aggressive growth and adoption of cutting edge technology
  • 10. DATASETS ▸Market Data ▸Level 1 (top of book) ▸Basic market data (instrument, bid price, bid size, ask price, ask size) ▸Level 2 (order book or market depth) ▸Richer data (highest bid prices, lowest ask prices)
  • 11. DATASETS ▸Market Data ▸Level 1 (top of book) ▸Level 2 (order book or market depth) ▸Often represented as incremental updates at nanosecond granularity ▸HUGE dataset! Time Quantity Bid Ask Quantity Time 12:32:16 120 12.25 12.25 150 12:32:55 12:31:01 50 12.26 12.27 60 12:31:19 12:33:45 150 12.25 12.27 100 12:34:27
  • 12. DATASETS ▸Other datasets include ▸Datapacks for additional enrichment ▸Risk analysis on positions, portfolios and strategies ▸Reference data about markets structures and corporate entities ▸News feeds ▸…
  • 13. HADOOP DATA PLATFORM – REASONS ▸Volume of data ▸Fast processing ▸Multiple use cases and tenants ▸Flexibility ▸Development ▸Business
  • 15. CORE REQUIREMENTS FOR DATA PLATFORM▸Security ▸Protect intellectual property of the company ▸Datasets processed through the pipeline ▸...but also code ▸Integration with existing security systems and policies ▸Authentication ▸Encryption ▸Strict and flexible authorization
  • 16. CORE REQUIREMENTS, CONTINUED… ▸Governance ▸Ability to find data easily enabling collaboration ▸Track data changes, impact, lineage and maintain consistency ▸Multi-tenancy ▸Variety of development teams need to work on the the same platform and share data and resources ▸Scalability ▸Explosive data growth
  • 17. IMPORTANCE OF GOVERNANCE “Data is the new oil” - Someone who’s been quoted about a billion times “Metadata is the new Data”
  • 18. GOVERNANCE AND SECURITY - SOLUTION DESIGN ▸Kerberos based authentication ▸Integrated with AD ▸HDFS encryption at rest ▸Hive ▸Atlas ▸Great framework...but limited as a finished product  i.e. Spark process lineage  Custom implementation Spark lineage in Atlas
  • 19. GOVERNANCE AND SECURITY - SOLUTION DESIGN ▸Automatic discovery and tagging of data using Privacera ▸Tags created at HDFS folder level ▸Privacera publishes ”tags” to Atlas ▸Leverage Atlas-Ranger integration ▸Tag based policies in Ranger ▸Highly confidential data will have restricted access
  • 20. GOVERNANCE AND SECURITY - SOLUTION DESIGN
  • 22. CUSTOM METADATA IN ATLAS ▸Custom datasets  Atlas metamodel definitions Type spark_job id cardinality indexable operations cardinality indexable input_data cardinality indexable output_data cardinality indexable Entity Type spark_join_ operation Type string Type spark_dataset Type Dataset Type Process Type spark_operation Attributes Type spark_entity Type Referenceable
  • 23. OUR JOURNEY IN SECURITY MANAGEMENT ▸Datazone definition to capture data movement ▸Advanced data discovery and tagging ▸Custom lineage applied on our own data types Standard Ranger Policies Tag-based Ranger Policies Comprehensive Tag-based Ranger Policies Privacera Advanced Security Policies Atlas OoB Custom Atlas metamodels Privacera ▸Access management through data tags ▸Basic Ranger policies t=0 t=1 t=2 t=3
  • 24. KEY CHALLENGES AND LEARNINGS ▸Technical challenges ▸GraphDB does not scale as metastore ▸~100k’s entities tagged per week ▸Back-end rewritten to only use HBase + Solr ▸Open source flexibility can be a 2-sided coin ▸Business challenges ▸Business process integration
  • 25. SUMMARY ▸Understand your data before expanding your data lake ▸Invest in automated classification and centralized metadata ▸Manage user access through data classification ▸Anonymise data to reduce exposure ▸Monitor the use of data - “trust but verify”