GDPR is the current focus of all organizations working with data in Europe. Svenska Spel are using Atlas and Ranger as an essential part to make our
data warehouse to conform to GDPR.
This presentation will show how Atlas and Ranger, together with Hive are used in practice to control access and masking of data in our data warehouse. We will demonstrate how we with a model driven development process, using data vault modeling, have extended our model with metadata. From the metadata we generate our access control configuration. The configuration is automatically synchronized to Atlas and Ranger in our deployment process.
Previously our data model was only used to generate most of our SQL code. By extending our data model with metadata for access control it became inevitably
the source of truth and guarantee the model's conformity with the reality. This have increased the quality of our data model and decreased maintenance. It helps us maintain our data over its full life cycle and ensures the
correct rules for access control is in place.
Global B2B Contacts is a leading provider of quality Business to Business and Information Technology mailing, email and telemarketing lists.
info@globalb2bcontacts.com
http://www.globalb2bcontacts.com
Perth technology expo process & instrumentation trade show attendees emai...Global B2B Contacts LLC
We custom build the lists based on the marketing campaign and various target demographics to help our clients expand reach to a more specific target audience who are most likely to buy their product or service.
info@globalb2bcontacts.com
http://www.globalb2bcontacts.com
http://globalb2bcontacts.com/cfo-mailing-lists.html
Practical experiences using Atlas and Ranger to implement GDPRDataWorks Summit
GDPR is the current focus of all organizations working with data in Europe. Svenska Spel are using Atlas and Ranger as an essential part to make our
data warehouse to conform to GDPR.
This presentation will show how Atlas and Ranger, together with Hive are used in practice to control access and masking of data in our data warehouse. We will demonstrate how we with a model driven development process, using data vault modeling, have extended our model with metadata. From the metadata we generate our access control configuration. The configuration is automatically synchronized to Atlas and Ranger in our deployment process.
Speaker
Magnus Runesson, Data Engineer, Svenska Spel
Metadata Driven Access Control in Practice - BigData Tech Warsawm 2019Magnus Runesson
The importance of data access governance is continuously growing due to new regulations, such as GDPR, and industry policies. Manage access policies for each individual dataset is a hassle. In this talk, we will show how Svenska Spel use metadata about datasets to generate access policies. We use it to create policies for access, retention, and anonymization.
To create our policies from metadata we developed and open sourced cobra-policytool. Thanks to the tool we can use git and metadata repository as the source of truth for our policies. Policies are deployed to Apache Atlas and Ranger via a CI/CD pipeline. With our work deriving policies from metadata we have:
privacy and security by design
low maintenance burden using automatization
easy to review of policies
traceability of policy changes
single source of truth
We learned the importance of having company policies and technology walk in tandem. Data governance and metadata management are complex areas requiring high attention and are nothing you can duct tape in the end or solve buying a new power tool. We will show; with the right approach it can be rewarding and make new requirements easy to implement and audit.
Journey in Country of Data Access Governance - Data works summit 2019 BarcelonaMagnus Runesson
The importance data access governance is continuously growing due to new regulations, such as GDPR, and industry policies. In this talk, we will share learnings at Svenska Spel from implementing and supporting this growing demand. Our journey started a year ago implementing Atlas and Ranger in our Hortonworks Data Platform solution. From start we wanted to implement a process and solution that had:
Minimal impact on ETL-developers and analysts
Privacy and security by design
Low maintenance burden using automatization
To be able to manage our policies and metadata we developed and open sourced cobra-policytool. Cobra-policytool integrates into our deployment pipeline and simplifies the way we express our policies. Thanks to the tool we can use Git as our truth of source for our policies and metadata information, along with the source code. It is close to seamless for our ETL-developers and with trust deploy the right policies.
Throughout the course of our journey, we learned the importance of having company policies and technology walk in tandem. Single source of truth, open platforms, and automatization are key to success. Data governance and metadata management are complex areas requiring high attention and are nothing you can duct tape in the end or solve buying a new power tool. With the right approach, it can be rewarding and make new requirements easy to implement and audit.
GDPR Considerations for IBM ConnectionsLetsConnect
In EU there is a new data privacy regulation effective from May 2018. Organizations are required to comply with multiple requirements which affect also IBM Connections. In the session we will check how IBM Connections (on prem) meet the requirements of GDPR and what tool you might need to use.
YugaByte DB - "Designing a Distributed Database Architecture for GDPR Complia...Jimmy Guerrero
Join Karthik Ranganathan (YugaByte CTO) for an in-depth technical webinar to understand how developers and administrators alike, can design systems that enable users to control the sharing and protection of their personal data so that it complies with GDPR. Topics covered include schema design, data partitioning, encryption and replication. Karthik will draw on his experience helping scale Facebook's Messenger and Inbox Search along with real-world implementations which make use of YugaByte DB.
Distributed Database Architecture for GDPRYugabyte
The General Data Protection Regulation, often referred to as GDPR, came into effect on 25 May 2018 across the European Union. This regulation has implications on many global businesses, given the fines imposed if the organization is be found to be non-compliant. Making sure that the app architecture continues to ensure regulatory compliance is an on-going challenge for many businesses. This talk covesr some of the key requirements of GDPR that impact database architecture, what the inherent challenges are with requirements and how YugaByte DB can be used to implement these requirements.
Global B2B Contacts is a leading provider of quality Business to Business and Information Technology mailing, email and telemarketing lists.
info@globalb2bcontacts.com
http://www.globalb2bcontacts.com
Perth technology expo process & instrumentation trade show attendees emai...Global B2B Contacts LLC
We custom build the lists based on the marketing campaign and various target demographics to help our clients expand reach to a more specific target audience who are most likely to buy their product or service.
info@globalb2bcontacts.com
http://www.globalb2bcontacts.com
http://globalb2bcontacts.com/cfo-mailing-lists.html
Practical experiences using Atlas and Ranger to implement GDPRDataWorks Summit
GDPR is the current focus of all organizations working with data in Europe. Svenska Spel are using Atlas and Ranger as an essential part to make our
data warehouse to conform to GDPR.
This presentation will show how Atlas and Ranger, together with Hive are used in practice to control access and masking of data in our data warehouse. We will demonstrate how we with a model driven development process, using data vault modeling, have extended our model with metadata. From the metadata we generate our access control configuration. The configuration is automatically synchronized to Atlas and Ranger in our deployment process.
Speaker
Magnus Runesson, Data Engineer, Svenska Spel
Metadata Driven Access Control in Practice - BigData Tech Warsawm 2019Magnus Runesson
The importance of data access governance is continuously growing due to new regulations, such as GDPR, and industry policies. Manage access policies for each individual dataset is a hassle. In this talk, we will show how Svenska Spel use metadata about datasets to generate access policies. We use it to create policies for access, retention, and anonymization.
To create our policies from metadata we developed and open sourced cobra-policytool. Thanks to the tool we can use git and metadata repository as the source of truth for our policies. Policies are deployed to Apache Atlas and Ranger via a CI/CD pipeline. With our work deriving policies from metadata we have:
privacy and security by design
low maintenance burden using automatization
easy to review of policies
traceability of policy changes
single source of truth
We learned the importance of having company policies and technology walk in tandem. Data governance and metadata management are complex areas requiring high attention and are nothing you can duct tape in the end or solve buying a new power tool. We will show; with the right approach it can be rewarding and make new requirements easy to implement and audit.
Journey in Country of Data Access Governance - Data works summit 2019 BarcelonaMagnus Runesson
The importance data access governance is continuously growing due to new regulations, such as GDPR, and industry policies. In this talk, we will share learnings at Svenska Spel from implementing and supporting this growing demand. Our journey started a year ago implementing Atlas and Ranger in our Hortonworks Data Platform solution. From start we wanted to implement a process and solution that had:
Minimal impact on ETL-developers and analysts
Privacy and security by design
Low maintenance burden using automatization
To be able to manage our policies and metadata we developed and open sourced cobra-policytool. Cobra-policytool integrates into our deployment pipeline and simplifies the way we express our policies. Thanks to the tool we can use Git as our truth of source for our policies and metadata information, along with the source code. It is close to seamless for our ETL-developers and with trust deploy the right policies.
Throughout the course of our journey, we learned the importance of having company policies and technology walk in tandem. Single source of truth, open platforms, and automatization are key to success. Data governance and metadata management are complex areas requiring high attention and are nothing you can duct tape in the end or solve buying a new power tool. With the right approach, it can be rewarding and make new requirements easy to implement and audit.
GDPR Considerations for IBM ConnectionsLetsConnect
In EU there is a new data privacy regulation effective from May 2018. Organizations are required to comply with multiple requirements which affect also IBM Connections. In the session we will check how IBM Connections (on prem) meet the requirements of GDPR and what tool you might need to use.
YugaByte DB - "Designing a Distributed Database Architecture for GDPR Complia...Jimmy Guerrero
Join Karthik Ranganathan (YugaByte CTO) for an in-depth technical webinar to understand how developers and administrators alike, can design systems that enable users to control the sharing and protection of their personal data so that it complies with GDPR. Topics covered include schema design, data partitioning, encryption and replication. Karthik will draw on his experience helping scale Facebook's Messenger and Inbox Search along with real-world implementations which make use of YugaByte DB.
Distributed Database Architecture for GDPRYugabyte
The General Data Protection Regulation, often referred to as GDPR, came into effect on 25 May 2018 across the European Union. This regulation has implications on many global businesses, given the fines imposed if the organization is be found to be non-compliant. Making sure that the app architecture continues to ensure regulatory compliance is an on-going challenge for many businesses. This talk covesr some of the key requirements of GDPR that impact database architecture, what the inherent challenges are with requirements and how YugaByte DB can be used to implement these requirements.
To disrupt and innovate, you need access to data. All of your data. The challenge for many organisations is that the data they need is locked away in a variety of silos. And there's perhaps no bigger silo than one of the most a widely deployed business application: SAP. Bringing together all your data for analytics and machine learning unlocks new insights and business value. Together, Cloudera and Datavard hold the key to breaking SAP data out of its silo, providing access to unlimited and untapped opportunities that currently lay hidden.
Integrate ERP and CRM Metadata into ER/StudioDATAVERSITY
You might think that the metadata in your large, complex, and customized ERP and CRM applications is too difficult and time-consuming to find and use within your enterprise data models. If you are implementing a data warehouse, data governance, data migration, or other information management project which includes SAP, Oracle, or Salesforce packages, then having access to their data models is critical. You can integrate, manage, and govern your ERP and CRM metadata within your data models to complete the big picture of your data architecture and lineage.
This webinar will briefly introduce the challenges associated with accessing the metadata in these ERP and CRM packages and demonstrate how the combination of Safyr® and ER/Studio tools lets you find and use the key metadata as easily and quickly as if it were a standard database. Being able to use the package metadata in enterprise data models and data lineage will help to accelerate delivery and improve accuracy.
Data Privacy & Governance in the Age of Big Data: Deploy a De-Identified Data...Amazon Web Services
Come to this session to learn a new approach in reducing risk and costs while increasing productivity, organizational alacrity, and customer experience, resulting in a competitive advantage and assorted revenue growth. We share how a de-identified data lake on AWS can help you comply with General Data Protection Regulation (GDPR) and California Consumer Protection Act requirements by solving the issue at its causal element.
Accelerate Digital Transformation Through AI-powered Cloud Analytics Moderniz...Amazon Web Services
Andrew McIntyre, Director of Strategic ISV Alliances, Informatica
Modernizing your analytics capabilities to deliver rapid new insights is critical to successfully drive data-driven digital transformation. Many organizations find it challenging to connect, understand and deliver the right data to generate new insights. Learn about the latest patterns, solutions and benefits of Informatica's next-generation Enterprise Data Management platform to unleash the power of your data through the modern cloud data infrastructure of AWS. See how you can accelerate AI-driven next-generation analytics by cataloging and integrating structured and unstructured data from hundreds of data sources from multiple on-premises and cloud data sources.
SmartMDM on helppokäyttöinen ja ketterä väline master datan syöttöön ja muokkaukseen. Tuemme sekä keskitettyjä että hajautettuja master data -ratkaisuja ja parannamme master datan ylläpitoon liittyvien henkilöiden työpäiväkokemusta.
Sports alliance - Case Study: Data Driven Marketing in European FootballBigDataExpo
Sports Alliance works for 106 Football Clubs in 6 European countries. These clubs, with the help of Sports Alliance, have found a way to turn their Fan data into Data Driven Marketing. This Case Study will show you how this is done and what the results are.
AWS Summit Singapore - Accelerate Digital Transformation through AI-powered C...Amazon Web Services
Andrew McIntyre, Director of Strategic ISV Alliances, Informatica
Modernizing your analytics capabilities to deliver rapid new insights is critical to successfully drive data-driven digital transformation. Many organizations find it challenging to connect, understand and deliver the right data to generate new insights. Learn about the latest patterns, solutions and benefits of Informatica's next-generation Enterprise Data Management platform to unleash the power of your data through the modern cloud data infrastructure of AWS. See how you can accelerate AI-driven next-generation analytics by cataloging and integrating structured and unstructured data from hundreds of data sources from multiple on-premises and cloud data sources.
70% of employees have access to data they should not…and that’s going to be a problem when GDPR takes affect in May 2018.
A strong data governance program ensures that you have the policies, standards, and controls in place to protect data effectively and access it for decision making. Data governance may become one of the most important functions of your data integration architecture when it comes to data agility.
Watch this on-demand webinar describing practical steps to data governance:
- Map personal data elements to data fields across systems using metadata
- Create workflows for data stewardship and manage end user computing
- Establish a data lake with native data quality for consent processing
- Track and manage data with audit trails and data lineage
Discover the concept of 'on-the-fly' analysis with TIBCO Spotfire based on effortless of coding program for combining different types of file, cut cost of increasing in DB warehouse while DB growing, and real time analysis for digital era.
Why an AI-Powered Data Catalog Tool is Critical to Business SuccessInformatica
Imagine a fast, more efficient business thriving on trusted data-driven decisions. An intelligent data catalog can help your organization discover, organize, and inventory all data assets across the org and democratize data with the right balance of governance and flexibility. Informatica's data catalog tools are powered by AI and can automate tedious data management tasks and offer immediate recommendations based on derived business intelligence. We offer data catalog workshops globally. Visit Informatica.com to attend one near you.
[Webinar] Getting Started with BigQuery: Basics, Its Appilcations & Use CasesTatvic Analytics
This webinar aims to provide the BigQuery product walkthrough right from the basics. Our core focus will be on the use cases and applications that help to gain additional customer insights from the data integrated within BigQuery.
BigQuery is equipped with the ability to crunch TBs of data in seconds while ensuring scalability and speed. It also enables us to perform advanced statistical analysis by providing unsampled raw hit level analytics data.
Microsoft Dynamics strategy for small to medium size business: a new solutionDXC Eclipse
Microsoft Dynamics 365: Continue Your Transformation Journey.
Microsoft Dynamics strategy for small to medium size businesses: a new solution.
Explore the latest release Dynamics 365 Business Central, suitable for existing Dynamics NAV and GP customers wanting to move to the Cloud.
Presented by Karen Blake - Solution Architect, Microsoft Dynamics and Carsten Pedersen - Senior Executive, Microsoft Dynamics NAV from DXC Eclipse
Artificial Intelligence and Analytic Ops to Continuously Improve Business Out...DataWorks Summit
The time for enterprises to gain market advantage through Artificial Intelligence is now. Already many AI-enabled advances are transforming business processes and customer experiences, but the vast majority of AI-enhanced use cases are still to be discovered, developed, and deployed. In order to discover and capture the value available through deployed AI, new deep learning techniques are the focus of feverish research and development in academia and business. However, even successful AI experiments are often never deployed to business operations, resulting in wasted effort, time, and money, and leaving businesses dangerously exposed to competitors that have integrated AI into their ongoing operations.
Experimentation with AI is essential to realizing the promise of AI, but enterprises face substantial risks that their experiments with AI, even successful ones, will do nothing to improve their business outcomes. We present a framework, inspired by DevOps practices used by software engineers to continuously incorporate new ideas and improvements into applications, that de-risks investments in AI by providing a reliable channel for pipelining successful AI experiments and development into continuously deployed and monitored operational analytics.
Speaker
Nick Switanek, Marketing Director of Artificial Intelligence, Teradata
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
More Related Content
Similar to Practical experiences using Atlas and Ranger to implement GDPR - Dataworkssummit 2018
To disrupt and innovate, you need access to data. All of your data. The challenge for many organisations is that the data they need is locked away in a variety of silos. And there's perhaps no bigger silo than one of the most a widely deployed business application: SAP. Bringing together all your data for analytics and machine learning unlocks new insights and business value. Together, Cloudera and Datavard hold the key to breaking SAP data out of its silo, providing access to unlimited and untapped opportunities that currently lay hidden.
Integrate ERP and CRM Metadata into ER/StudioDATAVERSITY
You might think that the metadata in your large, complex, and customized ERP and CRM applications is too difficult and time-consuming to find and use within your enterprise data models. If you are implementing a data warehouse, data governance, data migration, or other information management project which includes SAP, Oracle, or Salesforce packages, then having access to their data models is critical. You can integrate, manage, and govern your ERP and CRM metadata within your data models to complete the big picture of your data architecture and lineage.
This webinar will briefly introduce the challenges associated with accessing the metadata in these ERP and CRM packages and demonstrate how the combination of Safyr® and ER/Studio tools lets you find and use the key metadata as easily and quickly as if it were a standard database. Being able to use the package metadata in enterprise data models and data lineage will help to accelerate delivery and improve accuracy.
Data Privacy & Governance in the Age of Big Data: Deploy a De-Identified Data...Amazon Web Services
Come to this session to learn a new approach in reducing risk and costs while increasing productivity, organizational alacrity, and customer experience, resulting in a competitive advantage and assorted revenue growth. We share how a de-identified data lake on AWS can help you comply with General Data Protection Regulation (GDPR) and California Consumer Protection Act requirements by solving the issue at its causal element.
Accelerate Digital Transformation Through AI-powered Cloud Analytics Moderniz...Amazon Web Services
Andrew McIntyre, Director of Strategic ISV Alliances, Informatica
Modernizing your analytics capabilities to deliver rapid new insights is critical to successfully drive data-driven digital transformation. Many organizations find it challenging to connect, understand and deliver the right data to generate new insights. Learn about the latest patterns, solutions and benefits of Informatica's next-generation Enterprise Data Management platform to unleash the power of your data through the modern cloud data infrastructure of AWS. See how you can accelerate AI-driven next-generation analytics by cataloging and integrating structured and unstructured data from hundreds of data sources from multiple on-premises and cloud data sources.
SmartMDM on helppokäyttöinen ja ketterä väline master datan syöttöön ja muokkaukseen. Tuemme sekä keskitettyjä että hajautettuja master data -ratkaisuja ja parannamme master datan ylläpitoon liittyvien henkilöiden työpäiväkokemusta.
Sports alliance - Case Study: Data Driven Marketing in European FootballBigDataExpo
Sports Alliance works for 106 Football Clubs in 6 European countries. These clubs, with the help of Sports Alliance, have found a way to turn their Fan data into Data Driven Marketing. This Case Study will show you how this is done and what the results are.
AWS Summit Singapore - Accelerate Digital Transformation through AI-powered C...Amazon Web Services
Andrew McIntyre, Director of Strategic ISV Alliances, Informatica
Modernizing your analytics capabilities to deliver rapid new insights is critical to successfully drive data-driven digital transformation. Many organizations find it challenging to connect, understand and deliver the right data to generate new insights. Learn about the latest patterns, solutions and benefits of Informatica's next-generation Enterprise Data Management platform to unleash the power of your data through the modern cloud data infrastructure of AWS. See how you can accelerate AI-driven next-generation analytics by cataloging and integrating structured and unstructured data from hundreds of data sources from multiple on-premises and cloud data sources.
70% of employees have access to data they should not…and that’s going to be a problem when GDPR takes affect in May 2018.
A strong data governance program ensures that you have the policies, standards, and controls in place to protect data effectively and access it for decision making. Data governance may become one of the most important functions of your data integration architecture when it comes to data agility.
Watch this on-demand webinar describing practical steps to data governance:
- Map personal data elements to data fields across systems using metadata
- Create workflows for data stewardship and manage end user computing
- Establish a data lake with native data quality for consent processing
- Track and manage data with audit trails and data lineage
Discover the concept of 'on-the-fly' analysis with TIBCO Spotfire based on effortless of coding program for combining different types of file, cut cost of increasing in DB warehouse while DB growing, and real time analysis for digital era.
Why an AI-Powered Data Catalog Tool is Critical to Business SuccessInformatica
Imagine a fast, more efficient business thriving on trusted data-driven decisions. An intelligent data catalog can help your organization discover, organize, and inventory all data assets across the org and democratize data with the right balance of governance and flexibility. Informatica's data catalog tools are powered by AI and can automate tedious data management tasks and offer immediate recommendations based on derived business intelligence. We offer data catalog workshops globally. Visit Informatica.com to attend one near you.
[Webinar] Getting Started with BigQuery: Basics, Its Appilcations & Use CasesTatvic Analytics
This webinar aims to provide the BigQuery product walkthrough right from the basics. Our core focus will be on the use cases and applications that help to gain additional customer insights from the data integrated within BigQuery.
BigQuery is equipped with the ability to crunch TBs of data in seconds while ensuring scalability and speed. It also enables us to perform advanced statistical analysis by providing unsampled raw hit level analytics data.
Microsoft Dynamics strategy for small to medium size business: a new solutionDXC Eclipse
Microsoft Dynamics 365: Continue Your Transformation Journey.
Microsoft Dynamics strategy for small to medium size businesses: a new solution.
Explore the latest release Dynamics 365 Business Central, suitable for existing Dynamics NAV and GP customers wanting to move to the Cloud.
Presented by Karen Blake - Solution Architect, Microsoft Dynamics and Carsten Pedersen - Senior Executive, Microsoft Dynamics NAV from DXC Eclipse
Artificial Intelligence and Analytic Ops to Continuously Improve Business Out...DataWorks Summit
The time for enterprises to gain market advantage through Artificial Intelligence is now. Already many AI-enabled advances are transforming business processes and customer experiences, but the vast majority of AI-enhanced use cases are still to be discovered, developed, and deployed. In order to discover and capture the value available through deployed AI, new deep learning techniques are the focus of feverish research and development in academia and business. However, even successful AI experiments are often never deployed to business operations, resulting in wasted effort, time, and money, and leaving businesses dangerously exposed to competitors that have integrated AI into their ongoing operations.
Experimentation with AI is essential to realizing the promise of AI, but enterprises face substantial risks that their experiments with AI, even successful ones, will do nothing to improve their business outcomes. We present a framework, inspired by DevOps practices used by software engineers to continuously incorporate new ideas and improvements into applications, that de-risks investments in AI by providing a reliable channel for pipelining successful AI experiments and development into continuously deployed and monitored operational analytics.
Speaker
Nick Switanek, Marketing Director of Artificial Intelligence, Teradata
Similar to Practical experiences using Atlas and Ranger to implement GDPR - Dataworkssummit 2018 (20)
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
As Europe's leading economic powerhouse and the fourth-largest hashtag#economy globally, Germany stands at the forefront of innovation and industrial might. Renowned for its precision engineering and high-tech sectors, Germany's economic structure is heavily supported by a robust service industry, accounting for approximately 68% of its GDP. This economic clout and strategic geopolitical stance position Germany as a focal point in the global cyber threat landscape.
In the face of escalating global tensions, particularly those emanating from geopolitical disputes with nations like hashtag#Russia and hashtag#China, hashtag#Germany has witnessed a significant uptick in targeted cyber operations. Our analysis indicates a marked increase in hashtag#cyberattack sophistication aimed at critical infrastructure and key industrial sectors. These attacks range from ransomware campaigns to hashtag#AdvancedPersistentThreats (hashtag#APTs), threatening national security and business integrity.
🔑 Key findings include:
🔍 Increased frequency and complexity of cyber threats.
🔍 Escalation of state-sponsored and criminally motivated cyber operations.
🔍 Active dark web exchanges of malicious tools and tactics.
Our comprehensive report delves into these challenges, using a blend of open-source and proprietary data collection techniques. By monitoring activity on critical networks and analyzing attack patterns, our team provides a detailed overview of the threats facing German entities.
This report aims to equip stakeholders across public and private sectors with the knowledge to enhance their defensive strategies, reduce exposure to cyber risks, and reinforce Germany's resilience against cyber threats.
4. 42018-04-19
• This talk does not cover all we have done around GDPR
• This is NOT a way to say if you do this you are GDPR compliant.
• Some details are left out or simplified
Disclaimer
5. 52018-04-19
• Why?
• Svenska Spel’s data warehouse
• Atlas & Ranger
• How did we implement it?
• Experiences and conclusions
Agenda
6. 62018-04-19
GDPR requires
• clear purpose for PII data
• privacy by design
• clear consent or legal ground
• not to use/store PII if not needed
• people own their own data.
• penalty if not followed
Why?
7. 72018-04-19
• Our customers and partners integrity is protected
• Users have only access to data aimed for current purpose
• Keep doing our required processing
• Adaptable for new requirements
• Maintainable solution
Goals
9. 92018-04-19
• Moved from classic Cognos + Oracle
• HDP 2.6 using Hive
• Includes Personal Identifiable Information (PII)
• 300+ event streams in
• 150 published tables and views
Svenska Spel’s data warehouse
10. 102018-04-19
• Used data are
• Understood
• Documented
• Modelled
• Modelled with Data Vault
• Oracle SQL Developer Data Modeler
• SQL code generated from model
Model based development
11. 112018-04-19
• History tracking
• Uniquely linked
• Pattern based
• Easy to generate code
Data Vault
Link
Hub
Hub
Satellite
Satellite Satellite
14. 142018-04-19
• Metadata about resources
• Resource is
• Table
• Column
• Schema
• File on HDFS
• …
• Lineage
Apache Atlas
15. 152018-04-19
• Tags have no meaning themselves
• Your business vocabulary define the meaning
• Example of tags:
• Business entity owning the data
• Indication of sensitive data
• The rules in Ranger enforces the policy
• Separate metadata from policy implementation
Atlas tags
16. 162018-04-19
• Is user U allowed to do operation O on resource R?
• Access
• Row based filtering
• Masking
• Audit logging
• Resources referred with tags
Apache Ranger
18. 182018-04-19
customer
Customer_id Name Postal_code Has_phone Marketing
1 Steve 12345 False False
2 Bill 54321 True False
3 Paul 54672 False True
PII_table
PII
Add PII tags on table and columns in Atlas.
No behaviour change.
PII
19. 192018-04-19
customer
Customer_id Name Postal_code Has_phone Marketing
17 ABC 12345 False False
42 DEF 54321 True False
13 BDE 54672 False True
PII
We set a rule in Ranger to mask PII columns
Analyst view
PII_table
PII
25. 252018-04-19
• Hand coded of rules per tag
• Policy tool applies rule on all tables with the tag
• Can be different rules for different users
• Filter gets appended to where condition by Ranger
• Used for
• Row based filtering (access)
• Masking (anonymization)
• Catch all rule to deny access to tables not in our model
Ranger rules
28. 282018-04-19
• Makes it easy to manage
• Atlas tags
• Ranger policy rules
• Command line tool
• Consumes tags and policy CSV files
• Calls Atlas and Ranger API
• Less than 1000 rows of Python
Policytool
Policytool
CSV
33. 332018-04-19
• Simple and easy model
• Limited performance penalty
• Tag on table with masking rule => all columns masked
• Hard to understand API doc
• Restriction on Ranger row based filtering (not on tags)
• Row based filtering and masking not on direct file access
Experiences of Atlas and Ranger
34. 342018-04-19
• Our customers and partners integrity is protected
• Users have only access to data aimed for current purpose
• Keep doing our required processing
• Adaptable for new requirements
• Maintainable solution
Reached Goals
35. 352018-04-19
• Goals reached
• No SQL changes
• Scale when new datasets added
• Our data model is guaranteed in sync
• Better comments in Hive
• Minimal impact on ETL developers workflow
Conclusions
36. 362018-04-19
• Make it as simple as possible
• Automate
• Know your tool
• Be clear on your authorization model
• Know your data
Takeaways
Two weeks ago we had one of our regular syncs with Hortonworks. As many others; we discussed things around GDPR in different aspects. I did a short adhoc presentation about how we are using Apache Atlas and Ranger to support access control and anonymization in our GDPR work.
Our sales representative Johan got very exited and asked me to come here and do a presentation.
I will talk about how we managed to keep our development process for ETL developers intact and avoid to do any changes in our SQL-code thanks to Apache Atlas, Ranger and some integrations. We can now present different views of our data for different users. CRM users see only users opted in for marketing and analyst sees data that are anonymized.
Atlas and Ranger in combination can feel very abstract and unclear how to apply them. My goal is to be as concrete as possible how we use the tools and that you feel it is time to get your hands dirty.
Magnus Runesson
20+ years experience in Development, Operation and databases with high demand. Lately focus on big data and distributed scalable systems.
Data Engineer @ Svenska Spel
Svenska Spel is a gaming company owned by the state. In direct translation Swedish Games.
We provide games on-line, in stores and in casinos in Sweden
The state asks us and to deliver attractive games in a responsible manner and at the same time take a big responsibility on counteract gambling addiction.
Svenska Spel´s vision is therefore gaming is for everyone´s enjoyment.
Before I go into subject I want to point out that:
This talk does not cover all we have done around GDPR
This is NOT a way to say if you do this and you are GDPR compliant.
Some details are left out or simplified
I will start talk about Why we are doing this.
To be able to understand what we have done I’ll give a introduction to our DW environment.
Moving on to a short introduction to Atlas and Ranger. I assume most of you knows about it but to ensure we are on the same page.
With that information as base I’ll talk about how we use Atlas and Ranger to control access to PII data at scale in a controlled and maintainable manner.
The laws in Europe about how to handle personal information will on May 25th be harmonized within EU.
There are several aspects how to handle PII that becomes stricter and we get a penalty if breaking them.
PAUS
The goals we sat up before starting this work were:
Our customers and partners integrity is protected
Users have only access to data granted for purpose of interest
We can do our required processing
Adaptable for new requirements
Maintainable solution
SvS coming from classical EDW. It started to become slow to load and we did not have the granularity of the event that we wanted.
Decided to go towards Hadoop and with the final goal to do all BI work within the Hadoop echo system.
We are heavy users of Hive. The advantage is that it is relatively simple for an ETL developer with SQL knowledge to be productive.
As many EDW ours includes PII
Input to the system we have more than 300 event streams. These are processed into more than 150 tables published in Tableau.
Inevitably the amount of tables make us want to automate much of our development and deployment as possible. To do that we have taken a model based development approach.
Model based development means for us that the data we consumes is well understood. We document this understanding as a step to support analysts.
We are modelling input, output and intermediate steps using ER diagrams in Oracle SQL Developer Data Modeler.
Or integration layer Is modeled using Data Vault methodology
The model in Data Modeler are is used to generate all DDL code and some of our transformation logic. The generation works when we have simple business rules.
Complex business rules is hand coded.
The benefits of using Data Vault are:
* History tracking – It is easy to see how relations between entities have changed
* Uniquely linked – All links are 1-N
* Pattern based – The former two gives us pattern how to formulate things
* Easy to generate code – These patterns makes it easy to generate code to transform the D-V into a Kimball dimension model
Join draw back
Physically we have a multi layer architecture.
All our in data from source systems goes into our Data Lake.
The data in the Lake can be structured in many ways, sometimes direct copies of internal data structures from the source systems.
We model this data using Data Vault in our Integration layer. To a start this was very much views from the lake tables but due to performance we have materialized most views today.
Last step in the processing is to put our data into the dimension mart where the data is dimension modelled according to Kimball. This makes it easy to consume by any analysis tool.
Earlier the dimension layer was directly copied over to our presentation layer, an in-memory database Exasol.
With the requirements of GDPR we wanted to separate the data into different marts based on purpose. Nowdays we have these purpose based marts as an intermediate step.
Example of two marts are our BI mart and CRM mart.
BI mart is anonymized data for all our customers. On this we can do Business analysis to understand how our business perform. But we cannot identify the individual user.
In the CRM mart we only include customers that have opted in for marketing and recommendations.
We do also have a handful of other marts each with a different purposes. What data is going to what mart is controlled by our roll based access, handled by Ranger.
This is how we govern the data to ensure each purpose based marts only get the data they should that is I will talk about the rest of the time. This is also where Apache Ranger and Atlas comes into play.
Atlas handles metadata about our resources. It is not necessary limited to resources in Hadoop.
Resources are Hive tables, columns, schemas. But also files on HDFS. We get metadata around creation, usage and deletion.
As a user one can tag each resource with tags.
These tags have no meaning them self.
Your business vocabulary define the meaning
Example of tags:
Business entity owning the data
Indication of sensitive data
The rules in Ranger enforces the policy
This separate metadata from policy implementation.
It is easier to understand that a tag PII means it is personal identifiable information.
How this is implemented is often not necessary to understand when modeling. Therefore the policies are implemented separately in Ranger.
The rules can often be complex and not obvious why they are relevant for a table. The tags separates the problems from each other.
Let us take a look on an example how this can be used.
PAUS
Let us take a look on what Ranger and Atlas are in this perspective.
Is user U allowed to do operation O on resource R?
Access
Row based filtering
Masking
Audit logging
15 min
We have a plain customer table in Hive with some information about our customers.
We want to do two things with this table before we make it available to the user.
Anonymize data that can identify a person.
CRM are only allowed to see customers opted in for marketing (Most left column)
We add three tags in Atlas to this table and its columns.
Two tags on the two columns we consider PII. We also add a tag to the table so that we know it includes PII even if we not will use it right now.
Users of the table will still see no change.
We add a masking rule in Ranger using our own hashing algorithm with a good salt to anonymize our data.
Only data that are tagged PII will be masked.
When this is applied our user will see this table.
We have now fulfilled our first goal of anonymize the data. This is the view an analyst sees that analysis how we perform in different areas.
To implement our second goal that CRM shall only see customers that have opted in for marketing.
We must add a second rule saying: CRM users are only allowed to see customers with Marketing=True
This rule is an appended where condition to our queries.
Now our table is ready for CRM to use.
A challenge is that in Ranger you have to create one rule for every table. In our case with a very systematic naming we know that all tables with a column Marketing must have this rule. This may not be valid for for this example, but in other.
We needed tooling support for this otherwise it is impossible to ensure the 150 tables are correct.
We also wanted to take advantage having all tables described in Oracle Data Modeler. Why not add PII and Marketing info there too?
We need tooling for this too.
18-20min
We have now in general terms looked on the problem statement and how Apache Atlas and Ranger can solve it.
We have also observed that it might be infeasible to scale for many tables and rules using Atlas and Ranger directly.
Tooling is needed to do this in a maintainable manner.
These insights of needed tooling lead us to the insight that we wanted a development process looking like this.
The ETL developer add new tables, columns etc in Oracle Data modeler. Including information what is PII and some other metadata we want.
The developer store the model in a GIT repository. He or She runs a tool to generate the SQL code and information which tables and columns has what tags.
Some SQL may have to be written by hand, but no table creations. This is core to be able to tag our data.
Also the generated code is checked into GIT. That way our GIT repository includes a full description of or ETL process.
The code is pushed to Jenkins and automatically deployed on Hive according to our release process.
Exceptionally new rules who can access what data must be added. We went with the decision to store these in the repository too, since it is our developers creating the rules. Others might have other requirements.
This process is very similar for the developer to what they had before. Only new thing they normally need to do is tagging the data within Oracle Data Modeler.
PAUS
To generate the code we have an in-house tool called HQLgenerator. It reads the files from Data Modeler and generates SQL.
The generation is based on templates and three files defining what to generate and how to generate transformations.
These files includes source and destination tables and if any transformation is needed.
We have had this tool for about 2 years. This was the natural place to also generate the tagging information for our tables and columns.
Easy to change SQL dialect.
If you think about it. Describing what tags you want on a table is very much the same as describing the create statement of the table. Both are about data definition.
We added functionality to generate two CSV files. One for columns and one for tables. Here you see an example of a file for columns.
It includes schema, table, column (called attribute) and what tags it shall have. One line per column. It is very simplistic.
The files for tables looks almost the same.
Independent of end tool. We have considered using for out in-memory database Exasol too.
As earlier said. It is not enough to set tags on the data. We also need rules. We did not find any good way to have them in our model but we wanted to have them in our repository for auditing purposes.
The ranger rules are hand coded. One line per tag, user and rule.
The tags will be expanded by our tooling to all tables having that tag.
These rules is then added to Raynor.
We use it for both row based filtering and masking.
A catch all rule to deny access to tables not in our Model
The Ranger policy file is close to as simple as the tag files.
One specify the tag we want to have a rule for. What groups and/or uses this rule applies and finally our rule.
Note that we have $table in the rule. This will when inserted to raynor be changed to the table name.
In Raynor we will get one rule per table having the tag PII_table.
Note that we can do this this way since we know that each table with PII information have a customer_id thanks to our structured and pattern based generation of SQL. In this example you can see how we trust that we have a customer_id in all our PII_tables.
We will change this in a second iteration, we made it too simple.
Okay we have now created all files that are created or generated by our ETL developer. It is deployed and becomes part of a job in Airflow, our scheduler.
This job is for create or modify tables. It does three steps.
Create all tables from the SQL files.
Take the tag files and tag our tables in Atlas. This is done with a small command line tool called policytool.
Our policies are applied to Ranger using policytool too.
The not so much magic is handled by the policytool we have developed.
The two last steps is repeated every day to verify no manual changes have been done in neither Raynor nor Atlas.
The purpose of the Policytool is to make it simple to add and manage tags and policies in Atlas and Ranger. We can thanks to it handle our tags and rules within GIT.
It is easy to audit and find if any unintended changes have happened.
It is very simple. It consumes our CSV files and applies them to the Atlas and Ranger API. In the case of policy rules they are expanded to one rule per table instead of one per tag.
The tool warns us if there are unknown tables or columns. Normal user cannot access tables that does not exist in our data model, thanks to this.
All in all the tool is right now less than 1000 rows of Python code.
30 MIN
To wrap it up. We have an simple and clear development process.
All tables are modeled in a graphical tool, Oracle SQL Developer Data Modeler. Even metadata such as PII is added there.
The full description of our model inclusive, policies and metadata are stored in GIT. Very much generated from our model.
It is automatically deployed when merged.
The content of the GIT repo ends up in Airflow that schedules our runs.
Our policies and tags are running daily. Table changes are only run when needed.
The behavior of applying tags and rules can be show as this.
The analysts view of our data is now restricted according to our Policies.
From our earlier example.
The analyst can still see all data but PII is anonymized. The analyst have no need of name or other PII. She is interested in grouping on postal code or any other market segment.
The CRM user sees only users opted in for marketing. This reduces risk of sending marketing to customers not asking for it.
So what do we think about Atlas and Ranger.
Both tools have simple and easy models.
The performance penalty feels fair.
There are some pitfalls like: If you put a tag on a table and then add a masking policy to that tag. All fields in the table will be masked.
The most negative thing with the tools is that the API doc is hard to understand. There are several versions of the APIs. One in ranger not documented at all, but turned out to work best. Easiest way to understand the API was to wiretap the GUI.
For us the restriction on Ranger row based filtering not be able to select tables to filter based on a tag was a drawback. We solved it with our tooling. In the end we would probably wanted that tool anyway.
Row based filtering and masking not on direct file access It is not a big surprise from a technical perspective. But that forced us to materialize our purposed based marts so we could import them to the in-memory database Exasol.
Lets evaluate how we reached our goals. With the work I have described using Atlas, Ranger, our data model. Extension of our HQLgenerator and creation of our policytool we reached our goal.
With this, and other measures we feel that we protect our customers and partners integrity.
Unauthorized users can not access data they should not.
We can still do our processing, without any changes in our jobs.
We are confident that we can adapt to new requirements.
The solution is easy to maintain.
In total the impact was that we reached our goal.
The solution scales when we add new data sets. The change in our existing workflow was minimal.
We also got some advantages we did not foresaw when started.
In our data model we have a lot of comments that are valuable documentation. Since this work ensure us that our model is in sync. Short cuts cannot be taken we now have better documentation within Hive of our datamodel. Being in sync between model
and Hive also helps further development.
Finally some takeaways I want to give you if you will do something similar:
Make it as simple as possible. It will be complex enough anyway.
Think automatization. It eases the burden when in production. Helps you scale. But it also increases your quality and auditability.
The number of policy rules we create, need to review and audit is less than 1 of 100 of those we actually have in Ranger.
Be sure you understand the tool and how the limitations will impact you. This comes in proximity to that you should have a clear authorization model.
Avoid complex rules like: If this, but not that or that do this but only when etc.
Of course the last is obvious. Know your data. Be systematic about it. Always call your customer id customer_id not sometimes c_id or just id.
Thank you! Fell free to contact me if you have any questions.
I hope we have a few minutes for questions left.