Presented by Eric Ogren and Anthony Hsu at the LinkedIn Big Data Meetup on 2018-01-25: https://www.meetup.com/Big-Data-Meetup-LinkedIn/events/246858500/
This document discusses LinkedIn's approach to balancing data democracy and data protection. It describes LinkedIn's data ecosystem and challenges in providing data access while ensuring member privacy. Key tools discussed include WhereHows for metadata and data discovery, Dali as a data access layer, and Apache Gobblin for data lifecycle management including deletion. The document emphasizes privacy by design and cross-functional collaboration as needed to sustainably address increasingly complex data protection regulations.
The document discusses the importance of metadata for managing and organizing digital records and information within organizations. It defines metadata as "data that describes other data" and explains that metadata allows users to locate, evaluate, and discover information without having to rediscover it each time. The document outlines different types of metadata including descriptive, structural, and administrative metadata. It emphasizes that metadata needs to be carefully designed and applied according to standards to ensure information is properly maintained and accessible. Metadata plays a key role in organizing information, enabling search and discovery, facilitating data archiving and preservation, and allowing interoperability between different data sources.
Take control over GDPR compliance with ContentMap software!Pär Eliasson
ContentMap helps your organization analyze, visualize and structure GDPR sensitive data. All in order to take control of your GDPR compliance.
ContentMap finds, analyzes and catigorize sensitive GDPR data in ALL systems and files. We help organizations simply identify, evaluate and structure GDPR compliance in email, MS Office files and cloud connected sources, all in one simple view.
Gdpr ccpa automated compliance - spark java application features and functi...Steven Meister
GDPR – CCPA Automated Technology, 16 Page PowerPoint with Features, Functions, Architecture and our reasons for choosing them. Be on your way to compliance with Technology created with compliance as its goal. Expect to add years of development without technology built specifically for compliances, such as GDPR, CCPA, HIPAA and others.
After scrolling through this PowerPoint you will realize just what is required and be able to better estimate the efforts it will take for your company to meet these regulatory requirements with technology and then without technology.
Spend just 5-10 minutes that might save your company, and your Customers, all the negative ramifications of the inevitable 2 breaches a year a company can expect to suffer.
This PowerPoint covers the critical aspects and needs that are present in any project designed to meet regulatory requirements for GDPR, CCPA and many others.
Complete Channel of Videos on BigDataRevealed
https://www.youtube.com/watch?v=3rLcQF5Wsgc&list=UU3F-qrvOIOwDj4ZKBMmoTWA
847-440-4439
#CCPA #GDPR #Big Data #Data Compliance #PII #Facebook #Hadoop #AWS #Spark #IoT #California
Every Executive that has a Big Data Hadoop Cluster and their Staff, this is a must see! Getting your big data house in order.
The misalignment and clutter issues waste much of the precious time for critical decisions.
Building the Governance Ready Enterprise for GDPR ComplianceIndex Engines Inc.
The EU General Data Protection Regulation (GDPR) fundamentally changes how organizations manage personal data. Giving citizens the right to access, rectify, erase, restrict, and migrate their personal content existing in any data center that does business in the European Union.
Index Engines' technology delivers extensive search and management solutions that empower you to find all personal data under management with considerable precision and meet or exceed the requirements of the regulation through implementation of powerful indexing technology. Index Engines supports all classes of data from primary storage to legacy backup data.
This document discusses LinkedIn's approach to balancing data democracy and data protection. It describes LinkedIn's data ecosystem and challenges in providing data access while ensuring member privacy. Key tools discussed include WhereHows for metadata and data discovery, Dali as a data access layer, and Apache Gobblin for data lifecycle management including deletion. The document emphasizes privacy by design and cross-functional collaboration as needed to sustainably address increasingly complex data protection regulations.
The document discusses the importance of metadata for managing and organizing digital records and information within organizations. It defines metadata as "data that describes other data" and explains that metadata allows users to locate, evaluate, and discover information without having to rediscover it each time. The document outlines different types of metadata including descriptive, structural, and administrative metadata. It emphasizes that metadata needs to be carefully designed and applied according to standards to ensure information is properly maintained and accessible. Metadata plays a key role in organizing information, enabling search and discovery, facilitating data archiving and preservation, and allowing interoperability between different data sources.
Take control over GDPR compliance with ContentMap software!Pär Eliasson
ContentMap helps your organization analyze, visualize and structure GDPR sensitive data. All in order to take control of your GDPR compliance.
ContentMap finds, analyzes and catigorize sensitive GDPR data in ALL systems and files. We help organizations simply identify, evaluate and structure GDPR compliance in email, MS Office files and cloud connected sources, all in one simple view.
Gdpr ccpa automated compliance - spark java application features and functi...Steven Meister
GDPR – CCPA Automated Technology, 16 Page PowerPoint with Features, Functions, Architecture and our reasons for choosing them. Be on your way to compliance with Technology created with compliance as its goal. Expect to add years of development without technology built specifically for compliances, such as GDPR, CCPA, HIPAA and others.
After scrolling through this PowerPoint you will realize just what is required and be able to better estimate the efforts it will take for your company to meet these regulatory requirements with technology and then without technology.
Spend just 5-10 minutes that might save your company, and your Customers, all the negative ramifications of the inevitable 2 breaches a year a company can expect to suffer.
This PowerPoint covers the critical aspects and needs that are present in any project designed to meet regulatory requirements for GDPR, CCPA and many others.
Complete Channel of Videos on BigDataRevealed
https://www.youtube.com/watch?v=3rLcQF5Wsgc&list=UU3F-qrvOIOwDj4ZKBMmoTWA
847-440-4439
#CCPA #GDPR #Big Data #Data Compliance #PII #Facebook #Hadoop #AWS #Spark #IoT #California
Every Executive that has a Big Data Hadoop Cluster and their Staff, this is a must see! Getting your big data house in order.
The misalignment and clutter issues waste much of the precious time for critical decisions.
Building the Governance Ready Enterprise for GDPR ComplianceIndex Engines Inc.
The EU General Data Protection Regulation (GDPR) fundamentally changes how organizations manage personal data. Giving citizens the right to access, rectify, erase, restrict, and migrate their personal content existing in any data center that does business in the European Union.
Index Engines' technology delivers extensive search and management solutions that empower you to find all personal data under management with considerable precision and meet or exceed the requirements of the regulation through implementation of powerful indexing technology. Index Engines supports all classes of data from primary storage to legacy backup data.
Hyperion Essbase Training from Hyperion Experts,
Tech thinkers Lab is a well known Hyperion training provider in the World, our Uniquer way of Hyperion training will make student easy to understand and Practice.
contact us for more details
Supporting GDPR Compliance through Data ClassificationIndex Engines Inc.
This document discusses how Index Engines technology can help organizations comply with the General Data Protection Regulation (GDPR). It provides an overview of the GDPR requirements and highlights Index Engines' capabilities such as data classification, reporting, disposition, and automated monitoring that allow organizations to know, manage, and govern their enterprise data in accordance with the GDPR. Index Engines provides a single platform to classify petabytes of data, enable flexible search and disposition of personal data, and demonstrate ongoing compliance through automated policy monitoring and auditing.
There’s a lot of ‘software vendor hype’ in support of the GDPR, but most of their solutions are ineffective because of limited features that cannot support the comprehensive compliance the GDPR demands.
Index Engines delivers an enterprise class classification, search and management solution to find all personal data under management with considerable precision. As an added bonus, Index Engines is proven to reduce costs and deliver an ROI through clean-up of content that no longer has business value.
During this webinar you'll learn:
Actionable approaches to managing petabytes of data
Proven strategies on classifying and finding personal data
How support for the GDPR can deliver an ROI
No marketing fluff, just a concrete workflow that will get you started
The GDPR consists of 99 articles that mandate how personal data is to be handled, but how do you manage years of data on various platforms?
Dell EMC and Index Engines together can deliver an intelligent, actionable approach to managing complex data environments in support of GDPR compliance. Providing deep intelligence across primary and secondary storage, this combination enables advanced classification, search and management allowing you to find personal data under management with considerable precision.
During the 60-minute web event, you’ll learn how to:
Approach GDPR compliance with a data-focused process
Mitigate risks with an intelligent, easy-to-implement workflow
Map and classify data to focus and streamline searches for personal data
Locate personal data with advanced search techniques
Better manage petabytes of data to control access and streamline costs
Consider alternatives to cross-border transfers to reduce potential risk
Cleaning up Redundant, Obsolete and Trivial Data to Reclaim Capacity and Mana...Index Engines Inc.
Data grows at a rate of 40-60% each year, but as capacity is expanded, redundant, obsolete and trivial user data - ROT - is clogging corporate networks resulting in unnecessary risk and expense.
Depending on industry, 40-70% of this data has no business value. Harnessing ROT growth will not only control expenses by reducing or eliminate storage upgrades, but also minimize risk.
Index Engines data profiling software supports ROT analysis and data disposition that ranges from terabytes to petabytes of enterprise content. It provides search, reporting, disposition and defensible deletion of data.
http://www.indexengines.com/storage-management/solutions-for/rot-analysis
DBAs - Is Your Company’s Personal and Sensitive Data Safe?DevOps.com
We have all seen the press coverage on corporate data breaches and compromises to personal data. You’ve probably heard about the new EU General Data Protection Regulation (GDPR) that came into effect in May last year, which affects any company that manages the personal data of EU residents. There are also some U.S. regulations that cover data privacy, such as HIPAA, HITECH, PCI and the CA Consumer Privacy Act.
Of these, GDPR is considered the most comprehensive when it comes to the needs of the individual and how their personal data should be protected and carries the harshest financial penalties for non-compliance.
The DBA is often the primary responsible party for implementing compliance controls and technical measures for protecting data. But the GDPR first requires an assessment of where PII and sensitive data is across multiple databases and this will be one of the first challenges a DBA will face before applying protection measures.
With many DBAs having to manually trawl through their database tables to identify sensitive data, what is needed is a fast, effective way to automate the discovery process and report on where sensitive data is stored. This would save time and enable companies to determine the most appropriate way to apply protective safeguards in order to minimize data breaches in the future and protect the business.
If you are a DBA responsible for your company’s data and are concerned about how to identify and protect your data, you should attend this webinar to find out how you can simplify and automate this task.
This document summarizes an update on the Research Data Alliance (RDA). It discusses the growth of RDA membership and activities. Key points include:
- RDA works to reduce barriers to data sharing and exchange by building social, organizational and technical infrastructure.
- RDA has grown significantly since its launch in 2013, with over 2,500 members from over 90 countries working in various working groups.
- Working groups focus on developing deliverables like standards, best practices and code to enable data sharing in various domains and for community needs, data stewardship, and base infrastructure.
- The first deliverables have been presented, with more to come, aimed at making data sharing and discovery more trustworthy
The document discusses the Research Data Alliance (RDA), an international organization focused on data sharing. It provides information on RDA's vision, mission, members, activities, and outputs. RDA has over 6,400 members from 133 countries working in groups to develop infrastructure and standards to facilitate open data sharing across disciplines. The document outlines the various domain-specific and cross-cutting working groups and interest groups within RDA addressing issues like metadata, data citation, and interoperability.
Eu gdpr technical workflow and productionalization neccessary w privacy ass...Steven Meister
GDPR = General Data Protection Regulations or GDPR = Get Demand Payment Ready when your hacked or audited.
A Realistic project plan for GDPR Compliance. Another reality is the 95% not ready and even the 5% that say they are, will not like what they see in this plan in the hopes of becoming GDPR compliant.
There is just not enough time or people to get it done in the next 8 months and even if you had
2 years. This is a harsh reality and without the use of software technology and strict yet flexible, repeatable methodologies, it just won’t happen. Look at this Project plan of what needs to be done, do the math, see the complexity of data movement and code and programs needed then give us a call.
The document discusses key concepts in file organization and database management including bits, bytes, fields, records, files, and databases. It outlines problems with traditional file environments like data redundancy and lack of flexibility. Finally, it introduces database management systems which create and maintain databases, eliminate data definition statements, act as an interface between programs and data files, and separate logical and physical views of data.
Tamingthecompliancebeast stratanyc-171001162525Charan Sai
1) LinkedIn aims to balance data democracy (access to data for product development) with data protection (member privacy).
2) Key challenges include the need to discover data, access it easily, and delete specific data upon member requests while restricting access.
3) LinkedIn's solutions include building a metadata layer (WhereHows), a data access layer (Dali) that applies access controls and consent, and data lifecycle management tools (Gobblin) to delete data at scale.
Taming the ever-evolving Compliance Beast : Lessons learnt at LinkedIn [Strat...Shirshanka Das
Just when you think you have your Kafka and Hadoop clusters set up and humming and you’re well on your path to democratizing data, you realize that you now have a very different set of challenges to solve. You want to provide unfettered access to data to your data scientists, but at the same time, you need to preserve the privacy of your members, who have entrusted you with their data.
Shirshanka Das and Tushar Shanbhag outline the path LinkedIn has taken to protect member privacy in its scalable distributed data ecosystem built around Kafka and Hadoop.
They also discuss three foundational building blocks for scalable data management that can meet data compliance regulations: a centralized metadata system, a standardized data lifecycle management platform, and a unified data access layer. Some of these systems are open source and can be of use to companies that are in a similar situation. Along the way, they also look to the future—specifically, to the General Data Protection Regulation, which comes into effect in 2018—and outline LinkedIn’s plans for addressing those requirements.
But technology is just part of the solution. Shirshanka and Tushar also share the culture and process change they’ve seen happen at the company and the lessons they’ve learned about sustainable process and governance.
Member privacy is of paramount importance to LinkedIn. The company must protect the sensitive data users provide. On the other hand, our members join LinkedIn to find each other, necessitating the sharing of certain data. This privacy paradox can only be addressed by giving users control over where and how their data is used. While this approach is extremely important, it also presents scaling challenges.
In this talk, we will discuss the challenges behind enforcing compliance at scale as well as LinkedIn's solution. Our comprehensive record-level offline compliance framework includes schema metadata tracking, alternate read-time views of the same dataset, physical purging of data on HDFS, and features for users to define custom filtering rules using SQL, assigning such customizations to specific datasets, groups of datasets, or use cases. We achieve this using many open-source projects like Hadoop, Hive, Gobblin, and Wherehows, as well as a homegrown data access layer called Dali. We also show how the same Hadoop-powered framework can be used for enforcing compliance on other stores like Pinot, Salesforce, and Espresso.
While there is no one-size fits all solution to guaranteeing user data privacy, this talk will provide a blueprint and concrete example of how to enforce compliance at scale, which we hope proves useful to organizations working to improve their privacy commitments. ISSAC BUENROSTRO, Staff Software Engineer, LinkedIn and ANTHONY HSU, Staff Software Engineer, LinkedIn
LinkedIn Infrastructure (analytics@webscale, at fb 2013)Jun Rao
This is the presentation at analytics@webscale in 2013 (http://analyticswebscale.splashthat.com/?em=187&utm_campaign=website&utm_source=sg&utm_medium=em)
Amundsen: From discovering to security datamarkgrover
Hear about how Lyft and Square are solving data discovery and data security challenges using a shared open source project - Amundsen.
Talk details and abstract:
https://www.datacouncil.ai/talks/amundsen-from-discovering-data-to-securing-data
Enterprise Data Marketplace: A Centralized Portal for All Your Data AssetsDenodo
Watch full webinar here: https://bit.ly/3OLv0jY
Organizations continue to collect mounds of data and it is spread over different locations and in different formats. The challenge is navigating the vastness and complexity of the modern data ecosystem to find the right data to suit your specific business purpose. Data is an important corporate asset and it needs to be leveraged but also protected.
By adopting an alternate approach to data management and adapting a logical data architecture, data can be democratized while providing centralized control within a distributed data landscape. The web-based Data Catalog tool a single access point for secure enterprise-wide data access and governance. This corporate data marketplace provides visibility into your data ecosystem and allows data to be shared without compromising data security policies.
Catch this on-demand session to understand how this approach can transform how you leverage data across the business:
- Empower the knowledge worker with data and increase productivity
- Promote data accuracy and trust to encourage re-use of important data assets
- Apply consistent security and governance policies across the enterprise data landscape
Denodo’s Data Catalog: Bridging the Gap between Data and Business (APAC)Denodo
Watch full webinar here: https://bit.ly/3nxGFam
Self service is a major goal of modern data strategists. Denodo’s data catalog is a key piece in Denodo’s portfolio to bridge the gap between the technical data infrastructure and business users. It provides documentation, search, governance and collaboration capabilities, and data exploration wizards. It’s the perfect companion for a virtual layer to fully empower those self service initiatives with minimal IT intervention. It provides business users with the tool to generate their own insights with proper security, governance and guardrails.
In this session you will learn about:
- The role of a virtual semantic layer in self service initiatives
- What are the key capabilities of Denodo’s new Data Catalog
- Best practices and advanced tips for a successful deployment
- How customers are using the Denodo’s Data Catalog to enable self-service initiatives
This document discusses characteristics of big data and the big data stack. It describes the evolution of data from the 1970s to today's large volumes of structured, unstructured and multimedia data. Big data is defined as data that is too large and complex for traditional data processing systems to handle. The document then outlines the challenges of big data and characteristics such as volume, velocity and variety. It also discusses the typical data warehouse environment and Hadoop environment. The five layers of the big data stack are then described including the redundant physical infrastructure, security infrastructure, operational databases, organizing data services and tools, and analytical data warehouses.
Data Integration Alternatives: When to use Data Virtualization, ETL, and ESBDenodo
Data integration is paramount, in this presentation you will find three different paradigms: using client-side tools, creating traditional data warehouses and the data virtualization solution - the logical data warehouse, comparing each other and positioning data virtualization as an integral part of any future-proof IT infrastructure.
This presentation is part of the Fast Data Strategy Conference, and you can watch the video here goo.gl/1q94Ka.
The document summarizes LinkedIn's experience scaling their Hadoop infrastructure from 1 cluster with 20 nodes and 10 users in 2008 to over 10 clusters with over 10,000 nodes and 1,000 users now. It discusses the challenges of scaling hardware and human infrastructure. It then introduces Dali, LinkedIn's system for managing metadata and data access that aims to make analytics infrastructure invisible through concepts like datasets, views, and lineage tracking. Key aspects of Dali include separating logical and physical concerns, versioning APIs and views, and expressing producer/consumer contracts as constraints.
Hyperion Essbase Training from Hyperion Experts,
Tech thinkers Lab is a well known Hyperion training provider in the World, our Uniquer way of Hyperion training will make student easy to understand and Practice.
contact us for more details
Supporting GDPR Compliance through Data ClassificationIndex Engines Inc.
This document discusses how Index Engines technology can help organizations comply with the General Data Protection Regulation (GDPR). It provides an overview of the GDPR requirements and highlights Index Engines' capabilities such as data classification, reporting, disposition, and automated monitoring that allow organizations to know, manage, and govern their enterprise data in accordance with the GDPR. Index Engines provides a single platform to classify petabytes of data, enable flexible search and disposition of personal data, and demonstrate ongoing compliance through automated policy monitoring and auditing.
There’s a lot of ‘software vendor hype’ in support of the GDPR, but most of their solutions are ineffective because of limited features that cannot support the comprehensive compliance the GDPR demands.
Index Engines delivers an enterprise class classification, search and management solution to find all personal data under management with considerable precision. As an added bonus, Index Engines is proven to reduce costs and deliver an ROI through clean-up of content that no longer has business value.
During this webinar you'll learn:
Actionable approaches to managing petabytes of data
Proven strategies on classifying and finding personal data
How support for the GDPR can deliver an ROI
No marketing fluff, just a concrete workflow that will get you started
The GDPR consists of 99 articles that mandate how personal data is to be handled, but how do you manage years of data on various platforms?
Dell EMC and Index Engines together can deliver an intelligent, actionable approach to managing complex data environments in support of GDPR compliance. Providing deep intelligence across primary and secondary storage, this combination enables advanced classification, search and management allowing you to find personal data under management with considerable precision.
During the 60-minute web event, you’ll learn how to:
Approach GDPR compliance with a data-focused process
Mitigate risks with an intelligent, easy-to-implement workflow
Map and classify data to focus and streamline searches for personal data
Locate personal data with advanced search techniques
Better manage petabytes of data to control access and streamline costs
Consider alternatives to cross-border transfers to reduce potential risk
Cleaning up Redundant, Obsolete and Trivial Data to Reclaim Capacity and Mana...Index Engines Inc.
Data grows at a rate of 40-60% each year, but as capacity is expanded, redundant, obsolete and trivial user data - ROT - is clogging corporate networks resulting in unnecessary risk and expense.
Depending on industry, 40-70% of this data has no business value. Harnessing ROT growth will not only control expenses by reducing or eliminate storage upgrades, but also minimize risk.
Index Engines data profiling software supports ROT analysis and data disposition that ranges from terabytes to petabytes of enterprise content. It provides search, reporting, disposition and defensible deletion of data.
http://www.indexengines.com/storage-management/solutions-for/rot-analysis
DBAs - Is Your Company’s Personal and Sensitive Data Safe?DevOps.com
We have all seen the press coverage on corporate data breaches and compromises to personal data. You’ve probably heard about the new EU General Data Protection Regulation (GDPR) that came into effect in May last year, which affects any company that manages the personal data of EU residents. There are also some U.S. regulations that cover data privacy, such as HIPAA, HITECH, PCI and the CA Consumer Privacy Act.
Of these, GDPR is considered the most comprehensive when it comes to the needs of the individual and how their personal data should be protected and carries the harshest financial penalties for non-compliance.
The DBA is often the primary responsible party for implementing compliance controls and technical measures for protecting data. But the GDPR first requires an assessment of where PII and sensitive data is across multiple databases and this will be one of the first challenges a DBA will face before applying protection measures.
With many DBAs having to manually trawl through their database tables to identify sensitive data, what is needed is a fast, effective way to automate the discovery process and report on where sensitive data is stored. This would save time and enable companies to determine the most appropriate way to apply protective safeguards in order to minimize data breaches in the future and protect the business.
If you are a DBA responsible for your company’s data and are concerned about how to identify and protect your data, you should attend this webinar to find out how you can simplify and automate this task.
This document summarizes an update on the Research Data Alliance (RDA). It discusses the growth of RDA membership and activities. Key points include:
- RDA works to reduce barriers to data sharing and exchange by building social, organizational and technical infrastructure.
- RDA has grown significantly since its launch in 2013, with over 2,500 members from over 90 countries working in various working groups.
- Working groups focus on developing deliverables like standards, best practices and code to enable data sharing in various domains and for community needs, data stewardship, and base infrastructure.
- The first deliverables have been presented, with more to come, aimed at making data sharing and discovery more trustworthy
The document discusses the Research Data Alliance (RDA), an international organization focused on data sharing. It provides information on RDA's vision, mission, members, activities, and outputs. RDA has over 6,400 members from 133 countries working in groups to develop infrastructure and standards to facilitate open data sharing across disciplines. The document outlines the various domain-specific and cross-cutting working groups and interest groups within RDA addressing issues like metadata, data citation, and interoperability.
Eu gdpr technical workflow and productionalization neccessary w privacy ass...Steven Meister
GDPR = General Data Protection Regulations or GDPR = Get Demand Payment Ready when your hacked or audited.
A Realistic project plan for GDPR Compliance. Another reality is the 95% not ready and even the 5% that say they are, will not like what they see in this plan in the hopes of becoming GDPR compliant.
There is just not enough time or people to get it done in the next 8 months and even if you had
2 years. This is a harsh reality and without the use of software technology and strict yet flexible, repeatable methodologies, it just won’t happen. Look at this Project plan of what needs to be done, do the math, see the complexity of data movement and code and programs needed then give us a call.
The document discusses key concepts in file organization and database management including bits, bytes, fields, records, files, and databases. It outlines problems with traditional file environments like data redundancy and lack of flexibility. Finally, it introduces database management systems which create and maintain databases, eliminate data definition statements, act as an interface between programs and data files, and separate logical and physical views of data.
Tamingthecompliancebeast stratanyc-171001162525Charan Sai
1) LinkedIn aims to balance data democracy (access to data for product development) with data protection (member privacy).
2) Key challenges include the need to discover data, access it easily, and delete specific data upon member requests while restricting access.
3) LinkedIn's solutions include building a metadata layer (WhereHows), a data access layer (Dali) that applies access controls and consent, and data lifecycle management tools (Gobblin) to delete data at scale.
Taming the ever-evolving Compliance Beast : Lessons learnt at LinkedIn [Strat...Shirshanka Das
Just when you think you have your Kafka and Hadoop clusters set up and humming and you’re well on your path to democratizing data, you realize that you now have a very different set of challenges to solve. You want to provide unfettered access to data to your data scientists, but at the same time, you need to preserve the privacy of your members, who have entrusted you with their data.
Shirshanka Das and Tushar Shanbhag outline the path LinkedIn has taken to protect member privacy in its scalable distributed data ecosystem built around Kafka and Hadoop.
They also discuss three foundational building blocks for scalable data management that can meet data compliance regulations: a centralized metadata system, a standardized data lifecycle management platform, and a unified data access layer. Some of these systems are open source and can be of use to companies that are in a similar situation. Along the way, they also look to the future—specifically, to the General Data Protection Regulation, which comes into effect in 2018—and outline LinkedIn’s plans for addressing those requirements.
But technology is just part of the solution. Shirshanka and Tushar also share the culture and process change they’ve seen happen at the company and the lessons they’ve learned about sustainable process and governance.
Member privacy is of paramount importance to LinkedIn. The company must protect the sensitive data users provide. On the other hand, our members join LinkedIn to find each other, necessitating the sharing of certain data. This privacy paradox can only be addressed by giving users control over where and how their data is used. While this approach is extremely important, it also presents scaling challenges.
In this talk, we will discuss the challenges behind enforcing compliance at scale as well as LinkedIn's solution. Our comprehensive record-level offline compliance framework includes schema metadata tracking, alternate read-time views of the same dataset, physical purging of data on HDFS, and features for users to define custom filtering rules using SQL, assigning such customizations to specific datasets, groups of datasets, or use cases. We achieve this using many open-source projects like Hadoop, Hive, Gobblin, and Wherehows, as well as a homegrown data access layer called Dali. We also show how the same Hadoop-powered framework can be used for enforcing compliance on other stores like Pinot, Salesforce, and Espresso.
While there is no one-size fits all solution to guaranteeing user data privacy, this talk will provide a blueprint and concrete example of how to enforce compliance at scale, which we hope proves useful to organizations working to improve their privacy commitments. ISSAC BUENROSTRO, Staff Software Engineer, LinkedIn and ANTHONY HSU, Staff Software Engineer, LinkedIn
LinkedIn Infrastructure (analytics@webscale, at fb 2013)Jun Rao
This is the presentation at analytics@webscale in 2013 (http://analyticswebscale.splashthat.com/?em=187&utm_campaign=website&utm_source=sg&utm_medium=em)
Amundsen: From discovering to security datamarkgrover
Hear about how Lyft and Square are solving data discovery and data security challenges using a shared open source project - Amundsen.
Talk details and abstract:
https://www.datacouncil.ai/talks/amundsen-from-discovering-data-to-securing-data
Enterprise Data Marketplace: A Centralized Portal for All Your Data AssetsDenodo
Watch full webinar here: https://bit.ly/3OLv0jY
Organizations continue to collect mounds of data and it is spread over different locations and in different formats. The challenge is navigating the vastness and complexity of the modern data ecosystem to find the right data to suit your specific business purpose. Data is an important corporate asset and it needs to be leveraged but also protected.
By adopting an alternate approach to data management and adapting a logical data architecture, data can be democratized while providing centralized control within a distributed data landscape. The web-based Data Catalog tool a single access point for secure enterprise-wide data access and governance. This corporate data marketplace provides visibility into your data ecosystem and allows data to be shared without compromising data security policies.
Catch this on-demand session to understand how this approach can transform how you leverage data across the business:
- Empower the knowledge worker with data and increase productivity
- Promote data accuracy and trust to encourage re-use of important data assets
- Apply consistent security and governance policies across the enterprise data landscape
Denodo’s Data Catalog: Bridging the Gap between Data and Business (APAC)Denodo
Watch full webinar here: https://bit.ly/3nxGFam
Self service is a major goal of modern data strategists. Denodo’s data catalog is a key piece in Denodo’s portfolio to bridge the gap between the technical data infrastructure and business users. It provides documentation, search, governance and collaboration capabilities, and data exploration wizards. It’s the perfect companion for a virtual layer to fully empower those self service initiatives with minimal IT intervention. It provides business users with the tool to generate their own insights with proper security, governance and guardrails.
In this session you will learn about:
- The role of a virtual semantic layer in self service initiatives
- What are the key capabilities of Denodo’s new Data Catalog
- Best practices and advanced tips for a successful deployment
- How customers are using the Denodo’s Data Catalog to enable self-service initiatives
This document discusses characteristics of big data and the big data stack. It describes the evolution of data from the 1970s to today's large volumes of structured, unstructured and multimedia data. Big data is defined as data that is too large and complex for traditional data processing systems to handle. The document then outlines the challenges of big data and characteristics such as volume, velocity and variety. It also discusses the typical data warehouse environment and Hadoop environment. The five layers of the big data stack are then described including the redundant physical infrastructure, security infrastructure, operational databases, organizing data services and tools, and analytical data warehouses.
Data Integration Alternatives: When to use Data Virtualization, ETL, and ESBDenodo
Data integration is paramount, in this presentation you will find three different paradigms: using client-side tools, creating traditional data warehouses and the data virtualization solution - the logical data warehouse, comparing each other and positioning data virtualization as an integral part of any future-proof IT infrastructure.
This presentation is part of the Fast Data Strategy Conference, and you can watch the video here goo.gl/1q94Ka.
The document summarizes LinkedIn's experience scaling their Hadoop infrastructure from 1 cluster with 20 nodes and 10 users in 2008 to over 10 clusters with over 10,000 nodes and 1,000 users now. It discusses the challenges of scaling hardware and human infrastructure. It then introduces Dali, LinkedIn's system for managing metadata and data access that aims to make analytics infrastructure invisible through concepts like datasets, views, and lineage tracking. Key aspects of Dali include separating logical and physical concerns, versioning APIs and views, and expressing producer/consumer contracts as constraints.
The document describes a business intelligence software called Qiagram that allows non-technical domain experts to easily explore and query complex datasets through a visual drag-and-drop interface without SQL or programming knowledge. It provides centralized data management, integration with various data sources, and self-service visual querying capabilities to help researchers gain insights from their data.
Presentation on an overview of LinkedIn data driven products and infrastructure given on 26 Oct 2012 in the big-data symposium given in honor of the retirement of my PhD advisor Dr Martin H. Schultz.
Watch Alberto's presentation from Fast Data Strategy on-demand here: https://goo.gl/CRjYuD
In this session, we will review Denodo Platform 7.0 key capabilities.
Watch this session to learn more about:
• The vision behind the Denodo Platform
• The new data catalog and self-service features of Denodo Platform 7.0
• The new connectivity, data transformation, and enterprise-wide deployment features
This document summarizes an introductory presentation on data science. It introduces the presenter and their background in data and analytics. The goals of the presentation are to define what a data scientist is, how the field has emerged, and how to become one. It discusses the growing demand and salaries for data scientists. Examples are given of how data science has been applied at companies like LinkedIn and Netflix. The presentation covers big data, Hadoop, data processing techniques, machine learning algorithms, and tools used in data science. Finally, attendees are encouraged to consider Thinkful's data science bootcamp program.
The Great Lakes: How to Approach a Big Data ImplementationInside Analysis
- Rick Stellwagen from Think Big, A Teradata Company, discussed best practices for implementing a data lake including establishing standards for data ingestion and metadata capture, developing a security plan, and planning for data discovery and reporting.
- Analyst Robin Bloor asked questions about metadata management, data governance, and security for data lakes. Bloor noted that while data lakes are a new concept, best practices are needed as organizations move analytics and BI capabilities to this model.
- Upcoming Briefing Room topics in 2015 will focus on big data, cloud computing, and innovators in technology.
Empowering your Enterprise with a Self-Service Data Marketplace (EMEA)Denodo
This document outlines an agenda for an EMEA webinar about empowering enterprises with a self-service data marketplace. The agenda includes discussions of the data challenges facing users, how a data marketplace can help address those challenges, what constitutes a data marketplace, a demo of Denodo's data catalog tool, and a customer case study. Key benefits of a data marketplace mentioned are enabling self-service access to trusted data while maintaining governance over sensitive data and reducing dependency on IT.
This document summarizes a talk on using big data driven solutions to combat COVID-19. It discusses how big data preparation involves ingesting, cleansing, and enriching data from various sources. It also describes common big data technologies used for storage, mining, analytics and visualization including Hadoop, Presto, Kafka and Tableau. Finally, it provides examples of research projects applying big data and AI to track COVID-19 cases, model disease spread, and optimize health resource utilization.
Data Governance, Compliance and Security in Hadoop with ClouderaCaserta
The document discusses data governance, compliance and security in Hadoop. It provides an agenda for an event on this topic, including presentations from Joe Caserta of Caserta Concepts on data governance in big data, and Patrick Angeles of Cloudera on using Cloudera for data governance in Hadoop. The document also includes background information on Caserta Concepts and their expertise in data warehousing, business intelligence and big data analytics.
This talk was given by Jun Rao (Staff Software Engineer at LinkedIn) and Sam Shah (Senior Engineering Manager at LinkedIn) at the Analytics@Webscale Technical Conference (June 2013).
Decision Ready Data: Power Your Analytics with Great DataDLT Solutions
Murthy Mathiprakasam, Principal Product Marketing Manager at Informatica, shares how to power your analytics with great data from the 2015 Informatica Government Summit.
Similar to Balancing Data Democracy with Data Privacy: The LinkedIn Story (20)
Introduction to Jio Cinema**:
- Brief overview of Jio Cinema as a streaming platform.
- Its significance in the Indian market.
- Introduction to retention and engagement strategies in the streaming industry.
2. **Understanding Retention and Engagement**:
- Define retention and engagement in the context of streaming platforms.
- Importance of retaining users in a competitive market.
- Key metrics used to measure retention and engagement.
3. **Jio Cinema's Content Strategy**:
- Analysis of the content library offered by Jio Cinema.
- Focus on exclusive content, originals, and partnerships.
- Catering to diverse audience preferences (regional, genre-specific, etc.).
- User-generated content and interactive features.
4. **Personalization and Recommendation Algorithms**:
- How Jio Cinema leverages user data for personalized recommendations.
- Algorithmic strategies for suggesting content based on user preferences, viewing history, and behavior.
- Dynamic content curation to keep users engaged.
5. **User Experience and Interface Design**:
- Evaluation of Jio Cinema's user interface (UI) and user experience (UX).
- Accessibility features and device compatibility.
- Seamless navigation and search functionality.
- Integration with other Jio services.
6. **Community Building and Social Features**:
- Strategies for fostering a sense of community among users.
- User reviews, ratings, and comments.
- Social sharing and engagement features.
- Interactive events and campaigns.
7. **Retention through Loyalty Programs and Incentives**:
- Overview of loyalty programs and rewards offered by Jio Cinema.
- Subscription plans and benefits.
- Promotional offers, discounts, and partnerships.
- Gamification elements to encourage continued usage.
8. **Customer Support and Feedback Mechanisms**:
- Analysis of Jio Cinema's customer support infrastructure.
- Channels for user feedback and suggestions.
- Handling of user complaints and queries.
- Continuous improvement based on user feedback.
9. **Multichannel Engagement Strategies**:
- Utilization of multiple channels for user engagement (email, push notifications, SMS, etc.).
- Targeted marketing campaigns and promotions.
- Cross-promotion with other Jio services and partnerships.
- Integration with social media platforms.
10. **Data Analytics and Iterative Improvement**:
- Role of data analytics in understanding user behavior and preferences.
- A/B testing and experimentation to optimize engagement strategies.
- Iterative improvement based on data-driven insights.
Build applications with generative AI on Google CloudMárton Kodok
We will explore Vertex AI - Model Garden powered experiences, we are going to learn more about the integration of these generative AI APIs. We are going to see in action what the Gemini family of generative models are for developers to build and deploy AI-driven applications. Vertex AI includes a suite of foundation models, these are referred to as the PaLM and Gemini family of generative ai models, and they come in different versions. We are going to cover how to use via API to: - execute prompts in text and chat - cover multimodal use cases with image prompts. - finetune and distill to improve knowledge domains - run function calls with foundation models to optimize them for specific tasks. At the end of the session, developers will understand how to innovate with generative AI and develop apps using the generative ai industry trends.
Codeless Generative AI Pipelines
(GenAI with Milvus)
https://ml.dssconf.pl/user.html#!/lecture/DSSML24-041a/rate
Discover the potential of real-time streaming in the context of GenAI as we delve into the intricacies of Apache NiFi and its capabilities. Learn how this tool can significantly simplify the data engineering workflow for GenAI applications, allowing you to focus on the creative aspects rather than the technical complexities. I will guide you through practical examples and use cases, showing the impact of automation on prompt building. From data ingestion to transformation and delivery, witness how Apache NiFi streamlines the entire pipeline, ensuring a smooth and hassle-free experience.
Timothy Spann
https://www.youtube.com/@FLaNK-Stack
https://medium.com/@tspann
https://www.datainmotion.dev/
milvus, unstructured data, vector database, zilliz, cloud, vectors, python, deep learning, generative ai, genai, nifi, kafka, flink, streaming, iot, edge
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataKiwi Creative
Harness the power of AI-backed reports, benchmarking and data analysis to predict trends and detect anomalies in your marketing efforts.
Peter Caputa, CEO at Databox, reveals how you can discover the strategies and tools to increase your growth rate (and margins!).
From metrics to track to data habits to pick up, enhance your reporting for powerful insights to improve your B2B tech company's marketing.
- - -
This is the webinar recording from the June 2024 HubSpot User Group (HUG) for B2B Technology USA.
Watch the video recording at https://youtu.be/5vjwGfPN9lw
Sign up for future HUG events at https://events.hubspot.com/b2b-technology-usa/
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...Social Samosa
The Modern Marketing Reckoner (MMR) is a comprehensive resource packed with POVs from 60+ industry leaders on how AI is transforming the 4 key pillars of marketing – product, place, price and promotions.
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Aggregage
This webinar will explore cutting-edge, less familiar but powerful experimentation methodologies which address well-known limitations of standard A/B Testing. Designed for data and product leaders, this session aims to inspire the embrace of innovative approaches and provide insights into the frontiers of experimentation!
Balancing Data Democracy with Data Privacy: The LinkedIn Story
1. Balancing Data Democracy with Data
Privacy: The LinkedIn Story
Jan 25, 2018
Eric Ogren
Anthony Hsu
Big Data Meetup, LinkedIn SF
1
2. We needed data democracy to
deliver member value
LinkedIn Data Science
I want to analyze as much data as
possible so my models are accurate
Data Democracy
ALL THE DATA, ALL THE TIME
I want to discover data that’s needed for my
analysis as fast as possible
I want to access that data as quickly as
possible for my analysis
2
3. I want my personal data to be stored only
where needed and not propagated
unnecessarily
Data Protection
Need to Ensure Member Privacy
LinkedIn Members
STORE, PROCESS, DELETE,..
I want my personal data to be deleted when
I close my account or request deletion
I want my personal data to only be
processed if essential and only if I consent
3
4. DATA DEMOCRACY <> DATA PROTECTION
More Data
Discover Data
Easy Access
Less Data
Discover Violations
Restricted Access
The Data Paradox
4
5. DATA DEMOCRACY <> DATA PROTECTION
More Data
Discover Data
Easy Access
Less Data
Discover Violations
Restricted Access
The Data Paradox
5
6. Data Hubs at LinkedIn
In Motion
At Rest
Scale
O(10) clusters
~2.3 Trillion messages / day
~450 TB written / day
Scale
O(10) clusters
~10K machines
~XXX PB at rest
6
8. REQUIREMENTS
Less Data
Legal: Right to Erasure or Right to be Forgotten
“Delete all my personal data without undue delay when it is no
longer necessary / when consent has been withdrawn”
Engineering:
Need the ability to delete some specific subset or all data associated
with a specific LinkedIn member from all our data systems
8
9. A lot of data, different formats
Challenges
Understand HDFS data: organization, formats, …
Cycle asynchronously, within an SLA, deleting
records, without affecting running jobs
Quarantine exceptional records for manual triage
Can scale to processing hundreds of PB of data
Data Deletion
IMPLICATIONS FOR HADOOP
9
10. Gobblin: The Logical Pipeline
Source
Work
Unit
Work
Unit
Work
Unit
Extract Convert Quality Write Data
Publish
WriteQualityConvertExtract
Extract Convert Quality Write
Task
Task
Task
10
11. Gobblin: Extending for Purge
HDFS
Work
Unit
Data
Publish
Extract Convert Quality Write
Task
Task
HDFS
If needs purge
then drop
else continue
Member’s Delete
Requests
11
12. STATUS AND CHALLENGES
Gobblin: Data Lifecycle Management at Scale
Status
Number of datasets: many thousands
Amount of data scanned for purge: hundreds of TB/day
Challenges
Immutable Storage Formats + Right to Erasure = Unhappy Disks
“Widespread implementation will surely lead to innovation in these formats!”
12
13. DATA DEMOCRACY <> DATA PROTECTION
More Data
Discover Data
Easy Access
Less Data
Discover Violations
Restricted Access
The Data Paradox
DATA LIFECYCLE MANAGEMENT
13
14. DATA DEMOCRACY <> DATA PROTECTION
More Data
Discover Data
Easy Access
Less Data
Discover Violations
Restricted Access
The Data Paradox
DATA LIFECYCLE MANAGEMENT
14
16. Metadata based Search Experience
for Data Scientists
Data Discovery
Where is dataset X?
How did it get created?
Usage : In production since 2014
Users : Data Scientists, Product Engineers
Use Cases: Discovery, Impact Analysis
WhereHows
FIND DATA, NAVIGATE RELATIONSHIPS
Open source @ github.com/linkedin/wherehows 16
19. More than just Discovery
Use Cases
Which datasets at LinkedIn contain PII or highly
confidential data?
How many contain member-member messages?
How many of them are accessible by team X?
Have all datasets been purged within SLA?
Discovering Violations
ANSWERING HARDER QUESTIONS
19
20. Wide + Deep
Metadata
Comprehensive coverage of data systems at LinkedIn
We have > 20 systems!
SQL, NoSQL, Indexes, Blob Stores, …
Deeper understanding of each dataset
Schema is not enough
Need to understand semantics
Discovering Violations
REQUIREMENTS
20
21. A METADATA REFINERY APPROACH
WhereHows Architecture @ 10,000 ft
ML driven
refinements
21
22. METADATA SHOULD LOOK JUST LIKE DATA
WhereHows Architecture @ 10,000 ft
ML driven
refinements
Unified Metadata Dataset
Metadata Serving Repository
key-value
search
graph
Data Systems
Technical metadata
Snapshots
Stream
Services + Jobs
Operational
Metadata
WhereHows
Application
LinkedIn
Community AnnotationsTechnical metadata
Data Catalogs
Process Definitions
Code
Operational metadata
Data Publish
Data Access
Job Executions
22
23. DATA DEMOCRACY <> DATA PROTECTION
More Data
Discover Data
Easy Access
Less Data
Discover Violations
Restricted Access
The Data Paradox
DATA LIFECYCLE MANAGEMENT
METADATA
23
24. METADATA
DATA DEMOCRACY <> DATA PROTECTION
More Data
Discover Data
Easy Access
Less Data
Discover Violations
Restricted Access
The Data Paradox
DATA LIFECYCLE MANAGEMENT
24
25. Simple to Complex
Different Types
Basic Restrictions
Access to dataset based on business need
Privacy by Default
Analysts shouldn’t get access to raw PII
(Personally Identifiable Information) by default
Consent-based Access
Access to certain data elements only available
if member has consented for that particular use-
case
Access Restrictions
REQUIREMENTS
25
27. HARD TO CHANGE ANYTHING UNDERNEATH!
Challenge for Infrastructure Providers
(Pig scripts)
My Raw Data
Native readers, dependencies on path, format hard-coded
Hard to move to
better formats
without breaking
everyone or
copying data twice
My Raw Data
27
28. HARD TO CHANGE ANYTHING UPSTREAM!
Semantic Challenges
Data is unclean (bad data on certain dates)
Data models are in constant flux (split event into multiple)
Have to change
data processing
logic everywhere!
My Raw Data
28
29. AN API TO MANAGE EVOLUTION
We need “microservices” for Data
My Data API
My Raw Data
29
30. A DATA ACCESS LAYER FOR LINKEDIN
We built Dali to solve this
Dataset Readers
Dataset Tooling
Abstract away underlying physical details to
allow users to focus solely on the logical
concerns
30
31. Dali: Implementation Details in Context
Dataflow APIs
(MR, Spark,
Scalding)
Query Layers
(Pig, Hive,
Spark)
Data CatalogGit + Artifactory
Dataset
Owner
31
Datasets
+
UDFs
Dali Datasets (Tables+Views)
Dali Readers
32. STEP 1: DATA + METADATA
Solving for Compliant Access
Schema = {
int memberId
String firstName
String lastName
Position[] positions
educationHistory[] educationHistory
…
}
MemberProfile
NAME : is_pii
MEMBER_ID : is_pii
Raw
Dataset
Meta
Data
32
33. STEP 2: A MEMBER’S PREFERENCES
Privacy Preferences
33
35. A BITMAP DATASET: ONE PER MEMBER PER SETTING
Privacy Preferences
35
Member Privacy
Preferences
36. Solving for Compliant Access With Dali
Raw
Dataset
Meta
Data
Member Privacy
Preferences
Dali Reader responsibility:
Given:
(Dataset, Metadata, UseCase)
Generate:
Dataset and Column-level
transformations
(obfuscate, null, …)
Auto-join with Member
Privacy Preferences
(filter out data elements that
are not consented to)
Processing
Logic
Dali
Reader
Library
Use
Case = X
36
37. Compliance Transformations: Under the Hood
37
Table Scan Operator
Filter Operator
Select Operator
Table Scan Operator
Filter Operator
Select Operator
GDPR Operator
Meta
Data
Query
Context
Privacy
Settings
38. Solving for Compliant Purging With Dali + Gobblin
Raw
Dataset
Meta
Data
Member Privacy
Preferences
Gobblin
Purger
Dali
Reader
Library
Use
Case =
Purge
Purged
Dataset
Member Delete
Requests
38
39. DATA DEMOCRACY <> DATA PROTECTION
More Data
Discover Data
Easy Access
Less Data
Discover Violations
Restricted Access
The Data Paradox
DATA LIFECYCLE MANAGEMENT
METADATA
DATA ACCESS LAYER
39
40. DATA DEMOCRACY <> DATA PROTECTION
More Data
Discover Data
Easy Access
Less Data
Discover Violations
Restricted Access
The Data Paradox : Solved !
METADATA
DATA ACCESS LAYER
DATA LIFECYCLE MANAGEMENT
40
41. DATA DEMOCRACY + DATA PROTECTION
The Technology Blueprint
WhereHows*
Dali Apache Gobblin*
* Open Source : We can collaborate on these together!
DATA LIFECYCLE MANAGEMENTDATA ACCESS LAYER
METADATA
41