The document discusses how LinkedIn handles personally identifiable information (PII) in its content ingestion system called Babylonia. It outlines several options for handling PII when a member's account is closed, including having upstream systems notify Babylonia, actively refetching all first party URLs, or eliminating member PII from Babylonia. It describes how Babylonia uses tools like WhereHows, Dali, and Apache Gobblin to track PII in datasets as they move between systems and to purge PII when necessary. The document advocates for a blended approach using notification, active refetching, and whitelisting first party URLs to balance compliance needs with user experience.
Web Information Network Extraction and AnalysisTim Weninger
Tim Weninger presents a tutorial on information network analysis and extraction from the semi-structured web. The tutorial covers preliminaries on information extraction and integration from web pages and social networks. It also discusses ranking, clustering, and analyzing the structure and content of information on the web.
Member privacy is of paramount importance to LinkedIn. The company must protect the sensitive data users provide. On the other hand, our members join LinkedIn to find each other, necessitating the sharing of certain data. This privacy paradox can only be addressed by giving users control over where and how their data is used. While this approach is extremely important, it also presents scaling challenges.
In this talk, we will discuss the challenges behind enforcing compliance at scale as well as LinkedIn's solution. Our comprehensive record-level offline compliance framework includes schema metadata tracking, alternate read-time views of the same dataset, physical purging of data on HDFS, and features for users to define custom filtering rules using SQL, assigning such customizations to specific datasets, groups of datasets, or use cases. We achieve this using many open-source projects like Hadoop, Hive, Gobblin, and Wherehows, as well as a homegrown data access layer called Dali. We also show how the same Hadoop-powered framework can be used for enforcing compliance on other stores like Pinot, Salesforce, and Espresso.
While there is no one-size fits all solution to guaranteeing user data privacy, this talk will provide a blueprint and concrete example of how to enforce compliance at scale, which we hope proves useful to organizations working to improve their privacy commitments. ISSAC BUENROSTRO, Staff Software Engineer, LinkedIn and ANTHONY HSU, Staff Software Engineer, LinkedIn
Oracle database threats - LAOUC WebinarOsama Mustafa
This document discusses database security and how databases can be hacked. It begins by introducing the presenter and their qualifications. It then discusses why database security is important for protecting financial, customer and organizational data. Common ways databases are hacked include gathering information through search engines or social media, scanning for vulnerabilities, gaining unauthorized access, and maintaining that access. Specific attacks on Oracle databases and the most common database security threats are outlined, such as weak authentication, denial of service attacks, and SQL injection. The document provides examples of how to test for and exploit SQL injection vulnerabilities. It emphasizes the importance of securing databases to prevent data theft and protect sensitive information.
Geek Sync | Handling HIPAA Compliance with Your Data AccessIDERA Software
The document discusses how to ensure compliance with HIPAA regulations when handling electronic protected health information (ePHI) stored in SQL Server databases. It addresses five key questions around auditing access to ePHI, defining a secure SQL Server configuration baseline, implementing repeatable security processes, auditing permissions and changes in SQL Server, and maintaining ongoing compliance. The presenter provides recommendations for secure configurations, including role-based access control, encryption of data at rest and in transit, and auditing access through features like extended events and audit objects. Maintaining repeatable processes for security and change management is emphasized as important for compliance.
Preventing Security Leaks in SharePoint with Joel Oleson & Christian BuckleyJoel Oleson
With recent news of one of the largest security breaches in US history, many organizations are looking to their SharePoint environments to better understand just how vulnerable their data is, and whether they have in place adequate governance policies and procedures to prevent a similar breech.
In this webinar, we'll discuss some of what happened in the case of Snowden and the NSA's SharePoint environment, and clarify the differences between willful intent versus poor governance planning. We'll help you to outline steps you can take within your own organization to improve security and lock down permissions, closing off any gaps within your governance strategy.
Create a Compliance Strategy for Office 365Erica Toelle
SharePoint, OneDrive, Microsoft Teams, Exchange, Skype…there are a lot of collaboration tools for creating and content in the Microsoft stack. Highly regulated and government organizations have advanced compliance and records management needs, some of which are tricky to meet with out of the box Microsoft tools, such as Cloud App Security, Azure Information Protection and Advanced Data Governance in Office 365. How can you ensure that content is retained properly for compliance purposes and that the proper processes are in place to ensure compliance?
In this session, you will learn about Microsoft’s out of the box compliance and records management features, as well as how to extend them to meet advanced requirements. Whether you are a decision maker, IT Pro tasked with implementation, or an information management professional tasked with compliance, this workshop is for you.
RMS, EFS, and BitLocker are Microsoft data protection technologies that can help prevent data leakage. RMS allows users to apply usage policies to files and encrypts files to control access. EFS transparently encrypts files stored locally on a computer. BitLocker encrypts fixed and removable drives to protect data at rest. The technologies provide different levels of protection and have varying capabilities for controlling access to data inside and outside an organization.
This presentation targets to guiding security expert and developer to protect PaaS deployment to eliminate security threats. This also introduces Threat Modeling.
Web Information Network Extraction and AnalysisTim Weninger
Tim Weninger presents a tutorial on information network analysis and extraction from the semi-structured web. The tutorial covers preliminaries on information extraction and integration from web pages and social networks. It also discusses ranking, clustering, and analyzing the structure and content of information on the web.
Member privacy is of paramount importance to LinkedIn. The company must protect the sensitive data users provide. On the other hand, our members join LinkedIn to find each other, necessitating the sharing of certain data. This privacy paradox can only be addressed by giving users control over where and how their data is used. While this approach is extremely important, it also presents scaling challenges.
In this talk, we will discuss the challenges behind enforcing compliance at scale as well as LinkedIn's solution. Our comprehensive record-level offline compliance framework includes schema metadata tracking, alternate read-time views of the same dataset, physical purging of data on HDFS, and features for users to define custom filtering rules using SQL, assigning such customizations to specific datasets, groups of datasets, or use cases. We achieve this using many open-source projects like Hadoop, Hive, Gobblin, and Wherehows, as well as a homegrown data access layer called Dali. We also show how the same Hadoop-powered framework can be used for enforcing compliance on other stores like Pinot, Salesforce, and Espresso.
While there is no one-size fits all solution to guaranteeing user data privacy, this talk will provide a blueprint and concrete example of how to enforce compliance at scale, which we hope proves useful to organizations working to improve their privacy commitments. ISSAC BUENROSTRO, Staff Software Engineer, LinkedIn and ANTHONY HSU, Staff Software Engineer, LinkedIn
Oracle database threats - LAOUC WebinarOsama Mustafa
This document discusses database security and how databases can be hacked. It begins by introducing the presenter and their qualifications. It then discusses why database security is important for protecting financial, customer and organizational data. Common ways databases are hacked include gathering information through search engines or social media, scanning for vulnerabilities, gaining unauthorized access, and maintaining that access. Specific attacks on Oracle databases and the most common database security threats are outlined, such as weak authentication, denial of service attacks, and SQL injection. The document provides examples of how to test for and exploit SQL injection vulnerabilities. It emphasizes the importance of securing databases to prevent data theft and protect sensitive information.
Geek Sync | Handling HIPAA Compliance with Your Data AccessIDERA Software
The document discusses how to ensure compliance with HIPAA regulations when handling electronic protected health information (ePHI) stored in SQL Server databases. It addresses five key questions around auditing access to ePHI, defining a secure SQL Server configuration baseline, implementing repeatable security processes, auditing permissions and changes in SQL Server, and maintaining ongoing compliance. The presenter provides recommendations for secure configurations, including role-based access control, encryption of data at rest and in transit, and auditing access through features like extended events and audit objects. Maintaining repeatable processes for security and change management is emphasized as important for compliance.
Preventing Security Leaks in SharePoint with Joel Oleson & Christian BuckleyJoel Oleson
With recent news of one of the largest security breaches in US history, many organizations are looking to their SharePoint environments to better understand just how vulnerable their data is, and whether they have in place adequate governance policies and procedures to prevent a similar breech.
In this webinar, we'll discuss some of what happened in the case of Snowden and the NSA's SharePoint environment, and clarify the differences between willful intent versus poor governance planning. We'll help you to outline steps you can take within your own organization to improve security and lock down permissions, closing off any gaps within your governance strategy.
Create a Compliance Strategy for Office 365Erica Toelle
SharePoint, OneDrive, Microsoft Teams, Exchange, Skype…there are a lot of collaboration tools for creating and content in the Microsoft stack. Highly regulated and government organizations have advanced compliance and records management needs, some of which are tricky to meet with out of the box Microsoft tools, such as Cloud App Security, Azure Information Protection and Advanced Data Governance in Office 365. How can you ensure that content is retained properly for compliance purposes and that the proper processes are in place to ensure compliance?
In this session, you will learn about Microsoft’s out of the box compliance and records management features, as well as how to extend them to meet advanced requirements. Whether you are a decision maker, IT Pro tasked with implementation, or an information management professional tasked with compliance, this workshop is for you.
RMS, EFS, and BitLocker are Microsoft data protection technologies that can help prevent data leakage. RMS allows users to apply usage policies to files and encrypts files to control access. EFS transparently encrypts files stored locally on a computer. BitLocker encrypts fixed and removable drives to protect data at rest. The technologies provide different levels of protection and have varying capabilities for controlling access to data inside and outside an organization.
This presentation targets to guiding security expert and developer to protect PaaS deployment to eliminate security threats. This also introduces Threat Modeling.
The current Microsoft PowerBI governance enabling and recommendations. Including the changes following the November PowerBI release and PASS conference announcements.
A datastore is a central system that offers data in a consistent place and provides important context and metadata about the data. It allows data owners to publish their data through various methods like email submissions, file dropboxes, data proxies, or data replication. The document also discusses challenges around building and maintaining datastores, as well as opportunities for engaging communities around open data.
Stop the Madness! A Practical Guide to Making Your Data Catalog Strategy WorkDATAVERSITY
This document discusses the need for a business intelligence (BI) portal to help solve governance issues and improve data discoverability, access, and usage. It outlines problems with current approaches to data governance that lead to declining usage of data catalogs. The proposed solution is a BI portal that delivers metadata to users where they consume it, through BI tools, data catalogs, and Excel. The portal would make useful content discoverable and provide value to three key user personas: business users, content publishers, and data governance teams. It would be demoed to show how it addresses these needs.
The document discusses building a data warehouse in SQL Server. It provides an agenda that covers topics like an overview of data warehousing, data warehouse design, dimension and fact tables, and physical design. It also discusses components of a data warehousing solution like the data warehouse database, ETL processes, and security considerations.
SharePoint migrations rarely turn out as you plan them. They are sometimes risky and too often take longer than planned. Over the last 10 years of migrating from SharePoint 2003, 2007, 2010 to the latest versions of SharePoint/Office 365 we’ve seen a consistent theme -- organizations underestimate the complexity and level of effort required for a successful, smooth migration.
Whether you are planning to complete your own migration, or engaging a vendor to assist, this webinar will discuss precautions you can take to avoid the slippery slope experienced in SharePoint migrations.
5 Tips to Optimize SharePoint While Preparing for HybridAdam Levithan
For organizations planning to migrate to a hybrid deployment of the their SharePoint and Office 365 infrastructure, optimizing their current SharePoint is a crucial step in reducing the amount of work required for a successful migration, increase end-user performance and decrease the risk of an unsuccessful migration.
Join Metalogix SharePoint expert Adam Levithan on March 17, 2016 for 5 Tips to Optimize SharePoint While Preparing for a Hybrid Deployment, a comprehensive live webinar where he unveils the top five optimizations that organizations need to consider before they plan to move to a SharePoint 2013 or SharePoint 2016 hybrid deployment.
Key takeaways
Such optimizations will help SharePoint Admins and IT professionals:
Provide the best end-user experience,
Gain early warnings as performance issues are developing
Obtain better insight into the interdependency between SharePoint infrastructure and applications
With the upcoming release of SharePoint 2016, hybrid deployments are quickly becoming the new standard for SharePoint deployments. Adam's help increased the success of several companies migrating to hybrid deployments and will be happy to share his insights, experience and solutions with attendees.
If you’re already using or thinking of moving to Microsoft Office 365, you’ll need to think about where to store your precious documents.
Microsoft SharePoint integrates with Office 365 and allows organisations to set up a centralised, password protected space to store and manage documents, create an intranet and collaborate on projects.
In this webinar with charity IT experts, Co-Operative Systems, we look at:
• What is SharePoint and why use it
• Key features explained
• Migrating to SharePoint and what it doesn't say on the tin
• Practical demonstration of how SharePoint works
• Question & Answer
About Co-Operative Systems:
Co-Operative Systems have helped over 2,000 users onto Microsoft's Office 365 platform, and have been providing specialist IT support services to the non-profit sector since 1987. Their annual Where ITs @ event for charities is hosted by Microsoft. Read more about Co-Operative Systems at: www.coopsys.net
Best practices for security and governance in share point 2013 publishedAntonioMaio2
Microsoft SharePoint provides features and capabilities enabling you to secure access, control authentication and authorize access to information. Choosing the capabilities to make use of, configuring them and understanding their impact can be a complex tax. In this session you will learn about the key security features available in Microsoft SharePoint 2013 and the best practices for using them. The sessions begin by talking about the business reasons that organizations need to consider when security their SharePoint content, and it will then review specific capabilities and options in detail with recommendations. We’ll also review various governance best practices and how they relate to SharePoint security capabilities. Throughout the session, you’ll hear examples from large commercial enterprise, government and military and about the best practices they use to secure their content within SharePoint.
This document describes a cloud-based data integration platform called Conductor. It addresses common problems with data integration like manual work required, inability to integrate non-critical systems, and lack of data governance. Conductor automatically maps data between systems, works with any data source/format, and provides data profiling, analytics, encryption and metadata tools. It can be used for tasks like data warehousing, customer integration, database migration and cloud integration. The platform aims to simplify and automate data integration work that traditionally requires custom coding and separate point solutions.
LinkedIn Infrastructure (analytics@webscale, at fb 2013)Jun Rao
This is the presentation at analytics@webscale in 2013 (http://analyticswebscale.splashthat.com/?em=187&utm_campaign=website&utm_source=sg&utm_medium=em)
Enterprise data serves both running business operations and managing the business. Building a successful data architecture is challenging due to data complexity, competing stakeholder interests, data proliferation, and inaccuracies. A robust data architecture must address key components like data repositories, capture and ingestion, definition and design, integration, access and distribution, and analysis.
Webcast slides for "Low Risk and High Reward in App Decomm with InfoArchive a...Tom Rieger
Platform 3 Solutions presented these slides on January 17, 2019 with Opentext to give everyone an opportunity to understand the value in removing systems from their operations
Webcast slides for "Low Risk and High Reward in App Decomm with InfoArchive a...Tracy Blackburn
This document discusses how organizations can use InfoArchive and Archon to retire outdated legacy applications and systems. It begins with an overview of challenges like rising costs and compliance issues associated with maintaining old systems. A demo is shown where Archon can connect to a 25-year old legacy application, extract data, map it to InfoArchive, and generate queries in under 7 minutes. Examples are given of clients who were able to reduce wasted budgets and drive data center consolidation by using these tools. The document encourages contacting Platform 3 Solutions or an OpenText seller to learn more about a proof of concept for analyzing one's own legacy systems.
A Guide To Single Sign-On for IBM Collaboration SolutionsGabriella Davis
Single sign-on, single identity and even password synchronization—in this session, we will take you through all the options available to minimize or eradicate logins across IBM's Collaboration Solutions (ICS); whether it is a Domino web server, IHS, Notes client, Traveler, Sametime, Connections or Verse, on-premises or cloud. The discussion will cover security certificates, password synchronization, IWA, SPNEGO and SAML Federation. We will explain what you can (and can't) do, and how to do it. Presented at Think 2018
Unit 2 - Chapter 7 (Database Security).pptxSakshiGawde6
This document discusses database security concepts. It explains that databases store sensitive organizational data so security is important. It describes database security layers including server, network, operating system, data encryption, and database levels. Database security involves balancing access for users' jobs with restricting sensitive data. Permissions at each level control what users can access.
This document provides an overview of database management systems and the entity relationship model. It discusses:
1) The components and purpose of a DBMS including data storage and management, data independence, and concurrent access.
2) Database users including administrators, designers, end users, and application developers.
3) The three schema architecture including the internal, conceptual, and external levels and mappings between them.
4) Entity relationship modeling concepts such as entities, attributes, relationships and constraints which allow conceptualization of data.
IWMW 2002: Web standards briefing (session C2)IWMW
Web Standards Briefing session at IWMW 2002 event by Brian Kelly.
See http://www.ukoln.ac.uk/web-focus/events/workshops/webmaster-2002/materials/kelly1/
Linked Open Data combines open data and linked data by making open data available on the web in a way that is machine-readable and semantically interlinked. It uses URIs and RDF to identify things and their properties and relationships, and links data from different sources to enable discovery of related data. Publishing and consuming Linked Open Data allows data sharing and integration to create new knowledge and applications. Key steps involve identifying, cleaning, and publishing data as RDF while linking it to other datasets, then consuming and combining it with other sources. Major Linked Open Data sources include data from governments, Wikipedia, and other organizations.
Multiple cloud storage platforms. Hybrid environments. File shares. Cloud storage. Are you struggling with migrating from one platform to another while maintaining several cloud platforms?
In this informative webinar with Daniel Cohen-Dumani, CEO at Portal Solutions, and Doak Williford of SkySync, learn how to easily migrate from DropBox to Office 365, while maintaining hybrid on-premise and cloud environments. Ease the pain of cutover migration.
Role of Data Cleaning in Data WarehouseRamakant Soni
Data cleaning is an essential part of building a data warehouse as it improves data quality by detecting and removing errors and inconsistencies. Data warehouses integrate large amounts of data from various sources, so the probability of dirty data is high. Clean data is vital for decision making based on the data warehouse. The data cleaning process involves data analysis, defining transformation rules, verification of cleaning, applying transformations, and incorporating cleaned data. Tools can help support the different phases of data cleaning from data profiling to specialized cleaning of particular domains.
The current Microsoft PowerBI governance enabling and recommendations. Including the changes following the November PowerBI release and PASS conference announcements.
A datastore is a central system that offers data in a consistent place and provides important context and metadata about the data. It allows data owners to publish their data through various methods like email submissions, file dropboxes, data proxies, or data replication. The document also discusses challenges around building and maintaining datastores, as well as opportunities for engaging communities around open data.
Stop the Madness! A Practical Guide to Making Your Data Catalog Strategy WorkDATAVERSITY
This document discusses the need for a business intelligence (BI) portal to help solve governance issues and improve data discoverability, access, and usage. It outlines problems with current approaches to data governance that lead to declining usage of data catalogs. The proposed solution is a BI portal that delivers metadata to users where they consume it, through BI tools, data catalogs, and Excel. The portal would make useful content discoverable and provide value to three key user personas: business users, content publishers, and data governance teams. It would be demoed to show how it addresses these needs.
The document discusses building a data warehouse in SQL Server. It provides an agenda that covers topics like an overview of data warehousing, data warehouse design, dimension and fact tables, and physical design. It also discusses components of a data warehousing solution like the data warehouse database, ETL processes, and security considerations.
SharePoint migrations rarely turn out as you plan them. They are sometimes risky and too often take longer than planned. Over the last 10 years of migrating from SharePoint 2003, 2007, 2010 to the latest versions of SharePoint/Office 365 we’ve seen a consistent theme -- organizations underestimate the complexity and level of effort required for a successful, smooth migration.
Whether you are planning to complete your own migration, or engaging a vendor to assist, this webinar will discuss precautions you can take to avoid the slippery slope experienced in SharePoint migrations.
5 Tips to Optimize SharePoint While Preparing for HybridAdam Levithan
For organizations planning to migrate to a hybrid deployment of the their SharePoint and Office 365 infrastructure, optimizing their current SharePoint is a crucial step in reducing the amount of work required for a successful migration, increase end-user performance and decrease the risk of an unsuccessful migration.
Join Metalogix SharePoint expert Adam Levithan on March 17, 2016 for 5 Tips to Optimize SharePoint While Preparing for a Hybrid Deployment, a comprehensive live webinar where he unveils the top five optimizations that organizations need to consider before they plan to move to a SharePoint 2013 or SharePoint 2016 hybrid deployment.
Key takeaways
Such optimizations will help SharePoint Admins and IT professionals:
Provide the best end-user experience,
Gain early warnings as performance issues are developing
Obtain better insight into the interdependency between SharePoint infrastructure and applications
With the upcoming release of SharePoint 2016, hybrid deployments are quickly becoming the new standard for SharePoint deployments. Adam's help increased the success of several companies migrating to hybrid deployments and will be happy to share his insights, experience and solutions with attendees.
If you’re already using or thinking of moving to Microsoft Office 365, you’ll need to think about where to store your precious documents.
Microsoft SharePoint integrates with Office 365 and allows organisations to set up a centralised, password protected space to store and manage documents, create an intranet and collaborate on projects.
In this webinar with charity IT experts, Co-Operative Systems, we look at:
• What is SharePoint and why use it
• Key features explained
• Migrating to SharePoint and what it doesn't say on the tin
• Practical demonstration of how SharePoint works
• Question & Answer
About Co-Operative Systems:
Co-Operative Systems have helped over 2,000 users onto Microsoft's Office 365 platform, and have been providing specialist IT support services to the non-profit sector since 1987. Their annual Where ITs @ event for charities is hosted by Microsoft. Read more about Co-Operative Systems at: www.coopsys.net
Best practices for security and governance in share point 2013 publishedAntonioMaio2
Microsoft SharePoint provides features and capabilities enabling you to secure access, control authentication and authorize access to information. Choosing the capabilities to make use of, configuring them and understanding their impact can be a complex tax. In this session you will learn about the key security features available in Microsoft SharePoint 2013 and the best practices for using them. The sessions begin by talking about the business reasons that organizations need to consider when security their SharePoint content, and it will then review specific capabilities and options in detail with recommendations. We’ll also review various governance best practices and how they relate to SharePoint security capabilities. Throughout the session, you’ll hear examples from large commercial enterprise, government and military and about the best practices they use to secure their content within SharePoint.
This document describes a cloud-based data integration platform called Conductor. It addresses common problems with data integration like manual work required, inability to integrate non-critical systems, and lack of data governance. Conductor automatically maps data between systems, works with any data source/format, and provides data profiling, analytics, encryption and metadata tools. It can be used for tasks like data warehousing, customer integration, database migration and cloud integration. The platform aims to simplify and automate data integration work that traditionally requires custom coding and separate point solutions.
LinkedIn Infrastructure (analytics@webscale, at fb 2013)Jun Rao
This is the presentation at analytics@webscale in 2013 (http://analyticswebscale.splashthat.com/?em=187&utm_campaign=website&utm_source=sg&utm_medium=em)
Enterprise data serves both running business operations and managing the business. Building a successful data architecture is challenging due to data complexity, competing stakeholder interests, data proliferation, and inaccuracies. A robust data architecture must address key components like data repositories, capture and ingestion, definition and design, integration, access and distribution, and analysis.
Webcast slides for "Low Risk and High Reward in App Decomm with InfoArchive a...Tom Rieger
Platform 3 Solutions presented these slides on January 17, 2019 with Opentext to give everyone an opportunity to understand the value in removing systems from their operations
Webcast slides for "Low Risk and High Reward in App Decomm with InfoArchive a...Tracy Blackburn
This document discusses how organizations can use InfoArchive and Archon to retire outdated legacy applications and systems. It begins with an overview of challenges like rising costs and compliance issues associated with maintaining old systems. A demo is shown where Archon can connect to a 25-year old legacy application, extract data, map it to InfoArchive, and generate queries in under 7 minutes. Examples are given of clients who were able to reduce wasted budgets and drive data center consolidation by using these tools. The document encourages contacting Platform 3 Solutions or an OpenText seller to learn more about a proof of concept for analyzing one's own legacy systems.
A Guide To Single Sign-On for IBM Collaboration SolutionsGabriella Davis
Single sign-on, single identity and even password synchronization—in this session, we will take you through all the options available to minimize or eradicate logins across IBM's Collaboration Solutions (ICS); whether it is a Domino web server, IHS, Notes client, Traveler, Sametime, Connections or Verse, on-premises or cloud. The discussion will cover security certificates, password synchronization, IWA, SPNEGO and SAML Federation. We will explain what you can (and can't) do, and how to do it. Presented at Think 2018
Unit 2 - Chapter 7 (Database Security).pptxSakshiGawde6
This document discusses database security concepts. It explains that databases store sensitive organizational data so security is important. It describes database security layers including server, network, operating system, data encryption, and database levels. Database security involves balancing access for users' jobs with restricting sensitive data. Permissions at each level control what users can access.
This document provides an overview of database management systems and the entity relationship model. It discusses:
1) The components and purpose of a DBMS including data storage and management, data independence, and concurrent access.
2) Database users including administrators, designers, end users, and application developers.
3) The three schema architecture including the internal, conceptual, and external levels and mappings between them.
4) Entity relationship modeling concepts such as entities, attributes, relationships and constraints which allow conceptualization of data.
IWMW 2002: Web standards briefing (session C2)IWMW
Web Standards Briefing session at IWMW 2002 event by Brian Kelly.
See http://www.ukoln.ac.uk/web-focus/events/workshops/webmaster-2002/materials/kelly1/
Linked Open Data combines open data and linked data by making open data available on the web in a way that is machine-readable and semantically interlinked. It uses URIs and RDF to identify things and their properties and relationships, and links data from different sources to enable discovery of related data. Publishing and consuming Linked Open Data allows data sharing and integration to create new knowledge and applications. Key steps involve identifying, cleaning, and publishing data as RDF while linking it to other datasets, then consuming and combining it with other sources. Major Linked Open Data sources include data from governments, Wikipedia, and other organizations.
Multiple cloud storage platforms. Hybrid environments. File shares. Cloud storage. Are you struggling with migrating from one platform to another while maintaining several cloud platforms?
In this informative webinar with Daniel Cohen-Dumani, CEO at Portal Solutions, and Doak Williford of SkySync, learn how to easily migrate from DropBox to Office 365, while maintaining hybrid on-premise and cloud environments. Ease the pain of cutover migration.
Role of Data Cleaning in Data WarehouseRamakant Soni
Data cleaning is an essential part of building a data warehouse as it improves data quality by detecting and removing errors and inconsistencies. Data warehouses integrate large amounts of data from various sources, so the probability of dirty data is high. Clean data is vital for decision making based on the data warehouse. The data cleaning process involves data analysis, defining transformation rules, verification of cleaning, applying transformations, and incorporating cleaned data. Tools can help support the different phases of data cleaning from data profiling to specialized cleaning of particular domains.
Similar to David Max SATURN 2018 - Handling Personal Information in LinkedIn's Content Ingestion System (20)
International Conference on NLP, Artificial Intelligence, Machine Learning an...gerogepatton
International Conference on NLP, Artificial Intelligence, Machine Learning and Applications (NLAIM 2024) offers a premier global platform for exchanging insights and findings in the theory, methodology, and applications of NLP, Artificial Intelligence, Machine Learning, and their applications. The conference seeks substantial contributions across all key domains of NLP, Artificial Intelligence, Machine Learning, and their practical applications, aiming to foster both theoretical advancements and real-world implementations. With a focus on facilitating collaboration between researchers and practitioners from academia and industry, the conference serves as a nexus for sharing the latest developments in the field.
Presentation of IEEE Slovenia CIS (Computational Intelligence Society) Chapte...University of Maribor
Slides from talk presenting:
Aleš Zamuda: Presentation of IEEE Slovenia CIS (Computational Intelligence Society) Chapter and Networking.
Presentation at IcETRAN 2024 session:
"Inter-Society Networking Panel GRSS/MTT-S/CIS
Panel Session: Promoting Connection and Cooperation"
IEEE Slovenia GRSS
IEEE Serbia and Montenegro MTT-S
IEEE Slovenia CIS
11TH INTERNATIONAL CONFERENCE ON ELECTRICAL, ELECTRONIC AND COMPUTING ENGINEERING
3-6 June 2024, Niš, Serbia
Embedded machine learning-based road conditions and driving behavior monitoringIJECEIAES
Car accident rates have increased in recent years, resulting in losses in human lives, properties, and other financial costs. An embedded machine learning-based system is developed to address this critical issue. The system can monitor road conditions, detect driving patterns, and identify aggressive driving behaviors. The system is based on neural networks trained on a comprehensive dataset of driving events, driving styles, and road conditions. The system effectively detects potential risks and helps mitigate the frequency and impact of accidents. The primary goal is to ensure the safety of drivers and vehicles. Collecting data involved gathering information on three key road events: normal street and normal drive, speed bumps, circular yellow speed bumps, and three aggressive driving actions: sudden start, sudden stop, and sudden entry. The gathered data is processed and analyzed using a machine learning system designed for limited power and memory devices. The developed system resulted in 91.9% accuracy, 93.6% precision, and 92% recall. The achieved inference time on an Arduino Nano 33 BLE Sense with a 32-bit CPU running at 64 MHz is 34 ms and requires 2.6 kB peak RAM and 139.9 kB program flash memory, making it suitable for resource-constrained embedded systems.
2. About Me
• Software Engineer at LinkedIn NYC
since 2015
• Content Ingestion team
• Office Hours –
Thursday 11:30-12:00
David Max
Senior Software Engineer
LinkedIn
www.linkedin.com/in/davidpmax/
3. About LinkedIn New York Engineering
• Located in Empire State Building
• Approximately 100 engineers and
1000 employees total
• Multiple teams, front end, back
end, and data science
New York
Engineering
4. Disclaimers
• I’m not a lawyer
• Some details omitted
• I am not a spokesperson for official LinkedIn
policy
5. O U R M I S S I O N
Create economic opportunity for every member
of the global workforce
6. LinkedIn
>546M >70%
• World’s largest professional
network
members of members reside outside the U.S.
• More than 200 countries and
territories worldwide
7. General Data Protection Regulation
• Applies to all companies worldwide that
process personal data of EU citizens.
• Widens definition of personal data.
• Introduces restrictive data handling
principles.
• Enforceable from May 25, 2018.
8. Handling Personally Identifiable Information (PII)
Limit personal data
collection, storage,
usage
Data Minimization
Cannot use collected
data for a different
purpose
Consent
Do not hold data
longer then necessary
Retention
Must delete data upon
request
Deletion
9. Handling PII in Content Ingestion
Content Ingestion Data Protection
Babylonia Data Minimization Consent
Retention Deletion
15. What is Content Ingestion?
Content Ingestion
Babylonia
• Extracts metadata from web pages
• Source of Truth for 3rd party content
• Also contains metadata for some
public 1st party content
• Used by LinkedIn services for sharing,
decorating, and embedding content
• Data also feeds into content
understanding and relevance models
17. Ingesting 1st party
pages containing
publicly viewable
member PII
• Profile pages
• Publish posts
• SlideShare content
18. When a Member Account is Closed
• Remove scraped data relating to
the member pages that have been
taken down
• Notify downstream systems that
might be holding a copy of the
data
• Babylonia (along with other
systems) is notified that the
member’s account is closed
• Other systems take down the
member’s content
(i.e. public profile page, publish
posts, etc.)
What happens What Babylonia needs to do
20. Downstream and Upstream Datasets
Espresso
Database
HDFS
ETL
Brooklin Data
Change Events
1st party
web page
profile
job
article
publishing
profile
Online
Service
Near
Line
Offline
21. • Need to identify URLs that
contain a member’s PII.
• My post might contain your PII
• Connection between member
and the URL resides in the
upstream system
Challenges of
member PII in
Babylonia
22. Option #1: Require Upstream Systems to Notify Babylonia
• Simple – Babylonia waits to be told
specifically which URLs should be purged
• Babylonia only does extra work when a URL
needs to be purged
• Puts responsibility where the knowledge is
Pros Cons
• Requires additional work by every system
that exposes PII in publicly accessible web
pages
• If the notification is missed, how will
Babylonia know?
• 1st party URLs sometimes change as
upstream systems are changed – need to
correctly handle old URLs too
23. Option #2: Actively Refetch Every 1st Party URL
• Simple logic: Page gone? Purge the page.
• Requires little additional work from
upstream systems
• Works also for old 1st party URLs
Pros Cons
• There are a lot of 1st party URLs in
Babylonia
• Continuous polling of all 1st party URLs
consumes a lot of resources just for the
sake of the very few URLs that are actually
affected
• Extra work to avoid false positives or false
negatives
24. Option #3: Eliminate Member PII in Babylonia
• The easiest data to delete is data that isn’t
in your system to begin with
• Gets closer to Single Source of Truth (SSOT)
for all 1st party content – better for
consistency, not only for compliance
Pros Cons
• Babylonia is relied upon by numerous systems
to have content for URLs – excluding 1st party
content will affect member experience
• No substitute currently available
• Difficult to achieve based on URL – can’t always
tell by looking at a URL if it resolves to 1st party
content (eg. shortlinks)
25. Blended Approach
• Option 1 - Having upstream systems notify is
best, but might miss some pages
• Option 2 - Active refetch is thorough but
expensive. Must use to catch pages that
won’t support notifications
• Option 3 - Some pages won’t work with active
refetch. For example, pages that still return
an HTTP status code 200 even when the data
has been removed. These must be blocked
26. Classification of Ingested URLs
URL
3rd Party
1st Party
Blocked
Whitelisted
Actively
Refetched
Notified by
Upstream
27. Option 1 – Upstream Notification
• Upstream system sends a
Kafka message
• Babylonia consumes message
and purges data
• Open source -
kafka.apache.org
29. Option 3 – Whitelist
• Block all 1st party URLs that
can’t meet minimal
requirements
• Mainly must return a 404 for an
invalid or deleted URL
• Ensures new 1st party URLs are
onboarded before being
ingested
31. Espresso Datasets
Espresso
Datasets
Espresso
Database
• LinkedIn distributed
NoSQL database
• Data stored in Avro
format (JSON)
• Indexed by specific
primary key fields
What is Espresso? Challenges
• Reference to PII not
always in the key
• ETL snapshots of
Espresso Dataset
become offline
Datasets
32. Offline (HDFS) Datasets
HDFS
ETL
Offline
Datasets• Files of Avro (JSON) records
• Need to read whole record to see if
it has PII
• Files not conducive to removing
one record from the middle
• Dataset can be source for
downstream jobs that also need to
be purged
Challenges
33. Which datasets contain member PII?
Data Discovery
• Data discovery and lineage tool
• Central location for all schema
• Document meanings of each column
• Trace downstream/upstream lineage
of datasets
• Tag every column that can contain
member reference or PII.
• Open Source -
github.com/linkedin/wherehows
WhereHows
34. • Interface for accessing datasets
• Combines dataset schema with
WhereHows metadata
• Defines output virtual dataset while
preserving data tags
• Supports defining virtual datasets
where PII is excluded or obfuscated
Dali (Data Access at LinkedIn)
Raw Dataset
WhereHows
Metadata
Dali
Reader
35. Only systems that handle PII properly
are allowed access
Access Control
• Controls access to PII data to known
list of authorized systems
• We only approve access to systems
that it can handle PII properly
• Ensures that member PII can’t leak
into untracked systems/datasets
• Acts as a list of downstream services
Access Control List (ACL)
36. Keeping Track of Personal Information in Babylonia
• Field tagging for fields
containing PII
• Know where the PII is
WhereHows Dali ACL
• Downstreams use Dali,
which preserves the
WhereHows tagging on
new virtual datasets
• Keeps tags with the
data as it moves from
one dataset to another
• Control the spread of
PII data only to
authorized readers
• Serves as a list of
current downstream
systems to notify when
data is purged
37. Apache Gobblin
• Framework for transforming large
datasets
• Data lifecycle management
• Uses WhereHows tags to identify data
in our Espresso or offline datasets that
need to be purged
• Open source - gobblin.apache.org
38. • Created tags representing
ingested content URLs in
WhereHows
• Enables downstream systems to
onboard with Espresso auto
purge and Gobblin by tagging
columns in their tables as
containing a URL or Ingested
Content URN (Uniform Resource
Name)
Tagging in
WhereHows
WhereHows and Gobblin
39. • Choose an implementation where
restriction is the default until proven
safe
• Whitelisting ensures all allowed 1st
party URLs meets a minimum
technical bar for ingestion
• Simplicity of active refetching helps
keep the bar low enough to include
most content safely
Compliance Comes First
40. • Added constraints to the
system
• Developer restrictions
• Made certain kinds of things
harder to do
Constraints
Bigger Picture
41. “Constraints can act as guide rails that
point a system where you want it to go.”
G E O R G E F A I R B A N K S
42. • A constrained system is easier
to predict and control
• Make the wrong things harder
to do
• Give guidance to all developers
how things are supposed to be
done
Constraints /
Guide Rails
Bigger Picture
43. • Constraints should manifest in some
explicit way
• Counter-Example: “No backwards
incompatible schema changes”
• Hard to tell what developers refrained
from doing
• WhereHows, Dali, and ACLs make
metadata and the rules explicit and
thus easier to perpetuate
Manifest
Guide Rails
in the Code
Bigger Picture
44. A design technique where the
responsibility for a guide rail is
moved away from developer
vigilance into code, with the
goal of achieving a global
property on the system.
Architecture
Hoisting
Bigger Picture
45. Architecture
Hoisting
Bigger Picture
• Make use of the framework to manage
PII
• Requires developers to think about PII
concerns up front to access the data
• Once set up, developers can focus less
on managing PII because the
architecture is handling it
• Users of the framework can
automatically benefit from future
enhancements