Successfully reported this slideshow.
Your SlideShare is downloading. ×

Scaling Privacy in a Spark Ecosystem

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad

Check these out next

1 of 20 Ad

Scaling Privacy in a Spark Ecosystem

Download to read offline

Privacy has become one of the most important critical topics in data today. It is more than how do we ingest and consume data but the important factors about how you protect your customer’s rights while balancing the business need. In our session, we will bring CTO, Privacera, Don Bosco Durai together with Northwestern Mutual to detail an important use case in privacy and then show how to scale Privacy with a focus on the business needs. We will make the ability to scale effortless.

Privacy has become one of the most important critical topics in data today. It is more than how do we ingest and consume data but the important factors about how you protect your customer’s rights while balancing the business need. In our session, we will bring CTO, Privacera, Don Bosco Durai together with Northwestern Mutual to detail an important use case in privacy and then show how to scale Privacy with a focus on the business needs. We will make the ability to scale effortless.

Advertisement
Advertisement

More Related Content

Slideshows for you (20)

Similar to Scaling Privacy in a Spark Ecosystem (20)

Advertisement

More from Databricks (20)

Advertisement

Scaling Privacy in a Spark Ecosystem

  1. 1. Scaling Privacy with Apache Spark Aaron Colcord Sr. Director Engineering, Northwestern Mutual Don Durai Bosco CTO and Co-Founder, Privacera
  2. 2. Agenda ▪ Our background ▪ Why privacy, security, compliance? ▪ Approaches ▪ Ideal problem solve ▪ Real life meets ideal life
  3. 3. Backgrounds ▪ Building an Enterprise Scale Unified Framework ▪ Very Long, Respected History ~ 160 Years ▪ Compliance is extremely important to us ▪ Agile Data vs Compliant Data ▪ Founded in 2016 by the creators of Apache Ranger & Apache Atlas ▪ Extends Ranger's capabilities beyond traditional Big Data environments to cloud (Databricks, AWS, Azure, GCP, and more) ▪ Specializes in democratizing data for analytics, while ensuring compliance with privacy regulations (GDPR, CCPA, LGPD, HIPAA, & more) • Privacera • Northwestern Mutual
  4. 4. Why do we suddenly care about privacy? • You care if you are regulated in any form • Simple you need to show you can pass an audit • You care if you store any information about your users • Simple because governments have woken up with GDPR and CCPA • You care if you want to democratize your data • Simple because the use of your data can be scrutinized We always did, but technology got ahead of privacy. Privacy is often this assumed competency, and technology really showed how important it was.
  5. 5. Have you ever... • Collecting information about your customers can • Improve the experience • Allow the company to understand their business better • At the core, privacy is a policy and legal obligation • You have the data, it used to be your business to just secure it. • Do you want your information monetized? Sold? Traded? • Most companies don’t do this. But the privacy policy is there for you. • Clicked ‘accept all’ on website, used a digital assistant.. Gone to a website and read their privacy policy, clicked accept cookies, accepted terms of service, or EULA?
  6. 6. And it’s only going to pick up speed. • More Regulations are arriving around privacy • Increasing your ability to execute against data means respecting your user’s rights • A part of maturity is being able to manage governance
  7. 7. More importantly, why do we care so much? • Technology like Apache Spark opens the capability to democratize your data. • Most every company wants the marketplace to enrich and share their data. • Who inside that company can view it? Do we have the controls to protect your information? Can we verify that the information is used for the right purposes?
  8. 8. What is the difference between these? ▪ Preventing unauthorized usage of systems ▪ Ensuring users don’t see the incorrect information ▪ Creating boundaries to enforce right action of the system • The process of making sure your company and employees follow all laws, regulations, standards, and ethical practices that apply to your organization • Compliance • Security • “Data privacy may be defined as the authorized, fair, and legitimate processing of personal information” • Consent rights • Do not share • Slippery space • Privacy
  9. 9. Examine strategies to scale agile data w/privacy • Build a metadata layer that defines PII in its schema • Users and developers can and will change where PII is stored • You can literally chase people to do the ‘right thing’ forever • You could build views with permissions to certain users • Not very scalable • Plus you need to always show who accessed and why • Are these security scenario?
  10. 10. Challenges to that strategy • Is the metadata layer flexible enough or should we think in policies? • Privacy is inherently your organization’s position which may evolve based on regulation • Can your development keep up with views? • When you discover the extra 10,000 fields, can you keep up? • Implement a framework that scales • Security is not Privacy. • Security has a different domain and set of principles. • Remember we are protecting the usage of your data.
  11. 11. How can we solve it?
  12. 12. Ideal scalable system ▪ Revocation of Consent ▪ Portability ▪ Erasure ▪ Rectification ▪ How is data used? ▪ Rights follow Data Reuse ▪ Flexible to change ▪ Should align with a Data Governance program ▪ Should adapt to changing data ▪ Proactive. ▪ Reclassification • Classification • User Rights ▪ How was it used? ▪ How was it accessed? ▪ How was it protected? ▪ Did it cross borders? • Audit/Governance ▪ Authorization of User may change ▪ Supports Agile Access ▪ Business Use is preserved ▪ Automated Systems obey Privacy • Access
  13. 13. User Rights at Scale ▪Revocation of Consent/ Right To Be Forgotten ▪Portability ▪Erasure ▪Rectification ▪How is data used? ▪Rights follow Data Reuse ▪Flexible to change
  14. 14. S3 ADLS Redshift Snowflake Synapse Privacy Challenges in Open Data Ecosystem Athena Databricks HDInsight EMR Dremio Trino PrestoDB PowerBI Tableau Storage SQL Engines Data Virtualization BI Tools Marketing Data Analyst Data Scientist/A rchitect
  15. 15. Governance blind spot
  16. 16. Tools & Technology AUTOMATED DATA DISCOVERY CENTRALIZED ACCESS CONTROL AUDIT COLLECTION AND REPORTING
  17. 17. Automated Data Discovery ● Automatically detect and catalog sensitive data ● Detailed classification, e.g. EMAIL, SSN, GENDER, CC, PHONE_NUMBER, etc. ● Eliminate manual processes ● Catalog data as it is ingested ● Track data movement and propagate tag ● Catalog data across multiple cloud services
  18. 18. Centralized Access Control ● Global Tag/Classification-based policies ● Purpose and Persona based policies ● Dynamic row filters v/s Views ● Dynamic masking or decryption ● Approval workflows with time and purpose constraints
  19. 19. Centralized Auditing and Reporting ● Centralize auditing ● Monitoring data access by classification ● Track usage by Purpose ● Generate attestation reports
  20. 20. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.

×