Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Multi Tenanted Streams @Workday - Enrico Agnoli & Leire Fernandez de Retana Roitegui, Workday

138 views

Published on

At WORKDAY Inc., #1 Future Fortune company 2018 (link: https://fortune.com/future-50/2018/workday/), we process data for our community of more than 39 million workers, including 40 percent of Fortune 500 organizations. Our success is driven by the trust our customer puts on us and we give them confidence with our strict security regulations. This demands that we always encrypt customer data at rest and in transit: each piece of data should always be stored, encrypted with the customer key.

This is a challenge in a Data Streaming platform like Flink, where data may be persisted in multiple phases:

Storage of States in Checkpoints or Savepoints, Temporary fs storage for time-window aggregation, Common spilling to disk when heap is full.

On top of that, we need to consider that in a Flink dataflow data might get manipulated and we need to maintain the context needed to correctly encrypt it.

Come join us to see how we solved this challenges to provide a secure platform to support our MachineLearning organization, how we extended AVRO libraries to enable encryption at serialization and how we support data traceability for GDPR.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Multi Tenanted Streams @Workday - Enrico Agnoli & Leire Fernandez de Retana Roitegui, Workday

  1. 1. Tenant-based encryption in Flink An introduction to the challenges of tenanted data-streaming in WORKDAY
  2. 2. Enrico Agnoli Sr. Software Engineer Machine Learning Foundation Leire Fernandez de Retana Sr. Software Engineer Machine Learning Foundation Workday Inc.
  3. 3. Abstract At WORKDAY Inc. we process data for thousands of customers and our strict security regulations demand we always encrypt customer data at rest and in transit. That means, each piece of data should always be stored, encrypted with the customer key. This is a challenge in a Data Streaming platform like Flink, where data may be persisted in multiple phases: • Storage of States in Checkpoints or Savepoints • Temporary fs storage for time-window aggregation • Common spilling to disk when heap is full On top of that, we need to consider that in a Flink dataflow the data might get manipulated. After the manipulation we need to maintain the context needed to correctly encrypt it. We solved this challenge by extending the serialization libraries (AVRO) to enable encryption at serialization. In this talk we will walk through the complexity of having a runtime encryption in a multi-tenant data streaming world, how we solved it and and how we support data traceability for GDPR.
  4. 4. Agenda - Introduction to the business case - Flink @ Workday - Our special requirements - Data Streaming in a multi-tenancy world - Solution overview - Summary - Other Challenges
  5. 5. Introduction to the business case What do we do
  6. 6. About Workday A leading provider of enterprise cloud applications, Workday delivers financial management, human capital management, planning, and analytics applications designed for the world's largest organizations. • Founded in 2005 • Over 11k employees and a NASDAQ listed company. • Development headquarters in California and Ireland. • Over 100 billion transactions were processed with Workday in FY19, a 50% increase YoY • A community of 40 million workers, including 40 percent of Fortune 500 organizations • Awards: ‒ #1 2018 Future 50 Fortune, best prospects for long-term growth. ‒ #4 2019 100 Best Companies to Work For, Fortune ‒ #2 2019 50 Best Workplaces in Technology, Fortune ‒ ….
  7. 7. The Leading Enterprise Cloud for Finance and HR 40 Million + workers 100 Billion + transactions per year 96.1% transactions < 1 seconds 99.9% actual availability ~200 companies #1 Future 50, Fortune #2 40 Best Workplaces in Technology, Fortune 10 Thousand + certified resources in the ecosystem
  8. 8. Flink @Workday How we do Data Streaming
  9. 9. Why does Workday need Flink? - Workday success is connected to a Monolith Service - Expansion brought more microservices and external platforms, data processing can happen outside TS - Machine Learning is expensive and can’t be done individually per each customer, need to have a unique separate system - Flink allow us to fast develop and deploy logics to correlate/aggregate data from and to multiple services
  10. 10. Infrastructure - How we deploy - Openstack + K8, 1 job per cluster - tooling to deploy jobs on Workday’s Openstack platform - Plugs into Workday’s metrics, logs and stats infrastructures - Accelerate the job development providing libraries to integrate internal stack - Jobs for ML and data transfer - Currently supporting 6 different flink jobs across 6 datacenters, ingesting hundreds of millions of messages per week
  11. 11. - DataPipelines - To thousands of buckets S3 - To thousands of buckets HDFS - Enriching data with async calls - Mainly to do offline/online model training - Anomalies-Detection - Generic Inference Services - Do inference on data using the models created offline - Eg: help financial transaction bookkeeping Jobs
  12. 12. Our special requirements Introducing the main requirements at the base of this work
  13. 13. Multi-tenancy From Gartner glossary: Multitenancy - 1 Tenant ≃ 1 Customer
  14. 14. Security requirement Requirement: At WORKDAY Inc. we process data for thousands of customers and our strict security regulations demand we always encrypt customer data at rest and in transit.
  15. 15. Data Streaming in a multi-tenancy world Given the requirements above, see how the architecture is impacted
  16. 16. Common Flink architecture Event production Bus Data Streaming Platform DataLake
  17. 17. This translates to these internal (de)serializations
  18. 18. Flink architecture in WORKDAY Bus Data Streaming Platform DataLakeEvent production
  19. 19. Where we need to look at - possible issues
  20. 20. Solution overview Give a high level diagram/explanation of the solution
  21. 21. We started by wrapping the data in a “container” where all the interaction with the real data is controlled by a logic (get/setPayload) that encrypt/decrypt it as needed. However this is not enough: - Key in streaming is the possibility of filtering, keying, map data: a simple wrapper doesn’t work Initial attempt - wrapping the message
  22. 22. Solution: Encryption at serialization Solution is to handle encryption at serialization: so any time the object is serialized by the platform (FLINK), encryption will be used transparently. Two options: Extend Flink Extend AVRO
  23. 23. Encryption via AVRO
  24. 24. Overview implementation Three main parts: - A serialization library: that delegates the encryption work to the object and is executed transparently by Flink - A common library: were we define the the encryption logic - A process that allows to mark objects as encryptable (sensitive) so the new serialization will kick in
  25. 25. Overview implementation
  26. 26. Serialization library How we delegate some work to the object itself...
  27. 27. AVRO - new interface package org.apache.avro; import org.apache.avro.io.Decoder; import org.apache.avro.io.Encoder; import org.slf4j.Logger; import org.slf4j.LoggerFactory; import java.io.ByteArrayOutputStream; public interface SerializeFinalizationDelegate { void afterSerialization(ByteArrayOutputStream serializedData, Encoder finalEncoder); Decoder beforeDeserialization(Decoder dataToDecode); } SerializeFinalizationDelegate.java
  28. 28. AVRO - create 2 hooks in AVRO writer/reader GenericDatumWriter.java GenericDatumReader.java public void write(D datum, Encoder out) throws IOException { // Check if we should delegate the after-serialization if (datum instanceof SerializeFinalizationDelegate) { //create a new encoder to handle the serialization separately without // writing directly to the output stream attached to the received (out) encoder ByteArrayOutputStream baos = new ByteArrayOutputStream(); Encoder newEncoder = EncoderFactory.get().binaryEncoder(baos, null); //call the standard serialization write(root, datum, newEncoder); //now delegate for the finalization newEncoder.flush(); SerializeFinalizationDelegate delegate = (SerializeFinalizationDelegate) datum; delegate.afterSerialization(baos, out); } else { write(root, datum, out); } } public D read(D reuse, Decoder in) throws IOException { try { Class<?> clazz = Class.forName(actual.getFullName()); Constructor<?> constructor = clazz.getConstructor(); if(reuse == null){ reuse = (D)constructor.newInstance(); } if(reuse instanceof SerializeFinalizationDelegate){ SerializeFinalizationDelegate delegate = (SerializeFinalizationDelegate)reuse; in = delegate.beforeDeserialization(in); } } catch (InstantiationException | InvocationTargetException | NoSuchMethodException | IllegalAccessException e) { LOG.debug("Not possible to instantiate object of the class."); } catch (ClassNotFoundException e) { LOG.debug("The class can't be find in the classLoader, skip..."); } ResolvingDecoder resolver = getResolver(actual, expected); resolver.configure(in); D result = (D) read(reuse, expected, resolver); resolver.drain(); return result; }
  29. 29. Logic to encrypt data How the delegation can be used to encrypt the object
  30. 30. AVRO - details - implementation example SerializeWithTenantEncryption.java
  31. 31. Definition of tenanted data how we generate the java classes used on the platform
  32. 32. Use the schemas A pipeline with a Gradle project builds our POJOs out of .avsc https://github.com/davidmc24/gradle-avro-plugin Then in the Flink project we use these class as dependency! DONE! Let’s look at the template... task generateAvroTenanted(type: com.commercehub.gradle.plugin.avro.GenerateAvroJavaTask) { //tenanted schemas source("${rootDir}/basicTypes", generatedAvroTenantedSchemas) outputDir = generatedJavaTenantedSources templateDirectory = "$rootDir/avro_compiler_tenanted_templates/" }
  33. 33. Modify avro templates Modified the standard template at avro/compiler/specific/templates/java/classic/record.vm - So the generated class is “tenanted” - If a piece of the class is extracted, the context should be passed along ... public class ${this.mangle($schema.getName())}#if ($schema.isError()) extends org.apache.avro.workday.specific.SpecificExceptionBase#else extends SerializeWithTenantEncryption#end implements org.apache.avro.workday.specific.SpecificRecord { ... public ${this.javaType($field.schema())} ${this.generateGetMethod($schema, $field)}() { ${this.javaType($field.schema())} local_value = ${this.mangle($field.name(), $schema.isError())}; if(SerializeWithTenantEncryption.class.isInstance(local_value)){ SerializeWithTenantEncryption.class.cast(local_value).__setTenant(this.__getTenant()); } return local_value; }
  34. 34. Summary
  35. 35. • A schema is defined and out of it a java class created • Data produced using the schema class ‒ Avro serializes the info ‒ Then finds it is a tenanted message ‒ Delegates the encryption to the message ‒ Bytes are sent • Flink receives the message, avro sees it is a tenanted info ‒ Delegates to the obj decryption ‒ Then deserializes ‒ If a piece of data is extracted the context is pushed down • All done transparently at any serialization How is the flow
  36. 36. - Openstack cluster management - Need to interact with FLINK API: changes are painful - Flink UI can’t be used as we can’t access PROD endpoints - Anomalies Detection job having 120GB+ states - S3 writing to thousands of folders blocks the Sink - Tests/Performances of AWS services can be expensive - End-to-end test are complex because - Production like data distributions - Encryption logics - Flink ParameterTool flexibility - We read some properties from local FS, but these are different from TM to JM Other challenges
  37. 37. Thank You Any question? enrico.agnoli@workday.com leire.retana@workday.com

×