Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Kubernetes as data platform

303 views

Published on

Bonnier News is the largest news organisation in Sweden, publishing Dagens Nyheter and Expressen, two of the country’s largest newspapers. When we needed to build a new data processing platform that could accommodate the needs of many different, competing brands, we turned to Openshift and Kubernetes. In this presentation, we will describe the architectural tradeoffs and choices we made, and how we have been able to deploy data flows at a high rate by focusing on technical simplicity.

Published in: Data & Analytics
  • Be the first to comment

Kubernetes as data platform

  1. 1. Kubernetes as Data Platform Riga DevOpsDays 2018-09-28 Eric Skoglund, Bonnier News Lars Albertsson, Mimeria 1
  2. 2. 2
  3. 3. 3
  4. 4. 4
  5. 5. 5 Brand Scope Data Scope ➔ Behavioral Data ➔ Technical Data No Content Data Scoping the platform
  6. 6. Cloud Selection 6
  7. 7. Cloud Selection 7 The Pragmatic Choice ➔ Known to people in the dev teams ➔ New base platform for all other applications within Bonnier News
  8. 8. Use Case Driven Development ➔ Use cases drive the development of the platform ➔ Focus on value and quality not on slurping in all data in the company ➔ Start with simple use cases! 8
  9. 9. 9 FIND USE CASE THAT PROVIDE VALUE NEW DATA INTO THE PLATFORM EVOLVE THE PLATFORM BASED ON REQUIREMENTS Use Case Driven Development
  10. 10. ● Need data from teams ○ willing? ○ backlog? ○ collected? ○ useful? ○ extraction? ○ data governance? ○ history? Data-centric innovation 10
  11. 11. A collaboration paradigm 11 Stream storage Data lake Data democratised
  12. 12. Onboard driven by use case 12 Data lake
  13. 13. Data platform == collaboration platform 13 Data lake
  14. 14. Data platform overview 14 Data lake Cold store Service Service Online services Offline data platform Batch processing
  15. 15. Data platform overview 15 Data lake Cold store Dataset Job Service Service Online services Offline data platform Batch processing
  16. 16. Data platform overview 16 Data lake Cold store Dataset Pipeline Service Service Online services Offline data platform Job Batch processing Workflow orchestration
  17. 17. Data platform overview 17 Data lake Batch processing Online services Cold store Service Data feature Dataset Pipeline Service Service Online services Offline data platform Internal services Job
  18. 18. Life of a change, batch pipelines 18 ● My pipeline, version 2! ○ Dual datasets during transition ● Run downstream parallel pipelines ○ Cheap ○ Low risk ○ Easy rollback ● Easy to test end-to-end ○ Upstream team can do the change ∆?
  19. 19. Egress target change 19 ● Need output in different storage! ○ Adding egress target is easy ○ Egress target backfill is easy ● Facilitates cost limitation ○ Partially aggregate → BigQuery / Redshift ○ Limited retention in egress storage
  20. 20. Life of an error, batch pipelines 20 ● My dataset, bad version! 1. Revert serving datasets to old 2. Fix bug 3. Remove faulty datasets 4. Backfill is automatic (Luigi) Done! ● Low cost of error ○ Reactive QA ○ Production environment sufficient
  21. 21. Deployment example, on-premise 21 source repo Luigi DSL, jars, config my-pipe-7.tar.gz Luigi daemon > pip install my-pipe-7.tar.gz Worker Worker Worker Worker Worker Worker Worker Worker Redundant cron schedule, higher frequency All that a pipeline needs, installed atomically 10 * * * * luigi --module mymodule MyDaily Standard deployment artifact Standard artifact store
  22. 22. Deployment example, cloud native 22 source repo Luigi DSL, jars, config my-pipe:7 Luigi daemon Worker Worker Worker Worker Worker Worker Worker Worker Redundant cron schedule, higher frequency kind: CronJob spec: schedule: "10 * * * *" command: "luigi --module mymodule MyDaily" Docker image Docker registry S3 / GCS Dataproc / EMR
  23. 23. Deployment, one cluster less 23 source repo Luigi DSL, jars, config my-pipe:7 Luigi daemon Worker Worker Worker Worker Worker Worker Workerspark-submit --master=local Redundant cron schedule, higher frequency kind: CronJob spec: schedule: "10 * * * *" command: "luigi --module mymodule MyDaily" Docker image Docker registry S3 / GCS
  24. 24. Continuous deployment 24 mono- repo PR build, affected CI tests mymodule/mypipe:revtag Luigi daemon Worker Worker Worker Worker Worker Worker Workerspark-submit --master=local kind: CronJob spec: schedule: "10 * * * *" command: "luigi --module mymodule MyDaily" Openshift registry S3 master branch pipeline tests doc build
  25. 25. Some pipelines are straightforward 25
  26. 26. Some are twisted 26
  27. 27. Autoscaling 27
  28. 28. GDPR Article 17. “The data subject shall have the right to obtain from the controller the erasure of personal data concerning him or her without undue delay and the controller shall have the obligation to erase personal data without undue delay where one of the following grounds applies:“ ➔ the personal data are no longer necessary in relation to the purposes for which they were collected or otherwise processed - Data Retention ➔ the data subject withdraws consent on which the processing is based - Data Deletion Requests 28
  29. 29. GDPR 29 { id: …. pii: [...] } CREATE KEY FOR ID ENCRYPT PERSONAL DATA WITH KEY
  30. 30. GDPR - Retention 30 { id: …. pii: [...] } CREATE KEY FOR ID ENCRYPT PERSONAL DATA WITH KEY ➔ Each dataset has a retention time from the owners of the data ➔ Create new keys each 30 days ➔ Destroy keys older than the retention time
  31. 31. GDPR - Right to be forgotten 31 List of users that have requested deletion Find keys for those users Destroy keys
  32. 32. Use Cases in Use ➔ Machine Learning ◆ Built a system that tries to predict if a visitor will watch an ad in a video or not ➔ Creating Reports ◆ Daily reporting data for ad team ◆ Weekly report of ad viewing data for site team ➔ GDPR Registry Extract ◆ Collect data from multiple different sources ◆ Merge the data ◆ Send data to be viewed by the user 32
  33. 33. Lessons Learned Cloud selection is influenced by data location Most data for the use cases we started with was on Google Cloud Storage / BigQuery incurring extra development time and cost to exfiltrate that data. Kubernetes? Same platform as other teams + great support from infrastructure platform team. No Spark cluster maintenance, tweaking, debugging. Autoscaling works, but some challenges for batch jobs. 33
  34. 34. Summary Use case driven development == Short Time to Production First pipeline in 3 weeks Small team 2-4 People Keep it simple 10-15 Pipelines 34

×