Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Data ops in practice - Swedish style

DataOps requires a cultural shift that brings the principles of lean manufacturing and DevOps to data analytics. It breaks down silos between developers, data scientists, and operators, resulting in rapid cycle times and low error rates.

At Spotify in 2013, the concept of DataOps did not exist but the Swedish company needed a way to align the people, processes, and technologies of the data organization to accelerate the development of high-quality analytics. The result was a Swedish-style DataOps, influenced by Scandinavian culture and agile principles, that enabled the company to become a true data-driven leader.

Related Books

Free with a 30 day trial from Scribd

See all

Related Audiobooks

Free with a 30 day trial from Scribd

See all
  • Be the first to comment

Data ops in practice - Swedish style

  1. 1. www.scling.com DataOps in practice - Swedish style Lars Albertsson (@lalleal) Scling 1
  2. 2. www.scling.com Who’s talking? ... Google - video conference, engineering productivity ... Spotify - data engineering ... Independent data engineering consultant Banks, media, startups, heavy industry, telco Founder @ Scling - data-value-as-a-service 2
  3. 3. www.scling.com Contents Journey to DataOps Experiences that shaped my data engineering IMHO principles of successful DataOps Toolbox 3 ● Spotify information is old history ● Previously published ● Today is very different
  4. 4. www.scling.com Spotify data 2007-2013 ● Hadoop installed 2007 ● Use cases: reporting, insights, recommendations ● Cultural aspects: ○ Autonomous teams ○ Eliminate waste ○ Learn and adapt 4
  5. 5. www.scling.com Traditional systems 5 Mutation Early Hadoop: ● Weak indexing ● No transactions ● Weak security ● Batch transformations
  6. 6. www.scling.com Data lake Transformation Cold store 6 Mutation Immutable, shareable Early Hadoop: ● Weak indexing ● No transactions ● Weak security ● Batch transformations DataOps workflows: ● Immutable, shared data ● Resilient to failure ● Quick error recovery ● Low-risk experiments Data factories
  7. 7. www.scling.com What conclusion from this graph? COVID-19 fatalities / day in Sweden 7
  8. 8. www.scling.com Wrong conclusion, every day ● Downward trend every day! 8
  9. 9. www.scling.com Normalise data collection to compare 9Graph by Adam Altmejd, @adamaltmejd
  10. 10. www.scling.com Normalise data collection to compare 10Graph by Adam Altmejd, @adamaltmejd
  11. 11. www.scling.com Forecast for analytics with fresh data 11Graph by Adam Altmejd, @adamaltmejd
  12. 12. www.scling.com From craft to process 12
  13. 13. www.scling.com From craft to process 13 Multiple time windows
  14. 14. www.scling.com From craft to process 14 Multiple time windows Assess ingress data quality
  15. 15. www.scling.com From craft to process 15 Multiple time windows Assess ingress data quality Assess outcome data quality
  16. 16. www.scling.com From craft to process 16 Multiple time windows Assess ingress data quality Assess outcome data quality Repair broken data Intermediate datasets, reusable between pipelines
  17. 17. www.scling.com From craft to process 17 Multiple time windows Assess ingress data quality Repair broken data from complementary source Assess outcome data quality
  18. 18. www.scling.com From craft to process 18 Multiple time windows Assess ingress data quality Repair broken data from complementary source Forecast based on history Assess outcome data quality
  19. 19. www.scling.com From craft to process 19 Multiple time windows Assess ingress data quality Repair broken data from complementary source Forecast based on history, multiple parameter settings Assess outcome data quality
  20. 20. www.scling.com From craft to process 20 Multiple time windows Assess ingress data quality Repair broken data from complementary source Forecast based on history, multiple parameter settings Assess outcome data quality Assess forecast success, adapt parameters
  21. 21. www.scling.com Towards sustainable production ML 21 Multiple models, parameters, features Assess ingress data quality Repair broken data from complementary source Choose model and parameters based on performance and input data Benchmark models Try multiple models, measure, A/B test
  22. 22. www.scling.com Risky operations 22 How to I test the pipeline? You temporarily change the output path and run manually. Don’t do that. What if I forget to change path?
  23. 23. www.scling.com 2013 23 ● Teams: Analytics computation (AC), data collection (DC), recommend, reporting (1) ● Folklore development cycle & operations ● Unsatisfied needs in other teams
  24. 24. www.scling.com luigid Redundant "edge nodes" with Luigi workers, scheduled with cron. Compute + data in Hadoop. On-prem Hadoop production Worker 10 * * * * luigi --module mymodule MyDaily 23 * * * * luigi --module other OtherDaily Master Executor Worker HDFS metadata Data Control (+data) Submit job 10 * ... 23 * ...
  25. 25. www.scling.com Ghost in the cluster ● Jobs were deployed with Debian packages + Puppet on pet machines. ○ Multiple pets for redundancy. Race to run job. ● "This monitor daemon is at 100%. Since 6 months. I'll kill it." ● "Data is wrong. But we fixed this bug 6 months ago?!?" 25
  26. 26. www.scling.com Start of a DataOps journey 26 Stateful Stateless Pets Cattle Folklore Golden pathTest in prod Local test CI/CD Weeks to learn New pipeline < 1 day Days to mend Bug fix < 1 hour
  27. 27. www.scling.com On-prem pipeline deployment pipeline 27 source repo Luigi DSL, jars, config my-pipe-7.tar.gz Luigi daemon > pip install my-pipe-7.tar.gz Worker Worker Worker Worker Worker Worker Worker Worker Redundant cron schedule, higher frequency All that a pipeline needs, installed atomically 10 * * * * luigi --module mymodule MyDaily Standard deployment artifact Standard artifact store
  28. 28. www.scling.com Principle: Functional pipelines 28 ● Raw source of truth + data refinement factory ● Immutable datasets & artifacts ● Deterministic, idempotent, reproducible deployment & processing ● Key success factor: workflow orchestration ○ Oozie, Rambo, Builder, Builder2, Luigi ○ Key properties: 1. Pure Python 2. Simplicity 3. All the features it lacks
  29. 29. www.scling.com Big data - a collaboration paradigm 29 Stream storage Data lake Data democratised
  30. 30. www.scling.com ● Technically ○ Data available ○ Reusable QA ● Operationally ○ Continuous deployment ○ Hands off operations ○ Monitoring, debugging ● Bottom-up innovation Enabling teams 30 "The actual work that went into Discover Weekly was very little, because we're reusing things we already had." https://youtu.be/A259Yo8hBRs https://youtu.be/ZcmJxli8WS8
  31. 31. www.scling.com Principle: Small scope components 31 ● Do one thing well. Less is more. ● Complex systems from replaceable bricks ○ Cloud/OSS over enterprise vendors ○ Simplicity over features Solvable challenge ~2000 lines of code Perpetual complexity
  32. 32. www.scling.com Cloud native deployment 32 source repo Luigi DSL, jars, config my-pipe:7 Luigi daemon Worker Worker Worker Worker Worker Worker Worker Worker Redundant cron schedule, higher frequency kind: CronJob spec: schedule: "10 * * * *" command: "luigi --module mymodule MyDaily" Docker image Docker registry S3 / GCS Dataproc / EMR
  33. 33. www.scling.com Data platform gravitation ● Hadoop all the things. ● Data is there. Simple test, simple deploy, simple ops. ● Autonomous teams - no mandate. Natural gravity. 33
  34. 34. www.scling.com 3434 Nearline ● Stream storage ● Asynchronous event processing ● 10 ms - 1 hour Data integration timescales 34 Job Stream Offline ● File storage ● Asynchronous batch processing ● 1 minute - Online ● SOA / microservices ● Synchronous RPC ● 1-100 ms Stream Job Stream
  35. 35. www.scling.com 3535 Upgrade ● Careful rollout ● Risk of user impact ● Proactive QA Operational manoeuvres - online 35
  36. 36. www.scling.com 3636 Upgrade ● Careful rollout ● Risk of user impact ● Proactive QA Operational manoeuvres - online 36 Service failure ● User impact ● Data loss ● Cascading outage
  37. 37. www.scling.com 3737 Upgrade ● Careful rollout ● Risk of user impact ● Proactive QA Operational manoeuvres - online 37 Service failure ● User impact ● Data loss ● Cascading outage Bug ● User impact ● Data corruption ● Cascading corruption
  38. 38. www.scling.com 38 Operational manoeuvres - offline 38 Upgrade ● Instant rollout ● No user impact ● Reactive QA Service failure ● Pipeline delay ● No data loss ● No downstream impact Bug ● Temporary data corruption ● Downstream impact
  39. 39. www.scling.com Life of an error, batch pipelines 39 ● Faulty job, emits bad data 1. Revert serving datasets to old 2. Fix bug 3. Remove faulty datasets 4. Backfill is automatic (Luigi) Done! ● Low cost of error ○ Reactive QA ○ Production environment sufficient
  40. 40. www.scling.com 40 Production critical upgrade ● Dual datasets during transition ● Run downstream parallel pipelines ○ Cheap ○ Low risk ○ Easy rollback ● Testable end-to-end No dev & staging environment needed! ∆?
  41. 41. www.scling.com 41 Operational manoeuvres - nearline 41 Upgrade ● Swift rollout ● Parallel pipelines ● User impact, QA? Service failure ● Pipeline delay ● No data loss ● Downstream impact? Bug ● Data corruption ● Downstream impact Job Stream Stream Job Stream Job Stream Stream Job Stream Job Stream Stream Job Stream
  42. 42. www.scling.com 42 Life of an error, streaming 42 ● Works for a single job, not pipeline. :-( Job StreamStream Stream Stream Stream Stream Job Job Stream Stream Stream Job Job Job Reprocessing in Kafka Streams
  43. 43. www.scling.com Data speed Innovation speed 43 Nearline Data processing tradeoff 43 Job Stream OfflineOnline Stream Job Stream
  44. 44. www.scling.com 44 Separating online & offline ● Daily user DB dump. Cassandra can handle the load. ○ Load spike became 25 h long… ● New recommendation model! Cassandra can replicate to all regions. ○ Who saturated the Atlantic link? ● Batch jobs saturate one resource. ○ Bad neighbours.
  45. 45. www.scling.com Batch offline vs online 45 Raw Fraud serviceFraud model Orders Orders Replication / Backup Standard procedures Standard proceduresLightweight procedures ● QA driven by internal efficiency ● Continuous deployment ● New pipeline < 1 day ● Upgrade < 1 hour ● Bug recovery < 1 hour Careful handover Careful handover
  46. 46. www.scling.com Data quality dimensions ● Timeliness ○ E.g. the customer engagement report was produced at the expected time ● Correctness ○ The numbers in the reports were calculated correctly ● Completeness ○ The report includes information on all customers, using all information from the whole time period ● Consistency ○ The customer summaries are all based on the same time period 46
  47. 47. www.scling.com Testing single batch job 47 Job Standard Scalatest harness file://test_input/ file://test_output/ 1. Generate input 2. Run in local mode 3. Verify output f() p() Runs well in CI / from IDE
  48. 48. www.scling.com Testing batch pipelines - two options 48 Standard Scalatest harness file://test_input/ file://test_output/ 1. Generate input 2. Run custom multi-job Test job with sequence of jobs 3. Verify output f() p() A: Customised workflow manager setup p()f() B:
  49. 49. www.scling.com Monitoring timeliness, examples ● Datamon - Spotify internal ● Twitter Ambrose (dead?) ● Airflow 49
  50. 50. www.scling.com 50 Measuring correctness: counters ● Processing tool (Spark/Hadoop) counters ○ Odd code path => bump counter ○ System metrics Hadoop / Spark counters DB Standard graphing tools Standard alerting service
  51. 51. www.scling.com Measuring correctness: counters ● User-defined ● Technical from framework ○ Execution time ○ Memory consumption ○ Data volumes ○ ... 51 case class Order(item: ItemId, userId: UserId) case class User(id: UserId, country: String) val orders = read(orderPath) val users = read(userPath) val orderNoUserCounter = longAccumulator("order-no-user") val joined: C[(Order, Option[User])] = orders .groupBy(_.userId) .leftJoin(users.groupBy(_.id)) .values val orderWithUser: C[(Order, User)] = joined .flatMap( orderUser match case (order, Some(user)) => Some((order, user)) case (order, None) => { orderNoUserCounter.add(1) None }) SQL: Nope
  52. 52. www.scling.com Data quality - high code vs low code ● 2013: Python MapReduce outdated ● Hive/SQL? ○ Not expressive enough ○ Data quality challenging ● Technical platform + multi-skilled teams! ○ Strong development processes 52 Low code / no code platform? Technical platform?
  53. 53. www.scling.com 53 Measuring consistency: pipelines ● Processing tool (Spark/Hadoop) counters ○ Odd code path => bump counter ○ System metrics ● Dedicated quality assessment pipelines DB Quality assessment job Quality metadataset (tiny) Standard graphing tools Standard alerting service
  54. 54. www.scling.com 54 Machine learning operations, simplified ● Multiple trained models ○ Select at run time ● Measure user behaviour ○ E.g. session length, engagement, funnel ● Ready to revert to ○ old models ○ simpler models Measure interactionsRendez- vous DB Standard alerting service Stream Job "The required surrounding infrastructure is vast and complex." - Google
  55. 55. www.scling.com 55 Not all things went well ● Autonomy → excessive heterogeneity ○ 25 ways to store a timestamp? ● Pipeline end-to-end tests ○ Culturally challenging ○ → difficult to change & retire pipelines ● Trial and error to learn
  56. 56. www.scling.com Data engineering in Scandinavia ● Stockholm region ranks 2nd in unicorns / capita ○ Media, games, fintech ● Critical mass of world class data engineering ○ Limited to a few companies 56
  57. 57. www.scling.com Mission: Spread data & AI superpowers ● There are companies to help ● Data & AI capabilities require culture & process change ○ Slow, very slow 57
  58. 58. www.scling.com Scandinavian minimalist design ● Lean, simple technology - focus on flow and business value ● Bonnier News data platform, 4-5 persons: ○ Zero to happy customer in 3 weeks. ○ Dozens of ROI pipelines in 8 months. ● Scling retail client, 1-3 persons, after 1 year: ○ 40 sources, 70 pipelines, 200 egress points ○ 3,400 datasets / day ● Typical enterprise numbers ○ Big data project: 6-24 months ○ Analytics department: 100-1000 datasets / day ○ Spotify: 100,000+ datasets / day ○ Google: 1.6B datasets / day (2016) 58
  59. 59. www.scling.com Scling - data-value-as-a-service 59 Data value through collaboration Customer Data factory Data platform & lake data domain expertise Value from data! www.scling.com/reading-list www.scling.com/presentations www.scling.com/courses

    Be the first to comment

    Login to see the comments

  • RichieSiburian

    Mar. 23, 2021

DataOps requires a cultural shift that brings the principles of lean manufacturing and DevOps to data analytics. It breaks down silos between developers, data scientists, and operators, resulting in rapid cycle times and low error rates. At Spotify in 2013, the concept of DataOps did not exist but the Swedish company needed a way to align the people, processes, and technologies of the data organization to accelerate the development of high-quality analytics. The result was a Swedish-style DataOps, influenced by Scandinavian culture and agile principles, that enabled the company to become a true data-driven leader.

Views

Total views

205

On Slideshare

0

From embeds

0

Number of embeds

0

Actions

Downloads

4

Shares

0

Comments

0

Likes

1

×