Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Data democratised

New times, new hype. Buzzwords like big data and Hadoop have been changed to AI and machine learning. But it's not technology, old or new, nor machine learning that separates companies that get value from data from the companies that struggle .

When big data was at its peak, several young, technology-intensive companies succeeded in absorbing big data successfully. They acquired large Hadoop clusters, learned to master data and created valuable products with machine learning. However, big data has had a limited impact at traditional companies, and the list of long and expensive data lake and Hadoop projects is long.

The key to implementing successful projects that transform data into business value is to democratise data - making it accessible and easy to use within an organisation.

  • Login to see the comments

Data democratised

  1. 1. www.scling.com Data democratised Next data analytics & protection, 2019-12-11 Lars Albertsson (@lalleal) Scling 1
  2. 2. www.scling.com Big data adoption 22 ● 2003-2007: Only Google ● 2007-2014: Hadoop era (Europe). Highly technical companies succeed and disrupt. ● 2015-2019: Enterprise adoption (Europe). Big data gone from Gartner hype cycle. “New normal” ● 2019: Many enterprises in production, but big data and machine learning ROI still confined to high-tech.
  3. 3. www.scling.com Data value efficiency gap aka disrupted or disruptor 3 Early Spotify recommendations Creator of Luigi, Annoy
  4. 4. www.scling.com Efficiency gap, latency 4 We just took a machine learning pipeline in production after 8 months. Great success! Scandinavian retail (pycon.se, 2019)Document similarity pipeline finally in production. Estimated 3 months, took 8 months. Scandinavian telecom (NDSML Summit 2019) 2016: Data platform approval 2018: Pipeline in production Dutch bank (Dataworks Summit 2018) Bonnier News (Riga DevOpsDays 2018) Platform + 1st pipeline in production. Seven weeks, 1 person. Scandinavian retail 2018 New pipeline: < 1 day Mend pipeline: < 1 hour Spotify DataOps transform, 2013 Platform + 1st pipeline in production. Three weeks, 4 persons. 20 pipelines in 8 months.
  5. 5. www.scling.com Efficiency gap, data cost & value ● Data processing produces datasets ● Each dataset has business value ○ Financial, sales, forecasting reports ○ A/B test, auto completion, insights ○ Recommendations, fraud ● Proxy metric: datasets / day ○ S-M traditional: < 10 ○ Bank, telecom, media: 10-1000 5 2016: 20000 datasets / day 2017: 100B events collected / day Spotify 2016: 1600 000 000 datasets / day Google
  6. 6. www.scling.com Data efficiency key factors 6 Data democratisation ● Making data available, usable, accessible DataOps ● Short path from idea to production ● Cross-functional teams ○ Data engineering, domain experts, product, (data science) ○ Aligned with value, not function ● Low cost of failure ○ Machine and human failure ○ Risks ok → move fast ● Engineered operations
  7. 7. www.scling.com Service-oriented organisations ● Teams own services ● Teams own data 7
  8. 8. www.scling.com Data-centric innovation ● Need data from teams ○ willing? ○ backlog? ○ collected? ○ useful? ○ quality? ○ extraction? ○ data governance? ○ history? 8
  9. 9. www.scling.com Data-centric innovation ● Need data from teams ○ willing? ○ backlog? ○ collected? ○ useful? ○ quality? ○ extraction? ○ data governance? ○ history? ● Innovation friction Value adding Waste 9
  10. 10. www.scling.com Centralising data 10 Data lake
  11. 11. www.scling.com More data - decreased friction 11 Data lake Stream storage
  12. 12. www.scling.com Hadoop is dead? 12
  13. 13. www.scling.com Traditional systems 13 Mutation
  14. 14. www.scling.com Data lake Transformation Cold store Data pipelines at a glance 14 Mutation Immutable, shareable
  15. 15. www.scling.com Data lake Transformation Cold store Data pipelines at a glance 15 Mutation Immutable, shareable Early Hadoop: ● Weak indexing ● No transactions ● Weak security ● Batch transformations DataOps workflows: ● Immutable, shared data ● Resilient to failure ● Quick error recovery ● Low-risk experiments
  16. 16. www.scling.com Late Hadoop adoption 16 Mutation Can you please implement mutability, transactions, SQL, etc? We would like to keep our workflows. Anything, as long as you are buying. DataOps workflows: ● Immutable, shared data ● Resilient to failure ● Quick error recovery ● Low-risk experiments
  17. 17. www.scling.com Complex business logic - MDM @ Spotify ~2014 ● 10 pipelines like this ● Pipeline dev environment ● Pipeline continuous deployment infrastructure One team of five engineers 17
  18. 18. www.scling.com Data value = data + domain expertise + data practices 18 Disrupt? https://xkcd.com/1831/ + 1000s of failures...
  19. 19. www.scling.com Data value = data + domain expertise + data practices 19 Disrupt? https://xkcd.com/1831/ Adapt? + 1000s of failures...
  20. 20. www.scling.com Data value = data + domain expertise + data practices 20 Data lake Stream storage Client data + domain expertise Practices from data leaders Disrupt? https://xkcd.com/1831/ Collaborate? Data-value-as-a-service Adapt? + 1000s of failures...
  21. 21. www.scling.com Factors of democratisation 21 Siloed Shared Distributed storage Homogeneous storage CoordinatedOrganic
  22. 22. www.scling.com Factors of democratisation 22 Siloed Shared Distributed storage Homogeneous storage Documentation read+write accessNeed-to-know basis CoordinatedOrganic
  23. 23. www.scling.com Factors of democratisation 23 Siloed Shared Distributed storage Homogeneous storage Documentation read+write accessNeed-to-know basis Code read+write access Closed code ownership CoordinatedOrganic
  24. 24. www.scling.com Factors of democratisation 24 Siloed Shared Distributed storage Homogeneous storage Documentation read+write accessNeed-to-know basis Code read+write access Closed code ownership Coordinated data governanceLocal rituals CoordinatedOrganic
  25. 25. www.scling.com Factors of democratisation 25 Siloed Shared Distributed storage Homogeneous storage Documentation read+write accessNeed-to-know basis Code read+write access Closed code ownership Coordinated data governanceLocal rituals Common glossary, semantics Tribal knowledge CoordinatedOrganic
  26. 26. www.scling.com Factors of democratisation 26 Siloed Shared Distributed storage Homogeneous storage Documentation read+write accessNeed-to-know basis Code read+write access Closed code ownership Coordinated data governanceLocal rituals Common glossary, semantics Tribal knowledge Common data provenance Unclear data origin CoordinatedOrganic
  27. 27. www.scling.com Factors of democratisation 27 Siloed Shared Distributed storage Homogeneous storage Documentation read+write accessNeed-to-know basis Code read+write access Closed code ownership Coordinated data governanceLocal rituals Common glossary, semantics Tribal knowledge Common DataOps procedures Lay-on-hands deployment Common data provenance Unclear data origin CoordinatedOrganic
  28. 28. www.scling.com An e-shopping tale 28 1. Log in, search for product X ○ X + 100s of accessories, random order 2. Find X in product catalog ○ No link to web shop 3. Put in cart, delivery? ○ Ask for address, customer club number 4. … Full story: “Avoid artificial stupidity” blog post 1. Log in, search for product X ○ Popular items first 2. Find X in product catalog ○ Take me to shop 3. Put in cart, delivery? ○ I am logged in 4. ...
  29. 29. www.scling.com ● Include minimal governance, security, privacy Data lake Transformation Cold store Document a clean architecture 29 Mutation Immutable, shareable
  30. 30. ● Align team with use case ○ Zero budget ● Ingest only necessary data ● Key technical component: Workflow orchestrator (Luigi / Airflow) A lean start 30
  31. 31. www.scling.com An MVP is minimal 31 Out of scope Minimal privacy - limiting access One use case In scope Minimal privacy Security One DB source One use caseData scala- bility High availa- bility Dura- bility Most privacy Self service Data quality Auto- mation Clusters Audita- bility Scalable BI Fill lake Real- time Lineage
  32. 32. ● Remove complexity wherever possible ○ Unfamiliar tools may be less complex ● Pay attention to human and social factors Journey towards data value 32 “Five dysfunctions of a data engineering team” - Jesse Anderson ● Only database admins ● Set up for failure ● No one understands schema ● No veterans ● Too ambitious “Avoiding big data antipatterns” - Alex Holmes ● Big data tech for small data ● Point-to-point data integration ● Single tool for the job ● Excess volume or precision ● Lack of security

×