Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

10 ways to stumble with big data

769 views

Published on

Many companies have data with great potential. There are many ways to go wrong with Big Data projects, however; the difference between a successful and a failed project can be huge, both in cost and the return of investment. In this talk. we will describe the most common pitfalls, and how to avoid them. You will learn to:

- Be aware of the existing risk factors in your organisation that may cause a data project to fail.
- Learn how to recognise the most common and costly causes of project failure.
- Learn how to avoid or mitigate project problems in order to ensure return of investment in a lean manner.

Published in: Data & Analytics
  • Be the first to comment

10 ways to stumble with big data

  1. 1. 10 ways to stumble with big data 2017-09-14 Lars Albertsson www.mapflat.com 1
  2. 2. Who’s talking? ● KTH-PDC Center for High Performance Computing (MSc thesis) ● Swedish Institute of Computer Science (distributed system test+debug tools) ● Sun Microsystems (building very large machines) ● Google (Hangouts, productivity) ● Recorded Future (natural language processing startup) ● Cinnober Financial Tech. (trading systems) ● Spotify (data processing & modelling) ● Schibsted Media Group (data processing & modelling) ● Mapflat (independent data engineering consultant) 2
  3. 3. Data-centric systems, 1st generation ● The monolith ○ All data in one place ○ Analytics + online serving from single database 3 DB Presentation Logic Storage
  4. 4. Data-centric systems, 2nd generation ● Collect aggregated data from multiple online systems to data warehouse ● Aggregate to OLAP cubes ● Analytics focused 4 Service Service Service Web application Data warehouse Daily aggregates
  5. 5. 3rd generation - event oriented 5 Cluster storage ETL Data lake AI feature DatasetJob Pipeline Data-driven product development Analytics
  6. 6. Why bother? 6 Development iteration speed Data-driven development Machine learning features Democratised data access
  7. 7. 1 - Spending-driven development 7 ● Large spending before value delivery ● Vendors want you to make this mistake No workflow orchestration tool Driven by infrastructure department Project named “data lake” or “data platform” High trust in vendor Warning signs
  8. 8. 2 - Premature scaling ● You don’t have big data! ● Max cloud instance memory: 2TB ● Does your data ○ fit? ○ grow faster than Moore’s law? ● Scaling out only when needed ● Big data Lean data ○ Time-efficient data handling ○ Democratised data ○ Complex business logic ○ Human fault tolerance ○ Data agility 88 Funky databases In-memory technology Daily work requires cluster
  9. 9. 3 - The data waterfall 9 ● Handovers add latency ● Low product agility High time to delivery Unclear use cases Many teams from source to end No workflow orchestration tool Mono-functional teams
  10. 10. Right turn: Feature-driven teams & infrastructure ● Cross-functional teams own specific feature ● Path from source data to end user service 10 Start out with workflow orchestration Self-service infrastructure added lazily Postpone clusters & investments End-to-end proof of concepts
  11. 11. Team that owns data exports to lake Team needing data imports to lake 4 - Lake of trash 11111111 Excessive time spent cleaning Data feature teams access production data Data quality & semantics issues
  12. 12. 5 - Random walk ● Many iterative steps without a target vision ● Works fine for months. Pain then increases gradually. ● Difficult to be GDPR compliant. 1212121212 Autonomous / microservice culture Little technology governance No plan for schemas, deployment, privacy Wide changes difficult
  13. 13. 6 - Distinct crawl ● Batch data pipelines are forgiving ○ Workflow orchestration tool for recovery ● Many practices are cargo rituals ○ Release management ○ In situ testing ○ Performance testing ● Start minimal & quick ○ Developer integration tests ○ Continuous deployment pipeline ● Add process iff pain 131313131313 Enterprise culture Heavy practice governance Standard rituals applied Late first delivery
  14. 14. 7 - Data loss by design 14 Processing during data ingestion Unclear source of truth Mutable master data Store every event Immutable data Reproducible execution Large recovery buffers Human error tolerance Component error tolerance Rapid iteration speed Eliminate manual precautions
  15. 15. 8 - AI first ● You can climb, not jump ● PoCs are possible Credits: “The data science hierarchy of needs”, Monica Rogati 15 AI Deep learning A/B testing Machine learning Analytics Segments Curation Anomaly detection Data infrastructure Pipelines Instrumentation Data collection Value Effort
  16. 16. 9 - Technical bankruptcy ● Data pipeline == software product ● Apply common best practices ○ Quality tools & processes ○ Automated (integration) testing ○ CI/CD ○ Refactoring ● Avoid tools that steer you away ○ Local execution? ○ Difficult testing? ○ Mocks required? ● Strong software engineers needed ○ Rotate if necessary 1616 Heterogeneous environmentWeak release process Few code quality tools Excessive time on operations
  17. 17. 1717 Data engineer Increasing tech debt 10 - Team trinity unbalance ● Team sport ● Mutual respect & learning ● Be driven by ○ user value ● Balance with ○ innovation ○ engineering 17 Data scientist Product ownerLittle innovation Low business value
  18. 18. 11 - Miss the train 18 Big data + AI is not optional C.f. Internet, smartphones, … Product development speed impact is significant Data-driven evaluation Forgiving environment - move fast without breaking things Democratised access to data

×