Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Introduction to Data Engineering (with Scala)

859 views

Published on

An introduction to Data Engineering for Data Scientists.

Published in: Software
  • Be the first to comment

Introduction to Data Engineering (with Scala)

  1. 1. Introduction to Data Engineering (with Scala) John Nestor 47 Degrees www.47deg.com June 27, 2016 Galvanize 147deg.com
  2. 2. 47deg.com © Copyright 2015 47 Degrees Outline • Introduction • Data Engineering Requirements • Data Engineering Design Patterns • Recommended Data Engineering Tools and Systems • Final Thoughts 2
  3. 3. Introduction 3
  4. 4. 47deg.com © Copyright 2015 47 Degrees Typical Data Engineering Systems • Low latency response to HTTP or REST requests • Database reads and writes • Run ML models • Produce event streams for later processing • Near real time event processing • Simple analytics and alerts • Analysis of server information • Logs and metrics • Produce data for later analysis by data scientists 4
  5. 5. 47deg.com © Copyright 2015 47 Degrees Big Data • (Much) Too big to fit on a single machine • Must have both • distributed computation • distributed data (bases) • Distributed systems means no single main memory • Must pass data across servers • Large number of distributed components means failure is common • Dealing with failure must be part of the fundamental architecture 5
  6. 6. 47deg.com © Copyright 2015 47 Degrees • https://blogs.oracle.com/jag/resource/Fallacies.html Peter Deutsch • The network is reliable • Latency is zero • Bandwidth is infinite • The network is secure • Topology doesn’t change • There is one administrator • Transport cost is zero • The network is homogeneous 6 Fallacies of Distributed Computing
  7. 7. 47deg.com © Copyright 2015 47 Degrees Reactive Manifesto • http://www.reactivemanifesto.org/ • Responsive - predictable latency • Resilient - fault tolerant • Elastic - (auto) scalability • Message driven - basis of a distributed implementation 7
  8. 8. Data Engineering Requirements 8
  9. 9. 47deg.com © Copyright 2015 47 Degrees Scalability • New systems are getting bigger all the time • Hardware is getting cheaper • Business requirements to stay competitive are increasing • Cloud computing permits easy expansion based on instantaneous need • No single server is ever big enough • Scalability goal: performance increases (close to) linearly with the number of servers 9
  10. 10. 47deg.com © Copyright 2015 47 Degrees Availability • Systems are increasingly expected to be available 24/7 with no downtime • Any server can fail, others must be able to take over • No downtime for maintenance. Software upgrades occur without shutting system down. • Must avoid availability killing features such a 2 phase commit • SLA’s # of nine’s • The best most achieve is 3 nines (8.8 hours per year) • Most strive for 6 nines (30 minutes per year) • AWS S3 claims 9 nines (32 msec per year) 10
  11. 11. 47deg.com © Copyright 2015 47 Degrees Durability • Loosing data is never acceptable • Since any single point can fail, we must replicate data • Replication to • main memory • different server • server in different zone • across geo-distributed data centers • AWS S3 will loose at most one object out of 32K objects every 10 million years 11
  12. 12. 47deg.com © Copyright 2015 47 Degrees Latency and Bandwidth • Latency - msec to process a single request • More hops can increase latency • Very fast network hardware can reduce latency • Speed of light is still the upper bound • Bandwidth - number of requests processed per sec • More servers can increase bandwidth • Latency Numbers Every Programmer Should Know • main memory (0.0001 msec) • different server (0.5 msec) • across geo-distributed data centers (150 msec) 12
  13. 13. Data Engineering Design Patterns 13
  14. 14. 47deg.com © Copyright 2015 47 Degrees Immutable Data • Concurrent access to mutable data requires synchronization. Immutable data does not. • Data passed between servers will be immutable • Immutable data plus functional programming results in code that is easier to understand and test 14
  15. 15. 47deg.com © Copyright 2015 47 Degrees Messaging (1 of 2) • Message sent from A to B • A gets ack from B • A gets no ack from B • message never got to B • ack from B never got to A • What kind? • at most once (never resend) • at least once (resend if no ack) • exactly once (resend idempotently if no ack) 15
  16. 16. 47deg.com © Copyright 2015 47 Degrees Messaging (2 of 2) • Idempotence • Multiple sends have same effect • set X to 3, NOT add 2 to X • Attach GUID, destination must handle • In order delivery • Waiting for an ack before sending next increases latency • Attach sequence number, destination must handle • Batching multiple messages together can help • Design so order does not matter 16
  17. 17. 47deg.com © Copyright 2015 47 Degrees Persistent Data (1 of 3) • CAP theorem (pick 2) • Consistency (ACID) • Availability • Partition tolerance (closely tied to fault tolerance) • Distributed consistency solutions: 2-phase commit is “the anti-availability protocol” (Helland) • For very large highly available systems, AP is only possible choice 17
  18. 18. 47deg.com © Copyright 2015 47 Degrees Persistent Data (2 of 3) • Detecting conflicts with Vector clocks • Each server has own time • Vector has one element for each server • Forms a partial order • Resolving conflicts (for example: 2 different phone numbers) • Select the latest • Ask someone • Keep both • CRDTs (generalization of keep both) • conflict free replicated data sets • merge must be commutative, associative, idempotent 18
  19. 19. 47deg.com © Copyright 2015 47 Degrees Persistent Data (3 of 3) • Log based stores • Sequence of transformational steps • Each step is immutable • Log is append only (fast sequential write to disk) • Database is a cache of some point in the log • Log is primary • Database can be deleted and recreated from log 19
  20. 20. 47deg.com © Copyright 2015 47 Degrees Concurrency and Distribution • Individual servers are getting ever more cores. • Utilization is key • Large data applications require multiple servers • Connections between servers are frequent points of failure • Parallel data operations help: parallel collections, Spark • Traditional synchronization (locks, monitors) are error prone and very hard to get right. • Message bases systems (Hoare’s CSP, Hewitt’s actors) are a better solution and work well across servers. 20
  21. 21. 47deg.com © Copyright 2015 47 Degrees Logging and Monitoring • As systems involve more and more servers • Detecting and locating failure is getting harder • Understanding system performance and performance tuning is getting harder • We now produce massive amounts of logs and monitoring data • Making sense of this huge volume of data is hard • For failures we need near real-time analysis • Increasing need for data science solutions 21
  22. 22. 47deg.com © Copyright 2015 47 Degrees Continuous Deployment (1 of 2) • High availability means we can no longer shut down for upgrades to • Application code • Operating system upgrades and patches • Hardware maintenance • Automatic server failover • Rolling upgrades • Backward compatibility • Messages • Database schemas 22
  23. 23. 47deg.com © Copyright 2015 47 Degrees Continuous Deployment (2 of 2) • Deployment of lots of small changes reduces the chance of errors in any single deployment • Requires comprehensive automation for testing and deployment • But errors still do occur • Although we have good methods for testing individual components, integration testing is still hard and error prone. • Some approaches • Roll back • A-B testing • Database checkpoints 23
  24. 24. Recommended Data Engineering Tools and Systems 24
  25. 25. 47deg.com © Copyright 2015 47 Degrees Choices • Open source preferred • Personal favorites • Widely used (best practices in leading companies) 25
  26. 26. 47deg.com © Copyright 2015 47 Degrees Prefer Open Source • “Free” • Full source is available • Community participation • Can move very fast • More responsive • Plus if there is a commercial company providing support 26
  27. 27. 47deg.com © Copyright 2015 47 Degrees Programming Language (1 of 3) • Compiled versus interpreted • Compiled: C, C++, Go • Semi-compiled: Java, C#, Scala • Interpreted: Python, Ruby, R • Static versus dynamic type checking • Static catches more errors at compile-time • Static are easier to understand and maintain • Static requires more work writing • Garbage collection. Safety versus performance 27
  28. 28. 47deg.com © Copyright 2015 47 Degrees Programming Languages (2 of 3) • Choice of language does not matter • I can write any algorithm in any language • Lets avoid pointless “language religion” wars • Choice of language matters a lot • Language can have a big impact on performance, productivity and reliability • Programming languages shape the way we think 28
  29. 29. 47deg.com © Copyright 2015 47 Degrees Programming Languages (3 of 3) • Scala • Semi-compiled. Compiled with JIT compiler. • Statically typed but concise syntax of untyped • Garbage collected • Runs on JVM. Full ecosystem of libraries and tools available. • Key features • Functional plus immutable data (major advance in program quality) • Scala Futures and Akka Actors (major advance in easy to understand, easy to get correct, and fault-tolerant distributed computation) • Main language for Spark • Suitable for both data engineers and data scientists (better cooperation) 29
  30. 30. 47deg.com © Copyright 2015 47 Degrees Messaging • Kafka (written in Scala) • Reliable buffer between produced and consumer • Can replay • Multiple produces and consumers • Multiple topics • Linearly scalable • Kafka stream • Other • Reactive streams • Spark streaming 30
  31. 31. 47deg.com © Copyright 2015 47 Degrees Databases • Relational: Postgres (scaling can be a problem) • Embedded: LevelDB, MapDB • NoSQL: Cassandra, Couchbase • Graph: Neo4j, Titan, DataStax Enterprise Graph 31
  32. 32. 47deg.com © Copyright 2015 47 Degrees Analytics • Hadoop (let it die!) • Spark (Written in Scala, Scala API is best) • Trend toward SQL • Improved performance via query optimizer • Widely understood (but poor?) programming model • Somewhat abandoned functional programming (RDDs) • dataset transforms: experiment to combine functional programming with support for query optimization 32
  33. 33. 47deg.com © Copyright 2015 47 Degrees Data Center Infrastructure and Continuous Deployment • GitHub, SBT, Artifactory, Jenkins • Docker/Rkt, Etcd, CoreOS • Mesos, Kubernetes • Cloud: AWS, Google, Microsoft 33
  34. 34. Final Thoughts 34
  35. 35. 47deg.com © Copyright 2015 47 Degrees Final Thoughts • Scala is the best choice for both data engineers and data scientists • Spark is the best choice for data analysis • Data will continue to grow in size and importance • The number of servers we use will continue to grow requiring better fault tolerance and better automation • When data engineers and data scientists work closely together both benefit and better results are achieved • We need to break down traditional silos • We need shared tools and technologies that work well for both groups 35
  36. 36. Questions 36

×