Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Demystifying Data Engineering

4,067 views

Published on

Talk given in NYC on 7/20/2015

Published in: Technology

Demystifying Data Engineering

  1. 1. Demystifying Data Engineering
  2. 2. Data engineering • Software engineering with an emphasis on dealing with large amounts of data • A “specialty” of software engineering
  3. 3. Why now? • Always value in scale, but it was previously too difficult / expensive • Economics and technology advances make these scales accessible
  4. 4. Enable others to answer questions on dataset within latency constraints
  5. 5. Data engineering • Distributed systems – consensus, consistency, availability, etc. • Parallel processing • Databases • Queuing
  6. 6. Data engineering • Human-fault tolerance • Metrics and monitoring • Multi-tenancy
  7. 7. BackType • When I joined: • Comment search by keyword • Comment search by user • Basic stats on commenters • Link search on Twitter
  8. 8. BackType Kyoto Cabinet Custom workers Custom crawlers
  9. 9. BackType • Inflexible • Prone to corruption • Heavy operational burden • Not scalable • Not fault-tolerant
  10. 10. BackType • Enable asking any question (with high latency) • Allows exploration and experimentation • Establishes human-fault tolerance
  11. 11. Collector Collector Collector Collector
  12. 12. ElephantDB • Export results of MapReduce pipelines for querying • Low latency querying but out of date by many hours • Incredibly simple
  13. 13. • Infrastructure • Data pipelines • Abstractions Data engineering
  14. 14. Data pipeline example Tweets (S3) Normalize URLs Compute hour bucket Sum by hour/url Emit ElephantDB indexes
  15. 15. Data pipeline example Tweets (Kafka) Normalize URLs Compute hour bucket Update hour/ url bucket Cassandra
  16. 16. Abstraction example MapReduce Cascading Cascalog
  17. 17. Infrastructure • HDFS • MapReduce • Kafka • Storm • Spark • Cassandra • HBase • ElephantDB • Zookeeper
  18. 18. Streaming compute team at Twitter • Started streaming compute team at Twitter • One shared Storm cluster for entire company
  19. 19. Multi-tenancy • Independent applications on same cluster • Topologies should not affect one another
  20. 20. Resource allocation • Topologies should be given an appropriate amount of resources
  21. 21. Initial approach • Use Mesos to provide resource guarantees • Users include resources needed as part of topology submission
  22. 22. Solution • Implement new scheduler which gives production topologies dedicated hardware • Only Storm team can configure production topologies • Left-over machines are used as failover or for in-development topologies
  23. 23. Data Engineering vs Data Science • Well-defined problems • No special statistics skills required • Larger scope • Not just analytics
  24. 24. Open source • Almost all major Big Data tools are open source (e.g. Hadoop, Storm, Spark, Kafka, Cassandra, HBase, etc.) • Many have commercial support
  25. 25. Open source • Very important for recruiting data engineers • Strong developers want to work at places where they can be involved with open source
  26. 26. Open source • Develop a technology brand for company (in conjunction with a tech blog) • Creating a popular open source project can give you access to lots of strong engineers
  27. 27. Open source • Identify strong engineers in the community you may want to recruit • Learn best practices and get help from the people who know the tools the best • *Do not* expect to get “free work” on your projects
  28. 28. Ideal data engineer • Strong software engineering skills • Abstraction • Testing • Version control • Refactoring
  29. 29. Ideal data engineer • Strong software engineering skills • Strong algorithm skills
  30. 30. Ideal data engineer • Strong software engineering skills • Strong algorithm skills • Good at digging into open source code
  31. 31. Ideal data engineer • Strong software engineering skills • Strong algorithm skills • Good at digging into open source code • Good at stress testing
  32. 32. Finding strong data engineers • Standard “coding on the whiteboard” interviews are near useless • Use take home projects to gauge general programming ability • The best is to see projects that require data engineering
  33. 33. Questions?

×