Successfully reported this slideshow.

Big data and containers

3

Share

1 of 37
1 of 37

Big data and containers

3

Share

Download to read offline

Description

Thinking through how containers should change our thinking in big data.

Transcript

  1. 1. Big Data and Containers Charles Smith @charles_s_smith
  2. 2. Netflix / Lead the big data platform architecture team Spend my time / Thinking how to make it easy/efficient to work with big data University of Florida / PhD in Computer Science Who am I?
  3. 3. “It is important that we know where we come from, because if you do not know where you come from, then you don't know where you are, and if you don't know where you are, you don't know where you're going. And if you don't know where you're going, you're probably going wrong.” Terry Pratchett
  4. 4. Database Distributed Database Distributed Storage Distributed Processing ???
  5. 5. Why do we care about containers?
  6. 6. Containers ~= Virtual Machines Virtual Machines ~= Servers
  7. 7. Lightweight fast to start memory use Secure Process isolation Data isolation Portable Composable Reproducible Everything old is new
  8. 8. Microservices and large architectures
  9. 9. Datastorage (Cassandra, MySQL, MongoDB, etc..)
  10. 10. Operational (Mesos, Kubernetes, etc...)
  11. 11. Discovery/Routing
  12. 12. What’s different about big data.
  13. 13. Data at rest Data in motion
  14. 14. Customer Facing Minimize latency Maximize reliability
  15. 15. Data Analytics Minimize I/O Maximize processing
  16. 16. Ship computation to data
  17. 17. The questions you can answer aren’t predefined
  18. 18. Hive/Pig/MR Presto Metacat Hive Metastore
  19. 19. That doesn’t look very container-y (or microservicy-y for that matter)
  20. 20. Datastorage - HDFS (Or in our case S3)
  21. 21. Operational - YARN
  22. 22. Containers - JVM
  23. 23. So what happens when you want to do something else?
  24. 24. But is that really the way we want to approach containers?
  25. 25. What’s different about big data.
  26. 26. Running many different short-lived processes
  27. 27. Running many different short-lived processes Efficient container construction, allocation, and movement
  28. 28. Groups of processes having meaning
  29. 29. Groups of processes having meaning How we observe processes needs to be holistic
  30. 30. Processes need to be scheduled by data locality (And not just data locality for data at rest)
  31. 31. Processes need to be scheduled by data locality (And not just data locality for data at rest) A special case of affinity (although possibly over time) but...
  32. 32. We do need a data discovery service. (kind of… maybe… a namenode?)
  33. 33. SELECT t.title_id, t.title_desc, SUM(v.view_secs) FROM view_history as v join title_d as t on v.title_id = t.title_id WHERE v.view_dateint > 20150101 GROUP BY 1,2; LOAD LOAD JOIN GROUP
  34. 34. Data Discovery Query Compiler Query Planner Metadata DAG Watcher
  35. 35. Bottom line Containers provide process level security The goal should be to minimize monoliths This isn’t different from what we are doing already Our languages are abstractions of composable-distributed processing Different big data projects should share services No matter what we do, joining is going to be a big problem
  36. 36. Questions?

Editor's Notes

  • This is a good thing!
  • Something that is ingrained at Netflix
  • Decentralized
  • Basically do I deploy and get resources?
  • Think of it this way:

    Our content is data at rest, a bunch of encodings sitting on an open connect server somewhere.
    When someone wants to view something, the data is streamed to them, data in motion (for a huge chunk of the downstream bandwidth)
    And the actual viewing of the content is the visualization of the data.

    You can extend this pattern to other services. Don’t go overboard, but it is a useful way to think about data. Especially when the data starts to get big.
  • But that isn’t really what we do.
  • As a result the allocations need to be fast and scalable.
  • As a result the allocations need to be fast and scalable.
  • Description

    Thinking through how containers should change our thinking in big data.

    Transcript

    1. 1. Big Data and Containers Charles Smith @charles_s_smith
    2. 2. Netflix / Lead the big data platform architecture team Spend my time / Thinking how to make it easy/efficient to work with big data University of Florida / PhD in Computer Science Who am I?
    3. 3. “It is important that we know where we come from, because if you do not know where you come from, then you don't know where you are, and if you don't know where you are, you don't know where you're going. And if you don't know where you're going, you're probably going wrong.” Terry Pratchett
    4. 4. Database Distributed Database Distributed Storage Distributed Processing ???
    5. 5. Why do we care about containers?
    6. 6. Containers ~= Virtual Machines Virtual Machines ~= Servers
    7. 7. Lightweight fast to start memory use Secure Process isolation Data isolation Portable Composable Reproducible Everything old is new
    8. 8. Microservices and large architectures
    9. 9. Datastorage (Cassandra, MySQL, MongoDB, etc..)
    10. 10. Operational (Mesos, Kubernetes, etc...)
    11. 11. Discovery/Routing
    12. 12. What’s different about big data.
    13. 13. Data at rest Data in motion
    14. 14. Customer Facing Minimize latency Maximize reliability
    15. 15. Data Analytics Minimize I/O Maximize processing
    16. 16. Ship computation to data
    17. 17. The questions you can answer aren’t predefined
    18. 18. Hive/Pig/MR Presto Metacat Hive Metastore
    19. 19. That doesn’t look very container-y (or microservicy-y for that matter)
    20. 20. Datastorage - HDFS (Or in our case S3)
    21. 21. Operational - YARN
    22. 22. Containers - JVM
    23. 23. So what happens when you want to do something else?
    24. 24. But is that really the way we want to approach containers?
    25. 25. What’s different about big data.
    26. 26. Running many different short-lived processes
    27. 27. Running many different short-lived processes Efficient container construction, allocation, and movement
    28. 28. Groups of processes having meaning
    29. 29. Groups of processes having meaning How we observe processes needs to be holistic
    30. 30. Processes need to be scheduled by data locality (And not just data locality for data at rest)
    31. 31. Processes need to be scheduled by data locality (And not just data locality for data at rest) A special case of affinity (although possibly over time) but...
    32. 32. We do need a data discovery service. (kind of… maybe… a namenode?)
    33. 33. SELECT t.title_id, t.title_desc, SUM(v.view_secs) FROM view_history as v join title_d as t on v.title_id = t.title_id WHERE v.view_dateint > 20150101 GROUP BY 1,2; LOAD LOAD JOIN GROUP
    34. 34. Data Discovery Query Compiler Query Planner Metadata DAG Watcher
    35. 35. Bottom line Containers provide process level security The goal should be to minimize monoliths This isn’t different from what we are doing already Our languages are abstractions of composable-distributed processing Different big data projects should share services No matter what we do, joining is going to be a big problem
    36. 36. Questions?

    Editor's Notes

  • This is a good thing!
  • Something that is ingrained at Netflix
  • Decentralized
  • Basically do I deploy and get resources?
  • Think of it this way:

    Our content is data at rest, a bunch of encodings sitting on an open connect server somewhere.
    When someone wants to view something, the data is streamed to them, data in motion (for a huge chunk of the downstream bandwidth)
    And the actual viewing of the content is the visualization of the data.

    You can extend this pattern to other services. Don’t go overboard, but it is a useful way to think about data. Especially when the data starts to get big.
  • But that isn’t really what we do.
  • As a result the allocations need to be fast and scalable.
  • As a result the allocations need to be fast and scalable.
  • More Related Content

    ×