Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Eva Tse, Director, Big Data Services, Netflix
Ku...
What to Expect from the Session
How we use Amazon S3 as our centralized data hub
Our big data ecosystems on AWS
- Big data...
Why Amazon S3?
S3
‘The only valuable thing is intuition.’
– Albert Einstein
Why is it Intuitive?
It is a cloud native service! (free engineering)
‘Practically infinitely’ scalable
99.999999999% dura...
Why is it Counter Intuitive?
Eventual consistency?
Performance?
Our Data Hub Scale
60+ PB
1.5 billions+ Objects
Data hub
60+ PB
100+ TB daily
Ingest
Expiration
400+ TB daily
ETL Processing
Read ~3.5 PB daily
Write 500+ TB daily
Data V...
Ingest
Event Data Pipeline
Business events
~500 billion events/day
5 min SLA from source to data hub
Cloud
apps
Kafka AWS
S3
Cloud
apps
Kafka
Cloud
apps
Kafka
Ursula
Data
Hub
AWS SQS
Event Data
Region 1
Region 2
Region 3
AW...
Dimension Data Pipeline
Stateful data in Cassandra clusters
Extract from tens of Cassandra clusters
Daily or more granular...
Dimension Data
Cassandra
clusters
Aegisthus Data
Hub
Cassandra
clusters
Cassandra
clusters
SSTables
on AWS S3
Region 1
Reg...
Transform
Data
Data Processing
Our Data Processing Engines
Data Hub
Hadoop Yarn
Clusters
~250 - 400 r3.4xl~3500 d2.4xl
Look at it from Scalability Angle
1 d2.4xl has 24 TB
60 PB/24 TB = 2,560 machines
To achieve 3 way replications for redund...
Tradeoffs
What are the tradeoffs?
Eventual consistency
Performance
Eventual Consistency
Updates (overwrite puts)
- Always put new files with new keys when updating data;
then delete old fil...
Parquet
Majority of our data is in Parquet file format
Supported across Hive, Pig, Presto, Spark
Performance benefits in r...
Performance Impact
Read
- Penalty: Throughput and latency
- Impact depends on amount of data read
- Improvement: I/O manag...
Performance Impact
List
- Penalty: List thousands of partitions for split calculation
- Each partition is a S3 prefix
- Im...
Performance Impact – some good news
ETL jobs:
- Mostly CPU bound, not network bound
- Performance converges w/ volume and ...
Job and Cluster Mgmt Service
....
..
For Users
Should I run my job on my laptop?
Where can I find the right version of
the tools?
Which cluster should ...
....
..
For Admins
How do I manage different versions
of tools in different clusters?
How can I upgrade/swap the
clusters ...
Genie – Job and Cluster Mgmt Service
Users:
- Discovery: find the right cluster to run the jobs
- Gateway: to run differen...
....
..
Archived
Jobs output
....
..
Job scripts & jars
Tools & clusters
configs
Genie on Netflix OSS
Data Mgmt Services
Metastore
Metacat
Data
Hub
Metacat
Metacat
Federated metadata service. A proxy across data sources.
Metastore
Amazon
RDS
Amazon
Redshift
Metacat
Common APIs for our applications and tools. Thrift APIs for
interoperability.
Metadata discovery across data sourc...
Data Lifecycle Management
Janitor tools
- Delete ‘dangling’ data after 60 days
- Delete data obsoleted by ‘data updates’ a...
Scaling Deletes from S3
Deletion Service
Centralized service to handle errors, retries, and backoffs
of S3 deletes
Cool-down period to delete afte...
Backup Strategy
Core S3
Versioned buckets
20 days
Scale
Simplicity
Above and beyond
Other data stores
Heterogeneous cloud platform
CRR
Data Accessibility
Data Tracking
Approach
Tell us who you are
User agent
S3 access logs
Metrics pipeline
Charlotte
Data Cost
Approach
The calculation
Tableau reports
Data Doctor
Approach
The calculation
Tableau reports
Data Doctor
TTLs
Future: Tie to job cost and leverage SIA / Amazon Glacier?
Best Supporting
Actor
Amazon Redshift
Faster, interactive subset of data
Some use, some don’t
Auto-synch (tag-based)
Fast loading!
Backups, rest...
Druid
Interactive at scale
S3 (source of truth)
S3 for Druid deep storage
Tableau
S3 (source of truth)
Mostly extracts (vs. Direct Connect)
Backups (multi-region)
Big Data Portal
Big Data API (aka Kragle)
import kragle as kg
trans_info = kg.transport.Transporter() 
.source('metacat://prodhive/default...
S3
Next Steps
Add caching?
Storage efficiency
Partner with the S3 team to
improve S3 for big data
Take-aways
Amazon S3 = Data hub = 
Extend and improve as you go
It takes an ecosystem
S3
Thank you!
Remember to complete
your evaluations!
AWS re:Invent 2016: Netflix: Using Amazon S3 as the fabric of our big data ecosystem (BDM306)
AWS re:Invent 2016: Netflix: Using Amazon S3 as the fabric of our big data ecosystem (BDM306)
AWS re:Invent 2016: Netflix: Using Amazon S3 as the fabric of our big data ecosystem (BDM306)
AWS re:Invent 2016: Netflix: Using Amazon S3 as the fabric of our big data ecosystem (BDM306)
AWS re:Invent 2016: Netflix: Using Amazon S3 as the fabric of our big data ecosystem (BDM306)
AWS re:Invent 2016: Netflix: Using Amazon S3 as the fabric of our big data ecosystem (BDM306)
AWS re:Invent 2016: Netflix: Using Amazon S3 as the fabric of our big data ecosystem (BDM306)
AWS re:Invent 2016: Netflix: Using Amazon S3 as the fabric of our big data ecosystem (BDM306)
AWS re:Invent 2016: Netflix: Using Amazon S3 as the fabric of our big data ecosystem (BDM306)
Upcoming SlideShare
Loading in …5
×

AWS re:Invent 2016: Netflix: Using Amazon S3 as the fabric of our big data ecosystem (BDM306)

4,959 views

Published on

Amazon S3 is the central data hub for Netflix's big data ecosystem. We currently have over 1.5 billion objects and 60+ PB of data stored in S3. As we ingest, transform, transport, and visualize data, we find this data naturally weaving in and out of S3. Amazon S3 provides us the flexibility to use an interoperable set of big data processing tools like Spark, Presto, Hive, and Pig. It serves as the hub for transporting data to additional data stores / engines like Teradata, Redshift, and Druid, as well as exporting data to reporting tools like Microstrategy and Tableau. Over time, we have built an ecosystem of services and tools to manage our data on S3. We have a federated metadata catalog service that keeps track of all our data. We have a set of data lifecycle management tools that expire data based on business rules and compliance. We also have a portal that allows users to see the cost and size of their data footprint. In this talk, we’ll dive into these major uses of S3, as well as many smaller cases, where S3 smoothly addresses an important data infrastructure need. We will also provide solutions and methodologies on how you can build your own S3 big data hub.

Published in: Technology

AWS re:Invent 2016: Netflix: Using Amazon S3 as the fabric of our big data ecosystem (BDM306)

  1. 1. © 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Eva Tse, Director, Big Data Services, Netflix Kurt Brown, Director, Data Platform, Netflix November 29, 2016 Netflix Using Amazon S3 as the Fabric of Our Big Data Ecosystem BDM306
  2. 2. What to Expect from the Session How we use Amazon S3 as our centralized data hub Our big data ecosystems on AWS - Big data processing engines - Architecture - Tools and services
  3. 3. Why Amazon S3?
  4. 4. S3
  5. 5. ‘The only valuable thing is intuition.’ – Albert Einstein
  6. 6. Why is it Intuitive? It is a cloud native service! (free engineering) ‘Practically infinitely’ scalable 99.999999999% durable 99.99% available Decouple compute and storage
  7. 7. Why is it Counter Intuitive? Eventual consistency? Performance?
  8. 8. Our Data Hub Scale 60+ PB 1.5 billions+ Objects
  9. 9. Data hub 60+ PB 100+ TB daily Ingest Expiration 400+ TB daily ETL Processing Read ~3.5 PB daily Write 500+ TB daily Data Velocity
  10. 10. Ingest
  11. 11. Event Data Pipeline Business events ~500 billion events/day 5 min SLA from source to data hub
  12. 12. Cloud apps Kafka AWS S3 Cloud apps Kafka Cloud apps Kafka Ursula Data Hub AWS SQS Event Data Region 1 Region 2 Region 3 AWS S3 AWS S3
  13. 13. Dimension Data Pipeline Stateful data in Cassandra clusters Extract from tens of Cassandra clusters Daily or more granular extracts
  14. 14. Dimension Data Cassandra clusters Aegisthus Data Hub Cassandra clusters Cassandra clusters SSTables on AWS S3 Region 1 Region 2 Region 3 SSTables on AWS S3 SSTables on AWS S3
  15. 15. Transform
  16. 16. Data
  17. 17. Data Processing
  18. 18. Our Data Processing Engines Data Hub Hadoop Yarn Clusters ~250 - 400 r3.4xl~3500 d2.4xl
  19. 19. Look at it from Scalability Angle 1 d2.4xl has 24 TB 60 PB/24 TB = 2,560 machines To achieve 3 way replications for redundancy in one zone, we need 7,680 machines! The data size we have is beyond what we could fit into our clusters!
  20. 20. Tradeoffs
  21. 21. What are the tradeoffs? Eventual consistency Performance
  22. 22. Eventual Consistency Updates (overwrite puts) - Always put new files with new keys when updating data; then delete old files List - We need to know we missed something - Keep prefix manifest in S3mper (or EMRFS)
  23. 23. Parquet Majority of our data is in Parquet file format Supported across Hive, Pig, Presto, Spark Performance benefits in read - Column projections - Predicate pushdown - Vectorized read
  24. 24. Performance Impact Read - Penalty: Throughput and latency - Impact depends on amount of data read - Improvement: I/O manager in Parquet Write - Penalty: Writing to local disk before upload to S3 - Improvement: Direct write via multi-part uploads
  25. 25. Performance Impact List - Penalty: List thousands of partitions for split calculation - Each partition is a S3 prefix - Improvement: Track files instead of prefixes
  26. 26. Performance Impact – some good news ETL jobs: - Mostly CPU bound, not network bound - Performance converges w/ volume and complexity Interactive queries: - % impact is higher  … but they run fast Benefits still outweigh the cost!
  27. 27. Job and Cluster Mgmt Service
  28. 28. .... .. For Users Should I run my job on my laptop? Where can I find the right version of the tools? Which cluster should I run my high priority ETL job? Where can I see all my jobs run yesterday?
  29. 29. .... .. For Admins How do I manage different versions of tools in different clusters? How can I upgrade/swap the clusters with no downtime to users?
  30. 30. Genie – Job and Cluster Mgmt Service Users: - Discovery: find the right cluster to run the jobs - Gateway: to run different kinds of jobs - Orchestration: and, the one place to find all jobs! Admins: - Config mgmt: multiple versions of multiple tools - Deployment: cluster swap/updates with no downtime
  31. 31. .... .. Archived Jobs output
  32. 32. .... .. Job scripts & jars Tools & clusters configs
  33. 33. Genie on Netflix OSS
  34. 34. Data Mgmt Services
  35. 35. Metastore Metacat Data Hub
  36. 36. Metacat Metacat Federated metadata service. A proxy across data sources. Metastore Amazon RDS Amazon Redshift
  37. 37. Metacat Common APIs for our applications and tools. Thrift APIs for interoperability. Metadata discovery across data sources Additional business context - Lifecycle policy (TTL) per table - Table owner, description, tags - User-defined custom metrics
  38. 38. Data Lifecycle Management Janitor tools - Delete ‘dangling’ data after 60 days - Delete data obsoleted by ‘data updates’ after 3 days - Delete partitions based on table TTL
  39. 39. Scaling Deletes from S3
  40. 40. Deletion Service Centralized service to handle errors, retries, and backoffs of S3 deletes Cool-down period to delete after a few days Store history and statistics Allow easy recovery based on time and tags
  41. 41. Backup Strategy
  42. 42. Core S3 Versioned buckets 20 days Scale Simplicity
  43. 43. Above and beyond Other data stores Heterogeneous cloud platform CRR
  44. 44. Data Accessibility
  45. 45. Data Tracking
  46. 46. Approach Tell us who you are User agent S3 access logs Metrics pipeline Charlotte
  47. 47. Data Cost
  48. 48. Approach The calculation Tableau reports Data Doctor
  49. 49. Approach The calculation Tableau reports Data Doctor TTLs Future: Tie to job cost and leverage SIA / Amazon Glacier?
  50. 50. Best Supporting Actor
  51. 51. Amazon Redshift Faster, interactive subset of data Some use, some don’t Auto-synch (tag-based) Fast loading! Backups, restore, & expansion
  52. 52. Druid Interactive at scale S3 (source of truth) S3 for Druid deep storage
  53. 53. Tableau S3 (source of truth) Mostly extracts (vs. Direct Connect) Backups (multi-region)
  54. 54. Big Data Portal
  55. 55. Big Data API (aka Kragle) import kragle as kg trans_info = kg.transport.Transporter() .source('metacat://prodhive/default/my_table') .target('metacat://redshift/test/demo_table') .execute()
  56. 56. S3
  57. 57. Next Steps Add caching? Storage efficiency Partner with the S3 team to improve S3 for big data
  58. 58. Take-aways Amazon S3 = Data hub =  Extend and improve as you go It takes an ecosystem S3
  59. 59. Thank you!
  60. 60. Remember to complete your evaluations!

×