Successfully reported this slideshow.
Your SlideShare is downloading. ×

Netflix running Presto in the AWS Cloud

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad

Check these out next

1 of 15 Ad

More Related Content

Slideshows for you (20)

Viewers also liked (16)

Advertisement

Similar to Netflix running Presto in the AWS Cloud (20)

Advertisement

Recently uploaded (20)

Netflix running Presto in the AWS Cloud

  1. 1. Netflix running Presto in the AWS Cloud Zhenxiao Luo Senior Software Engineer @ Netflix
  2. 2. Outline ● BigDataPlatform@Netflix ● Use cases & requirements ● What we did ○ Reading/Writing from/to Amazon S3 ○ Operations ○ Deployment ○ Performance ● What’s next?
  3. 3. BigDataPlatform @ Netflix
  4. 4. Use Cases ● Big Batch Jobs ○ high throughput, fault tolerant, ETL ○ data spills to disk ○ Hive on Tez, Pig on Tez ● Adhoc Queries ○ low latency, interactive, data exploration ○ in-memory, but limited data size ○ Impala, Redshift, Spark, Presto
  5. 5. Netflix Requirement ● SQL like Language ● Low latency for adhoc queries ● Work well on AWS cloud ● Good integration with Hadoop stack ● Scale to 1000+ node cluster ● Open source with community support
  6. 6. What did Netflix do?
  7. 7. Reading/Writing to/from S3 ● Option 1: Apache Hadoop NativeS3FileSysyem ● Option 2: PrestoS3FileSystem ○ retry logic for read timeout ○ write directly to final S3 path ● Option 3: emrFileSystem ○ disable hadoop logging ○ disable hadoop FileSystem cache
  8. 8. Bug Fixes ● https://github. com/facebook/presto/commit/cf0b2d66f4050fb1959c832809fa76e323d6d4 6e ● https://github. com/facebook/presto/commit/594b06c3e93a482dc162d2c49c9bd265795ef b86 ● https://github.com/facebook/presto/pull/1147 ● https://github.com/facebook/presto/pull/1300 ● https://github.com/facebook/presto/issues/1285 ● https://github.com/facebook/presto/issues/1264
  9. 9. Our Operations Environment ● Launch script on top of EMR ● Ganglia integration ● Usage graphs - concurrent queries & tasks
  10. 10. Current Deployment ● Presto in Production @ Netflix ● 100+ nodes Presto Cluster ● 1000+ queries running per day ● Presto query against the same Petabyte Scale S3 Data Warehouse as Hive and Pig
  11. 11. Observed Performance @ Netflix ● Data in Sequence File Format ● One MapReduce Job SmallTableScan ○ MapReduce overhead dominates the query execution time ○ Presto is always ~10X faster than Hive ● One MapReduce Job BigTableScan ○ MapReduce overhead is marginal compared with big table scan time ○ Presto performs similar to Hive ● Multiple MapReduce Aggregation ○ Presto is always > 10X faster than Hive ● Joins ○ Presto is always > 2X faster than Hive
  12. 12. What we are working on ● Support Parquet File Format ○ https://github.com/facebook/presto/pull/1147 ○ Parquet performs similar to Sequence, but not as fast as RCFile ● ODBC/JDBC driver for Presto ○ Support Microstrategy running on Presto
  13. 13. Some inconveniences ... ● Support Server Side “Use Schema” ○ Workaround: Client Side “Use Schema” Or “Schema.Table” ● Recurse the partition directory ○ Different behavior with Hive ● Metadata caching ○ have to rerun the query a number of times to see the metadata change ● Extend JSON extract functions to allow . notation ○ json_extract_scalar(mapColumn, '$.namePart1.namePart2') ○ Workaround: regexp_extract ● WebUI running slow ○ load query task info on demand
  14. 14. Features we would like ● Big table join ● User Defined Functions ● Break down one column value into several tuples ○ In Hive: lateral view explode json_tuple ● Decimal type ● Scheduler ● Writes ○ Insert overwrite ○ Alter table add partition ○ Parallel writes from workers (not client only)
  15. 15. Q & A Thank you!

×