Powering Interactive Data Analysis at Pinterest by Amazon Redshift

17,019 views
16,461 views

Published on

In the last six month, we have set up Amazon Redshift to power our interactive data analysis at Pinterest. It has tremendously improved the speed of analyzing our data.

Published in: Technology
0 Comments
73 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
17,019
On SlideShare
0
From Embeds
0
Number of Embeds
241
Actions
Shares
0
Downloads
5
Comments
0
Likes
73
Embeds 0
No embeds

No notes for slide
  • Add Hadoop Logo
  • To verify machine generated queries
  • Use logo for hadoop/hive. Add Pinball?
  • Powering Interactive Data Analysis at Pinterest by Amazon Redshift

    1. 1. Powering interactive data analysis by Amazon Redshift Jie Li Data Infra at Pinterest
    2. 2. Pinterest: a place to get inspired and plan for the future
    3. 3. Data Infra at Pinterest Production data pipeline Kafka Pinball (*) Hive S3 MySQL HBase Redis Cascading Hadoop MySQL Amazon Web Service * Pinball is our own workflow manager that we plan to open source. Ad-hoc data analysis Analytics Dashboard
    4. 4. We need a low latency data warehouse! Production data pipeline Kafka Pinball Hive S3 MySQL HBase Redis Cascading Hadoop High latency! Ad-hoc data analysis MySQL not a viable data warehouse. MySQL Amazon Web Service Analytics Dashboard
    5. 5. Low-latency data warehouse • SQL on Hadoop – Shark, Impala, Drill, Tez, Presto, … – Open source and free  – Immature?  • Massive Parallel Processing (MPP) – Asterdata, Vertica, ParAccel, … – Built on mature technologies like Postgres  – Expensive and only available on-premise  • Amazon Redshift – ParAccel on AWS  – Mature but also cost-effective 
    6. 6. Highlights of Redshift High cost efficiency on-demand $0.85 per hour 3yr reserved instances $999/TB/year Free snapshot on S3
    7. 7. Highlights of Redshift Low maintenance overhead Fully self-managed Automated maintenance & upgrade Built-in admin dashboard
    8. 8. Highlights of Redshift Superior performance 6000 5000 25-100x over Hive • Columnar layout • Index • Advanced optimizer • Efficient execution second 4000 3000 2000 1000 0 Q1 Q2 Hive Q3 RedShift Note: based on our own dataset and queries. Q4
    9. 9. Cool, but how to integrate Redshift with Hive/Hadoop
    10. 10. First, get data from Hive into Redshift Unstructure d Unclean Structured Clean Extract & Transform Hive Columnar Compact Compressed Load S3 Hadoop/Hive is perfect for heavy-lifting ETL workloads Redshift
    11. 11. Building ETL from Hive to Redshift What worked What didn’t work Schematizing Hive tables Writing column-mapping scripts to generate ETL queries N/A Cleaning data Filtering out non-ASCII characters Loading all characters Loading big tables with sortkey Sorting externally in Hadoop/Hive and loading in chunks Loading unordered data directly Loading time-series tables Appending to the table in the order of time (sortkey) A table per day connected with view performing poorly Table retention Insert into a new table Delete and vacuum (poor performance)
    12. 12. But it’s just the beginning. Make sure you audit the ETL from Day 1
    13. 13. Audit ETL for Data Consistency Everything was good until one day we noticed one table was only half of its size  S3 is only eventual consistent (EC) !  Hive Solutions: S3 ① Audit Redshift Audit ② Also reduce number of files on S3 to alleviate EC. Also, recently there is a new feature to specify a manifest for files on S3.
    14. 14. Now we got the data. Is it ready for superior performance?
    15. 15. Understand the performance Leader ① Understand the query execution plan (via “explain”). Always update system stats after data loading by running “analyze”. System Stats Compute Compute Compute ② Optimize the data layout by choosing consistent distkeys across tables, and always choose a sortkey. Watch out for bad distkey with skew (e.g. distkey with null values).
    16. 16. What if a query took long It’s worth doing your own homework Filing tickets doesn’t work well for perf issues • Requires a lot of information exchange • May be caused by minor issues Case: we optimized a query from 3 hours to 7 seconds after studying the query plan and fixing the system stats (the broadcast join regarded the larger table as the smaller one).
    17. 17. Educate users with best practices  Best Practice Details Select only the columns you need Redshift is a columnar database and it only scans the columns you need to speed things up. “SELECT *” is usually bad. Use the sortkey (dt or created_at) Using sortkey can skip unnecessary data. Most of our tables are using dt or created_at as the sortkey. Avoid slow data transferring Transferring large query result from Redshift to the local client may be slow. Try saving the result as a Redshift table or using the command line client on EC2. Apply selective filters before join Join operation can be significantly faster if we filter out irrelevant data as much as possible. Run one query at a time The performance gets diluted with more queries. So be patient. Understand the query plan by EXPLAIN EXPLAIN gives you idea why a query may be slow. For advanced users only.
    18. 18. Hopefully users will follow the best practice  But Redshift is a shared service One query may slow down the whole cluster 
    19. 19. Proactive monitoring System tables (e.g. stl_query) It’s easy to write scripts for Real-time monitoring slow queries • Ping users with best practice • Send alerts to admin Analyzing patterns • Who need help • Who was “abusing” Hint: manually backup these system tables as they will be cleaned up weekly.
    20. 20. Optimizing workload management • Run heavy ETL during night – ETL is resource intensive – No easy way to limit the resource usage (IO/CPU) • Time out user queries during peak hours – Long queries (>= 30 mins) likely have mistakes – Sacrifice a few users for the majority • Unlike Hadoop, there is no preemption in Redshift 
    21. 21. Current status of Redshift at Pinterest • • • • 16 node 256TB cluster with 100TB+ core data Ingesting 1.5TB data per day with retention 30+ daily users 500+ ad-hoc queries per day – 75% <= 35 seconds, 90% <= 2 minute • operational effort <= 5 hours/week
    22. 22. Redshift integrated at Pinterest Pinball Hive Kafka Cascading Production data pipeline Hadoop Ad-hoc data analysis S3 MySQL HBase Redis Redshift MySQL Amazon Web Service Analytics Dashboard
    23. 23. Next step • Next generation of analytics dashboards – Replace offline MySQL with Redshift – Replace custom dashboards with Tableau Pinball Hive Kafka Cascading Production data pipeline Hadoop Ad-hoc data analysis S3 MySQL HBase Redis Redshift Amazon Web Service Tableau
    24. 24. Remaining risks • SLA for low latency queries – Due to the lack of preemption, it can not guarantee mission-critical queries to finish fast • High availability – Takes hours to restore clusters from snapshots – May need a standby cluster in future
    25. 25. Questions? • Quora: http://qr.ae/TwRJf • Twitter: @jay23jack

    ×