Powering Interactive Data Analysis at Pinterest by Amazon Redshift
Jan. 22, 2014•0 likes•29,595 views
Report
Technology
In the last six month, we have set up Amazon Redshift to power our interactive data analysis at Pinterest. It has tremendously improved the speed of analyzing our data.
3. Data Infra at Pinterest
Production
data pipeline
Kafka
Pinball (*)
Hive
S3
MySQL
HBase
Redis
Cascading
Hadoop
MySQL
Amazon Web Service
* Pinball is our own workflow manager that we plan to open source.
Ad-hoc data
analysis
Analytics Dashboard
4. We need a low latency data warehouse!
Production
data pipeline
Kafka
Pinball
Hive
S3
MySQL
HBase
Redis
Cascading
Hadoop
High latency!
Ad-hoc data
analysis
MySQL not a viable data warehouse.
MySQL
Amazon Web Service
Analytics Dashboard
5. Low-latency data warehouse
• SQL on Hadoop
– Shark, Impala, Drill, Tez, Presto, …
– Open source and free
– Immature?
• Massive Parallel Processing (MPP)
– Asterdata, Vertica, ParAccel, …
– Built on mature technologies like Postgres
– Expensive and only available on-premise
• Amazon Redshift
– ParAccel on AWS
– Mature but also cost-effective
6. Highlights of Redshift
High cost efficiency
on-demand $0.85 per hour
3yr reserved instances $999/TB/year
Free snapshot on S3
8. Highlights of Redshift
Superior performance
6000
5000
25-100x over Hive
• Columnar layout
• Index
• Advanced optimizer
• Efficient execution
second
4000
3000
2000
1000
0
Q1
Q2
Hive
Q3
RedShift
Note: based on our own dataset and queries.
Q4
10. First, get data from Hive into Redshift
Unstructure
d
Unclean
Structured
Clean
Extract & Transform
Hive
Columnar
Compact
Compressed
Load
S3
Hadoop/Hive is perfect for heavy-lifting ETL workloads
Redshift
11. Building ETL from Hive to Redshift
What worked
What didn’t work
Schematizing Hive tables
Writing column-mapping
scripts to generate ETL
queries
N/A
Cleaning data
Filtering out non-ASCII
characters
Loading all characters
Loading big tables with
sortkey
Sorting externally in
Hadoop/Hive and loading in
chunks
Loading unordered data
directly
Loading time-series tables
Appending to the table in
the order of time (sortkey)
A table per day connected
with view performing poorly
Table retention
Insert into a new table
Delete and vacuum (poor
performance)
12. But it’s just the beginning.
Make sure you audit the ETL from Day 1
13. Audit ETL for Data Consistency
Everything was good until one day we noticed
one table was only half of its size
S3 is only eventual consistent (EC) !
Hive
Solutions:
S3
① Audit
Redshift
Audit
② Also reduce number of files on S3 to alleviate EC.
Also, recently there is a new feature to specify a manifest for files on S3.
14. Now we got the data.
Is it ready for superior performance?
15. Understand the performance
Leader
① Understand the query execution
plan (via “explain”). Always
update system stats after data
loading by running “analyze”.
System
Stats
Compute
Compute
Compute
② Optimize the data layout by choosing consistent
distkeys across tables, and always choose a
sortkey. Watch out for bad distkey with skew (e.g.
distkey with null values).
16. What if a query took long
It’s worth doing your own homework
Filing tickets doesn’t work well for perf issues
• Requires a lot of information exchange
• May be caused by minor issues
Case: we optimized a query from 3 hours to 7 seconds
after studying the query plan and fixing the system
stats (the broadcast join regarded the larger table as
the smaller one).
17. Educate users with best practices
Best Practice
Details
Select only the columns you need
Redshift is a columnar database and it only scans the
columns you need to speed things up. “SELECT *” is
usually bad.
Use the sortkey (dt or created_at)
Using sortkey can skip unnecessary data. Most of our
tables are using dt or created_at as the sortkey.
Avoid slow data transferring
Transferring large query result from Redshift to the local
client may be slow. Try saving the result as a Redshift
table or using the command line client on EC2.
Apply selective filters before join
Join operation can be significantly faster if we filter out
irrelevant data as much as possible.
Run one query at a time
The performance gets diluted with more queries. So be
patient.
Understand the query plan by
EXPLAIN
EXPLAIN gives you idea why a query may be slow. For
advanced users only.
18. Hopefully users will follow the best practice
But Redshift is a shared service
One query may slow down the whole cluster
19. Proactive monitoring
System tables
(e.g. stl_query)
It’s easy to write scripts for
Real-time
monitoring
slow queries
• Ping users with best practice
• Send alerts to admin
Analyzing
patterns
• Who need help
• Who was “abusing”
Hint: manually backup these system tables as they will be cleaned up weekly.
20. Optimizing workload management
• Run heavy ETL during night
– ETL is resource intensive
– No easy way to limit the resource usage (IO/CPU)
• Time out user queries during peak hours
– Long queries (>= 30 mins) likely have mistakes
– Sacrifice a few users for the majority
• Unlike Hadoop, there is no preemption in
Redshift
21. Current status of Redshift at Pinterest
•
•
•
•
16 node 256TB cluster with 100TB+ core data
Ingesting 1.5TB data per day with retention
30+ daily users
500+ ad-hoc queries per day
– 75% <= 35 seconds, 90% <= 2 minute
• operational effort <= 5 hours/week
22. Redshift integrated at Pinterest
Pinball
Hive
Kafka
Cascading
Production
data pipeline
Hadoop
Ad-hoc data
analysis
S3
MySQL
HBase
Redis
Redshift
MySQL
Amazon Web Service
Analytics
Dashboard
23. Next step
• Next generation of analytics dashboards
– Replace offline MySQL with Redshift
– Replace custom dashboards with Tableau
Pinball
Hive
Kafka
Cascading
Production
data pipeline
Hadoop
Ad-hoc data
analysis
S3
MySQL
HBase
Redis
Redshift
Amazon Web Service
Tableau
24. Remaining risks
• SLA for low latency queries
– Due to the lack of preemption, it can not
guarantee mission-critical queries to finish fast
• High availability
– Takes hours to restore clusters from snapshots
– May need a standby cluster in future