This talk goes through Metail's experiences with Redshift and highlights some of the design decisions we made and why. Particularly where we tripped ourselves up by applying to the logic that Redshift is a PosgreSQL fork too strongly.
2. 2
Metail lets you try on clothes online
Discover clothes on
your body shape
Create, save outfits
and share
Shop with confidence
of size and fit
3. 3
Proven impact as validated by
American business schools and A/B tests
‘‘
…customers who had access to the fitting tool are more likely to come
back to the site, and this effect is statistically significant… ‘‘
…shows approximately a 5.1 percent reduction in returns
compared to the control group…In other words, providing fit information
reduces average fulfilment costs”
…sales for users with access to the tool were substantially higher overall - 22.32 percent larger
‘‘
Source: “The Value of Fit Information in Online Retail: Evidence from a
Randomized Field Experiment” by Prof Santiago Gallino (Dartmouth College -
Tuck School of Business) & Prof Antonio Moreno (Northwestern University) –Oct 21,
2015
DATA
1000+ GARMENTS
POINTS3M
4. 4
Architecture
Comparing with a more modern flow:
http://tech.metail.com/elastic-mapreduce-metail-aws-loft-london/
User DB
DynamoDB
6. 6
Creating the Cluster
• Compute capacity vs storage capacity
– Tight coupling of compute and storage
– We load everything into Redshift so far >3TB of data
– At 1GB per day compute cluster last 10 sprints, at 30GB per day not so long :S
– Six node dc1.8xlarge cluster costs $957.60 per week on-demand pricing
8. 8
It’s Postgres Like – Connecting to the Server
• Postgres like system meant from day one there were
mature tools and stack overflow help
• Redshift ecosystem now more mature and
optimised tooling and help exists
• Redshift JDBC/ODBC is now recommended over
PostgresSQL driver
9. 9
It’s Postgres Like – My First Query
WITH order_events AS (
SELECT collector_tstamp, event_id, ue_properties
FROM events
WHERE collector_tstamp >= '2015-09-20' AND collector_tstamp < '2015-10-02‘ AND event = 'unstruct'
AND JSON_EXTRACT_PATH_TEXT(ue_properties,'data','data','name') = 'Order'),
in_orders AS (
SELECT DATE(collector_tstamp) AS order_date,
COUNT(event_id) AS orders,
COUNT(DISTINCT event_id) AS orders_distinct
FROM order_events
WHERE ue_properties ILIKE '%"bin":"in",%'
GROUP BY DATE(collector_tstamp) ORDER BY DATE (collector_tstamp)),
out_orders AS (
SELECT DATE(collector_tstamp) AS order_date,
COUNT(event_id) AS orders,
COUNT(DISTINCT event_id) AS orders_distinct
FROM order_events
WHERE ue_properties ILIKE '%"bin":"out",%'
GROUP BY DATE(collector_tstamp)
ORDER BY DATE(collector_tstamp))
SELECT bin_in.order_date,
bin_in.orders AS bin_in_orders,
bin_out.orders AS bin_out_orders
FROM in_orders AS bin_in
INNER JOIN out_orders AS bin_out ON bin_in.order_date = bin_out.order_date
ORDER BY bin_in.order_date;
10. 10
Not so Postgres Like – Schema Design
• For day-to-day querying even power users
won’t notice the difference
• For the schema designers the differences
matter and will bite you from the start
• Redshift = columnar; Postgres = row; Very
different optimisation considerations
11. 11
Summary
• Redshift gives you all the usual AWS goodies
• Day-to-day you don’t care that Redshift is
Postgres like
• When designing the schema forget about row
databases, experiment with columnar stores