Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Redshift and Why the 'Like' In 'PosgresSQL Like' Matters

93 views

Published on

This talk goes through Metail's experiences with Redshift and highlights some of the design decisions we made and why. Particularly where we tripped ourselves up by applying to the logic that Redshift is a PosgreSQL fork too strongly.

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Redshift and Why the 'Like' In 'PosgresSQL Like' Matters

  1. 1. 1 Metail's Redshift Experience and Why the 'Like' in 'Postgres Like' Is Important Gareth Rogers, Data Engineer
  2. 2. 2 Metail lets you try on clothes online Discover clothes on your body shape Create, save outfits and share Shop with confidence of size and fit
  3. 3. 3 Proven impact as validated by American business schools and A/B tests ‘‘ …customers who had access to the fitting tool are more likely to come back to the site, and this effect is statistically significant… ‘‘ …shows approximately a 5.1 percent reduction in returns compared to the control group…In other words, providing fit information reduces average fulfilment costs” …sales for users with access to the tool were substantially higher overall - 22.32 percent larger ‘‘ Source: “The Value of Fit Information in Online Retail: Evidence from a Randomized Field Experiment” by Prof Santiago Gallino (Dartmouth College - Tuck School of Business) & Prof Antonio Moreno (Northwestern University) –Oct 21, 2015 DATA 1000+ GARMENTS POINTS3M
  4. 4. 4 Architecture Comparing with a more modern flow: http://tech.metail.com/elastic-mapreduce-metail-aws-loft-london/ User DB DynamoDB
  5. 5. 5 Creating the Cluster
  6. 6. 6 Creating the Cluster • Compute capacity vs storage capacity – Tight coupling of compute and storage – We load everything into Redshift so far >3TB of data – At 1GB per day compute cluster last 10 sprints, at 30GB per day not so long :S – Six node dc1.8xlarge cluster costs $957.60 per week on-demand pricing
  7. 7. 7 Creating the Cluster
  8. 8. 8 It’s Postgres Like – Connecting to the Server • Postgres like system meant from day one there were mature tools and stack overflow help • Redshift ecosystem now more mature and optimised tooling and help exists • Redshift JDBC/ODBC is now recommended over PostgresSQL driver
  9. 9. 9 It’s Postgres Like – My First Query WITH order_events AS ( SELECT collector_tstamp, event_id, ue_properties FROM events WHERE collector_tstamp >= '2015-09-20' AND collector_tstamp < '2015-10-02‘ AND event = 'unstruct' AND JSON_EXTRACT_PATH_TEXT(ue_properties,'data','data','name') = 'Order'), in_orders AS ( SELECT DATE(collector_tstamp) AS order_date, COUNT(event_id) AS orders, COUNT(DISTINCT event_id) AS orders_distinct FROM order_events WHERE ue_properties ILIKE '%"bin":"in",%' GROUP BY DATE(collector_tstamp) ORDER BY DATE (collector_tstamp)), out_orders AS ( SELECT DATE(collector_tstamp) AS order_date, COUNT(event_id) AS orders, COUNT(DISTINCT event_id) AS orders_distinct FROM order_events WHERE ue_properties ILIKE '%"bin":"out",%' GROUP BY DATE(collector_tstamp) ORDER BY DATE(collector_tstamp)) SELECT bin_in.order_date, bin_in.orders AS bin_in_orders, bin_out.orders AS bin_out_orders FROM in_orders AS bin_in INNER JOIN out_orders AS bin_out ON bin_in.order_date = bin_out.order_date ORDER BY bin_in.order_date;
  10. 10. 10 Not so Postgres Like – Schema Design • For day-to-day querying even power users won’t notice the difference • For the schema designers the differences matter and will bite you from the start • Redshift = columnar; Postgres = row; Very different optimisation considerations
  11. 11. 11 Summary • Redshift gives you all the usual AWS goodies • Day-to-day you don’t care that Redshift is Postgres like • When designing the schema forget about row databases, experiment with columnar stores

×