@metail#VoxxedBristol
Putting the Spark into
Functional Fashion Tech
Analytics
Gareth Rogers
Metail
2
● Who are Metail and what do we do
● Our data pipeline
● What is Apache Spark and our experiences
● Live coding demonstration of Spark?
Introduction
3
Metail
Making clothing fit for all
4
● Create your MeModel with just a
few clicks
● See how the clothes look on you
● Primarily clickstream analysis
● Understanding the user journey
http://trymetail.com
The Metail Experience
5
Composed Photography
With Metail
Shoot model once
Choose poses
Style & restyle
Compositing
Shoot clothes
at source
● Understanding process flow
● Discovering bottlenecks
● Optimising the workflow
● Understanding costs
6
• Batch pipeline modelled on a lambda architecture
• Transformed by Clojure, Spark in Clojure or SQL
• Managed by applications written in Clojure
• Built on and using Amazon Web Services (AWS)
Metail’s Data Pipeline
7
Metail’s Data Pipeline
8
Metail’s Data Pipeline
To analytics
dashboards
https://looker.com/
9
● Our pipeline is heavily influenced by functional programmings
paradigms
● Declarative
● Immutable data structures
● Pure functions -- effects only dependent on input state
● Minimise side effects
Functional Pipeline
10
• Clojure is predominately a functional programming language
• Aims to be approachable
• Allow interactive development
• A lisp programming language
• Everything is dynamically typed
Clojure
11
• Spark is a general purpose distributed data processing engine
– Can scale from processing a single line to terabytes of data
• Functional paradigm
– Declarative
– Functions are first class
– Immutable datasets
• Written in Scala a JVM based language
– Just like Clojure
– Has a Java API
– Clojure has great Java interop
Apache Spark
12
• Operates on a Directed
Acyclic Graph (DAG)
– Link together data
processing steps
– May be dependent on
multiple parent steps
– Transforms cannot
return to an older
partition
Apache Spark
13
Apache Spark
https://databricks.com/spark/about
14
• Consists of a driver and multiple workers
• Declare a Spark session and build a job graph
• Driver coordinates with a cluster to execute the graph on the
workers
Apache Spark
15
• Operates on in-memory datasets
– where possible, it will spill to disk
• Based on Resilient Distributed Datasets (RDDs)
• Each RDD represents on dataset
• Split into multiple partitions distributed through the cluster
Apache Spark
16
Apache Spark
Driver
●
●
●
Worker 0
RDD
Partition
Code
Worker 1
RDD
Partition
Code
Worker n
RDD
Partition
Code
Live code or
compiled code
Cluster
Manager
Local machine
Cluster: YARN/Meso/Kubernetes
17
• Using the Sparkling Clojure library
– This wraps the Spark Java API
– Does the heavy lifting of Clojure data structure serialisation
• Mostly using the core API
• RDDs holding rows of Clojure maps
• Runs on AWS Elastic MapReduce
– Tune your cluster
• Uses DynamoDB to for inter-batch state
Metail and Spark
18
• Metail is making clothing fit for all
• We’re incorporating metrics derived from our collected data
• We have several pipelines collecting, transforming and
visualising our data
• When dealing with datasets functional programming offers
many advantages
• Give them a go!
– https://github.com/gareth625/lhcb-opendata
Summary
19
• Learning Clojure: https://www.braveclojure.com/
• A random web based Clojure REPL: https://repl.it/repls/
• Bases of Metail Experience pipeline https://snowplowanalytics.com
• Dashboarding and SQL warehouse management: https://looker.com
• Tuning you cluster
http://c2fo.io/c2fo/spark/aws/emr/2016/07/06/apache-spark-config-cheatsheet/
– Required on EMR
Resources
20
Questions?

Putting the Spark into Functional Fashion Tech Analystics

  • 1.
    @metail#VoxxedBristol Putting the Sparkinto Functional Fashion Tech Analytics Gareth Rogers Metail
  • 2.
    2 ● Who areMetail and what do we do ● Our data pipeline ● What is Apache Spark and our experiences ● Live coding demonstration of Spark? Introduction
  • 3.
  • 4.
    4 ● Create yourMeModel with just a few clicks ● See how the clothes look on you ● Primarily clickstream analysis ● Understanding the user journey http://trymetail.com The Metail Experience
  • 5.
    5 Composed Photography With Metail Shootmodel once Choose poses Style & restyle Compositing Shoot clothes at source ● Understanding process flow ● Discovering bottlenecks ● Optimising the workflow ● Understanding costs
  • 6.
    6 • Batch pipelinemodelled on a lambda architecture • Transformed by Clojure, Spark in Clojure or SQL • Managed by applications written in Clojure • Built on and using Amazon Web Services (AWS) Metail’s Data Pipeline
  • 7.
  • 8.
    8 Metail’s Data Pipeline Toanalytics dashboards https://looker.com/
  • 9.
    9 ● Our pipelineis heavily influenced by functional programmings paradigms ● Declarative ● Immutable data structures ● Pure functions -- effects only dependent on input state ● Minimise side effects Functional Pipeline
  • 10.
    10 • Clojure ispredominately a functional programming language • Aims to be approachable • Allow interactive development • A lisp programming language • Everything is dynamically typed Clojure
  • 11.
    11 • Spark isa general purpose distributed data processing engine – Can scale from processing a single line to terabytes of data • Functional paradigm – Declarative – Functions are first class – Immutable datasets • Written in Scala a JVM based language – Just like Clojure – Has a Java API – Clojure has great Java interop Apache Spark
  • 12.
    12 • Operates ona Directed Acyclic Graph (DAG) – Link together data processing steps – May be dependent on multiple parent steps – Transforms cannot return to an older partition Apache Spark
  • 13.
  • 14.
    14 • Consists ofa driver and multiple workers • Declare a Spark session and build a job graph • Driver coordinates with a cluster to execute the graph on the workers Apache Spark
  • 15.
    15 • Operates onin-memory datasets – where possible, it will spill to disk • Based on Resilient Distributed Datasets (RDDs) • Each RDD represents on dataset • Split into multiple partitions distributed through the cluster Apache Spark
  • 16.
    16 Apache Spark Driver ● ● ● Worker 0 RDD Partition Code Worker1 RDD Partition Code Worker n RDD Partition Code Live code or compiled code Cluster Manager Local machine Cluster: YARN/Meso/Kubernetes
  • 17.
    17 • Using theSparkling Clojure library – This wraps the Spark Java API – Does the heavy lifting of Clojure data structure serialisation • Mostly using the core API • RDDs holding rows of Clojure maps • Runs on AWS Elastic MapReduce – Tune your cluster • Uses DynamoDB to for inter-batch state Metail and Spark
  • 18.
    18 • Metail ismaking clothing fit for all • We’re incorporating metrics derived from our collected data • We have several pipelines collecting, transforming and visualising our data • When dealing with datasets functional programming offers many advantages • Give them a go! – https://github.com/gareth625/lhcb-opendata Summary
  • 19.
    19 • Learning Clojure:https://www.braveclojure.com/ • A random web based Clojure REPL: https://repl.it/repls/ • Bases of Metail Experience pipeline https://snowplowanalytics.com • Dashboarding and SQL warehouse management: https://looker.com • Tuning you cluster http://c2fo.io/c2fo/spark/aws/emr/2016/07/06/apache-spark-config-cheatsheet/ – Required on EMR Resources
  • 20.