SlideShare a Scribd company logo
@metail#VoxxedBristol
Putting the Spark into
Functional Fashion Tech
Analytics
Gareth Rogers
Metail
2
● Who are Metail and what do we do
● Our data pipeline
● What is Apache Spark and our experiences
● Live coding demonstration of Spark?
Introduction
3
Metail
Making clothing fit for all
4
● Create your MeModel with just a
few clicks
● See how the clothes look on you
● Primarily clickstream analysis
● Understanding the user journey
http://trymetail.com
The Metail Experience
5
Composed Photography
With Metail
Shoot model once
Choose poses
Style & restyle
Compositing
Shoot clothes
at source
● Understanding process flow
● Discovering bottlenecks
● Optimising the workflow
● Understanding costs
6
• Batch pipeline modelled on a lambda architecture
• Transformed by Clojure, Spark in Clojure or SQL
• Managed by applications written in Clojure
• Built on and using Amazon Web Services (AWS)
Metail’s Data Pipeline
7
Metail’s Data Pipeline
8
Metail’s Data Pipeline
To analytics
dashboards
https://looker.com/
9
● Our pipeline is heavily influenced by functional programmings
paradigms
● Declarative
● Immutable data structures
● Pure functions -- effects only dependent on input state
● Minimise side effects
Functional Pipeline
10
• Clojure is predominately a functional programming language
• Aims to be approachable
• Allow interactive development
• A lisp programming language
• Everything is dynamically typed
Clojure
11
• Spark is a general purpose distributed data processing engine
– Can scale from processing a single line to terabytes of data
• Functional paradigm
– Declarative
– Functions are first class
– Immutable datasets
• Written in Scala a JVM based language
– Just like Clojure
– Has a Java API
– Clojure has great Java interop
Apache Spark
12
• Operates on a Directed
Acyclic Graph (DAG)
– Link together data
processing steps
– May be dependent on
multiple parent steps
– Transforms cannot
return to an older
partition
Apache Spark
13
Apache Spark
https://databricks.com/spark/about
14
• Consists of a driver and multiple workers
• Declare a Spark session and build a job graph
• Driver coordinates with a cluster to execute the graph on the
workers
Apache Spark
15
• Operates on in-memory datasets
– where possible, it will spill to disk
• Based on Resilient Distributed Datasets (RDDs)
• Each RDD represents on dataset
• Split into multiple partitions distributed through the cluster
Apache Spark
16
Apache Spark
Driver
●
●
●
Worker 0
RDD
Partition
Code
Worker 1
RDD
Partition
Code
Worker n
RDD
Partition
Code
Live code or
compiled code
Cluster
Manager
Local machine
Cluster: YARN/Meso/Kubernetes
17
• Using the Sparkling Clojure library
– This wraps the Spark Java API
– Does the heavy lifting of Clojure data structure serialisation
• Mostly using the core API
• RDDs holding rows of Clojure maps
• Runs on AWS Elastic MapReduce
– Tune your cluster
• Uses DynamoDB to for inter-batch state
Metail and Spark
18
• Metail is making clothing fit for all
• We’re incorporating metrics derived from our collected data
• We have several pipelines collecting, transforming and
visualising our data
• When dealing with datasets functional programming offers
many advantages
• Give them a go!
– https://github.com/gareth625/lhcb-opendata
Summary
19
• Learning Clojure: https://www.braveclojure.com/
• A random web based Clojure REPL: https://repl.it/repls/
• Bases of Metail Experience pipeline https://snowplowanalytics.com
• Dashboarding and SQL warehouse management: https://looker.com
• Tuning you cluster
http://c2fo.io/c2fo/spark/aws/emr/2016/07/06/apache-spark-config-cheatsheet/
– Required on EMR
Resources
20
Questions?

More Related Content

What's hot

Interactive workflow management using Azkaban
Interactive workflow management using AzkabanInteractive workflow management using Azkaban
Interactive workflow management using Azkaban
datamantra
 
Case Study: Stream Processing on AWS using Kappa Architecture
Case Study: Stream Processing on AWS using Kappa ArchitectureCase Study: Stream Processing on AWS using Kappa Architecture
Case Study: Stream Processing on AWS using Kappa Architecture
Joey Bolduc-Gilbert
 
Biomatters and Amazon Web Services
Biomatters and Amazon Web Services Biomatters and Amazon Web Services
Biomatters and Amazon Web Services
Biomatters
 
Anatomy of Data Frame API : A deep dive into Spark Data Frame API
Anatomy of Data Frame API :  A deep dive into Spark Data Frame APIAnatomy of Data Frame API :  A deep dive into Spark Data Frame API
Anatomy of Data Frame API : A deep dive into Spark Data Frame API
datamantra
 
Micheal Pershyn "Coljure 4 Big Data"
Micheal Pershyn "Coljure 4 Big Data"Micheal Pershyn "Coljure 4 Big Data"
Micheal Pershyn "Coljure 4 Big Data"
Lviv Startup Club
 
Cloud native data platform
Cloud native data platformCloud native data platform
Cloud native data platform
Li Gao
 
A Tool For Big Data Analysis using Apache Spark
A Tool For Big Data Analysis using Apache SparkA Tool For Big Data Analysis using Apache Spark
A Tool For Big Data Analysis using Apache Spark
datamantra
 
Spline 2 - Vision and Architecture Overview
Spline 2 - Vision and Architecture OverviewSpline 2 - Vision and Architecture Overview
Spline 2 - Vision and Architecture Overview
Vaclav Kosar
 
Real time ETL processing using Spark streaming
Real time ETL processing using Spark streamingReal time ETL processing using Spark streaming
Real time ETL processing using Spark streaming
datamantra
 
Apache Cassandra Lunch #72: Databricks and Cassandra
Apache Cassandra Lunch #72: Databricks and CassandraApache Cassandra Lunch #72: Databricks and Cassandra
Apache Cassandra Lunch #72: Databricks and Cassandra
Anant Corporation
 
Presto @ Zalando - Big Data Tech Warsaw 2020
Presto @ Zalando - Big Data Tech Warsaw 2020Presto @ Zalando - Big Data Tech Warsaw 2020
Presto @ Zalando - Big Data Tech Warsaw 2020
Piotr Findeisen
 
Spark Summit EU talk by Nick Pentreath
Spark Summit EU talk by Nick PentreathSpark Summit EU talk by Nick Pentreath
Spark Summit EU talk by Nick Pentreath
Spark Summit
 
Building Your First Apache Apex (Next Gen Big Data/Hadoop) Application
Building Your First Apache Apex (Next Gen Big Data/Hadoop) ApplicationBuilding Your First Apache Apex (Next Gen Big Data/Hadoop) Application
Building Your First Apache Apex (Next Gen Big Data/Hadoop) Application
Apache Apex
 
Efficient HTTP applications on the JVM with Ratpack - JDD 2015
Efficient HTTP applications on the JVM with Ratpack - JDD 2015Efficient HTTP applications on the JVM with Ratpack - JDD 2015
Efficient HTTP applications on the JVM with Ratpack - JDD 2015
Alvaro Sanchez-Mariscal
 
Apache HBase Workshop
Apache HBase WorkshopApache HBase Workshop
Apache HBase Workshop
Valerii Moisieienko
 
Operationalize Apache Spark Analytics
Operationalize Apache Spark AnalyticsOperationalize Apache Spark Analytics
Operationalize Apache Spark Analytics
Databricks
 
Apache airflow
Apache airflowApache airflow
Apache airflow
Purna Chander
 
Modern ETL Pipelines with Change Data Capture
Modern ETL Pipelines with Change Data CaptureModern ETL Pipelines with Change Data Capture
Modern ETL Pipelines with Change Data Capture
Databricks
 
Presto Summit 2018 - 02 - LinkedIn
Presto Summit 2018  - 02 - LinkedInPresto Summit 2018  - 02 - LinkedIn
Presto Summit 2018 - 02 - LinkedIn
kbajda
 
James Turner (Caplin) - Enterprise HTML5 Patterns
James Turner (Caplin) - Enterprise HTML5 PatternsJames Turner (Caplin) - Enterprise HTML5 Patterns
James Turner (Caplin) - Enterprise HTML5 Patterns
akqaanoraks
 

What's hot (20)

Interactive workflow management using Azkaban
Interactive workflow management using AzkabanInteractive workflow management using Azkaban
Interactive workflow management using Azkaban
 
Case Study: Stream Processing on AWS using Kappa Architecture
Case Study: Stream Processing on AWS using Kappa ArchitectureCase Study: Stream Processing on AWS using Kappa Architecture
Case Study: Stream Processing on AWS using Kappa Architecture
 
Biomatters and Amazon Web Services
Biomatters and Amazon Web Services Biomatters and Amazon Web Services
Biomatters and Amazon Web Services
 
Anatomy of Data Frame API : A deep dive into Spark Data Frame API
Anatomy of Data Frame API :  A deep dive into Spark Data Frame APIAnatomy of Data Frame API :  A deep dive into Spark Data Frame API
Anatomy of Data Frame API : A deep dive into Spark Data Frame API
 
Micheal Pershyn "Coljure 4 Big Data"
Micheal Pershyn "Coljure 4 Big Data"Micheal Pershyn "Coljure 4 Big Data"
Micheal Pershyn "Coljure 4 Big Data"
 
Cloud native data platform
Cloud native data platformCloud native data platform
Cloud native data platform
 
A Tool For Big Data Analysis using Apache Spark
A Tool For Big Data Analysis using Apache SparkA Tool For Big Data Analysis using Apache Spark
A Tool For Big Data Analysis using Apache Spark
 
Spline 2 - Vision and Architecture Overview
Spline 2 - Vision and Architecture OverviewSpline 2 - Vision and Architecture Overview
Spline 2 - Vision and Architecture Overview
 
Real time ETL processing using Spark streaming
Real time ETL processing using Spark streamingReal time ETL processing using Spark streaming
Real time ETL processing using Spark streaming
 
Apache Cassandra Lunch #72: Databricks and Cassandra
Apache Cassandra Lunch #72: Databricks and CassandraApache Cassandra Lunch #72: Databricks and Cassandra
Apache Cassandra Lunch #72: Databricks and Cassandra
 
Presto @ Zalando - Big Data Tech Warsaw 2020
Presto @ Zalando - Big Data Tech Warsaw 2020Presto @ Zalando - Big Data Tech Warsaw 2020
Presto @ Zalando - Big Data Tech Warsaw 2020
 
Spark Summit EU talk by Nick Pentreath
Spark Summit EU talk by Nick PentreathSpark Summit EU talk by Nick Pentreath
Spark Summit EU talk by Nick Pentreath
 
Building Your First Apache Apex (Next Gen Big Data/Hadoop) Application
Building Your First Apache Apex (Next Gen Big Data/Hadoop) ApplicationBuilding Your First Apache Apex (Next Gen Big Data/Hadoop) Application
Building Your First Apache Apex (Next Gen Big Data/Hadoop) Application
 
Efficient HTTP applications on the JVM with Ratpack - JDD 2015
Efficient HTTP applications on the JVM with Ratpack - JDD 2015Efficient HTTP applications on the JVM with Ratpack - JDD 2015
Efficient HTTP applications on the JVM with Ratpack - JDD 2015
 
Apache HBase Workshop
Apache HBase WorkshopApache HBase Workshop
Apache HBase Workshop
 
Operationalize Apache Spark Analytics
Operationalize Apache Spark AnalyticsOperationalize Apache Spark Analytics
Operationalize Apache Spark Analytics
 
Apache airflow
Apache airflowApache airflow
Apache airflow
 
Modern ETL Pipelines with Change Data Capture
Modern ETL Pipelines with Change Data CaptureModern ETL Pipelines with Change Data Capture
Modern ETL Pipelines with Change Data Capture
 
Presto Summit 2018 - 02 - LinkedIn
Presto Summit 2018  - 02 - LinkedInPresto Summit 2018  - 02 - LinkedIn
Presto Summit 2018 - 02 - LinkedIn
 
James Turner (Caplin) - Enterprise HTML5 Patterns
James Turner (Caplin) - Enterprise HTML5 PatternsJames Turner (Caplin) - Enterprise HTML5 Patterns
James Turner (Caplin) - Enterprise HTML5 Patterns
 

Similar to Putting the Spark into Functional Fashion Tech Analystics

Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F... Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Databricks
 
Spark tutorial
Spark tutorialSpark tutorial
Spark tutorial
Sahan Bulathwela
 
Apache Spark for Beginners
Apache Spark for BeginnersApache Spark for Beginners
Apache Spark for Beginners
Anirudh
 
Spark Driven Big Data Analytics
Spark Driven Big Data AnalyticsSpark Driven Big Data Analytics
Spark Driven Big Data Analytics
inoshg
 
Apache Spark
Apache SparkApache Spark
Apache Spark
masifqadri
 
Apache Spark on HDinsight Training
Apache Spark on HDinsight TrainingApache Spark on HDinsight Training
Apache Spark on HDinsight Training
Synergetics Learning and Cloud Consulting
 
Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability Meetup
Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability MeetupApache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability Meetup
Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability Meetup
Hyderabad Scalability Meetup
 
Dec6 meetup spark presentation
Dec6 meetup spark presentationDec6 meetup spark presentation
Dec6 meetup spark presentation
Ramesh Mudunuri
 
Apache Spark Core
Apache Spark CoreApache Spark Core
Apache Spark Core
Girish Khanzode
 
Spark introduction and architecture
Spark introduction and architectureSpark introduction and architecture
Spark introduction and architecture
Sohil Jain
 
Spark introduction and architecture
Spark introduction and architectureSpark introduction and architecture
Spark introduction and architecture
Sohil Jain
 
An Introduction to Apache Spark
An Introduction to Apache SparkAn Introduction to Apache Spark
An Introduction to Apache Spark
Dona Mary Philip
 
Spark from the Surface
Spark from the SurfaceSpark from the Surface
Spark from the Surface
Josi Aranda
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
Zahra Eskandari
 
Spark
SparkSpark
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Anant Corporation
 
Apache Spark - A High Level overview
Apache Spark - A High Level overviewApache Spark - A High Level overview
Apache Spark - A High Level overview
Karan Alang
 
Big_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_SessionBig_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_Session
RUHULAMINHAZARIKA
 
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARKSCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
zmhassan
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
Venkata Naga Ravi
 

Similar to Putting the Spark into Functional Fashion Tech Analystics (20)

Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F... Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 
Spark tutorial
Spark tutorialSpark tutorial
Spark tutorial
 
Apache Spark for Beginners
Apache Spark for BeginnersApache Spark for Beginners
Apache Spark for Beginners
 
Spark Driven Big Data Analytics
Spark Driven Big Data AnalyticsSpark Driven Big Data Analytics
Spark Driven Big Data Analytics
 
Apache Spark
Apache SparkApache Spark
Apache Spark
 
Apache Spark on HDinsight Training
Apache Spark on HDinsight TrainingApache Spark on HDinsight Training
Apache Spark on HDinsight Training
 
Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability Meetup
Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability MeetupApache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability Meetup
Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability Meetup
 
Dec6 meetup spark presentation
Dec6 meetup spark presentationDec6 meetup spark presentation
Dec6 meetup spark presentation
 
Apache Spark Core
Apache Spark CoreApache Spark Core
Apache Spark Core
 
Spark introduction and architecture
Spark introduction and architectureSpark introduction and architecture
Spark introduction and architecture
 
Spark introduction and architecture
Spark introduction and architectureSpark introduction and architecture
Spark introduction and architecture
 
An Introduction to Apache Spark
An Introduction to Apache SparkAn Introduction to Apache Spark
An Introduction to Apache Spark
 
Spark from the Surface
Spark from the SurfaceSpark from the Surface
Spark from the Surface
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
 
Spark
SparkSpark
Spark
 
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
 
Apache Spark - A High Level overview
Apache Spark - A High Level overviewApache Spark - A High Level overview
Apache Spark - A High Level overview
 
Big_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_SessionBig_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_Session
 
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARKSCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
 

Recently uploaded

一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
aqzctr7x
 
Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......
Sachin Paul
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
nuttdpt
 
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
Timothy Spann
 
Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024
ElizabethGarrettChri
 
UofT毕业证如何办理
UofT毕业证如何办理UofT毕业证如何办理
UofT毕业证如何办理
exukyp
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
v7oacc3l
 
Monthly Management report for the Month of May 2024
Monthly Management report for the Month of May 2024Monthly Management report for the Month of May 2024
Monthly Management report for the Month of May 2024
facilitymanager11
 
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataPredictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Kiwi Creative
 
一比一原版(CU毕业证)卡尔顿大学毕业证如何办理
一比一原版(CU毕业证)卡尔顿大学毕业证如何办理一比一原版(CU毕业证)卡尔顿大学毕业证如何办理
一比一原版(CU毕业证)卡尔顿大学毕业证如何办理
bmucuha
 
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
Social Samosa
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
apvysm8
 
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
taqyea
 
Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...
Bill641377
 
DSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelinesDSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelines
Timothy Spann
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
sameer shah
 
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Kaxil Naik
 
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
nyfuhyz
 
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
nuttdpt
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
Walaa Eldin Moustafa
 

Recently uploaded (20)

一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
 
Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
 
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
 
Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024
 
UofT毕业证如何办理
UofT毕业证如何办理UofT毕业证如何办理
UofT毕业证如何办理
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
 
Monthly Management report for the Month of May 2024
Monthly Management report for the Month of May 2024Monthly Management report for the Month of May 2024
Monthly Management report for the Month of May 2024
 
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataPredictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
 
一比一原版(CU毕业证)卡尔顿大学毕业证如何办理
一比一原版(CU毕业证)卡尔顿大学毕业证如何办理一比一原版(CU毕业证)卡尔顿大学毕业证如何办理
一比一原版(CU毕业证)卡尔顿大学毕业证如何办理
 
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
 
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
 
Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...
 
DSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelinesDSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelines
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
 
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
 
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
 
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
 

Putting the Spark into Functional Fashion Tech Analystics

  • 1. @metail#VoxxedBristol Putting the Spark into Functional Fashion Tech Analytics Gareth Rogers Metail
  • 2. 2 ● Who are Metail and what do we do ● Our data pipeline ● What is Apache Spark and our experiences ● Live coding demonstration of Spark? Introduction
  • 4. 4 ● Create your MeModel with just a few clicks ● See how the clothes look on you ● Primarily clickstream analysis ● Understanding the user journey http://trymetail.com The Metail Experience
  • 5. 5 Composed Photography With Metail Shoot model once Choose poses Style & restyle Compositing Shoot clothes at source ● Understanding process flow ● Discovering bottlenecks ● Optimising the workflow ● Understanding costs
  • 6. 6 • Batch pipeline modelled on a lambda architecture • Transformed by Clojure, Spark in Clojure or SQL • Managed by applications written in Clojure • Built on and using Amazon Web Services (AWS) Metail’s Data Pipeline
  • 8. 8 Metail’s Data Pipeline To analytics dashboards https://looker.com/
  • 9. 9 ● Our pipeline is heavily influenced by functional programmings paradigms ● Declarative ● Immutable data structures ● Pure functions -- effects only dependent on input state ● Minimise side effects Functional Pipeline
  • 10. 10 • Clojure is predominately a functional programming language • Aims to be approachable • Allow interactive development • A lisp programming language • Everything is dynamically typed Clojure
  • 11. 11 • Spark is a general purpose distributed data processing engine – Can scale from processing a single line to terabytes of data • Functional paradigm – Declarative – Functions are first class – Immutable datasets • Written in Scala a JVM based language – Just like Clojure – Has a Java API – Clojure has great Java interop Apache Spark
  • 12. 12 • Operates on a Directed Acyclic Graph (DAG) – Link together data processing steps – May be dependent on multiple parent steps – Transforms cannot return to an older partition Apache Spark
  • 14. 14 • Consists of a driver and multiple workers • Declare a Spark session and build a job graph • Driver coordinates with a cluster to execute the graph on the workers Apache Spark
  • 15. 15 • Operates on in-memory datasets – where possible, it will spill to disk • Based on Resilient Distributed Datasets (RDDs) • Each RDD represents on dataset • Split into multiple partitions distributed through the cluster Apache Spark
  • 16. 16 Apache Spark Driver ● ● ● Worker 0 RDD Partition Code Worker 1 RDD Partition Code Worker n RDD Partition Code Live code or compiled code Cluster Manager Local machine Cluster: YARN/Meso/Kubernetes
  • 17. 17 • Using the Sparkling Clojure library – This wraps the Spark Java API – Does the heavy lifting of Clojure data structure serialisation • Mostly using the core API • RDDs holding rows of Clojure maps • Runs on AWS Elastic MapReduce – Tune your cluster • Uses DynamoDB to for inter-batch state Metail and Spark
  • 18. 18 • Metail is making clothing fit for all • We’re incorporating metrics derived from our collected data • We have several pipelines collecting, transforming and visualising our data • When dealing with datasets functional programming offers many advantages • Give them a go! – https://github.com/gareth625/lhcb-opendata Summary
  • 19. 19 • Learning Clojure: https://www.braveclojure.com/ • A random web based Clojure REPL: https://repl.it/repls/ • Bases of Metail Experience pipeline https://snowplowanalytics.com • Dashboarding and SQL warehouse management: https://looker.com • Tuning you cluster http://c2fo.io/c2fo/spark/aws/emr/2016/07/06/apache-spark-config-cheatsheet/ – Required on EMR Resources