SlideShare a Scribd company logo
1 of 30
Download to read offline
1
OCTOBER 2015
Putting the Spark into Functional
Fashion Tech Analytics
Gareth Rogers Data Engineer November 2018
2
● Who are Metail and what we do
● Our data pipeline
● What is Apache Spark and our experiences
Introduction
3
Metail
Making clothing fit for all
4
● Create your MeModel with just a
few clicks
● See how the clothes look on you
● Primarily clickstream analysis
● Understanding the user journey
http://trymetail.com
The Metail Experience
5
Composed Photography
With Metail
Shoot model once
Choose poses
Style & restyle
Compositing
Shoot clothes
at source
● Understanding process flow
● Discovering bottlenecks
● Optimising the workflow
● Understanding costs
6
● Our pipeline is heavily influenced by functional programmings
paradigms
● Immutable data structures
● Declarative
● Pure functions -- effects only dependent on input state
● Minimise side effects
Functional Pipeline
7
Metail’s Data Pipeline
• Batch pipeline modelled on a lambda architecture
8
• Batch pipeline modelled on a lambda architecture
● Immutable datasets
● Batch layer append only
● Rebuild views rather than edit
● Serving layer for visualization
● Speed layer samples input data
○ Kappa architecture
Metail’s Data Pipeline - The lambda architecture
9
• Managed by applications written in Clojure
Metail’s Data Pipeline - Driving the Pipeline
10
• Clojure is predominately a functional programming language
• Aims to be approachable
• Allow interactive development
• A lisp programming language
• Everything is dynamically typed
• Runs on the JVM or compiled to JavaScript
Clojure
11
• Running on the JVM
– access to all the Java ecosystem and learnings
– Java interop well supported
– not Java though
Metail’s Data Pipeline - Driving the Pipeline
12
Metail’s Data Pipeline - Driving the Pipeline
Snowplow Analytics (shameful plug, we
use their platform and I like the founders)
13
Metail’s Data Pipeline - Driving the Pipeline
14
• Transformed by Clojure, Spark in Clojure or SQL
• Clojure used for datasets with well defined size
– These easily run on a single JVM
– Dataset size always within an order of magnitude
• Spark
– Dataset sizes can vary over a few orders of magnitude
• SQL is typically used in the serving layer and for dashboarding
and BI tools
Metail’s Data Pipeline - Transforming the Data
15
Metail’s Data Pipeline
16
Metail’s Data Pipeline
To analytics
dashboards
https://looker.com/
17
• Spark is a general purpose distributed data processing engine
– Can scale from processing a single line to terabytes of data
• Functional paradigm
– Declarative
– Functions are first class
– Immutable datasets
• Written in Scala a JVM based language
– Just like Clojure
– Has a Java API
– Clojure has great Java interop
Apache Spark
18
Apache Spark
https://databricks.com/spark/about
19
Cluster: YARN/Meso/Kubernetes
Apache Spark
Driver
●
●
●
Worker 0
RDD
Partition
Code
Worker 1
RDD
Partition
Code
Worker n
RDD
Partition
Code
Live code or
compiled code
Cluster
Manager
Local machine
Master
20
Cluster: YARN/Meso/Kubernetes
Apache Spark
Driver
●
●
●
Worker 0
RDD
Partition
Code
Worker 1
RDD
Partition
Code
Worker n
RDD
Partition
Code
Live code or
compiled code
Cluster
Manager
Local machine
Master
• Consists of a driver and multiple workers
• Declare a Spark session and build a job graph
• Driver coordinates with a cluster to execute the graph on the workers
21
Cluster: YARN/Meso/Kubernetes
Apache Spark
Driver
●
●
●
Worker 0
RDD
Partition
Code
Worker 1
RDD
Partition
Code
Worker n
RDD
Partition
Code
Live code or
compiled code
Cluster
Manager
Local machine
Master
• Operates on in-memory datasets
– where possible, it will spill to disk
• Based on Resilient Distributed Datasets (RDDs)
• Each RDD represents on dataset
• Split into multiple partitions distributed through
the cluster
22
• Operates on a Directed
Acyclic Graph (DAG)
– Link together data
processing steps
– May be dependent on
multiple parent steps
– Transforms cannot
return to an older
partition
Apache Spark
23
• Using the Sparkling Clojure library
– This wraps the Spark Java API
– Handles Clojure data structure and function serialisation
• For example counting user-sessions
– This is simple but doesn’t scale, returns everything to driver
memory
Metail and Spark
24
• Two variants that return PairRRDs
• combineByKey is more general than aggregateByKey
• Note I did the Java interop as Sparkling doesn’t cover this API
– Easy when you can use their serialization library
Metail and Spark
25
• Mostly using the core API
• RDDs holding rows of Clojure maps
• Would like to migrate to the Dataset API
Metail and Spark
26
• Runs on AWS Elastic MapReduce
– Tune your cluster
• This is an example for my limited VM, a cluster would use
bigger values!
Metail and Spark
27
• Very scalable
– but the distributed environment adds overhead
• Sparkling does a lot of the hard work
– Clojure not a supported language
• Good documentation
– Sometimes it’s hard to figure out which way to do
something
– Lots of deprecated methods
• Declarative language + Clojure interop makes stacktraces hard
to interpret
• Dataset API is heavily optimised
– Would remove a lot of the Clojure interop
Metail and Spark - Pros and Cons
28
• Metail is making clothing fit for all
• We’re incorporating metrics derived from our collected data
• We have several pipelines collecting, transforming and
visualising our data
• When dealing with datasets functional programming offers
many advantages
• Give them a go!
– https://github.com/gareth625/lhcb-opendata
Summary
29
• Learning Clojure: https://www.braveclojure.com/
• A random web based Clojure REPL: https://repl.it/repls/
• Basis of Metail Experience pipeline https://snowplowanalytics.com
• Dashboarding and SQL warehouse management: https://looker.com
• Tuning your Spark cluster
http://c2fo.io/c2fo/spark/aws/emr/2016/07/06/apache-spark-config-cheatsheet/
Resources
30
Questions?

More Related Content

What's hot

Case Study: Stream Processing on AWS using Kappa Architecture
Case Study: Stream Processing on AWS using Kappa ArchitectureCase Study: Stream Processing on AWS using Kappa Architecture
Case Study: Stream Processing on AWS using Kappa ArchitectureJoey Bolduc-Gilbert
 
Introduction to Spark 2.0 Dataset API
Introduction to Spark 2.0 Dataset APIIntroduction to Spark 2.0 Dataset API
Introduction to Spark 2.0 Dataset APIdatamantra
 
Understanding and Improving Code Generation
Understanding and Improving Code GenerationUnderstanding and Improving Code Generation
Understanding and Improving Code GenerationDatabricks
 
Building a Self-Service Hadoop Platform at Linkedin with Azkaban
Building a Self-Service Hadoop Platform at Linkedin with AzkabanBuilding a Self-Service Hadoop Platform at Linkedin with Azkaban
Building a Self-Service Hadoop Platform at Linkedin with AzkabanDataWorks Summit
 
Spline 2 - Vision and Architecture Overview
Spline 2 - Vision and Architecture OverviewSpline 2 - Vision and Architecture Overview
Spline 2 - Vision and Architecture OverviewVaclav Kosar
 
Storing State Forever: Why It Can Be Good For Your Analytics
Storing State Forever: Why It Can Be Good For Your AnalyticsStoring State Forever: Why It Can Be Good For Your Analytics
Storing State Forever: Why It Can Be Good For Your AnalyticsYaroslav Tkachenko
 
Modern ETL Pipelines with Change Data Capture
Modern ETL Pipelines with Change Data CaptureModern ETL Pipelines with Change Data Capture
Modern ETL Pipelines with Change Data CaptureDatabricks
 
Ruby to Scala in 9 weeks
Ruby to Scala in 9 weeksRuby to Scala in 9 weeks
Ruby to Scala in 9 weeksjutley
 
Alpakka - Connecting Kafka and ElasticSearch to Akka Streams
Alpakka - Connecting Kafka and ElasticSearch to Akka StreamsAlpakka - Connecting Kafka and ElasticSearch to Akka Streams
Alpakka - Connecting Kafka and ElasticSearch to Akka StreamsKnoldus Inc.
 
Multi Source Data Analysis using Spark and Tellius
Multi Source Data Analysis using Spark and TelliusMulti Source Data Analysis using Spark and Tellius
Multi Source Data Analysis using Spark and Telliusdatamantra
 
Clovaを支える技術 機械学習配信基盤のご紹介
Clovaを支える技術 機械学習配信基盤のご紹介Clovaを支える技術 機械学習配信基盤のご紹介
Clovaを支える技術 機械学習配信基盤のご紹介LINE Corporation
 
Willump: Optimizing Feature Computation in ML Inference
Willump: Optimizing Feature Computation in ML InferenceWillump: Optimizing Feature Computation in ML Inference
Willump: Optimizing Feature Computation in ML InferenceDatabricks
 
When Apache Spark Meets TiDB with Xiaoyu Ma
When Apache Spark Meets TiDB with Xiaoyu MaWhen Apache Spark Meets TiDB with Xiaoyu Ma
When Apache Spark Meets TiDB with Xiaoyu MaDatabricks
 
Stream Processing using Apache Flink in Zalando's World of Microservices - Re...
Stream Processing using Apache Flink in Zalando's World of Microservices - Re...Stream Processing using Apache Flink in Zalando's World of Microservices - Re...
Stream Processing using Apache Flink in Zalando's World of Microservices - Re...Zalando Technology
 
Metrics driven development with dedicated Observability Team
Metrics driven development with dedicated Observability TeamMetrics driven development with dedicated Observability Team
Metrics driven development with dedicated Observability TeamLINE Corporation
 
Micheal Pershyn "Coljure 4 Big Data"
Micheal Pershyn "Coljure 4 Big Data"Micheal Pershyn "Coljure 4 Big Data"
Micheal Pershyn "Coljure 4 Big Data"Lviv Startup Club
 
Stream Processing with Kafka and Samza
Stream Processing with Kafka and SamzaStream Processing with Kafka and Samza
Stream Processing with Kafka and SamzaDiego Pacheco
 
Understanding time in structured streaming
Understanding time in structured streamingUnderstanding time in structured streaming
Understanding time in structured streamingdatamantra
 
A Collaborative Data Science Development Workflow
A Collaborative Data Science Development WorkflowA Collaborative Data Science Development Workflow
A Collaborative Data Science Development WorkflowDatabricks
 

What's hot (20)

Case Study: Stream Processing on AWS using Kappa Architecture
Case Study: Stream Processing on AWS using Kappa ArchitectureCase Study: Stream Processing on AWS using Kappa Architecture
Case Study: Stream Processing on AWS using Kappa Architecture
 
Introduction to Spark 2.0 Dataset API
Introduction to Spark 2.0 Dataset APIIntroduction to Spark 2.0 Dataset API
Introduction to Spark 2.0 Dataset API
 
Understanding and Improving Code Generation
Understanding and Improving Code GenerationUnderstanding and Improving Code Generation
Understanding and Improving Code Generation
 
Building a Self-Service Hadoop Platform at Linkedin with Azkaban
Building a Self-Service Hadoop Platform at Linkedin with AzkabanBuilding a Self-Service Hadoop Platform at Linkedin with Azkaban
Building a Self-Service Hadoop Platform at Linkedin with Azkaban
 
Spline 2 - Vision and Architecture Overview
Spline 2 - Vision and Architecture OverviewSpline 2 - Vision and Architecture Overview
Spline 2 - Vision and Architecture Overview
 
Storing State Forever: Why It Can Be Good For Your Analytics
Storing State Forever: Why It Can Be Good For Your AnalyticsStoring State Forever: Why It Can Be Good For Your Analytics
Storing State Forever: Why It Can Be Good For Your Analytics
 
Modern ETL Pipelines with Change Data Capture
Modern ETL Pipelines with Change Data CaptureModern ETL Pipelines with Change Data Capture
Modern ETL Pipelines with Change Data Capture
 
Ruby to Scala in 9 weeks
Ruby to Scala in 9 weeksRuby to Scala in 9 weeks
Ruby to Scala in 9 weeks
 
Apache Flink
Apache FlinkApache Flink
Apache Flink
 
Alpakka - Connecting Kafka and ElasticSearch to Akka Streams
Alpakka - Connecting Kafka and ElasticSearch to Akka StreamsAlpakka - Connecting Kafka and ElasticSearch to Akka Streams
Alpakka - Connecting Kafka and ElasticSearch to Akka Streams
 
Multi Source Data Analysis using Spark and Tellius
Multi Source Data Analysis using Spark and TelliusMulti Source Data Analysis using Spark and Tellius
Multi Source Data Analysis using Spark and Tellius
 
Clovaを支える技術 機械学習配信基盤のご紹介
Clovaを支える技術 機械学習配信基盤のご紹介Clovaを支える技術 機械学習配信基盤のご紹介
Clovaを支える技術 機械学習配信基盤のご紹介
 
Willump: Optimizing Feature Computation in ML Inference
Willump: Optimizing Feature Computation in ML InferenceWillump: Optimizing Feature Computation in ML Inference
Willump: Optimizing Feature Computation in ML Inference
 
When Apache Spark Meets TiDB with Xiaoyu Ma
When Apache Spark Meets TiDB with Xiaoyu MaWhen Apache Spark Meets TiDB with Xiaoyu Ma
When Apache Spark Meets TiDB with Xiaoyu Ma
 
Stream Processing using Apache Flink in Zalando's World of Microservices - Re...
Stream Processing using Apache Flink in Zalando's World of Microservices - Re...Stream Processing using Apache Flink in Zalando's World of Microservices - Re...
Stream Processing using Apache Flink in Zalando's World of Microservices - Re...
 
Metrics driven development with dedicated Observability Team
Metrics driven development with dedicated Observability TeamMetrics driven development with dedicated Observability Team
Metrics driven development with dedicated Observability Team
 
Micheal Pershyn "Coljure 4 Big Data"
Micheal Pershyn "Coljure 4 Big Data"Micheal Pershyn "Coljure 4 Big Data"
Micheal Pershyn "Coljure 4 Big Data"
 
Stream Processing with Kafka and Samza
Stream Processing with Kafka and SamzaStream Processing with Kafka and Samza
Stream Processing with Kafka and Samza
 
Understanding time in structured streaming
Understanding time in structured streamingUnderstanding time in structured streaming
Understanding time in structured streaming
 
A Collaborative Data Science Development Workflow
A Collaborative Data Science Development WorkflowA Collaborative Data Science Development Workflow
A Collaborative Data Science Development Workflow
 

Similar to Putting the Spark into Functional Fashion Tech Analytics

Metail and Elastic MapReduce
Metail and Elastic MapReduceMetail and Elastic MapReduce
Metail and Elastic MapReduceGareth Rogers
 
Big Data Analytics on the Cloud Oracle Applications AWS Redshift & Tableau
Big Data Analytics on the Cloud Oracle Applications AWS Redshift & TableauBig Data Analytics on the Cloud Oracle Applications AWS Redshift & Tableau
Big Data Analytics on the Cloud Oracle Applications AWS Redshift & TableauSam Palani
 
Splice Machine's use of Apache Spark and MLflow
Splice Machine's use of Apache Spark and MLflowSplice Machine's use of Apache Spark and MLflow
Splice Machine's use of Apache Spark and MLflowDatabricks
 
Apache Spark for Beginners
Apache Spark for BeginnersApache Spark for Beginners
Apache Spark for BeginnersAnirudh
 
Fighting Fraud with Apache Spark
Fighting Fraud with Apache SparkFighting Fraud with Apache Spark
Fighting Fraud with Apache SparkMiklos Christine
 
Predicting Flights with Azure Databricks
Predicting Flights with Azure DatabricksPredicting Flights with Azure Databricks
Predicting Flights with Azure DatabricksSarah Dutkiewicz
 
Introduction to spark 2.0
Introduction to spark 2.0Introduction to spark 2.0
Introduction to spark 2.0datamantra
 
01 demystifying mysq-lfororacledbaanddeveloperv1
01 demystifying mysq-lfororacledbaanddeveloperv101 demystifying mysq-lfororacledbaanddeveloperv1
01 demystifying mysq-lfororacledbaanddeveloperv1Ivan Ma
 
Parallelizing with Apache Spark in Unexpected Ways
Parallelizing with Apache Spark in Unexpected WaysParallelizing with Apache Spark in Unexpected Ways
Parallelizing with Apache Spark in Unexpected WaysDatabricks
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimSpark Summit
 
GraphQL API on a Serverless Environment
GraphQL API on a Serverless EnvironmentGraphQL API on a Serverless Environment
GraphQL API on a Serverless EnvironmentItai Yaffe
 
Deploying Full BI Platforms to Oracle Cloud
Deploying Full BI Platforms to Oracle CloudDeploying Full BI Platforms to Oracle Cloud
Deploying Full BI Platforms to Oracle CloudMark Rittman
 
Delivering Insights from 20M+ Smart Homes with 500M+ Devices
Delivering Insights from 20M+ Smart Homes with 500M+ DevicesDelivering Insights from 20M+ Smart Homes with 500M+ Devices
Delivering Insights from 20M+ Smart Homes with 500M+ DevicesDatabricks
 
Apache Spark - A High Level overview
Apache Spark - A High Level overviewApache Spark - A High Level overview
Apache Spark - A High Level overviewKaran Alang
 
KoprowskiT_SQLRelay2014#4_Caerdydd_MaintenancePlansForBeginners
KoprowskiT_SQLRelay2014#4_Caerdydd_MaintenancePlansForBeginnersKoprowskiT_SQLRelay2014#4_Caerdydd_MaintenancePlansForBeginners
KoprowskiT_SQLRelay2014#4_Caerdydd_MaintenancePlansForBeginnersTobias Koprowski
 

Similar to Putting the Spark into Functional Fashion Tech Analytics (20)

Metail and Elastic MapReduce
Metail and Elastic MapReduceMetail and Elastic MapReduce
Metail and Elastic MapReduce
 
Big Data Analytics on the Cloud Oracle Applications AWS Redshift & Tableau
Big Data Analytics on the Cloud Oracle Applications AWS Redshift & TableauBig Data Analytics on the Cloud Oracle Applications AWS Redshift & Tableau
Big Data Analytics on the Cloud Oracle Applications AWS Redshift & Tableau
 
Splice Machine's use of Apache Spark and MLflow
Splice Machine's use of Apache Spark and MLflowSplice Machine's use of Apache Spark and MLflow
Splice Machine's use of Apache Spark and MLflow
 
Apache Spark for Beginners
Apache Spark for BeginnersApache Spark for Beginners
Apache Spark for Beginners
 
Spark
SparkSpark
Spark
 
Fighting Fraud with Apache Spark
Fighting Fraud with Apache SparkFighting Fraud with Apache Spark
Fighting Fraud with Apache Spark
 
Breaking data
Breaking dataBreaking data
Breaking data
 
Predicting Flights with Azure Databricks
Predicting Flights with Azure DatabricksPredicting Flights with Azure Databricks
Predicting Flights with Azure Databricks
 
Introduction to spark 2.0
Introduction to spark 2.0Introduction to spark 2.0
Introduction to spark 2.0
 
01 demystifying mysq-lfororacledbaanddeveloperv1
01 demystifying mysq-lfororacledbaanddeveloperv101 demystifying mysq-lfororacledbaanddeveloperv1
01 demystifying mysq-lfororacledbaanddeveloperv1
 
Apache Spark Core
Apache Spark CoreApache Spark Core
Apache Spark Core
 
Parallelizing with Apache Spark in Unexpected Ways
Parallelizing with Apache Spark in Unexpected WaysParallelizing with Apache Spark in Unexpected Ways
Parallelizing with Apache Spark in Unexpected Ways
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin Kim
 
GraphQL API on a Serverless Environment
GraphQL API on a Serverless EnvironmentGraphQL API on a Serverless Environment
GraphQL API on a Serverless Environment
 
Deploying Full BI Platforms to Oracle Cloud
Deploying Full BI Platforms to Oracle CloudDeploying Full BI Platforms to Oracle Cloud
Deploying Full BI Platforms to Oracle Cloud
 
Delivering Insights from 20M+ Smart Homes with 500M+ Devices
Delivering Insights from 20M+ Smart Homes with 500M+ DevicesDelivering Insights from 20M+ Smart Homes with 500M+ Devices
Delivering Insights from 20M+ Smart Homes with 500M+ Devices
 
Apache Spark
Apache SparkApache Spark
Apache Spark
 
Apache Spark - A High Level overview
Apache Spark - A High Level overviewApache Spark - A High Level overview
Apache Spark - A High Level overview
 
KoprowskiT_SQLRelay2014#4_Caerdydd_MaintenancePlansForBeginners
KoprowskiT_SQLRelay2014#4_Caerdydd_MaintenancePlansForBeginnersKoprowskiT_SQLRelay2014#4_Caerdydd_MaintenancePlansForBeginners
KoprowskiT_SQLRelay2014#4_Caerdydd_MaintenancePlansForBeginners
 

Recently uploaded

Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Delhi Call girls
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxolyaivanovalion
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...shambhavirathore45
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023ymrp368
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...amitlee9823
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls
 

Recently uploaded (20)

Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptx
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 

Putting the Spark into Functional Fashion Tech Analytics

  • 1. 1 OCTOBER 2015 Putting the Spark into Functional Fashion Tech Analytics Gareth Rogers Data Engineer November 2018
  • 2. 2 ● Who are Metail and what we do ● Our data pipeline ● What is Apache Spark and our experiences Introduction
  • 4. 4 ● Create your MeModel with just a few clicks ● See how the clothes look on you ● Primarily clickstream analysis ● Understanding the user journey http://trymetail.com The Metail Experience
  • 5. 5 Composed Photography With Metail Shoot model once Choose poses Style & restyle Compositing Shoot clothes at source ● Understanding process flow ● Discovering bottlenecks ● Optimising the workflow ● Understanding costs
  • 6. 6 ● Our pipeline is heavily influenced by functional programmings paradigms ● Immutable data structures ● Declarative ● Pure functions -- effects only dependent on input state ● Minimise side effects Functional Pipeline
  • 7. 7 Metail’s Data Pipeline • Batch pipeline modelled on a lambda architecture
  • 8. 8 • Batch pipeline modelled on a lambda architecture ● Immutable datasets ● Batch layer append only ● Rebuild views rather than edit ● Serving layer for visualization ● Speed layer samples input data ○ Kappa architecture Metail’s Data Pipeline - The lambda architecture
  • 9. 9 • Managed by applications written in Clojure Metail’s Data Pipeline - Driving the Pipeline
  • 10. 10 • Clojure is predominately a functional programming language • Aims to be approachable • Allow interactive development • A lisp programming language • Everything is dynamically typed • Runs on the JVM or compiled to JavaScript Clojure
  • 11. 11 • Running on the JVM – access to all the Java ecosystem and learnings – Java interop well supported – not Java though Metail’s Data Pipeline - Driving the Pipeline
  • 12. 12 Metail’s Data Pipeline - Driving the Pipeline Snowplow Analytics (shameful plug, we use their platform and I like the founders)
  • 13. 13 Metail’s Data Pipeline - Driving the Pipeline
  • 14. 14 • Transformed by Clojure, Spark in Clojure or SQL • Clojure used for datasets with well defined size – These easily run on a single JVM – Dataset size always within an order of magnitude • Spark – Dataset sizes can vary over a few orders of magnitude • SQL is typically used in the serving layer and for dashboarding and BI tools Metail’s Data Pipeline - Transforming the Data
  • 16. 16 Metail’s Data Pipeline To analytics dashboards https://looker.com/
  • 17. 17 • Spark is a general purpose distributed data processing engine – Can scale from processing a single line to terabytes of data • Functional paradigm – Declarative – Functions are first class – Immutable datasets • Written in Scala a JVM based language – Just like Clojure – Has a Java API – Clojure has great Java interop Apache Spark
  • 19. 19 Cluster: YARN/Meso/Kubernetes Apache Spark Driver ● ● ● Worker 0 RDD Partition Code Worker 1 RDD Partition Code Worker n RDD Partition Code Live code or compiled code Cluster Manager Local machine Master
  • 20. 20 Cluster: YARN/Meso/Kubernetes Apache Spark Driver ● ● ● Worker 0 RDD Partition Code Worker 1 RDD Partition Code Worker n RDD Partition Code Live code or compiled code Cluster Manager Local machine Master • Consists of a driver and multiple workers • Declare a Spark session and build a job graph • Driver coordinates with a cluster to execute the graph on the workers
  • 21. 21 Cluster: YARN/Meso/Kubernetes Apache Spark Driver ● ● ● Worker 0 RDD Partition Code Worker 1 RDD Partition Code Worker n RDD Partition Code Live code or compiled code Cluster Manager Local machine Master • Operates on in-memory datasets – where possible, it will spill to disk • Based on Resilient Distributed Datasets (RDDs) • Each RDD represents on dataset • Split into multiple partitions distributed through the cluster
  • 22. 22 • Operates on a Directed Acyclic Graph (DAG) – Link together data processing steps – May be dependent on multiple parent steps – Transforms cannot return to an older partition Apache Spark
  • 23. 23 • Using the Sparkling Clojure library – This wraps the Spark Java API – Handles Clojure data structure and function serialisation • For example counting user-sessions – This is simple but doesn’t scale, returns everything to driver memory Metail and Spark
  • 24. 24 • Two variants that return PairRRDs • combineByKey is more general than aggregateByKey • Note I did the Java interop as Sparkling doesn’t cover this API – Easy when you can use their serialization library Metail and Spark
  • 25. 25 • Mostly using the core API • RDDs holding rows of Clojure maps • Would like to migrate to the Dataset API Metail and Spark
  • 26. 26 • Runs on AWS Elastic MapReduce – Tune your cluster • This is an example for my limited VM, a cluster would use bigger values! Metail and Spark
  • 27. 27 • Very scalable – but the distributed environment adds overhead • Sparkling does a lot of the hard work – Clojure not a supported language • Good documentation – Sometimes it’s hard to figure out which way to do something – Lots of deprecated methods • Declarative language + Clojure interop makes stacktraces hard to interpret • Dataset API is heavily optimised – Would remove a lot of the Clojure interop Metail and Spark - Pros and Cons
  • 28. 28 • Metail is making clothing fit for all • We’re incorporating metrics derived from our collected data • We have several pipelines collecting, transforming and visualising our data • When dealing with datasets functional programming offers many advantages • Give them a go! – https://github.com/gareth625/lhcb-opendata Summary
  • 29. 29 • Learning Clojure: https://www.braveclojure.com/ • A random web based Clojure REPL: https://repl.it/repls/ • Basis of Metail Experience pipeline https://snowplowanalytics.com • Dashboarding and SQL warehouse management: https://looker.com • Tuning your Spark cluster http://c2fo.io/c2fo/spark/aws/emr/2016/07/06/apache-spark-config-cheatsheet/ Resources