Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Building data pipelines
01
from simple to more advanced - hands-on
Sergii Khomenko, Data Scientist
sergii.khomenko@styligh...
Sergii Khomenko
2
Data scientist at one of the biggest fashion communities, Stylight.
Data analysis and visualisation hobb...
Profitable Leads
Stylight provides its
partners with high-
quality leads enabling
partner shops to
leverage Stylight as a
...
Stylight – acting on a global scale
Experienced & Ambitious Team
Innovative cross-
functional organisation
with flat hierarchy builds a 

unique team spirit.
...
Agenda
6
T h e G o o d , T h e B a d A n d T h e L e g a c y
O p e n S o u r c e s t a c k
A m a z o n A W S
G o o g l e C...
7
I n c o m p u t i n g , a p i p e l i n e i s a
s e t o f d a t a p r o c e s s i n g e l e m e n t s
c o n n e c t e d ...
The Good, The Bad
And The Legacy
8
Sources of data:
9
• Web tracking
• Metrics tracking
• Behaviour tracking
• Business intelligence ETL
• Internal Services
...
Access patterns
10
• Real-time
• Nearly real-time
• Daily batches
11
12
Properties
13
• Data consistency
• Doesn’t scale
• Hard to add new sources
• Complex system
• Many interfaces
• As lean an...
14
15
Streaming
Open Source Stack
16
17
http://lambda-architecture.net/
18
A p a c h e K a f k a i s p u b l i s h - s u b s c r i b e m e s s a g i n g
r e t h o u g h t a s a d i s t r i b u t...
19
20
21http://www.ipponusa.com/wp-content/uploads/2014/10/spark-architecture.jpg
Results
22
• Scalable
• Flexible
• High costs of maintenance
• Not so easy to setup
23
A p r o g r a m m i n g l a n g u a g e i s l o w
l e v e l w h e n i t s p r o g r a m s r e q u i r e
a t t e n t i o...
Amazon AWS
24
Kinesis Streams
27
28
29
business
development
& finance
website
events
enrichment
Business
Intelligence
Kinesis Firehose Kinesis Analytics
33
34
custom
unification
pipeline
Product
Processing
Business
Intelligence
ML/Tagging
Product events
variety of event types
an...
36
AWS Data Pipeline
Google Cloud
39
40
41
42
43
44
Tips, tricks and best practices
46
Cross-Functional
Team
47
Department: mission oriented team with
all resources and the least dependencies
Product Team: bui...
48
Cross-Functional
Team
49
• You build it - you run it
• You check your numbers (domain
knowledge)
• You provide your data a...
50
51
52
54
I t h i n k t h a t i t ' s e x t r a o r d i n a r i l y
i m p o r t a n t t h a t w e i n c o m p u t e r
s c i e n c...
www.stylight.com
sergii.khomenko@stylight.com
@lc0d3r
Related talks
56
• Helping Data Teams with Puppet / Puppet Camp London
• Secure Data Scalability at Stylight with Tableau ...
Building data pipelines: from simple to more advanced - hands-on experience / CrunchConf - Oct 29, 2015
Building data pipelines: from simple to more advanced - hands-on experience / CrunchConf - Oct 29, 2015
Building data pipelines: from simple to more advanced - hands-on experience / CrunchConf - Oct 29, 2015
Building data pipelines: from simple to more advanced - hands-on experience / CrunchConf - Oct 29, 2015
Building data pipelines: from simple to more advanced - hands-on experience / CrunchConf - Oct 29, 2015
Building data pipelines: from simple to more advanced - hands-on experience / CrunchConf - Oct 29, 2015
Building data pipelines: from simple to more advanced - hands-on experience / CrunchConf - Oct 29, 2015
Upcoming SlideShare
Loading in …5
×

Building data pipelines: from simple to more advanced - hands-on experience / CrunchConf - Oct 29, 2015

953 views

Published on

Data is becoming one of the main decision-makers in an organisation. The more data we have the more challenges we face every day. Every decision we make will have long-term implications. In the talk we will go through different approaches to the data pipelines: from a simple in-house built, with comparison to open source solutions based on Apache stack(Apache Kafka, Apache Samza, Spark) and finally hosted auto-scaling solutions based Amazon(S3, Kinesis, Lambda, EMR) or Google(Pub/Sub, Dataflow, BigQuery). The talk covers the main aspects of data collecting processes altogether with further implications for data processing, highlighting appropriate solutions and architectures for the particular use-cases.

Published in: Data & Analytics
  • Be the first to comment

Building data pipelines: from simple to more advanced - hands-on experience / CrunchConf - Oct 29, 2015

  1. 1. Building data pipelines 01 from simple to more advanced - hands-on Sergii Khomenko, Data Scientist sergii.khomenko@stylight.com, @lc0d3r CrunchConf - October 29, 2015
  2. 2. Sergii Khomenko 2 Data scientist at one of the biggest fashion communities, Stylight. Data analysis and visualisation hobbyist, working on problems not only in working time but in free time for fun and personal data visualisations. Speaker at Berlin Buzzwords 2014, ApacheCon Europe 2014, Puppet Camp London, Berlin Buzzwords 2015 , Tableau Conference on Tour, Budapest BI Forum 2015
  3. 3. Profitable Leads Stylight provides its partners with high- quality leads enabling partner shops to leverage Stylight as a ROI positive traffic channel. Inspiration Stylight offers shoppable inspiration that makes it easy to know what to buy and how to style it. Branding & Reach Stylight offers a unique opportunity for brands to reach an audience that is actively looking for style online. Shopping Stylight helps users search and shop fashion and lifestyle products smarter across hundreds of shops. 3 Stylight – Make Style Happen Core Target Group Stylight help aspiring women between 18 and 35 to evolve their style through shoppable inspiration.
  4. 4. Stylight – acting on a global scale
  5. 5. Experienced & Ambitious Team Innovative cross- functional organisation with flat hierarchy builds a 
 unique team spirit. • +200 employees • 40 PhDs/Engineers • 28 years average age • 63% female • 23 nationalities • 0 suits 5
  6. 6. Agenda 6 T h e G o o d , T h e B a d A n d T h e L e g a c y O p e n S o u r c e s t a c k A m a z o n A W S G o o g l e C l o u d T i p s , t r i c k s a n d b e s t p r a c t i c e s
  7. 7. 7 I n c o m p u t i n g , a p i p e l i n e i s a s e t o f d a t a p r o c e s s i n g e l e m e n t s c o n n e c t e d i n s e r i e s , w h e r e t h e o u t p u t o f o n e e l e m e n t i s t h e i n p u t o f t h e n e x t o n e .
  8. 8. The Good, The Bad And The Legacy 8
  9. 9. Sources of data: 9 • Web tracking • Metrics tracking • Behaviour tracking • Business intelligence ETL • Internal Services • ML tagging service
  10. 10. Access patterns 10 • Real-time • Nearly real-time • Daily batches
  11. 11. 11
  12. 12. 12
  13. 13. Properties 13 • Data consistency • Doesn’t scale • Hard to add new sources • Complex system • Many interfaces • As lean and legacy as possible • No need for special services
  14. 14. 14
  15. 15. 15 Streaming
  16. 16. Open Source Stack 16
  17. 17. 17 http://lambda-architecture.net/
  18. 18. 18 A p a c h e K a f k a i s p u b l i s h - s u b s c r i b e m e s s a g i n g r e t h o u g h t a s a d i s t r i b u t e d c o m m i t l o g .
  19. 19. 19
  20. 20. 20
  21. 21. 21http://www.ipponusa.com/wp-content/uploads/2014/10/spark-architecture.jpg
  22. 22. Results 22 • Scalable • Flexible • High costs of maintenance • Not so easy to setup
  23. 23. 23 A p r o g r a m m i n g l a n g u a g e i s l o w l e v e l w h e n i t s p r o g r a m s r e q u i r e a t t e n t i o n t o t h e i r r e l e v a n t . Alan Jay Perlis / Epigrams on Programming
  24. 24. Amazon AWS 24
  25. 25. Kinesis Streams
  26. 26. 27
  27. 27. 28
  28. 28. 29 business development & finance website events enrichment Business Intelligence
  29. 29. Kinesis Firehose Kinesis Analytics
  30. 30. 33
  31. 31. 34 custom unification pipeline Product Processing Business Intelligence ML/Tagging Product events variety of event types and structures
  32. 32. 36
  33. 33. AWS Data Pipeline
  34. 34. Google Cloud 39
  35. 35. 40
  36. 36. 41
  37. 37. 42
  38. 38. 43
  39. 39. 44
  40. 40. Tips, tricks and best practices 46
  41. 41. Cross-Functional Team 47 Department: mission oriented team with all resources and the least dependencies Product Team: builds the software the department or its customers use Squad: team that executes the product development 47 Department Product Team Squad PO Engineer Engineer Designer Data Scientist Head of Business Role Business Role
  42. 42. 48
  43. 43. Cross-Functional Team 49 • You build it - you run it • You check your numbers (domain knowledge) • You provide your data as interface layer • Data report comes after data tracking 49 Department Product Team Squad PO Engineer Engineer Designer Data Scientist Head of Business Role Business Role
  44. 44. 50
  45. 45. 51
  46. 46. 52
  47. 47. 54 I t h i n k t h a t i t ' s e x t r a o r d i n a r i l y i m p o r t a n t t h a t w e i n c o m p u t e r s c i e n c e k e e p f u n i n c o m p u t i n g . W h e n i t s t a r t e d o u t , i t w a s a n a w f u l l o t o f f u n . Alan Jay Perlis / The Structure and Interpretation of Computer Programs
  48. 48. www.stylight.com sergii.khomenko@stylight.com @lc0d3r
  49. 49. Related talks 56 • Helping Data Teams with Puppet / Puppet Camp London • Secure Data Scalability at Stylight with Tableau Online and Amazon Redshift / Tableau Conference on Tour - Berlin • Google Cloud Dataflow Two Worlds Become a Much Better One

×