Data is becoming one of the main decision-makers in an organisation. The more data we have the more challenges we face every day. Every decision we make will have long-term implications. In the talk we will go through different approaches to the data pipelines: from a simple in-house built, with comparison to open source solutions based on Apache stack(Apache Kafka, Apache Samza, Spark) and finally hosted auto-scaling solutions based Amazon(S3, Kinesis, Lambda, EMR) or Google(Pub/Sub, Dataflow, BigQuery). The talk covers the main aspects of data collecting processes altogether with further implications for data processing, highlighting appropriate solutions and architectures for the particular use-cases.
Building data pipelines: from simple to more advanced - hands-on experience / CrunchConf - Oct 29, 2015
1. Building data pipelines
01
from simple to more advanced - hands-on
Sergii Khomenko, Data Scientist
sergii.khomenko@stylight.com, @lc0d3r
CrunchConf - October 29, 2015
2. Sergii Khomenko
2
Data scientist at one of the biggest fashion communities, Stylight.
Data analysis and visualisation hobbyist, working on problems not
only in working time but in free time for fun and personal data
visualisations.
Speaker at Berlin Buzzwords 2014, ApacheCon Europe 2014, Puppet
Camp London, Berlin Buzzwords 2015 , Tableau Conference on Tour,
Budapest BI Forum 2015
3. Profitable Leads
Stylight provides its
partners with high-
quality leads enabling
partner shops to
leverage Stylight as a
ROI positive traffic
channel.
Inspiration
Stylight offers
shoppable
inspiration that
makes it easy to
know what to
buy and how to
style it.
Branding & Reach
Stylight offers a unique
opportunity for brands to reach
an audience that is actively
looking for style online.
Shopping
Stylight helps users search
and shop fashion and lifestyle
products smarter across
hundreds of shops.
3
Stylight – Make Style Happen
Core Target Group
Stylight help aspiring women between 18 and 35 to evolve their style through shoppable inspiration.
5. Experienced & Ambitious Team
Innovative cross-
functional organisation
with flat hierarchy builds a
unique team spirit.
• +200 employees
• 40 PhDs/Engineers
• 28 years average age
• 63% female
• 23 nationalities
• 0 suits
5
6. Agenda
6
T h e G o o d , T h e B a d A n d T h e L e g a c y
O p e n S o u r c e s t a c k
A m a z o n A W S
G o o g l e C l o u d
T i p s , t r i c k s a n d b e s t p r a c t i c e s
7. 7
I n c o m p u t i n g , a p i p e l i n e i s a
s e t o f d a t a p r o c e s s i n g e l e m e n t s
c o n n e c t e d i n s e r i e s , w h e r e t h e
o u t p u t o f o n e e l e m e n t i s t h e
i n p u t o f t h e n e x t o n e .
13. Properties
13
• Data consistency
• Doesn’t scale
• Hard to add new sources
• Complex system
• Many interfaces
• As lean and legacy as possible
• No need for special services
23. 23
A p r o g r a m m i n g l a n g u a g e i s l o w
l e v e l w h e n i t s p r o g r a m s r e q u i r e
a t t e n t i o n t o t h e i r r e l e v a n t .
Alan Jay Perlis / Epigrams on Programming
47. Cross-Functional
Team
47
Department: mission oriented team with
all resources and the least dependencies
Product Team: builds the software the
department or its customers use
Squad: team that executes the product
development
47
Department
Product Team
Squad
PO
Engineer
Engineer
Designer
Data Scientist
Head of
Business Role
Business Role
49. Cross-Functional
Team
49
• You build it - you run it
• You check your numbers (domain
knowledge)
• You provide your data as interface layer
• Data report comes after data tracking
49
Department
Product Team
Squad
PO
Engineer
Engineer
Designer
Data Scientist
Head of
Business Role
Business Role
54. 54
I t h i n k t h a t i t ' s e x t r a o r d i n a r i l y
i m p o r t a n t t h a t w e i n c o m p u t e r
s c i e n c e k e e p f u n i n c o m p u t i n g .
W h e n i t s t a r t e d o u t , i t w a s a n a w f u l
l o t o f f u n .
Alan Jay Perlis /
The Structure and Interpretation
of Computer Programs
56. Related talks
56
• Helping Data Teams with Puppet / Puppet Camp London
• Secure Data Scalability at Stylight with Tableau Online and
Amazon Redshift / Tableau Conference on Tour - Berlin
• Google Cloud Dataflow Two Worlds Become a Much Better
One