data science
and enterprise
engineering
how data scientists and engineers work in tandem to achieve
real-time personalization at overstock.com
overstock
Š Overstock2
pushing boundaries of retail since 1999
Midvale, Utah
5 million unique products for sale
billions of visits and page views
160+ countries
team in the beginning.....
Š Overstock
4
• pigs
• data science team – 3 data scientists
• engineering team - 2 developers, 1 QA, and a
product owner
• chickens
• plenty of business owners
• lots of channel managers
remember animal farm?
overstock
marketing
dev
the problems we were up against
Š Overstock
5
• data scientists were not working with the
business to solve business problems
• engineers were not working with data
scientists
• engineering was in a relational data mindset
• not regularly delivering business value
• solving yesterday's problems - today
overstock
marketing
dev
days
minutes
seconds
Š Overstock6
1st problem we came together to solve
Š Overstock
7
• real-time bidding (RTB) is a means by
which advertising inventory is bought and sold on a
per-impression basis via real-time bidding, similar to
financial markets
• low-latency operation – need to bid >10ms
• pre-compute scores nightly
• push scored to cache in AWS
• partnered with Bees Wax
• replace existing 3rd party partners
This is where I add text. If
I want it to keep going, it
looks like this.
business
problem
propensity to purchase
Š Overstock
8
• identifying unique patterns and tendencies that indicate
a user is ready to purchase
• user score: represents a customer’s likelihood of
purchase
• collected at the visit/user level
• basic steps
• turn raw user interactions into ‘features’
• train classifiers on months of data with label =
purchase vs. no purchase
• predict on new users/visits
This is where I add text. If
I want it to keep going, it
looks like this.
problem to
solve
class imbalance
Š Overstock
9
• many new customers
• billions of unique page views in a calendar year
• many users we are seeing for the first time
(potentially)
• low conversion
• a small percentage of sessions end in a purchase
• sparse web logs mean we have to digest an
enormous amount of data to generate useful features
• we are interested in accuracy on the positive label
• recall over precision
data science
challenges
what the team was up against
Š Overstock
10
• Constraints with current infrastructure and
processes
• used to pulling data instead of streaming it
• data scientists are a scarce resource
• working on many non-relative tasks
• not a lot of experience with the care and feeding of
enterprise software
• a bit set in the academia mindset
challenges
Š Overstock
11
Faster Scotty
Š Overstock
12
what we accomplished in 6 months
• we were solving today's problems... by the end of
the day
• engineering was thinking about streaming data
• had the pieces in place to start scoring users in
minutes
recap
days
minutes
seconds
Š Overstock
13
needed to move faster
Š Overstock
14
• score users faster
• moved from daily to hourly
• picked up a fraud project
• assign fraud score within 5 mins
data science +
engineering
business
challenges
moving from daily batches to more regular micro-batches
Š Overstock
15
• hourly jobs
• much more responsive scoring = larger server footprint
• can still train offline in batch, but must have pipelines better
tuned.
• every 5 minutes
• ETL must be tightly honed
• at this point you hit the edge of what is possible in the batch
setting
• feedback loops become critical for success
data science +
engineering
challenges
balancing business goals with enterprise engineering
Š Overstock
16
• as adoption increases visibility increases
• need standardized data representations
• have to make sure architecture is hardened and
robust
• need fail-safe mechanisms for critical processes
• MOST IMPORTANT
• you can never, ever slow down overstock.com
business vs engineering
challenges
Š Overstock
17
Asset Data Platform
User ID
Content
UserID
AttributesAttributes
Content
User Training Data
Asset Training Data
Experience
Training Data
Trained
Models
User ML
Asset ML
needed to move faster
Š Overstock
18
• what was accomplished in 6 months
• fraud scores in a minute or less
• users were being scored in a minute
• what was left
• still not fast enough for real time personalization on
overstock.com
data science +
engineering
recap
days
minutes
seconds
Š Overstock
19
near real-time personalization on the shopping site
Š Overstock
20
• near real-time personalization on overstock
• putting custom recs in front of the user.
• can't slow the site down
This is where I add text. If
I want it to keep going, it
looks like this.
business
problem
from micro-batches to near real-time
Š Overstock
21
• near real-time can limit the size of models and
complexity of calculations
• we aren’t trading stocks, we sell home goods
• we may not know much about a user we are
trying to personalize for
• what can you personalize for a new user as they
initially interact with your site
• heuristics moving towards full models
data science +
engineering +
business
challenges
from micro-batches to near real-time
Š Overstock
22
• real-time data flow requires rea-time analytics
and strategy
• introspection into processes becomes critical
• must have automated process control to prevent
algorithms from running wild
• empower business owners to operate strategically
without suffering from information overload
data science +
engineering +
business
challenges
from micro-batches to near real-time
Š Overstock
23
• every piece of the process needs to function as
a cohesive flow
• do we need to wait for the data to get there, then act
(fraud?)
• or act on the data we have (recommendations for a
shopping site)
data science +
engineering +
business
challenges
Š Overstock
24
asset
data platform
live rex
pcm
asset exhaust
user exhaust
audiences &
user attributes
targeted
assets
userattributesassetattributes
tracking
deep links
tracking
deep links
id
bids
bids
auction, bid, & win logs
campaign logs
trking
deep
links
userattributes
assets
campaign
logs
assetevents
userevents
raw
data
ml
features
ml
features
raw
data
what was done and undone
Š Overstock
25
• still not quite to near-real-time/low-latency recs for
shopping site
• fraud score and email recommendations are fast
enough
data science +
engineering +
business
recap
roadmap
Š Overstock
26
• page-less personalization
• deep learning for fraud
• balancing real-time needs with efficient gains
We all need to be daring and collaborative to
accelerate innovation!
data science +
engineering +
business
where we
are going
lessons learned
Š Overstock
27
• the cloud presents a whole new set of problems
• tech is hard, politics are harder
• we can’t operate in silos
• we must be aligned with the business
• the process must be iterative
data science +
engineering +
business
take
always
questions?
and of course, we are hiring!
Data Science and Enterprise Engineering with Michael Finger and Chris Robison

Data Science and Enterprise Engineering with Michael Finger and Chris Robison

  • 1.
    data science and enterprise engineering howdata scientists and engineers work in tandem to achieve real-time personalization at overstock.com
  • 2.
    overstock Š Overstock2 pushing boundariesof retail since 1999 Midvale, Utah 5 million unique products for sale billions of visits and page views 160+ countries
  • 3.
    team in thebeginning..... © Overstock 4 • pigs • data science team – 3 data scientists • engineering team - 2 developers, 1 QA, and a product owner • chickens • plenty of business owners • lots of channel managers remember animal farm? overstock marketing dev
  • 4.
    the problems wewere up against © Overstock 5 • data scientists were not working with the business to solve business problems • engineers were not working with data scientists • engineering was in a relational data mindset • not regularly delivering business value • solving yesterday's problems - today overstock marketing dev
  • 5.
  • 6.
    1st problem wecame together to solve © Overstock 7 • real-time bidding (RTB) is a means by which advertising inventory is bought and sold on a per-impression basis via real-time bidding, similar to financial markets • low-latency operation – need to bid >10ms • pre-compute scores nightly • push scored to cache in AWS • partnered with Bees Wax • replace existing 3rd party partners This is where I add text. If I want it to keep going, it looks like this. business problem
  • 7.
    propensity to purchase ©Overstock 8 • identifying unique patterns and tendencies that indicate a user is ready to purchase • user score: represents a customer’s likelihood of purchase • collected at the visit/user level • basic steps • turn raw user interactions into ‘features’ • train classifiers on months of data with label = purchase vs. no purchase • predict on new users/visits This is where I add text. If I want it to keep going, it looks like this. problem to solve
  • 8.
    class imbalance © Overstock 9 •many new customers • billions of unique page views in a calendar year • many users we are seeing for the first time (potentially) • low conversion • a small percentage of sessions end in a purchase • sparse web logs mean we have to digest an enormous amount of data to generate useful features • we are interested in accuracy on the positive label • recall over precision data science challenges
  • 9.
    what the teamwas up against © Overstock 10 • Constraints with current infrastructure and processes • used to pulling data instead of streaming it • data scientists are a scarce resource • working on many non-relative tasks • not a lot of experience with the care and feeding of enterprise software • a bit set in the academia mindset challenges
  • 10.
  • 11.
    Faster Scotty © Overstock 12 whatwe accomplished in 6 months • we were solving today's problems... by the end of the day • engineering was thinking about streaming data • had the pieces in place to start scoring users in minutes recap
  • 12.
  • 13.
    needed to movefaster © Overstock 14 • score users faster • moved from daily to hourly • picked up a fraud project • assign fraud score within 5 mins data science + engineering business challenges
  • 14.
    moving from dailybatches to more regular micro-batches © Overstock 15 • hourly jobs • much more responsive scoring = larger server footprint • can still train offline in batch, but must have pipelines better tuned. • every 5 minutes • ETL must be tightly honed • at this point you hit the edge of what is possible in the batch setting • feedback loops become critical for success data science + engineering challenges
  • 15.
    balancing business goalswith enterprise engineering © Overstock 16 • as adoption increases visibility increases • need standardized data representations • have to make sure architecture is hardened and robust • need fail-safe mechanisms for critical processes • MOST IMPORTANT • you can never, ever slow down overstock.com business vs engineering challenges
  • 16.
    Š Overstock 17 Asset DataPlatform User ID Content UserID AttributesAttributes Content User Training Data Asset Training Data Experience Training Data Trained Models User ML Asset ML
  • 17.
    needed to movefaster © Overstock 18 • what was accomplished in 6 months • fraud scores in a minute or less • users were being scored in a minute • what was left • still not fast enough for real time personalization on overstock.com data science + engineering recap
  • 18.
  • 19.
    near real-time personalizationon the shopping site © Overstock 20 • near real-time personalization on overstock • putting custom recs in front of the user. • can't slow the site down This is where I add text. If I want it to keep going, it looks like this. business problem
  • 20.
    from micro-batches tonear real-time © Overstock 21 • near real-time can limit the size of models and complexity of calculations • we aren’t trading stocks, we sell home goods • we may not know much about a user we are trying to personalize for • what can you personalize for a new user as they initially interact with your site • heuristics moving towards full models data science + engineering + business challenges
  • 21.
    from micro-batches tonear real-time © Overstock 22 • real-time data flow requires rea-time analytics and strategy • introspection into processes becomes critical • must have automated process control to prevent algorithms from running wild • empower business owners to operate strategically without suffering from information overload data science + engineering + business challenges
  • 22.
    from micro-batches tonear real-time © Overstock 23 • every piece of the process needs to function as a cohesive flow • do we need to wait for the data to get there, then act (fraud?) • or act on the data we have (recommendations for a shopping site) data science + engineering + business challenges
  • 23.
    Š Overstock 24 asset data platform liverex pcm asset exhaust user exhaust audiences & user attributes targeted assets userattributesassetattributes tracking deep links tracking deep links id bids bids auction, bid, & win logs campaign logs trking deep links userattributes assets campaign logs assetevents userevents raw data ml features ml features raw data
  • 24.
    what was doneand undone © Overstock 25 • still not quite to near-real-time/low-latency recs for shopping site • fraud score and email recommendations are fast enough data science + engineering + business recap
  • 25.
    roadmap © Overstock 26 • page-lesspersonalization • deep learning for fraud • balancing real-time needs with efficient gains We all need to be daring and collaborative to accelerate innovation! data science + engineering + business where we are going
  • 26.
    lessons learned © Overstock 27 •the cloud presents a whole new set of problems • tech is hard, politics are harder • we can’t operate in silos • we must be aligned with the business • the process must be iterative data science + engineering + business take always
  • 27.