SlideShare a Scribd company logo
1 of 87
Download to read offline
Scalding at Etsy
So hey everybody, my name is Dan McKinley
I’m visiting from LA
I worked for Etsy for 6.5 years, mostly from Brooklyn. In an office considerably less sparse
than this one, I assure you. Mea culpa, that’s “worked” in the past tense. I quit to join a startup
last month. After signing up to give this talk. But I left on very good terms so I’m still doing it.
This talk’s about Scalding, and how we wound up using it at Etsy.
When I was writing this talk this passage from Douglas Adams kept popping into my brain. I
do feel like we had scalding thrust upon us at Etsy, rather than choosing it intentionally. Which
is not the same as saying that I was personally unhappy with it, exactly. I was not. This is the
character that went on to try to insult every being in the cosmos in alphabetical order. So I’m
not sure if it was intended as intentional allegory about the scala community.
The first thing I wanted to do was give an overview of how Etsy uses scalding now.
This is hopefully the only strata-esque slide in the talk. Don’t run for the exits or anything.
What I want to communicate with it is that in abstract, we aggregate logs from the live site, put
them on hdfs. Then from there we crunch them to build internal tooling and features. For live
features we’re putting job outputs into mysql shards; for backend tools we typically use a BI
database (vertica) to fill the same need.
Scalding gets used at all points on the hadoop side. Parsing logs, generating
recommendations and ranking datasets, and business intelligence is all either done in
Scalding or will be ported to Scalding very shortly.
There are a bunch of ways that people use analytics at Etsy. The way you get your answers
depends on the kind of question you’re asking.
I’ll go through some examples. This is a simple one. Let’s say you just want to know how
many shops open up a day.
That’s a pretty common question. And so somebody’s thought of it way before you, and
they’ve put it on a dashboard. So you can just go look at the dashboard.
Another kind of question is one about how an A/B test you’re running is doing.
We do a lot of A/B testing at Etsy, so much so that we’ve built our own A/B analyzer fronted
called Catapult. So for most questions relating to variants in A/B tests you can go to that.
Then there are slightly more complicated questions. Like, how many of the top sellers sell
vintage goods? Maybe you’re the first person to ever ask such a question.
But, people have thought of questions that are kind of similar to it before. And in most of those
cases you can go ask the BI database.
And then there are questions that are even farther out there. Cases where you’re probably the
first person to ask not just this specifically, but you’re also probably the first person to ask any
question even similar to it. Like this one. Etsy gets traffic to items that are sold. How often
could we redirect that traffic to items that have close tags and titles?
That’s the kind of thing you’d use scalding to answer today. We have the data in theory, but
we haven’t normalized it and put it in BI. Or maybe it’s too big to fit in BI.
A very common kind of novel question relates to debugging A/B tests.
We do a ton of that with scalding too.
I conceptualize our data universe as having three domains.
There are questions we’ve anticipated, questions we didn’t anticipate, and then there are
permanent systems.
Like I said, we have tooling support for the first domain. And we use scalding for the second
two.
That’s questions where the data needed to get an answer is in a relatively raw form, which I’ll
wave my hands and call analysis. And then we also build features and systems with scalding,
which is more like what I’d call “engineering.” We do work for ranking, for recommendations,
and so on in scalding.
Let me give you some idea for how big of a thing this is.
It’s pretty big, I guess. When I quit we had about 800 scalding jobs in source control. And if
everyone is like me, there are probably twice as many in working directories, not committed.
Only about 90 of those, though, run as part of our nightly batch process.
58 people had written scalding jobs
And 14 of them figured out how to use Algebird. Etsy’s engineering team, by the way, is like
150 programmers.
This histogram showing how many jobs people have written is about what you’d expect.
There’s a small group of people like me who have written a ton of jobs. And most people have
written one or two jobs.
And the way it breaks down across the domains is like this. Most of the people using scalding
are using to answer analytics questions. The experts tend to be the people building systems
with scalding.
So why would we pick scalding?
Well, we didn’t really pick it on purpose. It was an accident.
To explain how that accident happened I guess I first have to explain how we got started with
analytics
And that was kind of an accident too. We didn’t necessarily set out to build something to
replace Google Analytics.
What we did do was buy an advertising startup called Adtuitive back in 2009.
And those guys brought something with them called cascading.jruby. For our purposes you
can consider this to be pretty close to Pig, but using JRuby.
This is a really simple example of a job written in cascading.jruby. Hopefully you’ll just believe
me that the Java equivalent would be Byzantine.
The thing we wanted to get out of that acquisition this feature. Paid promoted listings that you
see when you search on Etsy. In the beginning we pretty much just wanted to build whatever
we needed to have this.
But do that we needed things like impression logging and fronted feedback. So we started
collecting event beacons from our frontend.
And shipped those beacon logs to hdfs and turned them into event logs.
And we sessionized the event logs and made visit logs out of them.
That decision to make a table for visits, with a row per user session, turned out to be
important. Our data is stored as serialized sequences of events inside cascading tuples.
So even though we just wanted this feature, well, what the hell did we just do. We just started
building an analytics system I guess.
The next thing we knew we had a proprietary tool for analyzing AB tests. Go figure.
By 2013 we definitely had our own giant analytics stack. It was built, racked, and debugged.
And It was right about then that scalding blew the whole thing to smithereens.
The thing that caused this was that we had hired Avi Bryant, who some of you may know as
one of the authors of scalding. And something of a group theory crank. And just an all-around
amazing smart guy.
And as an amazing smart guy, when Avi joined Etsy he had some cover to get a little rogue
with things.
And what he did with that cover was that he added scalding to the build. And then he started
trying to make things with it. Etsy’s not bureaucratic in any way I understand the word. But in
theory there’s supposed to be at least some discussion before you start using a new
framework. That didn’t happen at all with Scalding.
And immediately after this, he up and quit. So the force of his intellect and personality doesn’t
explain scalding’s runaway success. If that’s all it was about everyone would have stopped
using it the minute he left. But the opposite of that happened.
About a year ago we had this giant cascading.jruby system, which was starting to get mature.
But by last October the official policy was to rewrite the few pieces that were left in Scalding.
There’s a technical reason this happened, which I think is interesting, but at the same time it’s
pretty simple.
I think it’s simple enough that I can show it to you in a couple of examples. Let’s say that we
want to count how many visits searched for any given search term.
In other words we want to find every search and every visit, and produce a table like this.
Search terms to the number of visits that entered them.
The cascading.jruby job is really simple and straightforward. It looks like this. Don’t worry
about understanding it or anything, the point is that it’s short and easy.
And the equivalent scalding job is also really short and simple.
Conceptually they’re both just doing this.
You unroll the search events, then you grab the search terms out of them, then you just group
and count.
And both scalding and cascading.jruby manage to factor that into one mapreduce step. And in
this case they both perform identically.
But you can start to see the difference if you add just one more layer of complexity. Let’s say
that we wanted to count up the search terms again, but this time relate them to purchases that
happen after them in visits.
Like this. We want a table showing how many visits searched for a thing, and another column
giving how many of those visits bought something.
In this case the scalding job is not that much more complicated. It’s still just about this long.
And scalding manages to get this done in one mapreduce step again. It’s just unrolling the
searches out of the visits like it was before, and grouping with a sum.
The jruby job, on the other hand, no longer fits on the slide. It’s in this gist if anyone wants to
look at it.
I can show you what it does schematically. You make two branches, one for the searches and
one for the purchases. Then you cross join them and filter that shit down. And then you wind
up with a branch for conversions per search term and a branch for visits per term, and you
join those back together to get your answer.
So the pure cascading.jruby solution is more complicated. And it also turns out to be a lot
slower, too. Cascading doesn’t have a query optimizer, and this might be a lot closer if it did.
But it doesn’t, so jruby winds up being done in many more mapreduce steps and takes like
eight times longer.
If we go back to the scalding code for a second
This here is the feature that killed cascading jruby. We just wrote a cascading user-defined
function without even having to realize that that’s what we were doing.
Now it’s not impossible to fix this in cascading.jruby, or in other frameworks that don’t give you
easy access to UDF’s.
You can indeed go write a cascading operation to do the same thing and use it from those.
But in reality, even though it comes up constantly, nobody wants to do that. You have to
change files, and you have to change programming languages. Those hurdles are enough to
make people write slower jobs.
For example we had one job that was a major resource problem in JRuby, which was taking
seven hours to run every night. Someone rewrote it in scalding in a day or two and got it down
to 20 minutes. The problem wasn’t that anything was impossible in cascading.jruby. The point
is merely that scalding makes doing it the right way feel natural.
So easy user defined functions swept all before them.
But I don’t think scalding is all peaches and cream.
You could say we only have two complaints.
This is a talk about scalding. So I’m going to spare you my list of cascading gripes. You
probably have your own. I will say that if you do, using a DSL on top of cascading doesn’t help
with any of them.
Very flippantly, this is basically the problem. Scala is too far from what most of our engineers
are using on a daily basis. It’s too weird. I assure you Kellan’s not this crotchety in reality. And
he’s probably mad at me for paraphrasing this from memory.
I firmly believe that analytics is for everyone. I don’t mean statistical modeling, or machine
learning, or things like that. But I do think that asking straightforward questions about the thing
you’re tasked with building should be for everyone.
What I mean by that is, let’s say we have a project to do.
Etsy’s a relatively enlightened place, by software industry standards anyway. So everyone
gets some time at the beginning and the end of that project to do quote-unquote “analysis.” It's
"thinking time." And the stuff called "work" gets done in the middle.
But I think this more accurately describes reality. We’re all still carrying the baggage of 20th
century software around with us. So analysis up front, which you’d do to see if you can make
a case for doing the feature at all, feels like you’re not working. And the stuff in the middle
feels like you’re really making progress. Even if it’s progress on something that could never
actually work.
That’s how it is everywhere, more or less. This is the social framework everybody’s working
inside of. So as somebody who really believes that analytics up front is powerful, I want to
give everyone the best chance possible.
And scala is just too different from what other Etsy programmers are using day to day. Don’t
mistake this as me saying they’re not smart enough, because they are. And it's not that
learning FP wouldn't be good for everyone, because I think it is. And it's not that functional
programming is fundamentally too hard, or anything like that. It’s just a statement of fact. Most
programmers I know are not experienced with functional programming, and scala shares
many functional idioms.
So the analysis process winds up looking like this. Between asking the question and getting
an answer there’s this weird period in the middle where you have to learn a bunch of category
theory. Sure it’s good for them, or something. But it’s also going to stop them from getting
their answer.
Ideally things would look more like this.
So someone should go build that.
Scalding at Etsy

More Related Content

Similar to Scalding at Etsy

Subverting The Algorithm
Subverting The AlgorithmSubverting The Algorithm
Subverting The Algorithmgaboosh
 
Analytics for SEO
Analytics for SEOAnalytics for SEO
Analytics for SEOIan Lurie
 
Putting Buyers and Sellers in the Best Light, How Etsy Leverages Big Data for...
Putting Buyers and Sellers in the Best Light, How Etsy Leverages Big Data for...Putting Buyers and Sellers in the Best Light, How Etsy Leverages Big Data for...
Putting Buyers and Sellers in the Best Light, How Etsy Leverages Big Data for...Dana Gardner
 
How to set up an SEO forecast for free using excel
How to set up an SEO forecast for free using excelHow to set up an SEO forecast for free using excel
How to set up an SEO forecast for free using excelMarie Turner
 
SearchLove Boston 2017 | Will Critchlow | Building Robot Allegiances
SearchLove Boston 2017 | Will Critchlow | Building Robot AllegiancesSearchLove Boston 2017 | Will Critchlow | Building Robot Allegiances
SearchLove Boston 2017 | Will Critchlow | Building Robot AllegiancesDistilled
 
Crowdsourcing Wisdom
Crowdsourcing WisdomCrowdsourcing Wisdom
Crowdsourcing WisdomVantte
 
Mobile App Feature Configuration and A/B Experiments
Mobile App Feature Configuration and A/B ExperimentsMobile App Feature Configuration and A/B Experiments
Mobile App Feature Configuration and A/B Experimentslacyrhoades
 
SearchLove San Diego 2017 | Will Critchlow | Knowing Ranking Factors Won't Be...
SearchLove San Diego 2017 | Will Critchlow | Knowing Ranking Factors Won't Be...SearchLove San Diego 2017 | Will Critchlow | Knowing Ranking Factors Won't Be...
SearchLove San Diego 2017 | Will Critchlow | Knowing Ranking Factors Won't Be...Distilled
 
Choose Boring Technology
Choose Boring TechnologyChoose Boring Technology
Choose Boring TechnologyDan McKinley
 
Programming methodology lecture23
Programming methodology lecture23Programming methodology lecture23
Programming methodology lecture23NYversity
 
آموزش گوگل آنالیتیکس (Google Analytics) برای تحلیل آمار سایت آموزش گوگل آنا...
 آموزش گوگل آنالیتیکس (Google Analytics) برای تحلیل آمار  سایت آموزش گوگل آنا... آموزش گوگل آنالیتیکس (Google Analytics) برای تحلیل آمار  سایت آموزش گوگل آنا...
آموزش گوگل آنالیتیکس (Google Analytics) برای تحلیل آمار سایت آموزش گوگل آنا...Hanieh Ghofrani
 
Widget areas
Widget areasWidget areas
Widget areasWordCamp
 
Introduction to programming
Introduction to programmingIntroduction to programming
Introduction to programmingAndre Leal
 
Data Visualization Inspiration: Analysis To Insights To Action, Faster!
Data Visualization Inspiration: Analysis To Insights To Action, Faster!Data Visualization Inspiration: Analysis To Insights To Action, Faster!
Data Visualization Inspiration: Analysis To Insights To Action, Faster!gtmarketing8688
 
User Research on a Shoestring
User Research on a ShoestringUser Research on a Shoestring
User Research on a Shoestringteaguese
 
Collaborative Design: Lessons & Observations
Collaborative Design: Lessons & ObservationsCollaborative Design: Lessons & Observations
Collaborative Design: Lessons & ObservationsAdam Connor
 
Transcript - Data Visualisation - Tools and Techniques
Transcript - Data Visualisation - Tools and TechniquesTranscript - Data Visualisation - Tools and Techniques
Transcript - Data Visualisation - Tools and TechniquesARDC
 

Similar to Scalding at Etsy (20)

Gaps in the algorithm
Gaps in the algorithmGaps in the algorithm
Gaps in the algorithm
 
Subverting The Algorithm
Subverting The AlgorithmSubverting The Algorithm
Subverting The Algorithm
 
Analytics for SEO
Analytics for SEOAnalytics for SEO
Analytics for SEO
 
Putting Buyers and Sellers in the Best Light, How Etsy Leverages Big Data for...
Putting Buyers and Sellers in the Best Light, How Etsy Leverages Big Data for...Putting Buyers and Sellers in the Best Light, How Etsy Leverages Big Data for...
Putting Buyers and Sellers in the Best Light, How Etsy Leverages Big Data for...
 
How to set up an SEO forecast for free using excel
How to set up an SEO forecast for free using excelHow to set up an SEO forecast for free using excel
How to set up an SEO forecast for free using excel
 
SearchLove Boston 2017 | Will Critchlow | Building Robot Allegiances
SearchLove Boston 2017 | Will Critchlow | Building Robot AllegiancesSearchLove Boston 2017 | Will Critchlow | Building Robot Allegiances
SearchLove Boston 2017 | Will Critchlow | Building Robot Allegiances
 
Crowdsourcing Wisdom
Crowdsourcing WisdomCrowdsourcing Wisdom
Crowdsourcing Wisdom
 
Mobile App Feature Configuration and A/B Experiments
Mobile App Feature Configuration and A/B ExperimentsMobile App Feature Configuration and A/B Experiments
Mobile App Feature Configuration and A/B Experiments
 
SearchLove San Diego 2017 | Will Critchlow | Knowing Ranking Factors Won't Be...
SearchLove San Diego 2017 | Will Critchlow | Knowing Ranking Factors Won't Be...SearchLove San Diego 2017 | Will Critchlow | Knowing Ranking Factors Won't Be...
SearchLove San Diego 2017 | Will Critchlow | Knowing Ranking Factors Won't Be...
 
Choose Boring Technology
Choose Boring TechnologyChoose Boring Technology
Choose Boring Technology
 
Programming methodology lecture23
Programming methodology lecture23Programming methodology lecture23
Programming methodology lecture23
 
آموزش گوگل آنالیتیکس (Google Analytics) برای تحلیل آمار سایت آموزش گوگل آنا...
 آموزش گوگل آنالیتیکس (Google Analytics) برای تحلیل آمار  سایت آموزش گوگل آنا... آموزش گوگل آنالیتیکس (Google Analytics) برای تحلیل آمار  سایت آموزش گوگل آنا...
آموزش گوگل آنالیتیکس (Google Analytics) برای تحلیل آمار سایت آموزش گوگل آنا...
 
Essay On Self-Analysis
Essay On Self-AnalysisEssay On Self-Analysis
Essay On Self-Analysis
 
Widget areas
Widget areasWidget areas
Widget areas
 
Introduction to programming
Introduction to programmingIntroduction to programming
Introduction to programming
 
Data Visualization Inspiration: Analysis To Insights To Action, Faster!
Data Visualization Inspiration: Analysis To Insights To Action, Faster!Data Visualization Inspiration: Analysis To Insights To Action, Faster!
Data Visualization Inspiration: Analysis To Insights To Action, Faster!
 
Introduction toprogramming
Introduction toprogrammingIntroduction toprogramming
Introduction toprogramming
 
User Research on a Shoestring
User Research on a ShoestringUser Research on a Shoestring
User Research on a Shoestring
 
Collaborative Design: Lessons & Observations
Collaborative Design: Lessons & ObservationsCollaborative Design: Lessons & Observations
Collaborative Design: Lessons & Observations
 
Transcript - Data Visualisation - Tools and Techniques
Transcript - Data Visualisation - Tools and TechniquesTranscript - Data Visualisation - Tools and Techniques
Transcript - Data Visualisation - Tools and Techniques
 

Recently uploaded

IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IES VE
 
Cloud Revolution: Exploring the New Wave of Serverless Spatial Data
Cloud Revolution: Exploring the New Wave of Serverless Spatial DataCloud Revolution: Exploring the New Wave of Serverless Spatial Data
Cloud Revolution: Exploring the New Wave of Serverless Spatial DataSafe Software
 
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesAI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesMd Hossain Ali
 
COMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a WebsiteCOMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a Websitedgelyza
 
Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Commit University
 
UiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPathCommunity
 
PicPay - GenAI Finance Assistant - ChatGPT for Customer Service
PicPay - GenAI Finance Assistant - ChatGPT for Customer ServicePicPay - GenAI Finance Assistant - ChatGPT for Customer Service
PicPay - GenAI Finance Assistant - ChatGPT for Customer ServiceRenan Moreira de Oliveira
 
NIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopNIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopBachir Benyammi
 
UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7DianaGray10
 
Spring24-Release Overview - Wellingtion User Group-1.pdf
Spring24-Release Overview - Wellingtion User Group-1.pdfSpring24-Release Overview - Wellingtion User Group-1.pdf
Spring24-Release Overview - Wellingtion User Group-1.pdfAnna Loughnan Colquhoun
 
Introduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxIntroduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxMatsuo Lab
 
Digital magic. A small project for controlling smart light bulbs.
Digital magic. A small project for controlling smart light bulbs.Digital magic. A small project for controlling smart light bulbs.
Digital magic. A small project for controlling smart light bulbs.francesco barbera
 
Bird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemBird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemAsko Soukka
 
Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1DianaGray10
 
Babel Compiler - Transforming JavaScript for All Browsers.pptx
Babel Compiler - Transforming JavaScript for All Browsers.pptxBabel Compiler - Transforming JavaScript for All Browsers.pptx
Babel Compiler - Transforming JavaScript for All Browsers.pptxYounusS2
 
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1DianaGray10
 
Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024SkyPlanner
 
OpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureOpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureEric D. Schabell
 
Designing A Time bound resource download URL
Designing A Time bound resource download URLDesigning A Time bound resource download URL
Designing A Time bound resource download URLRuncy Oommen
 
Machine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfMachine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfAijun Zhang
 

Recently uploaded (20)

IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
 
Cloud Revolution: Exploring the New Wave of Serverless Spatial Data
Cloud Revolution: Exploring the New Wave of Serverless Spatial DataCloud Revolution: Exploring the New Wave of Serverless Spatial Data
Cloud Revolution: Exploring the New Wave of Serverless Spatial Data
 
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesAI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
 
COMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a WebsiteCOMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a Website
 
Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)
 
UiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation Developers
 
PicPay - GenAI Finance Assistant - ChatGPT for Customer Service
PicPay - GenAI Finance Assistant - ChatGPT for Customer ServicePicPay - GenAI Finance Assistant - ChatGPT for Customer Service
PicPay - GenAI Finance Assistant - ChatGPT for Customer Service
 
NIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopNIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 Workshop
 
UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7
 
Spring24-Release Overview - Wellingtion User Group-1.pdf
Spring24-Release Overview - Wellingtion User Group-1.pdfSpring24-Release Overview - Wellingtion User Group-1.pdf
Spring24-Release Overview - Wellingtion User Group-1.pdf
 
Introduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxIntroduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptx
 
Digital magic. A small project for controlling smart light bulbs.
Digital magic. A small project for controlling smart light bulbs.Digital magic. A small project for controlling smart light bulbs.
Digital magic. A small project for controlling smart light bulbs.
 
Bird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemBird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystem
 
Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1
 
Babel Compiler - Transforming JavaScript for All Browsers.pptx
Babel Compiler - Transforming JavaScript for All Browsers.pptxBabel Compiler - Transforming JavaScript for All Browsers.pptx
Babel Compiler - Transforming JavaScript for All Browsers.pptx
 
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
 
Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024
 
OpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureOpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability Adventure
 
Designing A Time bound resource download URL
Designing A Time bound resource download URLDesigning A Time bound resource download URL
Designing A Time bound resource download URL
 
Machine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfMachine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdf
 

Scalding at Etsy

  • 2. So hey everybody, my name is Dan McKinley
  • 4. I worked for Etsy for 6.5 years, mostly from Brooklyn. In an office considerably less sparse than this one, I assure you. Mea culpa, that’s “worked” in the past tense. I quit to join a startup last month. After signing up to give this talk. But I left on very good terms so I’m still doing it.
  • 5. This talk’s about Scalding, and how we wound up using it at Etsy.
  • 6. When I was writing this talk this passage from Douglas Adams kept popping into my brain. I do feel like we had scalding thrust upon us at Etsy, rather than choosing it intentionally. Which is not the same as saying that I was personally unhappy with it, exactly. I was not. This is the character that went on to try to insult every being in the cosmos in alphabetical order. So I’m not sure if it was intended as intentional allegory about the scala community.
  • 7. The first thing I wanted to do was give an overview of how Etsy uses scalding now.
  • 8. This is hopefully the only strata-esque slide in the talk. Don’t run for the exits or anything. What I want to communicate with it is that in abstract, we aggregate logs from the live site, put them on hdfs. Then from there we crunch them to build internal tooling and features. For live features we’re putting job outputs into mysql shards; for backend tools we typically use a BI database (vertica) to fill the same need.
  • 9. Scalding gets used at all points on the hadoop side. Parsing logs, generating recommendations and ranking datasets, and business intelligence is all either done in Scalding or will be ported to Scalding very shortly.
  • 10. There are a bunch of ways that people use analytics at Etsy. The way you get your answers depends on the kind of question you’re asking.
  • 11. I’ll go through some examples. This is a simple one. Let’s say you just want to know how many shops open up a day.
  • 12. That’s a pretty common question. And so somebody’s thought of it way before you, and they’ve put it on a dashboard. So you can just go look at the dashboard.
  • 13. Another kind of question is one about how an A/B test you’re running is doing.
  • 14. We do a lot of A/B testing at Etsy, so much so that we’ve built our own A/B analyzer fronted called Catapult. So for most questions relating to variants in A/B tests you can go to that.
  • 15. Then there are slightly more complicated questions. Like, how many of the top sellers sell vintage goods? Maybe you’re the first person to ever ask such a question.
  • 16. But, people have thought of questions that are kind of similar to it before. And in most of those cases you can go ask the BI database.
  • 17. And then there are questions that are even farther out there. Cases where you’re probably the first person to ask not just this specifically, but you’re also probably the first person to ask any question even similar to it. Like this one. Etsy gets traffic to items that are sold. How often could we redirect that traffic to items that have close tags and titles?
  • 18. That’s the kind of thing you’d use scalding to answer today. We have the data in theory, but we haven’t normalized it and put it in BI. Or maybe it’s too big to fit in BI.
  • 19. A very common kind of novel question relates to debugging A/B tests.
  • 20. We do a ton of that with scalding too.
  • 21. I conceptualize our data universe as having three domains.
  • 22. There are questions we’ve anticipated, questions we didn’t anticipate, and then there are permanent systems.
  • 23. Like I said, we have tooling support for the first domain. And we use scalding for the second two.
  • 24. That’s questions where the data needed to get an answer is in a relatively raw form, which I’ll wave my hands and call analysis. And then we also build features and systems with scalding, which is more like what I’d call “engineering.” We do work for ranking, for recommendations, and so on in scalding.
  • 25. Let me give you some idea for how big of a thing this is.
  • 26. It’s pretty big, I guess. When I quit we had about 800 scalding jobs in source control. And if everyone is like me, there are probably twice as many in working directories, not committed. Only about 90 of those, though, run as part of our nightly batch process.
  • 27. 58 people had written scalding jobs
  • 28. And 14 of them figured out how to use Algebird. Etsy’s engineering team, by the way, is like 150 programmers.
  • 29. This histogram showing how many jobs people have written is about what you’d expect. There’s a small group of people like me who have written a ton of jobs. And most people have written one or two jobs.
  • 30. And the way it breaks down across the domains is like this. Most of the people using scalding are using to answer analytics questions. The experts tend to be the people building systems with scalding.
  • 31. So why would we pick scalding?
  • 32. Well, we didn’t really pick it on purpose. It was an accident.
  • 33. To explain how that accident happened I guess I first have to explain how we got started with analytics
  • 34. And that was kind of an accident too. We didn’t necessarily set out to build something to replace Google Analytics.
  • 35. What we did do was buy an advertising startup called Adtuitive back in 2009.
  • 36. And those guys brought something with them called cascading.jruby. For our purposes you can consider this to be pretty close to Pig, but using JRuby.
  • 37. This is a really simple example of a job written in cascading.jruby. Hopefully you’ll just believe me that the Java equivalent would be Byzantine.
  • 38. The thing we wanted to get out of that acquisition this feature. Paid promoted listings that you see when you search on Etsy. In the beginning we pretty much just wanted to build whatever we needed to have this.
  • 39. But do that we needed things like impression logging and fronted feedback. So we started collecting event beacons from our frontend.
  • 40. And shipped those beacon logs to hdfs and turned them into event logs.
  • 41. And we sessionized the event logs and made visit logs out of them.
  • 42. That decision to make a table for visits, with a row per user session, turned out to be important. Our data is stored as serialized sequences of events inside cascading tuples.
  • 43. So even though we just wanted this feature, well, what the hell did we just do. We just started building an analytics system I guess.
  • 44. The next thing we knew we had a proprietary tool for analyzing AB tests. Go figure.
  • 45. By 2013 we definitely had our own giant analytics stack. It was built, racked, and debugged. And It was right about then that scalding blew the whole thing to smithereens.
  • 46. The thing that caused this was that we had hired Avi Bryant, who some of you may know as one of the authors of scalding. And something of a group theory crank. And just an all-around amazing smart guy.
  • 47. And as an amazing smart guy, when Avi joined Etsy he had some cover to get a little rogue with things.
  • 48. And what he did with that cover was that he added scalding to the build. And then he started trying to make things with it. Etsy’s not bureaucratic in any way I understand the word. But in theory there’s supposed to be at least some discussion before you start using a new framework. That didn’t happen at all with Scalding.
  • 49. And immediately after this, he up and quit. So the force of his intellect and personality doesn’t explain scalding’s runaway success. If that’s all it was about everyone would have stopped using it the minute he left. But the opposite of that happened.
  • 50. About a year ago we had this giant cascading.jruby system, which was starting to get mature.
  • 51. But by last October the official policy was to rewrite the few pieces that were left in Scalding.
  • 52. There’s a technical reason this happened, which I think is interesting, but at the same time it’s pretty simple.
  • 53. I think it’s simple enough that I can show it to you in a couple of examples. Let’s say that we want to count how many visits searched for any given search term.
  • 54. In other words we want to find every search and every visit, and produce a table like this. Search terms to the number of visits that entered them.
  • 55. The cascading.jruby job is really simple and straightforward. It looks like this. Don’t worry about understanding it or anything, the point is that it’s short and easy.
  • 56. And the equivalent scalding job is also really short and simple.
  • 57. Conceptually they’re both just doing this.
  • 58. You unroll the search events, then you grab the search terms out of them, then you just group and count.
  • 59. And both scalding and cascading.jruby manage to factor that into one mapreduce step. And in this case they both perform identically.
  • 60. But you can start to see the difference if you add just one more layer of complexity. Let’s say that we wanted to count up the search terms again, but this time relate them to purchases that happen after them in visits.
  • 61. Like this. We want a table showing how many visits searched for a thing, and another column giving how many of those visits bought something.
  • 62. In this case the scalding job is not that much more complicated. It’s still just about this long.
  • 63. And scalding manages to get this done in one mapreduce step again. It’s just unrolling the searches out of the visits like it was before, and grouping with a sum.
  • 64. The jruby job, on the other hand, no longer fits on the slide. It’s in this gist if anyone wants to look at it.
  • 65. I can show you what it does schematically. You make two branches, one for the searches and one for the purchases. Then you cross join them and filter that shit down. And then you wind up with a branch for conversions per search term and a branch for visits per term, and you join those back together to get your answer.
  • 66. So the pure cascading.jruby solution is more complicated. And it also turns out to be a lot slower, too. Cascading doesn’t have a query optimizer, and this might be a lot closer if it did. But it doesn’t, so jruby winds up being done in many more mapreduce steps and takes like eight times longer.
  • 67. If we go back to the scalding code for a second
  • 68. This here is the feature that killed cascading jruby. We just wrote a cascading user-defined function without even having to realize that that’s what we were doing.
  • 69. Now it’s not impossible to fix this in cascading.jruby, or in other frameworks that don’t give you easy access to UDF’s.
  • 70. You can indeed go write a cascading operation to do the same thing and use it from those.
  • 71. But in reality, even though it comes up constantly, nobody wants to do that. You have to change files, and you have to change programming languages. Those hurdles are enough to make people write slower jobs.
  • 72. For example we had one job that was a major resource problem in JRuby, which was taking seven hours to run every night. Someone rewrote it in scalding in a day or two and got it down to 20 minutes. The problem wasn’t that anything was impossible in cascading.jruby. The point is merely that scalding makes doing it the right way feel natural.
  • 73. So easy user defined functions swept all before them.
  • 74. But I don’t think scalding is all peaches and cream.
  • 75. You could say we only have two complaints.
  • 76. This is a talk about scalding. So I’m going to spare you my list of cascading gripes. You probably have your own. I will say that if you do, using a DSL on top of cascading doesn’t help with any of them.
  • 77. Very flippantly, this is basically the problem. Scala is too far from what most of our engineers are using on a daily basis. It’s too weird. I assure you Kellan’s not this crotchety in reality. And he’s probably mad at me for paraphrasing this from memory.
  • 78. I firmly believe that analytics is for everyone. I don’t mean statistical modeling, or machine learning, or things like that. But I do think that asking straightforward questions about the thing you’re tasked with building should be for everyone.
  • 79. What I mean by that is, let’s say we have a project to do.
  • 80. Etsy’s a relatively enlightened place, by software industry standards anyway. So everyone gets some time at the beginning and the end of that project to do quote-unquote “analysis.” It's "thinking time." And the stuff called "work" gets done in the middle.
  • 81. But I think this more accurately describes reality. We’re all still carrying the baggage of 20th century software around with us. So analysis up front, which you’d do to see if you can make a case for doing the feature at all, feels like you’re not working. And the stuff in the middle feels like you’re really making progress. Even if it’s progress on something that could never actually work.
  • 82. That’s how it is everywhere, more or less. This is the social framework everybody’s working inside of. So as somebody who really believes that analytics up front is powerful, I want to give everyone the best chance possible.
  • 83. And scala is just too different from what other Etsy programmers are using day to day. Don’t mistake this as me saying they’re not smart enough, because they are. And it's not that learning FP wouldn't be good for everyone, because I think it is. And it's not that functional programming is fundamentally too hard, or anything like that. It’s just a statement of fact. Most programmers I know are not experienced with functional programming, and scala shares many functional idioms.
  • 84. So the analysis process winds up looking like this. Between asking the question and getting an answer there’s this weird period in the middle where you have to learn a bunch of category theory. Sure it’s good for them, or something. But it’s also going to stop them from getting their answer.
  • 85. Ideally things would look more like this.
  • 86. So someone should go build that.