1. Data Science at Scale:
Using Apache Spark for Data Science
at Bitly
Sarah Guido
Data Day Seattle 2015
2. Overview
• About me/Bitly
• Spark overview
• Using Spark for data science
• When it works, it’s great! When it works…
3. About me
• Data scientist at Bitly
• NYC Python/PyGotham co-organizer
• O’Reilly Media author
• @sarah_guido
4. About this talk
• This talk is:
– Description of my workflow
– Exploration of within-Spark tools
• This talk is not:
– In-depth exploration of algorithms
– Building new tools on top of Spark
– Any sort of ground truth for how you should be
using Spark
5. A bit of background
• Need for big data analysis tools
• MapReduce for exploratory data analysis ==
• Iterate/prototype quickly
• Overall goal: understand how people use not
only our app, but the Internet!
6. Bitly data!
• Legit big data
• 1 hour of decodes is 10 GB
• 1 day is 240 GB
• 1 month is ~7 TB
17. Exploratory data analysis
• Problem: what’s going on with my decodes?
• Solution: DataFrames!
– Similar to Pandas: describe, drop, fill, aggregate
functions
– You can actually convert to a Pandas DataFrame!
18. Exploratory data analysis
• Get a sense of what’s going on in the data
• Look at distributions, frequencies
• Mostly categorical data here
19. Topic modeling
• Problem: we have so many links but no way to
classify them into certain kinds of content
• Solution: LDA (latent Dirichlet allocation)
– Sort of – compare to other solutions
20. Topic modeling
• Oh, the JVM…
– LDA only in Scala
• Scala jar file
• Store script in S3
21. Topic modeling
• LDA in Spark
– Generative model
– Several different methods
– Term frequency vector as input
• “Note: LDA is a new feature with some missing
functionality...”
28. Architecture
• Right now: not in production
– Buy-in
• Streaming applications for parts of the app
• Python or Scala?
– Scala by force (LDA, GraphX)
29. Some issues
• Hadoop servers
• JVM
• gzip
• 1.4
• Resource allocation
• Really only got it to this stage very recently
30. Where to go next?
• Spark in production!
• Use for various parts of our app
• Use for R&D and prototyping purposes, with
the potential to expand into the product