Back to Square One: Building a Data Science Team from Scratch

2,205 views
2,031 views

Published on

Generally speaking, big data and data science originated in the west and are coming to Europe with a bit of a delay. There is at least one exception though: The London-based music discovery website Last.fm is a data company at heart and has been doing large-scale data processing and analysis for years. It started using Hadoop in early 2006, for instance, making it one of the earliest adopters worldwide. When I left Last.fm to join Massive Media, the social media company behind Netlog.com and Twoo.com, I basically moved from a data science forerunner to a newcomer. Massive Media had at least as much data to play with and tremendous potential, but they were not doing much with it yet. The data science team had to be build from the ground up and every step had to be argued for and justified along the way. Having done this exercise of evaluating everything I learned at Last.fm and starting over completely with a clean slate at Massive Media, I developed a pretty clear perspective on how to find good data scientists, what they should be doing, what tools they should be using, and how to organize them to work together efficiently as team, which is precisely what I would like to share in this talk.

0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,205
On SlideShare
0
From Embeds
0
Number of Embeds
60
Actions
Shares
0
Downloads
33
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide

Back to Square One: Building a Data Science Team from Scratch

  1. 1. BUILDINGDATA SCIENCE TEAMSFROM SCRATCHKlaas Bosteels @klbostee
  2. 2. MY CAREER PATH SO FAR2007: Began working with big data as PhD student2009: Embarked on a data science career at Last.fm2011: Joined Massive Media as Lead Data Scientist Data company at heart; one of the earliest Hadoop adopters world- wide; inventors of Ketama; organised first “NoSQL” meetup in SF. Huge audience and tremendous potential, but data science newcomer at the time.
  3. 3. MY TEAM AT MASSIVE MEDIA + interns!Currently 4 permanent people, so not huge just yetRelatively big and growing faster than anticipated though
  4. 4. OUR MISSION IS HELPING THE COMPANY... MEASURE metrics dashboards EVALUATE data-driven testing DECIDE ad hoc data insights IMPROVE e.g. abuse detection EXTEND new product features PROMOTE PR via data porn
  5. 5. OUR MISSION IS HELPING THE COMPANY... MEASURE metrics dashboardshigher risk but bigger returns EVALUATE data-driven testing DECIDE ad hoc data insights IMPROVE e.g. abuse detection EXTEND new product features PROMOTE PR via data porn
  6. 6. OUR MISSION IS HELPING THE COMPANY... MEASURE metrics dashboardshigher risk but bigger returns very wide range of tasks EVALUATE data-driven testing DECIDE ad hoc data insights IMPROVE e.g. abuse detection EXTEND new product features PROMOTE PR via data porn
  7. 7. STEP 1FOLLOW THE MONEY photo by Chris Isherwood
  8. 8. BOOTSTRAP BY SAVING OR GAINING MONEYYou need to get some capital to get startedSaving money tends to be easier in practiceReal-world example: • Analyzing CDN logs unveiled abuse • Stopping the abuse greatly reduced the bills
  9. 9. STEP 2EMBRACE HADOOP photo by Doug Kukurudza
  10. 10. HADOOPNot the holy grail, but deserves a central roleIt has a vibrant community and is proven to be: ECONOMICAL runs on commodity hardware SCALABLE smart distributed processing MAINTAINABLE very robust and fault-tolerant FLEXIBLE predefined schemas not required
  11. 11. STEP 3BUILD DASHBOARDS photo by Dawn Hopkins
  12. 12. STATS PIPELINE BASED ON HADOOP Log collector HDFS MapReduceDashboards HBase in batches continuous
  13. 13. STATS PIPELINE BASED ON HADOOP Cfr. “lambdaarchitecture” Log collector coined by@nathanmarz HDFS Realtime processing MapReduce Dashboards HBase in batches continuous
  14. 14. STATS PIPELINE BASED ON HADOOP Cfr. “lambdaarchitecture” Log collector coined by@nathanmarz HDFS Realtime Ad-hoc processing results MapReduce Dashboards HBase in batches continuous
  15. 15. PYTHON IS AN AWESOME JACK OF ALL TRADESIt is great for building dashboards: • Hadoop support: Dumbo, Python UDFs for Pig, ... • Several amazing web frameworks, e.g. Flask • Likewise for drawing graphs, e.g. PyCairoAnd it covers many other data science needs as well: • Scripting, prototyping and full-blown programming • NumPy, SciPy, PyLab, Scikit-learn, Pandas, ...
  16. 16. STEP 4ASSEMBLE A TEAM photo by Jean-François Schmitz
  17. 17. THE SECRET IS IN THE MIXHadoop’s tricks also apply to data science teams • Avoid specialisation to allow easy distribution and scaling • Exploit data locality by hiring people with wide skill setGreat Data Scientists have the right mix of skills • Hackers with solid technical background • Analytical mind that knows statistics and machine learning • Clever and creative in everything they do
  18. 18. STEP 5EXPLORE & INNOVATE photo by NASAr
  19. 19. SOME TIPS AND TRICKSDare to fail and/or start from estimatesIntroduce data exploration/innovation days • Basically 20% time devoted to playing with data • Incorporate brainstorming • Encourage collaborationCommunicate findings to the rest of the company • Fun and silliness are allowed • Prototype early and often
  20. 20. FIVE SIMPLE STEPS IS ALL IT TAKES1 FOLLOW THE MONEY2 EMBRACE HADOOP3 BUILD DASHBOARDS4 ASSEMBLE A TEAM5 EXPLORE & INNOVATE
  21. 21. FIVE SIMPLE STEPS IS ALL IT TAKES1 FOLLOW THE MONEY2 EMBRACE HADOOP Thanks!3 BUILD DASHBOARDS Questions?4 ASSEMBLE A TEAM5 EXPLORE & INNOVATE

×