Confessions of a Data Scientist


Published on

data mining challenges that real data scientists have confessed including missing data and data analysis

Published in: Technology
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Confessions of a Data Scientist

  1. 1. 8 challenges that data scientists have confessed Salford Systems
  2. 2. #1 Not knowing when to STOP.  This can be challenging because there is always the hope that your model and/or results can be improved a bit more, and a bit more, and just a little bit more. The point of diminishing return is difficult to identify and much more time may be spent for a very marginal benefit.
  3. 3. #2 Guilty of data torture.  "If you torture data long enough, it will confess." Any effect can be 'detected' by looking at the data in a certain, very specific way (even if there is no effect at all).
  4. 4. #3 Pretending there is a signal.  A big challenge is what to do when the signal is not there, but the client expects it. Especially when there is big $$$ at stake.  At this point your choices are rather grim:  tell the truth and lose the contract, continue procrastinating hoping that the client will keep paying, or massage your data to the point of seeing something that can be remotely presented as a success.
  5. 5. #4 Being 'bossed' around.  When your boss gives you an assignment to prove that he is right by doing some kind of data torture, it's time to move on.
  6. 6. #5 Client communication (or lack thereof).  How to communicate to the client that the Petabyte of data assembled over the years does not have a key variable that is needed in order to answer his business question.  This is especially difficult when the client is the person in charge of all data collection decisions historically.
  7. 7. #6 Modeling method dilemmas.  The challenge is to choose between a super-fast linear regression solution available on a Hadoop cluster versus an ultra slow Neural Net solution available on your desktop. The former has access to all of the data but does not take any advantage of it, the latter could be extremely useful but you will have a heck of a time educating the IT person in charge on the merits of sampling and how it culminates in the famous Central Limit Theorem in statistics.
  8. 8. #7 Being term-savvy.  It can be difficult to stay up-to-date on all of the terminology people use these days to give a new birth to frequency tables and descriptive statistics.  However, this is where the ultimate utility of Wikipedia comes to rescue, or even the Google Scholar for the more intrepid of us.  If all else fails, you may always invent your own term or claim that in your domain the term mentioned has a different meaning.
  9. 9. #8 Open source. 'Nuff said.  A big challenge is using the Open Source software as much as possible and hoping that it actually works. Even worse, spending hours learning how to use it only to discover that it can't do what you want because of some obscure memory limitation, a very bizarre bug that occurs only on your workstation, or a run that takes forever to complete. Well, at least you did not have to pay for it, literally...
  10. 10. Like what you’ve read?  Subscribe to the blog: 