Your SlideShare is downloading. ×
How to develop a data scientist – What business has requested v02
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Saving this for later?

Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime - even offline.

Text the download link to your phone

Standard text messaging rates apply

How to develop a data scientist – What business has requested v02

1,656
views

Published on

Brendan Moran, Data Scientist @Greenplum EMC presentation @ds_ldn March 21st, 2012

Brendan Moran, Data Scientist @Greenplum EMC presentation @ds_ldn March 21st, 2012

Published in: Technology, Business

0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,656
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
0
Comments
0
Likes
1
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. How to develop a data scientist What business has requested Big Data Meet Up 21 March 2012 Brendan Moran, EMC Data Scientist Data Computing Division© Copyright 2010 EMC Corporation. All rights reserved. 1
  • 2. Context•  McKinsey report –  Technology and techniques –  Mind the gap •  140-160k deep analytic talent •  1.5m data savvy managers –  UK Top 6 in producing talent•  EMC Global Survey•  Kaggle.com•  Our clients Data Computing Division© Copyright 2010 EMC Corporation. All rights reserved. 2
  • 3. Does it matter?Where’s the next generation coming from? Big data: The next frontier for innovation, competition, and productivity, http://tinyurl.com/74tdfdv Data Computing Division© Copyright 2010 EMC Corporation. All rights reserved. 3
  • 4. What about the UK?Big data: The next frontier for innovation, competition, and productivity, http://tinyurl.com/74tdfdv Data Computing Division© Copyright 2010 EMC Corporation. All rights reserved. 4
  • 5. What’s been trending – courtesy ofDatasifthttp://tinyurl.com/6vek5ge•  # views –  1747072 What is a data scientist? | Datablog http://t.co/tFfVvstm –  1537330 I love Oxford-style debates. This one at #strataconf: the data science debate: domain expertise or machine learning? http://t.co/jKGhx8AY –  1536264 #strataconf is amazing. Data science is the new black.•  Most popular links: –  2812012-03-02 14:01-What is a data scientist? | Datablog | News | guardian.co.uk –  2082012-03-08 11:25-bitlys Hilary Mason on "What is A Data Scientist?" - Forbes –  1752012-03-02 15:24-A Data Scientist You’ve Never Heard of Is Now the Master of Your Domain Data Computing Division© Copyright 2010 EMC Corporation. All rights reserved. 5
  • 6. Who do we have in tonight?Show of hands….•  Are you a data scientist? –  Beginner –  Proficient –  Expert? Data Computing Division© Copyright 2010 EMC Corporation. All rights reserved. 6
  • 7. Snapshot of youTools in your toolbox•  Warehousing/Analytics (single response) –  SQL -> IBM -> Oracle•  How do you manipulate your data (multi-response) –  Excel -> SQL -> Python•  How do you analyse your data (multi-response) –  SAS -> STATA -> SPSS (R was last)•  How do you visualise your data (multi-response) –  MS BI tools -> Oracel -> IBM -> SAP -> Microstrategy Data Computing Division© Copyright 2010 EMC Corporation. All rights reserved. 7
  • 8. Snapshot of youYour traits•  You want a full set of data (53% )•  Only 13% were comfortable working with complete data•  “I explore the data|report what it says” – evenly distributed•  “My findings drive decisions, or report what has happened” – evenly distributed Data Computing Division© Copyright 2010 EMC Corporation. All rights reserved. 8
  • 9. Why EMC?•  Award winner for enterprise development (TSIA 2011)•  Commitment to open source initiatives (chorus)•  Relationships with 700 universities around world – many already take our course content•  Carnegie Mellon•  Berkeley Data Computing Division© Copyright 2010 EMC Corporation. All rights reserved. 9
  • 10. How did we form the content?•  Experts•  Kaggle•  Universities•  Our enterprise clients Data Computing Division© Copyright 2010 EMC Corporation. All rights reserved. 10
  • 11. Guiding principles•  Open source – R, Rstudio, SQL, Python•  Vendor neutral•  No licensing implications (important for universities) –  Community editions of MPP DB, Hadoop•  Applied learning – lots of labs (~40% time)•  Foundation course Data Computing Division© Copyright 2010 EMC Corporation. All rights reserved. 11
  • 12. Where’s the bar?•  Solid understanding of statistics•  Experience with a scripting language (Jave, Perl, Python, R)•  Experience with SQL (or PSQL) Data Computing Division© Copyright 2010 EMC Corporation. All rights reserved. 12
  • 13. What’s on the course? Data Computing Division© Copyright 2010 EMC Corporation. All rights reserved. 13
  • 14. Is this for you?Practice, practice, practice•  “Tell me and I forget. Show me and I remember. Involve me and I understand”•  40% is hands on lab•  Take “dirty” data, tidy it up, start exploring data, basic statistics, simple plots, complex stats, beautiful graphs, build models, test models, present your findings Data Computing Division© Copyright 2010 EMC Corporation. All rights reserved. 14
  • 15. Show me : hypothesis testingNull & Alternative hypotheses•  is there a difference? Data Computing Division© Copyright 2010 EMC Corporation. All rights reserved. 15
  • 16. Show me : hypothesis testingHow good is my model?•  Receiver Operating Characteristics (ROC) –  False positives –  True positives Data Computing Division© Copyright 2010 EMC Corporation. All rights reserved. 16
  • 17. Show me: visualising your dataStudy hours by education #Code #http://tinyurl.com/7rbx7qs library(arules) data ("AdultUCI") dframe = AdultUCI[, c ("education", "hours-per- week")] colnames(dframe) = c ("education", "hours_per_week") library(ggplot2) ggplot (dframe, aes(x=education, y=hours_per_week)) + geom_point(colour="lightblue", alpha=0.1, position="jitter") + geom_boxplot(outlier.size=0, alpha=0.2) + coord_flip() Data Computing Division© Copyright 2010 EMC Corporation. All rights reserved. 17
  • 18. Show me : what problem do I haveHow do I solve it Problem to solve Technique (e.g.) Need to group items by similarity Need to discover relationships between actions or items Want to determine relationship between outcome and input variables Want to assign (known) labels to items Want to analyse my text Data Computing Division© Copyright 2010 EMC Corporation. All rights reserved. 18
  • 19. Show me : what problem do I haveHow do I solve it Problem to solve Technique (e.g.) Need to group items by similarity Clustering (k-means) Need to discover relationships between Association rules (a priori) actions or items Want to determine relationship between Regression (linear/logistic) outcome and input variables Want to assign (known) labels to items Classification (Naïve Bayes, decision trees) Want to analyse my text Regular expressions, Bag of words Data Computing Division© Copyright 2010 EMC Corporation. All rights reserved. 19
  • 20. So what now?What about you?•  If you know all of this already – congratulations!•  If you’d like to know more – our course goes live 26 March (register at http://education.emc.com)•  If you couldn’t care less – you’re probably in the wrong room Data Computing Division© Copyright 2010 EMC Corporation. All rights reserved. 20