Open Source Software
for Data Scientists
Charlie Greenbacker, Director of Data Science28 Mar 2014
Altamira Technologies Corporation 2014
Agenda
■  What is a Data Scientist?
■  Why use Open Source Software?
■  Survey of O...
Altamira Technologies Corporation 2014
About me: @greenbacker
Theories: popular tripe
Methods: sloppy
Conclusions: highly ...
Altamira Technologies Corporation 2014
Best reason for
not finishing PhD
Altamira Technologies Corporation 2014
@ExploreAltamira
What is a Data Scientist?
credit: Drew Conway (http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram)
http://www.itproportal.com/2014/02/11/how-to-pick-a-data-scientist-the-right-way/
Paul Cooper, ITProPortal.com
“A data sci...
Computer Programming
Mathematics & Analytic Methodology
Distributed Computing & Big Data
Data Science
StatisticalAnalysis
...
Why use Open Source Software?
photo: Karen (https://flic.kr/p/5njby2)
THERE ARE NO SILVER BULLETS."
photo: Paul Inkles (https://flic.kr/p/e2QMS5)
IF YOUR BOSS BUYS SOMETHING,"
YOU DAMN WELL BETTER USE IT."
photo: Valugi (http://bit.ly/1jrvVBC)
BUDGETS DON’T SCALE."
Survey of OSS Tools
Altamira Technologies Corporation 2014
Statistical Analysis
■  Name: R
■  Creator: Gentleman, Ihaka, et al.
■  License: GP...
Altamira Technologies Corporation 2014
Data Mining
■  Name: Pandas
■  Creator: Wes McKinney, et al.
■  License: BSD 3-Clau...
Altamira Technologies Corporation 2014
Data Mining
■  Name: Impala
■  Creator: Cloudera
■  License: Apache License 2.0
■  ...
Altamira Technologies Corporation 2014
Machine Learning
■  Name: Mahout
■  Creator: ASF
■  License: Apache License 2.0
■  ...
Altamira Technologies Corporation 2014
Machine Learning
■  Name: Scikit-learn
■  Creator: Cournapeau, et al.
■  License: B...
Altamira Technologies Corporation 2014
Machine Learning + NLP
■  Name: Mallet
■  Creator: UMass (McCallum, et al.)
■  Lice...
Altamira Technologies Corporation 2014
Natural Language Processing
■  Name: NLTK
■  Creator: Bird, Loper, et al.
■  Licens...
Altamira Technologies Corporation 2014
Natural Language Processing
■  Name: Stanford CoreNLP
■  Creator: Stanford NLP Grou...
Altamira Technologies Corporation 2014
NLP + Geospatial Analysis
■  Name: CLAVIN
■  Creator: Berico Technologies
■  Licens...
Altamira Technologies Corporation 2014
Social Network Analysis
■  Name: Gephi
■  Creator: UTC France
■  License: GPL Versi...
Altamira Technologies Corporation 2014
Data Visualization
■  Name: D3.js
■  Creator: Mike Bostock
■  License: BSD 3-Clause...
Altamira Technologies Corporation 2014
Fusion, Analysis, and Visualization
■  Name: Lumify
■  Creator: Altamira
■  License...
Altamira Technologies Corporation 2014
Final Thought…
Save your $$$ for:
¨  People
¤  salaries, training, etc.
¨  Resou...
open source software for data scientists
oss4ds.com
Charlie Greenbacker | @greenbacker
www.oss4ds.com
Open Source Software for Data Scientists -- BigConf 2014
Open Source Software for Data Scientists -- BigConf 2014
Open Source Software for Data Scientists -- BigConf 2014
Open Source Software for Data Scientists -- BigConf 2014
Upcoming SlideShare
Loading in...5
×

Open Source Software for Data Scientists -- BigConf 2014

1,560

Published on

As presented at BigConf on 28 March 2014 in Silver Spring, MD
http://www.bigconf.io/schedule/index#charlie_greenbacker
=========================

Harvard Business Review called it "the sexiest job of the 21st century." These days, data scientists are faced with an onslaught of companies pitching products that promise to solve all your problems. Is there such a thing as a "silver bullet" for data science, and is it worth the hefty price tag?

This talk will briefly discuss what data science is, it will argue why open source software is usually the right choice for data scientists, and it will examine some of the leading OSS tools for data science available today. Topics will include statistical analysis, data mining, machine learning, natural language processing, and data visualization. Additional materials will be provided on the presentation's companion website: oss4ds.com

Published in: Data & Analytics

Open Source Software for Data Scientists -- BigConf 2014

  1. 1. Open Source Software for Data Scientists Charlie Greenbacker, Director of Data Science28 Mar 2014
  2. 2. Altamira Technologies Corporation 2014 Agenda ■  What is a Data Scientist? ■  Why use Open Source Software? ■  Survey of Open Source Software Tools: ¤ Statistical Analysis ¤ Data Mining ¤ Machine Learning ¤ Natural Language Processing ¤ Social Network Analysis ¤ Data Visualization
  3. 3. Altamira Technologies Corporation 2014 About me: @greenbacker Theories: popular tripe Methods: sloppy Conclusions: highly questionable photo: Columbia Pictures
  4. 4. Altamira Technologies Corporation 2014 Best reason for not finishing PhD
  5. 5. Altamira Technologies Corporation 2014 @ExploreAltamira
  6. 6. What is a Data Scientist?
  7. 7. credit: Drew Conway (http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram)
  8. 8. http://www.itproportal.com/2014/02/11/how-to-pick-a-data-scientist-the-right-way/ Paul Cooper, ITProPortal.com “A data scientist is someone who understands the domains of programming, machine learning, data mining, statistics, and hacking”
  9. 9. Computer Programming Mathematics & Analytic Methodology Distributed Computing & Big Data Data Science StatisticalAnalysis DataMining MachineLearning NaturalLanguageProcessing SocialNetworkAnalysis DataVisualization Domain Knowledge & Communication Skills etc.Altamira Technologies Corporation 2014
  10. 10. Why use Open Source Software?
  11. 11. photo: Karen (https://flic.kr/p/5njby2) THERE ARE NO SILVER BULLETS."
  12. 12. photo: Paul Inkles (https://flic.kr/p/e2QMS5) IF YOUR BOSS BUYS SOMETHING," YOU DAMN WELL BETTER USE IT."
  13. 13. photo: Valugi (http://bit.ly/1jrvVBC) BUDGETS DON’T SCALE."
  14. 14. Survey of OSS Tools
  15. 15. Altamira Technologies Corporation 2014 Statistical Analysis ■  Name: R ■  Creator: Gentleman, Ihaka, et al. ■  License: GPL Version 2 ■  Website: r-project.org ■  Source: cran.us.r-project.org/src/base/ ■  Features: ¤  Language & environment for statistical computing & viz ¤  Linear and nonlinear modeling, classical statistical tests, time-series analysis, graphical techniques, and more… ¤  5000+ packages available in CRAN repository
  16. 16. Altamira Technologies Corporation 2014 Data Mining ■  Name: Pandas ■  Creator: Wes McKinney, et al. ■  License: BSD 3-Clause License ■  Website: pandas.pydata.org ■  Source: github.com/pydata/pandas ■  Features: ¤  Data analysis workflow in Python ¤  DataFrame object for fast manipulation & indexing ¤  Tools for reading & writing data between formats ¤  Label-based slicing, indexing, and subsetting of data
  17. 17. Altamira Technologies Corporation 2014 Data Mining ■  Name: Impala ■  Creator: Cloudera ■  License: Apache License 2.0 ■  Website: impala.io ■  Source: github.com/cloudera/impala ■  Features: ¤  MPP query engine implemented on Hadoop ¤  Low latency, high concurrency SQL & BI queries ¤  Same interfaces as Apache Hive, but ~24x faster ¤  Written in C++; does not use MapReduce
  18. 18. Altamira Technologies Corporation 2014 Machine Learning ■  Name: Mahout ■  Creator: ASF ■  License: Apache License 2.0 ■  Website: mahout.apache.org ■  Source: svn.apache.org/viewvc/mahout ■  Features: ¤  Distributed/scalable ML library for Hadoop ¤  Classification, Clustering, Collaborative filtering ¤  Logistic regression, naïve Bayes, random forest, neural networks, HMM, k-means, SVD, PCA, ALS, LDA, etc.
  19. 19. Altamira Technologies Corporation 2014 Machine Learning ■  Name: Scikit-learn ■  Creator: Cournapeau, et al. ■  License: BSD 3-Clause License ■  Website: scikit-learn.org ■  Source: github.com/scikit-learn/scikit-learn ■  Features: ¤  ML library for Python built on NumPy, SciPy, matplotlib ¤  Support for classification, clustering, dimensionality reduction, regression, model selection, preprocessing ¤  SVM, k-NN, PCA, NNMF, crossval, feature extraction, ...
  20. 20. Altamira Technologies Corporation 2014 Machine Learning + NLP ■  Name: Mallet ■  Creator: UMass (McCallum, et al.) ■  License: Common Public License 1.0 ■  Website: mallet.cs.umass.edu ■  Source: hg-iesl.cs.umass.edu/hg/mallet ■  Features: ¤  Java-based “Machine Learning for Language Toolkit” ¤  Document classification, clustering, topic modeling, information extraction & sequence tagging, etc. ¤  Efficient implementation of LDA for topic modeling
  21. 21. Altamira Technologies Corporation 2014 Natural Language Processing ■  Name: NLTK ■  Creator: Bird, Loper, et al. ■  License: Apache License 2.0 ■  Website: nltk.org ■  Source: github.com/nltk/nltk ■  Features: ¤  Natural Language Toolkit for Python ¤  Built-in support for dozens of corpora & trained models ¤  Libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning
  22. 22. Altamira Technologies Corporation 2014 Natural Language Processing ■  Name: Stanford CoreNLP ■  Creator: Stanford NLP Group ■  License: GPL Version 2 ■  Website: nlp.stanford.edu/software/corenlp.shtml ■  Source: github.com/stanfordnlp/CoreNLP ■  Features: ¤  Suite of high-quality, Java-based NLP tools ¤  Includes POS tagger, named entity recognizer, parser, coreference resolution, sentiment analysis, SUTime, etc. ¤  Includes models for English, Chinese, Arabic, German
  23. 23. Altamira Technologies Corporation 2014 NLP + Geospatial Analysis ■  Name: CLAVIN ■  Creator: Berico Technologies ■  License: Apache License 2.0 ■  Website: clavin.io ■  Source: github.com/Berico-Technologies/CLAVIN ■  Features: ¤  Extracts location names from text, resolves to gazetteer ¤  Employs context-based geospatial entity resolution ¤  ~75% accuracy, processes 1M documents per hour ¤  Built on Hadoop, CoreNLP, OpenNLP, GeoNames.org
  24. 24. Altamira Technologies Corporation 2014 Social Network Analysis ■  Name: Gephi ■  Creator: UTC France ■  License: GPL Version 3 ■  Website: gephi.org ■  Source: github.com/gephi/gephi ■  Features: ¤  Network analysis and visualization package for Java ¤  Dynamic network analysis with temporal filtering ¤  Metrics include: community detection, betweenness, closeness, clustering coefficient, PageRank, etc.
  25. 25. Altamira Technologies Corporation 2014 Data Visualization ■  Name: D3.js ■  Creator: Mike Bostock ■  License: BSD 3-Clause License ■  Website: d3js.org ■  Source: github.com/mbostock/d3 ■  Features: ¤  JavaScript library based on HTML, SVG, and CSS ¤  Binds data to DOM & enables transformations ¤  ~200 examples, including: force-directed graphs, choropleths, treemaps, dendrograms, animations, etc.
  26. 26. Altamira Technologies Corporation 2014 Fusion, Analysis, and Visualization ■  Name: Lumify ■  Creator: Altamira ■  License: Apache License 2.0 ■  Website: lumify.io ■  Source: github.com/altamiracorp/lumify ■  Features: ¤  Built on Hadoop, Storm, Accumulo, Elasticsearch, etc. ¤  Integrates structured data, text, images, video ¤  Cell-level security & access controls ¤  Live, shared collaborative workspaces
  27. 27. Altamira Technologies Corporation 2014 Final Thought… Save your $$$ for: ¨  People ¤  salaries, training, etc. ¨  Resources ¤  hardware, AWS, etc. ¨  Proprietary software ¤  if no viable OSS alternative exists photo: Brett Weinstein (http://bit.ly/1dHXvqJ) FINAL THOUGHT Springer’s
  28. 28. open source software for data scientists oss4ds.com
  29. 29. Charlie Greenbacker | @greenbacker www.oss4ds.com
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×