The document discusses open source software tools for data scientists. It begins by defining a data scientist and explaining why open source software is useful. It then surveys and summarizes numerous popular open source tools for statistical analysis, data mining, machine learning, natural language processing, social network analysis, data visualization, and data fusion/analysis. The tools covered include R, Pandas, Impala, Mahout, Scikit-learn, Mallet, NLTK, Stanford CoreNLP, CLAVIN, NetworkX, Gephi, D3.js, and Lumify. It concludes by recommending that companies focus their budgets on people, resources, and proprietary software when necessary rather than on software licenses.
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Open Source Software for Data Scientists -- Great Wide Open 2014
1. Open Source Software
for Data Scientists
Charlie Greenbacker, Director of Data Science02 Apr 2014
2. Altamira Technologies Corporation 2014
Agenda
■ What is a Data Scientist?
■ Why use Open Source Software?
■ Survey of Open Source Software Tools:
¤ Statistical Analysis
¤ Data Mining
¤ Machine Learning
¤ Natural Language Processing
¤ Social Network Analysis
¤ Data Visualization
3. Altamira Technologies Corporation 2014
About me: @greenbacker
Theories: popular tripe
Methods: sloppy
Conclusions: highly questionable photo: Columbia Pictures
18. Altamira Technologies Corporation 2014
Statistical Analysis
■ Name: R
■ Creator: Gentleman, Ihaka, et al.
■ License: GPL Version 2
■ Website: r-project.org
■ Source: cran.us.r-project.org/src/base/
■ Features:
¤ Language & environment for statistical computing & viz
¤ Linear and nonlinear modeling, classical statistical tests,
time-series analysis, graphical techniques, and more…
¤ 5000+ packages available in CRAN repository
19. Altamira Technologies Corporation 2014
Data Mining
■ Name: Pandas
■ Creator: Wes McKinney, et al.
■ License: BSD 3-Clause License
■ Website: pandas.pydata.org
■ Source: github.com/pydata/pandas
■ Features:
¤ Data analysis workflow in Python
¤ DataFrame object for fast manipulation & indexing
¤ Tools for reading & writing data between formats
¤ Label-based slicing, indexing, and subsetting of data
20. Altamira Technologies Corporation 2014
Data Mining
■ Name: Impala
■ Creator: Cloudera
■ License: Apache License 2.0
■ Website: impala.io
■ Source: github.com/cloudera/impala
■ Features:
¤ MPP query engine implemented on Hadoop
¤ Low latency, high concurrency SQL & BI queries
¤ Same interfaces as Apache Hive, but ~24x faster
¤ Written in C++; does not use MapReduce
32. Altamira Technologies Corporation 2014
Final Thought…
Save your $$$ for:
¨ People
¤ salaries, training, etc.
¨ Resources
¤ hardware, AWS, etc.
¨ Proprietary software
¤ if no viable OSS
alternative exists
photo: Brett Weinstein (http://bit.ly/1dHXvqJ)
FINAL
THOUGHT
Springer’s