SlideShare a Scribd company logo
Open Source Software
for Data Scientists
Charlie Greenbacker, Director of Data Science02 Apr 2014
Altamira Technologies Corporation 2014
Agenda
■  What is a Data Scientist?
■  Why use Open Source Software?
■  Survey of Open Source Software Tools:
¤ Statistical Analysis
¤ Data Mining
¤ Machine Learning
¤ Natural Language Processing
¤ Social Network Analysis
¤ Data Visualization
Altamira Technologies Corporation 2014
About me: @greenbacker
Theories: popular tripe
Methods: sloppy
Conclusions: highly questionable photo: Columbia Pictures
Altamira Technologies Corporation 2014
Best reason for
not finishing PhD
Altamira Technologies Corporation 2014
@ExploreAltamira
What is a Data Scientist?
credit: Drew Conway (http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram)
http://www.itproportal.com/2014/02/11/how-to-pick-a-data-scientist-the-right-way/
Paul Cooper, ITProPortal.com
“A data scientist is someone who
understands the domains of
programming, machine learning,
data mining, statistics, and
hacking”
Computer Programming
Mathematics & Analytic Methodology
Distributed Computing & Big Data
Data Science
StatisticalAnalysis
DataMining
MachineLearning
NaturalLanguageProcessing
SocialNetworkAnalysis
DataVisualization
Domain Knowledge & Communication Skills
etc.Altamira Technologies Corporation 2014
Why use Open Source Software?
photo: Karen (https://flic.kr/p/5njby2)
THERE ARE NO SILVER BULLETS."
photo: Paul Inkles (https://flic.kr/p/e2QMS5)
IF YOUR BOSS BUYS SOMETHING,"
YOU DAMN WELL BETTER USE IT."
photo: Valugi (http://bit.ly/1jrvVBC)
BUDGETS DON’T SCALE."
Survey of OSS Tools
Altamira Technologies Corporation 2014
Statistical Analysis
■  Name: R
■  Creator: Gentleman, Ihaka, et al.
■  License: GPL Version 2
■  Website: r-project.org
■  Source: cran.us.r-project.org/src/base/
■  Features:
¤  Language & environment for statistical computing & viz
¤  Linear and nonlinear modeling, classical statistical tests,
time-series analysis, graphical techniques, and more…
¤  5000+ packages available in CRAN repository
Altamira Technologies Corporation 2014
Data Mining
■  Name: Pandas
■  Creator: Wes McKinney, et al.
■  License: BSD 3-Clause License
■  Website: pandas.pydata.org
■  Source: github.com/pydata/pandas
■  Features:
¤  Data analysis workflow in Python
¤  DataFrame object for fast manipulation & indexing
¤  Tools for reading & writing data between formats
¤  Label-based slicing, indexing, and subsetting of data
Altamira Technologies Corporation 2014
Data Mining
■  Name: Impala
■  Creator: Cloudera
■  License: Apache License 2.0
■  Website: impala.io
■  Source: github.com/cloudera/impala
■  Features:
¤  MPP query engine implemented on Hadoop
¤  Low latency, high concurrency SQL & BI queries
¤  Same interfaces as Apache Hive, but ~24x faster
¤  Written in C++; does not use MapReduce
Altamira Technologies Corporation 2014
Machine Learning
■  Name: Mahout
■  Creator: ASF
■  License: Apache License 2.0
■  Website: mahout.apache.org
■  Source: svn.apache.org/viewvc/mahout
■  Features:
¤  Distributed/scalable ML library for Hadoop
¤  Classification, Clustering, Collaborative filtering
¤  Logistic regression, naïve Bayes, random forest, neural
networks, HMM, k-means, SVD, PCA, ALS, LDA, etc.
Altamira Technologies Corporation 2014
Machine Learning
■  Name: Scikit-learn
■  Creator: Cournapeau, et al.
■  License: BSD 3-Clause License
■  Website: scikit-learn.org
■  Source: github.com/scikit-learn/scikit-learn
■  Features:
¤  ML library for Python built on NumPy, SciPy, matplotlib
¤  Support for classification, clustering, dimensionality
reduction, regression, model selection, preprocessing
¤  SVM, k-NN, PCA, NNMF, crossval, feature extraction, ...
Altamira Technologies Corporation 2014
Machine Learning + NLP
■  Name: Mallet
■  Creator: UMass (McCallum, et al.)
■  License: Common Public License 1.0
■  Website: mallet.cs.umass.edu
■  Source: hg-iesl.cs.umass.edu/hg/mallet
■  Features:
¤  Java-based “Machine Learning for Language Toolkit”
¤  Document classification, clustering, topic modeling,
information extraction & sequence tagging, etc.
¤  Efficient implementation of LDA for topic modeling
Altamira Technologies Corporation 2014
Natural Language Processing
■  Name: NLTK
■  Creator: Bird, Loper, et al.
■  License: Apache License 2.0
■  Website: nltk.org
■  Source: github.com/nltk/nltk
■  Features:
¤  Natural Language Toolkit for Python
¤  Built-in support for dozens of corpora & trained models
¤  Libraries for classification, tokenization, stemming,
tagging, parsing, and semantic reasoning
Altamira Technologies Corporation 2014
Natural Language Processing
■  Name: Stanford CoreNLP
■  Creator: Stanford NLP Group
■  License: GPL Version 2
■  Website: nlp.stanford.edu/software/corenlp.shtml
■  Source: github.com/stanfordnlp/CoreNLP
■  Features:
¤  Suite of high-quality, Java-based NLP tools
¤  Includes POS tagger, named entity recognizer, parser,
coreference resolution, sentiment analysis, SUTime, etc.
¤  Includes models for English, Chinese, Arabic, German
Altamira Technologies Corporation 2014
NLP + Geospatial Analysis
■  Name: CLAVIN
■  Creator: Berico Technologies
■  License: Apache License 2.0
■  Website: clavin.io
■  Source: github.com/Berico-Technologies/CLAVIN
■  Features:
¤  Extracts location names from text, resolves to gazetteer
¤  Employs context-based geospatial entity resolution
¤  ~75% accuracy, processes 1M documents per hour
¤  Built on Hadoop, CoreNLP, OpenNLP, GeoNames.org
Altamira Technologies Corporation 2014
Social Network Analysis
■  Name: NetworkX
■  Creator: Los Alamos National Lab
■  License: BSD 3-Clause License
■  Website: networkx.github.io
■  Source: github.com/networkx/networkx
■  Features:
¤  Python structures for graphs, digraphs, & multigraphs
¤  Support for creating, manipulating, & analyzing the
structure, dynamics, & functions of complex networks
¤  Provides standard graph algorithms & analysis metrics
Altamira Technologies Corporation 2014
Social Network Analysis
■  Name: Gephi
■  Creator: UTC France
■  License: GPL Version 3
■  Website: gephi.org
■  Source: github.com/gephi/gephi
■  Features:
¤  Network analysis and visualization package for Java
¤  Dynamic network analysis with temporal filtering
¤  Metrics include: community detection, betweenness,
closeness, clustering coefficient, PageRank, etc.
Altamira Technologies Corporation 2014
Data Visualization
■  Name: D3.js
■  Creator: Mike Bostock
■  License: BSD 3-Clause License
■  Website: d3js.org
■  Source: github.com/mbostock/d3
■  Features:
¤  JavaScript library based on HTML, SVG, and CSS
¤  Binds data to DOM & enables transformations
¤  ~200 examples, including: force-directed graphs,
choropleths, treemaps, dendrograms, animations, etc.
Altamira Technologies Corporation 2014
Fusion, Analysis, and Visualization
■  Name: Lumify
■  Creator: Altamira
■  License: Apache License 2.0
■  Website: lumify.io
■  Source: github.com/altamiracorp/lumify
■  Features:
¤  Built on Hadoop, Storm, Accumulo, Elasticsearch, etc.
¤  Integrates structured data, text, images, video
¤  Cell-level security & access controls
¤  Live, shared collaborative workspaces
Altamira Technologies Corporation 2014
Final Thought…
Save your $$$ for:
¨  People
¤  salaries, training, etc.
¨  Resources
¤  hardware, AWS, etc.
¨  Proprietary software
¤  if no viable OSS
alternative exists
photo: Brett Weinstein (http://bit.ly/1dHXvqJ)
FINAL
THOUGHT
Springer’s
open source software for data scientists
oss4ds.com
Charlie Greenbacker | @greenbacker
oss4ds.com

More Related Content

Viewers also liked

Media Visie 2015 (ABN AMRO)
Media Visie 2015 (ABN AMRO)Media Visie 2015 (ABN AMRO)
Media Visie 2015 (ABN AMRO)
Jim Stolze
 
kamus-science
kamus-sciencekamus-science
kamus-scienceNur Asiah
 
Proxim Tsunami MP11 Series Datasheet(www.quantumwimax.com)
Proxim Tsunami MP11 Series Datasheet(www.quantumwimax.com)Proxim Tsunami MP11 Series Datasheet(www.quantumwimax.com)
Proxim Tsunami MP11 Series Datasheet(www.quantumwimax.com)
Ari Zoldan
 
Tech M&A Monthly: 10 Ways to Increase Your Company's Value
Tech M&A Monthly: 10 Ways to Increase Your Company's ValueTech M&A Monthly: 10 Ways to Increase Your Company's Value
Tech M&A Monthly: 10 Ways to Increase Your Company's Value
Corum Group
 
잡코리아 글로벌 프런티어 1기_노점순_탐방 계획서
잡코리아 글로벌 프런티어 1기_노점순_탐방 계획서잡코리아 글로벌 프런티어 1기_노점순_탐방 계획서
잡코리아 글로벌 프런티어 1기_노점순_탐방 계획서
잡코리아 글로벌 프런티어
 
Battling Drug Cartels with Big Data Using Lumify
Battling Drug Cartels with Big Data Using LumifyBattling Drug Cartels with Big Data Using Lumify
Battling Drug Cartels with Big Data Using Lumify
All Things Open
 
东吴-费森尤斯
东吴-费森尤斯东吴-费森尤斯
东吴-费森尤斯
cardiohealth215
 
Introduction to Exponentials Insights 2016
Introduction to Exponentials Insights 2016Introduction to Exponentials Insights 2016
Introduction to Exponentials Insights 2016
Dean Bonehill ♠Technology for Business♠
 
Biggest info security mistakes security innovation inc.
Biggest info security mistakes security innovation inc.Biggest info security mistakes security innovation inc.
Biggest info security mistakes security innovation inc.
uNIX Jim
 
CDXC Corporate presentation
CDXC Corporate presentationCDXC Corporate presentation
CDXC Corporate presentation
RedChip Companies, Inc.
 
Revista C&S 21 junho/julho 2012
Revista C&S 21 junho/julho 2012Revista C&S 21 junho/julho 2012
Revista C&S 21 junho/julho 2012
Ciclomídia
 
Chicago Safety Conference Presentation 2009
Chicago Safety Conference Presentation 2009Chicago Safety Conference Presentation 2009
Chicago Safety Conference Presentation 2009
American Society of Safety Engineers
 

Viewers also liked (14)

Media Visie 2015 (ABN AMRO)
Media Visie 2015 (ABN AMRO)Media Visie 2015 (ABN AMRO)
Media Visie 2015 (ABN AMRO)
 
kamus-science
kamus-sciencekamus-science
kamus-science
 
Proxim Tsunami MP11 Series Datasheet(www.quantumwimax.com)
Proxim Tsunami MP11 Series Datasheet(www.quantumwimax.com)Proxim Tsunami MP11 Series Datasheet(www.quantumwimax.com)
Proxim Tsunami MP11 Series Datasheet(www.quantumwimax.com)
 
Tech M&A Monthly: 10 Ways to Increase Your Company's Value
Tech M&A Monthly: 10 Ways to Increase Your Company's ValueTech M&A Monthly: 10 Ways to Increase Your Company's Value
Tech M&A Monthly: 10 Ways to Increase Your Company's Value
 
잡코리아 글로벌 프런티어 1기_노점순_탐방 계획서
잡코리아 글로벌 프런티어 1기_노점순_탐방 계획서잡코리아 글로벌 프런티어 1기_노점순_탐방 계획서
잡코리아 글로벌 프런티어 1기_노점순_탐방 계획서
 
Battling Drug Cartels with Big Data Using Lumify
Battling Drug Cartels with Big Data Using LumifyBattling Drug Cartels with Big Data Using Lumify
Battling Drug Cartels with Big Data Using Lumify
 
东吴-费森尤斯
东吴-费森尤斯东吴-费森尤斯
东吴-费森尤斯
 
Introduction to Exponentials Insights 2016
Introduction to Exponentials Insights 2016Introduction to Exponentials Insights 2016
Introduction to Exponentials Insights 2016
 
Biggest info security mistakes security innovation inc.
Biggest info security mistakes security innovation inc.Biggest info security mistakes security innovation inc.
Biggest info security mistakes security innovation inc.
 
CDXC Corporate presentation
CDXC Corporate presentationCDXC Corporate presentation
CDXC Corporate presentation
 
11 cdxc
11 cdxc11 cdxc
11 cdxc
 
Revista C&S 21 junho/julho 2012
Revista C&S 21 junho/julho 2012Revista C&S 21 junho/julho 2012
Revista C&S 21 junho/julho 2012
 
Chicago Safety Conference Presentation 2009
Chicago Safety Conference Presentation 2009Chicago Safety Conference Presentation 2009
Chicago Safety Conference Presentation 2009
 
Revista gm
Revista gmRevista gm
Revista gm
 

Similar to Open Source Software for Data Scientists -- Great Wide Open 2014

Open Source Software for Data Scientists -- BigConf 2014
Open Source Software for Data Scientists -- BigConf 2014Open Source Software for Data Scientists -- BigConf 2014
Open Source Software for Data Scientists -- BigConf 2014
Charlie Greenbacker
 
2024 Feb AI Meetup NYC GenAI_LLMs_ML_Data Codeless Generative AI Pipelines
2024 Feb AI Meetup NYC GenAI_LLMs_ML_Data Codeless Generative AI Pipelines2024 Feb AI Meetup NYC GenAI_LLMs_ML_Data Codeless Generative AI Pipelines
2024 Feb AI Meetup NYC GenAI_LLMs_ML_Data Codeless Generative AI Pipelines
Timothy Spann
 
Intro to H2O Machine Learning in R at Santa Clara University
Intro to H2O Machine Learning in R at Santa Clara UniversityIntro to H2O Machine Learning in R at Santa Clara University
Intro to H2O Machine Learning in R at Santa Clara University
Sri Ambati
 
Using Machine Learning to Understand Kafka Runtime Behavior (Shivanath Babu, ...
Using Machine Learning to Understand Kafka Runtime Behavior (Shivanath Babu, ...Using Machine Learning to Understand Kafka Runtime Behavior (Shivanath Babu, ...
Using Machine Learning to Understand Kafka Runtime Behavior (Shivanath Babu, ...
confluent
 
"The OpenCV Open Source Computer Vision Library: What’s New and What’s Coming...
"The OpenCV Open Source Computer Vision Library: What’s New and What’s Coming..."The OpenCV Open Source Computer Vision Library: What’s New and What’s Coming...
"The OpenCV Open Source Computer Vision Library: What’s New and What’s Coming...
Edge AI and Vision Alliance
 
Intro to Machine Learning with H2O and AWS
Intro to Machine Learning with H2O and AWSIntro to Machine Learning with H2O and AWS
Intro to Machine Learning with H2O and AWS
Sri Ambati
 
US Software Developers - Github Audience Analysis
US Software Developers - Github Audience Analysis US Software Developers - Github Audience Analysis
US Software Developers - Github Audience Analysis
Affinio
 
Strata sf - Amundsen presentation
Strata sf - Amundsen presentationStrata sf - Amundsen presentation
Strata sf - Amundsen presentation
Tao Feng
 
2024 XTREMEJ_ Building Real-time Pipelines with FLaNK_ A Case Study with Tra...
2024 XTREMEJ_  Building Real-time Pipelines with FLaNK_ A Case Study with Tra...2024 XTREMEJ_  Building Real-time Pipelines with FLaNK_ A Case Study with Tra...
2024 XTREMEJ_ Building Real-time Pipelines with FLaNK_ A Case Study with Tra...
Timothy Spann
 
Spark Uber Development Kit
Spark Uber Development KitSpark Uber Development Kit
Spark Uber Development Kit
DataWorks Summit/Hadoop Summit
 
Databricks Meetup @ Los Angeles Apache Spark User Group
Databricks Meetup @ Los Angeles Apache Spark User GroupDatabricks Meetup @ Los Angeles Apache Spark User Group
Databricks Meetup @ Los Angeles Apache Spark User Group
Paco Nathan
 
Bringing Deep Learning into production
Bringing Deep Learning into production Bringing Deep Learning into production
Bringing Deep Learning into production
Paolo Platter
 
Five Fabulous Sinks for Your Kafka Data. #3 will surprise you! (Rachel Pedres...
Five Fabulous Sinks for Your Kafka Data. #3 will surprise you! (Rachel Pedres...Five Fabulous Sinks for Your Kafka Data. #3 will surprise you! (Rachel Pedres...
Five Fabulous Sinks for Your Kafka Data. #3 will surprise you! (Rachel Pedres...
confluent
 
28March2024-Codeless-Generative-AI-Pipelines
28March2024-Codeless-Generative-AI-Pipelines28March2024-Codeless-Generative-AI-Pipelines
28March2024-Codeless-Generative-AI-Pipelines
Timothy Spann
 
Intro to H2O in Python - Data Science LA
Intro to H2O in Python - Data Science LAIntro to H2O in Python - Data Science LA
Intro to H2O in Python - Data Science LA
Sri Ambati
 
Intro to H2O Machine Learning in Python - Galvanize Seattle
Intro to H2O Machine Learning in Python - Galvanize SeattleIntro to H2O Machine Learning in Python - Galvanize Seattle
Intro to H2O Machine Learning in Python - Galvanize Seattle
Sri Ambati
 
Analysis-of-Major-Trends-in-big-data-analytics-slim-baltagi-hadoop-summit
Analysis-of-Major-Trends-in-big-data-analytics-slim-baltagi-hadoop-summitAnalysis-of-Major-Trends-in-big-data-analytics-slim-baltagi-hadoop-summit
Analysis-of-Major-Trends-in-big-data-analytics-slim-baltagi-hadoop-summit
Slim Baltagi
 
Ncku csie talk about Spark
Ncku csie talk about SparkNcku csie talk about Spark
Ncku csie talk about Spark
Giivee The
 
Intro to Machine Learning with H2O and Python - Denver
Intro to Machine Learning with H2O and Python - DenverIntro to Machine Learning with H2O and Python - Denver
Intro to Machine Learning with H2O and Python - Denver
Sri Ambati
 
Devfest uk & ireland using apache nifi with apache pulsar for fast data on-r...
Devfest uk & ireland  using apache nifi with apache pulsar for fast data on-r...Devfest uk & ireland  using apache nifi with apache pulsar for fast data on-r...
Devfest uk & ireland using apache nifi with apache pulsar for fast data on-r...
Timothy Spann
 

Similar to Open Source Software for Data Scientists -- Great Wide Open 2014 (20)

Open Source Software for Data Scientists -- BigConf 2014
Open Source Software for Data Scientists -- BigConf 2014Open Source Software for Data Scientists -- BigConf 2014
Open Source Software for Data Scientists -- BigConf 2014
 
2024 Feb AI Meetup NYC GenAI_LLMs_ML_Data Codeless Generative AI Pipelines
2024 Feb AI Meetup NYC GenAI_LLMs_ML_Data Codeless Generative AI Pipelines2024 Feb AI Meetup NYC GenAI_LLMs_ML_Data Codeless Generative AI Pipelines
2024 Feb AI Meetup NYC GenAI_LLMs_ML_Data Codeless Generative AI Pipelines
 
Intro to H2O Machine Learning in R at Santa Clara University
Intro to H2O Machine Learning in R at Santa Clara UniversityIntro to H2O Machine Learning in R at Santa Clara University
Intro to H2O Machine Learning in R at Santa Clara University
 
Using Machine Learning to Understand Kafka Runtime Behavior (Shivanath Babu, ...
Using Machine Learning to Understand Kafka Runtime Behavior (Shivanath Babu, ...Using Machine Learning to Understand Kafka Runtime Behavior (Shivanath Babu, ...
Using Machine Learning to Understand Kafka Runtime Behavior (Shivanath Babu, ...
 
"The OpenCV Open Source Computer Vision Library: What’s New and What’s Coming...
"The OpenCV Open Source Computer Vision Library: What’s New and What’s Coming..."The OpenCV Open Source Computer Vision Library: What’s New and What’s Coming...
"The OpenCV Open Source Computer Vision Library: What’s New and What’s Coming...
 
Intro to Machine Learning with H2O and AWS
Intro to Machine Learning with H2O and AWSIntro to Machine Learning with H2O and AWS
Intro to Machine Learning with H2O and AWS
 
US Software Developers - Github Audience Analysis
US Software Developers - Github Audience Analysis US Software Developers - Github Audience Analysis
US Software Developers - Github Audience Analysis
 
Strata sf - Amundsen presentation
Strata sf - Amundsen presentationStrata sf - Amundsen presentation
Strata sf - Amundsen presentation
 
2024 XTREMEJ_ Building Real-time Pipelines with FLaNK_ A Case Study with Tra...
2024 XTREMEJ_  Building Real-time Pipelines with FLaNK_ A Case Study with Tra...2024 XTREMEJ_  Building Real-time Pipelines with FLaNK_ A Case Study with Tra...
2024 XTREMEJ_ Building Real-time Pipelines with FLaNK_ A Case Study with Tra...
 
Spark Uber Development Kit
Spark Uber Development KitSpark Uber Development Kit
Spark Uber Development Kit
 
Databricks Meetup @ Los Angeles Apache Spark User Group
Databricks Meetup @ Los Angeles Apache Spark User GroupDatabricks Meetup @ Los Angeles Apache Spark User Group
Databricks Meetup @ Los Angeles Apache Spark User Group
 
Bringing Deep Learning into production
Bringing Deep Learning into production Bringing Deep Learning into production
Bringing Deep Learning into production
 
Five Fabulous Sinks for Your Kafka Data. #3 will surprise you! (Rachel Pedres...
Five Fabulous Sinks for Your Kafka Data. #3 will surprise you! (Rachel Pedres...Five Fabulous Sinks for Your Kafka Data. #3 will surprise you! (Rachel Pedres...
Five Fabulous Sinks for Your Kafka Data. #3 will surprise you! (Rachel Pedres...
 
28March2024-Codeless-Generative-AI-Pipelines
28March2024-Codeless-Generative-AI-Pipelines28March2024-Codeless-Generative-AI-Pipelines
28March2024-Codeless-Generative-AI-Pipelines
 
Intro to H2O in Python - Data Science LA
Intro to H2O in Python - Data Science LAIntro to H2O in Python - Data Science LA
Intro to H2O in Python - Data Science LA
 
Intro to H2O Machine Learning in Python - Galvanize Seattle
Intro to H2O Machine Learning in Python - Galvanize SeattleIntro to H2O Machine Learning in Python - Galvanize Seattle
Intro to H2O Machine Learning in Python - Galvanize Seattle
 
Analysis-of-Major-Trends-in-big-data-analytics-slim-baltagi-hadoop-summit
Analysis-of-Major-Trends-in-big-data-analytics-slim-baltagi-hadoop-summitAnalysis-of-Major-Trends-in-big-data-analytics-slim-baltagi-hadoop-summit
Analysis-of-Major-Trends-in-big-data-analytics-slim-baltagi-hadoop-summit
 
Ncku csie talk about Spark
Ncku csie talk about SparkNcku csie talk about Spark
Ncku csie talk about Spark
 
Intro to Machine Learning with H2O and Python - Denver
Intro to Machine Learning with H2O and Python - DenverIntro to Machine Learning with H2O and Python - Denver
Intro to Machine Learning with H2O and Python - Denver
 
Devfest uk & ireland using apache nifi with apache pulsar for fast data on-r...
Devfest uk & ireland  using apache nifi with apache pulsar for fast data on-r...Devfest uk & ireland  using apache nifi with apache pulsar for fast data on-r...
Devfest uk & ireland using apache nifi with apache pulsar for fast data on-r...
 

Recently uploaded

Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
2023240532
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
mbawufebxi
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
ahzuo
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
u86oixdj
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
axoqas
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
jerlynmaetalle
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
balafet
 
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
dwreak4tg
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
jerlynmaetalle
 
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
pchutichetpong
 
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptxData_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
AnirbanRoy608946
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
Oppotus
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
74nqk8xf
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
v3tuleee
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
TravisMalana
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
rwarrenll
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Subhajit Sahu
 

Recently uploaded (20)

Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
 
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
 
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
 
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptxData_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
 

Open Source Software for Data Scientists -- Great Wide Open 2014

  • 1. Open Source Software for Data Scientists Charlie Greenbacker, Director of Data Science02 Apr 2014
  • 2. Altamira Technologies Corporation 2014 Agenda ■  What is a Data Scientist? ■  Why use Open Source Software? ■  Survey of Open Source Software Tools: ¤ Statistical Analysis ¤ Data Mining ¤ Machine Learning ¤ Natural Language Processing ¤ Social Network Analysis ¤ Data Visualization
  • 3. Altamira Technologies Corporation 2014 About me: @greenbacker Theories: popular tripe Methods: sloppy Conclusions: highly questionable photo: Columbia Pictures
  • 4. Altamira Technologies Corporation 2014 Best reason for not finishing PhD
  • 5. Altamira Technologies Corporation 2014 @ExploreAltamira
  • 6. What is a Data Scientist?
  • 7.
  • 8.
  • 9.
  • 10. credit: Drew Conway (http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram)
  • 11. http://www.itproportal.com/2014/02/11/how-to-pick-a-data-scientist-the-right-way/ Paul Cooper, ITProPortal.com “A data scientist is someone who understands the domains of programming, machine learning, data mining, statistics, and hacking”
  • 12. Computer Programming Mathematics & Analytic Methodology Distributed Computing & Big Data Data Science StatisticalAnalysis DataMining MachineLearning NaturalLanguageProcessing SocialNetworkAnalysis DataVisualization Domain Knowledge & Communication Skills etc.Altamira Technologies Corporation 2014
  • 13. Why use Open Source Software?
  • 15. photo: Paul Inkles (https://flic.kr/p/e2QMS5) IF YOUR BOSS BUYS SOMETHING," YOU DAMN WELL BETTER USE IT."
  • 17. Survey of OSS Tools
  • 18. Altamira Technologies Corporation 2014 Statistical Analysis ■  Name: R ■  Creator: Gentleman, Ihaka, et al. ■  License: GPL Version 2 ■  Website: r-project.org ■  Source: cran.us.r-project.org/src/base/ ■  Features: ¤  Language & environment for statistical computing & viz ¤  Linear and nonlinear modeling, classical statistical tests, time-series analysis, graphical techniques, and more… ¤  5000+ packages available in CRAN repository
  • 19. Altamira Technologies Corporation 2014 Data Mining ■  Name: Pandas ■  Creator: Wes McKinney, et al. ■  License: BSD 3-Clause License ■  Website: pandas.pydata.org ■  Source: github.com/pydata/pandas ■  Features: ¤  Data analysis workflow in Python ¤  DataFrame object for fast manipulation & indexing ¤  Tools for reading & writing data between formats ¤  Label-based slicing, indexing, and subsetting of data
  • 20. Altamira Technologies Corporation 2014 Data Mining ■  Name: Impala ■  Creator: Cloudera ■  License: Apache License 2.0 ■  Website: impala.io ■  Source: github.com/cloudera/impala ■  Features: ¤  MPP query engine implemented on Hadoop ¤  Low latency, high concurrency SQL & BI queries ¤  Same interfaces as Apache Hive, but ~24x faster ¤  Written in C++; does not use MapReduce
  • 21. Altamira Technologies Corporation 2014 Machine Learning ■  Name: Mahout ■  Creator: ASF ■  License: Apache License 2.0 ■  Website: mahout.apache.org ■  Source: svn.apache.org/viewvc/mahout ■  Features: ¤  Distributed/scalable ML library for Hadoop ¤  Classification, Clustering, Collaborative filtering ¤  Logistic regression, naïve Bayes, random forest, neural networks, HMM, k-means, SVD, PCA, ALS, LDA, etc.
  • 22. Altamira Technologies Corporation 2014 Machine Learning ■  Name: Scikit-learn ■  Creator: Cournapeau, et al. ■  License: BSD 3-Clause License ■  Website: scikit-learn.org ■  Source: github.com/scikit-learn/scikit-learn ■  Features: ¤  ML library for Python built on NumPy, SciPy, matplotlib ¤  Support for classification, clustering, dimensionality reduction, regression, model selection, preprocessing ¤  SVM, k-NN, PCA, NNMF, crossval, feature extraction, ...
  • 23. Altamira Technologies Corporation 2014 Machine Learning + NLP ■  Name: Mallet ■  Creator: UMass (McCallum, et al.) ■  License: Common Public License 1.0 ■  Website: mallet.cs.umass.edu ■  Source: hg-iesl.cs.umass.edu/hg/mallet ■  Features: ¤  Java-based “Machine Learning for Language Toolkit” ¤  Document classification, clustering, topic modeling, information extraction & sequence tagging, etc. ¤  Efficient implementation of LDA for topic modeling
  • 24. Altamira Technologies Corporation 2014 Natural Language Processing ■  Name: NLTK ■  Creator: Bird, Loper, et al. ■  License: Apache License 2.0 ■  Website: nltk.org ■  Source: github.com/nltk/nltk ■  Features: ¤  Natural Language Toolkit for Python ¤  Built-in support for dozens of corpora & trained models ¤  Libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning
  • 25. Altamira Technologies Corporation 2014 Natural Language Processing ■  Name: Stanford CoreNLP ■  Creator: Stanford NLP Group ■  License: GPL Version 2 ■  Website: nlp.stanford.edu/software/corenlp.shtml ■  Source: github.com/stanfordnlp/CoreNLP ■  Features: ¤  Suite of high-quality, Java-based NLP tools ¤  Includes POS tagger, named entity recognizer, parser, coreference resolution, sentiment analysis, SUTime, etc. ¤  Includes models for English, Chinese, Arabic, German
  • 26. Altamira Technologies Corporation 2014 NLP + Geospatial Analysis ■  Name: CLAVIN ■  Creator: Berico Technologies ■  License: Apache License 2.0 ■  Website: clavin.io ■  Source: github.com/Berico-Technologies/CLAVIN ■  Features: ¤  Extracts location names from text, resolves to gazetteer ¤  Employs context-based geospatial entity resolution ¤  ~75% accuracy, processes 1M documents per hour ¤  Built on Hadoop, CoreNLP, OpenNLP, GeoNames.org
  • 27. Altamira Technologies Corporation 2014 Social Network Analysis ■  Name: NetworkX ■  Creator: Los Alamos National Lab ■  License: BSD 3-Clause License ■  Website: networkx.github.io ■  Source: github.com/networkx/networkx ■  Features: ¤  Python structures for graphs, digraphs, & multigraphs ¤  Support for creating, manipulating, & analyzing the structure, dynamics, & functions of complex networks ¤  Provides standard graph algorithms & analysis metrics
  • 28. Altamira Technologies Corporation 2014 Social Network Analysis ■  Name: Gephi ■  Creator: UTC France ■  License: GPL Version 3 ■  Website: gephi.org ■  Source: github.com/gephi/gephi ■  Features: ¤  Network analysis and visualization package for Java ¤  Dynamic network analysis with temporal filtering ¤  Metrics include: community detection, betweenness, closeness, clustering coefficient, PageRank, etc.
  • 29. Altamira Technologies Corporation 2014 Data Visualization ■  Name: D3.js ■  Creator: Mike Bostock ■  License: BSD 3-Clause License ■  Website: d3js.org ■  Source: github.com/mbostock/d3 ■  Features: ¤  JavaScript library based on HTML, SVG, and CSS ¤  Binds data to DOM & enables transformations ¤  ~200 examples, including: force-directed graphs, choropleths, treemaps, dendrograms, animations, etc.
  • 30. Altamira Technologies Corporation 2014 Fusion, Analysis, and Visualization ■  Name: Lumify ■  Creator: Altamira ■  License: Apache License 2.0 ■  Website: lumify.io ■  Source: github.com/altamiracorp/lumify ■  Features: ¤  Built on Hadoop, Storm, Accumulo, Elasticsearch, etc. ¤  Integrates structured data, text, images, video ¤  Cell-level security & access controls ¤  Live, shared collaborative workspaces
  • 31.
  • 32. Altamira Technologies Corporation 2014 Final Thought… Save your $$$ for: ¨  People ¤  salaries, training, etc. ¨  Resources ¤  hardware, AWS, etc. ¨  Proprietary software ¤  if no viable OSS alternative exists photo: Brett Weinstein (http://bit.ly/1dHXvqJ) FINAL THOUGHT Springer’s
  • 33. open source software for data scientists oss4ds.com
  • 34. Charlie Greenbacker | @greenbacker oss4ds.com