SlideShare a Scribd company logo
1 of 34
Download to read offline
Open Source Software
for Data Scientists
Charlie Greenbacker, Director of Data Science02 Apr 2014
Altamira Technologies Corporation 2014
Agenda
■  What is a Data Scientist?
■  Why use Open Source Software?
■  Survey of Open Source Software Tools:
¤ Statistical Analysis
¤ Data Mining
¤ Machine Learning
¤ Natural Language Processing
¤ Social Network Analysis
¤ Data Visualization
Altamira Technologies Corporation 2014
About me: @greenbacker
Theories: popular tripe
Methods: sloppy
Conclusions: highly questionable photo: Columbia Pictures
Altamira Technologies Corporation 2014
Best reason for
not finishing PhD
Altamira Technologies Corporation 2014
@ExploreAltamira
What is a Data Scientist?
credit: Drew Conway (http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram)
http://www.itproportal.com/2014/02/11/how-to-pick-a-data-scientist-the-right-way/
Paul Cooper, ITProPortal.com
“A data scientist is someone who
understands the domains of
programming, machine learning,
data mining, statistics, and
hacking”
Computer Programming
Mathematics & Analytic Methodology
Distributed Computing & Big Data
Data Science
StatisticalAnalysis
DataMining
MachineLearning
NaturalLanguageProcessing
SocialNetworkAnalysis
DataVisualization
Domain Knowledge & Communication Skills
etc.Altamira Technologies Corporation 2014
Why use Open Source Software?
photo: Karen (https://flic.kr/p/5njby2)
THERE ARE NO SILVER BULLETS."
photo: Paul Inkles (https://flic.kr/p/e2QMS5)
IF YOUR BOSS BUYS SOMETHING,"
YOU DAMN WELL BETTER USE IT."
photo: Valugi (http://bit.ly/1jrvVBC)
BUDGETS DON’T SCALE."
Survey of OSS Tools
Altamira Technologies Corporation 2014
Statistical Analysis
■  Name: R
■  Creator: Gentleman, Ihaka, et al.
■  License: GPL Version 2
■  Website: r-project.org
■  Source: cran.us.r-project.org/src/base/
■  Features:
¤  Language & environment for statistical computing & viz
¤  Linear and nonlinear modeling, classical statistical tests,
time-series analysis, graphical techniques, and more…
¤  5000+ packages available in CRAN repository
Altamira Technologies Corporation 2014
Data Mining
■  Name: Pandas
■  Creator: Wes McKinney, et al.
■  License: BSD 3-Clause License
■  Website: pandas.pydata.org
■  Source: github.com/pydata/pandas
■  Features:
¤  Data analysis workflow in Python
¤  DataFrame object for fast manipulation & indexing
¤  Tools for reading & writing data between formats
¤  Label-based slicing, indexing, and subsetting of data
Altamira Technologies Corporation 2014
Data Mining
■  Name: Impala
■  Creator: Cloudera
■  License: Apache License 2.0
■  Website: impala.io
■  Source: github.com/cloudera/impala
■  Features:
¤  MPP query engine implemented on Hadoop
¤  Low latency, high concurrency SQL & BI queries
¤  Same interfaces as Apache Hive, but ~24x faster
¤  Written in C++; does not use MapReduce
Altamira Technologies Corporation 2014
Machine Learning
■  Name: Mahout
■  Creator: ASF
■  License: Apache License 2.0
■  Website: mahout.apache.org
■  Source: svn.apache.org/viewvc/mahout
■  Features:
¤  Distributed/scalable ML library for Hadoop
¤  Classification, Clustering, Collaborative filtering
¤  Logistic regression, naïve Bayes, random forest, neural
networks, HMM, k-means, SVD, PCA, ALS, LDA, etc.
Altamira Technologies Corporation 2014
Machine Learning
■  Name: Scikit-learn
■  Creator: Cournapeau, et al.
■  License: BSD 3-Clause License
■  Website: scikit-learn.org
■  Source: github.com/scikit-learn/scikit-learn
■  Features:
¤  ML library for Python built on NumPy, SciPy, matplotlib
¤  Support for classification, clustering, dimensionality
reduction, regression, model selection, preprocessing
¤  SVM, k-NN, PCA, NNMF, crossval, feature extraction, ...
Altamira Technologies Corporation 2014
Machine Learning + NLP
■  Name: Mallet
■  Creator: UMass (McCallum, et al.)
■  License: Common Public License 1.0
■  Website: mallet.cs.umass.edu
■  Source: hg-iesl.cs.umass.edu/hg/mallet
■  Features:
¤  Java-based “Machine Learning for Language Toolkit”
¤  Document classification, clustering, topic modeling,
information extraction & sequence tagging, etc.
¤  Efficient implementation of LDA for topic modeling
Altamira Technologies Corporation 2014
Natural Language Processing
■  Name: NLTK
■  Creator: Bird, Loper, et al.
■  License: Apache License 2.0
■  Website: nltk.org
■  Source: github.com/nltk/nltk
■  Features:
¤  Natural Language Toolkit for Python
¤  Built-in support for dozens of corpora & trained models
¤  Libraries for classification, tokenization, stemming,
tagging, parsing, and semantic reasoning
Altamira Technologies Corporation 2014
Natural Language Processing
■  Name: Stanford CoreNLP
■  Creator: Stanford NLP Group
■  License: GPL Version 2
■  Website: nlp.stanford.edu/software/corenlp.shtml
■  Source: github.com/stanfordnlp/CoreNLP
■  Features:
¤  Suite of high-quality, Java-based NLP tools
¤  Includes POS tagger, named entity recognizer, parser,
coreference resolution, sentiment analysis, SUTime, etc.
¤  Includes models for English, Chinese, Arabic, German
Altamira Technologies Corporation 2014
NLP + Geospatial Analysis
■  Name: CLAVIN
■  Creator: Berico Technologies
■  License: Apache License 2.0
■  Website: clavin.io
■  Source: github.com/Berico-Technologies/CLAVIN
■  Features:
¤  Extracts location names from text, resolves to gazetteer
¤  Employs context-based geospatial entity resolution
¤  ~75% accuracy, processes 1M documents per hour
¤  Built on Hadoop, CoreNLP, OpenNLP, GeoNames.org
Altamira Technologies Corporation 2014
Social Network Analysis
■  Name: NetworkX
■  Creator: Los Alamos National Lab
■  License: BSD 3-Clause License
■  Website: networkx.github.io
■  Source: github.com/networkx/networkx
■  Features:
¤  Python structures for graphs, digraphs, & multigraphs
¤  Support for creating, manipulating, & analyzing the
structure, dynamics, & functions of complex networks
¤  Provides standard graph algorithms & analysis metrics
Altamira Technologies Corporation 2014
Social Network Analysis
■  Name: Gephi
■  Creator: UTC France
■  License: GPL Version 3
■  Website: gephi.org
■  Source: github.com/gephi/gephi
■  Features:
¤  Network analysis and visualization package for Java
¤  Dynamic network analysis with temporal filtering
¤  Metrics include: community detection, betweenness,
closeness, clustering coefficient, PageRank, etc.
Altamira Technologies Corporation 2014
Data Visualization
■  Name: D3.js
■  Creator: Mike Bostock
■  License: BSD 3-Clause License
■  Website: d3js.org
■  Source: github.com/mbostock/d3
■  Features:
¤  JavaScript library based on HTML, SVG, and CSS
¤  Binds data to DOM & enables transformations
¤  ~200 examples, including: force-directed graphs,
choropleths, treemaps, dendrograms, animations, etc.
Altamira Technologies Corporation 2014
Fusion, Analysis, and Visualization
■  Name: Lumify
■  Creator: Altamira
■  License: Apache License 2.0
■  Website: lumify.io
■  Source: github.com/altamiracorp/lumify
■  Features:
¤  Built on Hadoop, Storm, Accumulo, Elasticsearch, etc.
¤  Integrates structured data, text, images, video
¤  Cell-level security & access controls
¤  Live, shared collaborative workspaces
Altamira Technologies Corporation 2014
Final Thought…
Save your $$$ for:
¨  People
¤  salaries, training, etc.
¨  Resources
¤  hardware, AWS, etc.
¨  Proprietary software
¤  if no viable OSS
alternative exists
photo: Brett Weinstein (http://bit.ly/1dHXvqJ)
FINAL
THOUGHT
Springer’s
open source software for data scientists
oss4ds.com
Charlie Greenbacker | @greenbacker
oss4ds.com

More Related Content

Viewers also liked

Media Visie 2015 (ABN AMRO)
Media Visie 2015 (ABN AMRO)Media Visie 2015 (ABN AMRO)
Media Visie 2015 (ABN AMRO)Jim Stolze
 
kamus-science
kamus-sciencekamus-science
kamus-scienceNur Asiah
 
Proxim Tsunami MP11 Series Datasheet(www.quantumwimax.com)
Proxim Tsunami MP11 Series Datasheet(www.quantumwimax.com)Proxim Tsunami MP11 Series Datasheet(www.quantumwimax.com)
Proxim Tsunami MP11 Series Datasheet(www.quantumwimax.com)Ari Zoldan
 
Tech M&A Monthly: 10 Ways to Increase Your Company's Value
Tech M&A Monthly: 10 Ways to Increase Your Company's ValueTech M&A Monthly: 10 Ways to Increase Your Company's Value
Tech M&A Monthly: 10 Ways to Increase Your Company's ValueCorum Group
 
잡코리아 글로벌 프런티어 1기_노점순_탐방 계획서
잡코리아 글로벌 프런티어 1기_노점순_탐방 계획서잡코리아 글로벌 프런티어 1기_노점순_탐방 계획서
잡코리아 글로벌 프런티어 1기_노점순_탐방 계획서잡코리아 글로벌 프런티어
 
Battling Drug Cartels with Big Data Using Lumify
Battling Drug Cartels with Big Data Using LumifyBattling Drug Cartels with Big Data Using Lumify
Battling Drug Cartels with Big Data Using LumifyAll Things Open
 
Biggest info security mistakes security innovation inc.
Biggest info security mistakes security innovation inc.Biggest info security mistakes security innovation inc.
Biggest info security mistakes security innovation inc.uNIX Jim
 
Revista C&S 21 junho/julho 2012
Revista C&S 21 junho/julho 2012Revista C&S 21 junho/julho 2012
Revista C&S 21 junho/julho 2012Ciclomídia
 

Viewers also liked (14)

Media Visie 2015 (ABN AMRO)
Media Visie 2015 (ABN AMRO)Media Visie 2015 (ABN AMRO)
Media Visie 2015 (ABN AMRO)
 
kamus-science
kamus-sciencekamus-science
kamus-science
 
Proxim Tsunami MP11 Series Datasheet(www.quantumwimax.com)
Proxim Tsunami MP11 Series Datasheet(www.quantumwimax.com)Proxim Tsunami MP11 Series Datasheet(www.quantumwimax.com)
Proxim Tsunami MP11 Series Datasheet(www.quantumwimax.com)
 
Tech M&A Monthly: 10 Ways to Increase Your Company's Value
Tech M&A Monthly: 10 Ways to Increase Your Company's ValueTech M&A Monthly: 10 Ways to Increase Your Company's Value
Tech M&A Monthly: 10 Ways to Increase Your Company's Value
 
잡코리아 글로벌 프런티어 1기_노점순_탐방 계획서
잡코리아 글로벌 프런티어 1기_노점순_탐방 계획서잡코리아 글로벌 프런티어 1기_노점순_탐방 계획서
잡코리아 글로벌 프런티어 1기_노점순_탐방 계획서
 
Battling Drug Cartels with Big Data Using Lumify
Battling Drug Cartels with Big Data Using LumifyBattling Drug Cartels with Big Data Using Lumify
Battling Drug Cartels with Big Data Using Lumify
 
东吴-费森尤斯
东吴-费森尤斯东吴-费森尤斯
东吴-费森尤斯
 
Introduction to Exponentials Insights 2016
Introduction to Exponentials Insights 2016Introduction to Exponentials Insights 2016
Introduction to Exponentials Insights 2016
 
Biggest info security mistakes security innovation inc.
Biggest info security mistakes security innovation inc.Biggest info security mistakes security innovation inc.
Biggest info security mistakes security innovation inc.
 
CDXC Corporate presentation
CDXC Corporate presentationCDXC Corporate presentation
CDXC Corporate presentation
 
11 cdxc
11 cdxc11 cdxc
11 cdxc
 
Revista C&S 21 junho/julho 2012
Revista C&S 21 junho/julho 2012Revista C&S 21 junho/julho 2012
Revista C&S 21 junho/julho 2012
 
Chicago Safety Conference Presentation 2009
Chicago Safety Conference Presentation 2009Chicago Safety Conference Presentation 2009
Chicago Safety Conference Presentation 2009
 
Revista gm
Revista gmRevista gm
Revista gm
 

Similar to Open Source Software for Data Scientists -- Great Wide Open 2014

Open Source Software for Data Scientists -- BigConf 2014
Open Source Software for Data Scientists -- BigConf 2014Open Source Software for Data Scientists -- BigConf 2014
Open Source Software for Data Scientists -- BigConf 2014Charlie Greenbacker
 
2024 Feb AI Meetup NYC GenAI_LLMs_ML_Data Codeless Generative AI Pipelines
2024 Feb AI Meetup NYC GenAI_LLMs_ML_Data Codeless Generative AI Pipelines2024 Feb AI Meetup NYC GenAI_LLMs_ML_Data Codeless Generative AI Pipelines
2024 Feb AI Meetup NYC GenAI_LLMs_ML_Data Codeless Generative AI PipelinesTimothy Spann
 
Intro to H2O Machine Learning in R at Santa Clara University
Intro to H2O Machine Learning in R at Santa Clara UniversityIntro to H2O Machine Learning in R at Santa Clara University
Intro to H2O Machine Learning in R at Santa Clara UniversitySri Ambati
 
Using Machine Learning to Understand Kafka Runtime Behavior (Shivanath Babu, ...
Using Machine Learning to Understand Kafka Runtime Behavior (Shivanath Babu, ...Using Machine Learning to Understand Kafka Runtime Behavior (Shivanath Babu, ...
Using Machine Learning to Understand Kafka Runtime Behavior (Shivanath Babu, ...confluent
 
"The OpenCV Open Source Computer Vision Library: What’s New and What’s Coming...
"The OpenCV Open Source Computer Vision Library: What’s New and What’s Coming..."The OpenCV Open Source Computer Vision Library: What’s New and What’s Coming...
"The OpenCV Open Source Computer Vision Library: What’s New and What’s Coming...Edge AI and Vision Alliance
 
Intro to Machine Learning with H2O and AWS
Intro to Machine Learning with H2O and AWSIntro to Machine Learning with H2O and AWS
Intro to Machine Learning with H2O and AWSSri Ambati
 
US Software Developers - Github Audience Analysis
US Software Developers - Github Audience Analysis US Software Developers - Github Audience Analysis
US Software Developers - Github Audience Analysis Affinio
 
Strata sf - Amundsen presentation
Strata sf - Amundsen presentationStrata sf - Amundsen presentation
Strata sf - Amundsen presentationTao Feng
 
2024 XTREMEJ_ Building Real-time Pipelines with FLaNK_ A Case Study with Tra...
2024 XTREMEJ_  Building Real-time Pipelines with FLaNK_ A Case Study with Tra...2024 XTREMEJ_  Building Real-time Pipelines with FLaNK_ A Case Study with Tra...
2024 XTREMEJ_ Building Real-time Pipelines with FLaNK_ A Case Study with Tra...Timothy Spann
 
Databricks Meetup @ Los Angeles Apache Spark User Group
Databricks Meetup @ Los Angeles Apache Spark User GroupDatabricks Meetup @ Los Angeles Apache Spark User Group
Databricks Meetup @ Los Angeles Apache Spark User GroupPaco Nathan
 
Bringing Deep Learning into production
Bringing Deep Learning into production Bringing Deep Learning into production
Bringing Deep Learning into production Paolo Platter
 
Five Fabulous Sinks for Your Kafka Data. #3 will surprise you! (Rachel Pedres...
Five Fabulous Sinks for Your Kafka Data. #3 will surprise you! (Rachel Pedres...Five Fabulous Sinks for Your Kafka Data. #3 will surprise you! (Rachel Pedres...
Five Fabulous Sinks for Your Kafka Data. #3 will surprise you! (Rachel Pedres...confluent
 
28March2024-Codeless-Generative-AI-Pipelines
28March2024-Codeless-Generative-AI-Pipelines28March2024-Codeless-Generative-AI-Pipelines
28March2024-Codeless-Generative-AI-PipelinesTimothy Spann
 
Intro to H2O in Python - Data Science LA
Intro to H2O in Python - Data Science LAIntro to H2O in Python - Data Science LA
Intro to H2O in Python - Data Science LASri Ambati
 
Intro to H2O Machine Learning in Python - Galvanize Seattle
Intro to H2O Machine Learning in Python - Galvanize SeattleIntro to H2O Machine Learning in Python - Galvanize Seattle
Intro to H2O Machine Learning in Python - Galvanize SeattleSri Ambati
 
Analysis-of-Major-Trends-in-big-data-analytics-slim-baltagi-hadoop-summit
Analysis-of-Major-Trends-in-big-data-analytics-slim-baltagi-hadoop-summitAnalysis-of-Major-Trends-in-big-data-analytics-slim-baltagi-hadoop-summit
Analysis-of-Major-Trends-in-big-data-analytics-slim-baltagi-hadoop-summitSlim Baltagi
 
Ncku csie talk about Spark
Ncku csie talk about SparkNcku csie talk about Spark
Ncku csie talk about SparkGiivee The
 
Intro to Machine Learning with H2O and Python - Denver
Intro to Machine Learning with H2O and Python - DenverIntro to Machine Learning with H2O and Python - Denver
Intro to Machine Learning with H2O and Python - DenverSri Ambati
 
Devfest uk & ireland using apache nifi with apache pulsar for fast data on-r...
Devfest uk & ireland  using apache nifi with apache pulsar for fast data on-r...Devfest uk & ireland  using apache nifi with apache pulsar for fast data on-r...
Devfest uk & ireland using apache nifi with apache pulsar for fast data on-r...Timothy Spann
 

Similar to Open Source Software for Data Scientists -- Great Wide Open 2014 (20)

Open Source Software for Data Scientists -- BigConf 2014
Open Source Software for Data Scientists -- BigConf 2014Open Source Software for Data Scientists -- BigConf 2014
Open Source Software for Data Scientists -- BigConf 2014
 
2024 Feb AI Meetup NYC GenAI_LLMs_ML_Data Codeless Generative AI Pipelines
2024 Feb AI Meetup NYC GenAI_LLMs_ML_Data Codeless Generative AI Pipelines2024 Feb AI Meetup NYC GenAI_LLMs_ML_Data Codeless Generative AI Pipelines
2024 Feb AI Meetup NYC GenAI_LLMs_ML_Data Codeless Generative AI Pipelines
 
Intro to H2O Machine Learning in R at Santa Clara University
Intro to H2O Machine Learning in R at Santa Clara UniversityIntro to H2O Machine Learning in R at Santa Clara University
Intro to H2O Machine Learning in R at Santa Clara University
 
Using Machine Learning to Understand Kafka Runtime Behavior (Shivanath Babu, ...
Using Machine Learning to Understand Kafka Runtime Behavior (Shivanath Babu, ...Using Machine Learning to Understand Kafka Runtime Behavior (Shivanath Babu, ...
Using Machine Learning to Understand Kafka Runtime Behavior (Shivanath Babu, ...
 
"The OpenCV Open Source Computer Vision Library: What’s New and What’s Coming...
"The OpenCV Open Source Computer Vision Library: What’s New and What’s Coming..."The OpenCV Open Source Computer Vision Library: What’s New and What’s Coming...
"The OpenCV Open Source Computer Vision Library: What’s New and What’s Coming...
 
Intro to Machine Learning with H2O and AWS
Intro to Machine Learning with H2O and AWSIntro to Machine Learning with H2O and AWS
Intro to Machine Learning with H2O and AWS
 
US Software Developers - Github Audience Analysis
US Software Developers - Github Audience Analysis US Software Developers - Github Audience Analysis
US Software Developers - Github Audience Analysis
 
Strata sf - Amundsen presentation
Strata sf - Amundsen presentationStrata sf - Amundsen presentation
Strata sf - Amundsen presentation
 
2024 XTREMEJ_ Building Real-time Pipelines with FLaNK_ A Case Study with Tra...
2024 XTREMEJ_  Building Real-time Pipelines with FLaNK_ A Case Study with Tra...2024 XTREMEJ_  Building Real-time Pipelines with FLaNK_ A Case Study with Tra...
2024 XTREMEJ_ Building Real-time Pipelines with FLaNK_ A Case Study with Tra...
 
Spark Uber Development Kit
Spark Uber Development KitSpark Uber Development Kit
Spark Uber Development Kit
 
Databricks Meetup @ Los Angeles Apache Spark User Group
Databricks Meetup @ Los Angeles Apache Spark User GroupDatabricks Meetup @ Los Angeles Apache Spark User Group
Databricks Meetup @ Los Angeles Apache Spark User Group
 
Bringing Deep Learning into production
Bringing Deep Learning into production Bringing Deep Learning into production
Bringing Deep Learning into production
 
Five Fabulous Sinks for Your Kafka Data. #3 will surprise you! (Rachel Pedres...
Five Fabulous Sinks for Your Kafka Data. #3 will surprise you! (Rachel Pedres...Five Fabulous Sinks for Your Kafka Data. #3 will surprise you! (Rachel Pedres...
Five Fabulous Sinks for Your Kafka Data. #3 will surprise you! (Rachel Pedres...
 
28March2024-Codeless-Generative-AI-Pipelines
28March2024-Codeless-Generative-AI-Pipelines28March2024-Codeless-Generative-AI-Pipelines
28March2024-Codeless-Generative-AI-Pipelines
 
Intro to H2O in Python - Data Science LA
Intro to H2O in Python - Data Science LAIntro to H2O in Python - Data Science LA
Intro to H2O in Python - Data Science LA
 
Intro to H2O Machine Learning in Python - Galvanize Seattle
Intro to H2O Machine Learning in Python - Galvanize SeattleIntro to H2O Machine Learning in Python - Galvanize Seattle
Intro to H2O Machine Learning in Python - Galvanize Seattle
 
Analysis-of-Major-Trends-in-big-data-analytics-slim-baltagi-hadoop-summit
Analysis-of-Major-Trends-in-big-data-analytics-slim-baltagi-hadoop-summitAnalysis-of-Major-Trends-in-big-data-analytics-slim-baltagi-hadoop-summit
Analysis-of-Major-Trends-in-big-data-analytics-slim-baltagi-hadoop-summit
 
Ncku csie talk about Spark
Ncku csie talk about SparkNcku csie talk about Spark
Ncku csie talk about Spark
 
Intro to Machine Learning with H2O and Python - Denver
Intro to Machine Learning with H2O and Python - DenverIntro to Machine Learning with H2O and Python - Denver
Intro to Machine Learning with H2O and Python - Denver
 
Devfest uk & ireland using apache nifi with apache pulsar for fast data on-r...
Devfest uk & ireland  using apache nifi with apache pulsar for fast data on-r...Devfest uk & ireland  using apache nifi with apache pulsar for fast data on-r...
Devfest uk & ireland using apache nifi with apache pulsar for fast data on-r...
 

Recently uploaded

Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxaleedritatuxx
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfchwongval
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
detection and classification of knee osteoarthritis.pptx
detection and classification of knee osteoarthritis.pptxdetection and classification of knee osteoarthritis.pptx
detection and classification of knee osteoarthritis.pptxAleenaJamil4
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degreeyuu sss
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...GQ Research
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Boston Institute of Analytics
 
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGILLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGIThomas Poetter
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...ssuserf63bd7
 

Recently uploaded (20)

Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdf
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
detection and classification of knee osteoarthritis.pptx
detection and classification of knee osteoarthritis.pptxdetection and classification of knee osteoarthritis.pptx
detection and classification of knee osteoarthritis.pptx
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis Project
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
 
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGILLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
 

Open Source Software for Data Scientists -- Great Wide Open 2014

  • 1. Open Source Software for Data Scientists Charlie Greenbacker, Director of Data Science02 Apr 2014
  • 2. Altamira Technologies Corporation 2014 Agenda ■  What is a Data Scientist? ■  Why use Open Source Software? ■  Survey of Open Source Software Tools: ¤ Statistical Analysis ¤ Data Mining ¤ Machine Learning ¤ Natural Language Processing ¤ Social Network Analysis ¤ Data Visualization
  • 3. Altamira Technologies Corporation 2014 About me: @greenbacker Theories: popular tripe Methods: sloppy Conclusions: highly questionable photo: Columbia Pictures
  • 4. Altamira Technologies Corporation 2014 Best reason for not finishing PhD
  • 5. Altamira Technologies Corporation 2014 @ExploreAltamira
  • 6. What is a Data Scientist?
  • 7.
  • 8.
  • 9.
  • 10. credit: Drew Conway (http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram)
  • 11. http://www.itproportal.com/2014/02/11/how-to-pick-a-data-scientist-the-right-way/ Paul Cooper, ITProPortal.com “A data scientist is someone who understands the domains of programming, machine learning, data mining, statistics, and hacking”
  • 12. Computer Programming Mathematics & Analytic Methodology Distributed Computing & Big Data Data Science StatisticalAnalysis DataMining MachineLearning NaturalLanguageProcessing SocialNetworkAnalysis DataVisualization Domain Knowledge & Communication Skills etc.Altamira Technologies Corporation 2014
  • 13. Why use Open Source Software?
  • 15. photo: Paul Inkles (https://flic.kr/p/e2QMS5) IF YOUR BOSS BUYS SOMETHING," YOU DAMN WELL BETTER USE IT."
  • 17. Survey of OSS Tools
  • 18. Altamira Technologies Corporation 2014 Statistical Analysis ■  Name: R ■  Creator: Gentleman, Ihaka, et al. ■  License: GPL Version 2 ■  Website: r-project.org ■  Source: cran.us.r-project.org/src/base/ ■  Features: ¤  Language & environment for statistical computing & viz ¤  Linear and nonlinear modeling, classical statistical tests, time-series analysis, graphical techniques, and more… ¤  5000+ packages available in CRAN repository
  • 19. Altamira Technologies Corporation 2014 Data Mining ■  Name: Pandas ■  Creator: Wes McKinney, et al. ■  License: BSD 3-Clause License ■  Website: pandas.pydata.org ■  Source: github.com/pydata/pandas ■  Features: ¤  Data analysis workflow in Python ¤  DataFrame object for fast manipulation & indexing ¤  Tools for reading & writing data between formats ¤  Label-based slicing, indexing, and subsetting of data
  • 20. Altamira Technologies Corporation 2014 Data Mining ■  Name: Impala ■  Creator: Cloudera ■  License: Apache License 2.0 ■  Website: impala.io ■  Source: github.com/cloudera/impala ■  Features: ¤  MPP query engine implemented on Hadoop ¤  Low latency, high concurrency SQL & BI queries ¤  Same interfaces as Apache Hive, but ~24x faster ¤  Written in C++; does not use MapReduce
  • 21. Altamira Technologies Corporation 2014 Machine Learning ■  Name: Mahout ■  Creator: ASF ■  License: Apache License 2.0 ■  Website: mahout.apache.org ■  Source: svn.apache.org/viewvc/mahout ■  Features: ¤  Distributed/scalable ML library for Hadoop ¤  Classification, Clustering, Collaborative filtering ¤  Logistic regression, naïve Bayes, random forest, neural networks, HMM, k-means, SVD, PCA, ALS, LDA, etc.
  • 22. Altamira Technologies Corporation 2014 Machine Learning ■  Name: Scikit-learn ■  Creator: Cournapeau, et al. ■  License: BSD 3-Clause License ■  Website: scikit-learn.org ■  Source: github.com/scikit-learn/scikit-learn ■  Features: ¤  ML library for Python built on NumPy, SciPy, matplotlib ¤  Support for classification, clustering, dimensionality reduction, regression, model selection, preprocessing ¤  SVM, k-NN, PCA, NNMF, crossval, feature extraction, ...
  • 23. Altamira Technologies Corporation 2014 Machine Learning + NLP ■  Name: Mallet ■  Creator: UMass (McCallum, et al.) ■  License: Common Public License 1.0 ■  Website: mallet.cs.umass.edu ■  Source: hg-iesl.cs.umass.edu/hg/mallet ■  Features: ¤  Java-based “Machine Learning for Language Toolkit” ¤  Document classification, clustering, topic modeling, information extraction & sequence tagging, etc. ¤  Efficient implementation of LDA for topic modeling
  • 24. Altamira Technologies Corporation 2014 Natural Language Processing ■  Name: NLTK ■  Creator: Bird, Loper, et al. ■  License: Apache License 2.0 ■  Website: nltk.org ■  Source: github.com/nltk/nltk ■  Features: ¤  Natural Language Toolkit for Python ¤  Built-in support for dozens of corpora & trained models ¤  Libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning
  • 25. Altamira Technologies Corporation 2014 Natural Language Processing ■  Name: Stanford CoreNLP ■  Creator: Stanford NLP Group ■  License: GPL Version 2 ■  Website: nlp.stanford.edu/software/corenlp.shtml ■  Source: github.com/stanfordnlp/CoreNLP ■  Features: ¤  Suite of high-quality, Java-based NLP tools ¤  Includes POS tagger, named entity recognizer, parser, coreference resolution, sentiment analysis, SUTime, etc. ¤  Includes models for English, Chinese, Arabic, German
  • 26. Altamira Technologies Corporation 2014 NLP + Geospatial Analysis ■  Name: CLAVIN ■  Creator: Berico Technologies ■  License: Apache License 2.0 ■  Website: clavin.io ■  Source: github.com/Berico-Technologies/CLAVIN ■  Features: ¤  Extracts location names from text, resolves to gazetteer ¤  Employs context-based geospatial entity resolution ¤  ~75% accuracy, processes 1M documents per hour ¤  Built on Hadoop, CoreNLP, OpenNLP, GeoNames.org
  • 27. Altamira Technologies Corporation 2014 Social Network Analysis ■  Name: NetworkX ■  Creator: Los Alamos National Lab ■  License: BSD 3-Clause License ■  Website: networkx.github.io ■  Source: github.com/networkx/networkx ■  Features: ¤  Python structures for graphs, digraphs, & multigraphs ¤  Support for creating, manipulating, & analyzing the structure, dynamics, & functions of complex networks ¤  Provides standard graph algorithms & analysis metrics
  • 28. Altamira Technologies Corporation 2014 Social Network Analysis ■  Name: Gephi ■  Creator: UTC France ■  License: GPL Version 3 ■  Website: gephi.org ■  Source: github.com/gephi/gephi ■  Features: ¤  Network analysis and visualization package for Java ¤  Dynamic network analysis with temporal filtering ¤  Metrics include: community detection, betweenness, closeness, clustering coefficient, PageRank, etc.
  • 29. Altamira Technologies Corporation 2014 Data Visualization ■  Name: D3.js ■  Creator: Mike Bostock ■  License: BSD 3-Clause License ■  Website: d3js.org ■  Source: github.com/mbostock/d3 ■  Features: ¤  JavaScript library based on HTML, SVG, and CSS ¤  Binds data to DOM & enables transformations ¤  ~200 examples, including: force-directed graphs, choropleths, treemaps, dendrograms, animations, etc.
  • 30. Altamira Technologies Corporation 2014 Fusion, Analysis, and Visualization ■  Name: Lumify ■  Creator: Altamira ■  License: Apache License 2.0 ■  Website: lumify.io ■  Source: github.com/altamiracorp/lumify ■  Features: ¤  Built on Hadoop, Storm, Accumulo, Elasticsearch, etc. ¤  Integrates structured data, text, images, video ¤  Cell-level security & access controls ¤  Live, shared collaborative workspaces
  • 31.
  • 32. Altamira Technologies Corporation 2014 Final Thought… Save your $$$ for: ¨  People ¤  salaries, training, etc. ¨  Resources ¤  hardware, AWS, etc. ¨  Proprietary software ¤  if no viable OSS alternative exists photo: Brett Weinstein (http://bit.ly/1dHXvqJ) FINAL THOUGHT Springer’s
  • 33. open source software for data scientists oss4ds.com
  • 34. Charlie Greenbacker | @greenbacker oss4ds.com