SlideShare a Scribd company logo
1 of 33
Download to read offline
Open Source Software
for Data Scientists
Charlie Greenbacker, Director of Data Science28 Mar 2014
Altamira Technologies Corporation 2014
Agenda
■  What is a Data Scientist?
■  Why use Open Source Software?
■  Survey of Open Source Software Tools:
¤ Statistical Analysis
¤ Data Mining
¤ Machine Learning
¤ Natural Language Processing
¤ Social Network Analysis
¤ Data Visualization
Altamira Technologies Corporation 2014
About me: @greenbacker
Theories: popular tripe
Methods: sloppy
Conclusions: highly questionable photo: Columbia Pictures
Altamira Technologies Corporation 2014
Best reason for
not finishing PhD
Altamira Technologies Corporation 2014
@ExploreAltamira
What is a Data Scientist?
credit: Drew Conway (http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram)
http://www.itproportal.com/2014/02/11/how-to-pick-a-data-scientist-the-right-way/
Paul Cooper, ITProPortal.com
“A data scientist is someone who
understands the domains of
programming, machine learning,
data mining, statistics, and
hacking”
Computer Programming
Mathematics & Analytic Methodology
Distributed Computing & Big Data
Data Science
StatisticalAnalysis
DataMining
MachineLearning
NaturalLanguageProcessing
SocialNetworkAnalysis
DataVisualization
Domain Knowledge & Communication Skills
etc.Altamira Technologies Corporation 2014
Why use Open Source Software?
photo: Karen (https://flic.kr/p/5njby2)
THERE ARE NO SILVER BULLETS."
photo: Paul Inkles (https://flic.kr/p/e2QMS5)
IF YOUR BOSS BUYS SOMETHING,"
YOU DAMN WELL BETTER USE IT."
photo: Valugi (http://bit.ly/1jrvVBC)
BUDGETS DON’T SCALE."
Survey of OSS Tools
Altamira Technologies Corporation 2014
Statistical Analysis
■  Name: R
■  Creator: Gentleman, Ihaka, et al.
■  License: GPL Version 2
■  Website: r-project.org
■  Source: cran.us.r-project.org/src/base/
■  Features:
¤  Language & environment for statistical computing & viz
¤  Linear and nonlinear modeling, classical statistical tests,
time-series analysis, graphical techniques, and more…
¤  5000+ packages available in CRAN repository
Altamira Technologies Corporation 2014
Data Mining
■  Name: Pandas
■  Creator: Wes McKinney, et al.
■  License: BSD 3-Clause License
■  Website: pandas.pydata.org
■  Source: github.com/pydata/pandas
■  Features:
¤  Data analysis workflow in Python
¤  DataFrame object for fast manipulation & indexing
¤  Tools for reading & writing data between formats
¤  Label-based slicing, indexing, and subsetting of data
Altamira Technologies Corporation 2014
Data Mining
■  Name: Impala
■  Creator: Cloudera
■  License: Apache License 2.0
■  Website: impala.io
■  Source: github.com/cloudera/impala
■  Features:
¤  MPP query engine implemented on Hadoop
¤  Low latency, high concurrency SQL & BI queries
¤  Same interfaces as Apache Hive, but ~24x faster
¤  Written in C++; does not use MapReduce
Altamira Technologies Corporation 2014
Machine Learning
■  Name: Mahout
■  Creator: ASF
■  License: Apache License 2.0
■  Website: mahout.apache.org
■  Source: svn.apache.org/viewvc/mahout
■  Features:
¤  Distributed/scalable ML library for Hadoop
¤  Classification, Clustering, Collaborative filtering
¤  Logistic regression, naïve Bayes, random forest, neural
networks, HMM, k-means, SVD, PCA, ALS, LDA, etc.
Altamira Technologies Corporation 2014
Machine Learning
■  Name: Scikit-learn
■  Creator: Cournapeau, et al.
■  License: BSD 3-Clause License
■  Website: scikit-learn.org
■  Source: github.com/scikit-learn/scikit-learn
■  Features:
¤  ML library for Python built on NumPy, SciPy, matplotlib
¤  Support for classification, clustering, dimensionality
reduction, regression, model selection, preprocessing
¤  SVM, k-NN, PCA, NNMF, crossval, feature extraction, ...
Altamira Technologies Corporation 2014
Machine Learning + NLP
■  Name: Mallet
■  Creator: UMass (McCallum, et al.)
■  License: Common Public License 1.0
■  Website: mallet.cs.umass.edu
■  Source: hg-iesl.cs.umass.edu/hg/mallet
■  Features:
¤  Java-based “Machine Learning for Language Toolkit”
¤  Document classification, clustering, topic modeling,
information extraction & sequence tagging, etc.
¤  Efficient implementation of LDA for topic modeling
Altamira Technologies Corporation 2014
Natural Language Processing
■  Name: NLTK
■  Creator: Bird, Loper, et al.
■  License: Apache License 2.0
■  Website: nltk.org
■  Source: github.com/nltk/nltk
■  Features:
¤  Natural Language Toolkit for Python
¤  Built-in support for dozens of corpora & trained models
¤  Libraries for classification, tokenization, stemming,
tagging, parsing, and semantic reasoning
Altamira Technologies Corporation 2014
Natural Language Processing
■  Name: Stanford CoreNLP
■  Creator: Stanford NLP Group
■  License: GPL Version 2
■  Website: nlp.stanford.edu/software/corenlp.shtml
■  Source: github.com/stanfordnlp/CoreNLP
■  Features:
¤  Suite of high-quality, Java-based NLP tools
¤  Includes POS tagger, named entity recognizer, parser,
coreference resolution, sentiment analysis, SUTime, etc.
¤  Includes models for English, Chinese, Arabic, German
Altamira Technologies Corporation 2014
NLP + Geospatial Analysis
■  Name: CLAVIN
■  Creator: Berico Technologies
■  License: Apache License 2.0
■  Website: clavin.io
■  Source: github.com/Berico-Technologies/CLAVIN
■  Features:
¤  Extracts location names from text, resolves to gazetteer
¤  Employs context-based geospatial entity resolution
¤  ~75% accuracy, processes 1M documents per hour
¤  Built on Hadoop, CoreNLP, OpenNLP, GeoNames.org
Altamira Technologies Corporation 2014
Social Network Analysis
■  Name: Gephi
■  Creator: UTC France
■  License: GPL Version 3
■  Website: gephi.org
■  Source: github.com/gephi/gephi
■  Features:
¤  Network analysis and visualization package for Java
¤  Dynamic network analysis with temporal filtering
¤  Metrics include: community detection, betweenness,
closeness, clustering coefficient, PageRank, etc.
Altamira Technologies Corporation 2014
Data Visualization
■  Name: D3.js
■  Creator: Mike Bostock
■  License: BSD 3-Clause License
■  Website: d3js.org
■  Source: github.com/mbostock/d3
■  Features:
¤  JavaScript library based on HTML, SVG, and CSS
¤  Binds data to DOM & enables transformations
¤  ~200 examples, including: force-directed graphs,
choropleths, treemaps, dendrograms, animations, etc.
Altamira Technologies Corporation 2014
Fusion, Analysis, and Visualization
■  Name: Lumify
■  Creator: Altamira
■  License: Apache License 2.0
■  Website: lumify.io
■  Source: github.com/altamiracorp/lumify
■  Features:
¤  Built on Hadoop, Storm, Accumulo, Elasticsearch, etc.
¤  Integrates structured data, text, images, video
¤  Cell-level security & access controls
¤  Live, shared collaborative workspaces
Altamira Technologies Corporation 2014
Final Thought…
Save your $$$ for:
¨  People
¤  salaries, training, etc.
¨  Resources
¤  hardware, AWS, etc.
¨  Proprietary software
¤  if no viable OSS
alternative exists
photo: Brett Weinstein (http://bit.ly/1dHXvqJ)
FINAL
THOUGHT
Springer’s
open source software for data scientists
oss4ds.com
Charlie Greenbacker | @greenbacker
www.oss4ds.com

More Related Content

Viewers also liked

Building and deploying large scale real time news system with my sql and dist...
Building and deploying large scale real time news system with my sql and dist...Building and deploying large scale real time news system with my sql and dist...
Building and deploying large scale real time news system with my sql and dist...Tao Cheng
 
Data Science and Data Visualization (All about Data Analysis) by Pooja Ajmera
Data Science and Data Visualization (All about Data Analysis) by Pooja AjmeraData Science and Data Visualization (All about Data Analysis) by Pooja Ajmera
Data Science and Data Visualization (All about Data Analysis) by Pooja AjmeraPooja Ajmera
 
Choosing a Data Visualization Tool for Data Scientists_Final
Choosing a Data Visualization Tool for Data Scientists_FinalChoosing a Data Visualization Tool for Data Scientists_Final
Choosing a Data Visualization Tool for Data Scientists_FinalHeather Choi
 
Microsoft NERD Talk - R and Tableau - 2-4-2013
Microsoft NERD Talk - R and Tableau - 2-4-2013Microsoft NERD Talk - R and Tableau - 2-4-2013
Microsoft NERD Talk - R and Tableau - 2-4-2013Tanya Cashorali
 
Using Salesforce, ERP, Tableau & R in Sales Forecasting
Using Salesforce, ERP, Tableau & R in Sales ForecastingUsing Salesforce, ERP, Tableau & R in Sales Forecasting
Using Salesforce, ERP, Tableau & R in Sales ForecastingSenturus
 
Performance data visualization with r and tableau
Performance data visualization with r and tableauPerformance data visualization with r and tableau
Performance data visualization with r and tableauEnkitec
 
R Markdown Tutorial For Beginners
R Markdown Tutorial For BeginnersR Markdown Tutorial For Beginners
R Markdown Tutorial For BeginnersRsquared Academy
 
Big Data & Analytics: End to End on AWS - Technical 101
Big Data & Analytics: End to End on AWS - Technical 101Big Data & Analytics: End to End on AWS - Technical 101
Big Data & Analytics: End to End on AWS - Technical 101Amazon Web Services
 
RMySQL Tutorial For Beginners
RMySQL Tutorial For BeginnersRMySQL Tutorial For Beginners
RMySQL Tutorial For BeginnersRsquared Academy
 
Mit Romney 1040 tax return 2011
Mit Romney 1040 tax return 2011Mit Romney 1040 tax return 2011
Mit Romney 1040 tax return 2011Kit Seeborg
 
Revista C&S 21 junho/julho 2012
Revista C&S 21 junho/julho 2012Revista C&S 21 junho/julho 2012
Revista C&S 21 junho/julho 2012Ciclomídia
 
MeHI Privacy & Security Webinar 3.18.15
MeHI Privacy & Security Webinar 3.18.15MeHI Privacy & Security Webinar 3.18.15
MeHI Privacy & Security Webinar 3.18.15MassEHealth
 
잡코리아 글로벌 프런티어 1기_노점순_탐방 계획서
잡코리아 글로벌 프런티어 1기_노점순_탐방 계획서잡코리아 글로벌 프런티어 1기_노점순_탐방 계획서
잡코리아 글로벌 프런티어 1기_노점순_탐방 계획서잡코리아 글로벌 프런티어
 
World Academic Journal of Business & Applied Sciences (WAJBAS)
World Academic Journal of Business & Applied Sciences (WAJBAS) World Academic Journal of Business & Applied Sciences (WAJBAS)
World Academic Journal of Business & Applied Sciences (WAJBAS) World-Academic Journal
 

Viewers also liked (19)

Building and deploying large scale real time news system with my sql and dist...
Building and deploying large scale real time news system with my sql and dist...Building and deploying large scale real time news system with my sql and dist...
Building and deploying large scale real time news system with my sql and dist...
 
Data Science and Data Visualization (All about Data Analysis) by Pooja Ajmera
Data Science and Data Visualization (All about Data Analysis) by Pooja AjmeraData Science and Data Visualization (All about Data Analysis) by Pooja Ajmera
Data Science and Data Visualization (All about Data Analysis) by Pooja Ajmera
 
Choosing a Data Visualization Tool for Data Scientists_Final
Choosing a Data Visualization Tool for Data Scientists_FinalChoosing a Data Visualization Tool for Data Scientists_Final
Choosing a Data Visualization Tool for Data Scientists_Final
 
Microsoft NERD Talk - R and Tableau - 2-4-2013
Microsoft NERD Talk - R and Tableau - 2-4-2013Microsoft NERD Talk - R and Tableau - 2-4-2013
Microsoft NERD Talk - R and Tableau - 2-4-2013
 
Using Salesforce, ERP, Tableau & R in Sales Forecasting
Using Salesforce, ERP, Tableau & R in Sales ForecastingUsing Salesforce, ERP, Tableau & R in Sales Forecasting
Using Salesforce, ERP, Tableau & R in Sales Forecasting
 
Performance data visualization with r and tableau
Performance data visualization with r and tableauPerformance data visualization with r and tableau
Performance data visualization with r and tableau
 
R Markdown Tutorial For Beginners
R Markdown Tutorial For BeginnersR Markdown Tutorial For Beginners
R Markdown Tutorial For Beginners
 
Big Data & Analytics: End to End on AWS - Technical 101
Big Data & Analytics: End to End on AWS - Technical 101Big Data & Analytics: End to End on AWS - Technical 101
Big Data & Analytics: End to End on AWS - Technical 101
 
RMySQL Tutorial For Beginners
RMySQL Tutorial For BeginnersRMySQL Tutorial For Beginners
RMySQL Tutorial For Beginners
 
Mit Romney 1040 tax return 2011
Mit Romney 1040 tax return 2011Mit Romney 1040 tax return 2011
Mit Romney 1040 tax return 2011
 
东吴-费森尤斯
东吴-费森尤斯东吴-费森尤斯
东吴-费森尤斯
 
Revista C&S 21 junho/julho 2012
Revista C&S 21 junho/julho 2012Revista C&S 21 junho/julho 2012
Revista C&S 21 junho/julho 2012
 
CDXC Corporate presentation
CDXC Corporate presentationCDXC Corporate presentation
CDXC Corporate presentation
 
MeHI Privacy & Security Webinar 3.18.15
MeHI Privacy & Security Webinar 3.18.15MeHI Privacy & Security Webinar 3.18.15
MeHI Privacy & Security Webinar 3.18.15
 
11 cdxc
11 cdxc11 cdxc
11 cdxc
 
Abn Amro
Abn AmroAbn Amro
Abn Amro
 
Introduction to Exponentials Insights 2016
Introduction to Exponentials Insights 2016Introduction to Exponentials Insights 2016
Introduction to Exponentials Insights 2016
 
잡코리아 글로벌 프런티어 1기_노점순_탐방 계획서
잡코리아 글로벌 프런티어 1기_노점순_탐방 계획서잡코리아 글로벌 프런티어 1기_노점순_탐방 계획서
잡코리아 글로벌 프런티어 1기_노점순_탐방 계획서
 
World Academic Journal of Business & Applied Sciences (WAJBAS)
World Academic Journal of Business & Applied Sciences (WAJBAS) World Academic Journal of Business & Applied Sciences (WAJBAS)
World Academic Journal of Business & Applied Sciences (WAJBAS)
 

Similar to Open Source Software for Data Scientists -- BigConf 2014

Open Source Software for Data Scientists -- Great Wide Open 2014
Open Source Software for Data Scientists -- Great Wide Open 2014Open Source Software for Data Scientists -- Great Wide Open 2014
Open Source Software for Data Scientists -- Great Wide Open 2014Charlie Greenbacker
 
Bringing Deep Learning into production
Bringing Deep Learning into production Bringing Deep Learning into production
Bringing Deep Learning into production Paolo Platter
 
"The OpenCV Open Source Computer Vision Library: What’s New and What’s Coming...
"The OpenCV Open Source Computer Vision Library: What’s New and What’s Coming..."The OpenCV Open Source Computer Vision Library: What’s New and What’s Coming...
"The OpenCV Open Source Computer Vision Library: What’s New and What’s Coming...Edge AI and Vision Alliance
 
Using Machine Learning to Understand Kafka Runtime Behavior (Shivanath Babu, ...
Using Machine Learning to Understand Kafka Runtime Behavior (Shivanath Babu, ...Using Machine Learning to Understand Kafka Runtime Behavior (Shivanath Babu, ...
Using Machine Learning to Understand Kafka Runtime Behavior (Shivanath Babu, ...confluent
 
Intro to H2O Machine Learning in R at Santa Clara University
Intro to H2O Machine Learning in R at Santa Clara UniversityIntro to H2O Machine Learning in R at Santa Clara University
Intro to H2O Machine Learning in R at Santa Clara UniversitySri Ambati
 
SAP & Open Souce - Give & Take
SAP & Open Souce - Give & TakeSAP & Open Souce - Give & Take
SAP & Open Souce - Give & TakeJan Penninkhof
 
2024 Feb AI Meetup NYC GenAI_LLMs_ML_Data Codeless Generative AI Pipelines
2024 Feb AI Meetup NYC GenAI_LLMs_ML_Data Codeless Generative AI Pipelines2024 Feb AI Meetup NYC GenAI_LLMs_ML_Data Codeless Generative AI Pipelines
2024 Feb AI Meetup NYC GenAI_LLMs_ML_Data Codeless Generative AI PipelinesTimothy Spann
 
2024 XTREMEJ_ Building Real-time Pipelines with FLaNK_ A Case Study with Tra...
2024 XTREMEJ_  Building Real-time Pipelines with FLaNK_ A Case Study with Tra...2024 XTREMEJ_  Building Real-time Pipelines with FLaNK_ A Case Study with Tra...
2024 XTREMEJ_ Building Real-time Pipelines with FLaNK_ A Case Study with Tra...Timothy Spann
 
IWSG2014: Developing Science Gateways Using Apache Airavata
IWSG2014: Developing Science Gateways Using Apache AiravataIWSG2014: Developing Science Gateways Using Apache Airavata
IWSG2014: Developing Science Gateways Using Apache Airavatamarpierc
 
Ncku csie talk about Spark
Ncku csie talk about SparkNcku csie talk about Spark
Ncku csie talk about SparkGiivee The
 
Intro to H2O Machine Learning in Python - Galvanize Seattle
Intro to H2O Machine Learning in Python - Galvanize SeattleIntro to H2O Machine Learning in Python - Galvanize Seattle
Intro to H2O Machine Learning in Python - Galvanize SeattleSri Ambati
 
Intro to H2O in Python - Data Science LA
Intro to H2O in Python - Data Science LAIntro to H2O in Python - Data Science LA
Intro to H2O in Python - Data Science LASri Ambati
 
Large Scale Image Forensics using Tika and Tensorflow [ICMR MFSec 2017]
Large Scale Image Forensics using Tika and Tensorflow [ICMR MFSec 2017]Large Scale Image Forensics using Tika and Tensorflow [ICMR MFSec 2017]
Large Scale Image Forensics using Tika and Tensorflow [ICMR MFSec 2017]Thamme Gowda
 
Devfest uk & ireland using apache nifi with apache pulsar for fast data on-r...
Devfest uk & ireland  using apache nifi with apache pulsar for fast data on-r...Devfest uk & ireland  using apache nifi with apache pulsar for fast data on-r...
Devfest uk & ireland using apache nifi with apache pulsar for fast data on-r...Timothy Spann
 
Intro to Machine Learning with H2O and AWS
Intro to Machine Learning with H2O and AWSIntro to Machine Learning with H2O and AWS
Intro to Machine Learning with H2O and AWSSri Ambati
 
Conf42-Python-Building Apache NiFi 2.0 Python Processors
Conf42-Python-Building Apache NiFi 2.0 Python ProcessorsConf42-Python-Building Apache NiFi 2.0 Python Processors
Conf42-Python-Building Apache NiFi 2.0 Python ProcessorsTimothy Spann
 
aip_developer_overview_icar_2014
aip_developer_overview_icar_2014aip_developer_overview_icar_2014
aip_developer_overview_icar_2014Matthew Vaughn
 
Intro to Machine Learning with H2O and Python - Denver
Intro to Machine Learning with H2O and Python - DenverIntro to Machine Learning with H2O and Python - Denver
Intro to Machine Learning with H2O and Python - DenverSri Ambati
 
Hadoop Application Architectures - Fraud Detection
Hadoop Application Architectures - Fraud  DetectionHadoop Application Architectures - Fraud  Detection
Hadoop Application Architectures - Fraud Detectionhadooparchbook
 

Similar to Open Source Software for Data Scientists -- BigConf 2014 (20)

Open Source Software for Data Scientists -- Great Wide Open 2014
Open Source Software for Data Scientists -- Great Wide Open 2014Open Source Software for Data Scientists -- Great Wide Open 2014
Open Source Software for Data Scientists -- Great Wide Open 2014
 
Bringing Deep Learning into production
Bringing Deep Learning into production Bringing Deep Learning into production
Bringing Deep Learning into production
 
"The OpenCV Open Source Computer Vision Library: What’s New and What’s Coming...
"The OpenCV Open Source Computer Vision Library: What’s New and What’s Coming..."The OpenCV Open Source Computer Vision Library: What’s New and What’s Coming...
"The OpenCV Open Source Computer Vision Library: What’s New and What’s Coming...
 
Using Machine Learning to Understand Kafka Runtime Behavior (Shivanath Babu, ...
Using Machine Learning to Understand Kafka Runtime Behavior (Shivanath Babu, ...Using Machine Learning to Understand Kafka Runtime Behavior (Shivanath Babu, ...
Using Machine Learning to Understand Kafka Runtime Behavior (Shivanath Babu, ...
 
Intro to H2O Machine Learning in R at Santa Clara University
Intro to H2O Machine Learning in R at Santa Clara UniversityIntro to H2O Machine Learning in R at Santa Clara University
Intro to H2O Machine Learning in R at Santa Clara University
 
SAP & Open Souce - Give & Take
SAP & Open Souce - Give & TakeSAP & Open Souce - Give & Take
SAP & Open Souce - Give & Take
 
2024 Feb AI Meetup NYC GenAI_LLMs_ML_Data Codeless Generative AI Pipelines
2024 Feb AI Meetup NYC GenAI_LLMs_ML_Data Codeless Generative AI Pipelines2024 Feb AI Meetup NYC GenAI_LLMs_ML_Data Codeless Generative AI Pipelines
2024 Feb AI Meetup NYC GenAI_LLMs_ML_Data Codeless Generative AI Pipelines
 
Spark Uber Development Kit
Spark Uber Development KitSpark Uber Development Kit
Spark Uber Development Kit
 
2024 XTREMEJ_ Building Real-time Pipelines with FLaNK_ A Case Study with Tra...
2024 XTREMEJ_  Building Real-time Pipelines with FLaNK_ A Case Study with Tra...2024 XTREMEJ_  Building Real-time Pipelines with FLaNK_ A Case Study with Tra...
2024 XTREMEJ_ Building Real-time Pipelines with FLaNK_ A Case Study with Tra...
 
IWSG2014: Developing Science Gateways Using Apache Airavata
IWSG2014: Developing Science Gateways Using Apache AiravataIWSG2014: Developing Science Gateways Using Apache Airavata
IWSG2014: Developing Science Gateways Using Apache Airavata
 
Ncku csie talk about Spark
Ncku csie talk about SparkNcku csie talk about Spark
Ncku csie talk about Spark
 
Intro to H2O Machine Learning in Python - Galvanize Seattle
Intro to H2O Machine Learning in Python - Galvanize SeattleIntro to H2O Machine Learning in Python - Galvanize Seattle
Intro to H2O Machine Learning in Python - Galvanize Seattle
 
Intro to H2O in Python - Data Science LA
Intro to H2O in Python - Data Science LAIntro to H2O in Python - Data Science LA
Intro to H2O in Python - Data Science LA
 
Large Scale Image Forensics using Tika and Tensorflow [ICMR MFSec 2017]
Large Scale Image Forensics using Tika and Tensorflow [ICMR MFSec 2017]Large Scale Image Forensics using Tika and Tensorflow [ICMR MFSec 2017]
Large Scale Image Forensics using Tika and Tensorflow [ICMR MFSec 2017]
 
Devfest uk & ireland using apache nifi with apache pulsar for fast data on-r...
Devfest uk & ireland  using apache nifi with apache pulsar for fast data on-r...Devfest uk & ireland  using apache nifi with apache pulsar for fast data on-r...
Devfest uk & ireland using apache nifi with apache pulsar for fast data on-r...
 
Intro to Machine Learning with H2O and AWS
Intro to Machine Learning with H2O and AWSIntro to Machine Learning with H2O and AWS
Intro to Machine Learning with H2O and AWS
 
Conf42-Python-Building Apache NiFi 2.0 Python Processors
Conf42-Python-Building Apache NiFi 2.0 Python ProcessorsConf42-Python-Building Apache NiFi 2.0 Python Processors
Conf42-Python-Building Apache NiFi 2.0 Python Processors
 
aip_developer_overview_icar_2014
aip_developer_overview_icar_2014aip_developer_overview_icar_2014
aip_developer_overview_icar_2014
 
Intro to Machine Learning with H2O and Python - Denver
Intro to Machine Learning with H2O and Python - DenverIntro to Machine Learning with H2O and Python - Denver
Intro to Machine Learning with H2O and Python - Denver
 
Hadoop Application Architectures - Fraud Detection
Hadoop Application Architectures - Fraud  DetectionHadoop Application Architectures - Fraud  Detection
Hadoop Application Architectures - Fraud Detection
 

Recently uploaded

Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystSamantha Rae Coolbeth
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一ffjhghh
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts ServiceSapana Sha
 
Digi Khata Problem along complete plan.pptx
Digi Khata Problem along complete plan.pptxDigi Khata Problem along complete plan.pptx
Digi Khata Problem along complete plan.pptxTanveerAhmed817946
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSAishani27
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 

Recently uploaded (20)

Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data Analyst
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts Service
 
Digi Khata Problem along complete plan.pptx
Digi Khata Problem along complete plan.pptxDigi Khata Problem along complete plan.pptx
Digi Khata Problem along complete plan.pptx
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICS
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 

Open Source Software for Data Scientists -- BigConf 2014

  • 1. Open Source Software for Data Scientists Charlie Greenbacker, Director of Data Science28 Mar 2014
  • 2. Altamira Technologies Corporation 2014 Agenda ■  What is a Data Scientist? ■  Why use Open Source Software? ■  Survey of Open Source Software Tools: ¤ Statistical Analysis ¤ Data Mining ¤ Machine Learning ¤ Natural Language Processing ¤ Social Network Analysis ¤ Data Visualization
  • 3. Altamira Technologies Corporation 2014 About me: @greenbacker Theories: popular tripe Methods: sloppy Conclusions: highly questionable photo: Columbia Pictures
  • 4. Altamira Technologies Corporation 2014 Best reason for not finishing PhD
  • 5. Altamira Technologies Corporation 2014 @ExploreAltamira
  • 6. What is a Data Scientist?
  • 7.
  • 8.
  • 9.
  • 10. credit: Drew Conway (http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram)
  • 11. http://www.itproportal.com/2014/02/11/how-to-pick-a-data-scientist-the-right-way/ Paul Cooper, ITProPortal.com “A data scientist is someone who understands the domains of programming, machine learning, data mining, statistics, and hacking”
  • 12. Computer Programming Mathematics & Analytic Methodology Distributed Computing & Big Data Data Science StatisticalAnalysis DataMining MachineLearning NaturalLanguageProcessing SocialNetworkAnalysis DataVisualization Domain Knowledge & Communication Skills etc.Altamira Technologies Corporation 2014
  • 13. Why use Open Source Software?
  • 15. photo: Paul Inkles (https://flic.kr/p/e2QMS5) IF YOUR BOSS BUYS SOMETHING," YOU DAMN WELL BETTER USE IT."
  • 17. Survey of OSS Tools
  • 18. Altamira Technologies Corporation 2014 Statistical Analysis ■  Name: R ■  Creator: Gentleman, Ihaka, et al. ■  License: GPL Version 2 ■  Website: r-project.org ■  Source: cran.us.r-project.org/src/base/ ■  Features: ¤  Language & environment for statistical computing & viz ¤  Linear and nonlinear modeling, classical statistical tests, time-series analysis, graphical techniques, and more… ¤  5000+ packages available in CRAN repository
  • 19. Altamira Technologies Corporation 2014 Data Mining ■  Name: Pandas ■  Creator: Wes McKinney, et al. ■  License: BSD 3-Clause License ■  Website: pandas.pydata.org ■  Source: github.com/pydata/pandas ■  Features: ¤  Data analysis workflow in Python ¤  DataFrame object for fast manipulation & indexing ¤  Tools for reading & writing data between formats ¤  Label-based slicing, indexing, and subsetting of data
  • 20. Altamira Technologies Corporation 2014 Data Mining ■  Name: Impala ■  Creator: Cloudera ■  License: Apache License 2.0 ■  Website: impala.io ■  Source: github.com/cloudera/impala ■  Features: ¤  MPP query engine implemented on Hadoop ¤  Low latency, high concurrency SQL & BI queries ¤  Same interfaces as Apache Hive, but ~24x faster ¤  Written in C++; does not use MapReduce
  • 21. Altamira Technologies Corporation 2014 Machine Learning ■  Name: Mahout ■  Creator: ASF ■  License: Apache License 2.0 ■  Website: mahout.apache.org ■  Source: svn.apache.org/viewvc/mahout ■  Features: ¤  Distributed/scalable ML library for Hadoop ¤  Classification, Clustering, Collaborative filtering ¤  Logistic regression, naïve Bayes, random forest, neural networks, HMM, k-means, SVD, PCA, ALS, LDA, etc.
  • 22. Altamira Technologies Corporation 2014 Machine Learning ■  Name: Scikit-learn ■  Creator: Cournapeau, et al. ■  License: BSD 3-Clause License ■  Website: scikit-learn.org ■  Source: github.com/scikit-learn/scikit-learn ■  Features: ¤  ML library for Python built on NumPy, SciPy, matplotlib ¤  Support for classification, clustering, dimensionality reduction, regression, model selection, preprocessing ¤  SVM, k-NN, PCA, NNMF, crossval, feature extraction, ...
  • 23. Altamira Technologies Corporation 2014 Machine Learning + NLP ■  Name: Mallet ■  Creator: UMass (McCallum, et al.) ■  License: Common Public License 1.0 ■  Website: mallet.cs.umass.edu ■  Source: hg-iesl.cs.umass.edu/hg/mallet ■  Features: ¤  Java-based “Machine Learning for Language Toolkit” ¤  Document classification, clustering, topic modeling, information extraction & sequence tagging, etc. ¤  Efficient implementation of LDA for topic modeling
  • 24. Altamira Technologies Corporation 2014 Natural Language Processing ■  Name: NLTK ■  Creator: Bird, Loper, et al. ■  License: Apache License 2.0 ■  Website: nltk.org ■  Source: github.com/nltk/nltk ■  Features: ¤  Natural Language Toolkit for Python ¤  Built-in support for dozens of corpora & trained models ¤  Libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning
  • 25. Altamira Technologies Corporation 2014 Natural Language Processing ■  Name: Stanford CoreNLP ■  Creator: Stanford NLP Group ■  License: GPL Version 2 ■  Website: nlp.stanford.edu/software/corenlp.shtml ■  Source: github.com/stanfordnlp/CoreNLP ■  Features: ¤  Suite of high-quality, Java-based NLP tools ¤  Includes POS tagger, named entity recognizer, parser, coreference resolution, sentiment analysis, SUTime, etc. ¤  Includes models for English, Chinese, Arabic, German
  • 26. Altamira Technologies Corporation 2014 NLP + Geospatial Analysis ■  Name: CLAVIN ■  Creator: Berico Technologies ■  License: Apache License 2.0 ■  Website: clavin.io ■  Source: github.com/Berico-Technologies/CLAVIN ■  Features: ¤  Extracts location names from text, resolves to gazetteer ¤  Employs context-based geospatial entity resolution ¤  ~75% accuracy, processes 1M documents per hour ¤  Built on Hadoop, CoreNLP, OpenNLP, GeoNames.org
  • 27. Altamira Technologies Corporation 2014 Social Network Analysis ■  Name: Gephi ■  Creator: UTC France ■  License: GPL Version 3 ■  Website: gephi.org ■  Source: github.com/gephi/gephi ■  Features: ¤  Network analysis and visualization package for Java ¤  Dynamic network analysis with temporal filtering ¤  Metrics include: community detection, betweenness, closeness, clustering coefficient, PageRank, etc.
  • 28. Altamira Technologies Corporation 2014 Data Visualization ■  Name: D3.js ■  Creator: Mike Bostock ■  License: BSD 3-Clause License ■  Website: d3js.org ■  Source: github.com/mbostock/d3 ■  Features: ¤  JavaScript library based on HTML, SVG, and CSS ¤  Binds data to DOM & enables transformations ¤  ~200 examples, including: force-directed graphs, choropleths, treemaps, dendrograms, animations, etc.
  • 29. Altamira Technologies Corporation 2014 Fusion, Analysis, and Visualization ■  Name: Lumify ■  Creator: Altamira ■  License: Apache License 2.0 ■  Website: lumify.io ■  Source: github.com/altamiracorp/lumify ■  Features: ¤  Built on Hadoop, Storm, Accumulo, Elasticsearch, etc. ¤  Integrates structured data, text, images, video ¤  Cell-level security & access controls ¤  Live, shared collaborative workspaces
  • 30.
  • 31. Altamira Technologies Corporation 2014 Final Thought… Save your $$$ for: ¨  People ¤  salaries, training, etc. ¨  Resources ¤  hardware, AWS, etc. ¨  Proprietary software ¤  if no viable OSS alternative exists photo: Brett Weinstein (http://bit.ly/1dHXvqJ) FINAL THOUGHT Springer’s
  • 32. open source software for data scientists oss4ds.com
  • 33. Charlie Greenbacker | @greenbacker www.oss4ds.com