SlideShare a Scribd company logo
Text Mining with Node.js
Philipp Burckhardt
Carnegie Mellon University
Who am I?
Why re-invent the
wheel?
Reasons for using Node.js
• JavaScript - language of the Web
• Platform-agnostic (all operating systems, browser, CLIs and
desktop applications)
• V8 engine is fast enough to handle text mining tasks (faster than
Python or R)
• Core streams can handle real-time data & large amounts of
text
Drawback: Besides few popular packages like natural,
no eco-system of good text mining modules yet.
Use Case: deidentify
Use Case: deidentify
• Software for de-identification of protected health
information in free-text medical record data
• Developed as part of research project at CMU
The Challenge
Unstructured data
might account for
more than 80%
percent of data
collected.
Text Mining Overview
Typical Test Mining Tasks
• Sentiment
analysis
•Cluster analysis and
topic modeling: find
hidden patterns or
grouping in data
Getting practical
Sentiment Analysis of „State of the Union“ addresses
by President Obama
const getSpeeches = require( '@stdlib/datasets/sotu-addresses' );
const words = require( '@stdlib/datasets/afinn-111' );
const tm = require( 'text-miner' );
// Convert to a dictionary...
const len = words.length;
const dict = {};
for ( let i = 0; i < len; i++ ) {
dict[ words[i][0] ] = words[i][1];
}
const obamaSpeeches = getSpeeches({
'president': [ 'Barack Obama' ]
});
let obamaCorpus = new tm.Corpus(
obamaSpeeches.map( x => x.text )
)
.trim()
.toLower()
.removeInterpunctuation();
// Calculate sentiments...
const docs = obamaCorpus.getTexts();
const sentiments = [];
for ( let i = 0; i < docs; i++ ) {
const words = docs[ i ].split( ' ' );
let score = 0;
for ( let j = 0; j < words.length; j++ ) {
const val = dict[ words[ j ] ];
if ( val ) { score += val; }
}
sentiments.push( score );
}
Pre-Processing
sentiments = [ 69, 47, 266, 75, 234, 234, 163, 157 ]
Topic Modeling
Goal: find documents which share the same themes
(e.g. politics, business, sports)
Latent Dirichlet Allocation
• Probabilistic model for text documents by Blei et
al.
• Documents are assumed to have a distribution
over topics
• Very popular because of its expandability
const getSpeeches = require( '@stdlib/datasets/sotu-addresses' )
const lda = require( '@stdlib/nlp/latent-dirichlet-allocation' );
const tm = require( 'text-miner' );
let speeches = getSpeeches({ range: [ 1930, 2010 ] })
.map( ( e ) => e.text );
let corpus = new tm.Corpus( speeches );
corpus = corpus
.toLower()
.removeInterpunctuation()
.removeWords( tm.STOPWORDS.EN );
let docs = corpus.getTexts();
let model = lda( docs, 3 );
model.fit( 1000, 100, 10 );
lda( <Documents Array>, <Number of Topics> )
model.fit( <Iterations>, <Burnin>, <Thinning> )
Results for SOTU addresses
from 1930 to 2010
Topic Words
1 world, peace, war, nations, free,
people, great, nation, united,
freedom, power, military,
american, men, defense, time,
forces, strength
2 america, people, american, years,
americans, year, work, make,
children, congress, tonight, time,
tax, country, government, health,
budget, care
3 government, year, federal,
program, congress, economic,
states, national, administration,
million, policy, public, dollars,
legislation, programs, billion,
system, years, united, fiscal
Text Analysis using the
Command Line
Rationale
• Data pipelines using UNIX shell commands
• Processing of shell commands is done in parallel
• Memory usage
• V8 engine has default limit of 1.76 GB on 64 bit machine
(changeable via --max_old_space_size=<size>)
• Use stream processing instead of batch processing to avoid
high memory usage
LIVE DEMONSTRATION

More Related Content

What's hot

Elasticsearch - basics and beyond
Elasticsearch - basics and beyondElasticsearch - basics and beyond
Elasticsearch - basics and beyond
Ernesto Reig
 
Temporal Pattern Mining
Temporal Pattern MiningTemporal Pattern Mining
Temporal Pattern Mining
Prakhar Dhama
 
Frequent Itemset Mining(FIM) on BigData
Frequent Itemset Mining(FIM) on BigDataFrequent Itemset Mining(FIM) on BigData
Frequent Itemset Mining(FIM) on BigData
Raju Gupta
 
A Parallel Algorithm for Approximate Frequent Itemset Mining using MapReduce
A Parallel Algorithm for Approximate Frequent Itemset Mining using MapReduce A Parallel Algorithm for Approximate Frequent Itemset Mining using MapReduce
A Parallel Algorithm for Approximate Frequent Itemset Mining using MapReduce
Fabio Fumarola
 
2011linked science4mccuskermcguinnessfinal
2011linked science4mccuskermcguinnessfinal2011linked science4mccuskermcguinnessfinal
2011linked science4mccuskermcguinnessfinal
Deborah McGuinness
 
Session 09 learning relationships.pptx
Session 09 learning relationships.pptxSession 09 learning relationships.pptx
Session 09 learning relationships.pptx
bodaceacat
 
Tds — big science dec 2021
Tds — big science dec 2021Tds — big science dec 2021
Tds — big science dec 2021
Gérard Dupont
 
Nuix Presentation
Nuix PresentationNuix Presentation
Nuix Presentation
tbonk_dti
 
tech 3camp presentation
tech 3camp presentationtech 3camp presentation
tech 3camp presentation
jwacxiom
 
Visualization-Driven Data Aggregation
Visualization-Driven Data AggregationVisualization-Driven Data Aggregation
Visualization-Driven Data Aggregation
Zbigniew Jerzak
 
Lightning fast genomics with Spark, Adam and Scala
Lightning fast genomics with Spark, Adam and ScalaLightning fast genomics with Spark, Adam and Scala
Lightning fast genomics with Spark, Adam and Scala
Andy Petrella
 
Data science in ruby is it possible? is it fast? should we use it?
Data science in ruby is it possible? is it fast? should we use it?Data science in ruby is it possible? is it fast? should we use it?
Data science in ruby is it possible? is it fast? should we use it?
Rodrigo Urubatan
 
Object multifunctional indexing with an open API
Object multifunctional indexing with an open API Object multifunctional indexing with an open API
Object multifunctional indexing with an open API
akvalex
 
Introduction to SARA's Hadoop Hackathon - dec 7th 2010
Introduction to SARA's Hadoop Hackathon - dec 7th 2010Introduction to SARA's Hadoop Hackathon - dec 7th 2010
Introduction to SARA's Hadoop Hackathon - dec 7th 2010
Evert Lammerts
 
NBITSearch. Features.
NBITSearch. Features.NBITSearch. Features.
NBITSearch. Features.
Novosib-BIT LLC
 
Toulouse Data Science meetup - Apache zeppelin
Toulouse Data Science meetup - Apache zeppelinToulouse Data Science meetup - Apache zeppelin
Toulouse Data Science meetup - Apache zeppelin
Gérard Dupont
 
"Spark, Deep Learning and Life Sciences, Systems Biology in the Big Data Age"...
"Spark, Deep Learning and Life Sciences, Systems Biology in the Big Data Age"..."Spark, Deep Learning and Life Sciences, Systems Biology in the Big Data Age"...
"Spark, Deep Learning and Life Sciences, Systems Biology in the Big Data Age"...
Dataconomy Media
 
Xray: extended arrays for scientific datasets by Stephan Hoyer PyData SV 2014
Xray: extended arrays for scientific datasets by Stephan Hoyer PyData SV 2014Xray: extended arrays for scientific datasets by Stephan Hoyer PyData SV 2014
Xray: extended arrays for scientific datasets by Stephan Hoyer PyData SV 2014
PyData
 

What's hot (20)

Elasticsearch - basics and beyond
Elasticsearch - basics and beyondElasticsearch - basics and beyond
Elasticsearch - basics and beyond
 
Temporal Pattern Mining
Temporal Pattern MiningTemporal Pattern Mining
Temporal Pattern Mining
 
Frequent Itemset Mining(FIM) on BigData
Frequent Itemset Mining(FIM) on BigDataFrequent Itemset Mining(FIM) on BigData
Frequent Itemset Mining(FIM) on BigData
 
A Parallel Algorithm for Approximate Frequent Itemset Mining using MapReduce
A Parallel Algorithm for Approximate Frequent Itemset Mining using MapReduce A Parallel Algorithm for Approximate Frequent Itemset Mining using MapReduce
A Parallel Algorithm for Approximate Frequent Itemset Mining using MapReduce
 
2011linked science4mccuskermcguinnessfinal
2011linked science4mccuskermcguinnessfinal2011linked science4mccuskermcguinnessfinal
2011linked science4mccuskermcguinnessfinal
 
Session 09 learning relationships.pptx
Session 09 learning relationships.pptxSession 09 learning relationships.pptx
Session 09 learning relationships.pptx
 
Tds — big science dec 2021
Tds — big science dec 2021Tds — big science dec 2021
Tds — big science dec 2021
 
Nuix Presentation
Nuix PresentationNuix Presentation
Nuix Presentation
 
tech 3camp presentation
tech 3camp presentationtech 3camp presentation
tech 3camp presentation
 
Visualization-Driven Data Aggregation
Visualization-Driven Data AggregationVisualization-Driven Data Aggregation
Visualization-Driven Data Aggregation
 
Lightning fast genomics with Spark, Adam and Scala
Lightning fast genomics with Spark, Adam and ScalaLightning fast genomics with Spark, Adam and Scala
Lightning fast genomics with Spark, Adam and Scala
 
Data science in ruby is it possible? is it fast? should we use it?
Data science in ruby is it possible? is it fast? should we use it?Data science in ruby is it possible? is it fast? should we use it?
Data science in ruby is it possible? is it fast? should we use it?
 
Object multifunctional indexing with an open API
Object multifunctional indexing with an open API Object multifunctional indexing with an open API
Object multifunctional indexing with an open API
 
Introduction to SARA's Hadoop Hackathon - dec 7th 2010
Introduction to SARA's Hadoop Hackathon - dec 7th 2010Introduction to SARA's Hadoop Hackathon - dec 7th 2010
Introduction to SARA's Hadoop Hackathon - dec 7th 2010
 
NBITSearch. Features.
NBITSearch. Features.NBITSearch. Features.
NBITSearch. Features.
 
Toulouse Data Science meetup - Apache zeppelin
Toulouse Data Science meetup - Apache zeppelinToulouse Data Science meetup - Apache zeppelin
Toulouse Data Science meetup - Apache zeppelin
 
"Spark, Deep Learning and Life Sciences, Systems Biology in the Big Data Age"...
"Spark, Deep Learning and Life Sciences, Systems Biology in the Big Data Age"..."Spark, Deep Learning and Life Sciences, Systems Biology in the Big Data Age"...
"Spark, Deep Learning and Life Sciences, Systems Biology in the Big Data Age"...
 
3Camp2015_prod
3Camp2015_prod3Camp2015_prod
3Camp2015_prod
 
Big Data & Hadoop
Big Data & HadoopBig Data & Hadoop
Big Data & Hadoop
 
Xray: extended arrays for scientific datasets by Stephan Hoyer PyData SV 2014
Xray: extended arrays for scientific datasets by Stephan Hoyer PyData SV 2014Xray: extended arrays for scientific datasets by Stephan Hoyer PyData SV 2014
Xray: extended arrays for scientific datasets by Stephan Hoyer PyData SV 2014
 

Viewers also liked

Text mining mengmeng & jack_lsu
Text mining mengmeng & jack_lsuText mining mengmeng & jack_lsu
Text mining mengmeng & jack_lsu
jjdai
 
Abe Curso Estudos De Caso Ii
Abe Curso Estudos De Caso IiAbe Curso Estudos De Caso Ii
Abe Curso Estudos De Caso Ii
ufrj
 
Cmc liberando a agenda-horários
Cmc   liberando a agenda-horáriosCmc   liberando a agenda-horários
Cmc liberando a agenda-horáriosLeonardo Alves
 
L) médico pesquisando medicações
L) médico   pesquisando medicaçõesL) médico   pesquisando medicações
L) médico pesquisando medicaçõesLeonardo Alves
 
Cmc encaminhamento- solicitando e agendando
Cmc  encaminhamento- solicitando e agendandoCmc  encaminhamento- solicitando e agendando
Cmc encaminhamento- solicitando e agendandoLeonardo Alves
 
K) médico histórico de consultas, retornos, consultas antigas
K) médico   histórico de consultas, retornos, consultas antigasK) médico   histórico de consultas, retornos, consultas antigas
K) médico histórico de consultas, retornos, consultas antigasLeonardo Alves
 
Prontuário Eletrônico - Prefeituras
Prontuário Eletrônico - PrefeiturasProntuário Eletrônico - Prefeituras
Prontuário Eletrônico - Prefeituras
Leonardo Alves
 
Hitchhiker's Guide to"'Serverless" Javascript - Steven Faulkner, Bustle
Hitchhiker's Guide to"'Serverless" Javascript - Steven Faulkner, BustleHitchhiker's Guide to"'Serverless" Javascript - Steven Faulkner, Bustle
Hitchhiker's Guide to"'Serverless" Javascript - Steven Faulkner, Bustle
NodejsFoundation
 
Take Data Validation Seriously - Paul Milham, WildWorks
Take Data Validation Seriously - Paul Milham, WildWorksTake Data Validation Seriously - Paul Milham, WildWorks
Take Data Validation Seriously - Paul Milham, WildWorks
NodejsFoundation
 
Node.js Core State of the Union- James Snell
Node.js Core State of the Union- James SnellNode.js Core State of the Union- James Snell
Node.js Core State of the Union- James Snell
NodejsFoundation
 
State of the CLI- Kat Marchan
State of the CLI- Kat MarchanState of the CLI- Kat Marchan
State of the CLI- Kat Marchan
NodejsFoundation
 
Aplicação de técnicas de mineração de textos para classificação automática de...
Aplicação de técnicas de mineração de textos para classificação automática de...Aplicação de técnicas de mineração de textos para classificação automática de...
Aplicação de técnicas de mineração de textos para classificação automática de...
Rommel Carvalho
 
Developing Nirvana - Corey A. Butler, Author.io
Developing Nirvana - Corey A. Butler, Author.ioDeveloping Nirvana - Corey A. Butler, Author.io
Developing Nirvana - Corey A. Butler, Author.io
NodejsFoundation
 
Real-Life Node.js Troubleshooting - Damian Schenkelman, Auth0
Real-Life Node.js Troubleshooting - Damian Schenkelman, Auth0Real-Life Node.js Troubleshooting - Damian Schenkelman, Auth0
Real-Life Node.js Troubleshooting - Damian Schenkelman, Auth0
NodejsFoundation
 
From Pterodactyls and Cactus to Artificial Intelligence - Ivan Seidel Gomes, ...
From Pterodactyls and Cactus to Artificial Intelligence - Ivan Seidel Gomes, ...From Pterodactyls and Cactus to Artificial Intelligence - Ivan Seidel Gomes, ...
From Pterodactyls and Cactus to Artificial Intelligence - Ivan Seidel Gomes, ...
NodejsFoundation
 
Multimodal Interactions & JS: The What, The Why and The How - Diego Paez, Des...
Multimodal Interactions & JS: The What, The Why and The How - Diego Paez, Des...Multimodal Interactions & JS: The What, The Why and The How - Diego Paez, Des...
Multimodal Interactions & JS: The What, The Why and The How - Diego Paez, Des...
NodejsFoundation
 
Are your v8 garbage collection logs speaking to you?Joyee Cheung -Alibaba Clo...
Are your v8 garbage collection logs speaking to you?Joyee Cheung -Alibaba Clo...Are your v8 garbage collection logs speaking to you?Joyee Cheung -Alibaba Clo...
Are your v8 garbage collection logs speaking to you?Joyee Cheung -Alibaba Clo...
NodejsFoundation
 
Node's Event Loop From the Inside Out - Sam Roberts, IBM
Node's Event Loop From the Inside Out - Sam Roberts, IBMNode's Event Loop From the Inside Out - Sam Roberts, IBM
Node's Event Loop From the Inside Out - Sam Roberts, IBM
NodejsFoundation
 
Math in V8 is Broken and How We Can Fix It - Athan Reines, Fourier
Math in V8 is Broken and How We Can Fix It - Athan Reines, FourierMath in V8 is Broken and How We Can Fix It - Athan Reines, Fourier
Math in V8 is Broken and How We Can Fix It - Athan Reines, Fourier
NodejsFoundation
 
Real-Time Machine Learning with Node.js - Philipp Burckhardt, Carnegie Mellon...
Real-Time Machine Learning with Node.js - Philipp Burckhardt, Carnegie Mellon...Real-Time Machine Learning with Node.js - Philipp Burckhardt, Carnegie Mellon...
Real-Time Machine Learning with Node.js - Philipp Burckhardt, Carnegie Mellon...
NodejsFoundation
 

Viewers also liked (20)

Text mining mengmeng & jack_lsu
Text mining mengmeng & jack_lsuText mining mengmeng & jack_lsu
Text mining mengmeng & jack_lsu
 
Abe Curso Estudos De Caso Ii
Abe Curso Estudos De Caso IiAbe Curso Estudos De Caso Ii
Abe Curso Estudos De Caso Ii
 
Cmc liberando a agenda-horários
Cmc   liberando a agenda-horáriosCmc   liberando a agenda-horários
Cmc liberando a agenda-horários
 
L) médico pesquisando medicações
L) médico   pesquisando medicaçõesL) médico   pesquisando medicações
L) médico pesquisando medicações
 
Cmc encaminhamento- solicitando e agendando
Cmc  encaminhamento- solicitando e agendandoCmc  encaminhamento- solicitando e agendando
Cmc encaminhamento- solicitando e agendando
 
K) médico histórico de consultas, retornos, consultas antigas
K) médico   histórico de consultas, retornos, consultas antigasK) médico   histórico de consultas, retornos, consultas antigas
K) médico histórico de consultas, retornos, consultas antigas
 
Prontuário Eletrônico - Prefeituras
Prontuário Eletrônico - PrefeiturasProntuário Eletrônico - Prefeituras
Prontuário Eletrônico - Prefeituras
 
Hitchhiker's Guide to"'Serverless" Javascript - Steven Faulkner, Bustle
Hitchhiker's Guide to"'Serverless" Javascript - Steven Faulkner, BustleHitchhiker's Guide to"'Serverless" Javascript - Steven Faulkner, Bustle
Hitchhiker's Guide to"'Serverless" Javascript - Steven Faulkner, Bustle
 
Take Data Validation Seriously - Paul Milham, WildWorks
Take Data Validation Seriously - Paul Milham, WildWorksTake Data Validation Seriously - Paul Milham, WildWorks
Take Data Validation Seriously - Paul Milham, WildWorks
 
Node.js Core State of the Union- James Snell
Node.js Core State of the Union- James SnellNode.js Core State of the Union- James Snell
Node.js Core State of the Union- James Snell
 
State of the CLI- Kat Marchan
State of the CLI- Kat MarchanState of the CLI- Kat Marchan
State of the CLI- Kat Marchan
 
Aplicação de técnicas de mineração de textos para classificação automática de...
Aplicação de técnicas de mineração de textos para classificação automática de...Aplicação de técnicas de mineração de textos para classificação automática de...
Aplicação de técnicas de mineração de textos para classificação automática de...
 
Developing Nirvana - Corey A. Butler, Author.io
Developing Nirvana - Corey A. Butler, Author.ioDeveloping Nirvana - Corey A. Butler, Author.io
Developing Nirvana - Corey A. Butler, Author.io
 
Real-Life Node.js Troubleshooting - Damian Schenkelman, Auth0
Real-Life Node.js Troubleshooting - Damian Schenkelman, Auth0Real-Life Node.js Troubleshooting - Damian Schenkelman, Auth0
Real-Life Node.js Troubleshooting - Damian Schenkelman, Auth0
 
From Pterodactyls and Cactus to Artificial Intelligence - Ivan Seidel Gomes, ...
From Pterodactyls and Cactus to Artificial Intelligence - Ivan Seidel Gomes, ...From Pterodactyls and Cactus to Artificial Intelligence - Ivan Seidel Gomes, ...
From Pterodactyls and Cactus to Artificial Intelligence - Ivan Seidel Gomes, ...
 
Multimodal Interactions & JS: The What, The Why and The How - Diego Paez, Des...
Multimodal Interactions & JS: The What, The Why and The How - Diego Paez, Des...Multimodal Interactions & JS: The What, The Why and The How - Diego Paez, Des...
Multimodal Interactions & JS: The What, The Why and The How - Diego Paez, Des...
 
Are your v8 garbage collection logs speaking to you?Joyee Cheung -Alibaba Clo...
Are your v8 garbage collection logs speaking to you?Joyee Cheung -Alibaba Clo...Are your v8 garbage collection logs speaking to you?Joyee Cheung -Alibaba Clo...
Are your v8 garbage collection logs speaking to you?Joyee Cheung -Alibaba Clo...
 
Node's Event Loop From the Inside Out - Sam Roberts, IBM
Node's Event Loop From the Inside Out - Sam Roberts, IBMNode's Event Loop From the Inside Out - Sam Roberts, IBM
Node's Event Loop From the Inside Out - Sam Roberts, IBM
 
Math in V8 is Broken and How We Can Fix It - Athan Reines, Fourier
Math in V8 is Broken and How We Can Fix It - Athan Reines, FourierMath in V8 is Broken and How We Can Fix It - Athan Reines, Fourier
Math in V8 is Broken and How We Can Fix It - Athan Reines, Fourier
 
Real-Time Machine Learning with Node.js - Philipp Burckhardt, Carnegie Mellon...
Real-Time Machine Learning with Node.js - Philipp Burckhardt, Carnegie Mellon...Real-Time Machine Learning with Node.js - Philipp Burckhardt, Carnegie Mellon...
Real-Time Machine Learning with Node.js - Philipp Burckhardt, Carnegie Mellon...
 

Similar to Text Mining with Node.js - Philipp Burckhardt, Carnegie Mellon University

An Introduction to gensim: "Topic Modelling for Humans"
An Introduction to gensim: "Topic Modelling for Humans"An Introduction to gensim: "Topic Modelling for Humans"
An Introduction to gensim: "Topic Modelling for Humans"
sandinmyjoints
 
PyGotham NY 2017: Natural Language Processing from Scratch
PyGotham NY 2017: Natural Language Processing from ScratchPyGotham NY 2017: Natural Language Processing from Scratch
PyGotham NY 2017: Natural Language Processing from Scratch
Noemi Derzsy
 
Data science training in hyderabad
Data science training in hyderabadData science training in hyderabad
Data science training in hyderabad
Geohedrick
 
Towards Computational Research Objects
Towards Computational Research ObjectsTowards Computational Research Objects
Towards Computational Research Objects
David De Roure
 
DataStax | Data Science with DataStax Enterprise (Brian Hess) | Cassandra Sum...
DataStax | Data Science with DataStax Enterprise (Brian Hess) | Cassandra Sum...DataStax | Data Science with DataStax Enterprise (Brian Hess) | Cassandra Sum...
DataStax | Data Science with DataStax Enterprise (Brian Hess) | Cassandra Sum...
DataStax
 
Data Science Keys to Open Up OpenNASA Datasets - PyData New York 2017
Data Science Keys to Open Up OpenNASA Datasets - PyData New York 2017Data Science Keys to Open Up OpenNASA Datasets - PyData New York 2017
Data Science Keys to Open Up OpenNASA Datasets - PyData New York 2017
Noemi Derzsy
 
Data Science Keys to Open Up OpenNASA Datasets
Data Science Keys to Open Up OpenNASA DatasetsData Science Keys to Open Up OpenNASA Datasets
Data Science Keys to Open Up OpenNASA Datasets
PyData
 
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Andrey Vykhodtsev
 
2021 04-20 apache arrow and its impact on the database industry.pptx
2021 04-20  apache arrow and its impact on the database industry.pptx2021 04-20  apache arrow and its impact on the database industry.pptx
2021 04-20 apache arrow and its impact on the database industry.pptx
Andrew Lamb
 
Migrating from matlab to python
Migrating from matlab to pythonMigrating from matlab to python
Migrating from matlab to pythonActiveState
 
Data_Science.ppt
Data_Science.pptData_Science.ppt
Data_Science.ppt
ANGADPRAJAPATI3
 
Presentation on Text Classification
Presentation on Text ClassificationPresentation on Text Classification
Presentation on Text Classification
Sai Srinivas Kotni
 
Reproducible Research and the Cloud
Reproducible Research and the CloudReproducible Research and the Cloud
Reproducible Research and the Cloud
Microsoft Azure for Research
 
eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific Method
Duncan Hull
 
PPT ON MACHINE LEARNING by Ragini Ratre
PPT ON MACHINE LEARNING by Ragini RatrePPT ON MACHINE LEARNING by Ragini Ratre
PPT ON MACHINE LEARNING by Ragini Ratre
RaginiRatre
 
Data management
Data management Data management
Data management
Graça Gabriel
 
Introducing U-SQL (SQLPASS 2016)
Introducing U-SQL (SQLPASS 2016)Introducing U-SQL (SQLPASS 2016)
Introducing U-SQL (SQLPASS 2016)
Michael Rys
 
Cloudera Breakfast Series, Analytics Part 1: Use All Your Data
Cloudera Breakfast Series, Analytics Part 1: Use All Your DataCloudera Breakfast Series, Analytics Part 1: Use All Your Data
Cloudera Breakfast Series, Analytics Part 1: Use All Your Data
Cloudera, Inc.
 

Similar to Text Mining with Node.js - Philipp Burckhardt, Carnegie Mellon University (20)

An Introduction to gensim: "Topic Modelling for Humans"
An Introduction to gensim: "Topic Modelling for Humans"An Introduction to gensim: "Topic Modelling for Humans"
An Introduction to gensim: "Topic Modelling for Humans"
 
PyGotham NY 2017: Natural Language Processing from Scratch
PyGotham NY 2017: Natural Language Processing from ScratchPyGotham NY 2017: Natural Language Processing from Scratch
PyGotham NY 2017: Natural Language Processing from Scratch
 
04 standard class library c#
04 standard class library c#04 standard class library c#
04 standard class library c#
 
Data science training in hyderabad
Data science training in hyderabadData science training in hyderabad
Data science training in hyderabad
 
Towards Computational Research Objects
Towards Computational Research ObjectsTowards Computational Research Objects
Towards Computational Research Objects
 
DataStax | Data Science with DataStax Enterprise (Brian Hess) | Cassandra Sum...
DataStax | Data Science with DataStax Enterprise (Brian Hess) | Cassandra Sum...DataStax | Data Science with DataStax Enterprise (Brian Hess) | Cassandra Sum...
DataStax | Data Science with DataStax Enterprise (Brian Hess) | Cassandra Sum...
 
Data Science Keys to Open Up OpenNASA Datasets - PyData New York 2017
Data Science Keys to Open Up OpenNASA Datasets - PyData New York 2017Data Science Keys to Open Up OpenNASA Datasets - PyData New York 2017
Data Science Keys to Open Up OpenNASA Datasets - PyData New York 2017
 
Data Science Keys to Open Up OpenNASA Datasets
Data Science Keys to Open Up OpenNASA DatasetsData Science Keys to Open Up OpenNASA Datasets
Data Science Keys to Open Up OpenNASA Datasets
 
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
 
2021 04-20 apache arrow and its impact on the database industry.pptx
2021 04-20  apache arrow and its impact on the database industry.pptx2021 04-20  apache arrow and its impact on the database industry.pptx
2021 04-20 apache arrow and its impact on the database industry.pptx
 
Migrating from matlab to python
Migrating from matlab to pythonMigrating from matlab to python
Migrating from matlab to python
 
Data_Science.ppt
Data_Science.pptData_Science.ppt
Data_Science.ppt
 
Presentation on Text Classification
Presentation on Text ClassificationPresentation on Text Classification
Presentation on Text Classification
 
Meow Hagedorn
Meow HagedornMeow Hagedorn
Meow Hagedorn
 
Reproducible Research and the Cloud
Reproducible Research and the CloudReproducible Research and the Cloud
Reproducible Research and the Cloud
 
eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific Method
 
PPT ON MACHINE LEARNING by Ragini Ratre
PPT ON MACHINE LEARNING by Ragini RatrePPT ON MACHINE LEARNING by Ragini Ratre
PPT ON MACHINE LEARNING by Ragini Ratre
 
Data management
Data management Data management
Data management
 
Introducing U-SQL (SQLPASS 2016)
Introducing U-SQL (SQLPASS 2016)Introducing U-SQL (SQLPASS 2016)
Introducing U-SQL (SQLPASS 2016)
 
Cloudera Breakfast Series, Analytics Part 1: Use All Your Data
Cloudera Breakfast Series, Analytics Part 1: Use All Your DataCloudera Breakfast Series, Analytics Part 1: Use All Your Data
Cloudera Breakfast Series, Analytics Part 1: Use All Your Data
 

More from NodejsFoundation

The Morality of Code - Glen Goodwin, SAS Institute, inc.
The Morality of Code - Glen Goodwin, SAS Institute, inc.The Morality of Code - Glen Goodwin, SAS Institute, inc.
The Morality of Code - Glen Goodwin, SAS Institute, inc.
NodejsFoundation
 
Nodifying the Enterprise - Prince Soni, TO THE NEW
Nodifying the Enterprise - Prince Soni, TO THE NEWNodifying the Enterprise - Prince Soni, TO THE NEW
Nodifying the Enterprise - Prince Soni, TO THE NEW
NodejsFoundation
 
Workshop: Science Meets Industry: Online Behavioral Experiments with nodeGame...
Workshop: Science Meets Industry: Online Behavioral Experiments with nodeGame...Workshop: Science Meets Industry: Online Behavioral Experiments with nodeGame...
Workshop: Science Meets Industry: Online Behavioral Experiments with nodeGame...
NodejsFoundation
 
Express State of the Union at Nodejs Interactive EU- Doug Wilson
Express State of the Union at Nodejs Interactive EU- Doug WilsonExpress State of the Union at Nodejs Interactive EU- Doug Wilson
Express State of the Union at Nodejs Interactive EU- Doug Wilson
NodejsFoundation
 
Building Scalable Web Applications Using Microservices Architecture and NodeJ...
Building Scalable Web Applications Using Microservices Architecture and NodeJ...Building Scalable Web Applications Using Microservices Architecture and NodeJ...
Building Scalable Web Applications Using Microservices Architecture and NodeJ...
NodejsFoundation
 
Take Data Validation Seriously - Paul Milham, WildWorks
Take Data Validation Seriously - Paul Milham, WildWorksTake Data Validation Seriously - Paul Milham, WildWorks
Take Data Validation Seriously - Paul Milham, WildWorks
NodejsFoundation
 
From Pterodactyls and Cactus to Artificial Intelligence - Ivan Seidel Gomes, ...
From Pterodactyls and Cactus to Artificial Intelligence - Ivan Seidel Gomes, ...From Pterodactyls and Cactus to Artificial Intelligence - Ivan Seidel Gomes, ...
From Pterodactyls and Cactus to Artificial Intelligence - Ivan Seidel Gomes, ...
NodejsFoundation
 
Breaking Down the Monolith - Peter Marton, RisingStack
Breaking Down the Monolith - Peter Marton, RisingStackBreaking Down the Monolith - Peter Marton, RisingStack
Breaking Down the Monolith - Peter Marton, RisingStack
NodejsFoundation
 
The Enterprise Case for Node.js
The Enterprise Case for Node.jsThe Enterprise Case for Node.js
The Enterprise Case for Node.js
NodejsFoundation
 
Node Foundation Membership Overview 20160907
Node Foundation Membership Overview 20160907Node Foundation Membership Overview 20160907
Node Foundation Membership Overview 20160907
NodejsFoundation
 

More from NodejsFoundation (10)

The Morality of Code - Glen Goodwin, SAS Institute, inc.
The Morality of Code - Glen Goodwin, SAS Institute, inc.The Morality of Code - Glen Goodwin, SAS Institute, inc.
The Morality of Code - Glen Goodwin, SAS Institute, inc.
 
Nodifying the Enterprise - Prince Soni, TO THE NEW
Nodifying the Enterprise - Prince Soni, TO THE NEWNodifying the Enterprise - Prince Soni, TO THE NEW
Nodifying the Enterprise - Prince Soni, TO THE NEW
 
Workshop: Science Meets Industry: Online Behavioral Experiments with nodeGame...
Workshop: Science Meets Industry: Online Behavioral Experiments with nodeGame...Workshop: Science Meets Industry: Online Behavioral Experiments with nodeGame...
Workshop: Science Meets Industry: Online Behavioral Experiments with nodeGame...
 
Express State of the Union at Nodejs Interactive EU- Doug Wilson
Express State of the Union at Nodejs Interactive EU- Doug WilsonExpress State of the Union at Nodejs Interactive EU- Doug Wilson
Express State of the Union at Nodejs Interactive EU- Doug Wilson
 
Building Scalable Web Applications Using Microservices Architecture and NodeJ...
Building Scalable Web Applications Using Microservices Architecture and NodeJ...Building Scalable Web Applications Using Microservices Architecture and NodeJ...
Building Scalable Web Applications Using Microservices Architecture and NodeJ...
 
Take Data Validation Seriously - Paul Milham, WildWorks
Take Data Validation Seriously - Paul Milham, WildWorksTake Data Validation Seriously - Paul Milham, WildWorks
Take Data Validation Seriously - Paul Milham, WildWorks
 
From Pterodactyls and Cactus to Artificial Intelligence - Ivan Seidel Gomes, ...
From Pterodactyls and Cactus to Artificial Intelligence - Ivan Seidel Gomes, ...From Pterodactyls and Cactus to Artificial Intelligence - Ivan Seidel Gomes, ...
From Pterodactyls and Cactus to Artificial Intelligence - Ivan Seidel Gomes, ...
 
Breaking Down the Monolith - Peter Marton, RisingStack
Breaking Down the Monolith - Peter Marton, RisingStackBreaking Down the Monolith - Peter Marton, RisingStack
Breaking Down the Monolith - Peter Marton, RisingStack
 
The Enterprise Case for Node.js
The Enterprise Case for Node.jsThe Enterprise Case for Node.js
The Enterprise Case for Node.js
 
Node Foundation Membership Overview 20160907
Node Foundation Membership Overview 20160907Node Foundation Membership Overview 20160907
Node Foundation Membership Overview 20160907
 

Recently uploaded

OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoamOpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
takuyayamamoto1800
 
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing SuiteAI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
Google
 
Accelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with PlatformlessAccelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with Platformless
WSO2
 
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Globus
 
Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024
Paco van Beckhoven
 
SOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBrokerSOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar
 
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus
 
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus
 
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Globus
 
May Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdfMay Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdf
Adele Miller
 
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data AnalysisProviding Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
Globus
 
RISE with SAP and Journey to the Intelligent Enterprise
RISE with SAP and Journey to the Intelligent EnterpriseRISE with SAP and Journey to the Intelligent Enterprise
RISE with SAP and Journey to the Intelligent Enterprise
Srikant77
 
2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx
Georgi Kodinov
 
How Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptxHow Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptx
wottaspaceseo
 
Lecture 1 Introduction to games development
Lecture 1 Introduction to games developmentLecture 1 Introduction to games development
Lecture 1 Introduction to games development
abdulrafaychaudhry
 
Corporate Management | Session 3 of 3 | Tendenci AMS
Corporate Management | Session 3 of 3 | Tendenci AMSCorporate Management | Session 3 of 3 | Tendenci AMS
Corporate Management | Session 3 of 3 | Tendenci AMS
Tendenci - The Open Source AMS (Association Management Software)
 
Vitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume MontevideoVitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume Montevideo
Vitthal Shirke
 
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, BetterWebinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
XfilesPro
 
Enterprise Resource Planning System in Telangana
Enterprise Resource Planning System in TelanganaEnterprise Resource Planning System in Telangana
Enterprise Resource Planning System in Telangana
NYGGS Automation Suite
 
A Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of PassageA Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of Passage
Philip Schwarz
 

Recently uploaded (20)

OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoamOpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
 
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing SuiteAI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
 
Accelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with PlatformlessAccelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with Platformless
 
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
 
Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024
 
SOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBrokerSOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBroker
 
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024
 
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024
 
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
 
May Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdfMay Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdf
 
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data AnalysisProviding Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
 
RISE with SAP and Journey to the Intelligent Enterprise
RISE with SAP and Journey to the Intelligent EnterpriseRISE with SAP and Journey to the Intelligent Enterprise
RISE with SAP and Journey to the Intelligent Enterprise
 
2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx
 
How Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptxHow Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptx
 
Lecture 1 Introduction to games development
Lecture 1 Introduction to games developmentLecture 1 Introduction to games development
Lecture 1 Introduction to games development
 
Corporate Management | Session 3 of 3 | Tendenci AMS
Corporate Management | Session 3 of 3 | Tendenci AMSCorporate Management | Session 3 of 3 | Tendenci AMS
Corporate Management | Session 3 of 3 | Tendenci AMS
 
Vitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume MontevideoVitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume Montevideo
 
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, BetterWebinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
 
Enterprise Resource Planning System in Telangana
Enterprise Resource Planning System in TelanganaEnterprise Resource Planning System in Telangana
Enterprise Resource Planning System in Telangana
 
A Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of PassageA Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of Passage
 

Text Mining with Node.js - Philipp Burckhardt, Carnegie Mellon University

  • 1. Text Mining with Node.js Philipp Burckhardt Carnegie Mellon University
  • 4. Reasons for using Node.js • JavaScript - language of the Web • Platform-agnostic (all operating systems, browser, CLIs and desktop applications) • V8 engine is fast enough to handle text mining tasks (faster than Python or R) • Core streams can handle real-time data & large amounts of text Drawback: Besides few popular packages like natural, no eco-system of good text mining modules yet.
  • 6. Use Case: deidentify • Software for de-identification of protected health information in free-text medical record data • Developed as part of research project at CMU
  • 8. Unstructured data might account for more than 80% percent of data collected.
  • 10. Typical Test Mining Tasks • Sentiment analysis
  • 11. •Cluster analysis and topic modeling: find hidden patterns or grouping in data
  • 12. Getting practical Sentiment Analysis of „State of the Union“ addresses by President Obama
  • 13. const getSpeeches = require( '@stdlib/datasets/sotu-addresses' ); const words = require( '@stdlib/datasets/afinn-111' ); const tm = require( 'text-miner' ); // Convert to a dictionary... const len = words.length; const dict = {}; for ( let i = 0; i < len; i++ ) { dict[ words[i][0] ] = words[i][1]; } const obamaSpeeches = getSpeeches({ 'president': [ 'Barack Obama' ] }); let obamaCorpus = new tm.Corpus( obamaSpeeches.map( x => x.text ) ) .trim() .toLower() .removeInterpunctuation(); // Calculate sentiments... const docs = obamaCorpus.getTexts(); const sentiments = []; for ( let i = 0; i < docs; i++ ) { const words = docs[ i ].split( ' ' ); let score = 0; for ( let j = 0; j < words.length; j++ ) { const val = dict[ words[ j ] ]; if ( val ) { score += val; } } sentiments.push( score ); } Pre-Processing sentiments = [ 69, 47, 266, 75, 234, 234, 163, 157 ]
  • 14.
  • 15. Topic Modeling Goal: find documents which share the same themes (e.g. politics, business, sports)
  • 16. Latent Dirichlet Allocation • Probabilistic model for text documents by Blei et al. • Documents are assumed to have a distribution over topics • Very popular because of its expandability
  • 17. const getSpeeches = require( '@stdlib/datasets/sotu-addresses' ) const lda = require( '@stdlib/nlp/latent-dirichlet-allocation' ); const tm = require( 'text-miner' ); let speeches = getSpeeches({ range: [ 1930, 2010 ] }) .map( ( e ) => e.text ); let corpus = new tm.Corpus( speeches ); corpus = corpus .toLower() .removeInterpunctuation() .removeWords( tm.STOPWORDS.EN ); let docs = corpus.getTexts(); let model = lda( docs, 3 ); model.fit( 1000, 100, 10 ); lda( <Documents Array>, <Number of Topics> ) model.fit( <Iterations>, <Burnin>, <Thinning> )
  • 18. Results for SOTU addresses from 1930 to 2010 Topic Words 1 world, peace, war, nations, free, people, great, nation, united, freedom, power, military, american, men, defense, time, forces, strength 2 america, people, american, years, americans, year, work, make, children, congress, tonight, time, tax, country, government, health, budget, care 3 government, year, federal, program, congress, economic, states, national, administration, million, policy, public, dollars, legislation, programs, billion, system, years, united, fiscal
  • 19. Text Analysis using the Command Line
  • 20. Rationale • Data pipelines using UNIX shell commands • Processing of shell commands is done in parallel • Memory usage • V8 engine has default limit of 1.76 GB on 64 bit machine (changeable via --max_old_space_size=<size>) • Use stream processing instead of batch processing to avoid high memory usage