SlideShare a Scribd company logo
NLP Structured Data Investigation on Non-Text
Casey Stella
@casey_stella
2016
Casey Stella@casey_stella (Hortonworks) NLP Structured Data Investigation on Non-Text 2016
Table of Contents
Preliminaries
Borrowing from NLP
Demo
Questions
Casey Stella@casey_stella (Hortonworks) NLP Structured Data Investigation on Non-Text 2016
Introduction
• I’m a Principal Architect at Hortonworks
• I work primarily doing Data Science in the Hadoop Ecosystem
• Prior to this, I’ve spent my time and had a lot of fun
◦ Doing data mining on medical data at Explorys using the Hadoop ecosystem
◦ Doing signal processing on seismic data at Ion Geophysical using MapReduce
◦ Being a graduate student in the Math department at Texas A&M in algorithmic
complexity theory
Casey Stella@casey_stella (Hortonworks) NLP Structured Data Investigation on Non-Text 2016
Domain Challenges in Data Science
A data scientist has to merge analytical skills with domain expertise.
• Often we’re thrown into places where we have insufficient domain experience.
• Gaining this expertise can be challenging and time-consuming.
• Unsupervised machine learning techniques can be very useful to understand complex
data relationships.
Casey Stella@casey_stella (Hortonworks) NLP Structured Data Investigation on Non-Text 2016
Domain Challenges in Data Science
A data scientist has to merge analytical skills with domain expertise.
• Often we’re thrown into places where we have insufficient domain experience.
• Gaining this expertise can be challenging and time-consuming.
• Unsupervised machine learning techniques can be very useful to understand complex
data relationships.
We’ll use an unsupervised structure learning algorithm borrowed from NLP to look at
medical data.
Casey Stella@casey_stella (Hortonworks) NLP Structured Data Investigation on Non-Text 2016
Word2Vec
Word2Vec is a vectorization model created by Google [1] that attempts to learn
relationships between words automatically given a large corpus of sentences.
• Gives us a way to find similar words by finding near neighbors in the vector space
with cosine similarity.
1
http://radimrehurek.com/2014/12/making-sense-of-word2vec/
Casey Stella@casey_stella (Hortonworks) NLP Structured Data Investigation on Non-Text 2016
Word2Vec
Word2Vec is a vectorization model created by Google [1] that attempts to learn
relationships between words automatically given a large corpus of sentences.
• Gives us a way to find similar words by finding near neighbors in the vector space
with cosine similarity.
• Uses a neural network to learn vector representations.
1
http://radimrehurek.com/2014/12/making-sense-of-word2vec/
Casey Stella@casey_stella (Hortonworks) NLP Structured Data Investigation on Non-Text 2016
Word2Vec
Word2Vec is a vectorization model created by Google [1] that attempts to learn
relationships between words automatically given a large corpus of sentences.
• Gives us a way to find similar words by finding near neighbors in the vector space
with cosine similarity.
• Uses a neural network to learn vector representations.
• Work by Pennington, Socher, and Manning [2] shows that the word2vec model is
equivalent to a word co-occurance matrix weighting based on window distance and
lowering the dimension by matrix factorization.
1
http://radimrehurek.com/2014/12/making-sense-of-word2vec/
Casey Stella@casey_stella (Hortonworks) NLP Structured Data Investigation on Non-Text 2016
Word2Vec
Word2Vec is a vectorization model created by Google [1] that attempts to learn
relationships between words automatically given a large corpus of sentences.
• Gives us a way to find similar words by finding near neighbors in the vector space
with cosine similarity.
• Uses a neural network to learn vector representations.
• Work by Pennington, Socher, and Manning [2] shows that the word2vec model is
equivalent to a word co-occurance matrix weighting based on window distance and
lowering the dimension by matrix factorization.
Takeaway: The technique boils down, intuitively, to a riff on word co-occurence. See
here1 for more.
1
http://radimrehurek.com/2014/12/making-sense-of-word2vec/
Casey Stella@casey_stella (Hortonworks) NLP Structured Data Investigation on Non-Text 2016
Clinical Data as Sentences
Clinical encounters form a sort of sentence over time. For a given encounter:
• Vitals are measured (e.g. height, weight, BMI).
• Labs are performed and results are recorded (e.g. blood tests).
• Procedures are performed.
• Diagnoses are made (e.g. Diabetes).
• Drugs are prescribed.
Each of these can be considered clinical “words” and the encounter forms a clinical
“sentence”.
Casey Stella@casey_stella (Hortonworks) NLP Structured Data Investigation on Non-Text 2016
Clinical Data as Sentences
Clinical encounters form a sort of sentence over time. For a given encounter:
• Vitals are measured (e.g. height, weight, BMI).
• Labs are performed and results are recorded (e.g. blood tests).
• Procedures are performed.
• Diagnoses are made (e.g. Diabetes).
• Drugs are prescribed.
Each of these can be considered clinical “words” and the encounter forms a clinical
“sentence”.
Idea: We can use word2vec to investigate connections between these clinical concepts.
Casey Stella@casey_stella (Hortonworks) NLP Structured Data Investigation on Non-Text 2016
Demo
As part of a Kaggle competition2, Practice Fusion, a digital electronic medical records
provider released depersonalized clinical records of 10,000 patients. I ingested and
preprocessed these records into 197,340 clinical “sentences” using Pig and Hive.
2
https://www.kaggle.com/c/pf2012-diabetes
Casey Stella@casey_stella (Hortonworks) NLP Structured Data Investigation on Non-Text 2016
Demo
As part of a Kaggle competition2, Practice Fusion, a digital electronic medical records
provider released depersonalized clinical records of 10,000 patients. I ingested and
preprocessed these records into 197,340 clinical “sentences” using Pig and Hive.
MLLib from Spark now contains an implementation of word2vec, so let’s use pyspark
and IPython Notebook to explore this dataset on Hadoop.
2
https://www.kaggle.com/c/pf2012-diabetes
Casey Stella@casey_stella (Hortonworks) NLP Structured Data Investigation on Non-Text 2016
Questions
Thanks for your attention! Questions?
• Code & scripts for this talk available on my github presentation page.3
• Find me at http://caseystella.com
• Twitter handle: @casey_stella
• Email address: cstella@hortonworks.com
3
http://github.com/cestella/presentations/
Casey Stella@casey_stella (Hortonworks) NLP Structured Data Investigation on Non-Text 2016
Bibliography
[1] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of
word representations in vector space. CoRR, abs/1301.3781, 2013.
[2] Jeffrey Pennington, Richard Socher, and Christopher Manning. Glove: Global
vectors for word representation. In Proceedings of the 2014 Conference on Empirical
Methods in Natural Language Processing (EMNLP), pages 1532–1543. Association
for Computational Linguistics, 2014.
Casey Stella@casey_stella (Hortonworks) NLP Structured Data Investigation on Non-Text 2016

More Related Content

What's hot

Opportunistic Persistent Data Storage
Opportunistic Persistent Data StorageOpportunistic Persistent Data Storage
Opportunistic Persistent Data Storage
Luke Weerasooriya
 
Metopen 6
Metopen 6Metopen 6
Metopen 6
Ali Murfi
 
Knowledge graph construction for research & medicine
Knowledge graph construction for research & medicineKnowledge graph construction for research & medicine
Knowledge graph construction for research & medicine
Paul Groth
 
Diversity and Depth: Implementing AI across many long tail domains
Diversity and Depth: Implementing AI across many long tail domainsDiversity and Depth: Implementing AI across many long tail domains
Diversity and Depth: Implementing AI across many long tail domains
Paul Groth
 
NC3Rs Publication Bias workshop - Sansone - Better Data = Better Science
NC3Rs Publication Bias workshop - Sansone - Better Data = Better ScienceNC3Rs Publication Bias workshop - Sansone - Better Data = Better Science
NC3Rs Publication Bias workshop - Sansone - Better Data = Better Science
Susanna-Assunta Sansone
 
The Roots: Linked data and the foundations of successful Agriculture Data
The Roots: Linked data and the foundations of successful Agriculture DataThe Roots: Linked data and the foundations of successful Agriculture Data
The Roots: Linked data and the foundations of successful Agriculture Data
Paul Groth
 
Answering More Questions with Provenance and Query Patterns
Answering More Questions with Provenance and Query PatternsAnswering More Questions with Provenance and Query Patterns
Answering More Questions with Provenance and Query Patterns
Bertram Ludäscher
 
Workshop on Systematic Searching (Oslo)
Workshop on Systematic Searching (Oslo)Workshop on Systematic Searching (Oslo)
Workshop on Systematic Searching (Oslo)
jstaaks
 
ISWC 2014 Tutorial - Instance Matching Benchmarks for Linked Data
ISWC 2014 Tutorial - Instance Matching Benchmarks for Linked DataISWC 2014 Tutorial - Instance Matching Benchmarks for Linked Data
ISWC 2014 Tutorial - Instance Matching Benchmarks for Linked Data
Evangelia Daskalaki
 
Introduction to Systematic Reviews (Oslo)
Introduction to Systematic Reviews (Oslo)Introduction to Systematic Reviews (Oslo)
Introduction to Systematic Reviews (Oslo)
jstaaks
 
On community-standards, data curation and scholarly communication - BITS, Ita...
On community-standards, data curation and scholarly communication - BITS, Ita...On community-standards, data curation and scholarly communication - BITS, Ita...
On community-standards, data curation and scholarly communication - BITS, Ita...
Susanna-Assunta Sansone
 
Swapnil soni Thesis_Presentation
Swapnil soni Thesis_PresentationSwapnil soni Thesis_Presentation
Swapnil soni Thesis_Presentation
Swapnil Soni
 
Dynamic Search Using Semantics & Statistics
Dynamic Search Using Semantics & StatisticsDynamic Search Using Semantics & Statistics
Dynamic Search Using Semantics & Statistics
Paul Hofmann
 
Public PhD Defense - Ben De Meester
Public PhD Defense - Ben De MeesterPublic PhD Defense - Ben De Meester
Public PhD Defense - Ben De Meester
Ben De Meester
 
Differential privacy (개인정보 차등보호)
Differential privacy (개인정보 차등보호)Differential privacy (개인정보 차등보호)
Differential privacy (개인정보 차등보호)
Young-Geun Choi
 
On community-standards, data curation and scholarly communication" Stanford M...
On community-standards, data curation and scholarly communication" Stanford M...On community-standards, data curation and scholarly communication" Stanford M...
On community-standards, data curation and scholarly communication" Stanford M...
Susanna-Assunta Sansone
 
Oxford DTP - Sansone - Data publications and Scientific Data - Dec 2014
Oxford DTP - Sansone - Data publications and Scientific Data - Dec 2014Oxford DTP - Sansone - Data publications and Scientific Data - Dec 2014
Oxford DTP - Sansone - Data publications and Scientific Data - Dec 2014
Susanna-Assunta Sansone
 
Using the search engine as recommendation engine
Using the search engine as recommendation engineUsing the search engine as recommendation engine
Using the search engine as recommendation engine
Lars Marius Garshol
 
Understanding your data with Bayesian networks (in Python) by Bartek Wilczyns...
Understanding your data with Bayesian networks (in Python) by Bartek Wilczyns...Understanding your data with Bayesian networks (in Python) by Bartek Wilczyns...
Understanding your data with Bayesian networks (in Python) by Bartek Wilczyns...
PyData
 

What's hot (19)

Opportunistic Persistent Data Storage
Opportunistic Persistent Data StorageOpportunistic Persistent Data Storage
Opportunistic Persistent Data Storage
 
Metopen 6
Metopen 6Metopen 6
Metopen 6
 
Knowledge graph construction for research & medicine
Knowledge graph construction for research & medicineKnowledge graph construction for research & medicine
Knowledge graph construction for research & medicine
 
Diversity and Depth: Implementing AI across many long tail domains
Diversity and Depth: Implementing AI across many long tail domainsDiversity and Depth: Implementing AI across many long tail domains
Diversity and Depth: Implementing AI across many long tail domains
 
NC3Rs Publication Bias workshop - Sansone - Better Data = Better Science
NC3Rs Publication Bias workshop - Sansone - Better Data = Better ScienceNC3Rs Publication Bias workshop - Sansone - Better Data = Better Science
NC3Rs Publication Bias workshop - Sansone - Better Data = Better Science
 
The Roots: Linked data and the foundations of successful Agriculture Data
The Roots: Linked data and the foundations of successful Agriculture DataThe Roots: Linked data and the foundations of successful Agriculture Data
The Roots: Linked data and the foundations of successful Agriculture Data
 
Answering More Questions with Provenance and Query Patterns
Answering More Questions with Provenance and Query PatternsAnswering More Questions with Provenance and Query Patterns
Answering More Questions with Provenance and Query Patterns
 
Workshop on Systematic Searching (Oslo)
Workshop on Systematic Searching (Oslo)Workshop on Systematic Searching (Oslo)
Workshop on Systematic Searching (Oslo)
 
ISWC 2014 Tutorial - Instance Matching Benchmarks for Linked Data
ISWC 2014 Tutorial - Instance Matching Benchmarks for Linked DataISWC 2014 Tutorial - Instance Matching Benchmarks for Linked Data
ISWC 2014 Tutorial - Instance Matching Benchmarks for Linked Data
 
Introduction to Systematic Reviews (Oslo)
Introduction to Systematic Reviews (Oslo)Introduction to Systematic Reviews (Oslo)
Introduction to Systematic Reviews (Oslo)
 
On community-standards, data curation and scholarly communication - BITS, Ita...
On community-standards, data curation and scholarly communication - BITS, Ita...On community-standards, data curation and scholarly communication - BITS, Ita...
On community-standards, data curation and scholarly communication - BITS, Ita...
 
Swapnil soni Thesis_Presentation
Swapnil soni Thesis_PresentationSwapnil soni Thesis_Presentation
Swapnil soni Thesis_Presentation
 
Dynamic Search Using Semantics & Statistics
Dynamic Search Using Semantics & StatisticsDynamic Search Using Semantics & Statistics
Dynamic Search Using Semantics & Statistics
 
Public PhD Defense - Ben De Meester
Public PhD Defense - Ben De MeesterPublic PhD Defense - Ben De Meester
Public PhD Defense - Ben De Meester
 
Differential privacy (개인정보 차등보호)
Differential privacy (개인정보 차등보호)Differential privacy (개인정보 차등보호)
Differential privacy (개인정보 차등보호)
 
On community-standards, data curation and scholarly communication" Stanford M...
On community-standards, data curation and scholarly communication" Stanford M...On community-standards, data curation and scholarly communication" Stanford M...
On community-standards, data curation and scholarly communication" Stanford M...
 
Oxford DTP - Sansone - Data publications and Scientific Data - Dec 2014
Oxford DTP - Sansone - Data publications and Scientific Data - Dec 2014Oxford DTP - Sansone - Data publications and Scientific Data - Dec 2014
Oxford DTP - Sansone - Data publications and Scientific Data - Dec 2014
 
Using the search engine as recommendation engine
Using the search engine as recommendation engineUsing the search engine as recommendation engine
Using the search engine as recommendation engine
 
Understanding your data with Bayesian networks (in Python) by Bartek Wilczyns...
Understanding your data with Bayesian networks (in Python) by Bartek Wilczyns...Understanding your data with Bayesian networks (in Python) by Bartek Wilczyns...
Understanding your data with Bayesian networks (in Python) by Bartek Wilczyns...
 

Viewers also liked

Smart data for a predictive bank
Smart data for a predictive bankSmart data for a predictive bank
Smart data for a predictive bank
DataWorks Summit/Hadoop Summit
 
Securing Hadoop in an Enterprise Context
Securing Hadoop in an Enterprise ContextSecuring Hadoop in an Enterprise Context
Securing Hadoop in an Enterprise Context
DataWorks Summit/Hadoop Summit
 
Using a Data Lake at the core of a Life Assurance business
Using a Data Lake at the core of a Life Assurance businessUsing a Data Lake at the core of a Life Assurance business
Using a Data Lake at the core of a Life Assurance business
DataWorks Summit/Hadoop Summit
 
Implementing the Business Catalog in the Modern Enterprise: Bridging Traditio...
Implementing the Business Catalog in the Modern Enterprise: Bridging Traditio...Implementing the Business Catalog in the Modern Enterprise: Bridging Traditio...
Implementing the Business Catalog in the Modern Enterprise: Bridging Traditio...
DataWorks Summit/Hadoop Summit
 
Destroying Data Silos
Destroying Data SilosDestroying Data Silos
Destroying Data Silos
Hellmar Becker
 
Open Data Fueling Innovation - Kristen Honey
Open Data Fueling Innovation - Kristen HoneyOpen Data Fueling Innovation - Kristen Honey
Open Data Fueling Innovation - Kristen Honey
scoopnewsgroup
 
Hadoop World 2011: Mike Olson Keynote Presentation
Hadoop World 2011: Mike Olson Keynote PresentationHadoop World 2011: Mike Olson Keynote Presentation
Hadoop World 2011: Mike Olson Keynote Presentation
Cloudera, Inc.
 
HHS: Opening Data, Influencing Innovation - Damon Davis
HHS: Opening Data, Influencing Innovation - Damon DavisHHS: Opening Data, Influencing Innovation - Damon Davis
HHS: Opening Data, Influencing Innovation - Damon Davis
scoopnewsgroup
 
Intro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application MeetupIntro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application Meetup
Mike Percy
 
LinkedIn
LinkedInLinkedIn
February 2016 HUG: Apache Kudu (incubating): New Apache Hadoop Storage for Fa...
February 2016 HUG: Apache Kudu (incubating): New Apache Hadoop Storage for Fa...February 2016 HUG: Apache Kudu (incubating): New Apache Hadoop Storage for Fa...
February 2016 HUG: Apache Kudu (incubating): New Apache Hadoop Storage for Fa...
Yahoo Developer Network
 
GRUTER가 들려주는 Big Data Platform 구축 전략과 적용 사례: 인터넷 쇼핑몰의 실시간 분석 플랫폼 구축 사례
GRUTER가 들려주는 Big Data Platform 구축 전략과 적용 사례: 인터넷 쇼핑몰의 실시간 분석 플랫폼 구축 사례GRUTER가 들려주는 Big Data Platform 구축 전략과 적용 사례: 인터넷 쇼핑몰의 실시간 분석 플랫폼 구축 사례
GRUTER가 들려주는 Big Data Platform 구축 전략과 적용 사례: 인터넷 쇼핑몰의 실시간 분석 플랫폼 구축 사례
Gruter
 
The TCO Calculator - Estimate the True Cost of Hadoop
The TCO Calculator - Estimate the True Cost of Hadoop The TCO Calculator - Estimate the True Cost of Hadoop
The TCO Calculator - Estimate the True Cost of Hadoop
MapR Technologies
 
Apache Hive on ACID
Apache Hive on ACIDApache Hive on ACID
Apache Hive on ACID
DataWorks Summit/Hadoop Summit
 
Protecting Enterprise Data in Apache Hadoop
Protecting Enterprise Data in Apache HadoopProtecting Enterprise Data in Apache Hadoop
Protecting Enterprise Data in Apache Hadoop
DataWorks Summit/Hadoop Summit
 
Data Process Systems, connecting everything
Data Process Systems, connecting everythingData Process Systems, connecting everything
Data Process Systems, connecting everything
DataWorks Summit/Hadoop Summit
 
The Future of Apache Storm
The Future of Apache StormThe Future of Apache Storm
The Future of Apache Storm
DataWorks Summit/Hadoop Summit
 
The key to unlocking the Value in the IoT? Managing the Data!
The key to unlocking the Value in the IoT? Managing the Data!The key to unlocking the Value in the IoT? Managing the Data!
The key to unlocking the Value in the IoT? Managing the Data!
DataWorks Summit/Hadoop Summit
 
Log I am your father
Log I am your fatherLog I am your father
Log I am your father
DataWorks Summit/Hadoop Summit
 
Cooperative Data Exploration with iPython Notebook
Cooperative Data Exploration with iPython NotebookCooperative Data Exploration with iPython Notebook
Cooperative Data Exploration with iPython Notebook
DataWorks Summit/Hadoop Summit
 

Viewers also liked (20)

Smart data for a predictive bank
Smart data for a predictive bankSmart data for a predictive bank
Smart data for a predictive bank
 
Securing Hadoop in an Enterprise Context
Securing Hadoop in an Enterprise ContextSecuring Hadoop in an Enterprise Context
Securing Hadoop in an Enterprise Context
 
Using a Data Lake at the core of a Life Assurance business
Using a Data Lake at the core of a Life Assurance businessUsing a Data Lake at the core of a Life Assurance business
Using a Data Lake at the core of a Life Assurance business
 
Implementing the Business Catalog in the Modern Enterprise: Bridging Traditio...
Implementing the Business Catalog in the Modern Enterprise: Bridging Traditio...Implementing the Business Catalog in the Modern Enterprise: Bridging Traditio...
Implementing the Business Catalog in the Modern Enterprise: Bridging Traditio...
 
Destroying Data Silos
Destroying Data SilosDestroying Data Silos
Destroying Data Silos
 
Open Data Fueling Innovation - Kristen Honey
Open Data Fueling Innovation - Kristen HoneyOpen Data Fueling Innovation - Kristen Honey
Open Data Fueling Innovation - Kristen Honey
 
Hadoop World 2011: Mike Olson Keynote Presentation
Hadoop World 2011: Mike Olson Keynote PresentationHadoop World 2011: Mike Olson Keynote Presentation
Hadoop World 2011: Mike Olson Keynote Presentation
 
HHS: Opening Data, Influencing Innovation - Damon Davis
HHS: Opening Data, Influencing Innovation - Damon DavisHHS: Opening Data, Influencing Innovation - Damon Davis
HHS: Opening Data, Influencing Innovation - Damon Davis
 
Intro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application MeetupIntro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application Meetup
 
LinkedIn
LinkedInLinkedIn
LinkedIn
 
February 2016 HUG: Apache Kudu (incubating): New Apache Hadoop Storage for Fa...
February 2016 HUG: Apache Kudu (incubating): New Apache Hadoop Storage for Fa...February 2016 HUG: Apache Kudu (incubating): New Apache Hadoop Storage for Fa...
February 2016 HUG: Apache Kudu (incubating): New Apache Hadoop Storage for Fa...
 
GRUTER가 들려주는 Big Data Platform 구축 전략과 적용 사례: 인터넷 쇼핑몰의 실시간 분석 플랫폼 구축 사례
GRUTER가 들려주는 Big Data Platform 구축 전략과 적용 사례: 인터넷 쇼핑몰의 실시간 분석 플랫폼 구축 사례GRUTER가 들려주는 Big Data Platform 구축 전략과 적용 사례: 인터넷 쇼핑몰의 실시간 분석 플랫폼 구축 사례
GRUTER가 들려주는 Big Data Platform 구축 전략과 적용 사례: 인터넷 쇼핑몰의 실시간 분석 플랫폼 구축 사례
 
The TCO Calculator - Estimate the True Cost of Hadoop
The TCO Calculator - Estimate the True Cost of Hadoop The TCO Calculator - Estimate the True Cost of Hadoop
The TCO Calculator - Estimate the True Cost of Hadoop
 
Apache Hive on ACID
Apache Hive on ACIDApache Hive on ACID
Apache Hive on ACID
 
Protecting Enterprise Data in Apache Hadoop
Protecting Enterprise Data in Apache HadoopProtecting Enterprise Data in Apache Hadoop
Protecting Enterprise Data in Apache Hadoop
 
Data Process Systems, connecting everything
Data Process Systems, connecting everythingData Process Systems, connecting everything
Data Process Systems, connecting everything
 
The Future of Apache Storm
The Future of Apache StormThe Future of Apache Storm
The Future of Apache Storm
 
The key to unlocking the Value in the IoT? Managing the Data!
The key to unlocking the Value in the IoT? Managing the Data!The key to unlocking the Value in the IoT? Managing the Data!
The key to unlocking the Value in the IoT? Managing the Data!
 
Log I am your father
Log I am your fatherLog I am your father
Log I am your father
 
Cooperative Data Exploration with iPython Notebook
Cooperative Data Exploration with iPython NotebookCooperative Data Exploration with iPython Notebook
Cooperative Data Exploration with iPython Notebook
 

Similar to NLP Structured Data Investigation on Non-Text

NLP Structured Data Investigation on Non-Text by Casey Stella
NLP Structured Data Investigation on Non-Text by Casey StellaNLP Structured Data Investigation on Non-Text by Casey Stella
NLP Structured Data Investigation on Non-Text by Casey Stella
Spark Summit
 
Natural Language Processing for Materials Design - What Can We Extract From t...
Natural Language Processing for Materials Design - What Can We Extract From t...Natural Language Processing for Materials Design - What Can We Extract From t...
Natural Language Processing for Materials Design - What Can We Extract From t...
Anubhav Jain
 
Using a keyword extraction pipeline to understand concepts in future work sec...
Using a keyword extraction pipeline to understand concepts in future work sec...Using a keyword extraction pipeline to understand concepts in future work sec...
Using a keyword extraction pipeline to understand concepts in future work sec...
Kai Li
 
Using Knowledge Graph for Promoting Cognitive Computing
Using Knowledge Graph for Promoting Cognitive ComputingUsing Knowledge Graph for Promoting Cognitive Computing
Using Knowledge Graph for Promoting Cognitive Computing
Artificial Intelligence Institute at UofSC
 
Progress Towards Leveraging Natural Language Processing for Collecting Experi...
Progress Towards Leveraging Natural Language Processing for Collecting Experi...Progress Towards Leveraging Natural Language Processing for Collecting Experi...
Progress Towards Leveraging Natural Language Processing for Collecting Experi...
Anubhav Jain
 
Capturing and leveraging materials science knowledge from millions of journal...
Capturing and leveraging materials science knowledge from millions of journal...Capturing and leveraging materials science knowledge from millions of journal...
Capturing and leveraging materials science knowledge from millions of journal...
Anubhav Jain
 
Applications of Natural Language Processing to Materials Design
Applications of Natural Language Processing to Materials DesignApplications of Natural Language Processing to Materials Design
Applications of Natural Language Processing to Materials Design
Anubhav Jain
 
Words, Documents and Distance: Deep Learning and Semantic Analysis
Words, Documents and Distance: Deep Learning and Semantic AnalysisWords, Documents and Distance: Deep Learning and Semantic Analysis
Words, Documents and Distance: Deep Learning and Semantic Analysis
Ray Poynter
 
Research Objects for FAIRer Science
Research Objects for FAIRer Science Research Objects for FAIRer Science
Research Objects for FAIRer Science
Carole Goble
 
Natural Language Processing Through Different Classes of Machine Learning
Natural Language Processing Through Different Classes of Machine LearningNatural Language Processing Through Different Classes of Machine Learning
Natural Language Processing Through Different Classes of Machine Learning
csandit
 
Deep Neural Methods for Retrieval
Deep Neural Methods for RetrievalDeep Neural Methods for Retrieval
Deep Neural Methods for Retrieval
Bhaskar Mitra
 
Data Preparation for Data Science
Data Preparation for Data ScienceData Preparation for Data Science
Data Preparation for Data Science
DataWorks Summit/Hadoop Summit
 
Data Preparation of Data Science
Data Preparation of Data ScienceData Preparation of Data Science
Data Preparation of Data Science
DataWorks Summit/Hadoop Summit
 
Open IE tutorial 2018
Open IE tutorial 2018Open IE tutorial 2018
Open IE tutorial 2018
Andre Freitas
 
Low Resource Domain Subjective Context Feature Extraction via Thematic Meta-l...
Low Resource Domain Subjective Context Feature Extraction via Thematic Meta-l...Low Resource Domain Subjective Context Feature Extraction via Thematic Meta-l...
Low Resource Domain Subjective Context Feature Extraction via Thematic Meta-l...
AI Publications
 
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & OpportunitiesDeep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
Matthew Lease
 
Towards Incidental Collaboratories; Research Data Services
Towards Incidental Collaboratories; Research Data ServicesTowards Incidental Collaboratories; Research Data Services
Towards Incidental Collaboratories; Research Data Services
Anita de Waard
 
The State of Open Research Data - OpenCon 2014
The State of Open Research Data - OpenCon 2014The State of Open Research Data - OpenCon 2014
The State of Open Research Data - OpenCon 2014
Right to Research
 
The State of Open Research Data
The State of Open Research DataThe State of Open Research Data
The State of Open Research Data
Ross Mounce
 
Idcc kansa-kansa-arbuckle
Idcc kansa-kansa-arbuckleIdcc kansa-kansa-arbuckle
Idcc kansa-kansa-arbuckle
Eric Kansa
 

Similar to NLP Structured Data Investigation on Non-Text (20)

NLP Structured Data Investigation on Non-Text by Casey Stella
NLP Structured Data Investigation on Non-Text by Casey StellaNLP Structured Data Investigation on Non-Text by Casey Stella
NLP Structured Data Investigation on Non-Text by Casey Stella
 
Natural Language Processing for Materials Design - What Can We Extract From t...
Natural Language Processing for Materials Design - What Can We Extract From t...Natural Language Processing for Materials Design - What Can We Extract From t...
Natural Language Processing for Materials Design - What Can We Extract From t...
 
Using a keyword extraction pipeline to understand concepts in future work sec...
Using a keyword extraction pipeline to understand concepts in future work sec...Using a keyword extraction pipeline to understand concepts in future work sec...
Using a keyword extraction pipeline to understand concepts in future work sec...
 
Using Knowledge Graph for Promoting Cognitive Computing
Using Knowledge Graph for Promoting Cognitive ComputingUsing Knowledge Graph for Promoting Cognitive Computing
Using Knowledge Graph for Promoting Cognitive Computing
 
Progress Towards Leveraging Natural Language Processing for Collecting Experi...
Progress Towards Leveraging Natural Language Processing for Collecting Experi...Progress Towards Leveraging Natural Language Processing for Collecting Experi...
Progress Towards Leveraging Natural Language Processing for Collecting Experi...
 
Capturing and leveraging materials science knowledge from millions of journal...
Capturing and leveraging materials science knowledge from millions of journal...Capturing and leveraging materials science knowledge from millions of journal...
Capturing and leveraging materials science knowledge from millions of journal...
 
Applications of Natural Language Processing to Materials Design
Applications of Natural Language Processing to Materials DesignApplications of Natural Language Processing to Materials Design
Applications of Natural Language Processing to Materials Design
 
Words, Documents and Distance: Deep Learning and Semantic Analysis
Words, Documents and Distance: Deep Learning and Semantic AnalysisWords, Documents and Distance: Deep Learning and Semantic Analysis
Words, Documents and Distance: Deep Learning and Semantic Analysis
 
Research Objects for FAIRer Science
Research Objects for FAIRer Science Research Objects for FAIRer Science
Research Objects for FAIRer Science
 
Natural Language Processing Through Different Classes of Machine Learning
Natural Language Processing Through Different Classes of Machine LearningNatural Language Processing Through Different Classes of Machine Learning
Natural Language Processing Through Different Classes of Machine Learning
 
Deep Neural Methods for Retrieval
Deep Neural Methods for RetrievalDeep Neural Methods for Retrieval
Deep Neural Methods for Retrieval
 
Data Preparation for Data Science
Data Preparation for Data ScienceData Preparation for Data Science
Data Preparation for Data Science
 
Data Preparation of Data Science
Data Preparation of Data ScienceData Preparation of Data Science
Data Preparation of Data Science
 
Open IE tutorial 2018
Open IE tutorial 2018Open IE tutorial 2018
Open IE tutorial 2018
 
Low Resource Domain Subjective Context Feature Extraction via Thematic Meta-l...
Low Resource Domain Subjective Context Feature Extraction via Thematic Meta-l...Low Resource Domain Subjective Context Feature Extraction via Thematic Meta-l...
Low Resource Domain Subjective Context Feature Extraction via Thematic Meta-l...
 
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & OpportunitiesDeep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
 
Towards Incidental Collaboratories; Research Data Services
Towards Incidental Collaboratories; Research Data ServicesTowards Incidental Collaboratories; Research Data Services
Towards Incidental Collaboratories; Research Data Services
 
The State of Open Research Data - OpenCon 2014
The State of Open Research Data - OpenCon 2014The State of Open Research Data - OpenCon 2014
The State of Open Research Data - OpenCon 2014
 
The State of Open Research Data
The State of Open Research DataThe State of Open Research Data
The State of Open Research Data
 
Idcc kansa-kansa-arbuckle
Idcc kansa-kansa-arbuckleIdcc kansa-kansa-arbuckle
Idcc kansa-kansa-arbuckle
 

More from DataWorks Summit/Hadoop Summit

Running Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in ProductionRunning Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in Production
DataWorks Summit/Hadoop Summit
 
State of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache ZeppelinState of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache Zeppelin
DataWorks Summit/Hadoop Summit
 
Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache Ranger
DataWorks Summit/Hadoop Summit
 
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformEnabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science Platform
DataWorks Summit/Hadoop Summit
 
Revolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and ZeppelinRevolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and Zeppelin
DataWorks Summit/Hadoop Summit
 
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDouble Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSense
DataWorks Summit/Hadoop Summit
 
Hadoop Crash Course
Hadoop Crash CourseHadoop Crash Course
Hadoop Crash Course
DataWorks Summit/Hadoop Summit
 
Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
DataWorks Summit/Hadoop Summit
 
Apache Spark Crash Course
Apache Spark Crash CourseApache Spark Crash Course
Apache Spark Crash Course
DataWorks Summit/Hadoop Summit
 
Dataflow with Apache NiFi
Dataflow with Apache NiFiDataflow with Apache NiFi
Dataflow with Apache NiFi
DataWorks Summit/Hadoop Summit
 
Schema Registry - Set you Data Free
Schema Registry - Set you Data FreeSchema Registry - Set you Data Free
Schema Registry - Set you Data Free
DataWorks Summit/Hadoop Summit
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
DataWorks Summit/Hadoop Summit
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
DataWorks Summit/Hadoop Summit
 
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLMool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and ML
DataWorks Summit/Hadoop Summit
 
How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient
DataWorks Summit/Hadoop Summit
 
HBase in Practice
HBase in Practice HBase in Practice
HBase in Practice
DataWorks Summit/Hadoop Summit
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)
DataWorks Summit/Hadoop Summit
 
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS HadoopBreaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
DataWorks Summit/Hadoop Summit
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
DataWorks Summit/Hadoop Summit
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
DataWorks Summit/Hadoop Summit
 

More from DataWorks Summit/Hadoop Summit (20)

Running Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in ProductionRunning Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in Production
 
State of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache ZeppelinState of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache Zeppelin
 
Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache Ranger
 
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformEnabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science Platform
 
Revolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and ZeppelinRevolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and Zeppelin
 
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDouble Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSense
 
Hadoop Crash Course
Hadoop Crash CourseHadoop Crash Course
Hadoop Crash Course
 
Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Apache Spark Crash Course
Apache Spark Crash CourseApache Spark Crash Course
Apache Spark Crash Course
 
Dataflow with Apache NiFi
Dataflow with Apache NiFiDataflow with Apache NiFi
Dataflow with Apache NiFi
 
Schema Registry - Set you Data Free
Schema Registry - Set you Data FreeSchema Registry - Set you Data Free
Schema Registry - Set you Data Free
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
 
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLMool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and ML
 
How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient
 
HBase in Practice
HBase in Practice HBase in Practice
HBase in Practice
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)
 
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS HadoopBreaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
 

Recently uploaded

UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
DianaGray10
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
Octavian Nadolu
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
panagenda
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
Matthew Sinclair
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
Tomaz Bratanic
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
Kari Kakkonen
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Safe Software
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
Daiki Mogmet Ito
 
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
Edge AI and Vision Alliance
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
SOFTTECHHUB
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
DianaGray10
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
Uni Systems S.M.S.A.
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
Matthew Sinclair
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
panagenda
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Paige Cruz
 
Mariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceXMariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceX
Mariano Tinti
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
shyamraj55
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
Kumud Singh
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
Pixlogix Infotech
 

Recently uploaded (20)

UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
 
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
 
Mariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceXMariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceX
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
 

NLP Structured Data Investigation on Non-Text

  • 1. NLP Structured Data Investigation on Non-Text Casey Stella @casey_stella 2016 Casey Stella@casey_stella (Hortonworks) NLP Structured Data Investigation on Non-Text 2016
  • 2. Table of Contents Preliminaries Borrowing from NLP Demo Questions Casey Stella@casey_stella (Hortonworks) NLP Structured Data Investigation on Non-Text 2016
  • 3. Introduction • I’m a Principal Architect at Hortonworks • I work primarily doing Data Science in the Hadoop Ecosystem • Prior to this, I’ve spent my time and had a lot of fun ◦ Doing data mining on medical data at Explorys using the Hadoop ecosystem ◦ Doing signal processing on seismic data at Ion Geophysical using MapReduce ◦ Being a graduate student in the Math department at Texas A&M in algorithmic complexity theory Casey Stella@casey_stella (Hortonworks) NLP Structured Data Investigation on Non-Text 2016
  • 4. Domain Challenges in Data Science A data scientist has to merge analytical skills with domain expertise. • Often we’re thrown into places where we have insufficient domain experience. • Gaining this expertise can be challenging and time-consuming. • Unsupervised machine learning techniques can be very useful to understand complex data relationships. Casey Stella@casey_stella (Hortonworks) NLP Structured Data Investigation on Non-Text 2016
  • 5. Domain Challenges in Data Science A data scientist has to merge analytical skills with domain expertise. • Often we’re thrown into places where we have insufficient domain experience. • Gaining this expertise can be challenging and time-consuming. • Unsupervised machine learning techniques can be very useful to understand complex data relationships. We’ll use an unsupervised structure learning algorithm borrowed from NLP to look at medical data. Casey Stella@casey_stella (Hortonworks) NLP Structured Data Investigation on Non-Text 2016
  • 6. Word2Vec Word2Vec is a vectorization model created by Google [1] that attempts to learn relationships between words automatically given a large corpus of sentences. • Gives us a way to find similar words by finding near neighbors in the vector space with cosine similarity. 1 http://radimrehurek.com/2014/12/making-sense-of-word2vec/ Casey Stella@casey_stella (Hortonworks) NLP Structured Data Investigation on Non-Text 2016
  • 7. Word2Vec Word2Vec is a vectorization model created by Google [1] that attempts to learn relationships between words automatically given a large corpus of sentences. • Gives us a way to find similar words by finding near neighbors in the vector space with cosine similarity. • Uses a neural network to learn vector representations. 1 http://radimrehurek.com/2014/12/making-sense-of-word2vec/ Casey Stella@casey_stella (Hortonworks) NLP Structured Data Investigation on Non-Text 2016
  • 8. Word2Vec Word2Vec is a vectorization model created by Google [1] that attempts to learn relationships between words automatically given a large corpus of sentences. • Gives us a way to find similar words by finding near neighbors in the vector space with cosine similarity. • Uses a neural network to learn vector representations. • Work by Pennington, Socher, and Manning [2] shows that the word2vec model is equivalent to a word co-occurance matrix weighting based on window distance and lowering the dimension by matrix factorization. 1 http://radimrehurek.com/2014/12/making-sense-of-word2vec/ Casey Stella@casey_stella (Hortonworks) NLP Structured Data Investigation on Non-Text 2016
  • 9. Word2Vec Word2Vec is a vectorization model created by Google [1] that attempts to learn relationships between words automatically given a large corpus of sentences. • Gives us a way to find similar words by finding near neighbors in the vector space with cosine similarity. • Uses a neural network to learn vector representations. • Work by Pennington, Socher, and Manning [2] shows that the word2vec model is equivalent to a word co-occurance matrix weighting based on window distance and lowering the dimension by matrix factorization. Takeaway: The technique boils down, intuitively, to a riff on word co-occurence. See here1 for more. 1 http://radimrehurek.com/2014/12/making-sense-of-word2vec/ Casey Stella@casey_stella (Hortonworks) NLP Structured Data Investigation on Non-Text 2016
  • 10. Clinical Data as Sentences Clinical encounters form a sort of sentence over time. For a given encounter: • Vitals are measured (e.g. height, weight, BMI). • Labs are performed and results are recorded (e.g. blood tests). • Procedures are performed. • Diagnoses are made (e.g. Diabetes). • Drugs are prescribed. Each of these can be considered clinical “words” and the encounter forms a clinical “sentence”. Casey Stella@casey_stella (Hortonworks) NLP Structured Data Investigation on Non-Text 2016
  • 11. Clinical Data as Sentences Clinical encounters form a sort of sentence over time. For a given encounter: • Vitals are measured (e.g. height, weight, BMI). • Labs are performed and results are recorded (e.g. blood tests). • Procedures are performed. • Diagnoses are made (e.g. Diabetes). • Drugs are prescribed. Each of these can be considered clinical “words” and the encounter forms a clinical “sentence”. Idea: We can use word2vec to investigate connections between these clinical concepts. Casey Stella@casey_stella (Hortonworks) NLP Structured Data Investigation on Non-Text 2016
  • 12. Demo As part of a Kaggle competition2, Practice Fusion, a digital electronic medical records provider released depersonalized clinical records of 10,000 patients. I ingested and preprocessed these records into 197,340 clinical “sentences” using Pig and Hive. 2 https://www.kaggle.com/c/pf2012-diabetes Casey Stella@casey_stella (Hortonworks) NLP Structured Data Investigation on Non-Text 2016
  • 13. Demo As part of a Kaggle competition2, Practice Fusion, a digital electronic medical records provider released depersonalized clinical records of 10,000 patients. I ingested and preprocessed these records into 197,340 clinical “sentences” using Pig and Hive. MLLib from Spark now contains an implementation of word2vec, so let’s use pyspark and IPython Notebook to explore this dataset on Hadoop. 2 https://www.kaggle.com/c/pf2012-diabetes Casey Stella@casey_stella (Hortonworks) NLP Structured Data Investigation on Non-Text 2016
  • 14. Questions Thanks for your attention! Questions? • Code & scripts for this talk available on my github presentation page.3 • Find me at http://caseystella.com • Twitter handle: @casey_stella • Email address: cstella@hortonworks.com 3 http://github.com/cestella/presentations/ Casey Stella@casey_stella (Hortonworks) NLP Structured Data Investigation on Non-Text 2016
  • 15. Bibliography [1] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. CoRR, abs/1301.3781, 2013. [2] Jeffrey Pennington, Richard Socher, and Christopher Manning. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543. Association for Computational Linguistics, 2014. Casey Stella@casey_stella (Hortonworks) NLP Structured Data Investigation on Non-Text 2016