SlideShare a Scribd company logo
Discover the world at Leiden University
Data Science Workshop
Dr. Peter Verhaar Maastricht, 2 April 2019
Discover the world at Leiden University
◻ Unprecedented growth in
volume of digital data
◻ Combined with growing
sophistication of algorithms
and tools
Background
□ Data mining or data science is
the process of applying
computational and algorithmic
methods to large datasets.
□ Text mining is collection of
methods used to extract
information not from “formalised
database records” but from
“unstructured textual data”
Data Science
Feldman, Ronan. The Text Mining Handbook:
Advanced Approaches in Analyzing
Unstructured Data. Cambridge: Cambridge
University Press, 2007, p. 1
Érik Desmazière, illustration for
cover of La biblioteca de Babel,
1941
Discover the world at Leiden University
Centre for Digital Scholarship
◻ Located within Leiden University
Libraries
◻ Staffed by subject librarians and software
developers
◻ Builds on existing services and existing
expertise
◻ Focus on Open Access, Research Data
management, Digital Preservation and
Text and Data Mining
Support for TDM within library?
◻ Central knowledge base
◻ Fostering interdisciplinary
collaboration
◻ Clarifying terms and
conditions of licences and
negotiations with publishers
◻ Digital preservation
◻ Continuation of traditional
role of libraries: providing
access to texts
The old Public library of
Cincinnati, now demolished
Discover the world at Leiden University
Building expertise on TDM
◻ Literature review; Courses
on Data Science and on
Machine Learning; Online
Tutorials (R Package, Mallet,
OpenNLP, Packages in
Python: nltk, textmining,
matPlotLib, gensim)
◻ Involvement in MA course
on Text and Data Mining
◻ Interviews with scholars who
have expertise
◻ Internal research projects
and pilots with researchers
Discover the world at Leiden University
Biographic research on Van Gogh
◻ Signs of mental decline in the correspondence of Vincent
van Gogh
◻ Average length of sentences and type-token ratios
Discover the world at Leiden University
TDM Workshops
◻ Full day workshop with
explanation of the basics of
Python
◻ Explanation of a range of
algorithms which can be used to
analyse texts
◻ Experiments based on research
questions of participants
Discover the world at Leiden University
◻ Educational programme
aimed at librarians
◻ Aim is to ensure that
librarians can talk about
technology on a basic level
◻ Courses are developed in
collaboration with National
Library (KB) and VU
Amsterdam, under the name
“DH Clinics”
Discover the world at Leiden University
◻ Python and Jupyter Notebook
◻ Data acquisition: Web Scraping, APIs, Linked
Open Data
◻ Data analysis and enrichment: Pandas, CSV,
TDM, tokenization, POS tagging,
lemmatization
◻ Data visualisation: Matplotlib
Workshop outline
Discover the world at Leiden University
◻Python is a widely used
programming languages
◻Developed by Guido van
Rossum
◻Advocates code
readability and simplicity
◻Programmng style ought
to ‘pythonic’
Discover the world at Leiden University
http://www.rapidtables.com/
Algoritm Programming
Language
Tool
Word2Vec
TopicModelling
(LDA)
Python
Java
Perl
Voyant
Tapor
Variables
□ Variables have a name: any combination of
alphanumerical characters with an underscore
keyword
□ Variables can be assigned a value with a specific data
type
keyword = “Elzevier” ;
number = 10 ;
□ Examples of variable types include string (a sequence
of characters), integer (whole numbers) and floating
point numbers
Strings
□ Can be created with single quotes and with double
quotes
author = ‘Douglas Adams’
title = “The Hitchhiker’s guide to
the galaxy”
□ You can then “escape characters” in your string to
add basic formatting:
“n” new line
“t” tab
Mathematical operators
□ The following mathematical operators can be used:
+ addition
- subtraction
/ division
* multiplication
□ For example:
sum = 5 + 6
product = 5 * 6
Boolean operators
□ Boolean operators compare values:
> greater that
< less than
== equal to
□ Expressions result in a ‘Boolean value’: true of false
a = 5
b = 8
print( a > b )
Selection
if <condition>:
<first block of code
elif <condition>:
<second block of code>
else
<last block of code>
Jupyter Notebook
□ Open source application
which can be used to create
documents containing both
code and documentation
□ Such documents can be
opened in a browser
□ It offers support for a
variety of programming
langauges, including Python,
Julia and R
□ It includes “kernels” or
computational engines
which can run the code
directly
Opening Jupyter Notebook
□ Open Anaconda Navigator and select Jupyer
Notebook > Launch
□ OR navigate to the directory that contains
your files in the Command Prompt and type
in:
jupyter notebook
Jupyter can then be opened in a web-
browser (e.g. Google Chrome) via the
address localhost:8888
□ Jupyter initially opens the dashboard: a
directory displaying all your files
Opening Jupyter Notebook
□ Jupyter notebooks can also be opened in
Microsoft Azure:
https://notebooks.azure.com/
□ Create a new project and import a GitHub
repository
□ The notebooks for this workshop can be
downloaded from:
https://github.com/peterverhaar/
MaastrichtDataScience
Discover the world at Leiden University
Algorithm
Define number to be guessed
Ask user to type in number
WHILE given number IS NOT number to be guessed
Print: Number is correct 
Given number HIGHER?
Print:
LOWER
Print:
HIGHER
Y N
Discover the world at Leiden University
Discover the world at Leiden University
Discover the world at Leiden University
Data Acquisition
◻ Direct downloads of data objects
(e.g. full text in UTF-8 from
Delpher or Project Gutenberg)
◻ Downloading data
◻ Downloads of data via
Application Programming
Interfaces (APS’s)
◻ Webscraping (via
BeautifulSoup)
◻ Download csv files from data
repositories such as Kaggle,
figShare, DANS EASY
□ An Application Programming Interface is a
technology which can be used to make specific
functions of an application or specific data sets
available for external services
API
User ServiceAPI
Request +
key
XML /
JSON
□ Some APIs are open; in
other cases an API key is
needed
□ Data may be delivered in
different formats: JSON,
XM
□ Actions such as create,
read, update and delete
are technically possible,
but option are usually
limited to reading data
□ Texts and images
Discover the world at Leiden University
Discover the world at Leiden University
□ A process in which texts are divided into smaller units (e.g.
Paragraphs, sentences, words)
□ Token counts reflect the total number of words; Types are
the unique words in a text
“It was the best of
times, it was the worst
of times, it was the
age of wisdom, it was
the age of foolishness,
it was the epoch of
belief, it was the
epoch of incredulity”
Tokenisation
Tokens: 36
Types: 13
□ Segmentation or
tokenisation
□ Often based on the fact
that there are spaces in
between words (at least
since scriptura continua
was abandoned in late
9th C.)
□ “soft mark up”
Research based on vocabulary
□ ‘Bag of words’ model: original
word order is ignored
Frequency lists
“It was the best of
times, it was the worst
of times, it was the
age of wisdom, it was
the age of foolishness,
it was the epoch of
belief, it was the
epoch of incredulity”
the 6
it 6
of 6
was 6
epoch 2
age 2
times 2
foolishness 1
wisdom 1
belief 1
Word cloud
Dispersion graph
Authorship attribution
John Burrows, Never
Say Always Again:
Reflections on the
Numbers Game
□ Suggesting an author for texts whose authorship is
disputed
Type-token ratio
□ Peter Garrard, Textual Pathology
□ Total number of
types divided by the
number of tokens
□ Gives an indication
of the lexical
diversity of a text
Concordance (or KWIC)
□ NLTK modules contain text corpora, lexical
resources, and “a suite of text processing libraries
for classification, tokenization, stemming, tagging,
parsing, and semantic reasoning”
Python NTLK modules
import nltk
from nltk.tokenize import
sent_tokenize,
word_tokenize
novel = open( "ARoomWithAView.txt" ,
encoding = 'utf-8’ )
fullText = novel.read()
sentences = sent_tokenize(fullText)
for sent in sentences:
words = word_tokenize(sent)
tags = nltk.pos_tag(words)
for t in tags:
print( t[0] + " => " + t[1] + "n")
The => DT
Signora => NNP
had => VBD
no => DT
business => NN
to => TO
do => VB
it => PRP
said => VBD
Miss => NNP
Bartlett => NNP
□ Stemming: converting an inflected verb
from into its stem.
□ Algorithms based on removal of
suffixes
□ Lemmatisation: relating an inflected
verb form to its lemma (dictionary
form)
□ Tags are commonly based on the
Penn Treebank Tag Set
from nltk.stem import PorterStemmer,
WordNetLemmatizer
st = PorterStemmer()
lm = WordNetLemmatizer()
print( st.stem("studying") ) #studi
print( lm.lemmatize("studying" ,
pos = "v")) #study
print( st.stem("went") ) #went
print( lm.lemmatize("went", pos="v") ) #go
print( st.stem("are") ) # are
print( lm.lemmatize("are", pos="v") ) #be
Regular expressions
□ A pattern which represents a specific sequence of
characters
□ To work with regular expressions in Python, you
need to import the ‘re’ module:
import re
□ Regex can be used in search() method:
if re.search( r“Florence” , line ):
print( line )
□ Simplest regular expression: Simple sequence of
characters
Example:
Regular expressions
’sun’
Also matches: disunited, sunk, Sunday,
asunder
’ sun ’
Does NOT match:
[…] the gate of the eastern sun,
[…] gloom beneath the noonday sun.
. Any character
w Any alphanumerical character:
alphabetical characters, numbers and
underscore
d Any digit
s White space: space, tab, newline
[..] Any of the characters supplied within
square brackets, e.g. [A-Za-z]
Character classes
‘d{4}’
Matches: 1234, 2013, 1066
‘[a-zA-Z]+’
Matches any word that consists of
alphabetical characters only
Does not FULLY match:
e-mail, catch22, can’t
Examples
{n,m} Pattern must occur a least n times,
at most m times
{n,} At least n times
{n} Exactly n times
? is the same as {0,1}
+ is the same as {1,}
* Is the same as {0,}
Quantifiers
‘b[aeiou]{1,2}tw*’
bit
blister
boathouse
beauty
boyhood
but
beast
beat
Do not match characters, but locations
within strings.
b Word boundaries
^ Start of a line
$ End of a line
Anchors
Discover the world at Leiden University
Aa, Pieter Jansz van der (* Leiden 1697; † 2-8-1751 [begr. PK
31-7/7-8-1751]; w. 1719-36)
Example
Discover the world at Leiden University
Aa, Pieter Jansz van der (* Leiden 1697; † 2-8-1751 [begr. PK
31-7/7-8-1751]; w. 1719-36)
parts = re.split( '[;]' , data["biography"] )
for p in parts:
if re.search( '^[*]' , p.strip() ):
p = re.sub( '^*' , '' , p )
if re.search( 'd{4}' , p ):
match = re.search( '(d{2,4})' , p )
data['dob'] = match.group(1)
elif re.search( '^[†]' , p.strip() ):
data['dod'] = p.strip()
Discover the world at Leiden University
<person>
<firstName>Pieter Jansz van der</firstName>
<lastName>Aa</lastName>
<dob>1697</dob><dod>1751</dod>
<pob>Leiden</pob>
<professional-start>1719</professional-start>
<professional-end>1736</professional-end>
<profession>boekverkoper</profession>
…
</person>
□ Indication of the readability of the text, often
based on average number of words per sentence,
or average nr of syllables per word
□ Examples include Flesch-Kincaid test, Gunning-
Fog index, Coleman-Liau index
□ Flesch-Kincaid is often used in US educational
system and roughly indicates number of years of
formal education
Readability metrics
Flesch-Kincaid formula
Pandas
□ A Python module
developed for data
science
□ Available for Python
2.7 and higher
□ Many methods for
reading the contents
of data sets in a wide
range of formats such
as csv, tsv or MS
Excel
□ The data in CSV files can be made available via
the read_csv() method
□ This method converts the CSV file into a so-
called data frame.
□ A data frame consists of rows and columns
□ The data type of the columns is Series
Data frames
title,tokens,sentences,adjectives,adverbs,verbs
ARoomWithaView,83147,5863,4058,4455,13917
ATaleofTwoCities,165042,7802,9231,7715,24343
HeartofDarkness,44542,2430,2938,2342,6916
Ivanhoe,210928,6245,12663,8360,29230
MobyDick,252594,9982,18578,14207,32773
PrideandPrejudice,143598,5852,7777,9171,23724
SonsandLovers,204126,16218,9630,10853,33534
ThroughtheLookingGlass,36680,2061,1639,2096,6104
TreasureIsland,82769,3734,4054,4361,12302
VanityFair,355446,13224,22002,14988,50865
10 Rows and 6 Series
import pandas as pd
df = pd.read_csv( 'data.csv' )
print( df.shape )
print( df.columns )
Basic information
import pandas as pd
df = pd.read_csv( 'data.csv' )
print( df.mean() )
## Mean values
print( df.corr() )
## Correlations
Statistics
Discover the world at Leiden University
Biografisch onderzoek Van Gogh
Number of types
Number of tokens
Lexical variety
Type-token ratio
□ The higher the number, the higher the
vocabulary diversity.
□ If the number is (relatively) low, there is a
high level of repetition
□ The length of the text has an impact on the
type-token ratio
Correlation
□ A statistical formula that measures the degree
in which variables are related
□ Expressed as a numerical value ranging form -1
to + 1
□ A negative correlation means that values for
one variable go down when the values for the
other go up
□Source: http://www.stat.yale.edu/
Discover the world at Leiden University
Questions?

More Related Content

Similar to Data Science Workshop

The Academic Library as a Centre of Expertise in the field of Text and Data M...
The Academic Library as a Centre of Expertise in the field of Text and Data M...The Academic Library as a Centre of Expertise in the field of Text and Data M...
The Academic Library as a Centre of Expertise in the field of Text and Data M...
Centre for Digital Scholarship, Leiden University Libraries
 
Cork AI Meetup Number 3
Cork AI Meetup Number 3Cork AI Meetup Number 3
Cork AI Meetup Number 3
Nick Grattan
 
Digital Tools, Trends and Methodologies in the Humanities and Social Sciences
Digital Tools, Trends and Methodologies in the Humanities and Social SciencesDigital Tools, Trends and Methodologies in the Humanities and Social Sciences
Digital Tools, Trends and Methodologies in the Humanities and Social Sciences
Shawn Day
 
HyperMembrane Structures for Open Source Cognitive Computing
HyperMembrane Structures for Open Source Cognitive ComputingHyperMembrane Structures for Open Source Cognitive Computing
HyperMembrane Structures for Open Source Cognitive Computing
Jack Park
 
Reviewing literaure through digital technologies
Reviewing literaure through digital technologiesReviewing literaure through digital technologies
Reviewing literaure through digital technologies
HRDC, GJU Hisar
 
Digital Humanities Workshop
Digital Humanities WorkshopDigital Humanities Workshop
Big Data Analytics course: Named Entities and Deep Learning for NLP
Big Data Analytics course: Named Entities and Deep Learning for NLPBig Data Analytics course: Named Entities and Deep Learning for NLP
Big Data Analytics course: Named Entities and Deep Learning for NLP
Christian Morbidoni
 
Introduction to Text Mining
Introduction to Text MiningIntroduction to Text Mining
Introduction to Text Mining
Minha Hwang
 
Digital Humanities Research
Digital Humanities ResearchDigital Humanities Research
Digital Humanities Research
elli.m
 
Open Repositories and Interoperability Challenges in UK
Open Repositories and Interoperability Challenges in UKOpen Repositories and Interoperability Challenges in UK
Open Repositories and Interoperability Challenges in UK
EDINA, University of Edinburgh
 
Introduction to Text Mining
Introduction to Text MiningIntroduction to Text Mining
Faculty center dh talk 2 s2016 pedagogical provocations
Faculty center dh talk 2 s2016   pedagogical provocationsFaculty center dh talk 2 s2016   pedagogical provocations
Faculty center dh talk 2 s2016 pedagogical provocations
Jennifer Dellner
 
Laurel Stvan dh ant_conc 2/27/13
Laurel Stvan dh ant_conc 2/27/13Laurel Stvan dh ant_conc 2/27/13
Laurel Stvan dh ant_conc 2/27/13
Jessica C. Murphy
 
UKSG webinar - Introduction to Text-Mining Research Papers with Petr Knoth an...
UKSG webinar - Introduction to Text-Mining Research Papers with Petr Knoth an...UKSG webinar - Introduction to Text-Mining Research Papers with Petr Knoth an...
UKSG webinar - Introduction to Text-Mining Research Papers with Petr Knoth an...
UKSG: connecting the knowledge community
 
ICRH Winter Institute Strand 4 Day 1 - Building Narratives with Digital Objects
ICRH Winter Institute Strand 4 Day 1 - Building Narratives with Digital ObjectsICRH Winter Institute Strand 4 Day 1 - Building Narratives with Digital Objects
ICRH Winter Institute Strand 4 Day 1 - Building Narratives with Digital Objects
Shawn Day
 
Eptcs slides-for-coasp-2010
Eptcs slides-for-coasp-2010Eptcs slides-for-coasp-2010
Eptcs slides-for-coasp-2010
RobvanGlabbeek
 
Babar: Knowledge Recognition, Extraction and Representation
Babar: Knowledge Recognition, Extraction and RepresentationBabar: Knowledge Recognition, Extraction and Representation
Babar: Knowledge Recognition, Extraction and Representation
Pierre de Lacaze
 
Our World is Socio-technical
Our World is Socio-technicalOur World is Socio-technical
Our World is Socio-technical
Markus Luczak-Rösch
 
Washington Linked Data Authority Service at University of Houston
Washington Linked Data Authority Service at University of HoustonWashington Linked Data Authority Service at University of Houston
Washington Linked Data Authority Service at University of Houston
National Information Standards Organization (NISO)
 
Topic Extraction on Domain Ontology
Topic Extraction on Domain OntologyTopic Extraction on Domain Ontology
Topic Extraction on Domain Ontology
Keerti Bhogaraju
 

Similar to Data Science Workshop (20)

The Academic Library as a Centre of Expertise in the field of Text and Data M...
The Academic Library as a Centre of Expertise in the field of Text and Data M...The Academic Library as a Centre of Expertise in the field of Text and Data M...
The Academic Library as a Centre of Expertise in the field of Text and Data M...
 
Cork AI Meetup Number 3
Cork AI Meetup Number 3Cork AI Meetup Number 3
Cork AI Meetup Number 3
 
Digital Tools, Trends and Methodologies in the Humanities and Social Sciences
Digital Tools, Trends and Methodologies in the Humanities and Social SciencesDigital Tools, Trends and Methodologies in the Humanities and Social Sciences
Digital Tools, Trends and Methodologies in the Humanities and Social Sciences
 
HyperMembrane Structures for Open Source Cognitive Computing
HyperMembrane Structures for Open Source Cognitive ComputingHyperMembrane Structures for Open Source Cognitive Computing
HyperMembrane Structures for Open Source Cognitive Computing
 
Reviewing literaure through digital technologies
Reviewing literaure through digital technologiesReviewing literaure through digital technologies
Reviewing literaure through digital technologies
 
Digital Humanities Workshop
Digital Humanities WorkshopDigital Humanities Workshop
Digital Humanities Workshop
 
Big Data Analytics course: Named Entities and Deep Learning for NLP
Big Data Analytics course: Named Entities and Deep Learning for NLPBig Data Analytics course: Named Entities and Deep Learning for NLP
Big Data Analytics course: Named Entities and Deep Learning for NLP
 
Introduction to Text Mining
Introduction to Text MiningIntroduction to Text Mining
Introduction to Text Mining
 
Digital Humanities Research
Digital Humanities ResearchDigital Humanities Research
Digital Humanities Research
 
Open Repositories and Interoperability Challenges in UK
Open Repositories and Interoperability Challenges in UKOpen Repositories and Interoperability Challenges in UK
Open Repositories and Interoperability Challenges in UK
 
Introduction to Text Mining
Introduction to Text MiningIntroduction to Text Mining
Introduction to Text Mining
 
Faculty center dh talk 2 s2016 pedagogical provocations
Faculty center dh talk 2 s2016   pedagogical provocationsFaculty center dh talk 2 s2016   pedagogical provocations
Faculty center dh talk 2 s2016 pedagogical provocations
 
Laurel Stvan dh ant_conc 2/27/13
Laurel Stvan dh ant_conc 2/27/13Laurel Stvan dh ant_conc 2/27/13
Laurel Stvan dh ant_conc 2/27/13
 
UKSG webinar - Introduction to Text-Mining Research Papers with Petr Knoth an...
UKSG webinar - Introduction to Text-Mining Research Papers with Petr Knoth an...UKSG webinar - Introduction to Text-Mining Research Papers with Petr Knoth an...
UKSG webinar - Introduction to Text-Mining Research Papers with Petr Knoth an...
 
ICRH Winter Institute Strand 4 Day 1 - Building Narratives with Digital Objects
ICRH Winter Institute Strand 4 Day 1 - Building Narratives with Digital ObjectsICRH Winter Institute Strand 4 Day 1 - Building Narratives with Digital Objects
ICRH Winter Institute Strand 4 Day 1 - Building Narratives with Digital Objects
 
Eptcs slides-for-coasp-2010
Eptcs slides-for-coasp-2010Eptcs slides-for-coasp-2010
Eptcs slides-for-coasp-2010
 
Babar: Knowledge Recognition, Extraction and Representation
Babar: Knowledge Recognition, Extraction and RepresentationBabar: Knowledge Recognition, Extraction and Representation
Babar: Knowledge Recognition, Extraction and Representation
 
Our World is Socio-technical
Our World is Socio-technicalOur World is Socio-technical
Our World is Socio-technical
 
Washington Linked Data Authority Service at University of Houston
Washington Linked Data Authority Service at University of HoustonWashington Linked Data Authority Service at University of Houston
Washington Linked Data Authority Service at University of Houston
 
Topic Extraction on Domain Ontology
Topic Extraction on Domain OntologyTopic Extraction on Domain Ontology
Topic Extraction on Domain Ontology
 

More from Centre for Digital Scholarship, Leiden University Libraries

Narrowing the gap between international FAIR Best Practices for Open Science ...
Narrowing the gap between international FAIR Best Practices for Open Science ...Narrowing the gap between international FAIR Best Practices for Open Science ...
Narrowing the gap between international FAIR Best Practices for Open Science ...
Centre for Digital Scholarship, Leiden University Libraries
 
Building the Abnormal Hieratic Global Portal
Building the Abnormal Hieratic Global PortalBuilding the Abnormal Hieratic Global Portal
Building the Abnormal Hieratic Global Portal
Centre for Digital Scholarship, Leiden University Libraries
 
Open Science Opens Careers
Open Science Opens CareersOpen Science Opens Careers
Let your research bloom: practical steps for FAIR data
Let your research bloom: practical steps for FAIR dataLet your research bloom: practical steps for FAIR data
Let your research bloom: practical steps for FAIR data
Centre for Digital Scholarship, Leiden University Libraries
 
IIIF at the UBL
IIIF at the UBLIIIF at the UBL
International Image Interoperability Framework (IIIF)
International Image Interoperability Framework (IIIF)International Image Interoperability Framework (IIIF)
International Image Interoperability Framework (IIIF)
Centre for Digital Scholarship, Leiden University Libraries
 
A comprehensive approach towards the curation of born digital material by Lei...
A comprehensive approach towards the curation of born digital material by Lei...A comprehensive approach towards the curation of born digital material by Lei...
A comprehensive approach towards the curation of born digital material by Lei...
Centre for Digital Scholarship, Leiden University Libraries
 
Referentie Architectuur Onderzoeksdata en Onderzoeksdata diensten catalogus
Referentie Architectuur Onderzoeksdata en Onderzoeksdata diensten catalogusReferentie Architectuur Onderzoeksdata en Onderzoeksdata diensten catalogus
Referentie Architectuur Onderzoeksdata en Onderzoeksdata diensten catalogus
Centre for Digital Scholarship, Leiden University Libraries
 
RDM Services catalogue @ Leiden University
RDM Services catalogue @ Leiden UniversityRDM Services catalogue @ Leiden University
RDM Services catalogue @ Leiden University
Centre for Digital Scholarship, Leiden University Libraries
 
Virtual Research Environments at Leiden University
Virtual Research Environments at Leiden UniversityVirtual Research Environments at Leiden University
Virtual Research Environments at Leiden University
Centre for Digital Scholarship, Leiden University Libraries
 
The repository as an interactive research tool
The repository as an interactive research toolThe repository as an interactive research tool
The repository as an interactive research tool
Centre for Digital Scholarship, Leiden University Libraries
 
Research Support at Leiden University
Research Support at Leiden UniversityResearch Support at Leiden University
Centre for Digital Scholarship and LURIS
Centre for Digital Scholarship and LURISCentre for Digital Scholarship and LURIS
Centre for Digital Scholarship and LURIS
Centre for Digital Scholarship, Leiden University Libraries
 
Centre for Digital Scholarship at Leiden University Libraries
Centre for Digital Scholarship at Leiden University LibrariesCentre for Digital Scholarship at Leiden University Libraries
Centre for Digital Scholarship at Leiden University Libraries
Centre for Digital Scholarship, Leiden University Libraries
 
Samenwerken voor Research Data Management
Samenwerken voor Research Data ManagementSamenwerken voor Research Data Management
Samenwerken voor Research Data Management
Centre for Digital Scholarship, Leiden University Libraries
 
International Image Interoperability Framework (IIIF)
International Image Interoperability Framework (IIIF)International Image Interoperability Framework (IIIF)
International Image Interoperability Framework (IIIF)
Centre for Digital Scholarship, Leiden University Libraries
 
Text and Data Mining: kennisdeelsessie
Text and Data Mining: kennisdeelsessie Text and Data Mining: kennisdeelsessie
Text and Data Mining: kennisdeelsessie
Centre for Digital Scholarship, Leiden University Libraries
 
Bijzondere collecties: houdbaar, vindbaar en bruikbaar
Bijzondere collecties: houdbaar, vindbaar en bruikbaarBijzondere collecties: houdbaar, vindbaar en bruikbaar
Bijzondere collecties: houdbaar, vindbaar en bruikbaar
Centre for Digital Scholarship, Leiden University Libraries
 
Publishers and RDM
Publishers and RDMPublishers and RDM
From DAI to ORCID; Implementation and beyond in Leiden
From DAI to ORCID; Implementation and beyond in LeidenFrom DAI to ORCID; Implementation and beyond in Leiden
From DAI to ORCID; Implementation and beyond in Leiden
Centre for Digital Scholarship, Leiden University Libraries
 

More from Centre for Digital Scholarship, Leiden University Libraries (20)

Narrowing the gap between international FAIR Best Practices for Open Science ...
Narrowing the gap between international FAIR Best Practices for Open Science ...Narrowing the gap between international FAIR Best Practices for Open Science ...
Narrowing the gap between international FAIR Best Practices for Open Science ...
 
Building the Abnormal Hieratic Global Portal
Building the Abnormal Hieratic Global PortalBuilding the Abnormal Hieratic Global Portal
Building the Abnormal Hieratic Global Portal
 
Open Science Opens Careers
Open Science Opens CareersOpen Science Opens Careers
Open Science Opens Careers
 
Let your research bloom: practical steps for FAIR data
Let your research bloom: practical steps for FAIR dataLet your research bloom: practical steps for FAIR data
Let your research bloom: practical steps for FAIR data
 
IIIF at the UBL
IIIF at the UBLIIIF at the UBL
IIIF at the UBL
 
International Image Interoperability Framework (IIIF)
International Image Interoperability Framework (IIIF)International Image Interoperability Framework (IIIF)
International Image Interoperability Framework (IIIF)
 
A comprehensive approach towards the curation of born digital material by Lei...
A comprehensive approach towards the curation of born digital material by Lei...A comprehensive approach towards the curation of born digital material by Lei...
A comprehensive approach towards the curation of born digital material by Lei...
 
Referentie Architectuur Onderzoeksdata en Onderzoeksdata diensten catalogus
Referentie Architectuur Onderzoeksdata en Onderzoeksdata diensten catalogusReferentie Architectuur Onderzoeksdata en Onderzoeksdata diensten catalogus
Referentie Architectuur Onderzoeksdata en Onderzoeksdata diensten catalogus
 
RDM Services catalogue @ Leiden University
RDM Services catalogue @ Leiden UniversityRDM Services catalogue @ Leiden University
RDM Services catalogue @ Leiden University
 
Virtual Research Environments at Leiden University
Virtual Research Environments at Leiden UniversityVirtual Research Environments at Leiden University
Virtual Research Environments at Leiden University
 
The repository as an interactive research tool
The repository as an interactive research toolThe repository as an interactive research tool
The repository as an interactive research tool
 
Research Support at Leiden University
Research Support at Leiden UniversityResearch Support at Leiden University
Research Support at Leiden University
 
Centre for Digital Scholarship and LURIS
Centre for Digital Scholarship and LURISCentre for Digital Scholarship and LURIS
Centre for Digital Scholarship and LURIS
 
Centre for Digital Scholarship at Leiden University Libraries
Centre for Digital Scholarship at Leiden University LibrariesCentre for Digital Scholarship at Leiden University Libraries
Centre for Digital Scholarship at Leiden University Libraries
 
Samenwerken voor Research Data Management
Samenwerken voor Research Data ManagementSamenwerken voor Research Data Management
Samenwerken voor Research Data Management
 
International Image Interoperability Framework (IIIF)
International Image Interoperability Framework (IIIF)International Image Interoperability Framework (IIIF)
International Image Interoperability Framework (IIIF)
 
Text and Data Mining: kennisdeelsessie
Text and Data Mining: kennisdeelsessie Text and Data Mining: kennisdeelsessie
Text and Data Mining: kennisdeelsessie
 
Bijzondere collecties: houdbaar, vindbaar en bruikbaar
Bijzondere collecties: houdbaar, vindbaar en bruikbaarBijzondere collecties: houdbaar, vindbaar en bruikbaar
Bijzondere collecties: houdbaar, vindbaar en bruikbaar
 
Publishers and RDM
Publishers and RDMPublishers and RDM
Publishers and RDM
 
From DAI to ORCID; Implementation and beyond in Leiden
From DAI to ORCID; Implementation and beyond in LeidenFrom DAI to ORCID; Implementation and beyond in Leiden
From DAI to ORCID; Implementation and beyond in Leiden
 

Recently uploaded

一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
zsjl4mimo
 
Intelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicineIntelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicine
AndrzejJarynowski
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
apvysm8
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
javier ramirez
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
v7oacc3l
 
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
74nqk8xf
 
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
nyfuhyz
 
Experts live - Improving user adoption with AI
Experts live - Improving user adoption with AIExperts live - Improving user adoption with AI
Experts live - Improving user adoption with AI
jitskeb
 
State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023
kuntobimo2016
 
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
74nqk8xf
 
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
nuttdpt
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
bopyb
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
nuttdpt
 
End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
Lars Albertsson
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
Walaa Eldin Moustafa
 
Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
Sm321
 
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headedGlobal Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
vikram sood
 
The Ipsos - AI - Monitor 2024 Report.pdf
The  Ipsos - AI - Monitor 2024 Report.pdfThe  Ipsos - AI - Monitor 2024 Report.pdf
The Ipsos - AI - Monitor 2024 Report.pdf
Social Samosa
 
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdfUdemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Fernanda Palhano
 
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
g4dpvqap0
 

Recently uploaded (20)

一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
 
Intelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicineIntelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicine
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
 
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
 
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
 
Experts live - Improving user adoption with AI
Experts live - Improving user adoption with AIExperts live - Improving user adoption with AI
Experts live - Improving user adoption with AI
 
State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023
 
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
 
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
 
End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
 
Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
 
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headedGlobal Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
 
The Ipsos - AI - Monitor 2024 Report.pdf
The  Ipsos - AI - Monitor 2024 Report.pdfThe  Ipsos - AI - Monitor 2024 Report.pdf
The Ipsos - AI - Monitor 2024 Report.pdf
 
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdfUdemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
 
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
 

Data Science Workshop

  • 1. Discover the world at Leiden University Data Science Workshop Dr. Peter Verhaar Maastricht, 2 April 2019
  • 2. Discover the world at Leiden University ◻ Unprecedented growth in volume of digital data ◻ Combined with growing sophistication of algorithms and tools Background
  • 3. □ Data mining or data science is the process of applying computational and algorithmic methods to large datasets. □ Text mining is collection of methods used to extract information not from “formalised database records” but from “unstructured textual data” Data Science Feldman, Ronan. The Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data. Cambridge: Cambridge University Press, 2007, p. 1 Érik Desmazière, illustration for cover of La biblioteca de Babel, 1941
  • 4. Discover the world at Leiden University Centre for Digital Scholarship ◻ Located within Leiden University Libraries ◻ Staffed by subject librarians and software developers ◻ Builds on existing services and existing expertise ◻ Focus on Open Access, Research Data management, Digital Preservation and Text and Data Mining
  • 5. Support for TDM within library? ◻ Central knowledge base ◻ Fostering interdisciplinary collaboration ◻ Clarifying terms and conditions of licences and negotiations with publishers ◻ Digital preservation ◻ Continuation of traditional role of libraries: providing access to texts The old Public library of Cincinnati, now demolished
  • 6. Discover the world at Leiden University Building expertise on TDM ◻ Literature review; Courses on Data Science and on Machine Learning; Online Tutorials (R Package, Mallet, OpenNLP, Packages in Python: nltk, textmining, matPlotLib, gensim) ◻ Involvement in MA course on Text and Data Mining ◻ Interviews with scholars who have expertise ◻ Internal research projects and pilots with researchers
  • 7. Discover the world at Leiden University Biographic research on Van Gogh ◻ Signs of mental decline in the correspondence of Vincent van Gogh ◻ Average length of sentences and type-token ratios
  • 8. Discover the world at Leiden University TDM Workshops ◻ Full day workshop with explanation of the basics of Python ◻ Explanation of a range of algorithms which can be used to analyse texts ◻ Experiments based on research questions of participants
  • 9. Discover the world at Leiden University ◻ Educational programme aimed at librarians ◻ Aim is to ensure that librarians can talk about technology on a basic level ◻ Courses are developed in collaboration with National Library (KB) and VU Amsterdam, under the name “DH Clinics”
  • 10. Discover the world at Leiden University ◻ Python and Jupyter Notebook ◻ Data acquisition: Web Scraping, APIs, Linked Open Data ◻ Data analysis and enrichment: Pandas, CSV, TDM, tokenization, POS tagging, lemmatization ◻ Data visualisation: Matplotlib Workshop outline
  • 11. Discover the world at Leiden University ◻Python is a widely used programming languages ◻Developed by Guido van Rossum ◻Advocates code readability and simplicity ◻Programmng style ought to ‘pythonic’
  • 12. Discover the world at Leiden University http://www.rapidtables.com/ Algoritm Programming Language Tool Word2Vec TopicModelling (LDA) Python Java Perl Voyant Tapor
  • 13. Variables □ Variables have a name: any combination of alphanumerical characters with an underscore keyword □ Variables can be assigned a value with a specific data type keyword = “Elzevier” ; number = 10 ; □ Examples of variable types include string (a sequence of characters), integer (whole numbers) and floating point numbers
  • 14. Strings □ Can be created with single quotes and with double quotes author = ‘Douglas Adams’ title = “The Hitchhiker’s guide to the galaxy” □ You can then “escape characters” in your string to add basic formatting: “n” new line “t” tab
  • 15. Mathematical operators □ The following mathematical operators can be used: + addition - subtraction / division * multiplication □ For example: sum = 5 + 6 product = 5 * 6
  • 16. Boolean operators □ Boolean operators compare values: > greater that < less than == equal to □ Expressions result in a ‘Boolean value’: true of false a = 5 b = 8 print( a > b )
  • 17. Selection if <condition>: <first block of code elif <condition>: <second block of code> else <last block of code>
  • 18. Jupyter Notebook □ Open source application which can be used to create documents containing both code and documentation □ Such documents can be opened in a browser □ It offers support for a variety of programming langauges, including Python, Julia and R □ It includes “kernels” or computational engines which can run the code directly
  • 19. Opening Jupyter Notebook □ Open Anaconda Navigator and select Jupyer Notebook > Launch □ OR navigate to the directory that contains your files in the Command Prompt and type in: jupyter notebook Jupyter can then be opened in a web- browser (e.g. Google Chrome) via the address localhost:8888 □ Jupyter initially opens the dashboard: a directory displaying all your files
  • 20. Opening Jupyter Notebook □ Jupyter notebooks can also be opened in Microsoft Azure: https://notebooks.azure.com/ □ Create a new project and import a GitHub repository □ The notebooks for this workshop can be downloaded from: https://github.com/peterverhaar/ MaastrichtDataScience
  • 21. Discover the world at Leiden University Algorithm Define number to be guessed Ask user to type in number WHILE given number IS NOT number to be guessed Print: Number is correct  Given number HIGHER? Print: LOWER Print: HIGHER Y N
  • 22. Discover the world at Leiden University
  • 23. Discover the world at Leiden University
  • 24. Discover the world at Leiden University Data Acquisition ◻ Direct downloads of data objects (e.g. full text in UTF-8 from Delpher or Project Gutenberg) ◻ Downloading data ◻ Downloads of data via Application Programming Interfaces (APS’s) ◻ Webscraping (via BeautifulSoup) ◻ Download csv files from data repositories such as Kaggle, figShare, DANS EASY
  • 25. □ An Application Programming Interface is a technology which can be used to make specific functions of an application or specific data sets available for external services API User ServiceAPI Request + key XML / JSON
  • 26. □ Some APIs are open; in other cases an API key is needed □ Data may be delivered in different formats: JSON, XM □ Actions such as create, read, update and delete are technically possible, but option are usually limited to reading data □ Texts and images
  • 27. Discover the world at Leiden University
  • 28. Discover the world at Leiden University □ A process in which texts are divided into smaller units (e.g. Paragraphs, sentences, words) □ Token counts reflect the total number of words; Types are the unique words in a text “It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was the epoch of belief, it was the epoch of incredulity” Tokenisation Tokens: 36 Types: 13
  • 29. □ Segmentation or tokenisation □ Often based on the fact that there are spaces in between words (at least since scriptura continua was abandoned in late 9th C.) □ “soft mark up” Research based on vocabulary
  • 30. □ ‘Bag of words’ model: original word order is ignored Frequency lists “It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was the epoch of belief, it was the epoch of incredulity” the 6 it 6 of 6 was 6 epoch 2 age 2 times 2 foolishness 1 wisdom 1 belief 1
  • 33. Authorship attribution John Burrows, Never Say Always Again: Reflections on the Numbers Game □ Suggesting an author for texts whose authorship is disputed
  • 34.
  • 35. Type-token ratio □ Peter Garrard, Textual Pathology □ Total number of types divided by the number of tokens □ Gives an indication of the lexical diversity of a text
  • 37. □ NLTK modules contain text corpora, lexical resources, and “a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning” Python NTLK modules import nltk from nltk.tokenize import sent_tokenize, word_tokenize
  • 38. novel = open( "ARoomWithAView.txt" , encoding = 'utf-8’ ) fullText = novel.read() sentences = sent_tokenize(fullText) for sent in sentences: words = word_tokenize(sent) tags = nltk.pos_tag(words) for t in tags: print( t[0] + " => " + t[1] + "n")
  • 39. The => DT Signora => NNP had => VBD no => DT business => NN to => TO do => VB it => PRP said => VBD Miss => NNP Bartlett => NNP
  • 40.
  • 41. □ Stemming: converting an inflected verb from into its stem. □ Algorithms based on removal of suffixes □ Lemmatisation: relating an inflected verb form to its lemma (dictionary form) □ Tags are commonly based on the Penn Treebank Tag Set
  • 42. from nltk.stem import PorterStemmer, WordNetLemmatizer st = PorterStemmer() lm = WordNetLemmatizer() print( st.stem("studying") ) #studi print( lm.lemmatize("studying" , pos = "v")) #study print( st.stem("went") ) #went print( lm.lemmatize("went", pos="v") ) #go print( st.stem("are") ) # are print( lm.lemmatize("are", pos="v") ) #be
  • 43. Regular expressions □ A pattern which represents a specific sequence of characters □ To work with regular expressions in Python, you need to import the ‘re’ module: import re □ Regex can be used in search() method: if re.search( r“Florence” , line ): print( line )
  • 44. □ Simplest regular expression: Simple sequence of characters Example: Regular expressions ’sun’ Also matches: disunited, sunk, Sunday, asunder ’ sun ’ Does NOT match: […] the gate of the eastern sun, […] gloom beneath the noonday sun.
  • 45. . Any character w Any alphanumerical character: alphabetical characters, numbers and underscore d Any digit s White space: space, tab, newline [..] Any of the characters supplied within square brackets, e.g. [A-Za-z] Character classes
  • 46. ‘d{4}’ Matches: 1234, 2013, 1066 ‘[a-zA-Z]+’ Matches any word that consists of alphabetical characters only Does not FULLY match: e-mail, catch22, can’t Examples
  • 47. {n,m} Pattern must occur a least n times, at most m times {n,} At least n times {n} Exactly n times ? is the same as {0,1} + is the same as {1,} * Is the same as {0,} Quantifiers
  • 49. Do not match characters, but locations within strings. b Word boundaries ^ Start of a line $ End of a line Anchors
  • 50. Discover the world at Leiden University Aa, Pieter Jansz van der (* Leiden 1697; † 2-8-1751 [begr. PK 31-7/7-8-1751]; w. 1719-36) Example
  • 51. Discover the world at Leiden University Aa, Pieter Jansz van der (* Leiden 1697; † 2-8-1751 [begr. PK 31-7/7-8-1751]; w. 1719-36) parts = re.split( '[;]' , data["biography"] ) for p in parts: if re.search( '^[*]' , p.strip() ): p = re.sub( '^*' , '' , p ) if re.search( 'd{4}' , p ): match = re.search( '(d{2,4})' , p ) data['dob'] = match.group(1) elif re.search( '^[†]' , p.strip() ): data['dod'] = p.strip()
  • 52. Discover the world at Leiden University <person> <firstName>Pieter Jansz van der</firstName> <lastName>Aa</lastName> <dob>1697</dob><dod>1751</dod> <pob>Leiden</pob> <professional-start>1719</professional-start> <professional-end>1736</professional-end> <profession>boekverkoper</profession> … </person>
  • 53. □ Indication of the readability of the text, often based on average number of words per sentence, or average nr of syllables per word □ Examples include Flesch-Kincaid test, Gunning- Fog index, Coleman-Liau index □ Flesch-Kincaid is often used in US educational system and roughly indicates number of years of formal education Readability metrics
  • 55. Pandas □ A Python module developed for data science □ Available for Python 2.7 and higher □ Many methods for reading the contents of data sets in a wide range of formats such as csv, tsv or MS Excel
  • 56. □ The data in CSV files can be made available via the read_csv() method □ This method converts the CSV file into a so- called data frame. □ A data frame consists of rows and columns □ The data type of the columns is Series Data frames
  • 58. import pandas as pd df = pd.read_csv( 'data.csv' ) print( df.shape ) print( df.columns ) Basic information
  • 59. import pandas as pd df = pd.read_csv( 'data.csv' ) print( df.mean() ) ## Mean values print( df.corr() ) ## Correlations Statistics
  • 60. Discover the world at Leiden University Biografisch onderzoek Van Gogh
  • 61. Number of types Number of tokens Lexical variety
  • 62. Type-token ratio □ The higher the number, the higher the vocabulary diversity. □ If the number is (relatively) low, there is a high level of repetition □ The length of the text has an impact on the type-token ratio
  • 63. Correlation □ A statistical formula that measures the degree in which variables are related □ Expressed as a numerical value ranging form -1 to + 1 □ A negative correlation means that values for one variable go down when the values for the other go up □Source: http://www.stat.yale.edu/
  • 64.
  • 65. Discover the world at Leiden University Questions?