Data Science Workshop

Discover the world at Leiden University
Data Science Workshop
Dr. Peter Verhaar Maastricht, 2 April 2019

◻ Unprecedented growth in
volume of digital data
◻ Combined with growing
sophistication of algorithms
and tools
Background

□ Data mining or data science is
the process of applying
computational and algorithmic
methods to large datasets.
□ Text mining is collection of
methods used to extract
information not from “formalised
database records” but from
“unstructured textual data”
Data Science
Feldman, Ronan. The Text Mining Handbook:
Advanced Approaches in Analyzing
Unstructured Data. Cambridge: Cambridge
University Press, 2007, p. 1
Érik Desmazière, illustration for
cover of La biblioteca de Babel,
1941

Centre for Digital Scholarship
◻ Located within Leiden University
Libraries
◻ Staffed by subject librarians and software
developers
◻ Builds on existing services and existing
expertise
◻ Focus on Open Access, Research Data
management, Digital Preservation and
Text and Data Mining

Support for TDM within library?
◻ Central knowledge base
◻ Fostering interdisciplinary
collaboration
◻ Clarifying terms and
conditions of licences and
negotiations with publishers
◻ Digital preservation
◻ Continuation of traditional
role of libraries: providing
access to texts
The old Public library of
Cincinnati, now demolished

Building expertise on TDM
◻ Literature review; Courses
on Data Science and on
Machine Learning; Online
Tutorials (R Package, Mallet,
OpenNLP, Packages in
Python: nltk, textmining,
matPlotLib, gensim)
◻ Involvement in MA course
on Text and Data Mining
◻ Interviews with scholars who
have expertise
◻ Internal research projects
and pilots with researchers

Biographic research on Van Gogh
◻ Signs of mental decline in the correspondence of Vincent
van Gogh
◻ Average length of sentences and type-token ratios

TDM Workshops
◻ Full day workshop with
explanation of the basics of
Python
◻ Explanation of a range of
algorithms which can be used to
analyse texts
◻ Experiments based on research
questions of participants

◻ Educational programme
aimed at librarians
◻ Aim is to ensure that
librarians can talk about
technology on a basic level
◻ Courses are developed in
collaboration with National
Library (KB) and VU
Amsterdam, under the name
“DH Clinics”

◻ Python and Jupyter Notebook
◻ Data acquisition: Web Scraping, APIs, Linked
Open Data
◻ Data analysis and enrichment: Pandas, CSV,
TDM, tokenization, POS tagging,
lemmatization
◻ Data visualisation: Matplotlib
Workshop outline

◻Python is a widely used
programming languages
◻Developed by Guido van
Rossum
◻Advocates code
readability and simplicity
◻Programmng style ought
to ‘pythonic’

http://www.rapidtables.com/
Algoritm Programming
Language
Tool
Word2Vec
TopicModelling
(LDA)
Python
Java
Perl
Voyant
Tapor

Variables
□ Variables have a name: any combination of
alphanumerical characters with an underscore
keyword
□ Variables can be assigned a value with a specific data
type
keyword = “Elzevier” ;
number = 10 ;
□ Examples of variable types include string (a sequence
of characters), integer (whole numbers) and floating
point numbers

Strings
□ Can be created with single quotes and with double
quotes
author = ‘Douglas Adams’
title = “The Hitchhiker’s guide to
the galaxy”
□ You can then “escape characters” in your string to
add basic formatting:
“n” new line
“t” tab

Mathematical operators
□ The following mathematical operators can be used:
+ addition
- subtraction
/ division
* multiplication
□ For example:
sum = 5 + 6
product = 5 * 6

Boolean operators
□ Boolean operators compare values:
> greater that
< less than
== equal to
□ Expressions result in a ‘Boolean value’: true of false
a = 5
b = 8
print( a > b )

Selection
if <condition>:
<first block of code
elif <condition>:
<second block of code>
else
<last block of code>

Jupyter Notebook
□ Open source application
which can be used to create
documents containing both
code and documentation
□ Such documents can be
opened in a browser
□ It offers support for a
variety of programming
langauges, including Python,
Julia and R
□ It includes “kernels” or
computational engines
which can run the code
directly

Opening Jupyter Notebook
□ Open Anaconda Navigator and select Jupyer
Notebook > Launch
□ OR navigate to the directory that contains
your files in the Command Prompt and type
in:
jupyter notebook
Jupyter can then be opened in a web-
browser (e.g. Google Chrome) via the
address localhost:8888
□ Jupyter initially opens the dashboard: a
directory displaying all your files

Opening Jupyter Notebook
□ Jupyter notebooks can also be opened in
Microsoft Azure:
https://notebooks.azure.com/
□ Create a new project and import a GitHub
repository
□ The notebooks for this workshop can be
downloaded from:
https://github.com/peterverhaar/
MaastrichtDataScience

Algorithm
Define number to be guessed
Ask user to type in number
WHILE given number IS NOT number to be guessed
Print: Number is correct 
Given number HIGHER?
Print:
LOWER
Print:
HIGHER
Y N

Data Acquisition
◻ Direct downloads of data objects
(e.g. full text in UTF-8 from
Delpher or Project Gutenberg)
◻ Downloading data
◻ Downloads of data via
Application Programming
Interfaces (APS’s)
◻ Webscraping (via
BeautifulSoup)
◻ Download csv files from data
repositories such as Kaggle,
figShare, DANS EASY

□ An Application Programming Interface is a
technology which can be used to make specific
functions of an application or specific data sets
available for external services
API
User ServiceAPI
Request +
key
XML /
JSON

□ Some APIs are open; in
other cases an API key is
needed
□ Data may be delivered in
different formats: JSON,
XM
□ Actions such as create,
read, update and delete
are technically possible,
but option are usually
limited to reading data
□ Texts and images

□ A process in which texts are divided into smaller units (e.g.
Paragraphs, sentences, words)
□ Token counts reflect the total number of words; Types are
the unique words in a text
“It was the best of
times, it was the worst
of times, it was the
age of wisdom, it was
the age of foolishness,
it was the epoch of
belief, it was the
epoch of incredulity”
Tokenisation
Tokens: 36
Types: 13

□ Segmentation or
tokenisation
□ Often based on the fact
that there are spaces in
between words (at least
since scriptura continua
was abandoned in late
9th C.)
□ “soft mark up”
Research based on vocabulary

□ ‘Bag of words’ model: original
word order is ignored
Frequency lists
“It was the best of
times, it was the worst
of times, it was the
age of wisdom, it was
the age of foolishness,
it was the epoch of
belief, it was the
epoch of incredulity”
the 6
it 6
of 6
was 6
epoch 2
age 2
times 2
foolishness 1
wisdom 1
belief 1

Authorship attribution
John Burrows, Never
Say Always Again:
Reflections on the
Numbers Game
□ Suggesting an author for texts whose authorship is
disputed

Type-token ratio
□ Peter Garrard, Textual Pathology
□ Total number of
types divided by the
number of tokens
□ Gives an indication
of the lexical
diversity of a text

□ NLTK modules contain text corpora, lexical
resources, and “a suite of text processing libraries
for classification, tokenization, stemming, tagging,
parsing, and semantic reasoning”
Python NTLK modules
import nltk
from nltk.tokenize import
sent_tokenize,
word_tokenize

novel = open( "ARoomWithAView.txt" ,
encoding = 'utf-8’ )
fullText = novel.read()
sentences = sent_tokenize(fullText)
for sent in sentences:
words = word_tokenize(sent)
tags = nltk.pos_tag(words)
for t in tags:
print( t[0] + " => " + t[1] + "n")

The => DT
Signora => NNP
had => VBD
no => DT
business => NN
to => TO
do => VB
it => PRP
said => VBD
Miss => NNP
Bartlett => NNP

□ Stemming: converting an inflected verb
from into its stem.
□ Algorithms based on removal of
suffixes
□ Lemmatisation: relating an inflected
verb form to its lemma (dictionary
form)
□ Tags are commonly based on the
Penn Treebank Tag Set

from nltk.stem import PorterStemmer,
WordNetLemmatizer
st = PorterStemmer()
lm = WordNetLemmatizer()
print( st.stem("studying") ) #studi
print( lm.lemmatize("studying" ,
pos = "v")) #study
print( st.stem("went") ) #went
print( lm.lemmatize("went", pos="v") ) #go
print( st.stem("are") ) # are
print( lm.lemmatize("are", pos="v") ) #be

Regular expressions
□ A pattern which represents a specific sequence of
characters
□ To work with regular expressions in Python, you
need to import the ‘re’ module:
import re
□ Regex can be used in search() method:
if re.search( r“Florence” , line ):
print( line )

□ Simplest regular expression: Simple sequence of
characters
Example:
Regular expressions
’sun’
Also matches: disunited, sunk, Sunday,
asunder
’ sun ’
Does NOT match:
[…] the gate of the eastern sun,
[…] gloom beneath the noonday sun.

. Any character
w Any alphanumerical character:
alphabetical characters, numbers and
underscore
d Any digit
s White space: space, tab, newline
[..] Any of the characters supplied within
square brackets, e.g. [A-Za-z]
Character classes

‘d{4}’
Matches: 1234, 2013, 1066
‘[a-zA-Z]+’
Matches any word that consists of
alphabetical characters only
Does not FULLY match:
e-mail, catch22, can’t
Examples

{n,m} Pattern must occur a least n times,
at most m times
{n,} At least n times
{n} Exactly n times
? is the same as {0,1}
+ is the same as {1,}
* Is the same as {0,}
Quantifiers

‘b[aeiou]{1,2}tw*’
bit
blister
boathouse
beauty
boyhood
but
beast
beat

Do not match characters, but locations
within strings.
b Word boundaries
^ Start of a line
$ End of a line
Anchors

Aa, Pieter Jansz van der (* Leiden 1697; † 2-8-1751 [begr. PK
31-7/7-8-1751]; w. 1719-36)
Example

Aa, Pieter Jansz van der (* Leiden 1697; † 2-8-1751 [begr. PK
31-7/7-8-1751]; w. 1719-36)
parts = re.split( '[;]' , data["biography"] )
for p in parts:
if re.search( '^[*]' , p.strip() ):
p = re.sub( '^*' , '' , p )
if re.search( 'd{4}' , p ):
match = re.search( '(d{2,4})' , p )
data['dob'] = match.group(1)
elif re.search( '^[†]' , p.strip() ):
data['dod'] = p.strip()

<person>
<firstName>Pieter Jansz van der</firstName>
<lastName>Aa</lastName>
<dob>1697</dob><dod>1751</dod>
<pob>Leiden</pob>
<professional-start>1719</professional-start>
<professional-end>1736</professional-end>
<profession>boekverkoper</profession>
…
</person>

□ Indication of the readability of the text, often
based on average number of words per sentence,
or average nr of syllables per word
□ Examples include Flesch-Kincaid test, Gunning-
Fog index, Coleman-Liau index
□ Flesch-Kincaid is often used in US educational
system and roughly indicates number of years of
formal education
Readability metrics

Pandas
□ A Python module
developed for data
science
□ Available for Python
2.7 and higher
□ Many methods for
reading the contents
of data sets in a wide
range of formats such
as csv, tsv or MS
Excel

□ The data in CSV files can be made available via
the read_csv() method
□ This method converts the CSV file into a so-
called data frame.
□ A data frame consists of rows and columns
□ The data type of the columns is Series
Data frames

title,tokens,sentences,adjectives,adverbs,verbs
ARoomWithaView,83147,5863,4058,4455,13917
ATaleofTwoCities,165042,7802,9231,7715,24343
HeartofDarkness,44542,2430,2938,2342,6916
Ivanhoe,210928,6245,12663,8360,29230
MobyDick,252594,9982,18578,14207,32773
PrideandPrejudice,143598,5852,7777,9171,23724
SonsandLovers,204126,16218,9630,10853,33534
ThroughtheLookingGlass,36680,2061,1639,2096,6104
TreasureIsland,82769,3734,4054,4361,12302
VanityFair,355446,13224,22002,14988,50865
10 Rows and 6 Series

import pandas as pd
df = pd.read_csv( 'data.csv' )
print( df.shape )
print( df.columns )
Basic information

import pandas as pd
df = pd.read_csv( 'data.csv' )
print( df.mean() )
## Mean values
print( df.corr() )
## Correlations
Statistics

Biografisch onderzoek Van Gogh

Number of types
Number of tokens
Lexical variety

Type-token ratio
□ The higher the number, the higher the
vocabulary diversity.
□ If the number is (relatively) low, there is a
high level of repetition
□ The length of the text has an impact on the
type-token ratio

Correlation
□ A statistical formula that measures the degree
in which variables are related
□ Expressed as a numerical value ranging form -1
to + 1
□ A negative correlation means that values for
one variable go down when the values for the
other go up
□Source: http://www.stat.yale.edu/

Questions?

Data Science Workshop

Recommended

Recommended

More Related Content

Similar to Data Science Workshop

Similar to Data Science Workshop (20)

More from Centre for Digital Scholarship, Leiden University Libraries

More from Centre for Digital Scholarship, Leiden University Libraries (20)

Recently uploaded

Recently uploaded (20)

Data Science Workshop