SlideShare a Scribd company logo
1 of 19
Download to read offline
Quick Start Tutorial of KH Coder:
Quantitative Content Analysis or Text Mining
of English Language Data
Koichi Higuchi
1
2
Preface
 This presentation is a part of tutorials for using KH Coder.
 KH Coder is a free software for quantitative content
analysis or text mining. It is also utilized for
computational linguistics.
 Details and downloads:
http://khc.sourceforge.net/en/
Table of Contents
3
Configure KH Coder for English speaking people / English data
 1. Change the interface language to English
 2. Settings for analyzing English text
 Notes on the stopwords
Create a new project and prepare for analysis
 3. Create a new project
 4. Run pre-processing
Frequently appeared words and co-occurrences
 5. Word frequency list
 6. KWIC and collocation stats
 7. Co-occurrence network of words
 Methods for exploring co-occurrences of words
Characteristics of each chapter
 8. Distinctive words of each chapter
 9. Correspondence analysis of words and chapters
Coding Rules
 Use coding rules to count concepts
 10. Search documents with coding rules
 11. Cross tabulation of the codes
1. Change the Interface Language to English
4
Choose “English” here
and restart KH Coder.
If you prefer the Japanese interface, you may skip this step.
You may also change the interface font.
Go to [Project] [Settings] in the menubar.
2. Settings for Analyzing English Text
5
(1) Go to [Project] [Settings] in the menubar.
(2) Select “Lemmatization.”
(3) Click “config.”
(4) Open the “tutorial_en”
folder, then drag the file
“stopwords_sample_en.txt”
and drop here. (Or just paste
the content of the file here)
(5) Click “OK.”(6) Click “OK.”
Notes on the Stopwords
6
You can specify any words as stopwords in KH Coder.
The stopwords will be given the special POS tag “OTHER.”
Words with “OTHER” tag will be excluded from analyses by default.
3. Create a New Project
7
(1) Go to [Project] [New] in the menubar.
(2) Click “Browse” and open the file
“tutorial_en/botchan_en.txt”
(3) fill in whatever
memo you like
(4) Click “OK.”
In this tutorial we analyze a
novel “Botchan” by Soseki.
“botchan_en.txt” contains all 11
chapters of the novel.
Chapter headings are marked
with h1 tag
Next time you start KH Coder,
go to [Project] [Open] in the
menubar and open the project
you have created here.
4. Run Pre-Processing
8
Go to [Pre-Processing] [Run Pre-Processing]
in the menubar. Then click “OK.”
Sentence splitting, tokenization, POS tagging
and lemmatization are performed.
The results are compiled into MySQL database
for searching and statistical analysis.
When processing data, KH Coder
“concentrates” on the job. So sometimes it
looks frozen. But it is normal when CPU or disk
is busy.
5. Word Frequency List
9
Go to [Tools] [Words] [Frequency List] in the menubar.
These are counts of base forms / lemmas
6. KWIC and Collocation Stats 1/2
10
(1) Go to [Tools] [Words] [KWIC Concordance] in the menubar.
(2) Input a base form of a word
and hit “Enter” on the keybord
When you change sort options,
click “Search” button again.
Double click any line to view
wider contexts. You can
change viewing Units below.
(3) Click “Stats” to open
the collocation stats.
6. KWIC and Collocation Stats 2/2
11
(1) Follow the steps in the previous slide to open the collocation stats.
(2) You can filter words
by POS tags.
“L1” stands for “Left 1.” Numbers in this column
indicate how many times each words appeared
just before the Node Word (left side, distance 1).
7. Co-Occurrence Network of Words
12
(3) Click “Config” and check “Larger nodes
for higher frequency words”, then lick “OK.”
Now you can see a co-occurrence network of high frequency words in the text.
The color change from blue (low) to pink (high). It indicates the centrality index.
(1) Go to [Tools] [Words] [Co-Occurrence Network] in the menubar.
(2) Select “Paragraphs” as Unit, then click “OK”
(4) Click “Config” and increase “edges” (co-
occurences) to “top 100,” then lick “OK.”
(5) Select “Community: modularity” as “color.”
Which version did you like?
Methods for Exploring Co-Occurrences of Words
13
To explore co-occurrences of words, you can also use:
 hierarchical cluster analysis
 multidimensional scaling
co-occurrence network cluster analysis MDS
By interpreting these result, you may find major themes of the text
from groups of words which tend to appear together.
KH Coder uses R as back end to execute these multivariate methods.
8. Distinctive Words of Each Chapter
14
(2) Click “Heading 1.”
Top 10 distinctive words of each chapter
are tabulated. The “distinctiveness” is
calculated using Jaccard index.
Basically, if a word shows larger
probability of appearance in a specific
chapter, It’s considered distinctive.
(1) Go to [Tools] [Variables & Headings] [List] in the menubar.
(3) Select “Sentences.”
(4) Select “catalogue: Excel.”
9. Correspondence Analysis of Words and Chapters
15
(2) Click “OK”
Using correspondence analysis,
you can visually interpret
characteristics of each chapter.
(1) Go to [Tools] [Words] [Correspondence Analysis] in the menubar.
(3) Click “Config”, then reduce words
to “Top 30,” check “Bubble plot,”
uncheck “Size of variables...,” and
click “OK.” (This step is optional.)
Use Coding Rules to Count Concepts
16
In some cases, we have to count concepts, not words.
To count concepts, you can compose “cording rules” like this:
*shopping
store or shop or ( merchandise and not develop )
Indicates the name of this code.
The conditions for attaching this code. Cases that contain words
like store and shop are given the code “shopping.” The
parenthetical notation means that cases should contain the word
“merchandise” but should not contain the word “develop.”
If a case is acceptable under multiple coding rules, multiple codes will
be given to the case.
We use “tutorial_en/themes.txt”
as example coding rules in this
tutorial. Please open this file and
check the content.
10. Search Documents with Coding Rules
17
(1) Go to [Tools] [Documents] [Search Documents] in the menubar.
(2) Click “Browse” and select
“tutorial_en/themes.txt”
(3) Select “Paragraphs”
(4) Double click a code
(5) Double click a result to
view the whole paragraph. When you compose a coding
rule, it is important to search and
check the actual documents
which are acceptable under the
rule.
11. Cross Tabulation of Codes
18
(1) Go to [Tools] [Coding] [Crosstab] in the menubar.
(2) Click “Browse” and select
“tutorial_en/themes.txt”
(3) Select “Sentences”
(5) Click “all” to
make a graph.
In the latter half of the novel,
it looks like “aggression”
overwhelms “positive affect”
and forms the climax of the
story at chapter X.
(4) Click “Run”
Acknowledgement
I am grateful to students who attended the 2011
“text mining” class at Doshisha University (Faculty
of Culture and Information Science) for giving me
some hints on composing coding rules for
“Botchan.”
Questions or Comments?
Please feel free to post questions or comments at
web forum here:
https://sourceforge.net/p/khc/discussion/

More Related Content

What's hot

PDPC法(過程決定計画図)
PDPC法(過程決定計画図)PDPC法(過程決定計画図)
PDPC法(過程決定計画図)博行 門眞
 
0 データサイエンス概論まえがき
0 データサイエンス概論まえがき0 データサイエンス概論まえがき
0 データサイエンス概論まえがきSeiichi Uchida
 
大規模データに基づく自然言語処理
大規模データに基づく自然言語処理大規模データに基づく自然言語処理
大規模データに基づく自然言語処理JunSuzuki21
 
Elasticsearch勉強会#44 20210624
Elasticsearch勉強会#44 20210624Elasticsearch勉強会#44 20210624
Elasticsearch勉強会#44 20210624Tetsuya Sodo
 
1 データとデータ分析
1 データとデータ分析1 データとデータ分析
1 データとデータ分析Seiichi Uchida
 
Common MongoDB Use Cases
Common MongoDB Use Cases Common MongoDB Use Cases
Common MongoDB Use Cases MongoDB
 
Example of Using R #1: Exporting the Result of Correspondence Analysis
Example of Using R #1: Exporting the Result of Correspondence AnalysisExample of Using R #1: Exporting the Result of Correspondence Analysis
Example of Using R #1: Exporting the Result of Correspondence Analysiskhcoder
 
Best Practices with ODI : Flexibility
Best Practices with ODI : FlexibilityBest Practices with ODI : Flexibility
Best Practices with ODI : FlexibilityGurcan Orhan
 
12 非構造化データ解析
12 非構造化データ解析12 非構造化データ解析
12 非構造化データ解析Seiichi Uchida
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDBMongoDB
 
連続時間フラクショナル・トピックモデル(NLP2023 金融・経済ドメインのための言語処理)
連続時間フラクショナル・トピックモデル(NLP2023 金融・経済ドメインのための言語処理)連続時間フラクショナル・トピックモデル(NLP2023 金融・経済ドメインのための言語処理)
連続時間フラクショナル・トピックモデル(NLP2023 金融・経済ドメインのための言語処理)Kei Nakagawa
 
クラスタリングとレコメンデーション資料
クラスタリングとレコメンデーション資料クラスタリングとレコメンデーション資料
クラスタリングとレコメンデーション資料洋資 堅田
 

What's hot (14)

PDPC法(過程決定計画図)
PDPC法(過程決定計画図)PDPC法(過程決定計画図)
PDPC法(過程決定計画図)
 
0 データサイエンス概論まえがき
0 データサイエンス概論まえがき0 データサイエンス概論まえがき
0 データサイエンス概論まえがき
 
大規模データに基づく自然言語処理
大規模データに基づく自然言語処理大規模データに基づく自然言語処理
大規模データに基づく自然言語処理
 
Elasticsearch勉強会#44 20210624
Elasticsearch勉強会#44 20210624Elasticsearch勉強会#44 20210624
Elasticsearch勉強会#44 20210624
 
1 データとデータ分析
1 データとデータ分析1 データとデータ分析
1 データとデータ分析
 
pyOpenCL 입문
pyOpenCL 입문pyOpenCL 입문
pyOpenCL 입문
 
15 人工知能入門
15 人工知能入門15 人工知能入門
15 人工知能入門
 
Common MongoDB Use Cases
Common MongoDB Use Cases Common MongoDB Use Cases
Common MongoDB Use Cases
 
Example of Using R #1: Exporting the Result of Correspondence Analysis
Example of Using R #1: Exporting the Result of Correspondence AnalysisExample of Using R #1: Exporting the Result of Correspondence Analysis
Example of Using R #1: Exporting the Result of Correspondence Analysis
 
Best Practices with ODI : Flexibility
Best Practices with ODI : FlexibilityBest Practices with ODI : Flexibility
Best Practices with ODI : Flexibility
 
12 非構造化データ解析
12 非構造化データ解析12 非構造化データ解析
12 非構造化データ解析
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDB
 
連続時間フラクショナル・トピックモデル(NLP2023 金融・経済ドメインのための言語処理)
連続時間フラクショナル・トピックモデル(NLP2023 金融・経済ドメインのための言語処理)連続時間フラクショナル・トピックモデル(NLP2023 金融・経済ドメインのための言語処理)
連続時間フラクショナル・トピックモデル(NLP2023 金融・経済ドメインのための言語処理)
 
クラスタリングとレコメンデーション資料
クラスタリングとレコメンデーション資料クラスタリングとレコメンデーション資料
クラスタリングとレコメンデーション資料
 

Similar to Quick Start Tutorial of KH Coder 2: Quantitative Content Analysis or Text Mining of English Language Data

Quick Start Tutorial of KH Coder 3
Quick Start Tutorial of KH Coder 3Quick Start Tutorial of KH Coder 3
Quick Start Tutorial of KH Coder 3khcoder
 
[OUTDATED] Quick Start Tutorial of KH Coder 3
[OUTDATED] Quick Start Tutorial of KH Coder 3[OUTDATED] Quick Start Tutorial of KH Coder 3
[OUTDATED] Quick Start Tutorial of KH Coder 3khcoder
 
ATLAS.ti Training - Covering the Basics (Mac edition)
ATLAS.ti Training - Covering the Basics (Mac edition)ATLAS.ti Training - Covering the Basics (Mac edition)
ATLAS.ti Training - Covering the Basics (Mac edition)Arun Verma
 
ATLAS.ti training presentation: Covering the basics
ATLAS.ti training presentation: Covering the basics ATLAS.ti training presentation: Covering the basics
ATLAS.ti training presentation: Covering the basics Arun Verma
 
Basics-of-HTML.ppt
Basics-of-HTML.pptBasics-of-HTML.ppt
Basics-of-HTML.pptBala Anand
 
html presentation on basis of tage .ppt
html presentation on basis of tage  .ppthtml presentation on basis of tage  .ppt
html presentation on basis of tage .pptProgressiveHeights2
 
902350_HTML_Jar.ppt
902350_HTML_Jar.ppt902350_HTML_Jar.ppt
902350_HTML_Jar.pptARUNVEVO1
 
web development html css javascrptt902350_HTML_Jar.ppt
web development html css javascrptt902350_HTML_Jar.pptweb development html css javascrptt902350_HTML_Jar.ppt
web development html css javascrptt902350_HTML_Jar.pptPuniNihithasree
 
HTML Start Up - Introduction to HTML
HTML Start Up - Introduction to HTMLHTML Start Up - Introduction to HTML
HTML Start Up - Introduction to HTMLGrayzon Gonzales, LPT
 

Similar to Quick Start Tutorial of KH Coder 2: Quantitative Content Analysis or Text Mining of English Language Data (20)

Quick Start Tutorial of KH Coder 3
Quick Start Tutorial of KH Coder 3Quick Start Tutorial of KH Coder 3
Quick Start Tutorial of KH Coder 3
 
[OUTDATED] Quick Start Tutorial of KH Coder 3
[OUTDATED] Quick Start Tutorial of KH Coder 3[OUTDATED] Quick Start Tutorial of KH Coder 3
[OUTDATED] Quick Start Tutorial of KH Coder 3
 
1428393873 mhkx3 ln
1428393873 mhkx3 ln1428393873 mhkx3 ln
1428393873 mhkx3 ln
 
Hku Ppt
Hku PptHku Ppt
Hku Ppt
 
HKU ppt
HKU pptHKU ppt
HKU ppt
 
ATLAS.ti Training - Covering the Basics (Mac edition)
ATLAS.ti Training - Covering the Basics (Mac edition)ATLAS.ti Training - Covering the Basics (Mac edition)
ATLAS.ti Training - Covering the Basics (Mac edition)
 
ATLAS.ti training presentation: Covering the basics
ATLAS.ti training presentation: Covering the basics ATLAS.ti training presentation: Covering the basics
ATLAS.ti training presentation: Covering the basics
 
902350_HTML_Jar.ppt
902350_HTML_Jar.ppt902350_HTML_Jar.ppt
902350_HTML_Jar.ppt
 
DOC-20220920-WA0012..pptx
DOC-20220920-WA0012..pptxDOC-20220920-WA0012..pptx
DOC-20220920-WA0012..pptx
 
Basics-of-HTML.ppt
Basics-of-HTML.pptBasics-of-HTML.ppt
Basics-of-HTML.ppt
 
902350_HTML_Jar.ppt
902350_HTML_Jar.ppt902350_HTML_Jar.ppt
902350_HTML_Jar.ppt
 
html presentation on basis of tage .ppt
html presentation on basis of tage  .ppthtml presentation on basis of tage  .ppt
html presentation on basis of tage .ppt
 
Intro to HTML
Intro to HTMLIntro to HTML
Intro to HTML
 
902350_HTML_Jar.ppt
902350_HTML_Jar.ppt902350_HTML_Jar.ppt
902350_HTML_Jar.ppt
 
web development html css javascrptt902350_HTML_Jar.ppt
web development html css javascrptt902350_HTML_Jar.pptweb development html css javascrptt902350_HTML_Jar.ppt
web development html css javascrptt902350_HTML_Jar.ppt
 
902350 html jar
902350 html jar902350 html jar
902350 html jar
 
HTML
HTMLHTML
HTML
 
html tags
 html tags html tags
html tags
 
Mdb dn 2016_05_index_tuning
Mdb dn 2016_05_index_tuningMdb dn 2016_05_index_tuning
Mdb dn 2016_05_index_tuning
 
HTML Start Up - Introduction to HTML
HTML Start Up - Introduction to HTMLHTML Start Up - Introduction to HTML
HTML Start Up - Introduction to HTML
 

More from khcoder

KH Coder 3 チュートリアル(スライド版)
KH Coder 3 チュートリアル(スライド版)KH Coder 3 チュートリアル(スライド版)
KH Coder 3 チュートリアル(スライド版)khcoder
 
【旧版】KH Coder 3 チュートリアル(スライド版)
【旧版】KH Coder 3 チュートリアル(スライド版)【旧版】KH Coder 3 チュートリアル(スライド版)
【旧版】KH Coder 3 チュートリアル(スライド版)khcoder
 
Jaccard係数の計算式と特徴(2)
Jaccard係数の計算式と特徴(2)Jaccard係数の計算式と特徴(2)
Jaccard係数の計算式と特徴(2)khcoder
 
Jaccard係数の計算式と特徴(1)
Jaccard係数の計算式と特徴(1)Jaccard係数の計算式と特徴(1)
Jaccard係数の計算式と特徴(1)khcoder
 
フリーソフトウェア「KH Coder」を使った計量テキスト分析 ―手軽なマウス操作による分析からプラグイン作成まで― #TokyoWebmining 41st
フリーソフトウェア「KH Coder」を使った計量テキスト分析 ―手軽なマウス操作による分析からプラグイン作成まで― #TokyoWebmining 41stフリーソフトウェア「KH Coder」を使った計量テキスト分析 ―手軽なマウス操作による分析からプラグイン作成まで― #TokyoWebmining 41st
フリーソフトウェア「KH Coder」を使った計量テキスト分析 ―手軽なマウス操作による分析からプラグイン作成まで― #TokyoWebmining 41stkhcoder
 
KH Coder 2 チュートリアル(スライド版)
KH Coder 2 チュートリアル(スライド版)KH Coder 2 チュートリアル(スライド版)
KH Coder 2 チュートリアル(スライド版)khcoder
 
Rファイルの保存と活用1―KH Coderによる対応分析の結果のエクスポートと活用―
Rファイルの保存と活用1―KH Coderによる対応分析の結果のエクスポートと活用―Rファイルの保存と活用1―KH Coderによる対応分析の結果のエクスポートと活用―
Rファイルの保存と活用1―KH Coderによる対応分析の結果のエクスポートと活用―khcoder
 

More from khcoder (7)

KH Coder 3 チュートリアル(スライド版)
KH Coder 3 チュートリアル(スライド版)KH Coder 3 チュートリアル(スライド版)
KH Coder 3 チュートリアル(スライド版)
 
【旧版】KH Coder 3 チュートリアル(スライド版)
【旧版】KH Coder 3 チュートリアル(スライド版)【旧版】KH Coder 3 チュートリアル(スライド版)
【旧版】KH Coder 3 チュートリアル(スライド版)
 
Jaccard係数の計算式と特徴(2)
Jaccard係数の計算式と特徴(2)Jaccard係数の計算式と特徴(2)
Jaccard係数の計算式と特徴(2)
 
Jaccard係数の計算式と特徴(1)
Jaccard係数の計算式と特徴(1)Jaccard係数の計算式と特徴(1)
Jaccard係数の計算式と特徴(1)
 
フリーソフトウェア「KH Coder」を使った計量テキスト分析 ―手軽なマウス操作による分析からプラグイン作成まで― #TokyoWebmining 41st
フリーソフトウェア「KH Coder」を使った計量テキスト分析 ―手軽なマウス操作による分析からプラグイン作成まで― #TokyoWebmining 41stフリーソフトウェア「KH Coder」を使った計量テキスト分析 ―手軽なマウス操作による分析からプラグイン作成まで― #TokyoWebmining 41st
フリーソフトウェア「KH Coder」を使った計量テキスト分析 ―手軽なマウス操作による分析からプラグイン作成まで― #TokyoWebmining 41st
 
KH Coder 2 チュートリアル(スライド版)
KH Coder 2 チュートリアル(スライド版)KH Coder 2 チュートリアル(スライド版)
KH Coder 2 チュートリアル(スライド版)
 
Rファイルの保存と活用1―KH Coderによる対応分析の結果のエクスポートと活用―
Rファイルの保存と活用1―KH Coderによる対応分析の結果のエクスポートと活用―Rファイルの保存と活用1―KH Coderによる対応分析の結果のエクスポートと活用―
Rファイルの保存と活用1―KH Coderによる対応分析の結果のエクスポートと活用―
 

Recently uploaded

Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesPhilip Schwarz
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024StefanoLambiase
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideChristina Lin
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsAhmed Mohamed
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Velvetech LLC
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...Christina Lin
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作qr0udbr0
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based projectAnoyGreter
 
Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Hr365.us smith
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio, Inc.
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaHanief Utama
 
How to Track Employee Performance A Comprehensive Guide.pdf
How to Track Employee Performance A Comprehensive Guide.pdfHow to Track Employee Performance A Comprehensive Guide.pdf
How to Track Employee Performance A Comprehensive Guide.pdfLivetecs LLC
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...soniya singh
 
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Matt Ray
 
Buds n Tech IT Solutions: Top-Notch Web Services in Noida
Buds n Tech IT Solutions: Top-Notch Web Services in NoidaBuds n Tech IT Solutions: Top-Notch Web Services in Noida
Buds n Tech IT Solutions: Top-Notch Web Services in Noidabntitsolutionsrishis
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odishasmiwainfosol
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesŁukasz Chruściel
 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEOrtus Solutions, Corp
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfAlina Yurenko
 

Recently uploaded (20)

Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a series
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML Diagrams
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based project
 
Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief Utama
 
How to Track Employee Performance A Comprehensive Guide.pdf
How to Track Employee Performance A Comprehensive Guide.pdfHow to Track Employee Performance A Comprehensive Guide.pdf
How to Track Employee Performance A Comprehensive Guide.pdf
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
 
2.pdf Ejercicios de programación competitiva
2.pdf Ejercicios de programación competitiva2.pdf Ejercicios de programación competitiva
2.pdf Ejercicios de programación competitiva
 
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
 
Buds n Tech IT Solutions: Top-Notch Web Services in Noida
Buds n Tech IT Solutions: Top-Notch Web Services in NoidaBuds n Tech IT Solutions: Top-Notch Web Services in Noida
Buds n Tech IT Solutions: Top-Notch Web Services in Noida
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New Features
 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
 

Quick Start Tutorial of KH Coder 2: Quantitative Content Analysis or Text Mining of English Language Data

  • 1. Quick Start Tutorial of KH Coder: Quantitative Content Analysis or Text Mining of English Language Data Koichi Higuchi 1
  • 2. 2 Preface  This presentation is a part of tutorials for using KH Coder.  KH Coder is a free software for quantitative content analysis or text mining. It is also utilized for computational linguistics.  Details and downloads: http://khc.sourceforge.net/en/
  • 3. Table of Contents 3 Configure KH Coder for English speaking people / English data  1. Change the interface language to English  2. Settings for analyzing English text  Notes on the stopwords Create a new project and prepare for analysis  3. Create a new project  4. Run pre-processing Frequently appeared words and co-occurrences  5. Word frequency list  6. KWIC and collocation stats  7. Co-occurrence network of words  Methods for exploring co-occurrences of words Characteristics of each chapter  8. Distinctive words of each chapter  9. Correspondence analysis of words and chapters Coding Rules  Use coding rules to count concepts  10. Search documents with coding rules  11. Cross tabulation of the codes
  • 4. 1. Change the Interface Language to English 4 Choose “English” here and restart KH Coder. If you prefer the Japanese interface, you may skip this step. You may also change the interface font. Go to [Project] [Settings] in the menubar.
  • 5. 2. Settings for Analyzing English Text 5 (1) Go to [Project] [Settings] in the menubar. (2) Select “Lemmatization.” (3) Click “config.” (4) Open the “tutorial_en” folder, then drag the file “stopwords_sample_en.txt” and drop here. (Or just paste the content of the file here) (5) Click “OK.”(6) Click “OK.”
  • 6. Notes on the Stopwords 6 You can specify any words as stopwords in KH Coder. The stopwords will be given the special POS tag “OTHER.” Words with “OTHER” tag will be excluded from analyses by default.
  • 7. 3. Create a New Project 7 (1) Go to [Project] [New] in the menubar. (2) Click “Browse” and open the file “tutorial_en/botchan_en.txt” (3) fill in whatever memo you like (4) Click “OK.” In this tutorial we analyze a novel “Botchan” by Soseki. “botchan_en.txt” contains all 11 chapters of the novel. Chapter headings are marked with h1 tag Next time you start KH Coder, go to [Project] [Open] in the menubar and open the project you have created here.
  • 8. 4. Run Pre-Processing 8 Go to [Pre-Processing] [Run Pre-Processing] in the menubar. Then click “OK.” Sentence splitting, tokenization, POS tagging and lemmatization are performed. The results are compiled into MySQL database for searching and statistical analysis. When processing data, KH Coder “concentrates” on the job. So sometimes it looks frozen. But it is normal when CPU or disk is busy.
  • 9. 5. Word Frequency List 9 Go to [Tools] [Words] [Frequency List] in the menubar. These are counts of base forms / lemmas
  • 10. 6. KWIC and Collocation Stats 1/2 10 (1) Go to [Tools] [Words] [KWIC Concordance] in the menubar. (2) Input a base form of a word and hit “Enter” on the keybord When you change sort options, click “Search” button again. Double click any line to view wider contexts. You can change viewing Units below. (3) Click “Stats” to open the collocation stats.
  • 11. 6. KWIC and Collocation Stats 2/2 11 (1) Follow the steps in the previous slide to open the collocation stats. (2) You can filter words by POS tags. “L1” stands for “Left 1.” Numbers in this column indicate how many times each words appeared just before the Node Word (left side, distance 1).
  • 12. 7. Co-Occurrence Network of Words 12 (3) Click “Config” and check “Larger nodes for higher frequency words”, then lick “OK.” Now you can see a co-occurrence network of high frequency words in the text. The color change from blue (low) to pink (high). It indicates the centrality index. (1) Go to [Tools] [Words] [Co-Occurrence Network] in the menubar. (2) Select “Paragraphs” as Unit, then click “OK” (4) Click “Config” and increase “edges” (co- occurences) to “top 100,” then lick “OK.” (5) Select “Community: modularity” as “color.” Which version did you like?
  • 13. Methods for Exploring Co-Occurrences of Words 13 To explore co-occurrences of words, you can also use:  hierarchical cluster analysis  multidimensional scaling co-occurrence network cluster analysis MDS By interpreting these result, you may find major themes of the text from groups of words which tend to appear together. KH Coder uses R as back end to execute these multivariate methods.
  • 14. 8. Distinctive Words of Each Chapter 14 (2) Click “Heading 1.” Top 10 distinctive words of each chapter are tabulated. The “distinctiveness” is calculated using Jaccard index. Basically, if a word shows larger probability of appearance in a specific chapter, It’s considered distinctive. (1) Go to [Tools] [Variables & Headings] [List] in the menubar. (3) Select “Sentences.” (4) Select “catalogue: Excel.”
  • 15. 9. Correspondence Analysis of Words and Chapters 15 (2) Click “OK” Using correspondence analysis, you can visually interpret characteristics of each chapter. (1) Go to [Tools] [Words] [Correspondence Analysis] in the menubar. (3) Click “Config”, then reduce words to “Top 30,” check “Bubble plot,” uncheck “Size of variables...,” and click “OK.” (This step is optional.)
  • 16. Use Coding Rules to Count Concepts 16 In some cases, we have to count concepts, not words. To count concepts, you can compose “cording rules” like this: *shopping store or shop or ( merchandise and not develop ) Indicates the name of this code. The conditions for attaching this code. Cases that contain words like store and shop are given the code “shopping.” The parenthetical notation means that cases should contain the word “merchandise” but should not contain the word “develop.” If a case is acceptable under multiple coding rules, multiple codes will be given to the case. We use “tutorial_en/themes.txt” as example coding rules in this tutorial. Please open this file and check the content.
  • 17. 10. Search Documents with Coding Rules 17 (1) Go to [Tools] [Documents] [Search Documents] in the menubar. (2) Click “Browse” and select “tutorial_en/themes.txt” (3) Select “Paragraphs” (4) Double click a code (5) Double click a result to view the whole paragraph. When you compose a coding rule, it is important to search and check the actual documents which are acceptable under the rule.
  • 18. 11. Cross Tabulation of Codes 18 (1) Go to [Tools] [Coding] [Crosstab] in the menubar. (2) Click “Browse” and select “tutorial_en/themes.txt” (3) Select “Sentences” (5) Click “all” to make a graph. In the latter half of the novel, it looks like “aggression” overwhelms “positive affect” and forms the climax of the story at chapter X. (4) Click “Run”
  • 19. Acknowledgement I am grateful to students who attended the 2011 “text mining” class at Doshisha University (Faculty of Culture and Information Science) for giving me some hints on composing coding rules for “Botchan.” Questions or Comments? Please feel free to post questions or comments at web forum here: https://sourceforge.net/p/khc/discussion/