SlideShare a Scribd company logo
1 of 58
Download to read offline
Inteligência Artificial
Aula 12
Python para processamento de Texto
text wrangling?
>>>inputstring = ' This is an example sent. The sentence
splitter will split on sent markers. Ohh really !!'
>>>from nltk.tokenize import sent_tokenize
>>>all_sent = sent_tokenize(inputstring)
>>>print all_sent
[' This is an example sent', 'The sentence splitter will split
on markers.','Ohh really !!']
>>>import nltk.tokenize.punkt
>>>tokenizer =
nltk.tokenize.punkt.PunktSentenceTokenizer()
Stemming
Snowball stemmers that can be used for Dutch, English,
French, German, Italian, Portuguese, Romanian, Russian,
and so on
For Snowball Stemmer, which is based on Snowball
Stemming Algorithm, can be used in NLTK like this:
>>> from nltk.stem import SnowballStemmer
>>> snowball_stemmer = SnowballStemmer(“english”)
>>> snowball_stemmer.stem(‘maximum’)
u’maximum’
>>> snowball_stemmer.stem(‘presumably’)
u’presum’
>>> snowball_stemmer.stem(‘multiply’)
u’multipli’
>>> snowball_stemmer.stem(‘provision’)
u’provis’
>>> snowball_stemmer.stem(‘owed’)
u’owe’
>>> snowball_stemmer.stem(‘ear’)
u’ear’
Lemmatization
>>>from nltk.corpus import stopwords
>>>stoplist = stopwords.words('english') # config the
language name
# NLTK supports 22 languages for removing the stop
words
>>>text = "This is just a test"
>>>cleanwordlist = [word for word in text.split() if word not
in stoplist]
# apart from just and test others are stopwords
['test']
StopWords
>>># tokens is a list of all tokens in corpus
>>>freq_dist = nltk.FreqDist(token)
>>>rarewords = freq_dist.keys()[-50:]
>>>after_rare_words = [ word for word in token not in
rarewords]
Rarewords
>>>import nltk
>>>from nltk import word_tokenize
>>>s = "I was watching TV"
>>>print nltk.pos_tag(word_tokenize(s))
[('I', 'PRP'), ('was', 'VBD'), ('watching', 'VBG'), ('TV', 'NN')]
What is Part of speech tagging
>>>from nltk.corpus import brown
>>>import nltk
>>>tags = [tag for (word, tag) in
brown.tagged_words(categories='news')]
>>>print nltk.FreqDist(tags)
<FreqDist: 'NN': 13162, 'IN': 10616, 'AT': 8893, 'NP': 6866, ',':
5133, 'NNS': 5066, '.': 4452, 'JJ': 4392 >
clusters <- hclust(dist(iris[, 3:4]))
plot(clusters)
clusterCut <- cutree(clusters, 3)
clusters <- hclust(dist(iris[, 3:4]),
method = 'average')
plot(clusters)
K-Means
ggplot(iris, aes(Petal.Length, Petal.Width, color = iris
$Species)) +
geom_point(alpha = 0.4, size = 3.5) + geom_point(col =
clusterCut) +
scale_color_manual(values = c('black', 'red', 'green'))

More Related Content

What's hot

My programming final proj. (1)
My programming final proj. (1)My programming final proj. (1)
My programming final proj. (1)
aeden_brines
 

What's hot (6)

Zeppelin Helium: Spell
Zeppelin Helium: SpellZeppelin Helium: Spell
Zeppelin Helium: Spell
 
Reanalyzing the Notepad++ project
Reanalyzing the Notepad++ projectReanalyzing the Notepad++ project
Reanalyzing the Notepad++ project
 
My programming final proj. (1)
My programming final proj. (1)My programming final proj. (1)
My programming final proj. (1)
 
Demystifying the Go Scheduler
Demystifying the Go SchedulerDemystifying the Go Scheduler
Demystifying the Go Scheduler
 
Sine Wave Generator with controllable frequency displayed on a seven segment ...
Sine Wave Generator with controllable frequency displayed on a seven segment ...Sine Wave Generator with controllable frequency displayed on a seven segment ...
Sine Wave Generator with controllable frequency displayed on a seven segment ...
 
Exploit Research and Development Megaprimer: Unicode Based Exploit Development
Exploit Research and Development Megaprimer: Unicode Based Exploit DevelopmentExploit Research and Development Megaprimer: Unicode Based Exploit Development
Exploit Research and Development Megaprimer: Unicode Based Exploit Development
 

Similar to Inteligencia artificial 12

Falcon初印象
Falcon初印象Falcon初印象
Falcon初印象
勇浩 赖
 
Java căn bản - Chapter6
Java căn bản - Chapter6Java căn bản - Chapter6
Java căn bản - Chapter6
Vince Vo
 
#include iostream#includectimeusing namespace std;void.docx
#include iostream#includectimeusing namespace std;void.docx#include iostream#includectimeusing namespace std;void.docx
#include iostream#includectimeusing namespace std;void.docx
mayank272369
 
ZCA: A component architecture for Python
ZCA: A component architecture for PythonZCA: A component architecture for Python
ZCA: A component architecture for Python
Timo Stollenwerk
 
I have written the code but cannot complete the assignment please help.pdf
I have written the code but cannot complete the assignment please help.pdfI have written the code but cannot complete the assignment please help.pdf
I have written the code but cannot complete the assignment please help.pdf
shreeaadithyaacellso
 
Eff Plsql
Eff PlsqlEff Plsql
Eff Plsql
afa reg
 

Similar to Inteligencia artificial 12 (20)

Software Transactional Memory (STM) in Frege
Software Transactional Memory (STM) in Frege Software Transactional Memory (STM) in Frege
Software Transactional Memory (STM) in Frege
 
C++ L03-Control Structure
C++ L03-Control StructureC++ L03-Control Structure
C++ L03-Control Structure
 
Quick into to Software Transactional Memory in Frege
Quick into to Software Transactional Memory in FregeQuick into to Software Transactional Memory in Frege
Quick into to Software Transactional Memory in Frege
 
Falcon初印象
Falcon初印象Falcon初印象
Falcon初印象
 
Lập trình C
Lập trình CLập trình C
Lập trình C
 
Java căn bản - Chapter6
Java căn bản - Chapter6Java căn bản - Chapter6
Java căn bản - Chapter6
 
pa-pe-pi-po-pure Python Text Processing
pa-pe-pi-po-pure Python Text Processingpa-pe-pi-po-pure Python Text Processing
pa-pe-pi-po-pure Python Text Processing
 
JavaScript 101
JavaScript 101JavaScript 101
JavaScript 101
 
MUST CS101 Lab11
MUST CS101 Lab11 MUST CS101 Lab11
MUST CS101 Lab11
 
#include iostream#includectimeusing namespace std;void.docx
#include iostream#includectimeusing namespace std;void.docx#include iostream#includectimeusing namespace std;void.docx
#include iostream#includectimeusing namespace std;void.docx
 
130707833146508191
130707833146508191130707833146508191
130707833146508191
 
Interpreter, Compiler, JIT from scratch
Interpreter, Compiler, JIT from scratchInterpreter, Compiler, JIT from scratch
Interpreter, Compiler, JIT from scratch
 
TCLSH and Macro Ping Test on Cisco Routers and Switches
TCLSH and Macro Ping Test on Cisco Routers and SwitchesTCLSH and Macro Ping Test on Cisco Routers and Switches
TCLSH and Macro Ping Test on Cisco Routers and Switches
 
ch05-program-logic-indefinite-loops.ppt
ch05-program-logic-indefinite-loops.pptch05-program-logic-indefinite-loops.ppt
ch05-program-logic-indefinite-loops.ppt
 
ZCA: A component architecture for Python
ZCA: A component architecture for PythonZCA: A component architecture for Python
ZCA: A component architecture for Python
 
Project in programming
Project in programmingProject in programming
Project in programming
 
Un monde où 1 ms vaut 100 M€ - Devoxx France 2015
Un monde où 1 ms vaut 100 M€ - Devoxx France 2015Un monde où 1 ms vaut 100 M€ - Devoxx France 2015
Un monde où 1 ms vaut 100 M€ - Devoxx France 2015
 
I have written the code but cannot complete the assignment please help.pdf
I have written the code but cannot complete the assignment please help.pdfI have written the code but cannot complete the assignment please help.pdf
I have written the code but cannot complete the assignment please help.pdf
 
Eff Plsql
Eff PlsqlEff Plsql
Eff Plsql
 
C++ L06-Pointers
C++ L06-PointersC++ L06-Pointers
C++ L06-Pointers
 

More from Nauber Gois

More from Nauber Gois (20)

Ai health
Ai health Ai health
Ai health
 
Inteligencia artificial 13
Inteligencia artificial 13Inteligencia artificial 13
Inteligencia artificial 13
 
Sistemas operacionais 14
Sistemas operacionais 14Sistemas operacionais 14
Sistemas operacionais 14
 
Sistemas operacionais 13
Sistemas operacionais 13Sistemas operacionais 13
Sistemas operacionais 13
 
Sistemas operacionais 12
Sistemas operacionais 12Sistemas operacionais 12
Sistemas operacionais 12
 
Sistemas operacionais 11
Sistemas operacionais 11Sistemas operacionais 11
Sistemas operacionais 11
 
Sistemas operacionais 10
Sistemas operacionais 10Sistemas operacionais 10
Sistemas operacionais 10
 
Inteligencia artificial 11
Inteligencia artificial 11Inteligencia artificial 11
Inteligencia artificial 11
 
Sistemas operacional 9
Sistemas operacional 9Sistemas operacional 9
Sistemas operacional 9
 
Inteligencia artificial 10
Inteligencia artificial 10Inteligencia artificial 10
Inteligencia artificial 10
 
Sistemas operacionais 8
Sistemas operacionais 8Sistemas operacionais 8
Sistemas operacionais 8
 
Inteligencia artificial 9
Inteligencia artificial 9Inteligencia artificial 9
Inteligencia artificial 9
 
Inteligencia artificial 8
Inteligencia artificial 8Inteligencia artificial 8
Inteligencia artificial 8
 
Sist infgerenciais 8
Sist infgerenciais 8Sist infgerenciais 8
Sist infgerenciais 8
 
Sist operacionais 7
Sist operacionais 7Sist operacionais 7
Sist operacionais 7
 
Inteligencia artifical 7
Inteligencia artifical 7Inteligencia artifical 7
Inteligencia artifical 7
 
Beefataque
BeefataqueBeefataque
Beefataque
 
Ssit informacoesgerenciais 5
Ssit informacoesgerenciais 5Ssit informacoesgerenciais 5
Ssit informacoesgerenciais 5
 
Sistemas operacionais 6
Sistemas operacionais 6Sistemas operacionais 6
Sistemas operacionais 6
 
Inteligencia artifical 6
Inteligencia artifical 6Inteligencia artifical 6
Inteligencia artifical 6
 

Recently uploaded

CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Recently uploaded (20)

Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 

Inteligencia artificial 12

  • 3.
  • 5. >>>inputstring = ' This is an example sent. The sentence splitter will split on sent markers. Ohh really !!' >>>from nltk.tokenize import sent_tokenize >>>all_sent = sent_tokenize(inputstring) >>>print all_sent [' This is an example sent', 'The sentence splitter will split on markers.','Ohh really !!']
  • 7.
  • 8.
  • 10.
  • 11.
  • 12.
  • 13. Snowball stemmers that can be used for Dutch, English, French, German, Italian, Portuguese, Romanian, Russian, and so on
  • 14. For Snowball Stemmer, which is based on Snowball Stemming Algorithm, can be used in NLTK like this: >>> from nltk.stem import SnowballStemmer >>> snowball_stemmer = SnowballStemmer(“english”) >>> snowball_stemmer.stem(‘maximum’) u’maximum’ >>> snowball_stemmer.stem(‘presumably’) u’presum’ >>> snowball_stemmer.stem(‘multiply’) u’multipli’ >>> snowball_stemmer.stem(‘provision’) u’provis’ >>> snowball_stemmer.stem(‘owed’) u’owe’ >>> snowball_stemmer.stem(‘ear’) u’ear’
  • 16.
  • 17.
  • 18. >>>from nltk.corpus import stopwords >>>stoplist = stopwords.words('english') # config the language name # NLTK supports 22 languages for removing the stop words >>>text = "This is just a test" >>>cleanwordlist = [word for word in text.split() if word not in stoplist] # apart from just and test others are stopwords ['test'] StopWords
  • 19. >>># tokens is a list of all tokens in corpus >>>freq_dist = nltk.FreqDist(token) >>>rarewords = freq_dist.keys()[-50:] >>>after_rare_words = [ word for word in token not in rarewords] Rarewords
  • 20.
  • 21. >>>import nltk >>>from nltk import word_tokenize >>>s = "I was watching TV" >>>print nltk.pos_tag(word_tokenize(s)) [('I', 'PRP'), ('was', 'VBD'), ('watching', 'VBG'), ('TV', 'NN')] What is Part of speech tagging
  • 22.
  • 23. >>>from nltk.corpus import brown >>>import nltk >>>tags = [tag for (word, tag) in brown.tagged_words(categories='news')] >>>print nltk.FreqDist(tags) <FreqDist: 'NN': 13162, 'IN': 10616, 'AT': 8893, 'NP': 6866, ',': 5133, 'NNS': 5066, '.': 4452, 'JJ': 4392 >
  • 24.
  • 25.
  • 26.
  • 27.
  • 28.
  • 29.
  • 30.
  • 31.
  • 32.
  • 33.
  • 34.
  • 35.
  • 36.
  • 37.
  • 38.
  • 39.
  • 40.
  • 41.
  • 42.
  • 43.
  • 44.
  • 45.
  • 46.
  • 47.
  • 48.
  • 49.
  • 50. clusters <- hclust(dist(iris[, 3:4])) plot(clusters)
  • 52. clusters <- hclust(dist(iris[, 3:4]), method = 'average') plot(clusters)
  • 54.
  • 55.
  • 56.
  • 57.
  • 58. ggplot(iris, aes(Petal.Length, Petal.Width, color = iris $Species)) + geom_point(alpha = 0.4, size = 3.5) + geom_point(col = clusterCut) + scale_color_manual(values = c('black', 'red', 'green'))