pa-pe-pi-po-pure Python Text Processing

2,774 views

Published on

Experimentos com processamento de texto, da manipulação de strings básica até um exemplo de NLP, passando por compiladores.

Published in: Technology, Education
1 Comment
3 Likes
Statistics
Notes
No Downloads
Views
Total views
2,774
On SlideShare
0
From Embeds
0
Number of Embeds
766
Actions
Shares
0
Downloads
23
Comments
1
Likes
3
Embeds 0
No embeds

No notes for slide

pa-pe-pi-po-pure Python Text Processing

  1. 1. pa-pe-pi-po- Pure PythonText ProcessingRodrigo Senrarsenra@acm.orgPythonBrasil[7] - São Paulo
  2. 2. Anatomia do Blá• Eu, Vocês e Python• retrospectiva PythonBrasil[7] anos!• pa-pe-pi-po-pure python text processing• referências• 1 palavra dos patrocinadores
  3. 3. Quem está aí ?✓Profissionais de Informática✓Desenvolvedores✓Estudantes✓Professores✓1ª vez na PyConBrasil✓Membros APyBr• Nenhuma resposta acima!
  4. 4. Cenas dos últimos capítulos...[1] 2005 - BigKahuna[2] 2006 - Show Pyrotécnico Iteradores, Geradores,Hooks,Decoradores[3] 2007 - Show Pyrotécnico II Routing, RTSP, Twisted, GIS[4] 2008 - ISIS-NBP Bibliotecas Digitais[5] 2009 - Rest, Gtw e Compiladores SFC(Rede Petri) + ST(Pascal) > Ladder[5] 2010 - Potter vs Voldemort: Lições ofidiglotas da prática pythonica
  5. 5. >>> type("bla")<type str>>>> "".join([pa,"pe",pi,"""po"""])papepipo>>> str(2**1024)[100:120]21120113879871393357>>> 2**1024179769313486231590772930519078902473361797697894230657273430081157732675805500963132708477322407536021120113879871393357658789768814416622492847430639474124377767893424865485276302219601246094119453082952085005768838150682342462881473913110540827237163350510684586298239947245938479716304835356329624224137216L>>> ariediod[::-1]doideira
  6. 6. >>> " deu branco no prefixo e no sufixo, limpa com strip ".strip()deu branco no prefixo e no sufixo, limpa com strip>>> _.startswith("deu")True>>> "o rato roeu a roupa do rei de roma".partition("r")(o , r, ato roeu a roupa do rei de roma)>>> "o rato roeu a roupa do rei de roma".split("r")[o , ato , oeu a , oupa do , ei de , oma]>>> "o rato roeu a roupa do rei de roma".split()[o, rato, roeu, a, roupa, do, rei, de, roma]
  7. 7. >>> r"W:naoprecisadeescape"W:naoprecisadeescape>>> type(r"W:naoprecisadeescape")<type str>>>> type(u"Unicode")<type unicode>>>> print(u"xc3xa2")Traceback (most recent call last): File "<stdin>", line 1, in <module>UnicodeEncodeError: ascii codec cant encode characters in position 0-1: ordinal not in range(128)>>> print(unicode(xc3xa1,iso-8859-1).encode(iso-8859-1))á>>> import codecs, sys>>> sys.stdout = codecs.lookup(iso-8859-1)[-1](sys.stdout)>>> print(u"xc3xa1")á
  8. 8. >>> b"String de 8-bit chars" String de 8-bit charsPython 2.6.1 Python 3.1.4>>> b"Bla" >>> b"Bla"Bla bBla>>> b"Bla"=="Bla" >>> type(b"Bla")True <class bytes>>>> type(b"Bla") >>> type("Bla")<type str> <class str> >>> "Bla"==b"Bla" False
  9. 9. >>> [ord(i) for i in "nulalexsedlex"][110, 117, 108, 97, 108, 101, 120, 115, 101, 100, 108, 101, 120]>>> "".join([chr(i) for i in _])nulalexsedlex>>> lex in _True>>> import string>>> dir(string)[Formatter, Template, _TemplateMetaclass, __builtins__,__doc__, __file__, __name__, __package__, _float, _idmap,_idmapL, _int, _long, _multimap, _re, ascii_letters,ascii_lowercase, ascii_uppercase, atof, atof_error, atoi,atoi_error, atol, atol_error, capitalize, capwords, center, count,digits, expandtabs, find, hexdigits, index, index_error, join,joinfields, letters, ljust, lower, lowercase, lstrip, maketrans,octdigits, printable, punctuation, replace, rfind, rindex, rjust,rsplit, rstrip, split, splitfields, strip, swapcase, translate, upper,uppercase, whitespace, zfill]
  10. 10. >>> string.hexdigits0123456789abcdefABCDEF>>> string.punctuation!"#$%&()*+,-./:;<=>?@[]^_`{|}~>>> string.maketrans(,)x00x01x02x03x04x05x06x07x08tnx0bx0crx0ex0fx10x11x12x13x14x15x16x17x18x19x1ax1bx1cx1dx1ex1f !"#$%&()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[]^_`abcdefghijklmnopqrstuvwxyz{|}~x7fx80x81x82x83x84x85x86x87x88x89x8ax8bx8cx8dx8ex8fx90x91x92x93x94x95x96x97x98x99x9ax9bx9cx9dx9ex9fxa0xa1xa2xa3xa4xa5xa6xa7xa8xa9xaaxabxacxadxaexafxb0xb1xb2xb3xb4xb5xb6xb7xb8xb9xbaxbbxbcxbdxbexbfxc0xc1xc2xc3xc4xc5xc6xc7xc8xc9xcaxcbxccxcdxcexcfxd0xd1xd2xd3xd4xd5xd6xd7xd8xd9xdaxdbxdcxddxdexdfxe0xe1xe2xe3xe4xe5xe6xe7xe8xe9xeaxebxecxedxeexefxf0xf1xf2xf3xf4xf5xf6xf7xf8xf9xfaxfbxfcxfdxfexff
  11. 11. >>> def t(x,y): return string.translate(x,string.maketrans(,),y)...>>> t("O rato roeu. O que? A roupa! De quem? Do rei, de roma;",string.punctuation)O rato roeu O que A roupa De quem Do rei de roma>>> class Bla(object):... def __str__(self):... return "Belex"... def __repr__(self):... return "Bla()"...>>> b = Bla()>>> for i in [b, eval(repr(b))]:... print(i, end=t)...Belex Belex >>>
  12. 12. >>> class istr(str):... pass>>> for name in eq lt le gt ge ne cmp contains.split():... meth = getattr(str, __%s__ % name)... def new_meth(self, param, *args):... return meth(self.lower(), param.lower(), *args)... setattr(istr, __%s__% name, new_meth)...>>> istr("SomeCamelCase") == istr("sOmeCaMeLcase")True>>> Ec in istr("SomeCamel")True Adapted from Python Cookbook
  13. 13. >>> import re>>> pat = re.compile(re.escape("<strong>"))>>> re.escape("<strong>")<strong>>>> pat.sub("_","<strong>Hasta la vista<strong> baby")_Hasta la vista_ baby>>> date = re.compile(r"(dddd-dd-dd)s(w+)")>>> date.findall("Em 2011-09-29 PythonBrasil na parada. Em 2010-10-21curitiba hospedou")[(2011-09-29, PythonBrasil), (2010-10-21, curitiba)]
  14. 14. $ python -mtimeit -s "import re; n=re.compile(rabra)" "n.search(abracadabra)"1000000 loops, best of 3: 0.306 usec per loop$ python -mtimeit -s "import re; n=rabra" "n in abracadabra"10000000 loops, best of 3: 0.0591 usec per loop$ python -mtimeit -s "import re; n=re.compile(rd+$)" "n.match(0123456789)"1000000 loops, best of 3: 0.511 usec per loop$ python -mtimeit -s "import re" "0123456789.isdigit()"10000000loops, best of 3: 0.0945 usec per loop Extracted from PyMag Jan 2008
  15. 15. $ python -mtimeit -s "import re;r=re.compile(pa|pe|pi|po|pu);h=patapetapitapotapuxa” "r.search(h)"1000000 loops, best of 3: 0.383 usec per loop$ python -mtimeit -s "import re;n=[pa,pe,pi,po,pu];h=patapetapitapotapuxa""any(x in h for x in n)"1000000 loops, best of 3: 0.914 usec per loop Extracted from PyMag Jan 2008
  16. 16. from pyparsing import Word, Literal, Combineimport stringdef doSum(s,l,tokens): return int(tokens[0]) + int(tokens[2])integer = Word(string.digits)addition = Combine(integer) + Literal(+) + Combine(integer)addition.setParseAction(doSum)>>> addition.parseString("5+7")([12], {})
  17. 17. import ply.lex as lextokens = NUMBER, PLUSt_PLUS = r+def t_NUMBER(t): rd+ t.value = int(t.value) return tt_ignore = tnwdef t_error(t): t.lexer.skip(1)lexer = lex.lex() Adapted from http://www.dabeaz.com
  18. 18. import ply.yacc as yaccdef p_expression_plus(p): expression : expression PLUS expression p[0] = p[1] + p[3]def p_factor_num(p): expression : NUMBER p[0] = p[1]def p_error(p): print "Syntax error in input!"parser = yacc.yacc() Adapted from http://www.dabeaz.com
  19. 19. >>> parser.parse("1+2 + 45 n + 10")58>>> parser.parse("Quanto vale 2 + 7")9>>> parser.parse("A soma 2 + 7 resulta em 9")Syntax error in input!>>> parser.parse("2 + 7 9")Syntax error in input! Adapted from http://www.dabeaz.com
  20. 20. >>> parser.parse("1+2 + 45 n + 10")58>>> parser.parse("Quanto vale 2 + 7")9>>> parser.parse("A soma 2 + 7 resulta em 9")Syntax error in input!>>> parser.parse("2 + 7 9")Syntax error in input! Adapted from http://www.dabeaz.com
  21. 21. from nltk.tokenize import sent_tokenize, word_tokenizemsg = “Congratulations to Erico and his team. PythonBrasil gets betterevery year. You are now the BiggestKahuna.”>>> sent_tokenize(msg)[Congratulations to Erico and his team., PythonBrasil gets better everyyear., You are now the BiggestKahuna.]>>> word_tokenize(msg)[Congratulations, to, Erico, and, his, team., PythonBrasil, gets,better, every, year., You, are, now, the, BiggestKahuna, .] Extracted from NLP with Python
  22. 22. >>> def gender_features(word):... return {"last_letter": word[-1]}...>>> from nltk.corpus import names>>> len(names.words("male.txt"))2943>>> names = ([(name,male) for name in names.words(male.txt)] +... [(name,female) for name in names.words(female.txt)])>>> import random>>> random.shuffle(names)>>> featuresets = [(gender_features(n),g) for n,g in names]>>> train_set, test_set = featuresets[500:], featuresets[:500]>>> classifier = nltk.naiveBayesClassifier.train(train_set)>>> classifier.classify(gender_features("Dorneles"))male>>> classifier.classify(gender_features("Magali"))female Extracted from NLP with Python
  23. 23. Referências
  24. 24. Uma palavra dos patrocinadores...
  25. 25. Obrigado a todos pela atenção. Rodrigo Dias Arruda Senra http://rodrigo.senra.nom.br rsenra@acm.orgAs opiniões e conclusões expressas nesta apresentação são de exclusiva responsabilidade de Rodrigo Senra.Não é necessário requisitar permissão do autor para o uso de partes ou do todo desta apresentação, desde quenão sejam feitas alterações no conteúdo reutilizado e que esta nota esteja presente na íntegra no materialresultante.Imagens e referências para outros trabalhos nesta apresentação permanecem propriedade daqueles que detêmseus direitos de copyright.

×