Mining legal texts with Python

8,108 views

Published on

Presented at EuroScipy 2011

Published in: Education, Technology
1 Comment
29 Likes
Statistics
Notes
  • hi Flavio,
    Have you published this project? Do you think it is relatively easily transportable to other legal databases?
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total views
8,108
On SlideShare
0
From Embeds
0
Number of Embeds
27
Actions
Shares
0
Downloads
220
Comments
1
Likes
29
Embeds 0
No embeds

No notes for slide

Mining legal texts with Python

  1. 1. Mining Legal Text sil.fd SIL.fd . . Information Mining and Visualization of a Large Volume of Legal Texts . .. . . Flávio Codeço Coelho, Renato Rocha Souza and Pablo de Camargo Cerdeira Applied Mathematics School – Getulio Vargas Foundation August 22, 2011 . . . . . .
  2. 2. Mining Legal Text Outline I . . . Introduction 1 .. . 2 Web-Scraping HTML Parsing .. Pattern Matching .3 Regular expressions .. . 4 Database Interaction MySQLDb SQLAlchemy MongoDb .. Natural Language Processing .5 NLTK .. Visualization .6 Matplotlib Ubigraph Gource . . . . . .
  3. 3. Mining Legal Text Outline II Visual Python . . . Results 7 . . . Future Directions 8 . . . . . .
  4. 4. Mining Legal Text Introduction Conquering text Scraping and indexing the world’s web pages has changed the world... Should pagerank be our main measure of information relevance? What is possible if we go a little further? . . . . . .
  5. 5. Mining Legal Text Introduction It’s documents all the way down... Luckily, we didn’t have to scan them... We have to conquer an information mountain... . . . . . .
  6. 6. Mining Legal Text Introduction We had generous help... . . . . . .
  7. 7. Mining Legal Text Web-Scraping Obtaining the Data No API for access, a little heuristics was necessary Scraping took more than 3 months. 1.3 million cases . . . . . .
  8. 8. Mining Legal Text Web-Scraping Example: Photos Navigating with Mechanize1 br = mechanize . Browser ( ) br . open ( ” h t t p : / /www. s t f . j u s . br / p o r t a l / m i n i s t r o / m i n i s t r o . asp ? p e r i o d o=s t i = 0 l i n k = br . f i n d l i n k ( u r l r e g e x=r ’ v e r M i n i s t r o . asp ’ , nr=i ) while 1: br . f o l l o w l i n k ( l i n k ) i l = br . f i n d l i n k ( u r l r e g e x=’ imagem . asp ’ ) u r l = ” h t t p : / /www. s t f . j u s . br / p o r t a l ”+ i l . u r l . s t r i p ( ’ . . ’ ) nome = i l . t e x t download photo ( u r l , nome . decode ( ’ l a t i n 1 ’ ) . s p l i t ( ’ [ ’ ) [ 0 ] ) br . back ( ) try : l i n k = br . f i n d l i n k ( u r l r e g e x=r ’ v e r M i n i s t r o . asp ’ , nr=i ) i += 1 e x c e p t LinkNotFoundError : break 1 http://wwwsearch.sourceforge.net/mechanize/ . . . . . .
  9. 9. Mining Legal Text Web-Scraping HTML Parsing Parsing scraped HTML Beautiful Soup2 to the rescue! Firebug helped analyze page structure. Parsing was done during the scraping, to clean data for insertion into MySQL Some parts of the page were stored in HTML for later parsing sopa=B e a u t i f u l S o u p ( d [ ’ d e c i s a o ’ ] . s t r i p ( ’ [ ] ’ ) , fromEncoding=’ ISO8859−1 ’ ) r s = sopa . f i n d A l l ( ’ s t r o n g ’ , t e x t=r e . c o m p i l e ( ’ ˆ L e g i s l a ’ ) ) 2 http://www.crummy.com/software/BeautifulSoup/ . . . . . .
  10. 10. Mining Legal Text Pattern Matching Extracting Even more Information With Data on Local db, we started mining it: Tried to use the best SQL and Python had to offer Pattern matching, aggregation, string matching3 , etc... Read from Db → Process → Write to Db SQL → Python → SQL 3 difflib . . . . . .
  11. 11. Mining Legal Text Pattern Matching Regular expressions Regular Expressions re module, great, but tricky for different encodings. Kodosa : visual debugging indispensable! a http://kodos.sourceforge.net/ r a w s t r = r ”””>∗s ∗ ( [ A−Z] { 2 , 3 } s∗−s ∗ . [ A−Z0 − 9 ] ∗ ) | (CF ) | ( ”CAPUT”) s+””” c o m p i l e o b j = r e . c o m p i l e ( r a w s t r , r e . LOCALE) . . . . . .
  12. 12. Mining Legal Text Database Interaction Structuring the Data . Goals . .. Reflect the original structure of the data Store additional info coming from raw text Design data model with future analytical needs in mind . .. . . . . . . . .
  13. 13. Mining Legal Text Database Interaction MySQLDb Databases and Drivers MySQL (MariaDb4 ) was relational Db of Choice MySQLDb’s cursor.execute(’ select ∗ from ... ’) Server side cursors were essential. MongoDb + PyMongo 4 http://mariadb.org . . . . . .
  14. 14. Mining Legal Text Database Interaction SQLAlchemy What about ORMs? Object-relational mappers are great but... SqlAlchemy5 used mostly in table creation and data insertion. For analytical purposes, server-side raw SQL, stored procs and views can’t be beaten. We mostly used Elixir to design the tables. 5 http://www.sqlalchemy.org . . . . . .
  15. 15. Mining Legal Text Database Interaction MongoDb Escaping from 2D dataBenefits: Tips: Exploring MongoDba as an db.cursor( cursorclass =SSDictCursor) alternative for Analytics Convert every string to UTF-8 Auto-sharding + Map/reduce! Pymongo’s transparent Escape costly Joins in MySQL conversion of dictionaries to a www.mongodb.org BSON . . . . . .
  16. 16. Mining Legal Text Natural Language Processing Understanding Text Biggest challenge is extracting meaning from decisions Is a given decision pro- or against the defendant? What is the vote count on non-unanimous decisions? . . . . . .
  17. 17. Mining Legal Text Natural Language Processing NLTK Natural Language Toolkit Lots of batteries included . . . . . .
  18. 18. Mining Legal Text Visualization Visualizing the Data You can’t ask questions about what you don’t know... Data driven research . . . . . .
  19. 19. Mining Legal Text Visualization Matplotlib Standard Charting and Plotting: Matplotlib Great for plotting summary statistics Together with NetworkX can help visualizing some small graphs . . . . . .
  20. 20. Mining Legal Text Visualization Ubigraph Large Graph Visualization: Ubigraph Ubigraph Rocks!a Navigating Huge graphs gave powerful insights Takes advantage of multiple cores and GPU a http://ubietylab.net/ubigraph/ . . . . . .
  21. 21. Mining Legal Text Visualization Gource Untangling Temporal patterns: A bit of Python to create logs compatible with Gource6 This: Q = dbdec . e x e c u t e ( ”SELECT r e l a t o r , p r o c e s s o , t i p o , p r o c c l a s s e , duracao , U decs = Q. f e t c h a l l ( ) d u r a t i o n s = [ d [ 4 ] f o r d i n de cs ] cmap = cm . j e t norm = n o r m a l i z e ( min ( d u r a t i o n s ) , max( d u r a t i o n s ) ) #n o r m a l i z i n g d u r a t i o n with open ( ’ d e c i s o e s %s . l o g ’%ano , ’w ’ ) as f : f o r d i n decs : c = rgb2hex (cmap( norm ( d [ 4 ] ) ) [ : 3 ] ) . s t r i p ( ’#’ ) path = ”/%s/%s/%s/%s ”%(d [ 5 ] , d [ 2 ] , d [ 3 ] , d [ 1 ] ) #/ S t a t e / t i p o / p r o c l = ”%s |% s |% s |% s |% s n”%( i n t ( time . mktime ( d [ 6 ] . t i m e t u p l e ( ) ) ) , d [ f . write ( l ) Generates this: 885967200|MIN . SYDNEY SANCHES|A| /MG/ Monocrática /INQUÉRITO/1606809|0000 885967200|MIN . SYDNEY SANCHES|A| /MG/ P r e s i d ê n c i a /INQUÉRITO/1606809|0000 6 http://code.google.com/p/gource/ . . . . . .
  22. 22. Mining Legal Text Visualization Gource A snapshot of the Supreme Court activities: 1998 . . . . . .
  23. 23. Mining Legal Text Visualization Gource The Dynamics Video . . . . . .
  24. 24. Mining Legal Text Visualization Visual Python It’s a Jungle Out There. . . Division of labor in the supreme court VPythona is great to quickly create complex animations. Here judges are trees, branches are subjects and leaves are legal decisions a vpython.org . . . . . .
  25. 25. Mining Legal Text Results Results Detailed X-ray of the inner workings of the Supreme court 92% of the cases are appeals of a non-constitutional nature These results led to the proposal of an amendment to the constitution! More questions than answers! Python for data mining rocks! . . . . . .
  26. 26. Mining Legal Text Future Directions To be continued... Further automate and optimize More explorations Scale up the pipeline Model the life history of a legal process . . . . . .
  27. 27. Mining Legal Text Future Directions Acknowledgements FGV - Direito Rio FGV - EMAp Brazilian Supreme Court Asla Sá (for kindly lending us her server) . . . . . .

×