Legal Informatics with AWS CloudSearch - AWS Michigan

1,877 views

Published on

Bommarito Consulting will be covering legal informatics on AWS, giving the first AWS Michigan talk on CloudSearch. The presentation will provide structure, detail, and comparison between solutions, including a solr vs. CloudSearch implementation comparison and code to take you from scratch to searching.

An outline of the presentation is provided below:
1. Generalize information retrieval more than you’re used to.
2. Build a basic case search engine with solr/EC2.
3. Build a basic case search engine with CloudSearch.
4. Understand the relative strengths of solr and CloudSearch.
5. Discuss other legal services that CloudSearch can augment.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,877
On SlideShare
0
From Embeds
0
Number of Embeds
417
Actions
Shares
0
Downloads
17
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Legal Informatics with AWS CloudSearch - AWS Michigan

  1. 1. LEGAL INFORMATICS AWS MichiganWITH CLOUDSEARCH October 9, 201 2
  2. 2. GOALS1. Generalize information retrieval more than you’re used to.2. Build a basic case search engine with solr/EC2.3. Build a basic case search engine with CloudSearch.4. Understand the relative strengths of solr and CloudSearch.5. Discuss other legal services that CloudSearch can augment.© Bommarito Consulting
  3. 3. SEARCH: INFORMATION RETRIEVAL (loose – history has built silos where none might exist) Wiki: “[the science] of obtaining information resources relevant to an information need from a collection of information resources.” Examples:  Search: Google, Yahoo, Bing  Citation: MLA, Harvard Blue Book  Classification: Dewey Decimal, LOCWe’ll start by building a basic case search engine, like Lexis orWest.© Bommarito Consulting
  4. 4. SEARCH: INFORMATION RETRIEVAL pages about PythonResources Store Engine images of cats locations near Ashley’s • Examples: • Store=ext2, Engine=ext2 • Store=btrfs, Engine=grep • Store=tar, Engine=inverted index • Store=Oracle, Engine=PL/SQL • Store=Postgres, Engine=Sphinx© Bommarito Consulting
  5. 5. SOLR pages about PythonResources Lucene Solr images of cats locations near Ashley’s • All APL. • Lucene • The text search engine library of choice for Java developers. • Solr • Wraps Lucene in a blanket of RESTful goodness. • Many other search and architectural additions as well. • (You should know about Tika.) • (yes, there is also ElasticSearch.)© Bommarito Consulting
  6. 6. SOLR pages about PythonResources Lucene Solr images of cats locations near Ashley’s Using Solr 1. Deploy and configure Solr infrastructure. 2. Configure your schema, indexing, etc. 3. Seed index with initial data. 4. If index updates, figure out how to do it without breaking things. 5. Connect to search client with REST. 6. Oops, we don’t have enough capacity for search, update volume, etc.© Bommarito Consulting
  7. 7. SOLR OK, let’s try it. 1. Deploy and configure Solr infrastructure.Launch and connect to an m1.small.mjbommar@cluster0:~$ ec2run ami-c371cdaa -t m1.small --region us-east-1 –key ec2-keypair –g tomcatmjbommar@cluster0:~$ ec2din $INSTANCE_ID | grep ‘^INSTANCE’ | awk {print $4}‘mjbommar@cluster0:~$ ssh -i ~/.ssh/ec2 ubuntu@$INSTANCE_HOSTConfigure a solr deployment under /opt/solr.ubuntu@domU$ apt-get update --fix-missing && apt-get install default-jdk tomcat6ubuntu@domU$ cd /optubuntu@domU$ wget http://www.apache.org/dist/lucene/solr/4.0.0-BETA/apache-solr-4.0.0-BETA.tgzubuntu@domU$ tar xzf apache-solr-4.0.0-BETA.tgzubuntu@domU$ echo ‚SOLR_HOME=/opt/solr‛ >> /etc/environmentubuntu@domU$ wget –O /etc/tomcat6/Catalina/localhost/solr.xml http://bommarito-consulting.s3.amazonaws.com/legal-informatics-presentation/solr.xmlubuntu@domU$ mkdir solrubuntu@domU$ cp apache-solr-4.0.0-BETA/dist/apache-solr-4.0.0-BETA.war solr/solr.warubuntu@domU$ cd solr © Bommarito Consulting
  8. 8. SOLR 2. Configure your schema, indexing, etc.Configure the initial collection schema and options.ubuntu@domU$ mkdir collection1 && cd collection1ubuntu@domU$ wget http://bommarito-consulting.s3.amazonaws.com/legal-informatics-presentation/solr-conf-scotus.tar.gzubuntu@domU$ tar xzf solr-conf-scotus.tar.gzubuntu@domU$ less conf/solrconfig.xmlubuntu@domU$ less conf/schema.xmlStart tomcat and make sure we’re clean.ubuntu@domU$ chown –R tomcat6:tomcat6 /opt/solrubuntu@domU$ service tomcat6 restartubuntu@domU$ tail –f /var/log/tomcat6/catalina.out © Bommarito Consulting
  9. 9. SOLR 2. Configure your schema, indexing, etc.Schema Example<schema name="scotus" version="1.5"> <fields> <field name="title" type="string" indexed="true" stored="true" required="true" multiValued="false" /> <field name="content" type="text_en" indexed="true" stored="true" termVectors="true"termPositions="true" termOffsets="true" /> </fields> <uniqueKey>title</uniqueKey> <types> <fieldType name="string" class="solr.StrField" sortMissingLast="true" /> <fieldType name="text_en" class="solr.TextField" positionIncrementGap="100"> <analyzer type="index"> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true"expand="false"/>… © Bommarito Consulting
  10. 10. SOLR 2. Configure your schema, indexing, etc.Solr Config Handler Example<requestHandler name="/query" class="solr.SearchHandler"> <lst name="defaults"> <str name="echoParams">explicit</str> <str name="wt">json</str> <str name="indent">true</str> <str name="df">text</str> </lst> </requestHandler> © Bommarito Consulting
  11. 11. SOLR 3. Seed index with initial data.Download sample document.ubuntu@domU$ wget http://bommarito-consulting.s3.amazonaws.com/legal-informatics-presentation/sample.xmlSample File<?xml version="1.0" encoding="UTF-8"?><add> <doc> <field name="title">Marbury v. Madison</field> <field name="content">The clerks of the Department of State of the United States may be calledupon to give evidence of transactions in the Department which are not of a confidentialcharacter.</field> </doc></add>POST the sample document to the Solr update handler.ubuntu@domU$ curl –header ‘Content-Type: application/xml’ -–data-binary @sample.xmlhttp://localhost:8080/solr/update?commit=true © Bommarito Consulting
  12. 12. SOLR 5. Connect to search client with REST.ubuntu@domU$ curl http://localhost:8080/solr/query?q=content:confidential{ "responseHeader":{ "status":0, "QTime":0, "params":{ "q":"content:confidential"}}, "response":{"numFound":1,"start":0,"docs":[ { "title":"Marbury v. Madison", "content":"The clerks of the Department of State of the United States may be called uponto give evidence of transactions in the Department which are not of a confidential character."}] }} Some helpful references: • http://wiki.apache.org/solr/CommonQueryParameters • http://wiki.apache.org/solr/SearchHandler • http://lucene.apache.org/core/4_0_0-BETA/index.html © Bommarito Consulting
  13. 13. SOLR Let’s summarize: 1. Deploy and configure Solr infrastructure. 2. Configure your schema, indexing, etc. 3. Seed index with initial data. 4. If index updates, figure out how to do it without breaking things. 5. Connect to search client over RESTful interface. 6. Oops, we don’t have enough capacity for search, update volume, etc. Scale! Overall, solr is great. ElasticSearch makes some of these items easier as well. Can you imagine implementing this from scratch? However, items 1, 4, and 6 seem like low-hanging fruit for PaaS.© Bommarito Consulting
  14. 14. CLOUDSEARCH pages about PythonResources CloudSearch images of cats locations near Ashley’s • Yet another AWS managed service. • Compare to RDS, but for search. • Solr vs CloudSearch • Collection : Domain • Both RESTful • CloudSearch schema/text configuration much less flexible© Bommarito Consulting
  15. 15. CLOUDSEARCH pages about PythonResources CloudSearch images of cats locations near Ashley’s Using CloudSearch 1. Create a domain 2. Configure schema, indexing, access, etc. 3. Seed index with data. 4. Connect to search client with REST. 5. Relax (well, just make sure your billing information is correct).© Bommarito Consulting
  16. 16. CLOUDSEARCH© Bommarito Consulting
  17. 17. CLOUDSEARCH OK, let’s try it. Make sure you have your JRE and AWS credentials configured. Install CloudSearch command line tools.$ cd /opt$ sudo wget http://s3.amazonaws.com/amazon-cloudsearch-data/cloud-search-tools-1.0.0.1-2012.03.05.tar.gz$ sudo tar xzf cloud-search-tools-1.0.0.1-2012.03.05.tar.gz$ export CS_HOME=/opt/cloud-search-tools-1.0.0.1-2012.03.05$ export PATH=$PATH:$CS_HOME/bin$ export AWS_CREDENTIAL_FILE=/home/user/.ec2/credentials © Bommarito Consulting
  18. 18. CLOUDSEARCH 1. Create a domain.Request domain creation.$ cs-create-domain -d rehnquist-expressMonitor the domain until available. This can take more than 5-10 minutes.$ cs-describe-domain -d rehnquist-express$ grab-a-drink-or-two © Bommarito Consulting
  19. 19. CLOUDSEARCH 2. Configure schema, indexing, access, etc.Configure access policies. This can also take awhile.$ cs-configure-access-policies -d rehnquist-express --update --allow IP_ADDRESS --service doc$ cs-configure-access-policies -d rehnquist-express --update --allow all --service search$ cs-configure-access-policies –d rehnquist-express –retrieveConfigure the schema.$ cs-configure-fields -d rehnquist-express --name title --type text --option result$ cs-configure-fields -d rehnquist-express --name content --type text --option resultShow current stopwords.$ cs-configure-text-options –d rehnquist-express -psw===== Stop Words =====State: Active======================aanNot shown: custom synonyms, stemming rules, stopwords, or ranking. © Bommarito Consulting
  20. 20. CLOUDSEARCH 3. Seed index with data.$ cd /tmp$ wget http://bulk.resource.org/courts.gov/c/US.tar.bz2 && tar xjf US.tar.bz2$ for d in `find /tmp/US/ -type d`; do cs-generate-sdf --source "$d/*.html" -d rehnquist-express; done$ cs-index-documents –d rehnquist-express © Bommarito Consulting
  21. 21. CLOUDSEARCH 4. Connect to search client with REST.$ curl ‚http://search-rehnquist-express-5nvzkxgvupbufypbvcmg57lw7m.us-east-1.cloudsearch.amazonaws.com/2011-02-01/search?q=confidential&return-fields=title‛{"rank":"-text_relevance","match-expr":"(labelconfidential)","hits":{"found":449,"start":0,"hit":[{"id":"d__data_workspace_lle_data_us_454_454_us_170_80_1103_80_885_html","daa":{}},{"id":"d__data_workspace_lle_data_us_508_508_us_165_91_2054_html","data":{}},{"id":"d__data_workspace_lle_data_us_340_340_us_332_21_html","data":{}},{"id":"d__data_workssace_lle_data_us_484_484_us_19_86_422_html","data":{}},{"id":"d__data_workspace_lle_data_us_510_510_us_1103___2_html","data":{}},{"id":"d__data_workspace_lle_data_us_291_291_us___338_html","data":{}},{"id":"d__data_workspace_lle_data_us_537_537_us_941_01_1521_html","data":{}},{"id":"d__data_workspace_lle_data_us_537_537_us_941_01_1708_html","data":{}},,"id":"d__data_workspace_lle_data_us_357_357_us_144_621_html","data":{}},{"id":"d__data_workspace_lle_data_us_351_351_us_345_503_html","data":{}}]},"info":{"rid":"b7c167f6c2da6dd31b0fda497afcf1775b775c683dee5a356e2b9115965a3eb688f6fc18e0a36950","time-ms":3,"cpu-time-ms":0}} © Bommarito Consulting
  22. 22. CLOUDSEARCH What about pricing? “Sticky” opex Type Estimated Capacity* $/hr. $/mo. Small 1M docs 0.12 $86 Large 4M docs 0.48 $346 Extra Large 8M docs 0.68 $489 * 1k documents, no result storage. Variable opex • $0.10 per 1,000 upload • $0.98/GB per index • $0.12/GB network out© Bommarito Consulting
  23. 23. CLOUDSEARCH What about pricing? Building this sample: Some cycles on office servers not counted.© Bommarito Consulting
  24. 24. CLOUDSEARCH Next steps: • Try CloudSearch with Boto: • http://boto.cloudhackers.com/en/latest/cloudsearch_tut.html • Write your own custom content transformer with Tika: • http://tika.apache.org/1.2/formats.html • Understand how types, search, faceting, and results interact and constrain in CloudSearch. • Understand how response times are affected by document count, document size, types, search, faceting, and results.© Bommarito Consulting
  25. 25. SEARCH SUMMARY Solr CloudSearch • Can share infrastructure and • Managed service. application containers. • Highly scalable per unit of labor. • Highly flexible configuration via • Like all AWS services, stop XML. paying immediately when • Can customize or extend by you’re done. writing Java. Best solution probably depends on relative scarcity of SA/developer labor and document volume.© Bommarito Consulting
  26. 26. THINK MORE VARIABLE Case search is a pretty simple, static example. You build a Lexis/Westclone, monetize with some ads or a subscription, and figure out how to handlemonthly or yearly updates as cheaply as possible. Think more variable – what might CloudSearch help with in legal services? Primetasks should have durations or scales that are hard to meet with fixed assets.© Bommarito Consulting
  27. 27. THINK MORE VARIABLEHere’s an idea, shamelessly copied from some of my marketing material: Imagine you’re a smaller law firm that specializes in HR disputes. As part of a time- sensitive non-solicitation claim filed by your client, you’ve subpoenaed email from fifteen employees at a client’s competitor. It’s Friday afternoon at 5PM, and you finally receive a hard drive with the emails. However, in an effort to overwhelm your small team, the other party has dumped 10GB of data on your plate. There’s no way you can search through this by hand. You have a hearing on Wednesday, but need to prepare a memo for your client by Monday morning. Do you disappoint your client and motion to reschedule? How could you possibly make the deadline? If only you could just press a button and get something like Google for your data…Discovery is a perfect task for CloudSearch.• Very tight deadlines.• Short project lifetimes.• Wide variety of data volumes.© Bommarito Consulting
  28. 28. THINK MORE VARIABLE For public corporations, you might have enoughcompliance or discovery work to keep something anapplication running 24/7. Enter KnowCave – a SolidLogic project.• Deep search• Natural language content alert• Responsive, multi-device interfacehttp://knowcave.com © Bommarito Consulting
  29. 29. REFERENCES B o m m a r i to C o n s u l t i n g B l o g  Installing AWS Cloud Search Command Line Tools  Building an AWS CloudSearch domain for the Supreme Court  eDiscover y Consulting in the Cloud: Searching an Outlook mailbox and attachments  Generating AWS CloudSearch SDF for Emails S o l i d L o g i c K n o w C ave E l e c t r o ni c D i s c ove r y Re f e r e n c e M o d e l  EDRM Stages Explained C o r n e l l L e g a l I n f o r m a t i o n I n s t i t ute  Federal Rules of Civil Procedure A m a z o n We b S e r v i c e s  CloudSearch documentation  CloudSearch command line tools  Windows  Linux/Mac I n te r n a t i o n al A s s o c i a t i o n f o r A r t i fi c ia l I n te l l i g e nc e a n d L aw B o to A p a c h e T i ka A p a c h e Lu c e n e Apache Solr E l a s t i c S e a rc h© Bommarito Consulting
  30. 30. THANKS! You can get these slides on my blog – http://bommaritollc.com/blog/. Here’s the post.  Michael J Bommarito II  CEO, Bommarito Consulting, LLC  Email: michael@bommaritollc.com  Web: http://bommaritollc.com/© Bommarito Consulting

×