Sara's keyword searching metadata_lecture_revised


Published on

Sara's E-Discovery Consulting lecture to lawyers and paralegals on concept and keyword searching.

Published in: Technology, Business
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Sara's keyword searching metadata_lecture_revised

  2. 2. WHAT IS KEYWORD SEARCHING? When we think of the term, “keyword search” we are talking about a basic search technique that involves searching for one or more words within a collection of documents. Typically, a keyword search involves a user typing their search request, or query, into a search engine such as Google, which then returns only those documents that contain the search terms entered. The documents returned by the search engine are called the search results.
  3. 3. KEYWORD SEARCH AND TECHNIQUES Keyword searching in the EDRM (Electronic Discovery Reference Model) can utilize an array of techniques through a variety of data. Often time, data in a case are searched within documents in a specific case, but even there the documents can take several forms. Understanding the array of forms will not only benefit the EDD consultant, but also their client in the best approach to pursue their case.
  4. 4. KEY WORD SEARCHES AND TECHNIQUES (cont.) Computer files (known as Electronically Stored Information, or ESI), including files such as documents created with Microsoft Word or PowerPoint, email stored as individual message files or together in an Outlook or Notes data file, OCR (Optical Character Recognition) files created from scanned paper documents, or even more exotic files such as those created by a CADCAM program demand the need for computer systems to store and manage data in important cases.
  5. 5. KEYWORD SEARCHING AND WHY IT IS SIGNIFICANT IN EDISCOVERY Search tools and methodologies are significant because they have numerous applications during the e-discovery phase of the litigation lifecycle and yield searches which help cases for clients needing relevant information for their case. Let us take a real life example of the processes and challenges related to using search and how these challenges can be mitigated. Our example includes an automobile accident and a maintenance shop or garage which should have documented a failed brake system, but may have been incompetent.
  6. 6. EXAMPLE 1 Let us say that Attorney John Doe is working on a new case involving a car accident. The plaintiff is claiming that his local garage failed to spot the a failing brake system in his client’s 2004 Honda Civic. As a result, the failing breaks not only caused a major car accident, but additionally caused property damage and bodily injury. Attorney Jacob Bacon, who is representing the defendant’s garage, has a database containing thousands of documents, including email to and from the plaintiff and the defendant, email from a mailing list for Honda enthusiasts that both plaintiff and defendant participated in, and OCR’d documents including maintenance records and receipts from the garage.
  7. 7. EXAMPLE 1 (continued) This time Attorney John Doe runs a concept search using the keywords on Honda Civic, brakes, accident, and maintenance. As John Doe scrolls through the results he doesn’t see anything new, until he sees the word “stoppies”, which he is unfamiliar with. A little digging in the result set of documents lets him discover that “stoppies” is a behavior similar to wheelies that can result in damaged brakes. The documents containing this word revealed that the plaintiff frequently engaged in this dangerous behavior. Attorney Doe now had the ammunition he needed to win his case, using a concept he did not know in advance existed. What exactly is concept searching? Read on to find out.
  8. 8. CONCEPT SEARCHING We have discussed the notion of keyword searching, but based on our recent example of the failed brake system involving the Honda, let us examine what concept or “conceptual” searching is. Concept search is an automated method used to search electronically stored and unstructured text for information based on “ideas” or “concepts”. As we saw in our previous example of the automobile accident, the term “stoppies” was a concept or idea to show a failed brake system. The information retrieved in response to a concept query should be relevant to the ideas contained in the text of the query.
  9. 9. CONCEPT SEARCHING Example Let us say that you are hired on by Oil/Gas Company X who is in the midst of a lawsuit by a terminated employee by which the employee wants to sue Oil/Gas Company X for wrongful termination. Now, if we are wanting to perform a search on the word “termination” – what other concept words/concept ideas related to to “termination” can you think of? Here are some random words that might be found in e-mails related to termination: canned, let-go, hosed, fired, gatorated, sunset and beaches, retired, vacation, etc. With concept search technologies and their advanced capabilities, concept searching can assess trends in evaluating patters and produce results that can help lawyers and corporations with their litigation.
  10. 10. CONTEMPORARY EXAMPLE (CAN YOU SPELL ENRON?) We all may recall the Enron and WorldCom debacle which highlighted corporate greed and was quite the scandal of the early 2000s. How would concept searching help incriminate the big bad wolfs? Let us take an example Enron used to “hide” or employ the use of “code” to prevent authorities or legal entities from finding their hidden crime. The term “Rawhide” was found in several of the Enron emails. “Rawhide” could mean a kind of leather or an old TV show, but in the context of the Enron emails, “Rawhide” actually refers to one of its off-books partnerships. “Raptor” was another of those problematic partnerships. So a Concept Search query in the Enron emails for “Raptor” would not net you documents about hawks, but rather about “Rawhide” and other off-books partnerships, even if the words “Raptor” and “Rawhide” did not actually appear in any particular document itself.
  11. 11. BENEFITS OF CONCEPT SEARCH Increased likelihood of finding a larger number of relevant documents Less time spent perusing irrelevant documents Less time spent trying to come up with the right keywords Reduced time, cost and effort overall in retrieving the best documents in reply to the concept of your query in the context of the entire document collection
  12. 12. EXAMPLE 2 Let us say that we are working with a major oil/gas company (Oil Company X) and that Oil Company X needs a vendor who hires us to assist them with a lawsuit against oil company Y. Their lawsuit references intellectual property theft in the year 2009 and Oil company X argues that there are certain words or phrases that would incriminate Oil Company Y. How would we be able to assist our client in the most cost efficient and timeefficient fashion? Keyword searching allows vendors to zoom into collected data to find the relevant data in the form of “keyword search” that would assist the client with their lawsuit in the most meaningful fashion. Understanding the reasoning behind keyword searching allows us to help our clients.
  13. 13. WHAT QUALIFIES AS KEYWORD SEARCHING? Keyword searches are most often used to identify documents that are either responsive or privileged. It is also widely used for large-scale culling and filtering of documents. Keywords often form a basic building block for constructing other more complex compound searches. Such compound searches use other search elements such as Boolean logic.
  14. 14. PARAMETERS IN KEYWORD SEARCHING The syntax in the search string; Use of the keywords with or without stemming; Use of keywords with certain wildcard specifications and the syntax for said wildcards; Case-sensitivity of keywords used in searches and whether the keyword should match both cases; and The target data sources to be searched. Whether the query can be applied to any specific fields such as email ‘To/From’ or ‘Subject’. Whether the query can be applied to any specific date range such as an email ‘Sent Date’ between the date range of January 1, 2001 through December 31, 2001
  15. 15. BOOLEAN SEARCHES Boolean searches are used to combine results of multiple searches as well as to designate ambiguity, as when search for two or more terms but do not necessarily need both. Imagine you are at your local university library and want to perform a search in one of the library databases which houses many of the scholastic journals. You encounter a database form which asks you to enter the
  17. 17. EXAMPLES
  18. 18. WILDCARD A wildcard is a character that may be used in a search term to represent one or more other characters. It also allows you to find words using patterns for a set of words and to find synonyms or forms of a word The two most commonly used wildcards are: 1) The question mark (“?”) may be used to represent a single alphanumeric character in a search expression. For example, searching for the term “ho?se” would yield results which contain such words as “house” and “horse”.
  21. 21. FUZZY SEARCH Fuzzy search allows searching for word variations such as in the case of misspellings. Typically, such searching includes some form of distance and score computations between the specified word and the words in the corpus. Fuzzy search is specified using the operator: fuzzysearch.
  23. 23. SYNONYM SEARCH Synonyms are word variations that are determined to be synonyms of the word being searched. Such searching includes some form of dictionary or thesaurus based lookup (e.g. party synonym is gathering, get=together, festivity, etc.).
  24. 24. PROXIMITY SEARCH A proximity search looks for documents where two or more separately matching term occurrences are within a specified distance, where distance is the number of intermediate words or characters. In addition to proximity, some implementations may also impose a constraint on the word order, in that the order in the searched text must be identical to the order of the search query. Proximity searching goes beyond the simple matching of words by adding the constraint of proximity and is generally regarded as a form of advanced search.
  25. 25. PROXIMITY SEARCH EXAMPLE For example, a search could be used to find "red brick house", and match phrases such as "red house of brick" or "house made of red brick". By limiting the proximity, these phrases can be matched while avoiding documents where the words are scattered or spread across a page or in unrelated articles in an anthology.
  26. 26. TRUNCATION SPECIFICATION AND STEMMING Truncation specification is one way to match word variations. Truncation allows for the final few characters to be left unspecified. Stemming specification is another method for matching word variations. Stemming is the process of finding the root form of a word. The stemming specification will match all morphological inflections of the word, so that if you enter the search term sing, the stemming matches would include singing, sang, and song. Note that even though a stemming search will return singing for a search term of sing, this is different from wildcard search. A wildcard search for sing* will not return sang or song, while it will return Singsing.
  27. 27. WHAT IS METADATA AND WHY IS IT IMPORTANT? Software programs embed various categories of metadata in the documents users create. Metadata is significant because it describes how, when, and by whom an electronic document was created, modified, and transmitted. Unlike paper documents, electronic documents are unique because they carry their history with them. Paper is boring and pertains to dinosaurs as it merely shows us what a document said or looked like. Electronic tells where the document went and what it did.
  28. 28. METADATA AND EMAILS An e-mail carries information about its author, creation date, attachments, identities of all recipients including who was CC’ed or BCC’ed. Metadata also connects attachments to e-mails. Information embedded in other file types may include document names, authors, number of times printed…etc. Track changes reflects modifications by each recipient.
  29. 29. METADATA AND PRESERVATION Some methods of document review fail to account for and preserve metadata. If a document is printed in the review or production process, its metadata is lost. Many lawyers believe they are conducting EDD when in fact they are working with electronic images of documents. The process of scanning and coding documents into a database does not capture original document metadata. Understand the difference between document metadata versus file system metadata.
  30. 30. FILE SYSTEM METADATA When we think of file system metadata, think ‘file timestamps’ While ‘file metadata’ and “timestamps are often used interchangeably, they mean two completely different things. There are two separate ‘timestamps’ for office documents and several other file types. The first set, is stored in the operating system (Windows, Linux, MacOS) and are different from those stored in the file. The metadata stored in a file (Date Created, Date Last Saved etc.) may also be referred to as the files timestamps and confused with what’s stored by the operating system.
  38. 38. ADVANCED (continued)