This document discusses triaging foreign language documents for digital forensics investigations. It presents two scenarios where examiners encounter non-English documents and need to prioritize them for limited translator resources. An ideal solution would provide English executive summaries of documents. The proposed solution uses named entity recognition to extract who, what, where information and identify people on watch lists. It also uses concept dictionaries to find discussed topics. This solution would be implemented as a module in the open source Autopsy digital forensics platform to help investigators navigate and tag priority documents.
2. Scenarios / Problem Statement
Media triage is performed in the field. Triage
reveals dozens of non-English documents. The
translator is busy talking with the suspect.
2. Medium-dive analysis is performed at a base.
Even more documents are found. Limited
translators are available.
1.
How does examiner / operator prioritize the
documents for the translator?
3. Ideal Solution: Translated Gist
▪ A several page non-English document turns into
an English executive summary.
▪ Allow user to understand who, what, and where
are mentioned.
▪ No one provides that solution today.
4. Our Proposed 70% Solution
▪ Show human generated gists when they are known.
▪ Use Rosette Named Entity software to find names of
people, places, and organizations:
– Who and where
▪ Use name matching software to identify people on
watch lists.
▪ Use dictionaries to find concepts (financial, drugs,
IED).
– What
▪ Use graphical techniques to show relationships and
context.
5. Names
▪ Rosette® Entity Extractor:
– Uses statistical models, regular expressions, and
gazetteers to find names.
– Works on 17 languages.
▪ Rosette® Name Translator:
– Translates names from native language to English.
– Uses linguistic algorithms, dictionaries, and statistical
inference.
6. Concept Dictionary
▪ User generated dictionary based on concepts
that are important to them.
▪ Contains both native word and English words.
▪ Text in documents are normalized using Rosette
Base Linguistics.
▪ Concepts are identified in native or English.
7. Navigation Techniques
▪ Goals:
– Provide summary of names and concepts.
– Provide context to know what was mentioned nearby.
▪ This is an area of research to find an approach
that works best.
11. Deployment Platform
▪ Autopsy™ is an open source digital forensics
platform.
▪ Development started after our first Open Source
Digital Forensics Conference (OSDFCon) in 2010.
▪ Community wanted an end-to-end platform
instead of many stand-alone tools.
▪ Version 3.0 was released in September 2012.
▪ Received some US Army funding.
13. Autopsy Capabilities
▪ Ingests hard drives, media cards, and other digital
media.
▪ Identifies suspicious files based on:
– Keywords
– Hash databases
– File types
▪ Allows operator to quickly focus on recent user
activity:
– Web artifacts
– E-mail
▪ Provides fast results to enable field-based scenarios.
14. Autopsy Extensibility
▪ Ingest Modules analyze media on import
– Hash analysis, keyword search, registry, web artifacts
▪ Content viewers display files
– Text, image, text analytics, video triage, …
▪ Report modules generate final reports
– HTML, XML, …
21. Scenario 2: Medium Dive
▪ Media card, hard drive, or cell phone are added.
▪ File system is analyzed.
▪ User navigates media using:
– Hash lookup
– Keyword search
– Web browser activity
– E-mail analysis
▪ Uses triage module to evaluate documents as
they are found.
▪ Uses tags to flag priority files.
22. 80% Solution
▪ Entity resolution integration.
▪ Topic classifiers.
▪ More advanced analysis relating concepts and
entities.
▪ More advanced interface approaches.