This talk will discuss how Rosette — entity extraction, entity searching, document clustering, near duplicate detection, and fact-relationship-event extraction — can be combined with a powerful search engine to facilitate information discovery and thematic analysis across a variety of sources and languages.
The term “Big Data” has many possible meanings — large volume, fast-moving, many sources — but the issues it creates are clear. Analysts have significantly more data available, but the tools to exploit this data haven’t kept pace.
Many legacy approaches to analytic systems — databases and custom applications around them — are not flexible enough to pull in data from new sources at a moment’s notice, are not able to import and share the new data quickly enough to provide actionable intelligence, and cannot scale up to hold the massive amounts of data being produced.
But even if today’s systems could handle all of the available data — when presented with massive volumes of semi-structured, multilingual data from many sources, how effectively could an analyst discover the relevant data and efficiently move it into the analytical process?
View more slides from the Human Language Technology Conference 2012 here: http://info.basistech.com/hlt-2012-slides
Big Data Triage with Rosette Human Language Technology Conference
1. Big Data Triage with Text Analytics
Steve Kearns
Director of Product Management
Basis Technology
Basis Technology – Human Language Technology Conference 2012 1
2. Agenda
• What is Big Data?
• Challenges of Big Bata
• Text Analytics Technology
• Text Analytics for Big Data Triage
Basis Technology – Human Language Technology Conference 2012 2
3. What is Big Data?
Basis Technology – Human Language Technology Conference 2012 3
4. Big Data
• Volume
• Velocity
• Variety
Basis Technology – Human Language Technology Conference 2012 4
6. Volume
Basis Technology – Human Language Technology Conference 2012 6
http://mashable.com/2012/06/22/data-created-every-minute/
7. Velocity
• High-Throughput Sources:
– Digital Forensics
• Rapid Site Exploitation
• Many Hard Drives
• Rapidly Changing Sources:
– OSINT
• News
• Social Media
• High Throughput Storage, Analysis, Alerting
Basis Technology – Human Language Technology Conference 2012 7
8. Variety
• Data Types
– DOMEX/DOCEX/MEDEX/OSINT
– Finished Intel
– Cables
– Intellipedia
– Harmony
– Biometrics
– Watch Lists
– Hard Drive -> File(s) -> Unstructured and Structured Content
– Sensor Data
• Structured / Unstructured
• Textual / Visual / Numeric
Basis Technology – Human Language Technology Conference 2012 8
9. The Challenge: Finding Value
Basis Technology – Human Language Technology Conference 2012 9
http://learn-how-to-be-happy.com/wp-content/uploads/2011/08/happy_face.jpg
10. Big Data Problems - Volume
• Where/How do you store it?
– Single database -> database cluster -> Hadoop/HDFS?
• Data quality?
– Manual review or annotation?
– People don’t scale
• Query
– If you can, how fast, how complex and on what can you query?
– User Interface? SQL? Programming?
– How do you view results?
– Can you filter the results to refine your query?
– Thematic exploration, where the results of one query inform the next
– Security?
Basis Technology – Human Language Technology Conference 2012 10
11. Big Data Problems - Velocity
• Time sensitive
– Value of information decreases over time
– How long from “publish” to “discoverable”?
• Rapid changes/updates
– Which updates are important?
– Which sources/users are important? Which may become important?
– Individual pieces of data may be meaningless, but what about in
aggregate?
– Quality/Verification?
– Manual Review?
Basis Technology – Human Language Technology Conference 2012 11
12. Big Data Problems - Variety
• Many Sources
– Often stored, formatted, and accessed differently
– Access, security?
– Many languages
– How reliable is each source?
• Few, if any, links
– Between sources
– Between documents
– Between information within documents
Basis Technology – Human Language Technology Conference 2012 12
13. General Problems
• Computers are great at some things
• Humans are great at others
2
+
2
Scale
Human
Language
Basis Technology – Human Language Technology Conference 2012 13
15. Text Analytics
Automated analytical methods
operating on the written word to
surface insights about the data.
It's purpose is to assist the human in
finding things of relevance and
interest.
Basis Technology – Human Language Technology Conference 2012 15
17. Triage Example
Query:
Al
Qaeda
al-‐Qaeda
0.99
Al-‐Qaeda
has
the
following
direct
franchises:
ة
§ Al-‐Qaeda
in
(tal-‐Qa'idah) Peninsula,
w0.99
comprises
he
Arabian
hich
Al
-‐Qaeda
aeda
in
Saudi
Arabia,
a0.99
§ Al
Q nd
Baghdad military command spokesman Jihad
of
Yemen
0.99
§ ة
Islamic
(al-‐Qa'idah)
Colonel Dhia al-Wakeel said
thel-‐Qaeda
in
Iraq
§
al-‐Qada
bore
A attacks 0.91
the hallmarks of al-Qaeda. §
al-‐Qaida
0.91
Al-‐Qaeda
OrganizaBon
in
the
Islamic
Maghreb
Thursday was the deadliest day in Iraq
since Al-‐Qa'ida
0.91
Al-‐Qaïda
0.91
March 20, when shootings § al-‐Qaida
Africa
Somalia
and bombings
in
Al-‐Shabaab
0.78
§ Al-‐Qaeda
Sslamic
Jihad
claimed by an al-Qaeda affiliated group EgypBan
I ancBons
List
0.74
killed 50 people and wounded Al-‐Qaïda
slamic
FighBng
Group 0.74
§ 255 I Libyenne
Libyan
nationwide. § East
Turkestan
Islamic
M
47.0 وﺗﻨﻈﻴﻢinjiang,
اﻟﻘﺎﻋﺪةovement in
X
al-‐Qaeda
in
Islamic
Maghreb
China
0.7
Basis Technology – Human Language Technology Conference 2012 17
18. Text Analytics : Language ID
Après avoir rencontré
La Grande-Bretagne a les présidents de
de son côté jugé que La Grande-Bretagne a
quatre des cinq pays de sonAprès jugé que
côté avoir rencontré
l'accord de africains (Afrique du
Американская l'accord de
les présidents de nigérian
Luxembourg Sud, Algérie, Sénégal, компания Luxembourg Le président
French
В данный момент софтверная
constituait un véritable quatreOlusegun Obasanjo a
des cinq pays
Nigeria) membres du
правительство США,私ごとになりますが、ちょうどこ становится constituait un véritable du
changement dans la africains (Afrique
salué cette
のころ大学院生でしたが、 du
comité de pilotage
пользующимся спросом changement l'engagement du G8,
dans la
обвиняющее
stratégie agricole de Sud, Algérie, Sénégal,
радикальную Nouveau partenariat США
ACOS-6用のある言語処理系 у спецслужб stratégie
l'Europe, tandis que Nigeria) membres du"la
déclarant que
pour le développement
の開発を請け負って作っていま в области
экспертом
мусульманскую
l'Irlande y a vu un gage comité de pilotage du
économique de
した。ACOS-6はMulticsの概念
лингвистики (в condition majeure au
de stabilité et et de "Аль に非常に近いものを持っていま
группировку
l'Afrique частности, développement est
sécuritéКаида" в терактах 2 した、あるいは持とうとしていま изучения и
pour les
обработки информации
agriculteurs.назад,
года
Le président nigérian
активизирует свое した。
на арабском языке)
внимание к арабскому また、ハードウェアも大変似て после терактовObasanjo a
Olusegun 11 Программное обеспечение
языку и программам いました。シールをはがすと、
cette salué
сентября 2001 г. Basis Technology позволяет
Американская
その下から別のアメリカの会社 l'engagement du G8, осуществлять поиск слов с
его обработки. софтверная компания
В данный момент
「端末側で行単位に(あるいは の名前が出てくるマシンでテスト
que "la déclarant
Russian
Грамматика языков близкими значениями, а
становится
condition majeure au правительство США,
一画面分)編集しておいて、
したこともありました。1年間ほ
данной группы также транслитерировать
пользующимся спросом
développement est обвиняющее
送信キーによりまとめて送信 とんど休みなしにマシンルーム
у спецслужб США
радикальную
する」という方式と、
にこもっていて、ここでの議論 l'absence de conflit".
Программное обеспечение экспертом в области
мусульманскую
と疑問を自分のテーマとしても
La porte-parole de la
「端末には知能はなく、一字一 Basis Technology позволяет
présidence française, группировку "Аль
字すべてがその都度送られ処 扱ったことがあるのです。そ
осуществлять поиск слов с Каида" в терактах 2
れで、よーくわかるのです。
Catherine Colonna, a
理される」
близкими значениями, а
pour sa part qualifié la
という方式は、究極的に前者 также транслитерировать
は半二重通信、後者は全二重 réunion
арабские и фарси-буквы в d'"exceptionnelle".
FNPがコンピュータと端末の間 「端末側で行単位に(あるいは
通信とフィットします。
латинские. Продукт был
後者では、入力のエコーもコン にあって、実際の端末とのやり 一画面分)編集しておいて、
FNPがコンピュータと端末の間
разработан по
ピュータ側で制御されます。
заказу
специальному とりを制御するのです。そして、 送信キーによりまとめて送信
にあって、実際の端末とのやり
「端末側で行単位に(あるいは
Japanese
つまり、入力した字の表示はキ США с コンピュータとFNPの間の通 する」という方式と、
とりを制御するのです。そして、
правительства 一画面分)編集しておいて、
ー入力がコンピュータに送られ、
целью оптимизации 信は、
「端末には知能はなく、一字一
コンピュータとFNPの間の通
送信キーによりまとめて送信
それが送り返されて表示され процесса анализа арабских 少量の転送には不向きで、大 字すべてがその都度送られ処
信は、
する」という方式と、
ます。
量の一括転送に向いていました。
理される」
少量の転送には不向きで、大
текстов. 「端末には知能はなく、一字一
FNPによるコンピュータへの割 量の一括転送に向いていました。
字すべてがその都度送られ処
り込み要求は高価なものだっ FNPによるコンピュータへの割り
理される」
たからです。Multicsでのプロセス
のwake upも高価だということも
ありました。
Basis Technology – Human Language Technology Conference 2012 18
19. Text Analytics: Lemmatization
flying Search
Results
fly
132 hits
flying
97 hits
flew
78 hits
flown
61 hits
Basis Technology – Human Language Technology Conference 2012 19
26. Big Data Processing
• IdenBfy
data
sources
Collect
• Data
cleansing
• Move
data
into
analysis
repository
• IdenBfy
EnBBes,
Facts,
RelaBonships
Analyze
• Link
between
Documents
• Link
fact/enBty
between
documents
• Keyword
search
+
metadata
filters
Index
• ThemaBc
exploraBon
–
using
metadata
• Cross-‐document
links
Basis Technology – Human Language Technology Conference 2012 26
27. Big Data Processing - Technology
• Source:
News,
Twieer,
Database,
file
system,
digital
forensics,
Collect
etc.
• Storage:
HDFS,
MongoDB,
SQL,
etc.
• Plahorm:
Hadoop,
UIMA,
Odyssey,
Custom
Analyze
• Analysis
type:
Language
ID,
EnBty
ExtracBon,
RelaBonship
ExtracBon,
Document
Clustering,
EnBty
Linking
• Fulltext
Search:
Solr,
Accumulo,
Lucene
Index
• Structured
Data:
RDF,
SQL,
OrientDB,
Neo4j,
Cassandra,
HDFS,
etc.
Basis Technology – Human Language Technology Conference 2012 27
28. Big Data Triage Requirements
• View results while still processing
– Incremental collection/analysis/indexing
• User Interface that allows exploration
– Dashboard
– Keyword Search
– Geo Search
– Entity Search
• Enables thematic exploration
– Metadata produced by Analysis makes this easier
Basis Technology – Human Language Technology Conference 2012 28
33. Entity Search – Cross Language
Basis Technology – Human Language Technology Conference 2012 33
34. Search/Filter/Explore
Basis Technology – Human Language Technology Conference 2012 34
http://www.silobreaker.com/FlashNetwork.aspx?DrillDownItems=11_237360
35. Summary
Text
Analy9cs
enables
Big
Data
Triage
Basis Technology – Human Language Technology Conference 2012 35
36. Thank You!
For more information:
Visit www.basistech.com
Write to conference@basistech.com
Call 617-386-2090 or 800-697-2062
Basis Technology – Human Language Technology Conference 2012 36