SlideShare a Scribd company logo
1 of 36
Download to read offline
Big Data Triage with Text Analytics
Steve Kearns
Director of Product Management
Basis Technology
Basis Technology – Human Language Technology Conference 2012   1
Agenda

•    What is Big Data?
•    Challenges of Big Bata
•    Text Analytics Technology
•    Text Analytics for Big Data Triage




Basis Technology – Human Language Technology Conference 2012   2
What is Big Data?




Basis Technology – Human Language Technology Conference 2012   3
Big Data



 •  Volume

 •  Velocity

 •  Variety




 Basis Technology – Human Language Technology Conference 2012   4
Volume




Basis Technology – Human Language Technology Conference 2012   5
Volume




              Basis Technology – Human Language Technology Conference 2012   6
http://mashable.com/2012/06/22/data-created-every-minute/
Velocity

 •  High-Throughput Sources:
      –  Digital Forensics
            •  Rapid Site Exploitation
            •  Many Hard Drives
 •  Rapidly Changing Sources:
      –  OSINT
            •  News
            •  Social Media
 •  High Throughput Storage, Analysis, Alerting



 Basis Technology – Human Language Technology Conference 2012   7
Variety

 •  Data Types
      –    DOMEX/DOCEX/MEDEX/OSINT
      –    Finished Intel
      –    Cables
      –    Intellipedia
      –    Harmony
      –    Biometrics
      –    Watch Lists
      –    Hard Drive -> File(s) -> Unstructured and Structured Content
      –    Sensor Data
 •  Structured / Unstructured
 •  Textual / Visual / Numeric

 Basis Technology – Human Language Technology Conference 2012             8
The Challenge: Finding Value




            Basis Technology – Human Language Technology Conference 2012     9
http://learn-how-to-be-happy.com/wp-content/uploads/2011/08/happy_face.jpg
Big Data Problems - Volume

 •  Where/How do you store it?
      –  Single database -> database cluster -> Hadoop/HDFS?
 •  Data quality?
      –  Manual review or annotation?
      –  People don’t scale
 •  Query
      –    If you can, how fast, how complex and on what can you query?
      –    User Interface? SQL? Programming?
      –    How do you view results?
      –    Can you filter the results to refine your query?
      –    Thematic exploration, where the results of one query inform the next
      –    Security?

 Basis Technology – Human Language Technology Conference 2012              10
Big Data Problems - Velocity

 •  Time sensitive
      –  Value of information decreases over time
      –  How long from “publish” to “discoverable”?
 •  Rapid changes/updates
      –  Which updates are important?
      –  Which sources/users are important? Which may become important?
      –  Individual pieces of data may be meaningless, but what about in
         aggregate?
      –  Quality/Verification?
      –  Manual Review?




 Basis Technology – Human Language Technology Conference 2012        11
Big Data Problems - Variety

 •  Many Sources
      –    Often stored, formatted, and accessed differently
      –    Access, security?
      –    Many languages
      –    How reliable is each source?
 •  Few, if any, links
      –  Between sources
      –  Between documents
      –  Between information within documents




 Basis Technology – Human Language Technology Conference 2012   12
General Problems

 •  Computers are great at some things
 •  Humans are great at others




                                                          2	
  +	
  2	
  
                                                         Scale	
  
                                         Human	
  Language	
  
 Basis Technology – Human Language Technology Conference 2012               13
Text Analytics




Basis Technology – Human Language Technology Conference 2012   14
Text Analytics


            Automated analytical methods
            operating on the written word to
            surface insights about the data.

     It's purpose is to assist the human in
          finding things of relevance and
                      interest.


 Basis Technology – Human Language Technology Conference 2012   15
Text Analytics techniques




 Basis Technology – Human Language Technology Conference 2012   16
Triage Example



                                                                         Query:	
  Al	
  Qaeda	
  
                                                                         al-­‐Qaeda	
                                     0.99	
  
                                                                   Al-­‐Qaeda	
  has	
  the	
  following	
  direct	
  franchises:	
  
                                                                                                           ‫ة‬
                                                                   § Al-­‐Qaeda	
  in	
  (tal-­‐Qa'idah) Peninsula,	
  w0.99	
   comprises	
  
                                                                                      	
   he	
  Arabian	
         	
 hich	
  
                                                                          Al	
  -­‐Qaeda	
   aeda	
  in	
  Saudi	
  Arabia,	
  a0.99	
  
                                                                           §        Al	
  Q                                    nd	
  

            Baghdad military command spokesman Jihad	
  of	
  Yemen 	
 0.99	
  
                                                                          §                                ‫ة‬
                                                                                                    Islamic	
  
                                                                                                       	
  (al-­‐Qa'idah)	
  
            Colonel Dhia al-Wakeel said	
  	
  thel-­‐Qaeda	
  in	
  Iraq
                                       § 	
   	
  	
  	
  al-­‐Qada	
  	
   bore
                                                                  	
  A attacks                                                         0.91	
  
            the hallmarks of al-Qaeda. § 	
  	
  	
  	
  	
  	
  al-­‐Qaida	
                                                          0.91	
  
                                                                  	
  	
  	
  	
  	
  	
  	
  	
  Al-­‐Qaeda	
  OrganizaBon	
  in	
  the	
  Islamic	
  Maghreb
            Thursday was the deadliest day in Iraq	
   since      Al-­‐Qa'ida	
                                                         0.91	
  
                                                                  Al-­‐Qaïda	
  	
                                                      0.91	
  
            March 20, when shootings §  al-­‐Qaida	
  Africa	
   Somalia	
  
                                       and bombings	
  in	
              Al-­‐Shabaab
                                                                                                                                        0.78	
  
                                       §  Al-­‐Qaeda	
  Sslamic	
  Jihad
            claimed by an al-Qaeda affiliated group                      EgypBan	
  I ancBons	
  List	
                                 0.74	
  
            killed 50 people and wounded Al-­‐Qaïda	
  slamic	
  FighBng	
  Group 0.74	
  
                                       §  255 I Libyenne	
  	
          Libyan	
  
            nationwide.                §                                East	
  Turkestan	
  Islamic	
  M‫   
	47.0 وﺗﻨﻈﻴﻢ‬injiang,	
  
                                                                                                                        	
‫ اﻟﻘﺎﻋﺪة‬ovement in	
  X
                                                                       al-­‐Qaeda	
  in	
  Islamic	
  Maghreb	
  
                                                                   China	
                                                         0.7	
  


 Basis Technology – Human Language Technology Conference 2012                                                                                             17
Text Analytics : Language ID


                                      Après avoir rencontré
La Grande-Bretagne a                  les présidents de
de son côté jugé que                                                                 La Grande-Bretagne a
                                      quatre des cinq pays                           de sonAprès jugé que
                                                                                             côté avoir rencontré
l'accord de                           africains (Afrique du
                                                  Американская                       l'accord de
                                                                                            les présidents de nigérian
Luxembourg                            Sud, Algérie, Sénégal, компания                Luxembourg Le président
                                                                                                                          French
         В данный момент                          софтверная
constituait un véritable                                                                    quatreOlusegun Obasanjo a
                                                                                                    des cinq pays
                                      Nigeria) membres du
         правительство США,私ごとになりますが、ちょうどこ        становится                         constituait un véritable du
changement dans la                                                                          africains (Afrique
                                                                                                   salué cette
                               のころ大学院生でしたが、 du
                                      comité de pilotage
                                                  пользующимся спросом               changement l'engagement du G8,
                                                                                                   dans la
         обвиняющее
stratégie agricole de                                                                       Sud, Algérie, Sénégal,
         радикальную                  Nouveau partenariat США
                               ACOS-6用のある言語処理系    у спецслужб                        stratégie
l'Europe, tandis que                                                                        Nigeria) membres du"la
                                                                                                   déclarant que
                                      pour le développement
                               の開発を請け負って作っていま в области
                                                  экспертом
         мусульманскую
l'Irlande y a vu un gage                                                                    comité de pilotage du
                                      économique de
                               した。ACOS-6はMulticsの概念
                                                  лингвистики (в                                   condition majeure au
de stabilité et et de "Аль に非常に近いものを持っていま
         группировку
                                      l'Afrique частности,                                         développement est
sécuritéКаида" в терактах 2 した、あるいは持とうとしていま изучения и
          pour les
                                                  обработки информации
agriculteurs.назад,
         года
                                                       Le président nigérian
         активизирует свое した。	
                  на арабском языке)
         внимание к арабскому  また、ハードウェアも大変似て     после терактовObasanjo a
                                                       Olusegun 11                   Программное обеспечение
         языку и программам いました。シールをはがすと、	
 cette     salué
                                                  сентября 2001 г.                   Basis Technology позволяет
                                                                                            Американская
                               その下から別のアメリカの会社          l'engagement du G8,           осуществлять поиск слов с
         его обработки.                                                                     софтверная компания
                                                                                                  В данный момент
   「端末側で行単位に(あるいは の名前が出てくるマシンでテスト	
 que "la            déclarant

                                                                                                                          Russian
         Грамматика языков                                                           близкими значениями, а
                                                                                            становится
                                                       condition majeure au                       правительство США,
   一画面分)編集しておいて、	
 したこともありました。1年間ほ
         данной группы                                                               также транслитерировать
                                                                                            пользующимся спросом
                                                       développement est                          обвиняющее
   送信キーによりまとめて送信 とんど休みなしにマシンルーム	
                                                           у спецслужб США
                                                                                                  радикальную
   する」という方式と、	
                にこもっていて、ここでの議論          l'absence de conflit".
                  Программное обеспечение                                                   экспертом в области
                                                                                                  мусульманскую
                               と疑問を自分のテーマとしても	
        La porte-parole de la
   「端末には知能はなく、一字一 Basis Technology позволяет
                                                       présidence française,                      группировку "Аль
   字すべてがその都度送られ処 扱ったことがあるのです。そ
                  осуществлять поиск слов с                                                       Каида" в терактах 2
                               れで、よーくわかるのです。	
         Catherine Colonna, a
   理される」	
        близкими значениями, а
                                                       pour sa part qualifié la
   という方式は、究極的に前者  также транслитерировать
   は半二重通信、後者は全二重                                       réunion
                  арабские и фарси-буквы в             d'"exceptionnelle".
                                                    FNPがコンピュータと端末の間                  「端末側で行単位に(あるいは
   通信とフィットします。	
  латинские. Продукт был
   後者では、入力のエコーもコン                                   にあって、実際の端末とのやり                   一画面分)編集しておいて、	
                                                                                       FNPがコンピュータと端末の間
                  разработан по
   ピュータ側で制御されます。	
 заказу
                  специальному                      とりを制御するのです。そして、                  送信キーによりまとめて送信
                                                                                       にあって、実際の端末とのやり
                                                                                          「端末側で行単位に(あるいは

                                                                                                                          Japanese
   つまり、入力した字の表示はキ США с                             コンピュータとFNPの間の通                   する」という方式と、	
                                                                                       とりを制御するのです。そして、
                  правительства                                                           一画面分)編集しておいて、	
   ー入力がコンピュータに送られ、	
                  целью оптимизации                 信は、	
                            「端末には知能はなく、一字一
                                                                                       コンピュータとFNPの間の通
                                                                                          送信キーによりまとめて送信
   それが送り返されて表示され  процесса анализа арабских         少量の転送には不向きで、大                    字すべてがその都度送られ処
                                                                                       信は、	
                                                                                          する」という方式と、	
   ます。	
                                            量の一括転送に向いていました。             	
   理される」	
                                                                                       少量の転送には不向きで、大
                  текстов.                                                                「端末には知能はなく、一字一
                                                    FNPによるコンピュータへの割                    量の一括転送に向いていました。  	
                                                                                          字すべてがその都度送られ処
                                                    り込み要求は高価なものだっ                      FNPによるコンピュータへの割り	
                                                                                          理される」	
                                                    たからです。Multicsでのプロセス
                                                    のwake upも高価だということも
                                                    ありました。	




 Basis Technology – Human Language Technology Conference 2012                                                                   18
Text Analytics: Lemmatization



flying                                                          Search	
  

  Results

   fly	
             132 hits

   flying	
          97 hits

   flew	
            78 hits

   flown	
           61 hits




 Basis Technology – Human Language Technology Conference 2012                19
Text Analytics: Lemmatization (Arabic)



‫ﻑفﺝجﺭر‬          (Detonated)                                     Search	
  

  Results

 ‫ﻭوﺕتﻑفﺝجﻱي‬          132 hits
 ‫ﺭرﻩهﺍا‬
 ‫ﻡمﺕتﻑفﺝجﺭرﺍا‬        77 hits
 ‫ﺕت‬
 ‫ﺕتﻑفﺝجﻱيﺭرﺍاﺕت‬      32 hits

 ‫ﻑفﺝجﺭرﻩهﺍا‬          22 hits

 ‫ﺕتﻑفﺝجﺭرﺕت‬          2 hits




 Basis Technology – Human Language Technology Conference 2012                20
Text Analytics: Entity Extraction




 Basis Technology – Human Language Technology Conference 2012   21
Text Analytics: Relationship Extraction




 Basis Technology – Human Language Technology Conference 2012   22
Text Analytics: Entity Search




 Basis Technology – Human Language Technology Conference 2012   23
Text Analytics: Document Clustering




 Basis Technology – Human Language Technology Conference 2012   24
Big	
  Data	
  Triage	
  	
  
Text	
  Analytics	
  	
  
Big Data Processing

                           •  IdenBfy	
  data	
  sources	
  
   Collect	
               •  Data	
  cleansing	
  
                           •  Move	
  data	
  into	
  analysis	
  repository	
  


                           •  IdenBfy	
  EnBBes,	
  Facts,	
  RelaBonships	
  
  Analyze	
                •  Link	
  between	
  Documents	
  
                           •  Link	
  fact/enBty	
  between	
  documents	
  


                           •  Keyword	
  search	
  +	
  metadata	
  filters	
  
     Index	
               •  ThemaBc	
  exploraBon	
  –	
  using	
  metadata	
  
                           •  Cross-­‐document	
  links	
  

 Basis Technology – Human Language Technology Conference 2012                       26
Big Data Processing - Technology


                          •  Source:	
  News,	
  Twieer,	
  Database,	
  file	
  system,	
  digital	
  forensics,	
  
   Collect	
                 etc.	
  
                          •  Storage:	
  HDFS,	
  MongoDB,	
  SQL,	
  etc.	
  




                          •  Plahorm:	
  Hadoop,	
  UIMA,	
  Odyssey,	
  Custom	
  
  Analyze	
               •  Analysis	
  type:	
  Language	
  ID,	
  EnBty	
  ExtracBon,	
  RelaBonship	
  
                             ExtracBon,	
  Document	
  Clustering,	
  EnBty	
  Linking	
  



                          •  Fulltext	
  Search:	
  Solr,	
  Accumulo,	
  Lucene	
  

     Index	
              •  Structured	
  Data:	
  RDF,	
  SQL,	
  OrientDB,	
  Neo4j,	
  Cassandra,	
  HDFS,	
  
                             etc.	
  



 Basis Technology – Human Language Technology Conference 2012                                                      27
Big Data Triage Requirements

 •  View results while still processing
      –  Incremental collection/analysis/indexing
 •  User Interface that allows exploration
      –    Dashboard
      –    Keyword Search
      –    Geo Search
      –    Entity Search
            •  Enables thematic exploration
      –  Metadata produced by Analysis makes this easier




 Basis Technology – Human Language Technology Conference 2012   28
Dashboard




Basis Technology – Human Language Technology Conference 2012   29
Search and Filter




 Basis Technology – Human Language Technology Conference 2012   30
Foreign Language Search




 Basis Technology – Human Language Technology Conference 2012   31
Detailed Document View	
  




Basis Technology – Human Language Technology Conference 2012   32
Entity Search – Cross Language




 Basis Technology – Human Language Technology Conference 2012   33
Search/Filter/Explore




       Basis Technology – Human Language Technology Conference 2012     34
http://www.silobreaker.com/FlashNetwork.aspx?DrillDownItems=11_237360
Summary




          Text	
  Analy9cs	
  enables	
  Big	
  Data	
  Triage	
  

Basis Technology – Human Language Technology Conference 2012         35
Thank You!



For more information:
Visit www.basistech.com
Write to conference@basistech.com
Call 617-386-2090 or 800-697-2062

Basis Technology – Human Language Technology Conference 2012   36

More Related Content

Viewers also liked

Multilingual Search and Text Analytics with Solr - Open Source Search Conference
Multilingual Search and Text Analytics with Solr - Open Source Search ConferenceMultilingual Search and Text Analytics with Solr - Open Source Search Conference
Multilingual Search and Text Analytics with Solr - Open Source Search ConferenceBasis Technology
 
Gregor Stewart - OSIRA 2014
Gregor Stewart - OSIRA 2014Gregor Stewart - OSIRA 2014
Gregor Stewart - OSIRA 2014Basis Technology
 
Optimizing multilingual search in SOLR
Optimizing multilingual search in SOLROptimizing multilingual search in SOLR
Optimizing multilingual search in SOLRBasis Technology
 
Middeware2012 crowd
Middeware2012 crowdMiddeware2012 crowd
Middeware2012 crowdmjfrankli
 
Datafication of HR - Employee Benefits Live 2013
Datafication of HR - Employee Benefits Live 2013Datafication of HR - Employee Benefits Live 2013
Datafication of HR - Employee Benefits Live 2013Jon Ingham
 
Heavy, messy, misleading. Why Big Data is a human problem, not a technology one.
Heavy, messy, misleading. Why Big Data is a human problem, not a technology one.Heavy, messy, misleading. Why Big Data is a human problem, not a technology one.
Heavy, messy, misleading. Why Big Data is a human problem, not a technology one.Francesco D'Orazio
 
Big Data in Human Resources
Big Data in Human ResourcesBig Data in Human Resources
Big Data in Human ResourcesMatthias Vallaey
 
C2 empowering modern human resources and talent management in the cloud
C2   empowering modern human resources and talent management in the cloudC2   empowering modern human resources and talent management in the cloud
C2 empowering modern human resources and talent management in the cloudDr. Wilfred Lin (Ph.D.)
 
Managing human resources at data centers 1.0
Managing human resources at data centers 1.0Managing human resources at data centers 1.0
Managing human resources at data centers 1.0aqel aqel
 
Big data in HR: Why all the fuss?
Big data in HR: Why all the fuss? Big data in HR: Why all the fuss?
Big data in HR: Why all the fuss? Steve Pell
 
HR Analytics and KPIs with LBi HR HelpDesk
HR Analytics and KPIs with LBi HR HelpDeskHR Analytics and KPIs with LBi HR HelpDesk
HR Analytics and KPIs with LBi HR HelpDeskLBi Software
 
Big Trends in HR Tech for 2014 and Beyond - Human Resource Executive Webinar
Big Trends in HR Tech for 2014 and Beyond - Human Resource Executive WebinarBig Trends in HR Tech for 2014 and Beyond - Human Resource Executive Webinar
Big Trends in HR Tech for 2014 and Beyond - Human Resource Executive WebinarH3 HR Advisors, Inc.
 
Data Science and Analytics in Human Resources - Moneyball comes to HR
Data Science and Analytics in Human Resources - Moneyball comes to HRData Science and Analytics in Human Resources - Moneyball comes to HR
Data Science and Analytics in Human Resources - Moneyball comes to HRJosh Bersin
 

Viewers also liked (19)

Multilingual Search and Text Analytics with Solr - Open Source Search Conference
Multilingual Search and Text Analytics with Solr - Open Source Search ConferenceMultilingual Search and Text Analytics with Solr - Open Source Search Conference
Multilingual Search and Text Analytics with Solr - Open Source Search Conference
 
ASTMH Centennial Celebration PPT slides
ASTMH Centennial Celebration PPT slidesASTMH Centennial Celebration PPT slides
ASTMH Centennial Celebration PPT slides
 
Gregor Stewart - OSIRA 2014
Gregor Stewart - OSIRA 2014Gregor Stewart - OSIRA 2014
Gregor Stewart - OSIRA 2014
 
Optimizing multilingual search in SOLR
Optimizing multilingual search in SOLROptimizing multilingual search in SOLR
Optimizing multilingual search in SOLR
 
Making the Case for Global Public Health: Your Role in Changing the Conversat...
Making the Case for Global Public Health: Your Role in Changing the Conversat...Making the Case for Global Public Health: Your Role in Changing the Conversat...
Making the Case for Global Public Health: Your Role in Changing the Conversat...
 
Big data and hr
Big data and hrBig data and hr
Big data and hr
 
Middeware2012 crowd
Middeware2012 crowdMiddeware2012 crowd
Middeware2012 crowd
 
Clinical Manifestation and Pathogenesis of Obligately Intracellular Bacterial...
Clinical Manifestation and Pathogenesis of Obligately Intracellular Bacterial...Clinical Manifestation and Pathogenesis of Obligately Intracellular Bacterial...
Clinical Manifestation and Pathogenesis of Obligately Intracellular Bacterial...
 
Datafication of HR - Employee Benefits Live 2013
Datafication of HR - Employee Benefits Live 2013Datafication of HR - Employee Benefits Live 2013
Datafication of HR - Employee Benefits Live 2013
 
Patricia Walker_NARCH Keynote_June 2016
Patricia Walker_NARCH Keynote_June 2016Patricia Walker_NARCH Keynote_June 2016
Patricia Walker_NARCH Keynote_June 2016
 
Heavy, messy, misleading. Why Big Data is a human problem, not a technology one.
Heavy, messy, misleading. Why Big Data is a human problem, not a technology one.Heavy, messy, misleading. Why Big Data is a human problem, not a technology one.
Heavy, messy, misleading. Why Big Data is a human problem, not a technology one.
 
Big Data in Human Resources
Big Data in Human ResourcesBig Data in Human Resources
Big Data in Human Resources
 
C2 empowering modern human resources and talent management in the cloud
C2   empowering modern human resources and talent management in the cloudC2   empowering modern human resources and talent management in the cloud
C2 empowering modern human resources and talent management in the cloud
 
Kenya Higgs 2.9.2016
Kenya Higgs 2.9.2016Kenya Higgs 2.9.2016
Kenya Higgs 2.9.2016
 
Managing human resources at data centers 1.0
Managing human resources at data centers 1.0Managing human resources at data centers 1.0
Managing human resources at data centers 1.0
 
Big data in HR: Why all the fuss?
Big data in HR: Why all the fuss? Big data in HR: Why all the fuss?
Big data in HR: Why all the fuss?
 
HR Analytics and KPIs with LBi HR HelpDesk
HR Analytics and KPIs with LBi HR HelpDeskHR Analytics and KPIs with LBi HR HelpDesk
HR Analytics and KPIs with LBi HR HelpDesk
 
Big Trends in HR Tech for 2014 and Beyond - Human Resource Executive Webinar
Big Trends in HR Tech for 2014 and Beyond - Human Resource Executive WebinarBig Trends in HR Tech for 2014 and Beyond - Human Resource Executive Webinar
Big Trends in HR Tech for 2014 and Beyond - Human Resource Executive Webinar
 
Data Science and Analytics in Human Resources - Moneyball comes to HR
Data Science and Analytics in Human Resources - Moneyball comes to HRData Science and Analytics in Human Resources - Moneyball comes to HR
Data Science and Analytics in Human Resources - Moneyball comes to HR
 

More from Basis Technology

Product Update: Customization with Rosette
Product Update: Customization with RosetteProduct Update: Customization with Rosette
Product Update: Customization with RosetteBasis Technology
 
Smart Matching for Screening Webinar - May 2020
Smart Matching for Screening Webinar - May 2020Smart Matching for Screening Webinar - May 2020
Smart Matching for Screening Webinar - May 2020Basis Technology
 
Understanding Names with Neural Networks - May 2020
Understanding Names with Neural Networks - May 2020Understanding Names with Neural Networks - May 2020
Understanding Names with Neural Networks - May 2020Basis Technology
 
Rosette Product Update (May 2019)
Rosette Product Update (May 2019)Rosette Product Update (May 2019)
Rosette Product Update (May 2019)Basis Technology
 
Simple fuzzy Name Matching in Elasticsearch - Graham Morehead
Simple fuzzy Name Matching in Elasticsearch - Graham MoreheadSimple fuzzy Name Matching in Elasticsearch - Graham Morehead
Simple fuzzy Name Matching in Elasticsearch - Graham MoreheadBasis Technology
 
Basis Technology showcase at elasticsearch meetup in Japan
Basis Technology showcase at elasticsearch meetup in JapanBasis Technology showcase at elasticsearch meetup in Japan
Basis Technology showcase at elasticsearch meetup in JapanBasis Technology
 
OSDF 2013 - Autopsy 3: Extensible Desktop Forensics by Brian Carrier
OSDF 2013 - Autopsy 3: Extensible Desktop Forensics by Brian CarrierOSDF 2013 - Autopsy 3: Extensible Desktop Forensics by Brian Carrier
OSDF 2013 - Autopsy 3: Extensible Desktop Forensics by Brian CarrierBasis Technology
 
HLT 2013 - Triaging Foreign Language Documents for MEDEX by Brian Carrier
HLT 2013 - Triaging Foreign Language Documents for MEDEX by Brian CarrierHLT 2013 - Triaging Foreign Language Documents for MEDEX by Brian Carrier
HLT 2013 - Triaging Foreign Language Documents for MEDEX by Brian CarrierBasis Technology
 
HLT 2013 - Adapting News-Trained Entity Extraction to New Domains and Emergin...
HLT 2013 - Adapting News-Trained Entity Extraction to New Domains and Emergin...HLT 2013 - Adapting News-Trained Entity Extraction to New Domains and Emergin...
HLT 2013 - Adapting News-Trained Entity Extraction to New Domains and Emergin...Basis Technology
 
HLT 2013 - From Research to Reality: Advances in HLT by David Murgatroyd
HLT 2013 - From Research to Reality: Advances in HLT by David MurgatroydHLT 2013 - From Research to Reality: Advances in HLT by David Murgatroyd
HLT 2013 - From Research to Reality: Advances in HLT by David MurgatroydBasis Technology
 
Autopsy 3: Free Open Source End-to-End Windows-based Digital Forensics Platform
Autopsy 3: Free Open Source End-to-End Windows-based Digital Forensics PlatformAutopsy 3: Free Open Source End-to-End Windows-based Digital Forensics Platform
Autopsy 3: Free Open Source End-to-End Windows-based Digital Forensics PlatformBasis Technology
 
A Lightning Introduction To Clouds & HLT - Human Language Technology Conference
A Lightning Introduction To Clouds & HLT - Human Language Technology ConferenceA Lightning Introduction To Clouds & HLT - Human Language Technology Conference
A Lightning Introduction To Clouds & HLT - Human Language Technology ConferenceBasis Technology
 
Autopsy 3.0 - Open Source Digital Forensics Conference
Autopsy 3.0 - Open Source Digital Forensics ConferenceAutopsy 3.0 - Open Source Digital Forensics Conference
Autopsy 3.0 - Open Source Digital Forensics ConferenceBasis Technology
 
Moving Beyond Entity Extraction to Entity Resolution - Human Language Technol...
Moving Beyond Entity Extraction to Entity Resolution - Human Language Technol...Moving Beyond Entity Extraction to Entity Resolution - Human Language Technol...
Moving Beyond Entity Extraction to Entity Resolution - Human Language Technol...Basis Technology
 

More from Basis Technology (14)

Product Update: Customization with Rosette
Product Update: Customization with RosetteProduct Update: Customization with Rosette
Product Update: Customization with Rosette
 
Smart Matching for Screening Webinar - May 2020
Smart Matching for Screening Webinar - May 2020Smart Matching for Screening Webinar - May 2020
Smart Matching for Screening Webinar - May 2020
 
Understanding Names with Neural Networks - May 2020
Understanding Names with Neural Networks - May 2020Understanding Names with Neural Networks - May 2020
Understanding Names with Neural Networks - May 2020
 
Rosette Product Update (May 2019)
Rosette Product Update (May 2019)Rosette Product Update (May 2019)
Rosette Product Update (May 2019)
 
Simple fuzzy Name Matching in Elasticsearch - Graham Morehead
Simple fuzzy Name Matching in Elasticsearch - Graham MoreheadSimple fuzzy Name Matching in Elasticsearch - Graham Morehead
Simple fuzzy Name Matching in Elasticsearch - Graham Morehead
 
Basis Technology showcase at elasticsearch meetup in Japan
Basis Technology showcase at elasticsearch meetup in JapanBasis Technology showcase at elasticsearch meetup in Japan
Basis Technology showcase at elasticsearch meetup in Japan
 
OSDF 2013 - Autopsy 3: Extensible Desktop Forensics by Brian Carrier
OSDF 2013 - Autopsy 3: Extensible Desktop Forensics by Brian CarrierOSDF 2013 - Autopsy 3: Extensible Desktop Forensics by Brian Carrier
OSDF 2013 - Autopsy 3: Extensible Desktop Forensics by Brian Carrier
 
HLT 2013 - Triaging Foreign Language Documents for MEDEX by Brian Carrier
HLT 2013 - Triaging Foreign Language Documents for MEDEX by Brian CarrierHLT 2013 - Triaging Foreign Language Documents for MEDEX by Brian Carrier
HLT 2013 - Triaging Foreign Language Documents for MEDEX by Brian Carrier
 
HLT 2013 - Adapting News-Trained Entity Extraction to New Domains and Emergin...
HLT 2013 - Adapting News-Trained Entity Extraction to New Domains and Emergin...HLT 2013 - Adapting News-Trained Entity Extraction to New Domains and Emergin...
HLT 2013 - Adapting News-Trained Entity Extraction to New Domains and Emergin...
 
HLT 2013 - From Research to Reality: Advances in HLT by David Murgatroyd
HLT 2013 - From Research to Reality: Advances in HLT by David MurgatroydHLT 2013 - From Research to Reality: Advances in HLT by David Murgatroyd
HLT 2013 - From Research to Reality: Advances in HLT by David Murgatroyd
 
Autopsy 3: Free Open Source End-to-End Windows-based Digital Forensics Platform
Autopsy 3: Free Open Source End-to-End Windows-based Digital Forensics PlatformAutopsy 3: Free Open Source End-to-End Windows-based Digital Forensics Platform
Autopsy 3: Free Open Source End-to-End Windows-based Digital Forensics Platform
 
A Lightning Introduction To Clouds & HLT - Human Language Technology Conference
A Lightning Introduction To Clouds & HLT - Human Language Technology ConferenceA Lightning Introduction To Clouds & HLT - Human Language Technology Conference
A Lightning Introduction To Clouds & HLT - Human Language Technology Conference
 
Autopsy 3.0 - Open Source Digital Forensics Conference
Autopsy 3.0 - Open Source Digital Forensics ConferenceAutopsy 3.0 - Open Source Digital Forensics Conference
Autopsy 3.0 - Open Source Digital Forensics Conference
 
Moving Beyond Entity Extraction to Entity Resolution - Human Language Technol...
Moving Beyond Entity Extraction to Entity Resolution - Human Language Technol...Moving Beyond Entity Extraction to Entity Resolution - Human Language Technol...
Moving Beyond Entity Extraction to Entity Resolution - Human Language Technol...
 

Recently uploaded

クラウドネイティブなサーバー仮想化基盤 - OpenShift Virtualization.pdf
クラウドネイティブなサーバー仮想化基盤 - OpenShift Virtualization.pdfクラウドネイティブなサーバー仮想化基盤 - OpenShift Virtualization.pdf
クラウドネイティブなサーバー仮想化基盤 - OpenShift Virtualization.pdfFumieNakayama
 
TataPixel: 畳の異方性を利用した切り替え可能なディスプレイの提案
TataPixel: 畳の異方性を利用した切り替え可能なディスプレイの提案TataPixel: 畳の異方性を利用した切り替え可能なディスプレイの提案
TataPixel: 畳の異方性を利用した切り替え可能なディスプレイの提案sugiuralab
 
デジタル・フォレンジックの最新動向(2024年4月27日情洛会総会特別講演スライド)
デジタル・フォレンジックの最新動向(2024年4月27日情洛会総会特別講演スライド)デジタル・フォレンジックの最新動向(2024年4月27日情洛会総会特別講演スライド)
デジタル・フォレンジックの最新動向(2024年4月27日情洛会総会特別講演スライド)UEHARA, Tetsutaro
 
AWS の OpenShift サービス (ROSA) を使った OpenShift Virtualizationの始め方.pdf
AWS の OpenShift サービス (ROSA) を使った OpenShift Virtualizationの始め方.pdfAWS の OpenShift サービス (ROSA) を使った OpenShift Virtualizationの始め方.pdf
AWS の OpenShift サービス (ROSA) を使った OpenShift Virtualizationの始め方.pdfFumieNakayama
 
自分史上一番早い2024振り返り〜コロナ後、仕事は通常ペースに戻ったか〜 by IoT fullstack engineer
自分史上一番早い2024振り返り〜コロナ後、仕事は通常ペースに戻ったか〜 by IoT fullstack engineer自分史上一番早い2024振り返り〜コロナ後、仕事は通常ペースに戻ったか〜 by IoT fullstack engineer
自分史上一番早い2024振り返り〜コロナ後、仕事は通常ペースに戻ったか〜 by IoT fullstack engineerYuki Kikuchi
 
CTO, VPoE, テックリードなどリーダーポジションに登用したくなるのはどんな人材か?
CTO, VPoE, テックリードなどリーダーポジションに登用したくなるのはどんな人材か?CTO, VPoE, テックリードなどリーダーポジションに登用したくなるのはどんな人材か?
CTO, VPoE, テックリードなどリーダーポジションに登用したくなるのはどんな人材か?akihisamiyanaga1
 
業務で生成AIを活用したい人のための生成AI入門講座(社外公開版) 2024年4月作成
業務で生成AIを活用したい人のための生成AI入門講座(社外公開版) 2024年4月作成業務で生成AIを活用したい人のための生成AI入門講座(社外公開版) 2024年4月作成
業務で生成AIを活用したい人のための生成AI入門講座(社外公開版) 2024年4月作成Hiroshi Tomioka
 
モーダル間の変換後の一致性とジャンル表を用いた解釈可能性の考察 ~Text-to-MusicとText-To-ImageかつImage-to-Music...
モーダル間の変換後の一致性とジャンル表を用いた解釈可能性の考察  ~Text-to-MusicとText-To-ImageかつImage-to-Music...モーダル間の変換後の一致性とジャンル表を用いた解釈可能性の考察  ~Text-to-MusicとText-To-ImageかつImage-to-Music...
モーダル間の変換後の一致性とジャンル表を用いた解釈可能性の考察 ~Text-to-MusicとText-To-ImageかつImage-to-Music...博三 太田
 
【早稲田AI研究会 講義資料】3DスキャンとTextTo3Dのツールを知ろう!(Vol.1)
【早稲田AI研究会 講義資料】3DスキャンとTextTo3Dのツールを知ろう!(Vol.1)【早稲田AI研究会 講義資料】3DスキャンとTextTo3Dのツールを知ろう!(Vol.1)
【早稲田AI研究会 講義資料】3DスキャンとTextTo3Dのツールを知ろう!(Vol.1)Hiroki Ichikura
 

Recently uploaded (9)

クラウドネイティブなサーバー仮想化基盤 - OpenShift Virtualization.pdf
クラウドネイティブなサーバー仮想化基盤 - OpenShift Virtualization.pdfクラウドネイティブなサーバー仮想化基盤 - OpenShift Virtualization.pdf
クラウドネイティブなサーバー仮想化基盤 - OpenShift Virtualization.pdf
 
TataPixel: 畳の異方性を利用した切り替え可能なディスプレイの提案
TataPixel: 畳の異方性を利用した切り替え可能なディスプレイの提案TataPixel: 畳の異方性を利用した切り替え可能なディスプレイの提案
TataPixel: 畳の異方性を利用した切り替え可能なディスプレイの提案
 
デジタル・フォレンジックの最新動向(2024年4月27日情洛会総会特別講演スライド)
デジタル・フォレンジックの最新動向(2024年4月27日情洛会総会特別講演スライド)デジタル・フォレンジックの最新動向(2024年4月27日情洛会総会特別講演スライド)
デジタル・フォレンジックの最新動向(2024年4月27日情洛会総会特別講演スライド)
 
AWS の OpenShift サービス (ROSA) を使った OpenShift Virtualizationの始め方.pdf
AWS の OpenShift サービス (ROSA) を使った OpenShift Virtualizationの始め方.pdfAWS の OpenShift サービス (ROSA) を使った OpenShift Virtualizationの始め方.pdf
AWS の OpenShift サービス (ROSA) を使った OpenShift Virtualizationの始め方.pdf
 
自分史上一番早い2024振り返り〜コロナ後、仕事は通常ペースに戻ったか〜 by IoT fullstack engineer
自分史上一番早い2024振り返り〜コロナ後、仕事は通常ペースに戻ったか〜 by IoT fullstack engineer自分史上一番早い2024振り返り〜コロナ後、仕事は通常ペースに戻ったか〜 by IoT fullstack engineer
自分史上一番早い2024振り返り〜コロナ後、仕事は通常ペースに戻ったか〜 by IoT fullstack engineer
 
CTO, VPoE, テックリードなどリーダーポジションに登用したくなるのはどんな人材か?
CTO, VPoE, テックリードなどリーダーポジションに登用したくなるのはどんな人材か?CTO, VPoE, テックリードなどリーダーポジションに登用したくなるのはどんな人材か?
CTO, VPoE, テックリードなどリーダーポジションに登用したくなるのはどんな人材か?
 
業務で生成AIを活用したい人のための生成AI入門講座(社外公開版) 2024年4月作成
業務で生成AIを活用したい人のための生成AI入門講座(社外公開版) 2024年4月作成業務で生成AIを活用したい人のための生成AI入門講座(社外公開版) 2024年4月作成
業務で生成AIを活用したい人のための生成AI入門講座(社外公開版) 2024年4月作成
 
モーダル間の変換後の一致性とジャンル表を用いた解釈可能性の考察 ~Text-to-MusicとText-To-ImageかつImage-to-Music...
モーダル間の変換後の一致性とジャンル表を用いた解釈可能性の考察  ~Text-to-MusicとText-To-ImageかつImage-to-Music...モーダル間の変換後の一致性とジャンル表を用いた解釈可能性の考察  ~Text-to-MusicとText-To-ImageかつImage-to-Music...
モーダル間の変換後の一致性とジャンル表を用いた解釈可能性の考察 ~Text-to-MusicとText-To-ImageかつImage-to-Music...
 
【早稲田AI研究会 講義資料】3DスキャンとTextTo3Dのツールを知ろう!(Vol.1)
【早稲田AI研究会 講義資料】3DスキャンとTextTo3Dのツールを知ろう!(Vol.1)【早稲田AI研究会 講義資料】3DスキャンとTextTo3Dのツールを知ろう!(Vol.1)
【早稲田AI研究会 講義資料】3DスキャンとTextTo3Dのツールを知ろう!(Vol.1)
 

Big Data Triage with Rosette Human Language Technology Conference

  • 1. Big Data Triage with Text Analytics Steve Kearns Director of Product Management Basis Technology Basis Technology – Human Language Technology Conference 2012 1
  • 2. Agenda •  What is Big Data? •  Challenges of Big Bata •  Text Analytics Technology •  Text Analytics for Big Data Triage Basis Technology – Human Language Technology Conference 2012 2
  • 3. What is Big Data? Basis Technology – Human Language Technology Conference 2012 3
  • 4. Big Data •  Volume •  Velocity •  Variety Basis Technology – Human Language Technology Conference 2012 4
  • 5. Volume Basis Technology – Human Language Technology Conference 2012 5
  • 6. Volume Basis Technology – Human Language Technology Conference 2012 6 http://mashable.com/2012/06/22/data-created-every-minute/
  • 7. Velocity •  High-Throughput Sources: –  Digital Forensics •  Rapid Site Exploitation •  Many Hard Drives •  Rapidly Changing Sources: –  OSINT •  News •  Social Media •  High Throughput Storage, Analysis, Alerting Basis Technology – Human Language Technology Conference 2012 7
  • 8. Variety •  Data Types –  DOMEX/DOCEX/MEDEX/OSINT –  Finished Intel –  Cables –  Intellipedia –  Harmony –  Biometrics –  Watch Lists –  Hard Drive -> File(s) -> Unstructured and Structured Content –  Sensor Data •  Structured / Unstructured •  Textual / Visual / Numeric Basis Technology – Human Language Technology Conference 2012 8
  • 9. The Challenge: Finding Value Basis Technology – Human Language Technology Conference 2012 9 http://learn-how-to-be-happy.com/wp-content/uploads/2011/08/happy_face.jpg
  • 10. Big Data Problems - Volume •  Where/How do you store it? –  Single database -> database cluster -> Hadoop/HDFS? •  Data quality? –  Manual review or annotation? –  People don’t scale •  Query –  If you can, how fast, how complex and on what can you query? –  User Interface? SQL? Programming? –  How do you view results? –  Can you filter the results to refine your query? –  Thematic exploration, where the results of one query inform the next –  Security? Basis Technology – Human Language Technology Conference 2012 10
  • 11. Big Data Problems - Velocity •  Time sensitive –  Value of information decreases over time –  How long from “publish” to “discoverable”? •  Rapid changes/updates –  Which updates are important? –  Which sources/users are important? Which may become important? –  Individual pieces of data may be meaningless, but what about in aggregate? –  Quality/Verification? –  Manual Review? Basis Technology – Human Language Technology Conference 2012 11
  • 12. Big Data Problems - Variety •  Many Sources –  Often stored, formatted, and accessed differently –  Access, security? –  Many languages –  How reliable is each source? •  Few, if any, links –  Between sources –  Between documents –  Between information within documents Basis Technology – Human Language Technology Conference 2012 12
  • 13. General Problems •  Computers are great at some things •  Humans are great at others 2  +  2   Scale   Human  Language   Basis Technology – Human Language Technology Conference 2012 13
  • 14. Text Analytics Basis Technology – Human Language Technology Conference 2012 14
  • 15. Text Analytics Automated analytical methods operating on the written word to surface insights about the data. It's purpose is to assist the human in finding things of relevance and interest. Basis Technology – Human Language Technology Conference 2012 15
  • 16. Text Analytics techniques Basis Technology – Human Language Technology Conference 2012 16
  • 17. Triage Example Query:  Al  Qaeda   al-­‐Qaeda   0.99   Al-­‐Qaeda  has  the  following  direct  franchises:   ‫ة‬ § Al-­‐Qaeda  in  (tal-­‐Qa'idah) Peninsula,  w0.99   comprises     he  Arabian   hich   Al  -­‐Qaeda   aeda  in  Saudi  Arabia,  a0.99   §  Al  Q nd   Baghdad military command spokesman Jihad  of  Yemen 0.99   §  ‫ة‬ Islamic    (al-­‐Qa'idah)   Colonel Dhia al-Wakeel said    thel-­‐Qaeda  in  Iraq §         al-­‐Qada     bore  A attacks 0.91   the hallmarks of al-Qaeda. §             al-­‐Qaida   0.91                  Al-­‐Qaeda  OrganizaBon  in  the  Islamic  Maghreb Thursday was the deadliest day in Iraq   since Al-­‐Qa'ida   0.91   Al-­‐Qaïda     0.91   March 20, when shootings §  al-­‐Qaida  Africa   Somalia   and bombings  in   Al-­‐Shabaab 0.78   §  Al-­‐Qaeda  Sslamic  Jihad claimed by an al-Qaeda affiliated group EgypBan  I ancBons  List   0.74   killed 50 people and wounded Al-­‐Qaïda  slamic  FighBng  Group 0.74   §  255 I Libyenne     Libyan   nationwide. §  East  Turkestan  Islamic  M‫   47.0 وﺗﻨﻈﻴﻢ‬injiang,   ‫ اﻟﻘﺎﻋﺪة‬ovement in  X al-­‐Qaeda  in  Islamic  Maghreb   China   0.7   Basis Technology – Human Language Technology Conference 2012 17
  • 18. Text Analytics : Language ID Après avoir rencontré La Grande-Bretagne a les présidents de de son côté jugé que La Grande-Bretagne a quatre des cinq pays de sonAprès jugé que côté avoir rencontré l'accord de africains (Afrique du Американская l'accord de les présidents de nigérian Luxembourg Sud, Algérie, Sénégal, компания Luxembourg Le président French В данный момент софтверная constituait un véritable quatreOlusegun Obasanjo a des cinq pays Nigeria) membres du правительство США,私ごとになりますが、ちょうどこ становится constituait un véritable du changement dans la africains (Afrique salué cette のころ大学院生でしたが、 du comité de pilotage пользующимся спросом changement l'engagement du G8, dans la обвиняющее stratégie agricole de Sud, Algérie, Sénégal, радикальную Nouveau partenariat США ACOS-6用のある言語処理系 у спецслужб stratégie l'Europe, tandis que Nigeria) membres du"la déclarant que pour le développement の開発を請け負って作っていま в области экспертом мусульманскую l'Irlande y a vu un gage comité de pilotage du économique de した。ACOS-6はMulticsの概念 лингвистики (в condition majeure au de stabilité et et de "Аль に非常に近いものを持っていま группировку l'Afrique частности, développement est sécuritéКаида" в терактах 2 した、あるいは持とうとしていま изучения и pour les обработки информации agriculteurs.назад, года Le président nigérian активизирует свое した。 на арабском языке) внимание к арабскому また、ハードウェアも大変似て после терактовObasanjo a Olusegun 11 Программное обеспечение языку и программам いました。シールをはがすと、 cette salué сентября 2001 г. Basis Technology позволяет Американская その下から別のアメリカの会社 l'engagement du G8, осуществлять поиск слов с его обработки. софтверная компания В данный момент 「端末側で行単位に(あるいは の名前が出てくるマシンでテスト que "la déclarant Russian Грамматика языков близкими значениями, а становится condition majeure au правительство США, 一画面分)編集しておいて、 したこともありました。1年間ほ данной группы также транслитерировать пользующимся спросом développement est обвиняющее 送信キーによりまとめて送信 とんど休みなしにマシンルーム у спецслужб США радикальную する」という方式と、 にこもっていて、ここでの議論 l'absence de conflit". Программное обеспечение экспертом в области мусульманскую と疑問を自分のテーマとしても La porte-parole de la 「端末には知能はなく、一字一 Basis Technology позволяет présidence française, группировку "Аль 字すべてがその都度送られ処 扱ったことがあるのです。そ осуществлять поиск слов с Каида" в терактах 2 れで、よーくわかるのです。 Catherine Colonna, a 理される」 близкими значениями, а pour sa part qualifié la という方式は、究極的に前者 также транслитерировать は半二重通信、後者は全二重 réunion арабские и фарси-буквы в d'"exceptionnelle". FNPがコンピュータと端末の間 「端末側で行単位に(あるいは 通信とフィットします。 латинские. Продукт был 後者では、入力のエコーもコン にあって、実際の端末とのやり 一画面分)編集しておいて、 FNPがコンピュータと端末の間 разработан по ピュータ側で制御されます。 заказу специальному とりを制御するのです。そして、 送信キーによりまとめて送信 にあって、実際の端末とのやり 「端末側で行単位に(あるいは Japanese つまり、入力した字の表示はキ США с コンピュータとFNPの間の通 する」という方式と、 とりを制御するのです。そして、 правительства 一画面分)編集しておいて、 ー入力がコンピュータに送られ、 целью оптимизации 信は、 「端末には知能はなく、一字一 コンピュータとFNPの間の通 送信キーによりまとめて送信 それが送り返されて表示され процесса анализа арабских 少量の転送には不向きで、大 字すべてがその都度送られ処 信は、 する」という方式と、 ます。 量の一括転送に向いていました。 理される」 少量の転送には不向きで、大 текстов. 「端末には知能はなく、一字一 FNPによるコンピュータへの割 量の一括転送に向いていました。 字すべてがその都度送られ処 り込み要求は高価なものだっ FNPによるコンピュータへの割り 理される」 たからです。Multicsでのプロセス のwake upも高価だということも ありました。 Basis Technology – Human Language Technology Conference 2012 18
  • 19. Text Analytics: Lemmatization flying Search   Results fly   132 hits flying   97 hits flew   78 hits flown   61 hits Basis Technology – Human Language Technology Conference 2012 19
  • 20. Text Analytics: Lemmatization (Arabic) ‫ﻑفﺝجﺭر‬ (Detonated) Search   Results ‫ﻭوﺕتﻑفﺝجﻱي‬ 132 hits ‫ﺭرﻩهﺍا‬ ‫ﻡمﺕتﻑفﺝجﺭرﺍا‬ 77 hits ‫ﺕت‬ ‫ﺕتﻑفﺝجﻱيﺭرﺍاﺕت‬ 32 hits ‫ﻑفﺝجﺭرﻩهﺍا‬ 22 hits ‫ﺕتﻑفﺝجﺭرﺕت‬ 2 hits Basis Technology – Human Language Technology Conference 2012 20
  • 21. Text Analytics: Entity Extraction Basis Technology – Human Language Technology Conference 2012 21
  • 22. Text Analytics: Relationship Extraction Basis Technology – Human Language Technology Conference 2012 22
  • 23. Text Analytics: Entity Search Basis Technology – Human Language Technology Conference 2012 23
  • 24. Text Analytics: Document Clustering Basis Technology – Human Language Technology Conference 2012 24
  • 25. Big  Data  Triage     Text  Analytics    
  • 26. Big Data Processing •  IdenBfy  data  sources   Collect   •  Data  cleansing   •  Move  data  into  analysis  repository   •  IdenBfy  EnBBes,  Facts,  RelaBonships   Analyze   •  Link  between  Documents   •  Link  fact/enBty  between  documents   •  Keyword  search  +  metadata  filters   Index   •  ThemaBc  exploraBon  –  using  metadata   •  Cross-­‐document  links   Basis Technology – Human Language Technology Conference 2012 26
  • 27. Big Data Processing - Technology •  Source:  News,  Twieer,  Database,  file  system,  digital  forensics,   Collect   etc.   •  Storage:  HDFS,  MongoDB,  SQL,  etc.   •  Plahorm:  Hadoop,  UIMA,  Odyssey,  Custom   Analyze   •  Analysis  type:  Language  ID,  EnBty  ExtracBon,  RelaBonship   ExtracBon,  Document  Clustering,  EnBty  Linking   •  Fulltext  Search:  Solr,  Accumulo,  Lucene   Index   •  Structured  Data:  RDF,  SQL,  OrientDB,  Neo4j,  Cassandra,  HDFS,   etc.   Basis Technology – Human Language Technology Conference 2012 27
  • 28. Big Data Triage Requirements •  View results while still processing –  Incremental collection/analysis/indexing •  User Interface that allows exploration –  Dashboard –  Keyword Search –  Geo Search –  Entity Search •  Enables thematic exploration –  Metadata produced by Analysis makes this easier Basis Technology – Human Language Technology Conference 2012 28
  • 29. Dashboard Basis Technology – Human Language Technology Conference 2012 29
  • 30. Search and Filter Basis Technology – Human Language Technology Conference 2012 30
  • 31. Foreign Language Search Basis Technology – Human Language Technology Conference 2012 31
  • 32. Detailed Document View   Basis Technology – Human Language Technology Conference 2012 32
  • 33. Entity Search – Cross Language Basis Technology – Human Language Technology Conference 2012 33
  • 34. Search/Filter/Explore Basis Technology – Human Language Technology Conference 2012 34 http://www.silobreaker.com/FlashNetwork.aspx?DrillDownItems=11_237360
  • 35. Summary Text  Analy9cs  enables  Big  Data  Triage   Basis Technology – Human Language Technology Conference 2012 35
  • 36. Thank You! For more information: Visit www.basistech.com Write to conference@basistech.com Call 617-386-2090 or 800-697-2062 Basis Technology – Human Language Technology Conference 2012 36