Big Data Meets Metadata – Analyzing Large Data Sets
Upcoming SlideShare
Loading in...5

Like this? Share it with your network


Big Data Meets Metadata – Analyzing Large Data Sets



Presented by Jeremy Bently| Smartlogic - See conference video - ...

Presented by Jeremy Bently| Smartlogic - See conference video -

As Big Data becomes more pervasive, the need for increased metadata management becomes critical to the understanding and mining of that content. Metadata is what unlocks the value of information assets. When metadata is well managed, the information assets are more useful and valuable. Badly managed metadata can make information assets less useful and less valuable — creating increased costs and risks related to those assets. During this presentation, we'll discuss the different types of metadata, the role of search and analytics in Big Data and the integration of Apache Solr with Content Intelligence to enable better metadata management of Big Data.



Total Views
Views on SlideShare
Embed Views



0 Embeds 0

No embeds



Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

Big Data Meets Metadata – Analyzing Large Data Sets Presentation Transcript

  • 1. Smartlogic TM Lucene Revolution 2012     Jeremy  Bentley,  CEO  
  • 2. 1st degree of orderFiling management• 80% of enterprise information isunstructured• Doubling every 19 months andaccelerating [Gartner]• Increasing burden of compliance• Enterprise 2.0 additions• Big Data connotations
  • 3. 2nd degree of orderIndex management• File plans and metadata schema• Manually applied classification• Low level of consistency and quality
  • 4. 3rd degree Order Enterprise   Content   Search   Management   Portal   Infrastructure   Document    Management   Automation of SharePoint   1st & 2nd Records   Management   Degrees Publishing   Process     Systems   Management  &   Digital   Workflow   Asset   Management   eDiscovery  
  • 5. 5  A 10 year Flatline User  Search   Sa5sfac5on   50%   48%   2001   2011  •  2001,  IDC,  “Quan5fying  Enterprise  Search”    Searchers  are  successful  in  finding  what  they  seek  50%  of  the  9me  or  less      •  2011,  MindMetre/SmartLogic  More  than  half    (52%)  cannot  find  the  informa9on  they  need  using  their  Enterprise  search  system    
  • 6. The explosion of information 80Tb   ?   20  5mes   Terabytes  of  data   increase  in   Informa5on   volume   4Tb   1993-­‐2001   2001-­‐2009   Source:  the  Na5onal  Archives  
  • 7. Volume + other disruptive factorsVelocity    Variety    Complexity      Cross-­‐organiza5onal    and  cross  pla[orm  informa5on  needs      Changing  requirements  for  informa5on  over  5me     Copyright  @  2011  Smartlogic  Semaphore  Limited   7  
  • 8. New 4th degree of order Enterprise   Content   Search   Management   Portal   Infrastructure   Document    Management   SharePoint   Content Records   Intelligence Management   Publishing   Process     Systems   Management  &   Digital   Workflow   Asset   Management   eDiscovery  
  • 9. Content Intelligence Informa5on   Manufacturing   Mone5sa5on   Knowledge   Metadata   Recovery   Data  Loss  Preven5on   Risk  &  Compliance   Content     Analy5cs  
  • 10. Knowing what you have
  • 11. MetadataInformation Subject   Crea5on  Date   Loca5on   Modified  Date   Project   Author   Func5on   Format   (PDF,DOC,XLS)   (IT,HR,Finance)   Protec5ve   Marker   Expiry   Publisher   Expert   Reten5on   Site  Process Structural
  • 12. 4th degree of orderContent Intelligence Content  Intelligence  Pla[orm        FAST   SharePoint  
  • 13. What is Content Intelligence Content  Intelligence  is  the  process  of              IDENTIFYING   CLASSIFYING     EXTRACTING   ANALYZING   SURFACING     informa5on   based  on  its  meaning  and  context  to  make     !mely  and  informed  business  decisions.    
  • 15. Big Data + Content Intelligence From  Gartner,  2011    
  • 16. Semaphore – Three Core Capabilities Seman5c     Ontology     Build,  Manage  and   Model   Manager   Deploy  Vocabularies/   Libraries   Expose   Apply   SEMAPHORE   Users   Content   ClassificaJon   SemanJc   Server  Enhancement   Server   Inform   Explore  data  to  find   Automate  the   insights   Metadata  Enrichment   16  
  • 17. Enterprise ClassificationImportant  requirements  for  Velocity/Volume:  •  Scalability  for  large  volumes  of  content,  users,   metadata  and  systems  •  Easy  integra5on  with  processing  systems  -­‐   search,  content,  records  and  document   management  systems  as  well  as  file  shares   and  content  migra5on  tools  •  Support  for  all  the  organiza5on‘s  languages   and  data  formats  
  • 18. From Many Different Sources
  • 19. Metadata Generation Information Brand Creation Date Service Modified Date Geography Author Products Format (PDF,DOC,XLS) Expert Protective Retention Marker Publisher Expiry Site Process Structural
  • 20. Different Vocabulary and AmbiguityYou  Say   I  Say  Perpetrator   Burglar   Thief  Swine  Flu   Swine  Influenza  Virus    Missing  results   H1N1  Touchscreen   Touch  screen   Mul5-­‐touch  You  Say   What  do  you  mean?  Apple   A  fruit?   Fiona  -­‐  A  singer  /  songwriter?   An  electronics  company?  Rights   Employment  rights?   Equal  rights?    Too  many  results   Right  of  way?  Ford   Ford  Motor   Forward  Industrials  (5cker=FORD)   A  shallow  river  crossing   ©  2010   20  
  • 21. Without Accurate Metadata     Big  Data  has  its  perils.  With  huge  data     sets  and  fine-­‐grained  measurement,   there  is  increased  risk  of  “false   discoveries.”  The  trouble  with  seeking  a   meaningful  needle  in  massive  haystacks   of  data  is  that  “many  bits  of  straw  look   like  needles.”     -­‐  Trevor  Has5e,     Sta5s5cs  Professor  at  Stanford  University    
  • 22. What Classification Must HandleCapability   Included  Look  for  all  the  vocabulary  associated  with  topic/en5ty  Determine  aboutness  /  avoid  passing  men5ons  Address  term  ambiguity  Handle  stemming  errors  Determine  if  topics  in  the  same  context  Split  documents  into  components  Generate  scores  (so  most  relevant  content  bubbles  to  top)  Show  dynamic  summaries  to  users  
  • 23. Enhancing Metadata•  Accurately  classify  content  into  subject  areas   defined  in  a  taxonomy/ontology  •  En5ty  extrac5on  (Text  Mining)  •  Sen5ment  Analysis  •  Fact  Extrac5on  
  • 24. Physical Architecture Ontology  Management  Services   Ontology  Manager   Ontology  Manager  Desktop   Ontology  Manager  Desktop   Standalone  Desktop   Win  7,  Vista   Win  7,  Vista   Win7,  Vista   2Gb  RAM   2Gb  RAM   2Gb  RAM   2GHz  Dual  CPU   2GHz  Dual  CPU   2GHz  Dual  CPU   Op5onal  RDBMS  data  store   Ontology  Manager  Server   Oracle   Port  8001   Port  8002   MySQL   Win  7,  Vista,  2003,  2008  +R2   Ontology   Ontology   Linux   SQL  Server  2005  +  2008  +   Instance  1   Instance  2   2Gb  RAM   2008  R2   2GHz  CPU   Seman5c  Enhancement  Server   Content  Classifica5on  Server   Search  Enhancement  Server   Classifica5on  Server   Classifica5on  Test  Interface   Port  5058   Search   GSA  Extensions   Classifica5on   Internet  Explorer   Enhancement   FAST  Extensions   Instance   Firefox   Instance   Sharepoint  Extensions   Rule  and  Template  Editor   Windows  Server  2003  ,2008  (32bit/64bit)  +R2   Windows  Server  2003  ,2008  (32bit/64bit)  +  R2   Win  7,  Vista   Linux   Linux   2Gb  RAM   IIS/Apache  HTTP  Server   CPU    and  RAM  intensive.  Scale  to  volume  of  content   2GHz  Dual  CPU   RAM  and  disk  access  intensive.  Scale  to  expected  peak  search  throughput   and  number  of  publishing  users   Google  Classifica5on  Handler   Integra5on  Components   Dispatcher   Proxy   Windows  Server  2003  ,2008  (32bit/64bit)  +R2   Scale  for  throughput  of  GSA  Indexing  Crawler   Search  Applica5on  Framework   Search  Applica5on  Framework   Document  Library  Components   Semaphore  Document  Processor   Semaphore  Document  Processor   Search  Applica5on  Framework   Search  Web  Parts   Microsou  FAST  ESP   Microsou  Office  SharePoint   Google  Search  Appliance   Server  Farm   SOLR   Server  2007  /    2010  Server  Farm  
  • 25. Leveraging Metadata Schemes
  • 26. Examples – Customer Service
  • 27. Examples – Following Trends
  • 28. Examples – Fact Extraction
  • 29. How Else Does Semaphore Help Disambiguate queries    Perfectly formed filters organised by facet Graphical drill down Explore relationships Supporting documents
  • 30. Happy, Successful Customers