Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Using the LucidWorks REST API to Support User-Configuration Big Data Search Experience


Published on

Presented by Mark Davis, CTO Kitenga - See conference video -

Kitenga's Analyst system uses the LucidWorks Enterprise REST API in a variety of ways, including for configuring collections and managing Solr schema. As part of the Kitenga platform, the ZettaSearch Designer empowers the end-user to dynamically drag-and-drop search widgets to create a specialized search interface. For a user to effectively design search UIs that meet their needs, they need to be able to understand the available schema fields that populate a given collection. ZettaSearch Designer interrogates the Solr infrastructure using the Lucid REST API to provide an overview of the available metadata. It is then easy for the user to build rich, facetted search experiences around the metadata library indexed into the collection. In this implementation overview, I will describe the design of ZettaSearch Designer, how it interacts with big data technologies like Hadoop as part of the indexing pipeline, and how it uses the LucidWorks API to enable user discovery of the metadata needed to create novel search user interfaces on the fly.

Published in: Technology, Education
  • Be the first to comment

  • Be the first to like this

Using the LucidWorks REST API to Support User-Configuration Big Data Search Experience

  1. 1. Kitenga reinventing informationMark DavisFounder/CTO
  2. 2. EnablingBig DataSearch viathe LucidReST API
  3. 3. Big  Data    Enormous  transactional  data  Enormous  unstructured  information  Too  big  for  databases  New  tools  are  needed    
  4. 4. kilobyte (kB) 103 210 kibibyte(KiB) 210 megabyte (MB)106 220 mebibyte (MiB) 220gigabyte (GB) 109 230gibibyte (GiB) 230 terabyte(TB) 1012 240 tebibyte (TiB)240 petabyte (PB) 1015 250pebibyte (PiB) 250 exabyte(EB) 1018 260 exbibyte (EiB)260 zettabyte (ZB) 1021 270zebibyte (ZiB) 270 yottabyte(YB) 1024 280 yobibyte (YiB)280 Volume   Velocity   Variety  
  5. 5. Indexing  Challenges    Complex,  varied  data  Compute-­‐intensive  metadata  generation  Schema  and  collection  management     Gather   Extract  Metadata   Index   Resources   •  Crawl   •  Named   •  Schema   •  Crack  formats   entities   definition   •  Categories   •  Collection   •  Machine   management   learning   •  Semantic   analysis  
  6. 6. Initial  Query   Refine  Query   Evaluate   Relevance   •  Keyword   •  Analytic   •  Read  KWIC   guesses   tools   •  Read   •  Category   •  Facetted   metadata   guidance   guidance   •  Read   document  Search  Experience  Challenges    Complex,  varied  data  Resource  discovery  Facetted  search  experience  management    
  7. 7. The  Solution  Enable fast metadata generation: Hadoop Mahout GPUsManage and control collections and schema: LucidWorks Enterprise API
  8. 8. SQL   Search   RDBMS   Documents  Transactional  Data   Text  Classification   BI  Tools   Taxonomies   Ontologies  
  9. 9. Machine-­‐Learning   Finite  State  Transducer  Finite  State  Transducer   Finite  State  Transducer   Parts-­‐of-­‐Speech  Tagging   Lemmatization   Tokenization  
  10. 10. Resource  Integration  Facet  Browsing   Facet  Charting   Spellcheck   Autosuggest   Query  Language   Indexing   Metadata  Extraction  
  11. 11. ¡  Start  to  POC  in  a  week  ¡  Open  source  intelligence  problems  
  12. 12. ZettaSearch  GOAL:  Be  more  competitive   Facetted SearchSOURCES:  Patents,  PR   and Analytics announcements,  legal  documents,   relationships   whitepapers,  crawled  websites   metadata   entities   ZettaVox   data  ANALYSIS:  Extract  named  entities  and   relationships,  classify  and  label;   visually  understand  relationships  and   trends   Sources  ACTION:  Change  R&D  priorities  and   improve  marketing  approaches   13
  13. 13. ¡  Understand  IP  among  competitors  ¡  Assist  legal  team  with  litigation  ¡  Custom  search  experience  ¡  Custom  extractors:   §  Electronic  parts   §  Memory  types   §  Flash  memory   . 5/15/12 14
  14. 14. Documents   Size  Dell   102,508   9Gb  EMC   303,678   14Gb  Huawei   11,912   890Mb  Kingston   2,534   134Mb  Lenovo   8,305   542Mb  NEC   3,900   252Mb  Nokia   174,681   22Gb  Panasonic   5,804   473Mb  Rim   181   8Mb  Sharp  USA   31,918   4.9Gb   645,421   60.2Gb   5/15/12 . 15
  15. 15. ZettaSearch  GOAL:  Discover  new  drugs,  detect  side-­‐ effects,  speed  R&D   Facetted Search and AnalyticsSOURCES:  Published  research  reports,   relationships   pathways   patents,  adverse  effects  databases,   sequences   entities   ZettaVox   genomics  and  proteomics  databases   data  ANALYSIS:  Extract  named  entities  and   relationships,  classify  and  label;  visually   discover  trends  and  relationships  ACTION:  Change  R&D  priorities   Sources   16
  16. 16. ¡  Lousy  search  (Google  Search  Appliance)  ¡  Internal  regulators  can’t  find  by  accession   number  ¡  Custom  extractors:   §  Accession  number   §  Ontology  of  active  ingredients   §  Drug  names   © 2012 Kitenga Proprietary 17
  17. 17. ZettaSearch  GOAL:  Build  “second  screen   Facetted Search experiences”   and AnalyticsSOURCES:  wikipedia,  IMDB,  blogs   relationships  ANALYSIS:  Extract  named  entities  and   metadata   entities   ZettaVox   data   relationships,  preserve  existing   structural  metadata  ACTION:  Enable  new  media  experiences   Sources   18
  18. 18. ¡  Crawlers  on  Hadoop  ¡  Document  format  crackers  on  Hadoop  ¡  Extractors  on  Hadoop  ¡  Filters  on  Hadoop  ¡  HTTP  documents  to  Solr  sharded  cluster  ¡  Intermediary  files  remain  on  HDFS  for   reprocessing  
  19. 19. ¡  Missing  piece  of  the  puzzle  ¡  Addresses  the  impedance  mismatch  between   Big  Data  technologies  and  Solr  search  ¡  Manage  collections  ¡  Manage  schema  
  20. 20. ¡  Create  collections  ¡  Delete  collections  ¡  Update  collection  properties  ¡  Create  schema  ¡  Modify  schema  
  21. 21. ¡  Schema  interrogation  ¡  Schema  binding  to  user  experience  ¡  Facetted  search  ¡  Embedded  analytics  
  22. 22. ¡  Big  Data  search  and  analytics  has  many  challenges:   §  Volume  of  data   §  Variety  of  data   §  Velocity  of  data   §  Extracting  structure  from  unstructured  information  ¡  Hadoop  processing  enables  each  of  these  aspects  ¡  Controlling  indexing  and  search  is  enabled  by  the   Lucid  Imagination  search  API  ¡  We  can  enable  complex  user  interactions  with  Big   Data  on  a  self-­‐serve  basis  
  23. 23. Analyst  Browser   Enterprise  servers   Cloud  services   Tomcat  App  Server   Amazon  S3   Tomcat   Web  Services   Enterprise   ZettaVoxServices   Cloud   XML   Manager   ZettaVox   +   Author   JSON   GPU   Hadoop   RIA   Search  Indexing   Services   Services   Manager   Manager   ReST   JSON   GPU  MR  Service   Hadoop  Server   Hadoop  Server   Manager   Name  node   Job  Tracker   GPU   GPU   Hadoop   Hadoop     Task  Manager   Hadoop Task  Manager     Quantum4D   Task  Manager RDBMS   Entity   Mahout   Crawling   Extraction   ©  2012    Kitenga  Proprietary  
  24. 24. Analyst  Browser   Enterprise  servers   Search  Indexing   • Get  collection  information   • Create  new  collection   • Create  fields   • Delete  fields   • Edit  fields   ZettaVox   ReST   Author     RIA   JSON   Hadoop  Server   Hadoop  Server   Name  node   Job  Tracker   Hadoop   Hadoop     Task  Manager   Hadoop Task  Manager     Task  Manager Entity   Mahout   Crawling   Indexing   Extraction   ©  2012    Kitenga  Proprietary  
  25. 25. Questions?