Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Building Enterprise Search Engines using Open Source Technologies

1,853 views

Published on

Enterprise Search is a challenging problem for most organizations. Public search technologies such as Google can index content and use link popularity to rank content in addition to the basic keyword matches. Enterprise Search is different. Sometimes it requires specially designed indexes as well as several processing steps.

At the U.S. Patent & Trademark Office, part of the Department of Commerce, a team of professionals is building the next generation of search tools using open source technologies. Like any large undertaking, it’s not a simple plug and play project.

Main topics to be covered in this talk:

+ Architectures for Large Scale Enterprise Search
+ Leveraging Apache Cassandra & Spark
+ Customizing / Configuring Apache SolR and Indexing
+ Writing a custom Parser for SolR in Scala

Published in: Software
  • Be the first to comment

Building Enterprise Search Engines using Open Source Technologies

  1. 1. www.anant.us | solutions@anant.us | 202.905.2818 1010 Wisconsin Ave, NW | Suite 250 | Washington, DC 20007 Large Scale Search with Open Source Technologies Building Search Engines
  2. 2. What do we do? Streamline, Organize & Unify Business Information
  3. 3. Agenda • Challenge - Why does this matter? • Search Engine - 30k Foot View • Open - Lucene, Cassandra & Spark • Customizing - Apache Lucene/SolR • Custom Parser - Written in Scala
  4. 4. Challenge – Why does this matter? Knowledge Project Information Client Service Information Corporate Guides Collaborative Documents Assets & Files Corporate Resources Appleseed Framework (Portal, Base, Search) G Drive Delta DropBox G Drive Delta Nutshell Dropbox Freshbooks G Drive G Sites (KB) G Drive Workflowy Evernote G Drive DropBox OwnCloud Pocket Leaves AIC (WP) Anant (WP)
  5. 5. Search Engine – 30 Thousand Foot View The search index is only as good as your processed data. If you put everything you find in your index, you are going to spend a lot of time telling people how to search.
  6. 6. Lucene – More than meets the eye Who Next? Think of it like a “NoSQL” Database that has great indexing.. everywhere.
  7. 7. Cassandra – NoSQL With Structure Who Next? Think of it like a “NoSQL” Database that has structure. Using “CQL” You can insert into and select from.. just not join.
  8. 8. Spark – Way Better MapReduce Who Next? Think of it like MapReduce if MapReduce were created with scala, instead of Java, with streams. It’s also 100 times faster.
  9. 9. Configuring - SolR - 1/3 SolR is like an eighteen wheel truck you can take apart and rebuild from the ground up with only what you need, or add as much as you want. • Configuration - Schema –Data Types –Pre-Processing –Collection Definitions –Managed vs. Unmanaged • Configuration - ZooKeeper –Synchronize Configurations –Distribute Shards –Manage Replicas –Elect Leaders • Configuration - SolrConfig –Handlers –Components –Indexing Configurations –Memory / Cache –File System • Lessons Learned –Try to use out of the box –Try to configure your way –Make sure to upgrade –Not everything can be configured
  10. 10. Configuring - SolR - 2/3 • Before Docker –Setup Zookeeper •Customize zoo.cfg •Setup Zookeeper Servers –Setup SolR Distro •Download SolR •Clean up SolR •Customize Schema.xml •Customize SolrConfig.xml •Setup Different Solr Servers –Start the Cloud •Custom Start Scripts • Today w/ Docker – docker run --name zookeeper -p 127.0.0.1:2181:2181 -p 127.0.0.1:2888:2888 -p 127.0.0.1:3888:3888 jplock/zookeeper – docker run --link zookeeper:ZK -i -p 127.0.0.1:8983:8983 -t dockerimages/docker-solr /bin/bash -c ' cd /opt/solr/example; java -jar -Dbootstrap_confdir=./solr/collection1/conf -Dcollection.configName=myconf - DzkHost=$ZK_PORT_2181_TCP_ADDR:$ZK_PO RT_2181_TCP_PORT -DnumShards=2 start.jar'; https://hub.docker.com/r/dockerima ges/docker-solr/ https://cwiki.apache.org/confluence/displa y/solr/Getting+Started+with+SolrCloud https://cwiki.apache.org/confluence/displa y/solr/Taking+Solr+to+Production
  11. 11. Configuring - SolR - 3/3 • SolrConfig - Example • Schema - Example https://cwiki.apache.org/confluence/displa y/solr/Configuring+solrconfig.xml https://wiki.apache.org/solr/SchemaXml
  12. 12. SolR Cloud / Zookeeper
  13. 13. User Interface - Super Advanced
  14. 14. Customizing - SolR - 1/3 SolR is like an eighteen wheel truck you can take apart and rebuild from the ground up with only what you need, or add as much as you want. • Customization - Parsing –Need Specialized Syntax? –Java or Scala Based –Open Plugin Structure –Several Examples • Customization - Highlighting –Need Special Coloring? –Specialized Syntax Aware –Open Plugin Structure –Several Examples • Customization - Term Counts –Need Specific Information? –Create the Logic –Register the Component –Complicated Examples • Lessons Learned –Major version upgrades = pain –Newer classes can be extended better –Long term investment
  15. 15. Customizing - SolR - 2/3 • Custom Component in Scala or Java • Installing the Component http://wiki.apache.org/solr/SolrPluginshttp://sujitpal.blogspot.com/2011/03/using -lucenes-new-queryparser- framework.html
  16. 16. Customizing - SolR - 3/3
  17. 17. Creating a Custom Parser with Scala Building a parser in Scala wasn’t my first choice, but creating it in Scala made me see how much better the language is. • Why a Specialized Syntax? –Legacy Syntax –Boolean AND Proximity Queries –Specialized Fielded Expressions –Ranges / Classifications • Why not ANTLR or JavaCC? –Old Parser was in Parboiled(1) –Parboiled2 was in Scala –No need to learn a separate Syntax for Creating Syntax • Lessons Learned –Parboiled2 Documentation = bad –Understand the syntax –Interactive REPL in Scala = good –Write tons of unit tests –Long term investment • Customizing SolR with Scala –Found a good Virtual Mentor –Learned Scala (not for Spark) –Started from the ground up –Reduced from ~1k to 400 LOC
  18. 18. JavaCC vs. parboiled2 (Scala) • Java CC - SurroundQuery.jj • Scala based Parboiled2
  19. 19. Questions & Contact www.anant.us | solutions@anant.us | 202.905.2818 1010 Wisconsin Ave, NW | Suite 250 | Washington, DC 20007 @anantcorp facebook.com/anantCorp linkedin.com/company/anant rahul@anant.us linkedin.com/in/xingh Rahul Singh CEO & Founder Questions & Contact • Brown Bag Session or Meetup? • Modern Enterprise • Mastering Services in the Service of Others • Hybrid Agile Project Management • Building Search Engines • CICD / DevOps • Connecting Internet Software
  20. 20. www.anant.us | solutions@anant.us | 202.905.2818 1010 Wisconsin Ave, NW | Suite 250 | Washington, DC 20007 Streamlined Data Integration / Data Pipelines Organized Knowledge Search / Data Warehouses Unified Interfaces Portals / Dashboards / Mobile

×