Big data is a field that treats ways to analyze, systematically extract information from, or otherwise deal with data sets that are too large or complex to be dealt with by traditional data-processing application software.
2. Agenda
• Challenge - Why does this matter?
• Search Engine - 30k Foot View
• Open - Lucene, Cassandra & Spark
• Customizing - Apache Lucene/Solr
• Custom Parser - Written in Scala
3. Search Engine – 30 Thousand Foot View
The search index is only as good as your processed data.
If you put everything you find in your index, you are going to spend a lot of time
telling people how to search.
4. Lucene – More than meets the eye
Who
Next?
Think of it like a “NoSQL” Database that has great indexing..
everywhere.
5. Cassandra – NoSQL With Structure
Who
Next?
Think of it like a “NoSQL” Database that has structure. Using
“CQL” You can insert into and select from.. just not join.
6. Spark – Way Better MapReduce
Who
Next?
Think of it like MapReduce if MapReduce were created with
scala, instead of Java, with streams. It’s also 100 times faster.
7. Big Picture Overview
• UI - Static
(JS/CSS/HTML/jQuery)
• API - Java
(MySQL, Solr, JCS, OpenCV,
JMS, Jolt)
• Index - Solr
(One 200 Field Index)
• Indexing - Spark
(Cassandra to Spark)
• Data - Cassandra
(2 Keyspaces, 8-10 Tables)
• Ingesting - Java
(Quintessential Noodles)
8. Configuring - Solr - 1/3
Solr is like an eighteen wheel truck you can take apart and rebuild from the
ground up with only what you need, or add as much as you want.
● Configuration - Schema
○ Data Types
○ Pre-Processing
○ Collection Definitions
○ Managed vs. Unmanaged
● Configuration - ZooKeeper
○ Synchronize Configurations
○ Distribute Shards
○ Manage Replicas
○ Elect Leaders
● Configuration - SolrConfig
○ Handlers
○ Components
○ Indexing Configurations
○ Memory / Cache
○ File System
● Lessons Learned
○ Try to use out of the box
○ Try to configure your way
○ Make sure to upgrade
○ Not everything can be
configured
13. Customizing - Solr - 1/3
Solr is like an eighteen wheel truck you can take apart and rebuild from the
ground up with only what you need, or add as much as you want.
● Customization - Parsing
○ Need Specialized Syntax?
○ Java or Scala Based
○ Open Plugin Structure
○ Several Examples
● Customization - Highlighting
○ Need Special Coloring?
○ Specialized Syntax Aware
○ Open Plugin Structure
○ Several Examples
● Customization - Term Counts
○ Need Specific Information?
○ Create the Logic
○ Register the Component
○ Complicated Examples
● Lessons Learned
○ Major version upgrades = pain
○ Newer classes can be extended
better
○ Long term investment
14. Customizing - Solr - 2/3
• Custom Component in Scala or Java • Installing the Component
http://wiki.apache.org/solr/SolrPl
ugins
http://sujitpal.blogspot.com/2011/
03/using-lucenes-new-
queryparser-framework.html
16. Creating a Custom Parser with Scala
Building a parser in Scala wasn’t my first choice, but creating it in
Scala made me see how much better the language is.
● Why a Specialized Syntax?
○ Legacy Syntax
○ Boolean AND Proximity Queries
○ Specialized Fielded Expressions
○ Ranges / Classifications
● Why not ANTLR or JavaCC?
○ Old Parser was in Parboiled(1)
○ Parboiled2 was in Scala
○ No need to learn a separate
Syntax for Creating Syntax
● Lessons Learned
○ Parboiled2 Documentation = bad
○ Understand the syntax
○ Interactive REPL in Scala =
good
○ Write tons of unit tests
○ Long term investment
Customizing Solr with Scala
Found a good Virtual Mentor
Learned Scala (not for Spark)
Started from the ground up
Reduced from ~1k to 400 LOC
18. www.anant.us | solutions@anant.us | (855) 262-6826
3 Washington Circle, NW | Suite 301 | Washington, DC 20037
Data & Analytics
Cassandra, DataStax, Kafka, Spark
Customer Experience
Sitecore
Information Systems
Salesforce, Quickbooks, and more