Big Data Technologies

Research and Development
Big Data Technologies

Agenda
• Challenge - Why does this matter?
• Search Engine - 30k Foot View
• Open - Lucene, Cassandra & Spark
• Customizing - Apache Lucene/Solr
• Custom Parser - Written in Scala

Search Engine – 30 Thousand Foot View
The search index is only as good as your processed data.
If you put everything you find in your index, you are going to spend a lot of time
telling people how to search.

Lucene – More than meets the eye
Who
Next?
Think of it like a “NoSQL” Database that has great indexing..
everywhere.

Cassandra – NoSQL With Structure
Who
Next?
Think of it like a “NoSQL” Database that has structure. Using
“CQL” You can insert into and select from.. just not join.

Spark – Way Better MapReduce
Who
Next?
Think of it like MapReduce if MapReduce were created with
scala, instead of Java, with streams. It’s also 100 times faster.

Big Picture Overview
• UI - Static
(JS/CSS/HTML/jQuery)
• API - Java
(MySQL, Solr, JCS, OpenCV,
JMS, Jolt)
• Index - Solr
(One 200 Field Index)
• Indexing - Spark
(Cassandra to Spark)
• Data - Cassandra
(2 Keyspaces, 8-10 Tables)
• Ingesting - Java
(Quintessential Noodles)

Configuring - Solr - 1/3
Solr is like an eighteen wheel truck you can take apart and rebuild from the
ground up with only what you need, or add as much as you want.
● Configuration - Schema
○ Data Types
○ Pre-Processing
○ Collection Definitions
○ Managed vs. Unmanaged
● Configuration - ZooKeeper
○ Synchronize Configurations
○ Distribute Shards
○ Manage Replicas
○ Elect Leaders
● Configuration - SolrConfig
○ Handlers
○ Components
○ Indexing Configurations
○ Memory / Cache
○ File System
● Lessons Learned
○ Try to use out of the box
○ Try to configure your way
○ Make sure to upgrade
○ Not everything can be
configured

● Before Docker
○ Setup Zookeeper
■ Customize zoo.cfg
■ Setup Zookeeper Servers
○ Setup Solr Distro
■ Download Solr
■ Clean up Solr
■ Customize Schema.xml
■ Customize SolrConfig.xml
■ Setup Different Solr Servers
○ Start the Cloud
■ Custom Start Scripts
○ https://cwiki.apache.org/confluence/display/solr/Getti
ng+Started+with+SolrCloud
○ https://cwiki.apache.org/confluence/display/solr/Taki
ng+Solr+to+Production
● Today w/ Docker
○ docker run --name zookeeper
● -p 127.0.0.1:2181:2181
● -p 127.0.0.1:2888:2888
● -p 127.0.0.1:3888:3888
● jplock/zookeeper
○ docker run --link zookeeper:ZK -i
● -p 127.0.0.1:8983:8983
● -t dockerimages/docker-solr
● /bin/bash -c '
● cd /opt/solr/example;
● java -jar
● -Dbootstrap_confdir=./solr/collection1/conf
● -Dcollection.configName=myconf -
DzkHost=$ZK_PORT_2181_TCP_ADDR:$ZK_PORT_2181_
TCP_PORT
● -DnumShards=2
● start.jar';
● https://hub.docker.com/r/dockerimages/docker-solr/

• SolrConfig - Example • Schema - Example
https://cwiki.apache.org/confluence/dis
play/solr/Configuring+solrconfig.xml
https://wiki.apache.org/solr/Sch
emaXml

User Interface - Super Advanced

Customizing - Solr - 1/3
Solr is like an eighteen wheel truck you can take apart and rebuild from the
ground up with only what you need, or add as much as you want.
● Customization - Parsing
○ Need Specialized Syntax?
○ Java or Scala Based
○ Open Plugin Structure
○ Several Examples
● Customization - Highlighting
○ Need Special Coloring?
○ Specialized Syntax Aware
○ Open Plugin Structure
○ Several Examples
● Customization - Term Counts
○ Need Specific Information?
○ Create the Logic
○ Register the Component
○ Complicated Examples
● Lessons Learned
○ Major version upgrades = pain
○ Newer classes can be extended
better
○ Long term investment

Customizing - Solr - 2/3
• Custom Component in Scala or Java • Installing the Component
http://wiki.apache.org/solr/SolrPl
ugins
http://sujitpal.blogspot.com/2011/
03/using-lucenes-new-
queryparser-framework.html

Creating a Custom Parser with Scala
Building a parser in Scala wasn’t my first choice, but creating it in
Scala made me see how much better the language is.
● Why a Specialized Syntax?
○ Legacy Syntax
○ Boolean AND Proximity Queries
○ Specialized Fielded Expressions
○ Ranges / Classifications
● Why not ANTLR or JavaCC?
○ Old Parser was in Parboiled(1)
○ Parboiled2 was in Scala
○ No need to learn a separate
Syntax for Creating Syntax
● Lessons Learned
○ Parboiled2 Documentation = bad
○ Understand the syntax
○ Interactive REPL in Scala =
good
○ Write tons of unit tests
○ Long term investment
Customizing Solr with Scala
Found a good Virtual Mentor
Learned Scala (not for Spark)
Started from the ground up
Reduced from ~1k to 400 LOC

JavaCC vs. parboiled2 (Scala)
• Java CC - SurroundQuery.jj • Scala based Parboiled2

www.anant.us | solutions@anant.us | (855) 262-6826
3 Washington Circle, NW | Suite 301 | Washington, DC 20037
Data & Analytics
Cassandra, DataStax, Kafka, Spark
Customer Experience
Sitecore
Information Systems
Salesforce, Quickbooks, and more

Big Data Technologies

Recommended

Recommended

More Related Content

What's hot

What's hot (9)

Similar to Big Data Technologies

Similar to Big Data Technologies (20)

More from Anant Corporation

More from Anant Corporation (20)

Recently uploaded

Recently uploaded (20)

Big Data Technologies