SlideShare a Scribd company logo
1 of 18
Research and Development
Big Data Technologies
Agenda
• Challenge - Why does this matter?
• Search Engine - 30k Foot View
• Open - Lucene, Cassandra & Spark
• Customizing - Apache Lucene/Solr
• Custom Parser - Written in Scala
Search Engine – 30 Thousand Foot View
The search index is only as good as your processed data.
If you put everything you find in your index, you are going to spend a lot of time
telling people how to search.
Lucene – More than meets the eye
Who
Next?
Think of it like a “NoSQL” Database that has great indexing..
everywhere.
Cassandra – NoSQL With Structure
Who
Next?
Think of it like a “NoSQL” Database that has structure. Using
“CQL” You can insert into and select from.. just not join.
Spark – Way Better MapReduce
Who
Next?
Think of it like MapReduce if MapReduce were created with
scala, instead of Java, with streams. It’s also 100 times faster.
Big Picture Overview
• UI - Static
(JS/CSS/HTML/jQuery)
• API - Java
(MySQL, Solr, JCS, OpenCV,
JMS, Jolt)
• Index - Solr
(One 200 Field Index)
• Indexing - Spark
(Cassandra to Spark)
• Data - Cassandra
(2 Keyspaces, 8-10 Tables)
• Ingesting - Java
(Quintessential Noodles)
Configuring - Solr - 1/3
Solr is like an eighteen wheel truck you can take apart and rebuild from the
ground up with only what you need, or add as much as you want.
● Configuration - Schema
○ Data Types
○ Pre-Processing
○ Collection Definitions
○ Managed vs. Unmanaged
● Configuration - ZooKeeper
○ Synchronize Configurations
○ Distribute Shards
○ Manage Replicas
○ Elect Leaders
● Configuration - SolrConfig
○ Handlers
○ Components
○ Indexing Configurations
○ Memory / Cache
○ File System
● Lessons Learned
○ Try to use out of the box
○ Try to configure your way
○ Make sure to upgrade
○ Not everything can be
configured
Configuring - Solr - 2/3
● Before Docker
○ Setup Zookeeper
■ Customize zoo.cfg
■ Setup Zookeeper Servers
○ Setup Solr Distro
■ Download Solr
■ Clean up Solr
■ Customize Schema.xml
■ Customize SolrConfig.xml
■ Setup Different Solr Servers
○ Start the Cloud
■ Custom Start Scripts
○ https://cwiki.apache.org/confluence/display/solr/Getti
ng+Started+with+SolrCloud
○ https://cwiki.apache.org/confluence/display/solr/Taki
ng+Solr+to+Production
● Today w/ Docker
○ docker run --name zookeeper 
● -p 127.0.0.1:2181:2181 
● -p 127.0.0.1:2888:2888 
● -p 127.0.0.1:3888:3888 
● jplock/zookeeper
○ docker run --link zookeeper:ZK -i 
● -p 127.0.0.1:8983:8983 
● -t dockerimages/docker-solr 
● /bin/bash -c '
● cd /opt/solr/example; 
● java -jar 
● -Dbootstrap_confdir=./solr/collection1/conf 
● -Dcollection.configName=myconf  -
DzkHost=$ZK_PORT_2181_TCP_ADDR:$ZK_PORT_2181_
TCP_PORT 
● -DnumShards=2 
● start.jar';
● https://hub.docker.com/r/dockerimages/docker-solr/
Configuring - Solr - 3/3
• SolrConfig - Example • Schema - Example
https://cwiki.apache.org/confluence/dis
play/solr/Configuring+solrconfig.xml
https://wiki.apache.org/solr/Sch
emaXml
Solr Cloud / Zookeeper
User Interface - Super Advanced
Customizing - Solr - 1/3
Solr is like an eighteen wheel truck you can take apart and rebuild from the
ground up with only what you need, or add as much as you want.
● Customization - Parsing
○ Need Specialized Syntax?
○ Java or Scala Based
○ Open Plugin Structure
○ Several Examples
● Customization - Highlighting
○ Need Special Coloring?
○ Specialized Syntax Aware
○ Open Plugin Structure
○ Several Examples
● Customization - Term Counts
○ Need Specific Information?
○ Create the Logic
○ Register the Component
○ Complicated Examples
● Lessons Learned
○ Major version upgrades = pain
○ Newer classes can be extended
better
○ Long term investment
Customizing - Solr - 2/3
• Custom Component in Scala or Java • Installing the Component
http://wiki.apache.org/solr/SolrPl
ugins
http://sujitpal.blogspot.com/2011/
03/using-lucenes-new-
queryparser-framework.html
Customizing - Solr - 3/3
Creating a Custom Parser with Scala
Building a parser in Scala wasn’t my first choice, but creating it in
Scala made me see how much better the language is.
● Why a Specialized Syntax?
○ Legacy Syntax
○ Boolean AND Proximity Queries
○ Specialized Fielded Expressions
○ Ranges / Classifications
● Why not ANTLR or JavaCC?
○ Old Parser was in Parboiled(1)
○ Parboiled2 was in Scala
○ No need to learn a separate
Syntax for Creating Syntax
● Lessons Learned
○ Parboiled2 Documentation = bad
○ Understand the syntax
○ Interactive REPL in Scala =
good
○ Write tons of unit tests
○ Long term investment
Customizing Solr with Scala
Found a good Virtual Mentor
Learned Scala (not for Spark)
Started from the ground up
Reduced from ~1k to 400 LOC
JavaCC vs. parboiled2 (Scala)
• Java CC - SurroundQuery.jj • Scala based Parboiled2
www.anant.us | solutions@anant.us | (855) 262-6826
3 Washington Circle, NW | Suite 301 | Washington, DC 20037
Data & Analytics
Cassandra, DataStax, Kafka, Spark
Customer Experience
Sitecore
Information Systems
Salesforce, Quickbooks, and more

More Related Content

What's hot

Running Multiple XORP Instances In One Box
Running Multiple XORP Instances In One BoxRunning Multiple XORP Instances In One Box
Running Multiple XORP Instances In One Box
Jiaqing Du
 

What's hot (9)

Solr
SolrSolr
Solr
 
Postgres level up
Postgres level upPostgres level up
Postgres level up
 
Web scraping with nutch solr
Web scraping with nutch solrWeb scraping with nutch solr
Web scraping with nutch solr
 
Solr Performance Monitoring with SPM
Solr Performance Monitoring with SPMSolr Performance Monitoring with SPM
Solr Performance Monitoring with SPM
 
Intro to py spark (and cassandra)
Intro to py spark (and cassandra)Intro to py spark (and cassandra)
Intro to py spark (and cassandra)
 
Introduction to Apache Cassandra
Introduction to Apache CassandraIntroduction to Apache Cassandra
Introduction to Apache Cassandra
 
Hadoop 2.x HDFS Cluster Installation (VirtualBox)
Hadoop 2.x  HDFS Cluster Installation (VirtualBox)Hadoop 2.x  HDFS Cluster Installation (VirtualBox)
Hadoop 2.x HDFS Cluster Installation (VirtualBox)
 
Running Multiple XORP Instances In One Box
Running Multiple XORP Instances In One BoxRunning Multiple XORP Instances In One Box
Running Multiple XORP Instances In One Box
 
Elasticsearch 1.x Cluster Installation (VirtualBox)
Elasticsearch 1.x Cluster Installation (VirtualBox)Elasticsearch 1.x Cluster Installation (VirtualBox)
Elasticsearch 1.x Cluster Installation (VirtualBox)
 

Similar to Big Data Technologies

Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs
Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabsSolr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs
Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs
Lucidworks
 

Similar to Big Data Technologies (20)

Building Enterprise Search Engines using Open Source Technologies
Building Enterprise Search Engines using Open Source TechnologiesBuilding Enterprise Search Engines using Open Source Technologies
Building Enterprise Search Engines using Open Source Technologies
 
Building Enterprise Search Engines using Open Source Technologies
Building Enterprise Search Engines using Open Source TechnologiesBuilding Enterprise Search Engines using Open Source Technologies
Building Enterprise Search Engines using Open Source Technologies
 
Benchmarking Solr Performance at Scale
Benchmarking Solr Performance at ScaleBenchmarking Solr Performance at Scale
Benchmarking Solr Performance at Scale
 
Ippevent : openshift Introduction
Ippevent : openshift IntroductionIppevent : openshift Introduction
Ippevent : openshift Introduction
 
Cassandra Lunch #23: Lucene Based Indexes on Cassandra
Cassandra Lunch #23: Lucene Based Indexes on CassandraCassandra Lunch #23: Lucene Based Indexes on Cassandra
Cassandra Lunch #23: Lucene Based Indexes on Cassandra
 
Productionizing Spark and the REST Job Server- Evan Chan
Productionizing Spark and the REST Job Server- Evan ChanProductionizing Spark and the REST Job Server- Evan Chan
Productionizing Spark and the REST Job Server- Evan Chan
 
Productionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job ServerProductionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job Server
 
Deploying and managing Solr at scale
Deploying and managing Solr at scaleDeploying and managing Solr at scale
Deploying and managing Solr at scale
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with Solr
 
Rapid prototyping with solr - By Erik Hatcher
Rapid prototyping with solr -  By Erik Hatcher Rapid prototyping with solr -  By Erik Hatcher
Rapid prototyping with solr - By Erik Hatcher
 
First oslo solr community meetup lightning talk janhoy
First oslo solr community meetup lightning talk janhoyFirst oslo solr community meetup lightning talk janhoy
First oslo solr community meetup lightning talk janhoy
 
Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs
Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabsSolr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs
Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs
 
Introduction to Apache solr
Introduction to Apache solrIntroduction to Apache solr
Introduction to Apache solr
 
Benchmarking Solr Performance
Benchmarking Solr PerformanceBenchmarking Solr Performance
Benchmarking Solr Performance
 
OpenCms Days 2016: Next generation content repository
OpenCms Days 2016: Next generation content repository OpenCms Days 2016: Next generation content repository
OpenCms Days 2016: Next generation content repository
 
Spark - The Ultimate Scala Collections by Martin Odersky
Spark - The Ultimate Scala Collections by Martin OderskySpark - The Ultimate Scala Collections by Martin Odersky
Spark - The Ultimate Scala Collections by Martin Odersky
 
The elastic stack on docker
The elastic stack on dockerThe elastic stack on docker
The elastic stack on docker
 
Spark Summit EU talk by Ross Lawley
Spark Summit EU talk by Ross LawleySpark Summit EU talk by Ross Lawley
Spark Summit EU talk by Ross Lawley
 
How To Connect Spark To Your Own Datasource
How To Connect Spark To Your Own DatasourceHow To Connect Spark To Your Own Datasource
How To Connect Spark To Your Own Datasource
 
Search and analyze your data with elasticsearch
Search and analyze your data with elasticsearchSearch and analyze your data with elasticsearch
Search and analyze your data with elasticsearch
 

More from Anant Corporation

NoCode, Data & AI LLM Inside Bootcamp: Episode 6 - Design Patterns: Retrieval...
NoCode, Data & AI LLM Inside Bootcamp: Episode 6 - Design Patterns: Retrieval...NoCode, Data & AI LLM Inside Bootcamp: Episode 6 - Design Patterns: Retrieval...
NoCode, Data & AI LLM Inside Bootcamp: Episode 6 - Design Patterns: Retrieval...
Anant Corporation
 
Automate your Job and Business with ChatGPT #3 - Fundamentals of LLM/GPT
Automate your Job and Business with ChatGPT #3 - Fundamentals of LLM/GPTAutomate your Job and Business with ChatGPT #3 - Fundamentals of LLM/GPT
Automate your Job and Business with ChatGPT #3 - Fundamentals of LLM/GPT
Anant Corporation
 
Data Engineer's Lunch #86: Building Real-Time Applications at Scale: A Case S...
Data Engineer's Lunch #86: Building Real-Time Applications at Scale: A Case S...Data Engineer's Lunch #86: Building Real-Time Applications at Scale: A Case S...
Data Engineer's Lunch #86: Building Real-Time Applications at Scale: A Case S...
Anant Corporation
 

More from Anant Corporation (20)

QLoRA Fine-Tuning on Cassandra Link Data Set (1/2) Cassandra Lunch 137
QLoRA Fine-Tuning on Cassandra Link Data Set (1/2) Cassandra Lunch 137QLoRA Fine-Tuning on Cassandra Link Data Set (1/2) Cassandra Lunch 137
QLoRA Fine-Tuning on Cassandra Link Data Set (1/2) Cassandra Lunch 137
 
Kono.IntelCraft.Weekly.AI.LLM.Landscape.2024.02.28.pdf
Kono.IntelCraft.Weekly.AI.LLM.Landscape.2024.02.28.pdfKono.IntelCraft.Weekly.AI.LLM.Landscape.2024.02.28.pdf
Kono.IntelCraft.Weekly.AI.LLM.Landscape.2024.02.28.pdf
 
Data Engineer's Lunch 96: Intro to Real Time Analytics Using Apache Pinot
Data Engineer's Lunch 96: Intro to Real Time Analytics Using Apache PinotData Engineer's Lunch 96: Intro to Real Time Analytics Using Apache Pinot
Data Engineer's Lunch 96: Intro to Real Time Analytics Using Apache Pinot
 
NoCode, Data & AI LLM Inside Bootcamp: Episode 6 - Design Patterns: Retrieval...
NoCode, Data & AI LLM Inside Bootcamp: Episode 6 - Design Patterns: Retrieval...NoCode, Data & AI LLM Inside Bootcamp: Episode 6 - Design Patterns: Retrieval...
NoCode, Data & AI LLM Inside Bootcamp: Episode 6 - Design Patterns: Retrieval...
 
Automate your Job and Business with ChatGPT #3 - Fundamentals of LLM/GPT
Automate your Job and Business with ChatGPT #3 - Fundamentals of LLM/GPTAutomate your Job and Business with ChatGPT #3 - Fundamentals of LLM/GPT
Automate your Job and Business with ChatGPT #3 - Fundamentals of LLM/GPT
 
YugabyteDB Developer Tools
YugabyteDB Developer ToolsYugabyteDB Developer Tools
YugabyteDB Developer Tools
 
Episode 2: The LLM / GPT / AI Prompt / Data Engineer Roadmap
Episode 2: The LLM / GPT / AI Prompt / Data Engineer RoadmapEpisode 2: The LLM / GPT / AI Prompt / Data Engineer Roadmap
Episode 2: The LLM / GPT / AI Prompt / Data Engineer Roadmap
 
Machine Learning Orchestration with Airflow
Machine Learning Orchestration with AirflowMachine Learning Orchestration with Airflow
Machine Learning Orchestration with Airflow
 
Cassandra Lunch 130: Recap of Cassandra Forward Talks
Cassandra Lunch 130: Recap of Cassandra Forward TalksCassandra Lunch 130: Recap of Cassandra Forward Talks
Cassandra Lunch 130: Recap of Cassandra Forward Talks
 
Data Engineer's Lunch 90: Migrating SQL Data with Arcion
Data Engineer's Lunch 90: Migrating SQL Data with ArcionData Engineer's Lunch 90: Migrating SQL Data with Arcion
Data Engineer's Lunch 90: Migrating SQL Data with Arcion
 
Data Engineer's Lunch 89: Machine Learning Orchestration with AirflowMachine ...
Data Engineer's Lunch 89: Machine Learning Orchestration with AirflowMachine ...Data Engineer's Lunch 89: Machine Learning Orchestration with AirflowMachine ...
Data Engineer's Lunch 89: Machine Learning Orchestration with AirflowMachine ...
 
Cassandra Lunch 129: What’s New: Apache Cassandra 4.1+ Features & Future
Cassandra Lunch 129: What’s New:  Apache Cassandra 4.1+ Features & FutureCassandra Lunch 129: What’s New:  Apache Cassandra 4.1+ Features & Future
Cassandra Lunch 129: What’s New: Apache Cassandra 4.1+ Features & Future
 
Data Engineer's Lunch #86: Building Real-Time Applications at Scale: A Case S...
Data Engineer's Lunch #86: Building Real-Time Applications at Scale: A Case S...Data Engineer's Lunch #86: Building Real-Time Applications at Scale: A Case S...
Data Engineer's Lunch #86: Building Real-Time Applications at Scale: A Case S...
 
Data Engineer's Lunch #85: Designing a Modern Data Stack
Data Engineer's Lunch #85: Designing a Modern Data StackData Engineer's Lunch #85: Designing a Modern Data Stack
Data Engineer's Lunch #85: Designing a Modern Data Stack
 
CL 121
CL 121CL 121
CL 121
 
Data Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
Data Engineer's Lunch #83: Strategies for Migration to Apache IcebergData Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
Data Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
 
Apache Cassandra Lunch 120: Apache Cassandra Monitoring Made Easy with AxonOps
Apache Cassandra Lunch 120: Apache Cassandra Monitoring Made Easy with AxonOpsApache Cassandra Lunch 120: Apache Cassandra Monitoring Made Easy with AxonOps
Apache Cassandra Lunch 120: Apache Cassandra Monitoring Made Easy with AxonOps
 
Apache Cassandra Lunch 119: Desktop GUI Tools for Apache Cassandra
Apache Cassandra Lunch 119: Desktop GUI Tools for Apache CassandraApache Cassandra Lunch 119: Desktop GUI Tools for Apache Cassandra
Apache Cassandra Lunch 119: Desktop GUI Tools for Apache Cassandra
 
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
 
Data Engineer's Lunch #60: Series - Developing Enterprise Consciousness
Data Engineer's Lunch #60: Series - Developing Enterprise ConsciousnessData Engineer's Lunch #60: Series - Developing Enterprise Consciousness
Data Engineer's Lunch #60: Series - Developing Enterprise Consciousness
 

Recently uploaded

Recently uploaded (20)

MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 

Big Data Technologies

  • 1. Research and Development Big Data Technologies
  • 2. Agenda • Challenge - Why does this matter? • Search Engine - 30k Foot View • Open - Lucene, Cassandra & Spark • Customizing - Apache Lucene/Solr • Custom Parser - Written in Scala
  • 3. Search Engine – 30 Thousand Foot View The search index is only as good as your processed data. If you put everything you find in your index, you are going to spend a lot of time telling people how to search.
  • 4. Lucene – More than meets the eye Who Next? Think of it like a “NoSQL” Database that has great indexing.. everywhere.
  • 5. Cassandra – NoSQL With Structure Who Next? Think of it like a “NoSQL” Database that has structure. Using “CQL” You can insert into and select from.. just not join.
  • 6. Spark – Way Better MapReduce Who Next? Think of it like MapReduce if MapReduce were created with scala, instead of Java, with streams. It’s also 100 times faster.
  • 7. Big Picture Overview • UI - Static (JS/CSS/HTML/jQuery) • API - Java (MySQL, Solr, JCS, OpenCV, JMS, Jolt) • Index - Solr (One 200 Field Index) • Indexing - Spark (Cassandra to Spark) • Data - Cassandra (2 Keyspaces, 8-10 Tables) • Ingesting - Java (Quintessential Noodles)
  • 8. Configuring - Solr - 1/3 Solr is like an eighteen wheel truck you can take apart and rebuild from the ground up with only what you need, or add as much as you want. ● Configuration - Schema ○ Data Types ○ Pre-Processing ○ Collection Definitions ○ Managed vs. Unmanaged ● Configuration - ZooKeeper ○ Synchronize Configurations ○ Distribute Shards ○ Manage Replicas ○ Elect Leaders ● Configuration - SolrConfig ○ Handlers ○ Components ○ Indexing Configurations ○ Memory / Cache ○ File System ● Lessons Learned ○ Try to use out of the box ○ Try to configure your way ○ Make sure to upgrade ○ Not everything can be configured
  • 9. Configuring - Solr - 2/3 ● Before Docker ○ Setup Zookeeper ■ Customize zoo.cfg ■ Setup Zookeeper Servers ○ Setup Solr Distro ■ Download Solr ■ Clean up Solr ■ Customize Schema.xml ■ Customize SolrConfig.xml ■ Setup Different Solr Servers ○ Start the Cloud ■ Custom Start Scripts ○ https://cwiki.apache.org/confluence/display/solr/Getti ng+Started+with+SolrCloud ○ https://cwiki.apache.org/confluence/display/solr/Taki ng+Solr+to+Production ● Today w/ Docker ○ docker run --name zookeeper ● -p 127.0.0.1:2181:2181 ● -p 127.0.0.1:2888:2888 ● -p 127.0.0.1:3888:3888 ● jplock/zookeeper ○ docker run --link zookeeper:ZK -i ● -p 127.0.0.1:8983:8983 ● -t dockerimages/docker-solr ● /bin/bash -c ' ● cd /opt/solr/example; ● java -jar ● -Dbootstrap_confdir=./solr/collection1/conf ● -Dcollection.configName=myconf - DzkHost=$ZK_PORT_2181_TCP_ADDR:$ZK_PORT_2181_ TCP_PORT ● -DnumShards=2 ● start.jar'; ● https://hub.docker.com/r/dockerimages/docker-solr/
  • 10. Configuring - Solr - 3/3 • SolrConfig - Example • Schema - Example https://cwiki.apache.org/confluence/dis play/solr/Configuring+solrconfig.xml https://wiki.apache.org/solr/Sch emaXml
  • 11. Solr Cloud / Zookeeper
  • 12. User Interface - Super Advanced
  • 13. Customizing - Solr - 1/3 Solr is like an eighteen wheel truck you can take apart and rebuild from the ground up with only what you need, or add as much as you want. ● Customization - Parsing ○ Need Specialized Syntax? ○ Java or Scala Based ○ Open Plugin Structure ○ Several Examples ● Customization - Highlighting ○ Need Special Coloring? ○ Specialized Syntax Aware ○ Open Plugin Structure ○ Several Examples ● Customization - Term Counts ○ Need Specific Information? ○ Create the Logic ○ Register the Component ○ Complicated Examples ● Lessons Learned ○ Major version upgrades = pain ○ Newer classes can be extended better ○ Long term investment
  • 14. Customizing - Solr - 2/3 • Custom Component in Scala or Java • Installing the Component http://wiki.apache.org/solr/SolrPl ugins http://sujitpal.blogspot.com/2011/ 03/using-lucenes-new- queryparser-framework.html
  • 16. Creating a Custom Parser with Scala Building a parser in Scala wasn’t my first choice, but creating it in Scala made me see how much better the language is. ● Why a Specialized Syntax? ○ Legacy Syntax ○ Boolean AND Proximity Queries ○ Specialized Fielded Expressions ○ Ranges / Classifications ● Why not ANTLR or JavaCC? ○ Old Parser was in Parboiled(1) ○ Parboiled2 was in Scala ○ No need to learn a separate Syntax for Creating Syntax ● Lessons Learned ○ Parboiled2 Documentation = bad ○ Understand the syntax ○ Interactive REPL in Scala = good ○ Write tons of unit tests ○ Long term investment Customizing Solr with Scala Found a good Virtual Mentor Learned Scala (not for Spark) Started from the ground up Reduced from ~1k to 400 LOC
  • 17. JavaCC vs. parboiled2 (Scala) • Java CC - SurroundQuery.jj • Scala based Parboiled2
  • 18. www.anant.us | solutions@anant.us | (855) 262-6826 3 Washington Circle, NW | Suite 301 | Washington, DC 20037 Data & Analytics Cassandra, DataStax, Kafka, Spark Customer Experience Sitecore Information Systems Salesforce, Quickbooks, and more