Schema Training Mode

•

1 like•87 views

Abhishek Singh

How Solr can learn from your data to create a Schema automatically.

Technology

Schema Training Mode.
How to Train Your Schema?
Related JIRAS: (SOLR-11741, SOLR-6939)

WHAT IS SCHEMA?
A mapping between the fieldName and it’s Type.
SOLR needs a schema to define what type each field
belongs to. Internally it maps these types to Lucene
Types.
FieldName title price id
FieldType String Float String

SCHEMA LESS SOLR?
SOLR can’t function properly without a schema.
So, it comes with a Schema-less Mode that builds an
schema in the background as you index.
However, It has it’s own problems.…

Indexing Document in SchemaLess Mode
Doc 1:
{“title”:“Fantastic Beasts”, “price”:200,“distID”:”2017-01-07”}
Doc 2:
{“title”:“Train Your Dragon”, “price”:22.3, “distID”:”112-uuiw-0”}
Fails With Errors:
Float not supported for fieldType Long!!
String not supported for fieldType Date!!
FieldNames title price distId
TypesInferred String Long Date

WHAT IF…
We could train our schema based on the data that we
have?

INTRODUCING…
Schema Training Mode
A set of APIs that lets you create a schema by
learning from your data, without indexing it.

WHAT CAN IT DO?
Learn from the document stream, and suggest the
following for every field:-
The Most Suitable FieldType
SingleValued or MultiValued
Point-out possible ‘type-anomalies’ in a document
stream.

WHO NEEDS IT
Multi Tenant Search Platforms
Indexing Documents from Multiple Sources
Getting an Idea of Your Data
Getting started with SOLR

SCHEMA-TRAINING API’S
1. Get A Training ID:
POST: /schema/train/start
Response: <NewTrainingID>
2. Start Training:
POST: /schema/train/<trainingID> -d [{f1:v1, f2:v2…}]
3. Get The Schema Trained So Far:
GET: /schema/train/<trainingID>/trainedSchema
Response: {Generated Schema}
4. Stop The Training:
DELETE: /schema/train/<trainingID>

FIELD-HIERARCHY-TREE
String
Double
Long
BooleanDate
Level

TO DO…
Replay the data internally to index it.
Learn from the queries and suggest:
Most suitable field types (string field vs. text field)
DocValues: true vs. false
Stored: true vs. false
Default search fields (qf)

Recently uploaded

DMCC Future of Trade Web3 - Special EditionDubai Multi Commodity Centre

AI as an Interface for Commercial BuildingsMemoori

Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro

E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxnull - The Open Security Community

Vulnerability_Management_GRC_by Sohang Sengupta.pptxnull - The Open Security Community

Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies

CloudStudio User manual (basic edition):comworks

costume and set research powerpoint presentationphoebematthew05

Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski

Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55

Install Stable Diffusion in windows machinePadma Pradeep

My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar

Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi

Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University

Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar

Pigging Solutions in Pet Food ManufacturingPigging Solutions

"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays

Build your next Gen AI Breakthrough - April 2024Neo4j

Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software

Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm

Recently uploaded (20)

DMCC Future of Trade Web3 - Special Edition

AI as an Interface for Commercial Buildings

Unraveling Multimodality with Large Language Models.pdf

E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx

Vulnerability_Management_GRC_by Sohang Sengupta.pptx

Benefits Of Flutter Compared To Other Frameworks

CloudStudio User manual (basic edition):

costume and set research powerpoint presentation

Integration and Automation in Practice: CI/CD in Mule Integration and Automat...

Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...

Install Stable Diffusion in windows machine

My Hashitalk Indonesia April 2024 Presentation

Vertex AI Gemini Prompt Engineering Tips

Nell’iperspazio con Rocket: il Framework Web di Rust!

Unleash Your Potential - Namagunga Girls Coding Club

Pigging Solutions in Pet Food Manufacturing

"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...

Build your next Gen AI Breakthrough - April 2024

Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation

Streamlining Python Development: A Guide to a Modern Project Setup

Featured

Content Methodology: A Best Practices Report (Webinar)contently

How to Prepare For a Successful Job Search for 2024Albert Qian

Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)

Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal

5 Public speaking tips from TED - Visualized summarySpeakerHub

ChatGPT and the Future of Work - Clark Boyd Clark Boyd

Getting into the tech field. what next Tessa Mero

Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray

How to have difficult conversations Rajiv Jayarajah, MAppComm, ACC

Introduction to Data ScienceChristy Abraham Joy

Time Management & Productivity - Best PracticesVit Horky

The six step guide to practical project managementMindGenius

Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36

Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Applitools

12 Ways to Increase Your Influence at WorkGetSmarter

ChatGPT webinar slidesAlireza Esmikhani

More than Just Lines on a Map: Best Practices for U.S Bike RoutesProject for Public Spaces & National Center for Biking and Walking

Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...DevGAMM Conference

Barbie - Brand Strategy PresentationErica Santiago

Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them wellSaba Software

Featured (20)

Content Methodology: A Best Practices Report (Webinar)

How to Prepare For a Successful Job Search for 2024

Social Media Marketing Trends 2024 // The Global Indie Insights

Trends In Paid Search: Navigating The Digital Landscape In 2024

5 Public speaking tips from TED - Visualized summary

ChatGPT and the Future of Work - Clark Boyd

Getting into the tech field. what next

Google's Just Not That Into You: Understanding Core Updates & Search Intent

How to have difficult conversations

Introduction to Data Science

Time Management & Productivity - Best Practices

The six step guide to practical project management

Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...

Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...

12 Ways to Increase Your Influence at Work

ChatGPT webinar slides

More than Just Lines on a Map: Best Practices for U.S Bike Routes

Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...

Barbie - Brand Strategy Presentation

Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well

Schema Training Mode

1. Schema Training Mode. How to Train Your Schema? Related JIRAS: (SOLR-11741, SOLR-6939)

2. WHAT IS SCHEMA? A mapping between the fieldName and it’s Type. SOLR needs a schema to define what type each field belongs to. Internally it maps these types to Lucene Types. FieldName title price id FieldType String Float String

3. SCHEMA LESS SOLR? SOLR can’t function properly without a schema. So, it comes with a Schema-less Mode that builds an schema in the background as you index. However, It has it’s own problems.…

4. Indexing Document in SchemaLess Mode Doc 1: {“title”:“Fantastic Beasts”, “price”:200,“distID”:”2017-01-07”} Doc 2: {“title”:“Train Your Dragon”, “price”:22.3, “distID”:”112-uuiw-0”} Fails With Errors: Float not supported for fieldType Long!! String not supported for fieldType Date!! FieldNames title price distId TypesInferred String Long Date

5. WHAT IF… We could train our schema based on the data that we have?

6. INTRODUCING… Schema Training Mode A set of APIs that lets you create a schema by learning from your data, without indexing it.

7. WHAT CAN IT DO? Learn from the document stream, and suggest the following for every field:- The Most Suitable FieldType SingleValued or MultiValued Point-out possible ‘type-anomalies’ in a document stream.

8. WHO NEEDS IT Multi Tenant Search Platforms Indexing Documents from Multiple Sources Getting an Idea of Your Data Getting started with SOLR

9. SCHEMA-TRAINING API’S 1. Get A Training ID: POST: /schema/train/start Response: <NewTrainingID> 2. Start Training: POST: /schema/train/<trainingID> -d [{f1:v1, f2:v2…}] 3. Get The Schema Trained So Far: GET: /schema/train/<trainingID>/trainedSchema Response: {Generated Schema} 4. Stop The Training: DELETE: /schema/train/<trainingID>

10. FIELD-HIERARCHY-TREE String Double Long BooleanDate Level

11. FIELD-HIERARCHY-TREE

12. TO DO… Replay the data internally to index it. Learn from the queries and suggest: Most suitable field types (string field vs. text field) DocValues: true vs. false Stored: true vs. false Default search fields (qf)

13. QUESTIONS?

Schema Training Mode

Recommended

Recommended

More Related Content

Recently uploaded

Recently uploaded (20)

Featured

Featured (20)

Schema Training Mode