SlideShare a Scribd company logo
1 of 60
Big Search w/ Big Data
      Principles
          LuceneRevolution 2012
  Eric Pugh | epugh@o19s.com | @dep4b
What is Big Search?
Who am i?
•    Principal of OpenSource Connections
    - Solr/Lucene Search Consultancy

•   Member of Apache Software
    Foundation

•   SOLR-284 UpdateRichDocuments
    (July 07)

•   Fascinated by the art of software
    development
n!
                   io
              it
         ed
     d
2n
                         AUTHOR
AGILISTA
Telling some stories

• Prototyping
• Application Development
• Maintaining Your Cluster
Not an intro to cloud
    Computing
• See Indexing Big Data on Amazon AWS by
  Scott Stults @ 1:15 Thursday
• See How is the Government Spending Your
  Money? How GCE is Using Lucene and the
  GCE Big Data Cloud by Seshu Simhadi @
  2:55 Thursday
Not an intro to
       SolrCloud!
• See How SolrCloud Changes the User
  Experience In a Sharded Environment by
  Erick Erickson @ 2:55 Today
• See Solr 4: The SolrCloud Architecture by
  Mark Miller @ 10:45 Tomorrow
My Assumptions for
       Client X
• Big Data is any data set that is primarily at
  rest due to the difficulty of working with it.
• Limited selection of tools available.
• Aggressive timeline.
• All the data must be searched per query.
• On Solr 3.x line
Telling some stories

• Prototyping
• Application Development
• Maintaining Your Cluster
Boy meets Girl Story

Metadata

            Ingest    Solr
                       Solr
           Pipeline     Solr
                         Solr
Content
 Files
Bash Rocks
Bash Rocks
• Remote Solr stop/start scripts
• Remote Indexer stop/start scripts
• Performance Monitoring
• Content Extraction scripts (+Java)
• Ingestor Scripts (+Java)
• Artifact Deployment (CM)
Make it easy to change
       sharding
Make it easy to change
             sharding
	 public void run(Map options, List<SolrInputDocument> docs) throws
InstantiationException, IllegalAccessException, ClassNotFoundException {
	 	 IndexStrategy indexStrategy = (IndexStrategy) Class.forName(
	 	 	 	 "com.o19s.solr.ModShardIndexStrategy").newInstance();
	 	 indexStrategy.configure(options);
	 	
	 	 for (SolrInputDocument doc:docs){
	 	 	 indexStrategy.addDocument(doc);
	 	 }
	 }
Separate JVM from Solr
        Cores
• Step 1: Fire up empty Solr’s on all the
  servers (nohup &).
• Step 2:Verify they started cleanly.
• Step 3: Create Cores (curl http://
  search1.o19s.com:8983/solr/admin?
  action=create&name=run2)
• Step 4: Create a “aggregator” core, passing
  in urls of Cores. (&property.shards=)
Go Wide Quickly
search1.o19s.com
search1.o19s.com
                      shard1
                       shard1
                        shard1
                         shard1   :8983
 shard1
  shard1
   shard1
    shard1   :8983
                     search2.o19s.com
 shard1
  shard1
   shard1
    shard8   :8984    shard1
                       shard1
                        shard1    :8983
                         shard8

 shard1
  shard1
   shard1 :8985
    shard12          search3.o19s.com
                      shard1
                       shard1
                        shard1 :8985
                         shard12
                       shard1
                        shard1
                         shard1 :8983
                          shard12
Simple Pipeline


•   Simple pipeline

•   mv is atomic
Don’t Move Files
• SCP across machines is slow/error prone
• NFS share, single point of failure.
• Clustered file system like GFS (Global File
  System) can have “fencing” issues
• HDFS shines here.
• ZooKeeper shines here.
Can you test your
    changes?
JVM tuning is black art
-verbose:gc
-XX:+PrintGCDetails
-server
-Xmx8G
-Xms8G
-XX:MaxPermSize=256m
-XX:PermSize=256m
-XX:+AggressiveHeap
-XX:+DisableExplicitGC
-XX:ParallelGCThreads=16
-XX:+UseParallelOldGC
Run, don’t Walk
Telling some stories

• Prototyping
• Application Development
• Maintaining Your Cluster
Grab some Data
#!/bin/sh
SOURCE_SOLR='http://
ec2-107-20-92-190.compute-1.amazonaws.com:8983/solr/
core0/select?q=*%3A*&start=0&rows=500000&wt=csv'

TARGET_SOLR=http://localhost:8983/solr/us_patent_grant/
update/csv

wget -O output.csv $SOURCE_SOLR

curl 'http://localhost:8983/solr/us_patent_grant/update/
csv?skipLines=1&commit=true&optimize=true' --data-binary
@output.csv -H 'Content-type:text/plain; charset=utf-8'
Using Solr as a key/
           value store
    • thousands of queries per second without
       real time get.
http://localhost:8983/solr/run2_enrichment/select?
q=id:DOC45242&fl=entities,html



    • ??? with real time get?
http://localhost:8983/solr/run2_enrichment/get?
id=DOC45242&fl=entities,html
Using Solr as key/value store
               Solr Key/
              Value Cache
   Metadata

                 Ingest     Solr
                             Solr
                Pipeline      Solr
                               Solr
   Content
    Files
Using Solr as key/value store
    • thousands of queries per second without
       real time get.
http://localhost:8983/solr/run2_enrichment/select?
q=id:DOC45242&fl=entities,html



    • ??? with real time get?
 http://localhost:8983/solr/run2_enrichment/get?
 id=DOC45242&fl=entities,html
Push schema definition
      to the application
    • Not “schema less”
    • Just different owner of schema!
    • Schema may have common set of fields like
       id, type, timestamp, version
    • Nothing required.
q=intensity_i:[70 TO 0]&fq=TYPE:streetlamp_monitor
Don’t do expensive
    things in Solr

• Tika content extraction aka Solr Cell

• UpdateRequestProcessorChain
Don’t do expensive
    things in Solr

• Tika content extraction aka Solr Cell

• UpdateRequestProcessorChain
Avro!
• Supports serialization of data readable from
  multiple languages
• It’s smart XML
• Handles forward and reverse versions of an
  object
• Compact and fast to read.
Avro!
 Solr Key/
Value Cache
                         .avro


 Metadata      Ingest            Solr
                                  Solr
              Pipeline             Solr
                                    Solr



 Content
  Files
No JavaBin




                         /u
                           G te
                            p
                            iv /
                             da
                              e av
                                m r
                                 e o!
• Avoid Jarmaggeddon
• Reflection? Ugh.
No JavaBin
              Solr Key/
             Value Cache
Metadata

                Ingest     Solr
                            Solr
               Pipeline      Solr
                              Solr
Content
 Files
No JavaBin
              Solr Key/  Solr 3.4
             Value Cache
Metadata

                Ingest         Solr
                                Solr
               Pipeline          Solr
                                  Solr
Content
 Files
No JavaBin
              Solr Key/  Solr 3.4
             Value Cache
Metadata
                                    Solr 4
                Ingest         Solr
                                Solr
               Pipeline          Solr
                                  Solr
Content
 Files
No JavaBin
              Solr Key/  Solr 3.4
             Value Cache
Metadata
                                    Solr 4
                Ingest         Solr
                                Solr
               Pipeline          Solr
                                  Solr
Content
             Which SolrJ
 Files
             version do I
                 use?
Telling some stories

• Prototyping
• Application Development
• Maintaining Your Cluster
Upgrade Lucene
           Indexes Easily
    • Don’t reindex!
    • Try out new features!
                                          David Lyle
java -cp lucene-core.jar
org.apache.lucene.index.IndexUpgrader [-delete-prior-
commits] [-verbose] indexDir
Indexing is Easy and
       Quick
CHEAP AND CHEERFUL



    <       >
NRT versus BigData
The tension between
scale and update rate


                  Bad Place
                    to Be
  > 100,000,000               < 10,000,000
The tension between
    scale and update rate

10 million   Bad Place   100’s of millions
Grim Reaper
Delayed Replication
<requestHandler name="/replication" class="solr.ReplicationHandler" >
<lst name="slave">
 <str name="masterUrl">http://localhost:8983/solr/replication</str>
 <str name="pollInterval">36:00:00</str>
</lst>
</requestHandler>
Enable/Disable
<requestHandler name="/admin/ping" class="solr.PingRequestHandler">
<lst name="invariants">
  <str name="q">MY HARD QUERY</str>
  <str name="shards">http://search1.o19s.com:8983/solr/run2_1,http://
search1.o19s.com:8983/solr/run2_2,http://search1.o19s.com:8983/solr/run2_2
</lst>
<lst name="defaults">
  <str name="echoParams">all</str>
</lst>
<str name="healthcheckFile">server-enabled.txt</str>
</requestHandler>
Enable/Disable

• Solr-3301
Provisioning

• Chef/Puppet
• ZooKeeper
• Have you versioned everything to build an
  index?
TYPICAL ENVIRONMENT
FLEXIBLE ENVIRONMENT
Do I need Failover?

• Can I build quickly?
• Do I have a reliable cluster?
• Am I spread across data centers?
• Is sooo 90’s....
Telling some stories

• Prototyping
• Application Development
• Maintaining Your Cluster
Some Other Thoughts
Don’t be Mesmerized
Scientific
Method
Thank you!

• epugh@o19s.com
• @dep4b
• www.opensourceconnections.com

More Related Content

What's hot

Loading 350M documents into a large Solr cluster: Presented by Dion Olsthoorn...
Loading 350M documents into a large Solr cluster: Presented by Dion Olsthoorn...Loading 350M documents into a large Solr cluster: Presented by Dion Olsthoorn...
Loading 350M documents into a large Solr cluster: Presented by Dion Olsthoorn...Lucidworks
 
NYC Lucene/Solr Meetup: Spark / Solr
NYC Lucene/Solr Meetup: Spark / SolrNYC Lucene/Solr Meetup: Spark / Solr
NYC Lucene/Solr Meetup: Spark / Solrthelabdude
 
Rebalance API for SolrCloud: Presented by Nitin Sharma, Netflix & Suruchi Sha...
Rebalance API for SolrCloud: Presented by Nitin Sharma, Netflix & Suruchi Sha...Rebalance API for SolrCloud: Presented by Nitin Sharma, Netflix & Suruchi Sha...
Rebalance API for SolrCloud: Presented by Nitin Sharma, Netflix & Suruchi Sha...Lucidworks
 
Introduction to SolrCloud
Introduction to SolrCloudIntroduction to SolrCloud
Introduction to SolrCloudVarun Thacker
 
Scaling Through Partitioning and Shard Splitting in Solr 4
Scaling Through Partitioning and Shard Splitting in Solr 4Scaling Through Partitioning and Shard Splitting in Solr 4
Scaling Through Partitioning and Shard Splitting in Solr 4thelabdude
 
Building and Running Solr-as-a-Service: Presented by Shai Erera, IBM
Building and Running Solr-as-a-Service: Presented by Shai Erera, IBMBuilding and Running Solr-as-a-Service: Presented by Shai Erera, IBM
Building and Running Solr-as-a-Service: Presented by Shai Erera, IBMLucidworks
 
How SolrCloud Changes the User Experience In a Sharded Environment
How SolrCloud Changes the User Experience In a Sharded EnvironmentHow SolrCloud Changes the User Experience In a Sharded Environment
How SolrCloud Changes the User Experience In a Sharded Environmentlucenerevolution
 
Call me maybe: Jepsen and flaky networks
Call me maybe: Jepsen and flaky networksCall me maybe: Jepsen and flaky networks
Call me maybe: Jepsen and flaky networksShalin Shekhar Mangar
 
CaffeOnSpark Update: Recent Enhancements and Use Cases
CaffeOnSpark Update: Recent Enhancements and Use CasesCaffeOnSpark Update: Recent Enhancements and Use Cases
CaffeOnSpark Update: Recent Enhancements and Use CasesDataWorks Summit
 
Solrcloud Leader Election
Solrcloud Leader ElectionSolrcloud Leader Election
Solrcloud Leader Electionravikgiitk
 
First oslo solr community meetup lightning talk janhoy
First oslo solr community meetup lightning talk janhoyFirst oslo solr community meetup lightning talk janhoy
First oslo solr community meetup lightning talk janhoyCominvent AS
 
Solr Exchange: Introduction to SolrCloud
Solr Exchange: Introduction to SolrCloudSolr Exchange: Introduction to SolrCloud
Solr Exchange: Introduction to SolrCloudthelabdude
 
Building Distributed Systems in Scala
Building Distributed Systems in ScalaBuilding Distributed Systems in Scala
Building Distributed Systems in ScalaAlex Payne
 
Productionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job ServerProductionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job ServerEvan Chan
 
Cross Datacenter Replication in Apache Solr 6
Cross Datacenter Replication in Apache Solr 6Cross Datacenter Replication in Apache Solr 6
Cross Datacenter Replication in Apache Solr 6Shalin Shekhar Mangar
 
RESTful API – How to Consume, Extract, Store and Visualize Data with InfluxDB...
RESTful API – How to Consume, Extract, Store and Visualize Data with InfluxDB...RESTful API – How to Consume, Extract, Store and Visualize Data with InfluxDB...
RESTful API – How to Consume, Extract, Store and Visualize Data with InfluxDB...InfluxData
 
Leverage Mesos for running Spark Streaming production jobs by Iulian Dragos a...
Leverage Mesos for running Spark Streaming production jobs by Iulian Dragos a...Leverage Mesos for running Spark Streaming production jobs by Iulian Dragos a...
Leverage Mesos for running Spark Streaming production jobs by Iulian Dragos a...Spark Summit
 
Constructing Web APIs with Rack, Sinatra and MongoDB
Constructing Web APIs with Rack, Sinatra and MongoDBConstructing Web APIs with Rack, Sinatra and MongoDB
Constructing Web APIs with Rack, Sinatra and MongoDBOisin Hurley
 
SFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
SFBay Area Solr Meetup - June 18th: Benchmarking Solr PerformanceSFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
SFBay Area Solr Meetup - June 18th: Benchmarking Solr PerformanceLucidworks (Archived)
 

What's hot (20)

Loading 350M documents into a large Solr cluster: Presented by Dion Olsthoorn...
Loading 350M documents into a large Solr cluster: Presented by Dion Olsthoorn...Loading 350M documents into a large Solr cluster: Presented by Dion Olsthoorn...
Loading 350M documents into a large Solr cluster: Presented by Dion Olsthoorn...
 
NYC Lucene/Solr Meetup: Spark / Solr
NYC Lucene/Solr Meetup: Spark / SolrNYC Lucene/Solr Meetup: Spark / Solr
NYC Lucene/Solr Meetup: Spark / Solr
 
Rebalance API for SolrCloud: Presented by Nitin Sharma, Netflix & Suruchi Sha...
Rebalance API for SolrCloud: Presented by Nitin Sharma, Netflix & Suruchi Sha...Rebalance API for SolrCloud: Presented by Nitin Sharma, Netflix & Suruchi Sha...
Rebalance API for SolrCloud: Presented by Nitin Sharma, Netflix & Suruchi Sha...
 
Introduction to SolrCloud
Introduction to SolrCloudIntroduction to SolrCloud
Introduction to SolrCloud
 
Scaling Through Partitioning and Shard Splitting in Solr 4
Scaling Through Partitioning and Shard Splitting in Solr 4Scaling Through Partitioning and Shard Splitting in Solr 4
Scaling Through Partitioning and Shard Splitting in Solr 4
 
Building and Running Solr-as-a-Service: Presented by Shai Erera, IBM
Building and Running Solr-as-a-Service: Presented by Shai Erera, IBMBuilding and Running Solr-as-a-Service: Presented by Shai Erera, IBM
Building and Running Solr-as-a-Service: Presented by Shai Erera, IBM
 
How SolrCloud Changes the User Experience In a Sharded Environment
How SolrCloud Changes the User Experience In a Sharded EnvironmentHow SolrCloud Changes the User Experience In a Sharded Environment
How SolrCloud Changes the User Experience In a Sharded Environment
 
Scaling search with SolrCloud
Scaling search with SolrCloudScaling search with SolrCloud
Scaling search with SolrCloud
 
Call me maybe: Jepsen and flaky networks
Call me maybe: Jepsen and flaky networksCall me maybe: Jepsen and flaky networks
Call me maybe: Jepsen and flaky networks
 
CaffeOnSpark Update: Recent Enhancements and Use Cases
CaffeOnSpark Update: Recent Enhancements and Use CasesCaffeOnSpark Update: Recent Enhancements and Use Cases
CaffeOnSpark Update: Recent Enhancements and Use Cases
 
Solrcloud Leader Election
Solrcloud Leader ElectionSolrcloud Leader Election
Solrcloud Leader Election
 
First oslo solr community meetup lightning talk janhoy
First oslo solr community meetup lightning talk janhoyFirst oslo solr community meetup lightning talk janhoy
First oslo solr community meetup lightning talk janhoy
 
Solr Exchange: Introduction to SolrCloud
Solr Exchange: Introduction to SolrCloudSolr Exchange: Introduction to SolrCloud
Solr Exchange: Introduction to SolrCloud
 
Building Distributed Systems in Scala
Building Distributed Systems in ScalaBuilding Distributed Systems in Scala
Building Distributed Systems in Scala
 
Productionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job ServerProductionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job Server
 
Cross Datacenter Replication in Apache Solr 6
Cross Datacenter Replication in Apache Solr 6Cross Datacenter Replication in Apache Solr 6
Cross Datacenter Replication in Apache Solr 6
 
RESTful API – How to Consume, Extract, Store and Visualize Data with InfluxDB...
RESTful API – How to Consume, Extract, Store and Visualize Data with InfluxDB...RESTful API – How to Consume, Extract, Store and Visualize Data with InfluxDB...
RESTful API – How to Consume, Extract, Store and Visualize Data with InfluxDB...
 
Leverage Mesos for running Spark Streaming production jobs by Iulian Dragos a...
Leverage Mesos for running Spark Streaming production jobs by Iulian Dragos a...Leverage Mesos for running Spark Streaming production jobs by Iulian Dragos a...
Leverage Mesos for running Spark Streaming production jobs by Iulian Dragos a...
 
Constructing Web APIs with Rack, Sinatra and MongoDB
Constructing Web APIs with Rack, Sinatra and MongoDBConstructing Web APIs with Rack, Sinatra and MongoDB
Constructing Web APIs with Rack, Sinatra and MongoDB
 
SFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
SFBay Area Solr Meetup - June 18th: Benchmarking Solr PerformanceSFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
SFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
 

Viewers also liked

5 Factors Impacting Your Big Data Project’s Performance
5 Factors Impacting Your Big Data Project’s Performance5 Factors Impacting Your Big Data Project’s Performance
5 Factors Impacting Your Big Data Project’s PerformanceTyrone Systems
 
Chart of the week- 11th November 2016 - no Brexit effect in the trade data
Chart of the week- 11th November 2016 - no Brexit effect in the trade dataChart of the week- 11th November 2016 - no Brexit effect in the trade data
Chart of the week- 11th November 2016 - no Brexit effect in the trade dataKitty Ussher
 
A7 getting value from big data how to get there quickly and leverage your c...
A7   getting value from big data how to get there quickly and leverage your c...A7   getting value from big data how to get there quickly and leverage your c...
A7 getting value from big data how to get there quickly and leverage your c...Dr. Wilfred Lin (Ph.D.)
 
C6 deploying applications to your private cloud 7 to 10 times faster
C6   deploying applications to your private cloud 7 to 10 times fasterC6   deploying applications to your private cloud 7 to 10 times faster
C6 deploying applications to your private cloud 7 to 10 times fasterDr. Wilfred Lin (Ph.D.)
 
Big Data: Implications for Marketing and Strategy
Big Data: Implications for Marketing and StrategyBig Data: Implications for Marketing and Strategy
Big Data: Implications for Marketing and StrategyC.K. Kumar
 
Large-scale digitisation options at the Natural History Museum, London.
Large-scale digitisation options at the Natural History Museum, London.Large-scale digitisation options at the Natural History Museum, London.
Large-scale digitisation options at the Natural History Museum, London.Vince Smith
 
Methodological principles in dealing with Big Data, Reijo Sund
Methodological principles in dealing with Big Data, Reijo SundMethodological principles in dealing with Big Data, Reijo Sund
Methodological principles in dealing with Big Data, Reijo SundTilastokeskus
 
Analytics et Big Data, une histoire de cubes...
Analytics et Big Data, une histoire de cubes...Analytics et Big Data, une histoire de cubes...
Analytics et Big Data, une histoire de cubes...Mathias Kluba
 
Privacy in a digital world
Privacy in a digital worldPrivacy in a digital world
Privacy in a digital worldrobkitchin
 
Fight Fraud with Big Data Analytics
Fight Fraud with Big Data AnalyticsFight Fraud with Big Data Analytics
Fight Fraud with Big Data AnalyticsDatameer
 
Case study of online machine learning for display advertising in Yahoo! JAPAN
Case study of online machine learning for display advertising in Yahoo! JAPANCase study of online machine learning for display advertising in Yahoo! JAPAN
Case study of online machine learning for display advertising in Yahoo! JAPANDataWorks Summit/Hadoop Summit
 
Hadoop Hbase - Introduction
Hadoop Hbase - IntroductionHadoop Hbase - Introduction
Hadoop Hbase - IntroductionBlandine Larbret
 
Chapter 02 The Internet
Chapter 02 The InternetChapter 02 The Internet
Chapter 02 The Internetxtin101
 
Chapter 06 Inside Computers and Mobile Devices
Chapter 06 Inside Computers and Mobile DevicesChapter 06 Inside Computers and Mobile Devices
Chapter 06 Inside Computers and Mobile Devicesxtin101
 
Chapter 05 Digital Safety and Security
Chapter 05 Digital Safety and SecurityChapter 05 Digital Safety and Security
Chapter 05 Digital Safety and Securityxtin101
 
Research on data journalism: What is there to investigate? Insights from a st...
Research on data journalism: What is there to investigate? Insights from a st...Research on data journalism: What is there to investigate? Insights from a st...
Research on data journalism: What is there to investigate? Insights from a st...Julian Ausserhofer
 
Bigdata based fraud detection
Bigdata based fraud detectionBigdata based fraud detection
Bigdata based fraud detectionMk Kim
 

Viewers also liked (20)

5 Factors Impacting Your Big Data Project’s Performance
5 Factors Impacting Your Big Data Project’s Performance5 Factors Impacting Your Big Data Project’s Performance
5 Factors Impacting Your Big Data Project’s Performance
 
Chart of the week- 11th November 2016 - no Brexit effect in the trade data
Chart of the week- 11th November 2016 - no Brexit effect in the trade dataChart of the week- 11th November 2016 - no Brexit effect in the trade data
Chart of the week- 11th November 2016 - no Brexit effect in the trade data
 
Git Internals
Git InternalsGit Internals
Git Internals
 
Chapter 02
Chapter 02Chapter 02
Chapter 02
 
A7 getting value from big data how to get there quickly and leverage your c...
A7   getting value from big data how to get there quickly and leverage your c...A7   getting value from big data how to get there quickly and leverage your c...
A7 getting value from big data how to get there quickly and leverage your c...
 
C6 deploying applications to your private cloud 7 to 10 times faster
C6   deploying applications to your private cloud 7 to 10 times fasterC6   deploying applications to your private cloud 7 to 10 times faster
C6 deploying applications to your private cloud 7 to 10 times faster
 
Big Data: Implications for Marketing and Strategy
Big Data: Implications for Marketing and StrategyBig Data: Implications for Marketing and Strategy
Big Data: Implications for Marketing and Strategy
 
Large-scale digitisation options at the Natural History Museum, London.
Large-scale digitisation options at the Natural History Museum, London.Large-scale digitisation options at the Natural History Museum, London.
Large-scale digitisation options at the Natural History Museum, London.
 
Methodological principles in dealing with Big Data, Reijo Sund
Methodological principles in dealing with Big Data, Reijo SundMethodological principles in dealing with Big Data, Reijo Sund
Methodological principles in dealing with Big Data, Reijo Sund
 
Analytics et Big Data, une histoire de cubes...
Analytics et Big Data, une histoire de cubes...Analytics et Big Data, une histoire de cubes...
Analytics et Big Data, une histoire de cubes...
 
Privacy in a digital world
Privacy in a digital worldPrivacy in a digital world
Privacy in a digital world
 
Fight Fraud with Big Data Analytics
Fight Fraud with Big Data AnalyticsFight Fraud with Big Data Analytics
Fight Fraud with Big Data Analytics
 
Case study of online machine learning for display advertising in Yahoo! JAPAN
Case study of online machine learning for display advertising in Yahoo! JAPANCase study of online machine learning for display advertising in Yahoo! JAPAN
Case study of online machine learning for display advertising in Yahoo! JAPAN
 
Hadoop Hbase - Introduction
Hadoop Hbase - IntroductionHadoop Hbase - Introduction
Hadoop Hbase - Introduction
 
Chapter 02 The Internet
Chapter 02 The InternetChapter 02 The Internet
Chapter 02 The Internet
 
Chapter 06 Inside Computers and Mobile Devices
Chapter 06 Inside Computers and Mobile DevicesChapter 06 Inside Computers and Mobile Devices
Chapter 06 Inside Computers and Mobile Devices
 
Chapter 05 Digital Safety and Security
Chapter 05 Digital Safety and SecurityChapter 05 Digital Safety and Security
Chapter 05 Digital Safety and Security
 
Research on data journalism: What is there to investigate? Insights from a st...
Research on data journalism: What is there to investigate? Insights from a st...Research on data journalism: What is there to investigate? Insights from a st...
Research on data journalism: What is there to investigate? Insights from a st...
 
Bigdata based fraud detection
Bigdata based fraud detectionBigdata based fraud detection
Bigdata based fraud detection
 
Social media with big data analytics
Social media with big data analyticsSocial media with big data analytics
Social media with big data analytics
 

Similar to Big Search with Big Data Principles

ApacheCon Europe 2012 -Big Search 4 Big Data
ApacheCon Europe 2012 -Big Search 4 Big DataApacheCon Europe 2012 -Big Search 4 Big Data
ApacheCon Europe 2012 -Big Search 4 Big DataOpenSource Connections
 
[Hic2011] using hadoop lucene-solr-for-large-scale-search by systex
[Hic2011] using hadoop lucene-solr-for-large-scale-search by systex[Hic2011] using hadoop lucene-solr-for-large-scale-search by systex
[Hic2011] using hadoop lucene-solr-for-large-scale-search by systexJames Chen
 
Scaling SolrCloud to a Large Number of Collections - Fifth Elephant 2014
Scaling SolrCloud to a Large Number of Collections - Fifth Elephant 2014Scaling SolrCloud to a Large Number of Collections - Fifth Elephant 2014
Scaling SolrCloud to a Large Number of Collections - Fifth Elephant 2014Shalin Shekhar Mangar
 
Building Enterprise Search Engines using Open Source Technologies
Building Enterprise Search Engines using Open Source TechnologiesBuilding Enterprise Search Engines using Open Source Technologies
Building Enterprise Search Engines using Open Source TechnologiesRahul Singh
 
Building Enterprise Search Engines using Open Source Technologies
Building Enterprise Search Engines using Open Source TechnologiesBuilding Enterprise Search Engines using Open Source Technologies
Building Enterprise Search Engines using Open Source TechnologiesAnant Corporation
 
Rapid prototyping with solr - By Erik Hatcher
Rapid prototyping with solr -  By Erik Hatcher Rapid prototyping with solr -  By Erik Hatcher
Rapid prototyping with solr - By Erik Hatcher lucenerevolution
 
Solr Powered Lucene
Solr Powered LuceneSolr Powered Lucene
Solr Powered LuceneErik Hatcher
 
Solr search engine with multiple table relation
Solr search engine with multiple table relationSolr search engine with multiple table relation
Solr search engine with multiple table relationJay Bharat
 
Scaling SolrCloud to a large number of Collections
Scaling SolrCloud to a large number of CollectionsScaling SolrCloud to a large number of Collections
Scaling SolrCloud to a large number of CollectionsAnshum Gupta
 
What's new in Solr 5.0
What's new in Solr 5.0What's new in Solr 5.0
What's new in Solr 5.0Anshum Gupta
 
Stardog 1.1: Easier, Smarter, Faster RDF Database
Stardog 1.1: Easier, Smarter, Faster RDF DatabaseStardog 1.1: Easier, Smarter, Faster RDF Database
Stardog 1.1: Easier, Smarter, Faster RDF DatabaseClark & Parsia LLC
 
Stardog 1.1: An Easier, Smarter, Faster RDF Database
Stardog 1.1: An Easier, Smarter, Faster RDF DatabaseStardog 1.1: An Easier, Smarter, Faster RDF Database
Stardog 1.1: An Easier, Smarter, Faster RDF Databasekendallclark
 
Apache Solr - search for everyone!
Apache Solr - search for everyone!Apache Solr - search for everyone!
Apache Solr - search for everyone!Jaran Flaath
 
BigData Faceted Search Comparison between Apache Solr vs. ElasticSearch
BigData Faceted Search Comparison between Apache Solr vs. ElasticSearchBigData Faceted Search Comparison between Apache Solr vs. ElasticSearch
BigData Faceted Search Comparison between Apache Solr vs. ElasticSearchNetConstructor, Inc.
 
Battle of the giants: Apache Solr vs ElasticSearch
Battle of the giants: Apache Solr vs ElasticSearchBattle of the giants: Apache Solr vs ElasticSearch
Battle of the giants: Apache Solr vs ElasticSearchRafał Kuć
 
Meet Solr For The Tirst Again
Meet Solr For The Tirst AgainMeet Solr For The Tirst Again
Meet Solr For The Tirst AgainVarun Thacker
 

Similar to Big Search with Big Data Principles (20)

ApacheCon Europe 2012 -Big Search 4 Big Data
ApacheCon Europe 2012 -Big Search 4 Big DataApacheCon Europe 2012 -Big Search 4 Big Data
ApacheCon Europe 2012 -Big Search 4 Big Data
 
[Hic2011] using hadoop lucene-solr-for-large-scale-search by systex
[Hic2011] using hadoop lucene-solr-for-large-scale-search by systex[Hic2011] using hadoop lucene-solr-for-large-scale-search by systex
[Hic2011] using hadoop lucene-solr-for-large-scale-search by systex
 
Scaling SolrCloud to a Large Number of Collections - Fifth Elephant 2014
Scaling SolrCloud to a Large Number of Collections - Fifth Elephant 2014Scaling SolrCloud to a Large Number of Collections - Fifth Elephant 2014
Scaling SolrCloud to a Large Number of Collections - Fifth Elephant 2014
 
Building Enterprise Search Engines using Open Source Technologies
Building Enterprise Search Engines using Open Source TechnologiesBuilding Enterprise Search Engines using Open Source Technologies
Building Enterprise Search Engines using Open Source Technologies
 
Building Enterprise Search Engines using Open Source Technologies
Building Enterprise Search Engines using Open Source TechnologiesBuilding Enterprise Search Engines using Open Source Technologies
Building Enterprise Search Engines using Open Source Technologies
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with Solr
 
Rapid prototyping with solr - By Erik Hatcher
Rapid prototyping with solr -  By Erik Hatcher Rapid prototyping with solr -  By Erik Hatcher
Rapid prototyping with solr - By Erik Hatcher
 
Solr Powered Lucene
Solr Powered LuceneSolr Powered Lucene
Solr Powered Lucene
 
Solr search engine with multiple table relation
Solr search engine with multiple table relationSolr search engine with multiple table relation
Solr search engine with multiple table relation
 
Scaling SolrCloud to a large number of Collections
Scaling SolrCloud to a large number of CollectionsScaling SolrCloud to a large number of Collections
Scaling SolrCloud to a large number of Collections
 
What's new in Solr 5.0
What's new in Solr 5.0What's new in Solr 5.0
What's new in Solr 5.0
 
Stardog 1.1: Easier, Smarter, Faster RDF Database
Stardog 1.1: Easier, Smarter, Faster RDF DatabaseStardog 1.1: Easier, Smarter, Faster RDF Database
Stardog 1.1: Easier, Smarter, Faster RDF Database
 
Stardog 1.1: An Easier, Smarter, Faster RDF Database
Stardog 1.1: An Easier, Smarter, Faster RDF DatabaseStardog 1.1: An Easier, Smarter, Faster RDF Database
Stardog 1.1: An Easier, Smarter, Faster RDF Database
 
Apache Solr - search for everyone!
Apache Solr - search for everyone!Apache Solr - search for everyone!
Apache Solr - search for everyone!
 
Solr
SolrSolr
Solr
 
BigData Faceted Search Comparison between Apache Solr vs. ElasticSearch
BigData Faceted Search Comparison between Apache Solr vs. ElasticSearchBigData Faceted Search Comparison between Apache Solr vs. ElasticSearch
BigData Faceted Search Comparison between Apache Solr vs. ElasticSearch
 
Battle of the giants: Apache Solr vs ElasticSearch
Battle of the giants: Apache Solr vs ElasticSearchBattle of the giants: Apache Solr vs ElasticSearch
Battle of the giants: Apache Solr vs ElasticSearch
 
Solr @ eBay Kleinanzeigen
Solr @ eBay KleinanzeigenSolr @ eBay Kleinanzeigen
Solr @ eBay Kleinanzeigen
 
Meet Solr For The Tirst Again
Meet Solr For The Tirst AgainMeet Solr For The Tirst Again
Meet Solr For The Tirst Again
 
Polyglot Grails
Polyglot GrailsPolyglot Grails
Polyglot Grails
 

More from OpenSource Connections

How To Structure Your Search Team for Success
How To Structure Your Search Team for SuccessHow To Structure Your Search Team for Success
How To Structure Your Search Team for SuccessOpenSource Connections
 
The right path to making search relevant - Taxonomy Bootcamp London 2019
The right path to making search relevant  - Taxonomy Bootcamp London 2019The right path to making search relevant  - Taxonomy Bootcamp London 2019
The right path to making search relevant - Taxonomy Bootcamp London 2019OpenSource Connections
 
Haystack 2019 Lightning Talk - The Future of Quepid - Charlie Hull
Haystack 2019 Lightning Talk - The Future of Quepid - Charlie HullHaystack 2019 Lightning Talk - The Future of Quepid - Charlie Hull
Haystack 2019 Lightning Talk - The Future of Quepid - Charlie HullOpenSource Connections
 
Haystack 2019 Lightning Talk - State of Apache Tika - Tim Allison
Haystack 2019 Lightning Talk - State of Apache Tika - Tim AllisonHaystack 2019 Lightning Talk - State of Apache Tika - Tim Allison
Haystack 2019 Lightning Talk - State of Apache Tika - Tim AllisonOpenSource Connections
 
Haystack 2019 Lightning Talk - Relevance on 17 million full text documents - ...
Haystack 2019 Lightning Talk - Relevance on 17 million full text documents - ...Haystack 2019 Lightning Talk - Relevance on 17 million full text documents - ...
Haystack 2019 Lightning Talk - Relevance on 17 million full text documents - ...OpenSource Connections
 
Haystack 2019 Lightning Talk - Solr Cloud on Kubernetes - Manoj Bharadwaj
Haystack 2019 Lightning Talk - Solr Cloud on Kubernetes - Manoj BharadwajHaystack 2019 Lightning Talk - Solr Cloud on Kubernetes - Manoj Bharadwaj
Haystack 2019 Lightning Talk - Solr Cloud on Kubernetes - Manoj BharadwajOpenSource Connections
 
Haystack 2019 Lightning Talk - Quaerite a Search relevance evaluation toolkit...
Haystack 2019 Lightning Talk - Quaerite a Search relevance evaluation toolkit...Haystack 2019 Lightning Talk - Quaerite a Search relevance evaluation toolkit...
Haystack 2019 Lightning Talk - Quaerite a Search relevance evaluation toolkit...OpenSource Connections
 
Haystack 2019 - Search-based recommendations at Politico - Ryan Kohl
Haystack 2019 - Search-based recommendations at Politico - Ryan KohlHaystack 2019 - Search-based recommendations at Politico - Ryan Kohl
Haystack 2019 - Search-based recommendations at Politico - Ryan KohlOpenSource Connections
 
Haystack 2019 - Search with Vectors - Simon Hughes
Haystack 2019 - Search with Vectors - Simon HughesHaystack 2019 - Search with Vectors - Simon Hughes
Haystack 2019 - Search with Vectors - Simon HughesOpenSource Connections
 
Haystack 2019 - Natural Language Search with Knowledge Graphs - Trey Grainger
Haystack 2019 - Natural Language Search with Knowledge Graphs - Trey GraingerHaystack 2019 - Natural Language Search with Knowledge Graphs - Trey Grainger
Haystack 2019 - Natural Language Search with Knowledge Graphs - Trey GraingerOpenSource Connections
 
Haystack 2019 - Search Logs + Machine Learning = Auto-Tagging Inventory - Joh...
Haystack 2019 - Search Logs + Machine Learning = Auto-Tagging Inventory - Joh...Haystack 2019 - Search Logs + Machine Learning = Auto-Tagging Inventory - Joh...
Haystack 2019 - Search Logs + Machine Learning = Auto-Tagging Inventory - Joh...OpenSource Connections
 
Haystack 2019 - Improving Search Relevance with Numeric Features in Elasticse...
Haystack 2019 - Improving Search Relevance with Numeric Features in Elasticse...Haystack 2019 - Improving Search Relevance with Numeric Features in Elasticse...
Haystack 2019 - Improving Search Relevance with Numeric Features in Elasticse...OpenSource Connections
 
Haystack 2019 - Architectural considerations on search relevancy in the conte...
Haystack 2019 - Architectural considerations on search relevancy in the conte...Haystack 2019 - Architectural considerations on search relevancy in the conte...
Haystack 2019 - Architectural considerations on search relevancy in the conte...OpenSource Connections
 
Haystack 2019 - Custom Solr Query Parser Design Option, and Pros & Cons - Ber...
Haystack 2019 - Custom Solr Query Parser Design Option, and Pros & Cons - Ber...Haystack 2019 - Custom Solr Query Parser Design Option, and Pros & Cons - Ber...
Haystack 2019 - Custom Solr Query Parser Design Option, and Pros & Cons - Ber...OpenSource Connections
 
Haystack 2019 - Establishing a relevance focused culture in a large organizat...
Haystack 2019 - Establishing a relevance focused culture in a large organizat...Haystack 2019 - Establishing a relevance focused culture in a large organizat...
Haystack 2019 - Establishing a relevance focused culture in a large organizat...OpenSource Connections
 
Haystack 2019 - Solving for Satisfaction: Introduction to Click Models - Eliz...
Haystack 2019 - Solving for Satisfaction: Introduction to Click Models - Eliz...Haystack 2019 - Solving for Satisfaction: Introduction to Click Models - Eliz...
Haystack 2019 - Solving for Satisfaction: Introduction to Click Models - Eliz...OpenSource Connections
 
2019 Haystack - How The New York Times Tackles Relevance - Jeremiah Via
2019 Haystack - How The New York Times Tackles Relevance - Jeremiah Via2019 Haystack - How The New York Times Tackles Relevance - Jeremiah Via
2019 Haystack - How The New York Times Tackles Relevance - Jeremiah ViaOpenSource Connections
 

More from OpenSource Connections (20)

Encores
EncoresEncores
Encores
 
Test driven relevancy
Test driven relevancyTest driven relevancy
Test driven relevancy
 
How To Structure Your Search Team for Success
How To Structure Your Search Team for SuccessHow To Structure Your Search Team for Success
How To Structure Your Search Team for Success
 
The right path to making search relevant - Taxonomy Bootcamp London 2019
The right path to making search relevant  - Taxonomy Bootcamp London 2019The right path to making search relevant  - Taxonomy Bootcamp London 2019
The right path to making search relevant - Taxonomy Bootcamp London 2019
 
Payloads and OCR with Solr
Payloads and OCR with SolrPayloads and OCR with Solr
Payloads and OCR with Solr
 
Haystack 2019 Lightning Talk - The Future of Quepid - Charlie Hull
Haystack 2019 Lightning Talk - The Future of Quepid - Charlie HullHaystack 2019 Lightning Talk - The Future of Quepid - Charlie Hull
Haystack 2019 Lightning Talk - The Future of Quepid - Charlie Hull
 
Haystack 2019 Lightning Talk - State of Apache Tika - Tim Allison
Haystack 2019 Lightning Talk - State of Apache Tika - Tim AllisonHaystack 2019 Lightning Talk - State of Apache Tika - Tim Allison
Haystack 2019 Lightning Talk - State of Apache Tika - Tim Allison
 
Haystack 2019 Lightning Talk - Relevance on 17 million full text documents - ...
Haystack 2019 Lightning Talk - Relevance on 17 million full text documents - ...Haystack 2019 Lightning Talk - Relevance on 17 million full text documents - ...
Haystack 2019 Lightning Talk - Relevance on 17 million full text documents - ...
 
Haystack 2019 Lightning Talk - Solr Cloud on Kubernetes - Manoj Bharadwaj
Haystack 2019 Lightning Talk - Solr Cloud on Kubernetes - Manoj BharadwajHaystack 2019 Lightning Talk - Solr Cloud on Kubernetes - Manoj Bharadwaj
Haystack 2019 Lightning Talk - Solr Cloud on Kubernetes - Manoj Bharadwaj
 
Haystack 2019 Lightning Talk - Quaerite a Search relevance evaluation toolkit...
Haystack 2019 Lightning Talk - Quaerite a Search relevance evaluation toolkit...Haystack 2019 Lightning Talk - Quaerite a Search relevance evaluation toolkit...
Haystack 2019 Lightning Talk - Quaerite a Search relevance evaluation toolkit...
 
Haystack 2019 - Search-based recommendations at Politico - Ryan Kohl
Haystack 2019 - Search-based recommendations at Politico - Ryan KohlHaystack 2019 - Search-based recommendations at Politico - Ryan Kohl
Haystack 2019 - Search-based recommendations at Politico - Ryan Kohl
 
Haystack 2019 - Search with Vectors - Simon Hughes
Haystack 2019 - Search with Vectors - Simon HughesHaystack 2019 - Search with Vectors - Simon Hughes
Haystack 2019 - Search with Vectors - Simon Hughes
 
Haystack 2019 - Natural Language Search with Knowledge Graphs - Trey Grainger
Haystack 2019 - Natural Language Search with Knowledge Graphs - Trey GraingerHaystack 2019 - Natural Language Search with Knowledge Graphs - Trey Grainger
Haystack 2019 - Natural Language Search with Knowledge Graphs - Trey Grainger
 
Haystack 2019 - Search Logs + Machine Learning = Auto-Tagging Inventory - Joh...
Haystack 2019 - Search Logs + Machine Learning = Auto-Tagging Inventory - Joh...Haystack 2019 - Search Logs + Machine Learning = Auto-Tagging Inventory - Joh...
Haystack 2019 - Search Logs + Machine Learning = Auto-Tagging Inventory - Joh...
 
Haystack 2019 - Improving Search Relevance with Numeric Features in Elasticse...
Haystack 2019 - Improving Search Relevance with Numeric Features in Elasticse...Haystack 2019 - Improving Search Relevance with Numeric Features in Elasticse...
Haystack 2019 - Improving Search Relevance with Numeric Features in Elasticse...
 
Haystack 2019 - Architectural considerations on search relevancy in the conte...
Haystack 2019 - Architectural considerations on search relevancy in the conte...Haystack 2019 - Architectural considerations on search relevancy in the conte...
Haystack 2019 - Architectural considerations on search relevancy in the conte...
 
Haystack 2019 - Custom Solr Query Parser Design Option, and Pros & Cons - Ber...
Haystack 2019 - Custom Solr Query Parser Design Option, and Pros & Cons - Ber...Haystack 2019 - Custom Solr Query Parser Design Option, and Pros & Cons - Ber...
Haystack 2019 - Custom Solr Query Parser Design Option, and Pros & Cons - Ber...
 
Haystack 2019 - Establishing a relevance focused culture in a large organizat...
Haystack 2019 - Establishing a relevance focused culture in a large organizat...Haystack 2019 - Establishing a relevance focused culture in a large organizat...
Haystack 2019 - Establishing a relevance focused culture in a large organizat...
 
Haystack 2019 - Solving for Satisfaction: Introduction to Click Models - Eliz...
Haystack 2019 - Solving for Satisfaction: Introduction to Click Models - Eliz...Haystack 2019 - Solving for Satisfaction: Introduction to Click Models - Eliz...
Haystack 2019 - Solving for Satisfaction: Introduction to Click Models - Eliz...
 
2019 Haystack - How The New York Times Tackles Relevance - Jeremiah Via
2019 Haystack - How The New York Times Tackles Relevance - Jeremiah Via2019 Haystack - How The New York Times Tackles Relevance - Jeremiah Via
2019 Haystack - How The New York Times Tackles Relevance - Jeremiah Via
 

Recently uploaded

08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 

Recently uploaded (20)

08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 

Big Search with Big Data Principles

  • 1. Big Search w/ Big Data Principles LuceneRevolution 2012 Eric Pugh | epugh@o19s.com | @dep4b
  • 2. What is Big Search?
  • 3. Who am i? • Principal of OpenSource Connections - Solr/Lucene Search Consultancy • Member of Apache Software Foundation • SOLR-284 UpdateRichDocuments (July 07) • Fascinated by the art of software development
  • 4. n! io it ed d 2n AUTHOR
  • 6. Telling some stories • Prototyping • Application Development • Maintaining Your Cluster
  • 7. Not an intro to cloud Computing • See Indexing Big Data on Amazon AWS by Scott Stults @ 1:15 Thursday • See How is the Government Spending Your Money? How GCE is Using Lucene and the GCE Big Data Cloud by Seshu Simhadi @ 2:55 Thursday
  • 8. Not an intro to SolrCloud! • See How SolrCloud Changes the User Experience In a Sharded Environment by Erick Erickson @ 2:55 Today • See Solr 4: The SolrCloud Architecture by Mark Miller @ 10:45 Tomorrow
  • 9. My Assumptions for Client X • Big Data is any data set that is primarily at rest due to the difficulty of working with it. • Limited selection of tools available. • Aggressive timeline. • All the data must be searched per query. • On Solr 3.x line
  • 10. Telling some stories • Prototyping • Application Development • Maintaining Your Cluster
  • 11. Boy meets Girl Story Metadata Ingest Solr Solr Pipeline Solr Solr Content Files
  • 13. Bash Rocks • Remote Solr stop/start scripts • Remote Indexer stop/start scripts • Performance Monitoring • Content Extraction scripts (+Java) • Ingestor Scripts (+Java) • Artifact Deployment (CM)
  • 14. Make it easy to change sharding
  • 15. Make it easy to change sharding public void run(Map options, List<SolrInputDocument> docs) throws InstantiationException, IllegalAccessException, ClassNotFoundException { IndexStrategy indexStrategy = (IndexStrategy) Class.forName( "com.o19s.solr.ModShardIndexStrategy").newInstance(); indexStrategy.configure(options); for (SolrInputDocument doc:docs){ indexStrategy.addDocument(doc); } }
  • 16. Separate JVM from Solr Cores • Step 1: Fire up empty Solr’s on all the servers (nohup &). • Step 2:Verify they started cleanly. • Step 3: Create Cores (curl http:// search1.o19s.com:8983/solr/admin? action=create&name=run2) • Step 4: Create a “aggregator” core, passing in urls of Cores. (&property.shards=)
  • 18. search1.o19s.com search1.o19s.com shard1 shard1 shard1 shard1 :8983 shard1 shard1 shard1 shard1 :8983 search2.o19s.com shard1 shard1 shard1 shard8 :8984 shard1 shard1 shard1 :8983 shard8 shard1 shard1 shard1 :8985 shard12 search3.o19s.com shard1 shard1 shard1 :8985 shard12 shard1 shard1 shard1 :8983 shard12
  • 19. Simple Pipeline • Simple pipeline • mv is atomic
  • 20. Don’t Move Files • SCP across machines is slow/error prone • NFS share, single point of failure. • Clustered file system like GFS (Global File System) can have “fencing” issues • HDFS shines here. • ZooKeeper shines here.
  • 21. Can you test your changes?
  • 22. JVM tuning is black art -verbose:gc -XX:+PrintGCDetails -server -Xmx8G -Xms8G -XX:MaxPermSize=256m -XX:PermSize=256m -XX:+AggressiveHeap -XX:+DisableExplicitGC -XX:ParallelGCThreads=16 -XX:+UseParallelOldGC
  • 23.
  • 25. Telling some stories • Prototyping • Application Development • Maintaining Your Cluster
  • 26. Grab some Data #!/bin/sh SOURCE_SOLR='http:// ec2-107-20-92-190.compute-1.amazonaws.com:8983/solr/ core0/select?q=*%3A*&start=0&rows=500000&wt=csv' TARGET_SOLR=http://localhost:8983/solr/us_patent_grant/ update/csv wget -O output.csv $SOURCE_SOLR curl 'http://localhost:8983/solr/us_patent_grant/update/ csv?skipLines=1&commit=true&optimize=true' --data-binary @output.csv -H 'Content-type:text/plain; charset=utf-8'
  • 27. Using Solr as a key/ value store • thousands of queries per second without real time get. http://localhost:8983/solr/run2_enrichment/select? q=id:DOC45242&fl=entities,html • ??? with real time get? http://localhost:8983/solr/run2_enrichment/get? id=DOC45242&fl=entities,html
  • 28. Using Solr as key/value store Solr Key/ Value Cache Metadata Ingest Solr Solr Pipeline Solr Solr Content Files
  • 29. Using Solr as key/value store • thousands of queries per second without real time get. http://localhost:8983/solr/run2_enrichment/select? q=id:DOC45242&fl=entities,html • ??? with real time get? http://localhost:8983/solr/run2_enrichment/get? id=DOC45242&fl=entities,html
  • 30. Push schema definition to the application • Not “schema less” • Just different owner of schema! • Schema may have common set of fields like id, type, timestamp, version • Nothing required. q=intensity_i:[70 TO 0]&fq=TYPE:streetlamp_monitor
  • 31. Don’t do expensive things in Solr • Tika content extraction aka Solr Cell • UpdateRequestProcessorChain
  • 32. Don’t do expensive things in Solr • Tika content extraction aka Solr Cell • UpdateRequestProcessorChain
  • 33. Avro! • Supports serialization of data readable from multiple languages • It’s smart XML • Handles forward and reverse versions of an object • Compact and fast to read.
  • 34. Avro! Solr Key/ Value Cache .avro Metadata Ingest Solr Solr Pipeline Solr Solr Content Files
  • 35. No JavaBin /u G te p iv / da e av m r e o! • Avoid Jarmaggeddon • Reflection? Ugh.
  • 36. No JavaBin Solr Key/ Value Cache Metadata Ingest Solr Solr Pipeline Solr Solr Content Files
  • 37. No JavaBin Solr Key/ Solr 3.4 Value Cache Metadata Ingest Solr Solr Pipeline Solr Solr Content Files
  • 38. No JavaBin Solr Key/ Solr 3.4 Value Cache Metadata Solr 4 Ingest Solr Solr Pipeline Solr Solr Content Files
  • 39. No JavaBin Solr Key/ Solr 3.4 Value Cache Metadata Solr 4 Ingest Solr Solr Pipeline Solr Solr Content Which SolrJ Files version do I use?
  • 40. Telling some stories • Prototyping • Application Development • Maintaining Your Cluster
  • 41. Upgrade Lucene Indexes Easily • Don’t reindex! • Try out new features! David Lyle java -cp lucene-core.jar org.apache.lucene.index.IndexUpgrader [-delete-prior- commits] [-verbose] indexDir
  • 42. Indexing is Easy and Quick
  • 45. The tension between scale and update rate Bad Place to Be > 100,000,000 < 10,000,000
  • 46. The tension between scale and update rate 10 million Bad Place 100’s of millions
  • 48. Delayed Replication <requestHandler name="/replication" class="solr.ReplicationHandler" > <lst name="slave"> <str name="masterUrl">http://localhost:8983/solr/replication</str> <str name="pollInterval">36:00:00</str> </lst> </requestHandler>
  • 49. Enable/Disable <requestHandler name="/admin/ping" class="solr.PingRequestHandler"> <lst name="invariants"> <str name="q">MY HARD QUERY</str> <str name="shards">http://search1.o19s.com:8983/solr/run2_1,http:// search1.o19s.com:8983/solr/run2_2,http://search1.o19s.com:8983/solr/run2_2 </lst> <lst name="defaults"> <str name="echoParams">all</str> </lst> <str name="healthcheckFile">server-enabled.txt</str> </requestHandler>
  • 51. Provisioning • Chef/Puppet • ZooKeeper • Have you versioned everything to build an index?
  • 54. Do I need Failover? • Can I build quickly? • Do I have a reliable cluster? • Am I spread across data centers? • Is sooo 90’s....
  • 55. Telling some stories • Prototyping • Application Development • Maintaining Your Cluster
  • 59.
  • 60. Thank you! • epugh@o19s.com • @dep4b • www.opensourceconnections.com

Editor's Notes

  1. \n
  2. Search was the original big data problem. Then Google search came along, and search wandered in the wilderness of internal Enterprise search and ecommerce search. But now search is back, but with a new cooler name &amp;#x201C;Big Data&amp;#x201D;. Search interfaces are the dominant metaphor for working with big data sets by non data scientists.\n
  3. SOLR-284 back in July 07 was a first cut at a content extraction library before Tika came along.\n
  4. \n
  5. And I love Agile development processes. And I think of agile as business -&gt; requirements -&gt; development -&gt; testing -&gt; systems administration\n
  6. \n
  7. \n
  8. \n
  9. \n
  10. \n
  11. \n
  12. And I don&amp;#x2019;t mean this as a shot against Hadoop, but with the right hardware, you can get a lot done in bash, with a bit of Java or Perl sprinkled in.\nThere is a lot of value in getting started today building large scaled out ingestors.\n
  13. \n
  14. Notice our property style? Made it easy to read in properties in both Bash and Java!\n
  15. Try sharding at different sizes using Mod\nTry sharding by month, or week, or hour depending on your volume of data.\n
  16. \n
  17. \n
  18. We had huge left over &amp;#x201C;enterprise&amp;#x201D; boxes with ginourmous amounts of ram and cpu\n\n
  19. \n
  20. \n
  21. \n
  22. The verbose:gc and +PrintGCDetails lets you grep for the frequency of partial versus full garbage collecitons. We rolled back from 3.4 to 3.1 based on this data on one project.\n
  23. Again, horse racing two slaves can help. You can also pass in the connection information via jconsole command line which makes it easier to monitor a set of Solrs\n
  24. \n
  25. \n
  26. i love working with CSV and Solr. The CSV writer type is great for moving data between solrs. (Don&amp;#x2019;t forget to store everything!)\n
  27. \n
  28. \n
  29. \n
  30. \n
  31. You have many fewer Solrs then you do Indexer processors.\n
  32. \n
  33. \n
  34. \n
  35. \n
  36. \n
  37. \n
  38. \n
  39. \n
  40. \n
  41. dollar tree makes crap. Stores are always empty or missing items. You don&amp;#x2019;t want your indexing like that. Space shuttle costed 500 MILLIOn dollars to launch it every time. You don&amp;#x2019;t want your indexing process to be like launching the space shuttle.\n
  42. \n
  43. \n
  44. \n
  45. runs every hour.\nLooks at log files to determine if a solr cluster is misbehaving.\n
  46. Hal 9000 misbheaved\nruns every hour.\nLooks at log files to determine if a solr cluster is misbehaving.\nEspecially if you are on cloud platform. They implement their servers on the cheapest commodity hardware \n
  47. \n
  48. \n
  49. \n
  50. \n
  51. \n
  52. \n
  53. \n
  54. \n
  55. \n
  56. \n
  57. Kaa the snake from The Jungle Book hynotizing Mowgli. \nDanah Boyd among others have said that Big Data sometimes throws out thousands of years \n\n
  58. \n
  59. Nathan Marz\n
  60. \n