SlideShare a Scribd company logo
BUILDING A LIGHTWEIGHT
DISCOVERY INTERFACE FOR
CHINESE PATENTS
!
New York Solr/Lucene Meetup
ERIC PUGH | epugh@o19s.com | @dep4b
Who am I?
• Principal of OpenSource Connections
- Solr/Lucene Search Consultancy
http://bit.ly/OSCCommercialSummary	

• Member of Apache Software
Foundation	

• SOLR-284 UpdateRichDocuments
(July 07)	

• Fascinated by the art of software
development
Co-Author
N
extEdition
June!
Congrats to
Trey and Tim!
Agilista
Selected Customers
Telling some stories
war	

^
Building a Lightweight Discovery Interface for China's Patents@NYC Solr/Lucene Meetup
• First USPTO application in
“the cloud”	

• Simple, and discoverable	

• Expresses our philosophy of
“Cloud meets Ocean”	

!
• Check it out at http://
gpsn.uspto.gov
Telling some stories
➡How to inject “Discovery” into your
app	

• The Cloud to the Rescue (sorta!)	

• Parsers and Parsers and Parsers	

• Don’t be Afraid to Share!
Flow of understanding
Data UnderstandingInformation
Building “Discovery”
Engine
UX DataTension
Grok data at gut level	

Look for outliers	

!
User Interviews	

Surveys	

Card Sorting	

Scenarios/Personas	

!
UX
Data
brainstorm
Mockups	

Proof of concept	

!
!
Where to spend time?
UX
Engine
Data
40%	

!
20%	

!
40%	

!
40%	

!
40%	

!
20%	

We spent	

!
Telling some stories
• How to inject “Discovery” into your app	

➡The Cloud to the Rescue (sorta!)
• Parsers and Parsers and Parsers	

• Don’t be Afraid to Share!
Boy meets Girl Story
Boy meets Girl Story
Metadata
Ingest	

Pipeline	

Discovery
UX
Content
Files
Nothing but JS and
Solr!
• Updates are quarterly	

• User state in browser	

• Solr is the “RESTful” API ;-)	

• KISS: EmberJS + Solr
How we built it
EmberJS Single Page Search App
HTML
XML
JSON
Server Dashboard
GPSN UI (Bootsrap CSS)
Browsers
Mobile/
Tablet
Third Party
Application
Servers
S3 BucketSolr
Yes, Solr is hangout out
there on the Net…
• Using Jetty container security to lock down
everything but the /select handler.	

• Yes, the /admin interface appears to load,
but no panels load.	

• Go ahead, do a delete query! I dare you.
Actually, please don’t. ;-)
Single 550 GB index
• Solr + Index are in a Amazon AMI image. 	

• Currently running two independent Solrs.	

• Optimize works! Still.	

• Elastic Load Balancer + AutoScale spins up
more Solr’s if needed.	

• Threw lots of “provisioned IOPS” atVM
A better
security proxy
from Alex?
https://github.com/
dergachev/solr-
security-proxy
Spyglass
• EmberJS based Widget framework	

• List of Results	

• Facets	

• Autocomplete	

• “Deploy” is just .html + .js. S3 bucket!	

• Tooling is a pain. EmberJS is complex!
BetterthenAjaxSolr!
Daniel Beach’s
project
https://github.com/
o19s/spyglass
Key scaling concept
behind GPSN:	

!
Cloud meets Ocean
Building a Lightweight Discovery Interface for China's Patents@NYC Solr/Lucene Meetup
More prosaically…
Database
Server
Server
Server
Client
Client
Client
$
$
$
$
Lessons Learned
Don’t Move Files
• Copying 5 TB data up to S3 was very
painful.	

• We used S3Funnel which is “rsync like”	

• We bought more network bandwidth for
our office
Never
underestimate
the bandwidth of
a station wagon
full of tapes
hurtling down
the highway.

–Andrew Tanenbaum, 1981
Data Size
0
250000
500000
750000
1000000
1985 1987 1989 1991 1993 1995 1997 1999 2001 2003 2005 2007 2009 2011
Patent Count
277871
Think about DataVolume
• Started with older dataset, and tasks like TIFF -> PNG
conversion became progressively harder. Map/Reduce nice,
need more visibility into progress..	

• Should have sharded our Search Index from the beginning
just to make indexing faster and cheaper process (500 gb
index!)	

• 8 shards dropped time from 12 hours to 2 hours.
Merging took 5!	

• We had too many steps in our pipeline
Building	
  a	
  Patents	
  Index
MachineCount
0
75
150
225
300
5 days 3 days 30 Minutes
1 5
300
Telling some stories
• How to inject “Discovery” into your app	

• The Cloud to the Rescue (sorta!)	

➡Parsers and Parsers and Parsers
• Don’t be Afraid to Share!
Why so many pipelines?
Morphlines
Tika as a pipeline?
Lot’s of File Types
• Sometimes in ZIP archives, sometimes not!	

• multiple XML formats as well as CSV and
EDI	

• Purplebook,Yellowbook,
Redbook,Greenbook, Questel, SIPO…
Tika as a pipeline!
• Auto detects content type	

• Metadata structure has all the
key/value needed for Solr	

• Allows us to scale up with
Behemoth project (and
others!).
Lots of files!
HHHHHT APS1 ISSUE - 760106!
PATN!
WKU 039302717!
SRC 5!
APN 5328756!
APT 1!
ART 353!
APD 19741216!
TTL Golf glove!
ISD 19760106!
NCL 4!
ECL 1
<PatentGrant>!
<BibliographicData>!
<GrantIdentification>!
<DocumentKindCode>B1</DocumentKindCode>!
<GrantNumber>06644224</GrantNumber>!
<CountryCode>US</CountryCode>!
<IssueDateText>2003-11-11</IssueDateText>
Detector to pick File
public	
  class	
  GreenbookDetector	
  implements	
  Detector	
  {	
  
!
	
  	
  	
  	
  private	
  static	
  Pattern	
  pattern	
  =	
  Pattern.compile("PATN");	
  
	
  	
  	
  	
  	
  
	
  	
  	
  	
  @Override	
  
	
  	
  	
  	
  public	
  MediaType	
  detect(InputStream	
  stream,	
  Metadata	
  metadata)	
  throws	
  IOException	
  {	
  
!
	
  	
  	
  	
  	
  	
  	
  	
  MediaType	
  type	
  =	
  MediaType.OCTET_STREAM;	
  
	
  	
  	
  	
  	
  	
  	
  	
  InputStream	
  lookahead	
  =	
  new	
  LookaheadInputStream(stream,	
  1024);	
  
	
  	
  	
  	
  	
  	
  	
  	
  String	
  extract	
  =	
  org.apache.commons.io.IOUtils.toString(lookahead,	
  "UTF-­‐8");	
  
!
	
  	
  	
  	
  	
  	
  	
  	
  Matcher	
  matcher	
  =	
  pattern.matcher(extract);	
  
!
	
  	
  	
  	
  	
  	
  	
  	
  if	
  (matcher.find())	
  {	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  type	
  =	
  GreenbookParser.MEDIA_TYPE;	
  
	
  	
  	
  	
  	
  	
  	
  	
  }	
  
!
	
  	
  	
  	
  	
  	
  	
  	
  lookahead.close();	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  
	
  	
  	
  	
  	
  	
  	
  	
  return	
  type;	
  
	
  	
  	
  	
  }	
  
	
  	
  	
  
}
Telling some stories
• How to inject “Discovery” into your app	

• The Cloud to the Rescue (sorta!)	

• Parsers and Parsers and Parsers	

➡Don’t be Afraid to Share!
Your Search solution
isn’t perfect
• Allow users to export data	

• Most business users want to work in Excel!
Accept it!	

• Allow other applications to build on top of
it.
GPSN has
• Lots of easy “Print to
PDF” options.	

• Data stored in S3 as:	

• individual patent files	

• chunky downloads.	

• Filtering to expand or
select specific data sets.	

• Permalinks: simple, very
sharable URLs.	

• Underlying Solr service
is exposed to public via
proxy. You can query
Solr yourself.	

• Need advance querying?
Use Lucene syntax in
search bar.
One more thought...
Measuring the impact
of our algorithms
changes is just getting
harder as we get
smarter.
www.quepid.com
Quepid: Give your Queries
some Love
W
e
need
betausers!
Thank you!
!
Questions?
• epugh@o19s.com	

• @dep4b	

• www.opensourceconnections.com	

• slideshare.com/o19s
Nervous about
speaking up? Ask
me later!

More Related Content

What's hot

Intro to Search
Intro to SearchIntro to Search
Intro to Search
Grant Ingersoll
 
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
Lucidworks (Archived)
 
Indexing big data in the cloud
Indexing big data in the cloudIndexing big data in the cloud
Indexing big data in the cloud
OpenSource Connections
 
Solr for Data Science
Solr for Data ScienceSolr for Data Science
Solr for Data Science
Grant Ingersoll
 
SearchHub - How to Spend Your Summer Keeping it Real: Presented by Grant Inge...
SearchHub - How to Spend Your Summer Keeping it Real: Presented by Grant Inge...SearchHub - How to Spend Your Summer Keeping it Real: Presented by Grant Inge...
SearchHub - How to Spend Your Summer Keeping it Real: Presented by Grant Inge...
Lucidworks
 
Solr vs ElasticSearch
Solr vs ElasticSearchSolr vs ElasticSearch
Solr vs ElasticSearch
Dikshant Shahi
 
Searching for Better Code: Presented by Grant Ingersoll, Lucidworks
Searching for Better Code: Presented by Grant Ingersoll, LucidworksSearching for Better Code: Presented by Grant Ingersoll, Lucidworks
Searching for Better Code: Presented by Grant Ingersoll, Lucidworks
Lucidworks
 
OpenSearchLab and the Lucene Ecosystem
OpenSearchLab and the Lucene EcosystemOpenSearchLab and the Lucene Ecosystem
OpenSearchLab and the Lucene Ecosystem
Grant Ingersoll
 
Building a Real-Time News Search Engine: Presented by Ramkumar Aiyengar, Bloo...
Building a Real-Time News Search Engine: Presented by Ramkumar Aiyengar, Bloo...Building a Real-Time News Search Engine: Presented by Ramkumar Aiyengar, Bloo...
Building a Real-Time News Search Engine: Presented by Ramkumar Aiyengar, Bloo...
Lucidworks
 
The ultimate guide for Elasticsearch plugins
The ultimate guide for Elasticsearch pluginsThe ultimate guide for Elasticsearch plugins
The ultimate guide for Elasticsearch plugins
Itamar
 
Deep Learning and Recurrent Neural Networks in the Enterprise
Deep Learning and Recurrent Neural Networks in the EnterpriseDeep Learning and Recurrent Neural Networks in the Enterprise
Deep Learning and Recurrent Neural Networks in the Enterprise
Josh Patterson
 
This Ain't Your Parent's Search Engine
This Ain't Your Parent's Search EngineThis Ain't Your Parent's Search Engine
This Ain't Your Parent's Search Engine
Grant Ingersoll
 
Data IO: Next Generation Search with Lucene and Solr 4
Data IO: Next Generation Search with Lucene and Solr 4Data IO: Next Generation Search with Lucene and Solr 4
Data IO: Next Generation Search with Lucene and Solr 4
Grant Ingersoll
 
Delhi elasticsearch meetup
Delhi elasticsearch meetupDelhi elasticsearch meetup
Delhi elasticsearch meetup
Bharvi Dixit
 
How to Build Deep Learning Models
How to Build Deep Learning ModelsHow to Build Deep Learning Models
How to Build Deep Learning Models
Josh Patterson
 
State-of-the-Art Drupal Search with Apache Solr
State-of-the-Art Drupal Search with Apache SolrState-of-the-Art Drupal Search with Apache Solr
State-of-the-Art Drupal Search with Apache Solr
guest432cd6
 
Practical Elasticsearch - real world use cases
Practical Elasticsearch - real world use casesPractical Elasticsearch - real world use cases
Practical Elasticsearch - real world use cases
Itamar
 
Elasticsearch Distributed search & analytics on BigData made easy
Elasticsearch Distributed search & analytics on BigData made easyElasticsearch Distributed search & analytics on BigData made easy
Elasticsearch Distributed search & analytics on BigData made easy
Itamar
 
Ubiquitous Solr - A Database's Not-So-Evil Twin: Presented by Ayon Sinha, Wal...
Ubiquitous Solr - A Database's Not-So-Evil Twin: Presented by Ayon Sinha, Wal...Ubiquitous Solr - A Database's Not-So-Evil Twin: Presented by Ayon Sinha, Wal...
Ubiquitous Solr - A Database's Not-So-Evil Twin: Presented by Ayon Sinha, Wal...
Lucidworks
 
Getting a Neural Network Up and Running with OpenLab
Getting a Neural Network Up and Running with OpenLabGetting a Neural Network Up and Running with OpenLab
Getting a Neural Network Up and Running with OpenLab
Melvin Hillsman
 

What's hot (20)

Intro to Search
Intro to SearchIntro to Search
Intro to Search
 
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
 
Indexing big data in the cloud
Indexing big data in the cloudIndexing big data in the cloud
Indexing big data in the cloud
 
Solr for Data Science
Solr for Data ScienceSolr for Data Science
Solr for Data Science
 
SearchHub - How to Spend Your Summer Keeping it Real: Presented by Grant Inge...
SearchHub - How to Spend Your Summer Keeping it Real: Presented by Grant Inge...SearchHub - How to Spend Your Summer Keeping it Real: Presented by Grant Inge...
SearchHub - How to Spend Your Summer Keeping it Real: Presented by Grant Inge...
 
Solr vs ElasticSearch
Solr vs ElasticSearchSolr vs ElasticSearch
Solr vs ElasticSearch
 
Searching for Better Code: Presented by Grant Ingersoll, Lucidworks
Searching for Better Code: Presented by Grant Ingersoll, LucidworksSearching for Better Code: Presented by Grant Ingersoll, Lucidworks
Searching for Better Code: Presented by Grant Ingersoll, Lucidworks
 
OpenSearchLab and the Lucene Ecosystem
OpenSearchLab and the Lucene EcosystemOpenSearchLab and the Lucene Ecosystem
OpenSearchLab and the Lucene Ecosystem
 
Building a Real-Time News Search Engine: Presented by Ramkumar Aiyengar, Bloo...
Building a Real-Time News Search Engine: Presented by Ramkumar Aiyengar, Bloo...Building a Real-Time News Search Engine: Presented by Ramkumar Aiyengar, Bloo...
Building a Real-Time News Search Engine: Presented by Ramkumar Aiyengar, Bloo...
 
The ultimate guide for Elasticsearch plugins
The ultimate guide for Elasticsearch pluginsThe ultimate guide for Elasticsearch plugins
The ultimate guide for Elasticsearch plugins
 
Deep Learning and Recurrent Neural Networks in the Enterprise
Deep Learning and Recurrent Neural Networks in the EnterpriseDeep Learning and Recurrent Neural Networks in the Enterprise
Deep Learning and Recurrent Neural Networks in the Enterprise
 
This Ain't Your Parent's Search Engine
This Ain't Your Parent's Search EngineThis Ain't Your Parent's Search Engine
This Ain't Your Parent's Search Engine
 
Data IO: Next Generation Search with Lucene and Solr 4
Data IO: Next Generation Search with Lucene and Solr 4Data IO: Next Generation Search with Lucene and Solr 4
Data IO: Next Generation Search with Lucene and Solr 4
 
Delhi elasticsearch meetup
Delhi elasticsearch meetupDelhi elasticsearch meetup
Delhi elasticsearch meetup
 
How to Build Deep Learning Models
How to Build Deep Learning ModelsHow to Build Deep Learning Models
How to Build Deep Learning Models
 
State-of-the-Art Drupal Search with Apache Solr
State-of-the-Art Drupal Search with Apache SolrState-of-the-Art Drupal Search with Apache Solr
State-of-the-Art Drupal Search with Apache Solr
 
Practical Elasticsearch - real world use cases
Practical Elasticsearch - real world use casesPractical Elasticsearch - real world use cases
Practical Elasticsearch - real world use cases
 
Elasticsearch Distributed search & analytics on BigData made easy
Elasticsearch Distributed search & analytics on BigData made easyElasticsearch Distributed search & analytics on BigData made easy
Elasticsearch Distributed search & analytics on BigData made easy
 
Ubiquitous Solr - A Database's Not-So-Evil Twin: Presented by Ayon Sinha, Wal...
Ubiquitous Solr - A Database's Not-So-Evil Twin: Presented by Ayon Sinha, Wal...Ubiquitous Solr - A Database's Not-So-Evil Twin: Presented by Ayon Sinha, Wal...
Ubiquitous Solr - A Database's Not-So-Evil Twin: Presented by Ayon Sinha, Wal...
 
Getting a Neural Network Up and Running with OpenLab
Getting a Neural Network Up and Running with OpenLabGetting a Neural Network Up and Running with OpenLab
Getting a Neural Network Up and Running with OpenLab
 

Viewers also liked

Facebook API for iOS
Facebook API for iOSFacebook API for iOS
Facebook API for iOS
Hidetoshi Mori
 
Solr pattern
Solr patternSolr pattern
Alphageeks meetup - facebook api
Alphageeks meetup - facebook apiAlphageeks meetup - facebook api
Alphageeks meetup - facebook api
Alphageeks
 
Intro to Agile Practices and Values
Intro to Agile Practices and ValuesIntro to Agile Practices and Values
Intro to Agile Practices and Values
OpenSource Connections
 
Lessons Learned with Spark at the US Patent & Trademark Office
Lessons Learned with Spark at the US Patent & Trademark OfficeLessons Learned with Spark at the US Patent & Trademark Office
Lessons Learned with Spark at the US Patent & Trademark Office
OpenSource Connections
 
Richmond Spin - How To Sell A Traditional Client
Richmond Spin - How To Sell A Traditional ClientRichmond Spin - How To Sell A Traditional Client
Richmond Spin - How To Sell A Traditional Client
OpenSource Connections
 
Lucene - 10 ans d'usages plus ou moins classiques
Lucene - 10 ans d'usages plus ou moins classiquesLucene - 10 ans d'usages plus ou moins classiques
Lucene - 10 ans d'usages plus ou moins classiques
Sylvain Wallez
 
Database History From Codd to Brewer
Database History From Codd to BrewerDatabase History From Codd to Brewer
Database History From Codd to Brewer
OpenSource Connections
 
Hacking Lucene for Custom Search Results
Hacking Lucene for Custom Search ResultsHacking Lucene for Custom Search Results
Hacking Lucene for Custom Search Results
OpenSource Connections
 
Solr + Hadoop - Fouillez facilement dans votre système Big Data
Solr + Hadoop - Fouillez facilement dans votre système Big DataSolr + Hadoop - Fouillez facilement dans votre système Big Data
Solr + Hadoop - Fouillez facilement dans votre système Big Data
francelabs
 

Viewers also liked (11)

Facebook API for iOS
Facebook API for iOSFacebook API for iOS
Facebook API for iOS
 
Solr pattern
Solr patternSolr pattern
Solr pattern
 
Alphageeks meetup - facebook api
Alphageeks meetup - facebook apiAlphageeks meetup - facebook api
Alphageeks meetup - facebook api
 
Intro to Agile Practices and Values
Intro to Agile Practices and ValuesIntro to Agile Practices and Values
Intro to Agile Practices and Values
 
Solr implementation
Solr implementationSolr implementation
Solr implementation
 
Lessons Learned with Spark at the US Patent & Trademark Office
Lessons Learned with Spark at the US Patent & Trademark OfficeLessons Learned with Spark at the US Patent & Trademark Office
Lessons Learned with Spark at the US Patent & Trademark Office
 
Richmond Spin - How To Sell A Traditional Client
Richmond Spin - How To Sell A Traditional ClientRichmond Spin - How To Sell A Traditional Client
Richmond Spin - How To Sell A Traditional Client
 
Lucene - 10 ans d'usages plus ou moins classiques
Lucene - 10 ans d'usages plus ou moins classiquesLucene - 10 ans d'usages plus ou moins classiques
Lucene - 10 ans d'usages plus ou moins classiques
 
Database History From Codd to Brewer
Database History From Codd to BrewerDatabase History From Codd to Brewer
Database History From Codd to Brewer
 
Hacking Lucene for Custom Search Results
Hacking Lucene for Custom Search ResultsHacking Lucene for Custom Search Results
Hacking Lucene for Custom Search Results
 
Solr + Hadoop - Fouillez facilement dans votre système Big Data
Solr + Hadoop - Fouillez facilement dans votre système Big DataSolr + Hadoop - Fouillez facilement dans votre système Big Data
Solr + Hadoop - Fouillez facilement dans votre système Big Data
 

Similar to Building a Lightweight Discovery Interface for China's Patents@NYC Solr/Lucene Meetup

From a student to an apache committer practice of apache io tdb
From a student to an apache committer  practice of apache io tdbFrom a student to an apache committer  practice of apache io tdb
From a student to an apache committer practice of apache io tdb
jixuan1989
 
Agile Data: Building Hadoop Analytics Applications
Agile Data: Building Hadoop Analytics ApplicationsAgile Data: Building Hadoop Analytics Applications
Agile Data: Building Hadoop Analytics Applications
DataWorks Summit
 
Data Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at BitlyData Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at Bitly
Sarah Guido
 
Adf and ala design c sharp corner toronto chapter feb 2019 meetup nik shahriar
Adf and ala design c sharp corner toronto chapter feb 2019 meetup nik shahriarAdf and ala design c sharp corner toronto chapter feb 2019 meetup nik shahriar
Adf and ala design c sharp corner toronto chapter feb 2019 meetup nik shahriar
Nilesh Shah
 
Design for scale
Design for scaleDesign for scale
Design for scale
Doug Lampe
 
Elastic Data Analytics Platform @Datadog
Elastic Data Analytics Platform @DatadogElastic Data Analytics Platform @Datadog
Elastic Data Analytics Platform @Datadog
C4Media
 
Your Big Data Stack is Too Big!: Presented by Timothy Potter, Lucidworks
Your Big Data Stack is Too Big!: Presented by Timothy Potter, LucidworksYour Big Data Stack is Too Big!: Presented by Timothy Potter, Lucidworks
Your Big Data Stack is Too Big!: Presented by Timothy Potter, Lucidworks
Lucidworks
 
The economies of scaling software - Abdel Remani
The economies of scaling software - Abdel RemaniThe economies of scaling software - Abdel Remani
The economies of scaling software - Abdel Remani
jaxconf
 
The Economies of Scaling Software
The Economies of Scaling SoftwareThe Economies of Scaling Software
The Economies of Scaling Software
Abdelmonaim Remani
 
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
The Hive
 
Inside Wordnik's Architecture
Inside Wordnik's ArchitectureInside Wordnik's Architecture
Inside Wordnik's Architecture
Tony Tam
 
Agile Data Science: Building Hadoop Analytics Applications
Agile Data Science: Building Hadoop Analytics ApplicationsAgile Data Science: Building Hadoop Analytics Applications
Agile Data Science: Building Hadoop Analytics Applications
Russell Jurney
 
Stackato v5
Stackato v5Stackato v5
Stackato v5
Jonas Brømsø
 
Doug McCune - Using Open Source Flex and ActionScript Projects
Doug McCune - Using Open Source Flex and ActionScript ProjectsDoug McCune - Using Open Source Flex and ActionScript Projects
Doug McCune - Using Open Source Flex and ActionScript Projects
Doug McCune
 
System insight without Interference
System insight without InterferenceSystem insight without Interference
System insight without Interference
Tony Tam
 
Meet Solr For The Tirst Again
Meet Solr For The Tirst AgainMeet Solr For The Tirst Again
Meet Solr For The Tirst Again
Varun Thacker
 
Intro to Machine Learning with H2O and AWS
Intro to Machine Learning with H2O and AWSIntro to Machine Learning with H2O and AWS
Intro to Machine Learning with H2O and AWS
Sri Ambati
 
Ncku csie talk about Spark
Ncku csie talk about SparkNcku csie talk about Spark
Ncku csie talk about Spark
Giivee The
 
Time Series Anomaly Detection with Azure and .NETT
Time Series Anomaly Detection with Azure and .NETTTime Series Anomaly Detection with Azure and .NETT
Time Series Anomaly Detection with Azure and .NETT
Marco Parenzan
 
Moving Quickly with Data Services in the Cloud
Moving Quickly with Data Services in the CloudMoving Quickly with Data Services in the Cloud
Moving Quickly with Data Services in the Cloud
Matthew Dimich
 

Similar to Building a Lightweight Discovery Interface for China's Patents@NYC Solr/Lucene Meetup (20)

From a student to an apache committer practice of apache io tdb
From a student to an apache committer  practice of apache io tdbFrom a student to an apache committer  practice of apache io tdb
From a student to an apache committer practice of apache io tdb
 
Agile Data: Building Hadoop Analytics Applications
Agile Data: Building Hadoop Analytics ApplicationsAgile Data: Building Hadoop Analytics Applications
Agile Data: Building Hadoop Analytics Applications
 
Data Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at BitlyData Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at Bitly
 
Adf and ala design c sharp corner toronto chapter feb 2019 meetup nik shahriar
Adf and ala design c sharp corner toronto chapter feb 2019 meetup nik shahriarAdf and ala design c sharp corner toronto chapter feb 2019 meetup nik shahriar
Adf and ala design c sharp corner toronto chapter feb 2019 meetup nik shahriar
 
Design for scale
Design for scaleDesign for scale
Design for scale
 
Elastic Data Analytics Platform @Datadog
Elastic Data Analytics Platform @DatadogElastic Data Analytics Platform @Datadog
Elastic Data Analytics Platform @Datadog
 
Your Big Data Stack is Too Big!: Presented by Timothy Potter, Lucidworks
Your Big Data Stack is Too Big!: Presented by Timothy Potter, LucidworksYour Big Data Stack is Too Big!: Presented by Timothy Potter, Lucidworks
Your Big Data Stack is Too Big!: Presented by Timothy Potter, Lucidworks
 
The economies of scaling software - Abdel Remani
The economies of scaling software - Abdel RemaniThe economies of scaling software - Abdel Remani
The economies of scaling software - Abdel Remani
 
The Economies of Scaling Software
The Economies of Scaling SoftwareThe Economies of Scaling Software
The Economies of Scaling Software
 
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
 
Inside Wordnik's Architecture
Inside Wordnik's ArchitectureInside Wordnik's Architecture
Inside Wordnik's Architecture
 
Agile Data Science: Building Hadoop Analytics Applications
Agile Data Science: Building Hadoop Analytics ApplicationsAgile Data Science: Building Hadoop Analytics Applications
Agile Data Science: Building Hadoop Analytics Applications
 
Stackato v5
Stackato v5Stackato v5
Stackato v5
 
Doug McCune - Using Open Source Flex and ActionScript Projects
Doug McCune - Using Open Source Flex and ActionScript ProjectsDoug McCune - Using Open Source Flex and ActionScript Projects
Doug McCune - Using Open Source Flex and ActionScript Projects
 
System insight without Interference
System insight without InterferenceSystem insight without Interference
System insight without Interference
 
Meet Solr For The Tirst Again
Meet Solr For The Tirst AgainMeet Solr For The Tirst Again
Meet Solr For The Tirst Again
 
Intro to Machine Learning with H2O and AWS
Intro to Machine Learning with H2O and AWSIntro to Machine Learning with H2O and AWS
Intro to Machine Learning with H2O and AWS
 
Ncku csie talk about Spark
Ncku csie talk about SparkNcku csie talk about Spark
Ncku csie talk about Spark
 
Time Series Anomaly Detection with Azure and .NETT
Time Series Anomaly Detection with Azure and .NETTTime Series Anomaly Detection with Azure and .NETT
Time Series Anomaly Detection with Azure and .NETT
 
Moving Quickly with Data Services in the Cloud
Moving Quickly with Data Services in the CloudMoving Quickly with Data Services in the Cloud
Moving Quickly with Data Services in the Cloud
 

More from OpenSource Connections

Encores
EncoresEncores
Test driven relevancy
Test driven relevancyTest driven relevancy
Test driven relevancy
OpenSource Connections
 
How To Structure Your Search Team for Success
How To Structure Your Search Team for SuccessHow To Structure Your Search Team for Success
How To Structure Your Search Team for Success
OpenSource Connections
 
The right path to making search relevant - Taxonomy Bootcamp London 2019
The right path to making search relevant  - Taxonomy Bootcamp London 2019The right path to making search relevant  - Taxonomy Bootcamp London 2019
The right path to making search relevant - Taxonomy Bootcamp London 2019
OpenSource Connections
 
Payloads and OCR with Solr
Payloads and OCR with SolrPayloads and OCR with Solr
Payloads and OCR with Solr
OpenSource Connections
 
Haystack 2019 Lightning Talk - The Future of Quepid - Charlie Hull
Haystack 2019 Lightning Talk - The Future of Quepid - Charlie HullHaystack 2019 Lightning Talk - The Future of Quepid - Charlie Hull
Haystack 2019 Lightning Talk - The Future of Quepid - Charlie Hull
OpenSource Connections
 
Haystack 2019 Lightning Talk - State of Apache Tika - Tim Allison
Haystack 2019 Lightning Talk - State of Apache Tika - Tim AllisonHaystack 2019 Lightning Talk - State of Apache Tika - Tim Allison
Haystack 2019 Lightning Talk - State of Apache Tika - Tim Allison
OpenSource Connections
 
Haystack 2019 Lightning Talk - Relevance on 17 million full text documents - ...
Haystack 2019 Lightning Talk - Relevance on 17 million full text documents - ...Haystack 2019 Lightning Talk - Relevance on 17 million full text documents - ...
Haystack 2019 Lightning Talk - Relevance on 17 million full text documents - ...
OpenSource Connections
 
Haystack 2019 Lightning Talk - Solr Cloud on Kubernetes - Manoj Bharadwaj
Haystack 2019 Lightning Talk - Solr Cloud on Kubernetes - Manoj BharadwajHaystack 2019 Lightning Talk - Solr Cloud on Kubernetes - Manoj Bharadwaj
Haystack 2019 Lightning Talk - Solr Cloud on Kubernetes - Manoj Bharadwaj
OpenSource Connections
 
Haystack 2019 Lightning Talk - Quaerite a Search relevance evaluation toolkit...
Haystack 2019 Lightning Talk - Quaerite a Search relevance evaluation toolkit...Haystack 2019 Lightning Talk - Quaerite a Search relevance evaluation toolkit...
Haystack 2019 Lightning Talk - Quaerite a Search relevance evaluation toolkit...
OpenSource Connections
 
Haystack 2019 - Search-based recommendations at Politico - Ryan Kohl
Haystack 2019 - Search-based recommendations at Politico - Ryan KohlHaystack 2019 - Search-based recommendations at Politico - Ryan Kohl
Haystack 2019 - Search-based recommendations at Politico - Ryan Kohl
OpenSource Connections
 
Haystack 2019 - Search with Vectors - Simon Hughes
Haystack 2019 - Search with Vectors - Simon HughesHaystack 2019 - Search with Vectors - Simon Hughes
Haystack 2019 - Search with Vectors - Simon Hughes
OpenSource Connections
 
Haystack 2019 - Natural Language Search with Knowledge Graphs - Trey Grainger
Haystack 2019 - Natural Language Search with Knowledge Graphs - Trey GraingerHaystack 2019 - Natural Language Search with Knowledge Graphs - Trey Grainger
Haystack 2019 - Natural Language Search with Knowledge Graphs - Trey Grainger
OpenSource Connections
 
Haystack 2019 - Search Logs + Machine Learning = Auto-Tagging Inventory - Joh...
Haystack 2019 - Search Logs + Machine Learning = Auto-Tagging Inventory - Joh...Haystack 2019 - Search Logs + Machine Learning = Auto-Tagging Inventory - Joh...
Haystack 2019 - Search Logs + Machine Learning = Auto-Tagging Inventory - Joh...
OpenSource Connections
 
Haystack 2019 - Improving Search Relevance with Numeric Features in Elasticse...
Haystack 2019 - Improving Search Relevance with Numeric Features in Elasticse...Haystack 2019 - Improving Search Relevance with Numeric Features in Elasticse...
Haystack 2019 - Improving Search Relevance with Numeric Features in Elasticse...
OpenSource Connections
 
Haystack 2019 - Architectural considerations on search relevancy in the conte...
Haystack 2019 - Architectural considerations on search relevancy in the conte...Haystack 2019 - Architectural considerations on search relevancy in the conte...
Haystack 2019 - Architectural considerations on search relevancy in the conte...
OpenSource Connections
 
Haystack 2019 - Custom Solr Query Parser Design Option, and Pros & Cons - Ber...
Haystack 2019 - Custom Solr Query Parser Design Option, and Pros & Cons - Ber...Haystack 2019 - Custom Solr Query Parser Design Option, and Pros & Cons - Ber...
Haystack 2019 - Custom Solr Query Parser Design Option, and Pros & Cons - Ber...
OpenSource Connections
 
Haystack 2019 - Establishing a relevance focused culture in a large organizat...
Haystack 2019 - Establishing a relevance focused culture in a large organizat...Haystack 2019 - Establishing a relevance focused culture in a large organizat...
Haystack 2019 - Establishing a relevance focused culture in a large organizat...
OpenSource Connections
 
Haystack 2019 - Solving for Satisfaction: Introduction to Click Models - Eliz...
Haystack 2019 - Solving for Satisfaction: Introduction to Click Models - Eliz...Haystack 2019 - Solving for Satisfaction: Introduction to Click Models - Eliz...
Haystack 2019 - Solving for Satisfaction: Introduction to Click Models - Eliz...
OpenSource Connections
 
2019 Haystack - How The New York Times Tackles Relevance - Jeremiah Via
2019 Haystack - How The New York Times Tackles Relevance - Jeremiah Via2019 Haystack - How The New York Times Tackles Relevance - Jeremiah Via
2019 Haystack - How The New York Times Tackles Relevance - Jeremiah Via
OpenSource Connections
 

More from OpenSource Connections (20)

Encores
EncoresEncores
Encores
 
Test driven relevancy
Test driven relevancyTest driven relevancy
Test driven relevancy
 
How To Structure Your Search Team for Success
How To Structure Your Search Team for SuccessHow To Structure Your Search Team for Success
How To Structure Your Search Team for Success
 
The right path to making search relevant - Taxonomy Bootcamp London 2019
The right path to making search relevant  - Taxonomy Bootcamp London 2019The right path to making search relevant  - Taxonomy Bootcamp London 2019
The right path to making search relevant - Taxonomy Bootcamp London 2019
 
Payloads and OCR with Solr
Payloads and OCR with SolrPayloads and OCR with Solr
Payloads and OCR with Solr
 
Haystack 2019 Lightning Talk - The Future of Quepid - Charlie Hull
Haystack 2019 Lightning Talk - The Future of Quepid - Charlie HullHaystack 2019 Lightning Talk - The Future of Quepid - Charlie Hull
Haystack 2019 Lightning Talk - The Future of Quepid - Charlie Hull
 
Haystack 2019 Lightning Talk - State of Apache Tika - Tim Allison
Haystack 2019 Lightning Talk - State of Apache Tika - Tim AllisonHaystack 2019 Lightning Talk - State of Apache Tika - Tim Allison
Haystack 2019 Lightning Talk - State of Apache Tika - Tim Allison
 
Haystack 2019 Lightning Talk - Relevance on 17 million full text documents - ...
Haystack 2019 Lightning Talk - Relevance on 17 million full text documents - ...Haystack 2019 Lightning Talk - Relevance on 17 million full text documents - ...
Haystack 2019 Lightning Talk - Relevance on 17 million full text documents - ...
 
Haystack 2019 Lightning Talk - Solr Cloud on Kubernetes - Manoj Bharadwaj
Haystack 2019 Lightning Talk - Solr Cloud on Kubernetes - Manoj BharadwajHaystack 2019 Lightning Talk - Solr Cloud on Kubernetes - Manoj Bharadwaj
Haystack 2019 Lightning Talk - Solr Cloud on Kubernetes - Manoj Bharadwaj
 
Haystack 2019 Lightning Talk - Quaerite a Search relevance evaluation toolkit...
Haystack 2019 Lightning Talk - Quaerite a Search relevance evaluation toolkit...Haystack 2019 Lightning Talk - Quaerite a Search relevance evaluation toolkit...
Haystack 2019 Lightning Talk - Quaerite a Search relevance evaluation toolkit...
 
Haystack 2019 - Search-based recommendations at Politico - Ryan Kohl
Haystack 2019 - Search-based recommendations at Politico - Ryan KohlHaystack 2019 - Search-based recommendations at Politico - Ryan Kohl
Haystack 2019 - Search-based recommendations at Politico - Ryan Kohl
 
Haystack 2019 - Search with Vectors - Simon Hughes
Haystack 2019 - Search with Vectors - Simon HughesHaystack 2019 - Search with Vectors - Simon Hughes
Haystack 2019 - Search with Vectors - Simon Hughes
 
Haystack 2019 - Natural Language Search with Knowledge Graphs - Trey Grainger
Haystack 2019 - Natural Language Search with Knowledge Graphs - Trey GraingerHaystack 2019 - Natural Language Search with Knowledge Graphs - Trey Grainger
Haystack 2019 - Natural Language Search with Knowledge Graphs - Trey Grainger
 
Haystack 2019 - Search Logs + Machine Learning = Auto-Tagging Inventory - Joh...
Haystack 2019 - Search Logs + Machine Learning = Auto-Tagging Inventory - Joh...Haystack 2019 - Search Logs + Machine Learning = Auto-Tagging Inventory - Joh...
Haystack 2019 - Search Logs + Machine Learning = Auto-Tagging Inventory - Joh...
 
Haystack 2019 - Improving Search Relevance with Numeric Features in Elasticse...
Haystack 2019 - Improving Search Relevance with Numeric Features in Elasticse...Haystack 2019 - Improving Search Relevance with Numeric Features in Elasticse...
Haystack 2019 - Improving Search Relevance with Numeric Features in Elasticse...
 
Haystack 2019 - Architectural considerations on search relevancy in the conte...
Haystack 2019 - Architectural considerations on search relevancy in the conte...Haystack 2019 - Architectural considerations on search relevancy in the conte...
Haystack 2019 - Architectural considerations on search relevancy in the conte...
 
Haystack 2019 - Custom Solr Query Parser Design Option, and Pros & Cons - Ber...
Haystack 2019 - Custom Solr Query Parser Design Option, and Pros & Cons - Ber...Haystack 2019 - Custom Solr Query Parser Design Option, and Pros & Cons - Ber...
Haystack 2019 - Custom Solr Query Parser Design Option, and Pros & Cons - Ber...
 
Haystack 2019 - Establishing a relevance focused culture in a large organizat...
Haystack 2019 - Establishing a relevance focused culture in a large organizat...Haystack 2019 - Establishing a relevance focused culture in a large organizat...
Haystack 2019 - Establishing a relevance focused culture in a large organizat...
 
Haystack 2019 - Solving for Satisfaction: Introduction to Click Models - Eliz...
Haystack 2019 - Solving for Satisfaction: Introduction to Click Models - Eliz...Haystack 2019 - Solving for Satisfaction: Introduction to Click Models - Eliz...
Haystack 2019 - Solving for Satisfaction: Introduction to Click Models - Eliz...
 
2019 Haystack - How The New York Times Tackles Relevance - Jeremiah Via
2019 Haystack - How The New York Times Tackles Relevance - Jeremiah Via2019 Haystack - How The New York Times Tackles Relevance - Jeremiah Via
2019 Haystack - How The New York Times Tackles Relevance - Jeremiah Via
 

Recently uploaded

Recent Advancements in the NIST-JARVIS Infrastructure
Recent Advancements in the NIST-JARVIS InfrastructureRecent Advancements in the NIST-JARVIS Infrastructure
Recent Advancements in the NIST-JARVIS Infrastructure
KAMAL CHOUDHARY
 
TrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-In
TrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-InTrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-In
TrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-In
TrustArc
 
Feature sql server terbaru performance.pptx
Feature sql server terbaru performance.pptxFeature sql server terbaru performance.pptx
Feature sql server terbaru performance.pptx
ssuser1915fe1
 
How to build a generative AI solution A step-by-step guide (2).pdf
How to build a generative AI solution A step-by-step guide (2).pdfHow to build a generative AI solution A step-by-step guide (2).pdf
How to build a generative AI solution A step-by-step guide (2).pdf
ChristopherTHyatt
 
Girls Call Churchgate 9910780858 Provide Best And Top Girl Service And No1 in...
Girls Call Churchgate 9910780858 Provide Best And Top Girl Service And No1 in...Girls Call Churchgate 9910780858 Provide Best And Top Girl Service And No1 in...
Girls Call Churchgate 9910780858 Provide Best And Top Girl Service And No1 in...
maigasapphire
 
IPLOOK Remote-Sensing Satellite Solution
IPLOOK Remote-Sensing Satellite SolutionIPLOOK Remote-Sensing Satellite Solution
IPLOOK Remote-Sensing Satellite Solution
IPLOOK Networks
 
The Rise of AI in Cybersecurity How Machine Learning Will Shape Threat Detect...
The Rise of AI in Cybersecurity How Machine Learning Will Shape Threat Detect...The Rise of AI in Cybersecurity How Machine Learning Will Shape Threat Detect...
The Rise of AI in Cybersecurity How Machine Learning Will Shape Threat Detect...
digitalxplive
 
Google I/O Extended Harare Merged Slides
Google I/O Extended Harare Merged SlidesGoogle I/O Extended Harare Merged Slides
Google I/O Extended Harare Merged Slides
Google Developer Group - Harare
 
BT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdf
BT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdfBT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdf
BT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdf
Neo4j
 
Data Integration Basics: Merging & Joining Data
Data Integration Basics: Merging & Joining DataData Integration Basics: Merging & Joining Data
Data Integration Basics: Merging & Joining Data
Safe Software
 
WhatsApp Spy Online Trackers and Monitoring Apps
WhatsApp Spy Online Trackers and Monitoring AppsWhatsApp Spy Online Trackers and Monitoring Apps
WhatsApp Spy Online Trackers and Monitoring Apps
HackersList
 
Dublin_mulesoft_meetup_Mulesoft_Salesforce_Integration (1).pptx
Dublin_mulesoft_meetup_Mulesoft_Salesforce_Integration (1).pptxDublin_mulesoft_meetup_Mulesoft_Salesforce_Integration (1).pptx
Dublin_mulesoft_meetup_Mulesoft_Salesforce_Integration (1).pptx
Kunal Gupta
 
How Social Media Hackers Help You to See Your Wife's Message.pdf
How Social Media Hackers Help You to See Your Wife's Message.pdfHow Social Media Hackers Help You to See Your Wife's Message.pdf
How Social Media Hackers Help You to See Your Wife's Message.pdf
HackersList
 
Implementations of Fused Deposition Modeling in real world
Implementations of Fused Deposition Modeling  in real worldImplementations of Fused Deposition Modeling  in real world
Implementations of Fused Deposition Modeling in real world
Emerging Tech
 
Scaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - Mydbops
Scaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - MydbopsScaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - Mydbops
Scaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - Mydbops
Mydbops
 
CHAPTER-8 COMPONENTS OF COMPUTER SYSTEM CLASS 9 CBSE
CHAPTER-8 COMPONENTS OF COMPUTER SYSTEM CLASS 9 CBSECHAPTER-8 COMPONENTS OF COMPUTER SYSTEM CLASS 9 CBSE
CHAPTER-8 COMPONENTS OF COMPUTER SYSTEM CLASS 9 CBSE
kumarjarun2010
 
“Deploying Large Language Models on a Raspberry Pi,” a Presentation from Usef...
“Deploying Large Language Models on a Raspberry Pi,” a Presentation from Usef...“Deploying Large Language Models on a Raspberry Pi,” a Presentation from Usef...
“Deploying Large Language Models on a Raspberry Pi,” a Presentation from Usef...
Edge AI and Vision Alliance
 
Litestack talk at Brighton 2024 (Unleashing the power of SQLite for Ruby apps)
Litestack talk at Brighton 2024 (Unleashing the power of SQLite for Ruby apps)Litestack talk at Brighton 2024 (Unleashing the power of SQLite for Ruby apps)
Litestack talk at Brighton 2024 (Unleashing the power of SQLite for Ruby apps)
Muhammad Ali
 
CiscoIconsLibrary cours de réseau VLAN.ppt
CiscoIconsLibrary cours de réseau VLAN.pptCiscoIconsLibrary cours de réseau VLAN.ppt
CiscoIconsLibrary cours de réseau VLAN.ppt
moinahousna
 
leewayhertz.com-AI agents for healthcare Applications benefits and implementa...
leewayhertz.com-AI agents for healthcare Applications benefits and implementa...leewayhertz.com-AI agents for healthcare Applications benefits and implementa...
leewayhertz.com-AI agents for healthcare Applications benefits and implementa...
alexjohnson7307
 

Recently uploaded (20)

Recent Advancements in the NIST-JARVIS Infrastructure
Recent Advancements in the NIST-JARVIS InfrastructureRecent Advancements in the NIST-JARVIS Infrastructure
Recent Advancements in the NIST-JARVIS Infrastructure
 
TrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-In
TrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-InTrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-In
TrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-In
 
Feature sql server terbaru performance.pptx
Feature sql server terbaru performance.pptxFeature sql server terbaru performance.pptx
Feature sql server terbaru performance.pptx
 
How to build a generative AI solution A step-by-step guide (2).pdf
How to build a generative AI solution A step-by-step guide (2).pdfHow to build a generative AI solution A step-by-step guide (2).pdf
How to build a generative AI solution A step-by-step guide (2).pdf
 
Girls Call Churchgate 9910780858 Provide Best And Top Girl Service And No1 in...
Girls Call Churchgate 9910780858 Provide Best And Top Girl Service And No1 in...Girls Call Churchgate 9910780858 Provide Best And Top Girl Service And No1 in...
Girls Call Churchgate 9910780858 Provide Best And Top Girl Service And No1 in...
 
IPLOOK Remote-Sensing Satellite Solution
IPLOOK Remote-Sensing Satellite SolutionIPLOOK Remote-Sensing Satellite Solution
IPLOOK Remote-Sensing Satellite Solution
 
The Rise of AI in Cybersecurity How Machine Learning Will Shape Threat Detect...
The Rise of AI in Cybersecurity How Machine Learning Will Shape Threat Detect...The Rise of AI in Cybersecurity How Machine Learning Will Shape Threat Detect...
The Rise of AI in Cybersecurity How Machine Learning Will Shape Threat Detect...
 
Google I/O Extended Harare Merged Slides
Google I/O Extended Harare Merged SlidesGoogle I/O Extended Harare Merged Slides
Google I/O Extended Harare Merged Slides
 
BT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdf
BT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdfBT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdf
BT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdf
 
Data Integration Basics: Merging & Joining Data
Data Integration Basics: Merging & Joining DataData Integration Basics: Merging & Joining Data
Data Integration Basics: Merging & Joining Data
 
WhatsApp Spy Online Trackers and Monitoring Apps
WhatsApp Spy Online Trackers and Monitoring AppsWhatsApp Spy Online Trackers and Monitoring Apps
WhatsApp Spy Online Trackers and Monitoring Apps
 
Dublin_mulesoft_meetup_Mulesoft_Salesforce_Integration (1).pptx
Dublin_mulesoft_meetup_Mulesoft_Salesforce_Integration (1).pptxDublin_mulesoft_meetup_Mulesoft_Salesforce_Integration (1).pptx
Dublin_mulesoft_meetup_Mulesoft_Salesforce_Integration (1).pptx
 
How Social Media Hackers Help You to See Your Wife's Message.pdf
How Social Media Hackers Help You to See Your Wife's Message.pdfHow Social Media Hackers Help You to See Your Wife's Message.pdf
How Social Media Hackers Help You to See Your Wife's Message.pdf
 
Implementations of Fused Deposition Modeling in real world
Implementations of Fused Deposition Modeling  in real worldImplementations of Fused Deposition Modeling  in real world
Implementations of Fused Deposition Modeling in real world
 
Scaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - Mydbops
Scaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - MydbopsScaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - Mydbops
Scaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - Mydbops
 
CHAPTER-8 COMPONENTS OF COMPUTER SYSTEM CLASS 9 CBSE
CHAPTER-8 COMPONENTS OF COMPUTER SYSTEM CLASS 9 CBSECHAPTER-8 COMPONENTS OF COMPUTER SYSTEM CLASS 9 CBSE
CHAPTER-8 COMPONENTS OF COMPUTER SYSTEM CLASS 9 CBSE
 
“Deploying Large Language Models on a Raspberry Pi,” a Presentation from Usef...
“Deploying Large Language Models on a Raspberry Pi,” a Presentation from Usef...“Deploying Large Language Models on a Raspberry Pi,” a Presentation from Usef...
“Deploying Large Language Models on a Raspberry Pi,” a Presentation from Usef...
 
Litestack talk at Brighton 2024 (Unleashing the power of SQLite for Ruby apps)
Litestack talk at Brighton 2024 (Unleashing the power of SQLite for Ruby apps)Litestack talk at Brighton 2024 (Unleashing the power of SQLite for Ruby apps)
Litestack talk at Brighton 2024 (Unleashing the power of SQLite for Ruby apps)
 
CiscoIconsLibrary cours de réseau VLAN.ppt
CiscoIconsLibrary cours de réseau VLAN.pptCiscoIconsLibrary cours de réseau VLAN.ppt
CiscoIconsLibrary cours de réseau VLAN.ppt
 
leewayhertz.com-AI agents for healthcare Applications benefits and implementa...
leewayhertz.com-AI agents for healthcare Applications benefits and implementa...leewayhertz.com-AI agents for healthcare Applications benefits and implementa...
leewayhertz.com-AI agents for healthcare Applications benefits and implementa...
 

Building a Lightweight Discovery Interface for China's Patents@NYC Solr/Lucene Meetup

  • 1. BUILDING A LIGHTWEIGHT DISCOVERY INTERFACE FOR CHINESE PATENTS ! New York Solr/Lucene Meetup ERIC PUGH | epugh@o19s.com | @dep4b
  • 2. Who am I? • Principal of OpenSource Connections - Solr/Lucene Search Consultancy http://bit.ly/OSCCommercialSummary • Member of Apache Software Foundation • SOLR-284 UpdateRichDocuments (July 07) • Fascinated by the art of software development
  • 9. • First USPTO application in “the cloud” • Simple, and discoverable • Expresses our philosophy of “Cloud meets Ocean” ! • Check it out at http:// gpsn.uspto.gov
  • 10. Telling some stories ➡How to inject “Discovery” into your app • The Cloud to the Rescue (sorta!) • Parsers and Parsers and Parsers • Don’t be Afraid to Share!
  • 11. Flow of understanding Data UnderstandingInformation
  • 13. Grok data at gut level Look for outliers ! User Interviews Surveys Card Sorting Scenarios/Personas ! UX Data brainstorm Mockups Proof of concept ! !
  • 14. Where to spend time? UX Engine Data 40% ! 20% ! 40% ! 40% ! 40% ! 20% We spent !
  • 15. Telling some stories • How to inject “Discovery” into your app ➡The Cloud to the Rescue (sorta!) • Parsers and Parsers and Parsers • Don’t be Afraid to Share!
  • 16. Boy meets Girl Story
  • 17. Boy meets Girl Story Metadata Ingest Pipeline Discovery UX Content Files
  • 18. Nothing but JS and Solr! • Updates are quarterly • User state in browser • Solr is the “RESTful” API ;-) • KISS: EmberJS + Solr
  • 19. How we built it EmberJS Single Page Search App HTML XML JSON Server Dashboard GPSN UI (Bootsrap CSS) Browsers Mobile/ Tablet Third Party Application Servers S3 BucketSolr
  • 20. Yes, Solr is hangout out there on the Net… • Using Jetty container security to lock down everything but the /select handler. • Yes, the /admin interface appears to load, but no panels load. • Go ahead, do a delete query! I dare you. Actually, please don’t. ;-)
  • 21. Single 550 GB index • Solr + Index are in a Amazon AMI image. • Currently running two independent Solrs. • Optimize works! Still. • Elastic Load Balancer + AutoScale spins up more Solr’s if needed. • Threw lots of “provisioned IOPS” atVM
  • 22. A better security proxy from Alex? https://github.com/ dergachev/solr- security-proxy
  • 23. Spyglass • EmberJS based Widget framework • List of Results • Facets • Autocomplete • “Deploy” is just .html + .js. S3 bucket! • Tooling is a pain. EmberJS is complex! BetterthenAjaxSolr!
  • 25. Key scaling concept behind GPSN: ! Cloud meets Ocean
  • 29. Don’t Move Files • Copying 5 TB data up to S3 was very painful. • We used S3Funnel which is “rsync like” • We bought more network bandwidth for our office
  • 30. Never underestimate the bandwidth of a station wagon full of tapes hurtling down the highway.
 –Andrew Tanenbaum, 1981
  • 31. Data Size 0 250000 500000 750000 1000000 1985 1987 1989 1991 1993 1995 1997 1999 2001 2003 2005 2007 2009 2011 Patent Count 277871
  • 32. Think about DataVolume • Started with older dataset, and tasks like TIFF -> PNG conversion became progressively harder. Map/Reduce nice, need more visibility into progress.. • Should have sharded our Search Index from the beginning just to make indexing faster and cheaper process (500 gb index!) • 8 shards dropped time from 12 hours to 2 hours. Merging took 5! • We had too many steps in our pipeline
  • 33. Building  a  Patents  Index MachineCount 0 75 150 225 300 5 days 3 days 30 Minutes 1 5 300
  • 34. Telling some stories • How to inject “Discovery” into your app • The Cloud to the Rescue (sorta!) ➡Parsers and Parsers and Parsers • Don’t be Afraid to Share!
  • 35. Why so many pipelines? Morphlines
  • 36. Tika as a pipeline?
  • 37. Lot’s of File Types • Sometimes in ZIP archives, sometimes not! • multiple XML formats as well as CSV and EDI • Purplebook,Yellowbook, Redbook,Greenbook, Questel, SIPO…
  • 38. Tika as a pipeline! • Auto detects content type • Metadata structure has all the key/value needed for Solr • Allows us to scale up with Behemoth project (and others!).
  • 39. Lots of files! HHHHHT APS1 ISSUE - 760106! PATN! WKU 039302717! SRC 5! APN 5328756! APT 1! ART 353! APD 19741216! TTL Golf glove! ISD 19760106! NCL 4! ECL 1 <PatentGrant>! <BibliographicData>! <GrantIdentification>! <DocumentKindCode>B1</DocumentKindCode>! <GrantNumber>06644224</GrantNumber>! <CountryCode>US</CountryCode>! <IssueDateText>2003-11-11</IssueDateText>
  • 40. Detector to pick File public  class  GreenbookDetector  implements  Detector  {   !        private  static  Pattern  pattern  =  Pattern.compile("PATN");                    @Override          public  MediaType  detect(InputStream  stream,  Metadata  metadata)  throws  IOException  {   !                MediaType  type  =  MediaType.OCTET_STREAM;                  InputStream  lookahead  =  new  LookaheadInputStream(stream,  1024);                  String  extract  =  org.apache.commons.io.IOUtils.toString(lookahead,  "UTF-­‐8");   !                Matcher  matcher  =  pattern.matcher(extract);   !                if  (matcher.find())  {                          type  =  GreenbookParser.MEDIA_TYPE;                  }   !                lookahead.close();                                    return  type;          }         }
  • 41. Telling some stories • How to inject “Discovery” into your app • The Cloud to the Rescue (sorta!) • Parsers and Parsers and Parsers ➡Don’t be Afraid to Share!
  • 42. Your Search solution isn’t perfect • Allow users to export data • Most business users want to work in Excel! Accept it! • Allow other applications to build on top of it.
  • 43. GPSN has • Lots of easy “Print to PDF” options. • Data stored in S3 as: • individual patent files • chunky downloads. • Filtering to expand or select specific data sets. • Permalinks: simple, very sharable URLs. • Underlying Solr service is exposed to public via proxy. You can query Solr yourself. • Need advance querying? Use Lucene syntax in search bar.
  • 45. Measuring the impact of our algorithms changes is just getting harder as we get smarter.
  • 46. www.quepid.com Quepid: Give your Queries some Love W e need betausers!
  • 47. Thank you! ! Questions? • epugh@o19s.com • @dep4b • www.opensourceconnections.com • slideshare.com/o19s Nervous about speaking up? Ask me later!