Building a Lightweight Discovery Interface for Chinese Patents, Presented by Eric Pugh at SolrExchage DC
BUILDING A LIGHTWEIGHT DISCOVERY
INTERFACE FOR CHINESE PATENTS
ERIC PUGH | firstname.lastname@example.org | @dep4b
Who am I?
• Principal of OpenSource Connections
- Solr/Lucene Search Consultancy
• Member of Apache Software
• SOLR-284 UpdateRichDocuments
• Fascinated by the art of software
Think about DataVolume
• Started with older dataset, and tasks like TIFF -> PNG
conversion became progressively harder. Map/Reduce nice,
need more visibility into progress..
• Should have sharded our Search Index from the beginning
just to make indexing faster and cheaper process (500 gb
• 8 shards dropped time from 12 hours to 2 hours.
Merging took 5!
• We had too many steps in our pipeline
5 days 3 days 30 Minutes
Detector to pick File
Telling some stories
• How to inject “Discovery” into your app
• The Cloud to the Rescue (sorta!)
• Parsers and Parsers and Parsers
➡Don’t be Afraid to Share!
Your BigData solution
• Allow users to export data
• Most business users want to work in Excel!
• Allow other applications to build on top of
• Lots of easy “Print to
• Data stored in S3 as:
• individual patent ﬁles
• chunky downloads.
• Filtering to expand or
select speciﬁc data sets.
• Permalinks: simple, very
• Underlying Solr service
is exposed to public via
proxy. You can query
• Need advance querying?
Use Lucene syntax in