Searching Chinese Patents Presentation at Enterprise Data World

Searching Chinese Patents:

Challenges and Solutions When Building
an Innovative Discovery Interface
ERIC PUGH | epugh@o19s.com | @dep4b

Who am I?
• Principal at OpenSource Connections
- Solr/Lucene Search Consultancy
http://bit.ly/OSCCommercialSummary

• Member of Apache Software
Foundation

• SOLR-284 UpdateRichDocuments
(July 07)

Risks
• Cloud new at USPTO

• Discovery is tenuous concept

• Conﬂicting User Goals

• Fixed Budget: trade scope for
budget/quality

Telling some stories
➡How to inject “Discovery” into your
app

• The Cloud to the Rescue (sorta!)

• Parsers and Parsers and Parsers

• Don’t be Afraid to Share!

Flow of understanding
Data UnderstandingInformation

Building “Discovery”
Engine
UX DataTension

Grok data at gut level

Look for outliers

!
User Interviews

Surveys

Card Sorting

Scenarios/Personas

!
UX
Data
brainstorm
Mockups

Proof of concept

!
!

Where to spend time?
UX
Engine
Data
40%

!
20%

!
40%

!
40%

!
40%

!
20%

We spent

!

• How to inject “Discovery” into your app

➡The Cloud to the Rescue (sorta!)


Boy meets Girl Story
Metadata
Ingest

Pipeline

Discovery
UX
Content
Files

How we built it
EmberJS Single Page Search App
HTML
XML
JSON
Server Dashboard
GPSN UI (Bootsrap CSS)
Browsers
Mobile/
Tablet
Third Party
Application
Servers
S3 BucketSolr

Solr as a NoSQL
Datastore
• Used “atomic updates” to merge three
source datasets into single ﬁnal dataset.

• All text displayed in application stored in
Solr.

• Dynamic schema supports many languages,
en, cn right now.

Don’t Move Files
• Copying 5 TB data up to S3 was very
painful.

• We used S3Funnel which is “rsync like”

• We bought more network bandwidth for
our ofﬁce

Never
underestimate
the bandwidth of
a station wagon
full of tapes
hurtling down
the highway. 
–Andrew Tanenbaum, 1981

Data Size
0
250000
500000
750000
1000000
1985 1987 1989 1991 1993 1995 1997 1999 2001 2003 2005 2007 2009 2011
Patent Count
277871

Think about DataVolume
• Started with older dataset, and tasks like TIFF -> PNG
conversion became progressively harder. Map/Reduce nice,
need more visibility into progress..

• Should have sharded our Search Index from the beginning
just to make indexing faster and cheaper process (500 gb
index!)

• 8 shards dropped time from 12 hours to 2 hours.
Merging took 5!

• We had too many steps in our pipeline

Building
a
Patents
Index
MachineCount
0
75
150
225
300
5 days 3 days 30 Minutes
1 5
300

Key scaling concept
behind GPSN:

!
Cloud meets Ocean

More prosaically…
Database
Server
Server
Server
Client
Client
Client
$
$
$
$



➡Parsers and Parsers and Parsers

Why so many pipelines?
Morphlines

Lot’s of File Types
• Sometimes in ZIP archives, sometimes not!

• multiple XML formats as well as CSV and
EDI

• Purplebook,Yellowbook,
Redbook,Greenbook, Questel, SIPO…

Tika as a pipeline!
• Auto detects content type

• Metadata structure has all the
key/value needed for Solr

• Allows us to scale up with
Behemoth project (and
others!).

Lots of ﬁles!
HHHHHT APS1 ISSUE - 760106!
PATN!
WKU 039302717!
SRC 5!
APN 5328756!
APT 1!
ART 353!
APD 19741216!
TTL Golf glove!
ISD 19760106!
NCL 4!
ECL 1
<PatentGrant>!
<BibliographicData>!
<GrantIdentiﬁcation>!
<DocumentKindCode>B1</DocumentKindCode>!
<GrantNumber>06644224</GrantNumber>!
<CountryCode>US</CountryCode>!
<IssueDateText>2003-11-11</IssueDateText>

Detector to pick File
public
class
GreenbookDetector
implements
Detector
{

!

private
static
Pattern
pattern
=
Pattern.compile("PATN");

@Override

public
MediaType
detect(InputStream
stream,
Metadata
metadata)
throws
IOException
{

!

MediaType
type
=
MediaType.OCTET_STREAM;

InputStream
lookahead
=
new
LookaheadInputStream(stream,
1024);

String
extract
=
org.apache.commons.io.IOUtils.toString(lookahead,
"UTF-‐8");

!

Matcher
matcher
=
pattern.matcher(extract);

!

if
(matcher.find())
{

type
=
GreenbookParser.MEDIA_TYPE;

}

!

lookahead.close();

return
type;

}

}




➡Don’t be Afraid to Share!

Your BigData solution
isn’t perfect
• Allow users to export data

• Most business users want to work in Excel.
Accept it!

• Allow other applications to build on top of
of your application.

GPSN has
• Lots of easy “Print to
PDF” options.

• Data stored in S3 as:

• individual patent ﬁles

• chunky downloads.

• Filtering to expand or
select speciﬁc data sets.

• Permalinks: simple, very
sharable URLs.

• Underlying Solr service
is exposed to public via
proxy. You can query
Solr yourself.

• Need advance querying?
Use Lucene syntax in
search bar.

Measuring the impact
of our algorithms
changes is just getting
harder with Big Data.

www.quepid.com
Quepid: Give your Queries
some Love
W
e
need
betausers!

Thank you!
!
Questions?
• epugh@o19s.com

• @dep4b

• www.opensourceconnections.com

• slideshare.com/o19s
Nervous about
speaking up? Ask
me later!

Searching Chinese Patents Presentation at Enterprise Data World

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Searching Chinese Patents Presentation at Enterprise Data World

Similar to Searching Chinese Patents Presentation at Enterprise Data World (20)

More from OpenSource Connections

More from OpenSource Connections (20)

Recently uploaded

Recently uploaded (20)

Searching Chinese Patents Presentation at Enterprise Data World