Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
BUILDING A LIGHTWEIGHT DISCOVERY
INTERFACE FOR CHINESE PATENTS
ERIC PUGH | epugh@o19s.com | @dep4b
Who am I?
• Principal of OpenSource Connections
- Solr/Lucene Search Consultancy
http://bit.ly/OSCCommercialSummary	

• Me...
Co-Author
N
extEdition
M
ay!
Congrats to
Trey and Tim!
(Tim is here somewhere)
Agilista
Selected Customers
Telling some stories
Telling some stories
war	

^
Risks
• Cloud new at USPTO	

• Discovery is tenuous concept	

• Conflicting User Goals	

• Fixed Budget: trade scope for
bu...
• First USPTO application in
“the cloud”	

• Simple, and discoverable	

• Expresses our philosophy of
“Cloud meets Ocean”	...
Telling some stories
➡How to inject “Discovery” into your
app	

• The Cloud to the Rescue (sorta!)	

• Parsers and Parsers...
Flow of understanding
Data UnderstandingInformation
Building “Discovery”
UX DataTension
Building “Discovery”
Engine
UX DataTension
Grok data at gut level	

Look for outliers	

!
User Interviews	

Surveys	

Card Sorting	

Scenarios/Personas	

!
UX
Data
b...
Where to spend time?
UX
Engine
Data
40%	

!
20%	

!
40%	

!
Where to spend time?
UX
Engine
Data
40%	

!
20%	

!
40%	

!
40%	

!
40%	

!
20%	

We spent	

!
Telling some stories
• How to inject “Discovery” into your app	

➡The Cloud to the Rescue (sorta!)
• Parsers and Parsers a...
Boy meets Girl Story
Boy meets Girl Story
Boy meets Girl Story
Boy meets Girl Story
Metadata
Ingest	

Pipeline	

Discovery
UX
Content
Files
How we built it
EmberJS Single Page Search App
HTML
XML
JSON
Server Dashboard
GPSN UI (Bootsrap CSS)
Browsers
Mobile/
Tabl...
Lessons Learned
Don’t Move Files
• Copying 5 TB data up to S3 was very
painful.	

• We used S3Funnel which is “rsync like”	

• We bought m...
Never
underestimate
the bandwidth of
a station wagon
full of tapes
hurtling down
the highway.

–Andrew Tanenbaum, 1981
Data Size
0
250000
500000
750000
1000000
1985 1987 1989 1991 1993 1995 1997 1999 2001 2003 2005 2007 2009 2011
Patent Coun...
Data Size
0
250000
500000
750000
1000000
1985 1987 1989 1991 1993 1995 1997 1999 2001 2003 2005 2007 2009 2011
Patent Coun...
Think about DataVolume
• Started with older dataset, and tasks like TIFF -> PNG
conversion became progressively harder. Ma...
Building	
  a	
  Patents	
  Index
MachineCount
0
75
150
225
300
5 days 3 days 30 Minutes
1 5
300
Key scaling concept
behind GPSN:	

!
Cloud meets Ocean
More prosaically…
Database
Server
Server
Server
Client
Client
Client
$
$
$
$
More prosaically…
Database
Server
Server
Server
Client
Client
Client
$
$
$
$
$
More prosaically…
Database Server
Client
Client
Client
$
$
$
$
$
More prosaically…
Database Server
Client
Client
Client
Client
$
$
$
$
$
Client
More prosaically…
Database Server
Client
Client
Client
Client
$
$
$
$ $
$
Client
$
Telling some stories
• How to inject “Discovery” into your app	

• The Cloud to the Rescue (sorta!)	

➡Parsers and Parsers...
Why so many pipelines?
Morphlines
Tika as a pipeline?
Lot’s of File Types
• Sometimes in ZIP archives, sometimes not!	

• multiple XML formats as well as CSV and
EDI	

• Purple...
Tika as a pipeline!
• Auto detects content type	

• Metadata structure has all the
key/value needed for Solr	

• Allows us...
Lots of files!
HHHHHT APS1 ISSUE - 760106!
PATN!
WKU 039302717!
SRC 5!
APN 5328756!
APT 1!
ART 353!
APD 19741216!
TTL Golf ...
Detector to pick File
public	
  class	
  GreenbookDetector	
  implements	
  Detector	
  {	
  
!
	
  	
  	
  	
  private	
 ...
Telling some stories
• How to inject “Discovery” into your app	

• The Cloud to the Rescue (sorta!)	

• Parsers and Parser...
Your BigData solution
isn’t perfect
• Allow users to export data	

• Most business users want to work in Excel!
Accept it!...
GPSN has
• Lots of easy “Print to
PDF” options.	

• Data stored in S3 as:	

• individual patent files	

• chunky downloads....
One more thought...
Measuring the impact
of our algorithms
changes is just getting
harder with Big Data.
www.quepid.com
Quepid: Give your Queries
some Love
www.quepid.com
Quepid: Give your Queries
some Love
W
e
need
betausers!
Thank you!
!
Questions?
• epugh@o19s.com	

• @dep4b	

• www.opensourceconnections.com	

• slideshare.com/o19s
Nervous abou...
Building a Lightweight Discovery Interface for Chinese Patents, Presented by Eric Pugh at SolrExchage DC
Upcoming SlideShare
Loading in …5
×

Building a Lightweight Discovery Interface for Chinese Patents, Presented by Eric Pugh at SolrExchage DC

664 views

Published on

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Building a Lightweight Discovery Interface for Chinese Patents, Presented by Eric Pugh at SolrExchage DC

  1. 1. BUILDING A LIGHTWEIGHT DISCOVERY INTERFACE FOR CHINESE PATENTS ERIC PUGH | epugh@o19s.com | @dep4b
  2. 2. Who am I? • Principal of OpenSource Connections - Solr/Lucene Search Consultancy http://bit.ly/OSCCommercialSummary • Member of Apache Software Foundation • SOLR-284 UpdateRichDocuments (July 07) • Fascinated by the art of software development
  3. 3. Co-Author N extEdition M ay!
  4. 4. Congrats to Trey and Tim! (Tim is here somewhere)
  5. 5. Agilista
  6. 6. Selected Customers
  7. 7. Telling some stories
  8. 8. Telling some stories war ^
  9. 9. Risks • Cloud new at USPTO • Discovery is tenuous concept • Conflicting User Goals • Fixed Budget: trade scope for budget/quality
  10. 10. • First USPTO application in “the cloud” • Simple, and discoverable • Expresses our philosophy of “Cloud meets Ocean” ! • Check it out at http:// gpsn.uspto.gov
  11. 11. Telling some stories ➡How to inject “Discovery” into your app • The Cloud to the Rescue (sorta!) • Parsers and Parsers and Parsers • Don’t be Afraid to Share!
  12. 12. Flow of understanding Data UnderstandingInformation
  13. 13. Building “Discovery” UX DataTension
  14. 14. Building “Discovery” Engine UX DataTension
  15. 15. Grok data at gut level Look for outliers ! User Interviews Surveys Card Sorting Scenarios/Personas ! UX Data brainstorm Mockups Proof of concept ! !
  16. 16. Where to spend time? UX Engine Data 40% ! 20% ! 40% !
  17. 17. Where to spend time? UX Engine Data 40% ! 20% ! 40% ! 40% ! 40% ! 20% We spent !
  18. 18. Telling some stories • How to inject “Discovery” into your app ➡The Cloud to the Rescue (sorta!) • Parsers and Parsers and Parsers • Don’t be Afraid to Share!
  19. 19. Boy meets Girl Story
  20. 20. Boy meets Girl Story
  21. 21. Boy meets Girl Story
  22. 22. Boy meets Girl Story Metadata Ingest Pipeline Discovery UX Content Files
  23. 23. How we built it EmberJS Single Page Search App HTML XML JSON Server Dashboard GPSN UI (Bootsrap CSS) Browsers Mobile/ Tablet Third Party Application Servers S3 BucketSolr
  24. 24. Lessons Learned
  25. 25. Don’t Move Files • Copying 5 TB data up to S3 was very painful. • We used S3Funnel which is “rsync like” • We bought more network bandwidth for our office
  26. 26. Never underestimate the bandwidth of a station wagon full of tapes hurtling down the highway.
 –Andrew Tanenbaum, 1981
  27. 27. Data Size 0 250000 500000 750000 1000000 1985 1987 1989 1991 1993 1995 1997 1999 2001 2003 2005 2007 2009 2011 Patent Count 277871
  28. 28. Data Size 0 250000 500000 750000 1000000 1985 1987 1989 1991 1993 1995 1997 1999 2001 2003 2005 2007 2009 2011 Patent Count 277871
  29. 29. Think about DataVolume • Started with older dataset, and tasks like TIFF -> PNG conversion became progressively harder. Map/Reduce nice, need more visibility into progress.. • Should have sharded our Search Index from the beginning just to make indexing faster and cheaper process (500 gb index!) • 8 shards dropped time from 12 hours to 2 hours. Merging took 5! • We had too many steps in our pipeline
  30. 30. Building  a  Patents  Index MachineCount 0 75 150 225 300 5 days 3 days 30 Minutes 1 5 300
  31. 31. Key scaling concept behind GPSN: ! Cloud meets Ocean
  32. 32. More prosaically… Database Server Server Server Client Client Client $ $ $ $
  33. 33. More prosaically… Database Server Server Server Client Client Client $ $ $ $ $
  34. 34. More prosaically… Database Server Client Client Client $ $ $ $ $
  35. 35. More prosaically… Database Server Client Client Client Client $ $ $ $ $ Client
  36. 36. More prosaically… Database Server Client Client Client Client $ $ $ $ $ $ Client $
  37. 37. Telling some stories • How to inject “Discovery” into your app • The Cloud to the Rescue (sorta!) ➡Parsers and Parsers and Parsers • Don’t be Afraid to Share!
  38. 38. Why so many pipelines? Morphlines
  39. 39. Tika as a pipeline?
  40. 40. Lot’s of File Types • Sometimes in ZIP archives, sometimes not! • multiple XML formats as well as CSV and EDI • Purplebook,Yellowbook, Redbook,Greenbook, Questel, SIPO…
  41. 41. Tika as a pipeline! • Auto detects content type • Metadata structure has all the key/value needed for Solr • Allows us to scale up with Behemoth project (and others!).
  42. 42. Lots of files! HHHHHT APS1 ISSUE - 760106! PATN! WKU 039302717! SRC 5! APN 5328756! APT 1! ART 353! APD 19741216! TTL Golf glove! ISD 19760106! NCL 4! ECL 1 <PatentGrant>! <BibliographicData>! <GrantIdentification>! <DocumentKindCode>B1</DocumentKindCode>! <GrantNumber>06644224</GrantNumber>! <CountryCode>US</CountryCode>! <IssueDateText>2003-11-11</IssueDateText>
  43. 43. Detector to pick File public  class  GreenbookDetector  implements  Detector  {   !        private  static  Pattern  pattern  =  Pattern.compile("PATN");                    @Override          public  MediaType  detect(InputStream  stream,  Metadata  metadata)  throws  IOException  {   !                MediaType  type  =  MediaType.OCTET_STREAM;                  InputStream  lookahead  =  new  LookaheadInputStream(stream,  1024);                  String  extract  =  org.apache.commons.io.IOUtils.toString(lookahead,  "UTF-­‐8");   !                Matcher  matcher  =  pattern.matcher(extract);   !                if  (matcher.find())  {                          type  =  GreenbookParser.MEDIA_TYPE;                  }   !                lookahead.close();                                    return  type;          }         }
  44. 44. Telling some stories • How to inject “Discovery” into your app • The Cloud to the Rescue (sorta!) • Parsers and Parsers and Parsers ➡Don’t be Afraid to Share!
  45. 45. Your BigData solution isn’t perfect • Allow users to export data • Most business users want to work in Excel! Accept it! • Allow other applications to build on top of it.
  46. 46. GPSN has • Lots of easy “Print to PDF” options. • Data stored in S3 as: • individual patent files • chunky downloads. • Filtering to expand or select specific data sets. • Permalinks: simple, very sharable URLs. • Underlying Solr service is exposed to public via proxy. You can query Solr yourself. • Need advance querying? Use Lucene syntax in search bar.
  47. 47. One more thought...
  48. 48. Measuring the impact of our algorithms changes is just getting harder with Big Data.
  49. 49. www.quepid.com Quepid: Give your Queries some Love
  50. 50. www.quepid.com Quepid: Give your Queries some Love W e need betausers!
  51. 51. Thank you! ! Questions? • epugh@o19s.com • @dep4b • www.opensourceconnections.com • slideshare.com/o19s Nervous about speaking up? Ask me later!

×