Your SlideShare is downloading. ×
0
Searching Chinese Patents Presentation at Enterprise Data World
Searching Chinese Patents Presentation at Enterprise Data World
Searching Chinese Patents Presentation at Enterprise Data World
Searching Chinese Patents Presentation at Enterprise Data World
Searching Chinese Patents Presentation at Enterprise Data World
Searching Chinese Patents Presentation at Enterprise Data World
Searching Chinese Patents Presentation at Enterprise Data World
Searching Chinese Patents Presentation at Enterprise Data World
Searching Chinese Patents Presentation at Enterprise Data World
Searching Chinese Patents Presentation at Enterprise Data World
Searching Chinese Patents Presentation at Enterprise Data World
Searching Chinese Patents Presentation at Enterprise Data World
Searching Chinese Patents Presentation at Enterprise Data World
Searching Chinese Patents Presentation at Enterprise Data World
Searching Chinese Patents Presentation at Enterprise Data World
Searching Chinese Patents Presentation at Enterprise Data World
Searching Chinese Patents Presentation at Enterprise Data World
Searching Chinese Patents Presentation at Enterprise Data World
Searching Chinese Patents Presentation at Enterprise Data World
Searching Chinese Patents Presentation at Enterprise Data World
Searching Chinese Patents Presentation at Enterprise Data World
Searching Chinese Patents Presentation at Enterprise Data World
Searching Chinese Patents Presentation at Enterprise Data World
Searching Chinese Patents Presentation at Enterprise Data World
Searching Chinese Patents Presentation at Enterprise Data World
Searching Chinese Patents Presentation at Enterprise Data World
Searching Chinese Patents Presentation at Enterprise Data World
Searching Chinese Patents Presentation at Enterprise Data World
Searching Chinese Patents Presentation at Enterprise Data World
Searching Chinese Patents Presentation at Enterprise Data World
Searching Chinese Patents Presentation at Enterprise Data World
Searching Chinese Patents Presentation at Enterprise Data World
Searching Chinese Patents Presentation at Enterprise Data World
Searching Chinese Patents Presentation at Enterprise Data World
Searching Chinese Patents Presentation at Enterprise Data World
Searching Chinese Patents Presentation at Enterprise Data World
Searching Chinese Patents Presentation at Enterprise Data World
Searching Chinese Patents Presentation at Enterprise Data World
Searching Chinese Patents Presentation at Enterprise Data World
Searching Chinese Patents Presentation at Enterprise Data World
Searching Chinese Patents Presentation at Enterprise Data World
Searching Chinese Patents Presentation at Enterprise Data World
Searching Chinese Patents Presentation at Enterprise Data World
Searching Chinese Patents Presentation at Enterprise Data World
Searching Chinese Patents Presentation at Enterprise Data World
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Searching Chinese Patents Presentation at Enterprise Data World

454

Published on

War stories from building the Global Patent Search Network, and why Data folks need to think more about UX and Discovery, and UX folks need to think more about Data.

War stories from building the Global Patent Search Network, and why Data folks need to think more about UX and Discovery, and UX folks need to think more about Data.

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
454
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
3
Comments
0
Likes
1
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Searching Chinese Patents: Challenges and Solutions When Building an Innovative Discovery Interface ERIC PUGH | epugh@o19s.com | @dep4b
  • 2. Who am I? • Principal at OpenSource Connections - Solr/Lucene Search Consultancy http://bit.ly/OSCCommercialSummary • Member of Apache Software Foundation • SOLR-284 UpdateRichDocuments (July 07)
  • 3. Co-Author N extEdition M ay!
  • 4. Agilista
  • 5. Selected Customers
  • 6. Telling some stories war ^
  • 7. Risks • Cloud new at USPTO • Discovery is tenuous concept • Conflicting User Goals • Fixed Budget: trade scope for budget/quality
  • 8. Telling some stories ➡How to inject “Discovery” into your app • The Cloud to the Rescue (sorta!) • Parsers and Parsers and Parsers • Don’t be Afraid to Share!
  • 9. Flow of understanding Data UnderstandingInformation
  • 10. Building “Discovery” Engine UX DataTension
  • 11. Grok data at gut level Look for outliers ! User Interviews Surveys Card Sorting Scenarios/Personas ! UX Data brainstorm Mockups Proof of concept ! !
  • 12. Where to spend time? UX Engine Data 40% ! 20% ! 40% ! 40% ! 40% ! 20% We spent !
  • 13. Telling some stories • How to inject “Discovery” into your app ➡The Cloud to the Rescue (sorta!) • Parsers and Parsers and Parsers • Don’t be Afraid to Share!
  • 14. Boy meets Girl Story
  • 15. Boy meets Girl Story Metadata Ingest Pipeline Discovery UX Content Files
  • 16. How we built it EmberJS Single Page Search App HTML XML JSON Server Dashboard GPSN UI (Bootsrap CSS) Browsers Mobile/ Tablet Third Party Application Servers S3 BucketSolr
  • 17. Solr as a NoSQL Datastore • Used “atomic updates” to merge three source datasets into single final dataset. • All text displayed in application stored in Solr. • Dynamic schema supports many languages, en, cn right now.
  • 18. Lessons Learned
  • 19. Don’t Move Files • Copying 5 TB data up to S3 was very painful. • We used S3Funnel which is “rsync like” • We bought more network bandwidth for our office
  • 20. Never underestimate the bandwidth of a station wagon full of tapes hurtling down the highway.
 –Andrew Tanenbaum, 1981
  • 21. Data Size 0 250000 500000 750000 1000000 1985 1987 1989 1991 1993 1995 1997 1999 2001 2003 2005 2007 2009 2011 Patent Count 277871
  • 22. Think about DataVolume • Started with older dataset, and tasks like TIFF -> PNG conversion became progressively harder. Map/Reduce nice, need more visibility into progress.. • Should have sharded our Search Index from the beginning just to make indexing faster and cheaper process (500 gb index!) • 8 shards dropped time from 12 hours to 2 hours. Merging took 5! • We had too many steps in our pipeline
  • 23. Building  a  Patents  Index MachineCount 0 75 150 225 300 5 days 3 days 30 Minutes 1 5 300
  • 24. Key scaling concept behind GPSN: ! Cloud meets Ocean
  • 25. More prosaically… Database Server Server Server Client Client Client $ $ $ $
  • 26. Telling some stories • How to inject “Discovery” into your app • The Cloud to the Rescue (sorta!) ➡Parsers and Parsers and Parsers • Don’t be Afraid to Share!
  • 27. Why so many pipelines? Morphlines
  • 28. Tika as a pipeline?
  • 29. Lot’s of File Types • Sometimes in ZIP archives, sometimes not! • multiple XML formats as well as CSV and EDI • Purplebook,Yellowbook, Redbook,Greenbook, Questel, SIPO…
  • 30. Tika as a pipeline! • Auto detects content type • Metadata structure has all the key/value needed for Solr • Allows us to scale up with Behemoth project (and others!).
  • 31. Lots of files! HHHHHT APS1 ISSUE - 760106! PATN! WKU 039302717! SRC 5! APN 5328756! APT 1! ART 353! APD 19741216! TTL Golf glove! ISD 19760106! NCL 4! ECL 1 <PatentGrant>! <BibliographicData>! <GrantIdentification>! <DocumentKindCode>B1</DocumentKindCode>! <GrantNumber>06644224</GrantNumber>! <CountryCode>US</CountryCode>! <IssueDateText>2003-11-11</IssueDateText>
  • 32. Detector to pick File public  class  GreenbookDetector  implements  Detector  {   !        private  static  Pattern  pattern  =  Pattern.compile("PATN");                    @Override          public  MediaType  detect(InputStream  stream,  Metadata  metadata)  throws  IOException  {   !                MediaType  type  =  MediaType.OCTET_STREAM;                  InputStream  lookahead  =  new  LookaheadInputStream(stream,  1024);                  String  extract  =  org.apache.commons.io.IOUtils.toString(lookahead,  "UTF-­‐8");   !                Matcher  matcher  =  pattern.matcher(extract);   !                if  (matcher.find())  {                          type  =  GreenbookParser.MEDIA_TYPE;                  }   !                lookahead.close();                                    return  type;          }         }
  • 33. Telling some stories • How to inject “Discovery” into your app • The Cloud to the Rescue (sorta!) • Parsers and Parsers and Parsers ➡Don’t be Afraid to Share!
  • 34. Your BigData solution isn’t perfect • Allow users to export data • Most business users want to work in Excel. Accept it! • Allow other applications to build on top of of your application.
  • 35. GPSN has • Lots of easy “Print to PDF” options. • Data stored in S3 as: • individual patent files • chunky downloads. • Filtering to expand or select specific data sets. • Permalinks: simple, very sharable URLs. • Underlying Solr service is exposed to public via proxy. You can query Solr yourself. • Need advance querying? Use Lucene syntax in search bar.
  • 36. One more thought...
  • 37. Measuring the impact of our algorithms changes is just getting harder with Big Data.
  • 38. www.quepid.com Quepid: Give your Queries some Love W e need betausers!
  • 39. Thank you! ! Questions? • epugh@o19s.com • @dep4b • www.opensourceconnections.com • slideshare.com/o19s Nervous about speaking up? Ask me later!

×