Your SlideShare is downloading. ×
0
Building a lightweight discovery interface for Chinese patents
Building a lightweight discovery interface for Chinese patents
Building a lightweight discovery interface for Chinese patents
Building a lightweight discovery interface for Chinese patents
Building a lightweight discovery interface for Chinese patents
Building a lightweight discovery interface for Chinese patents
Building a lightweight discovery interface for Chinese patents
Building a lightweight discovery interface for Chinese patents
Building a lightweight discovery interface for Chinese patents
Building a lightweight discovery interface for Chinese patents
Building a lightweight discovery interface for Chinese patents
Building a lightweight discovery interface for Chinese patents
Building a lightweight discovery interface for Chinese patents
Building a lightweight discovery interface for Chinese patents
Building a lightweight discovery interface for Chinese patents
Building a lightweight discovery interface for Chinese patents
Building a lightweight discovery interface for Chinese patents
Building a lightweight discovery interface for Chinese patents
Building a lightweight discovery interface for Chinese patents
Building a lightweight discovery interface for Chinese patents
Building a lightweight discovery interface for Chinese patents
Building a lightweight discovery interface for Chinese patents
Building a lightweight discovery interface for Chinese patents
Building a lightweight discovery interface for Chinese patents
Building a lightweight discovery interface for Chinese patents
Building a lightweight discovery interface for Chinese patents
Building a lightweight discovery interface for Chinese patents
Building a lightweight discovery interface for Chinese patents
Building a lightweight discovery interface for Chinese patents
Building a lightweight discovery interface for Chinese patents
Building a lightweight discovery interface for Chinese patents
Building a lightweight discovery interface for Chinese patents
Building a lightweight discovery interface for Chinese patents
Building a lightweight discovery interface for Chinese patents
Building a lightweight discovery interface for Chinese patents
Building a lightweight discovery interface for Chinese patents
Building a lightweight discovery interface for Chinese patents
Building a lightweight discovery interface for Chinese patents
Building a lightweight discovery interface for Chinese patents
Building a lightweight discovery interface for Chinese patents
Building a lightweight discovery interface for Chinese patents
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Building a lightweight discovery interface for Chinese patents

657

Published on

The United States Patent and Trademark Office wanted a simple, lightweight, yet modern and rich discovery interface for Chinese patent data. This is the story of the Global Patent Search Network, the …

The United States Patent and Trademark Office wanted a simple, lightweight, yet modern and rich discovery interface for Chinese patent data. This is the story of the Global Patent Search Network, the next generation multilingual search platform for the USPTO. GPSN, http://gpsn.uspto.gov, was the first public application deployed in the cloud, and allowed a very small development team to build a discovery interface across millions of patents.

This case study will cover:
• How we leveraged Amazon Web Services platform for data ingestion, auto scaling, and deployment at a very low price compared to traditional data centers.
• We will cover some of the innovative methods for converting XML formatted data to usable information.
• Parsing through 5 TB of raw TIFF image data and converting them to modern web friendly format.
• Challenges in building a modern Single Page Application that provides a dynamic, rich user experience.
• How we built “data sharing” features into the application to allow third party systems to build additional functionality on top of GPSN.

Published in: Technology
1 Comment
1 Like
Statistics
Notes
No Downloads
Views
Total Views
657
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
2
Comments
1
Likes
1
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • SOLR-284 back in July 07 was a first cut at a content extraction library before Tika came along.
  • And I love Agile development processes. And I think of agile as business -> requirements -> development -> testing -> systems administration
  • SOLR-284 back in July 07 was a first cut at a content extraction library before Tika came along.
  • USPTO and SIPO: Chinese Intellectual Property Organization are committed to sharing patent data.
    Simplify patent protections
    reduce conflicting patent claims,
    facilitate business, by making it easier for Chinese and American companies to collaborate.
    Part of the MOU was to put China’s patent data on line, but somewhat of a checkbox feature.
  • we got excited!
  • I won’t be sugar coating them. Most speakers focus on how wonderful they did, and while inordinately proud of GPSN, it wasn’t a perfect project.
  • The Goal
  • Building Discovery capabilities is the tension between UX and Data needs
    Engine provides the flow between!
  • There are things that can be done in parallel, but after brainstorming, everything needs to go hand in hand.
    An issue we’ve seen is that it’s easy for the UX folks to document ideas, but often the Data folks get bogged down in the minutia of the source data. Need to surface that to a level people can work with.
  • Ideally you are focused on the UX and Data, your “tooling” shouldn’t get in the way.
    We were bit by doing a lot of knowledge transfer, and verification of our Cloud deploy because it was the first time, which meant we had some data issues bite us.
  • Issues that came up are: User Sophistication - Both public and Expert Patent Examiner users. Tilted towards public, w/ a layer of PE features
    Data, in English, was bad. So surface more of the original Chinese, and especially image data.
    Core users, such as patent attorneys’ want the original image of patents, it’s a trust thing.
    Google like simple search, but with more powerful queries.
  • One of the most common tropes in story telling is about a boy meeting a girl. Think Wall-E and Eve. The meet, shenanigans happen, and then happiness.
    Well, GPSN followed one of the most common tropes of discovery, margining clean metadata with content, and building a ui.
    this pretty much describes every project.
  • One of the most common tropes in story telling is about a boy meeting a girl. Think Wall-E and Eve. The meet, shenanigans happen, and then happiness.
    Well, GPSN followed one of the most common tropes of discovery, margining clean metadata with content, and building a ui.
    this pretty much describes every project.
  • $ is the cost of computation, doing work.
  • Scott pointed out Tika.
  • Walking out of a Federal building with 4 hard drives in my backpack!
  • Mention office hours.
  • Mention office hours.
  • Transcript

    • 1. Building a Lightweight Discovery Interface for Chinese Patents Chinese Patents Strata 2014 Santa Clara Eric Pugh | epugh@o19s.com | @dep4b
    • 2. Who am I? • Principal of OpenSource Connections - Solr/Lucene Search Consultancy http://bit.ly/OSCCommercialSummary • Member of Apache Software Foundation • SOLR-284 UpdateRichDocuments (July 07) • Fascinated by the art of software development
    • 3. ex N n tio di tE M ! ar Co-Author
    • 4. Agilista
    • 5. Selected Customers
    • 6. war ^ Telling some stories
    • 7. • First USPTO application in “the cloud” • • Simple, and discoverable Expresses our philosophy of “Cloud meets Ocean”
    • 8. Risks • • • • Cloud new at USPTO Discovery is tenuous concept Conflicting User Goals Fixed Budget: trade scope for budget/quality
    • 9. Telling some stories ➡How to inject “Discovery” into your app • The Cloud to the Rescue (sorta!) • Parsers and Parsers and Parsers • Don’t be Afraid to Share!
    • 10. Flow of understanding Data Data Information Information Understanding Understanding
    • 11. Building “Discovery” UX UX Tensio n Data Data Engine Engine
    • 12. UX UX User Interviews Card Sorting Scenarios/Personas Data Data Grok data at gut level Look for outliers brainstorm brainstorm brainstorm brainstorm Surveys Mockups Proof of concept
    • 13. Where to spend time? UX UX Engine Engine Data Data 40% 20% 40% 40% 40% 20% We spent
    • 14. Walk through results http://gpsn.uspto.gov
    • 15. Telling some stories • How to inject “Discovery” into your app ➡The Cloud to the Rescue (sorta!) • Parsers and Parsers and Parsers • Don’t be Afraid to Share!
    • 16. Boy meets Girl Story
    • 17. Boy meets Girl Story Content Files Ingest Pipeline Metadata Discovery UX
    • 18. How we built it
    • 19. Lessons Learned
    • 20. Don’t Move Files • Copying 5 TB data up to S3 was very painful. • We used S3Funnel which is “rsync like” • We bought more network bandwidth for our office
    • 21. Never underestimate the bandwidth of a station wagon full of tapes hurtling down the highway. –Andrew Tanenbaum, 1981
    • 22. Data Size 277871
    • 23. Think about Data Volume • Started with older dataset, and tasks like TIFF -> PNG conversion became progressively harder. Map/Reduce nice, need more visibility into progress.. • Should have sharded our Search Index from the beginning just to make indexing faster and cheaper process (500 gb index!) • • 8 shards dropped time from 12 hours to 2 hours. Merging took 5! We had too many steps in our pipeline
    • 24. Building a Patents Index
    • 25. Cloud meet Ocean
    • 26. More prosaically… Server Server $ Database Database Server Server $ $ Client Client Client Client Server Server $ Client Client
    • 27. Telling some stories • How to inject “Discovery” into your app • The Cloud to the Rescue (sorta!) ➡Parsers and Parsers and Parsers • Don’t be Afraid to Share!
    • 28. Morphlines Why so many pipelines?
    • 29. Tika as a pipeline?
    • 30. Lot’s of File Types • Sometimes in ZIP archives, sometimes not! • multiple XML formats as well as CSV and EDI • Purplebook,Yellowbook, Redbook,Greenbook, Questel, SIPO…
    • 31. Tika as a pipeline! • Auto detects content type • Metadata structure has all the key/value needed for Solr • Allows us to scale up with Behemoth project (and others!).
    • 32. Detector to pick File public class GreenbookDetector implements Detector { private static Pattern pattern = Pattern.compile("PATN"); @Override public MediaType detect(InputStream stream, Metadata metadata) throws IOException { MediaType type = MediaType.OCTET_STREAM; InputStream lookahead = new LookaheadInputStream(stream, 1024); String extract = org.apache.commons.io.IOUtils.toString(lookahead, "UTF-8"); Matcher matcher = pattern.matcher(extract); if (matcher.find()) { type = GreenbookParser.MEDIA_TYPE; } lookahead.close(); return type; } }
    • 33. Telling some stories • How to inject “Discovery” into your app • The Cloud to the Rescue (sorta!) • Parsers and Parsers and Parsers ➡Don’t be Afraid to Share!
    • 34. Your solution isn’t perfect • Allow users to export data • Most business users want to work in Excel! Accept it! • Allow other applications to build on top of it.
    • 35. GPSN has • • Lots of easy “Print to PDF” options. Data stored in S3 as: • • • • • individual patent files chunky downloads. Filtering to expand or select specific data sets. Permalinks: simple, very sharable URLs. Underlying Solr service is exposed to public via firewall. You can query Solr yourself.
    • 36. One more thought...
    • 37. Measuring the impact of our algorithms changes is just getting harder with Big Data.
    • 38. e W Quepid: Give your Queries some Love e ne d t be a s! er us www.quepid.io
    • 39. Office Hours Thurs 10:50 AM Whats Up with the Lucene Community? Community?
    • 40. Questions? Questions? Questions? Nervous about epugh@o19s.com • speaking up? Ask me • @dep4b later! • www.opensourceconnections.com • slideshare.com/o19s

    ×