SlideShare a Scribd company logo
Searching Chinese Patents: 	

Challenges and Solutions When Building
an Innovative Discovery Interface
ERIC PUGH | epugh@o19s.com | @dep4b
Who am I?
• Principal at OpenSource Connections
- Solr/Lucene Search Consultancy
http://bit.ly/OSCCommercialSummary	

• Member of Apache Software
Foundation	

• SOLR-284 UpdateRichDocuments
(July 07)
Co-Author
N
extEdition
M
ay!
Agilista
Selected Customers
Telling some stories
war	

^
Risks
• Cloud new at USPTO	

• Discovery is tenuous concept	

• Conflicting User Goals	

• Fixed Budget: trade scope for
budget/quality
Telling some stories
➡How to inject “Discovery” into your
app	

• The Cloud to the Rescue (sorta!)	

• Parsers and Parsers and Parsers	

• Don’t be Afraid to Share!
Flow of understanding
Data UnderstandingInformation
Building “Discovery”
Engine
UX DataTension
Grok data at gut level	

Look for outliers	

!
User Interviews	

Surveys	

Card Sorting	

Scenarios/Personas	

!
UX
Data
brainstorm
Mockups	

Proof of concept	

!
!
Where to spend time?
UX
Engine
Data
40%	

!
20%	

!
40%	

!
40%	

!
40%	

!
20%	

We spent	

!
Telling some stories
• How to inject “Discovery” into your app	

➡The Cloud to the Rescue (sorta!)
• Parsers and Parsers and Parsers	

• Don’t be Afraid to Share!
Boy meets Girl Story
Boy meets Girl Story
Metadata
Ingest	

Pipeline	

Discovery
UX
Content
Files
How we built it
EmberJS Single Page Search App
HTML
XML
JSON
Server Dashboard
GPSN UI (Bootsrap CSS)
Browsers
Mobile/
Tablet
Third Party
Application
Servers
S3 BucketSolr
Solr as a NoSQL
Datastore
• Used “atomic updates” to merge three
source datasets into single final dataset.	

• All text displayed in application stored in
Solr.	

• Dynamic schema supports many languages,
en, cn right now.
Lessons Learned
Don’t Move Files
• Copying 5 TB data up to S3 was very
painful.	

• We used S3Funnel which is “rsync like”	

• We bought more network bandwidth for
our office
Never
underestimate
the bandwidth of
a station wagon
full of tapes
hurtling down
the highway.

–Andrew Tanenbaum, 1981
Data Size
0
250000
500000
750000
1000000
1985 1987 1989 1991 1993 1995 1997 1999 2001 2003 2005 2007 2009 2011
Patent Count
277871
Think about DataVolume
• Started with older dataset, and tasks like TIFF -> PNG
conversion became progressively harder. Map/Reduce nice,
need more visibility into progress..	

• Should have sharded our Search Index from the beginning
just to make indexing faster and cheaper process (500 gb
index!)	

• 8 shards dropped time from 12 hours to 2 hours.
Merging took 5!	

• We had too many steps in our pipeline
Building	
  a	
  Patents	
  Index
MachineCount
0
75
150
225
300
5 days 3 days 30 Minutes
1 5
300
Key scaling concept
behind GPSN:	

!
Cloud meets Ocean
More prosaically…
Database
Server
Server
Server
Client
Client
Client
$
$
$
$
Telling some stories
• How to inject “Discovery” into your app	

• The Cloud to the Rescue (sorta!)	

➡Parsers and Parsers and Parsers
• Don’t be Afraid to Share!
Why so many pipelines?
Morphlines
Tika as a pipeline?
Lot’s of File Types
• Sometimes in ZIP archives, sometimes not!	

• multiple XML formats as well as CSV and
EDI	

• Purplebook,Yellowbook,
Redbook,Greenbook, Questel, SIPO…
Tika as a pipeline!
• Auto detects content type	

• Metadata structure has all the
key/value needed for Solr	

• Allows us to scale up with
Behemoth project (and
others!).
Lots of files!
HHHHHT APS1 ISSUE - 760106!
PATN!
WKU 039302717!
SRC 5!
APN 5328756!
APT 1!
ART 353!
APD 19741216!
TTL Golf glove!
ISD 19760106!
NCL 4!
ECL 1
<PatentGrant>!
<BibliographicData>!
<GrantIdentification>!
<DocumentKindCode>B1</DocumentKindCode>!
<GrantNumber>06644224</GrantNumber>!
<CountryCode>US</CountryCode>!
<IssueDateText>2003-11-11</IssueDateText>
Detector to pick File
public	
  class	
  GreenbookDetector	
  implements	
  Detector	
  {	
  
!
	
  	
  	
  	
  private	
  static	
  Pattern	
  pattern	
  =	
  Pattern.compile("PATN");	
  
	
  	
  	
  	
  	
  
	
  	
  	
  	
  @Override	
  
	
  	
  	
  	
  public	
  MediaType	
  detect(InputStream	
  stream,	
  Metadata	
  metadata)	
  throws	
  IOException	
  {	
  
!
	
  	
  	
  	
  	
  	
  	
  	
  MediaType	
  type	
  =	
  MediaType.OCTET_STREAM;	
  
	
  	
  	
  	
  	
  	
  	
  	
  InputStream	
  lookahead	
  =	
  new	
  LookaheadInputStream(stream,	
  1024);	
  
	
  	
  	
  	
  	
  	
  	
  	
  String	
  extract	
  =	
  org.apache.commons.io.IOUtils.toString(lookahead,	
  "UTF-­‐8");	
  
!
	
  	
  	
  	
  	
  	
  	
  	
  Matcher	
  matcher	
  =	
  pattern.matcher(extract);	
  
!
	
  	
  	
  	
  	
  	
  	
  	
  if	
  (matcher.find())	
  {	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  type	
  =	
  GreenbookParser.MEDIA_TYPE;	
  
	
  	
  	
  	
  	
  	
  	
  	
  }	
  
!
	
  	
  	
  	
  	
  	
  	
  	
  lookahead.close();	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  
	
  	
  	
  	
  	
  	
  	
  	
  return	
  type;	
  
	
  	
  	
  	
  }	
  
	
  	
  	
  
}
Telling some stories
• How to inject “Discovery” into your app	

• The Cloud to the Rescue (sorta!)	

• Parsers and Parsers and Parsers	

➡Don’t be Afraid to Share!
Your BigData solution
isn’t perfect
• Allow users to export data	

• Most business users want to work in Excel.
Accept it!	

• Allow other applications to build on top of
of your application.
GPSN has
• Lots of easy “Print to
PDF” options.	

• Data stored in S3 as:	

• individual patent files	

• chunky downloads.	

• Filtering to expand or
select specific data sets.	

• Permalinks: simple, very
sharable URLs.	

• Underlying Solr service
is exposed to public via
proxy. You can query
Solr yourself.	

• Need advance querying?
Use Lucene syntax in
search bar.
One more thought...
Measuring the impact
of our algorithms
changes is just getting
harder with Big Data.
www.quepid.com
Quepid: Give your Queries
some Love
W
e
need
betausers!
Thank you!
!
Questions?
• epugh@o19s.com	

• @dep4b	

• www.opensourceconnections.com	

• slideshare.com/o19s
Nervous about
speaking up? Ask
me later!

More Related Content

What's hot

How to Build Deep Learning Models
How to Build Deep Learning ModelsHow to Build Deep Learning Models
How to Build Deep Learning Models
Josh Patterson
 
Coscup
CoscupCoscup
Coscup
Giivee The
 
DrupalCon 2011 Highlight
DrupalCon 2011 HighlightDrupalCon 2011 Highlight
DrupalCon 2011 Highlight
Supakit Kiatrungrit
 
DefCore: The Interoperability Standard for OpenStack
DefCore: The Interoperability Standard for OpenStackDefCore: The Interoperability Standard for OpenStack
DefCore: The Interoperability Standard for OpenStack
Mark Voelker
 
Ncku csie talk about Spark
Ncku csie talk about SparkNcku csie talk about Spark
Ncku csie talk about Spark
Giivee The
 
Deep learning with DL4J - Hadoop Summit 2015
Deep learning with DL4J - Hadoop Summit 2015Deep learning with DL4J - Hadoop Summit 2015
Deep learning with DL4J - Hadoop Summit 2015
Josh Patterson
 
OpenStack: Toward a More Resilient Cloud
OpenStack: Toward a More Resilient CloudOpenStack: Toward a More Resilient Cloud
OpenStack: Toward a More Resilient Cloud
Mark Voelker
 
Getting a Neural Network Up and Running with OpenLab
Getting a Neural Network Up and Running with OpenLabGetting a Neural Network Up and Running with OpenLab
Getting a Neural Network Up and Running with OpenLab
Melvin Hillsman
 
Devops at Netflix (re:Invent)
Devops at Netflix (re:Invent)Devops at Netflix (re:Invent)
Devops at Netflix (re:Invent)
Jeremy Edberg
 
What's beyond Virtualization - The Future of Cloud Platforms
What's beyond Virtualization - The Future of Cloud PlatformsWhat's beyond Virtualization - The Future of Cloud Platforms
What's beyond Virtualization - The Future of Cloud Platforms
Derek Collison
 
Joyent circa 2006 (Scale with Rails)
Joyent circa 2006 (Scale with Rails)Joyent circa 2006 (Scale with Rails)
Joyent circa 2006 (Scale with Rails)
bcantrill
 
Securing Hadoop in an Enterprise Context (v2)
Securing Hadoop in an Enterprise Context (v2)Securing Hadoop in an Enterprise Context (v2)
Securing Hadoop in an Enterprise Context (v2)
Hellmar Becker
 
Automate Hadoop Cluster Deployment in a Banking Ecosystem
Automate Hadoop Cluster Deployment in a Banking EcosystemAutomate Hadoop Cluster Deployment in a Banking Ecosystem
Automate Hadoop Cluster Deployment in a Banking Ecosystem
Hellmar Becker
 
Markup languages and warp-speed documentation
Markup languages and warp-speed documentationMarkup languages and warp-speed documentation
Markup languages and warp-speed documentation
Lois Patterson
 
OpenStack 101 - All Things Open 2015
OpenStack 101 - All Things Open 2015OpenStack 101 - All Things Open 2015
OpenStack 101 - All Things Open 2015
Mark Voelker
 
What Drove Wordnik Non-Relational?
What Drove Wordnik Non-Relational?What Drove Wordnik Non-Relational?
What Drove Wordnik Non-Relational?
DATAVERSITY
 
Georgia Tech cse6242 - Intro to Deep Learning and DL4J
Georgia Tech cse6242 - Intro to Deep Learning and DL4JGeorgia Tech cse6242 - Intro to Deep Learning and DL4J
Georgia Tech cse6242 - Intro to Deep Learning and DL4J
Josh Patterson
 
Greenfields tech decisions
Greenfields tech decisionsGreenfields tech decisions
Greenfields tech decisions
Trent Hornibrook
 
Operations for databases: the agile/devops journey
Operations for databases: the agile/devops journeyOperations for databases: the agile/devops journey
Operations for databases: the agile/devops journey
Eduardo Piairo
 
Drupal 8 Configuration Management Initiative Update
Drupal 8 Configuration Management Initiative UpdateDrupal 8 Configuration Management Initiative Update
Drupal 8 Configuration Management Initiative Updateheyrocker
 

What's hot (20)

How to Build Deep Learning Models
How to Build Deep Learning ModelsHow to Build Deep Learning Models
How to Build Deep Learning Models
 
Coscup
CoscupCoscup
Coscup
 
DrupalCon 2011 Highlight
DrupalCon 2011 HighlightDrupalCon 2011 Highlight
DrupalCon 2011 Highlight
 
DefCore: The Interoperability Standard for OpenStack
DefCore: The Interoperability Standard for OpenStackDefCore: The Interoperability Standard for OpenStack
DefCore: The Interoperability Standard for OpenStack
 
Ncku csie talk about Spark
Ncku csie talk about SparkNcku csie talk about Spark
Ncku csie talk about Spark
 
Deep learning with DL4J - Hadoop Summit 2015
Deep learning with DL4J - Hadoop Summit 2015Deep learning with DL4J - Hadoop Summit 2015
Deep learning with DL4J - Hadoop Summit 2015
 
OpenStack: Toward a More Resilient Cloud
OpenStack: Toward a More Resilient CloudOpenStack: Toward a More Resilient Cloud
OpenStack: Toward a More Resilient Cloud
 
Getting a Neural Network Up and Running with OpenLab
Getting a Neural Network Up and Running with OpenLabGetting a Neural Network Up and Running with OpenLab
Getting a Neural Network Up and Running with OpenLab
 
Devops at Netflix (re:Invent)
Devops at Netflix (re:Invent)Devops at Netflix (re:Invent)
Devops at Netflix (re:Invent)
 
What's beyond Virtualization - The Future of Cloud Platforms
What's beyond Virtualization - The Future of Cloud PlatformsWhat's beyond Virtualization - The Future of Cloud Platforms
What's beyond Virtualization - The Future of Cloud Platforms
 
Joyent circa 2006 (Scale with Rails)
Joyent circa 2006 (Scale with Rails)Joyent circa 2006 (Scale with Rails)
Joyent circa 2006 (Scale with Rails)
 
Securing Hadoop in an Enterprise Context (v2)
Securing Hadoop in an Enterprise Context (v2)Securing Hadoop in an Enterprise Context (v2)
Securing Hadoop in an Enterprise Context (v2)
 
Automate Hadoop Cluster Deployment in a Banking Ecosystem
Automate Hadoop Cluster Deployment in a Banking EcosystemAutomate Hadoop Cluster Deployment in a Banking Ecosystem
Automate Hadoop Cluster Deployment in a Banking Ecosystem
 
Markup languages and warp-speed documentation
Markup languages and warp-speed documentationMarkup languages and warp-speed documentation
Markup languages and warp-speed documentation
 
OpenStack 101 - All Things Open 2015
OpenStack 101 - All Things Open 2015OpenStack 101 - All Things Open 2015
OpenStack 101 - All Things Open 2015
 
What Drove Wordnik Non-Relational?
What Drove Wordnik Non-Relational?What Drove Wordnik Non-Relational?
What Drove Wordnik Non-Relational?
 
Georgia Tech cse6242 - Intro to Deep Learning and DL4J
Georgia Tech cse6242 - Intro to Deep Learning and DL4JGeorgia Tech cse6242 - Intro to Deep Learning and DL4J
Georgia Tech cse6242 - Intro to Deep Learning and DL4J
 
Greenfields tech decisions
Greenfields tech decisionsGreenfields tech decisions
Greenfields tech decisions
 
Operations for databases: the agile/devops journey
Operations for databases: the agile/devops journeyOperations for databases: the agile/devops journey
Operations for databases: the agile/devops journey
 
Drupal 8 Configuration Management Initiative Update
Drupal 8 Configuration Management Initiative UpdateDrupal 8 Configuration Management Initiative Update
Drupal 8 Configuration Management Initiative Update
 

Viewers also liked

Oliver Schmid's Value Presentation
Oliver Schmid's Value Presentation Oliver Schmid's Value Presentation
Oliver Schmid's Value Presentation
Oliver Schmid
 
G&S 2017 Presentación Corporativa
G&S 2017   Presentación CorporativaG&S 2017   Presentación Corporativa
G&S 2017 Presentación Corporativa
Javier Gonzales
 
APM Presentación Corporativa 2012
APM Presentación Corporativa 2012APM Presentación Corporativa 2012
APM Presentación Corporativa 2012
APM-Shipley
 
Presentación Corporativa Indra
Presentación Corporativa IndraPresentación Corporativa Indra
Presentación Corporativa Indra
alekope
 
Eeb corporativa vf
Eeb corporativa vfEeb corporativa vf
Eeb corporativa vf
Grupo Energía Bogotá
 
Connectu media Company Presentation
Connectu media Company PresentationConnectu media Company Presentation
Connectu media Company Presentation
marykayhoal
 
Deutsche EuroShop | Company Presentation | 09/16
Deutsche EuroShop | Company Presentation | 09/16 Deutsche EuroShop | Company Presentation | 09/16
Deutsche EuroShop | Company Presentation | 09/16
Deutsche EuroShop AG
 
Company_Presentation_2015_-_Short
Company_Presentation_2015_-_ShortCompany_Presentation_2015_-_Short
Company_Presentation_2015_-_ShortHitesh Mathur
 
Click consulting group corporate presentation 2014
Click consulting group corporate presentation 2014Click consulting group corporate presentation 2014
Click consulting group corporate presentation 2014
S. Milena C Bonila
 
Karcher Company presentation
Karcher Company presentation Karcher Company presentation
Karcher Company presentation
Karcher Cleaning Systems Private Limited
 
Rcpower presentation original es
Rcpower presentation original esRcpower presentation original es
Rcpower presentation original esrcpowerrd
 
Presentación corporativa Abengoa 2017
Presentación corporativa Abengoa 2017Presentación corporativa Abengoa 2017
Presentación corporativa Abengoa 2017
Abengoa
 
Company presentation Aura Light Spain
Company presentation Aura Light SpainCompany presentation Aura Light Spain
Company presentation Aura Light Spain
AuraLight00
 
Abengoa's Corporate Presentation 2017
Abengoa's Corporate Presentation 2017Abengoa's Corporate Presentation 2017
Abengoa's Corporate Presentation 2017
Abengoa
 
Deutsche EuroShop | Company Presentation | 02/17
Deutsche EuroShop | Company Presentation | 02/17 Deutsche EuroShop | Company Presentation | 02/17
Deutsche EuroShop | Company Presentation | 02/17
Deutsche EuroShop AG
 
ERTMS Solutions general company presentation
ERTMS Solutions general company presentationERTMS Solutions general company presentation
ERTMS Solutions general company presentation
ERTMS Solutions
 
PJ Software Company Presentation
PJ Software Company PresentationPJ Software Company Presentation
PJ Software Company Presentation
PJ Software
 
2014 LinkedIn Company Presentation
2014 LinkedIn Company Presentation2014 LinkedIn Company Presentation
2014 LinkedIn Company Presentation
LinkedIn
 

Viewers also liked (20)

Oliver Schmid's Value Presentation
Oliver Schmid's Value Presentation Oliver Schmid's Value Presentation
Oliver Schmid's Value Presentation
 
G&S 2017 Presentación Corporativa
G&S 2017   Presentación CorporativaG&S 2017   Presentación Corporativa
G&S 2017 Presentación Corporativa
 
APM Presentación Corporativa 2012
APM Presentación Corporativa 2012APM Presentación Corporativa 2012
APM Presentación Corporativa 2012
 
Presentación Corporativa Indra
Presentación Corporativa IndraPresentación Corporativa Indra
Presentación Corporativa Indra
 
Eeb corporativa vf
Eeb corporativa vfEeb corporativa vf
Eeb corporativa vf
 
KP314
KP314KP314
KP314
 
Company Presentation
Company PresentationCompany Presentation
Company Presentation
 
Connectu media Company Presentation
Connectu media Company PresentationConnectu media Company Presentation
Connectu media Company Presentation
 
Deutsche EuroShop | Company Presentation | 09/16
Deutsche EuroShop | Company Presentation | 09/16 Deutsche EuroShop | Company Presentation | 09/16
Deutsche EuroShop | Company Presentation | 09/16
 
Company_Presentation_2015_-_Short
Company_Presentation_2015_-_ShortCompany_Presentation_2015_-_Short
Company_Presentation_2015_-_Short
 
Click consulting group corporate presentation 2014
Click consulting group corporate presentation 2014Click consulting group corporate presentation 2014
Click consulting group corporate presentation 2014
 
Karcher Company presentation
Karcher Company presentation Karcher Company presentation
Karcher Company presentation
 
Rcpower presentation original es
Rcpower presentation original esRcpower presentation original es
Rcpower presentation original es
 
Presentación corporativa Abengoa 2017
Presentación corporativa Abengoa 2017Presentación corporativa Abengoa 2017
Presentación corporativa Abengoa 2017
 
Company presentation Aura Light Spain
Company presentation Aura Light SpainCompany presentation Aura Light Spain
Company presentation Aura Light Spain
 
Abengoa's Corporate Presentation 2017
Abengoa's Corporate Presentation 2017Abengoa's Corporate Presentation 2017
Abengoa's Corporate Presentation 2017
 
Deutsche EuroShop | Company Presentation | 02/17
Deutsche EuroShop | Company Presentation | 02/17 Deutsche EuroShop | Company Presentation | 02/17
Deutsche EuroShop | Company Presentation | 02/17
 
ERTMS Solutions general company presentation
ERTMS Solutions general company presentationERTMS Solutions general company presentation
ERTMS Solutions general company presentation
 
PJ Software Company Presentation
PJ Software Company PresentationPJ Software Company Presentation
PJ Software Company Presentation
 
2014 LinkedIn Company Presentation
2014 LinkedIn Company Presentation2014 LinkedIn Company Presentation
2014 LinkedIn Company Presentation
 

Similar to Searching Chinese Patents Presentation at Enterprise Data World

Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...Lucidworks (Archived)
 
From a student to an apache committer practice of apache io tdb
From a student to an apache committer  practice of apache io tdbFrom a student to an apache committer  practice of apache io tdb
From a student to an apache committer practice of apache io tdb
jixuan1989
 
SDSC18 and DSATL Meetup March 2018
SDSC18 and DSATL Meetup March 2018 SDSC18 and DSATL Meetup March 2018
SDSC18 and DSATL Meetup March 2018
CareerBuilder.com
 
Big Data made easy in the era of the Cloud - Demi Ben-Ari
Big Data made easy in the era of the Cloud - Demi Ben-AriBig Data made easy in the era of the Cloud - Demi Ben-Ari
Big Data made easy in the era of the Cloud - Demi Ben-Ari
Demi Ben-Ari
 
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
Big Data Spain
 
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...Open Analytics
 
Open Data Summit Presentation by Joe Olsen
Open Data Summit Presentation by Joe OlsenOpen Data Summit Presentation by Joe Olsen
Open Data Summit Presentation by Joe OlsenChristopher Whitaker
 
Adf and ala design c sharp corner toronto chapter feb 2019 meetup nik shahriar
Adf and ala design c sharp corner toronto chapter feb 2019 meetup nik shahriarAdf and ala design c sharp corner toronto chapter feb 2019 meetup nik shahriar
Adf and ala design c sharp corner toronto chapter feb 2019 meetup nik shahriar
Nilesh Shah
 
Movin on Up - SPEngage Phoenix 2017
Movin on Up - SPEngage Phoenix 2017Movin on Up - SPEngage Phoenix 2017
Movin on Up - SPEngage Phoenix 2017
Jim Adcock
 
Continuum Analytics and Python
Continuum Analytics and PythonContinuum Analytics and Python
Continuum Analytics and Python
Travis Oliphant
 
Séminaire Big Data Alter Way - Elasticsearch - octobre 2014
Séminaire Big Data Alter Way - Elasticsearch - octobre 2014Séminaire Big Data Alter Way - Elasticsearch - octobre 2014
Séminaire Big Data Alter Way - Elasticsearch - octobre 2014ALTER WAY
 
Elastic Data Analytics Platform @Datadog
Elastic Data Analytics Platform @DatadogElastic Data Analytics Platform @Datadog
Elastic Data Analytics Platform @Datadog
C4Media
 
Agile Data: Building Hadoop Analytics Applications
Agile Data: Building Hadoop Analytics ApplicationsAgile Data: Building Hadoop Analytics Applications
Agile Data: Building Hadoop Analytics Applications
DataWorks Summit
 
Introduction to Azure DocumentDB
Introduction to Azure DocumentDBIntroduction to Azure DocumentDB
Introduction to Azure DocumentDB
Denny Lee
 
Data Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at BitlyData Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at Bitly
Sarah Guido
 
Turn Data Into Actionable Insights - StampedeCon 2016
Turn Data Into Actionable Insights - StampedeCon 2016Turn Data Into Actionable Insights - StampedeCon 2016
Turn Data Into Actionable Insights - StampedeCon 2016
StampedeCon
 
HPCC Systems Engineering Summit: Community Use Case: Because Who Has Time for...
HPCC Systems Engineering Summit: Community Use Case: Because Who Has Time for...HPCC Systems Engineering Summit: Community Use Case: Because Who Has Time for...
HPCC Systems Engineering Summit: Community Use Case: Because Who Has Time for...
HPCC Systems
 
Agile Data Science: Building Hadoop Analytics Applications
Agile Data Science: Building Hadoop Analytics ApplicationsAgile Data Science: Building Hadoop Analytics Applications
Agile Data Science: Building Hadoop Analytics Applications
Russell Jurney
 
USQL Trivadis Azure Data Lake Event
USQL Trivadis Azure Data Lake EventUSQL Trivadis Azure Data Lake Event
USQL Trivadis Azure Data Lake Event
Trivadis
 
Maintainable Machine Learning Products
Maintainable Machine Learning ProductsMaintainable Machine Learning Products
Maintainable Machine Learning Products
Andrew Musselman
 

Similar to Searching Chinese Patents Presentation at Enterprise Data World (20)

Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
 
From a student to an apache committer practice of apache io tdb
From a student to an apache committer  practice of apache io tdbFrom a student to an apache committer  practice of apache io tdb
From a student to an apache committer practice of apache io tdb
 
SDSC18 and DSATL Meetup March 2018
SDSC18 and DSATL Meetup March 2018 SDSC18 and DSATL Meetup March 2018
SDSC18 and DSATL Meetup March 2018
 
Big Data made easy in the era of the Cloud - Demi Ben-Ari
Big Data made easy in the era of the Cloud - Demi Ben-AriBig Data made easy in the era of the Cloud - Demi Ben-Ari
Big Data made easy in the era of the Cloud - Demi Ben-Ari
 
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
 
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
 
Open Data Summit Presentation by Joe Olsen
Open Data Summit Presentation by Joe OlsenOpen Data Summit Presentation by Joe Olsen
Open Data Summit Presentation by Joe Olsen
 
Adf and ala design c sharp corner toronto chapter feb 2019 meetup nik shahriar
Adf and ala design c sharp corner toronto chapter feb 2019 meetup nik shahriarAdf and ala design c sharp corner toronto chapter feb 2019 meetup nik shahriar
Adf and ala design c sharp corner toronto chapter feb 2019 meetup nik shahriar
 
Movin on Up - SPEngage Phoenix 2017
Movin on Up - SPEngage Phoenix 2017Movin on Up - SPEngage Phoenix 2017
Movin on Up - SPEngage Phoenix 2017
 
Continuum Analytics and Python
Continuum Analytics and PythonContinuum Analytics and Python
Continuum Analytics and Python
 
Séminaire Big Data Alter Way - Elasticsearch - octobre 2014
Séminaire Big Data Alter Way - Elasticsearch - octobre 2014Séminaire Big Data Alter Way - Elasticsearch - octobre 2014
Séminaire Big Data Alter Way - Elasticsearch - octobre 2014
 
Elastic Data Analytics Platform @Datadog
Elastic Data Analytics Platform @DatadogElastic Data Analytics Platform @Datadog
Elastic Data Analytics Platform @Datadog
 
Agile Data: Building Hadoop Analytics Applications
Agile Data: Building Hadoop Analytics ApplicationsAgile Data: Building Hadoop Analytics Applications
Agile Data: Building Hadoop Analytics Applications
 
Introduction to Azure DocumentDB
Introduction to Azure DocumentDBIntroduction to Azure DocumentDB
Introduction to Azure DocumentDB
 
Data Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at BitlyData Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at Bitly
 
Turn Data Into Actionable Insights - StampedeCon 2016
Turn Data Into Actionable Insights - StampedeCon 2016Turn Data Into Actionable Insights - StampedeCon 2016
Turn Data Into Actionable Insights - StampedeCon 2016
 
HPCC Systems Engineering Summit: Community Use Case: Because Who Has Time for...
HPCC Systems Engineering Summit: Community Use Case: Because Who Has Time for...HPCC Systems Engineering Summit: Community Use Case: Because Who Has Time for...
HPCC Systems Engineering Summit: Community Use Case: Because Who Has Time for...
 
Agile Data Science: Building Hadoop Analytics Applications
Agile Data Science: Building Hadoop Analytics ApplicationsAgile Data Science: Building Hadoop Analytics Applications
Agile Data Science: Building Hadoop Analytics Applications
 
USQL Trivadis Azure Data Lake Event
USQL Trivadis Azure Data Lake EventUSQL Trivadis Azure Data Lake Event
USQL Trivadis Azure Data Lake Event
 
Maintainable Machine Learning Products
Maintainable Machine Learning ProductsMaintainable Machine Learning Products
Maintainable Machine Learning Products
 

More from OpenSource Connections

Encores
EncoresEncores
Test driven relevancy
Test driven relevancyTest driven relevancy
Test driven relevancy
OpenSource Connections
 
How To Structure Your Search Team for Success
How To Structure Your Search Team for SuccessHow To Structure Your Search Team for Success
How To Structure Your Search Team for Success
OpenSource Connections
 
The right path to making search relevant - Taxonomy Bootcamp London 2019
The right path to making search relevant  - Taxonomy Bootcamp London 2019The right path to making search relevant  - Taxonomy Bootcamp London 2019
The right path to making search relevant - Taxonomy Bootcamp London 2019
OpenSource Connections
 
Payloads and OCR with Solr
Payloads and OCR with SolrPayloads and OCR with Solr
Payloads and OCR with Solr
OpenSource Connections
 
Haystack 2019 Lightning Talk - The Future of Quepid - Charlie Hull
Haystack 2019 Lightning Talk - The Future of Quepid - Charlie HullHaystack 2019 Lightning Talk - The Future of Quepid - Charlie Hull
Haystack 2019 Lightning Talk - The Future of Quepid - Charlie Hull
OpenSource Connections
 
Haystack 2019 Lightning Talk - State of Apache Tika - Tim Allison
Haystack 2019 Lightning Talk - State of Apache Tika - Tim AllisonHaystack 2019 Lightning Talk - State of Apache Tika - Tim Allison
Haystack 2019 Lightning Talk - State of Apache Tika - Tim Allison
OpenSource Connections
 
Haystack 2019 Lightning Talk - Relevance on 17 million full text documents - ...
Haystack 2019 Lightning Talk - Relevance on 17 million full text documents - ...Haystack 2019 Lightning Talk - Relevance on 17 million full text documents - ...
Haystack 2019 Lightning Talk - Relevance on 17 million full text documents - ...
OpenSource Connections
 
Haystack 2019 Lightning Talk - Solr Cloud on Kubernetes - Manoj Bharadwaj
Haystack 2019 Lightning Talk - Solr Cloud on Kubernetes - Manoj BharadwajHaystack 2019 Lightning Talk - Solr Cloud on Kubernetes - Manoj Bharadwaj
Haystack 2019 Lightning Talk - Solr Cloud on Kubernetes - Manoj Bharadwaj
OpenSource Connections
 
Haystack 2019 Lightning Talk - Quaerite a Search relevance evaluation toolkit...
Haystack 2019 Lightning Talk - Quaerite a Search relevance evaluation toolkit...Haystack 2019 Lightning Talk - Quaerite a Search relevance evaluation toolkit...
Haystack 2019 Lightning Talk - Quaerite a Search relevance evaluation toolkit...
OpenSource Connections
 
Haystack 2019 - Search-based recommendations at Politico - Ryan Kohl
Haystack 2019 - Search-based recommendations at Politico - Ryan KohlHaystack 2019 - Search-based recommendations at Politico - Ryan Kohl
Haystack 2019 - Search-based recommendations at Politico - Ryan Kohl
OpenSource Connections
 
Haystack 2019 - Search with Vectors - Simon Hughes
Haystack 2019 - Search with Vectors - Simon HughesHaystack 2019 - Search with Vectors - Simon Hughes
Haystack 2019 - Search with Vectors - Simon Hughes
OpenSource Connections
 
Haystack 2019 - Natural Language Search with Knowledge Graphs - Trey Grainger
Haystack 2019 - Natural Language Search with Knowledge Graphs - Trey GraingerHaystack 2019 - Natural Language Search with Knowledge Graphs - Trey Grainger
Haystack 2019 - Natural Language Search with Knowledge Graphs - Trey Grainger
OpenSource Connections
 
Haystack 2019 - Search Logs + Machine Learning = Auto-Tagging Inventory - Joh...
Haystack 2019 - Search Logs + Machine Learning = Auto-Tagging Inventory - Joh...Haystack 2019 - Search Logs + Machine Learning = Auto-Tagging Inventory - Joh...
Haystack 2019 - Search Logs + Machine Learning = Auto-Tagging Inventory - Joh...
OpenSource Connections
 
Haystack 2019 - Improving Search Relevance with Numeric Features in Elasticse...
Haystack 2019 - Improving Search Relevance with Numeric Features in Elasticse...Haystack 2019 - Improving Search Relevance with Numeric Features in Elasticse...
Haystack 2019 - Improving Search Relevance with Numeric Features in Elasticse...
OpenSource Connections
 
Haystack 2019 - Architectural considerations on search relevancy in the conte...
Haystack 2019 - Architectural considerations on search relevancy in the conte...Haystack 2019 - Architectural considerations on search relevancy in the conte...
Haystack 2019 - Architectural considerations on search relevancy in the conte...
OpenSource Connections
 
Haystack 2019 - Custom Solr Query Parser Design Option, and Pros & Cons - Ber...
Haystack 2019 - Custom Solr Query Parser Design Option, and Pros & Cons - Ber...Haystack 2019 - Custom Solr Query Parser Design Option, and Pros & Cons - Ber...
Haystack 2019 - Custom Solr Query Parser Design Option, and Pros & Cons - Ber...
OpenSource Connections
 
Haystack 2019 - Establishing a relevance focused culture in a large organizat...
Haystack 2019 - Establishing a relevance focused culture in a large organizat...Haystack 2019 - Establishing a relevance focused culture in a large organizat...
Haystack 2019 - Establishing a relevance focused culture in a large organizat...
OpenSource Connections
 
Haystack 2019 - Solving for Satisfaction: Introduction to Click Models - Eliz...
Haystack 2019 - Solving for Satisfaction: Introduction to Click Models - Eliz...Haystack 2019 - Solving for Satisfaction: Introduction to Click Models - Eliz...
Haystack 2019 - Solving for Satisfaction: Introduction to Click Models - Eliz...
OpenSource Connections
 
2019 Haystack - How The New York Times Tackles Relevance - Jeremiah Via
2019 Haystack - How The New York Times Tackles Relevance - Jeremiah Via2019 Haystack - How The New York Times Tackles Relevance - Jeremiah Via
2019 Haystack - How The New York Times Tackles Relevance - Jeremiah Via
OpenSource Connections
 

More from OpenSource Connections (20)

Encores
EncoresEncores
Encores
 
Test driven relevancy
Test driven relevancyTest driven relevancy
Test driven relevancy
 
How To Structure Your Search Team for Success
How To Structure Your Search Team for SuccessHow To Structure Your Search Team for Success
How To Structure Your Search Team for Success
 
The right path to making search relevant - Taxonomy Bootcamp London 2019
The right path to making search relevant  - Taxonomy Bootcamp London 2019The right path to making search relevant  - Taxonomy Bootcamp London 2019
The right path to making search relevant - Taxonomy Bootcamp London 2019
 
Payloads and OCR with Solr
Payloads and OCR with SolrPayloads and OCR with Solr
Payloads and OCR with Solr
 
Haystack 2019 Lightning Talk - The Future of Quepid - Charlie Hull
Haystack 2019 Lightning Talk - The Future of Quepid - Charlie HullHaystack 2019 Lightning Talk - The Future of Quepid - Charlie Hull
Haystack 2019 Lightning Talk - The Future of Quepid - Charlie Hull
 
Haystack 2019 Lightning Talk - State of Apache Tika - Tim Allison
Haystack 2019 Lightning Talk - State of Apache Tika - Tim AllisonHaystack 2019 Lightning Talk - State of Apache Tika - Tim Allison
Haystack 2019 Lightning Talk - State of Apache Tika - Tim Allison
 
Haystack 2019 Lightning Talk - Relevance on 17 million full text documents - ...
Haystack 2019 Lightning Talk - Relevance on 17 million full text documents - ...Haystack 2019 Lightning Talk - Relevance on 17 million full text documents - ...
Haystack 2019 Lightning Talk - Relevance on 17 million full text documents - ...
 
Haystack 2019 Lightning Talk - Solr Cloud on Kubernetes - Manoj Bharadwaj
Haystack 2019 Lightning Talk - Solr Cloud on Kubernetes - Manoj BharadwajHaystack 2019 Lightning Talk - Solr Cloud on Kubernetes - Manoj Bharadwaj
Haystack 2019 Lightning Talk - Solr Cloud on Kubernetes - Manoj Bharadwaj
 
Haystack 2019 Lightning Talk - Quaerite a Search relevance evaluation toolkit...
Haystack 2019 Lightning Talk - Quaerite a Search relevance evaluation toolkit...Haystack 2019 Lightning Talk - Quaerite a Search relevance evaluation toolkit...
Haystack 2019 Lightning Talk - Quaerite a Search relevance evaluation toolkit...
 
Haystack 2019 - Search-based recommendations at Politico - Ryan Kohl
Haystack 2019 - Search-based recommendations at Politico - Ryan KohlHaystack 2019 - Search-based recommendations at Politico - Ryan Kohl
Haystack 2019 - Search-based recommendations at Politico - Ryan Kohl
 
Haystack 2019 - Search with Vectors - Simon Hughes
Haystack 2019 - Search with Vectors - Simon HughesHaystack 2019 - Search with Vectors - Simon Hughes
Haystack 2019 - Search with Vectors - Simon Hughes
 
Haystack 2019 - Natural Language Search with Knowledge Graphs - Trey Grainger
Haystack 2019 - Natural Language Search with Knowledge Graphs - Trey GraingerHaystack 2019 - Natural Language Search with Knowledge Graphs - Trey Grainger
Haystack 2019 - Natural Language Search with Knowledge Graphs - Trey Grainger
 
Haystack 2019 - Search Logs + Machine Learning = Auto-Tagging Inventory - Joh...
Haystack 2019 - Search Logs + Machine Learning = Auto-Tagging Inventory - Joh...Haystack 2019 - Search Logs + Machine Learning = Auto-Tagging Inventory - Joh...
Haystack 2019 - Search Logs + Machine Learning = Auto-Tagging Inventory - Joh...
 
Haystack 2019 - Improving Search Relevance with Numeric Features in Elasticse...
Haystack 2019 - Improving Search Relevance with Numeric Features in Elasticse...Haystack 2019 - Improving Search Relevance with Numeric Features in Elasticse...
Haystack 2019 - Improving Search Relevance with Numeric Features in Elasticse...
 
Haystack 2019 - Architectural considerations on search relevancy in the conte...
Haystack 2019 - Architectural considerations on search relevancy in the conte...Haystack 2019 - Architectural considerations on search relevancy in the conte...
Haystack 2019 - Architectural considerations on search relevancy in the conte...
 
Haystack 2019 - Custom Solr Query Parser Design Option, and Pros & Cons - Ber...
Haystack 2019 - Custom Solr Query Parser Design Option, and Pros & Cons - Ber...Haystack 2019 - Custom Solr Query Parser Design Option, and Pros & Cons - Ber...
Haystack 2019 - Custom Solr Query Parser Design Option, and Pros & Cons - Ber...
 
Haystack 2019 - Establishing a relevance focused culture in a large organizat...
Haystack 2019 - Establishing a relevance focused culture in a large organizat...Haystack 2019 - Establishing a relevance focused culture in a large organizat...
Haystack 2019 - Establishing a relevance focused culture in a large organizat...
 
Haystack 2019 - Solving for Satisfaction: Introduction to Click Models - Eliz...
Haystack 2019 - Solving for Satisfaction: Introduction to Click Models - Eliz...Haystack 2019 - Solving for Satisfaction: Introduction to Click Models - Eliz...
Haystack 2019 - Solving for Satisfaction: Introduction to Click Models - Eliz...
 
2019 Haystack - How The New York Times Tackles Relevance - Jeremiah Via
2019 Haystack - How The New York Times Tackles Relevance - Jeremiah Via2019 Haystack - How The New York Times Tackles Relevance - Jeremiah Via
2019 Haystack - How The New York Times Tackles Relevance - Jeremiah Via
 

Recently uploaded

20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Nexer Digital
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
Building RAG with self-deployed Milvus vector database and Snowpark Container...
Building RAG with self-deployed Milvus vector database and Snowpark Container...Building RAG with self-deployed Milvus vector database and Snowpark Container...
Building RAG with self-deployed Milvus vector database and Snowpark Container...
Zilliz
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
Octavian Nadolu
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
Kumud Singh
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems S.M.S.A.
 
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofszkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
Alex Pruden
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
名前 です男
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
SOFTTECHHUB
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
SOFTTECHHUB
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
Matthew Sinclair
 
Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
Zilliz
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
James Anderson
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
danishmna97
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
DianaGray10
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
Neo4j
 

Recently uploaded (20)

20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
Building RAG with self-deployed Milvus vector database and Snowpark Container...
Building RAG with self-deployed Milvus vector database and Snowpark Container...Building RAG with self-deployed Milvus vector database and Snowpark Container...
Building RAG with self-deployed Milvus vector database and Snowpark Container...
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
 
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofszkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
 
Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
 

Searching Chinese Patents Presentation at Enterprise Data World

  • 1. Searching Chinese Patents: Challenges and Solutions When Building an Innovative Discovery Interface ERIC PUGH | epugh@o19s.com | @dep4b
  • 2. Who am I? • Principal at OpenSource Connections - Solr/Lucene Search Consultancy http://bit.ly/OSCCommercialSummary • Member of Apache Software Foundation • SOLR-284 UpdateRichDocuments (July 07)
  • 7.
  • 8. Risks • Cloud new at USPTO • Discovery is tenuous concept • Conflicting User Goals • Fixed Budget: trade scope for budget/quality
  • 9.
  • 10.
  • 11.
  • 12.
  • 13.
  • 14. Telling some stories ➡How to inject “Discovery” into your app • The Cloud to the Rescue (sorta!) • Parsers and Parsers and Parsers • Don’t be Afraid to Share!
  • 15. Flow of understanding Data UnderstandingInformation
  • 17. Grok data at gut level Look for outliers ! User Interviews Surveys Card Sorting Scenarios/Personas ! UX Data brainstorm Mockups Proof of concept ! !
  • 18. Where to spend time? UX Engine Data 40% ! 20% ! 40% ! 40% ! 40% ! 20% We spent !
  • 19. Telling some stories • How to inject “Discovery” into your app ➡The Cloud to the Rescue (sorta!) • Parsers and Parsers and Parsers • Don’t be Afraid to Share!
  • 20. Boy meets Girl Story
  • 21. Boy meets Girl Story Metadata Ingest Pipeline Discovery UX Content Files
  • 22. How we built it EmberJS Single Page Search App HTML XML JSON Server Dashboard GPSN UI (Bootsrap CSS) Browsers Mobile/ Tablet Third Party Application Servers S3 BucketSolr
  • 23. Solr as a NoSQL Datastore • Used “atomic updates” to merge three source datasets into single final dataset. • All text displayed in application stored in Solr. • Dynamic schema supports many languages, en, cn right now.
  • 25. Don’t Move Files • Copying 5 TB data up to S3 was very painful. • We used S3Funnel which is “rsync like” • We bought more network bandwidth for our office
  • 26. Never underestimate the bandwidth of a station wagon full of tapes hurtling down the highway.
 –Andrew Tanenbaum, 1981
  • 27. Data Size 0 250000 500000 750000 1000000 1985 1987 1989 1991 1993 1995 1997 1999 2001 2003 2005 2007 2009 2011 Patent Count 277871
  • 28. Think about DataVolume • Started with older dataset, and tasks like TIFF -> PNG conversion became progressively harder. Map/Reduce nice, need more visibility into progress.. • Should have sharded our Search Index from the beginning just to make indexing faster and cheaper process (500 gb index!) • 8 shards dropped time from 12 hours to 2 hours. Merging took 5! • We had too many steps in our pipeline
  • 29. Building  a  Patents  Index MachineCount 0 75 150 225 300 5 days 3 days 30 Minutes 1 5 300
  • 30. Key scaling concept behind GPSN: ! Cloud meets Ocean
  • 32. Telling some stories • How to inject “Discovery” into your app • The Cloud to the Rescue (sorta!) ➡Parsers and Parsers and Parsers • Don’t be Afraid to Share!
  • 33. Why so many pipelines? Morphlines
  • 34. Tika as a pipeline?
  • 35. Lot’s of File Types • Sometimes in ZIP archives, sometimes not! • multiple XML formats as well as CSV and EDI • Purplebook,Yellowbook, Redbook,Greenbook, Questel, SIPO…
  • 36. Tika as a pipeline! • Auto detects content type • Metadata structure has all the key/value needed for Solr • Allows us to scale up with Behemoth project (and others!).
  • 37. Lots of files! HHHHHT APS1 ISSUE - 760106! PATN! WKU 039302717! SRC 5! APN 5328756! APT 1! ART 353! APD 19741216! TTL Golf glove! ISD 19760106! NCL 4! ECL 1 <PatentGrant>! <BibliographicData>! <GrantIdentification>! <DocumentKindCode>B1</DocumentKindCode>! <GrantNumber>06644224</GrantNumber>! <CountryCode>US</CountryCode>! <IssueDateText>2003-11-11</IssueDateText>
  • 38. Detector to pick File public  class  GreenbookDetector  implements  Detector  {   !        private  static  Pattern  pattern  =  Pattern.compile("PATN");                    @Override          public  MediaType  detect(InputStream  stream,  Metadata  metadata)  throws  IOException  {   !                MediaType  type  =  MediaType.OCTET_STREAM;                  InputStream  lookahead  =  new  LookaheadInputStream(stream,  1024);                  String  extract  =  org.apache.commons.io.IOUtils.toString(lookahead,  "UTF-­‐8");   !                Matcher  matcher  =  pattern.matcher(extract);   !                if  (matcher.find())  {                          type  =  GreenbookParser.MEDIA_TYPE;                  }   !                lookahead.close();                                    return  type;          }         }
  • 39. Telling some stories • How to inject “Discovery” into your app • The Cloud to the Rescue (sorta!) • Parsers and Parsers and Parsers ➡Don’t be Afraid to Share!
  • 40. Your BigData solution isn’t perfect • Allow users to export data • Most business users want to work in Excel. Accept it! • Allow other applications to build on top of of your application.
  • 41. GPSN has • Lots of easy “Print to PDF” options. • Data stored in S3 as: • individual patent files • chunky downloads. • Filtering to expand or select specific data sets. • Permalinks: simple, very sharable URLs. • Underlying Solr service is exposed to public via proxy. You can query Solr yourself. • Need advance querying? Use Lucene syntax in search bar.
  • 43. Measuring the impact of our algorithms changes is just getting harder with Big Data.
  • 44. www.quepid.com Quepid: Give your Queries some Love W e need betausers!
  • 45. Thank you! ! Questions? • epugh@o19s.com • @dep4b • www.opensourceconnections.com • slideshare.com/o19s Nervous about speaking up? Ask me later!