SlideShare a Scribd company logo
1 of 16
ArXiv.org
250,000 documents
47,000 registered users
1 million+ downloads per year

Cost Per Paper
$10000

Commercial Journal

$1000

Non-Profit Journal

$10

arXiv
Goal: Process increasing number of submissions at
constant or declining cost
arXiv has an active core of users: 10% of users are
responsible for about 1/3 of all submissions, 50% of all
users have logged in (to submit or update a paper) in the
past 1.5 years
Authentication and Access Control
Recently moved from an http authentication/Berkeley database system
to a system based on cookies and a relational database.
Currently, all registered users (who haven’t been suspended) can
submit to all subjects classes in all archives – the original submitter or
somebody with the paper password can update the paper.
People are allowed to register depending on their E-mail address:
abc@university.edu can register, but xyz@company.com can’t unless
company=ibm,lucent,…; this list is hard to maintain (we have to block
popular ISPs in every country), exceptions are dealt with manually at
great cost (each case takes detective work), and there are many people
in .edu (alumni, non-research staff) who shouldn’t be able to submit.
Because registration and submission are linked, user database can’t be
used to offer other services: e-mail notification, personalization.
Endorsements and Trust Management
Administrators

Grandfathered Users

In new system, everyone will be able to register. Users who
registered under the old system will still be able to upload to
any archive or subject class, but new users will need to be
endorsed by an author with a publication history in that
category. Burden shifts from one senior staff person to 47,000
registered users. User database can be used
Endorsee

d
En

Endorser

en
m
rse
o

e
od
tc
Web-based interface for administrators:
• View user history and publications
• Monitor endorsement process
• Manage authority records
• Disable ability to submit or endorse
• Keep “institutional memory”
Future Directions
•Flexible Submission Queue (Currently submissions are
published the following evening – we can’t easily delay a
submission)
•Validating Metadata Form (Force users to clean up entry
errors, so administrators don’t have to)
• Automatic Protection (Suspicious submissions and
endorsements will be automatically delayed)
• New Search Engine based on Lucene
• Retrofit e-mail notification (current awareness) to use new
user database.
Classifying Articles with the
Support Vector Machine
Paul Ginsparg
Paul Houle
Thorsten Joachims
Jae-Hoon Sul
Goal: identify papers in existing archives that are relevant to
a new subject archive, q-bio (Quantitative Biology)
Active Training of SVM
Training: q-bio
Training: not q-bio
Other far from margin
Other close to margin

SVM finds maximum-margin hyperplane. We do first training run on one
year of data, then identify other papers that lie close to the dividing line.
We iteratively classify these by hand to refine the classification
Classifer performance improves as the size of a category
increases.
Time Series Analysis of Content
and Usage Information
Paul Ginsparg
Jon Kleinberg
Kleinberg’s algorithm uses a hidden Markov model to detect bursts of
word usage in arXiv titles, reveals intellectual trends in the last
decade of high-energy physics theory.
Announcement

Cited by other papers
Web Link Added

Review papers have a distinctive pattern of use: an initial spike after
announcement, followed by a long nearly-constant tail.

More Related Content

Viewers also liked

How to Trace an E-mail Part 2
How to Trace an E-mail Part 2How to Trace an E-mail Part 2
How to Trace an E-mail Part 2
Lebowitzcomics
 
Comandos spanning tree
Comandos spanning treeComandos spanning tree
Comandos spanning tree
1 2d
 
Newsletter nr 11_noiembrie_2014
Newsletter nr 11_noiembrie_2014Newsletter nr 11_noiembrie_2014
Newsletter nr 11_noiembrie_2014
Vochescu Alexandru
 
Tep business planning in tourism
Tep   business planning in tourismTep   business planning in tourism
Tep business planning in tourism
led4lgus
 
How2Recycle Label Presentation
How2Recycle Label PresentationHow2Recycle Label Presentation
How2Recycle Label Presentation
GreenBlue
 

Viewers also liked (20)

Semiclassical mechanics of a non-integrable spin cluster
Semiclassical mechanics of a non-integrable spin clusterSemiclassical mechanics of a non-integrable spin cluster
Semiclassical mechanics of a non-integrable spin cluster
 
Journalism and the Semantic Web
Journalism and the Semantic WebJournalism and the Semantic Web
Journalism and the Semantic Web
 
Diploma Supplement_1
Diploma Supplement_1Diploma Supplement_1
Diploma Supplement_1
 
Resume Jyoti Menon
Resume Jyoti MenonResume Jyoti Menon
Resume Jyoti Menon
 
How to Trace an E-mail Part 2
How to Trace an E-mail Part 2How to Trace an E-mail Part 2
How to Trace an E-mail Part 2
 
Comandos spanning tree
Comandos spanning treeComandos spanning tree
Comandos spanning tree
 
Open badgesmarch2014
Open badgesmarch2014Open badgesmarch2014
Open badgesmarch2014
 
Newsletter nr 11_noiembrie_2014
Newsletter nr 11_noiembrie_2014Newsletter nr 11_noiembrie_2014
Newsletter nr 11_noiembrie_2014
 
Test title
Test titleTest title
Test title
 
Microsoft® Outlook® Tips Hints For Admins
Microsoft® Outlook® Tips Hints For AdminsMicrosoft® Outlook® Tips Hints For Admins
Microsoft® Outlook® Tips Hints For Admins
 
2010 DOE Directory
2010 DOE Directory2010 DOE Directory
2010 DOE Directory
 
Uma sec council_june_22_v4
Uma sec council_june_22_v4Uma sec council_june_22_v4
Uma sec council_june_22_v4
 
July, 2014 Vol. 18 No.3
July, 2014 Vol. 18 No.3July, 2014 Vol. 18 No.3
July, 2014 Vol. 18 No.3
 
What is doe level 6
What is doe level 6What is doe level 6
What is doe level 6
 
Innovation & Marketing at 50+
Innovation & Marketing at 50+Innovation & Marketing at 50+
Innovation & Marketing at 50+
 
Changes to SNS, VIS & BARD
Changes to SNS, VIS & BARDChanges to SNS, VIS & BARD
Changes to SNS, VIS & BARD
 
Tep business planning in tourism
Tep   business planning in tourismTep   business planning in tourism
Tep business planning in tourism
 
USER & USAGE GEO.ADMIN.CH (OKCon 2013)
USER & USAGE GEO.ADMIN.CH (OKCon 2013)USER & USAGE GEO.ADMIN.CH (OKCon 2013)
USER & USAGE GEO.ADMIN.CH (OKCon 2013)
 
How2Recycle Label Presentation
How2Recycle Label PresentationHow2Recycle Label Presentation
How2Recycle Label Presentation
 
ARIN Registration Services Department Report
ARIN Registration Services Department ReportARIN Registration Services Department Report
ARIN Registration Services Department Report
 

Similar to Arxiv.org: Research And Development Directions

Triage in the Digital Age, by Mary Beth Weber and Gracemary Smulewitz
Triage in the Digital Age, by Mary Beth Weber and Gracemary Smulewitz Triage in the Digital Age, by Mary Beth Weber and Gracemary Smulewitz
Triage in the Digital Age, by Mary Beth Weber and Gracemary Smulewitz
Charleston Conference
 
How Do We Measure Success In Digital Repositories
How Do We Measure Success In Digital RepositoriesHow Do We Measure Success In Digital Repositories
How Do We Measure Success In Digital Repositories
Richard Bernier
 
Simple Web service Offering Repository Deposit (SWORD)‏
Simple Web service Offering Repository Deposit (SWORD)‏Simple Web service Offering Repository Deposit (SWORD)‏
Simple Web service Offering Repository Deposit (SWORD)‏
Julie Allinson
 
library management system
library management systemlibrary management system
library management system
prabhat kumar
 
Celsius Bloodhound: Automatizing searching and fetching records from library ...
Celsius Bloodhound: Automatizing searching and fetching records from library ...Celsius Bloodhound: Automatizing searching and fetching records from library ...
Celsius Bloodhound: Automatizing searching and fetching records from library ...
Servicio de Difusión de la Creación Intelectual (SEDICI)
 

Similar to Arxiv.org: Research And Development Directions (20)

E-library mangament system
E-library mangament systemE-library mangament system
E-library mangament system
 
Learning Management System
Learning Management SystemLearning Management System
Learning Management System
 
Federated Access Management 102
Federated Access Management 102Federated Access Management 102
Federated Access Management 102
 
McShibboleth Presentation
McShibboleth PresentationMcShibboleth Presentation
McShibboleth Presentation
 
JISC License Workshop
JISC License WorkshopJISC License Workshop
JISC License Workshop
 
Triage in the Digital Age, by Mary Beth Weber and Gracemary Smulewitz
Triage in the Digital Age, by Mary Beth Weber and Gracemary Smulewitz Triage in the Digital Age, by Mary Beth Weber and Gracemary Smulewitz
Triage in the Digital Age, by Mary Beth Weber and Gracemary Smulewitz
 
Leicester Research Archive (LRA): the work of a repository administrator
Leicester Research Archive (LRA): the work of a repository administratorLeicester Research Archive (LRA): the work of a repository administrator
Leicester Research Archive (LRA): the work of a repository administrator
 
Access Management for Libraries by John Paschoud & Masha Garibyan
Access Management for Libraries by John Paschoud & Masha GaribyanAccess Management for Libraries by John Paschoud & Masha Garibyan
Access Management for Libraries by John Paschoud & Masha Garibyan
 
A Novel methodology for handling Document Level Security in Search Based Appl...
A Novel methodology for handling Document Level Security in Search Based Appl...A Novel methodology for handling Document Level Security in Search Based Appl...
A Novel methodology for handling Document Level Security in Search Based Appl...
 
Lucene solrrev documentlevelsecurity_rajanimaski_final
Lucene solrrev documentlevelsecurity_rajanimaski_finalLucene solrrev documentlevelsecurity_rajanimaski_final
Lucene solrrev documentlevelsecurity_rajanimaski_final
 
OpenAthens Conference 2018 - Trevor Hough - Case study - University of Leeds
OpenAthens Conference 2018 - Trevor Hough - Case study - University of LeedsOpenAthens Conference 2018 - Trevor Hough - Case study - University of Leeds
OpenAthens Conference 2018 - Trevor Hough - Case study - University of Leeds
 
How Do We Measure Success In Digital Repositories
How Do We Measure Success In Digital RepositoriesHow Do We Measure Success In Digital Repositories
How Do We Measure Success In Digital Repositories
 
Vision and Scope Document For Library Management System
Vision and Scope Document For Library Management SystemVision and Scope Document For Library Management System
Vision and Scope Document For Library Management System
 
Federated Access Management (SFEU)
Federated Access Management (SFEU)Federated Access Management (SFEU)
Federated Access Management (SFEU)
 
Partnering With Vendors to Limit Compromised User Accounts - Richard Guajardo
Partnering With Vendors to Limit Compromised User Accounts - Richard GuajardoPartnering With Vendors to Limit Compromised User Accounts - Richard Guajardo
Partnering With Vendors to Limit Compromised User Accounts - Richard Guajardo
 
Individual e journal subscription: assembly required
Individual e journal subscription: assembly requiredIndividual e journal subscription: assembly required
Individual e journal subscription: assembly required
 
Simple Web service Offering Repository Deposit (SWORD)‏
Simple Web service Offering Repository Deposit (SWORD)‏Simple Web service Offering Repository Deposit (SWORD)‏
Simple Web service Offering Repository Deposit (SWORD)‏
 
HathiTrust Research Center Secure Commons
HathiTrust Research Center Secure CommonsHathiTrust Research Center Secure Commons
HathiTrust Research Center Secure Commons
 
library management system
library management systemlibrary management system
library management system
 
Celsius Bloodhound: Automatizing searching and fetching records from library ...
Celsius Bloodhound: Automatizing searching and fetching records from library ...Celsius Bloodhound: Automatizing searching and fetching records from library ...
Celsius Bloodhound: Automatizing searching and fetching records from library ...
 

More from Paul Houle

More from Paul Houle (20)

Chatbots in 2017 -- Ithaca Talk Dec 6
Chatbots in 2017 -- Ithaca Talk Dec 6Chatbots in 2017 -- Ithaca Talk Dec 6
Chatbots in 2017 -- Ithaca Talk Dec 6
 
Estimating the Software Product Value during the Development Process
Estimating the Software Product Value during the Development ProcessEstimating the Software Product Value during the Development Process
Estimating the Software Product Value during the Development Process
 
Universal Standards for LEI and other Corporate Reference Data: Enabling risk...
Universal Standards for LEI and other Corporate Reference Data: Enabling risk...Universal Standards for LEI and other Corporate Reference Data: Enabling risk...
Universal Standards for LEI and other Corporate Reference Data: Enabling risk...
 
Fixing a leaky bucket; Observations on the Global LEI System
Fixing a leaky bucket; Observations on the Global LEI SystemFixing a leaky bucket; Observations on the Global LEI System
Fixing a leaky bucket; Observations on the Global LEI System
 
Cisco Fog Strategy For Big and Smart Data
Cisco Fog Strategy For Big and Smart DataCisco Fog Strategy For Big and Smart Data
Cisco Fog Strategy For Big and Smart Data
 
Making the semantic web work
Making the semantic web workMaking the semantic web work
Making the semantic web work
 
Ontology2 platform
Ontology2 platformOntology2 platform
Ontology2 platform
 
Ontology2 Platform Evolution
Ontology2 Platform EvolutionOntology2 Platform Evolution
Ontology2 Platform Evolution
 
Paul houle the supermen
Paul houle   the supermenPaul houle   the supermen
Paul houle the supermen
 
Paul houle what ails enterprise search
Paul houle   what ails enterprise search Paul houle   what ails enterprise search
Paul houle what ails enterprise search
 
Subjective Importance Smackdown
Subjective Importance SmackdownSubjective Importance Smackdown
Subjective Importance Smackdown
 
Extension methods, nulls, namespaces and precedence in c#
Extension methods, nulls, namespaces and precedence in c#Extension methods, nulls, namespaces and precedence in c#
Extension methods, nulls, namespaces and precedence in c#
 
Dropping unique constraints in sql server
Dropping unique constraints in sql serverDropping unique constraints in sql server
Dropping unique constraints in sql server
 
Prefix casting versus as-casting in c#
Prefix casting versus as-casting in c#Prefix casting versus as-casting in c#
Prefix casting versus as-casting in c#
 
Paul houle resume
Paul houle resumePaul houle resume
Paul houle resume
 
Keeping track of state in asynchronous callbacks
Keeping track of state in asynchronous callbacksKeeping track of state in asynchronous callbacks
Keeping track of state in asynchronous callbacks
 
Embrace dynamic PHP
Embrace dynamic PHPEmbrace dynamic PHP
Embrace dynamic PHP
 
Once asynchronous, always asynchronous
Once asynchronous, always asynchronousOnce asynchronous, always asynchronous
Once asynchronous, always asynchronous
 
What do you do when you’ve caught an exception?
What do you do when you’ve caught an exception?What do you do when you’ve caught an exception?
What do you do when you’ve caught an exception?
 
Extension methods, nulls, namespaces and precedence in c#
Extension methods, nulls, namespaces and precedence in c#Extension methods, nulls, namespaces and precedence in c#
Extension methods, nulls, namespaces and precedence in c#
 

Recently uploaded

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Recently uploaded (20)

🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 

Arxiv.org: Research And Development Directions

  • 1. ArXiv.org 250,000 documents 47,000 registered users 1 million+ downloads per year Cost Per Paper $10000 Commercial Journal $1000 Non-Profit Journal $10 arXiv
  • 2. Goal: Process increasing number of submissions at constant or declining cost
  • 3. arXiv has an active core of users: 10% of users are responsible for about 1/3 of all submissions, 50% of all users have logged in (to submit or update a paper) in the past 1.5 years
  • 4. Authentication and Access Control Recently moved from an http authentication/Berkeley database system to a system based on cookies and a relational database. Currently, all registered users (who haven’t been suspended) can submit to all subjects classes in all archives – the original submitter or somebody with the paper password can update the paper. People are allowed to register depending on their E-mail address: abc@university.edu can register, but xyz@company.com can’t unless company=ibm,lucent,…; this list is hard to maintain (we have to block popular ISPs in every country), exceptions are dealt with manually at great cost (each case takes detective work), and there are many people in .edu (alumni, non-research staff) who shouldn’t be able to submit. Because registration and submission are linked, user database can’t be used to offer other services: e-mail notification, personalization.
  • 5. Endorsements and Trust Management Administrators Grandfathered Users In new system, everyone will be able to register. Users who registered under the old system will still be able to upload to any archive or subject class, but new users will need to be endorsed by an author with a publication history in that category. Burden shifts from one senior staff person to 47,000 registered users. User database can be used
  • 7. Web-based interface for administrators: • View user history and publications • Monitor endorsement process • Manage authority records • Disable ability to submit or endorse • Keep “institutional memory”
  • 8. Future Directions •Flexible Submission Queue (Currently submissions are published the following evening – we can’t easily delay a submission) •Validating Metadata Form (Force users to clean up entry errors, so administrators don’t have to) • Automatic Protection (Suspicious submissions and endorsements will be automatically delayed) • New Search Engine based on Lucene • Retrofit e-mail notification (current awareness) to use new user database.
  • 9. Classifying Articles with the Support Vector Machine Paul Ginsparg Paul Houle Thorsten Joachims Jae-Hoon Sul Goal: identify papers in existing archives that are relevant to a new subject archive, q-bio (Quantitative Biology)
  • 10. Active Training of SVM Training: q-bio Training: not q-bio Other far from margin Other close to margin SVM finds maximum-margin hyperplane. We do first training run on one year of data, then identify other papers that lie close to the dividing line. We iteratively classify these by hand to refine the classification
  • 11.
  • 12. Classifer performance improves as the size of a category increases.
  • 13.
  • 14. Time Series Analysis of Content and Usage Information Paul Ginsparg Jon Kleinberg
  • 15. Kleinberg’s algorithm uses a hidden Markov model to detect bursts of word usage in arXiv titles, reveals intellectual trends in the last decade of high-energy physics theory.
  • 16. Announcement Cited by other papers Web Link Added Review papers have a distinctive pattern of use: an initial spike after announcement, followed by a long nearly-constant tail.

Editor's Notes

  1. {}