Arxiv.org: Research And Development Directions

ArXiv.org
250,000 documents
47,000 registered users
1 million+ downloads per year

Cost Per Paper
$10000

Commercial Journal

$1000

Non-Profit Journal

$10

arXiv

Goal: Process increasing number of submissions at
constant or declining cost

arXiv has an active core of users: 10% of users are
responsible for about 1/3 of all submissions, 50% of all
users have logged in (to submit or update a paper) in the
past 1.5 years

Authentication and Access Control
Recently moved from an http authentication/Berkeley database system
to a system based on cookies and a relational database.
Currently, all registered users (who haven’t been suspended) can
submit to all subjects classes in all archives – the original submitter or
somebody with the paper password can update the paper.
People are allowed to register depending on their E-mail address:
abc@university.edu can register, but xyz@company.com can’t unless
company=ibm,lucent,…; this list is hard to maintain (we have to block
popular ISPs in every country), exceptions are dealt with manually at
great cost (each case takes detective work), and there are many people
in .edu (alumni, non-research staff) who shouldn’t be able to submit.
Because registration and submission are linked, user database can’t be
used to offer other services: e-mail notification, personalization.

Endorsements and Trust Management
Administrators

Grandfathered Users

In new system, everyone will be able to register. Users who
registered under the old system will still be able to upload to
any archive or subject class, but new users will need to be
endorsed by an author with a publication history in that
category. Burden shifts from one senior staff person to 47,000
registered users. User database can be used

Endorsee

d
En

Endorser

en
m
rse
o

e
od
tc

Web-based interface for administrators:
• View user history and publications
• Monitor endorsement process
• Manage authority records
• Disable ability to submit or endorse
• Keep “institutional memory”

Future Directions
•Flexible Submission Queue (Currently submissions are
published the following evening – we can’t easily delay a
submission)
•Validating Metadata Form (Force users to clean up entry
errors, so administrators don’t have to)
• Automatic Protection (Suspicious submissions and
endorsements will be automatically delayed)
• New Search Engine based on Lucene
• Retrofit e-mail notification (current awareness) to use new
user database.

Classifying Articles with the
Support Vector Machine
Paul Ginsparg
Paul Houle
Thorsten Joachims
Jae-Hoon Sul
Goal: identify papers in existing archives that are relevant to
a new subject archive, q-bio (Quantitative Biology)

Active Training of SVM
Training: q-bio
Training: not q-bio
Other far from margin
Other close to margin

SVM finds maximum-margin hyperplane. We do first training run on one
year of data, then identify other papers that lie close to the dividing line.
We iteratively classify these by hand to refine the classification

Classifer performance improves as the size of a category
increases.

Time Series Analysis of Content
and Usage Information
Paul Ginsparg
Jon Kleinberg

Kleinberg’s algorithm uses a hidden Markov model to detect bursts of
word usage in arXiv titles, reveals intellectual trends in the last
decade of high-energy physics theory.

Announcement

Cited by other papers
Web Link Added

Review papers have a distinctive pattern of use: an initial spike after
announcement, followed by a long nearly-constant tail.

Arxiv.org: Research And Development Directions

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (20)

Similar to Arxiv.org: Research And Development Directions

Similar to Arxiv.org: Research And Development Directions (20)

More from Paul Houle

More from Paul Houle (20)

Recently uploaded

Recently uploaded (20)

Arxiv.org: Research And Development Directions

Editor's Notes