A tutorial explaining how to upload multiple images or multi-page documents to the Internet Archive using Python and a Raspberry Pi 4:
- How to name files for successful ingest
- How to prepare metadata (information about the files)
- How to set up a Raspberry Pi 4 out of the box
- Installing the operating system
- Configuring the operating system
- Installing the Internet Archive Python library via command line
Connecting to your Internet Archive account
Uploading batches of files to your account
Batch uploading to the Internet Archive using Python
1. Batch uploading
to the Internet
Archive using
Python...
...and a Raspberry Pi 4
Alison Harvey
Special Collections and Archives
Cardiff University
2. This tutorial will explain:
- How to upload multiple images or multi-page documents to the
Internet Archive using Python (without a Linux PC)
- How to name files for successful ingest
- How to prepare metadata (information about the files)
- How to set up a Raspberry Pi 4 out of the box
- Installing the operating system
- Configuring the operating system
- Installing the Internet Archive Python library via command line
- Connecting to your Internet Archive account
- Uploading batches of files to your account
3. You will need:
- Raspberry Pi 4
- Monitor
- HDMI to micro HDMI cable, to connect the monitor to the Raspberry Pi
- USB keyboard and mouse
- Empty 64GB SD card and adapter - to use as your Raspberry Pi’s hard drive
- Image files or .zip files on a USB stick/external hard drive
- A .csv file of metadata saved in the same location
4. File-naming individual images for ingest
- You may upload jpg, jpeg, jp2, tif, tiff, png, gif or bmp files
- Filenames will form part of the final URL, and must be unique to the Internet
Archive. Use a file-naming convention to create a code that is meaningful to
you, but which is unlikely to have already been used.
- This might be three letters (e.g. ABC) to describe the collection, two digits to
describe the year the images were created (e.g. 2019), and three digits for a
running number per file (001-003 etc):
- ABC19001.tif
- ABC19002.tif
- ABC19003.tif
5. File-naming multi-page texts for ingest
- It is possible to upload multiple images belonging to a single text, and compile
them into a zip file for ingest as a single digital object.
- Follow the file-naming advice as for images, but use the running number
portion of the filename to indicate the order in which the pages should appear
in the final presentation, e.g.
- Page 1 = XYZ19001.tif
- Page 2 = XYZ19002.tif
- Page 3 = XYZ19003.tif
[prefix] [running number]
- Then, zip all images belonging to the same document/book into a single file
- Filename it with the prefix, followed by _images.zip, e.g. XYZ19_images.zip
6. Preparing image metadata
- identifier
- file
- title
- description
- collection
- subject
Use Excel or Google Sheets to create a spreadsheet with the following column headings:
- contributor
- date
- publisher
- creator
- language
- licenseurl
- mediatype
Headings must be spelt and capitalised exactly as above. Headings in bold are mandatory.
7. Mandatory fields: identifier
This is the image or zip filename, without the filetype prefix.
It will be used as part of the final URL, and it must be unique to the Internet
Archive, e.g.
Images: ABC19001
Texts: XYZ19
8. Mandatory fields: file
This is the path to your file, including the filetype prefix
This must begin with a forward slash. You will be storing files in your Raspberry
Pi’s ‘home’ folder: /home/pi
Your filepath will be:
Images: /home/pi/ABC19001.tif
Texts: /home/pi/XYZ19_images.zip
Use a new row in the spreadsheet for every file.
9. Mandatory fields: title
This will appear on
your account home
page. Try to make it
short but descriptive.
Think about what
information would be
of most use to
someone browsing
through the visual
content of your home
page.
10. Mandatory fields: description
Here, you can add any additional information that will not fit in the title field.
Images, e.g. Cathays Park: aerial photograph looking south west, 1962:
Black and white photo. Showing Cathays Park and College Buildings.
Cardiff Castle and Cardiff Arms Park visible in background. Dimensions:
160mm (h) x 210mm. Image area: 149mm (h) x 210mm (w).
Texts, e.g. On holiday in wartime, France 1914:
Handwritten and hand-illustrated journal of travels across France in the
autumn of 1914. Provenance: Deposited by Frith-Beard in 1932. Archival
ref: 410.
11. Mandatory fields: collection
Unless you already have collections set up on your account, use the default
collections:
Images: opensource_image
Texts: opensource
The collections must be spelt and capitalised exactly as above.
12. Mandatory fields: subject
This field can be searched, but also allows users to filter
items on the same topic.
Think about what information would be most helpful for
your users to be able to filter. Use terminology, spelling,
and capitalisation consistently, so that all matches group
successfully under a single heading.
If you have multiple subjects, use the column headings:
- subject[0]
- subject[1]
- subject[2]... and so on.
Always begin counting at 0, and do not add spaces.
13. Optional fields: contributor
If you add this field to every item you upload, it can be used a quick means of
identifying and extracting information about all your items.
Use advanced search to run a query on your contributor name that will return all
items you have ever uploaded:
and Archives, Cardiff University
‘Contributor’ could be your own name, or the name of your organisation. Make sure
it is detailed enough to be unique, to ensure that you only retrieve your own results.
14. Optional fields: date
This field allows collections to be searched and filtered by date.
This field must be machine-readable, expressing the date as either
YYYY (e.g. 1982) or YYYY-MM-DD (e.g. 1982-11-26)
If you do not have this information, or the date is estimated, leave
this field blank, and use the description field to either indicate that
the item is undated or of uncertain date.
15. Optional fields: publisher, creator
Add this information if you have it.
Creator names can be used to filter content. Present them in a consistent
format to ensure that all matches group successfully under a single heading,
such as:
Surname, forename, yyyy birth date-yyyy death date
Owen, Morfydd, 1891-1918
16. Optional fields: language
As well as allowing users to search and filter by the language of the text,
completing this field helps the Internet Archive to apply OCR to your items.
OCR, or Optical Character Recognition, analyses the shape of letters found in
images of printed text, and converts it into machine-encoded text. Users are
then able to search for words and phrases found inside the digital objects.
For multilingual texts, use the column headings: language[0], language[1] etc.
Always use the relevant ISO 639-2 code for your language, e.g.
- English (eng)
- Welsh (wel)
- Arabic (ara)
17. Optional fields: licenseurl
This field applies a license to your content, which tells users what they are allowed to do with it.
Visit Creative Commons to generate an appropriate license, and copy the url into the spreadsheet,
e.g. http://creativecommons.org/licenses/by/4.0/
18. Optional fields: mediatype
Images: image
Texts: texts
This field classifies the object as image or text for the purpose of filtering.
The types must be spelt and capitalised exactly as above.
19. Saving as csv
When your table of metadata is complete, with an item on each row, you are ready
to save as csv.
Saving to csv directly from Excel can cause errors - if you have been working in
Excel, paste cells into a Google Sheets document when your metadata is complete.
From Google Sheets, select File > Download > Comma-separated values. Save the
csv file in the same location as your image files or zip files.
20. Setting up a Raspberry Pi: install and run OS imager
Download Raspberry Pi
imager for Windows or Mac
on your usual PC.
Insert the SD card in the
adapter, plug into PC, and
run Raspberry Pi imager.
21. Setting up a Raspberry Pi: erase and format SD card
First, prepare the SD card by
erasing all previous data and
format it, ready to flash the
new OS.
Under Choose OS, scroll
down and select Erase.
Under Choose Storage,
select the SD card.
Select Write.
22. Setting up a Raspberry Pi: flash the OS to the SD card
Under Choose OS, select
Raspberry Pi OS (32 bit).
Under Choose Storage,
select the SD card.
Select Write. This will flash
the OS to the card - it may
take several minutes.
23. Setting up a Raspberry Pi: getting connected
Eject the SD card from the
PC, and remove from its
adapter.
Insert the card into the back
of the Raspberry Pi as
shown.
24. Setting up a Raspberry Pi: getting connected
Connect the monitor with the
HDMI-Micro HDMI cable.
Connect the keyboard and
mouse.
Finally, connect the power
cable, and switch on power.
25. Setting up a Raspberry Pi: installing the OS
The Raspberry Pi will boot (this may take several minutes, as it installs the OS).
When it’s complete, it will look like this. Work through the following set up stages.
26. Setting up a Raspberry Pi: location and language
27. Setting up a Raspberry Pi: change default password
32. Copy files to the Raspberry Pi
Connect your USB stick or external hard drive to your Raspberry Pi
Copy across all image files or zip files due to be transferred. Make note of the
name of your .csv file.
Save all files to /home/pi
If you want to create folders to organise files, do so under /home/pi, but
remember to update the file paths in your csv file to reflect the new folders.
33. Installing and configuring the Internet Archive python library
Open the command line (top menu bar)
and enter these commands:
$ sudo pip install internetarchive
$ ia configure
Enter your Internet Archive credentials
If you have stored images and csv in a
folder below /home/ia/, use cd to navigate
to the correct location of your files.
34. Installing and configuring the Internet Archive python library
Enter the following command, replacing [filename] with the name of your csv file.
This tells the Pi where to look for your metadata. Then the metadata tells it where
to find the files to upload, and how to describe them.
$ ia upload --spreadsheet=[filename].csv
Depending how many files you are uploading, the programme may run for several
hours. Do not close the command line or disconnect the Raspberry Pi.
Each file will be added to your Internet Archive account as it completes. It can take
up to 24 hours for the final documents to render on the live site.
Congratulations - you have batch uploaded to the Internet Archive!