[MM2023] Ducho: A Unified Framework for the Extraction of Multimodal Features in Recommendation

Ducho: A Unified Framework for the Extraction of
Multimodal Features in Recommendation
Daniele Malitesta1, Giuseppe Gassi2, Claudio Pomo1, Tommaso Di Noia1
Politecnico di Bari, Bari (Italy)
email: firstname.lastname@poliba.it, g.gassi@studenti.poliba.it
The 31st ACM International Conference on Multimedia
Ottawa, ON, Canada, 11-01-2023
Open Source Track
1 2

Ducho: A Unified Framework for the Extraction of Multimodal Features in Recommendation
The 31st ACM International Conference on Multimedia (Ottawa, October 29 - November 03, 2023)
● Introduction and motivations
● Architecture
● Extraction pipeline
● Ducho as Docker application
● Demonstrations
● Conclusion and future work
Outline
2

Introduction and motivations
3

Multimodal-aware recommender systems [Malitesta et al.] exploit multimodal (i.e., audio, visual, textual) content
data to augment the representation of items, thus tackling known issues such as dataset sparsity and the inexplicable
nature of users’ actions (i.e., views, clicks) on online platforms.
4
Recommendation systems leveraging multimodal data
࢛
࢏
MODALITIES
࢓૚
࢓૛
࢓૜
. . .
. . .
MULTIMODAL
FEATURE
EXTRACTOR
࣐࢓ሺ‫ڄ‬ሻ
MULTIMODAL
REPRESENTATION
JOINT
ࣆሺ‫ڄ‬ሻ
COORDINATE
ࣆ࢓ ‫ڄ‬
. . .
INFERENCE
࣋ሺ‫ڄ‬ሻ
EARLY
FUSION
ࢽࢋሺ‫ڄ‬ሻ
LATE
FUSION
ࢽ࢒ሺ‫ڄ‬ሻ
(1) (2)
(a)
(b)
MULTIMODAL
FUSION
(3)
(a)
(b)
(4)
࢘
Which? How? When?
INPUT
[Malitesta et al.] 2023. Formalizing Multimedia Recommendation through Multimodal Deep Learning. Under review at TORS. Available online at: arXiv:2309.05273.

࢛
࢏
MODALITIES
࢓૚
࢓૛
࢓૜
. . .
. . .
MULTIMODAL
FEATURE
EXTRACTOR
࣐࢓ሺ‫ڄ‬ሻ
MULTIMODAL
REPRESENTATION
JOINT
ࣆሺ‫ڄ‬ሻ
COORDINATE
ࣆ࢓ ‫ڄ‬
. . .
INFERENCE
࣋ሺ‫ڄ‬ሻ
EARLY
FUSION
ࢽࢋሺ‫ڄ‬ሻ
LATE
FUSION
ࢽ࢒ሺ‫ڄ‬ሻ
(1) (2)
(a)
(b)
MULTIMODAL
FUSION
(3)
(a)
(b)
(4)
࢘
Which? How? When?
INPUT
Despite being the initial stage in the multimodal recommendation pipeline, the extraction of meaningful
multimodal features is paramount in delivering high-quality recommendations [Deldjoo et al.].
5
The multimodal recommendation pipeline
[Deldjoo et al.] 2021. A Study on the Relative Importance of Convolutional Neural Networks in Visually-Aware Recommender Systems. In CVPR Workshops. Computer Vision Foundation / IEEE, 3961–3967.

However, diverse multimodal extraction
procedures are currently used in the literature.
This poses limitations:
• difficult interdependencies across various
multimodal recommendation frameworks 👎
• no shared interfaces among popular
libraries for the extraction of pre-trained
deep learning features 👎
6
Current issues in multimodal feature extraction

We present Ducho!
7
We present Ducho, our unified framework for the extraction of multimodal
features in recommendation.
To this day, Ducho:
ü integrates widely-adopted deep learning libraries (i.e., TensorFlow,
PyTorch, and Transformers) by establishing a shared interface 🙂
ü is useful to extract/process audio, visual, and textual features 😊
ü allows items and user-item interactions [Anelli et al.] as extraction
sources 😃
ü offers an easily configurable extraction pipeline through a YAML-based
file 🤩
Modalities
Sources Backends
Items Interactions TensorFlow PyTorch Transformers
Audio 3 3 3 3
Visual 3 3 3 3
Textual 3 3 3
[Anelli et al.] 2022. Reshaping Graph Recommendation with Edge Graph Collaborative Filtering and Customer Reviews. In DL4SR@CIKM (CEUR Workshop Proceedings, Vol. 3317). CEUR-WS.org.

9
The overall framework

10
Dataset modules
• Manages the loading and processing of
the input
• A general, shared schema, with three
separate implementations for Audio,
Visual, and Textual datasets
• Image/audio require folder path, text
requires a tsv file
• Two sources for the modalities: items
or user-item interactions
• Handles the pre-processing of data
• Saves the multimodal features into
numpy array format

11
Extractor modules
• Builds an extraction model from a pre-trained
network
• Provides three different implementations for
each modality
• Exposes a wide range of pre-trained models
for the three backends
• The user should indicate the (list of)
extraction layers and the pre-trained model,
following the official naming/indexing scheme
• For the textual modality, the user can indicate
the task the model is pre-trained on (e.g.,
sentiment analysis)

12
Runner
• Orchestrator of Ducho
• Instantiates, calls, and manages all modules
• Triggers the complete extraction pipeline
• Customized through the Configuration
component
• A YAML-based file is used to override
(some of) the default settings

13
Runner (configuration file)
dataset_path: ./local/data/demo1
gpu list: 0
visual:
items:
input_path: images
output_path: visual_embeddings
model: [
{ name: VGG19, output_layers: classifier.3, ...},
{ name: Xception, output_layers: avg_pool, ...},
]

15
0. Pipeline configuration

16
1. Load and preprocess step

17
2. Build of the extraction model

18
3. Output save

Ducho as Docker application
19

To fully exploit the GPU-speedup capabilities of the
selected backends, we dockerize Ducho into an out-of-
the-box Docker image, which provides:
ü CUDA 11.8
ü cuDNN 8
ü Ubuntu 22.04
ü Python 3.8
ü Pip
ü All needed Python packages
20
Dockerization of Ducho
Scan me!

Task: fashion recommendation
Input data: fashion data with images (visual)
and item metadata (textual)
Extraction: VGG19 and Xception (visual),
Sentence-BERT pre-trained for semantic
textual similarity (textual)
Output: numpy arrays for both visual and
textual features
22
Demo 1: visual + textual items features
Scan me!
Run it on

Task: song recommendation
Input data: music genres dataset with songs
(audio) and music genre (textual)
Extraction: Hybrid Demucs (audio) and
Sentence-BERT pre-trained for semantic
textual similarity (textual)
Output: audio features may require some
time…
23
Demo 2: audio + textual items features
Scan me!
Run it on

Task: product recommendation
Input data: Amazon recommendation dataset
with reviews (textual interactions) and product
descriptions (textual items)
Extraction: Multilingual BERT-based model
pre-trained on customers’ reviews for the task
of sentiment analysis (textual)
Output: numpy arrays (for the interactions,
they are mapped to the user-item pair)
24
Demo 3: textual items/interactions features
Scan me!
Run it on

Conclusion
● Ducho, a unified framework for the extraction of multimodal features in recommendation
● Three main modules: Dataset, Extractor, and Runner
● Multimodal pipeline highly configurable through a YAML-based file
● Dockerization of Ducho into an out-of-the-box application
● Three demonstrations to show all Ducho’s functionalities
Future work
● Adopt all available backends for all modalities
● Implement a general extraction interface to use the same naming/indexing scheme
● Integrate the extraction of low-level features
26

Useful resources
27
Wandering why we called our
framework Ducho?
Check out the Italian TV series
“Boris” 🤓

Don’t forget to check out our theoretical/experimental survey
28

[MM2023] Ducho: A Unified Framework for the Extraction of Multimodal Features in Recommendation

Recommended

Recommended

More Related Content

Similar to [MM2023] Ducho: A Unified Framework for the Extraction of Multimodal Features in Recommendation

Similar to [MM2023] Ducho: A Unified Framework for the Extraction of Multimodal Features in Recommendation (20)

Recently uploaded

Recently uploaded (20)

[MM2023] Ducho: A Unified Framework for the Extraction of Multimodal Features in Recommendation