We explore the feasibility of surface-related multiple elimination by two-step separation where primaries and multiples are separated in the latent space of a convolutional autoencoder. First, we train a convolutional autoencoder to produce orthogonal embeddings of primaries and multiples. Second, we train another network to classify the latent space embedding of target data into respective wave types and decode predictions back to the data domain. Moreover, we propose an end-to-end workflow for the generation of realistic synthetic seismic data sufficient for knowledge transfer from training on synthetic to inference on field data. We evaluate the two-step separation approach in synthetic setup and highlight the strengths and weaknesses of using masks in encoder latent space for surface-related multiple elimination.