The attention-based encoder-decoder model has achieved impressive results for both automatic speech recognition (ASR) and text-to-speech (TTS) tasks. Inspired by SpecAugment and BERT, this study proposed a semantic mask based regularization for training such kind of end-to-end (E2E) model. While this approach is applicable to the encoder-decoder framework with any type of Neural Network architecture, then study the transformer-based model for ASR and perform experiments on LibriSpeech 960h and TedLium2 dataset and achieve state-of-the-art performance on the test set in the scope of E2E models.