The document presents a model for performing heterogeneous in-memory data integration in a Hadoop-based data lake, particularly in the bioinformatics domain. It introduces an algorithm that utilizes the overlap coefficient for data integration, achieving better results than manual integration performed by specialists. The study includes experiments with eight bioinformatics datasets, demonstrating the model's efficiency, and serves as a guide for building a data lake from scratch.