In this talk we will dive deep into data pre-processing or data preparation part of Data Scientist work. Why data pre-processing is such an important topic to pay attention for aspiring Data Scientists / Machine Learning Engineers? How to process TBs of static and moving, aka streaming, schemaless data? How to ensure horizontal scalability for up to PBs when you expect such growth? We'll give you several insights how Apache Spark, a fast and general engine for large-scale data processing, and Apache Kafka helped us to deal with 80% of Data Scientists’ work. Why do you need such high-caliber tools as Spark or Kafka, when is it viable to use them and how to avoid such tools? What are the pitfalls of distributed processing using Spark and Kafka? How can Google Cloud Platform help and save costs up to 90%? We'll share what we've learned along the (hard) way.