Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Scaling Through Simplicity—How a 300 million User Chat App Reduced Data Engineering Efforts by 70%: Spark Summit East talk by Joel Cumming

465 views

Published on

Moving at the speed of a startup often means rapid iterative development, which can lead to a patchwork of systems and processes. In the early days at Kik (one of the most popular chat apps among U.S. teens), the data team was able to move extremely quickly but often at the expense of scalable data engineering. In this session, Kik’s head of data will share the eight things they did to save time and money. The team took their data stack from a complex combination of systems and processes to a scalable, simple, and robust platform leveraging Apache Spark and Databricks to make data super easy for everyone in the company to use.

Published in: Data & Analytics
  • Be the first to comment

Scaling Through Simplicity—How a 300 million User Chat App Reduced Data Engineering Efforts by 70%: Spark Summit East talk by Joel Cumming

  1. 1. Scaling Through Simplicity: How a 300 million user chat app reduced data engineering efforts by 70% Joel Cumming Kik Interactive
  2. 2. At Kik, we believe that everyone has the right to be curious.
  3. 3. Data should be available to everyone and should be super easy to use.
  4. 4. We have dashboards to glance at, reports to analyze, and a data lake for exploration.
  5. 5. However, Kik is a startup and we have to move very quickly.
  6. 6. Moving quickly often comes at the expense of scalable data engineering.
  7. 7. How can we compete with Facebook and Google (and their data teams) with a tiny team and very little time to master new tools?
  8. 8. Data v1 @ Kik
  9. 9. Data v1 @ Kik Data Lake & Transformations Exploration & Analysis KPIs
  10. 10. We decided to make 8 changes
  11. 11. Old 1. Streamline Data Collection via Kinesis Firehose New script
  12. 12. 2. Standardize Transformations with Spark SQL Old New
  13. 13. 3. Build a Data Lake (Caspian) in s3 Old New
  14. 14. 4. Move from EMR to Managed Spark Old New
  15. 15. 5. Collaborate via Notebooks Old New
  16. 16. 6. Get Serious About Committing Code Old New
  17. 17. 7. Move to Airflow for Orchestration Flexibility Old New
  18. 18. 8. Standardize Reporting on re:dash Old New
  19. 19. Data v2 @ Kik
  20. 20. Recall: Data v1 @ Kik Data Lake & Transformations Exploration & Analysis KPIs
  21. 21. Data v2 @ Kik: Scaling through Simplicity Data Lake & Transformations Exploration & Analysis KPIs SQL
  22. 22. New data is available within an hour in a query optimized format. Transformations can be built and scheduled in minutes. Reports can be developed just as quickly.
  23. 23. We estimate we save about 70% of our prior effort Data Collection Spark SQL Data Lake Managed Spark Notebooks Commiting Code Better Orchestration Standardize Reporting % Effort Savings (based on hours invested in related activities, v1 vs. v2) 0 5 10 15 20
  24. 24. What’s Next?
  25. 25. 1. Spark as a DW? 2. Structured Streaming 3. Data Lake Cataloging
  26. 26. Thank You. joel@kik.com

×