We've updated our privacy policy. Click here to review the details. Tap here to review the details.
Activate your 30 day free trial to unlock unlimited reading.
Activate your 30 day free trial to continue reading.
Download to read offline
In this talk, we will discuss several advantages of the Spark RDD API for developing custom applications when compared to pure SQL-like interfaces such as Hive. In particular, we will describe how to control data distribution, avoid data skew, and implement application specific optimizations in order to build performant and reliable data pipelines. In order to illustrate these ideas, we will share our experiences redesigning a large-scale, complex (100+ stage) language model training pipeline for Spark that was originally built in Hive. The final Spark based pipeline is modular, readable, and more maintainable when compared to previous set of HQL queries. In addition to the qualitative improvements, we also observed a significant reduction in both resource usage and data landing time. Finally, we will also describe Spark optimizations that we implemented for this workload that can be applied toward batch workloads in general.
In this talk, we will discuss several advantages of the Spark RDD API for developing custom applications when compared to pure SQL-like interfaces such as Hive. In particular, we will describe how to control data distribution, avoid data skew, and implement application specific optimizations in order to build performant and reliable data pipelines. In order to illustrate these ideas, we will share our experiences redesigning a large-scale, complex (100+ stage) language model training pipeline for Spark that was originally built in Hive. The final Spark based pipeline is modular, readable, and more maintainable when compared to previous set of HQL queries. In addition to the qualitative improvements, we also observed a significant reduction in both resource usage and data landing time. Finally, we will also describe Spark optimizations that we implemented for this workload that can be applied toward batch workloads in general.
You just clipped your first slide!
Clipping is a handy way to collect important slides you want to go back to later. Now customize the name of a clipboard to store your clips.The SlideShare family just got bigger. Enjoy access to millions of ebooks, audiobooks, magazines, and more from Scribd.
Cancel anytime.Unlimited Reading
Learn faster and smarter from top experts
Unlimited Downloading
Download to take your learnings offline and on the go
You also get free access to Scribd!
Instant access to millions of ebooks, audiobooks, magazines, podcasts and more.
Read and listen offline with any device.
Free access to premium services like Tuneln, Mubi and more.
We’ve updated our privacy policy so that we are compliant with changing global privacy regulations and to provide you with insight into the limited ways in which we use your data.
You can read the details below. By accepting, you agree to the updated privacy policy.
Thank you!