Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)


Published on

Presentation at Spark Summit 2015

Published in: Data & Analytics
  • Hello! High Quality And Affordable Essays For You. Starting at $4.99 per page - Check our website!
    Are you sure you want to  Yes  No
    Your message goes here

Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)

  1. 1. Data Storage Tips for Optimal Spark Performance Vida Ha Spark Summit West 2015
  2. 2. Today’s Talk About Me Vida Ha - Solutions Engineer at Databricks Poor Data File Storage Choices Result in: • Exceptions that are difficult to diagnose and fix. • Slow Spark Jobs. • Inaccurate Data Analysis. 2 Goal: Understand the Best Practices for Storing and working with Data in Files with
  3. 3. Agenda • Basic Data Storage Decisions • File Sizes • Compression Formats • Data Format Best Practices • Plain Text, Structured Text, Data Interchange, Columnar • General Tips 3
  4. 4. 4 Basic Data Storage Decisions
  5. 5. Choosing a File Size • Too Small • Lots of time opening file handles. • Too Big • Files need to be splittable. • Optimize for your File System • Good Rule of Thumb: 64 MB - 1 GB 5
  6. 6. Choosing a Compression Format • The Obvious • Minimize the Compressed Size of Files. • Decoding characteristics. • Pay attention to Splittable vs. NonSplittable. • Common Compression Formats for Big Data: • gzip, Snappy, bzip2, LZO, and LZ4. • Columnar formats for Structured Data - Parquet. 6
  7. 7. 7 Data Format Best Practices
  8. 8. Plain Text • sc.textFile() splits the file into lines. • So keep your lines a reasonable size. • Or use a different method to read data in. • Keep file sizes < 1 GB if compressed with a non-splittable format. 8
  9. 9. Basic Structured Text Tiles • Use Spark transformations to ETL the data. • Optionally use Spark SQL for analyzing. • Handle the inevitable malformed data with Spark: • Use a filter transformation to drop bad lines. • Or use a map function to fix bad lines. • Includes CSV, JSON, XML. 9
  10. 10. CSV • Use the relatively new Spark-CSV package. • Spark SQL Malformed Data Tip: • Use a where clause and sanity check fields. 10
  11. 11. JSON • Ideally have one JSON object per line. • Otherwise, DIY parsing the JSON. • Prefer specifying a schema over inferSchema. • Watch out for arbitrary number of keys. • Inferring Schema will result in an executor OOM error. • Spark SQL Malformed Data Tip: • Bad inputs have a column called _corrupt_record and other columns will be null. 11
  12. 12. XML • Not an ideal Big Data Format. • Typically not one XML object per line. • So you’ll need to DIY parse. • Not a lot of built in library support for XML. • Watch out for very large compressed files. • May need to employ the General Tips to parse. 12
  13. 13. Data Interchange Formats with a Schema • Good Practice to enforce an API with backward compatibility. • Avro, Parquet, and Thrift are common ones. • Usually, good compression. • Data format itself is not corrupt. • But underlying records still be. 13
  14. 14. Avro • Use DataFile Writer to write multiple objects. • Use spark-avro package to read in. • Don’t transfer “AvroKey” across driver and workers. • AvroKey is not serializable. • Pull out fields of interest or convert to JSON. 14
  15. 15. Protocol Buffers • Need to figure out how to encode multiple in a file. • Encode in Sequence Files or other similar file format. • Prepend the number of bytes before reading the next message. • Base64 encode one per line. • Currently, no built in Spark package to support. • Opportunity to contribute to open source community. • For now, convert to JSON and read into Spark SQL. 15
  16. 16. Columnar Formats: Parquet & ORD • Great for use with Spark SQL. • Parquet is actually best practice for Spark SQL. • Makes it easy to pull only certain records at a time. • Worth time to encode if multiple passes on data. • Do not support appending one record at a time. • Not good for collecting log records. 16
  17. 17. 17 General Tips
  18. 18. Reuse Hadoop Libraries • HadoopRDD & NewHadoopRDD • Reuse Hadoop File format libraries. • Hive SerDe’s are supported in Spark SQL • Load your SerDe jar onto your Spark Cluster. • Issue a SHOW CREATE TABLE command. • Enter the create table command (with EXTERNAL) 18
  19. 19. Controlling the Output File Size • Spark generally writes one file per partition. • Use coalesce to write out less files. • Use repartition to write out more files. • This will cause a shuffle of the data. • If repartition is too expensive: • Call mapPartition to get an iterator and write out as many (or few) files as you would like. 19
  20. 20. sc.wholeTextFiles() • The whole file should fit into memory. • Good for file formats that aren’t splittable by line. • Such as XML files. • Will need to performance tune. 20
  21. 21. Use “filename” as the RDD element • filenames = sc.parallelize([“s3:/file1”, “s3:/ file2”, …]) • Allows you to function on a single file. •…call_function_on_filename) • Use to decode non-standard compression formats. • Use to split up files that aren’t separable by line. 21
  22. 22. Thank you. Any questions?