Zeppelin and spark sql demystified

Zeppelin and Spark SQL
AWS BIG DATA demystified
Omid Vahdaty, Big Data Ninja

Agenda
● What is Zeppelin?
● What is Spark SQL?
● Motivation?
● Features?
● Performance?
● Demo?

Zeppelin
A completely open web-based notebook that enables interactive data analytics. Apache Zeppelin is a new and incubating multi-
purposed web-based notebook which brings data ingestion, data exploration, visualization, sharing and collaboration features

Zeppelin out of the box features
● Web Based GUI.
● Supported languages
○ SparkSQL
○ PySpark
○ Scala
○ SparkR
○ JDBC (Redshift,Athena, Presto,MySql ...)
○ Bash
● Visualization
● Users, Sharing and Collaboration
● Advanced Security features
● Built in AWS S3 support
● Orchestration

What is Spark SQL
● Spark SQL is a Spark
module for structured data
processing. Unlike the
basicSpark RDD API, the
interfaces provided by
Spark SQL provide Spark
with more information about
the structure of both the
data and the computation
being performed. Internally,
Spark SQL uses this extra
information to perform extra
optimizations.
● HiveSQL

Why Spark SQL?
● Simple
● Scalable
● Performance - faster than Hive
● External tables on S3
● Cost Reduction
● Decrease the GAP between Data Science and Data Engineering: HiveQL for
ALL
● Get us one step closer to use sparkR / pyspark/ scala
● JDBC connection enabled via thrift server.
● Concurrency via Yarn Scheduler :)
● Join is runs better here than hive. [still not redshift]

Why Not SparkSql?
● Buggy
● Not as fast as scala
● Not code <----> SQL
● Known issues:
○ Performance over S3 → BAD
○ Insert Overwrite → not overwriting … bug?
○ Chunk size control → bug?
○ Dynamic partitions… non trivial
○ Beeline client/server version mismatch (CLI)

Why SparkSql + Zeppelin
● Sexy Look and Feel of any SQL web client
● Backup your SQL easily automatically via S3
● Share your work
● Orchestration & Scheduler for your nightly job
● Combine system commands + sql + visualization.
● Advanced Security features.
● Combine all the DB’s you need in one place including data transfer.
● Get one step closer to spark and scala
● Visualize your data easily.

Performance of Spark SQL
● EMR is already pre-configured in terms of spark configurations:
○ spark.executor.instances (--num-executors)
○ spark.executor.cores (--executor-cores)
○ spark.executor.memory (--executor-memory)
● X10 faster than hive in select aggregations
● X5 faster than hive when working on top of S3
● Performance Penalty is greatest on
○ Insert overwrite
○ Write to s3

Connection via JDBC
hive_metastore.jar
hive_service.jar
HiveJDBC41.jar
libfb303-0.9.0.jar
libthrift-0.9.0.jar
log4j-1.2.14.jar
ql.jar
slf4j-api-1.5.11.jar
slf4j-log4j12-1.5.11.jar
TCLIServiceClient.jar
zookeeper-3.4.6.jar

Turing Spark SQL
for write
spark.conf.set("mapreduce.fileoutputcommitter.algorithm.version", "2")
for read:
spark.hadoop.parquet.enable.summary-metadata false
spark.sql.parquet.mergeSchema false
spark.sql.parquet.filterPushdown true
spark.sql.hive.metastorePartitionPruning true

Performance Testing -- data transformation
Read/Write from aws s3 Hive Spark SQL
Aggregation query 10 min 1 min
Text Gzip → Parquet 10 min ~2 min
Same as above with
s3DistCP
10 min ~4 min
Text Gzip → Parquet
gzip
10 min ~18 min
Text parquet →
Parquet-gzip
~2 min
Parquet-gzip →
Parquet-gzip
~2 min
● Observations
○ Penalty on s3
write
○ No Penalty on
S3 read even if
uncompressed
○ Compression is
not always
good...

Take away message to performance challenges
● Chunk size has no impact on performance. But helps on parallelism.
● Most spark config have minor impact.
● Use s3a
● Use compression carefully
○ [in the create table definition]
○ Choose compression algorithm carefully
● Using S3DistCP - is
○ Slower than direct write to s3 with compression.
○ Make you want to kill yourself when you work with dynamic partitions.
● Bottom line takeaways
○ Don't compress when transforming Gzip text file → parquet
○ Compress when transforming uncompressed parquet → gzip parquet
○ No need for any other configuration except for s3a and compression.

Turing resources
https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-configure.html
https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-zeppelin.html
https://zeppelin.apache.org/docs/latest/interpreter/spark.html
https://community.hortonworks.com/questions/33484/spark-sql-query-execution-is-very-very-slow-when-c.html
https://stackoverflow.com/questions/42822483/extremely-slow-s3-write-times-from-emr-spark
https://docs.databricks.com/spark/latest/faq/append-slow-with-spark-2.0.0.html
https://medium.com/@subhojit20_27731/apache-spark-and-amazon-s3-gotchas-and-best-practices-a767242f3d98
https://www.slideshare.net/JasonHubbard10/spark-meetup-73215961
https://hortonworks.github.io/hdp-aws/s3-spark/index.html
https://hortonworks.github.io/hdp-aws/s3-performance/index.html
http://agrajmangal.in/blog/big-data/spark-parquet-s3/

Resources
[1] http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/
[2] https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hadoop-task-config.html#emr-hadoop-task-jvm
[3] http://spark.apache.org/docs/latest/submitting-applications.html
[4] http://spark.apache.org/docs/latest/cluster-overview.html
[5] http://spark.apache.org/docs/latest/running-on-yarn.html
[6] https://www.cloudera.com/documentation/enterprise/5-5-x/topics/cdh_ig_running_spark_on_yarn.html
[7] http://docs.aws.amazon.com/emr/latest/ReleaseGuide/latest/emr-spark-configure.html
[8] https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-zeppelin.html#zeppelin-considerations
[9] http://agrajmangal.in/blog/big-data/spark-parquet-s3/
[10] https://docs.aws.amazon.com/emr/latest/ReleaseGuide/UsingEMR_s3distcp.html

Zeppelin and spark sql demystified

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Zeppelin and spark sql demystified

Similar to Zeppelin and spark sql demystified (20)

More from Omid Vahdaty

More from Omid Vahdaty (20)

Recently uploaded

Recently uploaded (20)

Zeppelin and spark sql demystified