Meetup#4, Apache Spark as SQL Engine

Apache Spark as SQL Engine
Data Engineering Approach
Dmitry Timofeev, Data Analyst, Wrike Inc.

Wrike is a collaborative task and project
management platform
wrike.com

What is Apache Spark?
• Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk.
• Write applications quickly in Java, Scala, Python, R.
• Combine SQL, streaming, and complex analytics.
• Spark runs on Hadoop, Mesos, standalone, or in the cloud. It can access diverse data
sources including HDFS, Cassandra, HBase, and S3.
Apache Spark™ is a fast in-memory and general engine
for large-scale data processing.

Where it came from?
Original white papers
• "Spark: Cluster Computing with Working Sets" by Matei
Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott
Shenker, Ion Stoica. University of California, Berkeley
• "Resilient Distributed Datasets: A Fault-Tolerant
Abstraction for In-Memory Cluster Computing" Matei
Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur
Dave, Justin Ma, Murphy McCauley, Michael J. Franklin,
Scott Shenker, Ion Stoica. University of California, Berkeley

Few words about data analysts
Or why they don’t want to write code and  
want query, query, query?
• We know SQL
• We love ETL

Spark SQL
Spark SQL is Spark's module for working with structured data.
• DataFrame and seamlessly mix SQL queries with
Spark programs;
• Connect to any data source the same way: Hive,
Avro, Parquet, JSON and JDBC;
• Server mode: connect to Spark SQL with you
favorite DB client over JDBC.

Spark SQL
Distributed SQL Engine. Integration with BI tools

Spark SQL
Distributed SQL Engine and my favorite DB tool

Spark SQL
Mix SQL queries with Spark programs

Where it came from?
Original white papers
• "Spark SQL: Relational Data Processing in Spark" by
Michael Armbrust, Reynold S. Xin, Cheng Lian, Yin
Huai, Davies Liu, Joseph K. Bradley, Xiangrui Meng,
Tomer Kaftan‡, Michael J. Franklin‡, Ali Ghodsi, Matei
Zaharia. Databricks Inc. MIT CSAIL, AMPLab, UC Berkeley

Conclusion
• You can easy crate scalable infrastructure;
• Do you dream about cross-DB joins?
Welcome!
• Do you want to join logs and usual DBs?
Welcome!
• You analysts is not a programmers? Not a
problem!

Your questions?
To make our team more awesome we need:
UX Data Analyst
Billing Operations Analyst
Data Engineer
hr-spb@team.wrike.com

Meetup#4, Apache Spark as SQL Engine

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Meetup#4, Apache Spark as SQL Engine

Similar to Meetup#4, Apache Spark as SQL Engine (20)

More from SPb_Data_Science

More from SPb_Data_Science (11)

Recently uploaded

Recently uploaded (20)

Meetup#4, Apache Spark as SQL Engine