Ema Orhian
@emaorhian
Jaws - Data Warehouse with
Spark SQL
2
• Big Data analytics / Machine Learning
• 4+ years exp with Hadoop ecosystem
• 2 years exp with Spark
About me
http://bigdataresearch.io/
• Co-founder of Big Data Research Group
• Provides open source solutions around Big Data
analytics
http://atigeo.com/
Agenda
• jaws-spark-sql-rest (Jaws) intro
• Main features
• Architecture
• Scaling
• Resource manager
• Working with Tachyon
• Working with Parquet files
• Configure Spark Sql context
• Demo 3
4
Shared
Spark Sql
Context
Concurrent
queries run
Query history
Page results
Query editor
Jaws
• Highly scalable and resilient data warehouse explorer
• Restful alternative to Spark SQL JDBC and not only …
• Support for Spark 0.9.1/Shark thru Spark 1.5
• Support for hive/MR
https://github.com/atigeo/jaws-spark-sql-rest
5
Main features
• Submit queries concurrently and asynchronously
• Provides persisted logs, query history, results with paging
• Pluggable persistent layer (Cassandra/HDFS)
• Supports load balancing with query cancelation
• Provides a metadata browser
• In-memory Parquet warehouse with Tachyon
• Configuration file to fine tune Spark context
• Pluggable UI 6
Jaws architecture
7
Scaling
8
•Standalone mode
•Mesos
•YARN
Fine grained mode
Coarse grained mode
9
Canceling a query
10
Canceling a query
Results persistence
• Queries with limited number of results:
‣ Cassandra
‣ HDFS
• Queries with unlimited number of results:
‣ HDFS
‣ Tachyon
11
Working with Tachyon
• Persists unlimited results in Tachyon
• Registers tables over Parquet files from Tachyon
12
Tachyon benefits:
★ in memory storage system
★ share data between applications at a memory
speed
Working with Parquet files
• Register tables on top of parquet files
13
Parquet
★ columnar format
★ nested data structures
★ supports schema evolution
★ efficient compression
• Files stored on HDFS or Tachyon
• MetaInfo about table stored in Cassandra (feature before Spark
1.3)
Configuring Jaws
14
• Cassandra
• HDFS
• Spray
• Application
• Spark
sparkConfiguration {
spark-master=“spark://devbox.local:7077”
/ “mesos://devbox.local:5050”
/ yarn-client
spark-mesos-coarse=false / true
spark-cores-max=100
spark-executor-instances=10
}
Demo
15

Jaws - Data Warehouse with Spark SQL by Ema Orhian

  • 1.
    Ema Orhian @emaorhian Jaws -Data Warehouse with Spark SQL
  • 2.
    2 • Big Dataanalytics / Machine Learning • 4+ years exp with Hadoop ecosystem • 2 years exp with Spark About me http://bigdataresearch.io/ • Co-founder of Big Data Research Group • Provides open source solutions around Big Data analytics http://atigeo.com/
  • 3.
    Agenda • jaws-spark-sql-rest (Jaws)intro • Main features • Architecture • Scaling • Resource manager • Working with Tachyon • Working with Parquet files • Configure Spark Sql context • Demo 3
  • 4.
  • 5.
    Jaws • Highly scalableand resilient data warehouse explorer • Restful alternative to Spark SQL JDBC and not only … • Support for Spark 0.9.1/Shark thru Spark 1.5 • Support for hive/MR https://github.com/atigeo/jaws-spark-sql-rest 5
  • 6.
    Main features • Submitqueries concurrently and asynchronously • Provides persisted logs, query history, results with paging • Pluggable persistent layer (Cassandra/HDFS) • Supports load balancing with query cancelation • Provides a metadata browser • In-memory Parquet warehouse with Tachyon • Configuration file to fine tune Spark context • Pluggable UI 6
  • 7.
  • 8.
  • 9.
  • 10.
  • 11.
    Results persistence • Querieswith limited number of results: ‣ Cassandra ‣ HDFS • Queries with unlimited number of results: ‣ HDFS ‣ Tachyon 11
  • 12.
    Working with Tachyon •Persists unlimited results in Tachyon • Registers tables over Parquet files from Tachyon 12 Tachyon benefits: ★ in memory storage system ★ share data between applications at a memory speed
  • 13.
    Working with Parquetfiles • Register tables on top of parquet files 13 Parquet ★ columnar format ★ nested data structures ★ supports schema evolution ★ efficient compression • Files stored on HDFS or Tachyon • MetaInfo about table stored in Cassandra (feature before Spark 1.3)
  • 14.
    Configuring Jaws 14 • Cassandra •HDFS • Spray • Application • Spark sparkConfiguration { spark-master=“spark://devbox.local:7077” / “mesos://devbox.local:5050” / yarn-client spark-mesos-coarse=false / true spark-cores-max=100 spark-executor-instances=10 }
  • 15.