Mixing Analytic Workloads with Greenplum and Apache Spark

© Copyright 2017 Pivotal Software, Inc. All rights Reserved.
Mixing Analytic Workloads with Greenplum
and Apache Spark
Kong Yew, Chan
Product Manager
kochan@pivotal.io

Cover w/ Image
Agenda
■ Apache Spark for analytic
workloads
■ Mixing workloads with Greenplum
and Spark
■ Using the Greenplum-Spark
connector

Pivotal Data Suite Use Case
Applied to Predictive
Maintenance
Analytical workloads are changing as
businesses are demanding streaming and
real-time processing

The Data Lake is Valuable, but not a Panacea
• ACID-compliant transactions
• Full ANSI SQL compliance
• Immediate consistency vs eventual consistency
• Hundreds or thousands of concurrent queries
• Queries involving complex, multi-way joins requiring a sophisticated
optimizer
Many operations require the features of mature, relational MPP data platforms

Does Spark Replace the Data Warehouse?
Spark is an in-memory processing system, complements with data warehouse
Reasons:
• In-memory processing
• Memory limitations
• Data Movement

What if we could leverage
the best qualities of the
data warehouse and the
best qualities of Spark?

Why use Apache Spark for processing data ?
Features:
• 100x performance gain with in-memory analytical processing
• SQL for structured data processing
• Advanced analytics for machine learning, graph and streaming
Use Cases:
• Data exploration
• Interactive analytics
• Stream processing

Why use Greenplum for processing data ?
Features:
● Process analytics for entire dataset (in-memory and disks)
● Provide full ANSI SQL for structured data processing
● Advanced analytics for machine learning(Madlib), graph, geospatial, text
Use Cases:
● Large-scale data processing
● Advanced analytics for enterprise use cases

Mixing Analytic Workloads
Best for Greenplum
● Analytics over the entire dataset
● Processing multi-structured data
Best for Spark
● Limited data that fits Spark’s in-
memory platform
● ETL processing (streaming,
micro-batches)
● Data exploration

Pivotal Data Suite Use Case
Applied to Predictive
Maintenance
Using the Greenplum-Spark connector

Use Case: Financial Services
Parallel data
transfer
Financial risk
algorithms
MPP
Database
Use Cases:
● Analyzing financial risk
Benefits:
● Faster in-memory processing
● Expand data processing to Spark
GPDB-Spark
connector
Executor

Greenplum-Spark connector (GSC)
High speed parallel data transfer between GPDB and Spark
● Easy to use
● Optimize for performance
● Complement with Spark ecosystem
In-memory processingMPP database

Greenplum-Spark architecture
● Uses GPDB segments to transfer
data to Spark executors
● Scale dynamically (Kubernetes,
Yarn, Mesos)
● Support Spark programming
languages (Python, Scala, Java, R)

Easy to use
scala> :paste
// Entering paste mode (ctrl-D to finish)
val gscOptionMap = Map(
"url" -> "jdbc:postgresql://gpmaster.domain/tutorial",
"user" -> "user1",
"password" -> "pivotal",
"dbschema" -> "faa",
"dbtable" -> "otp_c",
"partitionColumn" -> "airlineid"
)
val gpdf = spark.read.format("greenplum")
.options(gscOptionMap)
.load()
// Exiting paste mode, now interpreting.
gpdf: org.apache.spark.sql.DataFrame = [flt_year: smallint, flt_quarter: smallint ... 44 more fields]

Performance optimization (Column Projection)
scala> paste:
scala> gpdf.select("origincityname", "flt_month", "airlineid", "carrier").show()
control-D
+---------------+---------+---------+-------+
| origincityname|flt_month|airlineid|carrier|
+---------------+---------+---------+-------+
| Detroit, MI| 12| 19386| NW|
| Houston, TX| 12| 19704| CO|
| Houston, TX| 12| 19704| CO|
….
+--------------------+---------+---------+-------+
only showing top 20 rows

Performance optimization (Predicate Push down)
scala> paste:
scala> gpdf.select("origincityname", "flt_month", "airlineid", "carrier")
.filter("cancelled = 1").filter("flt_month = 12")
.orderBy("airlineid", "origincityname")
.show()
control-D
+---------------+---------+---------+-------+
| origincityname|flt_month|airlineid|carrier|
+---------------+---------+---------+-------+
| Detroit, MI| 12| 19386| NW|
| Houston, TX| 12| 19704| CO|
...
+--------------------+---------+---------+-------+
only showing top 20 rows

Benefits of the Greenplum Spark connector
● Faster data transfer between GPDB and Spark
(75x faster than JDBC connector)
● Easy to use
● Performance (Column projection, Predicate push down)

Cover w/ Image
Key Takeaways
● Use mixed workloads for both
Greenplum and Spark
● Leverage both the Greenplum and
Spark ecosystems

Start Your Journey Today!
Pivotal Greenplum and Spark
Connector
pivotal.io/pivotal-greenplum
greenplum-spark.docs.pivotal.io
Pivotal Data Science
pivotal.io/data-science
Apache MADlib
madlib.apache.org
Greenplum Database
Channel

Mixing Analytic Workloads with Greenplum and Apache Spark

More Related Content

What's hot

Similar to Mixing Analytic Workloads with Greenplum and Apache Spark

More from VMware Tanzu

Recently uploaded

Mixing Analytic Workloads with Greenplum and Apache Spark