SlideShare a Scribd company logo
1 of 12
Jack Gudenkauf
Senior Architect
HPE Big Data Professional Services
Structured Streaming
for Columnar Data
Warehouses
Our solution is a
Parallel Streaming Transformation Loader
Application
Agenda
 Benefits
 Features
 Architecture
 A self-service “ETL”
Sources, Transformations, Sinks
PSTL Benefits
Analysts and Data Scientists spend up to 80% of their time cleaning and reshaping data.
With PSTL, they will spend 20%
Features
Of our Autonomous Spark Solution
Highly Performant and Highly
Scalable
• Distributed systems
• Scale out clusters
Operational Robustness
• Observability through metrics
and dashboards
• Strong durability guarantees
• Exactly-once End-to-end
Self-Serve
• No developer code needed for
o Streaming data ingestion
o Processing semi-structured
data formats
o Filtering data
o Data Transformations or
loading
 PSTL provides all the necessary glue for complex streaming
solutions that would otherwise take many person months of
architecture/design, development, testing, and operational
infrastructure
 PSTL is a highly customizable Application framework that
enables Developers to create common code to dynamically add
functionality in SQL over streaming data sources
 PSTL removes DevOps complexity of; high availability,
scalability, deployment, autonomous job management,
monitoring, and alerting
 PSTL is a no-code needed Spark Structured Streaming App with
support for Change Data Capture / Slowly Changing Dimensions
Vertica Cluster
MPP Relational
Data Warehouse
Native &
External
Tables
Confluent™ REST Proxy
JSON/AVRO Messages
HDFS/Hive/
ORC/Parquet
Structured
Non-relational
HDFS
/RAW/Topic/YYYY/
MM/DD/HH
Semi-
Structured
Data
OpenTSDB
HBase
Raw Data Topics in JSON, AVRO, Protobuf, Delimited, Byte[]
Processed/Transformed Data Topics
Web,
Mobile,
IoT
DBMS
Applications
Parallel Streaming Transformation Loader
Hadoop™ Spark™ Cluster
Custom
Extensions
PSTL
CDC/SDC
Kafka™
Cluster
Centralized Data Hub
Libhdfs++
Sinks
Vertica, Hive,
OpenTSDB, Kafka, …
job {
structured-streaming {
sources {
logs {
format = kafka
options {
kafka.bootstrap.servers = "kafka:9092"
subscribe = kafkatopic
} } }
transforms = """
CREATE TEMPORARY VIEW transform01 AS
SELECT topic, partition, offset, timestamp,
split(cast(value as string), '|') values
FROM logs;
CREATE TEMPORARY VIEW transform02 AS
SELECT vhash(values[0] , values[1]) uniqueKey,
values[0] col0, values[1] col1
FROM transform01; """
sinks {
store_vertica {
format = vertica
dependsOn = transform02
options {
vertica.url = "jdbc:vertica://vertica:5433/db"
vertica.table = "public.logs"
vertica.property.user = dbadmin
vertica.property.password = changeMe!
kafka.topic = vertica-logs
kafka.property.bootstrap.servers = "kafka:9092"
} } } } }
We also support static sources (e.g., hdfs) refreshed on time intervals
Additional sinks
Kafka, S3, hdfs (ORC, Parquet), OpenTSDB/Hbase, console, …
Additional Transforms (Catalyst Expressions)
fprotobuf(), favro(), fconfluent(), fJSON()
tavro(), try(), vhash(vertica hash), hive udf(), …
bin/pstl-jobs --initial-contacts 127.0.0.1:2552
--start --job-id job.conf
Each job is a simple config file of Sources, Transforms & Sinks
Transform(s)
Source(s)
Sink2Sink1
Job DAGjob {
structured-streaming {
sources {
logs {
format = kafka
… } } }
transforms = """
…
CREATE TEMPORARY VIEW transform04 AS
SELECT topic, partition, offset, col0, col1,
try(ThrowOnOdd(offset)) try_result
FROM transform03
…
CREATE TEMPORARY VIEW transform06 AS
SELECT
topic, partition, offset, col0, col1
, uniqueKey
, try_result.value, try_result.isSuccess
, try_result.isFailure, try_result.failure.message
, try_result.failure.stackTrace
FROM transform04
WHERE try_result.isFailure = true
"""
sink1, sink2, …} } } } }
PSTL Transform library
Try/catch() Catalyst Expression
For data validation per row
using SQL set based logic !
job {
structured-streaming {
sources {
logs { format = kafka … } } }
transforms = """
CREATE TEMPORARY VIEW transform01 AS
SELECT
topic, partition, offset, kafkaMsgCreatedAt
, structCol.*, posexplode(anArray)
FROM (
SELECT
topic, partition, offset, timestamp as kafkaMsgCreatedAt,
favro('{"type":"record","name":"testAllAvroData","namespace":"com.mynamespace","fields":[{"name":"c1_nul
l","type":["null","string"]},…
,{"name":"anArray","type":{"type":"array","items":{"type":"record","name":"anArray","fields":[{"name":"a_name
","type":"string”…’
, value) as structCol
FROM logs)
"””
Thank You
Stop by the Vertica Booth #403
Jack Gudenkauf
Senior Architect
HPE Big Data Professional Services
https://twitter.com/_JG
https://github.com/jackghm
Backup
Spark DataFrame to/from Avro
// use case: Using PSTL expressions in a Spark Kafka Producer App
// serialize the DataFrame to an Avro encoded byte array
FunctionRegistryOps(spark).register(AvroSerializeFunction.registration)
val df_Avro = spark.sql("select tavro(*) as value from(select * from df)")
// debug
df_Avro.show(5, truncate = false)
df_Avro.createOrReplaceTempView("avrobytes")
FunctionRegistryOps(spark).register(AvroDeserializeFunction.registration)
val df1 = spark.sql(s"select favro('$avroSchema', value) as structCol from
avrobytes")

More Related Content

What's hot

Apache Spark RDDs
Apache Spark RDDsApache Spark RDDs
Apache Spark RDDsDean Chen
 
Supporting Over a Thousand Custom Hive User Defined Functions
Supporting Over a Thousand Custom Hive User Defined FunctionsSupporting Over a Thousand Custom Hive User Defined Functions
Supporting Over a Thousand Custom Hive User Defined FunctionsDatabricks
 
DataEngConf SF16 - Collecting and Moving Data at Scale
DataEngConf SF16 - Collecting and Moving Data at Scale DataEngConf SF16 - Collecting and Moving Data at Scale
DataEngConf SF16 - Collecting and Moving Data at Scale Hakka Labs
 
Querying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/Trident
Querying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/TridentQuerying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/Trident
Querying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/TridentDataWorks Summit/Hadoop Summit
 
Apache Spark, the Next Generation Cluster Computing
Apache Spark, the Next Generation Cluster ComputingApache Spark, the Next Generation Cluster Computing
Apache Spark, the Next Generation Cluster ComputingGerger
 
Lightweight Transactions in Scylla versus Apache Cassandra
Lightweight Transactions in Scylla versus Apache CassandraLightweight Transactions in Scylla versus Apache Cassandra
Lightweight Transactions in Scylla versus Apache CassandraScyllaDB
 
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza SeattleBuilding Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza SeattleEvan Chan
 
Spark and spark streaming internals
Spark and spark streaming internalsSpark and spark streaming internals
Spark and spark streaming internalsSigmoid
 
Oracle Goldengate training by Vipin Mishra
Oracle Goldengate training by Vipin Mishra Oracle Goldengate training by Vipin Mishra
Oracle Goldengate training by Vipin Mishra Vipin Mishra
 
Examining Oracle GoldenGate Trail Files
Examining Oracle GoldenGate Trail FilesExamining Oracle GoldenGate Trail Files
Examining Oracle GoldenGate Trail FilesBobby Curtis
 
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...DataStax
 
Delta Lake Streaming: Under the Hood
Delta Lake Streaming: Under the HoodDelta Lake Streaming: Under the Hood
Delta Lake Streaming: Under the HoodDatabricks
 
Exactly-Once Streaming from Kafka-(Cody Koeninger, Kixer)
Exactly-Once Streaming from Kafka-(Cody Koeninger, Kixer)Exactly-Once Streaming from Kafka-(Cody Koeninger, Kixer)
Exactly-Once Streaming from Kafka-(Cody Koeninger, Kixer)Spark Summit
 
Spark & Spark Streaming Internals - Nov 15 (1)
Spark & Spark Streaming Internals - Nov 15 (1)Spark & Spark Streaming Internals - Nov 15 (1)
Spark & Spark Streaming Internals - Nov 15 (1)Akhil Das
 
Building a High-Performance Database with Scala, Akka, and Spark
Building a High-Performance Database with Scala, Akka, and SparkBuilding a High-Performance Database with Scala, Akka, and Spark
Building a High-Performance Database with Scala, Akka, and SparkEvan Chan
 
Introduction to Spark Streaming
Introduction to Spark StreamingIntroduction to Spark Streaming
Introduction to Spark StreamingKnoldus Inc.
 
Oracle GoldenGate Presentation from OTN Virtual Technology Summit - 7/9/14 (PDF)
Oracle GoldenGate Presentation from OTN Virtual Technology Summit - 7/9/14 (PDF)Oracle GoldenGate Presentation from OTN Virtual Technology Summit - 7/9/14 (PDF)
Oracle GoldenGate Presentation from OTN Virtual Technology Summit - 7/9/14 (PDF)Bobby Curtis
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introductioncolorant
 
Apache spark - Installation
Apache spark - InstallationApache spark - Installation
Apache spark - InstallationMartin Zapletal
 
Connecting Hadoop and Oracle
Connecting Hadoop and OracleConnecting Hadoop and Oracle
Connecting Hadoop and OracleTanel Poder
 

What's hot (20)

Apache Spark RDDs
Apache Spark RDDsApache Spark RDDs
Apache Spark RDDs
 
Supporting Over a Thousand Custom Hive User Defined Functions
Supporting Over a Thousand Custom Hive User Defined FunctionsSupporting Over a Thousand Custom Hive User Defined Functions
Supporting Over a Thousand Custom Hive User Defined Functions
 
DataEngConf SF16 - Collecting and Moving Data at Scale
DataEngConf SF16 - Collecting and Moving Data at Scale DataEngConf SF16 - Collecting and Moving Data at Scale
DataEngConf SF16 - Collecting and Moving Data at Scale
 
Querying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/Trident
Querying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/TridentQuerying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/Trident
Querying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/Trident
 
Apache Spark, the Next Generation Cluster Computing
Apache Spark, the Next Generation Cluster ComputingApache Spark, the Next Generation Cluster Computing
Apache Spark, the Next Generation Cluster Computing
 
Lightweight Transactions in Scylla versus Apache Cassandra
Lightweight Transactions in Scylla versus Apache CassandraLightweight Transactions in Scylla versus Apache Cassandra
Lightweight Transactions in Scylla versus Apache Cassandra
 
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza SeattleBuilding Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
 
Spark and spark streaming internals
Spark and spark streaming internalsSpark and spark streaming internals
Spark and spark streaming internals
 
Oracle Goldengate training by Vipin Mishra
Oracle Goldengate training by Vipin Mishra Oracle Goldengate training by Vipin Mishra
Oracle Goldengate training by Vipin Mishra
 
Examining Oracle GoldenGate Trail Files
Examining Oracle GoldenGate Trail FilesExamining Oracle GoldenGate Trail Files
Examining Oracle GoldenGate Trail Files
 
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
 
Delta Lake Streaming: Under the Hood
Delta Lake Streaming: Under the HoodDelta Lake Streaming: Under the Hood
Delta Lake Streaming: Under the Hood
 
Exactly-Once Streaming from Kafka-(Cody Koeninger, Kixer)
Exactly-Once Streaming from Kafka-(Cody Koeninger, Kixer)Exactly-Once Streaming from Kafka-(Cody Koeninger, Kixer)
Exactly-Once Streaming from Kafka-(Cody Koeninger, Kixer)
 
Spark & Spark Streaming Internals - Nov 15 (1)
Spark & Spark Streaming Internals - Nov 15 (1)Spark & Spark Streaming Internals - Nov 15 (1)
Spark & Spark Streaming Internals - Nov 15 (1)
 
Building a High-Performance Database with Scala, Akka, and Spark
Building a High-Performance Database with Scala, Akka, and SparkBuilding a High-Performance Database with Scala, Akka, and Spark
Building a High-Performance Database with Scala, Akka, and Spark
 
Introduction to Spark Streaming
Introduction to Spark StreamingIntroduction to Spark Streaming
Introduction to Spark Streaming
 
Oracle GoldenGate Presentation from OTN Virtual Technology Summit - 7/9/14 (PDF)
Oracle GoldenGate Presentation from OTN Virtual Technology Summit - 7/9/14 (PDF)Oracle GoldenGate Presentation from OTN Virtual Technology Summit - 7/9/14 (PDF)
Oracle GoldenGate Presentation from OTN Virtual Technology Summit - 7/9/14 (PDF)
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
 
Apache spark - Installation
Apache spark - InstallationApache spark - Installation
Apache spark - Installation
 
Connecting Hadoop and Oracle
Connecting Hadoop and OracleConnecting Hadoop and Oracle
Connecting Hadoop and Oracle
 

Similar to PSTL Spark Summit West 2017

Structured Streaming for Columnar Data Warehouses with Jack Gudenkauf
Structured Streaming for Columnar Data Warehouses with Jack GudenkaufStructured Streaming for Columnar Data Warehouses with Jack Gudenkauf
Structured Streaming for Columnar Data Warehouses with Jack GudenkaufDatabricks
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Guido Schmutz
 
Spark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka StreamsSpark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka StreamsGuido Schmutz
 
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Helena Edelson
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Guido Schmutz
 
Solutions for bi-directional integration between Oracle RDBMS & Apache Kafka
Solutions for bi-directional integration between Oracle RDBMS & Apache KafkaSolutions for bi-directional integration between Oracle RDBMS & Apache Kafka
Solutions for bi-directional integration between Oracle RDBMS & Apache KafkaGuido Schmutz
 
A noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­tica
A noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­ticaA noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­tica
A noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­ticaData Con LA
 
xPatterns - Spark Summit 2014
xPatterns - Spark Summit   2014xPatterns - Spark Summit   2014
xPatterns - Spark Summit 2014Claudiu Barbura
 
xPatterns on Spark, Shark, Mesos, Tachyon
xPatterns on Spark, Shark, Mesos, TachyonxPatterns on Spark, Shark, Mesos, Tachyon
xPatterns on Spark, Shark, Mesos, TachyonClaudiu Barbura
 
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft AzureOtimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft AzureLuan Moreno Medeiros Maciel
 
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...Databricks
 
Ingesting hdfs intosolrusingsparktrimmed
Ingesting hdfs intosolrusingsparktrimmedIngesting hdfs intosolrusingsparktrimmed
Ingesting hdfs intosolrusingsparktrimmedwhoschek
 
C19013010 the tutorial to build shared ai services session 2
C19013010 the tutorial to build shared ai services session 2C19013010 the tutorial to build shared ai services session 2
C19013010 the tutorial to build shared ai services session 2Bill Liu
 
Experience sql server on l inux and docker
Experience sql server on l inux and dockerExperience sql server on l inux and docker
Experience sql server on l inux and dockerBob Ward
 
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...Jürgen Ambrosi
 
Distributed Database Design Decisions to Support High Performance Event Strea...
Distributed Database Design Decisions to Support High Performance Event Strea...Distributed Database Design Decisions to Support High Performance Event Strea...
Distributed Database Design Decisions to Support High Performance Event Strea...StreamNative
 
LarKC Tutorial at ISWC 2009 - Introduction
LarKC Tutorial at ISWC 2009 - IntroductionLarKC Tutorial at ISWC 2009 - Introduction
LarKC Tutorial at ISWC 2009 - IntroductionLarKC
 

Similar to PSTL Spark Summit West 2017 (20)

Structured Streaming for Columnar Data Warehouses with Jack Gudenkauf
Structured Streaming for Columnar Data Warehouses with Jack GudenkaufStructured Streaming for Columnar Data Warehouses with Jack Gudenkauf
Structured Streaming for Columnar Data Warehouses with Jack Gudenkauf
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
 
Spark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka StreamsSpark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka Streams
 
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
 
Solutions for bi-directional integration between Oracle RDBMS & Apache Kafka
Solutions for bi-directional integration between Oracle RDBMS & Apache KafkaSolutions for bi-directional integration between Oracle RDBMS & Apache Kafka
Solutions for bi-directional integration between Oracle RDBMS & Apache Kafka
 
A noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­tica
A noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­ticaA noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­tica
A noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­tica
 
xPatterns - Spark Summit 2014
xPatterns - Spark Summit   2014xPatterns - Spark Summit   2014
xPatterns - Spark Summit 2014
 
xPatterns on Spark, Shark, Mesos, Tachyon
xPatterns on Spark, Shark, Mesos, TachyonxPatterns on Spark, Shark, Mesos, Tachyon
xPatterns on Spark, Shark, Mesos, Tachyon
 
Chti jug - 2018-06-26
Chti jug - 2018-06-26Chti jug - 2018-06-26
Chti jug - 2018-06-26
 
Jug - ecosystem
Jug -  ecosystemJug -  ecosystem
Jug - ecosystem
 
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft AzureOtimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
 
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
 
Ingesting hdfs intosolrusingsparktrimmed
Ingesting hdfs intosolrusingsparktrimmedIngesting hdfs intosolrusingsparktrimmed
Ingesting hdfs intosolrusingsparktrimmed
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
 
C19013010 the tutorial to build shared ai services session 2
C19013010 the tutorial to build shared ai services session 2C19013010 the tutorial to build shared ai services session 2
C19013010 the tutorial to build shared ai services session 2
 
Experience sql server on l inux and docker
Experience sql server on l inux and dockerExperience sql server on l inux and docker
Experience sql server on l inux and docker
 
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
 
Distributed Database Design Decisions to Support High Performance Event Strea...
Distributed Database Design Decisions to Support High Performance Event Strea...Distributed Database Design Decisions to Support High Performance Event Strea...
Distributed Database Design Decisions to Support High Performance Event Strea...
 
LarKC Tutorial at ISWC 2009 - Introduction
LarKC Tutorial at ISWC 2009 - IntroductionLarKC Tutorial at ISWC 2009 - Introduction
LarKC Tutorial at ISWC 2009 - Introduction
 

Recently uploaded

High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
Data Science Project: Advancements in Fetal Health Classification
Data Science Project: Advancements in Fetal Health ClassificationData Science Project: Advancements in Fetal Health Classification
Data Science Project: Advancements in Fetal Health ClassificationBoston Institute of Analytics
 
Data Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxData Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxFurkanTasci3
 
Spark3's new memory model/management
Spark3's new memory model/managementSpark3's new memory model/management
Spark3's new memory model/managementakshesh doshi
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...Pooja Nehwal
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...Suhani Kapoor
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts ServiceSapana Sha
 

Recently uploaded (20)

꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
Data Science Project: Advancements in Fetal Health Classification
Data Science Project: Advancements in Fetal Health ClassificationData Science Project: Advancements in Fetal Health Classification
Data Science Project: Advancements in Fetal Health Classification
 
Data Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxData Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptx
 
Spark3's new memory model/management
Spark3's new memory model/managementSpark3's new memory model/management
Spark3's new memory model/management
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts Service
 

PSTL Spark Summit West 2017

  • 1. Jack Gudenkauf Senior Architect HPE Big Data Professional Services Structured Streaming for Columnar Data Warehouses
  • 2. Our solution is a Parallel Streaming Transformation Loader Application Agenda  Benefits  Features  Architecture  A self-service “ETL” Sources, Transformations, Sinks
  • 3. PSTL Benefits Analysts and Data Scientists spend up to 80% of their time cleaning and reshaping data. With PSTL, they will spend 20%
  • 4. Features Of our Autonomous Spark Solution Highly Performant and Highly Scalable • Distributed systems • Scale out clusters Operational Robustness • Observability through metrics and dashboards • Strong durability guarantees • Exactly-once End-to-end Self-Serve • No developer code needed for o Streaming data ingestion o Processing semi-structured data formats o Filtering data o Data Transformations or loading  PSTL provides all the necessary glue for complex streaming solutions that would otherwise take many person months of architecture/design, development, testing, and operational infrastructure  PSTL is a highly customizable Application framework that enables Developers to create common code to dynamically add functionality in SQL over streaming data sources  PSTL removes DevOps complexity of; high availability, scalability, deployment, autonomous job management, monitoring, and alerting  PSTL is a no-code needed Spark Structured Streaming App with support for Change Data Capture / Slowly Changing Dimensions
  • 5. Vertica Cluster MPP Relational Data Warehouse Native & External Tables Confluent™ REST Proxy JSON/AVRO Messages HDFS/Hive/ ORC/Parquet Structured Non-relational HDFS /RAW/Topic/YYYY/ MM/DD/HH Semi- Structured Data OpenTSDB HBase Raw Data Topics in JSON, AVRO, Protobuf, Delimited, Byte[] Processed/Transformed Data Topics Web, Mobile, IoT DBMS Applications Parallel Streaming Transformation Loader Hadoop™ Spark™ Cluster Custom Extensions PSTL CDC/SDC Kafka™ Cluster Centralized Data Hub Libhdfs++ Sinks Vertica, Hive, OpenTSDB, Kafka, …
  • 6. job { structured-streaming { sources { logs { format = kafka options { kafka.bootstrap.servers = "kafka:9092" subscribe = kafkatopic } } } transforms = """ CREATE TEMPORARY VIEW transform01 AS SELECT topic, partition, offset, timestamp, split(cast(value as string), '|') values FROM logs; CREATE TEMPORARY VIEW transform02 AS SELECT vhash(values[0] , values[1]) uniqueKey, values[0] col0, values[1] col1 FROM transform01; """ sinks { store_vertica { format = vertica dependsOn = transform02 options { vertica.url = "jdbc:vertica://vertica:5433/db" vertica.table = "public.logs" vertica.property.user = dbadmin vertica.property.password = changeMe! kafka.topic = vertica-logs kafka.property.bootstrap.servers = "kafka:9092" } } } } } We also support static sources (e.g., hdfs) refreshed on time intervals Additional sinks Kafka, S3, hdfs (ORC, Parquet), OpenTSDB/Hbase, console, … Additional Transforms (Catalyst Expressions) fprotobuf(), favro(), fconfluent(), fJSON() tavro(), try(), vhash(vertica hash), hive udf(), … bin/pstl-jobs --initial-contacts 127.0.0.1:2552 --start --job-id job.conf Each job is a simple config file of Sources, Transforms & Sinks
  • 7. Transform(s) Source(s) Sink2Sink1 Job DAGjob { structured-streaming { sources { logs { format = kafka … } } } transforms = """ … CREATE TEMPORARY VIEW transform04 AS SELECT topic, partition, offset, col0, col1, try(ThrowOnOdd(offset)) try_result FROM transform03 … CREATE TEMPORARY VIEW transform06 AS SELECT topic, partition, offset, col0, col1 , uniqueKey , try_result.value, try_result.isSuccess , try_result.isFailure, try_result.failure.message , try_result.failure.stackTrace FROM transform04 WHERE try_result.isFailure = true """ sink1, sink2, …} } } } } PSTL Transform library Try/catch() Catalyst Expression For data validation per row using SQL set based logic !
  • 8. job { structured-streaming { sources { logs { format = kafka … } } } transforms = """ CREATE TEMPORARY VIEW transform01 AS SELECT topic, partition, offset, kafkaMsgCreatedAt , structCol.*, posexplode(anArray) FROM ( SELECT topic, partition, offset, timestamp as kafkaMsgCreatedAt, favro('{"type":"record","name":"testAllAvroData","namespace":"com.mynamespace","fields":[{"name":"c1_nul l","type":["null","string"]},… ,{"name":"anArray","type":{"type":"array","items":{"type":"record","name":"anArray","fields":[{"name":"a_name ","type":"string”…’ , value) as structCol FROM logs) "””
  • 9. Thank You Stop by the Vertica Booth #403 Jack Gudenkauf Senior Architect HPE Big Data Professional Services https://twitter.com/_JG https://github.com/jackghm
  • 11.
  • 12. Spark DataFrame to/from Avro // use case: Using PSTL expressions in a Spark Kafka Producer App // serialize the DataFrame to an Avro encoded byte array FunctionRegistryOps(spark).register(AvroSerializeFunction.registration) val df_Avro = spark.sql("select tavro(*) as value from(select * from df)") // debug df_Avro.show(5, truncate = false) df_Avro.createOrReplaceTempView("avrobytes") FunctionRegistryOps(spark).register(AvroDeserializeFunction.registration) val df1 = spark.sql(s"select favro('$avroSchema', value) as structCol from avrobytes")

Editor's Notes

  1. We will discuss our Spark SS application our team has been customizing for very large customers
  2. Our PSTL Application is a complete end-to-end distributed Big Data pipeline
  3. Analysts can easily create *STREAMING* data ingestion pipelines for Columnar Stores Unified – not a a bunch of separate infrastructure code
  4. You could author your own solution on Spark but.. It would take a very long time Hit the ground running with our PSTL implementation A complete solution end-to-end solution
  5. Sources – We also did a POC using Apache Ignite vs. Spark 2.x for stream mapping, joins, etc.. Note that each sink is processed concurrenlty so there is no serial coordination amoung sinks A sink failure will cause a circuit breaker to begin exponential backoff until the sink is healthy. That is, auto-recovery Sinks – OpenTSDB for time series Agg Dashboard
  6. No bash scripts, no scheduling, self-serve
  7. We build a parse plan once, job DAG, from you config file. Sinks are concurrent with curcuit breakers and exponential back off
  8. Our tavro() and other high performance Spark catalyst expressions can be used from spark-shell, IntelliJ etc.. For example, if you are authoring a Spark App that is a Kafka Consumer Or Producer you can simply turn your DataFrame into an Avro encoded Kafka message. My github example
  9. 2 Sinks in the same job – Sinks are concurrent and have circuit breakers with exponential backoff AVRO raw data sink and ORC Sink in addition to a Vertica sink (not shown)