This document describes a Parallel Streaming Transformation Loader (PSTL) that uses Kafka, Spark, and Vertica for real-time data ingestion and analytics. It summarizes the PSTL as follows:
1. The PSTL ingests streaming data from Kafka into Spark RDDs in parallel.
2. Spark is used to transform the data, including assigning IDs and hashing records to partitions.
3. The transformed data is written in parallel from the Spark partitions directly to Vertica for analytics and querying.
4. Vertica demonstrated impressive parallel copy performance of 2.42 billion rows in under 8 minutes using this approach.