Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Integrating Apache Pulsar with Big Data Ecosystem


Published on

In Apache Pulsar Beijing Meetup, Yijieshen gave a presentation of the current state of Apache Pulsar integrating with Big Data Ecosystem. He explains why and how Pulsar fits into current big data computing and query engines, and how Pulsar integrates with Spark, Flink and Presto for unified data processing system.

Published in: Internet
  • Be the first to comment

Integrating Apache Pulsar with Big Data Ecosystem

  1. 1. Integrating Apache Pulsar
 Big Data Ecosystem Yijie Shen 20190817
  2. 2. Data analytics with Apache Pulsar
  3. 3. Why so many analytic frameworks ? Each kind has its best fit •Interactive Engine • Time critical • Medium data size • Rerun on failure •Batch Engine • The amount of data can be very large • Could run on a huge cluster • Fine-grained fault tolerance •Streaming • Ever running jobs • Time critical • Need scalability as well as resilient on failures •Serverless • Simple processing logic • Processing data with high velocity Don’t ask, I don’t know.
  4. 4. Why Apache Pulsar fits all ? It’s a Pulsar Meetup, dude...
  5. 5. Pulsar – A cloud-native architecture Stateless Serving Durable Storage
  6. 6. Pulsar – Segment-based storage •Managed ledger • The storage layer for a single topic •Ledger • Single writer, append-only • Replicated to multiple bookies
  7. 7. Pulsar – Infinite stream storage •Reduce storage cost • offloading segment to tiered storage one-by-one
  8. 8. Pulsar Schema • Consensus of data at server-side • Built-in schema registry • Data schema on a per-topic basis •Send and receive typed message directly • Validation • Multi-version
  9. 9. Durable and ordered source •Failures are inevitable for engines •Re-schedule failed tasks • Tasks assigned to fixed (start, end] in Spark • Tasks recover from checkpoint (start in Flink •Exactly-once • Based on message order in topic • Seek & read •Messages ”keep-alive” by subscription • Move sub cursor on commit task1 task2 Durable cursor
  10. 10. Two levels of reading API •Consumer • Subscribe / seek / receive • Per topic partition • Pulsar-Spark, Pulsar-Flink •Segment • Read directly from Bookies • For parallelism • Presto
  11. 11. Processing typed records •Regard Pulsar as structured storage •Fetching schema as the first step • With Pulsar Admin API • Dynamic / multi-versioned schema not supported in Spark/Flink • But you could try AUTO_CONSUME •SerDe your messages into InternalRow / Row • Avro schema and avro/json/protobuf Message • Or parse the Avro record as we do in pulsar-spark[1] •Message metadata as metadata fields • __key, __publishTime, __eventTime, __messageId, __topic
  12. 12. Topic/Partition add/delete discovery • Streaming jobs are long running • Topics & partitions may be added on removed during a job • Periodically check topic for status • Spark: during incremental planning • Flink: with a monitoring thread in each task Pulsar-Spark as an example • Happens during logical planning • getBatch(start: Option[Offset], end: Offset) • Discovery topic differences between start and end • Start – last end • End – getOffset() • Connector • provide available offset for all topic/ partitions for each getOffset • Create DataFrame/DataSet based on existing topic/partitions • SS take care of the rest Offset { topicOffsets: Map[String, MessageId }
  13. 13. Various APIs use Pulsar as source val df = spark
 .option("service.url", "pulsar://...")
 .option("admin.url", "http://...")
 .option("topic", "topic1")
 .load() val prop = new Properties()
 prop.setProperty(“service.url”, serviceUrl)
 prop.setProperty(“admin.url”, adminUrl)
 prop.setProperty(“partitionDiscoveryIntervalMillis”, "5000")
 prop.setProperty(“startingOffsets”, "earliest") env.addSource(new FlinkPulsarSource(sourceProps)) show tables in pulsar."public/default"; select * from pulsar."public/ default".generator_test; Spark Flink Presto
  14. 14. Pulsar-Spark and Pulsar-Flink •Pulsar-Spark based on Spark 2.4 is now open sourced • •Pulsar-Flink based on Flink 1.9 will open-source soon •Roadmaps for these two projects • End-to-end exactly once with pulsar transaction support • Fine-grained batch parallelism on segment level • Pulsar-spark / Pulsar-flink