Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Structured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim Dowling

758 views

Published on

Since mid-2016, Spark-as-a-Service has been available to researchers in Sweden from the Rise SICS ICE Data Center at www.hops.site. In this session, Dowling will discuss the challenges in building multi-tenant Spark structured streaming applications on YARN that are metered and easy-to-debug. The platform, called Hopsworks, is in an entirely UI-driven environment built with only open-source software. Learn how they use the ELK stack (Elasticsearch, Logstash and Kibana) for logging and debugging running Spark streaming applications; how they use Grafana and InfluxDB for monitoring Spark streaming applications; and, finally, how Apache Zeppelin can provide interactive visualizations and charts to end-users.

This session will also show how Spark applications are run within a ‘project’ on a YARN cluster with the novel property that Spark applications are metered and charged to projects. Projects are securely isolated from each other and include support for project-specific Kafka topics. That is, Kafka topics are protected from access by users that are not members of the project. In addition, hear about the experiences of their users (over 150 users as of early 2017): how they manage their Kafka topics and quotas, patterns for how users share topics between projects, and the novel solutions for helping researchers debug and optimize Spark applications.hear about the experiences of their users (over 150 users as of early 2017): how they manage their Kafka topics and quotas, patterns for how users share topics between projects, and the novel solutions for helping researchers debug and optimize Spark applications.afka topics are protected from access by users that are not members of the project. We will also discuss the experiences of our users (over 150 users as of early 2017): how they manage their Kafka topics and quotas, patterns for how users share topics between projects, and our novel solutions for helping researchers debug and optimize Spark applications.

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Structured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim Dowling

  1. 1. Jim Dowling Assoc. Prof, Royal Institute of Technology, Stockholm Structured Spark Streaming-as-a-Service with Hopsworks Hops @hopshadoop http://github.com/hopshadoop http://www.hops.io
  2. 2. Spark Streaming-as-a-Service in Sweden • SICS ICE Datacenter research environment • Hopsworks Spark/Flink/Kafka/Tensorflow-as-a-service – Built on Hops Hadoop (www.hops.io) >150 active users Swedish National Day Today!
  3. 3. Self-Service Spark-Streaming I want to Spark Up, all by myself.
  4. 4. Structured Spark Streaming MonitorMonitor LogsLogs Data Src SQL Spark StreamingPipe …...…...…... WAL & Checkpoints (S3/HDFS)
  5. 5. Smoking Spark Streaming A pipe for my data: Kafka A cloud for my smoke: Hops Hadoop A way to roll A/B tests: Jupyter/Zeppelin A lookout for trouble: Grafana/Influx A log with evidence: ELK Stack
  6. 6. Structured Spark Streaming (Hops) HopsFS YARN HopsFS YARN InfluxDBInfluxDB ElasticElastic Public Cloud or On-PremisePublic Cloud or On-Premise Parquet Data Src Spark Batch Analytics SQL Spark StreamingKafka …...….….
  7. 7. General Data Protection Regulation (GDPR) Ostrich Day: 2018-05-25 http://www.computerweekly.com/news/450295538/D-Day-for-GDPR-is-25-May-2018
  8. 8. PRIVACY-BY-DESIGN WITH PROJECTS
  9. 9. Proj-42 Projects for Multi-tenancy A Project is a Grouping of Users and Data Proj-X Hive DBKafka Topic hdfs://My/Data Proj-Company-WideSparkSQL Table
  10. 10. Manage Projects Like GitHub Spark Summit, Hopsworks, June 2017 10/53
  11. 11. Project Roles Spark Summit, Hopsworks, June 2017 11/53 Data Owner Privileges - Import/Export data - Manage Membership - Share Data/Topics Data Scientist Privileges - Write and Run code We delegate administration of privileges to users
  12. 12. EASY SHARING IN A SECURE ENVIRONMENT
  13. 13. Share Datasets/Topics like Dropbox Spark Summit, Hopsworks, June 2017 13/53 Share any Data Source/Sink: HDFS Datasets, Kafka Topics, etc
  14. 14. Custom Python Environments with Conda Spark Summit, Hopsworks, June 2017 14/53 Python libraries are usable by any framework (Spark/Tensorflow)
  15. 15. Case Study: IoT Data Platform Sensor Node Sensor Node Sensor Node Sensor Node Sensor Node Sensor Node Field Gateway StorageStorage AnalysisAnalysis IngestionIngestion ACMEACME Evil CorpEvil Corp IoT Infrastructure
  16. 16. Multi-Cloud IoT Data Platform ACME DontBeEvil Corp Evil-Corp AWS Google Cloud Oracle Cloud User Apps control IoT Devices IoT Data Platform Field Gateway Field Gateway Field Gateway
  17. 17. Cloud Native Solution ACME S3S3 [Authorization] GCSGCS IoT Company IoT Company cedes control of the data to its Customers IoT Company cedes control of the data to its Customers Spark Streaming App
  18. 18. IoT Project Kafka Topic Hopsworks Solution: Project per Customer ACME Project ACME Topic ACME HDFS Parquet Data Stream Generic Analytics Shared Custom Analytics ACME manage membership
  19. 19. Spark Streaming Tooling • Hops Hadoop • Apache Kafka • ELK Stack • Grafana/InfluxDB • Jupyter/Apache Zeppelin Hopsworks Self-Service
  20. 20. HopsFS – Scale-out HDFS 20/53
  21. 21. HopsFS: Next Generation HDFS* 21/53 16x Performance on Spotify Workload FasterBigger *https://www.usenix.org/conference/fast17/technical-sessions/presentation/niazi 37x Number of files Scale Challenge Winner (2017)
  22. 22. Kafka Self-Service UI Manage & Share • Topics • ACLs • Avro Schemas Manage & Share • Topics • ACLs • Avro Schemas
  23. 23. Tuning Kafka/Spark Streaming • Key Kafka Tuning Parameters – Number of Topics – Number of Partitions per Topic ● Spark Streaming Tuning Considerations – Match # of Executors to the # of Partitions – Ensure balanced data across partitions
  24. 24. Realtime Logs ● YARN aggregates logs on job completion – No good to us for Streaming ● Collect logs and make them searchable in real- time using Logstash, Elasticsearch, and Kibana – Log4j auto-configured to write to Logstash http://mkuthan.github.io/blog/2016/09/30/spark-streaming-on-yarn/
  25. 25. Realtime Logs Elasticsearch, Logstash, Kibana (ELK Stack) Elasticsearch, Logstash, Kibana (ELK Stack)
  26. 26. Resource Monitoring/Alerting ● $SPARK_HOME/conf/metrics.properties – Different sinks supported ● JMX, Graphite, Servlet/JSON, CSV, Console, Slf4j ● StructuredQueryListener – Send query progress to a Kafka topic for inspection ● StreamingQueryListener – Asynchronous Monitoring of all queries for a Spark session
  27. 27. Resource Monitoring/Alerting Graphite/ InfluxDB and Grafana Graphite/ InfluxDB and Grafana
  28. 28. Zeppelin for Prototyping Streaming Apps [https://github.com/knockdata/spark-highcharts]
  29. 29. Project Quotas • Per-Project quotas – Storage in HDFS – CPU in YARN (Uber-style Pricing) • Sharing is not Copying – Datasets/Topics
  30. 30. Secure Spark Streaming
  31. 31. SSL/TLS Everywhere • For each project, a user has a different SSL/TLS (X.509) certificate. – Client-side SSL certs for authentication • Services are also issued with SSL/TLS certificates. – Kafka performs access control to topics using certs
  32. 32. SSL/TLS Certificate Generation alice@gmail.com Login with: LDAP, Password, 2FA Add/Del Users MySQL Cluster Insert/Remove CertsProject Mgr Root CA HDFS Spark Kafka YARN CSR Intermediate Certificate Authority Hopsworks CSR
  33. 33. Distributing Certs for Spark Streaming Alice@gmail.com 1. Launch Spark Job MySQL Cluster 2. Get certs, service endpoints YARN Private LocalResources Spark Streaming App 4. Materialize certs 3. YARN Job, config 6. Get Schema 7. Consume Produce 5. Read Certs Hopsworks HopsUtil 8. Authenticate access to topics
  34. 34. Simplified Structured Spark Streaming
  35. 35. Basic Structured Spark Streaming Program • Query • Input source • Output sink – Mode: append/complete/update • Trigger Interval • Checkpoint Location val cloudtrailEvents = … val streamingETLQuery = cloudtrailEvents .withColumn("date", $"timestamp".cast("date") .writeStream .trigger(ProcessingTime ("10 seconds")) .format("parquet") .partitionBy("date") .option("path", "/cloudtrail") .option("checkpointLocation", "/cloudtrail.checkpoint/") .start() https://databricks.com/blog/2017/01/19/real-time-streaming-etl-structured-streaming-apache-spark-2-1.html
  36. 36. Secure Structured Spark Streaming • Streaming Apps also need to know: – Credentials, Endpoints – monitoring.properties – How to shutdown gracefully • The HopsUtil API hides this complexity.
  37. 37. HopsUtil simplifies Secure Spark/Kafka https://github.com/hopshadoop/hops-kafka-examples Properties props = new Properties(); props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, brokerList); props.put(SCHEMA_REGISTRY_URL, restApp.restConnect); props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, org.apache.kafka.common.serialization.StringSerializer.class); props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, io.confluent.kafka.serializers.KafkaAvroSerializer.class); props.put("producer.type", "sync"); props.put("serializer.class","kafka.serializer.StringEncoder"); props.put("request.required.acks", "1"); props.put("ssl.keystore.location","/var/ssl/kafka.client.keystore.jks" ) props.put("ssl.keystore.password","test1234") props.put("ssl.key.password","test1234") ProducerConfig config = new ProducerConfig(props); String userSchema = "{"namespace": "example.avro", "type": "record", "name": "U ser"," + ""fields": [{"name": "name", "type": "string"}]}"; Schema.Parser parser = new Schema.Parser(); Schema schema = parser.parse(userSchema); GenericRecord avroRecord = new GenericData.Record(schema); avroRecord.put("name", "testUser"); Producer<String, String> producer = new Producer<String, String>(config); ProducerRecord<String, Object> message = new ProducerRecord<>(“topicName”, avroRecord ); producer.send(data); SparkProducer producer = HopsUtil.getSparkProducer();
  38. 38. Kafka Producer in HopsWorks SparkConf sparkConf = new SparkConf().setAppName(HopsUtil.getJobName()); JavaSparkContext jsc = new JavaSparkContext(sparkConf); ... SparkProducer producer = HopsUtil.getSparkProducer(); ... producer.produce(message); ... HopsUtil.shutdownGracefully(jsc);
  39. 39. Streaming Consumer in HopsWorks SparkConf sparkConf = new SparkConf().setAppName(HopsUtil.getJobName()); JavaSparkContext jsc = new JavaSparkContext(sparkConf); ... DataStreamReader dsr = HopsUtil.getSparkConsumer().getKafkaDataStreamReader(); Dataset<Row> lines = dsr.load(); ... StreamingQuery queryFile = logEntries.writeStream() ... HopsUtil.shutdownGracefully(queryFile);
  40. 40. Demo
  41. 41. Hops Roadmap • HopsFS – Multi-Data-Center High Availability – Small files, 2-Level Erasure Coding • HopsYARN – Tensorflow on Spark with GPUs-as-a-Resource • Open Datasets – P2P Dataset Sharing
  42. 42. Spark Streaming on Hopsworks • Hopsworks is a new Data Platform built on Hops • Spark-Streaming is a First Class Citizen • Built-in Tooling for Spark-Streaming
  43. 43. Hops Heads Jim Dowling, Seif Haridi, Tor Björn Minde, Gautier Berthou, Salman Niazi, Mahmoud Ismail, Theofilos Kakantousis, Ermias Gebremeskel, Antonios Kouzoupis, Alex Ormenisan, Roberto Bampi, Fabio Buso, Fanti Machmount Al Samisti, Braulio Grana, Zahin Azher Rashid, Robin Andersson, ArunaKumari Yedurupaka, Tobias Johansson, August Bonds, Filotas Siskos. Active: Alumni: Vasileios Giannokostas, Johan Svedlund Nordström,Rizvi Hasan, Paul Mälzer, Bram Leenders, Juan Roca, Misganu Dessalegn, K “Sri” Srijeyanthan, Jude D’Souza, Alberto Lorente, Andre Moré, Ali Gholami, Davis Jaunzems, Stig Viaene, Hooman Peiro, Evangelos Savvidis, Steffen Grohsschmiedt, Qi Qi, Gayana Chandrasekara, Nikolaos Stanogias, Daniel Bali, Ioannis Kerkinos, Peter Buechler, Pushparaj Motamari, Hamid Afzali, Wasif Malik, Lalith Suresh, Mariano Valles, Ying Lieu.
  44. 44. Thank You. Follow us: @hopshadoop Star us: http://github.com/hopshadoop/hopsworks Join us: http://www.hops.io

×