Spark summit-east-dowling-feb2017-full


Published on

Since April 2016, Spark-as-a-service has been available to researchers in Sweden from the Swedish ICT SICS Data Center at Researchers work in an entirely UI-driven environment on a platform built with only open-source software.
Spark applications can be either deployed as jobs (batch or streaming) or written and run directly from Apache Zeppelin. Spark applications are run within a project on a YARN cluster with the novel property that Spark applications are metered and charged to projects. Projects are also securely isolated from each other and include support for project-specific Kafka topics. That is, Kafka topics are protected from access by users that are not members of the project. In this talk we will discuss the challenges in building multi-tenant Spark streaming applications on YARN that are metered and easy-to-debug. We show how we use the ELK stack (Elasticsearch, Logstash, and Kibana) for logging and debugging running Spark streaming applications, how we use Graphana and Graphite for monitoring Spark streaming applications, and how users can debug and optimize terminated Spark Streaming jobs using Dr Elephant. We will also discuss the experiences of our users (over 120 users as of Sept 2016): how they manage their Kafka topics and quotas, patterns for how users share topics between projects, and our novel solutions for helping researchers debug and optimize Spark applications.
To conclude, we will also give an overview on our course ID2223 on Large Scale Learning and Deep Learning, in which 60 students designed and ran SparkML applications on the platform.

Published in: Technology
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Spark summit-east-dowling-feb2017-full

  1. 1. Spark Streaming-as-a- Service with Kafka and YARN Jim Dowling KTH Royal Institute of Technology, Stockholm Senior Researcher, SICS CEO, Logical Clocks AB
  2. 2. Spark Streaming-as-a-Service in Sweden • SICS ICE: datacenter research environment • Hopsworks: Spark/Flink/Kafka/Tensorflow/Hadoop • -as-a-service – Built on Hops Hadoop ( – >130 active users
  3. 3. Hadoop is not a cool kid anymore!
  4. 4. Hadoop’s Evolution 2009 2016 ?
  5. 5. Hadoop’s Evolution 2009 2016 ? Tiny Brain (NameNode, ResourceMgr) Huge Body (DataNodes)
  6. 6. Build out Hadoop’s Brain with External Weakly Consistent MetaData Services Google-Glass Approach to Intelligence
  7. 7. NameNodes NDB HDFS Client DataNodes >37X Capacity >16 X Throughput HopsFS
  8. 8. Larger Brains => Bigger, Faster* 16x Performance on Spotify Workload *Usenix FAST 2017, HopsFS: Scaling Hierarchical File System Metadata Using NewSQL Databases
  9. 9. Hopsworks • Projects – Datasets/Files – Topics – Jobs/Notebooks Hadoop • Clusters • Users • Jobs/Applications • Files • ACLs • Sys Admins • Kerberos Larger Brains => More Intelligent* *HMGA2 gene mutations correlated with increased intracranial volume as well as enhanced IQ. User-Friendly Concepts
  10. 10. YARN Spark Streaming Support • Apache Kafka • ELK Stack – Real-time Logs • Grafana/InfluxDB – Monitoring Hopsworks YARN aggregates logs on job completion
  11. 11. Kafka Self-Service UI Manage & Share • Topics • ACLs • Avro Schemas Manage & Share • Topics • ACLs • Avro Schemas
  12. 12. Logs Elasticsearch, Logstash, Kibana (ELK Stack) Elasticsearch, Logstash, Kibana (ELK Stack)
  13. 13. Monitoring/Alerting InfluxDB and Grafana InfluxDB and Grafana StreamingMetrics.streaming.lastReceivedBatch_records == 0
  14. 14. Zeppelin for Prototyping Streaming Apps []
  15. 15. Debugging Spark with Dr. Elephant • Analyzes Spark Jobs for errors and common using pluggable heuristics • Doesn’t show killed jobs • No online support for streaming apps yet
  16. 16. Integration as Microservices in Hopsworks • Project-based Multi-tenancy • Self-Service UI • Simplifying Spark Streaming Apps
  17. 17. Proj-All Proj-X Projects in Hopsworks • Proj-42 Shared TopicTopic /Projs/My/Data CompanyDB
  18. 18. User roles 18 Data Owner - Import/Export data - Manage Membership - Share DataSets, Topics Data Scientist - Write and Run code Self-Service Administration – No Administrator Needed
  19. 19. Notebooks, Data sharing and Quotas • Zeppelin Notebooks in HDFS, Jobs launcher UI. • Sharing is not Copying – Datasets/Topics • Per-Project quotas – Storage in HDFS – CPU in YARN (Uber-style Pricing)
  20. 20. Dynamic roles ProjectA Authenticate ProjectB HopsFS YARN Kafka SSL/TLS Certificates Secure Impersonation ProjectA__alice ProjectB__alice
  21. 21. Look Ma, no Kerberos • Each project-specific user issued with a SSL/TLS (X.509) certificate for both authentication and encryption. • Services also issued with SSL/TLS certificates. – Same root CA as user certs
  22. 22. Simplifying Spark Streaming Apps • Spark Streaming Applications need to know – Credentials • Hadoop, Kafka, InfluxDb, Logstash – Endpoints • Kafka Broker, Kafka SchemaRegistry, ResourceManager, NameNode, InfluxDB, Logstash • The HopsUtil API hides this complexity. – Location/security transparent Spark applications
  23. 23. Secure Streaming App with Kafka Developer 1.Discover: Schema Registry and Kafka/InfluxDB/ELK Endpoints 2.Create: Kafka Properties file with certs and broker details 3.Create: Producer/Consumer using Kafka Properties 4.Download: the Schema for the Topic from the Schema Registry 5.Distribute: X.509 certs to all hosts on the cluster 6.Cleanup securely These steps are replaced by calls to the HopsUtil API Operations
  24. 24. Streaming Producer in HopsWorks JavaSparkContext jsc = new JavaSparkContext(sparkConf); String topic = HopsUtil.getTopic(); //Optional SparkProducer producer = HopsUtil.getSparkProducer(); Map<String, String> message = … sparkProducer.produce(message);
  25. 25. Streaming Consumer in HopsWorks JavaStreamingContext jssc = new JavaStreamingContext(sparkConf,Durations.seconds(2)); String topic = HopsUtil.getTopic(); //Optional String consumerGroup = HopsUtil.getConsumerGroup(); //Optional SparkConsumer consumer = HopsUtil.getSparkConsumer(jssc); JavaInputDStream<ConsumerRecord<String, byte[]>> messages = consumer.createDirectStream(); jssc.start();
  26. 26. Less code to write Properties props = new Properties(); props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, brokerList); props.put(SCHEMA_REGISTRY_URL, restApp.restConnect); props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, org.apache.kafka.common.serialization.StringSerializer.class); props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, io.confluent.kafka.serializers.KafkaAvroSerializer.class); props.put("producer.type", "sync"); props.put("serializer.class","kafka.serializer.StringEncoder"); props.put("request.required.acks", "1"); props.put("ssl.keystore.location","/var/ssl/kafka.client.keystore.jks" ) props.put("ssl.keystore.password","test1234") props.put("ssl.key.password","test1234") ProducerConfig config = new ProducerConfig(props); String userSchema = "{"namespace": "example.avro", "type": "record", "name": "U ser"," + ""fields": [{"name": "name", "type": "string"}]}"; Schema.Parser parser = new Schema.Parser(); Schema schema = parser.parse(userSchema); GenericRecord avroRecord = new GenericData.Record(schema); avroRecord.put("name", "testUser"); Producer<String, String> producer = new Producer<String, String>(config); ProducerRecord<String, Object> message = new ProducerRecord<>(“topicName”, avroRecord ); producer.send(data); Lots of Hard-Coded Endpoints Here! SparkProducer producer = HopsUtil.getSparkProducer(); Map<String, String> message = … sparkProducer.produce(message); Massively Simplified Code for Secure Spark Streaming/Kafka
  27. 27. Distributing Certs for Spark Streaming 1. Launch Spark Job Distributed Database 2. Get certs, service endpoints YARN Private LocalResources Spark Streaming App 4. Materialize certs 3. YARN Job, config 6. Get Schema 7. Consume Produce 5. Read Certs Hopsworks HopsUtil 8. Read ACLs for authentication
  28. 28. Multi-Tenant IoT Scenario Sensor Node Sensor Node Sensor Node Sensor Node Sensor Node Sensor Node Field Gateway StorageStorage AnalysisAnalysis IngestionIngestion ACMEACME Evil CorpEvil Corp IoT Cloud Platform DontBeEvil Corp DontBeEvil Corp
  29. 29. IoT Scenario ACME DontBeEvil Corp Evil-Corp AWS Google Cloud Oracle Cloud User Apps control IoT Devices IoT Company: Analyze Data, Data Services for Clients ACME DontBeEvil Corp Evil Corp
  30. 30. Cloud-Native Analytics Solution ACME S3S3 [Authorization] GCSGCS OracleOracleIoT Company Each customer needs its own Analytics Infrastructure Each customer needs its own Analytics Infrastructure Spark Streaming App
  31. 31. IoT Company Project GatewayTopic Hopsworks Solution using Projects ACME ProjectACMETopic ACME Dataset Data Stream Analytics Reports
  32. 32. Hopsworks Solution ACME Spark Streaming App [Authorized] ACME Dataset ACME Dataset ACME Topic ACME Analytics Reports ACME Analytics Reports Spark Batch Job ACME Project
  33. 33. Karamel/Chef for Automated Installation Google Compute Engine BareMetal
  34. 34. DEMO
  35. 35. Hops Roadmap • HopsFS – HA support for Multi-Data-Center – Small files, 2-Level Erasure Coding • HopsYARN – Tensorflow with isolated GPUs • Hopsworks – P2P Dataset Sharing – Jupyter, Presto, Hive
  36. 36. Summary • Hops is a new distribution of Hadoop – Tinker-friendly and open-source. • Hopsworks provides first-class support for Spark-Streaming-as-a-Service – With support services like Kafka, ELK Stack, Zeppelin, Grafana/InfluxDB.
  37. 37. Hops Team Jim Dowling, Seif Haridi, Tor Björn Minde, Gautier Berthou, Salman Niazi, Mahmoud Ismail, Theofilos Kakantousis, Ermias Gebremeskel, Antonios Kouzoupis, Alex Ormenisan, Roberto Bampi, Fabio Buso, Fanti Machmount Al Samisti, Braulio Grana, Adam Alpire, Zahin Azher Rashid, Robin Andersso, ArunaKumari Yedurupaka, Tobias Johansson, August Bonds, Tiago Brito, Filotas Siskos. Active: Alumni: Vasileios Giannokostas, Johan Svedlund Nordström,Rizvi Hasan, Paul Mälzer, Bram Leenders, Juan Roca, Misganu Dessalegn, K “Sri” Srijeyanthan, Jude D’Souza, Alberto Lorente, Andre Moré, Ali Gholami, Davis Jaunzems, Stig Viaene, Hooman Peiro, Evangelos Savvidis, Steffen Grohsschmiedt, Qi Qi, Gayana Chandrasekara, Nikolaos Stanogias, Daniel Bali, Ioannis Kerkinos, Peter Buechler, Pushparaj Motamari, Hamid Afzali, Wasif Malik, Lalith Suresh, Mariano Valles, Ying Lieu. Hops
  38. 38. Thank You. We totally understand it’s going to be America First Spark Streaming first, but can we take this chance to say Hopsworks second! @hopshadoop Hops