Since April 2016, Spark-as-a-service has been available to researchers in Sweden from the Swedish ICT SICS Data Center at www.hops.site. Researchers work in an entirely UI-driven environment on a platform built with only open-source software.
Spark applications can be either deployed as jobs (batch or streaming) or written and run directly from Apache Zeppelin. Spark applications are run within a project on a YARN cluster with the novel property that Spark applications are metered and charged to projects. Projects are also securely isolated from each other and include support for project-specific Kafka topics. That is, Kafka topics are protected from access by users that are not members of the project. In this talk we will discuss the challenges in building multi-tenant Spark streaming applications on YARN that are metered and easy-to-debug. We show how we use the ELK stack (Elasticsearch, Logstash, and Kibana) for logging and debugging running Spark streaming applications, how we use Graphana and Graphite for monitoring Spark streaming applications, and how users can debug and optimize terminated Spark Streaming jobs using Dr Elephant. We will also discuss the experiences of our users (over 120 users as of Sept 2016): how they manage their Kafka topics and quotas, patterns for how users share topics between projects, and our novel solutions for helping researchers debug and optimize Spark applications.
To conclude, we will also give an overview on our course ID2223 on Large Scale Learning and Deep Learning, in which 60 students designed and ran SparkML applications on the platform.
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by Jim Dowling
1. Spark Streaming-as-a-
Service with Kafka and
YARN
Jim Dowling
KTH Royal Institute of Technology, Stockholm
Senior Researcher, SICS
CEO, Logical Clocks AB
2. Spark Streaming-as-a-Service in Sweden
• SICS ICE: datacenter research environment
• Hopsworks: Spark/Flink/Kafka/Tensorflow/Hadoop
• -as-a-service
– Built on Hops Hadoop (www.hops.io)
– >130 active users
15. Debugging Spark with Dr. Elephant
• Analyzes Spark Jobs
for errors and
common using
pluggable heuristics
• Doesn’t show killed
jobs
• No online support for
streaming apps yet
16. Integration as Microservices in Hopsworks
• Project-based Multi-tenancy
• Self-Service UI
• Simplifying Spark Streaming Apps
18. User roles
18
Data Owner
- Import/Export data
- Manage Membership
- Share DataSets, Topics
Data Scientist
- Write and Run code
Self-Service Administration – No Administrator Needed
19. Notebooks, Data sharing and Quotas
• Zeppelin Notebooks in HDFS, Jobs launcher UI.
• Sharing is not Copying
– Datasets/Topics
• Per-Project quotas
– Storage in HDFS
– CPU in YARN (Uber-style Pricing)
21. Look Ma, no Kerberos
• Each project-specific user issued with a SSL/TLS
(X.509) certificate for both authentication and encryption.
• Services also issued with SSL/TLS certificates.
– Same root CA as user certs
22. Simplifying Spark Streaming Apps
• Spark Streaming Applications need to know
– Credentials
• Hadoop, Kafka, InfluxDb, Logstash
– Endpoints
• Kafka Broker, Kafka SchemaRegistry, ResourceManager,
NameNode, InfluxDB, Logstash
• The HopsUtil API hides this complexity.
– Location/security transparent Spark applications
23. Secure Streaming App with Kafka
Developer
1.Discover: Schema Registry and Kafka/InfluxDB/ELK Endpoints
2.Create: Kafka Properties file with certs and broker details
3.Create: Producer/Consumer using Kafka Properties
4.Download: the Schema for the Topic from the Schema Registry
5.Distribute: X.509 certs to all hosts on the cluster
6.Cleanup securely
These steps are replaced by calls to the HopsUtil API
Operations
https://github.com/hopshadoop/hops-kafka-examples
29. IoT Scenario
ACME DontBeEvil Corp Evil-Corp
AWS Google
Cloud
Oracle
Cloud
User Apps control IoT Devices
IoT Company:
Analyze Data,
Data Services
for Clients
ACME DontBeEvil Corp Evil Corp
30. Cloud-Native Analytics Solution
ACME S3S3
[Authorization]
GCSGCS
OracleOracleIoT Company
Each customer needs its own
Analytics Infrastructure
Each customer needs its own
Analytics Infrastructure
Spark
Streaming App
35. Hops Roadmap
• HopsFS
– HA support for Multi-Data-Center
– Small files, 2-Level Erasure Coding
• HopsYARN
– Tensorflow with isolated GPUs
• Hopsworks
– P2P Dataset Sharing
– Jupyter, Presto, Hive
36. Summary
• Hops is a new distribution of Hadoop
– Tinker-friendly and open-source.
• Hopsworks provides first-class support for
Spark-Streaming-as-a-Service
– With support services like Kafka, ELK Stack,
Zeppelin, Grafana/InfluxDB.
37. Hops Team
Jim Dowling, Seif Haridi, Tor Björn Minde, Gautier Berthou, Salman Niazi, Mahmoud Ismail,
Theofilos Kakantousis, Ermias Gebremeskel, Antonios Kouzoupis, Alex Ormenisan, Roberto
Bampi, Fabio Buso, Fanti Machmount Al Samisti, Braulio Grana, Adam Alpire, Zahin Azher Rashid,
Robin Andersso, ArunaKumari Yedurupaka, Tobias Johansson, August Bonds, Tiago Brito, Filotas
Siskos.
Active:
Alumni:
Vasileios Giannokostas, Johan Svedlund Nordström,Rizvi Hasan, Paul Mälzer, Bram Leenders, Juan
Roca, Misganu Dessalegn, K “Sri” Srijeyanthan, Jude D’Souza, Alberto Lorente, Andre Moré, Ali
Gholami, Davis Jaunzems, Stig Viaene, Hooman Peiro, Evangelos Savvidis, Steffen Grohsschmiedt,
Qi Qi, Gayana Chandrasekara, Nikolaos Stanogias, Daniel Bali, Ioannis Kerkinos, Peter Buechler,
Pushparaj Motamari, Hamid Afzali, Wasif Malik, Lalith Suresh, Mariano Valles, Ying Lieu.
Hops
38. Thank You.
We totally understand it’s going to be
America First Spark Streaming first, but
can we take this chance to say
Hopsworks second!
http://www.hops.io
@hopshadoop
Hops