This document discusses methods for understanding the evolution of open source cloud systems like OpenStack. It presents the authors' solution of using tracing techniques to analyze OpenStack's data and message flows for logical operations such as creating and deleting VMs. Key findings from tracing OpenStack releases include significant behavioral changes between releases, hundreds of database queries and AMQP messages required for operations, and the involvement of components like Keystone, Glance, Nova, and Neutron. The authors propose using their techniques to inject faults and build a knowledge base to aid future problem diagnosis.
Opaque: A Data Analytics Platform with Strong Security: Spark Summit East tal...Spark Summit
As enterprises move to cloud-based analytics, the risk of cloud security breaches poses a serious threat. Encrypting data at rest and in transit is a major first step. However, data must still be decrypted in memory for processing, exposing it to an attacker who has compromised the operating system or hypervisor. Trusted hardware such as Intel SGX has recently become available in latest-generation processors. Such hardware enables arbitrary computation on encrypted data while shielding it from a malicious OS or hypervisor. However, it still suffers from a significant side channel: access pattern leakage.
We present Opaque, a package for Apache Spark SQL that enables very strong security for SQL queries: data encryption, computation verification, and access pattern leakage protection (a.k.a. obliviousness). Opaque achieves these guarantees by introducing new oblivious distributed relational operators that provide 2000x performance gain over state of the art oblivious systems, as well as novel query planning techniques for these operators implemented using Catalyst.
Apache Spark Performance is too hard. Let's make it easierDatabricks
Apache Spark is a dynamic execution engine that can take relatively simple Scala code and create complex and optimized execution plans. In this talk, we will describe how user code translates into Spark drivers, executors, stages, tasks, transformations, and shuffles. We will then describe how this is critical to the design of Spark and how this tight interplay allows very efficient execution. We will also discuss various sources of metrics on how Spark applications use hardware resources, and show how application developers can use this information to write more efficient code. Users and operators who are aware of these concepts will become more effective at their interactions with Spark.
Sparklyr: Recap, Updates, and Use Cases with Javier LuraschiDatabricks
This session will start with a recap of what sparklyr is, and how it can be used to analyze, visualize and perform machine learning in Spark from R. We will walk through installation, configuration, data wrangling with SQL or dplyr, modeling in MLlib or H2O, and extending sparklyr by calling Scala functions from R or writing Scala modules accessible from R. You’ll then get a detailed update on new sparklyr features. After sparklyr 0.4 was released to CRAN last year, RStudio released 0.5, which implements new connections, features and architecture changes worth reviewing. We will wrap up with a discussion of uses cases relevant in the R ecosystem. The uses cases will demonstrate how to model data using popular frameworks in the R ecosystem that in seamless interactions between Spark and R using sparklyr.
Transactional writes to cloud storage with Eric LiangDatabricks
We will discuss the three dimensions to evaluate HDFS to S3: cost, SLAs (availability and durability), and performance. He then provided a deep dive on the challenges in writing to Cloud storage with Apache Spark and shared transactional commit benchmarks on Databricks I/O (DBIO) compared to Hadoop.
Natural Language Processing with CNTK and Apache Spark with Ali ZaidiDatabricks
Apache Spark provides an elegant API for developing machine learning pipelines that can be deployed seamlessly in production. However, one of the most intriguing and performant family of algorithms – deep learning – remains difficult for many groups to deploy in production, both because of the need for tremendous compute resources and also because of the inherent difficulty in tuning and configuring.
In this session, you’ll discover how to deploy the Microsoft Cognitive Toolkit (CNTK) inside of Spark clusters on the Azure cloud platform. Learn about the key considerations for administering GPU-enabled Spark clusters, configuring such workloads for maximum performance, and techniques for distributed hyperparameter optimization. You’ll also see a real-world example of training distributed deep learning learning algorithms for speech recognition and natural language processing.Microsoft Cognitive Toolkit (CNTK) inside of Spark clusters on the Azure cloud platform. We’ll discuss the key considerations for administering GPU-enabled Spark clusters, configuring such workloads for maximum performance, and techniques for distributed hyperparameter optimization. We’ll illustrate a real-world example of training distributed deep learning learning algorithms for speech recognition and natural language processing.
Building cloud-enabled genomics workflows with Luigi and DockerJacob Feala
Talk given at Bio-IT 2016, Cloud Computing track
Abstract:
As bioinformatics scientists, we tend to write custom tools for managing our workflows, even when viable, open-source alternatives are available from the tech community. Our field has, however, begun to adopt Docker containers to stabilize compute environments. In this talk, I will introduce Luigi, a workflow system built by engineers at Spotify to manage long-running big data processing jobs with complex dependencies. Focusing on a case study of next generation sequencing analysis in cancer genomics research, I will show how Luigi can connect simple, containerized applications into complex bioinformatics pipelines that can be easily integrated with compute, storage, and data warehousing on the cloud.
Opaque: A Data Analytics Platform with Strong Security: Spark Summit East tal...Spark Summit
As enterprises move to cloud-based analytics, the risk of cloud security breaches poses a serious threat. Encrypting data at rest and in transit is a major first step. However, data must still be decrypted in memory for processing, exposing it to an attacker who has compromised the operating system or hypervisor. Trusted hardware such as Intel SGX has recently become available in latest-generation processors. Such hardware enables arbitrary computation on encrypted data while shielding it from a malicious OS or hypervisor. However, it still suffers from a significant side channel: access pattern leakage.
We present Opaque, a package for Apache Spark SQL that enables very strong security for SQL queries: data encryption, computation verification, and access pattern leakage protection (a.k.a. obliviousness). Opaque achieves these guarantees by introducing new oblivious distributed relational operators that provide 2000x performance gain over state of the art oblivious systems, as well as novel query planning techniques for these operators implemented using Catalyst.
Apache Spark Performance is too hard. Let's make it easierDatabricks
Apache Spark is a dynamic execution engine that can take relatively simple Scala code and create complex and optimized execution plans. In this talk, we will describe how user code translates into Spark drivers, executors, stages, tasks, transformations, and shuffles. We will then describe how this is critical to the design of Spark and how this tight interplay allows very efficient execution. We will also discuss various sources of metrics on how Spark applications use hardware resources, and show how application developers can use this information to write more efficient code. Users and operators who are aware of these concepts will become more effective at their interactions with Spark.
Sparklyr: Recap, Updates, and Use Cases with Javier LuraschiDatabricks
This session will start with a recap of what sparklyr is, and how it can be used to analyze, visualize and perform machine learning in Spark from R. We will walk through installation, configuration, data wrangling with SQL or dplyr, modeling in MLlib or H2O, and extending sparklyr by calling Scala functions from R or writing Scala modules accessible from R. You’ll then get a detailed update on new sparklyr features. After sparklyr 0.4 was released to CRAN last year, RStudio released 0.5, which implements new connections, features and architecture changes worth reviewing. We will wrap up with a discussion of uses cases relevant in the R ecosystem. The uses cases will demonstrate how to model data using popular frameworks in the R ecosystem that in seamless interactions between Spark and R using sparklyr.
Transactional writes to cloud storage with Eric LiangDatabricks
We will discuss the three dimensions to evaluate HDFS to S3: cost, SLAs (availability and durability), and performance. He then provided a deep dive on the challenges in writing to Cloud storage with Apache Spark and shared transactional commit benchmarks on Databricks I/O (DBIO) compared to Hadoop.
Natural Language Processing with CNTK and Apache Spark with Ali ZaidiDatabricks
Apache Spark provides an elegant API for developing machine learning pipelines that can be deployed seamlessly in production. However, one of the most intriguing and performant family of algorithms – deep learning – remains difficult for many groups to deploy in production, both because of the need for tremendous compute resources and also because of the inherent difficulty in tuning and configuring.
In this session, you’ll discover how to deploy the Microsoft Cognitive Toolkit (CNTK) inside of Spark clusters on the Azure cloud platform. Learn about the key considerations for administering GPU-enabled Spark clusters, configuring such workloads for maximum performance, and techniques for distributed hyperparameter optimization. You’ll also see a real-world example of training distributed deep learning learning algorithms for speech recognition and natural language processing.Microsoft Cognitive Toolkit (CNTK) inside of Spark clusters on the Azure cloud platform. We’ll discuss the key considerations for administering GPU-enabled Spark clusters, configuring such workloads for maximum performance, and techniques for distributed hyperparameter optimization. We’ll illustrate a real-world example of training distributed deep learning learning algorithms for speech recognition and natural language processing.
Building cloud-enabled genomics workflows with Luigi and DockerJacob Feala
Talk given at Bio-IT 2016, Cloud Computing track
Abstract:
As bioinformatics scientists, we tend to write custom tools for managing our workflows, even when viable, open-source alternatives are available from the tech community. Our field has, however, begun to adopt Docker containers to stabilize compute environments. In this talk, I will introduce Luigi, a workflow system built by engineers at Spotify to manage long-running big data processing jobs with complex dependencies. Focusing on a case study of next generation sequencing analysis in cancer genomics research, I will show how Luigi can connect simple, containerized applications into complex bioinformatics pipelines that can be easily integrated with compute, storage, and data warehousing on the cloud.
Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...Spark Summit
We all dread “Lost task” and “Container killed by YARN for exceeding memory limits” messages in our scaled-up spark yarn applications. Even answering the question “How much memory did my application use?” is surprisingly tricky in the distributed yarn environment. Sqrrl has developed a testing framework for observing vital statistics of spark jobs including executor-by-executor memory and CPU usage over time for both the JDK and python portions of pyspark yarn containers. This talk will detail the methods we use to collect, store, and report spark yarn resource usage. This information has proved to be invaluable for performance and regression testing of the spark jobs in Sqrrl Enterprise.
Scalable Data Science with SparkR: Spark Summit East talk by Felix CheungSpark Summit
R is a very popular platform for Data Science. Apache Spark is a highly scalable data platform. How could we have the best of both worlds? How could a Data Scientist leverage the rich 9000+ packages on CRAN, and integrate Spark into their existing Data Science toolset?
In this talk we will walkthrough many examples how several new features in Apache Spark 2.x will enable this. We will also look at exciting changes in and coming next in Apache Spark 2.x releases.
Problem Solving Recipes Learned from Supporting Spark: Spark Summit East talk...Spark Summit
Due to Spark, writing big data applications has never been easier…at least until they stop being easy! At Lightbend we’ve helped our customers out of a number of hidden Spark pitfalls. Some crop up often; the ever-persistent OutOfMemoryError, the confusing NoSuchMethodError, shuffle and partition management, etc. Others occur less frequently; an obscure configuration affecting SQL broadcasts, struggles with speculating, a failing stream recovery due to RDD joins, S3 file reading leading to hangs, etc. All are intriguing! In this session we will provide insights into their origins and show how you can avoid making the same mistakes. Whether you are a seasoned Spark developer or a novice, you should learn some new tips and tricks that could save you hours or even days of debugging.
Jorge de la Cruz [Veeam Software] | RESTful API – How to Consume, Extract, St...InfluxData
Jorge de la Cruz [Veeam Software] | RESTful API – How to Consume, Extract, Store, and Visualize Data with InfluxDB and Grafana | InfluxDays Virtual Experience NA 2020
CaffeOnSpark Update: Recent Enhancements and Use CasesDataWorks Summit
By combining salient features from deep learning framework Caffe and big-data frameworks Apache Spark and Apache Hadoop, CaffeOnSpark enables distributed deep learning on a cluster of GPU and CPU servers. We released CaffeOnSpark as an open source project in early 2016, and shared its architecture design and basic usage at Hadoop Summit 2016.
In this talk, we will update audiences about the recenet development of CaffeOnSpark. We will highlight new features and capabilities: unified data layer which multi-label datasets, distributed LSTM training, interleave testing with training, monitoring/profiling framework, and docker deployment.
We plan to share some interesting use cases from Yahoo, including image classification, NSFW image detection, and automatic identification of eSports game highlights. We will offer an interactive demo of image auto captioning using CaffeOnSpark in a Hadoop based notebook.
Analyzing IOT Data in Apache Spark Across Data Centers and Cloud with NetApp ...Databricks
This session will explain how NetApp simplifies the process of analyzing IoT data, using Apache Spark clusters across data centers and the cloud using NetApp Private Storage (NPS) for AWS/Azure, NetApp Data Fabric and NetApp Connectors for NFS and S3. IoT data originates at the edge in different geographical locations, and it can arrive at different data centers or the cloud depending on sensor location. The challenge is how to combine these different data streams across different datacenters to generate wider insights.
Learn how NetApp Data Fabric helps solve this challenge. In the Data Fabric architecture, the IoT data is ingested via Kafka into an Apache Spark cluster running in AWS/Azure, but the data is stored in NPS provisioned NFS share through NFS Connector. The IoT data in NPS can then be moved to on-prem datacenters, or on-prem IoT data can be moved to NPS or ONTAP Cloud for processing in AWS/Azure using NetApp SnapMirror Flex Clone or NFS Connector. We’ll also review how NetApp StorageGRID object storage maintains IoT data for archival purposes using S3 Target. The above options allow you to analyze IoT data from AWS, StorageGRID, HDFS or NFS, providing a feasible solution for deploying Spark clusters across datacenters.
Takeaways will include identifying Spark challenges that can be remedied by extending your Spark environment to take advantage of NPS; understanding how NPS and StorageGRID can provide a cost-effective alternative for dev/test, DR for Spark analytics; and understanding Spark architecture and deployment options that utilize data from multiple locations, including on-prem and cloud-based repositories.
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...Chester Chen
Uber developed an new Spark ingestion system, Marmaray, for data ingestion from various sources. It’s designed to ingest billions of Kafka messages every 30 minutes. The amount of data handled by the pipeline is of the order hundreds of TBs. Omar details how to tackle such scale and insights into the optimizations techniques. Some key highlights are how to understand bottlenecks in Spark applications, to cache or not to cache your Spark DAG to avoid rereading your input data, how to effectively use accumulators to avoid unnecessary Spark actions, how to inspect your heap and nonheap memory usage across hundreds of executors, how you can change the layout of data to save long-term storage cost, how to effectively use serializers and compression to save network and disk traffic, and how to reduce amortize the cost of your application by multiplexing your jobs, different techniques for reducing memory footprint, runtime, and on-disk usage. CGI was able to significantly (~10%–40%) reduce memory footprint, runtime, and disk usage.
Speaker: Omkar Joshi (Uber)
Omkar Joshi is a senior software engineer on Uber’s Hadoop platform team, where he’s architecting Marmaray. Previously, he led object store and NFS solutions at Hedvig and was an initial contributor to Hadoop’s YARN scheduler.
Web-Scale Graph Analytics with Apache® Spark™Databricks
Graph analytics has a wide range of applications, from information propagation and network flow optimization to fraud and anomaly detection. The rise of social networks and the Internet of Things has given us complex web-scale graphs with billions of vertices and edges. However, in order to extract the hidden gems within those graphs, you need tools to analyze the graphs easily and efficiently.
At Spark Summit 2016, Databricks introduced GraphFrames, which implemented graph queries and pattern matching on top of Spark SQL to simplify graph analytics. In this talk, you’ll learn about work that has made graph algorithms in GraphFrames faster and more scalable. For example, new implementations like connected components have received algorithm improvements based on recent research, as well as performance improvements from Spark DataFrames. Discover lessons learned from scaling the implementation from millions to billions of nodes; compare its performance with other popular graph libraries, and hear about real-world applications.
Properly shaping partitions and your jobs to enable powerful optimizations, eliminate skew and maximize cluster utilization. We will explore various Spark Partition shaping methods along with several optimization strategies including join optimizations, aggregate optimizations, salting and multi-dimensional parallelism.
These are the slides for the Productionizing your Streaming Jobs webinar on 5/26/2016.
Apache Spark Streaming is one of the most popular stream processing framework that enables scalable, high-throughput, fault-tolerant stream processing of live data streams. In this talk, we will focus on the following aspects of Spark streaming:
- Motivation and most common use cases for Spark Streaming
- Common design patterns that emerge from these use cases and tips to avoid common pitfalls while implementing these design patterns
- Performance Optimization Techniques
Hands-on Session on Big Data processing using Apache Spark and Hadoop Distributed File System
This is the first session in the series of "Apache Spark Hands-on"
Topics Covered
+ Introduction to Apache Spark
+ Introduction to RDD (Resilient Distributed Datasets)
+ Loading data into an RDD
+ RDD Operations - Transformation
+ RDD Operations - Actions
+ Hands-on demos using CloudxLab
Apache Spark 2.0: Faster, Easier, and SmarterDatabricks
In this webcast, Reynold Xin from Databricks will be speaking about Apache Spark's new 2.0 major release.
The major themes for Spark 2.0 are:
- Unified APIs: Emphasis on building up higher level APIs including the merging of DataFrame and Dataset APIs
- Structured Streaming: Simplify streaming by building continuous applications on top of DataFrames allow us to unify streaming, interactive, and batch queries.
- Tungsten Phase 2: Speed up Apache Spark by 10X
Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...Spark Summit
We all dread “Lost task” and “Container killed by YARN for exceeding memory limits” messages in our scaled-up spark yarn applications. Even answering the question “How much memory did my application use?” is surprisingly tricky in the distributed yarn environment. Sqrrl has developed a testing framework for observing vital statistics of spark jobs including executor-by-executor memory and CPU usage over time for both the JDK and python portions of pyspark yarn containers. This talk will detail the methods we use to collect, store, and report spark yarn resource usage. This information has proved to be invaluable for performance and regression testing of the spark jobs in Sqrrl Enterprise.
Scalable Data Science with SparkR: Spark Summit East talk by Felix CheungSpark Summit
R is a very popular platform for Data Science. Apache Spark is a highly scalable data platform. How could we have the best of both worlds? How could a Data Scientist leverage the rich 9000+ packages on CRAN, and integrate Spark into their existing Data Science toolset?
In this talk we will walkthrough many examples how several new features in Apache Spark 2.x will enable this. We will also look at exciting changes in and coming next in Apache Spark 2.x releases.
Problem Solving Recipes Learned from Supporting Spark: Spark Summit East talk...Spark Summit
Due to Spark, writing big data applications has never been easier…at least until they stop being easy! At Lightbend we’ve helped our customers out of a number of hidden Spark pitfalls. Some crop up often; the ever-persistent OutOfMemoryError, the confusing NoSuchMethodError, shuffle and partition management, etc. Others occur less frequently; an obscure configuration affecting SQL broadcasts, struggles with speculating, a failing stream recovery due to RDD joins, S3 file reading leading to hangs, etc. All are intriguing! In this session we will provide insights into their origins and show how you can avoid making the same mistakes. Whether you are a seasoned Spark developer or a novice, you should learn some new tips and tricks that could save you hours or even days of debugging.
Jorge de la Cruz [Veeam Software] | RESTful API – How to Consume, Extract, St...InfluxData
Jorge de la Cruz [Veeam Software] | RESTful API – How to Consume, Extract, Store, and Visualize Data with InfluxDB and Grafana | InfluxDays Virtual Experience NA 2020
CaffeOnSpark Update: Recent Enhancements and Use CasesDataWorks Summit
By combining salient features from deep learning framework Caffe and big-data frameworks Apache Spark and Apache Hadoop, CaffeOnSpark enables distributed deep learning on a cluster of GPU and CPU servers. We released CaffeOnSpark as an open source project in early 2016, and shared its architecture design and basic usage at Hadoop Summit 2016.
In this talk, we will update audiences about the recenet development of CaffeOnSpark. We will highlight new features and capabilities: unified data layer which multi-label datasets, distributed LSTM training, interleave testing with training, monitoring/profiling framework, and docker deployment.
We plan to share some interesting use cases from Yahoo, including image classification, NSFW image detection, and automatic identification of eSports game highlights. We will offer an interactive demo of image auto captioning using CaffeOnSpark in a Hadoop based notebook.
Analyzing IOT Data in Apache Spark Across Data Centers and Cloud with NetApp ...Databricks
This session will explain how NetApp simplifies the process of analyzing IoT data, using Apache Spark clusters across data centers and the cloud using NetApp Private Storage (NPS) for AWS/Azure, NetApp Data Fabric and NetApp Connectors for NFS and S3. IoT data originates at the edge in different geographical locations, and it can arrive at different data centers or the cloud depending on sensor location. The challenge is how to combine these different data streams across different datacenters to generate wider insights.
Learn how NetApp Data Fabric helps solve this challenge. In the Data Fabric architecture, the IoT data is ingested via Kafka into an Apache Spark cluster running in AWS/Azure, but the data is stored in NPS provisioned NFS share through NFS Connector. The IoT data in NPS can then be moved to on-prem datacenters, or on-prem IoT data can be moved to NPS or ONTAP Cloud for processing in AWS/Azure using NetApp SnapMirror Flex Clone or NFS Connector. We’ll also review how NetApp StorageGRID object storage maintains IoT data for archival purposes using S3 Target. The above options allow you to analyze IoT data from AWS, StorageGRID, HDFS or NFS, providing a feasible solution for deploying Spark clusters across datacenters.
Takeaways will include identifying Spark challenges that can be remedied by extending your Spark environment to take advantage of NPS; understanding how NPS and StorageGRID can provide a cost-effective alternative for dev/test, DR for Spark analytics; and understanding Spark architecture and deployment options that utilize data from multiple locations, including on-prem and cloud-based repositories.
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...Chester Chen
Uber developed an new Spark ingestion system, Marmaray, for data ingestion from various sources. It’s designed to ingest billions of Kafka messages every 30 minutes. The amount of data handled by the pipeline is of the order hundreds of TBs. Omar details how to tackle such scale and insights into the optimizations techniques. Some key highlights are how to understand bottlenecks in Spark applications, to cache or not to cache your Spark DAG to avoid rereading your input data, how to effectively use accumulators to avoid unnecessary Spark actions, how to inspect your heap and nonheap memory usage across hundreds of executors, how you can change the layout of data to save long-term storage cost, how to effectively use serializers and compression to save network and disk traffic, and how to reduce amortize the cost of your application by multiplexing your jobs, different techniques for reducing memory footprint, runtime, and on-disk usage. CGI was able to significantly (~10%–40%) reduce memory footprint, runtime, and disk usage.
Speaker: Omkar Joshi (Uber)
Omkar Joshi is a senior software engineer on Uber’s Hadoop platform team, where he’s architecting Marmaray. Previously, he led object store and NFS solutions at Hedvig and was an initial contributor to Hadoop’s YARN scheduler.
Web-Scale Graph Analytics with Apache® Spark™Databricks
Graph analytics has a wide range of applications, from information propagation and network flow optimization to fraud and anomaly detection. The rise of social networks and the Internet of Things has given us complex web-scale graphs with billions of vertices and edges. However, in order to extract the hidden gems within those graphs, you need tools to analyze the graphs easily and efficiently.
At Spark Summit 2016, Databricks introduced GraphFrames, which implemented graph queries and pattern matching on top of Spark SQL to simplify graph analytics. In this talk, you’ll learn about work that has made graph algorithms in GraphFrames faster and more scalable. For example, new implementations like connected components have received algorithm improvements based on recent research, as well as performance improvements from Spark DataFrames. Discover lessons learned from scaling the implementation from millions to billions of nodes; compare its performance with other popular graph libraries, and hear about real-world applications.
Properly shaping partitions and your jobs to enable powerful optimizations, eliminate skew and maximize cluster utilization. We will explore various Spark Partition shaping methods along with several optimization strategies including join optimizations, aggregate optimizations, salting and multi-dimensional parallelism.
These are the slides for the Productionizing your Streaming Jobs webinar on 5/26/2016.
Apache Spark Streaming is one of the most popular stream processing framework that enables scalable, high-throughput, fault-tolerant stream processing of live data streams. In this talk, we will focus on the following aspects of Spark streaming:
- Motivation and most common use cases for Spark Streaming
- Common design patterns that emerge from these use cases and tips to avoid common pitfalls while implementing these design patterns
- Performance Optimization Techniques
Hands-on Session on Big Data processing using Apache Spark and Hadoop Distributed File System
This is the first session in the series of "Apache Spark Hands-on"
Topics Covered
+ Introduction to Apache Spark
+ Introduction to RDD (Resilient Distributed Datasets)
+ Loading data into an RDD
+ RDD Operations - Transformation
+ RDD Operations - Actions
+ Hands-on demos using CloudxLab
Apache Spark 2.0: Faster, Easier, and SmarterDatabricks
In this webcast, Reynold Xin from Databricks will be speaking about Apache Spark's new 2.0 major release.
The major themes for Spark 2.0 are:
- Unified APIs: Emphasis on building up higher level APIs including the merging of DataFrame and Dataset APIs
- Structured Streaming: Simplify streaming by building continuous applications on top of DataFrames allow us to unify streaming, interactive, and batch queries.
- Tungsten Phase 2: Speed up Apache Spark by 10X
This presentation gives an overview of SPEC Cloud (TM) IaaS 2016, the first industry standard benchmark that measures the performance of infrastructure as a service clouds. More details on benchmark at https://www.spec.org/cloud_iaas2016/ .
A Survey of Container Security in 2016: A Security Update on Container PlatformsSalman Baset
This talk is an update of container security in 2016. It describes the security measures that containers provide, shows how containers provide security measures out of box that are prone to configuration errors when running applications directly on host, and finally lists the ongoing in container security in the community.
Originally presented at API Strat and Practice conference in Boston 2016 by me and Mandy Whaley, this presentation shows the multiple archetypes that you could encounter while trying to govern APIs at your company.
Open Source Tools for Container Security and Compliance @Docker LA Meetup 2/13Zach Hill
Data and policy driven approach for container security and compliance using open-source Anchore. Presented at Docker Meetup LA 2/13/2017 including demos
Docker containers & the Future of Drupal testing Ricardo Amaro
Story of an investigation to improve cloud
The sad VirtualMachine story
Containers and non-containers
DEMO - Drupal Docker
Drupal Testbots story in a Glance
Docker as a testing automation factor
DEMO - Docker Tesbot
Integration path
Introduction to Infrastructure as Code & Automation / Introduction to ChefNathen Harvey
Your customers expect you to continuously deliver delightful experiences. This means that you’ll need to continuously deliver application and infrastructure updates. Hand-crafted servers lovingly built and maintained by a system administrator are a thing of the past. Golden images are fine for initial provisioning but will quickly fail as your configuration requirements change over time.
It’s time for you to fully automate the provisioning and management of your infrastructure components. Welcome to the world of infrastructure as code! In this new world, you’ll be able to programmatically provision and configure the components of your infrastructure.
Disposable infrastructure whose provisioning, configuration, and on-going maintenance is fully automated allow you to change the way you build and deliver applications. Move your applications and infrastructure towards continuous delivery.
In this talk, we’ll explore the ideas behind “infrastructure as code” and, specifically, look at how Chef allows you to fully automate your infrastructure. If you’re brave enough, we’ll even let you get your hands on some Chef and experience the delight of using Chef to build and deploy some infrastructure components.
Priming Your Teams For Microservice Deployment to the CloudMatt Callanan
You think of a great idea for a microservice and want to ship it to production as quickly as possible. Of course you'll need to create a Git repo with a codebase that reuses libraries you share with other services. And you'll want a build and a basic test suite. You'll want to deploy it to immutable servers using infrastructure as code that dev and ops can maintain. Centralised logging, monitoring, and HipChat notifications would also be great. Of course you'll want a load balancer and a CNAME that your other microservices can hit. You'd love to have blue-green deploys and the ability to deploy updates at any time through a Continuous Delivery pipeline. Phew! How long will it take to set all this up? A couple of days? A week? A month?
What if you could do all of this within 30 minutes? And with a click of a button soon be receiving production traffic?
Matt introduces "Primer", Expedia's microservice generation and deployment platform that enables rapid experimentation in the cloud, how it's caused unprecedented rates of learning, and explain tips and tricks on how to build one yourself with practical takeaways for everyone from the startup to the enterprise.
Video: https://www.youtube.com/watch?v=Xy4EkaXyEs4
Meetup: http://www.meetup.com/Devops-Brisbane/events/225050723/
DOXLON November 2016 - Data Democratization Using SplunkOutlyer
In this session, Neil Roy Chowdhury - Lead Splunk Consultant @ Strft - looks at Splunk to foster collaboration between dev and ops teams in a safe and secure way. We focus on the need for semantic logging and what part data models can play in everyone speaking the same language, not just for dev and ops teams, but for information security and other business areas too.
S.R.E - create ultra-scalable and highly reliable systemsRicardo Amaro
Site Reliability Engineering enables agility and stability.
SREs use Software Engineering to automate themselves out of the Job.
My advice, if you want to implement this change in your company is to start with action items, alter your training and hiring, implement error budgets, do blameless postmortems and reduce toil.
https://events.drupal.org/dublin2016/sessions/sre-create-ultra-scalable-and-highly-reliable-systems
YOW2018 Cloud Performance Root Cause Analysis at NetflixBrendan Gregg
Keynote by Brendan Gregg for YOW! 2018. Video: https://www.youtube.com/watch?v=03EC8uA30Pw . Description: "At Netflix, improving the performance of our cloud means happier customers and lower costs, and involves root cause
analysis of applications, runtimes, operating systems, and hypervisors, in an environment of 150k cloud instances
that undergo numerous production changes each week. Apart from the developers who regularly optimize their own code
, we also have a dedicated performance team to help with any issue across the cloud, and to build tooling to aid in
this analysis. In this session we will summarize the Netflix environment, procedures, and tools we use and build t
o do root cause analysis on cloud performance issues. The analysis performed may be cloud-wide, using self-service
GUIs such as our open source Atlas tool, or focused on individual instances, and use our open source Vector tool, f
lame graphs, Java debuggers, and tooling that uses Linux perf, ftrace, and bcc/eBPF. You can use these open source
tools in the same way to find performance wins in your own environment."
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...Databricks
This talk presents how we accelerated deep learning processing from preprocessing to inference and training on Apache Spark in SK Telecom. In SK Telecom, we have half a Korean population as our customers. To support them, we have 400,000 cell towers, which generates logs with geospatial tags.
Jump Start with Apache Spark 2.0 on DatabricksAnyscale
Apache Spark 2.x has laid the foundation for many new features and functionality. Its main three themes—easier, faster, and smarter—are pervasive in its unified and simplified high-level APIs for Structured data.
In this introductory part lecture and part hands-on workshop you’ll learn how to apply some of these new APIs using Databricks Community Edition. In particular, we will cover the following areas:
Apache Spark Fundamentals & Concepts
What’s new in Spark 2.x
SparkSessions vs SparkContexts
Datasets/Dataframes and Spark SQL
Introduction to Structured Streaming concepts and APIs
Introduction to Snowflake Datawarehouse and Architecture for Big data company. Centralized data management. Snowpipe and Copy into a command for data loading. Stream loading and Batch Processing.
SnapLogic- iPaaS (Elastic Integration Cloud and Data Integration) Surendar S
Especially this document provide very useful and meaningful concepts about SnapLogic. Also this document will be more useful for beginner/intermediate level SnapLogic learners.
This introductory workshop is aimed at data analysts & data engineers new to Apache Spark and exposes them how to analyze big data with Spark SQL and DataFrames.
In this partly instructor-led and self-paced labs, we will cover Spark concepts and you’ll do labs for Spark SQL and DataFrames
in Databricks Community Edition.
Toward the end, you’ll get a glimpse into newly minted Databricks Developer Certification for Apache Spark: what to expect & how to prepare for it.
* Apache Spark Basics & Architecture
* Spark SQL
* DataFrames
* Brief Overview of Databricks Certified Developer for Apache Spark
Author: Stefan Papp, Data Architect at “The unbelievable Machine Company“. An overview of Big Data Processing engines with a focus on Apache Spark and Apache Flink, given at a Vienna Data Science Group meeting on 26 January 2017. Following questions are addressed:
• What are big data processing paradigms and how do Spark 1.x/Spark 2.x and Apache Flink solve them?
• When to use batch and when stream processing?
• What is a Lambda-Architecture and a Kappa Architecture?
• What are the best practices for your project?
DEVNET-1140 InterCloud Mapreduce and Spark Workload Migration and Sharing: Fi...Cisco DevNet
Data gravity is a reality when dealing with massive amounts and globally distributed systems. Processing this data requires distributed analytics processing across InterCloud. In this presentation we will share our real world experience with storing, routing, and processing big data workloads on Cisco Cloud Services and Amazon Web Services clouds.
Real time Analytics with Apache Kafka and Apache SparkRahul Jain
A presentation cum workshop on Real time Analytics with Apache Kafka and Apache Spark. Apache Kafka is a distributed publish-subscribe messaging while other side Spark Streaming brings Spark's language-integrated API to stream processing, allows to write streaming applications very quickly and easily. It supports both Java and Scala. In this workshop we are going to explore Apache Kafka, Zookeeper and Spark with a Web click streaming example using Spark Streaming. A clickstream is the recording of the parts of the screen a computer user clicks on while web browsing.
ScyllaDB Open Source 5.0 is the latest evolution of our monstrously fast and scalable NoSQL database – powering instantaneous experiences with massive distributed datasets.
Join us to learn about ScyllaDB Open Source 5.0, which represents the first milestone in ScyllaDB V. ScyllaDB 5.0 introduces a host of functional, performance and stability improvements that resolve longstanding challenges of legacy NoSQL databases.
We’ll cover:
- New capabilities including a new IO model and scheduler, Raft-based schema updates, automated tombstone garbage collection, optimized reverse queries, and support for the latest AWS EC2 instances
- How ScyllaDB 5.0 fits into the evolution of ScyllaDB – and what to expect next
- The first look at benchmarks that quantify the impact of ScyllaDB 5.0's numerous optimizations
This will be an interactive session with ample time for Q & A – bring us your questions and feedback!
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...Databricks
This talk is about sharing experience and lessons learned on setting up and running the Apache Spark service inside the database group at CERN. It covers the many aspects of this change with examples taken from use cases and projects at the CERN Hadoop, Spark, streaming and database services. The talks is aimed at developers, DBAs, service managers and members of the Spark community who are using and/or investigating “Big Data” solutions deployed alongside relational database processing systems. The talk highlights key aspects of Apache Spark that have fuelled its rapid adoption for CERN use cases and for the data processing community at large, including the fact that it provides easy to use APIs that unify, under one large umbrella, many different types of data processing workloads from ETL, to SQL reporting to ML.
Spark can also easily integrate a large variety of data sources, from file-based formats to relational databases and more. Notably, Spark can easily scale up data pipelines and workloads from laptops to large clusters of commodity hardware or on the cloud. The talk also addresses some key points about the adoption process and learning curve around Apache Spark and the related “Big Data” tools for a community of developers and DBAs at CERN with a background in relational database operations.
We will be showing the use case of the implementation of a Data Pipeline in the maritime domain @Windward via Spark applications.
The process was converting a Monolith application to a fully distributed and scalable application.
We'll be talking about all the tools and the process of taking an idea and developing Spark applications around it, And will show the development of an application End to End, from DevOps to the method of thinking about the development of applications, showing use-cases and the "lessons learned" at Windward Ltd, I hope that after the talk, it will give you some more Practical tools to "Spark"ing your way around.
Similar to Dissecting Open Source Cloud Evolution: An OpenStack Case Study (20)
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
Elevating Tactical DDD Patterns Through Object Calisthenics
Dissecting Open Source Cloud Evolution: An OpenStack Case Study
1. Dissecting Open Source Cloud Evolution: An OpenStack
Case Study
Salman Baset, Chunqiang Tang, Byung Chul Tak, Long
Wang
IBM T. J. Watson Research Center
June 26th, 2013
2. Open source cloud projects
IaaS
PaaS
SaaS
Broadly two types:
(1) Native (listed here)
(2) Adapters (e.g., Netflix on EC2)
S. Baset, CQ Tang, B. Tak, L. Wang 2
3. Timeline for cloud open source
2006 2007 2008 2009 2010 2011 2012
Amazon EC2 Google App
Engine
2005
2001
3
4. Two characteristics of open source cloud systems
• Distributed multi-component architecture
– Example: OpenStack and Cloud Foundry have more than 10 components for
their IaaS controllers
• Rapid development by a community of developers
S. Baset, CQ Tang, B. Tak, L. Wang 4
5. Rapid development
• Open source cloud projects are being developed and released at a rapid
pace
– OpenStack: releases every six months
– Eucalyptus: every four months
– OpenShift Enterprise: every four months
• Compare it to
– Linux kernel: 2-3 months (3.x – 3.(x+1) )
– Ubuntu distro releases: every six months
• Major cloud providers are consuming OpenStack directly from
development trunk
– Two weeks behind the trunk
S. Baset, CQ Tang, B. Tak, L. Wang 5
6. Why understand evolution?
• Evolution:
– A git commit or a major release
• Research perspective
– How logical operations (e.g., create a VM) change across major versions?
• Developer perspective
– What is the impact of my committed changes?
• Provider perspective
– Continuous deployment and delivery
• How does a provider gain confidence in deploying a new release in production?
• What is the impact of new changes and configuration options on logical operations?
– Message flow, performance evaluation, fault injection etc
S. Baset, CQ Tang, B. Tak, L. Wang 6
7. Methods for understanding evolution
• Static
– Source code
– Documentation
• Dynamic
– Log analysis
• Lab and/or production
– Tracing message flow
• With or without code instrumentation
• Automatic correlation of message flow with logs
• Lab and/or production
– Fault injection
– Performance study
• Lab
S. Baset, CQ Tang, B. Tak, L. Wang 7
8. Our solution
• Without source code modification
– Tracing
– Tracing with log correlation
– Fault injection
• Other solutions
– Google Dapper (built RPC framework leveraging callbacks)
– Twitter Zipkin (attach identifiers to requests)
S. Baset, CQ Tang, B. Tak, L. Wang 8
9. 9
Summary of our solution: Tracing
• This simplified diagram shows one example path for one user request.
• A path is the series of system events such as RECEIVE and SEND across servers
captured using LD_PRELOAD technique.
• Prior art: vPath constructs such causal path of system activities initiated by user
requests.
thread
RECEIVE
Monitoring Agent
events caught
application
kernel
Ex) Apache webserver
thread
RECEIVE
Monitoring Agent
events caught
application
kernel
Ex) Application server
thread
RECEIVE
Monitoring Agent
events caught
application
kernel
Ex) Database server
Request
SEND
RECEIVE
SEND
SEND
SEND
RECEIVE
SEND
10. 10
Summary of our solution: Tracing with queues
• The path breaks if there are queues in the middle.
– Apache web server inserts a message in the queue and returns
– Application server retrieves the message from the queue and performs work
– How do we correlate these messages?
• Augment path information with unique message information
– e.g., transaction ids
• Run only one logical operation in the system if no unique message information
thread
RECEIVE
Monitoring Agent
events caught
application
kernel
Ex) Apache webserver
thread
RECEIVE
Monitoring Agent
events caught
application
kernel
Ex) Application server
thread
RECEIVE
Monitoring Agent
events caught
application
kernel
Ex) Database server
Request
SEND
RECEIVE
SEND
SEND
SEND
RECEIVE
SEND
Queue
11. 11
Summary of our solution: Log Analysis
• Key idea
– Combine the log information and causality (path) discovery technique
Trace low-level system calls to
infer causality and understand how
an application executes
Monitor log files and link log file
entries to observed low-level
system calls
Link
together
Improved
Semantics for
Problem Diagnosis
12. 12
Diagram: Detecting Log Writes
• During normal run,
– Maintain a mapping between fd and file name string
– Maintain a list of known/discovered log files
• On ‘write’ system calls,
– Check parameters and see if it is a ‘write’ on one of the log files.
– If it is, and the data to be written contains alerting keywords such as ‘ERROR’, then this is
a log write due to some errors.
– This ‘write’ event will be annotated appropriately.
Recv Read write SendRequest
Websphere /var/log/was.log
DB2 /var/log/db2/access.log
DB2 /usr/local/db2/fie22xlv.log
DB2 /usr/local/db2/fie23xlv.log
log file name
<Fragment of a Path>
Parameters
fd=5,offset=2048,data=“ERROR: …”
9
14
5
8
fd application
13. 13
Fault Injection for Building up Knowledge Base for Future
Problem Diagnosis
• Injects errors, observe application’s behavior, and build a knowledge base for future problem
diagnosis
– Alters a return value of a system call, e.g., mimic network communication error
– It observes the logging reaction.
– It repeats this for each system call and for each requests.
– It accumulates the observed logging reactions as a knowledge base.
• When an error message is logged in a production system, using the knowledge base to infer
the probability of different root causes
– Construct Bayesian Belief Network for inference
• In the example figure, fault injection changes the return value of ‘Read’ event to -1. This
triggers an error to be logged at the later part of the path.
Recv Read write SendRequest Recv write
Return value: 1024
Return value: -1
Parameter
data=“ERROR: Record missing.”
Newly appeared event
Reaction to our error injection
Altered
14. Brewing complexity: Evolution of OpenStack loc *
Released Nova Cinder Glance Keystone Quantum Swift Total
Austin Oct 2010 17,288 12,979 30,627
Bexar Feb
2011
27,734 3,629 16,014 47,377
Cactus Apr 2011 43,947 4,927 16,665 65,539
Diablo Sep
2011
66,395 9,961 12,451 15,591 91,947
Essex Apr 2012 87,750 15,698 11,555 17,646 149,596
Folsom Sep
2012
103,637 31,241 20,271 13,939 42,118 19,114 230,320
Grizzly Apr 2013 120,968 49,797 21,261 20,071 60,485 23,035 321,081
* CRLF and not python loc S. Baset, CQ Tang, B. Tak, L. Wang 14
Methodology
wc -l `find . | grep -E '*.py' | grep -v test | grep -v 'doc'`
wc -l `find . | grep -E '*.sh' | grep test | grep -v 'doc'`
17. OpenStack tracing
• Understand OpenStack data and message flow for logical operations, e.g.,
– Create a VM
– Delete a VM
– List VMs
– Create a volume
– Add or remove volume to a VM
– Create a floating IP address
– Add or remove floating IP address from a VM
– Create or destroy a virtual network
• Understand
– REST calls
– Data flow
– AMQP flow
– Timing information
17
• Build data consistency tool
• Gather data for generating performance load
• Build a performance model
S. Baset, CQ Tang, B. Tak, L. Wang
18. 18
Key observations from tracing OpenStack (1/2)
• OpenStack is evolving very rapidly. Significant behavior changes from one release to
another.
• Total tables
– Grizzly: 105 tables (160 with nova shadow tables), 53 in Diablo
• Creating a VM (grizzly)
– 139 SELECT queries, 37 INSERT queries, 74 UPDATE queries
– 12 tables are touched for INSERT and UPDATE
• In Diablo (Sep 2011), there were 450 SELECT, 4 INSERT, and 9 UPDATE queries
– 717K read, 458K write
– 655 send() calls to AMQP, 414 recv() calls
• Deleting a VM
– Only single record is deleted from database (rest are archived)
• Request-id
– Instance and request-id are stored in database (but only after updating quota) and before a
request is sent to the scheduler.
• Quota management
– Entries are inserted in database to indicate resource allocation for a VM. Negative or NULL entries
are inserted for deallocation. Each quota entry has expiration time (one day). E.g., core, fixedIP
etc.
• VM state and task state
– networking, block_device_mapping, spawning
• Keystone
– Token verification is optimized in Grizzly using caches (for flavor=keystone) and PKI
18S. Baset, CQ Tang, B. Tak, L. Wang
19. 19
Key observations from tracing OpenStack (2/2)
• Development of a data consistency checking tool
– Orphan iptable rules (not associated with VM transaction) => security holes
– Orphan data in tables due to errors in VM creation etc => audit and clean up
– Orphan virsh data => audit and clean up
S. Baset, CQ Tang, B. Tak, L. Wang 19S. Baset, CQ Tang, B. Tak, L. Wang
20. 20
Methodology
• Run OpenStack in a machine (w/ and w/o timers disabled)
• Diablo, Essex, Folsom, Grizzly
• Ubuntu, RabbitMQ, MySQL
• Use curl to send API request to OpenStack
– flavor=keystone
– Image has three parts
• AMI, ram disk, kernel image
– For keystone, PKI based token verification also used in grizzly
– Each service’s token were created before issuing a create or delete VM call
• Use our technique to capture message interaction, generate flow, run message analytics, and
insert faults (on going)
• curl_createserver.sh
AUTHTOKEN=$1
curl -i http://9.47.240.166:8774/v2/3283d689d02c41248fc82c202e82055a/servers -X POST -H "X-Auth-Project-Id: admin" -
H "User-Agent: python-novaclient" -H "Content-Type: application/json" -H "Accept: application/json" -H "X-Auth-Token:
${AUTHTOKEN}" -d '{"server": {"name": "test1", "imageRef": "de8882fb-94b3-4105-a212-c0a7fd8ab1e9", "flavorRef": "1",
"max_count": 1, "min_count": 1, "networks": [{"uuid": "48de54f9-2a60-4f28-9740-d6317086c32a"}] }}'
S. Baset, CQ Tang, B. Tak, L. Wang 20S. Baset, CQ Tang, B. Tak, L. Wang
22. 22
Keystone REST flow for creating a server (grizzly)
22
User Keystone nova-api glance-api
Credentials
Token (role)
Get services and
endpoints + token
Services + endpoints
Token + CreateInstance
Verify + token
Token + GetImage
Verify + token
image
CreateInstance Success
Accepted
glance-registry
Token + GetImage
Verify + token
image
S. Baset, CQ Tang, B. Tak, L. Wang
23. 23
Create a VM: overview (1/4)
• Which OpenStack component is issuing SELECT queries?
Diablo Essex Folsom-
nova-
network
Folsom-
quantum
Grizzly-
nova-
network
Girzzly
quantum
Auth. keystone 422 54 358 484 82 243
API server nova-api 4 11 11 9 10 10
Agent on
compute node
nova-
compute
4 5 13 14 0 0
Controller
agent
nova-
conductor
n/a n/a n/a n/a 15 16
Network agent
on compute
nova-
network
13 19 17 n/a 20 n/a
Scheduler nova-
scheduler
1 2 1 1 4 4
Image registry
server
glance-
registry
6 4 8 8 8 8
Network API
server
quantum-
server
n/a n/a n/a 44 n/a 62
23S. Baset, CQ Tang, B. Tak, L. Wang
24. 24
Create a VM: overview (2/4)
• How many HTTP requests with respect to SELECT calls? Red indicates REST calls rcvd.
Diablo Essex Folsom-nova-
network
Folsom-
quantum
Grizzly-nova-
network
Grizzly
quantum
keystone 422 54 358 484 82 243
30 GET 9 GET 17 GET 23 GET 3 GET 6 GET, 2POST
nova-api 4 11 11 9 10 10
1 POST 1 POST 1 POST 1 POST 1 POST 1 POST
nova-compute 4 5 13 14 0 0
nova-conductor n/a n/a n/a n/a 15 16
nova-network 13 19 17 n/a 20 n/a
nova-scheduler 1 2 1 1 4 4
glance-api 0 0 0 0 0 0
2 GET, 5
HEAD
4 HEAD 8 HEAD 8 HEAD 8 HEAD 8 HEAD
glance-registry 6 4 8 8 8 8
7 GET 4 GET 8 GET 8 GET 8 GET 8 GET
quantum-server n/a n/a n/a 44 n/a 62
5 GET, 1 POST 9 GET, 1 POST
24
S. Baset, CQ Tang, B. Tak, L. Wang
25. Why so many SELECT queries in keystone?
• In Diablo, for every keystone GET, 14 SELECT queries are issued, except for first query (16)
• In Essex, for every keystone GET, 6 SELECT queries are issued
• In Folsom-nova-net/quantum, for every keystone GET, 21 SELECT queries are issued, except
for first query (22)
• In Grizzly-nova-net, 27 SELECT queries for each request except for first (1).
– Keystone tokens are also cached. So subsequent queries do not result into full keystone token authentication
• If PKI token verification is used, the number of SELECT queries sent by keystone drop to 7
from 82.
25
keystone 422 54 358 484 82 243
30 GET 9 GET 17 GET 23 GET 3 GET 6 GET, 2POST
S. Baset, CQ Tang, B. Tak, L. Wang
26. 26
Create a VM: overview (3/4)
• What if there is no keystone?
Keystone enabled
Keystone disabled
S. Baset, CQ Tang, B. Tak, L. Wang 26
Diablo Essex Folsom-
nova-
network
Folsom-
quantum
Grizzly-
nova-
network
Grizzly
quantum
SELECT 28 41 51 76 57 100
INSERT 4 4 23 24 37 38
UPDATE 6 10 60 58 74 70
Diablo Essex Folsom-
nova-
network
Folsom-
quantum
Grizzly-
nova-
network
Grizzly
quantum
SELECT 450 95 409 560 139 343
INSERT 4 4 23 24 37 40
UPDATE 6 10 60 58 74 70
S. Baset, CQ Tang, B. Tak, L. Wang
28. 28
Grizzly
nova-net
SELEC
T
2 block_device_mapping
6 compute_node_stats
6 fixed_ips
1 floating_ips
8 images
4 instance_actions
2 instance_actions_events
1 instance_info_caches
4 networks
2 provider_fw_rules
5 quotas
4 quota_usages
2 reservations
7 role
1 security_group_rules
3 security_groups
4 virtual_interfaces
S. Baset, CQ Tang, B. Tak, L. Wang 28
Grizzly
nova-net
INSERT 12 compute_node_stats
1 instance_actions
2 instance_actions_events
1 instance_id_mappings
1 instance_info_caches
1 instances
13 instance_system_metadata
4 reservations
1
security_group_instance_associatio
n
1 virtual_interfaces
Grizzly
nova-net
UPDATE 6 compute_nodes
44 compute_node_stats
3 fixed_ips
2
instance_actions_events
1 instance_info_caches
8 instances
8 quota_usages
2 reservations
Tables touched for create VM
in grizzly-nova-net
S. Baset, CQ Tang, B. Tak, L. Wang
29. 29
Dataflow flow for creating a server (grizzly) (1/2)
29
nova-api nova-scheduler nova-conductor nova-compute
Create server Check quota
INSERT INTO reservations (instances, expires, usageid1)
INSERT INTO reservations (ram, expires, usageid2)
INSERT INTO reservations (core, expires, usageid3)
UPDATE quota_usages (usageid1)
UPDATE quota_usages (usageid2)
UPDATE quota_usages (usageid3)
Check if images exist
INSERT INTO instances (‘instance_uuid’)
INSERT INTO security_group_instance_association (‘instance_uid’)
INSERT INTO instance_system_metadata (‘image_kernel_id, instance_uuid’)
INSERT INTO instance_system_metadata (‘instance_type_memory_mb’)
INSERT INTO instance_system_metadata (‘instance_type_swap’)
INSERT INTO instance_system_metadata (‘instance_type_vcpu_weight’)
INSERT INTO instance_system_metadata (‘instance_type_root_gb’)
INSERT INTO instance_system_metadata (‘instance_type_id’)
INSERT INTO instance_system_metadata (‘image_ramdisk_id’)
INSERT INTO instance_system_metadata (‘instance_type_name’)
INSERT INTO instance_system_metadata (‘instance_type_ephemeral_gb’)
INSERT INTO instance_system_metadata (‘instance_type_rxtx_factor’)
INSERT INTO instance_system_metadata (‘instance_type_flavorid’)
INSERT INTO instance_system_metadata (‘instance_type_flavorid’)
INSERT INTO instance_system_metadata (‘image_base_image_ref’)
INSERT INTO instance_info_caches (‘instance_uuid)
Create reservations. No request id. Default: expires after
a day if not updated.
Update quotas.
What if nova-api dies here? Then quota updates
can potentially be permanent until expired or cleanup.
Create instance in the database.
30. 30
Dataflow flow for creating a server (grizzly) (2/2)
30
nova-api nova-scheduler nova-compute nova-conductor
INSERT into instance_id_mappings(‘instance_uuid’)
Update time in quota_usages table
INSERT INTO instance_actions (instance_uuid, request_id)
Send to scheduler (request_id)
INSERT into instance_action_events(scheduling)
nova-network
INSERT into instance_actions_events(compute_run)
Libvirt – create instance
UPDATE instances (task_state = NULL)
GET images from glance
UPDATE instances (host, node)
UPDATE compute_node_stats *
INSERT INTO compute_node_stats
UPDATE instances (task_state=networking)
This request is key. It associates instance id
with a request id. But occurs after quota and
reservations has been updated. BAD!!!
S. Baset, CQ Tang, B. Tak, L. Wang
31. 31
How many SQL queries for create VM before a request
is sent to:
S. Baset, CQ Tang, B. Tak, L. Wang 31
Diablo Essex Folsom-nova-
network
Folsom-
quantum
Grizzly-nova-
network
Grizzly
quantum
SELECT 202 10 27 289 98 138
INSERT 0 0 3 10 21 21
UPDATE 0 0 3 9 7 7
S. Baset, CQ Tang, B. Tak, L. Wang
Diablo Essex Folsom-nova-
network
Folsom-
quantum
Grizzly-nova-
network
Grizzly
quantum
SELECT 371 52 292 290 100 140
INSERT 3 3 10 10 22 22
UPDATE 1 2 10 10 8 8
scheduler
compute
Diablo Essex Folsom-nova-
network
Folsom-
quantum
Grizzly-nova-
network
Grizzly
quantum
SELECT 450 95 409 560 139 343
INSERT 4 4 23 24 37 40
UPDATE 6 10 60 58 74 70
32. 32
Create VM total message bytes – read() or recv()
S. Baset, CQ Tang, B. Tak, L. Wang 32
Diablo Essex Folsom
nova-network
Folsom
quantum
Grizzly
nova-network
keystone 154841 23090 198493 269920 41888
nova-api 65596 81836 75507 21435 22766
nova-compute 155233
(113701)
157660
(105460)
202163
(163107)
206003
(167383)
106396
(110721)
nova-conductor n/a n/a n/a n/a 371614
nova-network 98101 77184 62509 n/a 103100
nova-scheduler 3380 38477 16465 19688 29674
glance-registry 36764 16632 45798 46104 30494
glance-api 17440 6326 32386 32716 11248
quantum-server n/a n/a n/a 46533 n/a
quantum-dhcp n/a n/a n/a 3722 n/a
Total 531355 401205 582185 650,615 717,180
S. Baset, CQ Tang, B. Tak, L. WangExcludes any image transfer
33. 33
Create VM total message bytes – write() or send()
S. Baset, CQ Tang, B. Tak, L. Wang 33
Diablo Essex Folsom
nova-network
Folsom
quantum
Grizzly
nova-network
keystone 115606 15129 128957 174884 25364
nova-api 50704 70995 25449 20265 22693
nova-compute 99899 109136 127436 126143
(122363)
74864
(68352)
nova-conductor n/a n/a n/a n/a 222228
nova-network 74106 63446 46123 n/a 57321
nova-scheduler 2964 30182 17662 21993 26997
glance-registry 23095 11006 18210 18196 20329
glance-api 8841 5038 10226 10220 8705
quantum-server n/a n/a n/a 25986 n/a
quantum-dhcp n/a n/a n/a 84 n/a
Total 375,447 305,156 374,499 403,507 458,501
S. Baset, CQ Tang, B. Tak, L. Wang
34. 34
Create a VM: Message exchange with RabbitMQ – send()
Diablo Essex Folsom
nova-network
Folsom-
quantum
Grizzly
nova-network
nova-api 23 (3392) 35 (4769) 23 (8600) 11 (5254) 11 (4062)
nova-compute 18 (1316) 18 (1430) 18 (3782) 1 (21) 306 (67874)
nova-network 31 (1816) 45 (1018) 32 (2159) n/a 14 (1786)
nova-
scheduler
23 (2392) 12 (2976) 12 (7388) 12 (9737) 7 (11567)
nova-
conductor
n/a n/a n/a n/a 317 (82717)
quantum-
server
n/a n/a n/a 36 (4498) n/a
quantum-dhcp n/a n/a n/a 4 (84) n/a
S. Baset, CQ Tang, B. Tak, L. Wang 34S. Baset, CQ Tang, B. Tak, L. Wang
35. 35
Create a VM: Message exchange with RabbitMQ – recv()
Diablo Essex Folsom
nova-network
Folsom-
quantum
Grizzly
nova-network
nova-api 16 (833) 25 (1609) 16 (833) 7 (328) 7 (328)
nova-compute 14 (3442) 14 (2369) 14 (8752) 1 (9479) 230 (94463)
nova-network 18 (1808) 26 (3045) 19 (7298) n/a 8 (2699)
nova-
scheduler
8 (2479) 8 (2918) 8 (5307) 8 (5345) 4 (3861)
nova-
conductor
n/a n/a n/a n/a 172 (58721)
quantum-
server
n/a n/a n/a 24 (396) n/a
quantum-dhcp n/a n/a n/a 4 (3726) n/a
S. Baset, CQ Tang, B. Tak, L. Wang 35S. Baset, CQ Tang, B. Tak, L. Wang
36. S. Baset, CQ Tang, B. Tak, L. Wang 36
2176 comp
172 cond
1667 gapi
139 greg
3 keys
5429 napi
12 netw
4 sche
308 comp
317 cond
17 gapi
9 greg
3 keys
19 napi
19 netw
7 sche
Create a VM: send() and recv() grizzly-nova net
send() recv()
Single byte recv
in webob library
37. Conclusions
• Complexity is brewing under OpenStack. Beware!
• Build distributed applications with tracing in mind
• Flow diff
– Through an interactive page
• Ongoing and future work
– Fault injection and log correlation
– Leverage tool for other projects, e.g., CloudFoundry
S. Baset, CQ Tang, B. Tak, L. Wang 37
Editor's Notes
Talk about when started
How many open source cloud projects?
Hadoop is not listed here. Neither Chef, Puppet, Zenoss, Ganglia