Altair PBS Works Suite, Industries most Advance Suite of Software for High Performance Computing. It has PBS Access focused on Engineers and Researchers and PBS Control with Administrator & HPC Managers.
What does a typical day as an SRE look like? In this presentation I will discuss the challenges we face while running a SaaS platform that is used 24 / 7 / 365 around the globe. In doing so, we have embraced the core principles described in the Google SRE handbook. While we are not Google by any means, most of the principles apply to our daily work one way or another. Having a fully distributed team running a distributed system can be quite challenging. In this talk I’ll be covering:
- Core SRE principles
- How Instana has applied them to our daily work
- Lessons learned along the way
In our experience, many problems with production workflows can be traced back to unexpected values in the input data. In a complex pipeline, it can be difficult and costly to trace the root cause of errors. Here we outline our work developing an open source data validation framework built on Apache Spark. Our goal is a tool that easily integrates into existing workflows to automatically make data validation a vital initial step of every production workflow. Our tool is aimed at data scientists and data engineers, who are not necessarily Scala/Python programmers. Our users specify a configuration file that details the data validation checks to be completed. This configuration file is parsed into appropriate queries that are executed with Apache Spark. A status report is logged, which is used to notify developers/maintainers and to establish a historical record of validator checks. This work was inspired by the many great ideas behind Google's TensorFlow Extended (TFX) platform, in particular TensorFlow Data Validation (TFDV). As such we provide optional functionality for our users to visualize their data using Facets Overview and Facets Dive.
A Practical Enterprise Feature Store on Delta LakeDatabricks
The feature store is a data architecture concept used to accelerate data science experimentation and harden production ML deployments. Nate Buesgens and Bryan Christian describe a practical approach to building a feature store on Delta Lake at a large financial organization. This implementation has reduced feature engineering “wrangling” time by 75% and has increased the rate of production model delivery by 15x. The approach described focuses on practicality. It is informed by innovative approaches such as Feast, but our primary goal is evolutionary extensions of existing patterns that can be applied to any Delta Lake architecture.
Key Takeaways:
– Understand the key use cases that motivate the feature store from both a data science and engineering perspective.
– Consider edge cases where there may be opportunities for simplification such as “online” predictions.
– Review a typical logical data model for a feature store and how that can be applied to your business domain.
– Consider options for physical storage of the feature store in the Delta Lake.
– Understand common access patterns including metadata-based feature discovery.
Meetup: Streaming Data Pipeline DevelopmentTimothy Spann
Meetup: Streaming Data Pipeline Development
In this interactive session, Tim will lead participants through how to best build streaming data pipelines. He will cover how to build applications from some common use cases and highlight tips, tricks, best practices and patterns.
He will show how to build the easy way and then dive deep into the underlying open source technologies including Apache NiFi, Apache Flink, Apache Kafka and Apache Iceberg.
If you wish to follow along, please download open source projects beforehand. You can also download this helpful streaming platform: https://docs.cloudera.com/csp-ce/latest/installation/topics/csp-ce-installing-ce.html
All source code and slides will be shared for those interested in building their own FLaNK Apps. https://www.flankstack.dev/
You can join the meeting virtually here:
https://cloudera.zoom.us/j/91603330726
Speaker - Tim Spann
Tim Spann is a Principal Developer Advocate in Data In Motion for Cloudera. He works with Apache NiFi, Apache Pulsar, Apache Kafka, Apache Flink, Flink SQL, Apache Pinot, Trino, Apache Iceberg, DeltaLake, Apache Spark, Big Data, IoT, Cloud, AI/DL, machine learning, and deep learning. Tim has over ten years of experience with the IoT, big data, distributed computing, messaging, streaming technologies, and Java programming. Previously, he was a Developer Advocate at StreamNative, Principal DataFlow Field Engineer at Cloudera, a Senior Solutions Engineer at Hortonworks, a Senior Solutions Architect at AirisData, a Senior Field Engineer at Pivotal and a Team Leader at HPE. He blogs for DZone, where he is the Big Data Zone leader, and runs a popular meetup in Princeton & NYC on Big Data, Cloud, IoT, deep learning, streaming, NiFi, the blockchain, and Spark. Tim is a frequent speaker at conferences such as ApacheCon, DeveloperWeek, Pulsar Summit and many more. He holds a BS and MS in computer science.
An Approach to Data Quality for Netflix Personalization SystemsDatabricks
Personalization is one of the key pillars of Netflix as it enables each member to experience the vast collection of content tailored to their interests.
What does a typical day as an SRE look like? In this presentation I will discuss the challenges we face while running a SaaS platform that is used 24 / 7 / 365 around the globe. In doing so, we have embraced the core principles described in the Google SRE handbook. While we are not Google by any means, most of the principles apply to our daily work one way or another. Having a fully distributed team running a distributed system can be quite challenging. In this talk I’ll be covering:
- Core SRE principles
- How Instana has applied them to our daily work
- Lessons learned along the way
In our experience, many problems with production workflows can be traced back to unexpected values in the input data. In a complex pipeline, it can be difficult and costly to trace the root cause of errors. Here we outline our work developing an open source data validation framework built on Apache Spark. Our goal is a tool that easily integrates into existing workflows to automatically make data validation a vital initial step of every production workflow. Our tool is aimed at data scientists and data engineers, who are not necessarily Scala/Python programmers. Our users specify a configuration file that details the data validation checks to be completed. This configuration file is parsed into appropriate queries that are executed with Apache Spark. A status report is logged, which is used to notify developers/maintainers and to establish a historical record of validator checks. This work was inspired by the many great ideas behind Google's TensorFlow Extended (TFX) platform, in particular TensorFlow Data Validation (TFDV). As such we provide optional functionality for our users to visualize their data using Facets Overview and Facets Dive.
A Practical Enterprise Feature Store on Delta LakeDatabricks
The feature store is a data architecture concept used to accelerate data science experimentation and harden production ML deployments. Nate Buesgens and Bryan Christian describe a practical approach to building a feature store on Delta Lake at a large financial organization. This implementation has reduced feature engineering “wrangling” time by 75% and has increased the rate of production model delivery by 15x. The approach described focuses on practicality. It is informed by innovative approaches such as Feast, but our primary goal is evolutionary extensions of existing patterns that can be applied to any Delta Lake architecture.
Key Takeaways:
– Understand the key use cases that motivate the feature store from both a data science and engineering perspective.
– Consider edge cases where there may be opportunities for simplification such as “online” predictions.
– Review a typical logical data model for a feature store and how that can be applied to your business domain.
– Consider options for physical storage of the feature store in the Delta Lake.
– Understand common access patterns including metadata-based feature discovery.
Meetup: Streaming Data Pipeline DevelopmentTimothy Spann
Meetup: Streaming Data Pipeline Development
In this interactive session, Tim will lead participants through how to best build streaming data pipelines. He will cover how to build applications from some common use cases and highlight tips, tricks, best practices and patterns.
He will show how to build the easy way and then dive deep into the underlying open source technologies including Apache NiFi, Apache Flink, Apache Kafka and Apache Iceberg.
If you wish to follow along, please download open source projects beforehand. You can also download this helpful streaming platform: https://docs.cloudera.com/csp-ce/latest/installation/topics/csp-ce-installing-ce.html
All source code and slides will be shared for those interested in building their own FLaNK Apps. https://www.flankstack.dev/
You can join the meeting virtually here:
https://cloudera.zoom.us/j/91603330726
Speaker - Tim Spann
Tim Spann is a Principal Developer Advocate in Data In Motion for Cloudera. He works with Apache NiFi, Apache Pulsar, Apache Kafka, Apache Flink, Flink SQL, Apache Pinot, Trino, Apache Iceberg, DeltaLake, Apache Spark, Big Data, IoT, Cloud, AI/DL, machine learning, and deep learning. Tim has over ten years of experience with the IoT, big data, distributed computing, messaging, streaming technologies, and Java programming. Previously, he was a Developer Advocate at StreamNative, Principal DataFlow Field Engineer at Cloudera, a Senior Solutions Engineer at Hortonworks, a Senior Solutions Architect at AirisData, a Senior Field Engineer at Pivotal and a Team Leader at HPE. He blogs for DZone, where he is the Big Data Zone leader, and runs a popular meetup in Princeton & NYC on Big Data, Cloud, IoT, deep learning, streaming, NiFi, the blockchain, and Spark. Tim is a frequent speaker at conferences such as ApacheCon, DeveloperWeek, Pulsar Summit and many more. He holds a BS and MS in computer science.
An Approach to Data Quality for Netflix Personalization SystemsDatabricks
Personalization is one of the key pillars of Netflix as it enables each member to experience the vast collection of content tailored to their interests.
The 'macro view' on Big Query:
We started with an overview, some typical uses and moved to project hierarchy, access control and security.
In the end we touch about tools and demos.
Splunk: Druid on Kubernetes with Druid-operatorImply
We went through the journey of deploying Apache Druid clusters on Kubernetes(K8s) and created a druid-operator (https://github.com/druid-io/druid-operator). This talk introduces the druid kubernetes operator, how to use it to deploy druid clusters and how it works under the hood. We will share how we use this operator to deploy Druid clusters at Splunk.
Kubernetes is an open-source system for automating deployment, scaling, and management of containerized applications. Druid is a complex stateful distributed system and a Druid cluster consists of multiple web services such as Broker, Historical, Coordinator, Overlord, MiddleManager etc each deployed with multiple replicas. Deploying a single web service on K8s requires creating few K8s resources via YAML files and it multiplies due to multiple services inside of a Druid cluster. Now doing it for multiple Druid clusters (dev, staging, production environments) makes it even more tedious and error prone.
K8s enables creation of application (such as Druid) specific extension, called “Operator”, that combines kubernetes and application specific knowledge into a reusable K8s extension that makes deploying complex applications simple.
Operating PostgreSQL at Scale with KubernetesJonathan Katz
The maturation of containerization platforms has changed how people think about creating development environments and has eliminated many inefficiencies for deploying applications. These concept and technologies have made its way into the PostgreSQL ecosystem as well, and tools such as Docker and Kubernetes have enabled teams to run their own “database-as-a-service” on the infrastructure of their choosing.
All this sounds great, but if you are new to the world of containers, it can be very overwhelming to find a place to start. In this talk, which centers around demos, we will see how you can get PostgreSQL up and running in a containerized environment with some advanced sidecars in only a few steps! We will also see how it extends to a larger production environment with Kubernetes, and what the future holds for PostgreSQL in a containerized world.
We will cover the following:
* Why containers are important and what they mean for PostgreSQL
* Create a development environment with PostgreSQL, pgadmin4, monitoring, and more
* How to use Kubernetes to create your own "database-as-a-service"-like PostgreSQL environment
* Trends in the container world and how it will affect PostgreSQL
At the conclusion of the talk, you will understand the fundamentals of how to use container technologies with PostgreSQL and be on your way to running a containerized PostgreSQL environment at scale!
This is a presentation of the popular NoSQL database Apache Cassandra which was created by our team in the context of the module "Business Intelligence and Big Data Analysis".
Modern Data Stack for Game Analytics / Dmitry Anoshin (Microsoft Gaming, The ...DevGAMM Conference
Talk will cover the journey of data platform design and implement for game analytics industry. I will tell about modern data stack. What tools and approaches are available on the market and how leading game companies engineer the data analytics solution and make better games with data insights.
This presentation describes how to configure and leverage ProxySQL with
AWS Aurora,
Azure Database for MySQL
and CloudSQL for MySQL.
It details the various benefits, configuration, and monitoring.
Benchmarking is hard. Benchmarking databases, harder. Benchmarking databases that follow different approaches (relational vs document) is even harder.
But the market demands these kinds of benchmarks. Despite the different data models that MongoDB and PostgreSQL expose, many organizations face the challenge of picking either technology. And performance is arguably the main deciding factor.
Join this talk to discover the numbers! After $30K spent on public cloud and months of testing, there are many different scenarios to analyze. Benchmarks on three distinct categories have been performed: OLTP, OLAP and comparing MongoDB 4.0 transaction performance with PostgreSQL's.
What would be faster, MongoDB or PostgreSQL?
This tutorial covers all parallel replication implementation in MariaDB 10.0 and 10.1 and MySQL 5.6, 5.7 and 8.0 (including how it works in Group Replication).
MySQL and MariaDB have different types of parallel replication. In this tutorial, we present the different implementations that allow us to understand their limitations and tuning parameters. We cover how to make parallel replication faster and what to avoid for maximizing its benefits. We also present tests from Booking.com workloads.
Some of the subjects that are covered are group commit and optimistic parallel replication in MariaDB, the parallelism interval of MySQL and its Write Set optimization, and the ?slowing down the master to speed up the slave? optimization.
After this tutorial, you will know everything you need to implement and tune parallel replication in your environment. But more importantly, we will show how you can test parallel replication benefit in a non-disruptive way before deployment.
This talk is from Distributed Data Summit SF 2018 - http://distributeddatasummit.com/2018-sf/sessions#chella
Audit logging is one of the most critical features in an enterprise-ready database in terms of security compliance. Furthermore, live traffic troubleshooting is critical for operators to troubleshoot production issues quickly. While past versions have lacked these critical features, the Cassandra team understood the need for better solutions and in the upcoming release of Cassandra both of these features now come out of the box which makes Cassandra even more awesome to work with. Cassandra now supports Audit logging and query logging as part of C* itself. As part of this talk, audience will learn about how to enable, configure, and tune audit logging for their C* clusters and how to log live traffic/queries for serverel needs including troubleshooting or even live traffic reply
Fully Utilizing Spark for Data ValidationDatabricks
Data validation is becoming more important as companies have increasingly interconnected data pipelines. Validation serves as a safeguard to prevent existing pipelines from failing without notice. Currently, the most widely adopted data validation framework is Great Expectations. They have support for both Pandas and Spark workflows (with the same API). Great Expectations is a robust data validation library with a lot of features. For example, Great Expectations always keeps track of how many records are failing a validation, and stores examples for failing records. They also profile data after validations and output data documentation.
These features can be very useful, but if a user does not need them, they are expensive to generate. What are the options if we need a more lightweight framework? Pandas has some data validation frameworks that are designed to be lightweight. Pandera is one example. Is it possible to use a lightweight Pandas-based framework on Spark? In this talk, we’ll show how this is possible with a library called Fugue. Fugue is an open-source framework that lets users port native Python code or Pandas code to Spark. We will show an interactive demo of how to extend Pandera (or any other Pandas-based data validation library) to a Spark workflow.
There is also a deficiency in the current frameworks we will address in the demo. With big data, there is a need to apply different validation rules for each partition. For example, data that encompasses a lot of geographic regions may have different acceptable ranges of values (think of currency). Since the current frameworks are designed to apply a validation rule to the whole DataFrame, this can’t be done. Using Fugue and Pandera, we can apply different validation rules on each partition of data.
An AI-Powered Chatbot to Simplify Apache Spark Performance ManagementDatabricks
>Sarah: My Spark SQL query failed. How can I fix it? >Jeeves: Your Spark query driver went out of memory. >Jeeves: You can set spark.driver.memory to 2.2GB and rerun the query to complete it successfully. Who is Jeeves? An experienced Spark developer? A seasoned administrator? No, Jeeves is a chatbot created to simplify data operations management for enterprise Spark clusters. This chatbot is powered by advanced AI algorithms and an intuitive conversational interface that together provide answers to get users in and out of performance problems quickly. Instead of just being stuck to screens displaying performance logs and metrics, users can now have more refreshing experience; and consume performance insights via a two-way conversation with their own personal Spark expert. This talk will give an overview of the chatbot, its architecture, and how it fits in a complex Spark environment. The chatbot connects to a large number of sources to get the data to power its AI algorithms. It can detect anomalies in performance and push key insights via alerts to users when they need them the most. The chatbot can also be told to take actions like creating tickets and making configuration changes. You will learn how to build chatbots that tackle your complex data operations challenges with AI algorithms and automation, keeping a cool head at all times.
Netflix is a famously data-driven company. Data is used to make informed decisions on everything from content acquisition to content delivery, and everything in-between. As with any data-driven company, it’s critical that data used by the business is accurate. Or, at worst, that the business has visibility into potential quality issues as soon as they arise. But even in the most mature data warehouses, data quality can be hard. How can we ensure high quality in a cloud-based, internet-scale, modern big data warehouse employing a variety of data engineering technologies?
In this talk, Michelle Ufford will share how the Data Engineering & Analytics team at Netflix is doing exactly that. We’ll kick things off with a quick overview of Netflix’s analytics environment, then dig into the architecture of our current data quality solution. We’ll cover what worked, what didn’t work so well, and what we're working on next. We’ll conclude with some tips & lessons learned for ensuring high quality on big data.
This talk was presented at DataWorks/Hadoop Summit 2017 on June 13, 2017.
In their webinar "Big Data Fabric 2.0 Drives Data Democratization" Ben Szekley, Cambridge Semantics’ SVP of Field Operations, and guest speaker, Forrester’s Noel Yuhanna, author of the Forrester report: “Big Data Fabric 2.0 Drives Data Democratization”, explored why data-driven businesses are making a big data fabric part of their data strategy to minimize data complexity, integrate siloed data, deliver real-time trusted insights, and to create new business opportunities. These are the slides from that webinar.
This is a presentation at Bengaluru TechDay -October2019 for Oracle Database Admin and Architects presented by Karthik P R ( CEO Mydbops ). He explains the possible High Availability options in MySQL ecosystem.
https://www.meetup.com/All-India-Oracle-Users-Group-Bangalore-Chapter/events/265252214/
This introductory workshop is aimed at data analysts & data engineers new to Apache Spark and exposes them how to analyze big data with Spark SQL and DataFrames.
In this partly instructor-led and self-paced labs, we will cover Spark concepts and you’ll do labs for Spark SQL and DataFrames
in Databricks Community Edition.
Toward the end, you’ll get a glimpse into newly minted Databricks Developer Certification for Apache Spark: what to expect & how to prepare for it.
* Apache Spark Basics & Architecture
* Spark SQL
* DataFrames
* Brief Overview of Databricks Certified Developer for Apache Spark
The 'macro view' on Big Query:
We started with an overview, some typical uses and moved to project hierarchy, access control and security.
In the end we touch about tools and demos.
Splunk: Druid on Kubernetes with Druid-operatorImply
We went through the journey of deploying Apache Druid clusters on Kubernetes(K8s) and created a druid-operator (https://github.com/druid-io/druid-operator). This talk introduces the druid kubernetes operator, how to use it to deploy druid clusters and how it works under the hood. We will share how we use this operator to deploy Druid clusters at Splunk.
Kubernetes is an open-source system for automating deployment, scaling, and management of containerized applications. Druid is a complex stateful distributed system and a Druid cluster consists of multiple web services such as Broker, Historical, Coordinator, Overlord, MiddleManager etc each deployed with multiple replicas. Deploying a single web service on K8s requires creating few K8s resources via YAML files and it multiplies due to multiple services inside of a Druid cluster. Now doing it for multiple Druid clusters (dev, staging, production environments) makes it even more tedious and error prone.
K8s enables creation of application (such as Druid) specific extension, called “Operator”, that combines kubernetes and application specific knowledge into a reusable K8s extension that makes deploying complex applications simple.
Operating PostgreSQL at Scale with KubernetesJonathan Katz
The maturation of containerization platforms has changed how people think about creating development environments and has eliminated many inefficiencies for deploying applications. These concept and technologies have made its way into the PostgreSQL ecosystem as well, and tools such as Docker and Kubernetes have enabled teams to run their own “database-as-a-service” on the infrastructure of their choosing.
All this sounds great, but if you are new to the world of containers, it can be very overwhelming to find a place to start. In this talk, which centers around demos, we will see how you can get PostgreSQL up and running in a containerized environment with some advanced sidecars in only a few steps! We will also see how it extends to a larger production environment with Kubernetes, and what the future holds for PostgreSQL in a containerized world.
We will cover the following:
* Why containers are important and what they mean for PostgreSQL
* Create a development environment with PostgreSQL, pgadmin4, monitoring, and more
* How to use Kubernetes to create your own "database-as-a-service"-like PostgreSQL environment
* Trends in the container world and how it will affect PostgreSQL
At the conclusion of the talk, you will understand the fundamentals of how to use container technologies with PostgreSQL and be on your way to running a containerized PostgreSQL environment at scale!
This is a presentation of the popular NoSQL database Apache Cassandra which was created by our team in the context of the module "Business Intelligence and Big Data Analysis".
Modern Data Stack for Game Analytics / Dmitry Anoshin (Microsoft Gaming, The ...DevGAMM Conference
Talk will cover the journey of data platform design and implement for game analytics industry. I will tell about modern data stack. What tools and approaches are available on the market and how leading game companies engineer the data analytics solution and make better games with data insights.
This presentation describes how to configure and leverage ProxySQL with
AWS Aurora,
Azure Database for MySQL
and CloudSQL for MySQL.
It details the various benefits, configuration, and monitoring.
Benchmarking is hard. Benchmarking databases, harder. Benchmarking databases that follow different approaches (relational vs document) is even harder.
But the market demands these kinds of benchmarks. Despite the different data models that MongoDB and PostgreSQL expose, many organizations face the challenge of picking either technology. And performance is arguably the main deciding factor.
Join this talk to discover the numbers! After $30K spent on public cloud and months of testing, there are many different scenarios to analyze. Benchmarks on three distinct categories have been performed: OLTP, OLAP and comparing MongoDB 4.0 transaction performance with PostgreSQL's.
What would be faster, MongoDB or PostgreSQL?
This tutorial covers all parallel replication implementation in MariaDB 10.0 and 10.1 and MySQL 5.6, 5.7 and 8.0 (including how it works in Group Replication).
MySQL and MariaDB have different types of parallel replication. In this tutorial, we present the different implementations that allow us to understand their limitations and tuning parameters. We cover how to make parallel replication faster and what to avoid for maximizing its benefits. We also present tests from Booking.com workloads.
Some of the subjects that are covered are group commit and optimistic parallel replication in MariaDB, the parallelism interval of MySQL and its Write Set optimization, and the ?slowing down the master to speed up the slave? optimization.
After this tutorial, you will know everything you need to implement and tune parallel replication in your environment. But more importantly, we will show how you can test parallel replication benefit in a non-disruptive way before deployment.
This talk is from Distributed Data Summit SF 2018 - http://distributeddatasummit.com/2018-sf/sessions#chella
Audit logging is one of the most critical features in an enterprise-ready database in terms of security compliance. Furthermore, live traffic troubleshooting is critical for operators to troubleshoot production issues quickly. While past versions have lacked these critical features, the Cassandra team understood the need for better solutions and in the upcoming release of Cassandra both of these features now come out of the box which makes Cassandra even more awesome to work with. Cassandra now supports Audit logging and query logging as part of C* itself. As part of this talk, audience will learn about how to enable, configure, and tune audit logging for their C* clusters and how to log live traffic/queries for serverel needs including troubleshooting or even live traffic reply
Fully Utilizing Spark for Data ValidationDatabricks
Data validation is becoming more important as companies have increasingly interconnected data pipelines. Validation serves as a safeguard to prevent existing pipelines from failing without notice. Currently, the most widely adopted data validation framework is Great Expectations. They have support for both Pandas and Spark workflows (with the same API). Great Expectations is a robust data validation library with a lot of features. For example, Great Expectations always keeps track of how many records are failing a validation, and stores examples for failing records. They also profile data after validations and output data documentation.
These features can be very useful, but if a user does not need them, they are expensive to generate. What are the options if we need a more lightweight framework? Pandas has some data validation frameworks that are designed to be lightweight. Pandera is one example. Is it possible to use a lightweight Pandas-based framework on Spark? In this talk, we’ll show how this is possible with a library called Fugue. Fugue is an open-source framework that lets users port native Python code or Pandas code to Spark. We will show an interactive demo of how to extend Pandera (or any other Pandas-based data validation library) to a Spark workflow.
There is also a deficiency in the current frameworks we will address in the demo. With big data, there is a need to apply different validation rules for each partition. For example, data that encompasses a lot of geographic regions may have different acceptable ranges of values (think of currency). Since the current frameworks are designed to apply a validation rule to the whole DataFrame, this can’t be done. Using Fugue and Pandera, we can apply different validation rules on each partition of data.
An AI-Powered Chatbot to Simplify Apache Spark Performance ManagementDatabricks
>Sarah: My Spark SQL query failed. How can I fix it? >Jeeves: Your Spark query driver went out of memory. >Jeeves: You can set spark.driver.memory to 2.2GB and rerun the query to complete it successfully. Who is Jeeves? An experienced Spark developer? A seasoned administrator? No, Jeeves is a chatbot created to simplify data operations management for enterprise Spark clusters. This chatbot is powered by advanced AI algorithms and an intuitive conversational interface that together provide answers to get users in and out of performance problems quickly. Instead of just being stuck to screens displaying performance logs and metrics, users can now have more refreshing experience; and consume performance insights via a two-way conversation with their own personal Spark expert. This talk will give an overview of the chatbot, its architecture, and how it fits in a complex Spark environment. The chatbot connects to a large number of sources to get the data to power its AI algorithms. It can detect anomalies in performance and push key insights via alerts to users when they need them the most. The chatbot can also be told to take actions like creating tickets and making configuration changes. You will learn how to build chatbots that tackle your complex data operations challenges with AI algorithms and automation, keeping a cool head at all times.
Netflix is a famously data-driven company. Data is used to make informed decisions on everything from content acquisition to content delivery, and everything in-between. As with any data-driven company, it’s critical that data used by the business is accurate. Or, at worst, that the business has visibility into potential quality issues as soon as they arise. But even in the most mature data warehouses, data quality can be hard. How can we ensure high quality in a cloud-based, internet-scale, modern big data warehouse employing a variety of data engineering technologies?
In this talk, Michelle Ufford will share how the Data Engineering & Analytics team at Netflix is doing exactly that. We’ll kick things off with a quick overview of Netflix’s analytics environment, then dig into the architecture of our current data quality solution. We’ll cover what worked, what didn’t work so well, and what we're working on next. We’ll conclude with some tips & lessons learned for ensuring high quality on big data.
This talk was presented at DataWorks/Hadoop Summit 2017 on June 13, 2017.
In their webinar "Big Data Fabric 2.0 Drives Data Democratization" Ben Szekley, Cambridge Semantics’ SVP of Field Operations, and guest speaker, Forrester’s Noel Yuhanna, author of the Forrester report: “Big Data Fabric 2.0 Drives Data Democratization”, explored why data-driven businesses are making a big data fabric part of their data strategy to minimize data complexity, integrate siloed data, deliver real-time trusted insights, and to create new business opportunities. These are the slides from that webinar.
This is a presentation at Bengaluru TechDay -October2019 for Oracle Database Admin and Architects presented by Karthik P R ( CEO Mydbops ). He explains the possible High Availability options in MySQL ecosystem.
https://www.meetup.com/All-India-Oracle-Users-Group-Bangalore-Chapter/events/265252214/
This introductory workshop is aimed at data analysts & data engineers new to Apache Spark and exposes them how to analyze big data with Spark SQL and DataFrames.
In this partly instructor-led and self-paced labs, we will cover Spark concepts and you’ll do labs for Spark SQL and DataFrames
in Databricks Community Edition.
Toward the end, you’ll get a glimpse into newly minted Databricks Developer Certification for Apache Spark: what to expect & how to prepare for it.
* Apache Spark Basics & Architecture
* Spark SQL
* DataFrames
* Brief Overview of Databricks Certified Developer for Apache Spark
goto; London: Keeping your Cloud Footprint in CheckCoburn Watson
Presented on the "Lean" track at goto; London September 17th, 2015. Covers how Netflix manages cloud cost efficiency in light of innovation and reliability drivers.
This tutorial gives out an brief and interesting introduction to modern stream computing technologies. The participants can learn the essential concepts and methodologies for designing and building a advanced stream processing system. The tutorial unveils the key fundamentals behind various kinds of design choices. Some forecast of technology developments in this domain is also introduced at the last section of this tutorial.
This talk gives a quick intro what’s come to be known as software-defined data center. Its enablers are recent hardware trends combined with advances in software technology that together allow creating an infrastructure that makes life a lot easier for operations.
AME-1934 : Enable Active-Active Messaging Technology to Extend Workload Balan...wangbo626
Session Type : Breakout Session
Date/Time : Thu, 26-Feb, 10:30 AM-11:30 AM
Venue : Mandalay Bay
Room : Surf Ballroom E
Descriptions:
Active-Active is the target model of modern data center, its successfully adoption includes not only the mainframe, but also the heterogeneous and periphery distributed platforms which makes it much complex to implement. Data synchronization is the heart in the various technologies of active-active, which messaging technology been chose in its implementation.
This session gives an overview of active-active technologies on both z and distributed platforms; highlight how does the Active-Active gives the benefits of both high availability and workload balancing, we also discuss China customer cases to implement messaging based active-active.
Hhm 3474 mq messaging technologies and support for high availability and acti...Pete Siddall
Active-Active is the target messaging model for the modern data center. But its successful adoption must encompass not only the mainframe, but also heterogeneous and peripherally distributed platforms, which makes it much more complex to implement. Data synchronization is at the heart of the various Active-Active technologies, and the right messaging technology must therefore be chosen for its implementation. This session gives an overview of Active-Active technologies on both z Systems and distributed platforms. It highlights how Active-Active provides the benefits of both high availability and workload balancing. We will also discuss customer cases on how to implement messaging-based Active-Active.
HPC and cloud distributed computing, as a journeyPeter Clapham
Introducing an internal cloud brings new paradigms, tools and infrastructure management. When placed alongside traditional HPC the new opportunities are significant But getting to the new world with micro-services, autoscaling and autodialing is a journey that cannot be achieved in a single step.
Why building a big data platform is hard? What are the key aspects involved in providing a "Serverless" experience for data folks. And how Databricks solves infrastructure problems and provides the "Serverless" experience.
It’s one thing to support many data sources with megabytes of data. It’s a completely different problem supporting thousands of data sources with terabytes of data every day. How do you create systems that scale infinitely?
The answer is; you don’t . You can not design for infinite scalability. Rather, consider a pod approach where each pod supports a defined capacity. Scalability results from deployment of multiple cooperating pods.
Systems handling extremely large data sources with significant processing requirements are difficult at best to validate. Attempting to deploy such a system without well understood capacity limits is destined for failure.
This was first presented at Cloud Expo NYC.
Lessons From HPE: From Batch To Streaming For 20 Billion Sensors With Lightbe...Lightbend
In this guest webinar with Chris McDermott, Lead Data Engineer at HPE, learn how HPE InfoSight–powered by Lightbend Platform–has emerged as the go-to solution for providing real-time metrics and predictive analytics across various network, server, storage, and data center technologies.
In-Stream Processing Service Blueprint, Reference architecture for real-time ...Grid Dynamics
What is it about? In-Stream Event Processing is a new approach for building near real time big data systems with rapidly growing user base and applications like clickstream analytics, preventive maintenance or fraud detection. Maturity of some open source projects enables building an enterprise grade In-Stream Processing service in-house. However the open source world comprises of many competing projects of different maturity, having different perspectives so the task to select effective and efficient projects is not straightforward. In the talk I’ll present a blueprint of an In-Stream Processing Service, enterprise grade reliable and scalable, cloud ready, build from 100% open source components.
Lc3 beijing-june262018-sahdev zala-guangyaSahdev Zala
Our slides deck, used at the LinuxCon+ContainerCon+CLOUDOPEN China 2018, on Kubernetes cluster design considerations and our journey to 1000+ node single cluster with IBM Cloud.
HPC DAY 2017 | Altair's PBS Pro: Your Gateway to HPC ComputingHPC DAY
HPC DAY 2017 - http://www.hpcday.eu/
Altair's PBS Pro: Your Gateway to HPC Computing
Dr. Jochen Krebs | Director Enterprise Sales Central & Eastern Europe at Altaire
Presented at SF Big Analytics Meetup
Online event processing applications often require the ability to ingest, store, dispatch and process events. Until now, supporting all of these needs has required different systems for each task -- stream processing engines, messaging queuing middleware, and pub/sub messaging systems. This has led to the unnecessary complexity for the development of such applications and operations leading to increased barrier to adoption in the enterprises. In this talk, Karthik will outline the need to unify these capabilities in a single system and make it easy to develop and operate at scale. Karthik will delve into how Apache Pulsar was designed to address this need with an elegant architecture. Apache Pulsar is a next generation distributed pub-sub system that was originally developed and deployed at Yahoo and running in production in more than 100+ companies. Karthik will explain how the architecture and design of Pulsar provides the flexibility to support developers and applications needing any combination of queuing, messaging, streaming and lightweight compute for events. Furthermore, he will provide real life use cases how Apache Pulsar is used for event processing ranging from data processing tasks to web processing applications.
Similar to 20 Altair PBS Professional Features in 20 minutes, 2018 (20)
How to Position Your Globus Data Portal for Success Ten Good PracticesGlobus
Science gateways allow science and engineering communities to access shared data, software, computing services, and instruments. Science gateways have gained a lot of traction in the last twenty years, as evidenced by projects such as the Science Gateways Community Institute (SGCI) and the Center of Excellence on Science Gateways (SGX3) in the US, The Australian Research Data Commons (ARDC) and its platforms in Australia, and the projects around Virtual Research Environments in Europe. A few mature frameworks have evolved with their different strengths and foci and have been taken up by a larger community such as the Globus Data Portal, Hubzero, Tapis, and Galaxy. However, even when gateways are built on successful frameworks, they continue to face the challenges of ongoing maintenance costs and how to meet the ever-expanding needs of the community they serve with enhanced features. It is not uncommon that gateways with compelling use cases are nonetheless unable to get past the prototype phase and become a full production service, or if they do, they don't survive more than a couple of years. While there is no guaranteed pathway to success, it seems likely that for any gateway there is a need for a strong community and/or solid funding streams to create and sustain its success. With over twenty years of examples to draw from, this presentation goes into detail for ten factors common to successful and enduring gateways that effectively serve as best practices for any new or developing gateway.
Listen to the keynote address and hear about the latest developments from Rachana Ananthakrishnan and Ian Foster who review the updates to the Globus Platform and Service, and the relevance of Globus to the scientific community as an automation platform to accelerate scientific discovery.
Enhancing Research Orchestration Capabilities at ORNL.pdfGlobus
Cross-facility research orchestration comes with ever-changing constraints regarding the availability and suitability of various compute and data resources. In short, a flexible data and processing fabric is needed to enable the dynamic redirection of data and compute tasks throughout the lifecycle of an experiment. In this talk, we illustrate how we easily leveraged Globus services to instrument the ACE research testbed at the Oak Ridge Leadership Computing Facility with flexible data and task orchestration capabilities.
Into the Box Keynote Day 2: Unveiling amazing updates and announcements for modern CFML developers! Get ready for exciting releases and updates on Ortus tools and products. Stay tuned for cutting-edge innovations designed to boost your productivity.
Advanced Flow Concepts Every Developer Should KnowPeter Caitens
Tim Combridge from Sensible Giraffe and Salesforce Ben presents some important tips that all developers should know when dealing with Flows in Salesforce.
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...Globus
Large Language Models (LLMs) are currently the center of attention in the tech world, particularly for their potential to advance research. In this presentation, we'll explore a straightforward and effective method for quickly initiating inference runs on supercomputers using the vLLM tool with Globus Compute, specifically on the Polaris system at ALCF. We'll begin by briefly discussing the popularity and applications of LLMs in various fields. Following this, we will introduce the vLLM tool, and explain how it integrates with Globus Compute to efficiently manage LLM operations on Polaris. Attendees will learn the practical aspects of setting up and remotely triggering LLMs from local machines, focusing on ease of use and efficiency. This talk is ideal for researchers and practitioners looking to leverage the power of LLMs in their work, offering a clear guide to harnessing supercomputing resources for quick and effective LLM inference.
Code reviews are vital for ensuring good code quality. They serve as one of our last lines of defense against bugs and subpar code reaching production.
Yet, they often turn into annoying tasks riddled with frustration, hostility, unclear feedback and lack of standards. How can we improve this crucial process?
In this session we will cover:
- The Art of Effective Code Reviews
- Streamlining the Review Process
- Elevating Reviews with Automated Tools
By the end of this presentation, you'll have the knowledge on how to organize and improve your code review proces
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus
As part of the DOE Integrated Research Infrastructure (IRI) program, NERSC at Lawrence Berkeley National Lab and ALCF at Argonne National Lab are working closely with General Atomics on accelerating the computing requirements of the DIII-D experiment. As part of the work the team is investigating ways to speedup the time to solution for many different parts of the DIII-D workflow including how they run jobs on HPC systems. One of these routes is looking at Globus Compute as a way to replace the current method for managing tasks and we describe a brief proof of concept showing how Globus Compute could help to schedule jobs and be a tool to connect compute at different facilities.
SOCRadar Research Team: Latest Activities of IntelBrokerSOCRadar
The European Union Agency for Law Enforcement Cooperation (Europol) has suffered an alleged data breach after a notorious threat actor claimed to have exfiltrated data from its systems. Infamous data leaker IntelBroker posted on the even more infamous BreachForums hacking forum, saying that Europol suffered a data breach this month.
The alleged breach affected Europol agencies CCSE, EC3, Europol Platform for Experts, Law Enforcement Forum, and SIRIUS. Infiltration of these entities can disrupt ongoing investigations and compromise sensitive intelligence shared among international law enforcement agencies.
However, this is neither the first nor the last activity of IntekBroker. We have compiled for you what happened in the last few days. To track such hacker activities on dark web sources like hacker forums, private Telegram channels, and other hidden platforms where cyber threats often originate, you can check SOCRadar’s Dark Web News.
Stay Informed on Threat Actors’ Activity on the Dark Web with SOCRadar!
A Comprehensive Look at Generative AI in Retail App Testing.pdfkalichargn70th171
Traditional software testing methods are being challenged in retail, where customer expectations and technological advancements continually shape the landscape. Enter generative AI—a transformative subset of artificial intelligence technologies poised to revolutionize software testing.
Designing for Privacy in Amazon Web ServicesKrzysztofKkol1
Data privacy is one of the most critical issues that businesses face. This presentation shares insights on the principles and best practices for ensuring the resilience and security of your workload.
Drawing on a real-life project from the HR industry, the various challenges will be demonstrated: data protection, self-healing, business continuity, security, and transparency of data processing. This systematized approach allowed to create a secure AWS cloud infrastructure that not only met strict compliance rules but also exceeded the client's expectations.
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...Shahin Sheidaei
Games are powerful teaching tools, fostering hands-on engagement and fun. But they require careful consideration to succeed. Join me to explore factors in running and selecting games, ensuring they serve as effective teaching tools. Learn to maintain focus on learning objectives while playing, and how to measure the ROI of gaming in education. Discover strategies for pitching gaming to leadership. This session offers insights, tips, and examples for coaches, team leads, and enterprise leaders seeking to teach from simple to complex concepts.
Why React Native as a Strategic Advantage for Startup Innovation.pdfayushiqss
Do you know that React Native is being increasingly adopted by startups as well as big companies in the mobile app development industry? Big names like Facebook, Instagram, and Pinterest have already integrated this robust open-source framework.
In fact, according to a report by Statista, the number of React Native developers has been steadily increasing over the years, reaching an estimated 1.9 million by the end of 2024. This means that the demand for this framework in the job market has been growing making it a valuable skill.
But what makes React Native so popular for mobile application development? It offers excellent cross-platform capabilities among other benefits. This way, with React Native, developers can write code once and run it on both iOS and Android devices thus saving time and resources leading to shorter development cycles hence faster time-to-market for your app.
Let’s take the example of a startup, which wanted to release their app on both iOS and Android at once. Through the use of React Native they managed to create an app and bring it into the market within a very short period. This helped them gain an advantage over their competitors because they had access to a large user base who were able to generate revenue quickly for them.
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERRORTier1 app
Even though at surface level ‘java.lang.OutOfMemoryError’ appears as one single error; underlyingly there are 9 types of OutOfMemoryError. Each type of OutOfMemoryError has different causes, diagnosis approaches and solutions. This session equips you with the knowledge, tools, and techniques needed to troubleshoot and conquer OutOfMemoryError in all its forms, ensuring smoother, more efficient Java applications.
top nidhi software solution freedownloadvrstrong314
This presentation emphasizes the importance of data security and legal compliance for Nidhi companies in India. It highlights how online Nidhi software solutions, like Vector Nidhi Software, offer advanced features tailored to these needs. Key aspects include encryption, access controls, and audit trails to ensure data security. The software complies with regulatory guidelines from the MCA and RBI and adheres to Nidhi Rules, 2014. With customizable, user-friendly interfaces and real-time features, these Nidhi software solutions enhance efficiency, support growth, and provide exceptional member services. The presentation concludes with contact information for further inquiries.
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...Globus
The Earth System Grid Federation (ESGF) is a global network of data servers that archives and distributes the planet’s largest collection of Earth system model output for thousands of climate and environmental scientists worldwide. Many of these petabyte-scale data archives are located in proximity to large high-performance computing (HPC) or cloud computing resources, but the primary workflow for data users consists of transferring data, and applying computations on a different system. As a part of the ESGF 2.0 US project (funded by the United States Department of Energy Office of Science), we developed pre-defined data workflows, which can be run on-demand, capable of applying many data reduction and data analysis to the large ESGF data archives, transferring only the resultant analysis (ex. visualizations, smaller data files). In this talk, we will showcase a few of these workflows, highlighting how Globus Flows can be used for petabyte-scale climate analysis.
Accelerate Enterprise Software Engineering with PlatformlessWSO2
Key takeaways:
Challenges of building platforms and the benefits of platformless.
Key principles of platformless, including API-first, cloud-native middleware, platform engineering, and developer experience.
How Choreo enables the platformless experience.
How key concepts like application architecture, domain-driven design, zero trust, and cell-based architecture are inherently a part of Choreo.
Demo of an end-to-end app built and deployed on Choreo.
How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?XfilesPro
Worried about document security while sharing them in Salesforce? Fret no more! Here are the top-notch security standards XfilesPro upholds to ensure strong security for your Salesforce documents while sharing with internal or external people.
To learn more, read the blog: https://www.xfilespro.com/how-does-xfilespro-make-document-sharing-secure-and-seamless-in-salesforce/
2. Altair Confidential 2
USABILITYRECLAIM RESOURCESAUTOMATIONSAUTO HEALTH CHECKASYNC THROUGHPUT
FLEXI RESERVATIONSALLOCATION MGMTARM64 READYBURST BUFFER READYNVIDIA DGCM READY
HIGH AVAILABILITYOS PROVISIONINGMULTI SCHEDULERDYNAMIC RESOURCESHOOKS
ENERGY AWARETOPOLOGY AWARECLOUD BURSTINGCONTAINERSCGROUPS
>
3. Hooks
Altair Confidential 3
• PBS Plugin (“Hooks”) Framework
• Unified data model built on industry-standard Python
• Augment core capabilities on-the-fly
• No re-compiling PBS Pro core stability
• Hook events at major state transition points
• Use cases
• Routing jobs
• Managing job resource requests
• Managing access to resources for
users and jobs
• Ensuring efficient use of resources
• Ensuring that jobs run properly
• Converting requests to usable format
• Controlling interactive jobs
• Communicating information to users
• Helping to schedule jobs
• Managing user activity
• Enabling accounting and validation
• Allocation management
• Helping manage job execution
4. Dynamic Resources
Altair Confidential 4
• Represent elements that are outside of the control of PBS
• Modular
• Scalable
• Rich rules with hooks
• License as a resource
• Global license Managers
• Storage
• User quotas
• Scratch spaces on nodes
6. OS Provisioning
Altair Confidential 6
• Operating System as a Resource
• Integrate with third-party OS provisioning tools
• Provisioning / Orchestration – Bare metal
• Install required Operating system or application on bare metal
• Post install automation support
• Multi boot systems
• Workstation grids
PBSPro
WRFLSDYNAOpenFOAM
ProvisioningTools
7. High Availability
Altair Confidential 7
• High Availability in built
• No third party software required
• All critical services moved in real time
• No loss of service availability
• Transparent
• Notifications
• Full feature manageability tools
• Maintain quorum
• Interventions and servicing
8. Cgroups
Altair Confidential 8
• Ensures jobs have access to requested resources
• Can restrict resources for PBS jobs, preventing OOM conditions
• Ensures accurate resource accounting
• Provides resource enforcement at kernel level instead of the
MoM polling for usage
• Consistent job runtime
9. Containers
Altair Confidential 9
• Lightweight virtualized environment for traditional HPC apps
• Number of containers that can be run on a host
• Time to launch a container
• All the goodies of containers (App maintenance)
• Conflicting requirements for applications (e.g., app can run only on centos 6, or needs an older library)
• Ease of packaging application into their own “containers” with all dependencies included.
• Natural extension to cgroups and cpusets
• resource constraining, CPU pinning, etc.
10. Cloud bursting
Altair Confidential 10
Microsoft Azure
Amazon Web Services
GCP
Oracle
PBS Works
• On-demand use of cloud resources to
maximize efficiency
• Improve responsiveness, adding capacity
exactly when needed
• Automatic governance and cost controls via
site-defined policy and quotas
• Understands on-premise utilization, ensuring
bursting only when cost-efficient
• Vendor-agnostic: no lock-in
• Fast: 1,000+ nodes in minutes
11. Topology Aware
Altair Confidential 11
Before After
Average runtimes
~ 45% Faster
** actual Customer Reported Results
• Inter-node & intra-node placement
• Switches, clusters, and NUMA
• All networks
• Infiniband, Ethernet, custom
• Dynamic (runtime changeable)
• Support for all popular topologies
12. Energy Aware
Altair Confidential 12
DoD HPCMP
Yearly Savings (estimate)
• Eliminate energy waste with no loss in service
• turn off idle machines and backfill holes
• A/C savings by scheduling work onto cooler nodes
• Power capping: power_budget=0.5MW
• fit more hardware into smaller datacenters
• run in degraded mode during power emergencies
• Per-job power profiles: power=600W
• Power saving mode: off, standby, …
• Power ramping: slow up/down
• Energy accounting: energy=64.2kWh
13. Nvidia DGCM Ready
Altair Confidential 13
• Pre-job node risk identification and GPU resource allocation
• Automated monitoring of node health
• Reduced job terminations due to GPU failures
• Increased system resilience via intelligent routing decisions
• Increased job throughput via topology optimization
• Optimized job scheduling through GPU load and health
monitoring
PBSPro
14. Burst Buffer Ready
Altair Confidential 14
• Stage / Cache data between an application computation and
the PFS
• Use as private scratch on compute nodes
• Out of core memory
• Shared Storage, provides multiple jobs the same access to data
• Shared inputs
• Ensembles analysis
• In-transit analysis
• Compute Node Swap
• over-commit compute node memory.
• Job script support
• Native client integrations through hooks
15. ARM64
Altair Confidential 15
• Fujitsu Post K supercomputer will be powered by 64-bit Arm processors.
• HPE - Sandia National Lab: ARM based Astra Supercomputer
• Fast evolving ecosystem
• Support for ARM-V8 in PBSPro starting v18
16. Allocation Management
Altair Confidential 16
• Supports compute, storage and budget ($)
• Manages grants, quotas, budgets, limits, etc.
• Implements charge-back business logic
• Includes reporting tools
• PBS Pro add-on module
17. Flexi Reservations
Altair Confidential 17
• Resource Reservation
• SLA
• Predictable workloads – e.g weather models
• Standing Reservations
• Allow Reservations to start early or runover schedule
18. Throughput Mode
Altair Confidential 18
• Scheduler can run asynchronously
• doesn’t wait for each job to be accepted by MoM
• 10000 Jobs / minute
• Add-on hierarchical scheduler
• Handles small, short-job workloads
• Deploys per-user/project or site-wide
• Automatically adjusts to demand
• Built-in fairshare and limits
• Scales to millions of jobs
19. Auto Health check
Altair Confidential 19
• Handling failures at scale
• Degraded Hardware health
• Mean time between failures hardware components
• Improve Productivity
• Job failures prevented
• Improved throughput
• Improve admin productivity
• Offline nodes with possible causes
• Notifications
20. Automations
Altair Confidential 20
• HPC and High Throughput Workflows
• Directed acyclic graphs
• Expressed as Job Dependencies between two or more jobs
• Specifying the order in which jobs in a set should execute
• Requesting a job run only if an error occurs in another job
• Holding jobs until a particular job starts or completes execution
• Cylc
• Open Source project founded by NIWA
• Now includes: NIWA, UK Met, BOM, KMA, NCAR, Meteorological
Service Singapore and more
21. Reclaim Resources
Altair Confidential 21
• Releasing Unneeded Vnodes from Your Job
• Userlevel: -W release_nodes_on_stageout=true
• Admin: pbs_release_nodes
• Shrink to fit Jobs
• Jobs that are internally checkpointed.
• Jobs using periodic PBS checkpointing
• Jobs whose real running time might be much less than the
expected time
22. Usability
Altair Confidential 22
• Manage, Monitor and Measure
• Backward compatibility
• Behaves as a platform
• REST web service
• Data exchange formats for upstream processing and integrations
• Feature extensions Unlimited
23. Altair Confidential 23
USABILITYRECLAIM RESOURCESAUTOMATIONSAUTO HEALTH CHECKASYNC THROUGHPUT
FLEXI RESERVATIONSALLOCATION MGMTARM64 READYBURST BUFFER READYNVIDIA DGCM READY
HIGH AVAILABILITYOS PROVISIONINGMULTI SCHEDULERDYNAMIC RESOURCESHOOKS
ENERGY AWARETOPOLOGY AWARECLOUD BURSTINGCONTAINERSCGROUPS
>
Editor's Notes
Mana – 9216 cores
Harold – 500 nodes
AbUtil – 80 nodes
Overall Benefits:
Eliminate waste with no loss in service (as we turn off idle machines and backfill holes)
A/C savings by scheduling work onto cooler nodes
Power capping means you can fit more hardware into smaller datacenters (provision only for used power, not peak power)
Power capping can also be used to run in degraded mode during power emergencies / disasters
Measure, report, charge-back power use
Note: not running a jobs twice (because PBS mitigates system failures) is also very Green
Staging copies files from the PFS to the Burst Buffer for executions and then stages the data out
Cache moves data implicitly (read-ahead and write-behind); useful for the following
Checkpoint/Restart
Periodic output
Application libraries
Open Source project founded by NIWA, Newzealand
Now includes: NIWA, UK Met, BOM, KMA, NCAR, Meteorological Service Singapore, …