Scylla Summit 2022: What’s New in ScyllaDB Operator for KubernetesScyllaDB
This document summarizes the Scylla Operator for Kubernetes, including its developers, features, releases, and roadmap. Key points include:
- The Scylla Operator manages and automates tasks for Scylla clusters on Kubernetes.
- Features include seedless mode, security enhancements, performance tuning, and improved stability.
- It follows a rapid 6-week release cycle and supports the latest two releases.
- Future plans include additional performance optimizations, persistent storage support, TLS encryption, and multi-datacenter capabilities.
Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...ScyllaDB
Outbrain is the world's largest content discovery program. Learn about their use case with Scylla where they lowered latency while doing 20X IOPS of Cassandra.
Scylla Summit 2022: How to Migrate a Counter Table for 68 Billion RecordsScyllaDB
In this talk, we will discuss Happn's war story about migrating a Cassandra 2.1 cluster containing more than 68 Billion records in a counter table to ScyllaDB Open Source.
To watch all of the recordings hosted during Scylla Summit 2022 visit our website here: https://www.scylladb.com/summit.
Spark can run on Kubernetes containers in two ways - as a static cluster or with native integration. As a static cluster, Spark pods are manually deployed without autoscaling. Native integration treats Kubernetes as a resource manager, allowing Spark to dynamically acquire and release containers like in YARN. It uses Kubernetes custom controllers to create driver pods that then launch worker pods. This provides autoscaling of resources based on job demands.
This document discusses optimizing Spark write-heavy workloads to S3 object storage. It describes problems with eventual consistency, renames, and failures when writing to S3. It then presents several solutions implemented at Qubole to improve the performance of Spark writes to Hive tables and directly writing to the Hive warehouse location. These optimizations include parallelizing renames, writing directly to the warehouse, and making recover partitions faster by using more efficient S3 listing. Performance improvements of up to 7x were achieved.
20180503 kube con eu kubernetes metrics deep diveBob Cotton
Kubernetes generates a wealth of metrics. Some explicitly within the Kubernetes API server, the Kublet, and cAdvisor or implicitly by observing events such as the kube-state-metrics project. A subset of these metrics are used within Kubernetes itself to make scheduling decisions, however, other metrics can be used to determine the overall health of the system or for capacity planning purposes.
Kubernetes exposes metrics from several places, some available internally, others through add-on projects. In this session you will learn about:
- Node level metrics, as exposed from the node_exporter
- Kublet metrics
- API server metrics
- etcd metrics
- cAdvisor metrics
- Metrics exposed from kube-state-metrics
Lookout on Scaling Security to 100 Million DevicesScyllaDB
The massive increase of security-related data requires companies to respond with new approaches to ingestion. Learn how Lookout has changed its approach for ingesting telemetry to meet their goal of growing from 1.5 million devices to 100 million devices and beyond, using Kafka Connect and switching from AWS DynamoDB to Scylla.
Scylla Summit 2022: What’s New in ScyllaDB Operator for KubernetesScyllaDB
This document summarizes the Scylla Operator for Kubernetes, including its developers, features, releases, and roadmap. Key points include:
- The Scylla Operator manages and automates tasks for Scylla clusters on Kubernetes.
- Features include seedless mode, security enhancements, performance tuning, and improved stability.
- It follows a rapid 6-week release cycle and supports the latest two releases.
- Future plans include additional performance optimizations, persistent storage support, TLS encryption, and multi-datacenter capabilities.
Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...ScyllaDB
Outbrain is the world's largest content discovery program. Learn about their use case with Scylla where they lowered latency while doing 20X IOPS of Cassandra.
Scylla Summit 2022: How to Migrate a Counter Table for 68 Billion RecordsScyllaDB
In this talk, we will discuss Happn's war story about migrating a Cassandra 2.1 cluster containing more than 68 Billion records in a counter table to ScyllaDB Open Source.
To watch all of the recordings hosted during Scylla Summit 2022 visit our website here: https://www.scylladb.com/summit.
Spark can run on Kubernetes containers in two ways - as a static cluster or with native integration. As a static cluster, Spark pods are manually deployed without autoscaling. Native integration treats Kubernetes as a resource manager, allowing Spark to dynamically acquire and release containers like in YARN. It uses Kubernetes custom controllers to create driver pods that then launch worker pods. This provides autoscaling of resources based on job demands.
This document discusses optimizing Spark write-heavy workloads to S3 object storage. It describes problems with eventual consistency, renames, and failures when writing to S3. It then presents several solutions implemented at Qubole to improve the performance of Spark writes to Hive tables and directly writing to the Hive warehouse location. These optimizations include parallelizing renames, writing directly to the warehouse, and making recover partitions faster by using more efficient S3 listing. Performance improvements of up to 7x were achieved.
20180503 kube con eu kubernetes metrics deep diveBob Cotton
Kubernetes generates a wealth of metrics. Some explicitly within the Kubernetes API server, the Kublet, and cAdvisor or implicitly by observing events such as the kube-state-metrics project. A subset of these metrics are used within Kubernetes itself to make scheduling decisions, however, other metrics can be used to determine the overall health of the system or for capacity planning purposes.
Kubernetes exposes metrics from several places, some available internally, others through add-on projects. In this session you will learn about:
- Node level metrics, as exposed from the node_exporter
- Kublet metrics
- API server metrics
- etcd metrics
- cAdvisor metrics
- Metrics exposed from kube-state-metrics
Lookout on Scaling Security to 100 Million DevicesScyllaDB
The massive increase of security-related data requires companies to respond with new approaches to ingestion. Learn how Lookout has changed its approach for ingesting telemetry to meet their goal of growing from 1.5 million devices to 100 million devices and beyond, using Kafka Connect and switching from AWS DynamoDB to Scylla.
Gladier: The Globus Architecture for Data Intensive Experimental Research (AP...Globus
Gladier is a framework that combines instruments, storage, and compute using loosely coupled services to automate and simplify data-intensive experimental research workflows. It uses Globus services for remote data management and funcX for scalable remote execution. Gladier provides tools and a client to define workflows that span multiple facilities and orchestrate the execution of actions like data transfers, analysis, and publication. Example deployments at facilities like the Advanced Photon Source automate workflows for experiments like XPCS and crystallography.
Hoodie: How (And Why) We built an analytical datastore on SparkVinoth Chandar
Exploring a specific problem of ingesting petabytes of data in Uber and why they ended up building an analytical datastore from scratch using Spark. Then, discuss design choices and implementation approaches in building Hoodie to provide near-real-time data ingestion and querying using Spark and HDFS.
https://spark-summit.org/2017/events/incremental-processing-on-large-analytical-datasets/
This document discusses using Apache Geode and ActiveMQ Artemis to build a scalable IoT platform. It introduces IoT and the MQTT protocol. ActiveMQ Artemis is described as a high performance message broker that is embeddable and supports clustering. Geode is presented as a distributed in-memory data platform for building data-intensive applications that require high performance, scalability, and availability. Example users of Geode include large companies handling billions of records and thousands of transactions per second. Key capabilities of Geode like regions, functions, querying, and continuous queries are summarized.
This meeting we'll host a discussion on Google Cloud Platform and Amazon Web Services to bring light to similarities and differences between platforms. If you have questions about how our platforms compare this is the meeting to attend!
Scaling for India's Cricket Hungry Population (Bhavesh Raheja & Namit Mahuvak...confluent
Hotstar is a media entertainment platform, and has a large user base in India. Last year during the Indian Premier League, Hotstar introduced "Watch N Play", a real-time cricket prediction game in which over 33 million unique users answered over 2 billion questions, won more than a 100 million rewards, built with Kafka as the backbone. In the game, the user guesses the outcome of the next ball. If he/she guesses right before the actual outcome, they score points to climb up the ladder and receive rewards along the way. Supporting potentially millions of users with differing stream times & device latencies; we used topics to separate logical streams & partitions to scale to support 1M requests/second. We'll talk about how we reduced the -end latency to make the user-experience as real-time as possible. We'll focus on the operational challenges that we faced and how we overcame them by building automation around it to make our operational cum lives easier. We also have a war story around ghost topic creation where after deletion; topics would get magically created without any create topic requests. This led us to some interesting revelations and very important learnings of how long-living producers and consumers with very short-lived topics are a recipe for disaster.
The document discusses various concepts related to Apache Geode including clients, functions, serialization, transactions, and data colocation. It describes how clients can maintain local caches, run queries, and register for notifications. The document also covers how functions allow for distributed concurrent processing by pushing code to servers and executing across the distributed system.
Scylla Summit 2022: The Future of Consensus in ScyllaDB 5.0 and BeyondScyllaDB
Beyond the immediate schema changes supported in Scylla Open Source 5.0, learn how the Raft consensus infrastructure will enable radical new capabilities. Discover how it will enable more dynamic topology changes, tablets, immediate consistency, better and faster elasticity, and simplification to repair operations.
To watch all of the recordings hosted during Scylla Summit 2022 visit our website here: https://www.scylladb.com/summit.
One Tool to Rule Them All- Seamless SQL on MongoDB, MySQL and Redis with Apac...Tim Vaillancourt
This document provides an overview of Apache Spark, a fast and general engine for large-scale data processing. It discusses how Spark can be used to query and summarize data stored in different data sources like MongoDB, MySQL, and Redis in a single Spark job. The document then demonstrates a Spark job that retrieves weather station data from MongoDB and MySQL, aggregates it, stores the results in Redis, and retrieves the top 10 results.
Real Time Data Processing With Spark Streaming, Node.js and Redis with Visual...Brandon O'Brien
Contact:
https://www.linkedin.com/in/brandonjobrien
@hakczar
Code examples available at https://github.com/br4nd0n/spark-streaming and https://github.com/br4nd0n/spark-viz
A demo and explanation of building a streaming application using Spark Streaming, Node.js and Redis with a real time visualization. Includes discussion of internals of Spark and Spark streaming including RDD partitioning and code and data distribution and cluster resource allocation.
MySQL NDB Cluster 8.0 SQL faster than NoSQL Bernd Ocklin
MySQL NDB Cluster running SQL faster than most NoSQL databases. Benchmark results, comparisons and introduction into NDB's parallel distributed in-memory query engine. MySQL Day before FOSDEM 2020.
The document discusses Reactive Slick, a new version of the Slick database access library for Scala that provides reactive capabilities. It allows parallel database execution and streaming of large query results using Reactive Streams. Reactive Slick is suitable for composite database tasks, combining async tasks, and processing large datasets through reactive streams.
Tim Vaillancourt is a senior technical operations architect specializing in MongoDB. He has over 10 years of experience tuning Linux for database workloads and monitoring technologies like Nagios, MRTG, Munin, Zabbix, Cacti, and Graphite. He discussed the various MongoDB storage engines including MMAPv1, WiredTiger, RocksDB, and TokuMX. Key metrics for monitoring the different engines include lock ratio, page faults, background flushing times, checkpoints/compactions, replication lag, and scanned/moved documents. High-level operating system metrics like CPU, memory, disk, and network utilization are also important for ensuring MongoDB has sufficient resources.
Leveraging Cassandra for real-time multi-datacenter public cloud analyticsJulien Anguenot
iland has built a global data warehouse across multiple data centers, collecting and aggregating data from core cloud services including compute, storage and network as well as chargeback and compliance. iland's warehouse brings actionable intelligence that customers can use to manipulate resources, analyze trends, define alerts and share information.
In this session, we would like to present the lessons learned around Cassandra, both at the development and operations level, but also the technology and architecture we put in action on top of Cassandra such as Redis, syslog-ng, RabbitMQ, Java EE, etc.
Finally, we would like to share insights on how we are currently extending our platform with Spark and Kafka and what our motivations are.
Abstract –
Spark 2 is here, while Spark has been the leading cluster computation framework for severl years, its second version takes Spark to new heights. In this seminar, we will go over Spark internals and learn the new concepts of Spark 2 to create better scalable big data applications.
Target Audience
Architects, Java/Scala developers, Big Data engineers, team leaders
Prerequisites
Java/Scala knowledge and SQL knowledge
Contents:
- Spark internals
- Architecture
- RDD
- Shuffle explained
- Dataset API
- Spark SQL
- Spark Streaming
Large volume data analysis on the Typesafe Reactive PlatformMartin Zapletal
The document discusses several topics related to distributed machine learning and distributed systems including:
- Reasons for using distributed machine learning being either due to large data volumes or hopes of increased speed
- Failure rates of hardware and network links in large data centers
- Examples of database inconsistencies and data loss caused by network partitions in different distributed databases
- Key aspects of distributed data processing including data storage, integration, computing primitives, and analytics
This document summarizes transactions in Apache Geode, including:
- The semantics of repeatable read and optimistic concurrency control.
- The transaction API for basic, suspend/resume, and single entry operations.
- The implementation of using ThreadLocal to isolate transactions and conflict detection at commit.
- Handling transactions with replicated and partitioned regions, including failure scenarios.
- Support for client-initiated transactions, data colocation, and interaction with other Geode features.
- Roadmap items like distributed transactions and XA integration.
Apache Spark is an open-source parallel processing framework that supports in-memory processing to boost the performance of big-data analytic applications. We will cover approaches of processing Big Data on Spark cluster for real time analytic, machine learning and iterative BI and also discuss the pros and cons of using Spark in Azure cloud.
Why building a big data platform is hard? What are the key aspects involved in providing a "Serverless" experience for data folks. And how Databricks solves infrastructure problems and provides the "Serverless" experience.
M|18 Understanding the Architecture of MariaDB ColumnStoreMariaDB plc
The document provides an overview of MariaDB ColumnStore, including its history, components, disk storage architecture, writing and querying data processes. It was presented by Andrew Hutchings, the lead software engineer for MariaDB ColumnStore, who has previous experience with MySQL, HP, and other companies. The presentation covers the technical use cases for ColumnStore, differences from row-oriented databases, and optimizations for ColumnStore.
This document provides an overview of Kafka Streams, a stream processing library built on Apache Kafka. It discusses how Kafka Streams addresses limitations of traditional batch-oriented ETL processes by enabling low-latency, continuous stream processing of real-time data across diverse sources. Kafka Streams applications are fault-tolerant distributed applications that leverage Kafka's replication and partitioning. They define processing topologies with stream processors connected by streams. State is stored in fault-tolerant state stores backed by change logs.
Running Airflow Workflows as ETL Processes on Hadoopclairvoyantllc
While working with Hadoop, you'll eventually encounter the need to schedule and run workflows to perform various operations like ingesting data or performing ETL. There are a number of tools available to assist you with this type of requirement and one such tool that we at Clairvoyant have been looking to use is Apache Airflow. Apache Airflow is an Apache Incubator project that allows you to programmatically create workflows through a python script. This provides a flexible and effective way to design your workflows with little code and setup. In this talk, we will discuss Apache Airflow and how we at Clairvoyant have utilized it for ETL pipelines on Hadoop.
Leveraging Intra-Node Parallelization in HPCC SystemsHPCC Systems
This document discusses leveraging intra-node parallelization in HPCC Systems to improve the performance of set similarity joins (SSJ). It describes a naïve approach to computing SSJ that suffers from memory exhaustion and straggling executors. The presented approach replicates and groups independent data using hashing to address these issues while enabling efficient use of multiple CPU cores through multithreading. Experiments show the approach scales to larger datasets and achieves better performance by increasing the number of threads per executor. Lessons learned include that less complex optimizations are more robust in distributed environments.
Gladier: The Globus Architecture for Data Intensive Experimental Research (AP...Globus
Gladier is a framework that combines instruments, storage, and compute using loosely coupled services to automate and simplify data-intensive experimental research workflows. It uses Globus services for remote data management and funcX for scalable remote execution. Gladier provides tools and a client to define workflows that span multiple facilities and orchestrate the execution of actions like data transfers, analysis, and publication. Example deployments at facilities like the Advanced Photon Source automate workflows for experiments like XPCS and crystallography.
Hoodie: How (And Why) We built an analytical datastore on SparkVinoth Chandar
Exploring a specific problem of ingesting petabytes of data in Uber and why they ended up building an analytical datastore from scratch using Spark. Then, discuss design choices and implementation approaches in building Hoodie to provide near-real-time data ingestion and querying using Spark and HDFS.
https://spark-summit.org/2017/events/incremental-processing-on-large-analytical-datasets/
This document discusses using Apache Geode and ActiveMQ Artemis to build a scalable IoT platform. It introduces IoT and the MQTT protocol. ActiveMQ Artemis is described as a high performance message broker that is embeddable and supports clustering. Geode is presented as a distributed in-memory data platform for building data-intensive applications that require high performance, scalability, and availability. Example users of Geode include large companies handling billions of records and thousands of transactions per second. Key capabilities of Geode like regions, functions, querying, and continuous queries are summarized.
This meeting we'll host a discussion on Google Cloud Platform and Amazon Web Services to bring light to similarities and differences between platforms. If you have questions about how our platforms compare this is the meeting to attend!
Scaling for India's Cricket Hungry Population (Bhavesh Raheja & Namit Mahuvak...confluent
Hotstar is a media entertainment platform, and has a large user base in India. Last year during the Indian Premier League, Hotstar introduced "Watch N Play", a real-time cricket prediction game in which over 33 million unique users answered over 2 billion questions, won more than a 100 million rewards, built with Kafka as the backbone. In the game, the user guesses the outcome of the next ball. If he/she guesses right before the actual outcome, they score points to climb up the ladder and receive rewards along the way. Supporting potentially millions of users with differing stream times & device latencies; we used topics to separate logical streams & partitions to scale to support 1M requests/second. We'll talk about how we reduced the -end latency to make the user-experience as real-time as possible. We'll focus on the operational challenges that we faced and how we overcame them by building automation around it to make our operational cum lives easier. We also have a war story around ghost topic creation where after deletion; topics would get magically created without any create topic requests. This led us to some interesting revelations and very important learnings of how long-living producers and consumers with very short-lived topics are a recipe for disaster.
The document discusses various concepts related to Apache Geode including clients, functions, serialization, transactions, and data colocation. It describes how clients can maintain local caches, run queries, and register for notifications. The document also covers how functions allow for distributed concurrent processing by pushing code to servers and executing across the distributed system.
Scylla Summit 2022: The Future of Consensus in ScyllaDB 5.0 and BeyondScyllaDB
Beyond the immediate schema changes supported in Scylla Open Source 5.0, learn how the Raft consensus infrastructure will enable radical new capabilities. Discover how it will enable more dynamic topology changes, tablets, immediate consistency, better and faster elasticity, and simplification to repair operations.
To watch all of the recordings hosted during Scylla Summit 2022 visit our website here: https://www.scylladb.com/summit.
One Tool to Rule Them All- Seamless SQL on MongoDB, MySQL and Redis with Apac...Tim Vaillancourt
This document provides an overview of Apache Spark, a fast and general engine for large-scale data processing. It discusses how Spark can be used to query and summarize data stored in different data sources like MongoDB, MySQL, and Redis in a single Spark job. The document then demonstrates a Spark job that retrieves weather station data from MongoDB and MySQL, aggregates it, stores the results in Redis, and retrieves the top 10 results.
Real Time Data Processing With Spark Streaming, Node.js and Redis with Visual...Brandon O'Brien
Contact:
https://www.linkedin.com/in/brandonjobrien
@hakczar
Code examples available at https://github.com/br4nd0n/spark-streaming and https://github.com/br4nd0n/spark-viz
A demo and explanation of building a streaming application using Spark Streaming, Node.js and Redis with a real time visualization. Includes discussion of internals of Spark and Spark streaming including RDD partitioning and code and data distribution and cluster resource allocation.
MySQL NDB Cluster 8.0 SQL faster than NoSQL Bernd Ocklin
MySQL NDB Cluster running SQL faster than most NoSQL databases. Benchmark results, comparisons and introduction into NDB's parallel distributed in-memory query engine. MySQL Day before FOSDEM 2020.
The document discusses Reactive Slick, a new version of the Slick database access library for Scala that provides reactive capabilities. It allows parallel database execution and streaming of large query results using Reactive Streams. Reactive Slick is suitable for composite database tasks, combining async tasks, and processing large datasets through reactive streams.
Tim Vaillancourt is a senior technical operations architect specializing in MongoDB. He has over 10 years of experience tuning Linux for database workloads and monitoring technologies like Nagios, MRTG, Munin, Zabbix, Cacti, and Graphite. He discussed the various MongoDB storage engines including MMAPv1, WiredTiger, RocksDB, and TokuMX. Key metrics for monitoring the different engines include lock ratio, page faults, background flushing times, checkpoints/compactions, replication lag, and scanned/moved documents. High-level operating system metrics like CPU, memory, disk, and network utilization are also important for ensuring MongoDB has sufficient resources.
Leveraging Cassandra for real-time multi-datacenter public cloud analyticsJulien Anguenot
iland has built a global data warehouse across multiple data centers, collecting and aggregating data from core cloud services including compute, storage and network as well as chargeback and compliance. iland's warehouse brings actionable intelligence that customers can use to manipulate resources, analyze trends, define alerts and share information.
In this session, we would like to present the lessons learned around Cassandra, both at the development and operations level, but also the technology and architecture we put in action on top of Cassandra such as Redis, syslog-ng, RabbitMQ, Java EE, etc.
Finally, we would like to share insights on how we are currently extending our platform with Spark and Kafka and what our motivations are.
Abstract –
Spark 2 is here, while Spark has been the leading cluster computation framework for severl years, its second version takes Spark to new heights. In this seminar, we will go over Spark internals and learn the new concepts of Spark 2 to create better scalable big data applications.
Target Audience
Architects, Java/Scala developers, Big Data engineers, team leaders
Prerequisites
Java/Scala knowledge and SQL knowledge
Contents:
- Spark internals
- Architecture
- RDD
- Shuffle explained
- Dataset API
- Spark SQL
- Spark Streaming
Large volume data analysis on the Typesafe Reactive PlatformMartin Zapletal
The document discusses several topics related to distributed machine learning and distributed systems including:
- Reasons for using distributed machine learning being either due to large data volumes or hopes of increased speed
- Failure rates of hardware and network links in large data centers
- Examples of database inconsistencies and data loss caused by network partitions in different distributed databases
- Key aspects of distributed data processing including data storage, integration, computing primitives, and analytics
This document summarizes transactions in Apache Geode, including:
- The semantics of repeatable read and optimistic concurrency control.
- The transaction API for basic, suspend/resume, and single entry operations.
- The implementation of using ThreadLocal to isolate transactions and conflict detection at commit.
- Handling transactions with replicated and partitioned regions, including failure scenarios.
- Support for client-initiated transactions, data colocation, and interaction with other Geode features.
- Roadmap items like distributed transactions and XA integration.
Apache Spark is an open-source parallel processing framework that supports in-memory processing to boost the performance of big-data analytic applications. We will cover approaches of processing Big Data on Spark cluster for real time analytic, machine learning and iterative BI and also discuss the pros and cons of using Spark in Azure cloud.
Why building a big data platform is hard? What are the key aspects involved in providing a "Serverless" experience for data folks. And how Databricks solves infrastructure problems and provides the "Serverless" experience.
M|18 Understanding the Architecture of MariaDB ColumnStoreMariaDB plc
The document provides an overview of MariaDB ColumnStore, including its history, components, disk storage architecture, writing and querying data processes. It was presented by Andrew Hutchings, the lead software engineer for MariaDB ColumnStore, who has previous experience with MySQL, HP, and other companies. The presentation covers the technical use cases for ColumnStore, differences from row-oriented databases, and optimizations for ColumnStore.
This document provides an overview of Kafka Streams, a stream processing library built on Apache Kafka. It discusses how Kafka Streams addresses limitations of traditional batch-oriented ETL processes by enabling low-latency, continuous stream processing of real-time data across diverse sources. Kafka Streams applications are fault-tolerant distributed applications that leverage Kafka's replication and partitioning. They define processing topologies with stream processors connected by streams. State is stored in fault-tolerant state stores backed by change logs.
Running Airflow Workflows as ETL Processes on Hadoopclairvoyantllc
While working with Hadoop, you'll eventually encounter the need to schedule and run workflows to perform various operations like ingesting data or performing ETL. There are a number of tools available to assist you with this type of requirement and one such tool that we at Clairvoyant have been looking to use is Apache Airflow. Apache Airflow is an Apache Incubator project that allows you to programmatically create workflows through a python script. This provides a flexible and effective way to design your workflows with little code and setup. In this talk, we will discuss Apache Airflow and how we at Clairvoyant have utilized it for ETL pipelines on Hadoop.
Leveraging Intra-Node Parallelization in HPCC SystemsHPCC Systems
This document discusses leveraging intra-node parallelization in HPCC Systems to improve the performance of set similarity joins (SSJ). It describes a naïve approach to computing SSJ that suffers from memory exhaustion and straggling executors. The presented approach replicates and groups independent data using hashing to address these issues while enabling efficient use of multiple CPU cores through multithreading. Experiments show the approach scales to larger datasets and achieves better performance by increasing the number of threads per executor. Lessons learned include that less complex optimizations are more robust in distributed environments.
iland Internet Solutions: Leveraging Cassandra for real-time multi-datacenter...DataStax Academy
iland has built a global data warehouse across multiple data centers, collecting and aggregating data from core cloud services including compute, storage and network as well as chargeback and compliance. iland's warehouse brings actionable intelligence that customers can use to manipulate resources, analyze trends, define alerts and share information.
In this session, we would like to present the lessons learned around Cassandra, both at the development and operations level, but also the technology and architecture we put in action on top of Cassandra such as Redis, syslog-ng, RabbitMQ, Java EE, etc.
Finally, we would like to share insights on how we are currently extending our platform with Spark and Kafka and what our motivations are.
Apache Big Data 2016: Next Gen Big Data Analytics with Apache ApexApache Apex
Apache Apex is a next gen big data analytics platform. Originally developed at DataTorrent it comes with a powerful stream processing engine, rich set of functional building blocks and an easy to use API for the developer to build real-time and batch applications. Apex runs natively on YARN and HDFS and is used in production in various industries. You will learn about the Apex architecture, including its unique features for scalability, fault tolerance and processing guarantees, programming model and use cases.
http://apachebigdata2016.sched.org/event/6M0L/next-gen-big-data-analytics-with-apache-apex-thomas-weise-datatorrent
This document provides an overview of scientific workflows, including what they are, common elements, and problems they help address. A workflow is a formal way to express a calculation as a series of tasks with dependencies. Workflow tools automate task execution, data management, scheduling, and more. They can help scale applications from a local system to large clusters. An example is provided of how the CyberShake project uses the Pegasus workflow system to automate probabilistic seismic hazard analysis calculations involving hundreds of thousands of tasks and petabytes of data.
Kubernetes @ Squarespace (SRE Portland Meetup October 2017)Kevin Lynch
In this presentation I talk about our motivation to converting our microservices to run on Kubernetes. I discuss many of the technical challenges we encountered along the way, including networking issues, Java issues, monitoring and alerting, and managing all of our resources!
Xin Wang(Apache Storm Committer/PMC member)'s topic covered the relations between streaming and messaging platform, and the challenges and tips in Storm usage.
The document provides an introduction to the National Supercomputing Centre (NSCC) high performance computing cluster. It describes the 1 petaflop system consisting of 1300 nodes and 13 petabytes of storage. The system uses PBS Pro for job scheduling and includes compilers, libraries, developer tools, and applications for engineering, science, and industry users from organizations such as A*STAR, NUS, and NTU.
This document provides an outline of manycore GPU architectures and programming. It introduces GPU architectures, the GPGPU concept, and CUDA programming. It discusses the GPU execution model, CUDA programming model, and how to work with different memory types in CUDA like global, shared and constant memory. It also covers streams and concurrency, CUDA intrinsics and libraries, performance profiling and debugging. Finally, it mentions directive-based programming models like OpenACC and OpenMP.
OSMC 2019 | Monitoring Alerts and Metrics on Large Power Systems Clusters by ...NETWAYS
In this talk we’ll introduce an open source project being used to monitor large Power Systems clusters, such as in the IBM collaboration with Oak Ridge and Lawrence Livermore laboratories for the Summit project, a large deployment of custom AC922 Power Systems nodes augmented by GPUs that work in tandem to implement the (currently) largest Supercomputer in the world.
Data is collected out-of-band directly from the firmware layer and then redistributed to various components using an open source component called Crassd. In addition, in-band operating-system and service level metrics, logs and alerts can also be collected and used to enrich the visualization dashboards. Open source components such as the Elastic Stack (Elasticsearch, Logstash, Kibana and select Beats) and Netdata are used for monitoring scenarios appropriate to each tool’s strengths, with other components such as Prometheus and Grafana in the process of being implemented. We’ll briefly discuss our experience to put these components together, and the decisions we had to make in order to automate their deployment and configuration for our goals. Finally, we lay out collaboration possibilities and future directions to enhance our project as a convenient starting point for others in the open source community to easily monitor their own Power Systems environments.
Performance Benchmarking: Tips, Tricks, and Lessons LearnedTim Callaghan
Presentation covering 25 years worth of lessons learned while performance benchmarking applications and databases. Presented at Percona Live London in November 2014.
Kafka Connect: Operational Lessons Learned from the Trenches (Elizabeth Benne...confluent
At Stitch Fix, we maintain a distributed Kafka Connect cluster running several hundred connectors. Over the years, we've learned invaluable lessons for keeping our connectors going 24/7. As many conference goers probably know, event driven applications require a new way of thinking. With this new paradigm comes unique operational considerations, which I will delve into. Specifically, this talk will be an overview of: 1) Our deployment model and use case (we have a large distributed Kafka Connect cluster that powers a self-service data integration platform tailored to the needs of our Data Scientists). 2) Our favorite operational tools that we have built for making things run smoothly (the jobs, alerts and dashboards we find most useful. A quick run down of the admin service we wrote that sits on top of Kafka Connect). 3) Our approach to end-to-end integrity monitoring (our tracer bullet system that we built to constantly monitor all our sources and sinks). 4) Lessons learned from production issues and painful migrations (why, oh why did we not use schemas from the beginning?? Pausing connectors doesn't do what you think it does... rebalancing is tricky... jar hell problems are a thing of the past, upgrade and use plugin.path!). 5) Future areas of improvement. The target audience member is an engineer who is curious about Kafka Connect or currently maintains a small to medium sized Kafka Connect cluster. They should walk away from the talk with increased confidence in using and maintaining a large Kafka Connect cluster, and should be armed with the hard won experiences of our team. For the most part, we've been very happy with our Kafka Connect powered data integration platform, and we'd love to share our lessons learned with the community in order to drive adoption.
Performance Analysis: new tools and concepts from the cloudBrendan Gregg
Talk delivered at SCaLE10x, Los Angeles 2012.
Cloud Computing introduces new challenges for performance
analysis, for both customers and operators of the cloud. Apart from
monitoring a scaling environment, issues within a system can be
complicated when tenants are competing for the same resources, and are
invisible to each other. Other factors include rapidly changing
production code and wildly unpredictable traffic surges. For
performance analysis in the Joyent public cloud, we use a variety of
tools including Dynamic Tracing, which allows us to create custom
tools and metrics and to explore new concepts. In this presentation
I'll discuss a collection of these tools and the metrics that they
measure. While these are DTrace-based, the focus of the talk is on
which metrics are proving useful for analyzing real cloud issues.
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018Codemotion
Monitoring an entire application is not a simple task, but with the right tools it is not a hard task either. However, events like Black Friday can push your application to the limit, and even cause crashes. As the system is stressed, it generates a lot more logs, which may crash the monitoring system as well. In this talk I will walk through the best practices when using the Elastic Stack to centralize and monitor your logs. I will also share some tricks to help you with the huge increase of traffic typical in Black Fridays.
Vinetalk is a software abstraction layer that allows cluster managers like Mesos and Kubernetes to offer fractions of GPU resources, enabling more efficient sharing of accelerators. Existing cluster managers cannot share accelerators because device drivers do not support it. Vinetalk implements an abstraction layer that decouples executors from vendor-specific drivers, representing accelerators as virtual access queues. This allows multiple tasks to concurrently use the same physical accelerator. Vinetalk has been shown to reduce queuing times for tasks sharing a GPU compared to Mesos alone. It also easier for developers to use, hiding proprietary device APIs, and has low overhead of 1-5% due to memory transfers.
SystemML is an Apache project that provides a declarative machine learning language for data scientists. It aims to simplify the development of custom machine learning algorithms and enable scalable execution on everything from single nodes to clusters. SystemML provides pre-implemented machine learning algorithms, APIs for various languages, and a cost-based optimizer to compile execution plans tailored to workload and hardware characteristics in order to maximize performance.
A Dataflow Processing Chip for Training Deep Neural Networksinside-BigData.com
In this deck from the Hot Chips conference, Chris Nicol from Wave Computing presents: A Dataflow Processing Chip for Training Deep Neural Networks.
Watch the video: https://wp.me/p3RLHQ-k6W
Learn more: https://wavecomp.ai/
and
http://www.hotchips.org/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
The National Supercomputing Centre (NSCC) operates a 1 petaflop supercomputing cluster with 1300 nodes for research and industry users. It has 13 petabytes of storage including a 265TB burst buffer and uses a Mellanox 100Gbps network. The cluster scheduler is PBS Pro and supports applications in areas like engineering, science, life sciences and more.
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...Zilliz
Join us to introduce Milvus Lite, a vector database that can run on notebooks and laptops, share the same API with Milvus, and integrate with every popular GenAI framework. This webinar is perfect for developers seeking easy-to-use, well-integrated vector databases for their GenAI apps.
Building RAG with self-deployed Milvus vector database and Snowpark Container...Zilliz
This talk will give hands-on advice on building RAG applications with an open-source Milvus database deployed as a docker container. We will also introduce the integration of Milvus with Snowpark Container Services.
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionAggregage
Join Maher Hanafi, VP of Engineering at Betterworks, in this new session where he'll share a practical framework to transform Gen AI prototypes into impactful products! He'll delve into the complexities of data collection and management, model selection and optimization, and ensuring security, scalability, and responsible use.
Sudheer Mechineni, Head of Application Frameworks, Standard Chartered Bank
Discover how Standard Chartered Bank harnessed the power of Neo4j to transform complex data access challenges into a dynamic, scalable graph database solution. This keynote will cover their journey from initial adoption to deploying a fully automated, enterprise-grade causal cluster, highlighting key strategies for modelling organisational changes and ensuring robust disaster recovery. Learn how these innovations have not only enhanced Standard Chartered Bank’s data infrastructure but also positioned them as pioneers in the banking sector’s adoption of graph technology.
Pushing the limits of ePRTC: 100ns holdover for 100 daysAdtran
At WSTS 2024, Alon Stern explored the topic of parametric holdover and explained how recent research findings can be implemented in real-world PNT networks to achieve 100 nanoseconds of accuracy for up to 100 days.
UiPath Test Automation using UiPath Test Suite series, part 5DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 5. In this session, we will cover CI/CD with devops.
Topics covered:
CI/CD with in UiPath
End-to-end overview of CI/CD pipeline with Azure devops
Speaker:
Lyndsey Byblow, Test Suite Sales Engineer @ UiPath, Inc.
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Speck&Tech
ABSTRACT: A prima vista, un mattoncino Lego e la backdoor XZ potrebbero avere in comune il fatto di essere entrambi blocchi di costruzione, o dipendenze di progetti creativi e software. La realtà è che un mattoncino Lego e il caso della backdoor XZ hanno molto di più di tutto ciò in comune.
Partecipate alla presentazione per immergervi in una storia di interoperabilità, standard e formati aperti, per poi discutere del ruolo importante che i contributori hanno in una comunità open source sostenibile.
BIO: Sostenitrice del software libero e dei formati standard e aperti. È stata un membro attivo dei progetti Fedora e openSUSE e ha co-fondato l'Associazione LibreItalia dove è stata coinvolta in diversi eventi, migrazioni e formazione relativi a LibreOffice. In precedenza ha lavorato a migrazioni e corsi di formazione su LibreOffice per diverse amministrazioni pubbliche e privati. Da gennaio 2020 lavora in SUSE come Software Release Engineer per Uyuni e SUSE Manager e quando non segue la sua passione per i computer e per Geeko coltiva la sua curiosità per l'astronomia (da cui deriva il suo nickname deneb_alpha).
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...Neo4j
Leonard Jayamohan, Partner & Generative AI Lead, Deloitte
This keynote will reveal how Deloitte leverages Neo4j’s graph power for groundbreaking digital twin solutions, achieving a staggering 100x performance boost. Discover the essential role knowledge graphs play in successful generative AI implementations. Plus, get an exclusive look at an innovative Neo4j + Generative AI solution Deloitte is developing in-house.
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIVladimir Iglovikov, Ph.D.
Presented by Vladimir Iglovikov:
- https://www.linkedin.com/in/iglovikov/
- https://x.com/viglovikov
- https://www.instagram.com/ternaus/
This presentation delves into the journey of Albumentations.ai, a highly successful open-source library for data augmentation.
Created out of a necessity for superior performance in Kaggle competitions, Albumentations has grown to become a widely used tool among data scientists and machine learning practitioners.
This case study covers various aspects, including:
People: The contributors and community that have supported Albumentations.
Metrics: The success indicators such as downloads, daily active users, GitHub stars, and financial contributions.
Challenges: The hurdles in monetizing open-source projects and measuring user engagement.
Development Practices: Best practices for creating, maintaining, and scaling open-source libraries, including code hygiene, CI/CD, and fast iteration.
Community Building: Strategies for making adoption easy, iterating quickly, and fostering a vibrant, engaged community.
Marketing: Both online and offline marketing tactics, focusing on real, impactful interactions and collaborations.
Mental Health: Maintaining balance and not feeling pressured by user demands.
Key insights include the importance of automation, making the adoption process seamless, and leveraging offline interactions for marketing. The presentation also emphasizes the need for continuous small improvements and building a friendly, inclusive community that contributes to the project's growth.
Vladimir Iglovikov brings his extensive experience as a Kaggle Grandmaster, ex-Staff ML Engineer at Lyft, sharing valuable lessons and practical advice for anyone looking to enhance the adoption of their open-source projects.
Explore more about Albumentations and join the community at:
GitHub: https://github.com/albumentations-team/albumentations
Website: https://albumentations.ai/
LinkedIn: https://www.linkedin.com/company/100504475
Twitter: https://x.com/albumentations
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!SOFTTECHHUB
As the digital landscape continually evolves, operating systems play a critical role in shaping user experiences and productivity. The launch of Nitrux Linux 3.5.0 marks a significant milestone, offering a robust alternative to traditional systems such as Windows 11. This article delves into the essence of Nitrux Linux 3.5.0, exploring its unique features, advantages, and how it stands as a compelling choice for both casual users and tech enthusiasts.
For the full video of this presentation, please visit: https://www.edge-ai-vision.com/2024/06/building-and-scaling-ai-applications-with-the-nx-ai-manager-a-presentation-from-network-optix/
Robin van Emden, Senior Director of Data Science at Network Optix, presents the “Building and Scaling AI Applications with the Nx AI Manager,” tutorial at the May 2024 Embedded Vision Summit.
In this presentation, van Emden covers the basics of scaling edge AI solutions using the Nx tool kit. He emphasizes the process of developing AI models and deploying them globally. He also showcases the conversion of AI models and the creation of effective edge AI pipelines, with a focus on pre-processing, model conversion, selecting the appropriate inference engine for the target hardware and post-processing.
van Emden shows how Nx can simplify the developer’s life and facilitate a rapid transition from concept to production-ready applications.He provides valuable insights into developing scalable and efficient edge AI solutions, with a strong focus on practical implementation.
Removing Uninteresting Bytes in Software FuzzingAftab Hussain
Imagine a world where software fuzzing, the process of mutating bytes in test seeds to uncover hidden and erroneous program behaviors, becomes faster and more effective. A lot depends on the initial seeds, which can significantly dictate the trajectory of a fuzzing campaign, particularly in terms of how long it takes to uncover interesting behaviour in your code. We introduce DIAR, a technique designed to speedup fuzzing campaigns by pinpointing and eliminating those uninteresting bytes in the seeds. Picture this: instead of wasting valuable resources on meaningless mutations in large, bloated seeds, DIAR removes the unnecessary bytes, streamlining the entire process.
In this work, we equipped AFL, a popular fuzzer, with DIAR and examined two critical Linux libraries -- Libxml's xmllint, a tool for parsing xml documents, and Binutil's readelf, an essential debugging and security analysis command-line tool used to display detailed information about ELF (Executable and Linkable Format). Our preliminary results show that AFL+DIAR does not only discover new paths more quickly but also achieves higher coverage overall. This work thus showcases how starting with lean and optimized seeds can lead to faster, more comprehensive fuzzing campaigns -- and DIAR helps you find such seeds.
- These are slides of the talk given at IEEE International Conference on Software Testing Verification and Validation Workshop, ICSTW 2022.
Threats to mobile devices are more prevalent and increasing in scope and complexity. Users of mobile devices desire to take full advantage of the features
available on those devices, but many of the features provide convenience and capability but sacrifice security. This best practices guide outlines steps the users can take to better protect personal devices and information.
National Security Agency - NSA mobile device best practices
Intro to HPC
1. PRESENTED BY:
Introduction to HPC
Interacting with High Performance
Computing Systems
6/7/18 1
VirginiaTrueheart, MSIS
Texas Advanced Computing Center
vtrueheart@tacc.utexas.edu
2. An Overview of HPC: What is it?
• High Performance Computing
• Parallel processing for advanced computation
• A “Supercomputer”, “Large Scale System”, or “Cluster”
• The same parts as your laptop
• Processors, coprocessors, memory, operating system, etc.
• Specialized for scale & efficiency
• Scale and Speed
• Thousands of nodes
• High bandwidth, low latency network for large scale I/O
6/7/18 2
3. An Overview of HPC: Stampede2
• Peak performance: 18 PF, rank 12 in Top 500 (2017)
• 4,200 68-core Knights Landing (KNL) nodes
• 1,736 48-core Skylake (SKX) nodes
• 368,928 cores and 736,512GB memory in total
• Interconnect: Intel’s Omni-Path Fabric Network
• Three Lustre Filesystems
• Funded by NSF through grant #ACI-1134872
6/7/18 3
4. An Overview of HPC: Architecture
6/7/18 4
idev
Internet
ssh
login
node
knl-
nodes
skx-
nodes
sbatch
omnipath
STAMPEDE 2
$HOME $WORK $SCRATCH
5. Ex: SKX Compute Node
6/7/18 5
Model Intel Xeon Platinum 8160 ("Skylake")
Cores Per Node 48 cores on two sockets (24 cores/socket)
Hardware Threads per
Core
2
Hardware Threads per
Node
96
Clock Rate 2.1Ghz
RAM 192GB
Cache 57MB per socket
Local Storage 144GB /tmp partition
7. An Overview of HPC: Using a System
• Why would I use HPC resources?
• Large scale problems
• Parallelization and Efficiency
• Collaboration
• How do I find an HPC resource?
• Check with your institution
• Check with national scientific groups (NSF in the US)
6/7/18 7
8. Overview: What Can I Find on a System
• Modules and Software
• Basic compilers and libraries
• Popular packages
• Licensed software
• Build Your Own!
• github or other direct sources
• pip, wget, curl, etc
• You won’t have sudo access
6/7/18 8
9. What Do I Need to Get Started?
• User Account
• Allocation & Project
• Two Factor Authentication
6/7/18 9
10. SSH Protocols
• Secure Shell
• Encrypted network protocol to access a secure system over an unsecured
network
• automatically generated public-private key pairs
• Your Wi-Fi Stampede2 or other secure machine
• File transfers (scp & rsync)
• Options
• .ssh/config
• Host & Username
• Make connecting easier
• Passwordless Login
6/7/18 10
11. Where am I?
• Login Nodes
• Manage files
• Build software
• Submit, monitor and manage jobs
• Compute Nodes
• Running jobs
• Testing applications
6/7/18 11
12. Allocations
• Active Project with a Project Instructor attached
• Service Units (SUs)
• SUs billed (node-hrs) = ( # nodes ) x (wall clock hours ) x ( charge rate per
node-hour )
• Shared Systems
• Be a good citizen
6/7/18 12
13. Filesystems
• Division of Labor
• Linux Cluster (Lustre) system that look like a single hard disk space
• Small I/O is hard on the system
• Striping large data (OST, MDS)
• Partitions
• $HOME: 10GB, $WORK: 1TB, $SCRATCH: Unlimited
• Shared system
6/7/18 13
15. Filesystems: Cont.
• Where am I?
• pwd – print working directory
• cd – change directory
• cd .. – move up one directory
• New Files
• mkdir – make directory
• Editors – vi(m), nano, emacs
• mv – move a file to another location
6/7/18 15
16. Types of Code
• Serial Code
• Albeit a very, very small one
• Single tasks, one after the other
• Single node/single core
• Parallel Code
• Array or “embarrassingly parallel” jobs
• Many node/many core
• Uses MPI
• Hybrid codes
6/7/18 16
17. Message Passing Interface
• ibrun isTACC specific
• “Wrapper” for mpirun
• Execute serial and parallel jobs across the entire node
• MPI Functions
• Allows communication between all cores and all nodes
• Move data between parts of the job that need it
• Point-to-Point or Collective Communication
6/7/18 17
19. Submitting a Job
• Why submit?
• Larger jobs, more nodes
• You don’t have to watch it in real time
• Run multiple jobs simultaneously
• Queues
• Pick the queues that suit your needs
• Don’t request more resources than you need
• Remember this is a shared resource
• Never run on a login node!
6/7/18 19
20. 6/7/18 20
Queue Name Node Type
Max Nodes per
Job
Max Duration
Max Jobs in
Queue
Charge Rate
development KNL cache-quad 16 nodes 2hrs 1 1SU
normal KNL cache-quad 256 nodes 48hrs 50 1SU
large** KNL cache-quad 2048 nodes 48hrs 5 1SU
long KNL cache-quad 32 nodes 96hrs 2 1SU
flat-
quadrant
KNL flat-quad 24 nodes 48hrs 2 1SU
skx-dev SKX 4 nodes 2hrs 1 1SU
skx-normal SKX 128 nodes 48hrs 25 1SU
skx-large** SKX 868 nodes 48hrs 3 1SU
21. Submitting a Job cont.
• sbatch
• Simple Linux Utility for Resource Management (SLURM)
• Linux/Unix workload manager
• Allocates resources
• Executes and monitors jobs
• Evaluates and manages pending jobs
• Using a Scheduler
• Gets you off of the login nodes (shared resource)
• Means you can walk away and do other things
6/7/18 21
22. Submission Options
6/7/18 22
Option Argument Comments
-p queue_name Submits to queue (partition) designated by queue_name
-J job_name Job Name
-N total_nodes Required. Define the resources you need by specifying either:
(1) "-N" and "-n"; or (2) "-N" and "--ntasks-per-node".
-n total_tasks This is total MPI tasks in this job. When using this option in a non-MPI job, it is
usually best to set it to the same value as "-N".
-t hh:mm:ss Required. Wall clock time for job.
-o output_file Direct job standard output to output_file (without -e option error goes to this file)
-e error_file Direct job error output to error_file
-d= afterok:jobid Dependency: this run will start only after the specified job successfully finishes
-A projectnumber Charge job to the specified project/allocation number.
23. Managing Your Jobs
• qlimits – all queues restrictions
• sinfo – monitor queues in real time
• squeue – monitor jobs in real time
• showq – similar output to squeue
• scancel – manually cancel a job
• scontrol – detailed information about the configuration of a job
• sacct – accounting data about your jobs
6/7/18 23
25. On Node Monitoring
• cat /proc/cpuinfo
• Follow a read out of the cpu info on the node
• top
• See all of the processes running and which are consuming the most
resources
• free –g
• Basic print out of memory consumption
• Remora
• This is an open source tool developed by TACC that can help you track
memory, cpu usage, I/O activity, and other options
6/7/18 25
26. Modules
• Modules
• TACC uses a tool called Lmod
• Add, remove, and swap software packages
• Saves you from having to build your own
• Commands
• module spider <package> - search for a package
• module list – see currently loaded modules
• module avail – list all available packages
• module load <package> - load a specific package or version
6/7/18 26
28. 6/7/18 28
Further Information
Main Website: www.tacc.utexas.edu
User Support: www.portal.tacc.utexas.edu
Email: info@tacc.utexas.edu
SHI: http://shinstitute.org/
XSEDE: https://portal.xsede.org
Editor's Notes
High Performance Computing most generally refers to the practice of aggregating computing power in a way that delivers much higher performance than one could get out of a typical desktop computer or workstation in order to solve large problems in science, engineering, or business.
Petaflop a unit of computing speed equal to one thousand million million (1015) floating-point operations per second.
Blue Waters is about 13.34 PF
What it does on the node will always be faster than on the filesystem bc of the interconnect of omnipath. High throughput low latency. Do things on the node then write out.
So what does a node look like on a super computer? My Mac here is 1 node with 4 cores and a similar clock rate. The full machine is 4,200 KNL compute nodes AND 1,736 SKX compute nodes.
Technically 72 but due to the way the KNL handles data only 68 are actually “functional” as far as the system is concerned
68 all together or 48 split between two. Different kinds of responses. Figure out what you’re doing and what your code can be made to accommodate
You’re working on climate data so your problem sets are huge. How many years? How granular? Do you want to develop visualizations? Predictive scaling? We could track Harvey in real time and predict where the deepest water was after the fact to help rescue crews manage their responses. We also kept legacy data for future study.
Application processes, depending on their model access may be free but access is in high demand so you have to prove that you’ll be doing valuable science on the machine
Now that you know a bit about what HPC systems involve, let’s try it out. What do I need to get started?
We’ll come back to complex options later in the customizing your environment section. Programmers are lazy we don’t want to type more than we have to. On windows PuTTY is a ssh client
Object storage target and metadata server
Performance increases with filesystem size. But there are also caveats for long term storage associated with these resources
This is a rehash of Saturday but it will help you move around in what we’re going to do next
We’ll get to multithreading in a minute and how that actually works vs hyperthreading
A lot of people will use the fact that there are more cores on a node some people will run across it repeatedly for more through put
Point to point is communication between to processes so task 1 on core 1 talks to core 2 etc. Collective communication implies a synchronization point. All of your tasks much get to a certain point before moving to the next step (barrier) or broadcast, from single point send out the same data to other processors. You can get super fancy with this and build tree structures to help your code. Paired things like vectorization this becomes very important for increasing the performance of your code at a larger scale.
You can do this for as many logical cores are on the node or you can do it between nodes. But specify which you’re doing in your code or your just going to trip it up.
Point to point can be send/receive or a send or a receive separately and can go between any single core. This can be in order or not but its not the same thing being pushed out across multiple cores that is a “broadcast” the reverse can be “gathering” as in scatter and gather
So many more options available. What’s available? See the queues!
Varies based on the number of nodes the system has and the frequency of use. Large queues are special as they take up so much of the system and a poorly run job could a) cause system problems b) cost you a lot of SUs that you wont get refunded.
Cache-quad and flat-quad has to do with the way memory is distributed. We keep most of the knl’s in cache bc the response time is faster but sometimes that means there is less room for certain operations. If your code is heavier on the memory switch to flat quad so you can get the full memory available. Instead of 16+96 you get 112GB flat.
Cache Mode. In this mode, the fast MCDRAM is configured as an L3 cache. The operating system transparently uses the MCDRAM to move data from main memory. In this mode, the user has access to 96GB of RAM, all of it traditional DDR4. Most Stampede2 KNL nodes are configured in cache mode.
Flat Mode. In this mode, DDR4 and MCDRAM act as two distinct Non-Uniform Memory Access (NUMA) nodes. It is therefore possible to specify the type of memory (DDR4 or MCDRAM) when allocating memory. In this mode, the user has access to 112GB of RAM: 96GB of traditional DDR and 16GB of fast MCDRAM. By default, memory allocations occur only in DDR4. To use MCDRAM in flat mode, use the numactl utility or the memkind library; see Managing Memory for more information. If you do not modify the default behavior you will have access only to the slower DDR4.
This is how the system knows what’s trying to run. Helps manager resources and tries to keep things “fair” is generally automated though can be manipulated by admins.
This isn’t all options but it covers the most commonly used ones and the ones that are required. You have to tell the system what you are trying to do and this is how you communicate with it via the work load manager.
These will all provide you with useful feed back about your job when it is pending and when it is running. It can also tell you about the state of the system.
Helps you track your jobs and work towards improving performance and/or debugging
Lua based module system. A modulefile contains the necessary information to allow a user to run a particular application or provide access to a particular library. All of this can be done dynamically without logging out and back in. Modulefiles for applications modify the user's path to make access easy. Modulefiles for Library packages provide environment variables that specify where the library and header files can be found. It is also very easy to switch between different versions of a package or remove the package.
Module -help