A tour of pyspark streaming in Apache Spark with an example calculating CPU usage using the Docker stats API. Two buzzwordy technologies for the price of one.
In this one day workshop, we will introduce Spark at a high level context. Spark is fundamentally different than writing MapReduce jobs so no prior Hadoop experience is needed. You will learn how to interact with Spark on the command line and conduct rapid in-memory data analyses. We will then work on writing Spark applications to perform large cluster-based analyses including SQL-like aggregations, machine learning applications, and graph algorithms. The course will be conducted in Python using PySpark.
This slide introduces Hadoop Spark.
Just to help you construct an idea of Spark regarding its architecture, data flow, job scheduling, and programming.
Not all technical details are included.
This document provides an overview and introduction to Spark, including:
- Spark is a general purpose computational framework that provides more flexibility than MapReduce while retaining properties like scalability and fault tolerance.
- Spark concepts include resilient distributed datasets (RDDs), transformations that create new RDDs lazily, and actions that run computations and return values to materialize RDDs.
- Spark can run on standalone clusters or as part of Cloudera's Enterprise Data Hub, and examples of its use include machine learning, streaming, and SQL queries.
Apache Spark: The Next Gen toolset for Big Data Processingprajods
The Spark project from Apache(spark.apache.org), is the next generation of Big Data processing systems. It uses a new architecture and in-memory processing for orders of magnitude improvement in performance. Some would call it the successor to the Hadoop set of tools. Hadoop is a batch mode Big Data processor and depends on disk based files. Spark improves on this and supports real time and interactive processing, in addition to batch processing.
Table of contents:
1. The Big Data triangle
2. Hadoop stack and its limitations
3. Spark: An Overview
3.a. Spark Streaming
3.b. GraphX: Graph processing
3.c. MLib: Machine Learning
4. Performance characteristics of Spark
This document provides an overview of Apache Spark, including why it was created, how it works, and how to get started with it. Some key points:
- Spark was initially developed at UC Berkeley as a class project in 2009 to test cluster management systems like Mesos, and was later open sourced in 2010. It became an Apache project in 2014.
- Spark is faster than Hadoop for machine learning tasks because it keeps data in-memory between jobs rather than writing to disk, and has a smaller codebase.
- The basic unit of data in Spark is the resilient distributed dataset (RDD), which allows immutable, distributed collections across a cluster. RDDs support transformations and actions.
-
Unified Big Data Processing with Apache Spark (QCON 2014)Databricks
This document discusses Apache Spark, a fast and general engine for big data processing. It describes how Spark generalizes the MapReduce model through its Resilient Distributed Datasets (RDDs) abstraction, which allows efficient sharing of data across parallel operations. This unified approach allows Spark to support multiple types of processing, like SQL queries, streaming, and machine learning, within a single framework. The document also outlines ongoing developments like Spark SQL and improved machine learning capabilities.
This document discusses best practices for using PySpark. It covers:
- Core concepts of PySpark including RDDs and the execution model. Functions are serialized and sent to worker nodes using pickle.
- Recommended project structure with modules for data I/O, feature engineering, and modeling.
- Writing testable, serializable code with static methods and avoiding non-serializable objects like database connections.
- Tips for testing like unit testing functions and integration testing the full workflow.
- Best practices for running jobs like configuring the Python environment, managing dependencies, and logging to debug issues.
In this one day workshop, we will introduce Spark at a high level context. Spark is fundamentally different than writing MapReduce jobs so no prior Hadoop experience is needed. You will learn how to interact with Spark on the command line and conduct rapid in-memory data analyses. We will then work on writing Spark applications to perform large cluster-based analyses including SQL-like aggregations, machine learning applications, and graph algorithms. The course will be conducted in Python using PySpark.
This slide introduces Hadoop Spark.
Just to help you construct an idea of Spark regarding its architecture, data flow, job scheduling, and programming.
Not all technical details are included.
This document provides an overview and introduction to Spark, including:
- Spark is a general purpose computational framework that provides more flexibility than MapReduce while retaining properties like scalability and fault tolerance.
- Spark concepts include resilient distributed datasets (RDDs), transformations that create new RDDs lazily, and actions that run computations and return values to materialize RDDs.
- Spark can run on standalone clusters or as part of Cloudera's Enterprise Data Hub, and examples of its use include machine learning, streaming, and SQL queries.
Apache Spark: The Next Gen toolset for Big Data Processingprajods
The Spark project from Apache(spark.apache.org), is the next generation of Big Data processing systems. It uses a new architecture and in-memory processing for orders of magnitude improvement in performance. Some would call it the successor to the Hadoop set of tools. Hadoop is a batch mode Big Data processor and depends on disk based files. Spark improves on this and supports real time and interactive processing, in addition to batch processing.
Table of contents:
1. The Big Data triangle
2. Hadoop stack and its limitations
3. Spark: An Overview
3.a. Spark Streaming
3.b. GraphX: Graph processing
3.c. MLib: Machine Learning
4. Performance characteristics of Spark
This document provides an overview of Apache Spark, including why it was created, how it works, and how to get started with it. Some key points:
- Spark was initially developed at UC Berkeley as a class project in 2009 to test cluster management systems like Mesos, and was later open sourced in 2010. It became an Apache project in 2014.
- Spark is faster than Hadoop for machine learning tasks because it keeps data in-memory between jobs rather than writing to disk, and has a smaller codebase.
- The basic unit of data in Spark is the resilient distributed dataset (RDD), which allows immutable, distributed collections across a cluster. RDDs support transformations and actions.
-
Unified Big Data Processing with Apache Spark (QCON 2014)Databricks
This document discusses Apache Spark, a fast and general engine for big data processing. It describes how Spark generalizes the MapReduce model through its Resilient Distributed Datasets (RDDs) abstraction, which allows efficient sharing of data across parallel operations. This unified approach allows Spark to support multiple types of processing, like SQL queries, streaming, and machine learning, within a single framework. The document also outlines ongoing developments like Spark SQL and improved machine learning capabilities.
This document discusses best practices for using PySpark. It covers:
- Core concepts of PySpark including RDDs and the execution model. Functions are serialized and sent to worker nodes using pickle.
- Recommended project structure with modules for data I/O, feature engineering, and modeling.
- Writing testable, serializable code with static methods and avoiding non-serializable objects like database connections.
- Tips for testing like unit testing functions and integration testing the full workflow.
- Best practices for running jobs like configuring the Python environment, managing dependencies, and logging to debug issues.
Frustration-Reduced PySpark: Data engineering with DataFramesIlya Ganelin
In this talk I talk about my recent experience working with Spark Data Frames in Python. For DataFrames, the focus will be on usability. Specifically, a lot of the documentation does not cover common use cases like intricacies of creating data frames, adding or manipulating individual columns, and doing quick and dirty analytics.
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...spinningmatt
This document provides an introduction to Apache Spark, including:
- A brief history of Spark, which started at UC Berkeley in 2009 and was donated to the Apache Foundation in 2013.
- An overview of what Spark is - an open-source, efficient, and productive cluster computing system that is interoperable with Hadoop.
- Descriptions of Spark's core abstractions including Resilient Distributed Datasets (RDDs), transformations, actions, and how it allows loading and saving data.
- Mentions of Spark's machine learning, SQL, streaming, and graph processing capabilities through projects like MLlib, Spark SQL, Spark Streaming, and GraphX.
This presentation will be useful to those who would like to get acquainted with Apache Spark architecture, top features and see some of them in action, e.g. RDD transformations and actions, Spark SQL, etc. Also it covers real life use cases related to one of ours commercial projects and recall roadmap how we’ve integrated Apache Spark into it.
Was presented on Morning@Lohika tech talks in Lviv.
Design by Yarko Filevych: http://www.filevych.com/
Let Spark Fly: Advantages and Use Cases for Spark on HadoopMapR Technologies
http://bit.ly/1BTaXZP – Apache Spark is currently one of the most active projects in the Hadoop ecosystem, and as such, there’s been plenty of hype about it in recent months, but how much of the discussion is marketing spin? And what are the facts? MapR and Databricks, the company that created and led the development of the Spark stack, will cut through the noise to uncover practical advantages for having the full set of Spark technologies at your disposal and reveal the benefits for running Spark on Hadoop
This presentation was given at a webinar hosted by Data Science Central and co-presented by MapR + Databricks.
To see the webinar, please go to: http://www.datasciencecentral.com/video/let-spark-fly-advantages-and-use-cases-for-spark-on-hadoop
Your data is getting bigger while your boss is getting anxious to have insights! This tutorial covers Apache Spark that makes data analytics fast to write and fast to run. Tackle big datasets quickly through a simple API in Python, and learn one programming paradigm in order to deploy interactive, batch, and streaming applications while connecting to data sources incl. HDFS, Hive, JSON, and S3.
Real time Analytics with Apache Kafka and Apache SparkRahul Jain
A presentation cum workshop on Real time Analytics with Apache Kafka and Apache Spark. Apache Kafka is a distributed publish-subscribe messaging while other side Spark Streaming brings Spark's language-integrated API to stream processing, allows to write streaming applications very quickly and easily. It supports both Java and Scala. In this workshop we are going to explore Apache Kafka, Zookeeper and Spark with a Web click streaming example using Spark Streaming. A clickstream is the recording of the parts of the screen a computer user clicks on while web browsing.
This document provides an overview of Apache Spark, including its goal of providing a fast and general engine for large-scale data processing. It discusses Spark's programming model, components like RDDs and DAGs, and how to initialize and deploy Spark on a cluster. Key aspects covered include RDDs as the fundamental data structure in Spark, transformations and actions, and storage levels for caching data in memory or disk.
Python and Bigdata - An Introduction to Spark (PySpark)hiteshnd
This document provides an introduction to Spark and PySpark for processing big data. It discusses what Spark is, how it differs from MapReduce by using in-memory caching for iterative queries. Spark operations on Resilient Distributed Datasets (RDDs) include transformations like map, filter, and actions that trigger computation. Spark can be used for streaming, machine learning using MLlib, and processing large datasets faster than MapReduce. The document provides examples of using PySpark on network logs and detecting good vs bad tweets in real-time.
Apache Spark - Intro to Large-scale recommendations with Apache Spark and PythonChristian Perone
This document provides an introduction to Apache Spark and collaborative filtering. It discusses big data and the limitations of MapReduce, then introduces Apache Spark including Resilient Distributed Datasets (RDDs), transformations, actions, and DataFrames. It also covers Spark Machine Learning (ML) libraries and algorithms such as classification, regression, clustering, and collaborative filtering.
This document introduces Apache Spark, an open-source cluster computing system that provides fast, general execution engines for large-scale data processing. It summarizes key Spark concepts including resilient distributed datasets (RDDs) that let users spread data across a cluster, transformations that operate on RDDs, and actions that return values to the driver program. Examples demonstrate how to load data from files, filter and transform it using RDDs, and run Spark programs on a local or cluster environment.
Memory management is at the heart of any data-intensive system. Spark, in particular, must arbitrate memory allocation between two main use cases: buffering intermediate data for processing (execution) and caching user data (storage). This talk will take a deep dive through the memory management designs adopted in Spark since its inception and discuss their performance and usability implications for the end user.
This document provides an overview and comparison of Apache Hadoop and Apache Spark for big data analytics. It discusses the architectures and functionality of Hadoop MapReduce and HDFS, as well as Spark's RDDs, transformations, and actions. The document demonstrates K-means clustering in both Spark and Hadoop MapReduce and shows that Spark outperforms Hadoop MapReduce, especially for iterative algorithms. While Hadoop remains useful for its features, the combination of Spark and HDFS can achieve high performance for both batch and interactive analytics.
PySpark is a next generation cloud computing engine that uses Python. It allows users to write Spark applications in Python. PySpark applications can access data via the Spark API and process it using Python. The PySpark architecture involves Python code running on worker nodes communicating with Java Virtual Machines on those nodes via sockets. This allows leveraging Python libraries like scikit-learn with Spark. The presentation demonstrated recommender systems and interactive shell usage with PySpark.
Building a Unified Data Pipeline with Apache Spark and XGBoost with Nan ZhuDatabricks
XGBoost (https://github.com/dmlc/xgboost) is a library designed and optimized for tree boosting. XGBoost attracts users from a broad range of organizations in both industry and academia, and more than half of the winning solutions in machine learning challenges hosted at Kaggle adopt XGBoost.
While being one of the most popular machine learning systems, XGBoost is only one of the components in a complete data analytic pipeline. The data ETL/exploration/serving functionalities are built up on top of more general data processing frameworks, like Apache Spark. As a result, users have to build a communication channel between Apache Spark and XGBoost (usually through HDFS) and face the difficulties/inconveniences in data navigating and application development/deployment.
We (Distributed (Deep) Machine Learning Community) develop XGBoost4J-Spark (https://github.com/dmlc/xgboost/tree/master/jvm-packages), which seamlessly integrates Apache Spark and XGBoost.
The communication channel between Spark and XGBoost is established based on RDDs/DataFrame/Datasets, all of which are standard data interfaces in Spark. Additionally, XGBoost can be embedded into Spark MLLib pipeline and tuned through the tools provided by MLLib. In this talk, I will cover the motivation/history/design philosophy/implementation details as well as the use cases of XGBoost4J-Spark. I expect that this talk will share the insights on building a heterogeneous data analytic pipeline based on Spark and other data intelligence frameworks and bring more discussions on this topic.
Apache Spark is a fast distributed data processing engine that runs in memory. It can be used with Java, Scala, Python and R. Spark uses resilient distributed datasets (RDDs) as its main data structure. RDDs are immutable and partitioned collections of elements that allow transformations like map and filter. Spark is 10-100x faster than Hadoop for iterative algorithms and can be used for tasks like ETL, machine learning, and streaming.
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Josef A. Habdank
Presentation consists of an amazing bundle of Pro tips and tricks for building an insanely scalable Apache Spark and Spark Streaming based data pipeline.
Presentation consists of 4 parts:
* Quick intro to Spark
* N-billion rows/day system architecture
* Data Warehouse and Messaging
* How to deploy spark so it does not backfire
This document provides an overview of Apache Spark, including how it compares to Hadoop, the Spark ecosystem, Resilient Distributed Datasets (RDDs), transformations and actions on RDDs, the directed acyclic graph (DAG) scheduler, Spark Streaming, and the DataFrames API. Key points covered include Spark's faster performance versus Hadoop through its use of memory instead of disk, the RDD abstraction for distributed collections, common RDD operations, and Spark's capabilities for real-time streaming data processing and SQL queries on structured data.
This session covers how to work with PySpark interface to develop Spark applications. From loading, ingesting, and applying transformation on the data. The session covers how to work with different data sources of data, apply transformation, python best practices in developing Spark Apps. The demo covers integrating Apache Spark apps, In memory processing capabilities, working with notebooks, and integrating analytics tools into Spark Applications.
Spark Summit East 2015 Advanced Devops Student SlidesDatabricks
This document provides an agenda for an advanced Spark class covering topics such as RDD fundamentals, Spark runtime architecture, memory and persistence, shuffle operations, and Spark Streaming. The class will be held in March 2015 and include lectures, labs, and Q&A sessions. It notes that some slides may be skipped and asks attendees to keep Q&A low during the class, with a dedicated Q&A period at the end.
In this deck from the 2019 Stanford HPC Conference, Michael Jennings from LANL presents: Container Mythbusters.
"As containers initially grew to prominence within the greater Linux community, particularly in the hyperscale/cloud and web application space, there was very little information out there about using Linux containers for high-performance computing (HPC) at all, so those who were interested in making use of this fledgling technology were pretty much on their own.
Today, we have the opposite problem: There's so much information out there on this topic nowadays -- much of it confusing, confounding, or even conflicting! -- that it can be all but impossible to separate the facts from the fluff...or worse, the Dumbledores from the Voldemorts!
In this session, we'll confront this problem head-on by clearing up some common misconceptions about containers, bust some myths born out of misunderstanding and marketing hype alike, and learn how to safely (and securely!) navigate the Linux container landscape with an eye toward what the future holds for containers in HPC and how we can all get there together!
Michael Jennings has been a UNIX/Linux sysadmin and software engineer for over 20 years. He has been the author of or a contributor to numerous open source software projects, including Charliecloud, Mezzanine, Eterm, RPM, Warewulf/PERCEUS, and TORQUE. Additionally, he co-founded the Caos Foundation, creators of CentOS, and has been the lead developer on 3 separate Linux distributions. He currently serves as the Platforms Team Lead in the HPC Systems group at Los Alamos National Laboratory, responsible for managing some of our nation's most powerful supercomputers and is the primary author/maintainer for the LBNL Node Health Check (NHC) project. He is also the Vice President of HPCXXL, the extreme-scale HPC Users group.
Watch the video: https://youtu.be/FFyXdgWXD3A
Learn more: https://www.lanl.gov/discover/news-release-archive/2017/June/0607-charliecloud-simplifies-big-data.php
and
http://hpcadvisorycouncil.com/events/2019/stanford-workshop/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Docker containers provide isolation at the system resource level using namespaces and control groups. This allows each container to run applications packaged with dependencies without conflict. Docker solves issues around dependency management and allows for fast iterative development. To use Docker, a container-enabled kernel is needed along with installing Docker on Windows or Linux. Images contain application code and dependencies and are obtained from Docker Hub or built locally. Docker Compose allows defining multi-container applications and their dependencies. Security with Docker includes capability dropping, namespaces, and profiles with AppArmor and SELinux to control access. Trusted registries and image scanning help ensure secure images.
Frustration-Reduced PySpark: Data engineering with DataFramesIlya Ganelin
In this talk I talk about my recent experience working with Spark Data Frames in Python. For DataFrames, the focus will be on usability. Specifically, a lot of the documentation does not cover common use cases like intricacies of creating data frames, adding or manipulating individual columns, and doing quick and dirty analytics.
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...spinningmatt
This document provides an introduction to Apache Spark, including:
- A brief history of Spark, which started at UC Berkeley in 2009 and was donated to the Apache Foundation in 2013.
- An overview of what Spark is - an open-source, efficient, and productive cluster computing system that is interoperable with Hadoop.
- Descriptions of Spark's core abstractions including Resilient Distributed Datasets (RDDs), transformations, actions, and how it allows loading and saving data.
- Mentions of Spark's machine learning, SQL, streaming, and graph processing capabilities through projects like MLlib, Spark SQL, Spark Streaming, and GraphX.
This presentation will be useful to those who would like to get acquainted with Apache Spark architecture, top features and see some of them in action, e.g. RDD transformations and actions, Spark SQL, etc. Also it covers real life use cases related to one of ours commercial projects and recall roadmap how we’ve integrated Apache Spark into it.
Was presented on Morning@Lohika tech talks in Lviv.
Design by Yarko Filevych: http://www.filevych.com/
Let Spark Fly: Advantages and Use Cases for Spark on HadoopMapR Technologies
http://bit.ly/1BTaXZP – Apache Spark is currently one of the most active projects in the Hadoop ecosystem, and as such, there’s been plenty of hype about it in recent months, but how much of the discussion is marketing spin? And what are the facts? MapR and Databricks, the company that created and led the development of the Spark stack, will cut through the noise to uncover practical advantages for having the full set of Spark technologies at your disposal and reveal the benefits for running Spark on Hadoop
This presentation was given at a webinar hosted by Data Science Central and co-presented by MapR + Databricks.
To see the webinar, please go to: http://www.datasciencecentral.com/video/let-spark-fly-advantages-and-use-cases-for-spark-on-hadoop
Your data is getting bigger while your boss is getting anxious to have insights! This tutorial covers Apache Spark that makes data analytics fast to write and fast to run. Tackle big datasets quickly through a simple API in Python, and learn one programming paradigm in order to deploy interactive, batch, and streaming applications while connecting to data sources incl. HDFS, Hive, JSON, and S3.
Real time Analytics with Apache Kafka and Apache SparkRahul Jain
A presentation cum workshop on Real time Analytics with Apache Kafka and Apache Spark. Apache Kafka is a distributed publish-subscribe messaging while other side Spark Streaming brings Spark's language-integrated API to stream processing, allows to write streaming applications very quickly and easily. It supports both Java and Scala. In this workshop we are going to explore Apache Kafka, Zookeeper and Spark with a Web click streaming example using Spark Streaming. A clickstream is the recording of the parts of the screen a computer user clicks on while web browsing.
This document provides an overview of Apache Spark, including its goal of providing a fast and general engine for large-scale data processing. It discusses Spark's programming model, components like RDDs and DAGs, and how to initialize and deploy Spark on a cluster. Key aspects covered include RDDs as the fundamental data structure in Spark, transformations and actions, and storage levels for caching data in memory or disk.
Python and Bigdata - An Introduction to Spark (PySpark)hiteshnd
This document provides an introduction to Spark and PySpark for processing big data. It discusses what Spark is, how it differs from MapReduce by using in-memory caching for iterative queries. Spark operations on Resilient Distributed Datasets (RDDs) include transformations like map, filter, and actions that trigger computation. Spark can be used for streaming, machine learning using MLlib, and processing large datasets faster than MapReduce. The document provides examples of using PySpark on network logs and detecting good vs bad tweets in real-time.
Apache Spark - Intro to Large-scale recommendations with Apache Spark and PythonChristian Perone
This document provides an introduction to Apache Spark and collaborative filtering. It discusses big data and the limitations of MapReduce, then introduces Apache Spark including Resilient Distributed Datasets (RDDs), transformations, actions, and DataFrames. It also covers Spark Machine Learning (ML) libraries and algorithms such as classification, regression, clustering, and collaborative filtering.
This document introduces Apache Spark, an open-source cluster computing system that provides fast, general execution engines for large-scale data processing. It summarizes key Spark concepts including resilient distributed datasets (RDDs) that let users spread data across a cluster, transformations that operate on RDDs, and actions that return values to the driver program. Examples demonstrate how to load data from files, filter and transform it using RDDs, and run Spark programs on a local or cluster environment.
Memory management is at the heart of any data-intensive system. Spark, in particular, must arbitrate memory allocation between two main use cases: buffering intermediate data for processing (execution) and caching user data (storage). This talk will take a deep dive through the memory management designs adopted in Spark since its inception and discuss their performance and usability implications for the end user.
This document provides an overview and comparison of Apache Hadoop and Apache Spark for big data analytics. It discusses the architectures and functionality of Hadoop MapReduce and HDFS, as well as Spark's RDDs, transformations, and actions. The document demonstrates K-means clustering in both Spark and Hadoop MapReduce and shows that Spark outperforms Hadoop MapReduce, especially for iterative algorithms. While Hadoop remains useful for its features, the combination of Spark and HDFS can achieve high performance for both batch and interactive analytics.
PySpark is a next generation cloud computing engine that uses Python. It allows users to write Spark applications in Python. PySpark applications can access data via the Spark API and process it using Python. The PySpark architecture involves Python code running on worker nodes communicating with Java Virtual Machines on those nodes via sockets. This allows leveraging Python libraries like scikit-learn with Spark. The presentation demonstrated recommender systems and interactive shell usage with PySpark.
Building a Unified Data Pipeline with Apache Spark and XGBoost with Nan ZhuDatabricks
XGBoost (https://github.com/dmlc/xgboost) is a library designed and optimized for tree boosting. XGBoost attracts users from a broad range of organizations in both industry and academia, and more than half of the winning solutions in machine learning challenges hosted at Kaggle adopt XGBoost.
While being one of the most popular machine learning systems, XGBoost is only one of the components in a complete data analytic pipeline. The data ETL/exploration/serving functionalities are built up on top of more general data processing frameworks, like Apache Spark. As a result, users have to build a communication channel between Apache Spark and XGBoost (usually through HDFS) and face the difficulties/inconveniences in data navigating and application development/deployment.
We (Distributed (Deep) Machine Learning Community) develop XGBoost4J-Spark (https://github.com/dmlc/xgboost/tree/master/jvm-packages), which seamlessly integrates Apache Spark and XGBoost.
The communication channel between Spark and XGBoost is established based on RDDs/DataFrame/Datasets, all of which are standard data interfaces in Spark. Additionally, XGBoost can be embedded into Spark MLLib pipeline and tuned through the tools provided by MLLib. In this talk, I will cover the motivation/history/design philosophy/implementation details as well as the use cases of XGBoost4J-Spark. I expect that this talk will share the insights on building a heterogeneous data analytic pipeline based on Spark and other data intelligence frameworks and bring more discussions on this topic.
Apache Spark is a fast distributed data processing engine that runs in memory. It can be used with Java, Scala, Python and R. Spark uses resilient distributed datasets (RDDs) as its main data structure. RDDs are immutable and partitioned collections of elements that allow transformations like map and filter. Spark is 10-100x faster than Hadoop for iterative algorithms and can be used for tasks like ETL, machine learning, and streaming.
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Josef A. Habdank
Presentation consists of an amazing bundle of Pro tips and tricks for building an insanely scalable Apache Spark and Spark Streaming based data pipeline.
Presentation consists of 4 parts:
* Quick intro to Spark
* N-billion rows/day system architecture
* Data Warehouse and Messaging
* How to deploy spark so it does not backfire
This document provides an overview of Apache Spark, including how it compares to Hadoop, the Spark ecosystem, Resilient Distributed Datasets (RDDs), transformations and actions on RDDs, the directed acyclic graph (DAG) scheduler, Spark Streaming, and the DataFrames API. Key points covered include Spark's faster performance versus Hadoop through its use of memory instead of disk, the RDD abstraction for distributed collections, common RDD operations, and Spark's capabilities for real-time streaming data processing and SQL queries on structured data.
This session covers how to work with PySpark interface to develop Spark applications. From loading, ingesting, and applying transformation on the data. The session covers how to work with different data sources of data, apply transformation, python best practices in developing Spark Apps. The demo covers integrating Apache Spark apps, In memory processing capabilities, working with notebooks, and integrating analytics tools into Spark Applications.
Spark Summit East 2015 Advanced Devops Student SlidesDatabricks
This document provides an agenda for an advanced Spark class covering topics such as RDD fundamentals, Spark runtime architecture, memory and persistence, shuffle operations, and Spark Streaming. The class will be held in March 2015 and include lectures, labs, and Q&A sessions. It notes that some slides may be skipped and asks attendees to keep Q&A low during the class, with a dedicated Q&A period at the end.
In this deck from the 2019 Stanford HPC Conference, Michael Jennings from LANL presents: Container Mythbusters.
"As containers initially grew to prominence within the greater Linux community, particularly in the hyperscale/cloud and web application space, there was very little information out there about using Linux containers for high-performance computing (HPC) at all, so those who were interested in making use of this fledgling technology were pretty much on their own.
Today, we have the opposite problem: There's so much information out there on this topic nowadays -- much of it confusing, confounding, or even conflicting! -- that it can be all but impossible to separate the facts from the fluff...or worse, the Dumbledores from the Voldemorts!
In this session, we'll confront this problem head-on by clearing up some common misconceptions about containers, bust some myths born out of misunderstanding and marketing hype alike, and learn how to safely (and securely!) navigate the Linux container landscape with an eye toward what the future holds for containers in HPC and how we can all get there together!
Michael Jennings has been a UNIX/Linux sysadmin and software engineer for over 20 years. He has been the author of or a contributor to numerous open source software projects, including Charliecloud, Mezzanine, Eterm, RPM, Warewulf/PERCEUS, and TORQUE. Additionally, he co-founded the Caos Foundation, creators of CentOS, and has been the lead developer on 3 separate Linux distributions. He currently serves as the Platforms Team Lead in the HPC Systems group at Los Alamos National Laboratory, responsible for managing some of our nation's most powerful supercomputers and is the primary author/maintainer for the LBNL Node Health Check (NHC) project. He is also the Vice President of HPCXXL, the extreme-scale HPC Users group.
Watch the video: https://youtu.be/FFyXdgWXD3A
Learn more: https://www.lanl.gov/discover/news-release-archive/2017/June/0607-charliecloud-simplifies-big-data.php
and
http://hpcadvisorycouncil.com/events/2019/stanford-workshop/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Docker containers provide isolation at the system resource level using namespaces and control groups. This allows each container to run applications packaged with dependencies without conflict. Docker solves issues around dependency management and allows for fast iterative development. To use Docker, a container-enabled kernel is needed along with installing Docker on Windows or Linux. Images contain application code and dependencies and are obtained from Docker Hub or built locally. Docker Compose allows defining multi-container applications and their dependencies. Security with Docker includes capability dropping, namespaces, and profiles with AppArmor and SELinux to control access. Trusted registries and image scanning help ensure secure images.
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingPaco Nathan
London Spark Meetup 2014-11-11 @Skimlinks
http://www.meetup.com/Spark-London/events/217362972/
To paraphrase the immortal crooner Don Ho: "Tiny Batches, in the wine, make me happy, make me feel fine." http://youtu.be/mlCiDEXuxxA
Apache Spark provides support for streaming use cases, such as real-time analytics on log files, by leveraging a model called discretized streams (D-Streams). These "micro batch" computations operated on small time intervals, generally from 500 milliseconds up. One major innovation of Spark Streaming is that it leverages a unified engine. In other words, the same business logic can be used across multiple uses cases: streaming, but also interactive, iterative, machine learning, etc.
This talk will compare case studies for production deployments of Spark Streaming, emerging design patterns for integration with popular complementary OSS frameworks, plus some of the more advanced features such as approximation algorithms, and take a look at what's ahead — including the new Python support for Spark Streaming that will be in the upcoming 1.2 release.
Also, let's chat a bit about the new Databricks + O'Reilly developer certification for Apache Spark…
Containers provide security through mechanisms like kernel namespaces, control groups (cgroups), and SELinux labels. The Docker daemon manages these mechanisms to isolate containers and apply resource limits. While containers enable application density and portability, administrators must still practice secure configuration by limiting container privileges, updating containers regularly, and monitoring logs. When used properly, containers can improve security by isolating applications and minimizing the risk of compromise.
Albert Einstein Institute:
Largest research institute in the world specializing in general relativity and beyond.
Part of the Max Planck Society https://www.mpg.de/en Founded in Potsdam (1995)
Experimental/data analysis branch in Hannover since 2002 https://www.aei.mpg.de/
Conclusions:
Hard to draw after only ∼ 90 d, but:
* VictoriaMetrics performs much better than Prometheus out of the box Queries possible which were not before
* Storage 5 GB/d vs. 41 GB/d1
Still able to put “too much” into TSDB and overload it
* Waiting for prod hardware to deploy/test clustered version
* Looks very, very good & capable!
This document outlines the curriculum for an introduction to containerization presentation. It includes slides and hands-on exercises on installing Docker, building Docker images, running containers, viewing processes inside containers, and experimenting with resource isolation using cgroups and namespaces. Attendees will build a Docker image for a sample Flask application, run the container, view logs and processes, and push the image to Docker Hub. The presentation covers definitions of key containerization concepts and the benefits of using containers.
The document provides an overview of containers and Docker. It discusses why containers are important for organizing software, improving portability, and protecting infrastructure. It describes key Docker concepts like images, containers, Dockerfile for building images, and tools like Docker Compose and Docker Swarm for defining and running multi-container apps. The document recommends reading "The Art of War" and scanning systems without being detected before potentially more intrusive activities. It also briefly introduces network security pillars and buffer overflows as an attack technique.
This document contains a summary of a presentation given by Patrick Chanezon of Docker Inc. about Docker and the container ecosystem. The presentation covered Docker's history and growth, key products like Docker Engine, Docker Hub, Docker Compose and Docker Machine. It discussed how Docker enables developers and operations teams through containerization. The presentation also looked at related projects and companies in the container space, as well as Docker's open governance model and efforts to contribute container plumbing projects to open standards.
Practical Chaos Engineering will show how to start running chaos experiments in your infrastructure and will try to guide your through the principles of chaos.
- The document discusses building a predictive anomaly detection model for network traffic using streaming data technologies.
- It proposes using Apache Kafka to ingest and process network packet and Netflow data in real-time, and Akka clustering to build predictive models that can guide human cybersecurity experts.
- The solution aims to more effectively guide human awareness of network threats by complementing localized rule-matching with predictive modeling of aggregate network behavior based on streaming metrics.
This 2nd version of the last year workshop will shed light on a modern solution to solve application portability, building, delivery, packaging, and system dependency issues. Containers especially Docker have seen accelerated adoption in the web, cloud and recently the enterprise. HPC environments are seeing something similar to the introduction of HPC containers Singularity and Shifter. They provide a good use case for solving software portability, not to mention ensure repeatability of results. Not to mention their ECO system provides for the better development, delivery, testing workflows that were alien to most of HPC environments. This workshop will cover the Theory and hands-on of containers and Its ecosystem. Introducing Docker and singularity containers; Docker as a general-purpose container for almost any app, Singularity as the particular container technology for HPC. The workshop will go over the foundations of the containers platform, including an overview of the platform system components: images, containers, repositories, clustering, and orchestration. The strategy is to demonstrate through "live demo, and hands-on exercises." The reuse case of containers in building a portable distributed application cluster running a variety of workloads including HPC workload.
Practical Container Security by Mrunal Patel and Thomas Cameron, Red HatDocker, Inc.
You can secure your containerized microservices without slowing down development. Through a combination of Linux kernel features and open source tools, you can isolate the host from the container and the containers from each other, as well as finding vulnerabilities and securing data. Two of Red Hat's Docker contributors will discuss the state of container security today, covering Linux namespaces, SElinux, cgroups, capabilities, scan, seccomp, and other tools you can use right now.
The document discusses several concepts related to cloud computing including:
- The evolving definition of cloud computing with 3 service models and 4 deployment models and 5 essential characteristics
- Diffusion of innovation theory and how new technologies are adopted over time through innovators, early adopters, early majority, late majority, and laggards
- The concept of ubiquity and how innovations progress from niche to mainstream as they become more established, standardized, and commoditized over time
Apache Storm 0.9 basic training - VerisignMichael Noll
Apache Storm 0.9 basic training (130 slides) covering:
1. Introducing Storm: history, Storm adoption in the industry, why Storm
2. Storm core concepts: topology, data model, spouts and bolts, groupings, parallelism
3. Operating Storm: architecture, hardware specs, deploying, monitoring
4. Developing Storm apps: Hello World, creating a bolt, creating a topology, running a topology, integrating Storm and Kafka, testing, data serialization in Storm, example apps, performance and scalability tuning
5. Playing with Storm using Wirbelsturm
Audience: developers, operations, architects
Created by Michael G. Noll, Data Architect, Verisign, https://www.verisigninc.com/
Verisign is a global leader in domain names and internet security.
Tools mentioned:
- Wirbelsturm (https://github.com/miguno/wirbelsturm)
- kafka-storm-starter (https://github.com/miguno/kafka-storm-starter)
Blog post at:
http://www.michael-noll.com/blog/2014/09/15/apache-storm-training-deck-and-tutorial/
Many thanks to the Twitter Engineering team (the creators of Storm) and the Apache Storm open source community!
Drupalcamp es 2013 drupal with lxc docker and vagrant Ricardo Amaro
This document discusses using containers like LXC and Docker to automate Drupal deployments. It begins with an introduction to the speaker and overview of virtual machines versus containers. The speaker then demonstrates using LXC containers on Ubuntu with tools like Vagrant and Puppet for configuration management. Docker is presented as an improvement allowing developers to package applications and dependencies into portable containers that can be run anywhere without reconfiguration.
The document discusses the evolving concept of cloud computing, comparing it to historical notions of computer utilities. It describes cloud computing as having 3 service models (SaaS, PaaS, IaaS) and 4 deployment models (private, public, hybrid, community clouds) with 5 essential characteristics including on-demand access and elastic scaling. However, the exact definition of cloud computing remains unclear as the technology continues to change.
LibOS as a regression test framework for Linux networking #netdev1.1Hajime Tazaki
This document describes using the LibOS framework to build a regression testing system for Linux networking code. LibOS allows running the Linux network stack in a library, enabling deterministic network simulation. Tests can configure virtual networks and run network applications and utilities to identify bugs in networking code by detecting changes in behavior across kernel versions. Example tests check encapsulation protocols like IP-in-IP and detect past kernel bugs. Results are recorded in JUnit format for integration with continuous integration systems.
A computer cluster is a group of loosely connected computers that work together as a single computer. Clusters improve performance and reliability over a single computer and are more cost-effective than a single computer of comparable speed or reliability. There are several types of clusters, including high-availability clusters which improve service availability through redundancy, load-balancing clusters which distribute workload across backend servers, and high-performance clusters which increase performance by splitting computational tasks across nodes.
Similar to Rapid Prototyping in PySpark Streaming: The Thermodynamics of Docker Containers 2015 02-24 Washington DC Apache Spark Interactive (20)
Global Situational Awareness of A.I. and where its headedvikram sood
You can see the future first in San Francisco.
Over the past year, the talk of the town has shifted from $10 billion compute clusters to $100 billion clusters to trillion-dollar clusters. Every six months another zero is added to the boardroom plans. Behind the scenes, there’s a fierce scramble to secure every power contract still available for the rest of the decade, every voltage transformer that can possibly be procured. American big business is gearing up to pour trillions of dollars into a long-unseen mobilization of American industrial might. By the end of the decade, American electricity production will have grown tens of percent; from the shale fields of Pennsylvania to the solar farms of Nevada, hundreds of millions of GPUs will hum.
The AGI race has begun. We are building machines that can think and reason. By 2025/26, these machines will outpace college graduates. By the end of the decade, they will be smarter than you or I; we will have superintelligence, in the true sense of the word. Along the way, national security forces not seen in half a century will be un-leashed, and before long, The Project will be on. If we’re lucky, we’ll be in an all-out race with the CCP; if we’re unlucky, an all-out war.
Everyone is now talking about AI, but few have the faintest glimmer of what is about to hit them. Nvidia analysts still think 2024 might be close to the peak. Mainstream pundits are stuck on the wilful blindness of “it’s just predicting the next word”. They see only hype and business-as-usual; at most they entertain another internet-scale technological change.
Before long, the world will wake up. But right now, there are perhaps a few hundred people, most of them in San Francisco and the AI labs, that have situational awareness. Through whatever peculiar forces of fate, I have found myself amongst them. A few years ago, these people were derided as crazy—but they trusted the trendlines, which allowed them to correctly predict the AI advances of the past few years. Whether these people are also right about the next few years remains to be seen. But these are very smart people—the smartest people I have ever met—and they are the ones building this technology. Perhaps they will be an odd footnote in history, or perhaps they will go down in history like Szilard and Oppenheimer and Teller. If they are seeing the future even close to correctly, we are in for a wild ride.
Let me tell you what we see.
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdfGetInData
Recently we have observed the rise of open-source Large Language Models (LLMs) that are community-driven or developed by the AI market leaders, such as Meta (Llama3), Databricks (DBRX) and Snowflake (Arctic). On the other hand, there is a growth in interest in specialized, carefully fine-tuned yet relatively small models that can efficiently assist programmers in day-to-day tasks. Finally, Retrieval-Augmented Generation (RAG) architectures have gained a lot of traction as the preferred approach for LLMs context and prompt augmentation for building conversational SQL data copilots, code copilots and chatbots.
In this presentation, we will show how we built upon these three concepts a robust Data Copilot that can help to democratize access to company data assets and boost performance of everyone working with data platforms.
Why do we need yet another (open-source ) Copilot?
How can we build one?
Architecture and evaluation
The Ipsos - AI - Monitor 2024 Report.pdfSocial Samosa
According to Ipsos AI Monitor's 2024 report, 65% Indians said that products and services using AI have profoundly changed their daily life in the past 3-5 years.
Natural Language Processing (NLP), RAG and its applications .pptxfkyes25
1. In the realm of Natural Language Processing (NLP), knowledge-intensive tasks such as question answering, fact verification, and open-domain dialogue generation require the integration of vast and up-to-date information. Traditional neural models, though powerful, struggle with encoding all necessary knowledge within their parameters, leading to limitations in generalization and scalability. The paper "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks" introduces RAG (Retrieval-Augmented Generation), a novel framework that synergizes retrieval mechanisms with generative models, enhancing performance by dynamically incorporating external knowledge during inference.
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Aggregage
This webinar will explore cutting-edge, less familiar but powerful experimentation methodologies which address well-known limitations of standard A/B Testing. Designed for data and product leaders, this session aims to inspire the embrace of innovative approaches and provide insights into the frontiers of experimentation!
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Round table discussion of vector databases, unstructured data, ai, big data, real-time, robots and Milvus.
A lively discussion with NJ Gen AI Meetup Lead, Prasad and Procure.FYI's Co-Found
Analysis insight about a Flyball dog competition team's performanceroli9797
Insight of my analysis about a Flyball dog competition team's last year performance. Find more: https://github.com/rolandnagy-ds/flyball_race_analysis/tree/main
Learn SQL from basic queries to Advance queriesmanishkhaire30
Dive into the world of data analysis with our comprehensive guide on mastering SQL! This presentation offers a practical approach to learning SQL, focusing on real-world applications and hands-on practice. Whether you're a beginner or looking to sharpen your skills, this guide provides the tools you need to extract, analyze, and interpret data effectively.
Key Highlights:
Foundations of SQL: Understand the basics of SQL, including data retrieval, filtering, and aggregation.
Advanced Queries: Learn to craft complex queries to uncover deep insights from your data.
Data Trends and Patterns: Discover how to identify and interpret trends and patterns in your datasets.
Practical Examples: Follow step-by-step examples to apply SQL techniques in real-world scenarios.
Actionable Insights: Gain the skills to derive actionable insights that drive informed decision-making.
Join us on this journey to enhance your data analysis capabilities and unlock the full potential of SQL. Perfect for data enthusiasts, analysts, and anyone eager to harness the power of data!
#DataAnalysis #SQL #LearningSQL #DataInsights #DataScience #Analytics
The Building Blocks of QuestDB, a Time Series Databasejavier ramirez
Talk Delivered at Valencia Codes Meetup 2024-06.
Traditionally, databases have treated timestamps just as another data type. However, when performing real-time analytics, timestamps should be first class citizens and we need rich time semantics to get the most out of our data. We also need to deal with ever growing datasets while keeping performant, which is as fun as it sounds.
It is no wonder time-series databases are now more popular than ever before. Join me in this session to learn about the internal architecture and building blocks of QuestDB, an open source time-series database designed for speed. We will also review a history of some of the changes we have gone over the past two years to deal with late and unordered data, non-blocking writes, read-replicas, or faster batch ingestion.
Rapid Prototyping in PySpark Streaming: The Thermodynamics of Docker Containers 2015 02-24 Washington DC Apache Spark Interactive
1.
2. Rapid Prototyping in
PySpark Streaming
The Thermodynamics of Docker Containers
Rich Seymour @rseymour
Washington DC Area Apache Spark Interactive Meetup
10. Docker
10
Open Sourced by dotCloud March 2013
Switched from LXC to libcontainer March 2014
Written in Go
Allows us to contain dependencies with:
cgroups, namespaces, capabilities, netlink, netfilter, etc.
Currently Linux only, but supported on Amazon, Google,
Microsoft and Redhat cloud offerings
Gives us a registry and a method for pulling binary diffs
15. “a fast and general
engine for large scale
data processing”
16. Spark
16
Apache Project born out of UC Berkeley’s Algorithms,
Machines, People Lab (AMPLab)
Java / Scala / Python APIs for computing on redundant
distributed datasets across a cluster of multicore machines.
17. Resilient Distributed Datasets (RDDs)
17
“Represents an immutable, partitioned collection
of elements that can be operated on in parallel.”
18. Resilient Distributed Datasets (RDDs)
18
“Represents an immutable, partitioned collection
of elements that can be operated on in parallel.”
Immutable: can’t be changed over time.
If you want to preserve a change, create a new
RDD on the left of the equals sign.
19. Resilient Distributed Datasets (RDDs)
19
“Represents an immutable, partitioned collection
of elements that can be operated on in parallel.”
Partitioned: Split up often by key w/ a partitioner,
if your RDD is made up of key value pairs:
my_rdd = [(1,”Apple”), (2,”IBM”)]
25. Pipes circa 1964 – Doug McIlroy
25
Summary--what's most important.
To put my strongest concerns into a nutshell:
1. We should have some ways of coupling programs like
garden hose--screw in another segment when it becomes when
it becomes necessary to massage data in another way.
This is the way of IO also.
2. Our loader should be able to do link-loading and
controlled establishment.
3. Our library filing scheme should allow for rather
general indexing, responsibility, generations, data path
switching.
4. It should be possible to get private system components
(all routines are system components) for buggering around
with.
M. D. McIlroy
October 11, 1964
Interesting side notes: http://www.cs.dartmouth.edu/~doug/sieve/
26. Resilient Distributed Datasets (RDDs)
26
“Represents an immutable, partitioned collection
of elements that can be operated on in parallel.”
27. 27
Wolfsburg – Inside the Volkswagen Plant photo by Roger
https://www.flickr.com/photos/24736216@N07/5869083813/
Such parallel efficiency!
31. PySpark is in some ways
just helpers for Functional
Programming in Python
32. Functional Programming in Python
32
Please check out this article by Mary Rose Cook (@maryrosecook) in
which she writes:
“Functional code is characterized by one thing:
the absence of side effects”
https://codewords.hackerschool.com/issues/one/an-introduction-to-functional-programming
36. cpu shares
36
Each container gets 1024 by default. Unless you specify a
different X everything is equal. As soon as you do, the
scheduler steps in.
37. Thermodynamics...
37
The idea was to run docker containers that try to use all of the
CPU, but then limit them using cgroups and see how well that
works.
"Tolman & Einstein" by Los Angeles Times. Original uploader was Tillman at en.wikipedia - Transferred from en.wikipedia; transfer was stated to be made by User:chazchaz101.(Original text :
Los Angeles Times photographic archive, UCLA Library [1]). Licensed under Public Domain via Wikimedia Commons -
https://commons.wikimedia.org/wiki/File:Tolman_%26_Einstein.jpg#mediaviewer/File:Tolman_%26_Einstein.jpg
45. 45
def calculate_cpu_percent(prev_cpu, prev_sys, stats):
cpu_percent = 0.0
cpu_delta = float(stats['cpu_stats']['cpu_usage']['total_usage']) - prev_cpu
system_delta = float(stats['cpu_stats']['system_cpu_usage']) - prev_sys
if system_delta > 0.0 and cpu_delta > 0.0:
cpu_percent = (cpu_delta / system_delta) *
float(len(stats['cpu_stats']['cpu_usage']['percpu_usage'])) * 100.0
return cpu_percent
Really easy to do in a for loop:
(eg:
https://github.com/docker/docker/blob/ea8cb16af7e8c83a264a1d1c48db3cacd4cc082b/api/client/commands.go#L264
0-L2665 and
https://github.com/docker/docker/blob/ea8cb16af7e8c83a264a1d1c48db3cacd4cc082b/api/client/commands.go#L275
8-L2771 )
But not straightforward over a DStream of RDDs.
54. 54
keyed_up = stats
.map(safe_load)
.filter(lambda x: x != None)
.flatMap(key_up)
.filter(lambda x: x != None)
.groupByKeyAndWindow(20,5)
((‘isosystem_large_1’,total_usage), 54016836033.0)
((‘isosystem_large_1’,total_usage), 54016936033.0)
((‘isosystem_large_1’,total_usage), 54017036033.0)
((‘isosystem_large_1’,system_cpu_usage), 52812200870000000.0)
((‘isosystem_large_1’,system_cpu_usage), 52812200870200000.0)
((‘isosystem_large_1’,system_cpu_usage), 52812200870400000.0)
((‘isosystem_large_1’, tot_cpus), 4)
((‘isosystem_large_1’, tot_cpus), 4)
((‘isosystem_large_1’, tot_cpus), 4)
So now groupByKeyAndWindow groups every 20 seconds of data by
key, it then slides it by 5 seconds to keep a moving delta.