Summary of the lessons we learned with Docker (Dockerfile, storage, distributed networking) during the first iteration of the AdamCloud project (Fall 2014).
The AdamCloud project (part I) was presented here:
http://www.slideshare.net/davidonlaptop/bdm29-adamcloud-planification
AdamCloud: A Cloud infrastructure for a Genomic project. The AdamCloud project aims to simplify the installation of the AmpLab genomic pipeline (Snap, Adam, Avocado).
The results of the first iteration (part II) were presented here:
http://www.slideshare.net/davidonlaptop/bdm32-adam-cloud-part-2-43514904
This document provides tips and best practices for debugging and tuning Spark applications. It discusses Spark concepts like RDDs, transformations, actions, and the DAG execution model. It then gives recommendations for improving correctness, reducing overhead from parallelism, avoiding data skew, and tuning configurations like storage level, number of partitions, executor resources and joins. Common failures are analyzed along with their causes and fixes. Overall it emphasizes the importance of tuning partitioning, avoiding shuffles when possible, and using the right configurations to optimize Spark jobs.
The document provides an overview of Apache Spark internals and Resilient Distributed Datasets (RDDs). It discusses:
- RDDs are Spark's fundamental data structure - they are immutable distributed collections that allow transformations like map and filter to be applied.
- RDDs track their lineage or dependency graph to support fault tolerance. Transformations create new RDDs while actions trigger computation.
- Operations on RDDs include narrow transformations like map that don't require data shuffling, and wide transformations like join that do require shuffling.
- The RDD abstraction allows Spark's scheduler to optimize execution through techniques like pipelining and cache reuse.
This document provides an overview of Spark and its key components. Spark is a fast and general engine for large-scale data processing. It uses Resilient Distributed Datasets (RDDs) that allow data to be partitioned across clusters and cached in memory for fast performance. Spark is up to 100x faster than Hadoop for iterative jobs and provides a unified framework for batch processing, streaming, SQL, and machine learning workloads.
We will see internal architecture of spark cluster i.e what is driver, worker, executor and cluster manager, how spark program will be run on cluster and what are jobs,stages and task.
The document discusses Spark exceptions and errors related to shuffling data between nodes. It notes that tasks can fail due to out of memory errors or files being closed prematurely. It also provides explanations of Spark's shuffle operations and how data is written and merged across nodes during shuffles.
This document provides an overview of Spark SQL and its architecture. Spark SQL allows users to run SQL queries over SchemaRDDs, which are RDDs with a schema and column names. It introduces a SQL-like query abstraction over RDDs and allows querying data in a declarative manner. The Spark SQL component consists of Catalyst, a logical query optimizer, and execution engines for different data sources. It can integrate with data sources like Parquet, JSON, and Cassandra.
This document provides an overview of Apache Spark, including how it compares to Hadoop, the Spark ecosystem, Resilient Distributed Datasets (RDDs), transformations and actions on RDDs, the directed acyclic graph (DAG) scheduler, Spark Streaming, and the DataFrames API. Key points covered include Spark's faster performance versus Hadoop through its use of memory instead of disk, the RDD abstraction for distributed collections, common RDD operations, and Spark's capabilities for real-time streaming data processing and SQL queries on structured data.
AdamCloud: A Cloud infrastructure for a Genomic project. The AdamCloud project aims to simplify the installation of the AmpLab genomic pipeline (Snap, Adam, Avocado).
The results of the first iteration (part II) were presented here:
http://www.slideshare.net/davidonlaptop/bdm32-adam-cloud-part-2-43514904
This document provides tips and best practices for debugging and tuning Spark applications. It discusses Spark concepts like RDDs, transformations, actions, and the DAG execution model. It then gives recommendations for improving correctness, reducing overhead from parallelism, avoiding data skew, and tuning configurations like storage level, number of partitions, executor resources and joins. Common failures are analyzed along with their causes and fixes. Overall it emphasizes the importance of tuning partitioning, avoiding shuffles when possible, and using the right configurations to optimize Spark jobs.
The document provides an overview of Apache Spark internals and Resilient Distributed Datasets (RDDs). It discusses:
- RDDs are Spark's fundamental data structure - they are immutable distributed collections that allow transformations like map and filter to be applied.
- RDDs track their lineage or dependency graph to support fault tolerance. Transformations create new RDDs while actions trigger computation.
- Operations on RDDs include narrow transformations like map that don't require data shuffling, and wide transformations like join that do require shuffling.
- The RDD abstraction allows Spark's scheduler to optimize execution through techniques like pipelining and cache reuse.
This document provides an overview of Spark and its key components. Spark is a fast and general engine for large-scale data processing. It uses Resilient Distributed Datasets (RDDs) that allow data to be partitioned across clusters and cached in memory for fast performance. Spark is up to 100x faster than Hadoop for iterative jobs and provides a unified framework for batch processing, streaming, SQL, and machine learning workloads.
We will see internal architecture of spark cluster i.e what is driver, worker, executor and cluster manager, how spark program will be run on cluster and what are jobs,stages and task.
The document discusses Spark exceptions and errors related to shuffling data between nodes. It notes that tasks can fail due to out of memory errors or files being closed prematurely. It also provides explanations of Spark's shuffle operations and how data is written and merged across nodes during shuffles.
This document provides an overview of Spark SQL and its architecture. Spark SQL allows users to run SQL queries over SchemaRDDs, which are RDDs with a schema and column names. It introduces a SQL-like query abstraction over RDDs and allows querying data in a declarative manner. The Spark SQL component consists of Catalyst, a logical query optimizer, and execution engines for different data sources. It can integrate with data sources like Parquet, JSON, and Cassandra.
This document provides an overview of Apache Spark, including how it compares to Hadoop, the Spark ecosystem, Resilient Distributed Datasets (RDDs), transformations and actions on RDDs, the directed acyclic graph (DAG) scheduler, Spark Streaming, and the DataFrames API. Key points covered include Spark's faster performance versus Hadoop through its use of memory instead of disk, the RDD abstraction for distributed collections, common RDD operations, and Spark's capabilities for real-time streaming data processing and SQL queries on structured data.
Spark on Mesos-A Deep Dive-(Dean Wampler and Tim Chen, Typesafe and Mesosphere)Spark Summit
Typesafe has launched Spark support for Mesosphere's Data Center Operating System (DCOS). Typesafe engineers are contributing to the Mesos support for Spark and Typesafe will provide commercial support for Spark development and production deployment on Mesos. Mesos' flexibility allows many frameworks like Spark to run on top of it. This document discusses Spark on Mesos in coarse-grained and fine-grained modes and some features coming soon like dynamic allocation and constraints.
The document discusses Spark job failures and Spark/YARN architecture. It describes a Spark job failure due to a task failing 4 times with a NumberFormatException when parsing a string. It then explains that Spark jobs are divided into stages made up of tasks, and the entire job fails if a stage fails. The document also provides an overview of the Spark and YARN architectures, showing how Spark jobs are submitted to and run via the YARN resource manager.
Spark is a general engine for large-scale data processing. It introduces Resilient Distributed Datasets (RDDs) which allow in-memory caching for fault tolerance and act like familiar Scala collections for distributed computation across clusters. RDDs provide a programming model with transformations like map and reduce and actions to compute results. Spark also supports streaming, SQL, machine learning, and graph processing workloads.
Docker is an open-source project to easily create lightweight, portable, self-sufficient containers from any application. The same container that a developer builds and tests on a laptop can run at scale, in production, on VMs, bare metal, OpenStack clusters, public clouds and more.
This document provides an overview of Apache Spark, including its core concepts, transformations and actions, persistence, parallelism, and examples. Spark is introduced as a fast and general engine for large-scale data processing, with advantages like in-memory computing, fault tolerance, and rich APIs. Key concepts covered include its resilient distributed datasets (RDDs) and lazy evaluation approach. The document also discusses Spark SQL, streaming, and integration with other tools.
In this talk, we present two emerging, popular open source projects: Spark and Shark. Spark is an open source cluster computing system that aims to make data analytics fast — both fast to run and fast to write. It outperform Hadoop by up to 100x in many real-world applications. Spark programs are often much shorter than their MapReduce counterparts thanks to its high-level APIs and language integration in Java, Scala, and Python. Shark is an analytic query engine built on top of Spark that is compatible with Hive. It can run Hive queries much faster in existing Hive warehouses without modifications.
These systems have been adopted by many organizations large and small (e.g. Yahoo, Intel, Adobe, Alibaba, Tencent) to implement data intensive applications such as ETL, interactive SQL, and machine learning.
This presentation is an introduction to Apache Spark. It covers the basic API, some advanced features and describes how Spark physically executes its jobs.
The document discusses Resilient Distributed Datasets (RDDs) in Spark. It explains that RDDs hold references to partition objects containing subsets of data across a cluster. When a transformation like map is applied to an RDD, a new RDD is created to store the operation and maintain a dependency on the original RDD. This allows chained transformations to be lazily executed together in jobs scheduled by Spark.
This document discusses using Jupyter Notebook for machine learning projects with Spark. It describes running Python, Spark, and pandas code in Jupyter notebooks to work with data from various sources and build machine learning models. Key points include using notebooks for an ML pipeline, running Spark jobs, visualizing data, and building word embedding models with Spark. The document emphasizes how Jupyter notebooks allow integrating various tools for an ML workflow.
This document provides information about Linux containers and Docker. It discusses:
1) The evolution of IT from client-server models to thin apps running on any infrastructure and the challenges of ensuring consistent service interactions and deployments across environments.
2) Virtual machines and their benefits of full isolation but large disk usage, and Vagrant which allows packaging and provisioning of VMs via files.
3) Docker and how it uses Linux containers powered by namespaces and cgroups to deploy applications in lightweight portable containers that are more efficient than VMs. Examples of using Docker are provided.
Scaling Spark Workloads on YARN - Boulder/Denver July 2015Mac Moore
Hortonworks Presentation at The Boulder/Denver BigData Meetup on July 22nd, 2015. Topic: Scaling Spark Workloads on YARN. Spark as a workload in a multi-tenant Hadoop infrastructure, scaling, cloud deployment, tuning.
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveSachin Aggarwal
We will give a detailed introduction to Apache Spark and why and how Spark can change the analytics world. Apache Spark's memory abstraction is RDD (Resilient Distributed DataSet). One of the key reason why Apache Spark is so different is because of the introduction of RDD. You cannot do anything in Apache Spark without knowing about RDDs. We will give a high level introduction to RDD and in the second half we will have a deep dive into RDDs.
This document provides an introduction and overview of Apache Spark. It discusses why in-memory computing is important for speed, compares Spark and Ignite, describes what Spark is and how it works using Resilient Distributed Datasets (RDDs) and a directed acyclic graph (DAG) model. It also provides examples of Spark operations on RDDs and shows a word count example in Java, Scala and Python.
Apache Spark in Depth: Core Concepts, Architecture & InternalsAnton Kirillov
Slides cover Spark core concepts of Apache Spark such as RDD, DAG, execution workflow, forming stages of tasks and shuffle implementation and also describes architecture and main components of Spark Driver. The workshop part covers Spark execution modes , provides link to github repo which contains Spark Applications examples and dockerized Hadoop environment to experiment with
Terraform modules provide reusable, composable infrastructure components. The document discusses restructuring infrastructure code into modules to make it more reusable, testable, and maintainable. Key points include:
- Modules should be structured in a three-tier hierarchy from primitive resources to generic services to specific environments.
- Testing modules individually increases confidence in changes.
- Storing module code and versions in Git provides versioning and collaboration.
- Remote state allows infrastructure to be shared between modules and deployments.
This presentation show the main Spark characteristics, like RDD, Transformations and Actions.
I used this presentation for many Spark Intro workshops from Cluj-Napoca Big Data community : http://www.meetup.com/Big-Data-Data-Science-Meetup-Cluj-Napoca/
Video: https://www.youtube.com/watch?v=kkOG_aJ9KjQ
This talk gives details about Spark internals and an explanation of the runtime behavior of a Spark application. It explains how high level user programs are compiled into physical execution plans in Spark. It then reviews common performance bottlenecks encountered by Spark users, along with tips for diagnosing performance problems in a production application.
Apache Spark Introduction | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2spQIBA
This CloudxLab Introduction to Apache Spark tutorial helps you to understand Spark in detail. Below are the topics covered in this tutorial:
1) Spark Architecture
2) Why Apache Spark?
3) Shortcoming of MapReduce
4) Downloading Apache Spark
5) Starting Spark With Scala Interactive Shell
6) Starting Spark With Python Interactive Shell
7) Getting started with spark-submit
BDM8 - Near-realtime Big Data Analytics using ImpalaDavid Lauzon
Quick overview of all informations I've gathered on Cloudera Impala. It describes use cases for Impala and what not to use Impala for. Presented at Big Data Montreal #8 at RPM Startup Center.
Spark on Mesos-A Deep Dive-(Dean Wampler and Tim Chen, Typesafe and Mesosphere)Spark Summit
Typesafe has launched Spark support for Mesosphere's Data Center Operating System (DCOS). Typesafe engineers are contributing to the Mesos support for Spark and Typesafe will provide commercial support for Spark development and production deployment on Mesos. Mesos' flexibility allows many frameworks like Spark to run on top of it. This document discusses Spark on Mesos in coarse-grained and fine-grained modes and some features coming soon like dynamic allocation and constraints.
The document discusses Spark job failures and Spark/YARN architecture. It describes a Spark job failure due to a task failing 4 times with a NumberFormatException when parsing a string. It then explains that Spark jobs are divided into stages made up of tasks, and the entire job fails if a stage fails. The document also provides an overview of the Spark and YARN architectures, showing how Spark jobs are submitted to and run via the YARN resource manager.
Spark is a general engine for large-scale data processing. It introduces Resilient Distributed Datasets (RDDs) which allow in-memory caching for fault tolerance and act like familiar Scala collections for distributed computation across clusters. RDDs provide a programming model with transformations like map and reduce and actions to compute results. Spark also supports streaming, SQL, machine learning, and graph processing workloads.
Docker is an open-source project to easily create lightweight, portable, self-sufficient containers from any application. The same container that a developer builds and tests on a laptop can run at scale, in production, on VMs, bare metal, OpenStack clusters, public clouds and more.
This document provides an overview of Apache Spark, including its core concepts, transformations and actions, persistence, parallelism, and examples. Spark is introduced as a fast and general engine for large-scale data processing, with advantages like in-memory computing, fault tolerance, and rich APIs. Key concepts covered include its resilient distributed datasets (RDDs) and lazy evaluation approach. The document also discusses Spark SQL, streaming, and integration with other tools.
In this talk, we present two emerging, popular open source projects: Spark and Shark. Spark is an open source cluster computing system that aims to make data analytics fast — both fast to run and fast to write. It outperform Hadoop by up to 100x in many real-world applications. Spark programs are often much shorter than their MapReduce counterparts thanks to its high-level APIs and language integration in Java, Scala, and Python. Shark is an analytic query engine built on top of Spark that is compatible with Hive. It can run Hive queries much faster in existing Hive warehouses without modifications.
These systems have been adopted by many organizations large and small (e.g. Yahoo, Intel, Adobe, Alibaba, Tencent) to implement data intensive applications such as ETL, interactive SQL, and machine learning.
This presentation is an introduction to Apache Spark. It covers the basic API, some advanced features and describes how Spark physically executes its jobs.
The document discusses Resilient Distributed Datasets (RDDs) in Spark. It explains that RDDs hold references to partition objects containing subsets of data across a cluster. When a transformation like map is applied to an RDD, a new RDD is created to store the operation and maintain a dependency on the original RDD. This allows chained transformations to be lazily executed together in jobs scheduled by Spark.
This document discusses using Jupyter Notebook for machine learning projects with Spark. It describes running Python, Spark, and pandas code in Jupyter notebooks to work with data from various sources and build machine learning models. Key points include using notebooks for an ML pipeline, running Spark jobs, visualizing data, and building word embedding models with Spark. The document emphasizes how Jupyter notebooks allow integrating various tools for an ML workflow.
This document provides information about Linux containers and Docker. It discusses:
1) The evolution of IT from client-server models to thin apps running on any infrastructure and the challenges of ensuring consistent service interactions and deployments across environments.
2) Virtual machines and their benefits of full isolation but large disk usage, and Vagrant which allows packaging and provisioning of VMs via files.
3) Docker and how it uses Linux containers powered by namespaces and cgroups to deploy applications in lightweight portable containers that are more efficient than VMs. Examples of using Docker are provided.
Scaling Spark Workloads on YARN - Boulder/Denver July 2015Mac Moore
Hortonworks Presentation at The Boulder/Denver BigData Meetup on July 22nd, 2015. Topic: Scaling Spark Workloads on YARN. Spark as a workload in a multi-tenant Hadoop infrastructure, scaling, cloud deployment, tuning.
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveSachin Aggarwal
We will give a detailed introduction to Apache Spark and why and how Spark can change the analytics world. Apache Spark's memory abstraction is RDD (Resilient Distributed DataSet). One of the key reason why Apache Spark is so different is because of the introduction of RDD. You cannot do anything in Apache Spark without knowing about RDDs. We will give a high level introduction to RDD and in the second half we will have a deep dive into RDDs.
This document provides an introduction and overview of Apache Spark. It discusses why in-memory computing is important for speed, compares Spark and Ignite, describes what Spark is and how it works using Resilient Distributed Datasets (RDDs) and a directed acyclic graph (DAG) model. It also provides examples of Spark operations on RDDs and shows a word count example in Java, Scala and Python.
Apache Spark in Depth: Core Concepts, Architecture & InternalsAnton Kirillov
Slides cover Spark core concepts of Apache Spark such as RDD, DAG, execution workflow, forming stages of tasks and shuffle implementation and also describes architecture and main components of Spark Driver. The workshop part covers Spark execution modes , provides link to github repo which contains Spark Applications examples and dockerized Hadoop environment to experiment with
Terraform modules provide reusable, composable infrastructure components. The document discusses restructuring infrastructure code into modules to make it more reusable, testable, and maintainable. Key points include:
- Modules should be structured in a three-tier hierarchy from primitive resources to generic services to specific environments.
- Testing modules individually increases confidence in changes.
- Storing module code and versions in Git provides versioning and collaboration.
- Remote state allows infrastructure to be shared between modules and deployments.
This presentation show the main Spark characteristics, like RDD, Transformations and Actions.
I used this presentation for many Spark Intro workshops from Cluj-Napoca Big Data community : http://www.meetup.com/Big-Data-Data-Science-Meetup-Cluj-Napoca/
Video: https://www.youtube.com/watch?v=kkOG_aJ9KjQ
This talk gives details about Spark internals and an explanation of the runtime behavior of a Spark application. It explains how high level user programs are compiled into physical execution plans in Spark. It then reviews common performance bottlenecks encountered by Spark users, along with tips for diagnosing performance problems in a production application.
Apache Spark Introduction | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2spQIBA
This CloudxLab Introduction to Apache Spark tutorial helps you to understand Spark in detail. Below are the topics covered in this tutorial:
1) Spark Architecture
2) Why Apache Spark?
3) Shortcoming of MapReduce
4) Downloading Apache Spark
5) Starting Spark With Scala Interactive Shell
6) Starting Spark With Python Interactive Shell
7) Getting started with spark-submit
BDM8 - Near-realtime Big Data Analytics using ImpalaDavid Lauzon
Quick overview of all informations I've gathered on Cloudera Impala. It describes use cases for Impala and what not to use Impala for. Presented at Big Data Montreal #8 at RPM Startup Center.
BDM9 - Comparison of Oracle RDBMS and Cloudera Impala for a hospital use caseDavid Lauzon
High-level use case description of one department of a hospital, and comparisons of two solutions : 1) Big data solution using Cloudera Impala; and 2) Traditional RDBMS solution using Oracle DB.
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...BigDataEverywhere
Paco Nathan, Director of Community Evangelism at Databricks
Apache Spark is intended as a fast and powerful general purpose engine for processing Hadoop data. Spark supports combinations of batch processing, streaming, SQL, ML, Graph, etc., for applications written in Scala, Java, Python, Clojure, and R, among others. In this talk, I'll explore how Spark fits into the Big Data landscape. In addition, I'll describe other systems with which Spark pairs nicely, and will also explain why Spark is needed for the work ahead.
Unified Big Data Processing with Apache SparkC4Media
Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/1yNuLGF.
Matei Zaharia talks about the latest developments in Spark and shows examples of how it can combine processing algorithms to build rich data pipelines in just a few lines of code. Filmed at qconsf.com.
Matei Zaharia is an assistant professor of computer science at MIT, and CTO of Databricks, the company commercializing Apache Spark.
QCon2016--Drive Best Spark Performance on AILex Yu
1) The document discusses optimizing Spark application performance by using a hierarchical storage approach to accelerate big data processing. It proposes using SSDs to prioritize caching data that needs to be stored externally, like shuffle data and RDDs evicted from memory, and falling back to HDD storage when SSD capacity is exhausted.
2) A case study showed bringing 3x performance improvement for a graph processing workload by leveraging an NVMe SSD alongside existing HDDs. Shuffle reads saw the most significant gains due to the random access patterns benefitting from SSDs high IOPS.
3) Analysis of storage access patterns via blktrace showed RDD reads/writes and shuffle writes were sequential, but shuffle reads exhibited small, random I/
We are a company driven by inquisitive data scientists, having developed a pragmatic and interdisciplinary approach, which has evolved over the decades working with over 100 clients across multiple industries. Combining several Data Science techniques from statistics, machine learning, deep learning, decision science, cognitive science, and business intelligence, with our ecosystem of technology platforms, we have produced unprecedented solutions. Welcome to the Data Science Analytics team that can do it all, from architecture to algorithms.
Our practice delivers data driven solutions, including Descriptive Analytics, Diagnostic Analytics, Predictive Analytics, and Prescriptive Analytics. We employ a number of technologies in the area of Big Data and Advanced Analytics such as DataStax (Cassandra), Databricks (Spark), Cloudera, Hortonworks, MapR, R, SAS, Matlab, SPSS and Advanced Data Visualizations.
This presentation is designed for Spark Enthusiasts to get started and details of the course are below.
1. Introduction to Apache Spark
2. Functional Programming + Scala
3. Spark Core
4. Spark SQL + Parquet
5. Advanced Libraries
6. Tips & Tricks
7. Where do I go from here?
Este documento presenta a dos expertos en Big Data y Spark con Scala, José Carlos García Serrano y David Vallejo Navarro. Brinda información sobre sus experiencias laborales y educacionales trabajando con tecnologías como Scala, Spark, Akka, MongoDB y Cassandra. También incluye un índice de los temas a tratar en la presentación.
Apache Spark Streaming: Architecture and Fault ToleranceSachin Aggarwal
Agenda:
• Spark Streaming Architecture
• How different is Spark Streaming from other streaming applications
• Fault Tolerance
• Code Walk through & demo
• We will supplement theory concepts with sufficient examples
Speakers :
Paranth Thiruvengadam (Architect (STSM), Analytics Platform at IBM Labs)
Profile : https://in.linkedin.com/in/paranth-thiruvengadam-2567719
Sachin Aggarwal (Developer, Analytics Platform at IBM Labs)
Profile : https://in.linkedin.com/in/nitksachinaggarwal
Github Link: https://github.com/agsachin/spark-meetup
Resilient Distributed DataSets - Apache SPARKTaposh Roy
RDDs (Resilient Distributed Datasets) provide a fault-tolerant abstraction for data reuse across jobs in distributed applications. They allow data to be persisted in memory and manipulated using transformations like map and filter. This enables efficient processing of iterative algorithms. RDDs achieve fault tolerance by logging the transformations used to build a dataset rather than the actual data, enabling recovery of lost partitions through recomputation.
This document provides an overview of Spark, including:
- Spark's processing model involves chopping live data streams into batches and treating each batch as an RDD to apply transformations and actions.
- Resilient Distributed Datasets (RDDs) are Spark's primary abstraction, representing an immutable distributed collection of objects that can be operated on in parallel.
- An example word count program is presented to illustrate how to create and manipulate RDDs to count the frequency of words in a text file.
Spark can be used to perform maintenance operations on Cassandra data. There are three basic patterns for interacting with Cassandra using Spark: read-transform-write (1:1), read-transform-write (1:m), and read-filter-delete (m:1). Deletes are tricky in Cassandra and require either selecting records to delete and issuing deletes or selecting records to keep and rewriting/deleting partitions. The document provides examples of using Spark for cache maintenance, trimming user history, publishing data, and multitenant backup and recovery.
You’ve heard all of the hype, but how can SMACK work for you? In this all-star lineup, you will learn how to create a reactive, scaling, resilient and performant data processing powerhouse. Bringing Akka, Kafka and Mesos together provides a foundation to develop and operate an elastically scalable actor system. We will go through the basics of Akka, Kafka and Mesos and then deep dive into putting them together in an end2end (and back again) distrubuted transaction. Distributed transactions mean producers waiting for one or more of consumers to respond. We'll also go through automated ways to failure induce these systems (using LinkedIn Simoorg) and trace them from start to stop through each component (using Twitters Zipkin). Finally, you will see how Apache Cassandra and Spark can be combined to add the incredibly scaling storage and data analysis needed in fast data pipelines. With these technologies as a foundation, you have the assurance that scale is never a problem and uptime is default.
This document provides an overview of Apache Spark's architectural components through the life of simple Spark jobs. It begins with a simple Spark application analyzing airline on-time arrival data, then covers Resilient Distributed Datasets (RDDs), the cluster architecture, job execution through Spark components like tasks and scheduling, and techniques for writing better Spark applications like optimizing partitioning and reducing shuffle size.
This deep dive attempts to "de-mystify" Spark by touching on some of the main design philosophies and diving into some of the more advanced features that make it such a flexible and powerful cluster computing framework. It will touch on some common pitfalls and attempt to build some best practices for building, configuring, and deploying Spark applications.
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...Databricks
1) Reynold Xin presented on using sketches like Bloom filters, HyperLogLog, count-min sketches, and stratified sampling to summarize and analyze large datasets in Spark.
2) Sketches allow analyzing data in small space and in one pass to identify frequent items, estimate cardinality, and sample data.
3) Spark incorporates sketches to speed up exploration, feature engineering, and building faster exact algorithms for processing large datasets.
Spark and the Future of Advanced Analytics by Thomas DinsmoreSpark Summit
This document discusses the need for distributed platforms for machine learning and analytics. It argues that distributed systems are necessary because data sources and targets are distributed, data movement is expensive, and data and model requirements are growing. It presents Spark as currently the best option for a distributed framework, and notes that vendors are working to integrate their tools with Spark to enable distributed workflows. In summary, distributed machine learning is needed due to expanding data and computing demands, and Spark has emerged as the leading framework for distributed analytics and machine learning.
Data processing platforms with SMACK: Spark and Mesos internalsAnton Kirillov
The first part of the slides contains general overview of SMACK stack and possible architecture layouts that could be implemented on top of it. We discuss Apache Spark internals: the concept of RDD, DAG logical view and dependencies types, execution workflow, shuffle process and core Spark components. The second part is dedicated to Mesos architecture and the concept of framework, different ways of running applications and schedule Spark jobs on top of it. We'll take a look at popular frameworks like Marathon and Chronos and see how Spark Jobs and Docker containers are executed using them.
This document provides an introduction to Docker and containerization. It covers:
1. The differences between virtual machines and containers, and the container lifecycle.
2. An overview of the Docker ecosystem tools.
3. Instructions for installing and using the Docker Engine and Docker CLI to build, run, and manage containers.
4. A demonstration of using Docker Hub to build and store container images.
5. An introduction to Docker networking and volumes.
6. A demonstration of using Docker Compose to define and run multi-container applications.
7. Suggestions for further learning resources about Docker.
Docker allows applications to be packaged with all their dependencies and run consistently across computing environments. It provides isolation, security and portability for applications. This document discusses setting up an Eh Avatar application to run in Docker containers for Postgres, Redis and the application itself. It covers bringing up the dependency containers, building a custom Docker image for the application, and using Docker Compose to define and run the multi-container application. While this provides an introduction, there is still more to learn about optimizing Docker usage and avoiding common pitfalls.
This document provides an agenda for a one-day Docker introduction workshop. It includes an introduction to Docker tools and concepts like containers vs VMs, the Docker ecosystem and tools, Linux and Docker command line usage, Docker Engine, Docker Hub, Docker images, networking and volumes. It also covers deploying Docker images to Azure PaaS, Docker Compose, building ARM images on x86 machines, and a TensorFlow demo. The workshop aims to provide attendees with foundational Docker knowledge and hands-on experience through examples and exercises.
This document provides an overview of Docker concepts and tools for beginners. It covers:
1. The differences between virtual machines and containers, and the container lifecycle.
2. Tools in the Docker ecosystem such as Docker Engine, Docker CLI, Docker Hub, Docker Compose, and networking/volume commands.
3. Examples of using Docker Engine, Docker Hub for images, networking, volumes and deploying images to Azure PaaS.
4. How to use Docker Compose to define and run multi-container applications.
This document provides an overview of Docker concepts and tools for beginners. It covers:
1. The differences between virtual machines and containers, and the container lifecycle.
2. Tools in the Docker ecosystem such as Docker Engine, Docker CLI, Docker Hub, Docker Compose, and networking/volume commands.
3. Examples of using Docker Engine, Docker Hub for images, networking, volumes and deploying images to Azure PaaS.
4. How to use Docker Compose to define and run multi-container applications.
Dockerizing Symfony2 application. Why Docker is so cool And what is Docker? And what are Containers? How they works? What are the ecosystem of Docker? And how to dockerize your web application (can be based on Symfony2 framework)?
This document summarizes the key topics covered in Day 2 of a Docker and container technology introduction and hands-on course, including:
1) An overview of Docker Hub and how it relates to GitHub for automatically building images
2) Basic Git commands
3) Configuring automatic builds on Docker Hub by linking a GitHub repository
4) Docker network and volume commands, and exercises using these commands
5) Using Docker Compose to run multiple connected containers defined in a compose file
6) A demonstration of running TensorFlow using Docker
This document provides an overview of Docker for web developers. It defines Docker as a platform for developing, shipping and running applications using container virtualization technology. It describes the main Docker products and tools. It provides examples of using Docker for various programming languages and frameworks like PHP, Java, Python, Node.js, Go, databases and content management systems like WordPress, Joomla and Drupal. The document also discusses Dockerfiles, Docker Compose, Docker commands and repositories.
This document discusses Docker, including:
1. Docker is a platform for running and managing Linux containers that provides operating-system-level virtualization without the overhead of traditional virtual machines.
2. Key Docker concepts include images (immutable templates for containers), containers (running instances of images that have mutable state), and layers (the building blocks of images).
3. Publishing Docker images to registries allows them to be shared and reused across different systems. Volumes and networking allow containers to share filesystems and communicate.
Docker is an open-source tool that allows users to package applications into standardized units called containers for development and deployment. Containers allow applications to be isolated while efficiently sharing resources on a single Linux host. Docker encourages microservices architecture and streamlines the development lifecycle from development to production by ensuring consistency across environments.
Docker Essentials Workshop— Innovation Labs July 2020CloudHero
This presentation was the foundation of our Docker Essentials workshop hosted by CloudHero CEO & founder Andrei Manea for the Innovation Labs team on the 23rd of July 2020.
This presentation covers the following topics:
-Getting started with containers
-A bit of history about orchestration
-Introduction to services (what they are, how to create and scale them).
To find out more about this topic, check https://cloudhero.io/
This document provides an overview of containerization and Docker. It covers prerequisites, traditional application deployment challenges, container components like namespaces and cgroups, major Docker concepts like images and containers, common Docker commands, building Dockerfiles, and Docker workflows and best practices. Hands-on exercises are included to build and run containers.
PuppetConf 2016: The Challenges with Container Configuration – David Lutterko...Puppet
Here are the slides from David Lutterkort's PuppetConf 2016 presentation called The Challenges with Container Configuration. Watch the videos at https://www.youtube.com/playlist?list=PLV86BgbREluVjwwt-9UL8u2Uy8xnzpIqa
Introducing containers into your infrastructure brings new capabilities, but also new challenges, in particular around configuration. This talk will take a look under the hood at some of those operational challenges including:
* The difference between runtime and build-time configuration, and the importance of relating the two together.
* Configuration drift, immutable mental models and mutable container file systems.
* Who configures the orchestrators?
* Emergent vs. model driven configuration.
In the process we will identify some common problems and talk about potential solutions.
Talk from PuppetConf 2016
Shipping Applications to Production in Containers with DockerJérôme Petazzoni
This document provides an overview and introduction to using Docker in production environments. It discusses how Docker can help with "solved" problems like installing, building, and distributing applications. It also covers important areas for production Docker usage, such as service discovery, orchestration, performance, configuration management, and sysadmin tasks. The document outlines various approaches in each area and notes that there are often multiple valid solutions to consider.
A bit of history, frustration-driven development, and why and how we started looking into Puppet at Opera Software. What we're doing, successes, pain points and what we're going to do with Puppet and Config Management next.
This document provides an overview of Docker containers. It defines containers as lightweight sandboxed processes that share the same kernel as the host operating system. The key benefits of containers are that they have lower overhead than virtual machines and allow for the easy sharing and distribution of applications. The document discusses Docker images, containers, the client-server architecture, and basic Docker commands. It also covers use cases, the layered filesystem model, and security considerations when using containers.
This document provides an introduction to Docker including Docker vocabulary, architecture, file systems, networking, volumes, registry services like Docker Hub, and clustering technologies like Docker Swarm, Kubernetes and Mesos. It also covers setting up a local Docker environment, building Docker images with Dockerfiles, running containers, and deploying containers on AWS EC2 Container Service.
Similar to BDM32: AdamCloud Project - Part II (20)
Transform Your Communication with Cloud-Based IVR SolutionsTheSMSPoint
Discover the power of Cloud-Based IVR Solutions to streamline communication processes. Embrace scalability and cost-efficiency while enhancing customer experiences with features like automated call routing and voice recognition. Accessible from anywhere, these solutions integrate seamlessly with existing systems, providing real-time analytics for continuous improvement. Revolutionize your communication strategy today with Cloud-Based IVR Solutions. Learn more at: https://thesmspoint.com/channel/cloud-telephony
How Can Hiring A Mobile App Development Company Help Your Business Grow?ToXSL Technologies
ToXSL Technologies is an award-winning Mobile App Development Company in Dubai that helps businesses reshape their digital possibilities with custom app services. As a top app development company in Dubai, we offer highly engaging iOS & Android app solutions. https://rb.gy/necdnt
UI5con 2024 - Keynote: Latest News about UI5 and it’s EcosystemPeter Muessig
Learn about the latest innovations in and around OpenUI5/SAPUI5: UI5 Tooling, UI5 linter, UI5 Web Components, Web Components Integration, UI5 2.x, UI5 GenAI.
Recording:
https://www.youtube.com/live/MSdGLG2zLy8?si=INxBHTqkwHhxV5Ta&t=0
SOCRadar's Aviation Industry Q1 Incident Report is out now!
The aviation industry has always been a prime target for cybercriminals due to its critical infrastructure and high stakes. In the first quarter of 2024, the sector faced an alarming surge in cybersecurity threats, revealing its vulnerabilities and the relentless sophistication of cyber attackers.
SOCRadar’s Aviation Industry, Quarterly Incident Report, provides an in-depth analysis of these threats, detected and examined through our extensive monitoring of hacker forums, Telegram channels, and dark web platforms.
Using Query Store in Azure PostgreSQL to Understand Query PerformanceGrant Fritchey
Microsoft has added an excellent new extension in PostgreSQL on their Azure Platform. This session, presented at Posette 2024, covers what Query Store is and the types of information you can get out of it.
E-commerce Development Services- Hornet DynamicsHornet Dynamics
For any business hoping to succeed in the digital age, having a strong online presence is crucial. We offer Ecommerce Development Services that are customized according to your business requirements and client preferences, enabling you to create a dynamic, safe, and user-friendly online store.
Hand Rolled Applicative User ValidationCode KataPhilip Schwarz
Could you use a simple piece of Scala validation code (granted, a very simplistic one too!) that you can rewrite, now and again, to refresh your basic understanding of Applicative operators <*>, <*, *>?
The goal is not to write perfect code showcasing validation, but rather, to provide a small, rough-and ready exercise to reinforce your muscle-memory.
Despite its grandiose-sounding title, this deck consists of just three slides showing the Scala 3 code to be rewritten whenever the details of the operators begin to fade away.
The code is my rough and ready translation of a Haskell user-validation program found in a book called Finding Success (and Failure) in Haskell - Fall in love with applicative functors.
Top 9 Trends in Cybersecurity for 2024.pptxdevvsandy
Security and risk management (SRM) leaders face disruptions on technological, organizational, and human fronts. Preparation and pragmatic execution are key for dealing with these disruptions and providing the right cybersecurity program.
Top Benefits of Using Salesforce Healthcare CRM for Patient Management.pdfVALiNTRY360
Salesforce Healthcare CRM, implemented by VALiNTRY360, revolutionizes patient management by enhancing patient engagement, streamlining administrative processes, and improving care coordination. Its advanced analytics, robust security, and seamless integration with telehealth services ensure that healthcare providers can deliver personalized, efficient, and secure patient care. By automating routine tasks and providing actionable insights, Salesforce Healthcare CRM enables healthcare providers to focus on delivering high-quality care, leading to better patient outcomes and higher satisfaction. VALiNTRY360's expertise ensures a tailored solution that meets the unique needs of any healthcare practice, from small clinics to large hospital systems.
For more info visit us https://valintry360.com/solutions/health-life-sciences
Need for Speed: Removing speed bumps from your Symfony projects ⚡️Łukasz Chruściel
No one wants their application to drag like a car stuck in the slow lane! Yet it’s all too common to encounter bumpy, pothole-filled solutions that slow the speed of any application. Symfony apps are not an exception.
In this talk, I will take you for a spin around the performance racetrack. We’ll explore common pitfalls - those hidden potholes on your application that can cause unexpected slowdowns. Learn how to spot these performance bumps early, and more importantly, how to navigate around them to keep your application running at top speed.
We will focus in particular on tuning your engine at the application level, making the right adjustments to ensure that your system responds like a well-oiled, high-performance race car.
Most important New features of Oracle 23c for DBAs and Developers. You can get more idea from my youtube channel video from https://youtu.be/XvL5WtaC20A
Malibou Pitch Deck For Its €3M Seed Roundsjcobrien
French start-up Malibou raised a €3 million Seed Round to develop its payroll and human resources
management platform for VSEs and SMEs. The financing round was led by investors Breega, Y Combinator, and FCVC.
1. AdamCloud (Part 2):
Lessons learned from
Docker
Sébastien Bonami, IT Engineering Student
and
David Lauzon, Researcher
École de technologie supérieure (ÉTS)
Presented at Big Data Montreal #32 + DevOps Montreal
January 12th 2015
1
2. Plan
● AdamCloud Project
● Docker Introduction
● Lessons learned from Docker
o Dockerfiles
o Data Storage
o Networking
o Monitoring
● Conclusion
2
4. AdamCloud Goal
● Main goal: provide a portable infrastructure
for processing genomics data
● Requirements:
o A series of softwares must be chained in a pipeline
o Centralize configuration for multiple environments
o Simple installation procedure for new students
4
5. Potential solution
● For genomics: Adam project developed at
Berkeley AmpLab
o Snap, Adam, Avocado
o (uses Spark, HDFS)
● For infrastructure:
o Docker ?
5
6. Adam Genomic Pipeline
6
Fastq
File
(up to
250 GB)
Sam
File
Parquet
File
Parquet
File
(~10MB)
Sequencer
Machine
Snap AvocadoAdam
Hardware
AmpLab
Genomics
Projects
File
Formats
7. AdamCloud - Environments
3 different environments
● Development (laptop)
o All services in 1 single host
● Demo
o Mac mini cluster
● Testing
o ÉTS servers (for larger genomes)
7
8. Docker Introduction
From now on, we will talk about Docker leaving AdamCloud
aside.
For simplicity, we chose to use MySQL to demonstrate some
examples about learning Docker.
8
9. Docker Introduction - Key Concepts
Dockerfile Image
Docker
Hub
Registry
Internet
Container
build
push
pull
run commit
Text file
Size = ~ KB
Installation &
config instructions
Composed of many read-only layers
Typical size = ~ hundred(s) MB
Can have multiple versions (akin Git tags)
Shares the image’s read-only layers
1 private writeable layer (copy-on-write)
Initial size = 0 bytes
Can be stopped, started, paused, etc.
Free public hosting
9
10. Docker Introduction - How does it work?
Docker
Daemon Container 1
Host OS Kernel
Docker
Storage
Backend Container 2 ...
Hardware
Setups & manage the LXC containers.
Stores the image and container’s data layers
locally.
10
12. Lesson 0: Playing with Docker
$ sudo sh -c "echo deb https://get.docker.com/ubuntu docker main >
/etc/apt/sources.list.d/docker.list"
$ sudo apt-get update && sudo apt-get install -y --force-yes lxc-docker
12
$ docker run -ti --rm=true ubuntu bash
root@e0a1dad9f7fa:/# whoami; hostname
root
e0a1dad9f7fa
Creates a new interactive (-i)
container with a tty (-t) from the image
ubuntu, starts a bash shell, and
automatically remove the container
when it exits (--rm=true)
Install Docker
You are now “inside” the container
with the id e0a1dad9f7fa
14. Dockerfiles - MySQL Example (1/3)
$ mkdir mysql-docker/
$ vi mysql-docker/Dockerfile
# Contents of file mysql-docker/Dockerfile [1]
# Pull base image (from Docker Hub)
FROM ubuntu:14.04
# Install MySQL
RUN apt-get update
RUN apt-get install -y mysql-server
[1] Source: https://registry.hub.docker.com/u/dockerfile/mysql/dockerfile/ 14
15. Dockerfiles - MySQL Example (2/3)
# Contents of file mysql-docker/Dockerfile (continued)
# Configure MySQL: listening interface, log error, etc.
RUN sed -i 's/^(bind-addresss.*)/# 1/' /etc/mysql/my.cnf
RUN sed -i 's/^(log_errors.*)/# 1/' /etc/mysql/my.cnf
RUN echo "mysqld_safe &" > /tmp/config
RUN echo "mysqladmin --silent --wait=30 ping || exit 1" >> /tmp/config
RUN echo "mysql -e 'GRANT ALL PRIVILEGES ON *.* TO "root"@"%" WITH GRANT
OPTION;'" >> /tmp/config
RUN bash /tmp/config && rm -f /tmp/config
15
16. Dockerfiles - MySQL Example (3/3)
# Contents of file mysql-docker/Dockerfile (continued)
# Define default command
CMD ["mysqld_safe"]
# Expose guest port. Not required, but facilitates management
# NEVER expose the public port in the Dockerfile
EXPOSE 3306
16
17. Dockerfiles - Building MySQL image
$ docker build -t mysql-image mysql-docker/
Sending build context to Docker daemon 2.56 kB
[...]
debconf: unable to initialize frontend: Dialog
debconf: (TERM is not set, so the dialog frontend is not usable.)
debconf: falling back to frontend: Readline
debconf: unable to initialize frontend: Readline
debconf: (This frontend requires a controlling tty.)
debconf: falling back to frontend: Teletype
[...]
17
19. Lesson 1: Dialog-less installs
# Contents of file mysql/Dockerfile (showing differences)
[...]
RUN DEBIAN_FRONTEND=noninteractive apt-get install -y mysql-server
[...]
$ docker run -d mysql-image
5f3695d8f5e4dfc836156f645dbf6b647e264e58a25b4e2a9724b7522591b9bc
$ docker build -t mysql-image mysql-docker/
[...]
Successfully built d5cb85b206a4
That’s our image ID
That’s our container ID
(we can use a prefix as long as it is unique)
19
20. Lesson 1: Testing the connectivity
$ mysql -uroot -h 172.17.0.102 -e "SHOW DATABASES;"
+--------------------+
| Database |
+--------------------+
| information_schema |
| mysql |
| performance_schema |
+--------------------+
$ docker inspect 5f3695d8f5e4 |grep IPAddress |cut -d'"' -f4
172.17.0.102
Finding the IP address
of our container
From the host, we can now connect to
our MySQL box inside the container
using the Docker network bridge.
20
24. Lesson 2: Layers - What are they?
● Think of a layer as directory of files (or blocks)
● All these “physical” layers are combined into a
“logical” file system for each individual container
o Union file system
o Copy-on-write
o Like a stack: higher layers may override lower layers
24
25. Lesson 2: Layers - Purpose (1/4)
● Blazing fast container instantiation
o To create a new instance from an image, Docker simply creates a
new empty read-write layer
Great, but we could achieve this goal
with 1 single layer per image + 1 layer
per container
Why 17 layers ?
25
26. Lesson 2: Layers - Purpose (2/4)
● Faster image modification
o Changing/adding a Dockerfile instruction causes only the modified
layer(s) and those following it to be rebuilt
How often do you plan on changing
your Dockerfiles ?
26
27. Lesson 2: Layers - Purpose (3/4)
● Faster distribution
o when distributing the image (via docker push) and downloading it
(via docker pull, or docker build), only the affected layer(s)
are sent.
27
28. Lesson 2: Layers - Purpose (4/4)
● Minimize disk space
o All the containers located on the same Docker host and parent of
the same image hierarchy will share layers.
o Ubuntu Docker image is 200 MB
o 1000 containers based on Ubuntu only takes 200 MB total
(+ the additional packages they require)
Will you have multiple variants (config and/or versions) of MySQL on
the same machine ?
How many MySQL servers will you have on the same machine ?
28
29. Lesson 2: Layers - Layer Genocide
$ cp -r mysql-docker/ mysql-docker-grouped
$ vi mysql-docker-grouped/Dockerfile
In this example, all our MySQL containers will be the same.
Therefore, we’ll only be needing 1 single layer.
29
30. Lesson 2: Layers - Combine multiple RUN instructions
# Contents of file mysql-docker-grouped/Dockerfile
[...]
RUN apt-get update &&
apt-get install -y mysql-server &&
sed -i 's/^(bind-addresss.*)/# 1/' /etc/mysql/my.cnf &&
sed -i 's/^(log_errors.*)/# 1/' /etc/mysql/my.cnf &&
echo "mysqld_safe &" > /tmp/config &&
echo "mysqladmin --silent --wait=30 ping || exit 1" >> /tmp/config &&
echo "mysql -e 'GRANT ALL PRIVILEGES ON *.* TO "root"@"%" WITH GRANT
OPTION;'" >> /tmp/config &&
bash /tmp/config && rm -f /tmp/config
[...]
30
31. Lesson 2: Layers - Docker History
$ docker build -t mysql-image-grouped mysql-docker-grouped/
[...]
Successfully built d5cb85b206a4
$ docker history mysql-image-grouped
IMAGE CREATED CREATED BY SIZE
11ccd4cc6c82 About an hour ago /bin/sh -c #(nop) EXPOSE map[3306/tcp:{}] 0 B
59c9467d3360 About an hour ago /bin/sh -c #(nop) CMD [mysqld_safe] 0 B
0993d316210d About an hour ago /bin/sh -c apt-get update && DEBIAN_FRONT 151 MB
86ce37374f40 6 weeks ago /bin/sh -c #(nop) CMD [/bin/bash] 0 B
dc07507cef42 6 weeks ago /bin/sh -c apt-get update && apt-get dist-upg 0 B
78e82ee876a2 6 weeks ago /bin/sh -c sed -i 's/^#s*(deb.*universe)$/ 1.895 kB
3f45ca85fedc 6 weeks ago /bin/sh -c rm -rf /var/lib/apt/lists/* 0 B
61cb619d86bc 6 weeks ago /bin/sh -c echo '#!/bin/sh' > /usr/sbin/polic 194.8 kB
5bc37dc2dfba 6 weeks ago /bin/sh -c #(nop) ADD file:d11cc4a4310c270539 192.5 MB
511136ea3c5a 19 months ago 0 B
Freed 7 layers !
Our Docker now only
adds 3 layers on top of
the base image:
RUN, CMD, EXPOSE
31
33. Lesson 3: Staying fit - Compacting layers
$ cp -r mysql-docker-grouped/ mysql-docker-cleaned
$ vi mysql-docker-cleaned/Dockerfile
Some commands, like apt-get update, creates some
temporary files, which can be safely discarded after use.
We can save space and create smaller images by deleting
those files.
33
34. Lesson 3: Staying fit - Removing temporary files
# Contents of file mysql-docker-cleaned/Dockerfile (partial)
[...]
RUN apt-get update &&
apt-get install -y mysql-server &&
rm -fr /var/lib/apt/lists/* &&
[...]
$ docker build -t mysql-image-cleaned mysql-docker-cleaned/
[...]
Successfully built d5cb85b206a4
Remember: you’ll need to run
apt-get update again next time
you want to install something
34
35. Lesson 3: Staying fit - Local Docker images
$ docker images
REPOSITORY TAG IMAGE ID CREATED VIRTUAL SIZE
mysql-image-cleaned latest 032798b8e064 2 hours ago 322.8 MB
mysql-image-grouped latest 11ccd4cc6c82 2 hours ago 343.6 MB
mysql-image latest d5cb85b206a4 3 hours ago 348.9 MB
ubuntu 14.04 86ce37374f40 6 weeks ago 192.7 MB
The cleaned image occupies 17% less space than the original
mysql-image (it’s a virtual size) [1].
MySQL is small; the impact can be much bigger for other
applications.
[1] ((348-192) - (322-192)) / (348-192) = 17%
35
37. Lesson 3: Staying fit - docker diff
● Show differences between container and the image
o Useful to see which files have been modified/created when writing
your Dockerfile
37
39. Lesson 4: Reproducibility - Package
version
● Your Dockerfile may build a different image in a few
months than today’s image
RUN apt-get install -y mysql-server
RUN apt-get install -y mysql-server=5.5.40-0ubuntu0.14.04.1
Specify the package version explicitly is better
39
40. Lesson 4: Reproducibility - Dependency
version
RUN apt-get install -y libaio1=0.3.109-4 mysql-common=5.5.40-0ubuntu0.14.04.1
libmysqlclient18=5.5.40-0ubuntu0.14.04.1 libwrap0=7.6.q-25 libdbi-perl=1.630-
1 libdbd-mysql-perl=4.025-1 libterm-readkey-perl=2.31-1 mysql-client-core-
5.5=5.5.40-0ubuntu0.14.04.1 mysql-client-5.5=5.5.40-0ubuntu0.14.04.1 mysql-
server-core-5.5=5.5.40-0ubuntu0.14.04.1 psmisc=22.20-1ubuntu2 mysql-server-
5.5=5.5.40-0ubuntu0.14.04.1 libhtml-template-perl=2.95-1 mysql-server=5.5.40-
0ubuntu0.14.04.1 tcpd=7.6.q-25
Previous solution should be enough…
But if you need higher guarantee of reproducibility:
A. Specify the package version for the dependencies as well
B. And / or use a cache proxy, maven proxy, etc.
40
41. Lesson 5:
Prototry
A quick and dirty attempt to develop a working
model of software. The original intent is to
rewrite the ProtoTry, using lessons learned, but
schedules never permit. Also known as legacy
code. [1]
41[1] Michael Duell, Ailments of Unsuitable Project-Disoriented Software, http://www.fsfla.org/~lxoliva/fun/prog/resign-patterns
42. Lesson 5: Prototry - Docker Hub Registry
● Before writing your own Dockerfile, try a build from
someone else
o https://registry.hub.docker.com/
o Official builds
o Trusted (automated) builds
o Other builds
For advanced setup,
see these images:
● jenkins
● dockerfile/java
42
43. Lesson 5: Prototry - Using other people images
PROs CONs
● Faster to get started
● Better tested
● You may end up with a mixed stack to
support
○ e.g. different versions of Java
○ Ubuntu vs Debian vs CentOS
● Not all sources use all the best practices
described in this presentation
For medium - large organisations / heavy Docker users:
Best to fork and write your own Dockerfiles
43
44. Lesson 5: Prototry - Potential image hierarchy
FROM ubuntu:14.04
# Organization-wide tools (e.g. vim, etc.)
myorg-base
myorg-java
FROM myorg-base:1.0
# OpenJDK | OracleJDK
myorg-python
FROM myorg-base:1.0
# Install Python 2.7
python-app1
FROM myorg-python:2.7
# ...
java-app3
FROM myorg-java:oracle-jdk7
# ...
python-app2
FROM myorg-python:2.7
# ...
44
46. ● Nothing to do - that’s the default Docker behavior
o Application data is stored along with the
infrastructure (container) data
● If the container is restarted, data is still there
● If the container is deleted, data is gone
Lesson 6: Inside Container Pattern
46
47. Lesson 6: Host Directory Pattern
● A directory on the host
● To share data across containers on the
same host
● For example, put the source code on the
host and mount it inside the container with
the “-v” flag
47
48. Lesson 6: Data-Only Container Pattern
● Run on a barebone image
● VOLUME command in the Dockerfile or “-v”
flag at run
● Just use the “--volumes-from” flag to
mount all the volumes in another container
48
50. Lesson 7: Storage backend - Overview
● Options:
o VFS
o AUFS (default, docker < 0.7)
o DeviceMapper
Direct LVM
Loop LVM (default in Red Hat)
o Btrfs (experimental)
o OverlayFS (experimental)
Red Hat[1] says the
fastest backends are:
1. OverlayFS
2. Direct LVM
3. BtrFS
4. Loop LVM
Lookup your current Docker backend
$ docker info |grep Driver
[1] http://developerblog.redhat.com/2014/09/30/overview-storage-scalability-docker/
50
51. Lesson 7: Storage backend - VFS & AUFS
● Both are very basic (NOT for PROD)
● Both store each layer as a separate directory with
regular files
● VFS
o No Copy-on-Write (CoW)
● AUFS
o Original Docker backend
o File-level Copy-on-Write (CoW)
VFS & AUFS can be
useful to understand how
Docker works
Do not use in PROD
51
52. Lesson 7: Storage backend - DeviceMapper (1/2)
● Already used by linux kernel for LVM2 (logical volume management)
o Block-level Copy-on-Write (CoW)
o Unused blocks do not use space
● Uses thin pool provisioning to implement CoW snapshots
o Each pool requires 2 block devices: data & metadata
o By default, uses loop back mounts on sparse regular files
# ls -alhs /var/lib/docker/devicemapper/devicemapper
506M -rw-------. 1 root root 100G Sep 10 20:15 data
1.1M -rw-------. 1 root root 2.0G Sep 10 20:15 metadata
Loop LVM
52
53. Lesson 7: Storage backend - DeviceMapper (2/2)
● In production:
o Use real block devices! (Direct LVM)
o Ideally, data & metadata each on its own spindle
o Additional configuration is required
Docker does not
do that for you
53
54. Lesson 7: Storage backend - Btrfs & OverlayFS
Btrfs:
● Requires /var/lib/docker to be on a btrfs file system
● Block-level Copy-on-Write (CoW) using Btrfs’s snapshotting
● Each layer stored as a Btrfs subvolume
● No SELinux
OverlayFS:
● Support page cache sharing
● Lower FS contains the base image (XFS or EXT4)
● Upper FS contains the deltas
● No SELinux
Claims a huge
RAM saving
54
56. Docker
● Ethernet bridge “docker0” created when Docker boots
● Virtual subnet on the host (default: 172.17.42.1/16)
● Each container has a pair of virtual Ethernet interfaces
● You can remove “docker0” and use your own bridge if
you want
56
57. Weave
Why Weave?
● Docker built-in functionalities don’t provide
a solution for connecting containers on
multiple hosts
● Weave create a virtual network to permit a
distributed environment (common in the
real word)
57
58. Weave
How does it work?
● Virtual routers establish TCP connections to
each other with a handshake
● These connections are duplex
● Use “pcap” to capture packets
● Exclude traffic between local containers
58
61. Weave - getting started
$ sudo weave launch
$ sudo weave run 10.0.0.1/24 -ti --name ubuntu-01 ubuntu:14.04
$ sudo weave launch weave-01
$ sudo weave run 10.0.0.2/24 -ti --name ubuntu-02 ubuntu:14.04
● First host: weave-01
● Second host: weave-02
Note: “weave run” invokes “docker run -d” (running as a daemon)
Starts the weave router in a container
Starts the weave router in a container and peers it
CIDR notation
61
65. cAdvisor
● New tool from Google
● Specialized for Docker containers
PROs CONs
● Great web interface
● Docker image available (18 MB)
to try it in seconds
● Stats can be export to InfluxDB
(data mining to do)
● Needs more maturity
● Missing metrics
○ No data for Disk I/O
● Only keep last 60 metrics locally (not
configurable)
65
68. AdamCloud - The next steps
● Docker + Weave = success
● Open-source the project and merge it
upstream into the AmpLab genomic
pipeline.
● Support for Amazon EC2 environments
● Improve administration of Docker
containers
o Monitoring, orchestration, provisioning
68
69. Docker Conclusion
● 1 Docker container = 1 background daemon
● Container isolation is not like a VM
● Use correct versions of images and keep a trace
● Docker is less interesting for multi-tenants use cases (no SSH in the
containers)
● Docker is FAST and VERSATILE
● cAdvisor is an interesting monitoring tool, but limited
● Docker is perfect for short lived apps (no long term data persistence)
● Data intensive apps should review the Docker docs carefully. Start
looking at Direct LVM.
69
70. References
● Jonathan Bergknoff - Building good docker images, http://jonathan.bergknoff.com/journal/building-good-
docker-images
● Michael Crosby - Dockerfile Best Practices, http://crosbymichael.com/dockerfile-best-practices.html
● Michael Crosby - Dockerfile Best Practices - take 2, http://crosbymichael.com/dockerfile-best-practices-take-
2.html
● Nathan Leclaire - The Dockerfile is not the source of truth for your image,
http://nathanleclaire.com/blog/2014/09/29/the-dockerfile-is-not-the-source-of-truth-for-your-image/
● Docker Documentation - Understanding Docker, https://docs.docker.com/introduction/understanding-docker/
● Docker Documentation - Docker User Guide, https://docs.docker.com/userguide/
● Docker Documentation - Dockerfile Reference, https://docs.docker.com/reference/builder/
● Docker Documentation - Command Line (CLI) User Guide,
https://docs.docker.com/reference/commandline/cli/
● Docker Documentation - Advanced networking, http://docs.docker.com/articles/networking/
● Project Atomic - Supported Filesystems, http://www.projectatomic.io/docs/filesystems/
● Red Hat Developer Blog - Comprehensive Overview of Storage Scalability in Docker,
http://developerblog.redhat.com/2014/09/30/overview-storage-scalability-docker/
● Linux Kernel Documentation - DeviceMapper Thin Provisioning,
https://www.kernel.org/doc/Documentation/device-mapper/thin-provisioning.txt
● weave - the Docker network, http://zettio.github.io/weave/
● GitHub - google/cadvisor, https://github.com/google/cadvisor
70