Slides for my entry to the DC Apache Spark Meetup, Spark Bake Off. I built a demo of a distributed, real-time TCP packet analysis system with Apache Spark, Azure Event Hubs, and Power BI
This document outlines a final year project proposal to automate replication in OpenStack. The project aims to address the lack of automated replication when using OpenStack's Swift component for object storage. The objectives are to propose, design, and implement an automation framework for replication in OpenStack. The methodology will include installing and configuring OpenStack, developing a data management program, and testing the framework. The expected result is an automated replication solution in OpenStack that can be managed through the dashboard interface.
Slide deck for the fourth data engineering lunch, presented by guest speaker Will Angel. It covered the topic of using Airflow for data engineering. Airflow is a scheduling tool for managing data pipelines.
A look at kubeless a serverless framework on top of kubernetes. We take a look at what serverless is and why it matters then introduce kubeless which leverages Kubernetes API resources to provide a Function as a Services solution.
This document discusses machine learning infrastructure on Kubernetes. It describes how Kubernetes now supports stateful applications and data processing workloads through new abstractions. It introduces Kubeflow, which provides tools like JupyterHub, Tensorflow Training Controller, and Tensorflow Serving to make it easier to build and run machine learning workflows on Kubernetes. It also discusses efforts to run Apache Spark and Apache Airflow on Kubernetes to enable machine learning pipelines. The goal is for Kubernetes to provide a platform to orchestrate full machine learning workflows and leverage various frameworks.
Learn about core functions and architecture of Zentral. Zentral is a open source hub to process event streams from osquery and other sources into the ElasticStack. Besides support for distinct osquery features like file carving, Zentral provides numerous integrations for inventory acquisition and alerting.
Apache Airflow is a platform for authoring, scheduling, and monitoring workflows or directed acyclic graphs (DAGs). It allows defining and monitoring cron jobs, automating DevOps tasks, moving data periodically, and building machine learning pipelines. Many large companies use Airflow for tasks like data ingestion, analytics automation, and machine learning workflows. The author proposes using Airflow to manage data movement and automate tasks for their organization to benefit business units. Instructions are provided on installing Airflow using pip, Docker, or Helm along with developing sample DAGs connecting to Azure services like Blob Storage, Cosmos DB, and Databricks.
This document outlines steps to configure a Lambda function to send logs and events to Splunk Cloud in real-time. It involves setting up a Splunk index and HTTP Event Collector (HEC), creating an HEC token, and modifying the source type. A standalone Splunk Lambda function is created that can be invoked by other application Lambda functions to log events to Splunk Cloud. The application Lambda is modified to invoke the Splunk Lambda after starting EC2 instances to log instance details.
This document outlines a final year project proposal to automate replication in OpenStack. The project aims to address the lack of automated replication when using OpenStack's Swift component for object storage. The objectives are to propose, design, and implement an automation framework for replication in OpenStack. The methodology will include installing and configuring OpenStack, developing a data management program, and testing the framework. The expected result is an automated replication solution in OpenStack that can be managed through the dashboard interface.
Slide deck for the fourth data engineering lunch, presented by guest speaker Will Angel. It covered the topic of using Airflow for data engineering. Airflow is a scheduling tool for managing data pipelines.
A look at kubeless a serverless framework on top of kubernetes. We take a look at what serverless is and why it matters then introduce kubeless which leverages Kubernetes API resources to provide a Function as a Services solution.
This document discusses machine learning infrastructure on Kubernetes. It describes how Kubernetes now supports stateful applications and data processing workloads through new abstractions. It introduces Kubeflow, which provides tools like JupyterHub, Tensorflow Training Controller, and Tensorflow Serving to make it easier to build and run machine learning workflows on Kubernetes. It also discusses efforts to run Apache Spark and Apache Airflow on Kubernetes to enable machine learning pipelines. The goal is for Kubernetes to provide a platform to orchestrate full machine learning workflows and leverage various frameworks.
Learn about core functions and architecture of Zentral. Zentral is a open source hub to process event streams from osquery and other sources into the ElasticStack. Besides support for distinct osquery features like file carving, Zentral provides numerous integrations for inventory acquisition and alerting.
Apache Airflow is a platform for authoring, scheduling, and monitoring workflows or directed acyclic graphs (DAGs). It allows defining and monitoring cron jobs, automating DevOps tasks, moving data periodically, and building machine learning pipelines. Many large companies use Airflow for tasks like data ingestion, analytics automation, and machine learning workflows. The author proposes using Airflow to manage data movement and automate tasks for their organization to benefit business units. Instructions are provided on installing Airflow using pip, Docker, or Helm along with developing sample DAGs connecting to Azure services like Blob Storage, Cosmos DB, and Databricks.
This document outlines steps to configure a Lambda function to send logs and events to Splunk Cloud in real-time. It involves setting up a Splunk index and HTTP Event Collector (HEC), creating an HEC token, and modifying the source type. A standalone Splunk Lambda function is created that can be invoked by other application Lambda functions to log events to Splunk Cloud. The application Lambda is modified to invoke the Splunk Lambda after starting EC2 instances to log instance details.
Dask Tutorial at PyConDE / PyData Karlsruhe 2018. These were the introductory slides that mainly contain the link to Matthew Rocklin's Dask workshop at PyData NYC 2018 whereon this workshop was based.
Tableapp architecture migration story for GCPUG.TWYen-Wen Chen
This document summarizes the migration of a web application called TABLEAPP from AWS to GCP. It describes the original AWS architecture, problems encountered like slow scaling, and goals for the migration like improving performance and reducing costs. It then details experiments with Docker containers and Kubernetes on GCP and AWS. The selected solution deployed Kubernetes on GCP's Container Engine for auto-scaling and easy management. The new GCP architecture integrated Kubernetes, Cloud SQL, Cloud Storage and other services. This resulted in faster deployment times, higher performance, better log collection and a 40% reduction in costs compared to the original AWS architecture.
Using Libvirt with Cluster API to manage baremetal KubernetesHimani Agrawal
There are many different tools available to bootstrap and manage Kubernetes cluster within various platform, but they don't all interoperate.
Cluster API project attempts to solve this issue, which creates common declarative API, tools, and best practices for deploying, configuring, and managing Kubernetes in multiple platforms. It supports many public cloud provider plugins, and on-premise with vSphere and OpenStack.
However, running Cluster API for raw baremetal KVM machines without IaaS vendor remains a challenge. In this talk, we will show you how we use and configure Libvirt to create custom provider for Cluster API. Libvirt is an open-source API and management tool widely used for managing various virtualization platform. This enables us to manage Kubernetes on baremetal KVM/Qemu machines easily, with extensibility to run in Xen and other platform not supported by Cluster API.
Building an analytics workflow using Apache AirflowYohei Onishi
This document discusses using Apache Airflow to build an analytics workflow. It begins with an overview of Airflow and how it can be used to author workflows through Python code. Examples are shown of using Airflow to copy files between S3 buckets. The document then covers setting up a highly available Airflow cluster, implementing continuous integration/deployment, and monitoring workflows. It emphasizes that Google Cloud Composer can simplify deploying and managing Airflow clusters on Google Kubernetes Engine and integrating with other Google Cloud services.
The data science team at Zymergen is applying machine learning techniques to identify genetic targets, work that is supported by extensive analytical automation that systematically identifies outliers, removes process-related bias, and quantifies performance improvements. We’re using Apache Airflow to construct robust data pipelines that allow us to produce clean, reliable inputs to our predictive models. In this talk, I’ll discuss the unique data processing challenges we face in working with high-throughput, biological data and provide an overview of how we’re using Apache Airflow to meet those challenges.
This document provides lessons learned from optimizing Apache Spark for NoSQL databases like Riak. Some key lessons include:
1. Parallelizing operations whenever possible to avoid overloading Riak with too many direct key-based gets or secondary index queries.
2. Being smart about data mapping between NoSQL data structures and Spark DataFrames/RDDs for efficient processing.
3. Optimizing performance at all levels from the network protocol to data locality optimizations.
4. Being flexible in supporting multiple languages and deployment environments for Spark and NoSQL integrations.
Sparking up Data Engineering: Spark Summit East talk by Rohan SharmaSpark Summit
Learn about the Big Data Processing ecosystem at Netflix and how Apache Spark sits in this platform. I talk about typical data flows and data pipeline architectures that are used in Netflix and address how Spark is helping us gain efficiency in our processes. As a bonus – i’ll touch on some unconventional use-cases contrary to typical warehousing / analytics solutions that are being served by Apache Spark.
This document discusses using Prometheus for application monitoring on Kubernetes. It describes the current monitoring systems in use and their limitations. Prometheus is introduced as an open-source monitoring system developed by SoundCloud. Two approaches are presented for using Prometheus on Kubernetes - running Prometheus on EC2 instances and pointing it at Kubernetes, or using the Prometheus Operator which automates Prometheus configuration based on Kubernetes resources. The Prometheus Operator approach is recommended for its simplified configuration.
Quick introduction about Apache Spark and how it fits in the cognitive world, how can we use it to help cognitive solutions as well as create distributed algorithms to predict and perform other machine learning tasks.
Tao Feng gave a presentation on Airflow at Lyft. Some key points:
1) Lyft uses Apache Airflow for ETL workflows with over 600 DAGs and 800 DAG runs daily across three AWS Auto Scaling Groups of worker nodes.
2) Lyft has customized Airflow with additional UI links, DAG dependency graphs, and integration with internal tools.
3) Lyft is working to improve the backfill experience, support DAG-level access controls, and explore running Airflow with Kubernetes executors.
4) Tao discussed challenges like daylight saving time issues and long-running tasks occupying slots, and thanked other Lyft engineers contributing to Airflow.
Introduction to Streaming Distributed Processing with StormBrandon O'Brien
Contact:
https://www.linkedin.com/in/brandonjobrien
@hakczar
Introducing streaming data concepts, Storm cluster architecture, Storm topology architecture, and demonstrate working example of a WordCount topology for SIGKDD Seattle chapter meetup.
Presented by Brandon O'Brien
Code example: https://github.com/OpenDataMining/brandonobrien
Meetup: http://www.meetup.com/seattlesigkdd/events/222955114/
This document discusses a knowledge graph system and its components. It describes the system's architecture including data extraction, processing, storage and querying. It also covers the system's applications in risk management, various processing methods like streaming, batch and reasoning, and how it supports thousands of entities with billions of relationships. Finally, it provides contact details for the system.
Airflow is a platform created by Airbnb to automate and schedule workflows. It uses a Directed Acyclic Graph (DAG) structure to define dependencies between tasks, and allows scheduling tasks on a timetable or triggering them manually. Some key features include monitoring task status, resuming failed tasks, backfilling historical data, and a web-based user interface. While additional databases are required for high availability, Airflow provides a flexible way to model complex data workflows as code.
From AWS to GCP, TABLEAPP Architecture StoryYen-Wen Chen
TABLEAPP is migrating from AWS to GCP due to scaling issues with their AWS architecture. They propose using Kubernetes on GCP to containerize their application and allow for easier auto-scaling. This will eliminate wasted resources and slow provisioning times. They present a new GCP architecture using Kubernetes, Cloud SQL, Cloud Load Balancing, and other GCP services. Migrating has reduced costs by 40% while maintaining availability and performance.
Two Years In Production With Kubernetes - An Experience ReportKasper Nissen
This document summarizes a presentation about two years of experience using Kubernetes in production. It discusses how the company shifted to being application-oriented rather than machine-oriented, and introduced tools like Shuttle and Ham to improve developer experience and implement continuous delivery. It also covers how they used Kops to manage Kubernetes clusters across multiple availability zones and Dextre to improve node rollouts. While there were initial challenges, the presenter concludes that Kubernetes was the right choice and has allowed the company to scale their services.
Running Airflow Workflows as ETL Processes on Hadoopclairvoyantllc
While working with Hadoop, you'll eventually encounter the need to schedule and run workflows to perform various operations like ingesting data or performing ETL. There are a number of tools available to assist you with this type of requirement and one such tool that we at Clairvoyant have been looking to use is Apache Airflow. Apache Airflow is an Apache Incubator project that allows you to programmatically create workflows through a python script. This provides a flexible and effective way to design your workflows with little code and setup. In this talk, we will discuss Apache Airflow and how we at Clairvoyant have utilized it for ETL pipelines on Hadoop.
Paco Nathan has received certification as an Apache Spark 1.1.0 developer from Databricks and O'Reilly Media. The certification verifies that Paco Nathan has successfully completed the requirements to be considered a certified developer on Apache Spark. The certification is valid and was issued on July 16, 2016.
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 1)Sascha Dittmann
In dieser Session stellen wir anhand eines praktischen Szenarios vor, wie konkrete Aufgabenstellungen mit HDInsight in der Praxis gelöst werden können:
- Grundlagen von HDInsight für Windows Server und Windows Azure
- Mit Windows Azure HDInsight arbeiten
- MapReduce-Jobs mit Javascript und .NET Code implementieren
Dask Tutorial at PyConDE / PyData Karlsruhe 2018. These were the introductory slides that mainly contain the link to Matthew Rocklin's Dask workshop at PyData NYC 2018 whereon this workshop was based.
Tableapp architecture migration story for GCPUG.TWYen-Wen Chen
This document summarizes the migration of a web application called TABLEAPP from AWS to GCP. It describes the original AWS architecture, problems encountered like slow scaling, and goals for the migration like improving performance and reducing costs. It then details experiments with Docker containers and Kubernetes on GCP and AWS. The selected solution deployed Kubernetes on GCP's Container Engine for auto-scaling and easy management. The new GCP architecture integrated Kubernetes, Cloud SQL, Cloud Storage and other services. This resulted in faster deployment times, higher performance, better log collection and a 40% reduction in costs compared to the original AWS architecture.
Using Libvirt with Cluster API to manage baremetal KubernetesHimani Agrawal
There are many different tools available to bootstrap and manage Kubernetes cluster within various platform, but they don't all interoperate.
Cluster API project attempts to solve this issue, which creates common declarative API, tools, and best practices for deploying, configuring, and managing Kubernetes in multiple platforms. It supports many public cloud provider plugins, and on-premise with vSphere and OpenStack.
However, running Cluster API for raw baremetal KVM machines without IaaS vendor remains a challenge. In this talk, we will show you how we use and configure Libvirt to create custom provider for Cluster API. Libvirt is an open-source API and management tool widely used for managing various virtualization platform. This enables us to manage Kubernetes on baremetal KVM/Qemu machines easily, with extensibility to run in Xen and other platform not supported by Cluster API.
Building an analytics workflow using Apache AirflowYohei Onishi
This document discusses using Apache Airflow to build an analytics workflow. It begins with an overview of Airflow and how it can be used to author workflows through Python code. Examples are shown of using Airflow to copy files between S3 buckets. The document then covers setting up a highly available Airflow cluster, implementing continuous integration/deployment, and monitoring workflows. It emphasizes that Google Cloud Composer can simplify deploying and managing Airflow clusters on Google Kubernetes Engine and integrating with other Google Cloud services.
The data science team at Zymergen is applying machine learning techniques to identify genetic targets, work that is supported by extensive analytical automation that systematically identifies outliers, removes process-related bias, and quantifies performance improvements. We’re using Apache Airflow to construct robust data pipelines that allow us to produce clean, reliable inputs to our predictive models. In this talk, I’ll discuss the unique data processing challenges we face in working with high-throughput, biological data and provide an overview of how we’re using Apache Airflow to meet those challenges.
This document provides lessons learned from optimizing Apache Spark for NoSQL databases like Riak. Some key lessons include:
1. Parallelizing operations whenever possible to avoid overloading Riak with too many direct key-based gets or secondary index queries.
2. Being smart about data mapping between NoSQL data structures and Spark DataFrames/RDDs for efficient processing.
3. Optimizing performance at all levels from the network protocol to data locality optimizations.
4. Being flexible in supporting multiple languages and deployment environments for Spark and NoSQL integrations.
Sparking up Data Engineering: Spark Summit East talk by Rohan SharmaSpark Summit
Learn about the Big Data Processing ecosystem at Netflix and how Apache Spark sits in this platform. I talk about typical data flows and data pipeline architectures that are used in Netflix and address how Spark is helping us gain efficiency in our processes. As a bonus – i’ll touch on some unconventional use-cases contrary to typical warehousing / analytics solutions that are being served by Apache Spark.
This document discusses using Prometheus for application monitoring on Kubernetes. It describes the current monitoring systems in use and their limitations. Prometheus is introduced as an open-source monitoring system developed by SoundCloud. Two approaches are presented for using Prometheus on Kubernetes - running Prometheus on EC2 instances and pointing it at Kubernetes, or using the Prometheus Operator which automates Prometheus configuration based on Kubernetes resources. The Prometheus Operator approach is recommended for its simplified configuration.
Quick introduction about Apache Spark and how it fits in the cognitive world, how can we use it to help cognitive solutions as well as create distributed algorithms to predict and perform other machine learning tasks.
Tao Feng gave a presentation on Airflow at Lyft. Some key points:
1) Lyft uses Apache Airflow for ETL workflows with over 600 DAGs and 800 DAG runs daily across three AWS Auto Scaling Groups of worker nodes.
2) Lyft has customized Airflow with additional UI links, DAG dependency graphs, and integration with internal tools.
3) Lyft is working to improve the backfill experience, support DAG-level access controls, and explore running Airflow with Kubernetes executors.
4) Tao discussed challenges like daylight saving time issues and long-running tasks occupying slots, and thanked other Lyft engineers contributing to Airflow.
Introduction to Streaming Distributed Processing with StormBrandon O'Brien
Contact:
https://www.linkedin.com/in/brandonjobrien
@hakczar
Introducing streaming data concepts, Storm cluster architecture, Storm topology architecture, and demonstrate working example of a WordCount topology for SIGKDD Seattle chapter meetup.
Presented by Brandon O'Brien
Code example: https://github.com/OpenDataMining/brandonobrien
Meetup: http://www.meetup.com/seattlesigkdd/events/222955114/
This document discusses a knowledge graph system and its components. It describes the system's architecture including data extraction, processing, storage and querying. It also covers the system's applications in risk management, various processing methods like streaming, batch and reasoning, and how it supports thousands of entities with billions of relationships. Finally, it provides contact details for the system.
Airflow is a platform created by Airbnb to automate and schedule workflows. It uses a Directed Acyclic Graph (DAG) structure to define dependencies between tasks, and allows scheduling tasks on a timetable or triggering them manually. Some key features include monitoring task status, resuming failed tasks, backfilling historical data, and a web-based user interface. While additional databases are required for high availability, Airflow provides a flexible way to model complex data workflows as code.
From AWS to GCP, TABLEAPP Architecture StoryYen-Wen Chen
TABLEAPP is migrating from AWS to GCP due to scaling issues with their AWS architecture. They propose using Kubernetes on GCP to containerize their application and allow for easier auto-scaling. This will eliminate wasted resources and slow provisioning times. They present a new GCP architecture using Kubernetes, Cloud SQL, Cloud Load Balancing, and other GCP services. Migrating has reduced costs by 40% while maintaining availability and performance.
Two Years In Production With Kubernetes - An Experience ReportKasper Nissen
This document summarizes a presentation about two years of experience using Kubernetes in production. It discusses how the company shifted to being application-oriented rather than machine-oriented, and introduced tools like Shuttle and Ham to improve developer experience and implement continuous delivery. It also covers how they used Kops to manage Kubernetes clusters across multiple availability zones and Dextre to improve node rollouts. While there were initial challenges, the presenter concludes that Kubernetes was the right choice and has allowed the company to scale their services.
Running Airflow Workflows as ETL Processes on Hadoopclairvoyantllc
While working with Hadoop, you'll eventually encounter the need to schedule and run workflows to perform various operations like ingesting data or performing ETL. There are a number of tools available to assist you with this type of requirement and one such tool that we at Clairvoyant have been looking to use is Apache Airflow. Apache Airflow is an Apache Incubator project that allows you to programmatically create workflows through a python script. This provides a flexible and effective way to design your workflows with little code and setup. In this talk, we will discuss Apache Airflow and how we at Clairvoyant have utilized it for ETL pipelines on Hadoop.
Paco Nathan has received certification as an Apache Spark 1.1.0 developer from Databricks and O'Reilly Media. The certification verifies that Paco Nathan has successfully completed the requirements to be considered a certified developer on Apache Spark. The certification is valid and was issued on July 16, 2016.
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 1)Sascha Dittmann
In dieser Session stellen wir anhand eines praktischen Szenarios vor, wie konkrete Aufgabenstellungen mit HDInsight in der Praxis gelöst werden können:
- Grundlagen von HDInsight für Windows Server und Windows Azure
- Mit Windows Azure HDInsight arbeiten
- MapReduce-Jobs mit Javascript und .NET Code implementieren
Azure api app métricas com application insightsNicolas Takashi
Este documento discute o Application Insights do Azure, que fornece métricas e telemetria para aplicativos. Ele introduz o Application Insights, aborda preocupações sobre sobrecarga, lista tipos de dados que podem ser coletados e promete uma demonstração.
Application Insights is a Microsoft Azure service that helps application developers understand if their applications are available, performing well, and successful. It provides a 3600 view for developers to be alerted to problems quickly and learn what customers are doing to prioritize work items. The document discusses using Application Insights for Java applications, with the service currently in preview and working towards general availability. It seeks external help with Application Insights extensions and support for technologies like Java, PHP, Node.js, Ruby, and Python.
Big data streaming with Apache Spark on AzureWillem Meints
A talk for the Breda Dev meetup in which I showed what challenges microservices architectures bring for data analysis and how you can tackle these challenges with Apache Spark on Azure.
The document introduces Azure Functions as a serverless compute option on the Azure platform. It provides an overview of Azure's compute services spectrum, positioning Azure Functions as a highly agile and scalable option with less complexity compared to other services like virtual machines, cloud services, and service fabric. The document also includes information about event sponsor BlueMetal, an interactive design and technology architecture firm, and contact details for following up.
The document discusses real-time fraud detection patterns and architectures. It provides an overview of key technologies like Kafka, Flume, and Spark Streaming used for real-time event processing. It then describes a high-level architecture involving ingesting events through Flume and Kafka into Spark Streaming for real-time processing, with results stored in HBase, HDFS, and Solr. The document also covers partitioning strategies, micro-batching, complex topologies, and ingestion of real-time and batch data.
This document discusses Azure IOT Hub and Azure Stream Analytics. Azure IOT Hub is a fully managed service that enables reliable and secure bidirectional communication between IoT devices and backend solutions. It provides device-to-cloud and cloud-to-device messaging at scale with security credentials and access control. Azure Stream Analytics is a low-cost event processing engine that helps uncover real-time insights from streaming data sources. It allows developers to use SQL-like queries to develop solutions faster and elastically scales in the cloud. The document outlines how these services can be used to build IoT solutions that process and analyze real-time device data.
2016-08-25 TechExeter - going serverless with AzureSteve Lee
This document discusses serverless computing options on Microsoft Azure, including Azure Functions, Logic Apps, and Mobile Apps. Azure Functions allow developers to write small code fragments or "nanoservices" that run in ephemeral containers in a serverless computing environment. Logic Apps enable the creation of declarative, event-driven workflows to automate business processes. Mobile Apps provide backend services like user authentication, data synchronization, and push notifications for mobile applications. The document argues that serverless options on Azure simplify development by allowing developers to focus on their code while outsourcing server management.
The document discusses software scope, which involves determining project goals, tasks, costs, and deadlines. It also describes functions, performance, constraints, and interfaces. The first step in software project planning is to determine scope by assessing functions and performance allocated to software. Scope is identified by asking the customer questions about goals, benefits, problems, and environment. An example of determining scope for a conveyor line sorting system is provided.
This document provides an overview of big data and how Azure HDInsight can be used to work with big data. It discusses the evolution of data from gigabytes to exabytes and the big data utility gap where most data is stored but not analyzed. It then discusses how to store everything, analyze anything, and build the right thing using big data. Examples are provided of companies generating large amounts of data. An overview of the Hadoop ecosystem is given along with examples of using Hive and Pig on HDInsight to query and analyze large datasets. A case study of Klout is also summarized.
Azure Stream Analytics : Analyse Data in MotionRuhani Arora
The document discusses evolving approaches to data warehousing and analytics using Azure Data Factory and Azure Stream Analytics. It provides an example scenario of analyzing game usage logs to create a customer profiling view. Azure Data Factory is presented as a way to build data integration and analytics pipelines that move and transform data between on-premises and cloud data stores. Azure Stream Analytics is introduced for analyzing real-time streaming data using a declarative query language.
Spark on Azure HDInsight - spark meetup seattleJudy Nash
Since HDInsight launched Spark clusters last year, HDInsight spark team’s mission has been making Spark easy-to-use and production-ready. In the process, we have explored many open source technologies such as Livy, Jupyter, Zeppelin. In this talk, we will demo top customer features, deep dive into HDInsight Spark architecture, and share learnings from building the perfect cluster.
Speakers: Judy Nash and Lin Chan
This document provides an overview of serverless computing using Azure Functions. It discusses the benefits of serverless such as increased server utilization, instant scaling, and reduced time to market. Serverless allows developers to focus on business logic rather than managing servers. Azure Functions is introduced as a way to develop serverless applications using triggers and bindings in languages like C#, Node.js, Python and more. Common serverless patterns are also presented.
With all the outstanding education technologies available these days, it's now possible to turn an online course into a full ecosystem of best-in-breed technologies and content providers. Come to this session to learn what that ecosystem can look like! We'll discuss how to use open educational resources (OERs) to replace expensive textbooks, and tips for finding, reviewing, and implementing the best tools right inside your LMS/VLE. We'll also look at best practices for building and adopting an open-centric strategy in your organization's teaching and learning environment.
Azure IoT Hub on a Toradex Colibri VF61 – Part 1 - Sending data to the cloudToradex
The concept of the Internet of Things is intrinsically related to the sending of data to the internet and its so-called cloud services. Learn how to join a Toradex Single Board Computer solution with the Azure IoT Hub service to send and receive messages in our next blog. It will help you to develop an IoT application which can read field sensors, present results, and demonstrate business intelligence. Toradex is an Azure IoT certified partner.
Serverless is a new framework that allows developers to easily harness AWS Lambda and Api Gateway to build and deploy full fledged API services without needing to deal with any ops level overhead or paying for servers when they're not in use. It's kinda like Heroku on-demand for single functions.
This document discusses Hopsworks, a Spark-as-a-Service platform built on Hops Hadoop. It provides:
- Secure multi-tenant Spark and Kafka clusters hosted on-premise using YARN.
- Project-based access control and quotas for storage and compute.
- Simplified development of secure Spark Streaming applications with Kafka using automatically distributed certificates.
- Support for Zeppelin notebooks, automated installation, and tools like DrElephant for job monitoring.
Databricks Meetup @ Los Angeles Apache Spark User GroupPaco Nathan
This document summarizes a presentation on Apache Spark and Spark Streaming. It provides an overview of Spark, describing it as an in-memory cluster computing framework. It then discusses Spark Streaming, explaining that it runs streaming computations as small batch jobs to provide low latency processing. Several use cases for Spark Streaming are presented, including from companies like Stratio, Pearson, Ooyala, and Sharethrough. The presentation concludes with a demonstration of Python Spark Streaming code.
On-premise Spark as a Service with YARN Jim Dowling
On Premise Spark-as-a-Service on YARN provides Spark-as-a-service in Sweden using Hopsworks, which was built on Hops Hadoop and provides multi-tenant Spark/Kafka/Flink jobs as a service. Hopsworks uses X.509 certificates for authentication instead of Kerberos and provides project-based access control and quotas. It simplifies writing secure Spark Streaming applications with Kafka.
Mobius talk in Seattle Spark Meetup (Feb 2106). Mobius adds C# language binding to Apache Spark, enabling the implementation of Spark driver code and data processing operations in C#. More info @ https://github.com/Microsoft/Mobius. Tweet to @MobiusForSpark.
The document discusses the design and implementation of Spark Streaming connectors for real-time data sources like Azure Event Hubs. It covers key aspects like connecting Event Hubs to Spark Streaming, designing the connector to minimize resource usage, ensuring fault tolerance through checkpointing and recovery, and managing message offsets and processing rates in a distributed manner. The connector design addresses challenges like long-running receivers, extra resource requirements, and data loss during failures. Lessons from the initial receiver-based approach informed the design of a more efficient solution.
Create a Varnish cluster in Kubernetes for Drupal caching - DrupalCon North A...Ovadiah Myrgorod
Varnish is a caching proxy usually used for high profile Drupal sites. However, configuring Varnish is not an easy task that requires a lot of work. It is even more difficult when it comes to creating a scalable cluster of Varnish nodes.
Fortunately, there is a solution. I’ve been working on kube-httpcache project (https://github.com/mittwald/kube-httpcache) that takes care of many things such as routing, scaling, broadcasting, config-reloading, etc...
If you need to run more than one instance of Varnish, this session is for you. You will learn how to:
* Launch a single instance of Varnish in Kubernetes.
* Configure Varnish for Drupal.
* Scale Varnish from 1 to N nodes as part of the cluster.
* Make your Varnish cluster resilient.
* Reload Varnish configs on the fly.
* Properly invalidate cache for multiple Varnish nodes.
This session requires some basic understanding of Docker and Kubernetes; however, I will provide some intro if you are new to it.
Join this session and enjoy!
This document provides an overview of Apache Flink and discusses why it is suitable for real-world streaming analytics. The document contains an agenda that covers how Flink is a multi-purpose big data analytics framework, why streaming analytics are emerging, why Flink is suitable for real-world streaming analytics, novel use cases enabled by Flink, who is using Flink, and where to go from here. Key points include Flink innovations like custom memory management, its DataSet API, rich windowing semantics, and native iterative processing. Flink's streaming features that make it suitable for real-world use include its pipelined processing engine, stream abstraction, performance, windowing support, fault tolerance, and integration with Hadoop.
Overview of Apache Fink: the 4 G of Big Data Analytics FrameworksSlim Baltagi
Slides of my talk at the Hadoop Summit Europe in Dublin, Ireland on April 13th, 2016. The talk introduces Apache Flink as both a multi-purpose Big Data analytics framework and real-world streaming analytics framework. It is focusing on Flink's key differentiators and suitability for streaming analytics use cases. It also shows how Flink enables novel use cases such as distributed CEP (Complex Event Processing) and querying the state by behaving like a key value data store.
Overview of Apache Fink: The 4G of Big Data Analytics FrameworksSlim Baltagi
This document provides an overview of Apache Flink and discusses why it is suitable for real-world streaming analytics. The document contains an agenda that covers how Flink is a multi-purpose big data analytics framework, why streaming analytics are emerging, why Flink is suitable for real-world streaming analytics, novel use cases enabled by Flink, who is using Flink, and where to go from here. Key points include Flink innovations like custom memory management, its DataSet API, rich windowing semantics, and native iterative processing. Flink's streaming features that make it suitable for real-world use include its pipelined processing engine, stream abstraction, performance, windowing support, fault tolerance, and integration with Hadoop.
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...Spark Summit
This document discusses AppsFlyer's experience running Spark on Mesos in production for retention data processing and analytics. Key points include:
- AppsFlyer processes over 30 million installs and 5 billion sessions daily for retention reporting across 18 dimensions using Spark, Mesos, and S3.
- Challenges included timeouts and errors when using Spark's S3 connectors due to the eventual consistency of S3, which was addressed by using more robust connectors and configuration options.
- A coarse-grained Mesos scheduling approach was found to be more stable than fine-grained, though it has limitations like static core allocation that future Mesos improvements may address.
- Tuning jobs for coarse-
Teaching Apache Spark: Demonstrations on the Databricks Cloud PlatformYao Yao
Yao Yao Mooyoung Lee
https://github.com/yaowser/learn-spark/tree/master/Final%20project
https://www.youtube.com/watch?v=IVMbSDS4q3A
https://www.academia.edu/35646386/Teaching_Apache_Spark_Demonstrations_on_the_Databricks_Cloud_Platform
https://www.slideshare.net/YaoYao44/teaching-apache-spark-demonstrations-on-the-databricks-cloud-platform-86063070/
Apache Spark is a fast and general engine for big data analytics processing with libraries for SQL, streaming, and advanced analytics
Cloud Computing, Structured Streaming, Unified Analytics Integration, End-to-End Applications
Michal Malohlava talks about the PySparkling Water package for Spark and Python users.
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
This document summarizes a presentation about bridging the gap between RESTful APIs and Linked Data using GitHub and SPARQL queries. It discusses how grlc maps GitHub repositories of SPARQL queries to Swagger API specifications and endpoints to provide RESTful access to Linked Data without having to code and maintain separate APIs. Features like content negotiation, pagination, caching and containerization are described to improve the usability and performance of the generated APIs. The presentation concludes by demonstrating how grlc allows flexible organization of SPARQL queries and separation of query curation from client applications.
Headaches and Breakthroughs in Building Continuous ApplicationsDatabricks
At SpotX, we have built and maintained a portfolio of Spark Streaming applications -- all of which process records in the millions per minute. From pure data ingestion, to ETL, to real-time reporting, to live customer-facing products and features, continuous applications are in our DNA. Come along with us as we outline our journey from square one to present in the world of Spark Streaming. We'll detail what we've learned about efficient processing and monitoring, reliability and stability, and long term support of a streaming app. Come learn from our mistakes, and leave with some handy settings and designs you can implement in your own streaming apps.
JConWorld_ Continuous SQL with Kafka and FlinkTimothy Spann
JConWorld: Continuous SQL with Kafka and Flink
In this talk, I will walk through how someone can setup and run continous SQL queries against Kafka topics utilizing Apache Flink. We will walk through creating Kafka topics, schemas and publishing data.
We will then cover consuming Kafka data, joining Kafka topics and inserting new events into Kafka topics as they arrive. This basic over view will show hands-on techniques, tips and examples of how to do this.
Tim Spann is the Principal Developer Advocate for Data in Motion @ Cloudera where he works with Apache Kafka, Apache Flink, Apache NiFi, Apache Iceberg, TensorFlow, Apache Spark, big data, the IoT, machine learning, and deep learning. Tim has over a decade of experience with the IoT, big data, distributed computing, streaming technologies, and Java programming. Previously, he was a Developer Advocate at StreamNative, Principal Field Engineer at Cloudera, a Senior Solutions Architect at AirisData and a senior field engineer at Pivotal. He blogs for DZone, where he is the Big Data Zone leader, and runs a popular meetup in Princeton on big data, the IoT, deep learning, streaming, NiFi, the blockchain, and Spark. Tim is a frequent speaker at conferences such as IoT Fusion, Strata, ApacheCon, Data Works Summit Berlin, DataWorks Summit Sydney, and Oracle Code NYC. He holds a BS and MS in computer science. https://www.datainmotion.dev/p/about-me.html https://dzone.com/users/297029/bunkertor.html
https://www.youtube.com/channel/UCDIDMDfje6jAvNE8DGkJ3_w?view_as=subscriber
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...Landon Robinson
At SpotX, we have built and maintained a portfolio of Spark Streaming applications -- all of which process records in the millions per minute. From pure data ingestion, to ETL, to real-time reporting, to live customer-facing products and features, continuous applications are in our DNA. Come along with us as we outline our journey from square one to present in the world of Spark Streaming. We'll detail what we've learned about efficient processing and monitoring, reliability and stability, and long term support of a streaming app. Come learn from our mistakes, and leave with some handy settings and designs you can implement in your own streaming apps.
Presented by Landon Robinson and Jack Chapa
Developing apache spark jobs in .net using mobiusshareddatamsft
Slides used for the talk "Developing Apache Spark Jobs in .NET using Mobius" at dotnetfringe 20016 (http://lanyrd.com/2016/netfringe/sfcxpx).
Apache Spark is an open source data processing framework built for big data processing and analytics. Ease of programming and high performance relative to the traditional big data tools and platforms and a unified API to solve a diverse set of complex data problems drove the rapid adoption of Spark in the industry. Apache Spark APIs in Scala, Java, Python and R cater to a wide range of big data professionals and a variety of functional roles. Mobius is an open source project that aims to bring Spark's rich set of capabilities to the .NET community. Mobius project added C# as another first-class programming language for Apache Spark and currently supports RDD, DataFrame and Streaming API. With Mobius, developers can build Spark jobs in C# and reuse their existing .NET libraries with Apache Spark. Mobius is open-sourced at http://github.com/Microsoft/Mobius. This project has received great support from the .NET community and positive feedback from the Spark enthusiasts
Jump Start with Apache Spark 2.0 on DatabricksAnyscale
This document provides an agenda for a 3+ hour workshop on Apache Spark 2.x on Databricks. It includes introductions to Databricks, Spark fundamentals and architecture, new features in Spark 2.0 like unified APIs, and workshops on DataFrames/Datasets, Spark SQL, and structured streaming concepts. The agenda covers lunch and breaks and is divided into hour and half hour segments.
We present Spark Serving, a new spark computing mode that enables users to deploy any Spark computation as a sub-millisecond latency web service backed by any Spark Cluster. Attendees will explore the architecture of Spark Serving and discover how to deploy services on a variety of cluster types like Azure Databricks, Kubernetes, and Spark Standalone. We will also demonstrate its simple yet powerful API for RESTful SparkSQL, SparkML, and Deep Network deployment with the same API as batch and streaming workloads. In addition, we will explore the "dual architecture": HTTP on Spark. This architecture converts any spark cluster into a distributed web client with the familiar and pipelinable SparkML API. These two contributions provide the fundamental spark communication primitives to integrate and deploy any computation framework into the Spark Ecosystem. We will explore how Microsoft has used this work to leverage Spark as a fault-tolerant microservice orchestration engine in addition to an ETL and ML platform. And will walk through two examples drawn from Microsoft's ongoing work on Cognitive Service composition, and unsupervised object detection for Snow Leopard recognition.
Similar to DC Spark bake off - Realtime TCP Packet Analysis using Spark and Azure Event Hubs (20)
UI5con 2024 - Bring Your Own Design SystemPeter Muessig
How do you combine the OpenUI5/SAPUI5 programming model with a design system that makes its controls available as Web Components? Since OpenUI5/SAPUI5 1.120, the framework supports the integration of any Web Components. This makes it possible, for example, to natively embed own Web Components of your design system which are created with Stencil. The integration embeds the Web Components in a way that they can be used naturally in XMLViews, like with standard UI5 controls, and can be bound with data binding. Learn how you can also make use of the Web Components base class in OpenUI5/SAPUI5 to also integrate your Web Components and get inspired by the solution to generate a custom UI5 library providing the Web Components control wrappers for the native ones.
Hand Rolled Applicative User ValidationCode KataPhilip Schwarz
Could you use a simple piece of Scala validation code (granted, a very simplistic one too!) that you can rewrite, now and again, to refresh your basic understanding of Applicative operators <*>, <*, *>?
The goal is not to write perfect code showcasing validation, but rather, to provide a small, rough-and ready exercise to reinforce your muscle-memory.
Despite its grandiose-sounding title, this deck consists of just three slides showing the Scala 3 code to be rewritten whenever the details of the operators begin to fade away.
The code is my rough and ready translation of a Haskell user-validation program found in a book called Finding Success (and Failure) in Haskell - Fall in love with applicative functors.
Consistent toolbox talks are critical for maintaining workplace safety, as they provide regular opportunities to address specific hazards and reinforce safe practices.
These brief, focused sessions ensure that safety is a continual conversation rather than a one-time event, which helps keep safety protocols fresh in employees' minds. Studies have shown that shorter, more frequent training sessions are more effective for retention and behavior change compared to longer, infrequent sessions.
Engaging workers regularly, toolbox talks promote a culture of safety, empower employees to voice concerns, and ultimately reduce the likelihood of accidents and injuries on site.
The traditional method of conducting safety talks with paper documents and lengthy meetings is not only time-consuming but also less effective. Manual tracking of attendance and compliance is prone to errors and inconsistencies, leading to gaps in safety communication and potential non-compliance with OSHA regulations. Switching to a digital solution like Safelyio offers significant advantages.
Safelyio automates the delivery and documentation of safety talks, ensuring consistency and accessibility. The microlearning approach breaks down complex safety protocols into manageable, bite-sized pieces, making it easier for employees to absorb and retain information.
This method minimizes disruptions to work schedules, eliminates the hassle of paperwork, and ensures that all safety communications are tracked and recorded accurately. Ultimately, using a digital platform like Safelyio enhances engagement, compliance, and overall safety performance on site. https://safelyio.com/
What to do when you have a perfect model for your software but you are constrained by an imperfect business model?
This talk explores the challenges of bringing modelling rigour to the business and strategy levels, and talking to your non-technical counterparts in the process.
Everything You Need to Know About X-Sign: The eSign Functionality of XfilesPr...XfilesPro
Wondering how X-Sign gained popularity in a quick time span? This eSign functionality of XfilesPro DocuPrime has many advancements to offer for Salesforce users. Explore them now!
UI5con 2024 - Boost Your Development Experience with UI5 Tooling ExtensionsPeter Muessig
The UI5 tooling is the development and build tooling of UI5. It is built in a modular and extensible way so that it can be easily extended by your needs. This session will showcase various tooling extensions which can boost your development experience by far so that you can really work offline, transpile your code in your project to use even newer versions of EcmaScript (than 2022 which is supported right now by the UI5 tooling), consume any npm package of your choice in your project, using different kind of proxies, and even stitching UI5 projects during development together to mimic your target environment.
Using Query Store in Azure PostgreSQL to Understand Query PerformanceGrant Fritchey
Microsoft has added an excellent new extension in PostgreSQL on their Azure Platform. This session, presented at Posette 2024, covers what Query Store is and the types of information you can get out of it.
E-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian CompaniesQuickdice ERP
Explore the seamless transition to e-invoicing with this comprehensive guide tailored for Saudi Arabian businesses. Navigate the process effortlessly with step-by-step instructions designed to streamline implementation and enhance efficiency.
The Key to Digital Success_ A Comprehensive Guide to Continuous Testing Integ...kalichargn70th171
In today's business landscape, digital integration is ubiquitous, demanding swift innovation as a necessity rather than a luxury. In a fiercely competitive market with heightened customer expectations, the timely launch of flawless digital products is crucial for both acquisition and retention—any delay risks ceding market share to competitors.
Preparing Non - Technical Founders for Engaging a Tech AgencyISH Technologies
Preparing non-technical founders before engaging a tech agency is crucial for the success of their projects. It starts with clearly defining their vision and goals, conducting thorough market research, and gaining a basic understanding of relevant technologies. Setting realistic expectations and preparing a detailed project brief are essential steps. Founders should select a tech agency with a proven track record and establish clear communication channels. Additionally, addressing legal and contractual considerations and planning for post-launch support are vital to ensure a smooth and successful collaboration. This preparation empowers non-technical founders to effectively communicate their needs and work seamlessly with their chosen tech agency.Visit our site to get more details about this. Contact us today www.ishtechnologies.com.au
Mobile App Development Company In Noida | Drona InfotechDrona Infotech
Drona Infotech is a premier mobile app development company in Noida, providing cutting-edge solutions for businesses.
Visit Us For : https://www.dronainfotech.com/mobile-application-development/
WWDC 2024 Keynote Review: For CocoaCoders AustinPatrick Weigel
Overview of WWDC 2024 Keynote Address.
Covers: Apple Intelligence, iOS18, macOS Sequoia, iPadOS, watchOS, visionOS, and Apple TV+.
Understandable dialogue on Apple TV+
On-device app controlling AI.
Access to ChatGPT with a guest appearance by Chief Data Thief Sam Altman!
App Locking! iPhone Mirroring! And a Calculator!!
DC Spark bake off - Realtime TCP Packet Analysis using Spark and Azure Event Hubs
1. Washington DC Area Apache
Spark Interactive
Spark Bake-off
Team Name: Silvio Fiorito
Solution Title: Real-time Packet Analysis using Spark
2. Spark Bake-off
Page: 2
Team Introductions
Silvio Fiorito
– Background in development and app security
– Started working with Hadoop in 2012
– Started using Spark at v0.6 in early 2013
– Built a few prototypes for low-latency query
services with Spark/Shark and then
SparkSQL
– Twitter: @granturing
3. Spark Bake-off
Page: 3
Solution Overview
Real-time TCP packet analysis of geographically
distributed hosts
– Must support high throughput from many hosts
– 3 demo VMs ( 2 x Azure & 1 x AWS)
Local Flume agent pushes events to Azure Event Hub
Events are partitioned and persisted up to 7 days
Spark Streaming app ingests streams
– Reconstruct packets
– Lookups for geo-ip and port description
– Clusters using pre-trained k-means model
– Saves data to Azure Table Storage and publishes on
Service Bus Topic
6. Spark Bake-off
Page: 6
Final Comments & Questions
With more time
– Add true anomaly detection with MLLib
– Test on hosts with real traffic
– Wire up end-to-end with d3.js viz and
SparkSQL backend
– Integrate with existing IDS/IPS rules
– Bad IPs lookup