Building and deploying an analytic service on Cloud is a challenge. A bigger challenge is to maintain the service. In a world where users are gravitating towards a model where cluster instances are to be provisioned on the fly, in order for these to be used for analytics or other purposes, and then to have these cluster instances shut down when the jobs get done, the relevance of containers and container orchestration is more important than ever.
Container orchestrators like Kubernetes can be used to deploy and distribute modules quickly, easily, and reliably. The intent of this talk is to share the experience of building such a service and deploying it on a Kubernetes cluster. In this talk, we will discuss all the requirements which an enterprise grade Hadoop/Spark cluster running on containers bring in for a container orchestrator.
This talk will cover in details how Kubernetes orchestrator can be used to meet all our needs of resource management, scheduling, networking, and network isolation, volume management, etc. We will discuss how we have replaced our home grown container orchestrator with Kubernetes which used to manage the container lifecycle and manage resources in accordance to our requirements. We will also discuss the feature list as container orchestrator which is helping us deploy and patch 1000s of containers and also a list which we believe need improvement or can be enhanced in a container orchestrator.
Speaker
Rachit Arora, SSE, IBM
Providing true interactive and scalable BI on Hadoop is proven to be one of the biggest challenges that is preventing completion of legacy EDW OLAP system transit to Hadoop. While we have all seen many benchmarks running consecutive queries claiming success, having thousands of concurrent business users sending complicated generated queries from their dashboards over billions of records while delivering interactive speed is yet to be seen.
In this session we will discuss how an architecture that replaces full-scan brute-force approach with adaptive indexing and auto-generated cubes can dramatically reduce the resources and effort per query, resulting in interactive performance for high concurrency workloads and explain how this is achieved with minimum data engineering efforts. We will also discuss how this architecture can be seamlessly integrated with Hive to provide a complete OLAP-on-Hadoop solution.
Session will include live demo of complex business dashboards connected to Hive and accessing billions of rows at interactive speed.
Speaker
Boaz Raufman, CTO and Co-Founder, JethroData
Realizing the promise of portable data processing with Apache BeamDataWorks Summit
The world of big data involves an ever changing field of players. Much as SQL stands as a lingua franca for declarative data analysis, Apache Beam aims to provide a portable standard for expressing robust, out-of-order data processing pipelines in a variety of languages across a variety of platforms. In a way, Apache Beam is a glue that can connect the Big Data ecosystem together; it enables users to "run-anything-anywhere".
This talk will briefly cover the capabilities of the Beam model for data processing, as well as the current state of the Beam ecosystem. We'll discuss Beam architecture and dive into the portability layer. We'll offer a technical analysis of the Beam's powerful primitive operations that enable true and reliable portability across diverse environments. Finally, we'll demonstrate a complex pipeline running on multiple runners in multiple deployment scenarios (e.g. Apache Spark on Amazon Web Services, Apache Flink on Google Cloud, Apache Apex on-premise), and give a glimpse at some of the challenges Beam aims to address in the future.
It’s 2017, and big data challenges are as real as they get. Our customers have petabytes of data living in elastic and scalable commodity storage systems such as Azure Data Lake Store and Azure Blob storage.
One of the central questions today is finding insights from data in these storage systems in an interactive manner, at a fraction of the cost.
Interactive Query leverages [Hive on LLAP] in Apache Hive 2.1, brings the interactivity to your complex data warehouse style queries on large datasets stored on commodity cloud storage.
In this session, you will learn how technologies such as Low Latency Analytical Processing [LLAP] and Hive 2.x are making it possible to analyze petabytes of data with sub second latency with common file formats such as csv, json etc. without converting to columnar file formats like ORC/Parquet. We will go deep into LLAP’s performance and architecture benefits and how it compares with Spark and Presto in Azure HDInsight. We also look at how business analysts can use familiar tools such as Microsoft Excel and Power BI, and do interactive query over their data lake without moving data outside the data lake.
Speaker
Ashish Thapliyal, Principal Program Manager, Microsoft Corp
An elastic batch-and stream-processing stack with Pravega and Apache FlinkDataWorks Summit
Stream processing is a popular paradigm that is becoming more relevant as many applications provide low-latency response time and new application domains emerge that naturally demand data to be processed in motion. One particularly attractive characteristic of the stream-processing paradigm is that it conceptually unifies batch processing (bounded/static historic data) and continuous near-real-time data processing (unbounded streaming event data).
Implementing a unified batch and streaming data architecture is in practice not seamless—near-real-time event data and bulk historic data use different storage systems (messages queues or logs vs. file systems or object stores). Consequently, running the same analysis now and at some arbitrary time in the future (e.g., months, possibly years ahead) means dealing with different data sources and APIs. Few systems are capable of handling both near-real-time streaming workloads and large batch workloads at the same time. And streaming workloads tend to be inherently dynamic, requiring both storage and compute to adjust continuously for maximum resource efficiency.
In this talk, we present an open source streaming data stack consisting of Pravega (stream storage) and Apache Flink (computation on streams). The combination of these two systems offers an unprecedented way of handling “everything as a stream,” while dynamically accommodating workload variations in a novel way. Pravega enables the ingestion capacity of a stream to grow and shrink according to workload and sends signals downstream to enable Flink to scale accordingly.
Pravega offers a permanent streaming storage, exposing an API than enables applications to access data in either near-real time or at any arbitrary time in the future in a uniform fashion. Apache Flink’s SQL and streaming APIs provide a common interface for processing continuous near-real-time data, sets of historic data, or combinations of both. A deep integration between these two systems gives end-to-end exactly-once semantics for pipelines of streams and stream processing and lets both systems jointly scale and adjust automatically to changing data rates.
Speakers:
Stephan Ewen, Co-Founder/ CTO, Data Artisans
Flavio Junqueira, Engineering Lead, Pravega by DellEMC
Accelerating TensorFlow with RDMA for high-performance deep learningDataWorks Summit
Google’s TensorFlow is one of the most popular deep learning (DL) frameworks. In distributed TensorFlow, gradient updates are a critical step governing the total model training time. These updates incur a massive volume of data transfer over the network.
In this talk, we first present a thorough analysis of the communication patterns in distributed TensorFlow. Then we propose a unified way of achieving high performance through enhancing the gRPC runtime with Remote Direct Memory Access (RDMA) technology on InfiniBand and RoCE. Through our proposed RDMA-gRPC design, TensorFlow only needs to run over the gRPC channel and gets the optimal performance. Our design includes advanced features such as message pipelining, message coalescing, zero-copy transmission, etc. The performance evaluations show that our proposed design can significantly speed up gRPC throughput by up to 1.5x compared to the default gRPC design. By integrating our RDMA-gRPC with TensorFlow, we are able to achieve up to 35% performance improvement for TensorFlow training with CNN models.
Speakers
Dhabaleswar K (DK) Panda, Professor and University Distinguished Scholar, The Ohio State University
Xiaoyi Lu, Research Scientist, The Ohio State University
Stream processing has become the defacto standard for building real-time ETL and Stream Analytics applications. We see batch workloads move into Stream processing to act on the data and derive insights faster. With the explosion of data with "Perishable Insights" such IoT and machine-generated data, Stream Processing + Predictive Analytics is driving tremendous business value. This is evidenced by the explosion of Stream Processing frameworks like proven and evolving Apache Storm and newer frameworks such as Apache Flink, Apache Apex, and Spark Streaming.
Today, users have to choose and try to understand the benefits of each of these frameworks and not only that they have to learn the new APIs and also operationalize their applications. To create value faster, we are introducing new open source tool - Streamline. It is a self-service framework that will ease building streaming application and deploy the streaming application across multiple frameworks/engines that users prefer in a snap. It simplifies integration with Machine Learning models for scoring and classification of data for Predictive Analytics. It provides an elegant way to build Analytics dashboards to derive business insights out of the streaming data and to allow the business users to consume it easily.
In this talk, we will outline the fundamentals of real-time stream processing and demonstrate Streamline capabilities to show how it simplifies building real-time streaming analytics applications.
Speaker:
Priyank Shah, Staff Software Engineer, Hortonworks
Running secured Spark job in Kubernetes compute cluster and integrating with ...DataWorks Summit
This presentation will provide technical design and development insights to run a secured Spark job in Kubernetes compute cluster that accesses job data from a Kerberized HDFS cluster. Joy will show how to run a long-running machine learning or ETL Spark job in Kubernetes and to access data from HDFS using Kerberos Principal and Delegation token.
The first part of this presentation will unleash the design and best practices to deploy and run Spark in Kubernetes integrated with HDFS that creates on-demand multi-node Spark cluster during job submission, installing/resolving software dependencies (packages), executing/monitoring the workload, and finally disposing the resources at the end of job completion. The second part of this presentation covers the design and development details to setup a Spark+Kubernetes cluster that supports long-running jobs accessing data from secured HDFS storage by creating and renewing Kerberos delegation tokens seamlessly from end-user's Kerberos Principal.
All the techniques covered in this presentation are essential in order to set up a Spark+Kubernetes compute cluster that accesses data securely from distributed storage cluster such as HDFS in a corporate environment. No prior knowledge of any of these technologies is required to attend this presentation.
Speaker
Joy Chakraborty, Data Architect
Providing true interactive and scalable BI on Hadoop is proven to be one of the biggest challenges that is preventing completion of legacy EDW OLAP system transit to Hadoop. While we have all seen many benchmarks running consecutive queries claiming success, having thousands of concurrent business users sending complicated generated queries from their dashboards over billions of records while delivering interactive speed is yet to be seen.
In this session we will discuss how an architecture that replaces full-scan brute-force approach with adaptive indexing and auto-generated cubes can dramatically reduce the resources and effort per query, resulting in interactive performance for high concurrency workloads and explain how this is achieved with minimum data engineering efforts. We will also discuss how this architecture can be seamlessly integrated with Hive to provide a complete OLAP-on-Hadoop solution.
Session will include live demo of complex business dashboards connected to Hive and accessing billions of rows at interactive speed.
Speaker
Boaz Raufman, CTO and Co-Founder, JethroData
Realizing the promise of portable data processing with Apache BeamDataWorks Summit
The world of big data involves an ever changing field of players. Much as SQL stands as a lingua franca for declarative data analysis, Apache Beam aims to provide a portable standard for expressing robust, out-of-order data processing pipelines in a variety of languages across a variety of platforms. In a way, Apache Beam is a glue that can connect the Big Data ecosystem together; it enables users to "run-anything-anywhere".
This talk will briefly cover the capabilities of the Beam model for data processing, as well as the current state of the Beam ecosystem. We'll discuss Beam architecture and dive into the portability layer. We'll offer a technical analysis of the Beam's powerful primitive operations that enable true and reliable portability across diverse environments. Finally, we'll demonstrate a complex pipeline running on multiple runners in multiple deployment scenarios (e.g. Apache Spark on Amazon Web Services, Apache Flink on Google Cloud, Apache Apex on-premise), and give a glimpse at some of the challenges Beam aims to address in the future.
It’s 2017, and big data challenges are as real as they get. Our customers have petabytes of data living in elastic and scalable commodity storage systems such as Azure Data Lake Store and Azure Blob storage.
One of the central questions today is finding insights from data in these storage systems in an interactive manner, at a fraction of the cost.
Interactive Query leverages [Hive on LLAP] in Apache Hive 2.1, brings the interactivity to your complex data warehouse style queries on large datasets stored on commodity cloud storage.
In this session, you will learn how technologies such as Low Latency Analytical Processing [LLAP] and Hive 2.x are making it possible to analyze petabytes of data with sub second latency with common file formats such as csv, json etc. without converting to columnar file formats like ORC/Parquet. We will go deep into LLAP’s performance and architecture benefits and how it compares with Spark and Presto in Azure HDInsight. We also look at how business analysts can use familiar tools such as Microsoft Excel and Power BI, and do interactive query over their data lake without moving data outside the data lake.
Speaker
Ashish Thapliyal, Principal Program Manager, Microsoft Corp
An elastic batch-and stream-processing stack with Pravega and Apache FlinkDataWorks Summit
Stream processing is a popular paradigm that is becoming more relevant as many applications provide low-latency response time and new application domains emerge that naturally demand data to be processed in motion. One particularly attractive characteristic of the stream-processing paradigm is that it conceptually unifies batch processing (bounded/static historic data) and continuous near-real-time data processing (unbounded streaming event data).
Implementing a unified batch and streaming data architecture is in practice not seamless—near-real-time event data and bulk historic data use different storage systems (messages queues or logs vs. file systems or object stores). Consequently, running the same analysis now and at some arbitrary time in the future (e.g., months, possibly years ahead) means dealing with different data sources and APIs. Few systems are capable of handling both near-real-time streaming workloads and large batch workloads at the same time. And streaming workloads tend to be inherently dynamic, requiring both storage and compute to adjust continuously for maximum resource efficiency.
In this talk, we present an open source streaming data stack consisting of Pravega (stream storage) and Apache Flink (computation on streams). The combination of these two systems offers an unprecedented way of handling “everything as a stream,” while dynamically accommodating workload variations in a novel way. Pravega enables the ingestion capacity of a stream to grow and shrink according to workload and sends signals downstream to enable Flink to scale accordingly.
Pravega offers a permanent streaming storage, exposing an API than enables applications to access data in either near-real time or at any arbitrary time in the future in a uniform fashion. Apache Flink’s SQL and streaming APIs provide a common interface for processing continuous near-real-time data, sets of historic data, or combinations of both. A deep integration between these two systems gives end-to-end exactly-once semantics for pipelines of streams and stream processing and lets both systems jointly scale and adjust automatically to changing data rates.
Speakers:
Stephan Ewen, Co-Founder/ CTO, Data Artisans
Flavio Junqueira, Engineering Lead, Pravega by DellEMC
Accelerating TensorFlow with RDMA for high-performance deep learningDataWorks Summit
Google’s TensorFlow is one of the most popular deep learning (DL) frameworks. In distributed TensorFlow, gradient updates are a critical step governing the total model training time. These updates incur a massive volume of data transfer over the network.
In this talk, we first present a thorough analysis of the communication patterns in distributed TensorFlow. Then we propose a unified way of achieving high performance through enhancing the gRPC runtime with Remote Direct Memory Access (RDMA) technology on InfiniBand and RoCE. Through our proposed RDMA-gRPC design, TensorFlow only needs to run over the gRPC channel and gets the optimal performance. Our design includes advanced features such as message pipelining, message coalescing, zero-copy transmission, etc. The performance evaluations show that our proposed design can significantly speed up gRPC throughput by up to 1.5x compared to the default gRPC design. By integrating our RDMA-gRPC with TensorFlow, we are able to achieve up to 35% performance improvement for TensorFlow training with CNN models.
Speakers
Dhabaleswar K (DK) Panda, Professor and University Distinguished Scholar, The Ohio State University
Xiaoyi Lu, Research Scientist, The Ohio State University
Stream processing has become the defacto standard for building real-time ETL and Stream Analytics applications. We see batch workloads move into Stream processing to act on the data and derive insights faster. With the explosion of data with "Perishable Insights" such IoT and machine-generated data, Stream Processing + Predictive Analytics is driving tremendous business value. This is evidenced by the explosion of Stream Processing frameworks like proven and evolving Apache Storm and newer frameworks such as Apache Flink, Apache Apex, and Spark Streaming.
Today, users have to choose and try to understand the benefits of each of these frameworks and not only that they have to learn the new APIs and also operationalize their applications. To create value faster, we are introducing new open source tool - Streamline. It is a self-service framework that will ease building streaming application and deploy the streaming application across multiple frameworks/engines that users prefer in a snap. It simplifies integration with Machine Learning models for scoring and classification of data for Predictive Analytics. It provides an elegant way to build Analytics dashboards to derive business insights out of the streaming data and to allow the business users to consume it easily.
In this talk, we will outline the fundamentals of real-time stream processing and demonstrate Streamline capabilities to show how it simplifies building real-time streaming analytics applications.
Speaker:
Priyank Shah, Staff Software Engineer, Hortonworks
Running secured Spark job in Kubernetes compute cluster and integrating with ...DataWorks Summit
This presentation will provide technical design and development insights to run a secured Spark job in Kubernetes compute cluster that accesses job data from a Kerberized HDFS cluster. Joy will show how to run a long-running machine learning or ETL Spark job in Kubernetes and to access data from HDFS using Kerberos Principal and Delegation token.
The first part of this presentation will unleash the design and best practices to deploy and run Spark in Kubernetes integrated with HDFS that creates on-demand multi-node Spark cluster during job submission, installing/resolving software dependencies (packages), executing/monitoring the workload, and finally disposing the resources at the end of job completion. The second part of this presentation covers the design and development details to setup a Spark+Kubernetes cluster that supports long-running jobs accessing data from secured HDFS storage by creating and renewing Kerberos delegation tokens seamlessly from end-user's Kerberos Principal.
All the techniques covered in this presentation are essential in order to set up a Spark+Kubernetes compute cluster that accesses data securely from distributed storage cluster such as HDFS in a corporate environment. No prior knowledge of any of these technologies is required to attend this presentation.
Speaker
Joy Chakraborty, Data Architect
Present and future of unified, portable, and efficient data processing with A...DataWorks Summit
The world of big data involves an ever-changing field of players. Much as SQL stands as a lingua franca for declarative data analysis, Apache Beam aims to provide a portable standard for expressing robust, out-of-order data processing pipelines in a variety of languages across a variety of platforms. In a way, Apache Beam is a glue that can connect the big data ecosystem together; it enables users to "run any data processing pipeline anywhere."
This talk will briefly cover the capabilities of the Beam model for data processing and discuss its architecture, including the portability model. We’ll focus on the present state of the community and the current status of the Beam ecosystem. We’ll cover the state of the art in data processing and discuss where Beam is going next, including completion of the portability framework and the Streaming SQL. Finally, we’ll discuss areas of improvement and how anybody can join us on the path of creating the glue that interconnects the big data ecosystem.
Speaker
Davor Bonaci, Apache Software Foundation; Simbly, V.P. of Apache Beam; Founder/CEO at Operiant
Stream Processing is emerging as a popular paradigm for data processing architectures, because it handles the continuous nature of most data and computation and gets rid of artificial boundaries and delays. In this talk, we are going to look at some of the most common misconceptions about stream processing and debunk them.
- Myth 1: Streaming is approximate and exactly-once is not possible.
- Myth 2: Streaming is for real-time only.
- Myth 4: Streaming is harder to learn than Batch Processing.
- Myth 3: You need to choose between latency and throughput.
We will look at these and other myths and debunk them at the example of Apache Flink. We will discuss Apache Flink's approach to high performance stream processing with state, strong consistency, low latency, and sophisticated handling of time. With such building blocks, Apache Flink can handle classes of problems previously considered out of reach for stream processing. We also take a sneak preview at the next steps for Flink.
This workshop will provide a hands-on introduction to Apache Spark and Apache Zeppelin in the cloud.
Format: A short introductory lecture on Apache Spark covering core modules (SQL, Streaming, MLlib, GraphX) followed by a demo, lab exercises and a Q&A session. The lecture will be followed by lab time to work through the lab exercises and ask questions.
Objective: To provide a quick and short hands-on introduction to Apache Spark. This lab will use the following Spark and Apache Hadoop components: Spark, Spark SQL, Apache Hadoop HDFS, Apache Hadoop YARN, Apache ORC, and Apache Ambari Zepellin. You will learn how to move data into HDFS using Spark APIs, create Apache Hive tables, explore the data with Spark and Spark SQL, transform the data and then issue some SQL queries.df
Lab pre-requisites: Registrants must bring a laptop with a Chrome or Firefox web browser installed (with proxies disabled). Alternatively, they may download and install an HDP Sandbox as long as they have at least 16GB of RAM available (Note that the sandbox is over 10GB in size so we recommend downloading it before the crash course).
Speakers: Robert Hryniewicz
Data Ingest Self Service and Management using Nifi and KafkaDataWorks Summit
We’re feeling the growing pains of maintaining a large data platform. Last year we went from 50 to 150 unique data feeds by adding them all by hand. In this talk we will share the best practices developed to handle our 300% increase in feeds through self service. Having self-service capabilities will increase your teams velocity and decrease your time to value and insight.
* Self service data feed design and ingest
* configuration management
* automatic debugging
* light weight data governance
We discuss the current state of LLAP (Live Long and Process) – the concurrent sub-second execution of analytical queries engine for Hive 2.0. LLAP is a hybrid execution model that enables performance improvement in and across queries, such as caching of columnar data with cache coherence and intelligent eviction for disaggregated storage models (like S3, Isilon, Azure), JIT-friendly operator pipelines, asynchronous I/O, data pre-fetching and multi-threaded processing. LLAP features robust machine and service failure tolerance achieved by building on top of the time-tested fault tolerant subsystems, as well as a concurrency-directed design that achieves high utilization with low latency via resource sharing, reducing overheads for multiple queries, and enabling the system to preempt tasks of lower priority without failing any query in-flight. The talk also aims to cover the novel deployment model required for hybrid execution. The elasticity demands of the system are served by a long-lived YARN service interacting with on-demand elastic containers serving as a tightly integrated DAG-based framework for query execution. We discuss the current state of the project, performance numbers, deployment and usage strategy, as well as future work, including how LLAP fits into a unified secure DataFrame access layer.
Present and future of unified, portable and efficient data processing with Ap...DataWorks Summit
The world of big data involves an ever-changing field of players. Much as SQL stands as a lingua franca for declarative data analysis, Apache Beam aims to provide a portable standard for expressing robust, out-of-order data processing pipelines in a variety of languages across a variety of platforms. In a way, Apache Beam is a glue that can connect the big data ecosystem together; it enables users to "run any data processing pipeline anywhere."
This talk will briefly cover the capabilities of the Beam model for data processing and discuss its architecture, including the portability model. We’ll focus on the present state of the community and the current status of the Beam ecosystem. We’ll cover the state of the art in data processing and discuss where Beam is going next, including completion of the portability framework and the Streaming SQL. Finally, we’ll discuss areas of improvement and how anybody can join us on the path of creating the glue that interconnects the big data ecosystem.
Speaker
Davor Bonaci, V.P. of Apache Beam; Founder/CEO at Operiant
Using Spark Streaming and NiFi for the next generation of ETL in the enterpriseDataWorks Summit
On paper, combining Apache NiFi, Kafka, and Spark Streaming provides a compelling architecture option for building your next generation ETL data pipeline in near real time. What does this look like in enterprise production environment to deploy and operationalized?
The newer Spark Structured Streaming provides fast, scalable, fault-tolerant, end-to-end exactly-once stream processing with elegant code samples, but is that the whole story? This session will cover the Royal Bank of Canada’s (RBC) journey of moving away from traditional ETL batch processing with Teradata towards using the Hadoop ecosystem for ingesting data. One of the first systems to leverage this new approach was the Event Standardization Service (ESS). This service provides a centralized “client event” ingestion point for the bank’s internal systems through either a web service or text file daily batch feed. ESS allows down stream reporting applications and end users to query these centralized events.
We discuss the drivers and expected benefits of changing the existing event processing. In presenting the integrated solution, we will explore the key components of using NiFi, Kafka, and Spark, then share the good, the bad, and the ugly when trying to adopt these technologies into the enterprise. This session is targeted toward architects and other senior IT staff looking to continue their adoption of open source technology and modernize ingest/ETL processing. Attendees will take away lessons learned and experience in deploying these technologies to make their journey easier.
Speakers
Darryl Sutton, T4G, Principal Consultant
Kenneth Poon, RBC, Director, Data Engineering
At ING we needed a way to implement Data science models from exploration into production. I will do this talk from my experience on the exploration and production Hadoop environment as a senior Ops engineer. For this we are using OpenShift to run Docker containers that connect to the big data Hadoop environment.
During this talk I will explain why we need this and how this is done at ING. Also how to set up a docker container running a data science model using Hive, Python, and Spark. I’ll explain how to use Docker files to build Docker images, add all the needed components inside the Docker image, and how to run different versions of software in different containers.
In the end I will also give a demo of how it runs and is automated using Git with webhook connecting to Jenkins and start the docker service that will connect to a big data Hadoop environment.
This is going to be a great technical talk for engineers and data scientist.
Speaker
Lennard Cornelis, Ops Engineer, ING
Omid: scalable and highly available transaction processing for Apache PhoenixDataWorks Summit
Apache Phoenix is an OLTP and operational analytics for Hadoop. To ensure operations correctness, Phoenix requires that a transaction processor guarantees that all data accesses satisfy the ACID properties. Traditionally, Apache Phoenix has been using the Apache Tephra transaction processing technology. Recently, we introduced into Phoenix the support for Apache Omid—an open source transaction processor for HBase that is used at Yahoo at a large scale.
A single Omid instance sustains hundreds of thousands of transactions per second and provides high availability at zero cost for mainstream processing. Omid, as well as Tephra, are now configurable choices for the Phoenix transaction processing backend, being enabled by the newly introduced Transaction Abstraction Layer (TAL) API. The integration requires introducing many new features and operations to Omid and will become generally available early 2018.
In this talk, we walk through the challenges of the project, focusing on the new use cases introduced by Phoenix and how we address them in Omid.
Speaker
Ohad Shacham, Senior Research Scientist, Yahoo Research, Oath
Bringing complex event processing to Spark streamingDataWorks Summit
Complex event processing (CEP) is about identifying business opportunities and threats in real time by detecting patterns in data and taking appropriate automated action. Example business use cases for CEP include location-based marketing, smart inventories, targeted ads, Wi-Fi offloading, fraud detection, churn prediction, fleet management, predictive maintenance, security incident event management, and many more. While Spark Streaming provides a distributed resilient framework for ingesting events in real time, effort is still needed to build CEP applications. This is because CEP use cases require correlation of events, which in turn requires us to treat every incoming event as a discrete occurrence in time. Spark Streaming treats the entire batch of events as single occurrence. Many CEP use cases also require alerts to be fired even when there is no incoming event. An example of such use case is to fire an alert when an order-shipped event is NOT received within the SLA times following an order-received event. At Oracle we have adopted a few neat techniques like running continuous query engines as long running tasks, using empty batches as triggers, etc. to bring complex event processing to Spark Streaming.
Join us to learn more on CEP for Spark, the fastest growing data processing platform in the world.
Speakers
Prabhu Thukkaram, Senior Director, Product Development, Oracle
Hoyong Park, Architect, Oracle
Real-Time Machine Learning with Redis, Apache Spark, Tensor Flow, and more wi...Databricks
Predictive intelligence from machine learning has the potential to change everything in our day to day experiences, from education to entertainment, from travel to healthcare, from business to leisure and everything in between. Modern ML frameworks are batch by nature and cannot pivot on the fly to changing user data or situations. Many simple ML applications such as those that enhance the user experience, can benefit from real-time robust predictive models that adapt on the fly.
Join this session to learn how common practices in machine learning such as running a trained model in production can be substantially accelerated and radically simplified by using Redis modules that natively store and execute common models generated by Spark ML and Tensorflow algorithms. We will also discuss the implementation of simple, real-time feed-forward neural networks with Neural Redis and scenarios that can benefit from such efficient, accelerated artificial intelligence.
Real-life implementations of these new techniques at a large consumer credit company for fraud analytics, at an online e-commerce provider for user recommendations and at a large media company for targeting content will also be discussed.
LinkedIn leverages the Apache Hadoop ecosystem for its big data analytics. Steady growth of the member base at LinkedIn along with their social activities results in exponential growth of the analytics infrastructure. Innovations in analytics tooling lead to heavier workloads on the clusters, which generate more data, which in turn encourage innovations in tooling and more workloads. Thus, the infrastructure remains under constant growth pressure. Heterogeneous environments embodied via a variety of hardware and diverse workloads make the task even more challenging.
This talk will tell the story of how we doubled our Hadoop infrastructure twice in the past two years.
• We will outline our main use cases and historical rates of cluster growth in multiple dimensions.
• We will focus on optimizations, configuration improvements, performance monitoring and architectural decisions we undertook to allow the infrastructure to keep pace with business needs.
• The topics include improvements in HDFS NameNode performance, and fine tuning of block report processing, the block balancer, and the namespace checkpointer.
• We will reveal a study on the optimal storage device for HDFS persistent journals (SATA vs. SAS vs. SSD vs. RAID).
• We will also describe Satellite Cluster project which allowed us to double the objects stored on one logical cluster by splitting an HDFS cluster into two partitions without the use of federation and practically no code changes.
• Finally, we will take a peek at our future goals, requirements, and growth perspectives.
SPEAKERS
Konstantin Shvachko, Sr Staff Software Engineer, LinkedIn
Erik Krogen, Senior Software Engineer, LinkedIn
Druid and Hive Together : Use Cases and Best PracticesDataWorks Summit
Two popular open source technologies, Druid and Apache Hive, are often mentioned as viable solutions for large-scale analytics. Hive works well for storing large volumes of data, although not optimized for ingesting streaming data and making it available for queries in realtime. On the other hand, Druid excels at low-latency, interactive queries over streaming data and making data available in realtime for queries. Although the high level messaging presented by both projects may lead you to believe they are competing for same use case, the technologies are in fact extremely complementary solutions.
By combining the rich query capabilities of Hive with the powerful realtime streaming and indexing capabilities of Druid, we can build more powerful, flexible, and extremely low latency realtime streaming analytics solutions. In this talk we will discuss the motivation to combine Hive and Druid together alongwith the benefits, use cases, best practices and benchmark numbers.
The Agenda of the talk will be -
1. Motivation behind integrating Druid with Hive
2. Druid and Hive together - benefits
3. Use Cases with Demos and architecture discussion
4. Best Practices - Do's and Don'ts
5. Performance vs Cost Tradeoffs
6. SSB Benchmark Numbers
Apache Ambari is an extensible framework that simplifies provisioning, managing and monitoring Hadoop clusters. Apache Ambari was built on a standardized stack-based operations model. Stacks wrap services of all shapes and sizes with a consistent definition and lifecycle-control layer; thereby providing a consistent approach for managing and monitoring the services. This also provided a natural extension point for operators and the community to bring in their own add-on services and “plug-in” the new services into the stack.
However, one of the fundamental limitations of the current Apache Ambari architecture has been that there is a strong one-on-one coupling between entities. For instance, a cluster is tied to a single stack and a Hadoop operator can only deploy services defined in that stack, a cluster can have only a single instance of a service and a host can have only a single instance of a component. Taking into consideration various use case scenarios that cannot be enabled due to these limitations there is a growing need to revamp the Ambari architecture.
In this talk, we propose a revamped Apache Ambari architecture that will open up the floodgates for a wide range of scenarios that wouldn’t have been possible thus far. We will focus the discussion on a new mpack-based operations model that will replace the stack-based operations model. A management package is a self-contained deployment artifact that includes all the details for deploying, managing and upgrading a set of services bundled in the package. A third-party provider can also build their own management package containing their custom services. This eliminates the need to plug-in their services into a stack and also can define their own upgrade story for these custom services. A Hadoop operator will be able to deploy a Hadoop cluster with a mix of services across multiple packages instead of being limited to a single stack. For example, it would be possible to deploy a cluster with HDFS from HDP and NIFI from HDF.
Further, we will also discuss about the architectural changes needed to enable a multi instance architecture in future Ambari releases to support deploying multiple instances of a service in a cluster, deploying multiple instances of a component on a host as well as future proofing the Ambari architecture to leverage some of the advancements happening in the Hadoop community like YARN services (YARN-4692). We will wrap up the conversation with a brief overview of other improvements planned for future releases of Ambari.
Data Highway Rainbow - Petabyte Scale Event Collection, Transport & Delivery ...DataWorks Summit
This paper will present the architecture and features of Data Highway Rainbow, Yahoo’s hosted multi-tenant infrastructure which offers event collection, transport and aggregated delivery as a service. Data Highway supports collection from multiple data centers & aggregated delivery in primary Yahoo data centers which provide a big data computing cluster. From a delivery perspective, Data Highway supports endpoints/sinks such as HDFS, Storm and Kafka; with Storm & Kafka endpoints tailored towards latency latency consumers.
We will also look into the evolution of the service in terms of prominent features added, the motivation behind these features starting from it’s initial launch, some of which were customer asks, while others were driven from optimizing the efficiency and footprint of the deployed infrastructure. Some of the features we will touch upon are
* Delivery Completeness Audit WebService
* Publisher Daemon & Client API Robustness
* Aggregated HDFS File Delivery
* Filters for Low Latency Delivery.
* Schema Registry
* Adaptive Rate Limiting
* Various Load Balancing techniques.
* Event Deduplication
Aggregated Daily Metrics
* Events Ingested: 250 Billion
* Bytes Ingested (Uncompressed) : 700 Tera Bytes
* Bytes Delivered (Batch + Near Real Time) : 1.5 Peta Bytes
* Near Real Time Delivery (Storm & Kafka) Latency : 95th percentile 500ms - 1 second
* Batch Delivery Latency (Aggregated into 1 minute files) : 95th percentile within 3 minutes
* Production H/W Footprint : 651
* Total Active Event Schema Types: ~200
Underlying Technology Stack : ZeroMQ, Apache Avro, libevent, Apache HttpComponents
The paper will conclude with the next steps we’re considering as a logical evolution for Data Highway in light of considerable developments in similar open source projects such as Apache Kafka.
Many Organizations are currently processing various types of data and in different formats. Most often this data will be in free form, As the consumers of this data growing it’s imperative that this free-flowing data needs to adhere to a schema. It will help data consumers to have an expectation of about the type of data they are getting and also they will be able to avoid immediate impact if the upstream source changes its format. Having a uniform schema representation also gives the Data Pipeline a really easy way to integrate and support various systems that use different data formats.
SchemaRegistry is a central repository for storing, evolving schemas. It provides an API & tooling to help developers and users to register a schema and consume that schema without having any impact if the schema changed. Users can tag different schemas and versions, register for notifications of schema changes with versions etc.
In this talk, we will go through the need for a schema registry and schema evolution and showcase the integration with Apache Nifi, Apache Kafka, Apache Storm.
Building and deploying an analytic service on Cloud is a challenge. A bigger challenge is to maintain the service. In a world where users are gravitating towards a model where cluster instances are to provisioned on the fly, in order for these to be used for analytics or other purposes, and then to have these cluster instances shut down when the jobs get done, the relevance of containers and container orchestration is more important than ever. In short Customers are looking for Serverless Spark Clusters. The Intent of this presentation is to share what is Serverless Spark and what are the benefits of running Spark in serverless manner.
Storage Requirements and Options for Running Spark on KubernetesDataWorks Summit
In a world of serverless computing users tend to be frugal when it comes to expenditure on compute, storage and other resources. Paying for the same when they aren’t in use becomes a significant factor. Offering Spark as service on cloud presents very unique challenges. Running Spark on Kubernetes presents a lot of challenges especially around storage and persistence. Spark workloads have very unique requirements of Storage for intermediate data, long time persistence, Share file system and requirements become very tight when it same need to be offered as a service for enterprise to mange GDPR and other compliance like ISO 27001 and HIPAA certifications.
This talk covers challenges involved in providing Serverless Spark Clusters share the specific issues one can encounter when running large Kubernetes clusters in production especially covering the scenarios related to persistence.
This talk will help people using Kubernetes or docker runtime in production and help them understand various storage options available and which is more suitable for running Spark workloads on Kubernetes and what more can be done
Present and future of unified, portable, and efficient data processing with A...DataWorks Summit
The world of big data involves an ever-changing field of players. Much as SQL stands as a lingua franca for declarative data analysis, Apache Beam aims to provide a portable standard for expressing robust, out-of-order data processing pipelines in a variety of languages across a variety of platforms. In a way, Apache Beam is a glue that can connect the big data ecosystem together; it enables users to "run any data processing pipeline anywhere."
This talk will briefly cover the capabilities of the Beam model for data processing and discuss its architecture, including the portability model. We’ll focus on the present state of the community and the current status of the Beam ecosystem. We’ll cover the state of the art in data processing and discuss where Beam is going next, including completion of the portability framework and the Streaming SQL. Finally, we’ll discuss areas of improvement and how anybody can join us on the path of creating the glue that interconnects the big data ecosystem.
Speaker
Davor Bonaci, Apache Software Foundation; Simbly, V.P. of Apache Beam; Founder/CEO at Operiant
Stream Processing is emerging as a popular paradigm for data processing architectures, because it handles the continuous nature of most data and computation and gets rid of artificial boundaries and delays. In this talk, we are going to look at some of the most common misconceptions about stream processing and debunk them.
- Myth 1: Streaming is approximate and exactly-once is not possible.
- Myth 2: Streaming is for real-time only.
- Myth 4: Streaming is harder to learn than Batch Processing.
- Myth 3: You need to choose between latency and throughput.
We will look at these and other myths and debunk them at the example of Apache Flink. We will discuss Apache Flink's approach to high performance stream processing with state, strong consistency, low latency, and sophisticated handling of time. With such building blocks, Apache Flink can handle classes of problems previously considered out of reach for stream processing. We also take a sneak preview at the next steps for Flink.
This workshop will provide a hands-on introduction to Apache Spark and Apache Zeppelin in the cloud.
Format: A short introductory lecture on Apache Spark covering core modules (SQL, Streaming, MLlib, GraphX) followed by a demo, lab exercises and a Q&A session. The lecture will be followed by lab time to work through the lab exercises and ask questions.
Objective: To provide a quick and short hands-on introduction to Apache Spark. This lab will use the following Spark and Apache Hadoop components: Spark, Spark SQL, Apache Hadoop HDFS, Apache Hadoop YARN, Apache ORC, and Apache Ambari Zepellin. You will learn how to move data into HDFS using Spark APIs, create Apache Hive tables, explore the data with Spark and Spark SQL, transform the data and then issue some SQL queries.df
Lab pre-requisites: Registrants must bring a laptop with a Chrome or Firefox web browser installed (with proxies disabled). Alternatively, they may download and install an HDP Sandbox as long as they have at least 16GB of RAM available (Note that the sandbox is over 10GB in size so we recommend downloading it before the crash course).
Speakers: Robert Hryniewicz
Data Ingest Self Service and Management using Nifi and KafkaDataWorks Summit
We’re feeling the growing pains of maintaining a large data platform. Last year we went from 50 to 150 unique data feeds by adding them all by hand. In this talk we will share the best practices developed to handle our 300% increase in feeds through self service. Having self-service capabilities will increase your teams velocity and decrease your time to value and insight.
* Self service data feed design and ingest
* configuration management
* automatic debugging
* light weight data governance
We discuss the current state of LLAP (Live Long and Process) – the concurrent sub-second execution of analytical queries engine for Hive 2.0. LLAP is a hybrid execution model that enables performance improvement in and across queries, such as caching of columnar data with cache coherence and intelligent eviction for disaggregated storage models (like S3, Isilon, Azure), JIT-friendly operator pipelines, asynchronous I/O, data pre-fetching and multi-threaded processing. LLAP features robust machine and service failure tolerance achieved by building on top of the time-tested fault tolerant subsystems, as well as a concurrency-directed design that achieves high utilization with low latency via resource sharing, reducing overheads for multiple queries, and enabling the system to preempt tasks of lower priority without failing any query in-flight. The talk also aims to cover the novel deployment model required for hybrid execution. The elasticity demands of the system are served by a long-lived YARN service interacting with on-demand elastic containers serving as a tightly integrated DAG-based framework for query execution. We discuss the current state of the project, performance numbers, deployment and usage strategy, as well as future work, including how LLAP fits into a unified secure DataFrame access layer.
Present and future of unified, portable and efficient data processing with Ap...DataWorks Summit
The world of big data involves an ever-changing field of players. Much as SQL stands as a lingua franca for declarative data analysis, Apache Beam aims to provide a portable standard for expressing robust, out-of-order data processing pipelines in a variety of languages across a variety of platforms. In a way, Apache Beam is a glue that can connect the big data ecosystem together; it enables users to "run any data processing pipeline anywhere."
This talk will briefly cover the capabilities of the Beam model for data processing and discuss its architecture, including the portability model. We’ll focus on the present state of the community and the current status of the Beam ecosystem. We’ll cover the state of the art in data processing and discuss where Beam is going next, including completion of the portability framework and the Streaming SQL. Finally, we’ll discuss areas of improvement and how anybody can join us on the path of creating the glue that interconnects the big data ecosystem.
Speaker
Davor Bonaci, V.P. of Apache Beam; Founder/CEO at Operiant
Using Spark Streaming and NiFi for the next generation of ETL in the enterpriseDataWorks Summit
On paper, combining Apache NiFi, Kafka, and Spark Streaming provides a compelling architecture option for building your next generation ETL data pipeline in near real time. What does this look like in enterprise production environment to deploy and operationalized?
The newer Spark Structured Streaming provides fast, scalable, fault-tolerant, end-to-end exactly-once stream processing with elegant code samples, but is that the whole story? This session will cover the Royal Bank of Canada’s (RBC) journey of moving away from traditional ETL batch processing with Teradata towards using the Hadoop ecosystem for ingesting data. One of the first systems to leverage this new approach was the Event Standardization Service (ESS). This service provides a centralized “client event” ingestion point for the bank’s internal systems through either a web service or text file daily batch feed. ESS allows down stream reporting applications and end users to query these centralized events.
We discuss the drivers and expected benefits of changing the existing event processing. In presenting the integrated solution, we will explore the key components of using NiFi, Kafka, and Spark, then share the good, the bad, and the ugly when trying to adopt these technologies into the enterprise. This session is targeted toward architects and other senior IT staff looking to continue their adoption of open source technology and modernize ingest/ETL processing. Attendees will take away lessons learned and experience in deploying these technologies to make their journey easier.
Speakers
Darryl Sutton, T4G, Principal Consultant
Kenneth Poon, RBC, Director, Data Engineering
At ING we needed a way to implement Data science models from exploration into production. I will do this talk from my experience on the exploration and production Hadoop environment as a senior Ops engineer. For this we are using OpenShift to run Docker containers that connect to the big data Hadoop environment.
During this talk I will explain why we need this and how this is done at ING. Also how to set up a docker container running a data science model using Hive, Python, and Spark. I’ll explain how to use Docker files to build Docker images, add all the needed components inside the Docker image, and how to run different versions of software in different containers.
In the end I will also give a demo of how it runs and is automated using Git with webhook connecting to Jenkins and start the docker service that will connect to a big data Hadoop environment.
This is going to be a great technical talk for engineers and data scientist.
Speaker
Lennard Cornelis, Ops Engineer, ING
Omid: scalable and highly available transaction processing for Apache PhoenixDataWorks Summit
Apache Phoenix is an OLTP and operational analytics for Hadoop. To ensure operations correctness, Phoenix requires that a transaction processor guarantees that all data accesses satisfy the ACID properties. Traditionally, Apache Phoenix has been using the Apache Tephra transaction processing technology. Recently, we introduced into Phoenix the support for Apache Omid—an open source transaction processor for HBase that is used at Yahoo at a large scale.
A single Omid instance sustains hundreds of thousands of transactions per second and provides high availability at zero cost for mainstream processing. Omid, as well as Tephra, are now configurable choices for the Phoenix transaction processing backend, being enabled by the newly introduced Transaction Abstraction Layer (TAL) API. The integration requires introducing many new features and operations to Omid and will become generally available early 2018.
In this talk, we walk through the challenges of the project, focusing on the new use cases introduced by Phoenix and how we address them in Omid.
Speaker
Ohad Shacham, Senior Research Scientist, Yahoo Research, Oath
Bringing complex event processing to Spark streamingDataWorks Summit
Complex event processing (CEP) is about identifying business opportunities and threats in real time by detecting patterns in data and taking appropriate automated action. Example business use cases for CEP include location-based marketing, smart inventories, targeted ads, Wi-Fi offloading, fraud detection, churn prediction, fleet management, predictive maintenance, security incident event management, and many more. While Spark Streaming provides a distributed resilient framework for ingesting events in real time, effort is still needed to build CEP applications. This is because CEP use cases require correlation of events, which in turn requires us to treat every incoming event as a discrete occurrence in time. Spark Streaming treats the entire batch of events as single occurrence. Many CEP use cases also require alerts to be fired even when there is no incoming event. An example of such use case is to fire an alert when an order-shipped event is NOT received within the SLA times following an order-received event. At Oracle we have adopted a few neat techniques like running continuous query engines as long running tasks, using empty batches as triggers, etc. to bring complex event processing to Spark Streaming.
Join us to learn more on CEP for Spark, the fastest growing data processing platform in the world.
Speakers
Prabhu Thukkaram, Senior Director, Product Development, Oracle
Hoyong Park, Architect, Oracle
Real-Time Machine Learning with Redis, Apache Spark, Tensor Flow, and more wi...Databricks
Predictive intelligence from machine learning has the potential to change everything in our day to day experiences, from education to entertainment, from travel to healthcare, from business to leisure and everything in between. Modern ML frameworks are batch by nature and cannot pivot on the fly to changing user data or situations. Many simple ML applications such as those that enhance the user experience, can benefit from real-time robust predictive models that adapt on the fly.
Join this session to learn how common practices in machine learning such as running a trained model in production can be substantially accelerated and radically simplified by using Redis modules that natively store and execute common models generated by Spark ML and Tensorflow algorithms. We will also discuss the implementation of simple, real-time feed-forward neural networks with Neural Redis and scenarios that can benefit from such efficient, accelerated artificial intelligence.
Real-life implementations of these new techniques at a large consumer credit company for fraud analytics, at an online e-commerce provider for user recommendations and at a large media company for targeting content will also be discussed.
LinkedIn leverages the Apache Hadoop ecosystem for its big data analytics. Steady growth of the member base at LinkedIn along with their social activities results in exponential growth of the analytics infrastructure. Innovations in analytics tooling lead to heavier workloads on the clusters, which generate more data, which in turn encourage innovations in tooling and more workloads. Thus, the infrastructure remains under constant growth pressure. Heterogeneous environments embodied via a variety of hardware and diverse workloads make the task even more challenging.
This talk will tell the story of how we doubled our Hadoop infrastructure twice in the past two years.
• We will outline our main use cases and historical rates of cluster growth in multiple dimensions.
• We will focus on optimizations, configuration improvements, performance monitoring and architectural decisions we undertook to allow the infrastructure to keep pace with business needs.
• The topics include improvements in HDFS NameNode performance, and fine tuning of block report processing, the block balancer, and the namespace checkpointer.
• We will reveal a study on the optimal storage device for HDFS persistent journals (SATA vs. SAS vs. SSD vs. RAID).
• We will also describe Satellite Cluster project which allowed us to double the objects stored on one logical cluster by splitting an HDFS cluster into two partitions without the use of federation and practically no code changes.
• Finally, we will take a peek at our future goals, requirements, and growth perspectives.
SPEAKERS
Konstantin Shvachko, Sr Staff Software Engineer, LinkedIn
Erik Krogen, Senior Software Engineer, LinkedIn
Druid and Hive Together : Use Cases and Best PracticesDataWorks Summit
Two popular open source technologies, Druid and Apache Hive, are often mentioned as viable solutions for large-scale analytics. Hive works well for storing large volumes of data, although not optimized for ingesting streaming data and making it available for queries in realtime. On the other hand, Druid excels at low-latency, interactive queries over streaming data and making data available in realtime for queries. Although the high level messaging presented by both projects may lead you to believe they are competing for same use case, the technologies are in fact extremely complementary solutions.
By combining the rich query capabilities of Hive with the powerful realtime streaming and indexing capabilities of Druid, we can build more powerful, flexible, and extremely low latency realtime streaming analytics solutions. In this talk we will discuss the motivation to combine Hive and Druid together alongwith the benefits, use cases, best practices and benchmark numbers.
The Agenda of the talk will be -
1. Motivation behind integrating Druid with Hive
2. Druid and Hive together - benefits
3. Use Cases with Demos and architecture discussion
4. Best Practices - Do's and Don'ts
5. Performance vs Cost Tradeoffs
6. SSB Benchmark Numbers
Apache Ambari is an extensible framework that simplifies provisioning, managing and monitoring Hadoop clusters. Apache Ambari was built on a standardized stack-based operations model. Stacks wrap services of all shapes and sizes with a consistent definition and lifecycle-control layer; thereby providing a consistent approach for managing and monitoring the services. This also provided a natural extension point for operators and the community to bring in their own add-on services and “plug-in” the new services into the stack.
However, one of the fundamental limitations of the current Apache Ambari architecture has been that there is a strong one-on-one coupling between entities. For instance, a cluster is tied to a single stack and a Hadoop operator can only deploy services defined in that stack, a cluster can have only a single instance of a service and a host can have only a single instance of a component. Taking into consideration various use case scenarios that cannot be enabled due to these limitations there is a growing need to revamp the Ambari architecture.
In this talk, we propose a revamped Apache Ambari architecture that will open up the floodgates for a wide range of scenarios that wouldn’t have been possible thus far. We will focus the discussion on a new mpack-based operations model that will replace the stack-based operations model. A management package is a self-contained deployment artifact that includes all the details for deploying, managing and upgrading a set of services bundled in the package. A third-party provider can also build their own management package containing their custom services. This eliminates the need to plug-in their services into a stack and also can define their own upgrade story for these custom services. A Hadoop operator will be able to deploy a Hadoop cluster with a mix of services across multiple packages instead of being limited to a single stack. For example, it would be possible to deploy a cluster with HDFS from HDP and NIFI from HDF.
Further, we will also discuss about the architectural changes needed to enable a multi instance architecture in future Ambari releases to support deploying multiple instances of a service in a cluster, deploying multiple instances of a component on a host as well as future proofing the Ambari architecture to leverage some of the advancements happening in the Hadoop community like YARN services (YARN-4692). We will wrap up the conversation with a brief overview of other improvements planned for future releases of Ambari.
Data Highway Rainbow - Petabyte Scale Event Collection, Transport & Delivery ...DataWorks Summit
This paper will present the architecture and features of Data Highway Rainbow, Yahoo’s hosted multi-tenant infrastructure which offers event collection, transport and aggregated delivery as a service. Data Highway supports collection from multiple data centers & aggregated delivery in primary Yahoo data centers which provide a big data computing cluster. From a delivery perspective, Data Highway supports endpoints/sinks such as HDFS, Storm and Kafka; with Storm & Kafka endpoints tailored towards latency latency consumers.
We will also look into the evolution of the service in terms of prominent features added, the motivation behind these features starting from it’s initial launch, some of which were customer asks, while others were driven from optimizing the efficiency and footprint of the deployed infrastructure. Some of the features we will touch upon are
* Delivery Completeness Audit WebService
* Publisher Daemon & Client API Robustness
* Aggregated HDFS File Delivery
* Filters for Low Latency Delivery.
* Schema Registry
* Adaptive Rate Limiting
* Various Load Balancing techniques.
* Event Deduplication
Aggregated Daily Metrics
* Events Ingested: 250 Billion
* Bytes Ingested (Uncompressed) : 700 Tera Bytes
* Bytes Delivered (Batch + Near Real Time) : 1.5 Peta Bytes
* Near Real Time Delivery (Storm & Kafka) Latency : 95th percentile 500ms - 1 second
* Batch Delivery Latency (Aggregated into 1 minute files) : 95th percentile within 3 minutes
* Production H/W Footprint : 651
* Total Active Event Schema Types: ~200
Underlying Technology Stack : ZeroMQ, Apache Avro, libevent, Apache HttpComponents
The paper will conclude with the next steps we’re considering as a logical evolution for Data Highway in light of considerable developments in similar open source projects such as Apache Kafka.
Many Organizations are currently processing various types of data and in different formats. Most often this data will be in free form, As the consumers of this data growing it’s imperative that this free-flowing data needs to adhere to a schema. It will help data consumers to have an expectation of about the type of data they are getting and also they will be able to avoid immediate impact if the upstream source changes its format. Having a uniform schema representation also gives the Data Pipeline a really easy way to integrate and support various systems that use different data formats.
SchemaRegistry is a central repository for storing, evolving schemas. It provides an API & tooling to help developers and users to register a schema and consume that schema without having any impact if the schema changed. Users can tag different schemas and versions, register for notifications of schema changes with versions etc.
In this talk, we will go through the need for a schema registry and schema evolution and showcase the integration with Apache Nifi, Apache Kafka, Apache Storm.
Building and deploying an analytic service on Cloud is a challenge. A bigger challenge is to maintain the service. In a world where users are gravitating towards a model where cluster instances are to provisioned on the fly, in order for these to be used for analytics or other purposes, and then to have these cluster instances shut down when the jobs get done, the relevance of containers and container orchestration is more important than ever. In short Customers are looking for Serverless Spark Clusters. The Intent of this presentation is to share what is Serverless Spark and what are the benefits of running Spark in serverless manner.
Storage Requirements and Options for Running Spark on KubernetesDataWorks Summit
In a world of serverless computing users tend to be frugal when it comes to expenditure on compute, storage and other resources. Paying for the same when they aren’t in use becomes a significant factor. Offering Spark as service on cloud presents very unique challenges. Running Spark on Kubernetes presents a lot of challenges especially around storage and persistence. Spark workloads have very unique requirements of Storage for intermediate data, long time persistence, Share file system and requirements become very tight when it same need to be offered as a service for enterprise to mange GDPR and other compliance like ISO 27001 and HIPAA certifications.
This talk covers challenges involved in providing Serverless Spark Clusters share the specific issues one can encounter when running large Kubernetes clusters in production especially covering the scenarios related to persistence.
This talk will help people using Kubernetes or docker runtime in production and help them understand various storage options available and which is more suitable for running Spark workloads on Kubernetes and what more can be done
Secure Your Containers: What Network Admins Should Know When Moving Into Prod...Cynthia Thomas
This session offers techniques for securing Docker containers and hosts using open source network virtualization technologies to implement microsegmentation. Come learn real tips and tricks that you can apply to keep your production environment secure.
stackconf 2020 | Replace your Docker based Containers with Cri-o Kata Contain...NETWAYS
They provide the workload isolation and security advantages of VMs. but at the same time maintain the speed of deployment and usability of containers.by using kata containers, instead of namespace, small virtual machines are created on the kernel and be strongly isolated. The technology of Kata Containers is based on KVM hypervisor. That’s why the level of isolation is equivalent to typical hypervisors. This session will focus on a live production phase when choosing kata instead of docker, and why they are preferable
Although containers provides software-level isolation of resources, the kernel needs to be shared. That’s why the isolation level in terms of security is not so high when compared with hypervisors.This learns to shift from Docker as the de facto standard to Kata containers and learn how to obtain higherl level of security
Oscon 2017: Build your own container-based system with the Moby projectPatrick Chanezon
Build your own container-based system
with the Moby project
Docker Community Edition—an open source product that lets you build, ship, and run containers—is an assembly of modular components built from an upstream open source project called Moby. Moby provides a “Lego set” of dozens of components, the framework for assembling them into specialized container-based systems, and a place for all container enthusiasts to experiment and exchange ideas.
Patrick Chanezon and Mindy Preston explain how you can leverage the Moby project to assemble your own specialized container-based system, whether for IoT, cloud, or bare-metal scenarios. Patrick and Mindy explore Moby’s framework, components, and tooling, focusing on two components: LinuxKit, a toolkit to build container-based Linux subsystems that are secure, lean, and portable, and InfraKit, a toolkit for creating and managing declarative, self-healing infrastructure. Along the way, they demo how to use Moby, LinuxKit, InfraKit, and other components to quickly assemble full-blown container-based systems for several use cases and deploy them on various infrastructures.
Best Practices for Running Kafka on Docker ContainersBlueData, Inc.
Docker containers provide an ideal foundation for running Kafka-as-a-Service on-premises or in the public cloud. However, using Docker containers in production environments for Big Data workloads using Kafka poses some challenges – including container management, scheduling, network configuration and security, and performance.
In this session at Kafka Summit in August 2017, Nanda Vijyaydev of BlueData shared lessons learned from implementing Kafka-as-a-Service with Docker containers.
https://kafka-summit.org/sessions/kafka-service-docker-containers
Dockerized containers are the current wave that promising to revolutionize IT. Everybody is talking about containers, but a lot of people remain confused on how they work and why they are different or better than virtual machines. In this session, Black Duck container and virtualization expert Tim Mackey will demystify containers, explain their core concepts, and compare and contrast them with the virtual machine architectures that have been the staple of IT for the last decade.
Cloud orchestration major tools comparisionRavi Kiran
Cloud Orchestration major tools comparison (including history, installation, market share, integration with other public cloud system for each tool) For any clarification contact kiran79@techgeek.co.in
Introduction: This workshop will provide a hands-on introduction to Machine Learning (ML) with an overview of Deep Learning (DL).
Format: An introductory lecture on several supervised and unsupervised ML techniques followed by light introduction to DL and short discussion what is current state-of-the-art. Several python code samples using the scikit-learn library will be introduced that users will be able to run in the Cloudera Data Science Workbench (CDSW).
Objective: To provide a quick and short hands-on introduction to ML with python’s scikit-learn library. The environment in CDSW is interactive and the step-by-step guide will walk you through setting up your environment, to exploring datasets, training and evaluating models on popular datasets. By the end of the crash course, attendees will have a high-level understanding of popular ML algorithms and the current state of DL, what problems they can solve, and walk away with basic hands-on experience training and evaluating ML models.
Prerequisites: For the hands-on portion, registrants must bring a laptop with a Chrome or Firefox web browser. These labs will be done in the cloud, no installation needed. Everyone will be able to register and start using CDSW after the introductory lecture concludes (about 1hr in). Basic knowledge of python highly recommended.
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
In a world with a myriad of distributed storage systems to choose from, the majority of Apache HBase clusters still rely on Apache HDFS. Theoretically, any distributed file system could be used by HBase. One major reason HDFS is predominantly used are the specific durability requirements of HBase's write-ahead log (WAL) and HDFS providing that guarantee correctly. However, HBase's use of HDFS for WALs can be replaced with sufficient effort.
This talk will cover the design of a "Log Service" which can be embedded inside of HBase that provides a sufficient level of durability that HBase requires for WALs. Apache Ratis (incubating) is a library-implementation of the RAFT consensus protocol in Java and is used to build this Log Service. We will cover the design choices of the Ratis Log Service, comparing and contrasting it to other log-based systems that exist today. Next, we'll cover how the Log Service "fits" into HBase and the necessary changes to HBase which enable this. Finally, we'll discuss how the Log Service can simplify the operational burden of HBase.
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
Utilizing Apache NiFi we read various open data REST APIs and camera feeds to ingest crime and related data real-time streaming it into HBase and Phoenix tables. HBase makes an excellent storage option for our real-time time series data sources. We can immediately query our data utilizing Apache Zeppelin against Phoenix tables as well as Hive external tables to HBase.
Apache Phoenix tables also make a great option since we can easily put microservices on top of them for application usage. I have an example Spring Boot application that reads from our Philadelphia crime table for front-end web applications as well as RESTful APIs.
Apache NiFi makes it easy to push records with schemas to HBase and insert into Phoenix SQL tables.
Resources:
https://community.hortonworks.com/articles/54947/reading-opendata-json-and-storing-into-phoenix-tab.html
https://community.hortonworks.com/articles/56642/creating-a-spring-boot-java-8-microservice-to-read.html
https://community.hortonworks.com/articles/64122/incrementally-streaming-rdbms-data-to-your-hadoop.html
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
Whilst HBase is the most logical answer for use cases requiring random, realtime read/write access to Big Data, it may not be so trivial to design applications that make most of its use, neither the most simple to operate. As it depends/integrates with other components from Hadoop ecosystem (Zookeeper, HDFS, Spark, Hive, etc) or external systems ( Kerberos, LDAP), and its distributed nature requires a "Swiss clockwork" infrastructure, many variables are to be considered when observing anomalies or even outages. Adding to the equation there's also the fact that HBase is still an evolving product, with different release versions being used currently, some of those can carry genuine software bugs. On this presentation, we'll go through the most common HBase issues faced by different organisations, describing identified cause and resolution action over my last 5 years supporting HBase to our heterogeneous customer base.
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
LocationTech GeoMesa enables spatial and spatiotemporal indexing and queries for HBase and Accumulo. In this talk, after an overview of GeoMesa’s capabilities in the Cloudera ecosystem, we will dive into how GeoMesa leverages Accumulo’s Iterator interface and HBase’s Filter and Coprocessor interfaces. The goal will be to discuss both what spatial operations can be pushed down into the distributed database and also how the GeoMesa codebase is organized to allow for consistent use across the two database systems.
OCLC has been using HBase since 2012 to enable single-search-box access to over a billion items from your library and the world’s library collection. This talk will provide an overview of how HBase is structured to provide this information and some of the challenges they have encountered to scale to support the world catalog and how they have overcome them.
Many individuals/organizations have a desire to utilize NoSQL technology, but often lack an understanding of how the underlying functional bits can be utilized to enable their use case. This situation can result in drastic increases in the desire to put the SQL back in NoSQL.
Since the initial commit, Apache Accumulo has provided a number of examples to help jumpstart comprehension of how some of these bits function as well as potentially help tease out an understanding of how they might be applied to a NoSQL friendly use case. One very relatable example demonstrates how Accumulo could be used to emulate a filesystem (dirlist).
In this session we will walk through the dirlist implementation. Attendees should come away with an understanding of the supporting table designs, a simple text search supporting a single wildcard (on file/directory names), and how the dirlist elements work together to accomplish its feature set. Attendees should (hopefully) also come away with a justification for sometimes keeping the SQL out of NoSQL.
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
Data serves as the platform for decision-making at Uber. To facilitate data driven decisions, many datasets at Uber are ingested in a Hadoop Data Lake and exposed to querying via Hive. Analytical queries joining various datasets are run to better understand business data at Uber.
Data ingestion, at its most basic form, is about organizing data to balance efficient reading and writing of newer data. Data organization for efficient reading involves factoring in query patterns to partition data to ensure read amplification is low. Data organization for efficient writing involves factoring the nature of input data - whether it is append only or updatable.
At Uber we ingest terabytes of many critical tables such as trips that are updatable. These tables are fundamental part of Uber's data-driven solutions, and act as the source-of-truth for all the analytical use-cases across the entire company. Datasets such as trips constantly receive updates to the data apart from inserts. To ingest such datasets we need a critical component that is responsible for bookkeeping information of the data layout, and annotates each incoming change with the location in HDFS where this data should be written. This component is called as Global Indexing. Without this component, all records get treated as inserts and get re-written to HDFS instead of being updated. This leads to duplication of data, breaking data correctness and user queries. This component is key to scaling our jobs where we are now handling greater than 500 billion writes a day in our current ingestion systems. This component will need to have strong consistency and provide large throughputs for index writes and reads.
At Uber, we have chosen HBase to be the backing store for the Global Indexing component and is a critical component in allowing us to scaling our jobs where we are now handling greater than 500 billion writes a day in our current ingestion systems. In this talk, we will discuss data@Uber and expound more on why we built the global index using Apache Hbase and how this helps to scale out our cluster usage. We’ll give details on why we chose HBase over other storage systems, how and why we came up with a creative solution to automatically load Hfiles directly to the backend circumventing the normal write path when bootstrapping our ingestion tables to avoid QPS constraints, as well as other learnings we had bringing this system up in production at the scale of data that Uber encounters daily.
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
Recently, Apache Phoenix has been integrated with Apache (incubator) Omid transaction processing service, to provide ultra-high system throughput with ultra-low latency overhead. Phoenix has been shown to scale beyond 0.5M transactions per second with sub-5ms latency for short transactions on industry-standard hardware. On the other hand, Omid has been extended to support secondary indexes, multi-snapshot SQL queries, and massive-write transactions.
These innovative features make Phoenix an excellent choice for translytics applications, which allow converged transaction processing and analytics. We share the story of building the next-gen data tier for advertising platforms at Verizon Media that exploits Phoenix and Omid to support multi-feed real-time ingestion and AI pipelines in one place, and discuss the lessons learned.
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
Cybersecurity requires an organization to collect data, analyze it, and alert on cyber anomalies in near real-time. This is a challenging endeavor when considering the variety of data sources which need to be collected and analyzed. Everything from application logs, network events, authentications systems, IOT devices, business events, cloud service logs, and more need to be taken into consideration. In addition, multiple data formats need to be transformed and conformed to be understood by both humans and ML/AI algorithms.
To solve this problem, the Aetna Global Security team developed the Unified Data Platform based on Apache NiFi, which allows them to remain agile and adapt to new security threats and the onboarding of new technologies in the Aetna environment. The platform currently has over 60 different data flows with 95% doing real-time ETL and handles over 20 billion events per day. In this session learn from Aetna’s experience building an edge to AI high-speed data pipeline with Apache NiFi.
In the healthcare sector, data security, governance, and quality are crucial for maintaining patient privacy and ensuring the highest standards of care. At Florida Blue, the leading health insurer of Florida serving over five million members, there is a multifaceted network of care providers, business users, sales agents, and other divisions relying on the same datasets to derive critical information for multiple applications across the enterprise. However, maintaining consistent data governance and security for protected health information and other extended data attributes has always been a complex challenge that did not easily accommodate the wide range of needs for Florida Blue’s many business units. Using Apache Ranger, we developed a federated Identity & Access Management (IAM) approach that allows each tenant to have their own IAM mechanism. All user groups and roles are propagated across the federation in order to determine users’ data entitlement and access authorization; this applies to all stages of the system, from the broadest tenant levels down to specific data rows and columns. We also enabled audit attributes to ensure data quality by documenting data sources, reasons for data collection, date and time of data collection, and more. In this discussion, we will outline our implementation approach, review the results, and highlight our “lessons learned.”
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
Presto, an open source distributed SQL engine, is widely recognized for its low-latency queries, high concurrency, and native ability to query multiple data sources. Proven at scale in a variety of use cases at Airbnb, Bloomberg, Comcast, Facebook, FINRA, LinkedIn, Lyft, Netflix, Twitter, and Uber, in the last few years Presto experienced an unprecedented growth in popularity in both on-premises and cloud deployments over Object Stores, HDFS, NoSQL and RDBMS data stores.
With the ever-growing list of connectors to new data sources such as Azure Blob Storage, Elasticsearch, Netflix Iceberg, Apache Kudu, and Apache Pulsar, recently introduced Cost-Based Optimizer in Presto must account for heterogeneous inputs with differing and often incomplete data statistics. This talk will explore this topic in detail as well as discuss best use cases for Presto across several industries. In addition, we will present recent Presto advancements such as Geospatial analytics at scale and the project roadmap going forward.
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
Specialized tools for machine learning development and model governance are becoming essential. MlFlow is an open source platform for managing the machine learning lifecycle. Just by adding a few lines of code in the function or script that trains their model, data scientists can log parameters, metrics, artifacts (plots, miscellaneous files, etc.) and a deployable packaging of the ML model. Every time that function or script is run, the results will be logged automatically as a byproduct of those lines of code being added, even if the party doing the training run makes no special effort to record the results. MLflow application programming interfaces (APIs) are available for the Python, R and Java programming languages, and MLflow sports a language-agnostic REST API as well. Over a relatively short time period, MLflow has garnered more than 3,300 stars on GitHub , almost 500,000 monthly downloads and 80 contributors from more than 40 companies. Most significantly, more than 200 companies are now using MLflow. We will demo MlFlow Tracking , Project and Model components with Azure Machine Learning (AML) Services and show you how easy it is to get started with MlFlow on-prem or in the cloud.
Extending Twitter's Data Platform to Google CloudDataWorks Summit
Twitter's Data Platform is built using multiple complex open source and in house projects to support Data Analytics on hundreds of petabytes of data. Our platform support storage, compute, data ingestion, discovery and management and various tools and libraries to help users for both batch and realtime analytics. Our DataPlatform operates on multiple clusters across different data centers to help thousands of users discover valuable insights. As we were scaling our Data Platform to multiple clusters, we also evaluated various cloud vendors to support use cases outside of our data centers. In this talk we share our architecture and how we extend our data platform to use cloud as another datacenter. We walk through our evaluation process, challenges we faced supporting data analytics at Twitter scale on cloud and present our current solution. Extending Twitter's Data platform to cloud was complex task which we deep dive in this presentation.
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
At Comcast, our team has been architecting a customer experience platform which is able to react to near-real-time events and interactions and deliver appropriate and timely communications to customers. By combining the low latency capabilities of Apache Flink and the dataflow capabilities of Apache NiFi we are able to process events at high volume to trigger, enrich, filter, and act/communicate to enhance customer experiences. Apache Flink and Apache NiFi complement each other with their strengths in event streaming and correlation, state management, command-and-control, parallelism, development methodology, and interoperability with surrounding technologies. We will trace our journey from starting with Apache NiFi over three years ago and our more recent introduction of Apache Flink into our platform stack to handle more complex scenarios. In this presentation we will compare and contrast which business and technical use cases are best suited to which platform and explore different ways to integrate the two platforms into a single solution.
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
Companies are increasingly moving to the cloud to store and process data. One of the challenges companies have is in securing data across hybrid environments with easy way to centrally manage policies. In this session, we will talk through how companies can use Apache Ranger to protect access to data both in on-premise as well as in cloud environments. We will go into details into the challenges of hybrid environment and how Ranger can solve it. We will also talk through how companies can further enhance the security by leveraging Ranger to anonymize or tokenize data while moving into the cloud and de-anonymize dynamically using Apache Hive, Apache Spark or when accessing data from cloud storage systems. We will also deep dive into the Ranger’s integration with AWS S3, AWS Redshift and other cloud native systems. We will wrap it up with an end to end demo showing how policies can be created in Ranger and used to manage access to data in different systems, anonymize or de-anonymize data and track where data is flowing.
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
Advanced Big Data Processing frameworks have been proposed to harness the fast data transmission capability of Remote Direct Memory Access (RDMA) over high-speed networks such as InfiniBand, RoCEv1, RoCEv2, iWARP, and OmniPath. However, with the introduction of the Non-Volatile Memory (NVM) and NVM express (NVMe) based SSD, these designs along with the default Big Data processing models need to be re-assessed to discover the possibilities of further enhanced performance. In this talk, we will present, NRCIO, a high-performance communication runtime for non-volatile memory over modern network interconnects that can be leveraged by existing Big Data processing middleware. We will show the performance of non-volatile memory-aware RDMA communication protocols using our proposed runtime and demonstrate its benefits by incorporating it into a high-performance in-memory key-value store, Apache Hadoop, Tez, Spark, and TensorFlow. Evaluation results illustrate that NRCIO can achieve up to 3.65x performance improvement for representative Big Data processing workloads on modern data centers.
Background: Some early applications of Computer Vision in Retail arose from e-commerce use cases - but increasingly, it is being used in physical stores in a variety of new and exciting ways, such as:
● Optimizing merchandising execution, in-stocks and sell-thru
● Enhancing operational efficiencies, enable real-time customer engagement
● Enhancing loss prevention capabilities, response time
● Creating frictionless experiences for shoppers
Abstract: This talk will cover the use of Computer Vision in Retail, the implications to the broader Consumer Goods industry and share business drivers, use cases and benefits that are unfolding as an integral component in the remaking of an age-old industry.
We will also take a ‘peek under the hood’ of Computer Vision and Deep Learning, sharing technology design principles and skill set profiles to consider before starting your CV journey.
Deep learning has matured considerably in the past few years to produce human or superhuman abilities in a variety of computer vision paradigms. We will discuss ways to recognize these paradigms in retail settings, collect and organize data to create actionable outcomes with the new insights and applications that deep learning enables.
We will cover the basics of object detection, then move into the advanced processing of images describing the possible ways that a retail store of the near future could operate. Identifying various storefront situations by having a deep learning system attached to a camera stream. Such things as; identifying item stocks on shelves, a shelf in need of organization, or perhaps a wandering customer in need of assistance.
We will also cover how to use a computer vision system to automatically track customer purchases to enable a streamlined checkout process, and how deep learning can power plausible wardrobe suggestions based on what a customer is currently wearing or purchasing.
Finally, we will cover the various technologies that are powering these applications today. Deep learning tools for research and development. Production tools to distribute that intelligence to an entire inventory of all the cameras situation around a retail location. Tools for exploring and understanding the new data streams produced by the computer vision systems.
By the end of this talk, attendees should understand the impact Computer Vision and Deep Learning are having in the Consumer Goods industry, key use cases, techniques and key considerations leaders are exploring and implementing today.
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
Whole genome shotgun based next generation transcriptomics and metagenomics studies often generate 100 to 1000 gigabytes (GB) sequence data derived from tens of thousands of different genes or microbial species. De novo assembling these data requires an ideal solution that both scales with data size and optimizes for individual gene or genomes. Here we developed an Apache Spark-based scalable sequence clustering application, SparkReadClust (SpaRC), that partitions the reads based on their molecule of origin to enable downstream assembly optimization. SpaRC produces high clustering performance on transcriptomics and metagenomics test datasets from both short read and long read sequencing technologies. It achieved a near linear scalability with respect to input data size and number of compute nodes. SpaRC can run on different cloud computing environments without modifications while delivering similar performance. In summary, our results suggest SpaRC provides a scalable solution for clustering billions of reads from the next-generation sequencing experiments, and Apache Spark represents a cost-effective solution with rapid development/deployment cycles for similar big data genomics problems.
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams.
Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring.
Learn about:
• The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks.
• Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective.
• Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification.
• Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process.
Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
JMeter webinar - integration with InfluxDB and GrafanaRTTS
Watch this recorded webinar about real-time monitoring of application performance. See how to integrate Apache JMeter, the open-source leader in performance testing, with InfluxDB, the open-source time-series database, and Grafana, the open-source analytics and visualization application.
In this webinar, we will review the benefits of leveraging InfluxDB and Grafana when executing load tests and demonstrate how these tools are used to visualize performance metrics.
Length: 30 minutes
Session Overview
-------------------------------------------
During this webinar, we will cover the following topics while demonstrating the integrations of JMeter, InfluxDB and Grafana:
- What out-of-the-box solutions are available for real-time monitoring JMeter tests?
- What are the benefits of integrating InfluxDB and Grafana into the load testing stack?
- Which features are provided by Grafana?
- Demonstration of InfluxDB and Grafana using a practice web application
To view the webinar recording, go to:
https://www.rttsweb.com/jmeter-integration-webinar
JMeter webinar - integration with InfluxDB and Grafana
Why Kubernetes as a container orchestrator is a right choice for running spark clusters on cloud?
1. Why Kubernetes as a container
orchestrator is a right choice for
running spark clusters on cloud?
Rachit Arora
rachitar@in.ibm.com
IBM, India Software Labs
2. Spark
Unified, open source, parallel, data processing framework for Big Data Analytics
Spark Core Engine
Yarn Mesos
Standalon
e
Scheduler
Kubernete
s
Spark SQL
Interactive
Queries
Spark
Streaming
Stream
processing
Spark
MLlib
Machine
Learning
GraphX
Graph
Computation
4. Let look into role of Data Scientist
• I want to run my analytics jobs
• Social media analytics
• Text analytics (Structure and Unstructured)
• I want to run queries on demand
• I want to run R scripts
• I want to submit Spark jobs
• I want to view History Server Logs of my application
• I want to View Daemon logs
• I want to write Notebooks
5. Evolution of Spark Analytics
On Prem Install
• Acquire
Hardware
• Prepare
Machine
• Install Spark
• Retry
• Apply patches
• security
• Upgrades
• Scale
• High
availability
Virtualization
• Prepare Vm
Imaging
Solution
• Network
Management
• High
Avilability
• Patches
• Scale
Managed
• Configure
Cluster
• Customize
• Scale
• Pay even if
idle
Serverless
• Run analytics
6. Spark Serverless Characteristics
• No Servers to Provision
• Scale with usage
• Availability and fault tolerant
• Never pay for idle
Kernel
History
Server
Notebook
Server
Browswer
Data
Scientist
COS/Injestion
Data
Engineer
7. Typical Hadoop/Spark Cluster - Setup
• Get the suitable hardware
• Prepare host machine
• Setup various networks
• Private
• Public
• Management
• Fetch the binaries for the install
• Prepare the blueprint/config file for the install
• Start the install
• Many a times install fails, debug and retry again.
8. Earlier Experiments
Option OS Provisioning Config
Cluster Management /
Updates
1 Bare metal Chef Chef
2 xCAT – Stateful(Create your own VMs) PostScripts xCAT - updateNode
3 xCAT – Sysclone(Image from current system) Not Needed xCAT - updateNode
4 Bare metal PostScripts xCAT - updateNode
5 Cloud Provider Specific Images Not Needed Manual/Scripts
6 Standard-ISO Image Anaconda -post scripts Manual/Scripts
9. How do I build Serverless Spark
• Option 1: Vanilla Containers – If I need to build with Kubernetes
• Repeatable
• Application Portability
• Faster Development Cycle
• Reduced dev-ops load
• Improved Infrastructure Utilization
10. Guiding Principles
• Virtualization helps repeatability, lesser failures & speed
• Maintenance
• Performance(Equivalent to Bare metal)
• Use open source from an active community
• Cloud-agnostic
13. Docker in Hadoop Cluster on Cloud
• Each cluster node is a virtual node powered by Docker. Each node of
the cluster is a Docker container
• Docker containers run on a bunch of bare metal hosts (Docker-hosts)
• Each Hadoop cluster will have multiple nodes/Docker containers
spanning multiple hosts
• Docker
• Container management - Custom
• Multi host networking – Overlay Network
• Registry – Private
• Local Storage
14. Typical Clusters
L
D
A
P
K
M
S
My
SQL
S
S
H
Master Data 1
Data 2 Data 3
L
D
A
P
K
M
S
My
SQL
S
S
H
Master Data 1
Data 2
L
D
A
P
K
M
S
My
SQL
S
S
H
Master Data 1
Data 2 Data 3
Data 4 Data 5Cluster 1
Cluster 2
Cluster 3
Network Boundary
15. Docker Images
• Master node
• Data Node
• Edge Node
• Auxiliary service images
• Ldap
• Mysql
• Ambari server
• KMS
16. Multi host Docker networking
• Weave based overlay network among nodes
• One /26 private subnet per cluster (172.x.x.x)
• Master node to have a public IP – ports-forwarding
• Portable public IPs
• Network speed (shared with other masters)
• Edge node will be accessible using a public IP
• User can SSH and run Hive, Hbase, Hadoop & Spark shells
• Private network
• High Speed
• Secure
17. Network Architecture
A A A B
bond1 bond0 eth-weave
docker0
10 Gbps Softlayer
private network
10 Gbps Softlayer
public network
B B B C
eth-weave
docker0
C C C C
eth-weave
docker0
bond1 bond0 bond1 bond0
docker port
forwarding
* docker ICC=false (no inter container communication over docker0 network)
* All inter-container communication is through weave network
* One weave’s private subnet per cluster (No communication across subnets)
weave overlay network on Softlayer private network
Host 1 Host 2 Host 3
Master
node
B BB
Master
node
Edge
node
Data
node
19. Provisioning Infrastructure
• Cluster Manager that provides REST API to create cluster
• API Gateway application
• Deployment agent
• Home grown Container Orchestrator - Deployer scripts that actually do all the work
• Prepare Directory structure
• Prepare network
• Start Containers with right options for
• Volumes
• Ports
• IP
• Hostname
• network
21. How is a cluster created?
Cluster
Manager
DBAPI Gateway
2. Manage resources
1. Create Cluster
Deployer
Agent
Deployer
Agent
Deployer
Agent
3. Create Cluster
4. PrepareNode
7. Install IOP
5. Get Node details
6. Get Blueprint
22. How do I build Serverless Spark
• Option 2 : Function as a service
• Single Node Cluster – Or No Cluster at all
• Spark local mode
• all in one Image
• Resource Limitations
• Design Limitations
23. How do I build Serverless Spark
• Option 3 : Kubernetes
* Slide from Kubernetes Scheduler Design & Discussion
24. What Kubernetes Bring in?
• Kubernetes is an open-source system for automating deployment,
scaling, and management of containerized applications.
• It Manages Containers for me
• It Manages High availability
• It Provides me flexibility to choose resource I WANT and Persistence I want
• Kubernetes – Lots of addon services: third-party logging, monitoring,
and security tools
• Reduced operational costs
• Improved infrastructure utilization
26. Conclusion
• Spark Serverless - need for Data Scientist
• Kuberenetes enables
• Spark Clusters in Background and Kernel Up and Running in seconds
• High Availability
• Auto scaling with Spark monitoring and Kubernetes deployment features
• No Extra Money for idle time
27. References
• IBM Watson Studio
https://datascience.ibm.com
• IBM Watson
https://www.ibm.com/analytics/us/en/watson-data-platform/tutorial/
• Analytics Engine
https://www.ibm.com/cloud/analytics-engine
• Apache Spark
• Kubernetes Scheduler
Design & Discussion
• Kuberenetes Clusters on IBM Cloud
Rachit Arora
rachitar@in.ibm.com
@rachit1arora
Spark is an open source, scalable, massively parallel, in-memory execution engine for analytics applications.
Think of it as an in-memory layer that sits above multiple data stores, where data can be loaded into memory and analyzed in parallel across a cluster. Spark Core: The foundation of Spark that lot of libraires for scheduling and basic I/O Spark offers over 100s of high-level operators that make it easy to build parallel apps.
Spark also includes prebuilt machine-learning algorithms and graph analysis algorithms that are especially written to execute in parallel and in memory. It also supports interactive SQL processing of queries and real-time streaming analytics. As a result, you can write analytics applications in programming languages such as Java, Python, R and Scala. You can run Spark using its standalone cluster mode, on Cloud, on Hadoop YARN, on Apache Mesos, or on Kubernetes. Access data in HDFS, Cassandra, HBase, Hive, Object Store, and any Hadoop data source.
Prepare Even though you have the right data, it may not be in the right format or structure for analysis. That’s where data preparation comes in. Data engineers need to bring raw data into one interface from wherever it lives – on premises, in the cloud or on your desktop – where it can then be shaped, transformed, explored, and prepared for analysis.Data scientist: Primarily responsible for building predictive analytic models and building insights. He will analyze data that’s been cataloged and prepared by the data engineer using machine learning tools like Watson Machine Learning. He will build applications using Jupyter Notebooks, RStudio After the data scientist shares his Analytical outputs , Application developer can build APPs like a cognitive chatbot. As the chatbot engages with customers, it will continuously improve its knowledge and help uncover new insights.
As a data scientist what I was required to do
On Prem to Virtuliation as demand increased in my organization for the sevrice I decided to move to virtualized VM to handle many request on demand but there still pain was more
Then I decided to try services being offereed on cloud like EMR and IBM Analytics Engine or Microsoft Insights etce but there I need to order cluster sand configure them to suit my work loads Keep them running even when I do not want to use them
Cover what is takes to install a hadoop/spark cluster
Wit Spalr and Hadoop serveless is whole new game “Function as aservice “
- History , Logs , Performance as if i am doing stuff on prem
Lets take case of Data Scentient
Note book - kernel , log , History Server expectation from spark server less
Data Engineer and Scientist sending request to Serverless Spark
Setup is hard
6.8 to 6.7 example
Optimal seetings for work loads
We do have such offering , its deployed in bluemix
xCAT is Extreme Cluster/Cloud Administration Toolkit, xCAT offers complete management for clusters, Grids, Clouds, Datacenters, and many other things. It is agile, extensible, and based on years of system administration best practices and experience. It enables you to:
Provision Operating Systems on physical or virtual machines: RHEL, CentOS, Fedora, SLES, Ubuntu, AIX, Windows, VMWare, KVM, PowerVM, PowerKVM, zVM.
Provision using scripted install, stateless, statelite, iSCSI, or cloning
What problems we faced
- Managing Hardware as a service – complex
Build solutions for Container orchestration
Container Security and OS Patch
PROBLEM : INSTALLATION AND CUSTOMIZATION
Data Persistence
- First we tried for imaging solution and when I dd docker run command . That was just magic
“How do I backup a container?”
How do I handle restart of host machine ?
“What’s my patch management strategy for my running containers?”
How do I network them ?
Where is my data going to be ?
How do I manage them ?
500 baremetal
We spin conatiners
we are not using orch as of now , will talk why
Our own registry
Local storage
Monolthic application being breaked down . We plan to still do it further to service break downs
Strech on benfits of separating out / like I can change ldap tp IPA , Mysql to other db , KMS verssion change .
Separate metrics for these conatiners
Stress on security
Stress on extensible to adopt newer networking solutions
Mentions it its not recommened to have ssh /
Talk about ip forwarding and multiple networks we have and need for those networks in the cluster
Explain here about the cpu allocation and the
schduleings strategies and how we wanted to schedule them and reason behind it for the taking
advanteage of the hadoop’s build in retundancy and replicatioon to have less failure chances
T
Talk here about orchestration needs , CPU SET and Local disks . how many of the current orchestrators not fitting in . Talk about the montioring aspects of the conatiners and logging aspects of the conatiners
Resource manager , layout maker
Deployer can be any of machines will work no master deployer here
Wait for nodes to prepare
Once up start the install which is config driven . Yum install hadoop ! Oh I have it , set up db , Oh I have it .
AWS Lambda , IBM Apache OpenWisk , Microsoft functions. No Cluster just executors
Not The experience I am used too. History Server Logs, Monitoring tools?
Inability to communicate directly: Spark using DAG execution framework spawns jobs with multiple stages. For inter-stage communication, Spark requires data transfer across executors. Many Fiunction as a service does not allow communication between two Lambda functions. This poses a challenge for running executors in this environment.
Extremely limited runtime resources: Many Function invocations are currently limited to a maximum execution duration of 5 minutes, 1536 MB memory and 512 MB disk space. Spark loves memory, can have a large disk footprint and can spawn long running tasks. This makes Functions a difficult environment to run Spark on.
Introduction to Kubernetes
Kubernetes is an open-source system for automating deployment, scaling, and management of containerized applications.
It Manages Containers for me
It Manages High avilabilty
It Provides me flexibilty to choose resource I WANT and Persistence I want
Kubernetes – Lots of addon services: third-party logging, monitoring, and security tools –
Reduced operational costs –
Improved infrastructure utilization
No or very less latency
IBM Watson brings together data management, data policies, data preparation, and analysis capabilities into a common framework.
You can index, discover, control, and share data with Watson Knowledge Catalog, refine and prepare the data with Data Refinery, then organize resources to analyze the same data with Watson Studio.
The IBM Watson apps are fully integrated to use the same user interface and framework. You can pick whichever apps and tools you need for your organization.
Watson Studio (Watson Studio) provides you with the environment and tools to solve your business problems by collaboratively analyzing data
What is Analytics Engine?
You can use AE to Build and deploy clusters within minutes with simplified user experience, scalability, and reliability. You Custom configure the environment and Scale on demand.