Parallel Linear Regression in Interative Reduce and YARNDataWorks Summit
Online learning techniques, such as Stochastic Gradient Descent (SGD), are powerful when applied to risk minimization and convex games on large problems. However, their sequential design prevents them from taking advantage of newer distributed frameworks such as Hadoop/MapReduce. In this session, we will take a look at how we parallelized linear regression parameter optimization on the next-gen YARN framework Iterative Reduce.
MySQL Applier for Apache Hadoop: Real-Time Event Streaming to HDFSMats Kindahl
This presentation from MySQL Connect give a brief introduction to Big Data and the tooling used to gain insights into your data. It also introduces an experimental prototype of the MySQL Applier for Hadoop which can be used to incorporate changes from MySQL into HDFS using the replication protocol.
A session focused on ramping you up on what Hadoop is, how its works and what it's capable of. We will also look at what Hadoop 2.x and YARN brings to the table and some future projects in the Hadoop space to keep an eye on.
GPU Support in Spark and GPU/CPU Mixed Resource Scheduling at Production Scalesparktc
GPUs have been increasingly used in a broad area of applications, such as machine learning, image processing and risk analytics to achieve higher performance and lower costs (energy footprints). On the other hand, Spark has become a very popular distributed application framework for data processing and complex analytics.
Parallel Linear Regression in Interative Reduce and YARNDataWorks Summit
Online learning techniques, such as Stochastic Gradient Descent (SGD), are powerful when applied to risk minimization and convex games on large problems. However, their sequential design prevents them from taking advantage of newer distributed frameworks such as Hadoop/MapReduce. In this session, we will take a look at how we parallelized linear regression parameter optimization on the next-gen YARN framework Iterative Reduce.
MySQL Applier for Apache Hadoop: Real-Time Event Streaming to HDFSMats Kindahl
This presentation from MySQL Connect give a brief introduction to Big Data and the tooling used to gain insights into your data. It also introduces an experimental prototype of the MySQL Applier for Hadoop which can be used to incorporate changes from MySQL into HDFS using the replication protocol.
A session focused on ramping you up on what Hadoop is, how its works and what it's capable of. We will also look at what Hadoop 2.x and YARN brings to the table and some future projects in the Hadoop space to keep an eye on.
GPU Support in Spark and GPU/CPU Mixed Resource Scheduling at Production Scalesparktc
GPUs have been increasingly used in a broad area of applications, such as machine learning, image processing and risk analytics to achieve higher performance and lower costs (energy footprints). On the other hand, Spark has become a very popular distributed application framework for data processing and complex analytics.
MapR M7: Providing an enterprise quality Apache HBase APImcsrivas
Provides an overview of M7, which is the first unified data platform for tables and files. Does a deep dive into the MapR architecture, especially containers, and how M7 tables integrates with the rest of MapR architecture, including volumes, management and Hadoop.
Describes some of the problems with Apache HBase, and how M7 from MapR solves many of these issues.
Optimize + Deploy Distributed Tensorflow, Spark, and Scikit-Learn Models on GPUsChris Fregly
Optimize + Deploy Distributed Tensorflow, Spark, and Scikit-Learn Models on GPUs @ Strata London, May 24 2017
Optimize + Deploy Distributed Tensorflow, Spark, and Scikit-Learn Models on GPUs - Advanced Spark and TensorFlow Meetup May 23 2017 @ Hotels.com London
We'll discuss how to deploy TensorFlow, Spark, and Sciki-learn models on GPUs with Kubernetes across multiple cloud providers including AWS, Google, and Azure - as well as on-premise.
In addition, we'll discuss how to optimize TensorFlow models for high-performance inference using the latest TensorFlow XLA (Accelerated Linear Algebra) framework including the JIT and AOT Compilers.
Github Repo (100% Open Source!)
https://github.com/fluxcapacitor/pipeline
http://pipeline.io
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...Big Data Montreal
Sharing of Hadoop cluster deployment experience in production from scratch on real hardware. Brief overview of Hadoop stack, its components, major deployment and configuration challenges, performance tuning and application tuning experience. Some “war stories” about the issues we have faced while operating, the benefits of DevOps approach for running Hadoop apps.
LAS16-305: Smart City Big Data Visualization on 96BoardsLinaro
LAS16-305: Smart City Big Data Visualization on 96Boards
Speakers: Naresh Bhat, Ganesh Raju
Date: September 28, 2016
★ Session Description ★
Cities are getting identified as smart cities based on what and how data are used to do predictive analytics. Smart City as a phrase can have a wide spectrum of meaning. But there are two key things (Data and Analytics) that ‘smart’ refers to in smart city. With IoT gaining so much market attention, brings in the power to drive the implementation. Data collection, Storage and Analytics provide so much potential. This talk will go over a sample use case scenario utilizing ODPi based Hadoop eco system and H20 visualizations for analytics.
★ Resources ★
Etherpad: pad.linaro.org/p/las16-305
Presentations & Videos: http://connect.linaro.org/resource/las16/las16-305/
★ Event Details ★
Linaro Connect Las Vegas 2016 – #LAS16
September 26-30, 2016
http://www.linaro.org
http://connect.linaro.org
RAPIDS – Open GPU-accelerated Data ScienceData Works MD
RAPIDS – Open GPU-accelerated Data Science
RAPIDS is an initiative driven by NVIDIA to accelerate the complete end-to-end data science ecosystem with GPUs. It consists of several open source projects that expose familiar interfaces making it easy to accelerate the entire data science pipeline- from the ETL and data wrangling to feature engineering, statistical modeling, machine learning, and graph analysis.
Corey J. Nolet
Corey has a passion for understanding the world through the analysis of data. He is a developer on the RAPIDS open source project focused on accelerating machine learning algorithms with GPUs.
Adam Thompson
Adam Thompson is a Senior Solutions Architect at NVIDIA. With a background in signal processing, he has spent his career participating in and leading programs focused on deep learning for RF classification, data compression, high-performance computing, and managing and designing applications targeting large collection frameworks. His research interests include deep learning, high-performance computing, systems engineering, cloud architecture/integration, and statistical signal processing. He holds a Masters degree in Electrical & Computer Engineering from Georgia Tech and a Bachelors from Clemson University.
Optimizing, profiling and deploying high performance Spark ML and TensorFlow ...DataWorks Summit
Using the latest advancements from TensorFlow including the Accelerated Linear Algebra (XLA) Framework, JIT/AOT Compiler, and Graph Transform Tool , I’ll demonstrate how to optimize, profile, and deploy TensorFlow Models in GPU-based production environment.
This talk is contains many Spark ML and TensorFlow AI demos using PipelineIO's 100% Open Source Community Edition. All code and Docker images are available to reproduce on your own CPU or GPU-based cluster.
* Bio *
Chris Fregly is Founder and Research Engineer at PipelineIO, a Streaming Machine Learning and Artificial Intelligence Startup based in San Francisco. He is also an Apache Spark Contributor, a Netflix Open Source Committer, founder of the Global Advanced Spark and TensorFlow Meetup, author of the O’Reilly Video Series High Performance TensorFlow in Production.
Previously, Chris was a Distributed Systems Engineer at Netflix, a Data Solutions Engineer at Databricks, and a Founding Member of the IBM Spark Technology Center in San Francisco.
In this deck from FOSDEM'19, Christoph Angerer from NVIDIA presents: Rapids - Data Science on GPUs.
"The next big step in data science will combine the ease of use of common Python APIs, but with the power and scalability of GPU compute. The RAPIDS project is the first step in giving data scientists the ability to use familiar APIs and abstractions while taking advantage of the same technology that enables dramatic increases in speed in deep learning. This session highlights the progress that has been made on RAPIDS, discusses how you can get up and running doing data science on the GPU, and provides some use cases involving graph analytics as motivation.
GPUs and GPU platforms have been responsible for the dramatic advancement of deep learning and other neural net methods in the past several years. At the same time, traditional machine learning workloads, which comprise the majority of business use cases, continue to be written in Python with heavy reliance on a combination of single-threaded tools (e.g., Pandas and Scikit-Learn) or large, multi-CPU distributed solutions (e.g., Spark and PySpark). RAPIDS, developed by a consortium of companies and available as open source code, allows for moving the vast majority of machine learning workloads from a CPU environment to GPUs. This allows for a substantial speed up, particularly on large data sets, and affords rapid, interactive work that previously was cumbersome to code or very slow to execute. Many data science problems can be approached using a graph/network view, and much like traditional machine learning workloads, this has been either local (e.g., Gephi, Cytoscape, NetworkX) or distributed on CPU platforms (e.g., GraphX). We will present GPU-accelerated graph capabilities that, with minimal conceptual code changes, allows both graph representations and graph-based analytics to achieve similar speed ups on a GPU platform. By keeping all of these tasks on the GPU and minimizing redundant I/O, data scientists are enabled to model their data quickly and frequently, affording a higher degree of experimentation and more effective model generation. Further, keeping all of this in compatible formats allows quick movement from feature extraction, graph representation, graph analytic, enrichment back to the original data, and visualization of results. RAPIDS has a mission to build a platform that allows data scientist to explore data, train machine learning algorithms, and build applications while primarily staying on the GPU and GPU platforms."
Learn more: https://rapids.ai/
and
https://fosdem.org/2019/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...Mathieu Dumoulin
Examine the unique features of the MapR Converged Data Platform and how they can support production-grade enterprise machine learning - Ends with a live demo using H2O - Presented at Hadoop Summit Tokyo 2016
Microsoft Project Olympus AI Accelerator Chassis (HGX-1)inside-BigData.com
In this video from the Open Compute Summit, Siamak Tavallaei from Microsoft presents an overview of the Microsoft Project Olympus AI Accelerator Chassis, also known as the HGX-1.
Watch the presentation video: http://wp.me/p3RLHQ-guX
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
RAPIDS: GPU-Accelerated ETL and Feature EngineeringKeith Kraus
The RAPIDS suite of open source software libraries gives you the freedom to execute end-to-end data science and analytics pipelines entirely on GPUs. It relies on NVIDIA® CUDA® primitives for low-level compute optimization, but exposes that GPU parallelism and high-bandwidth memory speed through user-friendly Python interfaces.
Shift into High Gear: Dramatically Improve Hadoop & NoSQL PerformanceMapR Technologies
MapR Architecture Presentation given at Strata + Hadoop World 2013 by MapR CTO & Co-Founder M.C. Srivas
Prior to co-founding MapR, Srivas ran one of the major search infrastructure teams at Google where GFS, BigTable and MapReduce were used extensively. He wanted to provide that powerful capability to everyone, and started MapR on his vision to build the next-generation platform for Big Data. His strategy was to evolve Hadoop and bring simplicity of use, extreme speed and complete reliability to Hadoop users everywhere, and make it seamlessly easy for enterprises to use this powerful new way to get deep insights. Srivas brings to MapR his experiences at Google, Spinnaker Networks, Transarc in building game-changing products that advance the state of the art.
Distributed Deep Learning with Apache Spark and TensorFlow with Jim DowlingDatabricks
Methods that scale with available computation are the future of AI. Distributed deep learning is one such method that enables data scientists to massively increase their productivity by (1) running parallel experiments over many devices (GPUs/TPUs/servers) and (2) massively reducing training time by distributing the training of a single network over many devices. Apache Spark is a key enabling platform for distributed deep learning, as it enables different deep learning frameworks to be embedded in Spark workflows in a secure end-to-end pipeline. In this talk, we examine the different ways in which Tensorflow can be included in Spark workflows to build distributed deep learning applications.
We will analyse the different frameworks for integrating Spark with Tensorflow, from Horovod to TensorflowOnSpark to Databrick’s Deep Learning Pipelines. We will also look at where you will find the bottlenecks when training models (in your frameworks, the network, GPUs, and with your data scientists) and how to get around them. We will look at how to use Spark Estimator model to perform hyper-parameter optimization with Spark/TensorFlow and model-architecture search, where Spark executors perform experiments in parallel to automatically find good model architectures.
The talk will include a live demonstration of training and inference for a Tensorflow application embedded in a Spark pipeline written in a Jupyter notebook on the Hops platform. We will show how to debug the application using both Spark UI and Tensorboard, and how to examine logs and monitor training. The demo will be run on the Hops platform, currently used by over 450 researchers and students in Sweden, as well as at companies such as Scania and Ericsson.
MapR M7: Providing an enterprise quality Apache HBase APImcsrivas
Provides an overview of M7, which is the first unified data platform for tables and files. Does a deep dive into the MapR architecture, especially containers, and how M7 tables integrates with the rest of MapR architecture, including volumes, management and Hadoop.
Describes some of the problems with Apache HBase, and how M7 from MapR solves many of these issues.
Optimize + Deploy Distributed Tensorflow, Spark, and Scikit-Learn Models on GPUsChris Fregly
Optimize + Deploy Distributed Tensorflow, Spark, and Scikit-Learn Models on GPUs @ Strata London, May 24 2017
Optimize + Deploy Distributed Tensorflow, Spark, and Scikit-Learn Models on GPUs - Advanced Spark and TensorFlow Meetup May 23 2017 @ Hotels.com London
We'll discuss how to deploy TensorFlow, Spark, and Sciki-learn models on GPUs with Kubernetes across multiple cloud providers including AWS, Google, and Azure - as well as on-premise.
In addition, we'll discuss how to optimize TensorFlow models for high-performance inference using the latest TensorFlow XLA (Accelerated Linear Algebra) framework including the JIT and AOT Compilers.
Github Repo (100% Open Source!)
https://github.com/fluxcapacitor/pipeline
http://pipeline.io
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...Big Data Montreal
Sharing of Hadoop cluster deployment experience in production from scratch on real hardware. Brief overview of Hadoop stack, its components, major deployment and configuration challenges, performance tuning and application tuning experience. Some “war stories” about the issues we have faced while operating, the benefits of DevOps approach for running Hadoop apps.
LAS16-305: Smart City Big Data Visualization on 96BoardsLinaro
LAS16-305: Smart City Big Data Visualization on 96Boards
Speakers: Naresh Bhat, Ganesh Raju
Date: September 28, 2016
★ Session Description ★
Cities are getting identified as smart cities based on what and how data are used to do predictive analytics. Smart City as a phrase can have a wide spectrum of meaning. But there are two key things (Data and Analytics) that ‘smart’ refers to in smart city. With IoT gaining so much market attention, brings in the power to drive the implementation. Data collection, Storage and Analytics provide so much potential. This talk will go over a sample use case scenario utilizing ODPi based Hadoop eco system and H20 visualizations for analytics.
★ Resources ★
Etherpad: pad.linaro.org/p/las16-305
Presentations & Videos: http://connect.linaro.org/resource/las16/las16-305/
★ Event Details ★
Linaro Connect Las Vegas 2016 – #LAS16
September 26-30, 2016
http://www.linaro.org
http://connect.linaro.org
RAPIDS – Open GPU-accelerated Data ScienceData Works MD
RAPIDS – Open GPU-accelerated Data Science
RAPIDS is an initiative driven by NVIDIA to accelerate the complete end-to-end data science ecosystem with GPUs. It consists of several open source projects that expose familiar interfaces making it easy to accelerate the entire data science pipeline- from the ETL and data wrangling to feature engineering, statistical modeling, machine learning, and graph analysis.
Corey J. Nolet
Corey has a passion for understanding the world through the analysis of data. He is a developer on the RAPIDS open source project focused on accelerating machine learning algorithms with GPUs.
Adam Thompson
Adam Thompson is a Senior Solutions Architect at NVIDIA. With a background in signal processing, he has spent his career participating in and leading programs focused on deep learning for RF classification, data compression, high-performance computing, and managing and designing applications targeting large collection frameworks. His research interests include deep learning, high-performance computing, systems engineering, cloud architecture/integration, and statistical signal processing. He holds a Masters degree in Electrical & Computer Engineering from Georgia Tech and a Bachelors from Clemson University.
Optimizing, profiling and deploying high performance Spark ML and TensorFlow ...DataWorks Summit
Using the latest advancements from TensorFlow including the Accelerated Linear Algebra (XLA) Framework, JIT/AOT Compiler, and Graph Transform Tool , I’ll demonstrate how to optimize, profile, and deploy TensorFlow Models in GPU-based production environment.
This talk is contains many Spark ML and TensorFlow AI demos using PipelineIO's 100% Open Source Community Edition. All code and Docker images are available to reproduce on your own CPU or GPU-based cluster.
* Bio *
Chris Fregly is Founder and Research Engineer at PipelineIO, a Streaming Machine Learning and Artificial Intelligence Startup based in San Francisco. He is also an Apache Spark Contributor, a Netflix Open Source Committer, founder of the Global Advanced Spark and TensorFlow Meetup, author of the O’Reilly Video Series High Performance TensorFlow in Production.
Previously, Chris was a Distributed Systems Engineer at Netflix, a Data Solutions Engineer at Databricks, and a Founding Member of the IBM Spark Technology Center in San Francisco.
In this deck from FOSDEM'19, Christoph Angerer from NVIDIA presents: Rapids - Data Science on GPUs.
"The next big step in data science will combine the ease of use of common Python APIs, but with the power and scalability of GPU compute. The RAPIDS project is the first step in giving data scientists the ability to use familiar APIs and abstractions while taking advantage of the same technology that enables dramatic increases in speed in deep learning. This session highlights the progress that has been made on RAPIDS, discusses how you can get up and running doing data science on the GPU, and provides some use cases involving graph analytics as motivation.
GPUs and GPU platforms have been responsible for the dramatic advancement of deep learning and other neural net methods in the past several years. At the same time, traditional machine learning workloads, which comprise the majority of business use cases, continue to be written in Python with heavy reliance on a combination of single-threaded tools (e.g., Pandas and Scikit-Learn) or large, multi-CPU distributed solutions (e.g., Spark and PySpark). RAPIDS, developed by a consortium of companies and available as open source code, allows for moving the vast majority of machine learning workloads from a CPU environment to GPUs. This allows for a substantial speed up, particularly on large data sets, and affords rapid, interactive work that previously was cumbersome to code or very slow to execute. Many data science problems can be approached using a graph/network view, and much like traditional machine learning workloads, this has been either local (e.g., Gephi, Cytoscape, NetworkX) or distributed on CPU platforms (e.g., GraphX). We will present GPU-accelerated graph capabilities that, with minimal conceptual code changes, allows both graph representations and graph-based analytics to achieve similar speed ups on a GPU platform. By keeping all of these tasks on the GPU and minimizing redundant I/O, data scientists are enabled to model their data quickly and frequently, affording a higher degree of experimentation and more effective model generation. Further, keeping all of this in compatible formats allows quick movement from feature extraction, graph representation, graph analytic, enrichment back to the original data, and visualization of results. RAPIDS has a mission to build a platform that allows data scientist to explore data, train machine learning algorithms, and build applications while primarily staying on the GPU and GPU platforms."
Learn more: https://rapids.ai/
and
https://fosdem.org/2019/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...Mathieu Dumoulin
Examine the unique features of the MapR Converged Data Platform and how they can support production-grade enterprise machine learning - Ends with a live demo using H2O - Presented at Hadoop Summit Tokyo 2016
Microsoft Project Olympus AI Accelerator Chassis (HGX-1)inside-BigData.com
In this video from the Open Compute Summit, Siamak Tavallaei from Microsoft presents an overview of the Microsoft Project Olympus AI Accelerator Chassis, also known as the HGX-1.
Watch the presentation video: http://wp.me/p3RLHQ-guX
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
RAPIDS: GPU-Accelerated ETL and Feature EngineeringKeith Kraus
The RAPIDS suite of open source software libraries gives you the freedom to execute end-to-end data science and analytics pipelines entirely on GPUs. It relies on NVIDIA® CUDA® primitives for low-level compute optimization, but exposes that GPU parallelism and high-bandwidth memory speed through user-friendly Python interfaces.
Shift into High Gear: Dramatically Improve Hadoop & NoSQL PerformanceMapR Technologies
MapR Architecture Presentation given at Strata + Hadoop World 2013 by MapR CTO & Co-Founder M.C. Srivas
Prior to co-founding MapR, Srivas ran one of the major search infrastructure teams at Google where GFS, BigTable and MapReduce were used extensively. He wanted to provide that powerful capability to everyone, and started MapR on his vision to build the next-generation platform for Big Data. His strategy was to evolve Hadoop and bring simplicity of use, extreme speed and complete reliability to Hadoop users everywhere, and make it seamlessly easy for enterprises to use this powerful new way to get deep insights. Srivas brings to MapR his experiences at Google, Spinnaker Networks, Transarc in building game-changing products that advance the state of the art.
Distributed Deep Learning with Apache Spark and TensorFlow with Jim DowlingDatabricks
Methods that scale with available computation are the future of AI. Distributed deep learning is one such method that enables data scientists to massively increase their productivity by (1) running parallel experiments over many devices (GPUs/TPUs/servers) and (2) massively reducing training time by distributing the training of a single network over many devices. Apache Spark is a key enabling platform for distributed deep learning, as it enables different deep learning frameworks to be embedded in Spark workflows in a secure end-to-end pipeline. In this talk, we examine the different ways in which Tensorflow can be included in Spark workflows to build distributed deep learning applications.
We will analyse the different frameworks for integrating Spark with Tensorflow, from Horovod to TensorflowOnSpark to Databrick’s Deep Learning Pipelines. We will also look at where you will find the bottlenecks when training models (in your frameworks, the network, GPUs, and with your data scientists) and how to get around them. We will look at how to use Spark Estimator model to perform hyper-parameter optimization with Spark/TensorFlow and model-architecture search, where Spark executors perform experiments in parallel to automatically find good model architectures.
The talk will include a live demonstration of training and inference for a Tensorflow application embedded in a Spark pipeline written in a Jupyter notebook on the Hops platform. We will show how to debug the application using both Spark UI and Tensorboard, and how to examine logs and monitor training. The demo will be run on the Hops platform, currently used by over 450 researchers and students in Sweden, as well as at companies such as Scania and Ericsson.
Operating multi-tenant clusters requires careful planning of capacity for on-time launch of big data projects and applications within expected budget and with appropriate SLA guarantees. Making such guarantees with a set of standard hardware configurations is key to operate big data platforms as a hosted service for your organization.
This talk highlights the tools, techniques and methodology applied on a per-project or user basis across three primary multi-tenant deployments in the Apache Hadoop ecosystem, namely MapReduce/YARN and HDFS, HBase, and Storm due to the significance of capital investments with increasing scale in data nodes, region servers, and supervisor nodes respectively. We will demo the estimation tools developed for these deployments that can be used for capital planning and forecasting, and cluster resource and SLA management, including making latency and throughput guarantees to individual users and projects.
As we discuss the tools, we will share considerations that got incorporated to come up with the most appropriate calculation across these three primary deployments. We will discuss the data sources for calculations, resource drivers for different use cases, and how to plan for optimum capacity allocation per project with respect to given standard hardware configurations.
You’ve successfully deployed Hadoop, but are you taking advantage of all of Hadoop’s features to operate a stable and effective cluster? In the first part of the talk, we will cover issues that have been seen over the last two years on hundreds of production clusters with detailed breakdown covering the number of occurrences, severity, and root cause. We will cover best practices and many new tools and features in Hadoop added over the last year to help system administrators monitor, diagnose and address such incidents.
The second part of our talk discusses new features for making daily operations easier. This includes features such as ACLs for simplified permission control, snapshots for data protection and more. We will also cover tuning configuration and features that improve cluster utilization, such as short-circuit reads and datanode caching.
The job throughput and Apache Hadoop cluster utilization benefits of YARN and MapReduce v2 are widely known. Who wouldn’t want job throughput increased by 2x? Most likely you’ve heard (repeatedly) about the key benefits that could be gained from migrating your Hadoop cluster from MapReduce v1 to YARN: namely around improved job throughput and cluster utilization, as well as around permitting different computational frameworks to run on Hadoop. What you probably haven’t heard about are the configuration tweaks needed to ensure your existing MR v1 jobs can run on your YARN cluster as well as YARN specific configuration settings. In this session we’ll start with a list of recommended YARN configurations, and then step through the most common use-cases we’ve seen in the field. Production migrations can quickly go awry without proper guidance. Learn from others’ misconfigurations to get your YARN cluster configured right the first time.
Resilience: the key requirement of a [big] [data] architecture - StampedeCon...StampedeCon
From the StampedeCon 2015 Big Data Conference: There is an adage, “If you fail to plan, you plan to fail” . When developing systems the adage can be taken a step further, “If you fail to plan FOR FAILURE, you plan to fail”. At Huffington post data moves between a number of systems to provide statistics for our technical, business, and editorial teams. Due to the mission-critical nature of our data, considerable effort is spent building resiliency into processes.
This talk will focus on designing for failure. Some material will focus understanding the traits of specific distributed systems such as message queues or NoSQL databases and what are the consequences for different types of failures. While other parts of the presentation will focus on how systems and software can be designed to make re-processing batch data simple, or how to determine what failure mode semantics are important for a real time event processing system.
Evolution of Drupal and the Drupal communityAngela Byron
The Drupal project has experienced phenomenal growth over its more than 14 years, growing from a small hobby project to over 1 million known installations, over 1 million Drupal.org users, and more than doubling the active contributors and commits in Drupal core between Drupal 7 and Drupal 8, as well as thousands of people who depend on Drupal in some way for a living.
This talk will "de-mystify" some recent developments in the community, from the technical direction of Drupal 8, to various project governance changes, to the increasing role of the Drupal Association on Drupal.org. We'll look at both the historical context that brought those changes about, and talk about how they'll help us scale to the next 1 million sites and users.
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams.
Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
Generating a custom Ruby SDK for your web service or Rails API using Smithyg2nightmarescribd
Have you ever wanted a Ruby client API to communicate with your web service? Smithy is a protocol-agnostic language for defining services and SDKs. Smithy Ruby is an implementation of Smithy that generates a Ruby SDK using a Smithy model. In this talk, we will explore Smithy and Smithy Ruby to learn how to generate custom feature-rich SDKs that can communicate with any web service, such as a Rails JSON API.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
JMeter webinar - integration with InfluxDB and GrafanaRTTS
Watch this recorded webinar about real-time monitoring of application performance. See how to integrate Apache JMeter, the open-source leader in performance testing, with InfluxDB, the open-source time-series database, and Grafana, the open-source analytics and visualization application.
In this webinar, we will review the benefits of leveraging InfluxDB and Grafana when executing load tests and demonstrate how these tools are used to visualize performance metrics.
Length: 30 minutes
Session Overview
-------------------------------------------
During this webinar, we will cover the following topics while demonstrating the integrations of JMeter, InfluxDB and Grafana:
- What out-of-the-box solutions are available for real-time monitoring JMeter tests?
- What are the benefits of integrating InfluxDB and Grafana into the load testing stack?
- Which features are provided by Grafana?
- Demonstration of InfluxDB and Grafana using a practice web application
To view the webinar recording, go to:
https://www.rttsweb.com/jmeter-integration-webinar
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
2. About this talk
Share @twitterhadoop’s efforts, experience and learning in
moving thousand users and multi petabyte workloads from
Hadoop 1 to Hadoop 2
@twitterhadoop
2 / 29 v1.0
3. Use cases
Personalization
Graph analysis, Recommendations, Trends, User/topic modeling
Analytics
a/b testing, user behavior analysis, api analytics
Growth
Network Digest, People Recommendations, Email
Revenue
Engagement prediction, Ad targeting, ads analytics, marketplace optimization
Nielsen Twitter TV Rating
Tweet impressions processing
Backups & Scribe Logs
MySQL backups, Manhattan backups, FrontEnd scribe logs
Many more...
@twitterhadoop
3 / 29 v1.0
4. Hadoop and Data pipeline
TFE
hadoop real
time
hadoop
processing
hadoop
warehouse
hadoop
cold
hadoop
backupsSearch,
Ads, etc Partners
MySQL
hadoop
hbase
Vertica
Manhatta
n
hadoop
tst
@twitterhadoop
SVN, Git,
...
hadoop
tst
4 / 29 v1.0
5. Elephant Scale
➔ Tens of thousands Hadoop servers
(Mix of hardware)
➔ Hundreds of thousands of disk drives
➔ Few hundred PB data stored in
HDFS
➔ Hundreds of thousands of daily
hadoop jobs
➔ Tens of millions of daily hadoop tasks
@twitterhadoop
Individual Cluster Stats
➔ More than 3500 nodes
➔ 30-50+ PB data stored in HDFS
➔ 35K RPC/second on NNs
➔ 30K+ jobs per day
➔ 10M+ tasks per day
➔ 6PB+ data crunched per day
5 / 29 v1.0
6. Hadoop 1 Challenges (Q4-2012)
Growth:
Supporting twitter growth,
Request for new features on
older branch, new JAVA
Scalability:
NameNode files/blocks, NN
Operations, GC pause,
Checkpointing
JobTracker GC pause, task
assignment
Reliability:
SPOF NN and JT, NameNode
restart delays
Efficiency:
Slot utilization, QoS, Multi
Tenant, New features &
frameworks
Maintenance:
Old codebase, Numerous issues
fixed in later versions, dev
branch
. @twitterhadoop
6 / 29 v1.0
8. Hadoop 2 Migration (Q2-Q4 2013)
Phase 1 :
Testing
Phase 3 :
Production
Phase 2 :
Semi production
➔ Apache 2.0.3 branch
➔ New Hardware*, New
OS and JVM
➔ Benchmarks and user
jobs (lots of them…)
➔ Dependent
component updates
➔ Data movement
between different
versions
➔ Metrics, Alerts and tools
➔ Production use cases
running in 2 clusters in
parallel.
➔ Tuning/parameter updates
and learnings
➔ Started contributing fixes
back to community
➔ Educating users about new
version and changes
➔ Benefits of Hadoop 2
➔ Stable Apache 2.0.5
release with many
fixes and backports
➔ Multiple internal
releases
➔ Template for new
clusters
➔ Ready to roll Apache
2.3 release
*http://www.slideshare.net/Hadoop_Summit/hadoop-hardware-twitter-size-does-matter
@twitterhadoop
8 / 29 v1.0
9. CPU Utilization
Hadoop 1 CPU
Utilization for
one day. (45%
peaks)
Hadoop 2 CPU
Utilization for
one day. (85%
peaks)
@twitterhadoop
9 / 29 v1.0
11. Migration Challenge: web-based FS
Need a web-based FS to deal with H1/H2 interactions
● Hftp based on cross-DC LogMover experience
● Apps broken due to no FNF on non-existing paths
HDFS-6143
● Faced challenges cross-version checksums
@twitterhadoop
11 / 29 v1.0
12. Migration Challenge: hard-coded FS
1000’s of occurrences hdfs://${NN}/path and absolute URIs
● For cluster1 dial hdfs://hadoop-cluster1-nn.dc CNAME
● For cluster2 dial …
Ideal: use logical paths and viewfs as defaultFS
More realistic and faster:
● HDFSCompatibleViewFS HADOOP-9985
@twitterhadoop
12 / 29 v1.0
13. Migration Challenge: Interoperability
Migration in progress: H1 job requires input from H2
● hftp://OMGwhatNN/has/my/path problem
● ideal: use viewfs on H1 resolving to correct H2-NN
● realistic: see above “hardcoded FS”
● Even if you know OMGwhatNN, is it active?
@twitterhadoop
13 / 29 v1.0
14. StandbyActive
Cluster
CNAME
H1 client
Active Standby Active Standby
Load client-side mounttable on
the server side:
1. redirect to the right
namespace
2. redirect to active within
namespace
@twitterhadoop
14 / 29 v1.0
15. Migration: Tools and Ecosystem
● Port/recompile/package:
o Data Access Layer/HCatalog,
o Pig,
o Cascading/Scalding
o ElephantBird
o hadoop-lzo
● PIG-3913 (local mode counters),
● Analytics team fixed PIG-2888 (performance)
● hRaven fixes:
o translation between slot_millis and mb_millis
@twitterhadoop
15 / 29 v1.0
16. HadOops found and fixed
● ViewFS can’t be used for public DistributedCache (DC)
o HADOOP-10191, YARN-1542
● getFileStatus RPC storm on public DC:
o YARN-1771
● No user-specified progress string in MR-AM UI task
o MAPREDUCE-5550
● Uberized jobs for scheduling small jobs great but ...
o can you kill them? MAPREDUCE-5841
o size correctly for map-only? YARN-1190
@twitterhadoop
16 / 29 v1.0
17. More HadOops
Incident: a job blacklists nodes by logging terabytes
● need capping, but userlog.limit.kb loses valuable log tail
● RollingFileAppender for MR-AM/tasks MAPREDUCE-
5672
@twitterhadoop
17 / 29 v1.0
18. Diagnostics improvement
App/Job/Task kill:
● DAG processors/users can say why
o MAPREDUCE-5648, YARN-1551
● MR-AM: “speculation”, “reducer preemption”
o MAPREDUCE-5692, MAPREDUCE-5825
● Thread Dumps
o On task timeout: MAPREDUCE-5044
o On demand from CLI/UI: MAPREDUCE-5784, ...
@twitterhadoop
18 / 29 v1.0
19. UX/UI improvements
● NameNode state and cluster stats
● App size in MB on RM Apps Page
● RM Scheduler UI improvements: queue descriptions,
bugs min/max resource calc.
● Task Attempt state filtering in MR-AM
HDFS-5928, YARN-1945, HDFS-5296...
@twitterhadoop
19 / 29 v1.0
20. YARN reliability improvements
● Unhealthy nodes / positive feedback
o drain containers instead of killing: YARN-1996
o don’t rerun maps when all reduces committed: MAPREDUCE-5817
● RM crashes JIRA fixed either just internally or public
o YARN-351, YARN-502
@twitterhadoop
20 / 29 v1.0
21. MapReduce usability
● Memory.mb as a single tunable: Xmx, sort.mb auto-set
o mb is optimized on case-by-case basis
o MAPREDUCE-5785
● Users want newer artifacts like guava: job.classloader
o MAPREDUCE-5146 / 5751 / 5813 / 5814
● Help users debug
o thread dump on timeout, and on demand via UI
o educate users about heap dumps on OOM and java profiling
@twitterhadoop
21 / 29 v1.0
22. Multi-DC environment
MR clients across latency boundaries. Submit fast:
● moving split calculation to MR-AM: MAPREDUCE-207
DSCP bit coloring for DataXfer
● HDFS-5175
● Hftp (switched to Apache Commons HttpClient)
DataXfer throttling (client RW)
22 / 29 v1.0
23. YARN: Beyond Java & MapReduce
● MR-AM and other REST API’s across the stack for easy
integration in non-JVM tools.
● Vowpal Wabbit: (production)
o no extra spanning tree step
● Spark (semi-production)
@twitterhadoop
23 / 29 v1.0
24. Ongoing Project: Shared Cache
MapReduce function shipping: computation->data
● Teams have jobs with 100’s of jars uploaded via libjars
o Ideal: manage a jar repo on HDFS
o Reference jars via DistributedCache instead of uploading
o Real: currently hard to coordinate
● YARN-1492: Manage artifacts cache transparently
● Measure it:
o YARN-1529: Localization overhead/cache hits NM metrics
o MAPREDUCE-5696: Job localization counters
@twitterhadoop
24 / 29 v1.0
25. Upcoming Challenges
● Reduce ops complexity:
o grow to 10K+-node clusters
o try to avoid adding more clusters
● Scalability limits for NN, RM
● NN heap sizes: large Java heap vs namespace splitting
● RPC QoS Issues
● NN startup: long initial block report processing
● Integrating non-MR frameworks with hRaven
@twitterhadoop
25 / 29 v1.0
26. Future Work Ideas
● Productize RM HA and work-preserving restart
● HDFS Readable Standby NN
● Whole DAG in a single NN namespace
● Contribute to HDFS-5477 - Dedicated BM service
● NN SLA: fairshare for RPC queues: HADOOP-10598
● Finer lock granularity in NN
@twitterhadoop
26 / 29 v1.0
27. Summary: Hadoop 2 @ Twitter
● No JT bottleneck: Lightweight RM + MR-AM
● High compute density with flexible slots
● Reduced NN bottleneck using Federation
● HDFS HA removes the angst to try out new NN configs
● Much closer to upstream to consume/contribute fixes
o Development on 2.3 branch
● Adopting new frameworks on YARN
@twitterhadoop
27 / 29 v1.0
28. Conclusion
Migrating 1000+ users/use cases is anything but trivial
… however,
● Hadoop 2 made it worthwhile
● Hadoop 2 contributions:
o 40+ patches committed
o ~40 in review
@twitterhadoop
28 / 29 v1.0
29. Thank you! Questions
@JoinTheFlock about.twitter.com/careers
@TwitterHadoop
Catch up with us in person
@LohitVijayaRenu
@GeraShegalov
@twitterhadoop
29 / 29 v1.0
Editor's Notes
With scale and growth like this, twitter faced different kind of challenges with Hadoop 1.JT used to run >20K jobs per day.
JobTracker caches number of jobs per users and does not take into account size of job. Frequent JT full GCs.
Reasoning behind why Twitter had to chose different namespaces. As of now all Datanodes talk to all NameNodes, we have been thinking about different combinations where subset of DataNodes can talk to different namespaces as well.
We had decided to build new Hadoop 2 clusters instead of worrying about migrating/upgrading Hadoop 1 clusters. Saved huge downtime issues. Around phase two is when users started seeing benefits of moving to Hadoop 2. Simple fixes when long way helping lots of customers.