The GPU Open Analytics Initiative, GOAI, is accelerating data science like never before. CPUs are not improving at the same rate as networking and storage, and leveraging GPUs data scientist can analyze more data than ever with less hardware. Learn more about how GPU are accelerating data science (not just Deep Learning), and how to get started.
Predictive Maintenance Using Recurrent Neural NetworksJustin Brandenburg
My presentation from AnacondaCON 2018 where I discussed using Recurrent Neural Networks, Python, Tensorflow and the MapR Platform to develop deploy a predictive maintenance model for an IoT device in the manufacturing industry.
Data analytics, Spark, Hadoop and AI have become fundamental tools to drive digital transformation. A critical challenge is moving from isolated experiments to an organizational or enterprise production infrastructure. In this talk, we break apart the modern data analytics workflow to focus on the data challenges across different phases of the analytics and AI life cycle. By presenting a unified approach to data storage for AI and Analytics, organizations can reduce costs, modernize their data strategy and build a sustainable enterprise data lake. By anticipating how Hadoop, Spark, Tensorflow, Caffe and traditional analytics like SAS, HPC can share data, IT departments and data science practitioners can not only co-exist, but speed time to insight. We will present the tangible benefits of a Reference Architecture using real-world installations that span proprietary and open-source frameworks. Using intelligent software-defined shared storage, users are able to eliminate silos, reduce multiple data copies, and improve time to insight.PALLAVI GALGALI, Offering Manager,IBM and DOUGLAS O'FLAHERTY, Portfolio Product Manager, IBM
Accelerate AI w/ Synthetic Data using GANsRenee Yao
Strata Data Conference in Sep 2018 Presentation
Description:
Synthetic data will drive the next wave of deployment and application of deep learning in the real world across a variety of problems involving speech recognition, image classification, object recognition and language. All industries and companies will benefit, as synthetic data can create conditions through simulation, instead of authentic situations (virtual worlds enable you to avoid the cost of damages, spare human injuries, and other factors that come into play); unparalleled ability to test products, and interactions with them in any environment.
Join us for this introductory session to learn more about how Generative Adversarial Networks (GAN) are successfully used to improve data generation. We will cover specific real-world examples where customers have deployed GAN to solve challenges in healthcare, space, transportation, and retail industries.
Renee Yao explains how generative adversarial networks (GAN) are successfully used to improve data generation and explores specific real-world examples where customers have deployed GANs to solve challenges in healthcare, space, transportation, and retail industries.
Risk Management Framework Using Intel FPGA, Apache Spark, and Persistent RDDs...Databricks
Performing analytics for risk management purposes is applied in many fields, especially in financial services. We present a framework for accelerated risk analytics and show a large-scale financial sector application where this framework is used to run backtesting algorithms on risk-based securities such as options. These applications require highly computationally-intensive operations on extremely large data sets with objects numbering in the tens of billions.
Intel FPGA and FinLib library for financial applications are used to offload the computation; however, another challenging problem (that we have resolved) is how to feed the data to the FPGA at the optimal speed without having to do customized coding. A combination of Apache Spark along with Levyx’s persistent dataframes are used to address this problem. These dataframes allow absorbing the computation from Spark and offloading it to Finlib in an automated way. This example can be expanded to many other areas of Risk Management such as Insurance and Cybersecurity.
Solution Brief: Real-Time Pipeline AcceleratorBlueData, Inc.
Get started with Spark Streaming, Kafka, and Cassandra for real-time data analytics.
BlueData makes it easy to deploy Spark infrastructure and applications on- premises. The BlueData EPIC software platform is purpose-built to simplify and accelerate the deployment of Spark, Hadoop, and other tools for Big Data analytics—leveraging Docker containers and virtualized infrastructure.
Our new Real-Time Pipeline Accelerator solution provides the software and professional services you need for building data pipelines in a multi-tenant environment for Spark Streaming, Kafka, and Cassandra. With help from the BlueData team, you’ll also have two end-to-end real-time data pipelines as a starting point.
Learn more about BlueData at www.bluedata.com
Predictive Maintenance Using Recurrent Neural NetworksJustin Brandenburg
My presentation from AnacondaCON 2018 where I discussed using Recurrent Neural Networks, Python, Tensorflow and the MapR Platform to develop deploy a predictive maintenance model for an IoT device in the manufacturing industry.
Data analytics, Spark, Hadoop and AI have become fundamental tools to drive digital transformation. A critical challenge is moving from isolated experiments to an organizational or enterprise production infrastructure. In this talk, we break apart the modern data analytics workflow to focus on the data challenges across different phases of the analytics and AI life cycle. By presenting a unified approach to data storage for AI and Analytics, organizations can reduce costs, modernize their data strategy and build a sustainable enterprise data lake. By anticipating how Hadoop, Spark, Tensorflow, Caffe and traditional analytics like SAS, HPC can share data, IT departments and data science practitioners can not only co-exist, but speed time to insight. We will present the tangible benefits of a Reference Architecture using real-world installations that span proprietary and open-source frameworks. Using intelligent software-defined shared storage, users are able to eliminate silos, reduce multiple data copies, and improve time to insight.PALLAVI GALGALI, Offering Manager,IBM and DOUGLAS O'FLAHERTY, Portfolio Product Manager, IBM
Accelerate AI w/ Synthetic Data using GANsRenee Yao
Strata Data Conference in Sep 2018 Presentation
Description:
Synthetic data will drive the next wave of deployment and application of deep learning in the real world across a variety of problems involving speech recognition, image classification, object recognition and language. All industries and companies will benefit, as synthetic data can create conditions through simulation, instead of authentic situations (virtual worlds enable you to avoid the cost of damages, spare human injuries, and other factors that come into play); unparalleled ability to test products, and interactions with them in any environment.
Join us for this introductory session to learn more about how Generative Adversarial Networks (GAN) are successfully used to improve data generation. We will cover specific real-world examples where customers have deployed GAN to solve challenges in healthcare, space, transportation, and retail industries.
Renee Yao explains how generative adversarial networks (GAN) are successfully used to improve data generation and explores specific real-world examples where customers have deployed GANs to solve challenges in healthcare, space, transportation, and retail industries.
Risk Management Framework Using Intel FPGA, Apache Spark, and Persistent RDDs...Databricks
Performing analytics for risk management purposes is applied in many fields, especially in financial services. We present a framework for accelerated risk analytics and show a large-scale financial sector application where this framework is used to run backtesting algorithms on risk-based securities such as options. These applications require highly computationally-intensive operations on extremely large data sets with objects numbering in the tens of billions.
Intel FPGA and FinLib library for financial applications are used to offload the computation; however, another challenging problem (that we have resolved) is how to feed the data to the FPGA at the optimal speed without having to do customized coding. A combination of Apache Spark along with Levyx’s persistent dataframes are used to address this problem. These dataframes allow absorbing the computation from Spark and offloading it to Finlib in an automated way. This example can be expanded to many other areas of Risk Management such as Insurance and Cybersecurity.
Solution Brief: Real-Time Pipeline AcceleratorBlueData, Inc.
Get started with Spark Streaming, Kafka, and Cassandra for real-time data analytics.
BlueData makes it easy to deploy Spark infrastructure and applications on- premises. The BlueData EPIC software platform is purpose-built to simplify and accelerate the deployment of Spark, Hadoop, and other tools for Big Data analytics—leveraging Docker containers and virtualized infrastructure.
Our new Real-Time Pipeline Accelerator solution provides the software and professional services you need for building data pipelines in a multi-tenant environment for Spark Streaming, Kafka, and Cassandra. With help from the BlueData team, you’ll also have two end-to-end real-time data pipelines as a starting point.
Learn more about BlueData at www.bluedata.com
MaaS (Model as a Service): Modern Streaming Data Science with Apache Metron (...DataWorks Summit
Apache Metron (Incubating) is a streaming cybersecurity application
built on Apache Storm and Hadoop. One of its core missions is to enable
advanced analytics through machine learning and data science to the
users. Because of the relative immaturity of data science platform
infrastructure integrated into Hadoop that is oriented to streaming
analytics applications, we have been forced to create the requisite
platform components out of necessity, utilizing many of the pieces of
the Hadoop ecosystem.
In this talk, we will speak about the Metron analytics architecture and
how it utilizes a custom data science model deployment and autodiscovery
service that is tightly integrated with Hadoop via Yarn and Zookeeper.
We will discuss how we interact with the models deployed there via a
custom domain specific language that can query models as data streams
past. We will generally discuss the full-stack data science tooling that
has been created to enable data science at scale on an advanced analytics
streaming application.
Nicolas Trésegnie, Chief Architect at SuperAwesome
Abstract: SuperAwesome's mission is to make the internet safer for kids. At the core of SuperAwesome's analytics is Druid. In this talk, we walk through how we run Druid on spot instances. We explain the consequences in terms of cost and reliability, how we managed to build a reliable system despite the risks, and how you could do the same.
Nicolas works as Chief Architect at SuperAwesome, where is is looking after the overall architecture of the systems and the infrastructure. He is all about automation and how technology can be used to achieve business goals. Nicolas studied Computer Science and Bioinformatics, and he is now pursuing an MBA at Imperial.
Undertaking a digital journey starts with clearly articulating the success factors for the entire digital journey, and our experience from the field has shown it to be an Achilles heel for most CXOs, across Fortune 500 organizations. Our findings were corroborated when a Mckinsey study reported that only 15% of the organizations are able to calculate the ROI of a digital initiative.
In this talk we will deliberate on demonstrated examples from multi-billion dollar businesses around proven methodologies to measure the value of a digital enterprise. The panel will share experiences as well as provide actionable advice for immediate next steps around the following:
Successful metrics for measuring the value for Digital / IoT / AI/ Machine learning engagements
How can 'Digital Traction Metrics' help with actionable insights even before the Financial Metrics have been reported
What are the best in-class organizational constructs and futuristic employee engagement methods to facilitate the digital revolution
Panelists for this session include:
• Christian Bilien - Head of Global Data at Societe Generale
• Pierre Alexandre Pautrat – Head of Big Data at BPCE/Nattixis
• Ronny Fehling – VP , Airbus
• Juergen Urbanski – Silicon Valley Data Science
• Abhas Ricky - EMEA Lead, Innovation & Strategy, Hortonworks
Achieving Real-time Ingestion and Analysis of Security Events through Kafka a...Kevin Mao
Strata Hadoop World 2017 San Jose
Today’s enterprise architectures are often composed of a myriad of heterogeneous devices. Bring-your-own-device policies, vendor diversification, and the transition to the cloud all contribute to a sprawling infrastructure, the complexity and scale of which can only be addressed by using modern distributed data processing systems.
Kevin Mao outlines the system that Capital One has built to collect, clean, and analyze the security-related events occurring within its digital infrastructure. Raw data from each component is collected and preprocessed using Apache NiFi flows. This raw data is then written into an Apache Kafka cluster, which serves as the primary communications backbone of the platform. The raw data is parsed, cleaned, and enriched in real time via Apache Metron and Apache Storm and ingested into ElasticSearch, allowing operations teams to detect and monitor events as they occur. The refined data is also transformed into the Apache ORC data format and stored in Amazon S3, allowing data scientists to perform long-term, batch-based analysis.
Kevin discusses the challenges involved with architecting and implementing this system, such as data quality, performance tuning, and the impact of additional financial regulations relating to data governance, and shares the results of these efforts and the value that the data platform brings to Capital One.
Explore IoT in Big Data while brewing beer. All verticals are instrumenting devices to learn more about their process to help cut costs or improve efficiency.
Listening at the Cocktail Party with Deep Neural Networks and TensorFlowDatabricks
Many people are amazing at focusing their attention on one person or one voice in a multi speaker scenario, and ‘muting’ other people and background noise. This is known as the cocktail party effect. For other people it is a challenge to separate audio sources.
In this presentation I will focus on solving this problem with deep neural networks and TensorFlow. I will share technical and implementation details with the audience, and talk about gains, pains points, and merits of the solutions as it relates to:
* Preparing, transforming and augmenting relevant data for speech separation and noise removal.
* Creating, training and optimizing various neural network architectures.
* Hardware options for running networks on tiny devices.
* And the end goal : Real-time speech separation on a small embedded platform.
I will present a vision of future smart air pods, smart headsets and smart hearing aids that will be running deep neural networks .
Participants will get an insight into some of the latest advances and limitations in speech separation with deep neural networks on embedded devices in regards to:
* Data transformation and augmentation.
* Deep neural network models for speech separation and for removing noise.
* Training smaller and faster neural networks.
* Creating a real-time speech separation pipeline.
Introduction of streaming data, difference between batch processing and stream processing, Research issues in streaming data processing, Performance evaluation metrics , tools for stream processing.
Beyond Kerberos and Ranger - Tips to discover, track and manage risks in hybr...DataWorks Summit
Even after deploying traditional security measures like authentication and authorization to secure sensitive data, data owners and security teams are still struggling to manage and get visibility on risks with data. The same challenge multiplies when data is moving and shared across different data silos such as on-premise Hadoop, public cloud infrastructures such as AWS, Azure and Google Cloud. To control the risks that come with data, enterprises need a comprehensive data-centric approach to easily identify risks, manage security and compliance policies and implement behavior analytics to differentiate between good and bad behavior. This talk will explain a 3 step process of implementing data-centric controls for your hybrid environment including discovering where sensitive data is stored, tracking where data is moving and can easily identifying and controlling potential misuse of the data in near real time.
Cloud Computing in the Cloud (Hadoop.tw Meetup @ 2015/11/23)Jeff Hung
It is a common believe that Hadoop should run on physical servers. However, this requires huge capital investment in the beginning while you have no guarantee for the returns. Therefore, things usually end up in proving big-data with not-that-big data. One approach to workaround this dilemma is to run Cloud Computing in the Cloud. With the elastic that AWS provides, you could spend little but run big!! However, is it really a good idea? In this sharing, we will try to answer it – based on the result from an 1-year journey with real application and real big-data.
Webinar: The Modern Streaming Data Stack with Kinetica & StreamSetsKinetica
Enterprises are now faced with wrangling massive volumes of complex, streaming data from a variety of different sources, a new paradigm known as extreme data. However, the traditional data integration model that’s based on structured batch data and stable data movement patterns makes it difficult to analyze extreme data in real-time. Join Matt Hawkins, Principal Solutions Architect at Kinetica and Mark Brooks, Solution Engineer at StreamSets as they share how innovative organizations are modernizing their data stacks with StreamSets and Kinetica to enable faster data movement and analysis.In this webinar we’ll explore:
The modern data architecture required for dealing with extreme data
How StreamSets enables continuous data movement and transformation across the enterprise
How Kinetica harnesses the power of GPUs to accelerate analytics on streaming data
A live demo of StreamSets and Kinetica connector to enable high speed data ingestion, queries and data visualization
Reliable Data Intestion in BigData / IoTGuido Schmutz
Many of the Big Data and IoT use cases are based on combing data from multiple data sources and to make them available on a Big Data platform for analysis. The data sources are often very heterogeneous, from simple files, databases to high-volume event streams from sensors (IoT devices). It’s important to retrieve this data in a secure and reliable manner and integrate it with the Big Data platform so that it is available for analysis in real-time (stream processing) as well as in batch (typical big data processing). In past some new tools have emerged, which are especially capable of handling the process of integrating data from outside, often called Data Ingestion. From an outside perspective, they are very similar to a traditional Enterprise Service Bus infrastructures, which in larger organization are often in use to handle message-driven and service-oriented systems. But there are also important differences, they are typically easier to scale in a horizontal fashion, offer a more distributed setup, are capable of handling high-volumes of data/messages, provide a very detailed monitoring on message level and integrate very well with the Hadoop ecosystem. This session will present and compare Apache Flume, Apache NiFi, StreamSets and the Kafka Ecosystem and show how they handle the data ingestion in a Big Data solution architecture.
Building a future-proof cyber security platform with Apache MetronDataWorks Summit
Qsight IT gives you insight in how we use Metron in securing our customers by continuously analyzing and monitoring users, applications, data, and networks. We show you how we implemented Metron as a replacement for our former security platform based on rule-based security. Since we are dealing with a non-conventional use case “serving many customers with one platform,” we developed a business classification module that enables us to score threats according to the customer’s input.
To be future ready, we are working on extending this rule-based way of detection with machine learning models like web defacement, suspicious URL’s, UEBA, and many more to come.
In order to provide all the necessary information to the SOC analysts at a glance, we are developing a custom SOC application from where they can handle security alarms, analyze captured data, and have historical data at hand. We regard our new Metron based Security Platform as an emerging giant—a future-proof cyber security platform!
Speaker
Bas van de Lustgraaf, Big Data Engineer, QSight IT
Machiel van Tilborg, BI Engineer, QSight IT
Big Data Day LA 2016/ Use Case Driven track - Shaping the Role of Data Scienc...Data Con LA
At IRIS.TV, our business builds algorithmic solutions for video recommendation with the end goal to deliver a great user experience as evidenced by users viewing more video content. This talk outlines our reasons for expanding from a descriptive/predictive approach to data analytics toward a philosophy that features more prescriptive analytics, driven by our data science team.
Using Spark and Riak for IoT Apps—Patterns and Anti-Patterns: Spark Summit Ea...Spark Summit
Everybody agrees that IoT is changing the world… and creates new challenges for software developers, architects and DevOps. How can we build efficient and highly scalable distributed applications using open-source technologies? What are characteristics of data generated by IoT devices and how it differs from traditional enterprise or Big Data problems? Which architectural patterns are beneficial for IoT use cases and why some trusted methods eventually turn out to be “anti-patterns”? This talk will show how to combine best-of-breed open-source technologies, like Apache Spark, Riak and Mesos to build scalable IoT pipelines to ingest, store and analyze huge amounts of data, while keeping operational complexity and costs under control. We will discuss cons and pros of using relational, NoSQL and object storage products for storing and archiving IoT data. Then we cover best practices how to use Spark with Riak NoSQL database. Will describe how Apache Spark advanced modules (Spark SQL, Spark Streaming and MLlib) can solve the problems common to IoT apps, while using Riak for fast and scalable persistence. At the end, will explain why Structured Spark Streaming is a godsend for IoT data and make a case for Time Series databases deserving a separate category in NoSQL classification.
AI on Spark for Malware Analysis and Anomalous Threat DetectionDatabricks
At Avast, we believe everyone has the right to be safe. We are dedicated to creating a world that provides safety and privacy for all, not matter where you are, who you are, or how you connect. With over 1.5 billion attacks stopped and 30 million new executable files monthly, big data pipelines are crucial for the security of our customers. At Avast we are leveraging Apache Spark machine learning libraries and TensorflowOnSpark for a variety of tasks ranging from marketing and advertisement, through network security to malware detection. This talk will cover our main cybersecurity usecases of Spark. After describing our cluster environment we will first demonstrate anomaly detection on time series of threats. Having thousands of types of attacks and malware, AI helps human analysts select and focus on most urgent or dire threats. We will walk through our setup for distributed training of deep neural networks with Tensorflow to deploying and monitoring of a streaming anomaly detection application with trained model. Next we will show how we use Spark for analysis and clustering of malicious files and large scale experimentation to automatically process and handle changes in malware. In the end, we will give comparison to other tools we used for solving those problems.
Big Data Day LA 2016/ NoSQL track - Analytics at the Speed of Light with Redi...Data Con LA
Spark is in-memory, Redis is in-memery. The Spark-Redis connector gives Spark access to Redis' data structures as RDDs. Redis, with its blazing fast performance and optimized in-memory data structures, reduces Spark processing time by up to 98%. In this talk, Dave will share the top use cases for Spark-Redis such as time-series, recommendations and real-time bid management.
In this deck from FOSDEM'19, Christoph Angerer from NVIDIA presents: Rapids - Data Science on GPUs.
"The next big step in data science will combine the ease of use of common Python APIs, but with the power and scalability of GPU compute. The RAPIDS project is the first step in giving data scientists the ability to use familiar APIs and abstractions while taking advantage of the same technology that enables dramatic increases in speed in deep learning. This session highlights the progress that has been made on RAPIDS, discusses how you can get up and running doing data science on the GPU, and provides some use cases involving graph analytics as motivation.
GPUs and GPU platforms have been responsible for the dramatic advancement of deep learning and other neural net methods in the past several years. At the same time, traditional machine learning workloads, which comprise the majority of business use cases, continue to be written in Python with heavy reliance on a combination of single-threaded tools (e.g., Pandas and Scikit-Learn) or large, multi-CPU distributed solutions (e.g., Spark and PySpark). RAPIDS, developed by a consortium of companies and available as open source code, allows for moving the vast majority of machine learning workloads from a CPU environment to GPUs. This allows for a substantial speed up, particularly on large data sets, and affords rapid, interactive work that previously was cumbersome to code or very slow to execute. Many data science problems can be approached using a graph/network view, and much like traditional machine learning workloads, this has been either local (e.g., Gephi, Cytoscape, NetworkX) or distributed on CPU platforms (e.g., GraphX). We will present GPU-accelerated graph capabilities that, with minimal conceptual code changes, allows both graph representations and graph-based analytics to achieve similar speed ups on a GPU platform. By keeping all of these tasks on the GPU and minimizing redundant I/O, data scientists are enabled to model their data quickly and frequently, affording a higher degree of experimentation and more effective model generation. Further, keeping all of this in compatible formats allows quick movement from feature extraction, graph representation, graph analytic, enrichment back to the original data, and visualization of results. RAPIDS has a mission to build a platform that allows data scientist to explore data, train machine learning algorithms, and build applications while primarily staying on the GPU and GPU platforms."
Learn more: https://rapids.ai/
and
https://fosdem.org/2019/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
MaaS (Model as a Service): Modern Streaming Data Science with Apache Metron (...DataWorks Summit
Apache Metron (Incubating) is a streaming cybersecurity application
built on Apache Storm and Hadoop. One of its core missions is to enable
advanced analytics through machine learning and data science to the
users. Because of the relative immaturity of data science platform
infrastructure integrated into Hadoop that is oriented to streaming
analytics applications, we have been forced to create the requisite
platform components out of necessity, utilizing many of the pieces of
the Hadoop ecosystem.
In this talk, we will speak about the Metron analytics architecture and
how it utilizes a custom data science model deployment and autodiscovery
service that is tightly integrated with Hadoop via Yarn and Zookeeper.
We will discuss how we interact with the models deployed there via a
custom domain specific language that can query models as data streams
past. We will generally discuss the full-stack data science tooling that
has been created to enable data science at scale on an advanced analytics
streaming application.
Nicolas Trésegnie, Chief Architect at SuperAwesome
Abstract: SuperAwesome's mission is to make the internet safer for kids. At the core of SuperAwesome's analytics is Druid. In this talk, we walk through how we run Druid on spot instances. We explain the consequences in terms of cost and reliability, how we managed to build a reliable system despite the risks, and how you could do the same.
Nicolas works as Chief Architect at SuperAwesome, where is is looking after the overall architecture of the systems and the infrastructure. He is all about automation and how technology can be used to achieve business goals. Nicolas studied Computer Science and Bioinformatics, and he is now pursuing an MBA at Imperial.
Undertaking a digital journey starts with clearly articulating the success factors for the entire digital journey, and our experience from the field has shown it to be an Achilles heel for most CXOs, across Fortune 500 organizations. Our findings were corroborated when a Mckinsey study reported that only 15% of the organizations are able to calculate the ROI of a digital initiative.
In this talk we will deliberate on demonstrated examples from multi-billion dollar businesses around proven methodologies to measure the value of a digital enterprise. The panel will share experiences as well as provide actionable advice for immediate next steps around the following:
Successful metrics for measuring the value for Digital / IoT / AI/ Machine learning engagements
How can 'Digital Traction Metrics' help with actionable insights even before the Financial Metrics have been reported
What are the best in-class organizational constructs and futuristic employee engagement methods to facilitate the digital revolution
Panelists for this session include:
• Christian Bilien - Head of Global Data at Societe Generale
• Pierre Alexandre Pautrat – Head of Big Data at BPCE/Nattixis
• Ronny Fehling – VP , Airbus
• Juergen Urbanski – Silicon Valley Data Science
• Abhas Ricky - EMEA Lead, Innovation & Strategy, Hortonworks
Achieving Real-time Ingestion and Analysis of Security Events through Kafka a...Kevin Mao
Strata Hadoop World 2017 San Jose
Today’s enterprise architectures are often composed of a myriad of heterogeneous devices. Bring-your-own-device policies, vendor diversification, and the transition to the cloud all contribute to a sprawling infrastructure, the complexity and scale of which can only be addressed by using modern distributed data processing systems.
Kevin Mao outlines the system that Capital One has built to collect, clean, and analyze the security-related events occurring within its digital infrastructure. Raw data from each component is collected and preprocessed using Apache NiFi flows. This raw data is then written into an Apache Kafka cluster, which serves as the primary communications backbone of the platform. The raw data is parsed, cleaned, and enriched in real time via Apache Metron and Apache Storm and ingested into ElasticSearch, allowing operations teams to detect and monitor events as they occur. The refined data is also transformed into the Apache ORC data format and stored in Amazon S3, allowing data scientists to perform long-term, batch-based analysis.
Kevin discusses the challenges involved with architecting and implementing this system, such as data quality, performance tuning, and the impact of additional financial regulations relating to data governance, and shares the results of these efforts and the value that the data platform brings to Capital One.
Explore IoT in Big Data while brewing beer. All verticals are instrumenting devices to learn more about their process to help cut costs or improve efficiency.
Listening at the Cocktail Party with Deep Neural Networks and TensorFlowDatabricks
Many people are amazing at focusing their attention on one person or one voice in a multi speaker scenario, and ‘muting’ other people and background noise. This is known as the cocktail party effect. For other people it is a challenge to separate audio sources.
In this presentation I will focus on solving this problem with deep neural networks and TensorFlow. I will share technical and implementation details with the audience, and talk about gains, pains points, and merits of the solutions as it relates to:
* Preparing, transforming and augmenting relevant data for speech separation and noise removal.
* Creating, training and optimizing various neural network architectures.
* Hardware options for running networks on tiny devices.
* And the end goal : Real-time speech separation on a small embedded platform.
I will present a vision of future smart air pods, smart headsets and smart hearing aids that will be running deep neural networks .
Participants will get an insight into some of the latest advances and limitations in speech separation with deep neural networks on embedded devices in regards to:
* Data transformation and augmentation.
* Deep neural network models for speech separation and for removing noise.
* Training smaller and faster neural networks.
* Creating a real-time speech separation pipeline.
Introduction of streaming data, difference between batch processing and stream processing, Research issues in streaming data processing, Performance evaluation metrics , tools for stream processing.
Beyond Kerberos and Ranger - Tips to discover, track and manage risks in hybr...DataWorks Summit
Even after deploying traditional security measures like authentication and authorization to secure sensitive data, data owners and security teams are still struggling to manage and get visibility on risks with data. The same challenge multiplies when data is moving and shared across different data silos such as on-premise Hadoop, public cloud infrastructures such as AWS, Azure and Google Cloud. To control the risks that come with data, enterprises need a comprehensive data-centric approach to easily identify risks, manage security and compliance policies and implement behavior analytics to differentiate between good and bad behavior. This talk will explain a 3 step process of implementing data-centric controls for your hybrid environment including discovering where sensitive data is stored, tracking where data is moving and can easily identifying and controlling potential misuse of the data in near real time.
Cloud Computing in the Cloud (Hadoop.tw Meetup @ 2015/11/23)Jeff Hung
It is a common believe that Hadoop should run on physical servers. However, this requires huge capital investment in the beginning while you have no guarantee for the returns. Therefore, things usually end up in proving big-data with not-that-big data. One approach to workaround this dilemma is to run Cloud Computing in the Cloud. With the elastic that AWS provides, you could spend little but run big!! However, is it really a good idea? In this sharing, we will try to answer it – based on the result from an 1-year journey with real application and real big-data.
Webinar: The Modern Streaming Data Stack with Kinetica & StreamSetsKinetica
Enterprises are now faced with wrangling massive volumes of complex, streaming data from a variety of different sources, a new paradigm known as extreme data. However, the traditional data integration model that’s based on structured batch data and stable data movement patterns makes it difficult to analyze extreme data in real-time. Join Matt Hawkins, Principal Solutions Architect at Kinetica and Mark Brooks, Solution Engineer at StreamSets as they share how innovative organizations are modernizing their data stacks with StreamSets and Kinetica to enable faster data movement and analysis.In this webinar we’ll explore:
The modern data architecture required for dealing with extreme data
How StreamSets enables continuous data movement and transformation across the enterprise
How Kinetica harnesses the power of GPUs to accelerate analytics on streaming data
A live demo of StreamSets and Kinetica connector to enable high speed data ingestion, queries and data visualization
Reliable Data Intestion in BigData / IoTGuido Schmutz
Many of the Big Data and IoT use cases are based on combing data from multiple data sources and to make them available on a Big Data platform for analysis. The data sources are often very heterogeneous, from simple files, databases to high-volume event streams from sensors (IoT devices). It’s important to retrieve this data in a secure and reliable manner and integrate it with the Big Data platform so that it is available for analysis in real-time (stream processing) as well as in batch (typical big data processing). In past some new tools have emerged, which are especially capable of handling the process of integrating data from outside, often called Data Ingestion. From an outside perspective, they are very similar to a traditional Enterprise Service Bus infrastructures, which in larger organization are often in use to handle message-driven and service-oriented systems. But there are also important differences, they are typically easier to scale in a horizontal fashion, offer a more distributed setup, are capable of handling high-volumes of data/messages, provide a very detailed monitoring on message level and integrate very well with the Hadoop ecosystem. This session will present and compare Apache Flume, Apache NiFi, StreamSets and the Kafka Ecosystem and show how they handle the data ingestion in a Big Data solution architecture.
Building a future-proof cyber security platform with Apache MetronDataWorks Summit
Qsight IT gives you insight in how we use Metron in securing our customers by continuously analyzing and monitoring users, applications, data, and networks. We show you how we implemented Metron as a replacement for our former security platform based on rule-based security. Since we are dealing with a non-conventional use case “serving many customers with one platform,” we developed a business classification module that enables us to score threats according to the customer’s input.
To be future ready, we are working on extending this rule-based way of detection with machine learning models like web defacement, suspicious URL’s, UEBA, and many more to come.
In order to provide all the necessary information to the SOC analysts at a glance, we are developing a custom SOC application from where they can handle security alarms, analyze captured data, and have historical data at hand. We regard our new Metron based Security Platform as an emerging giant—a future-proof cyber security platform!
Speaker
Bas van de Lustgraaf, Big Data Engineer, QSight IT
Machiel van Tilborg, BI Engineer, QSight IT
Big Data Day LA 2016/ Use Case Driven track - Shaping the Role of Data Scienc...Data Con LA
At IRIS.TV, our business builds algorithmic solutions for video recommendation with the end goal to deliver a great user experience as evidenced by users viewing more video content. This talk outlines our reasons for expanding from a descriptive/predictive approach to data analytics toward a philosophy that features more prescriptive analytics, driven by our data science team.
Using Spark and Riak for IoT Apps—Patterns and Anti-Patterns: Spark Summit Ea...Spark Summit
Everybody agrees that IoT is changing the world… and creates new challenges for software developers, architects and DevOps. How can we build efficient and highly scalable distributed applications using open-source technologies? What are characteristics of data generated by IoT devices and how it differs from traditional enterprise or Big Data problems? Which architectural patterns are beneficial for IoT use cases and why some trusted methods eventually turn out to be “anti-patterns”? This talk will show how to combine best-of-breed open-source technologies, like Apache Spark, Riak and Mesos to build scalable IoT pipelines to ingest, store and analyze huge amounts of data, while keeping operational complexity and costs under control. We will discuss cons and pros of using relational, NoSQL and object storage products for storing and archiving IoT data. Then we cover best practices how to use Spark with Riak NoSQL database. Will describe how Apache Spark advanced modules (Spark SQL, Spark Streaming and MLlib) can solve the problems common to IoT apps, while using Riak for fast and scalable persistence. At the end, will explain why Structured Spark Streaming is a godsend for IoT data and make a case for Time Series databases deserving a separate category in NoSQL classification.
AI on Spark for Malware Analysis and Anomalous Threat DetectionDatabricks
At Avast, we believe everyone has the right to be safe. We are dedicated to creating a world that provides safety and privacy for all, not matter where you are, who you are, or how you connect. With over 1.5 billion attacks stopped and 30 million new executable files monthly, big data pipelines are crucial for the security of our customers. At Avast we are leveraging Apache Spark machine learning libraries and TensorflowOnSpark for a variety of tasks ranging from marketing and advertisement, through network security to malware detection. This talk will cover our main cybersecurity usecases of Spark. After describing our cluster environment we will first demonstrate anomaly detection on time series of threats. Having thousands of types of attacks and malware, AI helps human analysts select and focus on most urgent or dire threats. We will walk through our setup for distributed training of deep neural networks with Tensorflow to deploying and monitoring of a streaming anomaly detection application with trained model. Next we will show how we use Spark for analysis and clustering of malicious files and large scale experimentation to automatically process and handle changes in malware. In the end, we will give comparison to other tools we used for solving those problems.
Big Data Day LA 2016/ NoSQL track - Analytics at the Speed of Light with Redi...Data Con LA
Spark is in-memory, Redis is in-memery. The Spark-Redis connector gives Spark access to Redis' data structures as RDDs. Redis, with its blazing fast performance and optimized in-memory data structures, reduces Spark processing time by up to 98%. In this talk, Dave will share the top use cases for Spark-Redis such as time-series, recommendations and real-time bid management.
In this deck from FOSDEM'19, Christoph Angerer from NVIDIA presents: Rapids - Data Science on GPUs.
"The next big step in data science will combine the ease of use of common Python APIs, but with the power and scalability of GPU compute. The RAPIDS project is the first step in giving data scientists the ability to use familiar APIs and abstractions while taking advantage of the same technology that enables dramatic increases in speed in deep learning. This session highlights the progress that has been made on RAPIDS, discusses how you can get up and running doing data science on the GPU, and provides some use cases involving graph analytics as motivation.
GPUs and GPU platforms have been responsible for the dramatic advancement of deep learning and other neural net methods in the past several years. At the same time, traditional machine learning workloads, which comprise the majority of business use cases, continue to be written in Python with heavy reliance on a combination of single-threaded tools (e.g., Pandas and Scikit-Learn) or large, multi-CPU distributed solutions (e.g., Spark and PySpark). RAPIDS, developed by a consortium of companies and available as open source code, allows for moving the vast majority of machine learning workloads from a CPU environment to GPUs. This allows for a substantial speed up, particularly on large data sets, and affords rapid, interactive work that previously was cumbersome to code or very slow to execute. Many data science problems can be approached using a graph/network view, and much like traditional machine learning workloads, this has been either local (e.g., Gephi, Cytoscape, NetworkX) or distributed on CPU platforms (e.g., GraphX). We will present GPU-accelerated graph capabilities that, with minimal conceptual code changes, allows both graph representations and graph-based analytics to achieve similar speed ups on a GPU platform. By keeping all of these tasks on the GPU and minimizing redundant I/O, data scientists are enabled to model their data quickly and frequently, affording a higher degree of experimentation and more effective model generation. Further, keeping all of this in compatible formats allows quick movement from feature extraction, graph representation, graph analytic, enrichment back to the original data, and visualization of results. RAPIDS has a mission to build a platform that allows data scientist to explore data, train machine learning algorithms, and build applications while primarily staying on the GPU and GPU platforms."
Learn more: https://rapids.ai/
and
https://fosdem.org/2019/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
End to End Machine Learning Open Source Solution Presented in Cisco Developer...Manish Harsh
The RAPIDS suite of open source software libraries and APIs gives you the ability to execute end-to-end data science and analytics pipelines entirely on GPUs. Licensed under Apache 2.0, RAPIDS is incubated by NVIDIA® based on extensive hardware and data science science experience. RAPIDS utilizes NVIDIA CUDA® primitives for low-level compute optimization, and exposes GPU parallelism and high-bandwidth memory speed through user-friendly Python interfaces.
GPU-Accelerating UDFs in PySpark with Numba and PyGDFKeith Kraus
With advances in computer hardware such as 10 gigabit network cards, infiniband, and solid state drives all becoming commodity offerings, the new bottleneck in big data technologies is very commonly the processing power of the CPU. In order to meet the computational demand desired by users, enterprises have had to resort to extreme scale out approaches just to get the processing power they need. One of the most well known technologies in this space, Apache Spark, has numerous enterprises publicly talking about the challenges in running multiple 1000+ node clusters to give their users the processing power they need. This talk is based on work completed by NVIDIA’s Applied Solutions Engineering team. Attendees will learn how they were able to GPU-accelerate UDFs in PySpark using open source technologies such as Numba and PyGDF, the lessons they learned in the process, and how they were able to accelerate workloads in a fraction of the hardware footprint.
If you're like most of the world, you're on an aggressive race to implement machine learning applications and on a path to get to deep learning. If you can give better service at a lower cost, you will be the winners in 2030. But infrastructure is a key challenge to getting there. What does the technology infrastructure look like over the next decade as you move from Petabytes to Exabytes? How are you budgeting for more colossal data growth over the next decade? How do your data scientists share data today and will it scale for 5-10 years? Do you have the appropriate security, governance, back-up and archiving processes in place? This session will address these issues and discuss strategies for customers as they ramp up their AI journey with a long term view.
RAPIDS – Open GPU-accelerated Data ScienceData Works MD
RAPIDS – Open GPU-accelerated Data Science
RAPIDS is an initiative driven by NVIDIA to accelerate the complete end-to-end data science ecosystem with GPUs. It consists of several open source projects that expose familiar interfaces making it easy to accelerate the entire data science pipeline- from the ETL and data wrangling to feature engineering, statistical modeling, machine learning, and graph analysis.
Corey J. Nolet
Corey has a passion for understanding the world through the analysis of data. He is a developer on the RAPIDS open source project focused on accelerating machine learning algorithms with GPUs.
Adam Thompson
Adam Thompson is a Senior Solutions Architect at NVIDIA. With a background in signal processing, he has spent his career participating in and leading programs focused on deep learning for RF classification, data compression, high-performance computing, and managing and designing applications targeting large collection frameworks. His research interests include deep learning, high-performance computing, systems engineering, cloud architecture/integration, and statistical signal processing. He holds a Masters degree in Electrical & Computer Engineering from Georgia Tech and a Bachelors from Clemson University.
NVidia and Kinetica presented together about the trends in GPU use cases across industries. The basics and GPU architecture was discussed and how it compares with ASIC and FPGA.
Kinetica presented their In-Memory Database Platform powered by GPU which provides capabilities for fast analytics, geospatial analytics and realtime ML/Deep Learning execution engine.
Big Data, Simple and Fast: Addressing the Shortcomings of HadoopHazelcast
In this webinar
This talk identifies several shortcomings of Apache Hadoop and presents an alternative approach for building simple and flexible Big Data software stacks quickly, based on next generation computing paradigms, such as in-memory data/compute grids. The focus of the talk is on software architectures, but several code examples using Hazelcast will be provided to illustrate the concepts discussed.
We’ll cover these topics:
-Briefly explain why Hadoop is not a universal, or inexpensive, Big Data solution – despite the hype
-Lay out technical requirements for a flexible Big/Fast Data processing stack
-Present solutions thought to be alternatives to Hadoop
-Argue why In-Memory Data/Compute Grids are so attractive in creating future-proof Big/Fast Data applications
-Discuss how well Hazelcast meets the Big/Fast Data requirements vs Hadoop
-Present several code examples using Java and Hazelcast to illustrate concepts discussed
-Live Q&A Session
Presenter:
Jacek Kruszelnicki, President of Numatica Corporation
The relationships between data sets matter. Discovering, analyzing, and learning those relationships is a central part to expanding our understand, and is a critical step to being able to predict and act upon the data. Unfortunately, these are not always simple or quick tasks.
To help the analyst we introduce RAPIDS, a collection of open-source libraries, incubated by NVIDIA and focused on accelerating the complete end-to-end data science ecosystem. Graph analytics is a critical piece of the data science ecosystem for processing linked data, and RAPIDS is pleased to offer cuGraph as our accelerated graph library.
Simply accelerating algorithms only addressed a portion of the problem. To address the full problem space, RAPIDS cuGraph strives to be feature-rich, easy to use, and intuitive. Rather than limiting the solution to a single graph technology, cuGraph supports Property Graphs, Knowledge Graphs, Hyper-Graphs, Bipartite graphs, and the basic directed and undirected graph.
A Python API allows the data to be manipulated as a DataFrame, similar and compatible with Pandas, with inputs and outputs being shared across the full RAPIDS suite, for example with the RAPIDS machine learning package, cuML.
This talk will present an overview of RAPIDS and cuGraph. Discuss and show examples of how to manipulate and analyze bipartite and property graph, plus show how data can be shared with machine learning algorithms. The talk will include some performance and scalability metrics. Then conclude with a preview of upcoming features, like graph query language support, and the general RAPIDS roadmap.
Powering Real-Time Big Data Analytics with a Next-Gen GPU DatabaseKinetica
Freed from the constraints of storage, network and memory, many big data analytics systems now are routinely revealing themselves to be compute bound. To compensate, big data analytic systems often result in wide horizontal sprawl (300-node Spark or NoSQL clusters are not unusual!)— to bring in enough compute for the task at hand. High system complexity and crushing operational costs often result. As the world shifts from physical to virtual assets and methods of engagement, there is an increasing need for systems of intelligence to live alongside the more traditional systems of record and systems of analysis. New approaches to data processing are required to support the real-time processing of data required to drive these systems of intelligence.
Join 451 Research and Kinetica to learn:
•An overview of the business and technical trends driving widespread interest in real-time analytics
•Why systems of analysis need to be transformed and augmented with systems of intelligence bringing new approaches to data processing
•How a new class of solution—a GPU-accelerated, scale out, in-memory database–can bring you orders of magnitude more compute power, significantly smaller hardware footprint, and unrivaled analytic capabilities.
•Hear how other companies in a variety of industries, such as financial services, entertainment, pharmaceutical, and oil and gas, benefit from augmenting their legacy systems with a modern analytics database.
How to Leverage Mainframe Data with Hadoop: Bridging the Gap Between Big Iron...Precisely
In this presentation from Syncsort and Cloudera, you'll learn how to bridge the technical, skill and cost gaps between mainframe and Hadoop. We discuss the top challenges of ingesting and processing mainframe data in Hadoop – and how to solve them.
Hadoop Summit Amsterdam 2013 - Making Hadoop Ready for Prime Time - Syncsort ...Steven Totman
Lightening talk from the Hadoop Summit 2013 in Amsterdam covering how Syncsort is helping make Hadoop Ready for Prime Time. It includes the pluggable sort contribution - the impact on sort, join, aggregation, merge, filter in hadoopand Syncsort's ability to move mainframe data to hadoop - Big Iron to Big Data.
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization Denodo
Watch here: https://bit.ly/2NGQD7R
In an era increasingly dominated by advancements in cloud computing, AI and advanced analytics it may come as a shock that many organizations still rely on data architectures built before the turn of the century. But that scenario is rapidly changing with the increasing adoption of real-time data virtualization - a paradigm shift in the approach that organizations take towards accessing, integrating, and provisioning data required to meet business goals.
As data analytics and data-driven intelligence takes centre stage in today’s digital economy, logical data integration across the widest variety of data sources, with proper security and governance structure in place has become mission-critical.
Attend this session to learn:
- Learn how you can meet cloud and data science challenges with data virtualization.
- Why data virtualization is increasingly finding enterprise-wide adoption
- Discover how customers are reducing costs and improving ROI with data virtualization
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Precisely
Tackling the challenge of designing a machine learning model and putting it into production is the key to getting value back – and the roadblock that stops many promising machine learning projects. After the data scientists have done their part, engineering robust production data pipelines has its own set of challenges. Syncsort software helps the data engineer every step of the way.
Building on the process of finding and matching duplicates to resolve entities, the next step is to set up a continuous streaming flow of data from data sources so that as the sources change, new data automatically gets pushed through the same transformation and cleansing data flow – into the arms of machine learning models.
Some of your sources may already be streaming, but the rest are sitting in transactional databases that change hundreds or thousands of times a day. The challenge is that you can’t affect performance of data sources that run key applications, so putting something like database triggers in place is not the best idea. Using Apache Kafka or similar technologies as the backbone to moving data around doesn’t solve the problem of needing to grab changes from the source pushing them into Kafka and consuming the data from Kafka to be processed. If something unexpected happens – like connectivity is lost on either the source or the target side, you don’t want to have to fix it or start over because the data is out of sync.
View this 15-minute webcast on-demand to learn how to tackle these challenges in large scale production implementations.
Enhancing Performance with Globus and the Science DMZGlobus
ESnet has led the way in helping national facilities—and many other institutions in the research community—configure Science DMZs and troubleshoot network issues to maximize data transfer performance. In this talk we will present a summary of approaches and tips for getting the most out of your network infrastructure using Globus Connect Server.
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofsAlex Pruden
This paper presents Reef, a system for generating publicly verifiable succinct non-interactive zero-knowledge proofs that a committed document matches or does not match a regular expression. We describe applications such as proving the strength of passwords, the provenance of email despite redactions, the validity of oblivious DNS queries, and the existence of mutations in DNA. Reef supports the Perl Compatible Regular Expression syntax, including wildcards, alternation, ranges, capture groups, Kleene star, negations, and lookarounds. Reef introduces a new type of automata, Skipping Alternating Finite Automata (SAFA), that skips irrelevant parts of a document when producing proofs without undermining soundness, and instantiates SAFA with a lookup argument. Our experimental evaluation confirms that Reef can generate proofs for documents with 32M characters; the proofs are small and cheap to verify (under a second).
Paper: https://eprint.iacr.org/2023/1886
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
Climate Impact of Software Testing at Nordic Testing DaysKari Kakkonen
My slides at Nordic Testing Days 6.6.2024
Climate impact / sustainability of software testing discussed on the talk. ICT and testing must carry their part of global responsibility to help with the climat warming. We can minimize the carbon footprint but we can also have a carbon handprint, a positive impact on the climate. Quality characteristics can be added with sustainability, and then measured continuously. Test environments can be used less, and in smaller scale and on demand. Test techniques can be used in optimizing or minimizing number of tests. Test automation can be used to speed up testing.
Essentials of Automations: The Art of Triggers and Actions in FMESafe Software
In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation.
We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios.
Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
A tale of scale & speed: How the US Navy is enabling software delivery from l...sonjaschweigert1
Rapid and secure feature delivery is a goal across every application team and every branch of the DoD. The Navy’s DevSecOps platform, Party Barge, has achieved:
- Reduction in onboarding time from 5 weeks to 1 day
- Improved developer experience and productivity through actionable findings and reduction of false positives
- Maintenance of superior security standards and inherent policy enforcement with Authorization to Operate (ATO)
Development teams can ship efficiently and ensure applications are cyber ready for Navy Authorizing Officials (AOs). In this webinar, Sigma Defense and Anchore will give attendees a look behind the scenes and demo secure pipeline automation and security artifacts that speed up application ATO and time to production.
We will cover:
- How to remove silos in DevSecOps
- How to build efficient development pipeline roles and component templates
- How to deliver security artifacts that matter for ATO’s (SBOMs, vulnerability reports, and policy evidence)
- How to streamline operations with automated policy checks on container images
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfPeter Spielvogel
Building better applications for business users with SAP Fiori.
• What is SAP Fiori and why it matters to you
• How a better user experience drives measurable business benefits
• How to get started with SAP Fiori today
• How SAP Fiori elements accelerates application development
• How SAP Build Code includes SAP Fiori tools and other generative artificial intelligence capabilities
• How SAP Fiori paves the way for using AI in SAP apps
2. 2
SPARK ECOSYSTEM
The Glue of Big Data
• Spark has almost become synonymous with Hadoop and Big Data
• It’s the interface/API for big data app to app communication
• The processing layer for big data and leading ML framework
3. 3
DATA PROCESSING EVOLUTION
Faster Data Access Less Data Movement
HDFS
Read
HDFS
Write
HDFS
Read
HDFS
Write
HDFS
Read
Query ETL ML Train
Hadoop Processing, Reading from disk
4. 4
DATA PROCESSING EVOLUTION
Faster Data Access Less Data Movement
HDFS
Read
HDFS
Write
HDFS
Read
HDFS
Write
HDFS
Read
Query ETL ML Train
HDFS
Read
Query ETL ML Train
Hadoop Processing, Reading from disk
25-100x Improvement
Less code
Language flexible
Primarily In-Memory
Spark In-Memory Processing
5. 5
SPARK ECOSYSTEM
Lacks Full GPU Integration
• 4 Core Parts: SQL, Streaming (Spark functions micro batched), Machine Learning, & Graph
• Spark is currently optimizing its existing code base, adding more usability, not GPU support yet
6. 6
25-100x Improvement
Less code
Language flexible
Primarily In-Memory
DATA PROCESSING EVOLUTION
Faster Data Access Less Data Movement
HDFS
Read
HDFS
Write
HDFS
Read
HDFS
Write
HDFS
Read
Query ETL ML Train
HDFS
Read
Query ETL ML Train
HDFS
Read
GPU
Read
Query
CPU
Write
GPU
Read
ETL
CPU
Write
GPU
Read
ML
Train
5-10x Improvement
More code
Language rigid
Substantially on GPU
GPU/Spark In-Memory Processing
Hadoop Processing, Reading from disk
Spark In-Memory Processing
7. 7
Pre-GPU DATA FRAME
CURRENT
H2O.ai
Graphistry Anaconda
Gunrock
BlazingDB MapD
CPU
APP A
APP B
Copy & Convert
Copy & Convert Copy & Convert
Copy & Convert Copy & ConvertCopy & Convert Copy & Convert
Too Much Glue Code & Lack Of Standards
• For GPU applications to talk to each other data must be copy
and converted up to three times
• Each company has to build and maintain connectors to copy
and convert
• Some products wanted direct connectors to other
products
• Reduced hops but more for them to maintain and
develop
• A standard was needed
• ISVs always starting from scratch
• Barrier to entry and integration
8. 8
GPU Data Frame
Data Movement Kills Performance
Volume of data
Numberofdatahandoffs
Handoff
Pre-GPU DATA FRAME
9. 9
APP A
GPU-ACCELERATED ARCHITECTURE THEN
Too much data movement and too many different data formats
CPU GPU
APP B
Read DataH2O.ai
Anaconda Gunrock
Graphistry
BlazingDB MapD
Copy & Convert
Copy & Convert
Copy & Convert
Load Data
APP A GPU
Data
APP B
GPU
Data
10. 10
APP A
GPU-ACCELERATED ARCHITECTURE THEN
Too much data movement and too many different data formats
CPU GPU
APP B
Read DataH2O.ai
Anaconda Gunrock
Graphistry
BlazingDB MapD
Copy & Convert
Copy & Convert
Copy & Convert
Load Data
APP A GPU
Data
APP B
GPU
Data
11. 11
INTEROPERABILITY IN BIG DATA
Lessons Learned From Apache Arrow & Parquet
• Both Apache Arrow and Apache
Parquet are compressed columnar
storage
• Arrow resides in memory whereas
Parquet resides on disk
• Major push in the big data world to
remove bottlenecks of copy &
converting data between systems
that was a major issue in the GPU
world
12. 12
GPU-ACCELERATED ARCHITECTURE NOW
Single data format and shared access to data on GPU
CPU GPU
GPU
MEM
Read DataH2O.ai
Anaconda Gunrock
Graphistry
BlazingDB MapD Load Data
Apache Arrow
GPU
Data
Frame
Based on:
13. 13
GPU OPEN ANALYTICS INITIATIVE
github.com/gpuopenanalytics
GPU Data Frame (GDF)
Ingest/
Parse
Exploratory
Analysis
Feature
Engineering
ML/DL
Algorithms
Grid Search
Scoring
Model
Export
@gpuoai
Apache Arrow
19. 19
25-100x Improvement
Less code
Language flexible
Primarily In-Memory
DATA PROCESSING EVOLUTION
Faster Data Access Less Data Movement
HDFS
Read
HDFS
Write
HDFS
Read
HDFS
Write
HDFS
Read
Query ETL ML Train
HDFS
Read
Query ETL ML Train
HDFS
Read
GPU
Read
Query
CPU
Write
GPU
Read
ETL
CPU
Write
GPU
Read
ML
Train
Arrow
Read
Query ETL
ML
Train
5-10x Improvement
More code
Language rigid
Substantially on GPU
25-100x Improvement
Same code
Language flexible
Primarily on GPU
End to End GPU Processing (GOAI)
GPU/Spark In-Memory Processing
Hadoop Processing, Reading from disk
Spark In-Memory Processing
20. 20
Expand GPU Usage
More Data, Less Hardware
0.0
1.0
2.0
3.0
4.0
5.0
6.0
7.0
8.0
2008 2010 2012 2014 2016 2017
Peak Double Precision
NVIDIA GPU x86 CPU
TFLOPS
Scaling up and out with GPU co-
processors
21. 21
ANACONDA
Python ETL for GPU
A Python open-source just-in-time
optimizing compiler that uses LLVM to
produce native machine instructions.
Primary Contributor to PyGDF.
Dask is a flexible parallel computing
library for analytic computing with
dynamic task scheduling and big data
collections.
Primary contributor to Dask_GDF.
Jeremy Howard
Deep learning researcher & educator.
Founder: fast.ai; Faculty: USF &
Singularity University; // Previously - CEO:
Enlitic; President: Kaggle; CEO Fastmail
Rewrote @scikit_learn PolynomialFeatures in
@ContinuumIO Numba. Got a 40x speedup (would
be bigger with more data!) 12 lines of code
29. 29
MAPD
MapD Core MapD Immerse
LLVM Backend Rendering Streaming
LLVM creates one custom function that
runs at speeds approaching hand-written
functions. LLVM enables generic
targeting of different architectures + run
simultaneously on CPU/GPU.
Speed eliminates need to pre-index or
aggregate data. Compute resides on
GPUs freeing CPUs to parse + ingest.
Finally, newest data can be combined
with billions of rows of “near historical”
data.
Data goes from compute (CUDA) to
graphics (OpenGL) pipeline without copy
and comes back as compressed PNG
(~100 KB) rather than raw data (> 1GB).
30. 30
MAPD ARCHITECTURE
Visualization Libraries
JavaScript libraries that allow
users to build custom web-
based visualization apps
powered by a MapD Core
database based on DC.js.
LLVM
MapD Core SQL queries are
compiled with a just-in-time
(JIT) LLVM based compiler,
and run as NVIDIA GPU
machine code.
Distributed Scale-out
MapD Core has native
distributed scale-out
capabilities. MapD Core users
can query and visualize larger
datasets with much smaller
cluster sizes than traditional
solutions.
High Availability
MapD Core has high
availability functionality that
provides durability and
redundancy. Ingest and
queries are load balanced
across servers for additional
throughput.
Open Source Commercial
32. 32
FIRST PRINCIPLES OF CYBER SECURITY
Where the industry must go
1. Indication of compromise needs to improve as attacks are becoming more sophisticated,
subtle, and hidden in the massive volume and velocity of data. Combining machine
learning, graph analysis, and applied statistics, and integrating these methods with deep
learning is essential to reduce false positives, detect threats faster, and empower analyst
to be more efficient.
2. Event management is an accelerated analytics problem, the volume and velocity of data
from devices requires a new approach that combines all data sources to allow for more
intelligent/advanced threat hunting and exploration at scale across machine data.
3. Visualization will be a key part of daily operations, which will allows analyst to label and
train Deep Learning models faster, and validate machine learning prediciton.
33. 33
RULES & PEOPLE DON’T SCALE
Right now, financial services reports it takes an average of 98 days to detect an Advance Threat but
retailers say it can be about seven months.
Once the security community moves beyond the mantras “encrypt everything” and “secure the
perimeter,” it can begin developing intelligent prioritization and response plans to various kinds
of breaches – with a strong focus on integrity.
The challenge lies in efficiently scaling these technologies for practical deployment, and making
them reliable for large networks. This is where the security community should focus its efforts.
http://www.wired.com/2015/12/the-cia-secret-to-cybersecurity-that-no-one-seems-to-get/
Current methods are too slow
34. 34
ATTACKS ARE MORE SOPHISTICATED
How Hackers Hijacked a Bank’s Entire Online Operation
https://www.wired.com/2017/04/hackers-hijacked-banks-entire-online-operation/
35. 35
FIRST PRINCIPLES OF CYBER SECURITY
Where the industry must go
1. Indication of compromise needs to improve as attacks are becoming more sophisticated,
subtle, and hidden in the massive volume and velocity of data. Combining machine
learning, graph analysis, and applied statistics, and integrating these methods with deep
learning is essential to reduce false positives, detect threats faster, and empower analyst
to be more efficient.
36. 36
MULTI MODEL APPROACH
No Silver Bullet In Cyber Security
nvGRAPH
https://github.com/h2oai/h2o4gpu
# edges = E * 2^S ~34M
37. 37
FIRST PRINCIPLES OF CYBER SECURITY
Where the industry must go
1. Indication of compromise needs to improve as attacks are becoming more sophisticated,
subtle, and hidden in the massive volume and velocity of data. Combining machine
learning, graph analysis, and applied statistics, and integrating these methods with deep
learning is essential to reduce false positives, detect threats faster, and empower analyst
to be more efficient.
2. Event management is an accelerated analytics problem, the volume and velocity of data
from devices requires a new approach that combines all data sources to allow for more
intelligent/advanced threat hunting and exploration at scale across machine data.
38. 38
GPU ACCELERATION
Accelerate the Pipeline, Not Just Deep Learning
• GPUs for deep learning = proven
• Where else and how else can we use
GPU acceleration?
• Dashboards
• Accelerating data pipeline
• Stream processing
• Building better models faster
• First: GPU databases
Data Ingestion
Data Processing
Visualization
Model Training
Inferencing
39. 39
MOVING TO BIG DATA IS A START
Spark outperforms traditional SIEM
vs
Big Data Solution
10 node cluster - ~$60k in hardware
Production SIEM of Fortune 500 Enterprise Data
450+ columns
~250 million events per day
SIEM
Spark vs SIEM Benchmarks from Accenture Labs - Strata NY, Bsides LV
40. 40
MOVING TO BIG DATA IS A START
Spark outperforms traditional SIEM
Typical Scenario Time Period SIEM Big Data Speed Up
1 Show all network communication from one host
(IP) to multiple hosts (IPs)
1 Day 3h 20m 13s 1m 44s 114 Times Faster
1 Week Not Feasible* 4m 05s
2 Retrieve failed logon attempts in Active
Directory
1 Day 18m 26s 1m 37s 10 Times Faster
1 Week 2h 13m 45s 3m 10s 41 Times Faster
3 Search for Malware (exe) in Symantec logs 1 Day 3h 24m 36s 1m 37s 125 Times Faster
1 Week Not Feasible* 3m 22s
4 View all proxy logs for a for specific domain 1 Day 4h 30m 13s 2m 54s 92 Times Faster
1 Week Not Feasible* 1m 09s**
Spark vs SIEM Benchmarks from Accenture Labs - Strata NY, Bsides LV
41. 41
GPU DATABASES ARE EVEN FASTER
1.1 Billion Taxi Ride Benchmarks
21 30
1560
80 99
1250
150
269
2250
372
696
2970
0
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
MapD DGX-1 MapD 4 x P100 Redshift 6-node Spark 11-node
Query 1 Query 2 Query 3 Query 4
TimeinMilliseconds
Source: MapD Benchmarks on DGX from internal NVIDIA testing following guidelines of
Mark Litwintschik’s blogs: Redshift, 6-node ds2.8xlarge cluster & Spark 2.1, 11 x m3.xlarge cluster w/ HDFS @marklit82
10190 8134 19624 85942
42. 42
FIRST PRINCIPLES OF CYBER SECURITY
Where the industry must go
1. Indication of compromise needs to improve as attacks are becoming more sophisticated,
subtle, and hidden in the massive volume and velocity of data. Combining machine
learning, graph analysis, and applied statistics, and integrating these methods with deep
learning is essential to reduce false positives, detect threats faster, and empower analyst
to be more efficient.
2. Event management is an accelerated analytics problem, the volume and velocity of data
from devices requires a new approach that combines all data sources to allow for more
intelligent/advanced threat hunting and exploration at scale across machine data.
3. Visualization will be a key part of daily operations, which will allows analyst to label and
train Deep Learning models faster, and validate machine learning predictions.
44. 44
DATA PLATFORM-AS-A-SERVICE
• Handles 1M events/second
• Auto-scales the cluster
automatically
SCALE
• Offers HA with no data-loss
• Always-on architecture
• Data replication
HIGH AVAILABILITY
• Data platform security has
been implemented with
VPCs in AWS
• Dashboard access using
NVIDIA LDAP
SECURITY
• Log-to-analytics
• Kibana, JDBC access
• Accessing data using BI tools
SELF SERVICE
50. 50
VISUALIZATION WITH GPU
Less hardware, more performance, more scale
1/10th the hardware
1-2 orders of
magnitude more
performance
51. 51
VISUALIZATION WITH GPU
Less hardware, more performance, more scale
1/10th the hardware
1-2 orders of
magnitude more
performance
Real time visualization of 100K+ nodes 1M+ Edges
50-100x faster clustering than other solutions
52. 52
LISTS DO NOT VISUALLY SCALE
Text search is a great
starting point!
Does not scale
Do not see the 30K+ events
nor the IPs, users, nor how
they relate…
54. 54
GRAPHS:
A KEY MISSING
VIEW
Unified Model
Shows entities, events, and relationships
Multipurpose: connect, see, interact
Visual
Inspect individual items
See behavior, patterns, and outliers
Scale to enterprise workloads
55. 55
DIFFERENT GRAPHS, DIFFERENT QUESTIONS
Uni
Ex: Network mapping
“Is it safe to reboot this?”
ip ip
Hyper
Ex: Incident response
“Did this escalate?”
Multi
Ex: SSH trails
“Is a user crossing zones?”
ip
user
userip
ip
user
event
event
user
ip
57. 57
CYBERWORKS
CYBERWORKS SIEM SDK
Goals
• Open Source Ecosystem & Select ISVs
• Integration Points w/ leading security vendors
• FireEye
• Splunk
• Palo Alto Networks
Purpose
A platform to allow analysts to hunt and
analyze data faster at scale than traditional
big data to find unknown and zero day threats.
It will accelerate the threat detection
ecosystem and harden cyber defense utilizing
GPU ISVs and Deep Learning Frameworks.
Purpose Built SDK For SIEM Analytics
58. 58
CYBERWORKS ACTIVITIES
Continuous Improvement
Use GPU accelerated
databases to analyze
data to improve
hunting today, as well
as enrich and label
data for Deep Learning
Connect accelerated
DBs to Splunk for event
management, hunting,
and exploration. Use
Graphistry and MapD to
visualize the data for
anomaly and threat
detection in new ways.
The goal is to GPU
accelerate parts of
Splunk through
partnership and
connect/bolt on
GPUDBs/Graphistry
Use ML and Graph
Analytics for feature
extraction and behavioral
analytics, an ensemble
approach to detection.
Expand Deep Learning
training as more data is
labeled/classified, and
threats are caught faster,
building off DL techniques
used in GFN, other
groups, and external ISV.
Generalize Deep Learning for supervised
and unsupervised anomaly and threat
detection (Insider, APT, DDOS, etc…)
while building our own cyber security deep
learning accelerator. Use best practices
from Driveworks and other accelerators
and SDK as a reference architecture.
Leverage DL from other parts of the firm to
accelerate development as well.
While using Splunk
Cloud to protect
Nvidia, we create a
redundant path of data
to enable R&D.
nvGRAPH