This document discusses accelerating cyber threat detection with GPUs. It begins by noting that current detection methods are too slow, taking an average of 98 days for financial services and up to 7 months for retailers. It then discusses how attacks are becoming more sophisticated and provides examples. The document outlines principles for cybersecurity, including improving indication of compromise through combining machine learning, graph analysis, and other methods. It discusses building an anomaly detection platform using deep learning and GPUs for improved performance. It also covers using GPU databases and visualization to further accelerate analytics and hunting of threats.
GOAI: GPU-Accelerated Data Science DataSciCon 2017Joshua Patterson
The GPU Open Analytics Initiative, GOAI, is accelerating data science like never before. CPUs are not improving at the same rate as networking and storage, and leveraging GPUs data scientist can analyze more data than ever with less hardware. Learn more about how GPU are accelerating data science (not just Deep Learning), and how to get started.
Predictive Maintenance Using Recurrent Neural NetworksJustin Brandenburg
My presentation from AnacondaCON 2018 where I discussed using Recurrent Neural Networks, Python, Tensorflow and the MapR Platform to develop deploy a predictive maintenance model for an IoT device in the manufacturing industry.
Accelerate AI w/ Synthetic Data using GANsRenee Yao
Strata Data Conference in Sep 2018 Presentation
Description:
Synthetic data will drive the next wave of deployment and application of deep learning in the real world across a variety of problems involving speech recognition, image classification, object recognition and language. All industries and companies will benefit, as synthetic data can create conditions through simulation, instead of authentic situations (virtual worlds enable you to avoid the cost of damages, spare human injuries, and other factors that come into play); unparalleled ability to test products, and interactions with them in any environment.
Join us for this introductory session to learn more about how Generative Adversarial Networks (GAN) are successfully used to improve data generation. We will cover specific real-world examples where customers have deployed GAN to solve challenges in healthcare, space, transportation, and retail industries.
Renee Yao explains how generative adversarial networks (GAN) are successfully used to improve data generation and explores specific real-world examples where customers have deployed GANs to solve challenges in healthcare, space, transportation, and retail industries.
Complex event processing platform handling millions of users - Krzysztof Zarz...GetInData
If you want to learn more about it, check out our webinar here: https://www.youtube.com/watch?v=EfGPY_NyYQ8&t=77s
The webinar was organized by GetinData on 2020. During the webinar, we shared our lessons learnt from building and running stream processing platform in production for over 2 years.
Watch more here: https://www.youtube.com/watch?v=EfGPY_NyYQ8
Author: Krzysztof Zarzycki
Linkedin: https://www.linkedin.com/in/kzarzycki/
___
Getindata is a company founded in 2014 by ex-Spotify data engineers. From day one our focus has been on Big Data projects. We bring together a group of best and most experienced experts in Poland, working with cloud and open-source Big Data technologies to help companies build scalable data architectures and implement advanced analytics over large data sets.
Our experts have vast production experience in implementing Big Data projects for Polish as well as foreign companies including i.a. Spotify, Play, Truecaller, Kcell, Acast, Allegro, ING, Agora, Synerise, StepStone, iZettle and many others from the pharmaceutical, media, finance and FMCG industries.
https://getindata.com
Managing Big Data projects in a constantly changing environment - Rafał Zalew...GetInData
Watch our full performance given by our team during the Big Data Technology Warsaw Summit: https://www.youtube.com/watch?v=CBrq7z8ikaM
The nature of Big Data projects are nowadays one of its kind - they are not like the data warehousing initiatives in the old days, nor like cloud native applications projects, at least not yet. Variety of technologies, complicated architectures and rapidly changing landscape are just a few challenges that the IT Department is facing in such projects. When you add the number of stakeholders from different departments involved and that Big Data project is sometimes more like an R&D with unpredictable outcome, this makes a mix where the objectives can be easily lost. It is not a surprise that up to 85% of Big Data projects were pure failures (Gartner 2016).
In this talk we will share our experience in planning and executing Big Data initiatives in the organisations, with some use cases and good practices in mind
Watch our webinar here: https://www.youtube.com/watch?v=CBrq7z8ikaM
Speakers:
Rafał Małanij
Rafał Zalewski
Linkedin: https://www.linkedin.com/in/rafalzalewski/
___
Company:
Getindata is a company founded in 2014 by ex-Spotify data engineers. From day one our focus has been on Big Data projects. We bring together a group of best and most experienced experts in Poland, working with cloud and open-source Big Data technologies to help companies build scalable data architectures and implement advanced analytics over large data sets.
Our experts have vast production experience in implementing Big Data projects for Polish as well as foreign companies including i.a. Spotify, Play, Truecaller, Kcell, Acast, Allegro, ING, Agora, Synerise, StepStone, iZettle and many others from the pharmaceutical, media, finance and FMCG industries.
https://getindata.com
Deep Learning in Security—An Empirical Example in User and Entity Behavior An...Databricks
Recently, deep learning has delivered groundbreaking advances in many industries. In this presentation, Dr. Wang will share empirical experiences of applying deep learning to solving some specific security problems with real-world customer attack detection examples. He will also discuss the challenges and guidelines for successfully deploying deep learning, or general machine learning, in broader security.
This session will feature two deep learning examples. The first example is a user-behavior anomaly detection solution using Convolutional Neural Network (CNN). Since CNN is most effective for image processing, Dr. Wang will introduce an innovative way to encode a user’s daily behavior into multi-channel images. He will also share the experimental comparison results of CNN hyperparameter tuning. The second example is a stateful user risk scoring system using Long Short Term Memory (LSTM). Most of the modern attacks happen in a multi-stage fashion, i.e., infection -> command & control -> lateral movement -> data infiltration -> data exfiltration. In this case, the company uses LSTM to monitor the temporal state transition of each user over these.“
GOAI: GPU-Accelerated Data Science DataSciCon 2017Joshua Patterson
The GPU Open Analytics Initiative, GOAI, is accelerating data science like never before. CPUs are not improving at the same rate as networking and storage, and leveraging GPUs data scientist can analyze more data than ever with less hardware. Learn more about how GPU are accelerating data science (not just Deep Learning), and how to get started.
Predictive Maintenance Using Recurrent Neural NetworksJustin Brandenburg
My presentation from AnacondaCON 2018 where I discussed using Recurrent Neural Networks, Python, Tensorflow and the MapR Platform to develop deploy a predictive maintenance model for an IoT device in the manufacturing industry.
Accelerate AI w/ Synthetic Data using GANsRenee Yao
Strata Data Conference in Sep 2018 Presentation
Description:
Synthetic data will drive the next wave of deployment and application of deep learning in the real world across a variety of problems involving speech recognition, image classification, object recognition and language. All industries and companies will benefit, as synthetic data can create conditions through simulation, instead of authentic situations (virtual worlds enable you to avoid the cost of damages, spare human injuries, and other factors that come into play); unparalleled ability to test products, and interactions with them in any environment.
Join us for this introductory session to learn more about how Generative Adversarial Networks (GAN) are successfully used to improve data generation. We will cover specific real-world examples where customers have deployed GAN to solve challenges in healthcare, space, transportation, and retail industries.
Renee Yao explains how generative adversarial networks (GAN) are successfully used to improve data generation and explores specific real-world examples where customers have deployed GANs to solve challenges in healthcare, space, transportation, and retail industries.
Complex event processing platform handling millions of users - Krzysztof Zarz...GetInData
If you want to learn more about it, check out our webinar here: https://www.youtube.com/watch?v=EfGPY_NyYQ8&t=77s
The webinar was organized by GetinData on 2020. During the webinar, we shared our lessons learnt from building and running stream processing platform in production for over 2 years.
Watch more here: https://www.youtube.com/watch?v=EfGPY_NyYQ8
Author: Krzysztof Zarzycki
Linkedin: https://www.linkedin.com/in/kzarzycki/
___
Getindata is a company founded in 2014 by ex-Spotify data engineers. From day one our focus has been on Big Data projects. We bring together a group of best and most experienced experts in Poland, working with cloud and open-source Big Data technologies to help companies build scalable data architectures and implement advanced analytics over large data sets.
Our experts have vast production experience in implementing Big Data projects for Polish as well as foreign companies including i.a. Spotify, Play, Truecaller, Kcell, Acast, Allegro, ING, Agora, Synerise, StepStone, iZettle and many others from the pharmaceutical, media, finance and FMCG industries.
https://getindata.com
Managing Big Data projects in a constantly changing environment - Rafał Zalew...GetInData
Watch our full performance given by our team during the Big Data Technology Warsaw Summit: https://www.youtube.com/watch?v=CBrq7z8ikaM
The nature of Big Data projects are nowadays one of its kind - they are not like the data warehousing initiatives in the old days, nor like cloud native applications projects, at least not yet. Variety of technologies, complicated architectures and rapidly changing landscape are just a few challenges that the IT Department is facing in such projects. When you add the number of stakeholders from different departments involved and that Big Data project is sometimes more like an R&D with unpredictable outcome, this makes a mix where the objectives can be easily lost. It is not a surprise that up to 85% of Big Data projects were pure failures (Gartner 2016).
In this talk we will share our experience in planning and executing Big Data initiatives in the organisations, with some use cases and good practices in mind
Watch our webinar here: https://www.youtube.com/watch?v=CBrq7z8ikaM
Speakers:
Rafał Małanij
Rafał Zalewski
Linkedin: https://www.linkedin.com/in/rafalzalewski/
___
Company:
Getindata is a company founded in 2014 by ex-Spotify data engineers. From day one our focus has been on Big Data projects. We bring together a group of best and most experienced experts in Poland, working with cloud and open-source Big Data technologies to help companies build scalable data architectures and implement advanced analytics over large data sets.
Our experts have vast production experience in implementing Big Data projects for Polish as well as foreign companies including i.a. Spotify, Play, Truecaller, Kcell, Acast, Allegro, ING, Agora, Synerise, StepStone, iZettle and many others from the pharmaceutical, media, finance and FMCG industries.
https://getindata.com
Deep Learning in Security—An Empirical Example in User and Entity Behavior An...Databricks
Recently, deep learning has delivered groundbreaking advances in many industries. In this presentation, Dr. Wang will share empirical experiences of applying deep learning to solving some specific security problems with real-world customer attack detection examples. He will also discuss the challenges and guidelines for successfully deploying deep learning, or general machine learning, in broader security.
This session will feature two deep learning examples. The first example is a user-behavior anomaly detection solution using Convolutional Neural Network (CNN). Since CNN is most effective for image processing, Dr. Wang will introduce an innovative way to encode a user’s daily behavior into multi-channel images. He will also share the experimental comparison results of CNN hyperparameter tuning. The second example is a stateful user risk scoring system using Long Short Term Memory (LSTM). Most of the modern attacks happen in a multi-stage fashion, i.e., infection -> command & control -> lateral movement -> data infiltration -> data exfiltration. In this case, the company uses LSTM to monitor the temporal state transition of each user over these.“
Leveraging Spark to Democratize Data for Omni-Commerce with Shafaq AbdullahDatabricks
Insnap, a hyper-personalized ML-based platform acquired by The Honest Company, has been used to build a real-time data platform based on Apache Spark, Cassandra and Redshift. Users’ behavioral and transactional data have been used to build data models and ML models, and to drive use cases for marketing, growth, finance and operations.
Learn how Honest Company has used Spark as a workhorse for 1) collecting, ETL and storing data from various sources including mysql, mongo, jde, Google analytics, Facebook, Localytics and REST API; 2) building data models and aggregating and generating reports of revenue, order fulfillment tracking, data pipeline monitoring and subscriptions; 3) Using ML to build model for user acquisitions, LTV and recommendations use cases. Spark replaced the monolithic codebase with flexible, scalable and robust pipelines. Databricks helped The Honest Company to focus on data instead of maintaining infrastructure. While Honest users got delightful recommendations to improve experience, data users at Honest understood users much better in terms of segmenting with behavioral information and advanced ML models, leading to increased revenue and retention.
Nicolas Trésegnie, Chief Architect at SuperAwesome
Abstract: SuperAwesome's mission is to make the internet safer for kids. At the core of SuperAwesome's analytics is Druid. In this talk, we walk through how we run Druid on spot instances. We explain the consequences in terms of cost and reliability, how we managed to build a reliable system despite the risks, and how you could do the same.
Nicolas works as Chief Architect at SuperAwesome, where is is looking after the overall architecture of the systems and the infrastructure. He is all about automation and how technology can be used to achieve business goals. Nicolas studied Computer Science and Bioinformatics, and he is now pursuing an MBA at Imperial.
Listening at the Cocktail Party with Deep Neural Networks and TensorFlowDatabricks
Many people are amazing at focusing their attention on one person or one voice in a multi speaker scenario, and ‘muting’ other people and background noise. This is known as the cocktail party effect. For other people it is a challenge to separate audio sources.
In this presentation I will focus on solving this problem with deep neural networks and TensorFlow. I will share technical and implementation details with the audience, and talk about gains, pains points, and merits of the solutions as it relates to:
* Preparing, transforming and augmenting relevant data for speech separation and noise removal.
* Creating, training and optimizing various neural network architectures.
* Hardware options for running networks on tiny devices.
* And the end goal : Real-time speech separation on a small embedded platform.
I will present a vision of future smart air pods, smart headsets and smart hearing aids that will be running deep neural networks .
Participants will get an insight into some of the latest advances and limitations in speech separation with deep neural networks on embedded devices in regards to:
* Data transformation and augmentation.
* Deep neural network models for speech separation and for removing noise.
* Training smaller and faster neural networks.
* Creating a real-time speech separation pipeline.
Second presentation in Savi's sponsoring of the Washington DC Spark Interactive. Discusses use of Spark with Drools to create expert systems-based analytics for the Internet of Things (IoT)
Threat Detection and Response at Scale with Dominique BrezinskiDatabricks
Security monitoring and threat response has diverse processing demands on large volumes of log and telemetry data. Processing requirements span from low-latency stream processing to interactive queries over months of data. To make things more challenging, we must keep the data accessible for a retention window measured in years. Having tackled this problem before in a massive-scale environment using Apache Spark, when it came time to do it again, there were a few things I knew worked and a few wrongs I wanted to right.
We approached Databricks with a set of challenges to collaborate on: provide a stable and optimized platform for Unified Analytics that allows our team to focus on value delivery using streaming, SQL, graph, and ML; leverage decoupled storage and compute while delivering high performance over a broad set of workloads; use S3 notifications instead of list operations; remove Hive Metastore from the write path; and approach indexed response times for our more common search cases, without hard-to-scale index maintenance, over our entire retention window. This is about the fruit of that collaboration.
Assisting millions of active users in real-time - Alexey Brodovshuk, Kcell; K...Evention
Nowadays many companies become data rich and intensive. They have millions of users generating billions of interactions and events per day.
These massive streams of complex events can be processed and reacted upon to e.g. offer new products, next best actions, communicate to users or detect frauds, and quicker we can do it, the higher value we can generate.
In this talk we will present, how in joint development with our client and in just few months effort we have built from ground up a complex event processing platform for their intensive data streams. We will share how the system runs marketing campaigns or detect frauds by following behavior of millions users in real-time and reacting on it instantly. The platform designed and built with Big Data technologies to infinitely and cost-effectively scale already ingests and processes billions of messages or terabytes of data per day on a still small cluster. We will share how we leveraged the current best of breed open-source projects including Apache Flink, Apache Nifi and Apache Kafka, but also what interesting problems we needed to solve. Finally, we will share where we’re heading next, what next use cases we’re going to implement and how.
Get updates about OpenACC. This month focuses on: A new OpenACC Online Course, book and number of exciting events highlighted in the OpenACC September Update
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
Cybersecurity requires an organization to collect data, analyze it, and alert on cyber anomalies in near real-time. This is a challenging endeavor when considering the variety of data sources which need to be collected and analyzed. Everything from application logs, network events, authentications systems, IOT devices, business events, cloud service logs, and more need to be taken into consideration. In addition, multiple data formats need to be transformed and conformed to be understood by both humans and ML/AI algorithms.
To solve this problem, the Aetna Global Security team developed the Unified Data Platform based on Apache NiFi, which allows them to remain agile and adapt to new security threats and the onboarding of new technologies in the Aetna environment. The platform currently has over 60 different data flows with 95% doing real-time ETL and handles over 20 billion events per day. In this session learn from Aetna’s experience building an edge to AI high-speed data pipeline with Apache NiFi.
Monitoring tools for Big Data frameworks are needed in order to asses the quality of applications deployed on these frameworks. This presentation introduces a new monitoring platform for Big Data technologies (Apache Hadoop, Spark, Storm...) based on ELK stack, that is easy to deploy and provides a facile control of the monitored cluster.
Fit For Purpose: Preventing a Big Data LetdownInside Analysis
The Briefing Room with Dr. Robin Bloor and RedPoint Global
Live Webcast October 6, 2015
Watch the archive: https://bloorgroup.webex.com/bloorgroup/lsr.php?RCID=9982ad3a2603345984895f279e849d35
Gartner recently placed Big Data in its “trough of disillusionment,” reflective of many leaders’ struggle to prove the value of Hadoop within their organization. While the promise of enhanced data integration and enrichment is obvious, measurable results have remained elusive. This episode of The Briefing Room will outline how to successfully tie Big Data to existing business applications, preventing your next Hadoop project from being another “Big Data letdown.”
Register today to learn from veteran Analyst Dr. Robin Bloor as he discusses the importance of converging enterprise data integration with intelligence and scalability. He’ll be briefed by George Corugedo of RedPoint Global, who will provide concrete examples of how the convergence of scalable cloud platforms, ever-expanding data sources and intelligent execution can turn the Big Data hype into demonstrable business value.
Visit InsideAnalysis.com for more information.
Network and IT Ops Series: Build Production Solutions Neo4j
Jeff Morris, Director, Neo4j:Are you building a breakthrough product or extending an existing one? Do you need introduce new capabilities based on insights from data relationships? If so, you should consider embedding a graph database.
For software providers building products to assure quality network operations or security, using an embedded graph database may open new customer opportunities. Watch this webinar to learn how you can easily differentiate your applications and take your solutions to market faster with a native graph database like Neo4j.
Fast data in times of crisis with GPU accelerated database QikkDB | Business ...Matej Misik
Graphics cards (GPU) open up new ways of processing and analytics over big data, showing millisecond selections over billions of lines, as well as telling stories about data. #QikkDB
How to present data to be understood by everyone? Data analysis is for scientists, but data storytelling is for everyone. For managers, product owners, sales teams, the general public. #TellStory
Learn about high performance computing with GPU and how to present data with a rich Covid-19 data story example on the upcoming webinar.
In this deck from FOSDEM'19, Christoph Angerer from NVIDIA presents: Rapids - Data Science on GPUs.
"The next big step in data science will combine the ease of use of common Python APIs, but with the power and scalability of GPU compute. The RAPIDS project is the first step in giving data scientists the ability to use familiar APIs and abstractions while taking advantage of the same technology that enables dramatic increases in speed in deep learning. This session highlights the progress that has been made on RAPIDS, discusses how you can get up and running doing data science on the GPU, and provides some use cases involving graph analytics as motivation.
GPUs and GPU platforms have been responsible for the dramatic advancement of deep learning and other neural net methods in the past several years. At the same time, traditional machine learning workloads, which comprise the majority of business use cases, continue to be written in Python with heavy reliance on a combination of single-threaded tools (e.g., Pandas and Scikit-Learn) or large, multi-CPU distributed solutions (e.g., Spark and PySpark). RAPIDS, developed by a consortium of companies and available as open source code, allows for moving the vast majority of machine learning workloads from a CPU environment to GPUs. This allows for a substantial speed up, particularly on large data sets, and affords rapid, interactive work that previously was cumbersome to code or very slow to execute. Many data science problems can be approached using a graph/network view, and much like traditional machine learning workloads, this has been either local (e.g., Gephi, Cytoscape, NetworkX) or distributed on CPU platforms (e.g., GraphX). We will present GPU-accelerated graph capabilities that, with minimal conceptual code changes, allows both graph representations and graph-based analytics to achieve similar speed ups on a GPU platform. By keeping all of these tasks on the GPU and minimizing redundant I/O, data scientists are enabled to model their data quickly and frequently, affording a higher degree of experimentation and more effective model generation. Further, keeping all of this in compatible formats allows quick movement from feature extraction, graph representation, graph analytic, enrichment back to the original data, and visualization of results. RAPIDS has a mission to build a platform that allows data scientist to explore data, train machine learning algorithms, and build applications while primarily staying on the GPU and GPU platforms."
Learn more: https://rapids.ai/
and
https://fosdem.org/2019/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Leveraging Spark to Democratize Data for Omni-Commerce with Shafaq AbdullahDatabricks
Insnap, a hyper-personalized ML-based platform acquired by The Honest Company, has been used to build a real-time data platform based on Apache Spark, Cassandra and Redshift. Users’ behavioral and transactional data have been used to build data models and ML models, and to drive use cases for marketing, growth, finance and operations.
Learn how Honest Company has used Spark as a workhorse for 1) collecting, ETL and storing data from various sources including mysql, mongo, jde, Google analytics, Facebook, Localytics and REST API; 2) building data models and aggregating and generating reports of revenue, order fulfillment tracking, data pipeline monitoring and subscriptions; 3) Using ML to build model for user acquisitions, LTV and recommendations use cases. Spark replaced the monolithic codebase with flexible, scalable and robust pipelines. Databricks helped The Honest Company to focus on data instead of maintaining infrastructure. While Honest users got delightful recommendations to improve experience, data users at Honest understood users much better in terms of segmenting with behavioral information and advanced ML models, leading to increased revenue and retention.
Nicolas Trésegnie, Chief Architect at SuperAwesome
Abstract: SuperAwesome's mission is to make the internet safer for kids. At the core of SuperAwesome's analytics is Druid. In this talk, we walk through how we run Druid on spot instances. We explain the consequences in terms of cost and reliability, how we managed to build a reliable system despite the risks, and how you could do the same.
Nicolas works as Chief Architect at SuperAwesome, where is is looking after the overall architecture of the systems and the infrastructure. He is all about automation and how technology can be used to achieve business goals. Nicolas studied Computer Science and Bioinformatics, and he is now pursuing an MBA at Imperial.
Listening at the Cocktail Party with Deep Neural Networks and TensorFlowDatabricks
Many people are amazing at focusing their attention on one person or one voice in a multi speaker scenario, and ‘muting’ other people and background noise. This is known as the cocktail party effect. For other people it is a challenge to separate audio sources.
In this presentation I will focus on solving this problem with deep neural networks and TensorFlow. I will share technical and implementation details with the audience, and talk about gains, pains points, and merits of the solutions as it relates to:
* Preparing, transforming and augmenting relevant data for speech separation and noise removal.
* Creating, training and optimizing various neural network architectures.
* Hardware options for running networks on tiny devices.
* And the end goal : Real-time speech separation on a small embedded platform.
I will present a vision of future smart air pods, smart headsets and smart hearing aids that will be running deep neural networks .
Participants will get an insight into some of the latest advances and limitations in speech separation with deep neural networks on embedded devices in regards to:
* Data transformation and augmentation.
* Deep neural network models for speech separation and for removing noise.
* Training smaller and faster neural networks.
* Creating a real-time speech separation pipeline.
Second presentation in Savi's sponsoring of the Washington DC Spark Interactive. Discusses use of Spark with Drools to create expert systems-based analytics for the Internet of Things (IoT)
Threat Detection and Response at Scale with Dominique BrezinskiDatabricks
Security monitoring and threat response has diverse processing demands on large volumes of log and telemetry data. Processing requirements span from low-latency stream processing to interactive queries over months of data. To make things more challenging, we must keep the data accessible for a retention window measured in years. Having tackled this problem before in a massive-scale environment using Apache Spark, when it came time to do it again, there were a few things I knew worked and a few wrongs I wanted to right.
We approached Databricks with a set of challenges to collaborate on: provide a stable and optimized platform for Unified Analytics that allows our team to focus on value delivery using streaming, SQL, graph, and ML; leverage decoupled storage and compute while delivering high performance over a broad set of workloads; use S3 notifications instead of list operations; remove Hive Metastore from the write path; and approach indexed response times for our more common search cases, without hard-to-scale index maintenance, over our entire retention window. This is about the fruit of that collaboration.
Assisting millions of active users in real-time - Alexey Brodovshuk, Kcell; K...Evention
Nowadays many companies become data rich and intensive. They have millions of users generating billions of interactions and events per day.
These massive streams of complex events can be processed and reacted upon to e.g. offer new products, next best actions, communicate to users or detect frauds, and quicker we can do it, the higher value we can generate.
In this talk we will present, how in joint development with our client and in just few months effort we have built from ground up a complex event processing platform for their intensive data streams. We will share how the system runs marketing campaigns or detect frauds by following behavior of millions users in real-time and reacting on it instantly. The platform designed and built with Big Data technologies to infinitely and cost-effectively scale already ingests and processes billions of messages or terabytes of data per day on a still small cluster. We will share how we leveraged the current best of breed open-source projects including Apache Flink, Apache Nifi and Apache Kafka, but also what interesting problems we needed to solve. Finally, we will share where we’re heading next, what next use cases we’re going to implement and how.
Get updates about OpenACC. This month focuses on: A new OpenACC Online Course, book and number of exciting events highlighted in the OpenACC September Update
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
Cybersecurity requires an organization to collect data, analyze it, and alert on cyber anomalies in near real-time. This is a challenging endeavor when considering the variety of data sources which need to be collected and analyzed. Everything from application logs, network events, authentications systems, IOT devices, business events, cloud service logs, and more need to be taken into consideration. In addition, multiple data formats need to be transformed and conformed to be understood by both humans and ML/AI algorithms.
To solve this problem, the Aetna Global Security team developed the Unified Data Platform based on Apache NiFi, which allows them to remain agile and adapt to new security threats and the onboarding of new technologies in the Aetna environment. The platform currently has over 60 different data flows with 95% doing real-time ETL and handles over 20 billion events per day. In this session learn from Aetna’s experience building an edge to AI high-speed data pipeline with Apache NiFi.
Monitoring tools for Big Data frameworks are needed in order to asses the quality of applications deployed on these frameworks. This presentation introduces a new monitoring platform for Big Data technologies (Apache Hadoop, Spark, Storm...) based on ELK stack, that is easy to deploy and provides a facile control of the monitored cluster.
Fit For Purpose: Preventing a Big Data LetdownInside Analysis
The Briefing Room with Dr. Robin Bloor and RedPoint Global
Live Webcast October 6, 2015
Watch the archive: https://bloorgroup.webex.com/bloorgroup/lsr.php?RCID=9982ad3a2603345984895f279e849d35
Gartner recently placed Big Data in its “trough of disillusionment,” reflective of many leaders’ struggle to prove the value of Hadoop within their organization. While the promise of enhanced data integration and enrichment is obvious, measurable results have remained elusive. This episode of The Briefing Room will outline how to successfully tie Big Data to existing business applications, preventing your next Hadoop project from being another “Big Data letdown.”
Register today to learn from veteran Analyst Dr. Robin Bloor as he discusses the importance of converging enterprise data integration with intelligence and scalability. He’ll be briefed by George Corugedo of RedPoint Global, who will provide concrete examples of how the convergence of scalable cloud platforms, ever-expanding data sources and intelligent execution can turn the Big Data hype into demonstrable business value.
Visit InsideAnalysis.com for more information.
Network and IT Ops Series: Build Production Solutions Neo4j
Jeff Morris, Director, Neo4j:Are you building a breakthrough product or extending an existing one? Do you need introduce new capabilities based on insights from data relationships? If so, you should consider embedding a graph database.
For software providers building products to assure quality network operations or security, using an embedded graph database may open new customer opportunities. Watch this webinar to learn how you can easily differentiate your applications and take your solutions to market faster with a native graph database like Neo4j.
Fast data in times of crisis with GPU accelerated database QikkDB | Business ...Matej Misik
Graphics cards (GPU) open up new ways of processing and analytics over big data, showing millisecond selections over billions of lines, as well as telling stories about data. #QikkDB
How to present data to be understood by everyone? Data analysis is for scientists, but data storytelling is for everyone. For managers, product owners, sales teams, the general public. #TellStory
Learn about high performance computing with GPU and how to present data with a rich Covid-19 data story example on the upcoming webinar.
In this deck from FOSDEM'19, Christoph Angerer from NVIDIA presents: Rapids - Data Science on GPUs.
"The next big step in data science will combine the ease of use of common Python APIs, but with the power and scalability of GPU compute. The RAPIDS project is the first step in giving data scientists the ability to use familiar APIs and abstractions while taking advantage of the same technology that enables dramatic increases in speed in deep learning. This session highlights the progress that has been made on RAPIDS, discusses how you can get up and running doing data science on the GPU, and provides some use cases involving graph analytics as motivation.
GPUs and GPU platforms have been responsible for the dramatic advancement of deep learning and other neural net methods in the past several years. At the same time, traditional machine learning workloads, which comprise the majority of business use cases, continue to be written in Python with heavy reliance on a combination of single-threaded tools (e.g., Pandas and Scikit-Learn) or large, multi-CPU distributed solutions (e.g., Spark and PySpark). RAPIDS, developed by a consortium of companies and available as open source code, allows for moving the vast majority of machine learning workloads from a CPU environment to GPUs. This allows for a substantial speed up, particularly on large data sets, and affords rapid, interactive work that previously was cumbersome to code or very slow to execute. Many data science problems can be approached using a graph/network view, and much like traditional machine learning workloads, this has been either local (e.g., Gephi, Cytoscape, NetworkX) or distributed on CPU platforms (e.g., GraphX). We will present GPU-accelerated graph capabilities that, with minimal conceptual code changes, allows both graph representations and graph-based analytics to achieve similar speed ups on a GPU platform. By keeping all of these tasks on the GPU and minimizing redundant I/O, data scientists are enabled to model their data quickly and frequently, affording a higher degree of experimentation and more effective model generation. Further, keeping all of this in compatible formats allows quick movement from feature extraction, graph representation, graph analytic, enrichment back to the original data, and visualization of results. RAPIDS has a mission to build a platform that allows data scientist to explore data, train machine learning algorithms, and build applications while primarily staying on the GPU and GPU platforms."
Learn more: https://rapids.ai/
and
https://fosdem.org/2019/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Cloud Experience: Data-driven Applications Made Simple and FastDatabricks
A complex real-time data workflow implementation is very challenging. This session will describe the architecture of a data platform that provides a single, secure, high-performance system that can be deployed in a hybrid cloud architectures. We will present how to support simultaneous, consistent and high-performance access through multiple industry open source and cloud compatible standards of streaming, table, TSDB, object, and file APIs. A new serverless technology is also used in the architecture to support a dynamic and flexible implementations. The presenter will also outline how the platform was integrated with the Spark eco-system, including AI and ML tools, to simplify the development process
By 2020, 50% of all new software will process machine-generated data of some sort (Gartner). Historically, machine data use cases have required non-SQL data stores like Splunk, Elasticsearch, or InfluxDB.
Today, new SQL DB architectures rival the non-SQL solutions in ease of use, scalability, cost, and performance. Please join this webinar for a detailed comparison of machine data management approaches.
First in Class: Optimizing the Data Lake for Tighter IntegrationInside Analysis
The Briefing Room with Dr. Robin Bloor and Teradata RainStor
Live Webcast October 13, 2015
Watch the archive: https://bloorgroup.webex.com/bloorgroup/lsr.php?RCID=012bb2c290097165911872b1f241531d
Hadoop data lakes are emerging as peers to corporate data warehouses. However, successful data management solutions require a fusion of all relevant data, new and old, which has proven challenging for many companies. With a data lake that’s been optimized for fast queries, solid governance and lifecycle management, users can take data management to a whole new level.
Register for this episode of The Briefing Room to learn from veteran Analyst Dr. Robin Bloor as he discusses the relevance of data lakes in today’s information landscape. He’ll be briefed by Mark Cusack of Teradata, who will explain how his company’s archiving solution has developed into a storage point for raw data. He’ll show how the proven compression, scalability and governance of Teradata RainStor combined with Hadoop can enable an optimized data lake that serves as both reservoir for historical data and as a "system of record” for the enterprise.
Visit InsideAnalysis.com for more information.
Undertaking a digital journey starts with clearly articulating the success factors for the entire digital journey, and our experience from the field has shown it to be an Achilles heel for most CXOs, across Fortune 500 organizations. Our findings were corroborated when a Mckinsey study reported that only 15% of the organizations are able to calculate the ROI of a digital initiative.
In this talk we will deliberate on demonstrated examples from multi-billion dollar businesses around proven methodologies to measure the value of a digital enterprise. The panel will share experiences as well as provide actionable advice for immediate next steps around the following:
Successful metrics for measuring the value for Digital / IoT / AI/ Machine learning engagements
How can 'Digital Traction Metrics' help with actionable insights even before the Financial Metrics have been reported
What are the best in-class organizational constructs and futuristic employee engagement methods to facilitate the digital revolution
Panelists for this session include:
• Christian Bilien - Head of Global Data at Societe Generale
• Pierre Alexandre Pautrat – Head of Big Data at BPCE/Nattixis
• Ronny Fehling – VP , Airbus
• Juergen Urbanski – Silicon Valley Data Science
• Abhas Ricky - EMEA Lead, Innovation & Strategy, Hortonworks
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...ScyllaDB
Discover how to avoid common pitfalls when shifting to an event-driven architecture (EDA) in order to boost system recovery and scalability. We cover Kafka Schema Registry, in-broker transformations, event sourcing, and more.
Businesses are generating more data than ever before.
Doing real time data analytics requires IT infrastructure that often needs to be scaled up quickly and running an on-premise environment in this setting has its limitations.
Organisations often require a massive amount of IT resources to analyse their data and the upfront capital cost can deter them from embarking on these projects.
What’s needed is scalable, agile and secure cloud-based infrastructure at the lowest possible cost so they can spin up servers that support their data analysis projects exactly when they are required. This infrastructure must enable them to create proof-of-concepts quickly and cheaply – to fail fast and move on.
ING CoreIntel - collect and process network logs across data centers in near ...Evention
Security is at the core of every bank activity. ING set an ambitious goal to have an insight into the overall network data activity. The purpose is to quickly recognize and neutralize unwelcomed guests such as malware, viruses and to prevent data leakage or track down misconfigured software components.
Since the inception of the CoreIntel project we knew we were going to face the challenges of capturing, storing and processing vast amount of data of a various type from all over the world. In our session we would like to share our experience in building scalable, distributed system architecture based on Kafka, Spark Streaming, Hadoop and Elasticsearch to help us achieving these goals.
Why choosing good data format matters? How to manage kafka offsets? Why dealing with Elasticsearch is a love-hate relationship for us or how we just managed to put it all these pieces together.
Splunk App for Stream for Enhanced Operational Intelligence from Wire DataSplunk
Join us to learn what is new in Splunk App for Stream and how it can help you utilize wire/network data analytics to proactively resolve applications and IT operational issues and to efficiently analyze security threats in real-time, across your cloud and on-premises infrastructures.
Introduction of streaming data, difference between batch processing and stream processing, Research issues in streaming data processing, Performance evaluation metrics , tools for stream processing.
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Precisely
Tackling the challenge of designing a machine learning model and putting it into production is the key to getting value back – and the roadblock that stops many promising machine learning projects. After the data scientists have done their part, engineering robust production data pipelines has its own set of challenges. Syncsort software helps the data engineer every step of the way.
Building on the process of finding and matching duplicates to resolve entities, the next step is to set up a continuous streaming flow of data from data sources so that as the sources change, new data automatically gets pushed through the same transformation and cleansing data flow – into the arms of machine learning models.
Some of your sources may already be streaming, but the rest are sitting in transactional databases that change hundreds or thousands of times a day. The challenge is that you can’t affect performance of data sources that run key applications, so putting something like database triggers in place is not the best idea. Using Apache Kafka or similar technologies as the backbone to moving data around doesn’t solve the problem of needing to grab changes from the source pushing them into Kafka and consuming the data from Kafka to be processed. If something unexpected happens – like connectivity is lost on either the source or the target side, you don’t want to have to fix it or start over because the data is out of sync.
View this 15-minute webcast on-demand to learn how to tackle these challenges in large scale production implementations.
Streaming Cyber Security into Graph: Accelerating Data into DataStax Graph an...Keith Kraus
Traditional security tools like security information and event managers (SIEMs) are struggling to keep up with the terabytes of event data (250M to 2B events) being generated each day from an ever-growing number of devices. Cybersecurity has become a data problem, and enterprises need to reply with scalable solutions to enable effective hunting and combat evolving attacks. Rethinking the cybersecurity problem as a data-centric problem led Accenture Labs’s Cybersecurity team to use emerging big data tools along with new approaches such as graph databases and analysis to exploit the connected nature of the data to its advantage. Joshua Patterson, Michael Wendt, and Keith Kraus explain how Accenture Labs’s Cybersecurity team is using Apache Kafka, Spark, and Flink to stream data into Blazegraph and Datastax Graph to accelerate cyber defense.
Leveraging Datastax Graph and Blazegraph allows Accenture Labs to greatly accelerate query and analysis performance compared to traditional security tools like SIEM. Josh, Michael, and Keith share the challenges of fitting cybersecurity data into each of the graph structures, as well as the ways they exploited the connectedness of events to discover new threats that would have been missed in traditional SIEM tools. In addition, they explain how they use GPUs to accelerate graph analysis by using Blazegraph DASL. Josh, Michael, and Keith end by demonstrating how to efficiently and effectively stream data into these graph databases using best-in-breed technologies such as Apache Kafka, Spark, and Flink and touch on why Kudu is becoming an integral part of Accenture’s technology stack. Utilizing these technologies, clients have supercharged their security analysts’ cyber-hunting abilities and are uncovering threats faster.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
PHP Frameworks: I want to break free (IPC Berlin 2024)Ralf Eggert
In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development.
This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
"Impact of front-end architecture on development cost", Viktor TurskyiFwdays
I have heard many times that architecture is not important for the front-end. Also, many times I have seen how developers implement features on the front-end just following the standard rules for a framework and think that this is enough to successfully launch the project, and then the project fails. How to prevent this and what approach to choose? I have launched dozens of complex projects and during the talk we will analyze which approaches have worked for me and which have not.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Accelerating Cyber Threat Detection With GPU
1. Joshua Patterson | Director of Applied Solutions Engineering | GTC DC 2017
@datametrician
ACCELERATING CYBER THREAT
DETECTION WITH GPU
2. 2
RULES & PEOPLE DON’T SCALE
Right now, financial services reports it takes an average of 98 days to detect
an Advance Threat but retailers say it can be about seven months.
Once the security community moves beyond the mantras “encrypt
everything” and “secure the perimeter,” it can begin developing intelligent
prioritization and response plans to various kinds of breaches – with a
strong focus on integrity.
The challenge lies in efficiently scaling these technologies for practical
deployment, and making them reliable for large networks. This is where the
security community should focus its efforts.
http://www.wired.com/2015/12/the-cia-secret-to-cybersecurity-that-no-one-seems-to-get/
Current methods are too slow
3. 3
ATTACKS ARE MORE SOPHISTICATED
How Hackers Hijacked a Bank’s Entire Online Operation
https://www.wired.com/2017/04/hackers-hijacked-banks-entire-online-operation/
4. 4
FIRST PRINCIPLES OF CYBER SECURITY
Where the industry must go
1. Indication of compromise needs to improve as attacks are becoming more sophisticated,
subtle, and hidden in the massive volume and velocity of data. Combining machine
learning, graph analysis, and applied statistics, and integrating these methods with deep
learning is essential to reduce false positives, detect threats faster, and empower analyst
to be more efficient.
2. Event management is an accelerated analytics problem, the volume and velocity of data
from devices requires a new approach that combines all data sources to allow for more
intelligent/advanced threat hunting and exploration at scale across machine data.
3. Visualization will be a key part of daily operations, which will allows analyst to label and
train Deep Learning models faster, and validate machine learning prediciton.
5. 5
FIRST PRINCIPLES OF CYBER SECURITY
Where the industry must go
1. Indication of compromise needs to improve as attacks are becoming more sophisticated,
subtle, and hidden in the massive volume and velocity of data. Combining machine
learning, graph analysis, and applied statistics, and integrating these methods with deep
learning is essential to reduce false positives, detect threats faster, and empower analyst
to be more efficient.
7. 7
DATA PLATFORM-AS-A-SERVICE
• Handles 1M events/second
• Auto-scales the cluster
automatically
SCALE
• Offers HA with no data-loss
• Always-on architecture
• Data replication
HIGH AVAILABILITY
• Data platform security has
been implemented with
VPCs in AWS
• Dashboard access using
NVIDIA LDAP
SECURITY
• Log-to-analytics
• Kibana, JDBC access
• Accessing data using BI tools
SELF SERVICE
11. 11
ANOMALY DETECTION USING DEEP LEARNING
Data
Platform
AD AI
Framework
(Keras +
TensorFlow)
NGC/NGN
GPU
Cluster
NGC/NGN
GPU
ClusterGPU Cloud
Anomaly
Detection
Top Features
Automated Alerts & Dashboards
Early Detection
Self Service
Better accuracy & less noise
12. 12
Raw Dataset
Feature Learning
Algorithm: Recurrent
Neural Network (RNN),
Autoencoders (AE)
Unsupervised Learning:
Multivariate-Gaussian
Supervised Learning:
Logistic Regression
Anomalies: Email alerts,
Dashboards
Time X1 X2
Time X1 X2 X’ X’’
Time X1 X2 X’ X’’ Y
1
0
Anomaly Post-
processing: Univariate
Analysis
Time X1 X2 Y Anomaly
Description
1 X1
0
Anomaly Detection
Feedback
from user
ANOMALY DETECTION FRAMEWORK
13. 13
ANOMALY DETECTION BENEFITS WITH DEEP LEARNING
Top Features
Automated Alerts &
Dashboards
Early Detection
Self Service
Better accuracy & less
noise
14. 14
ANOMALY DETECTION TRAINING
• Evolution
• CPU vs GPU
• Learnings :
• Manual feature extraction does not scale
• Dataset preparation is the long pole
• Training on CPU takes longer than data collection rate
V0:
Manual
Feature
Creation
V1:
Automatic
Feature
Creation
using DL
(Theano)
V2: Multi-
GPU support
+ TensorFlow
Serving
(Keras +
TensorFlow)
15. 15
INFERENCING V1
• Use Case: Detecting anomalies with user’s activity
• Inferencing flow from 10k feet
• Started with python scripts for windowed aggregation
Live
Streaming
Data
AD Platform
ETL
aggregations
for inferencing
73
103
154
0
50
100
150
200
10 MINS 30 MINS 60 MINS
Python Script Performance
• Learnings: Hard to scale for near real time. AD platform runs inferencing
every 3 mins as we are impacted by speed of data processing
16. 16
INFERENCING V2
• V2: To improve performance, we started using Presto
with data on S3 in JSON format
• Live data will be streamed from Kafka to S3. We use
Presto for our data warehousing needs
• Presto is an open-source distributed SQL query engine
optimized for low-latency, ad-hoc analysis of data*
20
4
25
6
30
8
0
5
10
15
20
25
30
35
PRESTO ON JSON PRESTO ON PARQUET
10 mins 30 mins 60 mins
• Learnings: Presto with Parquet has best performance but we need
to batch data at 30 secs interval. So it’s not completely real time
Improved Performance
17. 17
FIRST PRINCIPLES OF CYBER SECURITY
Where the industry must go
1. Indication of compromise needs to improve as attacks are becoming more sophisticated,
subtle, and hidden in the massive volume and velocity of data. Combining machine
learning, graph analysis, and applied statistics, and integrating these methods with deep
learning is essential to reduce false positives, detect threats faster, and empower analyst
to be more efficient.
2. Event management is an accelerated analytics problem, the volume and velocity of data
from devices requires a new approach that combines all data sources to allow for more
intelligent/advanced threat hunting and exploration at scale across machine data.
18. 18
GPU ACCELERATION
Accelerate the Pipeline, Not Just Deep Learning
• GPUs for deep learning = proven
• Where else and how else can we use
GPU acceleration?
• Dashboards
• Accelerating data pipeline
• Stream processing
• Building better models faster
• First: GPU databases
Data Ingestion
Data Processing
Visualization
Model Training
Inferencing
19. 19
MOVING TO BIG DATA IS A START
Spark outperforms traditional SIEM
vs
Big Data Solution
10 node cluster - ~$60k in hardware
Production SIEM of Fortune 500 Enterprise Data
450+ columns
~250 million events per day
SIEM
Spark vs SIEM Benchmarks from Accenture Labs - Strata NY, Bsides LV
20. 20
MOVING TO BIG DATA IS A START
Spark outperforms traditional SIEM
Typical Scenario Time Period SIEM Big Data Speed Up
1 Show all network communication from one host
(IP) to multiple hosts (IPs)
1 Day 3h 20m 13s 1m 44s 114 Times Faster
1 Week Not Feasible* 4m 05s
2 Retrieve failed logon attempts in Active
Directory
1 Day 18m 26s 1m 37s 10 Times Faster
1 Week 2h 13m 45s 3m 10s 41 Times Faster
3 Search for Malware (exe) in Symantec logs 1 Day 3h 24m 36s 1m 37s 125 Times Faster
1 Week Not Feasible* 3m 22s
4 View all proxy logs for a for specific domain 1 Day 4h 30m 13s 2m 54s 92 Times Faster
1 Week Not Feasible* 1m 09s**
Spark vs SIEM Benchmarks from Accenture Labs - Strata NY, Bsides LV
21. 21
GPU DATABASES ARE EVEN FASTER
1.1 Billion Taxi Ride Benchmarks
21 30
1560
80 99
1250
150
269
2250
372
696
2970
0
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
MapD DGX-1 MapD 4 x P100 Redshift 6-node Spark 11-node
Query 1 Query 2 Query 3 Query 4
TimeinMilliseconds
Source: MapD Benchmarks on DGX from internal NVIDIA testing following guidelines of
Mark Litwintschik’s blogs: Redshift, 6-node ds2.8xlarge cluster & Spark 2.1, 11 x m3.xlarge cluster w/ HDFS @marklit82
10190 8134 19624 85942
22. 22
MAPD
MapD Core MapD Immerse
LLVM Backend Rendering Streaming
LLVM creates one custom function that
runs at speeds approaching hand-written
functions. LLVM enables generic
targeting of different architectures + run
simultaneously on CPU/GPU.
Speed eliminates need to pre-index or
aggregate data. Compute resides on
GPUs freeing CPUs to parse + ingest.
Finally, newest data can be combined with
billions of rows of “near historical” data.
Data goes from compute (CUDA) to
graphics (OpenGL) pipeline without copy
and comes back as compressed PNG
(~100 KB) rather than raw data (> 1GB).
23. 23
MAPD ARCHITECTURE
Visualization Libraries
JavaScript libraries that allow
users to build custom web-
based visualization apps
powered by a MapD Core
database based on DC.js.
LLVM
MapD Core SQL queries are
compiled with a just-in-time
(JIT) LLVM based compiler,
and run as NVIDIA GPU
machine code.
Distributed Scale-out
MapD Core has native
distributed scale-out
capabilities. MapD Core users
can query and visualize larger
datasets with much smaller
cluster sizes than traditional
solutions.
High Availability
MapD Core has high
availability functionality that
provides durability and
redundancy. Ingest and
queries are load balanced
across servers for additional
throughput.
Open Source Commercial
24. 24
MAPD + IMMERSE VS ELASTIC + KIBANA
Elastic + Kibana
• Fantastic for complex search
• Scales easily (up to a point)
• Indexing consumes more storage (~4-6x)
• Kibana for KPI dashboarding?
MapD Core
• Very fast OLAP queries
• JIT LLVM query compiler
• GPUs for compute
• CPUs for parse + ingest
• Limited join support (for now)
Immerse
• c3/d3 + crossfilter = nice dashboards
• Backend rendering
27. 27
INFERENCING V3
• V3: Explored GPU databases like MapD to improve the performance for querying on streaming live data
• MapD offers constant query response times
• MapD has some SQL limitations. We use Presto as an interface & built a “MapD-> Presto” connector for full
ANSI Sql features
20
4
0.1
1.2
25
6
0.1
1.2
30
8
0.1
1.2
0
5
10
15
20
25
30
35
PRESTO ON JSON PRESTO ON PARQUET MAPD PRESTO + MAPD
GPU Database Performance
10 mins 30 mins 60 mins
ExecutionTime(seconds)
28. 28
FIRST PRINCIPLES OF CYBER SECURITY
Where the industry must go
1. Indication of compromise needs to improve as attacks are becoming more sophisticated,
subtle, and hidden in the massive volume and velocity of data. Combining machine
learning, graph analysis, and applied statistics, and integrating these methods with deep
learning is essential to reduce false positives, detect threats faster, and empower analyst
to be more efficient.
2. Event management is an accelerated analytics problem, the volume and velocity of data
from devices requires a new approach that combines all data sources to allow for more
intelligent/advanced threat hunting and exploration at scale across machine data.
3. Visualization will be a key part of daily operations, which will allows analyst to label and
train Deep Learning models faster, and validate machine learning predictions.
32. 32
VISUALIZATION WITH GPU
Less hardware, more performance, more scale
1/10th the hardware
1-2 orders of
magnitude more
performance
33. 33
VISUALIZATION WITH GPU
Less hardware, more performance, more scale
1/10th the hardware
1-2 orders of
magnitude more
performance
Real time visualization of 100K+ nodes 1M+ Edges
50-100x faster clustering than other solutions
34. 34
LISTS DO NOT VISUALLY SCALE
Text search is a great
starting point!
Does not scale
Do not see the 30K+ events
nor the IPs, users, nor how
they relate…
36. 36
GRAPHS:
A KEY MISSING
VIEW
Unified Model
Shows entities, events, and relationships
Multipurpose: connect, see, interact
Visual
Inspect individual items
See behavior, patterns, and outliers
Scale to enterprise workloads
37. 37
DIFFERENT GRAPHS, DIFFERENT QUESTIONS
Uni
Ex: Network mapping
“Is it safe to reboot this?”
ip ip
Hyper
Ex: Incident response
“Did this escalate?”
Multi
Ex: SSH trails
“Is a user crossing zones?”
ip
user
userip
ip
userevent
event
user
ip
40. 40
EXPAND GPU USAGE
More Data, Less Hardware
0.0
1.0
2.0
3.0
4.0
5.0
6.0
7.0
8.0
2008 2010 2012 2014 2016 2017
Peak Double Precision
NVIDIA GPU x86 CPU
TFLOPS
Scaling up and out with
GPU co-processors
42. 42
GPU DATA FRAME
Pre-GOAI
H2O.ai
Graphistry Anaconda
Gunrock
BlazingDB MapD
CPU
APP A
APP B
Copy & Convert
Copy & Convert Copy & Convert
Copy & Convert Copy & ConvertCopy & Convert Copy & Convert
Evolution of Performance
No Copy & Converts - Full Interoperability
H2O.ai
Anaconda Gunrock
Graphistry
BlazingDB MapD
GPU Data
Frame
Apache
Arrow
43. 43
CYBERWORKS
CYBERWORKS SIEM SDK
Goals
• Open Source Ecosystem & Select ISVs
• Integration Points w/ leading security vendors
• FireEye
• Splunk
• Palo Alto Networks
Purpose
A platform to allow analysts to hunt and
analyze data faster at scale than traditional
big data to find unknown and zero day threats.
It will accelerate the threat detection
ecosystem and harden cyber defense utilizing
GPU ISVs and Deep Learning Frameworks.
Purpose Built SDK For SIEM Analytics
44. 44
CYBERWORKS ACTIVITIES
Continuous Improvement
Use GPU accelerated
databases to analyze
data to improve
hunting today, as well
as enrich and label
data for Deep Learning
Connect accelerated
DBs to Splunk for event
management, hunting,
and exploration. Use
Graphistry and MapD to
visualize the data for
anomaly and threat
detection in new ways.
The goal is to GPU
accelerate parts of
Splunk through
partnership and
connect/bolt on
GPUDBs/Graphistry
Use ML and Graph
Analytics for feature
extraction and behavioral
analytics, an ensemble
approach to detection.
Expand Deep Learning
training as more data is
labeled/classified, and
threats are caught faster,
building off DL techniques
used in GFN, other
groups, and external ISV.
Generalize Deep Learning for supervised
and unsupervised anomaly and threat
detection (Insider, APT, DDOS, etc…)
while building our own cyber security deep
learning accelerator. Use best practices
from Driveworks and other accelerators
and SDK as a reference architecture.
Leverage DL from other parts of the firm to
accelerate development as well.
While using Splunk
Cloud to protect
Nvidia, we create a
redundant path of data
to enable R&D.
nvGRAPH
46. 46
CYBERWORKS HARDWARE
Scale out Cluster
DGX Cluster
NAS
SIEM
Notebooks
End
User
3rd Party
Apps
Messaging
Queue
Accelerating your SIEM
47. 47
GPU OPEN ANALYTICS INITIATIVE
github.com/gpuopenanalytics
GPU Data Frame (GDF)
Ingest/
Parse
Exploratory
Analysis
Feature
Engineering
ML/DL
Algorithms
Grid Search
Scoring
Model
Export
@gpuoai
Apache Arrow
48. 48
ANACONDA
Python ETL on GPU
A Python open-source just-in-time
optimizing compiler that uses LLVM to
produce native machine instructions.
Primary Contributor to PyGDF.
Dask is a flexible parallel computing
library for analytic computing with
dynamic task scheduling and big data
collections.
Primary contributor to Dask_GDF.
Jeremy Howard
Deep learning researcher & educator.
Founder: fast.ai; Faculty: USF &
Singularity University; // Previously - CEO:
Enlitic; President: Kaggle; CEO Fastmail
Rewrote @scikit_learn PolynomialFeatures in
@ContinuumIO Numba. Got a 40x speedup (would
be bigger with more data!) 12 lines of code