Self driving computers active learning workflows with human interpretable ve...Adam Gibson
Human in the loop learning workflows leveraging deep learning to group and cluster data. Also, techniques for accounting for machine learning failures.
Anomaly Detection and Automatic Labeling with Deep LearningAdam Gibson
Adam Gibson demonstrates how to use variational autoencoders to automatically label time series location data. You'll explore the challenge of imbalanced classes and anomaly detection, learn how to leverage deep learning for automatically labeling (and the pitfalls of this), and discover how you can deploy these techniques in your organization.
Self driving computers active learning workflows with human interpretable ve...Adam Gibson
Human in the loop learning workflows leveraging deep learning to group and cluster data. Also, techniques for accounting for machine learning failures.
Anomaly Detection and Automatic Labeling with Deep LearningAdam Gibson
Adam Gibson demonstrates how to use variational autoencoders to automatically label time series location data. You'll explore the challenge of imbalanced classes and anomaly detection, learn how to leverage deep learning for automatically labeling (and the pitfalls of this), and discover how you can deploy these techniques in your organization.
Netflix success is credited to pioneering ways that the company introduced AI and ML into its products, services and infrastructure. ML learning is applied to solve a wide range of problems at Netflix.
When data size grows in terms of sample count, feature count and model parameter count, things go crazy. The slideshow presents an overview of what to expect and how to handle them.
Real-Time Image Recognition with Apache Spark with Nikita ShamgunovDatabricks
The future of computing is visual. With everything from smartphones to Spectacles, we are about to see more digital imagery and associated processing than ever before.
In conjunction, new computing models are rapidly appearing to help data engineers harness the power of this imagery. Vast resources with cloud platforms, and the sharing of processing algorithms, are moving the industry forward quickly. The models are readily available as well.
This session will examine the image recognition techniques available with Apache Spark, and how to put those techniques into production. It will further explore algebraic operations on tensors, and how that can assist in large-scale, high-throughput, highly-parallel image recognition. In particular, this session will showcase the use of Spark in conjunction with a high-performance database to operationalize these workflows.
Learn about a combination of:
-Architectural considerations in building and image recognition pipeline
-Advantages and pitfalls of specific approaches
-Real-time capabilities for instant matches
-Use of a fast relational datastore to persist data from Spark
You’ll also see a live demonstration on constructing and executing a real-time image recognition pipeline.to persist data from Spark.
Learn how to get started with distributed Deep Learning library BigDL for Apache Spark. You will also see demo of a Deep Learning application that uses BigDL running on a Spark cluster. The application is written to identify handwritten digits (0 to 9) using a LeNet-5 (Convolutional Neural Networks) model that is trained and validated using MNIST database.
Distributed Inference on Large Datasets Using Apache MXNet and Apache Spark ...Databricks
Deep Learning has become ubiquitous with abundance of data, commoditization of compute and storage. Pre-trained models are readily available for many use-cases. Distributed Inference has many applications such as pre-computing results offline, backfilling historic data with predictions from state-of-the-art models, etc.Inference on large scale datasets comes with many challenges prevalent in distributed data processing.
Attendees will learn how to efficiently run deep learning prediction on large data sets, leveraging Apache Spark and Apache MXNet (incubating).
In this session, we’ll cover core Deep Learning Concepts such as:
Types of Learning, a) Supervised Learning b) Unsupervised Learning c) Active Learning d) Reinforcement Learning
Supervised Learning types – classification, regression, Image classification
Types of Neural Networks – Feed forward Networks, CNNs, RNNs, GANs * Apache MXNet(Incubating) Deep Learning Framework. MXNet concepts ie., NDArray, Symbolic APIs and Module APIs. MXNet Gluon APIs * Distributed Inference using Apache MXNet and Apache Spark on Amazon EMR.
In this section, I will cover some of the use-cases of Distributed Inference, the challenges associated with running distributed Inference.
Willump: Optimizing Feature Computation in ML InferenceDatabricks
Systems for performing ML inference are increasingly important, but are far slower than they could be because they use techniques designed for conventional data serving workloads, neglecting the statistical nature of ML inference. As an alternative, this talk presents Willump, an optimizer for ML inference.
Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero ...Big Data Spain
GPUs on the cloud as Infrastructure as a Service (IaaS) seem a commodity. However to efficiently distribute deep learning tasks on several GPUs is challenging.
https://www.bigdataspain.org/2017/talk/training-deep-learning-models-on-multiple-gpus-in-the-cloud
Big Data Spain 2017
November 16th - 17th Kinépolis Madrid
Netflix success is credited to pioneering ways that the company introduced AI and ML into its products, services and infrastructure. ML learning is applied to solve a wide range of problems at Netflix.
When data size grows in terms of sample count, feature count and model parameter count, things go crazy. The slideshow presents an overview of what to expect and how to handle them.
Real-Time Image Recognition with Apache Spark with Nikita ShamgunovDatabricks
The future of computing is visual. With everything from smartphones to Spectacles, we are about to see more digital imagery and associated processing than ever before.
In conjunction, new computing models are rapidly appearing to help data engineers harness the power of this imagery. Vast resources with cloud platforms, and the sharing of processing algorithms, are moving the industry forward quickly. The models are readily available as well.
This session will examine the image recognition techniques available with Apache Spark, and how to put those techniques into production. It will further explore algebraic operations on tensors, and how that can assist in large-scale, high-throughput, highly-parallel image recognition. In particular, this session will showcase the use of Spark in conjunction with a high-performance database to operationalize these workflows.
Learn about a combination of:
-Architectural considerations in building and image recognition pipeline
-Advantages and pitfalls of specific approaches
-Real-time capabilities for instant matches
-Use of a fast relational datastore to persist data from Spark
You’ll also see a live demonstration on constructing and executing a real-time image recognition pipeline.to persist data from Spark.
Learn how to get started with distributed Deep Learning library BigDL for Apache Spark. You will also see demo of a Deep Learning application that uses BigDL running on a Spark cluster. The application is written to identify handwritten digits (0 to 9) using a LeNet-5 (Convolutional Neural Networks) model that is trained and validated using MNIST database.
Distributed Inference on Large Datasets Using Apache MXNet and Apache Spark ...Databricks
Deep Learning has become ubiquitous with abundance of data, commoditization of compute and storage. Pre-trained models are readily available for many use-cases. Distributed Inference has many applications such as pre-computing results offline, backfilling historic data with predictions from state-of-the-art models, etc.Inference on large scale datasets comes with many challenges prevalent in distributed data processing.
Attendees will learn how to efficiently run deep learning prediction on large data sets, leveraging Apache Spark and Apache MXNet (incubating).
In this session, we’ll cover core Deep Learning Concepts such as:
Types of Learning, a) Supervised Learning b) Unsupervised Learning c) Active Learning d) Reinforcement Learning
Supervised Learning types – classification, regression, Image classification
Types of Neural Networks – Feed forward Networks, CNNs, RNNs, GANs * Apache MXNet(Incubating) Deep Learning Framework. MXNet concepts ie., NDArray, Symbolic APIs and Module APIs. MXNet Gluon APIs * Distributed Inference using Apache MXNet and Apache Spark on Amazon EMR.
In this section, I will cover some of the use-cases of Distributed Inference, the challenges associated with running distributed Inference.
Willump: Optimizing Feature Computation in ML InferenceDatabricks
Systems for performing ML inference are increasingly important, but are far slower than they could be because they use techniques designed for conventional data serving workloads, neglecting the statistical nature of ML inference. As an alternative, this talk presents Willump, an optimizer for ML inference.
Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero ...Big Data Spain
GPUs on the cloud as Infrastructure as a Service (IaaS) seem a commodity. However to efficiently distribute deep learning tasks on several GPUs is challenging.
https://www.bigdataspain.org/2017/talk/training-deep-learning-models-on-multiple-gpus-in-the-cloud
Big Data Spain 2017
November 16th - 17th Kinépolis Madrid
DeepLearning4J and Spark: Successes and Challenges - François Garillotsparktc
At the recent sold-out Spark & Machine Learning Meetup in Brussels, François Garillot of Skymind delivered a lightning talk called DeepLearning4J and Spark: Successes and Challenges.
Specifically, François offered a tour of the DeepLearning4J architecture intermingled with applications. He went over the main blocks of this deep learning solution for the JVM that includes GPU acceleration, a custom n-dimensional array library, a parallelized data-loading swiss army tool, deep learning and reinforcement learning libraries — all with an easy-access interface.
Along the way, he pointed out the strategic points of parallelization of computation across machines and gave insight on where Spark helps — and where it doesn't.
David Kale and Ruben Fizsel from Skymind talk about deep learning for the JVM and enterprise using deeplearning4j (DL4J). Deep learning (nouveau neural nets) have sparked a renaissance in empirical machine learning with breakthroughs in computer vision, speech recognition, and natural language processing. However, many popular deep learning frameworks are targeted to researchers and poorly suited to enterprise settings that use Java-centric big data ecosystems. DL4J bridges the gap, bringing high performance numerical linear algebra libraries and state-of-the-art deep learning functionality to the JVM.
Deep Learning Use Cases - Data Science Pop-up SeattleDomino Data Lab
Companies like Google, Microsoft, Amazon and Facebook are in fierce competition for teams that can build deep-learning applications. Because of deep learning's general usefulness in pattern recognition, those applications are surprisingly diverse, ranging from image recognition to machine translation. This talk will explore deep learning use cases for the major data types -- image, sound, text and time series -- as they're emerging in the private sector. Presented by Chris Nicholson, Co-Founder and CEO at Skymind.
Slides from Strata+Hadoop Singapore 2016 presenting how Deep Learning can be scaled both vertically and horizontally, when to use CPUs and when to use GPUs.
sudoers: Benchmarking Hadoop with ALOJANicolas Poggi
Presentation for the sudoers Barcelona group 0ct 06 2015, on benchmarking Hadoop with ALOJA open source benchmarking platform. The presentation was mostly a live DEMO, posting some slides for the people who could not attend.
http://lanyrd.com/2015/sudoers-barcelona-october/
Spark is a powerful, scalable real-time data analytics engine that is fast becoming the de facto hub for data science and big data. However, in parallel, GPU clusters is fast becoming the default way to quickly develop and train deep learning models. As data science teams and data savvy companies mature, they will need to invest in both platforms if they intend to leverage both big data and artificial intelligence for competitive advantage.
This talk will discuss and show in action:
* Leveraging Spark and Tensorflow for hyperparameter tuning
* Leveraging Spark and Tensorflow for deploying trained models
* An examination of DeepLearning4J, CaffeOnSpark, IBM's SystemML, and Intel's BigDL
* Sidecar GPU cluster architecture and Spark-GPU data reading patterns
* Pros, cons, and performance characteristics of various approaches
Attendees will leave this session informed on:
* The available architectures for Spark and Deep Learning and Spark with and without GPUs for Deep Learning
* Several deep learning software frameworks, their pros and cons in the Spark context and for various use cases, and their performance characteristics
* A practical, applied methodology and technical examples for tackling big data deep learning
Deep Learning with Apache Spark and GPUs with Pierce SpitlerDatabricks
Apache Spark is a powerful, scalable real-time data analytics engine that is fast becoming the de facto hub for data science and big data. However, in parallel, GPU clusters are fast becoming the default way to quickly develop and train deep learning models. As data science teams and data savvy companies mature, they will need to invest in both platforms if they intend to leverage both big data and artificial intelligence for competitive advantage.
This session will cover:
– How to leverage Spark and TensorFlow for hyperparameter tuning and for deploying trained models
– DeepLearning4J, CaffeOnSpark, IBM’s SystemML and Intel’s BigDL
– Sidecar GPU cluster architecture and Spark-GPU data reading patterns
– The pros, cons and performance characteristics of various approaches
You’ll leave the session better informed about the available architectures for Spark and deep learning, and Spark with and without GPUs for deep learning. You’ll also learn about the pros and cons of deep learning software frameworks for various use cases, and discover a practical, applied methodology and technical examples for tackling big data deep learning.
Despite the increase of deep learning practitioners and researchers, many of them do not use GPUs, this may lead to long training/evaluation cycles and non-practical research.
In his talk, Lior shares how to get started with GPUs and some of the best practices that helped him during research and work. The talk is for everyone who works with machine learning (deep learning experience is NOT mandatory!), It covers the very basics of how GPU works, CUDA drivers, IDE configuration, training, inference, and multi-GPU training.
Hardware Acceleration of SVM Training for Real-time Embedded Systems: An Over...Ilham Amezzane
Support Vector Machines (SVMs) have proven to yield high accuracy and have been used widespread in recent years. However, the standard versions of the SVM algorithm are very time-consuming and computationally intensive; which places a challenge on engineers to explore other hardware architectures than CPU, capable of performing real-time training and classifications while maintaining low power consumption in embedded systems. This paper proposes an overview of works based on the two most popular parallel processing devices: GPU and FPGA, with a focus on multiclass training process. Since different techniques have been evaluated using different experimentation platforms and methodologies, we only focus on the improvements realized in each study.
At improve digital we collect and store large volumes of machine generated and behavioural data from our fleet of ad servers. For some time we have performed mostly batch processing through a data warehouse that combines traditional RDBMs (MySQL), columnar stores (Infobright, impala+parquet) and Hadoop.
We wish to share our experiences in enhancing this capability with systems and techniques that process the data as streams in near-realtime. In particular we will cover:
• The architectural need for an approach to data collection and distribution as a first-class capability
• The different needs of the ingest pipeline required by streamed realtime data, the challenges faced in building these pipelines and how they forced us to start thinking about the concept of production-ready data.
• The tools we used, in particular Apache Kafka as the message broker, Apache Samza for stream processing and Apache Avro to allow schema evolution; an essential element to handle data whose formats will change over time.
• The unexpected capabilities enabled by this approach, including the value in using realtime alerting as a strong adjunct to data validation and testing.
• What this has meant for our approach to analytics and how we are moving to online learning and realtime simulation.
This is still a work in progress at Improve Digital with differing levels of production-deployed capability across the topics above. We feel our experiences can help inform others embarking on a similar journey and hopefully allow them to learn from our initiative in this space.
Similar to Brief introduction to Distributed Deep Learning (20)
Deploying signature verification with deep learningAdam Gibson
Presentation covered building a signature verification system and deploying it to production. This includes resources usage as well as how the model was picked.
Meetup held in Tokyo with Deep learning Otemachi.
Recent presentation on deeplearning4j's new features as well as some underused features of the AI framework like arbiter,datavec's transform process and libnd4j.
This talk was on deep learning use cases outside of computer vision. It also covered larger scale patterns of what good deep learning use cases typically look like. We end up on an explanation of anomaly detection and various kinds of anomaly use cases.
Distributed deep rl on spark strata singaporeAdam Gibson
This talk briefly covers deep reinforcemeent learning on spark and the benefits of using large scale commodity compute with gpus for ease of running simulations as well as distributed training for use cases that aren't games such as network intrusion and risk. This talk also briefly mentions rl4j and our work with openai gym.
Deep learning in production with the bestAdam Gibson
Getting deep learning adopted at your company. The current landscape of academia vs industry. Presentation at AI with the best (online conference):
http://ai.withthebest.com/
Strata Beijing - Deep Learning in Production on SparkAdam Gibson
Recent talk at strata beijing - half english half chinese covering use cases of deep learning, deep learning in production and the different components of deeplearning4j.
Gave a talk at:
www.meetup.com/SF-Bayarea-Machine-Learning/events/221739934/
Covers basic architecture of a scientific lib and my take on it with nd4j.
These slides accompanied a demo of Deeplearning4j at the SF Data Mining Meetup hosted by Trulia.
http://www.meetup.com/Data-Mining/events/212445872/
Deep-learning is useful in detecting identifying similarities to augment search and text analytics; predicting customer lifetime value and churn; and recognizing faces and voices.
Deeplearning4j is an infinitely scalable deep-learning architecture suitable for Hadoop and other big-data structures. It includes a distributed deep-learning framework and a normal deep-learning framework; i.e. it runs on a single thread as well. Training takes place in the cluster, which means it can process massive amounts of data. Nets are trained in parallel via iterative reduce, and they are equally compatible with Java, Scala and Clojure. The distributed deep-learning framework is made for data input and neural net training at scale, and its output should be highly accurate predictive models.
The framework's neural nets include restricted Boltzmann machines, deep-belief networks, deep autoencoders, convolutional nets and recursive neural tensor networks.
Finally, Deeplearning4j integrates with GPUs. A stable version was released in October.
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
The Building Blocks of QuestDB, a Time Series Databasejavier ramirez
Talk Delivered at Valencia Codes Meetup 2024-06.
Traditionally, databases have treated timestamps just as another data type. However, when performing real-time analytics, timestamps should be first class citizens and we need rich time semantics to get the most out of our data. We also need to deal with ever growing datasets while keeping performant, which is as fun as it sounds.
It is no wonder time-series databases are now more popular than ever before. Join me in this session to learn about the internal architecture and building blocks of QuestDB, an open source time-series database designed for speed. We will also review a history of some of the changes we have gone over the past two years to deal with late and unordered data, non-blocking writes, read-replicas, or faster batch ingestion.
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Discussion on Vector Databases, Unstructured Data and AI
https://www.meetup.com/unstructured-data-meetup-new-york/
This meetup is for people working in unstructured data. Speakers will come present about related topics such as vector databases, LLMs, and managing data at scale. The intended audience of this group includes roles like machine learning engineers, data scientists, data engineers, software engineers, and PMs.This meetup was formerly Milvus Meetup, and is sponsored by Zilliz maintainers of Milvus.
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...2023240532
Quantitative data Analysis
Overview
Reliability Analysis (Cronbach Alpha)
Common Method Bias (Harman Single Factor Test)
Frequency Analysis (Demographic)
Descriptive Analysis
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
2. Neural net Training basics
Vectorization / Different kinds of data
Parameters - A whole neural net consists of a graph and
parameter vector
Minibatches - Neural net data requires lots of ram. Need to do
minibatch training
4. Parameters / Neural net structure
Computation graph - a neural net is just a dag of
ndarrays/tensors
The parameters of a neural net can be made in to a vector
representing all the connections/weights in the graph
5. Minibatches
Data is partitioned in to sub samples
Fits on gpu
Trains faster
Should be representative sample (every label present) as evenly
as possible
8. Multiple GPUs
Single box
Could be multiple host threads
RDMA (Remote Direct Memory Access) interconnect
NVLink
Typically used on a data center rack
Break problem up
9. Multiple GPUs and Multiple Computers
Coordinate problem over cluster
Use GPUs for compute
Can be done via MPI or hadoop (host thread coordination)
Parameter server - synchronize parameters over master as well
as handling things like gpu interconnect
11. Lots of different algorithms
All Reduce
Iterative Reduce
Pure Model parallelism
Parameter Averaging is key here
12. Core Ideas
Partition problem in to chunks
Can be neural net
As well as data
Use as many cuda or cpu cores as possible
13. How does parameter averaging work?
Replicate model across cluster
Train on different portions of data with same model
Synchronize as minimally as possible while producing a good
model
Hyper parameters should be more aggressive (higher learning
rates)
17. Tuning distributed training
Averaging acts as a form of regularization
Needs more aggressive hyper parameters
Not always going to be faster - account for amount of data
points you have
Distributed systems applies here: Send code to data not other
way around
Reduce communication overhead for max performance