Convolutional Neural Networks at scale in Spark MLlib

•

5 likes•3,424 views

Jeremy Nixon will focus on the engineering and applications of a new algorithm built on top of MLlib. The presentation will focus on the methods the algorithm uses to automatically generate features to capture nonlinear structure in data, as well as the process by which it’s trained. Major aspects of that are the compositional transformations over the data, convolution, and distributed backpropagation via SGD with adaptive gradients and an adaptive learning rate. Applications will look into how to use convolutional neural networks to model data in computer vision, natural language and signal processing. Details around optimal preprocessing, the type of structure that can be learned, and managing its ability to generalize will inform developers looking to apply nonlinear modeling tools to problems that they face.

Technology

Future Work
1. Convolutional Neural Networks
a. Convolutional Layer Type
b. Max Pooling Layer Type
2. Flexible Deep Learning API
3. More Modern Optimizers
a. Adam
b. Adadelta + Nesterov Momentum
4. More Modern activations
5. Dropout / L2 Regularization
6. Batch Normalization
7. Tensor Support
8. Recurrent Neural Networks (LSTM)

Spark Technology
Center
1. Framing Deep Learning
2. MLlib Deep Learning API
3. Optimization
4. Performance
5. Future Work
Structure

Spark Technology
Center
1. Structural Assumptions
2. Automated Feature Engineering
3. Learning Representations
4. Applications
Framing
Convolutional
Neural Networks

Spark Technology
Center
- Network depth creates an extraordinary
range of possible models.
- That flexibility creates value in large
datasets to reduce variance.
Structural
Assumptions:
Combinatorial
Flexibility

Spark Technology
Center
X = Normalized Data, W1
, W2
= Weights
Forward:
1. Multiply data by first layer weights | (X*W1
)
2. Put output through non-linear activation | max(0, X*W1
)
3. Multiply output by second layer weights | max(0, X*W1
) *
W2
4. Return predicted output
Structural
Assumption:
The Model

Spark Technology
Center
- Pixels - Edges - Shapes - Parts - Objects
- Learn features that are optimized for the
data
- Makes transfer learning feasible
Structural
Assumptions:
Hierarchical
Abstraction

Spark Technology
Center
Structural
Assumptions:
Location
Invariance
- Convolution is a restriction on the
features that can be combined.
- Location Invariance leads to strong
accuracy in vision, audio, and
language.
colah.github.io

Spark Technology
Center
Automated
Feature
Engineering

Spark Technology
Center
Learning
Representations
Hidden Layer
+
Nonlinearity
http://colah.github.io/posts/2014-03-NN-Manifolds-To
pology/

Spark Technology
Center
1. CNNs - State of the art
a. Object Recognition
b. Object Localization
c. Image Segmentation
d. Image Restoration
e. Music Recommendation
2. RNNs (LSTM) - State of the Art
a. Speech Recognition
b. Question Answering
c. Machine Translation
d. Text Summarization
e. Named Entity Recognition
f. Natural Language Generation
g. Word Sense Disambiguation
h. Image / Video Captioning
i. Sentiment Analysis
Applications

Spark Technology
Center
Flexibility. High level enough to be efficient.
Low level enough to be expressive.
MLlib Flexible Deep
Learning API

Spark Technology
Center
Modularity enables Logistic Regression,
Feedforward Networks.
MLlib Flexible Deep
Learning API

Spark Technology
Center
Introducing Convolutional and
Max-Pooling Layer types.
MLlib
Convolutional
Neural Network

Spark Technology
Center
Parallel implementation of
backpropagation:
1. Each worker gets weights from master
node.
2. Each worker computes a gradient on its
data.
3. Each worker sends gradient to master.
4. Master averages the gradients and
updates the weights.
Distributed
Optimization

Spark Technology
Center
● Parallel MLP on Spark with 7 nodes ~=
Caffe w/GPU (single node).
● Advantages to parallelism diminish with
additional nodes due to
communication costs.
● Additional workers are valuable up to
~20 workers.
● See
https://github.com/avulanov/ann-benc
hmark for more details
Performance

Spark Technology
Center
Github: https://github.com/JeremyNixon/sparkdl
Spark Package:
https://spark-packages.org/package/JeremyNixon/s
parkdl
Access

Spark Technology
Center
1. GPU Acceleration (External)
2. Keras Integration
3. Residual Layers
4. Hardening
5. Regularization
6. Batch Normalization
7. Tensor Support
Future Work

Spark Technology
Center
Thank you for your attention!
Questions?

What's hot

Introduction to Data Visualization: History, Concept, Methods (HCI Korea 2014)

Hannah Song

Data mining

Daminda Herath

Computer vision, machine, and deep learning

Igi Ardiyanto

Architecture Design for Deep Neural Networks III

Wanjin Yu

Bayesian Belief Network and its Applications.pptx

SamyakJain710491

PCA and LDA in machine learning

Akhilesh Joshi

Machine Learning With Logistic Regression

Knoldus Inc.

Deep Learning A-Z™: Autoencoders - Contractive Autoencoders

Kirill Eremenko

Introduction to Linear Discriminant Analysis

Jaclyn Kokx

Graph-Based Customer Journey Analytics with Neo4j

Neo4j

Mask-RCNN for Instance Segmentation

Dat Nguyen

Just a few years back, artificial intelligence meant adaptions like Jarvis. Who would have thought that AI would soon become an application of our daily lives? Artificial intelligence has the potential to streamline several business processes, analyze data for insights, and help in building fruitful business strategies. Hence, globally, it is being used to remediate old processes, invent new methods, and improve productivity.

How to optimize the supply chain with ai

GlobalTechCouncil

SPADE -

Monica Dagadita

Linear Regression and Logistic Regression in ML

Kumud Arora

Datawarehouse olap olam

Ravi Singh Shekhawat

Supervised and unsupervised learning

Paras Kohli

DeepLab V3+: Encoder-Decoder with Atrous Separable Convolution for Semantic I...

Joonhyung Lee

Logistic regression in Machine Learning

Kuppusamy P

YouTube: https://youtu.be/OCwZyYH14uw ** Data Science Certification using R: https://www.edureka.co/data-science ** This Edureka PPT on Linear Regression Vs Logistic Regression covers the basic concepts of linear and logistic models. The following topics are covered in this session: Types of Machine Learning Regression Vs Classification What is Linear Regression? What is Logistic Regression? Linear Regression Use Case Logistic Regression Use Case Linear Regression Vs Logistic Regression Blog Series: http://bit.ly/data-science-blogs Data Science Training Playlist: http://bit.ly/data-science-playlist Follow us to never miss an update in the future. YouTube: https://www.youtube.com/user/edurekaIN Instagram: https://www.instagram.com/edureka_learning/ Facebook: https://www.facebook.com/edurekaIN/ Twitter: https://twitter.com/edurekain LinkedIn: https://www.linkedin.com/company/edureka

Linear Regression vs Logistic Regression | Edureka

Edureka!

Deep neural networks

Si Haem

What's hot (20)

Introduction to Data Visualization: History, Concept, Methods (HCI Korea 2014)

Data mining

Computer vision, machine, and deep learning

Architecture Design for Deep Neural Networks III

Bayesian Belief Network and its Applications.pptx

PCA and LDA in machine learning

Machine Learning With Logistic Regression

Deep Learning A-Z™: Autoencoders - Contractive Autoencoders

Introduction to Linear Discriminant Analysis

Graph-Based Customer Journey Analytics with Neo4j

Mask-RCNN for Instance Segmentation

How to optimize the supply chain with ai

SPADE -

Linear Regression and Logistic Regression in ML

Datawarehouse olap olam

Supervised and unsupervised learning

DeepLab V3+: Encoder-Decoder with Atrous Separable Convolution for Semantic I...

Logistic regression in Machine Learning

Linear Regression vs Logistic Regression | Edureka

Deep neural networks

Similar to Convolutional Neural Networks at scale in Spark MLlib

Convolutional Neural Networks at scale in Spark MLlib: Jeremy Nixon will focus on the engineering and applications of a new algorithm built on top of MLlib. The presentation will focus on the methods the algorithm uses to automatically generate features to capture nonlinear structure in data, as well as the process by which it’s trained. Major aspects of that include compositional transformations over the data, convolution, and distributed backpropagation via SGD with adaptive gradients and an adaptive learning rate. Applications will look into how to use convolutional neural networks to model data in computer vision, natural language and signal processing. Details around optimal preprocessing, the type of structure that can be learned, and managing its ability to generalize will inform developers looking to apply nonlinear modeling tools to problems that they face.

Jeremy Nixon, Machine Learning Engineer, Spark Technology Center at MLconf AT...

MLconf

Deep Neural Network Regression at Scale in Spark MLlib

Jeremy Nixon

Deep learning continues to push the state of the art in domains such as computer vision, natural language understanding and recommendation engines. One of the key reasons for this progress is the availability of highly flexible and developer friendly deep learning frameworks. Apache MXNet is a fully-featured, flexibly-programmable and ultra-scalable deep learning framework supporting innovative deep models including convolutional neural networks (CNNs), and long short-term memory networks (LSTMs). This Tech Talk will show you how to launch the deep learning cloud formation template and deploy the deep learning AMI to train your own deep neural network, using MNIST, to recognize handwritten digits and test it for accuracy. Learning Objectives: - Learn about the features and benefits of Apache MXNet - Learn about the deep learning AMIs with the tools you need for DL - Learn how to train a neural network using MXNet"

A Deeper Dive into Apache MXNet - March 2017 AWS Online Tech Talks

Amazon Web Services

A Deeper Dive into Apache MXNet - March 2017 AWS Online Tech Talks

Amazon Web Services

Deep Learning in NLP (BERT, ERNIE and REFORMER)

Biswajit Biswas

Georgia Tech cse6242 - Intro to Deep Learning and DL4J

Josh Patterson

The State of ML for iOS: On the Advent of WWDC 2018 🕯

Meghan Kane

Deep Learning and Watson Studio

Sasha Lazarevic

by Dave Nielsen, Technical Program Manager, Big Data Technologies, STO, Intel In this workshop we explain how to use Deep Learning to recognize objects in photos using the BigDL framework for Apache Spark. The workshop starts by explaining how Convolution Neural Network models like SSD, VGG and Inception identify edges and then higher level characteristics in multiple layers in order to identify the object. Then we will describe how to build an end-to-end pipeline for image classification including data prep, model building, training, scoring. The workshop includes a Zeppelin Notebook with python code you can modify to train the model and test for accuracy. The code provided in this workshop can be used to run examples on your laptop, on-premises or on Amazon Web Services managed by Qubole. Level 100

Image Recognition on AWS with Apache Spark and BigDL

Amazon Web Services

chaitra_resume

chaitra chaitra

Deep Learning Enabled Question Answering System to Automate Corporate Helpdesk

Saurabh Saxena

Hands on image recognition with scala spark and deep learning4j

Guglielmo Iozzia

https://telecombcn-dl.github.io/2017-dlai/ Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks or Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles of deep learning from both an algorithmic and computational perspectives.

Attention-based Models (DLAI D8L 2017 UPC Deep Learning for Artificial Intell...

Universitat Politècnica de Catalunya

Scalable Deep Learning on AWS with Apache MXNet

Julien SIMON

Deep learning for NLP and Transformer

Arvind Devaraj

Novi sad ai event 1-2018

Jovan Stojanovic

Multimodal foundation models are a revolutionary class of AI models that provide impressive abilities to generate multimedia content and do so by interactive prompts in a seemingly creative manner. These foundation models are often self-supervised transformer-based models pre-trained on large volumes of data, typically collected from the web. They already form the basis of all state-of-the-art systems in computer vision and natural language processing across a wide range of tasks and have shown impressive transfer learning abilities. Despite their immense potential, these foundation models face challenges in fundamental perception tasks such as spatial grounding and temporal reasoning, have difficulty to operate on low-resource scenarios, and neglect human-alignment for ethical, legal, and societal acceptance. In this talk I will highlight recent work from my lab that identifies several of these challenges as well as ways to update foundation models to address these challenges and to do so in a sustainable way, without the need to retrain from scratch.

What multimodal foundation models cannot perceive

University of Amsterdam

Presentation at Code BEAM America 2021 https://codesync.global/conferences/code-beam-sf-2021 https://codesync.global/speaker/hideki-takase/ How do we install the magic of Elixir into robot systems? One of the solutions is "Rclex", that is a client library for ROS 2 platform. ROS (Robot Operating System) provides publish/subscribe based messaging mechanism between robot modules with the DDS (Data Distribution Service) stack. We suggest that the force of Erlang/Elixir can power up the scalability of ROS 2 communication. This talk will introduce how did we integrate ROS 2 and Elixir by using NIFs, and discuss the possibility of this library in the IoT field.

Rclex: A Library for Robotics meet Elixir

Hideki Takase

Yann le cun

Yandex

005281271.pdf

KalsoomTahir2

Similar to Convolutional Neural Networks at scale in Spark MLlib (20)

Jeremy Nixon, Machine Learning Engineer, Spark Technology Center at MLconf AT...

Deep Neural Network Regression at Scale in Spark MLlib

A Deeper Dive into Apache MXNet - March 2017 AWS Online Tech Talks

Deep Learning in NLP (BERT, ERNIE and REFORMER)

Georgia Tech cse6242 - Intro to Deep Learning and DL4J

The State of ML for iOS: On the Advent of WWDC 2018 🕯

Deep Learning and Watson Studio

Image Recognition on AWS with Apache Spark and BigDL

chaitra_resume

Deep Learning Enabled Question Answering System to Automate Corporate Helpdesk

Hands on image recognition with scala spark and deep learning4j

Attention-based Models (DLAI D8L 2017 UPC Deep Learning for Artificial Intell...

Scalable Deep Learning on AWS with Apache MXNet

Deep learning for NLP and Transformer

Novi sad ai event 1-2018

What multimodal foundation models cannot perceive

Rclex: A Library for Robotics meet Elixir

Yann le cun

005281271.pdf

More from DataWorks Summit

Introduction: This workshop will provide a hands-on introduction to Machine Learning (ML) with an overview of Deep Learning (DL). Format: An introductory lecture on several supervised and unsupervised ML techniques followed by light introduction to DL and short discussion what is current state-of-the-art. Several python code samples using the scikit-learn library will be introduced that users will be able to run in the Cloudera Data Science Workbench (CDSW). Objective: To provide a quick and short hands-on introduction to ML with python’s scikit-learn library. The environment in CDSW is interactive and the step-by-step guide will walk you through setting up your environment, to exploring datasets, training and evaluating models on popular datasets. By the end of the crash course, attendees will have a high-level understanding of popular ML algorithms and the current state of DL, what problems they can solve, and walk away with basic hands-on experience training and evaluating ML models. Prerequisites: For the hands-on portion, registrants must bring a laptop with a Chrome or Firefox web browser. These labs will be done in the cloud, no installation needed. Everyone will be able to register and start using CDSW after the introductory lecture concludes (about 1hr in). Basic knowledge of python highly recommended.

Data Science Crash Course

DataWorks Summit

In a world with a myriad of distributed storage systems to choose from, the majority of Apache HBase clusters still rely on Apache HDFS. Theoretically, any distributed file system could be used by HBase. One major reason HDFS is predominantly used are the specific durability requirements of HBase's write-ahead log (WAL) and HDFS providing that guarantee correctly. However, HBase's use of HDFS for WALs can be replaced with sufficient effort. This talk will cover the design of a "Log Service" which can be embedded inside of HBase that provides a sufficient level of durability that HBase requires for WALs. Apache Ratis (incubating) is a library-implementation of the RAFT consensus protocol in Java and is used to build this Log Service. We will cover the design choices of the Ratis Log Service, comparing and contrasting it to other log-based systems that exist today. Next, we'll cover how the Log Service "fits" into HBase and the necessary changes to HBase which enable this. Finally, we'll discuss how the Log Service can simplify the operational burden of HBase.

Floating on a RAFT: HBase Durability with Apache Ratis

DataWorks Summit

Utilizing Apache NiFi we read various open data REST APIs and camera feeds to ingest crime and related data real-time streaming it into HBase and Phoenix tables. HBase makes an excellent storage option for our real-time time series data sources. We can immediately query our data utilizing Apache Zeppelin against Phoenix tables as well as Hive external tables to HBase. Apache Phoenix tables also make a great option since we can easily put microservices on top of them for application usage. I have an example Spring Boot application that reads from our Philadelphia crime table for front-end web applications as well as RESTful APIs. Apache NiFi makes it easy to push records with schemas to HBase and insert into Phoenix SQL tables. Resources: https://community.hortonworks.com/articles/54947/reading-opendata-json-and-storing-into-phoenix-tab.html https://community.hortonworks.com/articles/56642/creating-a-spring-boot-java-8-microservice-to-read.html https://community.hortonworks.com/articles/64122/incrementally-streaming-rdbms-data-to-your-hadoop.html

Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi

DataWorks Summit

Whilst HBase is the most logical answer for use cases requiring random, realtime read/write access to Big Data, it may not be so trivial to design applications that make most of its use, neither the most simple to operate. As it depends/integrates with other components from Hadoop ecosystem (Zookeeper, HDFS, Spark, Hive, etc) or external systems ( Kerberos, LDAP), and its distributed nature requires a "Swiss clockwork" infrastructure, many variables are to be considered when observing anomalies or even outages. Adding to the equation there's also the fact that HBase is still an evolving product, with different release versions being used currently, some of those can carry genuine software bugs. On this presentation, we'll go through the most common HBase issues faced by different organisations, describing identified cause and resolution action over my last 5 years supporting HBase to our heterogeneous customer base.

HBase Tales From the Trenches - Short stories about most common HBase operati...

DataWorks Summit

LocationTech GeoMesa enables spatial and spatiotemporal indexing and queries for HBase and Accumulo. In this talk, after an overview of GeoMesa’s capabilities in the Cloudera ecosystem, we will dive into how GeoMesa leverages Accumulo’s Iterator interface and HBase’s Filter and Coprocessor interfaces. The goal will be to discuss both what spatial operations can be pushed down into the distributed database and also how the GeoMesa codebase is organized to allow for consistent use across the two database systems.

Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...

DataWorks Summit

Managing the Dewey Decimal System

DataWorks Summit

Many individuals/organizations have a desire to utilize NoSQL technology, but often lack an understanding of how the underlying functional bits can be utilized to enable their use case. This situation can result in drastic increases in the desire to put the SQL back in NoSQL. Since the initial commit, Apache Accumulo has provided a number of examples to help jumpstart comprehension of how some of these bits function as well as potentially help tease out an understanding of how they might be applied to a NoSQL friendly use case. One very relatable example demonstrates how Accumulo could be used to emulate a filesystem (dirlist). In this session we will walk through the dirlist implementation. Attendees should come away with an understanding of the supporting table designs, a simple text search supporting a single wildcard (on file/directory names), and how the dirlist elements work together to accomplish its feature set. Attendees should (hopefully) also come away with a justification for sometimes keeping the SQL out of NoSQL.

Practical NoSQL: Accumulo's dirlist Example

DataWorks Summit

Data serves as the platform for decision-making at Uber. To facilitate data driven decisions, many datasets at Uber are ingested in a Hadoop Data Lake and exposed to querying via Hive. Analytical queries joining various datasets are run to better understand business data at Uber. Data ingestion, at its most basic form, is about organizing data to balance efficient reading and writing of newer data. Data organization for efficient reading involves factoring in query patterns to partition data to ensure read amplification is low. Data organization for efficient writing involves factoring the nature of input data - whether it is append only or updatable. At Uber we ingest terabytes of many critical tables such as trips that are updatable. These tables are fundamental part of Uber's data-driven solutions, and act as the source-of-truth for all the analytical use-cases across the entire company. Datasets such as trips constantly receive updates to the data apart from inserts. To ingest such datasets we need a critical component that is responsible for bookkeeping information of the data layout, and annotates each incoming change with the location in HDFS where this data should be written. This component is called as Global Indexing. Without this component, all records get treated as inserts and get re-written to HDFS instead of being updated. This leads to duplication of data, breaking data correctness and user queries. This component is key to scaling our jobs where we are now handling greater than 500 billion writes a day in our current ingestion systems. This component will need to have strong consistency and provide large throughputs for index writes and reads. At Uber, we have chosen HBase to be the backing store for the Global Indexing component and is a critical component in allowing us to scaling our jobs where we are now handling greater than 500 billion writes a day in our current ingestion systems. In this talk, we will discuss data@Uber and expound more on why we built the global index using Apache Hbase and how this helps to scale out our cluster usage. We’ll give details on why we chose HBase over other storage systems, how and why we came up with a creative solution to automatically load Hfiles directly to the backend circumventing the normal write path when bootstrapping our ingestion tables to avoid QPS constraints, as well as other learnings we had bringing this system up in production at the scale of data that Uber encounters daily.

HBase Global Indexing to support large-scale data ingestion at Uber

DataWorks Summit

Recently, Apache Phoenix has been integrated with Apache (incubator) Omid transaction processing service, to provide ultra-high system throughput with ultra-low latency overhead. Phoenix has been shown to scale beyond 0.5M transactions per second with sub-5ms latency for short transactions on industry-standard hardware. On the other hand, Omid has been extended to support secondary indexes, multi-snapshot SQL queries, and massive-write transactions. These innovative features make Phoenix an excellent choice for translytics applications, which allow converged transaction processing and analytics. We share the story of building the next-gen data tier for advertising platforms at Verizon Media that exploits Phoenix and Omid to support multi-feed real-time ingestion and AI pipelines in one place, and discuss the lessons learned.

Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix

DataWorks Summit

Cybersecurity requires an organization to collect data, analyze it, and alert on cyber anomalies in near real-time. This is a challenging endeavor when considering the variety of data sources which need to be collected and analyzed. Everything from application logs, network events, authentications systems, IOT devices, business events, cloud service logs, and more need to be taken into consideration. In addition, multiple data formats need to be transformed and conformed to be understood by both humans and ML/AI algorithms. To solve this problem, the Aetna Global Security team developed the Unified Data Platform based on Apache NiFi, which allows them to remain agile and adapt to new security threats and the onboarding of new technologies in the Aetna environment. The platform currently has over 60 different data flows with 95% doing real-time ETL and handles over 20 billion events per day. In this session learn from Aetna’s experience building an edge to AI high-speed data pipeline with Apache NiFi.

Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi

DataWorks Summit

Supporting Apache HBase : Troubleshooting and Supportability Improvements

DataWorks Summit

In the healthcare sector, data security, governance, and quality are crucial for maintaining patient privacy and ensuring the highest standards of care. At Florida Blue, the leading health insurer of Florida serving over five million members, there is a multifaceted network of care providers, business users, sales agents, and other divisions relying on the same datasets to derive critical information for multiple applications across the enterprise. However, maintaining consistent data governance and security for protected health information and other extended data attributes has always been a complex challenge that did not easily accommodate the wide range of needs for Florida Blue’s many business units. Using Apache Ranger, we developed a federated Identity & Access Management (IAM) approach that allows each tenant to have their own IAM mechanism. All user groups and roles are propagated across the federation in order to determine users’ data entitlement and access authorization; this applies to all stages of the system, from the broadest tenant levels down to specific data rows and columns. We also enabled audit attributes to ensure data quality by documenting data sources, reasons for data collection, date and time of data collection, and more. In this discussion, we will outline our implementation approach, review the results, and highlight our “lessons learned.”

Security Framework for Multitenant Architecture

DataWorks Summit

Presto, an open source distributed SQL engine, is widely recognized for its low-latency queries, high concurrency, and native ability to query multiple data sources. Proven at scale in a variety of use cases at Airbnb, Bloomberg, Comcast, Facebook, FINRA, LinkedIn, Lyft, Netflix, Twitter, and Uber, in the last few years Presto experienced an unprecedented growth in popularity in both on-premises and cloud deployments over Object Stores, HDFS, NoSQL and RDBMS data stores. With the ever-growing list of connectors to new data sources such as Azure Blob Storage, Elasticsearch, Netflix Iceberg, Apache Kudu, and Apache Pulsar, recently introduced Cost-Based Optimizer in Presto must account for heterogeneous inputs with differing and often incomplete data statistics. This talk will explore this topic in detail as well as discuss best use cases for Presto across several industries. In addition, we will present recent Presto advancements such as Geospatial analytics at scale and the project roadmap going forward.

Presto: Optimizing Performance of SQL-on-Anything Engine

DataWorks Summit

Specialized tools for machine learning development and model governance are becoming essential. MlFlow is an open source platform for managing the machine learning lifecycle. Just by adding a few lines of code in the function or script that trains their model, data scientists can log parameters, metrics, artifacts (plots, miscellaneous files, etc.) and a deployable packaging of the ML model. Every time that function or script is run, the results will be logged automatically as a byproduct of those lines of code being added, even if the party doing the training run makes no special effort to record the results. MLflow application programming interfaces (APIs) are available for the Python, R and Java programming languages, and MLflow sports a language-agnostic REST API as well. Over a relatively short time period, MLflow has garnered more than 3,300 stars on GitHub , almost 500,000 monthly downloads and 80 contributors from more than 40 companies. Most significantly, more than 200 companies are now using MLflow. We will demo MlFlow Tracking , Project and Model components with Azure Machine Learning (AML) Services and show you how easy it is to get started with MlFlow on-prem or in the cloud.

Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...

DataWorks Summit

Twitter's Data Platform is built using multiple complex open source and in house projects to support Data Analytics on hundreds of petabytes of data. Our platform support storage, compute, data ingestion, discovery and management and various tools and libraries to help users for both batch and realtime analytics. Our DataPlatform operates on multiple clusters across different data centers to help thousands of users discover valuable insights. As we were scaling our Data Platform to multiple clusters, we also evaluated various cloud vendors to support use cases outside of our data centers. In this talk we share our architecture and how we extend our data platform to use cloud as another datacenter. We walk through our evaluation process, challenges we faced supporting data analytics at Twitter scale on cloud and present our current solution. Extending Twitter's Data platform to cloud was complex task which we deep dive in this presentation.

Extending Twitter's Data Platform to Google Cloud

DataWorks Summit

At Comcast, our team has been architecting a customer experience platform which is able to react to near-real-time events and interactions and deliver appropriate and timely communications to customers. By combining the low latency capabilities of Apache Flink and the dataflow capabilities of Apache NiFi we are able to process events at high volume to trigger, enrich, filter, and act/communicate to enhance customer experiences. Apache Flink and Apache NiFi complement each other with their strengths in event streaming and correlation, state management, command-and-control, parallelism, development methodology, and interoperability with surrounding technologies. We will trace our journey from starting with Apache NiFi over three years ago and our more recent introduction of Apache Flink into our platform stack to handle more complex scenarios. In this presentation we will compare and contrast which business and technical use cases are best suited to which platform and explore different ways to integrate the two platforms into a single solution.

Event-Driven Messaging and Actions using Apache Flink and Apache NiFi

DataWorks Summit

Companies are increasingly moving to the cloud to store and process data. One of the challenges companies have is in securing data across hybrid environments with easy way to centrally manage policies. In this session, we will talk through how companies can use Apache Ranger to protect access to data both in on-premise as well as in cloud environments. We will go into details into the challenges of hybrid environment and how Ranger can solve it. We will also talk through how companies can further enhance the security by leveraging Ranger to anonymize or tokenize data while moving into the cloud and de-anonymize dynamically using Apache Hive, Apache Spark or when accessing data from cloud storage systems. We will also deep dive into the Ranger’s integration with AWS S3, AWS Redshift and other cloud native systems. We will wrap it up with an end to end demo showing how policies can be created in Ranger and used to manage access to data in different systems, anonymize or de-anonymize data and track where data is flowing.

Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger

DataWorks Summit

Advanced Big Data Processing frameworks have been proposed to harness the fast data transmission capability of Remote Direct Memory Access (RDMA) over high-speed networks such as InfiniBand, RoCEv1, RoCEv2, iWARP, and OmniPath. However, with the introduction of the Non-Volatile Memory (NVM) and NVM express (NVMe) based SSD, these designs along with the default Big Data processing models need to be re-assessed to discover the possibilities of further enhanced performance. In this talk, we will present, NRCIO, a high-performance communication runtime for non-volatile memory over modern network interconnects that can be leveraged by existing Big Data processing middleware. We will show the performance of non-volatile memory-aware RDMA communication protocols using our proposed runtime and demonstrate its benefits by incorporating it into a high-performance in-memory key-value store, Apache Hadoop, Tez, Spark, and TensorFlow. Evaluation results illustrate that NRCIO can achieve up to 3.65x performance improvement for representative Big Data processing workloads on modern data centers.

Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...

DataWorks Summit

Background: Some early applications of Computer Vision in Retail arose from e-commerce use cases - but increasingly, it is being used in physical stores in a variety of new and exciting ways, such as: ● Optimizing merchandising execution, in-stocks and sell-thru ● Enhancing operational efficiencies, enable real-time customer engagement ● Enhancing loss prevention capabilities, response time ● Creating frictionless experiences for shoppers Abstract: This talk will cover the use of Computer Vision in Retail, the implications to the broader Consumer Goods industry and share business drivers, use cases and benefits that are unfolding as an integral component in the remaking of an age-old industry. We will also take a ‘peek under the hood’ of Computer Vision and Deep Learning, sharing technology design principles and skill set profiles to consider before starting your CV journey. Deep learning has matured considerably in the past few years to produce human or superhuman abilities in a variety of computer vision paradigms. We will discuss ways to recognize these paradigms in retail settings, collect and organize data to create actionable outcomes with the new insights and applications that deep learning enables. We will cover the basics of object detection, then move into the advanced processing of images describing the possible ways that a retail store of the near future could operate. Identifying various storefront situations by having a deep learning system attached to a camera stream. Such things as; identifying item stocks on shelves, a shelf in need of organization, or perhaps a wandering customer in need of assistance. We will also cover how to use a computer vision system to automatically track customer purchases to enable a streamlined checkout process, and how deep learning can power plausible wardrobe suggestions based on what a customer is currently wearing or purchasing. Finally, we will cover the various technologies that are powering these applications today. Deep learning tools for research and development. Production tools to distribute that intelligence to an entire inventory of all the cameras situation around a retail location. Tools for exploring and understanding the new data streams produced by the computer vision systems. By the end of this talk, attendees should understand the impact Computer Vision and Deep Learning are having in the Consumer Goods industry, key use cases, techniques and key considerations leaders are exploring and implementing today.

Computer Vision: Coming to a Store Near You

DataWorks Summit

Whole genome shotgun based next generation transcriptomics and metagenomics studies often generate 100 to 1000 gigabytes (GB) sequence data derived from tens of thousands of different genes or microbial species. De novo assembling these data requires an ideal solution that both scales with data size and optimizes for individual gene or genomes. Here we developed an Apache Spark-based scalable sequence clustering application, SparkReadClust (SpaRC), that partitions the reads based on their molecule of origin to enable downstream assembly optimization. SpaRC produces high clustering performance on transcriptomics and metagenomics test datasets from both short read and long read sequencing technologies. It achieved a near linear scalability with respect to input data size and number of compute nodes. SpaRC can run on different cloud computing environments without modifications while delivering similar performance. In summary, our results suggest SpaRC provides a scalable solution for clustering billions of reads from the next-generation sequencing experiments, and Apache Spark represents a cost-effective solution with rapid development/deployment cycles for similar big data genomics problems.

Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark

DataWorks Summit

More from DataWorks Summit (20)

Data Science Crash Course

Floating on a RAFT: HBase Durability with Apache Ratis

Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi

HBase Tales From the Trenches - Short stories about most common HBase operati...

Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...

Managing the Dewey Decimal System

Practical NoSQL: Accumulo's dirlist Example

HBase Global Indexing to support large-scale data ingestion at Uber

Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix

Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi

Supporting Apache HBase : Troubleshooting and Supportability Improvements

Security Framework for Multitenant Architecture

Presto: Optimizing Performance of SQL-on-Anything Engine

Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...

Extending Twitter's Data Platform to Google Cloud

Event-Driven Messaging and Actions using Apache Flink and Apache NiFi

Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger

Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...

Computer Vision: Coming to a Store Near You

Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark

Recently uploaded

Join us as we dive into the latest updates to the UiPath Orchestrator API, including new limits and features for 2024. Discover how these changes can enhance your automation projects and streamline your workflows. 📚 Overview of UiPath Orchestrator API 🔧 Recent changes to API limits 🛠️ How to adapt to new limits 📋 Best practices for using the Orchestrator API efficiently ❓ Q&A session

Exploring UiPath Orchestrator API: updates and limits in 2024 🚀

DianaGray10

The field of Information retrieval (IR) is currently undergoing a transformative shift, at least partly due to the emerging applications of generative AI to information access. In this talk, we will deliberate on the sociotechnical implications of generative AI for information access. We will argue that there is both a critical necessity and an exciting opportunity for the IR community to re-center our research agendas on societal needs while dismantling the artificial separation between the work on fairness, accountability, transparency, and ethics in IR and the rest of IR research. Instead of adopting a reactionary strategy of trying to mitigate potential social harms from emerging technologies, the community should aim to proactively set the research agenda for the kinds of systems we should build inspired by diverse explicitly stated sociotechnical imaginaries. The sociotechnical imaginaries that underpin the design and development of information access technologies needs to be explicitly articulated, and we need to develop theories of change in context of these diverse perspectives. Our guiding future imaginaries must be informed by other academic fields, such as democratic theory and critical theory, and should be co-developed with social science scholars, legal scholars, civil rights and social justice activists, and artists, among others.

Search and Society: Reimagining Information Access for Radical Futures

Bhaskar Mitra

From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...

Product School

Designing Great Products: The Power of Design and Leadership by Chief Designe...

Product School

New customer? New industry? New cloud? New team? A lot to handle! How to ensure the success of the project? Start it well! I've created the 3 areas of focus at the beginning of the project that helped me in multiple roles (BA, PO, and Consultant). Learn from real-world experiences and discover how these insights can empower you to deliver unparalleled value to your customers right from the project's start.

Powerful Start- the Key to Project Success, Barbara Laskowska

CzechDreamin

Unlock the mysteries of successful Salesforce interviews in this insightful session hosted by Hugo Rosario (Salesforce Customer), a seasoned hiring manager that leads the Salesforce Department of multinational company with over 100 interviews under their belt. Step into the manager's chair and gain exclusive behind-the-scenes insights into what makes a Salesforce consultant stand out during the interview process. From deciphering the unspoken cues to mastering key strategies, we'll explore the intricacies of the interview process and provide practical tips for consultants looking to not only pass interviews but also thrive in their roles. Whether you're a seasoned professional or just starting your Salesforce journey, this session is your backstage pass to the secrets that hiring managers wish you knew.

Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...

CzechDreamin

The standard Salesforce Approval process can be limiting in many ways, especially in complex scenarios. What if there was a way to implement very flexible approvals where one can use Apex code to make data updates in unrelated records, dynamically generate next steps details, and compute assignees on the fly? And still use UI-based configurations to implement concrete approval processes. In this session, we will share ideas behind such a solution and show a few lines of code to get you started.

Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder

CzechDreamin

ScyllaDB has the potential to deliver impressive performance and scalability. The better you understand how it works, the more you can squeeze out of it. But before you squeeze, make sure you know what to monitor! Watch our experienced Postgres developer work through monitoring and performance strategies that help him understand what mistakes he’s made moving to NoSQL. And learn with him as our database performance expert offers friendly guidance on how to use monitoring and performance tuning to get his sample Rust application on the right track. This webinar focuses on using monitoring and performance tuning to discover and correct mistakes that commonly occur when developers move from SQL to NoSQL. For example: - Common issues getting up and running with the monitoring stack - Using the CQL optimizations dashboard - Common issues causing high latency in a node - Common issues causing replica imbalance - What a healthy system looks like in terms of memory - Key metrics to keep an eye on This isn’t “Death-by-Powerpoint.” We’ll walk through problems encountered while migrating a real application from Postgres to ScyllaDB – and try to fix them live as well.

Optimizing NoSQL Performance Through Observability

ScyllaDB

We're living the AI revolution and Salesforce is adapting and bring new value to their customers. Einstein products are evolving rapidly and navigating their limitations, language support, and use cases can be challenging. Let's make review of what Einstein product are available currently, what are the capabilities and what can be used for in CEE region and how Rossie.ai can help to learn Salesforce speak Czech. We will explore the Einstein roadmap and I will make a short live demo (based on your vote) of some Einstein feature.

AI revolution and Salesforce, Jiří Karpíšek

CzechDreamin

Explore the core of Salesforce success in 'Salesforce Adoption – Metrics, Methods, and Motivation.' We will discuss essential metrics, effective methods to drive adoption, and the driving force behind user engagement and explore strategies for onboarding, training, and continuous support that empower users to navigate the platform seamlessly. By leveraging these tools, you can effectively measure adoption against your company’s goals and create an environment where users not only adopt Salesforce but actively contribute to its ongoing success.

Salesforce Adoption – Metrics, Methods, and Motivation, Antone Kom

CzechDreamin

AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...

Product School

Discover the essentials of performance testing in the IT sector with our concise guide. Learn about various testing types such as load, stress, endurance, spike, scalability, and volume testing. Understand key performance metrics like response time, throughput, CPU and memory utilization, and error rate. Explore top tools like Apache JMeter, LoadRunner, Gatling, Neoload, and BlazeMeter. Gain insights into best practices for defining objectives, creating realistic scenarios, automating tests, and optimizing performance to ensure user satisfaction, reliability, scalability, and cost efficiency. Ideal for developers, QA engineers, and IT professionals. Visit Expeed Software for more information. https://expeed.com/

In-Depth Performance Testing Guide for IT Professionals

Expeed Software

Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)

Julian Hyde

UiPath Test Automation using UiPath Test Suite series, part 3

DianaGray10

I have heard many times that architecture is not important for the front-end. Also, many times I have seen how developers implement features on the front-end just following the standard rules for a framework and think that this is enough to successfully launch the project, and then the project fails. How to prevent this and what approach to choose? I have launched dozens of complex projects and during the talk we will analyze which approaches have worked for me and which have not.

"Impact of front-end architecture on development cost", Viktor Turskyi

Fwdays

Watch this recorded webinar about real-time monitoring of application performance. See how to integrate Apache JMeter, the open-source leader in performance testing, with InfluxDB, the open-source time-series database, and Grafana, the open-source analytics and visualization application. In this webinar, we will review the benefits of leveraging InfluxDB and Grafana when executing load tests and demonstrate how these tools are used to visualize performance metrics. Length: 30 minutes Session Overview ------------------------------------------- During this webinar, we will cover the following topics while demonstrating the integrations of JMeter, InfluxDB and Grafana: - What out-of-the-box solutions are available for real-time monitoring JMeter tests? - What are the benefits of integrating InfluxDB and Grafana into the load testing stack? - Which features are provided by Grafana? - Demonstration of InfluxDB and Grafana using a practice web application To view the webinar recording, go to: https://www.rttsweb.com/jmeter-integration-webinar

JMeter webinar - integration with InfluxDB and Grafana

RTTS

Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows. We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases. This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams. Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.

Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...

Jeffrey Haguewood

💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™: See how to accelerate model training and optimize model performance with active learning Learn about the latest enhancements to out-of-the-box document processing – with little to no training required Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath. Speakers: 👨‍🏫 Andras Palfi, Senior Product Manager, UiPath 👩‍🏫 Lenka Dulovicova, Product Program Manager, UiPath

Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...

UiPathCommunity

Knowledge engineering: from people to machines and back

Elena Simperl

Speed Wins: From Kafka to APIs in Minutes

confluent

Recently uploaded (20)

Exploring UiPath Orchestrator API: updates and limits in 2024 🚀

Search and Society: Reimagining Information Access for Radical Futures

From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...

Designing Great Products: The Power of Design and Leadership by Chief Designe...

Powerful Start- the Key to Project Success, Barbara Laskowska

Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...

Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder

Optimizing NoSQL Performance Through Observability

AI revolution and Salesforce, Jiří Karpíšek

Salesforce Adoption – Metrics, Methods, and Motivation, Antone Kom

AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...

In-Depth Performance Testing Guide for IT Professionals

Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)

UiPath Test Automation using UiPath Test Suite series, part 3

"Impact of front-end architecture on development cost", Viktor Turskyi

JMeter webinar - integration with InfluxDB and Grafana

Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...

Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...

Knowledge engineering: from people to machines and back

Speed Wins: From Kafka to APIs in Minutes

Convolutional Neural Networks at scale in Spark MLlib

1. Spark Technology Center Convolutional Neural Networks at Scale in MLlib Jeremy Nixon

2. Spark Technology Center 1. Machine Learning Engineer at the Spark Technology Center 2. Contributor to MLlib, dedicated to scalable deep learning. 3. Previously, studied Applied Mathematics to Computer Science and Economics at Harvard Jeremy Nixon

3. Future Work 1. Convolutional Neural Networks a. Convolutional Layer Type b. Max Pooling Layer Type 2. Flexible Deep Learning API 3. More Modern Optimizers a. Adam b. Adadelta + Nesterov Momentum 4. More Modern activations 5. Dropout / L2 Regularization 6. Batch Normalization 7. Tensor Support 8. Recurrent Neural Networks (LSTM)

4. Spark Technology Center 1. Framing Deep Learning 2. MLlib Deep Learning API 3. Optimization 4. Performance 5. Future Work Structure

5. Spark Technology Center 1. Structural Assumptions 2. Automated Feature Engineering 3. Learning Representations 4. Applications Framing Convolutional Neural Networks

6. Spark Technology Center - Network depth creates an extraordinary range of possible models. - That flexibility creates value in large datasets to reduce variance. Structural Assumptions: Combinatorial Flexibility

7. Spark Technology Center X = Normalized Data, W1 , W2 = Weights Forward: 1. Multiply data by first layer weights | (X*W1 ) 2. Put output through non-linear activation | max(0, X*W1 ) 3. Multiply output by second layer weights | max(0, X*W1 ) * W2 4. Return predicted output Structural Assumption: The Model

8. Spark Technology Center - Pixels - Edges - Shapes - Parts - Objects - Learn features that are optimized for the data - Makes transfer learning feasible Structural Assumptions: Hierarchical Abstraction

9. Spark Technology Center Structural Assumptions: Location Invariance - Convolution is a restriction on the features that can be combined. - Location Invariance leads to strong accuracy in vision, audio, and language. colah.github.io

10. Spark Technology Center Automated Feature Engineering

11. Spark Technology Center Learning Representations Hidden Layer + Nonlinearity http://colah.github.io/posts/2014-03-NN-Manifolds-To pology/

12. Spark Technology Center 1. CNNs - State of the art a. Object Recognition b. Object Localization c. Image Segmentation d. Image Restoration e. Music Recommendation 2. RNNs (LSTM) - State of the Art a. Speech Recognition b. Question Answering c. Machine Translation d. Text Summarization e. Named Entity Recognition f. Natural Language Generation g. Word Sense Disambiguation h. Image / Video Captioning i. Sentiment Analysis Applications

13. Spark Technology Center Flexibility. High level enough to be efficient. Low level enough to be expressive. MLlib Flexible Deep Learning API

14. Spark Technology Center Flexibility. High level enough to be efficient. Low level enough to be expressive. MLlib Flexible Deep Learning API

15. Spark Technology Center Modularity enables Logistic Regression, Feedforward Networks. MLlib Flexible Deep Learning API

16. Spark Technology Center Introducing Convolutional and Max-Pooling Layer types. MLlib Convolutional Neural Network

17. Spark Technology Center Optimization

18. Spark Technology Center Optimization

19. Spark Technology Center Parallel implementation of backpropagation: 1. Each worker gets weights from master node. 2. Each worker computes a gradient on its data. 3. Each worker sends gradient to master. 4. Master averages the gradients and updates the weights. Distributed Optimization

20. Spark Technology Center ● Parallel MLP on Spark with 7 nodes ~= Caffe w/GPU (single node). ● Advantages to parallelism diminish with additional nodes due to communication costs. ● Additional workers are valuable up to ~20 workers. ● See https://github.com/avulanov/ann-benc hmark for more details Performance

21. Spark Technology Center Github: https://github.com/JeremyNixon/sparkdl Spark Package: https://spark-packages.org/package/JeremyNixon/s parkdl Access

22. Spark Technology Center 1. GPU Acceleration (External) 2. Keras Integration 3. Residual Layers 4. Hardening 5. Regularization 6. Batch Normalization 7. Tensor Support Future Work

23. Spark Technology Center Thank you for your attention! Questions?

Convolutional Neural Networks at scale in Spark MLlib

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Convolutional Neural Networks at scale in Spark MLlib

Similar to Convolutional Neural Networks at scale in Spark MLlib (20)

More from DataWorks Summit

More from DataWorks Summit (20)

Recently uploaded

Recently uploaded (20)

Convolutional Neural Networks at scale in Spark MLlib