Intuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue

•

0 likes•356 views

This presentation introduces Tune and Fugue, frameworks for intuitive and scalable hyperparameter optimization (HPO). Tune supports both non-iterative and iterative HPO problems. For non-iterative problems, Tune supports grid search, random search, and Bayesian optimization. For iterative problems, Tune generalizes algorithms like Hyperband and Asynchronous Successive Halving. Tune allows tuning models both locally and in a distributed manner without code changes. The presentation demonstrates Tune's capabilities through examples tuning Scikit-Learn and Keras models. The goal of Tune and Fugue is to make HPO development easy, testable, and scalable.

Data & Analytics

Intuitive & Scalable HPO
With Spark+Fugue
Han Wang

Agenda
Introduction
Non-Iterative HPO
Demo
Iterative HPO Demo

pip install tune
https://github.com/fugue-project/tune
pip install fugue
https://github.com/fugue-project/fugue

Questions
● Is parameter tuning a machine learning problem?
● Are there common ways to tune both classical models and deep
learning models?
● Why is it so hard to do distributed parameter tuning?

Tuning Problems In General
General Parameter Tuning
Hyperparameter Tuning (for Machine Learning)
Some Classical
Models
Deep Learning Models
Some Classical
Models
Non-Iterative Problems Iterative Problems

Distributed Parameter Tuning
● Not everything can be parallelized
● The tuning logic is always complex and tedious
● Popular tuning frameworks are not distributed environment
friendly
● Spark is not suitable for iterative tuning problems

Distributed Parameter Tuning
Tune SQL Validation

Our Goals
● For non-iterative problems:
○ Unify grid and random search, make other plugable
● For iterative problems:
○ Generalize SOTA algos such as Hyperband and ASHA
● For both
○ Tune both locally and distributedly without code change
○ Make tuning development iterable and testable
○ Minimize moving parts
○ Minimize interfaces

Grid Search
a: Grid(0,1)
b: Grid(“a”, “b”)
c: 3.14
a:0, b:”a”, c:3.14
a:0, b:”b”, c:3.14
a:1, b:”a”, c:3.14
a:1, b:”b”, c:3.14
Search Space Candidates
Pros: determinism, even coverage, interpretable
Cons: complexity can increase exponentially

Random Search
a: Rand(0,1)
b: Choice(“a”,“b”)
c: 3.14
a:0.12, b:”a”, c:3.14
a:0.66, b:”a”, c:3.14
a:0.32, b:”b”, c:3.14
a:0.94, b:”a”, c:3.14
Search Space Candidates
Pros: complexity and distribution are controlled, good for continuous variables
Cons: by luck, not deterministic, large number of samples are normally needed

Bayesian Optimization
objective: a^2
a: Rand(-1,1)
-0.66 -> 0.76 -> -0.18
-> 0.75 -> 0.90
-> 0.07 -> 0.00
-> 0.41 -> 0.12 -> 0.66
Search Space Candidates
Pros: less compute to guess the optimal parameters
Cons: sequential operations may require more time

Hybrid Search Space
Distributed Hybrid Search
Model 1 Model 2
Grid Random Bayesian

Live Demo
Space Concept & Scikit-Learn Tuning

Challenges
● Realtime asynchronous communication
● The overhead for checkpointing iterations can be significant
● Single iterative problem can’t be parallelized
● A lot of boilerplate code

Successive Halving (SHA)
Rung 1
Rung 2
Rung 3
Rung 4

Fully Customized Successive Halving
8, [(4,6), (2,2), (6,1)]

Summary
Space Monitoring
Dataset
Distributed
Execution
Abstraction
Non-Iterative
Random, Grid, BO
Iterative
SHA, HB, ASHA, PBT ...
Specialization
Scikit-Learn
Specialization
Keras, TF, PyTorch

Let’s Collaborate!
● Create specialized higher level APIs for major tuning cases so
users can do tuning with minimal code and without learning
distributed systems
● Enable advanced users to create fully customized, platform
agnostic and scale agnostic tuning pipelines with tune’s lower
level APIs

Feedback
Your feedback is important to us.
Don’t forget to rate and review the sessions.

Hyperparameter tuning and optimization is a powerful tool in the area of AutoML, for both traditional statistical learning models as well as for deep learning. There are many existing tools to help drive this process, including both blackbox and whitebox tuning. In this talk, we'll start with a brief survey of the most popular techniques for hyperparameter tuning (e.g., grid search, random search, Bayesian optimization, and parzen estimators) and then discuss the open source tools which implement each of these techniques. Finally, we will discuss how we can leverage MLflow with these tools and techniques to analyze how our search is performing and to productionize the best models. Speaker: Joseph Bradley

MATS stack (MLFlow, Airflow, Tensorflow, Spark) for Cross-system Orchestratio...

Databricks

At Avast we complete over 17 million phishing detections a day, providing crucial online protection for this type of attacks. In this talk Joao Da Silva and Yury Kasimov will present the MATS stack for productionisation of Machine Learning and their journey into integrating model tracking, storage, cross-system orchestration and model deployments for a complete and modern machine learning pipeline.

Wix.com Back-end Engineering Guild Manifesto

Aviran Mordo

BigQuery ML - Machine learning at scale using SQL

Márton Kodok

With BigQuery ML, you can build machine learning models without leaving the data warehouse environment and training it on massive datasets. We are going to demonstrate how to build, train, eval and predict, your own scalable machine learning models using standard SQL language in Google BigQuery. We will see how can we use CREATE MODEL sql syntax to build different models such as: Linear regression Multiclass logistic regression for classification K-means clustering Import TensorFlow models for prediction in BigQuery We will see how we can apply these models on tabular data in retail and marketing use cases. Models are trained and accessed in BigQuery using SQL — a language data analysts know. This enables business decision making through predictive analytics across the organization without leaving the query editor.

ML Infrastracture @ Dropbox

Tsahi Glik

Productionizing Machine Learning with a Microservices Architecture

Databricks

SymfonyCon 2019: Head first into Symfony Cache, Redis & Redis Cluster

André Rømcke

Symfony Cache has been around for a few releases. But what is happening behind the scenes? Talk focuses on how is it working, down to detail level on Redis for things like datatypes, Redis Cluster sharding logic, how it differs from Memcached and more. Hopefully you’ll learn how you can make sure to get optimal performance, what opportunities exists, and which pitfalls to try to avoid. https://amsterdam2019.symfony.com/speakers NOTE: This talk is recorded and available on SymfonyCasts, in the future it will also be uploaded to youtube.

[DSC Adria 23] Mikhail Rozhkov DVC in Machine Learning Engineering and MLOps ...

DataScienceConferenc1

ML development brings many new complexities beyond the traditional software development lifecycle. Unlike in traditional software development, ML developers want to try multiple algorithms, tools and parameters to get the best results, and they need to track this information to reproduce work. In addition, developers need to use many distinct systems to productionize models. To address these problems, many companies are building custom “ML platforms” that automate this lifecycle, but even these platforms are limited to a few supported algorithms and to each company’s internal infrastructure. In this talk, I present MLflow, a new open source project from Databricks that aims to design an open ML platform where organizations can use any ML library and development tool of their choice to reliably build and share ML applications. MLflow introduces simple abstractions to package reproducible projects, track results, and encapsulate models that can be used with many existing tools, accelerating the ML lifecycle for organizations of any size.

Feature Engineering - Getting most out of data for predictive models

Gabriel Moreira

How should data be preprocessed for use in machine learning algorithms? How to identify the most predictive attributes of a dataset? What features can generate to improve the accuracy of a model? Feature Engineering is the process of extracting and selecting, from raw data, features that can be used effectively in predictive models. As the quality of the features greatly influences the quality of the results, knowing the main techniques and pitfalls will help you to succeed in the use of machine learning in your projects. In this talk, we will present methods and techniques that allow us to extract the maximum potential of the features of a dataset, increasing flexibility, simplicity and accuracy of the models. The analysis of the distribution of features and their correlations, the transformation of numeric attributes (such as scaling, normalization, log-based transformation, binning), categorical attributes (such as one-hot encoding, feature hashing, Temporal (date / time), and free-text attributes (text vectorization, topic modeling). Python, Python, Scikit-learn, and Spark SQL examples will be presented and how to use domain knowledge and intuition to select and generate features relevant to predictive models.

Building an Observability platform with ClickHouse

Altinity Ltd

Introduction to MySQL

Giuseppe Maxia

Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals

Cloudera, Inc.

The enormous legacy of EDW experience and best practices can be adapted to the unique capabilities of the Hadoop environment. In this webinar, in a point-counterpoint format, Dr. Kimball will describe standard data warehouse best practices including the identification of dimensions and facts, managing primary keys, and handling slowly changing dimensions (SCDs) and conformed dimensions. Eli Collins, Chief Technologist at Cloudera, will describe how each of these practices actually can be implemented in Hadoop.

Scaling and Unifying SciKit Learn and Apache Spark Pipelines

Databricks

Pipelines have become ubiquitous, as the need for stringing multiple functions to compose applications has gained adoption and popularity. Common pipeline abstractions such as “fit” and “transform” are even shared across divergent platforms such as Python Scikit-Learn and Apache Spark. Scaling pipelines at the level of simple functions is desirable for many AI applications, however is not directly supported by Ray’s parallelism primitives. In this talk, Raghu will describe a pipeline abstraction that takes advantage of Ray’s compute model to efficiently scale arbitrarily complex pipeline workflows. He will demonstrate how this abstraction cleanly unifies pipeline workflows across multiple platforms such as Scikit-Learn and Spark, and achieves nearly optimal scale-out parallelism on pipelined computations. Attendees will learn how pipelined workflows can be mapped to Ray’s compute model and how they can both unify and accelerate their pipelines with Ray.

MLOps Using MLflow

Databricks

MLflow is an MLOps tool that enables data scientist to quickly productionize their Machine Learning projects. To achieve this, MLFlow has four major components which are Tracking, Projects, Models, and Registry. MLflow lets you train, reuse, and deploy models with any library and package them into reproducible steps. MLflow is designed to work with any machine learning library and require minimal changes to integrate into an existing codebase. In this session, we will cover the common pain points of machine learning developers such as tracking experiments, reproducibility, deployment tool and model versioning. Ready to get your hands dirty by doing quick ML project using mlflow and release to production to understand the ML-Ops lifecycle.

Zipline: Airbnb’s Machine Learning Data Management Platform with Nikhil Simha...

Databricks

Zipline is Airbnb’s data management platform specifically designed for ML use cases. Previously, ML practitioners at Airbnb spent roughly 60% of their time on collecting and writing transformations for machine learning tasks. Zipline reduces this task from months to days. It allows users to define features in an easy-to-use configuration language, then provides access to the following features: resource efficient and point-in-time correct training set backfills and scheduled updates, feature visualizations and automatic data quality monitoring, feature availability in online scoring environment: batch and streaming with batch correction (lambda architecture), collaboration and sharing of features, and data ownership and management. Spark powers many of Zipline’s features, especially offline tasks for efficient training set backfills and feature computation. This talk covers Ziplines architecture and the main problems that Zipline solves. Despite being widespread, there is no open source software to address these problems. As a result, we intend to open source our work.

Simplifying Model Management with MLflow

Databricks

<p>Last summer, Databricks launched MLflow, an open source platform to manage the machine learning lifecycle, including experiment tracking, reproducible runs and model packaging. MLflow has grown quickly since then, with over 120 contributors from dozens of companies, including major contributions from R Studio and Microsoft. It has also gained new capabilities such as automatic logging from TensorFlow and Keras, Kubernetes integrations, and a high-level Java API. In this talk, we’ll cover some of the new features that have come to MLflow, and then focus on a major upcoming feature: model management with the MLflow Model Registry. Many organizations face challenges tracking which models are available in the organization and which ones are in production. The MLflow Model Registry provides a centralized database to keep track of these models, share and describe new model versions, and deploy the latest version of a model through APIs. We’ll demonstrate how these features can simplify common ML lifecycle tasks.</p>

RecSysOps: Best Practices for Operating a Large-Scale Recommender System

Ehsan38

Ensuring the health of a modern large-scale recommendation system is a very challenging problem. To address this, we need to put in place proper logging, sophisticated exploration policies, develop ML-interpretability tools or even train new ML models to predict/detect issues of the main production model. In this talk, we shine a light on this less-discussed but important area and share some of the best practices, called RecSysOps, that we’ve learned while operating our increasingly complex recommender systems at Netflix. RecSysOps is a set of best practices for identifying issues and gaps as well as diagnosing and resolving them in a large-scale machine-learned recommender system. RecSysOps helped us to 1) reduce production issues and 2) increase recommendation quality by identifying areas of improvement and 3) make it possible to bring new innovations faster to our members by enabling us to spend more of our time on new innovations and less on debugging and firefighting issues. https://dl.acm.org/doi/10.1145/3460231.3474620

End to end Machine Learning using Kubeflow - Build, Train, Deploy and Manage

Animesh Singh

With the breadth of sheer functionalities which need to be addressed in the Machine Learning world around building, training, serving and managing models, getting it done in a consistent, composable, portable, and scalable manner is hard. The Kubernetes framework is well suited to address these issues, which is why it's a great foundation for deploying ML workloads. Kubeflow is designed to take advantage of these benefits. In this talk, we are going to address how to make it easy for everyone to develop, deploy, and manage portable, scalable ML everywhere and support the full lifecycle Machine Learning using open source technologies like Kubeflow, Tensorflow, PyTorch,Tekton, Knative, Istio and others. We are going to discuss how to enable distributed training of models, model serving, canary rollouts, drift detection, model explainability, metadata management, pipelines and others. Additionally we will discuss Watson productization in progress based on Kubeflow Pipelines and Tekton, and point to Kubeflow Dojo materials and follow-on workshops.

Wix's ML Platform

Ran Romano

Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...

DataWorks Summit

Specialized tools for machine learning development and model governance are becoming essential. MlFlow is an open source platform for managing the machine learning lifecycle. Just by adding a few lines of code in the function or script that trains their model, data scientists can log parameters, metrics, artifacts (plots, miscellaneous files, etc.) and a deployable packaging of the ML model. Every time that function or script is run, the results will be logged automatically as a byproduct of those lines of code being added, even if the party doing the training run makes no special effort to record the results. MLflow application programming interfaces (APIs) are available for the Python, R and Java programming languages, and MLflow sports a language-agnostic REST API as well. Over a relatively short time period, MLflow has garnered more than 3,300 stars on GitHub , almost 500,000 monthly downloads and 80 contributors from more than 40 companies. Most significantly, more than 200 companies are now using MLflow. We will demo MlFlow Tracking , Project and Model components with Azure Machine Learning (AML) Services and show you how easy it is to get started with MlFlow on-prem or in the cloud.

A Collaborative Data Science Development Workflow

Databricks

Collaborative data science workflows have several moving parts, and many organizations struggle with developing an efficient and scalable process. Our solution consists of data scientists individually building and testing Kedro pipelines and measuring performance using MLflow tracking. Once a strong solution is created, the candidate pipeline is trained on cloud-agnostic, GPU-enabled containers. If this pipeline is production worthy, the resulting model is served to a production application through MLflow.

MLConf 2016 SigOpt Talk by Scott Clark

SigOpt

Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016

MLconf

Using Bayesian Optimization to Tune Machine Learning Models: In this talk we briefly introduce Bayesian Global Optimization as an efficient way to optimize machine learning model parameters, especially when evaluating different parameters is time-consuming or expensive. We will motivate the problem and give example applications. We will also talk about our development of a robust benchmark suite for our algorithms including test selection, metric design, infrastructure architecture, visualization, and comparison to other standard and open source methods. We will discuss how this evaluation framework empowers our research engineers to confidently and quickly make changes to our core optimization engine. We will end with an in-depth example of using these methods to tune the features and hyperparameters of a real world problem and give several real world applications.

What's hot

Maximizing performance via tuning and optimization

MariaDB plc

MyDUMPER : Faster logical backups and restores

Mydbops

Pythonsevilla2019 - Introduction to MLFlow

Fernando Ortega Gallego

What is MLOps

Henrik Skogström

PyCon Sweden 2022 - Dowling - Serverless ML with Hopsworks.pdf

Jim Dowling

Automatic Forecasting using Prophet, Databricks, Delta Lake and MLflow

Databricks

Introduction to MLflow

Databricks

Feature Engineering - Getting most out of data for predictive models

Gabriel Moreira

Building an Observability platform with ClickHouse

Altinity Ltd

Introduction to MySQL

Giuseppe Maxia

Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals

Cloudera, Inc.

Scaling and Unifying SciKit Learn and Apache Spark Pipelines

Databricks

MLOps Using MLflow

Databricks

Zipline: Airbnb’s Machine Learning Data Management Platform with Nikhil Simha...

Databricks

Simplifying Model Management with MLflow

Databricks

RecSysOps: Best Practices for Operating a Large-Scale Recommender System

Ehsan38

End to end Machine Learning using Kubeflow - Build, Train, Deploy and Manage

Animesh Singh

Wix's ML Platform

Ran Romano

Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...

DataWorks Summit

A Collaborative Data Science Development Workflow

Databricks

What's hot (20)

Maximizing performance via tuning and optimization

MyDUMPER : Faster logical backups and restores

Pythonsevilla2019 - Introduction to MLFlow

What is MLOps

PyCon Sweden 2022 - Dowling - Serverless ML with Hopsworks.pdf

Automatic Forecasting using Prophet, Databricks, Delta Lake and MLflow

Introduction to MLflow

Feature Engineering - Getting most out of data for predictive models

Building an Observability platform with ClickHouse

Introduction to MySQL

Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals

Scaling and Unifying SciKit Learn and Apache Spark Pipelines

MLOps Using MLflow

Zipline: Airbnb’s Machine Learning Data Management Platform with Nikhil Simha...

Simplifying Model Management with MLflow

RecSysOps: Best Practices for Operating a Large-Scale Recommender System

End to end Machine Learning using Kubeflow - Build, Train, Deploy and Manage

Wix's ML Platform

Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...

A Collaborative Data Science Development Workflow

Similar to Intuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue

MLConf 2016 SigOpt Talk by Scott Clark

SigOpt

Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016

MLconf

Cutting edge hyperparameter tuning made simple with ray tune

XiaoweiJiang7

Triantafyllia Voulibasi

ISSEL

Argumentation in Artificial Intelligence: From Theory to Practice (Practice)

Mauro Vallati

Using Bayesian Optimization to Tune Machine Learning Models

Scott Clark

Using Bayesian Optimization to Tune Machine Learning Models

SigOpt

01-Introduction_to_Optimization-v2021.2-Sept23-2021.pptx

Tran273185

Topology Optimization Topology optimization is concerned with material distribution and how the members within a structure are connected. It treats the “equivalent density” of each element as a design variable. The solver calculates an equivalent density for each element, where 1 is equivalent to 100% material, while 0 is equivalent to no material in the element. The solver then seeks to assign elements that have a low stress value a lower equivalent density before analyzing the effect on the remaining structure. In this way extraneous elements tend towards a density of 0, with the optimum design tending towards 1. As a designer, you will need to exercise your judgment. For example, you may decide that you will omit material from all (finite) elements whose density is less than 0.3 (or 30%). Using an iso-plot of element densities helps to visualize the “remaining” structure as elements with a density below this threshold can be masked leaving behind the optimum design. Then you will need to take this geometry back to your CAD modeler, smooth it out (that is, use geometrically regular edges or surfaces, etc.) and re-evaluate the design for stresses, displacements, frequencies etc..

It Does What You Say, Not What You Mean: Lessons From A Decade of Program Repair

Claire Le Goues

In this talk we present lessons learned, good ideas, and thoughts on the future, with an eye toward informing junior researchers about the realities and opportunities of a long-running project. We highlight some notions from the original paper that stood the test of time, some that were not as prescient, and some that became more relevant as industrial practice advanced. We place the work in context, highlighting perceptions from software engineering and evolutionary computing, then and now, of how program repair could possibly work. We discuss the importance of measurable benchmarks and reproducible research in bringing scientists together and advancing the area. We give our thoughts on the role of quality requirements and properties in program repair. From testing to metrics to scalability to human factors to technology transfer, software repair touches many aspects of software engineering, and we hope a behind-the-scenes exploration of some of our struggles and successes may benefit researchers pursuing new projects.

Validation and-design-in-a-small-team-environmentObsidian Software

Validation and Design in a Small Team EnvironmentDVClub

Hadoop & Spark Performance tuning using Dr. Elephant

Akshay Rai

Dr. Elephant is a tool for the users of Hadoop to help them understand, analyze and tune their Hadoop/Spark applications easily, thus improving their productivity and the cluster’s efficiency. It analyzes the Hadoop and Spark jobs using a set of pluggable, configurable, rule-based heuristics that provide insights on how a job performed, and then uses the results to make suggestions about how to tune the job to make it perform more efficiently.

Ssbse10.ppt

Yann-Gaël Guéhéneuc

ICLR 2020 Recap

Sri Ambati

This presentation was made on June 9th, 2020. Video recording of the session can be viewed here: https://youtu.be/OCB9sTUnUug In this meetup with Sanyam Bhutani, Machine Learning Engineer at H2O.ai, he gives a recap of the eight annual ICLR (International Conference on Learning Representations) 2020 - a niche deep learning conference whose focus is to study how to learn representations of data, which is basically what deep learning does. Sanyam goes through a few of his favorite selected papers from this year’s ICLR, note this session may not be able to capture the richness of all papers or allow a detailed discussion. You will be able to find Sanyam in our community slack (https://www.h2o.ai/slack-community/), please feel free to start a discussion with him, if you send a  emoji greeting, you’ll find the answers. Following are the papers we will look into: U-GAT-IT: Unsupervised Generative Attentional Networks with Adaptive Layer-Instance Normalization for Image-to-Image Translation AugMix: A Simple Data Processing Method to Improve Robustness and Uncertainty Your classifier is secretly an energy based model and you should treat it like one ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators ALBERT: A Lite BERT for Self-supervised Learning of Language Representations Reformer: The Efficient Transformer Generative Models for Effective ML on Private, Decentralized Datasets Once for All: Train One Network and Specialize it for Efficient Deployment Thieves on Sesame Street! Model Extraction of BERT-based APIs Plug and Play Language Models: A Simple Approach to Controlled Text Generation BatchEnsemble: An Alternative Approach to Efficient Ensemble and Lifelong Learning Real or Not Real, that is the Question

Toronto meetup 20190917

Bill Liu

Fahroo - Computational Mathematics - Spring Review 2012

The Air Force Office of Scientific Research

ISC Frankfurt 2015: Good, bad and ugly of accelerators and a complementary path

John Holden

From Hours to Minutes: The Journey of Optimizing Mask-RCNN and BERT Using MXNet

Eric Haibin Lin

Scaling Machine Learning to Billions of Parameters - Spark Summit 2016

Badri Narayan Bhaskar

Scaling Machine Learning To Billions Of Parameters

Jen Aman

Similar to Intuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue (20)

MLConf 2016 SigOpt Talk by Scott Clark

Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016

Cutting edge hyperparameter tuning made simple with ray tune

Triantafyllia Voulibasi

Argumentation in Artificial Intelligence: From Theory to Practice (Practice)

Using Bayesian Optimization to Tune Machine Learning Models

01-Introduction_to_Optimization-v2021.2-Sept23-2021.pptx

It Does What You Say, Not What You Mean: Lessons From A Decade of Program Repair

Validation and-design-in-a-small-team-environment

Validation and Design in a Small Team Environment

Hadoop & Spark Performance tuning using Dr. Elephant

Ssbse10.ppt

ICLR 2020 Recap

Toronto meetup 20190917

Fahroo - Computational Mathematics - Spring Review 2012

ISC Frankfurt 2015: Good, bad and ugly of accelerators and a complementary path

From Hours to Minutes: The Journey of Optimizing Mask-RCNN and BERT Using MXNet

Scaling Machine Learning to Billions of Parameters - Spark Summit 2016

Scaling Machine Learning To Billions Of Parameters

More from Databricks

DW Migration Webinar-March 2022.pptx

Databricks

Data Lakehouse Symposium | Day 1 | Part 1

Databricks

The world of data architecture began with applications. Next came data warehouses. Then text was organized into a data warehouse. Then one day the world discovered a whole new kind of data that was being generated by organizations. The world found that machines generated data that could be transformed into valuable insights. This was the origin of what is today called the data lakehouse. The evolution of data architecture continues today. Come listen to industry experts describe this transformation of ordinary data into a data architecture that is invaluable to business. Simply put, organizations that take data architecture seriously are going to be at the forefront of business tomorrow. This is an educational event. Several of the authors of the book Building the Data Lakehouse will be presenting at this symposium.

Data Lakehouse Symposium | Day 1 | Part 2

Databricks

Data Lakehouse Symposium | Day 2

Databricks

Data Lakehouse Symposium | Day 4

Databricks

5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop

Databricks

In this session, learn how to quickly supplement your on-premises Hadoop environment with a simple, open, and collaborative cloud architecture that enables you to generate greater value with scaled application of analytics and AI on all your data. You will also learn five critical steps for a successful migration to the Databricks Lakehouse Platform along with the resources available to help you begin to re-skill your data teams.

Democratizing Data Quality Through a Centralized Platform

Databricks

Bad data leads to bad decisions and broken customer experiences. Organizations depend on complete and accurate data to power their business, maintain efficiency, and uphold customer trust. With thousands of datasets and pipelines running, how do we ensure that all data meets quality standards, and that expectations are clear between producers and consumers? Investing in shared, flexible components and practices for monitoring data health is crucial for a complex data organization to rapidly and effectively scale. At Zillow, we built a centralized platform to meet our data quality needs across stakeholders. The platform is accessible to engineers, scientists, and analysts, and seamlessly integrates with existing data pipelines and data discovery tools. In this presentation, we will provide an overview of our platform’s capabilities, including: Giving producers and consumers the ability to define and view data quality expectations using a self-service onboarding portal Performing data quality validations using libraries built to work with spark Dynamically generating pipelines that can be abstracted away from users Flagging data that doesn’t meet quality standards at the earliest stage and giving producers the opportunity to resolve issues before use by downstream consumers Exposing data quality metrics alongside each dataset to provide producers and consumers with a comprehensive picture of health over time

Learn to Use Databricks for Data Science

Databricks

Data scientists face numerous challenges throughout the data science workflow that hinder productivity. As organizations continue to become more data-driven, a collaborative environment is more critical than ever — one that provides easier access and visibility into the data, reports and dashboards built against the data, reproducibility, and insights uncovered within the data.. Join us to hear how Databricks’ open and collaborative platform simplifies data science by enabling you to run all types of analytics workloads, from data preparation to exploratory analysis and predictive analytics, at scale — all on one unified platform.

Why APM Is Not the Same As ML Monitoring

Databricks

Application performance monitoring (APM) has become the cornerstone of software engineering allowing engineering teams to quickly identify and remedy production issues. However, as the world moves to intelligent software applications that are built using machine learning, traditional APM quickly becomes insufficient to identify and remedy production issues encountered in these modern software applications. As a lead software engineer at NewRelic, my team built high-performance monitoring systems including Insights, Mobile, and SixthSense. As I transitioned to building ML Monitoring software, I found the architectural principles and design choices underlying APM to not be a good fit for this brand new world. In fact, blindly following APM designs led us down paths that would have been better left unexplored. In this talk, I draw upon my (and my team’s) experience building an ML Monitoring system from the ground up and deploying it on customer workloads running large-scale ML training with Spark as well as real-time inference systems. I will highlight how the key principles and architectural choices of APM don’t apply to ML monitoring. You’ll learn why, understand what ML Monitoring can successfully borrow from APM, and hear what is required to build a scalable, robust ML Monitoring architecture.

The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix

Databricks

Autonomy and ownership are core to working at Stitch Fix, particularly on the Algorithms team. We enable data scientists to deploy and operate their models independently, with minimal need for handoffs or gatekeeping. By writing a simple function and calling out to an intuitive API, data scientists can harness a suite of platform-provided tooling meant to make ML operations easy. In this talk, we will dive into the abstractions the Data Platform team has built to enable this. We will go over the interface data scientists use to specify a model and what that hooks into, including online deployment, batch execution on Spark, and metrics tracking and visualization.

Stage Level Scheduling Improving Big Data and AI Integration

Databricks

In this talk, I will dive into the stage level scheduling feature added to Apache Spark 3.1. Stage level scheduling extends upon Project Hydrogen by improving big data ETL and AI integration and also enables multiple other use cases. It is beneficial any time the user wants to change container resources between stages in a single Apache Spark application, whether those resources are CPU, Memory or GPUs. One of the most popular use cases is enabling end-to-end scalable Deep Learning and AI to efficiently use GPU resources. In this type of use case, users read from a distributed file system, do data manipulation and filtering to get the data into a format that the Deep Learning algorithm needs for training or inference and then sends the data into a Deep Learning algorithm. Using stage level scheduling combined with accelerator aware scheduling enables users to seamlessly go from ETL to Deep Learning running on the GPU by adjusting the container requirements for different stages in Spark within the same application. This makes writing these applications easier and can help with hardware utilization and costs. There are other ETL use cases where users want to change CPU and memory resources between stages, for instance there is data skew or perhaps the data size is much larger in certain stages of the application. In this talk, I will go over the feature details, cluster requirements, the API and use cases. I will demo how the stage level scheduling API can be used by Horovod to seamlessly go from data preparation to training using the Tensorflow Keras API using GPUs. The talk will also touch on other new Apache Spark 3.1 functionality, such as pluggable caching, which can be used to enable faster dataframe access when operating from GPUs.

Simplify Data Conversion from Spark to TensorFlow and PyTorch

Databricks

In this talk, I would like to introduce an open-source tool built by our team that simplifies the data conversion from Apache Spark to deep learning frameworks. Imagine you have a large dataset, say 20 GBs, and you want to use it to train a TensorFlow model. Before feeding the data to the model, you need to clean and preprocess your data using Spark. Now you have your dataset in a Spark DataFrame. When it comes to the training part, you may have the problem: How can I convert my Spark DataFrame to some format recognized by my TensorFlow model? The existing data conversion process can be tedious. For example, to convert an Apache Spark DataFrame to a TensorFlow Dataset file format, you need to either save the Apache Spark DataFrame on a distributed filesystem in parquet format and load the converted data with third-party tools such as Petastorm, or save it directly in TFRecord files with spark-tensorflow-connector and load it back using TFRecordDataset. Both approaches take more than 20 lines of code to manage the intermediate data files, rely on different parsing syntax, and require extra attention for handling vector columns in the Spark DataFrames. In short, all these engineering frictions greatly reduced the data scientists’ productivity. The Databricks Machine Learning team contributed a new Spark Dataset Converter API to Petastorm to simplify these tedious data conversion process steps. With the new API, it takes a few lines of code to convert a Spark DataFrame to a TensorFlow Dataset or a PyTorch DataLoader with default parameters. In the talk, I will use an example to show how to use the Spark Dataset Converter to train a Tensorflow model and how simple it is to go from single-node training to distributed training on Databricks.

Scaling your Data Pipelines with Apache Spark on Kubernetes

Databricks

There is no doubt Kubernetes has emerged as the next generation of cloud native infrastructure to support a wide variety of distributed workloads. Apache Spark has evolved to run both Machine Learning and large scale analytics workloads. There is growing interest in running Apache Spark natively on Kubernetes. By combining the flexibility of Kubernetes and scalable data processing with Apache Spark, you can run any data and machine pipelines on this infrastructure while effectively utilizing resources at disposal. In this talk, Rajesh Thallam and Sougata Biswas will share how to effectively run your Apache Spark applications on Google Kubernetes Engine (GKE) and Google Cloud Dataproc, orchestrate the data and machine learning pipelines with managed Apache Airflow on GKE (Google Cloud Composer). Following topics will be covered: – Understanding key traits of Apache Spark on Kubernetes- Things to know when running Apache Spark on Kubernetes such as autoscaling- Demonstrate running analytics pipelines on Apache Spark orchestrated with Apache Airflow on Kubernetes cluster.

Sawtooth Windows for Feature Aggregations

Databricks

In this talk about zipline, we will introduce a new type of windowing construct called a sawtooth window. We will describe various properties about sawtooth windows that we utilize to achieve online-offline consistency, while still maintaining high-throughput, low-read latency and tunable write latency for serving machine learning features.We will also talk about a simple deployment strategy for correcting feature drift – due operations that are not “abelian groups”, that operate over change data.

Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink

Databricks

We want to present multiple anti patterns utilizing Redis in unconventional ways to get the maximum out of Apache Spark.All examples presented are tried and tested in production at Scale at Adobe. The most common integration is spark-redis which interfaces with Redis as a Dataframe backing Store or as an upstream for Structured Streaming. We deviate from the common use cases to explore where Redis can plug gaps while scaling out high throughput applications in Spark. Niche 1 : Long Running Spark Batch Job – Dispatch New Jobs by polling a Redis Queue · Why? o Custom queries on top a table; We load the data once and query N times · Why not Structured Streaming · Working Solution using Redis Niche 2 : Distributed Counters · Problems with Spark Accumulators · Utilize Redis Hashes as distributed counters · Precautions for retries and speculative execution · Pipelining to improve performance

Re-imagine Data Monitoring with whylogs and Spark

Databricks

In the era of microservices, decentralized ML architectures and complex data pipelines, data quality has become a bigger challenge than ever. When data is involved in complex business processes and decisions, bad data can, and will, affect the bottom line. As a result, ensuring data quality across the entire ML pipeline is both costly, and cumbersome while data monitoring is often fragmented and performed ad hoc. To address these challenges, we built whylogs, an open source standard for data logging. It is a lightweight data profiling library that enables end-to-end data profiling across the entire software stack. The library implements a language and platform agnostic approach to data quality and data monitoring. It can work with different modes of data operations, including streaming, batch and IoT data. In this talk, we will provide an overview of the whylogs architecture, including its lightweight statistical data collection approach and various integrations. We will demonstrate how the whylogs integration with Apache Spark achieves large scale data profiling, and we will show how users can apply this integration into existing data and ML pipelines.

Raven: End-to-end Optimization of ML Prediction Queries

Databricks

Machine learning (ML) models are typically part of prediction queries that consist of a data processing part (e.g., for joining, filtering, cleaning, featurization) and an ML part invoking one or more trained models. In this presentation, we identify significant and unexplored opportunities for optimization. To the best of our knowledge, this is the first effort to look at prediction queries holistically, optimizing across both the ML and SQL components. We will present Raven, an end-to-end optimizer for prediction queries. Raven relies on a unified intermediate representation that captures both data processing and ML operators in a single graph structure. This allows us to introduce optimization rules that (i) reduce unnecessary computations by passing information between the data processing and ML operators (ii) leverage operator transformations (e.g., turning a decision tree to a SQL expression or an equivalent neural network) to map operators to the right execution engine, and (iii) integrate compiler techniques to take advantage of the most efficient hardware backend (e.g., CPU, GPU) for each operator. We have implemented Raven as an extension to Spark’s Catalyst optimizer to enable the optimization of SparkSQL prediction queries. Our implementation also allows the optimization of prediction queries in SQL Server. As we will show, Raven is capable of improving prediction query performance on Apache Spark and SQL Server by up to 13.1x and 330x, respectively. For complex models, where GPU acceleration is beneficial, Raven provides up to 8x speedup compared to state-of-the-art systems. As part of the presentation, we will also give a demo showcasing Raven in action.

Processing Large Datasets for ADAS Applications using Apache Spark

Databricks

Semantic segmentation is the classification of every pixel in an image/video. The segmentation partitions a digital image into multiple objects to simplify/change the representation of the image into something that is more meaningful and easier to analyze [1][2]. The technique has a wide variety of applications ranging from perception in autonomous driving scenarios to cancer cell segmentation for medical diagnosis. Exponential growth in the datasets that require such segmentation is driven by improvements in the accuracy and quality of the sensors generating the data extending to 3D point cloud data. This growth is further compounded by exponential advances in cloud technologies enabling the storage and compute available for such applications. The need for semantically segmented datasets is a key requirement to improve the accuracy of inference engines that are built upon them. Streamlining the accuracy and efficiency of these systems directly affects the value of the business outcome for organizations that are developing such functionalities as a part of their AI strategy. This presentation details workflows for labeling, preprocessing, modeling, and evaluating performance/accuracy. Scientists and engineers leverage domain-specific features/tools that support the entire workflow from labeling the ground truth, handling data from a wide variety of sources/formats, developing models and finally deploying these models. Users can scale their deployments optimally on GPU-based cloud infrastructure to build accelerated training and inference pipelines while working with big datasets. These environments are optimized for engineers to develop such functionality with ease and then scale against large datasets with Spark-based clusters on the cloud.

Massive Data Processing in Adobe Using Delta Lake

Databricks

At Adobe Experience Platform, we ingest TBs of data every day and manage PBs of data for our customers as part of the Unified Profile Offering. At the heart of this is a bunch of complex ingestion of a mix of normalized and denormalized data with various linkage scenarios power by a central Identity Linking Graph. This helps power various marketing scenarios that are activated in multiple platforms and channels like email, advertisements etc. We will go over how we built a cost effective and scalable data pipeline using Apache Spark and Delta Lake and share our experiences. What are we storing? Multi Source – Multi Channel Problem Data Representation and Nested Schema Evolution Performance Trade Offs with Various formats Go over anti-patterns used (String FTW) Data Manipulation using UDFs Writer Worries and How to Wipe them Away Staging Tables FTW Datalake Replication Lag Tracking Performance Time!

Machine Learning CI/CD for Email Attack Detection

Databricks

Detecting advanced email attacks at scale is a challenging ML problem, particularly due to the rarity of attacks, adversarial nature of the problem, and scale of data. In order to move quickly and adapt to the newest threat we needed to build a Continuous Integration / Continuous Delivery pipeline for the entire ML detection stack. Our goal is to enable detection engineers and data scientists to make changes to any part of the stack including joined datasets for hydration, feature extraction code, detection logic, and develop/train ML models. In this talk, we discuss why we decided to build this pipeline, how it is used to accelerate development and ensure quality, and dive into the nitty-gritty details of building such a system on top of an Apache Spark + Databricks stack.

More from Databricks (20)

DW Migration Webinar-March 2022.pptx

Data Lakehouse Symposium | Day 1 | Part 1

Data Lakehouse Symposium | Day 1 | Part 2

Data Lakehouse Symposium | Day 2

Data Lakehouse Symposium | Day 4

5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop

Democratizing Data Quality Through a Centralized Platform

Learn to Use Databricks for Data Science

Why APM Is Not the Same As ML Monitoring

The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix

Stage Level Scheduling Improving Big Data and AI Integration

Simplify Data Conversion from Spark to TensorFlow and PyTorch

Scaling your Data Pipelines with Apache Spark on Kubernetes

Sawtooth Windows for Feature Aggregations

Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink

Re-imagine Data Monitoring with whylogs and Spark

Raven: End-to-end Optimization of ML Prediction Queries

Processing Large Datasets for ADAS Applications using Apache Spark

Massive Data Processing in Adobe Using Delta Lake

Machine Learning CI/CD for Email Attack Detection

Recently uploaded

Global Situational Awareness of A.I. and where its headed

vikram sood

You can see the future first in San Francisco. Over the past year, the talk of the town has shifted from $10 billion compute clusters to $100 billion clusters to trillion-dollar clusters. Every six months another zero is added to the boardroom plans. Behind the scenes, there’s a fierce scramble to secure every power contract still available for the rest of the decade, every voltage transformer that can possibly be procured. American big business is gearing up to pour trillions of dollars into a long-unseen mobilization of American industrial might. By the end of the decade, American electricity production will have grown tens of percent; from the shale fields of Pennsylvania to the solar farms of Nevada, hundreds of millions of GPUs will hum. The AGI race has begun. We are building machines that can think and reason. By 2025/26, these machines will outpace college graduates. By the end of the decade, they will be smarter than you or I; we will have superintelligence, in the true sense of the word. Along the way, national security forces not seen in half a century will be un-leashed, and before long, The Project will be on. If we’re lucky, we’ll be in an all-out race with the CCP; if we’re unlucky, an all-out war. Everyone is now talking about AI, but few have the faintest glimmer of what is about to hit them. Nvidia analysts still think 2024 might be close to the peak. Mainstream pundits are stuck on the wilful blindness of “it’s just predicting the next word”. They see only hype and business-as-usual; at most they entertain another internet-scale technological change. Before long, the world will wake up. But right now, there are perhaps a few hundred people, most of them in San Francisco and the AI labs, that have situational awareness. Through whatever peculiar forces of fate, I have found myself amongst them. A few years ago, these people were derided as crazy—but they trusted the trendlines, which allowed them to correctly predict the AI advances of the past few years. Whether these people are also right about the next few years remains to be seen. But these are very smart people—the smartest people I have ever met—and they are the ones building this technology. Perhaps they will be an odd footnote in history, or perhaps they will go down in history like Szilard and Oppenheimer and Teller. If they are seeing the future even close to correctly, we are in for a wild ride. Let me tell you what we see.

The Building Blocks of QuestDB, a Time Series Database

javier ramirez

Talk Delivered at Valencia Codes Meetup 2024-06. Traditionally, databases have treated timestamps just as another data type. However, when performing real-time analytics, timestamps should be first class citizens and we need rich time semantics to get the most out of our data. We also need to deal with ever growing datasets while keeping performant, which is as fun as it sounds. It is no wonder time-series databases are now more popular than ever before. Join me in this session to learn about the internal architecture and building blocks of QuestDB, an open source time-series database designed for speed. We will also review a history of some of the changes we have gone over the past two years to deal with late and unordered data, non-blocking writes, read-replicas, or faster batch ingestion.

一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理

dwreak4tg

原版定制【微信:41543339】【(BCU毕业证书)伯明翰城市大学毕业证】【微信:41543339】成绩单、外壳、offer、留信学历认证（永久存档真实可查）采用学校原版纸张、特殊工艺完全按照原版一比一制作（包括：隐形水印，阴影底纹，钢印LOGO烫金烫银，LOGO烫金烫银复合重叠，文字图案浮雕，激光镭射，紫外荧光，温感，复印防伪）行业标杆！精益求精，诚心合作，真诚制作！多年品质 ,按需精细制作，24小时接单,全套进口原装设备，十五年致力于帮助留学生解决难题，业务范围有加拿大、英国、澳洲、韩国、美国、新加坡，新西兰等学历材料，包您满意。【我们承诺采用的是学校原版纸张（纸质、底色、纹路），我们拥有全套进口原装设备，特殊工艺都是采用不同机器制作，仿真度基本可以达到100%，所有工艺效果都可提前给客户展示，不满意可以根据客户要求进行调整，直到满意为止！】【业务选择办理准则】一、工作未确定，回国需先给父母、亲戚朋友看下文凭的情况，办理一份就读学校的毕业证【微信41543339】文凭即可二、回国进私企、外企、自己做生意的情况，这些单位是不查询毕业证真伪的，而且国内没有渠道去查询国外文凭的真假，也不需要提供真实教育部认证。鉴于此，办理一份毕业证【微信41543339】即可三、进国企，银行，事业单位，考公务员等等，这些单位是必需要提供真实教育部认证的，办理教育部认证所需资料众多且烦琐，所有材料您都必须提供原件，我们凭借丰富的经验，快捷的绿色通道帮您快速整合材料，让您少走弯路。留信网认证的作用: 1:该专业认证可证明留学生真实身份 2:同时对留学生所学专业登记给予评定 3:国家专业人才认证中心颁发入库证书 4:这个认证书并且可以归档倒地方 5:凡事获得留信网入网的信息将会逐步更新到个人身份内，将在公安局网内查询个人身份证信息后，同步读取人才网入库信息 6:个人职称评审加20分 7:个人信誉贷款加10分 8:在国家人才网主办的国家网络招聘大会中纳入资料，供国家高端企业选择人才留信网服务项目： 1、留学生专业人才库服务（留信分析） 2、国（境）学习人员提供就业推荐信服务 3、留学人员区块链存储服务 → 【关于价格问题（保证一手价格）】我们所定的价格是非常合理的，而且我们现在做得单子大多数都是代理和回头客户介绍的所以一般现在有新的单子我给客户的都是第一手的代理价格，因为我想坦诚对待大家不想跟大家在价格方面浪费时间对于老客户或者被老客户介绍过来的朋友，我们都会适当给一些优惠。选择实体注册公司办理，更放心，更安全！我们的承诺：客户在留信官方认证查询网站查询到认证通过结果后付款，不成功不收费！

Machine learning and optimization techniques for electrical drives.pptx

balafet

Malana- Gimlet Market Analysis (Portfolio 2)

TravisMalana

一比一原版(CBU毕业证)卡普顿大学毕业证如何办理

ahzuo

CBU毕业证offer【微信95270640】《卡普顿大学毕业证书》《QQ微信95270640》学位证书电子版：在线制作卡普顿大学毕业证成绩单GPA修改（制作CBU毕业证成绩单CBU文凭证书样本）、卡普顿大学毕业证书与成绩单样本图片、《CBU学历证书学位证书》、卡普顿大学毕业证案例毕业证书制作軟體、在线制作加拿大硕士学历证书真实可查. 如果您是以下情况，我们都能竭诚为您解决实际问题：【公司采用定金+余款的付款流程，以最大化保障您的利益，让您放心无忧】 1、在校期间，因各种原因未能顺利毕业，拿不到官方毕业证+微信95270640 2、面对父母的压力，希望尽快拿到卡普顿大学卡普顿大学毕业证成绩单； 3、不清楚流程以及材料该如何准备卡普顿大学卡普顿大学毕业证成绩单； 4、回国时间很长，忘记办理； 5、回国马上就要找工作，办给用人单位看； 6、企事业单位必须要求办理的；面向美国乔治城大学毕业留学生提供以下服务: 【★卡普顿大学卡普顿大学毕业证成绩单毕业证、成绩单等全套材料，从防伪到印刷，从水印到钢印烫金，与学校100%相同】【★真实使馆认证（留学人员回国证明），使馆存档可通过大使馆查询确认】【★真实教育部认证，教育部存档，教育部留服网站可查】【★真实留信认证，留信网入库存档，可查卡普顿大学卡普顿大学毕业证成绩单】我们从事工作十余年的有着丰富经验的业务顾问，熟悉海外各国大学的学制及教育体系，并且以挂科生解决毕业材料不全问题为基础，为客户量身定制1对1方案，未能毕业的回国留学生成功搭建回国顺利发展所需的桥梁。我们一直努力以高品质的教育为起点，以诚信、专业、高效、创新作为一切的行动宗旨，始终把“诚信为主、质量为本、客户第一”作为我们全部工作的出发点和归宿点。同时为海内外留学生提供大学毕业证购买、补办成绩单及各类分数修改等服务；归国认证方面，提供《留信网入库》申请、《国外学历学位认证》申请以及真实学籍办理等服务，帮助众多莘莘学子实现了一个又一个梦想。专业服务，请勿犹豫联系我如果您真实毕业回国，对于学历认证无从下手，请联系我，我们免费帮您递交诚招代理：本公司诚聘当地代理人员，如果你有业余时间，或者你有同学朋友需要，有兴趣就请联系我你赢我赢，共创双赢你做代理，可以帮助卡普顿大学同学朋友你做代理，可以拯救卡普顿大学失足青年你做代理，可以挽救卡普顿大学一个个人才你做代理，你将是别人人生卡普顿大学的转折点你做代理，可以改变自己，改变他人，给他人和自己一个机会道银边山娃摸索着扯了扯灯绳小屋顿时一片刺眼的亮瞅瞅床头的诺基亚山娃苦笑着摇了摇头连他自己都感到奇怪居然又睡到上午点半掐指算算随父亲进城已一个多星期了山娃几乎天天起得这么迟在乡下老家暑假五点多山娃就醒来在爷爷奶奶嘁嘁喳喳的忙碌声中一骨碌爬起把牛驱到后龙山再从莲塘里采回一蛇皮袋湿漉漉的莲蓬也才点多点半早就吃过早餐玩耍去了山娃的家在闽西山区依山傍水山清水秀门前潺潺流淌的蜿蜒小溪一直都是山娃和小伙伴们盛试

06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...

Timothy Spann

一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理

mzpolocfi

原版定制【微信:41543339】【(Dalhousie毕业证书)达尔豪斯大学毕业证】【微信:41543339】成绩单、外壳、offer、留信学历认证（永久存档真实可查）采用学校原版纸张、特殊工艺完全按照原版一比一制作（包括：隐形水印，阴影底纹，钢印LOGO烫金烫银，LOGO烫金烫银复合重叠，文字图案浮雕，激光镭射，紫外荧光，温感，复印防伪）行业标杆！精益求精，诚心合作，真诚制作！多年品质 ,按需精细制作，24小时接单,全套进口原装设备，十五年致力于帮助留学生解决难题，业务范围有加拿大、英国、澳洲、韩国、美国、新加坡，新西兰等学历材料，包您满意。【我们承诺采用的是学校原版纸张（纸质、底色、纹路），我们拥有全套进口原装设备，特殊工艺都是采用不同机器制作，仿真度基本可以达到100%，所有工艺效果都可提前给客户展示，不满意可以根据客户要求进行调整，直到满意为止！】【业务选择办理准则】一、工作未确定，回国需先给父母、亲戚朋友看下文凭的情况，办理一份就读学校的毕业证【微信41543339】文凭即可二、回国进私企、外企、自己做生意的情况，这些单位是不查询毕业证真伪的，而且国内没有渠道去查询国外文凭的真假，也不需要提供真实教育部认证。鉴于此，办理一份毕业证【微信41543339】即可三、进国企，银行，事业单位，考公务员等等，这些单位是必需要提供真实教育部认证的，办理教育部认证所需资料众多且烦琐，所有材料您都必须提供原件，我们凭借丰富的经验，快捷的绿色通道帮您快速整合材料，让您少走弯路。留信网认证的作用: 1:该专业认证可证明留学生真实身份 2:同时对留学生所学专业登记给予评定 3:国家专业人才认证中心颁发入库证书 4:这个认证书并且可以归档倒地方 5:凡事获得留信网入网的信息将会逐步更新到个人身份内，将在公安局网内查询个人身份证信息后，同步读取人才网入库信息 6:个人职称评审加20分 7:个人信誉贷款加10分 8:在国家人才网主办的国家网络招聘大会中纳入资料，供国家高端企业选择人才留信网服务项目： 1、留学生专业人才库服务（留信分析） 2、国（境）学习人员提供就业推荐信服务 3、留学人员区块链存储服务 → 【关于价格问题（保证一手价格）】我们所定的价格是非常合理的，而且我们现在做得单子大多数都是代理和回头客户介绍的所以一般现在有新的单子我给客户的都是第一手的代理价格，因为我想坦诚对待大家不想跟大家在价格方面浪费时间对于老客户或者被老客户介绍过来的朋友，我们都会适当给一些优惠。选择实体注册公司办理，更放心，更安全！我们的承诺：客户在留信官方认证查询网站查询到认证通过结果后付款，不成功不收费！

Adjusting OpenMP PageRank : SHORT REPORT / NOTES

Subhajit Sahu

For massive graphs that fit in RAM, but not in GPU memory, it is possible to take advantage of a shared memory system with multiple CPUs, each with multiple cores, to accelerate pagerank computation. If the NUMA architecture of the system is properly taken into account with good vertex partitioning, the speedup can be significant. To take steps in this direction, experiments are conducted to implement pagerank in OpenMP using two different approaches, uniform and hybrid. The uniform approach runs all primitives required for pagerank in OpenMP mode (with multiple threads). On the other hand, the hybrid approach runs certain primitives in sequential mode (i.e., sumAt, multiply).

ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake

Walaa Eldin Moustafa

Dynamic policy enforcement is becoming an increasingly important topic in today’s world where data privacy and compliance is a top priority for companies, individuals, and regulators alike. In these slides, we discuss how LinkedIn implements a powerful dynamic policy enforcement engine, called ViewShift, and integrates it within its data lake. We show the query engine architecture and how catalog implementations can automatically route table resolutions to compliance-enforcing SQL views. Such views have a set of very interesting properties: (1) They are auto-generated from declarative data annotations. (2) They respect user-level consent and preferences (3) They are context-aware, encoding a different set of transformations for different use cases (4) They are portable; while the SQL logic is only implemented in one SQL dialect, it is accessible in all engines. #SQL #Views #Privacy #Compliance #DataLake

STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...

sameer shah

Adjusting primitives for graph : SHORT REPORT / NOTES

Subhajit Sahu

Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is Multiply with different modes (map) 1. Performance of sequential execution based vs OpenMP based vector multiply. 2. Comparing various launch configs for CUDA based vector multiply. Sum with different storage types (reduce) 1. Performance of vector element sum using float vs bfloat16 as the storage type. Sum with different modes (reduce) 1. Performance of sequential execution based vs OpenMP based vector element sum. 2. Performance of memcpy vs in-place based CUDA based vector element sum. 3. Comparing various launch configs for CUDA based vector element sum (memcpy). 4. Comparing various launch configs for CUDA based vector element sum (in-place). Sum with in-place strategies of CUDA mode (reduce) 1. Comparing various launch configs for CUDA based vector element sum (in-place).

一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理

slg6lamcq

原版定制【微信:41543339】【(Adelaide毕业证书)阿德莱德大学毕业证】【微信:41543339】成绩单、外壳、offer、留信学历认证（永久存档真实可查）采用学校原版纸张、特殊工艺完全按照原版一比一制作（包括：隐形水印，阴影底纹，钢印LOGO烫金烫银，LOGO烫金烫银复合重叠，文字图案浮雕，激光镭射，紫外荧光，温感，复印防伪）行业标杆！精益求精，诚心合作，真诚制作！多年品质 ,按需精细制作，24小时接单,全套进口原装设备，十五年致力于帮助留学生解决难题，业务范围有加拿大、英国、澳洲、韩国、美国、新加坡，新西兰等学历材料，包您满意。【我们承诺采用的是学校原版纸张（纸质、底色、纹路），我们拥有全套进口原装设备，特殊工艺都是采用不同机器制作，仿真度基本可以达到100%，所有工艺效果都可提前给客户展示，不满意可以根据客户要求进行调整，直到满意为止！】【业务选择办理准则】一、工作未确定，回国需先给父母、亲戚朋友看下文凭的情况，办理一份就读学校的毕业证【微信41543339】文凭即可二、回国进私企、外企、自己做生意的情况，这些单位是不查询毕业证真伪的，而且国内没有渠道去查询国外文凭的真假，也不需要提供真实教育部认证。鉴于此，办理一份毕业证【微信41543339】即可三、进国企，银行，事业单位，考公务员等等，这些单位是必需要提供真实教育部认证的，办理教育部认证所需资料众多且烦琐，所有材料您都必须提供原件，我们凭借丰富的经验，快捷的绿色通道帮您快速整合材料，让您少走弯路。留信网认证的作用: 1:该专业认证可证明留学生真实身份 2:同时对留学生所学专业登记给予评定 3:国家专业人才认证中心颁发入库证书 4:这个认证书并且可以归档倒地方 5:凡事获得留信网入网的信息将会逐步更新到个人身份内，将在公安局网内查询个人身份证信息后，同步读取人才网入库信息 6:个人职称评审加20分 7:个人信誉贷款加10分 8:在国家人才网主办的国家网络招聘大会中纳入资料，供国家高端企业选择人才留信网服务项目： 1、留学生专业人才库服务（留信分析） 2、国（境）学习人员提供就业推荐信服务 3、留学人员区块链存储服务 → 【关于价格问题（保证一手价格）】我们所定的价格是非常合理的，而且我们现在做得单子大多数都是代理和回头客户介绍的所以一般现在有新的单子我给客户的都是第一手的代理价格，因为我想坦诚对待大家不想跟大家在价格方面浪费时间对于老客户或者被老客户介绍过来的朋友，我们都会适当给一些优惠。选择实体注册公司办理，更放心，更安全！我们的承诺：客户在留信官方认证查询网站查询到认证通过结果后付款，不成功不收费！

原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样

u86oixdj

学校原件一模一样【微信：741003700 】《(Deakin毕业证书)迪肯大学毕业证学位证》【微信：741003700 】学位证，留信认证（真实可查，永久存档）原件一模一样纸张工艺/offer、雅思、外壳等材料/诚信可靠,可直接看成品样本，帮您解决无法毕业带来的各种难题！外壳，原版制作，诚信可靠，可直接看成品样本。行业标杆！精益求精，诚心合作，真诚制作！多年品质 ,按需精细制作，24小时接单,全套进口原装设备。十五年致力于帮助留学生解决难题，包您满意。本公司拥有海外各大学样板无数，能完美还原。 1:1完美还原海外各大学毕业材料上的工艺：水印，阴影底纹，钢印LOGO烫金烫银，LOGO烫金烫银复合重叠。文字图案浮雕、激光镭射、紫外荧光、温感、复印防伪等防伪工艺。材料咨询办理、认证咨询办理请加学历顾问Q/微741003700 【主营项目】一.毕业证【q微741003700】成绩单、使馆认证、教育部认证、雅思托福成绩单、学生卡等！二.真实使馆公证(即留学回国人员证明,不成功不收费) 三.真实教育部学历学位认证（教育部存档！教育部留服网站永久可查）四.办理各国各大学文凭(一对一专业服务,可全程监控跟踪进度) 如果您处于以下几种情况： ◇在校期间，因各种原因未能顺利毕业……拿不到官方毕业证【q/微741003700】 ◇面对父母的压力，希望尽快拿到； ◇不清楚认证流程以及材料该如何准备； ◇回国时间很长，忘记办理； ◇回国马上就要找工作，办给用人单位看； ◇企事业单位必须要求办理的 ◇需要报考公务员、购买免税车、落转户口 ◇申请留学生创业基金留信网认证的作用: 1:该专业认证可证明留学生真实身份 2:同时对留学生所学专业登记给予评定 3:国家专业人才认证中心颁发入库证书 4:这个认证书并且可以归档倒地方 5:凡事获得留信网入网的信息将会逐步更新到个人身份内，将在公安局网内查询个人身份证信息后，同步读取人才网入库信息 6:个人职称评审加20分 7:个人信誉贷款加10分 8:在国家人才网主办的国家网络招聘大会中纳入资料，供国家高端企业选择人才

Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf

GetInData

Recently we have observed the rise of open-source Large Language Models (LLMs) that are community-driven or developed by the AI market leaders, such as Meta (Llama3), Databricks (DBRX) and Snowflake (Arctic). On the other hand, there is a growth in interest in specialized, carefully fine-tuned yet relatively small models that can efficiently assist programmers in day-to-day tasks. Finally, Retrieval-Augmented Generation (RAG) architectures have gained a lot of traction as the preferred approach for LLMs context and prompt augmentation for building conversational SQL data copilots, code copilots and chatbots. In this presentation, we will show how we built upon these three concepts a robust Data Copilot that can help to democratize access to company data assets and boost performance of everyone working with data platforms. Why do we need yet another (open-source ) Copilot? How can we build one? Architecture and evaluation

My burning issue is homelessness K.C.M.O.

rwarrenll

一比一原版(UniSA毕业证书)南澳大学毕业证如何办理

slg6lamcq

原版定制【微信:41543339】【(UniSA毕业证书)南澳大学毕业证】【微信:41543339】成绩单、外壳、offer、留信学历认证（永久存档真实可查）采用学校原版纸张、特殊工艺完全按照原版一比一制作（包括：隐形水印，阴影底纹，钢印LOGO烫金烫银，LOGO烫金烫银复合重叠，文字图案浮雕，激光镭射，紫外荧光，温感，复印防伪）行业标杆！精益求精，诚心合作，真诚制作！多年品质 ,按需精细制作，24小时接单,全套进口原装设备，十五年致力于帮助留学生解决难题，业务范围有加拿大、英国、澳洲、韩国、美国、新加坡，新西兰等学历材料，包您满意。【我们承诺采用的是学校原版纸张（纸质、底色、纹路），我们拥有全套进口原装设备，特殊工艺都是采用不同机器制作，仿真度基本可以达到100%，所有工艺效果都可提前给客户展示，不满意可以根据客户要求进行调整，直到满意为止！】【业务选择办理准则】一、工作未确定，回国需先给父母、亲戚朋友看下文凭的情况，办理一份就读学校的毕业证【微信41543339】文凭即可二、回国进私企、外企、自己做生意的情况，这些单位是不查询毕业证真伪的，而且国内没有渠道去查询国外文凭的真假，也不需要提供真实教育部认证。鉴于此，办理一份毕业证【微信41543339】即可三、进国企，银行，事业单位，考公务员等等，这些单位是必需要提供真实教育部认证的，办理教育部认证所需资料众多且烦琐，所有材料您都必须提供原件，我们凭借丰富的经验，快捷的绿色通道帮您快速整合材料，让您少走弯路。留信网认证的作用: 1:该专业认证可证明留学生真实身份 2:同时对留学生所学专业登记给予评定 3:国家专业人才认证中心颁发入库证书 4:这个认证书并且可以归档倒地方 5:凡事获得留信网入网的信息将会逐步更新到个人身份内，将在公安局网内查询个人身份证信息后，同步读取人才网入库信息 6:个人职称评审加20分 7:个人信誉贷款加10分 8:在国家人才网主办的国家网络招聘大会中纳入资料，供国家高端企业选择人才留信网服务项目： 1、留学生专业人才库服务（留信分析） 2、国（境）学习人员提供就业推荐信服务 3、留学人员区块链存储服务 → 【关于价格问题（保证一手价格）】我们所定的价格是非常合理的，而且我们现在做得单子大多数都是代理和回头客户介绍的所以一般现在有新的单子我给客户的都是第一手的代理价格，因为我想坦诚对待大家不想跟大家在价格方面浪费时间对于老客户或者被老客户介绍过来的朋友，我们都会适当给一些优惠。选择实体注册公司办理，更放心，更安全！我们的承诺：客户在留信官方认证查询网站查询到认证通过结果后付款，不成功不收费！

一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理

g4dpvqap0

毕业原版【微信:41543339】【(爱大毕业证书)爱丁堡大学毕业证】【微信:41543339】成绩单、外壳、offer、留信学历认证（永久存档真实可查）采用学校原版纸张、特殊工艺完全按照原版一比一制作（包括：隐形水印，阴影底纹，钢印LOGO烫金烫银，LOGO烫金烫银复合重叠，文字图案浮雕，激光镭射，紫外荧光，温感，复印防伪）行业标杆！精益求精，诚心合作，真诚制作！多年品质 ,按需精细制作，24小时接单,全套进口原装设备，十五年致力于帮助留学生解决难题，业务范围有加拿大、英国、澳洲、韩国、美国、新加坡，新西兰等学历材料，包您满意。【我们承诺采用的是学校原版纸张（纸质、底色、纹路），我们拥有全套进口原装设备，特殊工艺都是采用不同机器制作，仿真度基本可以达到100%，所有工艺效果都可提前给客户展示，不满意可以根据客户要求进行调整，直到满意为止！】【业务选择办理准则】一、工作未确定，回国需先给父母、亲戚朋友看下文凭的情况，办理一份就读学校的毕业证【微信41543339】文凭即可二、回国进私企、外企、自己做生意的情况，这些单位是不查询毕业证真伪的，而且国内没有渠道去查询国外文凭的真假，也不需要提供真实教育部认证。鉴于此，办理一份毕业证【微信41543339】即可三、进国企，银行，事业单位，考公务员等等，这些单位是必需要提供真实教育部认证的，办理教育部认证所需资料众多且烦琐，所有材料您都必须提供原件，我们凭借丰富的经验，快捷的绿色通道帮您快速整合材料，让您少走弯路。留信网认证的作用: 1:该专业认证可证明留学生真实身份 2:同时对留学生所学专业登记给予评定 3:国家专业人才认证中心颁发入库证书 4:这个认证书并且可以归档倒地方 5:凡事获得留信网入网的信息将会逐步更新到个人身份内，将在公安局网内查询个人身份证信息后，同步读取人才网入库信息 6:个人职称评审加20分 7:个人信誉贷款加10分 8:在国家人才网主办的国家网络招聘大会中纳入资料，供国家高端企业选择人才留信网服务项目： 1、留学生专业人才库服务（留信分析） 2、国（境）学习人员提供就业推荐信服务 3、留学人员区块链存储服务 → 【关于价格问题（保证一手价格）】我们所定的价格是非常合理的，而且我们现在做得单子大多数都是代理和回头客户介绍的所以一般现在有新的单子我给客户的都是第一手的代理价格，因为我想坦诚对待大家不想跟大家在价格方面浪费时间对于老客户或者被老客户介绍过来的朋友，我们都会适当给一些优惠。选择实体注册公司办理，更放心，更安全！我们的承诺：客户在留信官方认证查询网站查询到认证通过结果后付款，不成功不收费！

The affect of service quality and online reviews on customer loyalty in the E...

jerlynmaetalle

Analysis insight about a Flyball dog competition team's performance

roli9797

Recently uploaded (20)

Global Situational Awareness of A.I. and where its headed

The Building Blocks of QuestDB, a Time Series Database

一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理

Machine learning and optimization techniques for electrical drives.pptx

Malana- Gimlet Market Analysis (Portfolio 2)

一比一原版(CBU毕业证)卡普顿大学毕业证如何办理

06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...

一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理

Adjusting OpenMP PageRank : SHORT REPORT / NOTES

ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake

STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...

Adjusting primitives for graph : SHORT REPORT / NOTES

一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理

原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样

Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf

My burning issue is homelessness K.C.M.O.

一比一原版(UniSA毕业证书)南澳大学毕业证如何办理

一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理

The affect of service quality and online reviews on customer loyalty in the E...

Analysis insight about a Flyball dog competition team's performance

Intuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue

1. Intuitive & Scalable HPO With Spark+Fugue Han Wang

2. Agenda Introduction Non-Iterative HPO Demo Iterative HPO Demo

3. pip install tune https://github.com/fugue-project/tune pip install fugue https://github.com/fugue-project/fugue

4. Introduction

5. Questions ● Is parameter tuning a machine learning problem? ● Are there common ways to tune both classical models and deep learning models? ● Why is it so hard to do distributed parameter tuning?

6. Tuning Problems In General General Parameter Tuning Hyperparameter Tuning (for Machine Learning) Some Classical Models Deep Learning Models Some Classical Models Non-Iterative Problems Iterative Problems

7. Distributed Parameter Tuning ● Not everything can be parallelized ● The tuning logic is always complex and tedious ● Popular tuning frameworks are not distributed environment friendly ● Spark is not suitable for iterative tuning problems

8. Distributed Parameter Tuning Tune SQL Validation

9. Our Goals ● For non-iterative problems: ○ Unify grid and random search, make other plugable ● For iterative problems: ○ Generalize SOTA algos such as Hyperband and ASHA ● For both ○ Tune both locally and distributedly without code change ○ Make tuning development iterable and testable ○ Minimize moving parts ○ Minimize interfaces

10. Non-Iterative Problems

11. Grid Search a: Grid(0,1) b: Grid(“a”, “b”) c: 3.14 a:0, b:”a”, c:3.14 a:0, b:”b”, c:3.14 a:1, b:”a”, c:3.14 a:1, b:”b”, c:3.14 Search Space Candidates Pros: determinism, even coverage, interpretable Cons: complexity can increase exponentially

12. Random Search a: Rand(0,1) b: Choice(“a”,“b”) c: 3.14 a:0.12, b:”a”, c:3.14 a:0.66, b:”a”, c:3.14 a:0.32, b:”b”, c:3.14 a:0.94, b:”a”, c:3.14 Search Space Candidates Pros: complexity and distribution are controlled, good for continuous variables Cons: by luck, not deterministic, large number of samples are normally needed

13. Bayesian Optimization objective: a^2 a: Rand(-1,1) -0.66 -> 0.76 -> -0.18 -> 0.75 -> 0.90 -> 0.07 -> 0.00 -> 0.41 -> 0.12 -> 0.66 Search Space Candidates Pros: less compute to guess the optimal parameters Cons: sequential operations may require more time

14. Hybrid Search Space Distributed Hybrid Search Model 1 Model 2 Grid Random Bayesian

15. Live Demo Space Concept & Scikit-Learn Tuning

16. Iterative Problems

17. Challenges ● Realtime asynchronous communication ● The overhead for checkpointing iterations can be significant ● Single iterative problem can’t be parallelized ● A lot of boilerplate code

18. Successive Halving (SHA) Rung 1 Rung 2 Rung 3 Rung 4

19. Fully Customized Successive Halving 8, [(4,6), (2,2), (6,1)]

20. Hyperband

21. Asynchronous Successive Halving (ASHA)

22. Live Demo Keras Model Tuning

23. Summary Space Monitoring Dataset Distributed Execution Abstraction Non-Iterative Random, Grid, BO Iterative SHA, HB, ASHA, PBT ... Specialization Scikit-Learn Specialization Keras, TF, PyTorch

24. Let’s Collaborate! ● Create specialized higher level APIs for major tuning cases so users can do tuning with minimal code and without learning distributed systems ● Enable advanced users to create fully customized, platform agnostic and scale agnostic tuning pipelines with tune’s lower level APIs

25. pip install tune https://github.com/fugue-project/tune pip install fugue https://github.com/fugue-project/fugue

26. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.

Intuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Intuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue

Similar to Intuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue (20)

More from Databricks

More from Databricks (20)

Recently uploaded

Recently uploaded (20)

Intuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue