Slides deck used by Praveen Devarao for Apache Spark Machine Learning session organized by Bangalore Spark enthusiasts meetup group @ IBM campus on 10th September 2016
Demo notebook used can be found at https://gist.github.com/praveend/fe9a0c5eacd6b43ee210e88a374eb230
AI-Assisted Feature Selection for Big Data ModelingDatabricks
The number of features going into models is growing at an exponential rate thanks to the power of Spark. So is the number of models each company is creating. The common approach is to throw as many features as you can into a model.
Machine Learning in q/kdb+ - Teaching KDB to Read JapaneseMark Lefevre, CQF
Briefly introduce machine learning, supervised learning and neural networks. Implement neural network algorithms in q/kdb+ to recognize handwritten Japanese characters.
Using q/kdb+ for pricing fixed income derivatives. Demonstrate some basic linear algebra in q. Show how to implement the Heath-Jarrow-Morton model. Introduce principal component analysis.
AI-Assisted Feature Selection for Big Data ModelingDatabricks
The number of features going into models is growing at an exponential rate thanks to the power of Spark. So is the number of models each company is creating. The common approach is to throw as many features as you can into a model.
Machine Learning in q/kdb+ - Teaching KDB to Read JapaneseMark Lefevre, CQF
Briefly introduce machine learning, supervised learning and neural networks. Implement neural network algorithms in q/kdb+ to recognize handwritten Japanese characters.
Using q/kdb+ for pricing fixed income derivatives. Demonstrate some basic linear algebra in q. Show how to implement the Heath-Jarrow-Morton model. Introduce principal component analysis.
MATLAB tutorial provided by Zabeel is comprehensive introduction to the MATLAB technical computing environment. The MATLAB class is intended for beginning users and those looking for a review. No prior programming experience or knowledge of MATLAB programming or MATLAB CODE is assumed. Themes of data analysis, visualization, modeling, and programming are explored throughout the course. Becoming a Certified MATLAB Associate is the first step in the MATLAB certification.
The Power of Auto ML and How Does it WorkIvo Andreev
Automated ML is an approach to minimize the need of data science effort by enabling domain experts to build ML models without having deep knowledge of algorithms, mathematics or programming skills. The mechanism works by allowing end-users to simply provide data and the system automatically does the rest by determining approach to perform particular ML task. At first this may sound discouraging to those aiming to the “sexiest job of the 21st century” - the data scientists. However, Auto ML should be considered as democratization of ML, rather that automatic data science.
In this session we will talk about how Auto ML works, how is it implemented by Microsoft and how it could improve the productivity of even professional data scientists.
Here is my slide on MATLAB which includes Introduction to MATLAB, what is MATLAB, Programming languages in MATLAB, Uses of MATLAB, MATLAB features,tools and Advance tools, Advantages and disadvantages of MATLAB, Applications of MATLAB.
A seminar in advanced Software Engineering concerning using models to guide the development process, and QVT to transfer a model into another model automatically
IPL: An Integration Property Language for Multi-Model Cyber-Physical SystemsIvan Ruchkin
Our talk from the 22nd International Symposium on Formal Methods. Full paper: http://www.cs.cmu.edu/~iruchkin/docs/ruchkin18-ipl.pdf
Abstract: "Design and verification of modern systems requires diverse models, which often come from a variety of disciplines, and it is challenging to manage their heterogeneity -- especially in the case of cyber-physical systems. To check consistency between models, recent approaches map these models to flexible static abstractions, such as architectural views. This model integration approach, however, comes at a cost of reduced expressiveness because complex behaviors of the models are abstracted away. As a result, it may be impossible to automatically verify important behavioral properties across multiple models, leaving systems vulnerable to subtle bugs. This paper introduces the Integration Property Language (IPL) that improves integration expressiveness using modular verification of properties that depend on detailed behavioral semantics while retaining the ability for static system-wide reasoning. We prove that the verification algorithm is sound and analyze its termination conditions. Furthermore, we perform a case study on a mobile robot to demonstrate IPL is practically useful and evaluate its performance. "
Spark After Dark: Real time Advanced Analytics and Machine Learning with SparkChris Fregly
Generating high quality dating recommendations using advanced analytics, streaming data pipelines, machine learning, graph analytics, and text processing.
Use the latest Spark libraries including Spark SQL, Data Frames, BlinkDB, Spark Streaming, MLlib, and GraphX as well as Twitter's Algebird for sketch algorithms, probabilistic data structures, and approximations.
At the recent sold-out Spark & Machine Learning Meetup in Brussels, Holden Karau of the Spark Technology Center delivered a lightning talk called A very brief introduction to extending Spark ML for custom models: Talk + Demo.
Holden took a look at Apache SparkML™ pipelines. Inspired by sci-kit learn, they have the potential to make machine learning tasks much easier. This talk looked at how to extend Spark ML with custom model types when the built-in options don't meet your needs.
Intro to Machine Learning with H2O and AWSSri Ambati
Navdeep Gill @ Galvanize Seattle- May 2016
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
MATLAB tutorial provided by Zabeel is comprehensive introduction to the MATLAB technical computing environment. The MATLAB class is intended for beginning users and those looking for a review. No prior programming experience or knowledge of MATLAB programming or MATLAB CODE is assumed. Themes of data analysis, visualization, modeling, and programming are explored throughout the course. Becoming a Certified MATLAB Associate is the first step in the MATLAB certification.
The Power of Auto ML and How Does it WorkIvo Andreev
Automated ML is an approach to minimize the need of data science effort by enabling domain experts to build ML models without having deep knowledge of algorithms, mathematics or programming skills. The mechanism works by allowing end-users to simply provide data and the system automatically does the rest by determining approach to perform particular ML task. At first this may sound discouraging to those aiming to the “sexiest job of the 21st century” - the data scientists. However, Auto ML should be considered as democratization of ML, rather that automatic data science.
In this session we will talk about how Auto ML works, how is it implemented by Microsoft and how it could improve the productivity of even professional data scientists.
Here is my slide on MATLAB which includes Introduction to MATLAB, what is MATLAB, Programming languages in MATLAB, Uses of MATLAB, MATLAB features,tools and Advance tools, Advantages and disadvantages of MATLAB, Applications of MATLAB.
A seminar in advanced Software Engineering concerning using models to guide the development process, and QVT to transfer a model into another model automatically
IPL: An Integration Property Language for Multi-Model Cyber-Physical SystemsIvan Ruchkin
Our talk from the 22nd International Symposium on Formal Methods. Full paper: http://www.cs.cmu.edu/~iruchkin/docs/ruchkin18-ipl.pdf
Abstract: "Design and verification of modern systems requires diverse models, which often come from a variety of disciplines, and it is challenging to manage their heterogeneity -- especially in the case of cyber-physical systems. To check consistency between models, recent approaches map these models to flexible static abstractions, such as architectural views. This model integration approach, however, comes at a cost of reduced expressiveness because complex behaviors of the models are abstracted away. As a result, it may be impossible to automatically verify important behavioral properties across multiple models, leaving systems vulnerable to subtle bugs. This paper introduces the Integration Property Language (IPL) that improves integration expressiveness using modular verification of properties that depend on detailed behavioral semantics while retaining the ability for static system-wide reasoning. We prove that the verification algorithm is sound and analyze its termination conditions. Furthermore, we perform a case study on a mobile robot to demonstrate IPL is practically useful and evaluate its performance. "
Spark After Dark: Real time Advanced Analytics and Machine Learning with SparkChris Fregly
Generating high quality dating recommendations using advanced analytics, streaming data pipelines, machine learning, graph analytics, and text processing.
Use the latest Spark libraries including Spark SQL, Data Frames, BlinkDB, Spark Streaming, MLlib, and GraphX as well as Twitter's Algebird for sketch algorithms, probabilistic data structures, and approximations.
At the recent sold-out Spark & Machine Learning Meetup in Brussels, Holden Karau of the Spark Technology Center delivered a lightning talk called A very brief introduction to extending Spark ML for custom models: Talk + Demo.
Holden took a look at Apache SparkML™ pipelines. Inspired by sci-kit learn, they have the potential to make machine learning tasks much easier. This talk looked at how to extend Spark ML with custom model types when the built-in options don't meet your needs.
Intro to Machine Learning with H2O and AWSSri Ambati
Navdeep Gill @ Galvanize Seattle- May 2016
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
Inference & Learning in Linear-Chain Conditional Random Fields (CRFs)Anmol Dwivedi
Find the code on: https://github.com/anmold07/Graphical_Models/tree/master/CRF%20Learning
Probabilistic Graphical Models (PGMs) provides a general
framework to model dependencies among the output variables. Among the family of graphical models include Neural Networks, Markov Networks, Ising Models, factor graphs, Bayesian Networks etc, however, this project considers linear chain Conditional Random Fields to learn the inter-dependencies among the output variables for efficient classification of handwritten word recognition. Such models are capable of representing a complex distribution over multivariate distributions as a product of local factor functions.
Find all the relevant code on: https://github.com/anmold-07/Graphical_Models
Guiding through a typical Machine Learning PipelineMichael Gerke
Many People are talking about AI and Machine Learning. Here's a quick guideline how to manage ML Projects and what to consider in order to implement machine learning use cases.
Automated Hyperparameter Tuning, Scaling and TrackingDatabricks
Automated Machine Learning (AutoML) has received significant interest recently. We believe that the right automation would bring significant value and dramatically shorten time-to-value for data science teams. Databricks is automating the Data Science and Machine Learning process through a combination of product offerings, partnerships, and custom solutions. This talk will focus on how Databricks can help automate hyperparameter tuning.
For both traditional Machine Learning and modern Deep Learning, tuning hyperparameters can dramatically increase model performance and improve training times. However, tuning can be a complex and expensive process. In this talk, we'll start with a brief survey of the most popular techniques for hyperparameter tuning (e.g., grid search, random search, and Bayesian optimization). We will then discuss open source tools that implement each of these techniques, helping to automate the search over hyperparameters.
Finally, we will discuss and demo improvements we built for these tools in Databricks, including integration with MLflow:
Apache PySpark MLlib integration with MLflow for automatically tracking tuning
Hyperopt integration with Apache Spark to distribute tuning and with MLflow for automatic tracking
Recording and notebooks will be provided after the webinar so that you can practice at your own pace.
Presenters
Joseph Bradley, Software Engineer, Databricks
Joseph Bradley is a Software Engineer and Apache Spark PMC member working on Machine Learning at Databricks. Previously, he was a postdoc at UC Berkeley after receiving his Ph.D. in Machine Learning from Carnegie Mellon in 2013.
Yifan Cao, Senior Product Manager, Databricks
Yifan Cao is a Senior Product Manager at Databricks. His product area spans ML/DL algorithms and Databricks Runtime for Machine Learning. Prior to Databricks, Yifan worked on two Machine Learning products, applying NLP to find metadata and applying machine learning to predict equipment failures. He helped build the products from ground up to multi-million dollars in ARR. Yifan started his career as a researcher in quantum computing. Yifan received his B.S in UC Berkeley and Master from MIT.
Predicting Optimal Parallelism for Data AnalyticsDatabricks
A key benefit of serverless computing is that resources can be allocated on demand, but the quantity of resources to request, and allocate, for a job can profoundly impact its running time and cost. For a job that has not yet run, how can we provide users with an estimate of how the job’s performance changes with provisioned resources, so that users can make an informed choice upfront about cost-performance tradeoffs?
This talk will describe several related research efforts at Microsoft to address this question. We focus on optimizing the amount of computational resources that control a data analytics query’s achieved intra-parallelism. These use machine learning models on query characteristics to predict the run time or Performance Characteristic Curve (PCC) as a function of the maximum parallelism that the query will be allowed to exploit.
The AutoToken project uses models to predict the peak number of tokens (resource units) that is determined by the maximum parallelism that the recurring SCOPE job can ever exploit while running in Cosmos, an Exascale Big Data analytics platform at Microsoft. AutoToken_vNext, or TASQ, predicts the PCC as a function of the number of allocated tokens (limited parallelism). The AutoExecutor project uses models to predict the PCC for Apache Spark SQL queries as a function of the number of executors. The AutoDOP project uses models to predict the run time for SQL Server analytics queries, running on a single machine, as a function of their maximum allowed Degree Of Parallelism (DOP).
We will present our approaches and prediction results for these scenarios, discuss some common challenges that we handled, and outline some open research questions in this space.
Augmenting Machine Learning with Databricks Labs AutoML ToolkitDatabricks
Instead of better understanding and optimizing their machine learning models, data scientists spend a majority of their time training and iterating through different models even in cases where there the data is reliable and clean. Important aspects of creating an ML model include (but are not limited to) data preparation, feature engineering, identifying the correct models, training (and continuing to train) and optimizing their models. This process can be (and often is) laborious and time-consuming.
In this session, we will explore this process and then show how the AutoML toolkit (from Databricks Labs) can significantly simplify and optimize machine learning. We will demonstrate all of this financial loan risk data with code snippets and notebooks that will be free to download.
Scalable Automatic Machine Learning in H2OSri Ambati
Abstract:
In recent years, the demand for machine learning experts has outpaced the supply, despite the surge of people entering the field. To address this gap, there have been big strides in the development of user-friendly machine learning software that can be used by non-experts. Although H2O and other tools have made it easier for practitioners to train and deploy machine learning models at scale, there is still a fair bit of knowledge and background in data science that is required to produce high-performing machine learning models. Deep Neural Networks in particular, are notoriously difficult for a non-expert to tune properly.
In this presentation, we provide an overview of the the field of "Automatic Machine Learning" and introduce the new AutoML functionality in H2O. H2O's AutoML provides an easy-to-use interface which automates the process of training a large, comprehensive selection of candidate models and a stacked ensemble model which, in most cases, will be the top performing model in the AutoML Leaderboard.
H2O AutoML is available in all the H2O interfaces including the h2o R package, Python module and the Flow web GUI. We will also provide simple code examples to get you started using AutoML.
Erin’s Bio:
Erin is a Statistician and Machine Learning Scientist at H2O.ai. She is the main author of H2O Ensemble. Before joining H2O, she was the Principal Data Scientist at Wise.io and Marvin Mobile Security (acquired by Veracode in 2012) and the founder of DataScientific, Inc. Erin received her Ph.D. in Biostatistics with a Designated Emphasis in Computational Science and Engineering from University of California, Berkeley. Her research focuses on ensemble machine learning, learning from imbalanced binary-outcome data, influence curve based variance estimation and statistical computing. She also holds a B.S. and M.A. in Mathematics.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
JMeter webinar - integration with InfluxDB and GrafanaRTTS
Watch this recorded webinar about real-time monitoring of application performance. See how to integrate Apache JMeter, the open-source leader in performance testing, with InfluxDB, the open-source time-series database, and Grafana, the open-source analytics and visualization application.
In this webinar, we will review the benefits of leveraging InfluxDB and Grafana when executing load tests and demonstrate how these tools are used to visualize performance metrics.
Length: 30 minutes
Session Overview
-------------------------------------------
During this webinar, we will cover the following topics while demonstrating the integrations of JMeter, InfluxDB and Grafana:
- What out-of-the-box solutions are available for real-time monitoring JMeter tests?
- What are the benefits of integrating InfluxDB and Grafana into the load testing stack?
- Which features are provided by Grafana?
- Demonstration of InfluxDB and Grafana using a practice web application
To view the webinar recording, go to:
https://www.rttsweb.com/jmeter-integration-webinar
2. Agenda
• What
is
Machine
Learning?
• The
machine
learning
module
in
Spark
• SparkML
pipelines
• Extrac?on,
Selec?on
and
Tuning
• Demo
3. What
is
Machine
Learning?
• A
computer
program
is
said
to
learn
from
experience
E
with
respect
to
some
class
of
tasks
T
and
performance
measure
P
if
its
performance
at
tasks
in
T,
as
measured
by
P,
improves
with
experience
E
• Field
of
study
that
gives
computers
the
ability
to
learn
without
being
explicitly
programmed
4. How
is
it
achieved?
• Build
mathema?cal
models
for
given
tasks
• Represent
the
given
dataset
mathema?cally
• Apply
sta?s?c
methods
on
this
math
representa?on
• Tune
and
derive
a
model
that
can
perform
the
needed
task
5. Categories
of
ML
• Supervised
learning
• The
program
is
“trained”
on
a
pre-‐defined
set
of
“training
examples”,
which
then
facilitate
its
ability
to
reach
an
accurate
conclusion
when
given
new
data
• The
goal
is
to
learn
a
general
rule
that
maps
inputs
to
outputs
• Unsupervised
learning
• No
labels
are
given
to
the
learning
algorithm,
leaving
it
on
its
own
to
find
structure
(paOerns
and
rela?onships)
in
its
input
• Unsupervised
learning
can
be
a
goal
in
itself
(discovering
hidden
paOerns
in
data)
or
a
means
towards
an
end
(feature
learning)
7. SparkML
–
The
Machine
learning
module
of
Spark
• APIs
Based
on
Dataframes
• Distributed
collec?on
of
data
organized
as
columns
• Contains
commonly
used
ML
algorithms
• Classifica?on
• Regression
• Clustering
• Featuriza?on
-‐
feature
extrac?on,
transforma?on,
dimensionality
reduc?on,
and
selec?on
• Pipelines
-‐
tools
for
construc?ng,
evalua?ng,
and
tuning
• Persistence
of
models
and
pipelines
9. SparkML
Pipelines
• Transformer
:
Algorithm
to
transform
one
dataframe
to
another
• Es?mator
:
Algorithm
applied
on
dataframe
to
produce
a
transformer
• Parameters
:
Factors
affec?ng
the
Es?mators
• Pipeline
:
Chain
of
mul?ple
transformers
and
es?mators
that
forms
the
ML
flow
10. Extractors
• Algorithms
to
extract
features
from
raw
data
• TermFrequency-‐InverseDocumentFrequency
• Word2Vec:
• 2
layer
neural
network
that
converts
words
to
vectors
• CountVectorizer:
• Number
of
tokens
11. Transformers
and
Selectors
• Transformers
:
• Algorithms
for
scaling,
modifying
or
conver?ng
features
• Tokenizer
• StringIndexer
• VectorAssembler
• PCA
• Selectors
:
• Libraries
for
selec?ng
subset
of
larger
set
of
features
• Vector
Slicer
• RFormula
• ChiSqSelector
13. Model
evaluaEon
Techniques
• Evalua?on:
• F1
Score
Calculate
precision
and
recall
from
confusion
matrix
precision
=
True
Posi?ves
,
recall
=
True
Posi?ves
Predicted
Posi?ves
Actual
Posi?ves
• ROC
Predicted
PosiEve
Predicted
NegaEve
Actual
PosiEve
True
Posi?ve
False
Nega?ve
Actual
NegaEve
False
posi?ve
True
Nega?ve
Confusion
Matrix
14. SparkML
Evaluators
and
Tuning
• Evaluators:
• BinaryClassifica?onEvaluator
• areaUnderROC
&
areaUnderPR
• Mul?classClassifica?onEvaluator
• F1,
weightedPrecison,
WeightedRecall
• RegressionEvaluator
• MSE,
RMSE
• Model
Tuning
and
Selec?on:
• CrossValidator
• k
folds
(train,test)
dataset
pair
is
created
• Trains
and
evaluates
for
different
param
se_ngs
• Expensive
• TrainValida?onSplit
• 1
(train,test)
dataset
pair
is
created
• Trains
for
one
combina?on
of
the
params
only
• Less
expensive
than
cross-‐valida?on