This document provides an overview of Apache SystemML, an open source machine learning framework for scalable machine learning. It discusses how SystemML allows data scientists to implement machine learning algorithms using a declarative language, and how SystemML then compiles and optimizes the algorithms to run efficiently on everything from single nodes to large clusters. It also provides examples of DML code used in SystemML and how to invoke SystemML through its APIs or command line.
An Introduction to Higher Order Functions in Spark SQL with Herman van HovellDatabricks
Nested data types offer Apache Spark users powerful ways to manipulate structured data. In particular, they allow you to put complex objects like arrays, maps and structures inside of columns. This can help you model your data in a more natural way.
While this feature is certainly useful, it can quite bit cumbersome to manipulate data inside of complex objects because SQL (and Spark) do not have primitives for working with such data. In addition, it is time-consuming, non-performant, and non-trivial. During this talk we will discuss some of the commonly used techniques for working with complex objects, and we will introduce new ones based on Higher-order functions. Higher-order functions will be part of Spark 2.4 and are a simple and performant extension to SQL that allow a user to manipulate complex data such as arrays.
Java/Scala Lab: Анатолий Кметюк - Scala SubScript: Алгебра для реактивного пр...GeeksLab Odessa
SubScript - это расширение языка Scala, добавляющее поддержку конструкций и синтаксиса аглебры общающихся процессов (Algebra of Communicating Processes, ACP). SubScript является перспективным расширением, применимым как для разработки высоконагруженных параллельных систем, так и для простых персональных приложений.
Functional Programming Patterns for the Pragmatic ProgrammerRaúl Raja Martínez
In this talk we will see a pragmatic approach to building a purely functional architecture that delivers cohesive functional components.
We will cover functional patterns such as Free Monads, Transformers, Kleisli arrows, dependently typed checked exceptions
and types as well as how they can be glued together to achieve pure functions that are composable, context free, dependently injectable and testable.
Dome project and code with instructions to run it can be found at:
https://github.com/47deg/func-architecture
Probabilistic data structures. Part 2. CardinalityAndrii Gakhov
The book "Probabilistic Data Structures and Algorithms in Big Data Applications" is now available at Amazon and from local bookstores. More details at https://pdsa.gakhov.com
In the presentation, I described common data structures and algorithms to estimate the number of distinct elements in a set (cardinality), such as Linear Counting, HyperLogLog, and HyperLogLog++. Each approach comes with some math that is behind it and simple examples to clarify the theory statements.
Scala, Haskell and LISP are examples of programming languages using the functional programming paradigm. Join us in this TechTalk to know why functional programming is so important, how to implement some of its core concepts in your existing programming languages, and how functional programming inspired Google's Map Reduce, Twitter's Algebird, and many other technologies.
By Mohammad Ghabboun - Senior Software Engineer, SOUQ.com
An Introduction to Higher Order Functions in Spark SQL with Herman van HovellDatabricks
Nested data types offer Apache Spark users powerful ways to manipulate structured data. In particular, they allow you to put complex objects like arrays, maps and structures inside of columns. This can help you model your data in a more natural way.
While this feature is certainly useful, it can quite bit cumbersome to manipulate data inside of complex objects because SQL (and Spark) do not have primitives for working with such data. In addition, it is time-consuming, non-performant, and non-trivial. During this talk we will discuss some of the commonly used techniques for working with complex objects, and we will introduce new ones based on Higher-order functions. Higher-order functions will be part of Spark 2.4 and are a simple and performant extension to SQL that allow a user to manipulate complex data such as arrays.
Java/Scala Lab: Анатолий Кметюк - Scala SubScript: Алгебра для реактивного пр...GeeksLab Odessa
SubScript - это расширение языка Scala, добавляющее поддержку конструкций и синтаксиса аглебры общающихся процессов (Algebra of Communicating Processes, ACP). SubScript является перспективным расширением, применимым как для разработки высоконагруженных параллельных систем, так и для простых персональных приложений.
Functional Programming Patterns for the Pragmatic ProgrammerRaúl Raja Martínez
In this talk we will see a pragmatic approach to building a purely functional architecture that delivers cohesive functional components.
We will cover functional patterns such as Free Monads, Transformers, Kleisli arrows, dependently typed checked exceptions
and types as well as how they can be glued together to achieve pure functions that are composable, context free, dependently injectable and testable.
Dome project and code with instructions to run it can be found at:
https://github.com/47deg/func-architecture
Probabilistic data structures. Part 2. CardinalityAndrii Gakhov
The book "Probabilistic Data Structures and Algorithms in Big Data Applications" is now available at Amazon and from local bookstores. More details at https://pdsa.gakhov.com
In the presentation, I described common data structures and algorithms to estimate the number of distinct elements in a set (cardinality), such as Linear Counting, HyperLogLog, and HyperLogLog++. Each approach comes with some math that is behind it and simple examples to clarify the theory statements.
Scala, Haskell and LISP are examples of programming languages using the functional programming paradigm. Join us in this TechTalk to know why functional programming is so important, how to implement some of its core concepts in your existing programming languages, and how functional programming inspired Google's Map Reduce, Twitter's Algebird, and many other technologies.
By Mohammad Ghabboun - Senior Software Engineer, SOUQ.com
Systematic Generation Data and Types in C++Sumant Tambe
This presentation will discuss two classic techniques from the functional domain — composable data generators and property-based testing — implemented in C++14 for testing a generic serialization and deserialization library. We will look at a systematic technique of constructing data generators from a mere random number generator and random type generation using compile-time meta-programming. Along the way, we will discuss monoids, functors, and monads as we encounter them.
ScalaDays 2013 Keynote Speech by Martin OderskyTypesafe
Scala gives you awesome expressive power, but how to make best use of it? In my talk I will discuss the question what makes good Scala style. We will start with syntax and continue with how to name things, how to mix objects and functions, where (and where not) to use mutable state, and when to use which design pattern. As most questions of style, the discussion will be quite subjective, and some of it might be controversial. I am looking forward to discuss these topics with the conference attendees.
AS White Global creates this slideshow to explain our culture to potential candidates. We hope to see you all in a friendly, professional offshoring environment.
Systematic Generation Data and Types in C++Sumant Tambe
This presentation will discuss two classic techniques from the functional domain — composable data generators and property-based testing — implemented in C++14 for testing a generic serialization and deserialization library. We will look at a systematic technique of constructing data generators from a mere random number generator and random type generation using compile-time meta-programming. Along the way, we will discuss monoids, functors, and monads as we encounter them.
ScalaDays 2013 Keynote Speech by Martin OderskyTypesafe
Scala gives you awesome expressive power, but how to make best use of it? In my talk I will discuss the question what makes good Scala style. We will start with syntax and continue with how to name things, how to mix objects and functions, where (and where not) to use mutable state, and when to use which design pattern. As most questions of style, the discussion will be quite subjective, and some of it might be controversial. I am looking forward to discuss these topics with the conference attendees.
AS White Global creates this slideshow to explain our culture to potential candidates. We hope to see you all in a friendly, professional offshoring environment.
A lecture I gave, back in 2012, to postgraduate executive students on the Engineering and Technology Management Masters Programme at Bogazici University. Albeit a little late for "Star Wars Day", which was on the 4th of May, it seemed an opportune occasion to share.
Find out about Linnworks' new partnership with eBay, and learn how you can take advantage of the tools and opportunities for growing your business on this marketplace. To watch the full talk from this Linn Academy 2016, visit https://youtu.be/ky6Tog8sFPA
El Camino para que tu inversión tenga un retorno mayor que el promedio del mercado y el país. Y para que los ganaderos vuelvan a ser referentes en el mundo.-
Maryland summit jhh 2015 how to live longer and better with lupuslupusdmv
Lupus is a chronic disease that people must learn to deal with their entire lives. Dealing with it successfully involves more than just taking medications. There are many important steps the people who have lupus should take in order to have a good quality of life and live longer. Dr. Thomas will discuss these important steps as well as explain why they are important to do on a regular basis.
These are the outline slides that I used for the Pune Clojure Course.
The slides may not be much useful standalone, but I have uploaded them for reference.
Functional Programming Past Present FutureIndicThreads
Presented at the IndicThreads.com Software Development Conference 2016 held in Pune, India. More at http://www.IndicThreads.com and http://Pune16.IndicThreads.com
--
Beyond Map/Reduce: Getting Creative With Parallel ProcessingEd Kohlwey
While Map/Reduce is an excellent environment for some parallel computing tasks, there are many ways to use a cluster beyond Map/Reduce. Within the last year, the YARN and NextGen Map/Reduce has been contributed into the Hadoop trunk, Mesos has been released as an open source project, and a variety of new parallel programming environments have emerged such as Spark, Giraph, Golden Orb, Accumulo, and others.
We will discuss the features of YARN and Mesos, and talk about obvious yet relatively unexplored uses of these cluster schedulers as simple work queues. Examples will be provided in the context of machine learning. Next, we will provide an overview of the Bulk-Synchronous-Parallel model of computation, and compare and contrast the implementations that have emerged over the last year. We will also discuss two other alternative environments: Spark, an in-memory version of Map/Reduce which features a Scala-based interpreter; and Accumulo, a BigTable-style database that implements a novel model for parallel computation and was recently released by the NSA.
Large-Scale Machine Learning with Apache SparkDB Tsai
Spark is a new cluster computing engine that is rapidly gaining popularity — with over 150 contributors in the past year, it is one of the most active open source projects in big data, surpassing even Hadoop MapReduce. Spark was designed to both make traditional MapReduce programming easier and to support new types of applications, with one of the earliest focus areas being machine learning. In this talk, we’ll introduce Spark and show how to use it to build fast, end-to-end machine learning workflows. Using Spark’s high-level API, we can process raw data with familiar libraries in Java, Scala or Python (e.g. NumPy) to extract the features for machine learning. Then, using MLlib, its built-in machine learning library, we can run scalable versions of popular algorithms. We’ll also cover upcoming development work including new built-in algorithms and R bindings.
Bio:
Xiangrui Meng is a software engineer at Databricks. He has been actively involved in the development of Spark MLlib since he joined. Before Databricks, he worked as an applied research engineer at LinkedIn, where he was the main developer of an offline machine learning framework in Hadoop MapReduce. His thesis work at Stanford is on randomized algorithms for large-scale linear regression.
Automatic Task-based Code Generation for High Performance DSELJoel Falcou
Providing high level tools for parallel programming while sustaining a high level of performance has been a challenge that techniques like Domain Specific Embedded Languages try to solve. In previous works, we investigated the design of such a DSEL – NT2 – providing a Matlab -like syntax for parallel numerical computations inside a C++ library.
Main issues addressed here is how liimtaions of classical DSEL generation and multithreaded code generation can be overcome.
Introduction to parallel and distributed computation with sparkAngelo Leto
Lecture about Apache Spark at the Master in High Performance Computing organized by SISSA and ICTP
Covered topics: Apache Spark, functional programming, Scala, implementation of simple information retrieval programs using TFIDF and the Vector Model
Distributed Real-Time Stream Processing: Why and How 2.0Petr Zapletal
The demand for stream processing is increasing a lot these day. Immense amounts of data has to be processed fast from a rapidly growing set of disparate data sources. This pushes the limits of traditional data processing infrastructures. These stream-based applications include trading, social networks, Internet of things, system monitoring, and many other examples.
In this talk we are going to discuss various state of the art open-source distributed streaming frameworks, their similarities and differences, implementation trade-offs and their intended use-cases. Apart of that, I’m going to speak about Fast Data, theory of streaming, framework evaluation and so on. My goal is to provide comprehensive overview about modern streaming frameworks and to help fellow developers with picking the best possible for their particular use-case.
Similar to Overview of Apache SystemML by Berthold Reinwald and Nakul Jindal (20)
Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias...Arvind Surve
This session includes Apache SystemML Runtime techniques. Those include parfor optimization, bufferpool optimization, spark specific rewrites, partitioning preserving operations, update in place, and ongoing research (Compressed Linear Algebra)
Apache SystemML Optimizer and Runtime techniques by Matthias BoehmArvind Surve
This deck describes general framework techniques for Large Scale Machine Learning systems. It explains Apache SystemML specific Optimizer and Runtime techniques. It will describe data structures, DAG compilation, operator selection including fused operators, dynamic recompilation, inter procedure analysis and some ongoing research projects.
Apache SystemML Architecture by Niketan PanesarArvind Surve
This deck will present high level Apache SystemML design and architecture containing language, compiler and runtime modules. It will describe how compilation chain gets generated and variable analysis done. It will show HOPs and runtime plan for sample use case. It will show how to get statistics, and some diagnostic tools can be used.
Clustering and Factorization using Apache SystemML by Prithviraj SenArvind Surve
This deck will discuss application of Matrix Factorization in Machine Learning. It will discuss Least Square Matrix Factorization, Poisson Matrix Factorization.
Classification using Apache SystemML by Prithviraj SenArvind Surve
This deck will cover various algorithms at high level. Those algorithms include "Supervised Learning and Classification", "Training Discriminative Classifiers", "Representer Theorem", "Support Vector Machines", "Logistic Regression", "Generative Classifiers: Naive Bayes", "Deep Learning" and "Tree Ensembles"
Regression using Apache SystemML by Alexandre V EvfimievskiArvind Surve
This deck will present regression algorithms Linear Regression -- Least Square, Direct solve -- , Conjugate Gradient, and Generalized Linear Model supported in Apache SystemML
Data preparation, training and validation using SystemML by Faraz Makari Mans...Arvind Surve
This deck will provide you an information related to data preparation, training, testing and validation of data used in Machine Learning using Apache SystemML. As well as it will provide Descriptive statistics -- Univariate Statistics, Bivariate Statistics and Stratified Statistics.
Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias...Arvind Surve
This deck includes Apache SystemML Runtime techniques. Those include parfor optimization, bufferpool optimization, spark specific rewrites, partitioning preserving operations, update in place, and ongoing research (Compressed Linear Algebra)
Apache SystemML Optimizer and Runtime techniques by Matthias BoehmArvind Surve
This deck describes general framework techniques for Large Scale Machine Learning systems. It explains Apachhe SystemML specific Optimizer and Runtime techniques. It will describe data structures, DAG compilation, operator selection including fused operators, dynamic recompilation, inter procedure analysis and some ongoing research projects.
Apache SystemML Architecture by Niketan PanesarArvind Surve
This deck will present high level Apache SystemML design and architecture containing language, compiler and runtime modules. It will describe how compilation chain gets generated and variable analysis done. It will show HOPs and runtime plan for sample use case. It will show how to get statistics, and some diagnostic tools can be used.
Clustering and Factorization using Apache SystemML by Prithviraj SenArvind Surve
This deck will discuss application of Matrix Factorization in Machine Learning. It will discuss Least Square Matrix Factorization, Poisson Matrix Factorization.
Classification using Apache SystemML by Prithviraj SenArvind Surve
This deck will cover various algorithms at high level. Those algorithms include "Supervised Learning and Classification", "Training Discriminative Classifiers", "Representer Theorem", "Support Vector Machines", "Logistic Regression", "Generative Classifiers: Naive Bayes", "Deep Learning" and "Tree Ensembles"
Regression using Apache SystemML by Alexandre V EvfimievskiArvind Surve
This deck will present regression algorithms Linear Regression -- Least Square, Direct solve -- , Conjugate Gradient, and Generalized Linear Model supported in Apache SystemML
Data preparation, training and validation using SystemML by Faraz Makari Mans...Arvind Surve
This deck will provide you an information related to data preparation, training, testing and validation of data used in Machine Learning using Apache SystemML. As well as it will provide Descriptive statistics -- Univariate Statistics, Bivariate Statistics and Stratified Statistics.
Overview of Apache SystemML by Berthold Reinwald and Nakul JindalArvind Surve
This deck will provide SystemML architecture, how to get documentation for usage, algorithms etc. It will explain usage of it through command line or through notebook.
Introduction to AI for Nonprofits with Tapp NetworkTechSoup
Dive into the world of AI! Experts Jon Hill and Tareq Monaur will guide you through AI's role in enhancing nonprofit websites and basic marketing strategies, making it easy to understand and apply.
A review of the growth of the Israel Genealogy Research Association Database Collection for the last 12 months. Our collection is now passed the 3 million mark and still growing. See which archives have contributed the most. See the different types of records we have, and which years have had records added. You can also see what we have for the future.
A workshop hosted by the South African Journal of Science aimed at postgraduate students and early career researchers with little or no experience in writing and publishing journal articles.
Biological screening of herbal drugs: Introduction and Need for
Phyto-Pharmacological Screening, New Strategies for evaluating
Natural Products, In vitro evaluation techniques for Antioxidants, Antimicrobial and Anticancer drugs. In vivo evaluation techniques
for Anti-inflammatory, Antiulcer, Anticancer, Wound healing, Antidiabetic, Hepatoprotective, Cardio protective, Diuretics and
Antifertility, Toxicity studies as per OECD guidelines
This slide is special for master students (MIBS & MIFB) in UUM. Also useful for readers who are interested in the topic of contemporary Islamic banking.
How to Add Chatter in the odoo 17 ERP ModuleCeline George
In Odoo, the chatter is like a chat tool that helps you work together on records. You can leave notes and track things, making it easier to talk with your team and partners. Inside chatter, all communication history, activity, and changes will be displayed.
Normal Labour/ Stages of Labour/ Mechanism of LabourWasim Ak
Normal labor is also termed spontaneous labor, defined as the natural physiological process through which the fetus, placenta, and membranes are expelled from the uterus through the birth canal at term (37 to 42 weeks
it describes the bony anatomy including the femoral head , acetabulum, labrum . also discusses the capsule , ligaments . muscle that act on the hip joint and the range of motion are outlined. factors affecting hip joint stability and weight transmission through the joint are summarized.
Delivering Micro-Credentials in Technical and Vocational Education and TrainingAG2 Design
Explore how micro-credentials are transforming Technical and Vocational Education and Training (TVET) with this comprehensive slide deck. Discover what micro-credentials are, their importance in TVET, the advantages they offer, and the insights from industry experts. Additionally, learn about the top software applications available for creating and managing micro-credentials. This presentation also includes valuable resources and a discussion on the future of these specialised certifications.
For more detailed information on delivering micro-credentials in TVET, visit this https://tvettrainer.com/delivering-micro-credentials-in-tvet/
The simplified electron and muon model, Oscillating Spacetime: The Foundation...RitikBhardwaj56
Discover the Simplified Electron and Muon Model: A New Wave-Based Approach to Understanding Particles delves into a groundbreaking theory that presents electrons and muons as rotating soliton waves within oscillating spacetime. Geared towards students, researchers, and science buffs, this book breaks down complex ideas into simple explanations. It covers topics such as electron waves, temporal dynamics, and the implications of this model on particle physics. With clear illustrations and easy-to-follow explanations, readers will gain a new outlook on the universe's fundamental nature.
3. What is Apache SystemML
• In a nutshell
• a language for data scientists to implement scalable ML algorithms
• 2 language variants: R-like and Python-like syntax
• Strong foundation of linear algebra operations and statistical functions
• Comes with approx. 20+ algorithms pre-implemented
• Cost-based optimizer to compile execution plans
• Depending on data characteristics (tall/skinny, short/wide; dense/sparse)
and cluster characteristics
• ranging from single node to clusters (MapReduce, Spark); hybrid plans
• APIs & Tools
• Command line: hadoop jar, spark-submit, standalone Java app
• JMLC: embed as library
• Spark MLContext: Scala, Python, and Java
• Tools
• REPL (Scala Spark and pyspark)
• Spark ML pipeline
3
5. SystemML – Declarative ML
• Analytics language for data scientists
(“The SQL for analytics”)
• Algorithms expressed in a declarative,
high-level language DML with R-like syntax
• Productivity of data scientists
• Enable
• Solutions development
• Tools
• Compiler
• Cost-based optimizer to generate
execution plans and to parallelize
• based on data characteristics
• based on cluster and machine characteristics
• Physical operators for in-memory single node
and cluster execution
• Performance & Scalability
5
10. Sample Code
A = 1.0 # A is an integer
X <- matrix(“4 3 2 5 7 8”, rows=3, cols=2) # X = matrix of size 3,2 '<-' is assignment
Y = matrix(1, rows=3, cols=2) # Y = matrix of size 3,2 with all 1s
b <- t(X) %*% Y # %*% is matrix multiply, t(X) is transpose
S = "hello world"
i=0
while(i < max_iteration) {
H = (H * (t(W) %*% (V/(W%*%H))))/t(colSums(W)) # * is element by element mult
W = (W * ((V/(W%*%H)) %*% t(H)))/t(rowSums(H))
i = i + 1; # i is an integer
}
print (toString(H)) # toString converts a matrix to a string
10
11. Sample Code
source("nn/layers/affine.dml") as affine # import a file in the “affine“ namespace
[W, b] = affine::init(D, M) # calls the init function, multiple
return
parfor (i in 1:nrow(X)) { # i iterates over 1 through num rows in X in parallel
for (j in 1:ncol(X)) { # j iterates over 1 through num cols in X
# Computation ...
}
}
write (M, fileM, format=“text”) # M=matrix, fileM=file, also writes to
HDFS
X = read (fileX) # fileX=file, also reads from HDFS
if (ncol (A) > 1) {
# Matrix A is being sliced by a given range of columns
A[,1:(ncol (A) - 1)] = A[,1:(ncol (A) - 1)] - A[,2:ncol (A)];
}
11
12. Sample Code
interpSpline = function(
double x, matrix[double] X, matrix[double] Y, matrix[double] K) return (double q) {
i = as.integer(nrow(X) - sum(ppred(X, x, ">=")) + 1)
# misc computation …
q = as.scalar(qm)
}
eigen = externalFunction(Matrix[Double] A)
return(Matrix[Double] eval, Matrix[Double] evec)
implemented in (classname="org.apache.sysml.udf.lib.EigenWrapper",
exectype="mem")
12
13. Sample Code (From LinearRegDS.dml*)
A = t(X) %*% X
b = t(X) %*% y
if (intercept_status == 2) {
A = t(diag (scale_X) %*% A + shift_X %*% A [m_ext, ])
A = diag (scale_X) %*% A + shift_X %*% A [m_ext, ]
b = diag (scale_X) %*% b + shift_X %*% b [m_ext, ]
}
A = A + diag (lambda)
print ("Calling the Direct Solver...")
beta_unscaled = solve (A, b)
*https://github.com/apache/incubator-systemml/blob/master/scripts/algorithms/LinearRegDS.dml#L133
13
14. DML Editor Support
• Very rudimentary editor support
• Bit of shameless self-promotion :
• Atom – Hackable Text editor
• Install package - https://atom.io/packages/language-dml
• From GUI - http://flight-manual.atom.io/using-atom/sections/atom-packages/
• Or from command line – apm install language-dml
• Rudimentary snippet based completion of builtin function
• Vim
• Install package - https://github.com/nakul02/vim-dml
• Works with Vundle (vim package manager)
• There is an experimental Zeppelin Notebook integration with DML –
• https://issues.apache.org/jira/browse/SYSTEMML-542
• Available as a docker image to play with - https://hub.docker.com/r/nakul02/incubator-
zeppelin/
• Please send feedback when using these, requests for features, bugs
• I’ll work on them when I can
14
15. SystemML Algorithms
15
Category Description
Descriptive Statistics
Univariate
Bivariate
Stratified Bivariate
Classification
Logistic Regression (multinomial)
Multi-Class SVM
Naïve Bayes (multinomial)
Decision Trees
Random Forest
Clustering k-Means
Regression
Linear Regression system of equations
CG (conjugate gradient descent)
Generalized Linear Models
(GLM)
Distributions: Gaussian, Poisson, Gamma, InverseGaussian, Binomial, Bernoulli
Links for all distributions: identity, log, sq. root,inverse, 1/μ2
Links for Binomial / Bernoulli: logit, probit, cloglog, cauchit
Stepwise
Linear
GLM
Dimension Reduction PCA
Matrix Factorization ALS
direct solve
CG (conjugate gradient descent)
Survival Models
Kaplan Meier Estimate
Cox Proportional Hazard Regression
Predict Algorithm-specific scoring
Transformation (native) Recoding, dummy coding, binning, scaling, missing value imputation
Documentation: https://apache.github.io/incubator-systemml/algorithms-reference.html
Scripts: /usr/SystemML/systemml-0.10.0-incubating/scripts/algorithms/
17. MLContext API – Example Usage
val ml = new MLContext(sc)
val X_train = sc.textFile("amazon0601.txt")
.filter(!_.startsWith("#"))
.map(_.split("t") match{case Array(prod1, prod2)=>(prod1.toInt, prod2.toInt,1.0)})
.toDF("prod_i", "prod_j", "x_ij")
.filter("prod_i < 5000 AND prod_j < 5000") // Change to smaller number
.cache()
17
18. MLContext API – Example Usage
val pnmf =
"""
# data & args
X = read($X)
rank = as.integer($rank)
# Computation ....
write(negloglik, $negloglikout)
write(W, $Wout)
write(H, $Hout)
"""
18
19. MLContext API – Example Usage
val pnmf =
"""
# data & args
X = read($X)
rank = as.integer($rank)
# Computation ....
write(negloglik, $negloglikout)
write(W, $Wout)
write(H, $Hout)
"""
ml.registerInput("X", X_train)
ml.registerOutput("W")
ml.registerOutput("H")
ml.registerOutput("negloglik")
val outputs = ml.executeScript(pnmf,
Map("maxiter" -> "100", "rank" -> "10"))
val negloglik = getScalarDouble(outputs,
"negloglik")
19
22. End-to-end on Spark … in Code
22
import org.apache.spark.sql._
val ctx = new org.apache.spark.sql.SQLContext(sc)
val tweets = ctx.jsonFile("hdfs:/twitter/decahose")
tweets.registerAsTable("tweetTable")
ctx.sql("SELECT text FROM tweetTable LIMIT 5").collect.foreach(println)
ctx.sql("SELECT lang, COUNT(*) AS cnt FROM tweetTable
GROUP BY lang ORDER BY cnt DESC LIMIT 10").collect.foreach(println)
val texts = ctx.sql("SELECT text FROM tweetTable").map(_.head.toString)
def featurize(str: String): Vector = { ... }
val vectors = texts.map(featurize).toDF.cache()
val mcV = new MatrixCharacteristics(vectors.count, vocabSize, 1000,1000)
val V = RDDConvertUtilsExt(sc, vectors, mcV, false, "_1")
val ml = new com.ibm.bi.dml.api.MLContext(sc)
ml.registerInput("V", V, mcV)
ml.registerOutput("W")
ml.registerOutput("H")
val args = Array(numTopics, numGNMFIter)
val out = ml.execute("GNMF.dml", args)
val W = out.getDF("W")
val H = out.getDF("H")
def getWords(r: Row): Array[(String, Double)] = { ... }
val topics = H.rdd.map(getWords)
Twitter Data
Explore Data
In SQL
Data Set
Training Set
Topic Modeling
SQLML
Get Topics
23. SystemML Architecture
Language
• R- like syntax
• Linear algebra, statisticalfunctions, controlstructures, etc.
• User-defined & externalfunction
• Parsing
• Statement blocks & statements
• Program Analysis, type inference, dead code elimination
High-Level Operator (HOP) Component
• Dataflow in DAGs of operations on matrices, frames, and scalars
• Choosing from alternative execution plans based on memoryand
cost estimates: operatorordering & selection; hybrid plans
Low-Level Operator (LOP) Component
• Low-levelphysicalexecution plan (LOPDags)overkey-value pairs
• “Piggybacking”operationsinto minimalnumber Map-Reduce jobs
Runtime
• Hybrid Runtime
• CP: single machine operations & orchestrate jobs
• MR: generic Map-Reduce jobs & operations
• SP: Spark Jobs
• Numerically stable operators
• Dense / sparse matrix representation
• Multi-Levelbuffer pool (caching) to evict in-memory objects
• Dynamic Recompilation for initial unknowns
Command
Line
JMLC
Spark
MLContext
APIs
High-Level Operators
Parser/Language
Low-Level Operators
Compiler
Runtime
Control Program
Runtime
Program
Buffer Pool
ParFor Optimizer/
Runtime
MR
InstSpark
Inst
CP
Inst
Recompiler
Cost-based
optimizations
DFS IOMem/FS IO
Generic
MR Jobs
MatrixBlock Library
(single/multi-threaded)
23
24. SystemML Compilation Chain
24
CP + b sb _mVar1
SPARK mapmm X.MATRIX.DOUBLE _mvar1.MATRIX.DOUBLE
_mVar2.MATRIX.DOUBLE RIGHT false NONE
CP * y _mVar2 _mVar3
25. Selected Algebraic Simplification
Rewrites
25
Name Dynamic Pattern
Remove Unnecessary Indexing X[a:b,c:d] = Y à X = Y iff dims(X)=dims(Y)
X = Y[, 1] à X = Y iff ncol(Y)=1
Remove Empty
Matrix Multiply
X%*%Y à matrix(0,nrow(X),ncol(Y))
iff nnz(X)=0|nnz(Y)=0
Removed Unnecessary Outer
Product
X*(Y%*%matrix(1,...)) à X*Y
iff ncol(Y)=1
Simplify Diag Aggregates sum(diag(X))àtrace(X) iff ncol(X)=1
SimplifyMatrix Mult Diag diag(X)%*%Y à X*Y iff ncol(X)=1&ncol(Y)=1
Simplify Diag Matrix Mult diag(X%*%Y) à rowSums(X*t(Y)) iff ncol(Y)>1
Simplify Dot Product Sum sum(X^2) à t(X)%*%X iff ncol(X)=1
Name Static Pattern
Remove Unnecessary Operations t(t(X)), X/1, X*1, X-0 à X matrix(1,)/X à 1/X
rand(,min=-1,max=1)*7 à rand(,min=-7,max=7)
Binary to Unary X+X à 2*X X*X à X^2 X-X*Y à X*(1-Y)
Simplify Diag Aggregates trace(X%*%Y)àsum(X*t(Y))
26. A Data Scientist – Linear Regression
26
X ≈
Explanatory/
Independent Variables
Predicted/
Dependant VariableModel
w
w = argminw ||Xw-y||2 +λ||w||2
Optimization Problem:
next direction
Iterate until
convergence
initialize
step size
update w
initial direction
accuracy
measures
Conjugate GradientMethod:
• Start off with the (negative) gradient
• For each step
1. Move to the optimal point along the chosen direction;
2. Recompute the gradient;
3. Project it onto the subspace conjugate* to allprior directions;
4. Use this as the next direction
(* conjugate =orthogonalgiven A as the metric)
A = XT X + λ
y
27. SystemML – Run LinReg CG on Spark
27
100M
10,000
100M
1
yX
100M
1,000
X
100M
100
X
100M
10
X
100M
1
y
100M
1
y
100M
1
y
8 TB
800 GB
80 GB
8 GB …
tMMp
…
Multithreaded
Single Node
20 GB Driver on 16c
6 x 55 GB Executors
Hybrid Plan
with RDD caching
and fused operator
Hybrid Plan
with RDD out-of-
core and fused
operator
Hybrid Plan
with RDD out-of-
core and different
operators
…
x.persist();
...
X.mapValues(tMMp
)
.reduce ()
…
Driver
Fused
Executors
…
RDD cache: X
tMMv tMMv
…
x.persist();
...
X.mapValues(tMMp)
.reduce()
...
Executors
…
RDD cache: X
tMMv tMMv
Driver
Spilling
…
x.persist();
...
// 2 MxV mult
// with broadcast,
// mapToPair, and
// reduceByKey
... Executors
…
RDD cache: X
Mv
tvM
Mv
tvM
Driver
Driver
Cache
28. LinReg CG for varying Data
28
8 GB
100M x 10
80 GB
100M x 100
800 GB
100M x 1K
8 TB
100M x 10K
CP+Spark 21 92 2,065 40,395
Spark 76 124 2,159 40,130
CP+MR 24 277 2,613 41,006
10
100
1,000
10,000
100,000
ExecutionTimeinsecs(logscale)
Data Size
Note
Driver w+h 20 GB, 16c
6 Executors each 55 GB, 24c
Convergence in 3-4 itera+ons
SystemML as of 10/2015
Single node MT
avoids Spark Ctx
& distributed ops
3.6 x
Hybrid plan &
RDD caching
3x
Out of Core
1.2x
Fully Utilized
Ø Cost-based optimization is
important
Ø Hybrid execution plans
benefit especially medium-
sized data sets
Ø Aggregated in-memory data
sets are sweet spot for
Spark esp. for iterative
algorithms
Ø Graceful degradation for
out-of-core
29. Apache SystemML - Summary
• Cost-based compilation of machine learning algorithms generates execution plans
• for single-node in-memory, cluster, and hybrid execution
• for varying data characteristics:
• varying number of observations (1,000s to 10s of billions)
• varying number of variables (10s to 10s of millions)
• dense and sparse data
• for varying cluster characteristics (memory configurations, degree of parallelism)
• Out-of-the-box, scalable machine learning algorithms
• e.g. descriptive statistics, regression, clustering, and classification
• "Roll-your-own" algorithms
• Enable programmer productivity (no worry about scalability, numeric stability, and
optimizations)
• Fast turn-around for new algorithms
• Higher-level language shields algorithm development investment from platform
progression
• Yarn for resource negotiation and elasticity
• Spark for in-memory, iterative processing
29