Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Declarative Machine Learning: Bring your own Syntax, Algorithm, Data and Infrastructure

636 views

Published on

Presenter: Shivakumar Vaithyanathan, IBM Chief Scientist & Sr. Manager, IBM Research

Published in: Technology
  • Login to see the comments

Declarative Machine Learning: Bring your own Syntax, Algorithm, Data and Infrastructure

  1. 1. © 2015 IBM Corporation Declarative Machine Learning: Bring your Own Algorithm, Data, Syntax and Infrastructure Shivakumar Vaithyanathan IBM Fellow Watson & IBM Research
  2. 2. IBM Research © 2012 IBM Corporation Credit Risk Scoring Application at a Large Financial Institution  To execute on one machine (with a hypothetical statistical package/engine)  3.6 TB of RAM required (underestimate). Reduced Set: 1.2 TB of RAM (underestimate)  In practice more RAM is required – Outputs and intermediates also need to be stored along with the input 2 Prototypical of problems in other industries ranging from automotive to insurance to transportation Credit Risk Scoring Payment History Amount Owed Length of Credit History New Credit Types of Credit Used  Problem size  300 million rows, 1500 features  Reduced set: 500 features  Data size on disk  3.6 TB (uncompressed)  Even for reduced set: 1.2 TB  Algorithm of interest  Regression …
  3. 3. IBM Research © 2012 IBM Corporation Insurance Big Data Analytics Usecases  Problem Description – Consumer risk modeling – Consumer data with ~300 M rows and ~500 attributes 3  Problem Description – Predict customer monetary loss – Multi-million observations, 95 features, evaluate several hundred models for optimal subset of features  Problem Description – Customer Satisfaction – Multi-million cars with few reacquired cars – Feature expansion from ~250 to ~21,800 Automotive DaaS (Retail Finance) RISK
  4. 4. IBM Research © 2012 IBM Corporation A Day in the life of a Data Scientist …. 4 data sample data characteristics Develop new algorithm or modify existing algorithm original data Data scientist  Bayesian networks  Neural networks  Random forests  Support vector machines  … algorithms Custom syntax
  5. 5. IBM Research © 2012 IBM Corporation Bottleneck: Moving the algorithm onto Big Data Infrastructure 5 Data scientist Hadoop Programmer Spark Programmer MPI Programmer
  6. 6. IBM Research © 2012 IBM Corporation What If .…. 6 Data scientist Hadoop Programmer Spark Programmer MPI Programmer compiler optimizer
  7. 7. IBM Research © 2012 IBM Corporation Simplified view of what we want to build … 7 The What The How language tooling compiler optimizer High-level language  Write any algorithm Adapt to different data and program characteristics Support different backend architectures and configurations
  8. 8. IBM Research © 2012 IBM Corporation SystemML: IBM Research Project will soon be in Open Source 8 • IBM Research Project started 6 years ago • More than 10 papers in major conferences • In Beta for more than a year and used in multiple applications What • R- like, Python-like syntax, ….. • Rich set of statistical functions • User-defined & external function How • Single-node, embeddable and Hadoop & Spark • Dense / sparse matrix representation • Library of more than 15 algorithms In-Memory Single Node Hadoop / Spark Lower Ops (LOP) Higher Ops (HOP) R-parser Python- parser Writing a Python-syntax parser took less than 2 man-months
  9. 9. IBM Research © 2012 IBM Corporation How should the “What” work ? 9 package gnmf; import java.io.IOException; import java.net.URISyntaxException; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.Path; import org.apache.hadoop.mapred.JobConf; public class MatrixGNMF { public static void main(String[] args) throws IOException, URISyntaxException { if(args.length < 10) { System.out.println("missing parameters"); System.out.println("expected parameters: [directory of v] [directory of w] [directory of h] " + "[k] [num mappers] [num reducers] [replication] [working directory] " + "[final directory of w] [final directory of h]"); System.exit(1); } String vDir = args[0]; String wDir = args[1]; String hDir = args[2]; int k = Integer.parseInt(args[3]); int numMappers = Integer.parseInt(args[4]); int numReducers = Integer.parseInt(args[5]); int replication = Integer.parseInt(args[6]); String outputDir = args[7]; String wFinalDir = args[8]; String hFinalDir = args[9]; JobConf mainJob = new JobConf(MatrixGNMF.class); String vDirectory; String wDirectory; String hDirectory; FileSystem.get(mainJob).delete(new Path(outputDir)); vDirectory = vDir; hDirectory = hDir; wDirectory = wDir; String workingDirectory; String resultDirectoryX; String resultDirectoryY; long start = System.currentTimeMillis(); System.gc(); System.out.println("starting calculation"); System.out.print("calculating X = WT * V... "); workingDirectory = UpdateWHStep1.runJob(numMappers, numReducers, replication, UpdateWHStep1.UPDATE_TYPE_H, vDirectory, wDirectory, outputDir, k); resultDirectoryX = UpdateWHStep2.runJob(numMappers, numReducers, replication, workingDirectory, outputDir); FileSystem.get(mainJob).delete(new Path(workingDirectory)); System.out.println("done"); System.out.print("calculating Y = WT * W * H... "); workingDirectory = UpdateWHStep3.runJob(numMappers, numReducers, replication, wDirectory, outputDir); resultDirectoryY = UpdateWHStep4.runJob(numMappers, replication, workingDirectory, UpdateWHStep4.UPDATE_TYPE_H, hDirectory, outputDir); FileSystem.get(mainJob).delete(new Path(workingDirectory)); System.out.println("done"); System.out.print("calculating H = H .* X ./ Y... "); workingDirectory = UpdateWHStep5.runJob(numMappers, numReducers, replication, hDirectory, resultDirectoryX, resultDirectoryY, hFinalDir, k); System.out.println("done"); FileSystem.get(mainJob).delete(new Path(resultDirectoryX)); FileSystem.get(mainJob).delete(new Path(resultDirectoryY)); System.out.print("storing back H... "); FileSystem.get(mainJob).delete(new Path(hDirectory)); hDirectory = workingDirectory; System.out.println("done"); System.out.print("calculating X = V * HT... "); workingDirectory = UpdateWHStep1.runJob(numMappers, numReducers, replication, UpdateWHStep1.UPDATE_TYPE_W, vDirectory, hDirectory, outputDir, k); resultDirectoryX = UpdateWHStep2.runJob(numMappers, numReducers, replication, workingDirectory, outputDir); FileSystem.get(mainJob).delete(new Path(workingDirectory)); System.out.println("done"); System.out.print("calculating Y = W * H * HT... "); workingDirectory = UpdateWHStep3.runJob(numMappers, numReducers, replication, hDirectory, outputDir); resultDirectoryY = UpdateWHStep4.runJob(numMappers, replication, workingDirectory, UpdateWHStep4.UPDATE_TYPE_W, wDirectory, outputDir); FileSystem.get(mainJob).delete(new Path(workingDirectory)); System.out.println("done"); System.out.print("calculating W = W .* X ./ Y... "); workingDirectory = UpdateWHStep5.runJob(numMappers, numReducers, replication, wDirectory, resultDirectoryX, resultDirectoryY, wFinalDir, k); System.out.println("done"); FileSystem.get(mainJob).delete(new Path(resultDirectoryX)); FileSystem.get(mainJob).delete(new Path(resultDirectoryY)); System.out.print("storing back W... "); FileSystem.get(mainJob).delete(new Path(wDirectory)); wDirectory = workingDirectory; System.out.println("done"); long requiredTime = System.currentTimeMillis() - start; long requiredTimeMilliseconds = requiredTime % 1000; requiredTime -= requiredTimeMilliseconds; requiredTime /= 1000; long requiredTimeSeconds = requiredTime % 60; requiredTime -= requiredTimeSeconds; requiredTime /= 60; long requiredTimeMinutes = requiredTime % 60; requiredTime -= requiredTimeMinutes; requiredTime /= 60; long requiredTimeHours = requiredTime; } } package gnmf; import gnmf.io.MatrixObject; import gnmf.io.MatrixVector; import gnmf.io.TaggedIndex; import java.io.IOException; import java.util.Iterator; import org.apache.hadoop.fs.Path; import org.apache.hadoop.mapred.FileInputFormat; import org.apache.hadoop.mapred.FileOutputFormat; import org.apache.hadoop.mapred.JobClient; import org.apache.hadoop.mapred.JobConf; import org.apache.hadoop.mapred.MapReduceBase; import org.apache.hadoop.mapred.Mapper; import org.apache.hadoop.mapred.OutputCollector; import org.apache.hadoop.mapred.Reducer; import org.apache.hadoop.mapred.Reporter; import org.apache.hadoop.mapred.SequenceFileInputFormat; import org.apache.hadoop.mapred.SequenceFileOutputFormat; public class UpdateWHStep2 { static class UpdateWHStep2Mapper extends MapReduceBase implements Mapper<TaggedIndex, MatrixVector, TaggedIndex, MatrixVector> { @Override public void map(TaggedIndex key, MatrixVector value, OutputCollector<TaggedIndex, MatrixVector> out, Reporter reporter) throws IOException { out.collect(key, value); } } static class UpdateWHStep2Reducer extends MapReduceBase implements Reducer<TaggedIndex, MatrixVector, TaggedIndex, MatrixObject> { @Override public void reduce(TaggedIndex key, Iterator<MatrixVector> values, OutputCollector<TaggedIndex, MatrixObject> out, Reporter reporter) throws IOException { MatrixVector result = null; while(values.hasNext()) { MatrixVector current = values.next(); if(result == null) { result = current.getCopy(); } else { result.addVector(current); } } if(result != null) { out.collect(new TaggedIndex(key.getIndex(), TaggedIndex.TYPE_VECTOR_X), new MatrixObject(result)); } } } public static String runJob(int numMappers, int numReducers, int replication, String inputDir, String outputDir) throws IOException { String workingDirectory = outputDir + System.currentTimeMillis() + "- UpdateWHStep2/"; JobConf job = new JobConf(UpdateWHStep2.class); job.setJobName("MatrixGNMFUpdateWHStep2"); job.setInputFormat(SequenceFileInputFormat.class); FileInputFormat.setInputPaths(job, new Path(inputDir)); job.setOutputFormat(SequenceFileOutputFormat.class); FileOutputFormat.setOutputPath(job, new Path(workingDirectory)); job.setNumMapTasks(numMappers); job.setMapperClass(UpdateWHStep2Mapper.class); job.setMapOutputKeyClass(TaggedIndex.class); job.setMapOutputValueClass(MatrixVector.class); job.setNumReduceTasks(numReducers); job.setReducerClass(UpdateWHStep2Reducer.class); job.setOutputKeyClass(TaggedIndex.class); job.setOutputValueClass(MatrixObject.class); JobClient.runJob(job); return workingDirectory; } } package gnmf; import gnmf.io.MatrixCell; import gnmf.io.MatrixFormats; import gnmf.io.MatrixObject; import gnmf.io.MatrixVector; import gnmf.io.TaggedIndex; import java.io.IOException; import java.util.Iterator; import org.apache.hadoop.filecache.DistributedCache; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.Path; import org.apache.hadoop.mapred.FileInputFormat; import org.apache.hadoop.mapred.FileOutputFormat; import org.apache.hadoop.mapred.JobClient; import org.apache.hadoop.mapred.JobConf; import org.apache.hadoop.mapred.MapReduceBase; import org.apache.hadoop.mapred.Mapper; import org.apache.hadoop.mapred.OutputCollector; import org.apache.hadoop.mapred.Reducer; import org.apache.hadoop.mapred.Reporter; import org.apache.hadoop.mapred.SequenceFileInputFormat; import org.apache.hadoop.mapred.SequenceFileOutputFormat; public class UpdateWHStep1 { public static final int UPDATE_TYPE_H = 0; public static final int UPDATE_TYPE_W = 1; static class UpdateWHStep1Mapper extends MapReduceBase implements Mapper<TaggedIndex, MatrixObject, TaggedIndex, MatrixObject> { private int updateType; @Override public void map(TaggedIndex key, MatrixObject value, OutputCollector<TaggedIndex, MatrixObject> out, Reporter reporter) throws IOException { if(updateType == UPDATE_TYPE_W && key.getType() == TaggedIndex.TYPE_CELL) { MatrixCell current = (MatrixCell) value.getObject(); out.collect(new TaggedIndex(current.getColumn(), TaggedIndex.TYPE_CELL), new MatrixObject(new MatrixCell(key.getIndex(), current.getValue()))); } else { out.collect(key, value); } } @Override public void configure(JobConf job) { updateType = job.getInt("gnmf.updateType", 0); } } static class UpdateWHStep1Reducer extends MapReduceBase implements Reducer<TaggedIndex, MatrixObject, TaggedIndex, MatrixVector> { private double[] baseVector = null; private int vectorSizeK; @Override public void reduce(TaggedIndex key, Iterator<MatrixObject> values, OutputCollector<TaggedIndex, MatrixVector> out, Reporter reporter) throws IOException { if(key.getType() == TaggedIndex.TYPE_VECTOR) { if(!values.hasNext()) throw new RuntimeException("expected vector"); MatrixFormats current = values.next().getObject(); if(!(current instanceof MatrixVector)) throw new RuntimeException("expected vector"); baseVector = ((MatrixVector) current).getValues(); } else { while(values.hasNext()) { MatrixCell current = (MatrixCell) values.next().getObject(); if(baseVector == null) { out.collect(new TaggedIndex(current.getColumn(), TaggedIndex.TYPE_VECTOR), new MatrixVector(vectorSizeK)); } else { if(baseVector.length == 0) throw new RuntimeException("base vector is corrupted"); MatrixVector resultingVector = new MatrixVector(baseVector); resultingVector.multiplyWithScalar(current.getValue()); if(resultingVector.getValues().length == 0) throw new RuntimeException("multiplying with scalar failed"); out.collect(new TaggedIndex(current.getColumn(), TaggedIndex.TYPE_VECTOR), resultingVector); } } baseVector = null; } } @Override public void configure(JobConf job) { vectorSizeK = job.getInt("dml.matrix.gnmf.k", 0); if(vectorSizeK == 0) throw new RuntimeException("invalid k specified"); } } public static String runJob(int numMappers, int numReducers, int replication, int updateType, String matrixInputDir, String whInputDir, String outputDir, int k) throws IOException { R syntax (10 lines of code) Python syntax (10 lines of code) A factor of 7 – 10 advantage in man- months over multiple algorithms
  10. 10. IBM Research © 2012 IBM Corporation Scalability and Performance – GNMF Example 10 All operations execute on Single machine 0 MR Jobs Hybrid Execution (majority of operations execute on single machine) 4 MR Jobs Hybrid Execution (majority of operations execute in map-reduce) 6 MR Jobs
  11. 11. IBM Research © 2012 IBM Corporation What does the “How” do ? 1111
  12. 12. IBM Research © 2012 IBM Corporation What does the “How” do ? 12 X has 3 times more columns 300M 500 X 300M 1 y From 2.5 to GB Map Task JVM 7 GB In-Mem Master JVM Change in Cluster configuration 600M 500 X 600M 1 y X has 2 times more rows 300M 1500 X 300M 1 y X’y job1 X’y job2 X’X job solve X’y job1 X’y job2 X’X job solve 300M 500 X 300M 1 y Original data X’X and X’y job solve Execution plan Change in data characteristics X’X and X’y job solve X’X job1 X’X job2 X’y job solve 3X faster!
  13. 13. IBM Research © 2012 IBM Corporation Compilation Chain Overview with Example 13 + %*% * b sb X y Q bsb Xbsb yXbsb Parse Tree If dimensions are unknown at compile time, validate will pass through and additional checks will be made at run time Runtime Instructions: CP: b+sb  _mvar1 MR-Job: [map=X%*%_mvar1  _mvar2] CP: y*_mvar2  _mvar3 HOPs DAG: LOPs DAG:
  14. 14. IBM Research © 2012 IBM Corporation  Data fits in aggregated memory: SystemML optimizations give ~10X over Hadoop In-Memory Data Set (160GB) Some Performance Numbers for Spark / Hadoop  Data larger than aggregated memory: SystemML optimizations give ~ 2X ML Program MR Backend (All ML optims) Spark Backend (All ML optims) Spark Backend (Limited ML optims) LinregDS 479s 342s 456s LinregCG 954s 188s 243s L2SVM 1,517s 237s 531s GLM 1,989s 205s 318s ML Program MR Backend (All ML optims) Spark Backend (All ML optims) LinregDS 5,429s 6,779s LinregCG 12,469s 10,014s L2SVM 24,360s 12,795s GLM 32,521s 17,301s Large-Scale Data Set (1.6TB)
  15. 15. IBM Research © 2012 IBM Corporation Thank You

×