Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

AutoML for Data Science Productivity and Toward Better Digital Decisions

182 views

Published on

With the increased availability of both cloud computing and AI libraries arrives the opportunity to automatically search, or optimize machine learning algorithms. While this technology has been around for almost twenty years and seeing renewed interest lately, only recently has the computing power become widespread enough to fully take advantage of it by a growing community of data scientists across many different types of opportunities. Because machine learning still remains a rather challenging discipline for most, I advocate for a more “assistive” approach to AutoML that helps the data scientist learn about different methods within the entire machine learning pipeline, as well as create a knowledge graph of results that can be further mined and explored to gain knowledge and connect with other individuals who are also searching for machine learning pipelines. In this talk, I will present an overview of the approach, published recently in IJCAI and AAAI, and provide new unpublished results demonstrating its effectiveness on public data sets.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

AutoML for Data Science Productivity and Toward Better Digital Decisions

  1. 1. AutoML Productivity for Data Science, AND … A better way to make Digital Decisions Dr. Steven Gustafson Chief Scientist, Maana (2+ years) previously, GE Research (10+ years) (before previously, PhD AI for “automatic programming”)
  2. 2. What do you take-away? Observe my arguments about AutoML and the algorithm Reason about the evidence, consider past experience Decide to change your Data Science approach Learn by experimentation and feedback
  3. 3. My Argument • Generate new knowledge • Find good model pipelines • Allow your experts and data scientists to understand, learn and improve models that drive business decisions! • We created our AutoML as an archetype for architecting digital decisions!
  4. 4. AutoML • Generate and tune ML pipeline • Auto-WEKA, Auto-SKLEARN, Google NN, Azure ML, …,TPOT • Most Bayesian learning or computation vs. improvement • Black box – helps find solutions, not knowledge or wisdom • Assumes future problems represented by data • Biased by what code and data is available vs. what’s useful • Can be very long running - hours to days
  5. 5. Expert Data Science Observe the Problem, Data, Background Knowledge Reason about data characteristics vs. goals vs. techniques Decide on initial approaches Learn from results and iterate
  6. 6. Expert Data Science Vs. AutoML Massive compute Optimize many, many parameters Blind search, etc? How do you explain results? Justify compute budget Engage an SME? Does the Data Scientist learn? Observe the Problem, Data, Background Knowledge Reason about data characteristics Decide on initial approaches Learn from results and iterate =?
  7. 7. What if AutoML… • Capture & represent knowledge • Use reasoning to ”expertly” choose pipelines • Use reinforcement learning with human input in real- time to guide iteration • Target seconds and minutes for results instead of hours and days, match expert iteration
  8. 8. What is Knowledge Representation? • A surrogate, a substitute for the thing itself. • Enable an entity to determine consequences by thinking rather than acting. • A “language” in which we say things about the world. • A “theory” of intelligent reasoning: the type of reasoning and the applicable reasoning given data • Guidance for organizing information to facilitate inferences to get new expressions from old. • A KR is not a data structure. A KR must be implemented in the machine by some data structure. http://groups.csail.mit.edu/medg/ftp/psz/k-rep.html
  9. 9. Program Search for Machine Learning Pipelines Leveraging Symbolic Planning and Reinforcement Learning F. Yang, S. Gustafson, A. Elkholy, D. Lyu, B. Liu.  Program Search for Machine Learning Pipelines Leveraging Symbolic Planning and Reinforcement Learning.  In Genetic Programming Theory and Practice XVI. 2018. Springer.
  10. 10. Symbolic planning • Symbolic planning concerns using logical formalism to represent dynamic systems and performs automated algorithms that generate plans • Plans are a sequence of actions that achieves the goal state from an initial state • Common action description language (such as B, C, C+, BC) where plan can be automatically computed using ASP solver, such as Clingo. Data science contains a set of actions that transform and fit Data.
  11. 11. AutoML Pipelines
  12. 12. Pipelines• Featurizers – Count / bag of words Vectorizor – Tfidf Vectorizer • Preprocessors – matrix decompositions (truncatedSVD,pca,kernelPCA,fastICA) – kernel approximation (rbfsampler,nystroem) – feature selection (selectkbest,selectpercentile) – scaling (minmaxscaler,robustscaler,absscaler) – no preprocessing • Classifiers – logistic regression – gaussian naive Bayes – linear SVM – random forest – multinomial naive Bayes – stochastic gradient descent Nystroem: Approximate a kernel map using a subset of the training data. KernelPCA: Kernel Principal component analysis (KPCA) fastICA: a fast algorithm for Independent Component Analysis. truncatedSVD : This transformer performs linear dimensionality reduction by means of truncated singular value decomposition (SVD). Rbfsampler : Approximates feature map of an RBF kernel by Monte Carlo approximation of its Fourier transform.
  13. 13. Reinforcement Learning • Find a policy, i.e., a mapping from state to action, such that the agent can accumulate maximal reward • Learns the policy by trial-and- error: executing actions in the environment, obtain reward, update its estimation of the value function, until the value iteration converges • R-learning, update R(s,a) and rho(s) that reflects long term undiscounted average reward and gain reward, shooting for finite horizon problems (fixed number of steps in future) Data scientists performs trail and error on different ML pipelines to understand the most effective pipeline and hyper-parameters, similar to performing a reinforcement learning process
  14. 14. PEORL: Planning--Execution--Observation--Reinforcement-Learning
 Define pipeline goals Find all satisfying plans Shortest/highest reward plan instantiated Update plan R-values Planner focuses future trials on plan components and overall pipelines with higher learned rewards until all plans are tried, accuracy achieved, or out of time. BC Action Language ASP - Clingo scikit-learn R-Learning
  15. 15. Evidence from Experiments Best IMDB 300 bag of words, fastICA and stochastic gradient descent (SGD), Hashing vectorizer: ngram range = (1,2), lowercase = False • FastICA: n components = 3 • SGD classifier: loss=log, penalty=l2 Best Polarity Dataset 2.0 2000 movie reviews Cross validation accuracy of 0.84 • Hasing vecctorizer: ngram range = (1,3), lowercase = True • FastICA: n components = 3 • SGD classifier: loss = modified huber, penalty=elasticnet. Best Full IMDB dataset Cross validation score of 0.88 • Hashing vectorizer: ngram range = (1,1), lowercase = False • FastICA: n components = 3 • SGD classifier: loss=log, penalty=None 300 IMDB Docs – Top 5 300 IMBD Docs – Bottom 5
  16. 16. Classifiers All have viable options, but pipelines vary significantly.
  17. 17. Rho value evolution Pipeline A,B,C: • A – B fixed • C changes * Episodes are sequential, not reflected below Each pipeline is evaluated for 1..5 episodes of 5-fold cross-validation, 300 documents, 2 classes. Each episode updates the value 𝜌 episode
  18. 18. PEORL learns to focus on promising pipelines
  19. 19. UCI Data UCI Data Set Abenteeism 0.912 linear_svc_classifier Blood Transfusion 0.792 random_forest_classifier Breast Cancer Coimbra 0.75 random_forest_classifier Breast Cancer Wisconsin 0.972 sgd_classifier Breast tissue 0.707 logistic_classifier Cervical Cancer 0.9685 linear_svc_classifier Climate 0.9574 linear_svc_classifier Connectionist Bench 0.8269 gradient_boosting_classifier Ecoli 0.875 logistic_classifier Energy Efficiency A: 0.570 gradient_boosting_classifier B: 0.501 random_forest_classifier Glass 0.780 UCI Data Set Haberman's Survival 0.735 gradient_boosting_classifier HCC Survival 0.745 gradient_boosting_classifier Ionosphere 0.94 random_forest_classifier Iris 0.953 random_forest_classifier Leaf 0.793 linear_svc_classifier Libras Movement 0.852 logistic_classifier LSVT Voice Rehabilitation 0.881 logistic_classifier Mammographic Mass 0.85 random_forest_classifier Musk 0.823 random_forest_classifier Optical Interconnection 0.647 gradient_boosting_classifier Parkinsons 0.897 gradient_boosting_classifier UCI Data Set Quality Assessment of Digital Colposcopies A: 0.796 random_forest_classifier Seeds 0.971 linear_svc_classifier SPECTF Heart 0.8 random_forest_classifier Sports articles for objectivity analysis 0.853 linear_svc_classifier Vehicle Silhouettes 0.754 gradient_boosting_classifier Student Performance 0.185 random_forest_classifier Tennis Major Tournament Match Statistics 0.998 logistic_classifier Ultrasonic Flowmeter Diagnostics A: 0.839 gradient_boosting_classifier B: 1 random_forest_classifier User Knowledge Modeling 0.922 logistic_classifier Vertebral Column A: 0.842 linear_svc_classifier B: 0.864 logistic_classifier Wholesale Customers 0.920 random_forest_classifier Wine 0.983
  20. 20. Azure ML comparison UCI AutoML Azure AutoML Student Performance 0.185 random_forest_classifier 0.1507  LightGBM Abenteeism 0.912 linear_svc_classifier 1.0  LogisticRegression Blood Transfusion 0.792 random_forest_classifier 1.0  LogisticRegression Breast Cancer Coimbra 0.75 random_forest_classifier 1.0  LogisticRegression Ionosphere 0.94 random_forest_classifier 0.8785 LightGBM Optical Interconnection 0.647 gradient_boosting_classifier 0.4179  LogisticRegression Wine 0.983 random_forest_classifier 1.0  LightGBM
  21. 21. References F. Yang, A. Elkholy, S. Gustafson. Interpretable Automated Machine Learning in Maana Knowledge Platform. 18th International Conference on Autonomous Agents and Multiagent Systems (AAMAS), Montreal.  Extended Abstract, May, 2019. D. Lyu, F. Yang, B. Liu, S. Gustafson. SDRL: Interpretable and Data-efficient Deep Reinforcement Learning Leveraging Symbolic Planning.  33rd AAAI Conference on Artificial Intelligence (AAAI), Honolulu, HI, 2019 F. Yang, S. Gustafson, A. Elkholy, D. Lyu, B. Liu.  Program Search for Machine Learning Pipelines Leveraging Symbolic Planning and Reinforcement Learning.  In Genetic Programming Theory and Practice XVI. 2018. Springer. F. Yang, D. Lyu, B. Liu, S. Gustafson.  PEORL: Integrating Symbolic Planning and Hierarchical Reinforcement Learning for Robust Decision-Making.  IJCAI. Sweden. 2018. 

  22. 22. AutoML • Algorithm closely mirrors expert’s process, reasonable results • Algorithm is naturally “human in the loop” • Includes learning, via human input and reinforcement learning • Anything else?
  23. 23. Digitization / Digital Decisions AutoML has a knowledge representation of a digital decision It allows you to think & reason about the decision before making it I have made AutoML before, but this time, I want to do it in a way that aligns with digital decisions in general. AutoML is simply a digital decision for picking a ML pipeline!
  24. 24. Canvas (derived from “to canvass”) • A set of topics and questions that allowed you to gather information about your business and strategy, reflect, brainstorm and refine strategy • We will use a four section Decision Canvas: 1. Define the problem or opportunity 2. Identify the decision strategy 3. Break down the decision 4. Define the solution as composable functions
  25. 25. Given data with labels, what is the best model to predict label of new data?
  26. 26. Data, methods that can be combined into a pipeline Pipeline with good cross validation accuracy Shortest pipelines with low variability in accuracy Iterate over different pipelines Given data with labels, what is the best model to predict label of new data?
  27. 27. What pipeline steps have worked well, gotten closer to goal (better CV results) Stop pipeline, set accuracy, constrain options Select next pipeline to try Pipeline meets goal, best so far Labeled data, user preferences on pipeline CV results CV results, user action to stop Data, methods that can be combined into a pipeline Pipeline with good cross validation accuracy Shortest pipelines with low variability in accuracy Iterate over different pipelines Given data with labels, what is the best model to predict label of new data?
  28. 28. model  =  best (  ...  ( learn ( score ( plan ( input data, user preferences) ) ) ) ) where ... is an iteration of (learn(score(plan( ))) until all plans are tried or a target accuracy is met model : given input data and user preferences, what is the best pipeline plan : given input data and user preferences and known pipeline element performance, what are ordered by potential performance and length the possible pipelines score : given potential pipeline, what is its accuracy learn : given pipeline performance, what is pipeline element performance best : given known pipeline accuracy, what is the best one What pipeline steps have worked well, gotten closer to goal (better CV results) Stop pipeline, set accuracy, constrain options Select next pipeline to try Pipeline meets goal, best so far Labeled data, user preferences on pipeline CV results CV results, user action to stop Data, methods that can be combined into a pipeline Pipeline with good cross validation accuracy Shortest pipelines with low variability in accuracy Iterate over different pipelines Given data with labels, what is the best model to predict label of new data?
  29. 29. Example Digitization : Should I bring my umbrella? • Traditionally, I would only observe the weather report, but I can now combine this with my online calendar to decide if I’ll be outside • It stands to reason that I should bring an umbrella if I’ll be outside long when it is most likely to rain • If I have an important meeting, a long distance to walk, or if I have to carry a lot of other things, will factor into a decision about bringing an umbrella. • I want to learn to predict what to bring better, a better estimate of walking times, and learn to manage my daily activities better in general. • Optimizing a decision (bring umbrella) extends previous data (weather report) and fills in missing data (walking times), useful for other opportunities.
  30. 30. Given today’s activities, should I bring my umbrella? (main PQ) Given activities and step monitor data, when can I assume I am outside? (predict time outside based on step data) Given time outside and the weather forecast, what is likelihood of getting wet? (combine outside and weather prediction) Given likelihood of getting wet and activities, when do I accept recommendation to bring umbrella? (learn judgement decision to bring umbrella (Y/N) as conditioned on wet likelihood and activities) Today’s activities, Weather predictions Am I outside when it’s raining? Will being wet matter? Cost of carrying it? Don’t get caught out in the rain What activities, when will I be outside, based on steps data Carry umbrella given day’s activities? Bring umbrella? Happy with advice to bring umbrella – sent by text Day’s activities (locations and times), weather service Activities (name and time) and activity step monitor data Reply to text is Yes, No. A No is used to train a function on decision to send text. Given today’s activities, should I bring my umbrella?
  31. 31. What do you take-away? Observe my arguments about AutoML and the algorithm Reason about the evidence, consider past experience Decide to change your Data Science approach Learn by experimentation and feedback
  32. 32. Team Fangkai Yang (NVIDIA) Prof. Bo Liu (Auburn) Daoming Lyu (Auburn) Alexander Elkholy Krishnan Ram (intern) Jeremy Brown Sergey Ilinskiy
  33. 33. Take-away Solve AutoML (and digitization in general) like and with human experts!
  34. 34. www.globalbigdataconference.com Twitter : @bigdataconf #GAIC

×