Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

170330 cognitive systems institute speaker series mark sherman - watson presentation v1

199 views

Published on

Dr. Mark Sherman, Director of the Cyber Security Foundations group at CERT within CMU’s Software Engineering Institute. , presention “Experiences Developing an IBM Watson Cognitive Processing Application to Support Q&A of Application Security Diagnostics” as part of the Cognitive Systems Institute Speaker Series.

Published in: Technology
  • Be the first to comment

170330 cognitive systems institute speaker series mark sherman - watson presentation v1

  1. 1. 1 Experiences Developing an IBM Watson Cognitive Processing Application to Support Q&A of Application Security (Software Assurance) Diagnostics © 2016 Carnegie Mellon University [DISTRIBUTION STATEMENT A] This material has been approved for public release and unlimited distribution. Software Engineering Institute Carnegie Mellon University Pittsburgh, PA 15213 Experiences Developing an IBM Watson Cognitive Processing Application to Support Q&A of Application Security (Software Assurance) Diagnostics © 2016 Carnegie Mellon University [DISTRIBUTION STATEMENT A] This material has been approved for public release and unlimited distribution. Experiences Developing an IBM Watson Cognitive Processing Application to Support Q&A of Application Security (Software Assurance) Diagnostics SEI staff: Mark Sherman (PI) Lori Flynn (Tech lead) Chris Alberts (Assurance SME) Students: Christine Baek Anire Bowman Skye Toor Myles Blodnick
  2. 2. 2 Experiences Developing an IBM Watson Cognitive Processing Application to Support Q&A of Application Security (Software Assurance) Diagnostics © 2016 Carnegie Mellon University [DISTRIBUTION STATEMENT A] This material has been approved for public release and unlimited distribution. Distribution Statements Copyright 2016 Carnegie Mellon University This material is based upon work funded and supported by the Department of Defense under Contract No. FA8721-05-C-0003 with Carnegie Mellon University for the operation of the Software Engineering Institute, a federally funded research and development center. References herein to any specific commercial product, process, or service by trade name, trade mark, manufacturer, or otherwise, does not necessarily constitute or imply its endorsement, recommendation, or favoring by Carnegie Mellon University or its Software Engineering Institute. NO WARRANTY. THIS CARNEGIE MELLON UNIVERSITY AND SOFTWARE ENGINEERING INSTITUTE MATERIAL IS FURNISHED ON AN “AS-IS” BASIS. CARNEGIE MELLON UNIVERSITY MAKES NO WARRANTIES OF ANY KIND, EITHER EXPRESSED OR IMPLIED, AS TO ANY MATTER INCLUDING, BUT NOT LIMITED TO, WARRANTY OF FITNESS FOR PURPOSE OR MERCHANTABILITY, EXCLUSIVITY, OR RESULTS OBTAINED FROM USE OF THE MATERIAL. CARNEGIE MELLON UNIVERSITY DOES NOT MAKE ANY WARRANTY OF ANY KIND WITH RESPECT TO FREEDOM FROM PATENT, TRADEMARK, OR COPYRIGHT INFRINGEMENT. [Distribution Statement A] This material has been approved for public release and unlimited distribution. Please see Copyright notice for non-US Government use and distribution. This material may be reproduced in its entirety, without modification, and freely distributed in written or electronic form without requesting formal permission. Permission is required for any other use. Requests for permission should be directed to the Software Engineering Institute at permission@sei.cmu.edu. Carnegie Mellon® is registered in the U.S. Patent and Trademark Office by Carnegie Mellon University. IBM, BlueMix, IBM Watson, the IBM Watson Logo, and the Watson Avatar are trademarks of International Business Machines Corp. SparkCognition, SparkSecure and their logos are trademarks of SparkCognition Corp. CWE, Common Weakness Enumeration and the CWE logo are trademarks of MITRE Corp. CMU Language Technologies Institute logo is a trademark of Carnegie Mellon University. Apache Lucene, Apache Solr and their respective logos are trademarks of the Apache Software Foundation. “Jeopardy!” and Jeopardy Productions Inc. are trademarks of Sony Corp. Reprint Courtesy of International Business Machines Corporation, © (2013) International Business Machines Corporation. DM-0004218
  3. 3. 3 Experiences Developing an IBM Watson Cognitive Processing Application to Support Q&A of Application Security (Software Assurance) Diagnostics © 2016 Carnegie Mellon University [DISTRIBUTION STATEMENT A] This material has been approved for public release and unlimited distribution. Experiences Developing an IBM Watson Cognitive Processing Application Project Summary Project Setup Application Performance Application Construction Lessons Learned Availability
  4. 4. 4 Experiences Developing an IBM Watson Cognitive Processing Application to Support Q&A of Application Security (Software Assurance) Diagnostics © 2016 Carnegie Mellon University [DISTRIBUTION STATEMENT A] This material has been approved for public release and unlimited distribution. IBM Watson Made an Impressive Introduction Source: https://www.youtube.com/watch?v=P18EdAKuC1U (IBM Research); “Jeopardy!” courtesy Jeopardy Productions Inc.; Image reprint Courtesy of International Business Machines Corporation, © (2013) International Business Machines Corporation
  5. 5. 5 Experiences Developing an IBM Watson Cognitive Processing Application to Support Q&A of Application Security (Software Assurance) Diagnostics © 2016 Carnegie Mellon University [DISTRIBUTION STATEMENT A] This material has been approved for public release and unlimited distribution. Can DoD Use IBM Watson to Improve Assurance? • Acquisition programs generate voluminous documentation • Assurance is based on assembling and reviewing relevant evidence from documents • Finding appropriate evidence or explanations can be challenging Q : Can typical developers build IBM Watson applications to support an assurance review?
  6. 6. 6 Experiences Developing an IBM Watson Cognitive Processing Application to Support Q&A of Application Security (Software Assurance) Diagnostics © 2016 Carnegie Mellon University [DISTRIBUTION STATEMENT A] This material has been approved for public release and unlimited distribution. Key Take Aways • You do not need a PhD in AI or Natural Language Processing to build IBM Watson applications on BlueMix • Significant automation will be required for corpus (knowledge database) preparation, potentially larger than application development • Subject matter expert needed to help craft document structure • End user involvement needed to guide and improve training • IBM Watson is one of many tools to bring to bear for cognitive processing applications
  7. 7. 7 Experiences Developing an IBM Watson Cognitive Processing Application to Support Q&A of Application Security (Software Assurance) Diagnostics © 2016 Carnegie Mellon University [DISTRIBUTION STATEMENT A] This material has been approved for public release and unlimited distribution. Experiences Developing an IBM Watson Cognitive Processing Application Project Summary Project Setup Application Performance Application Construction Lessons Learned Availability
  8. 8. 8 Experiences Developing an IBM Watson Cognitive Processing Application to Support Q&A of Application Security (Software Assurance) Diagnostics © 2016 Carnegie Mellon University [DISTRIBUTION STATEMENT A] This material has been approved for public release and unlimited distribution. Approach: Simulate a Development Process Assemble team of assurance experts • Determine interesting questions • Select appropriate documents • Define training (ground truth) Assemble team of developers • Experienced programmers • No specific expertise in artificial intelligence or natural language processing
  9. 9. 9 Experiences Developing an IBM Watson Cognitive Processing Application to Support Q&A of Application Security (Software Assurance) Diagnostics © 2016 Carnegie Mellon University [DISTRIBUTION STATEMENT A] This material has been approved for public release and unlimited distribution. Preparation: Defining the Problem Focus assurance questions around information generated during source code evaluation Three internal experts in secure coding and assurance spent ~2 weeks in preparation: • Focus on building a corpus of CERT Secure Coding Standards and MITRE’s CWEs • Developed sample training questions and answers • Developed document fragmentation strategy
  10. 10. 10 Experiences Developing an IBM Watson Cognitive Processing Application to Support Q&A of Application Security (Software Assurance) Diagnostics © 2016 Carnegie Mellon University [DISTRIBUTION STATEMENT A] This material has been approved for public release and unlimited distribution. Example Original Document: CERT INT33-C Rule Corpus included ~400 CERT rules ~700 CWEs Note: only top of rule shown.
  11. 11. 11 Experiences Developing an IBM Watson Cognitive Processing Application to Support Q&A of Application Security (Software Assurance) Diagnostics © 2016 Carnegie Mellon University [DISTRIBUTION STATEMENT A] This material has been approved for public release and unlimited distribution. Example Original Document: CERT INT33-C Rule - Parts Each rule or CWE resulted in about 11 Solr documents • Whole rule or CWE is a Solr document • Key sections are Solr documents Many different formats within document
  12. 12. 12 Experiences Developing an IBM Watson Cognitive Processing Application to Support Q&A of Application Security (Software Assurance) Diagnostics © 2016 Carnegie Mellon University [DISTRIBUTION STATEMENT A] This material has been approved for public release and unlimited distribution. Experiences Developing an IBM Watson Cognitive Processing Application Project Summary Project Setup Application Performance Application Construction Lessons Learned Availability
  13. 13. 13 Experiences Developing an IBM Watson Cognitive Processing Application to Support Q&A of Application Security (Software Assurance) Diagnostics © 2016 Carnegie Mellon University [DISTRIBUTION STATEMENT A] This material has been approved for public release and unlimited distribution. Example Query of Application what is the risk of INT33-C
  14. 14. 14 Experiences Developing an IBM Watson Cognitive Processing Application to Support Q&A of Application Security (Software Assurance) Diagnostics © 2016 Carnegie Mellon University [DISTRIBUTION STATEMENT A] This material has been approved for public release and unlimited distribution. Recall of Documents in Application Recall is defined as selecting a relevant set of documents to a query. Note recall: all documents are related to INT33-C
  15. 15. 15 Experiences Developing an IBM Watson Cognitive Processing Application to Support Q&A of Application Security (Software Assurance) Diagnostics © 2016 Carnegie Mellon University [DISTRIBUTION STATEMENT A] This material has been approved for public release and unlimited distribution. Precision of Documents in Application Precision refers to the priority ordering of documents answering a query Note precision: exact subtext describing risk is produced as best document INT33-C – Risk Overview
  16. 16. 16 Experiences Developing an IBM Watson Cognitive Processing Application to Support Q&A of Application Security (Software Assurance) Diagnostics © 2016 Carnegie Mellon University [DISTRIBUTION STATEMENT A] This material has been approved for public release and unlimited distribution. Google Results from Example Query • Entire document found, no subcomponents • “Recall” is spread across irrelevant documents - (No match quality provided) • Imprecise excerpt produced INTC33-C. Ensure that division and remainder operations do not result … https://www.securecoding.cert.org/…/c/INT33-C. =Ensure+that+dividion+and+remaind…
  17. 17. 17 Experiences Developing an IBM Watson Cognitive Processing Application to Support Q&A of Application Security (Software Assurance) Diagnostics © 2016 Carnegie Mellon University [DISTRIBUTION STATEMENT A] This material has been approved for public release and unlimited distribution. Experiences Developing an IBM Watson Cognitive Processing Application Project Summary Project Setup Application Performance Application Construction Lessons Learned Availability
  18. 18. 18 Experiences Developing an IBM Watson Cognitive Processing Application to Support Q&A of Application Security (Software Assurance) Diagnostics © 2016 Carnegie Mellon University [DISTRIBUTION STATEMENT A] This material has been approved for public release and unlimited distribution. Corpus Construction Scanning or scraping sources Canonical reformatting Subdocument extraction and numbering Determine important key words (schema) Manually define key true Q&A Automate volume training examples Document Preparation Training Preparation
  19. 19. 19 Experiences Developing an IBM Watson Cognitive Processing Application to Support Q&A of Application Security (Software Assurance) Diagnostics © 2016 Carnegie Mellon University [DISTRIBUTION STATEMENT A] This material has been approved for public release and unlimited distribution. Corpus Training Config.zip Customized settings for Apache Solr (including schema) Data.json ground_truth.csv IBM Watson Loads user configuration and data onto Apache Solr. Watson then trains a machine learning model based on known relevant results, then leverages this model to provide improved results to end users in response to their questions NoSQL database used to hold corpus
  20. 20. 20 Experiences Developing an IBM Watson Cognitive Processing Application to Support Q&A of Application Security (Software Assurance) Diagnostics © 2016 Carnegie Mellon University [DISTRIBUTION STATEMENT A] This material has been approved for public release and unlimited distribution. Question-and-Answer Application Flow Q&A web-based user interface based on Django and Bootstrap Natural language question (generating field name query) Parse and select from ranked list of returned documents IBM Watson Retrieve and Rank
  21. 21. 21 Experiences Developing an IBM Watson Cognitive Processing Application to Support Q&A of Application Security (Software Assurance) Diagnostics © 2016 Carnegie Mellon University [DISTRIBUTION STATEMENT A] This material has been approved for public release and unlimited distribution. Application Development Timeline 4 student interns working full time • 2 graduate students • 2 undergraduate students No IBM Watson experience Used Python and JSON interfaces *R&R – IBM Watson Retrieve and Rank service interface on BlueMix 11 Weeks 1 2 3
  22. 22. 22 Experiences Developing an IBM Watson Cognitive Processing Application to Support Q&A of Application Security (Software Assurance) Diagnostics © 2016 Carnegie Mellon University [DISTRIBUTION STATEMENT A] This material has been approved for public release and unlimited distribution. Database Schema Definition XML file used by Watson and passed thru to Solr Defines the key concepts (keywords) for the domain of discussion Watson extends keyword types (e.g., “Watson text”) 18 fields defined (16 significant, 2 system used) <field name="risk" type="watson_text_en" indexed="true" stored="true" multiValued="true"/> <field name="riskSeverity" type="watson_text_en" indexed="true" stored="true" multiValued="true" /> <field name="riskProbability" type="watson_text_en" indexed="true" stored="true" multiValued="true"/> <field name="remediationCost" type="watson_text_en" indexed="true" stored="true" multiValued="true" /> <field name="attackerExploit" type="watson_text_en" indexed="true" stored="true" multiValued="true" /> <field name="tools" type="watson_text_en" indexed="true" stored="true" multiValued="true" /> <field name="all_compliant" type="watson_text_en" indexed="true" stored="true" multiValued="true" /> Note: Schema included in larger configuration file needed by Solr and Watson
  23. 23. 23 Experiences Developing an IBM Watson Cognitive Processing Application to Support Q&A of Application Security (Software Assurance) Diagnostics © 2016 Carnegie Mellon University [DISTRIBUTION STATEMENT A] This material has been approved for public release and unlimited distribution. Corpus (Knowledge Database) json formatted file used by Watson to define retrievable Solr documents Over 15,000 Solr documents in corpus add": { "doc": { "id": "30cb91d64d5d34d134bc0c0ff6763018eba06769d8650bd6fb243bdadede1741", "header": "CWE-102", "title": "Struts: Duplicate Validation Forms", "general_text_definition": "The application uses multiple validation forms ... does not expect. " } } , "add": { "doc": { "id": "59efe2fd05c46c981cc871137e8e8ff176917f1757299ae823e6191d1e167fb2", "header": "CWE-103", "title": "Struts: Incomplete validate() Method Definition", "attackerExploit": "Other - Technical Impact: OtherDisabling the .... to launch antttttttbuffer overflow attack." } } , "add": { "doc": { "id": "7b12c1cdf377f547b486d77454954b0514ffda36ebac3b90948af4d5df4c4735", "header": "CWE-104", "title": "Struts: Form Bean Does Not Extend Validation Class", "entire_rule": "<table border="0" cellpadding="0" cellspacing="0" id="MainPane".... n</tr>n</table>" } } , "add": { "doc": { "id": "6467eb68eb0361826cc06b15d88bc219118cbb44f59f0692dd427e95f0e5b0b4", "header": "CWE-104", "title": "Struts: Form Bean Does Not Extend Validation Class", "platform": "nJavan" } }
  24. 24. 24 Experiences Developing an IBM Watson Cognitive Processing Application to Support Q&A of Application Security (Software Assurance) Diagnostics © 2016 Carnegie Mellon University [DISTRIBUTION STATEMENT A] This material has been approved for public release and unlimited distribution. Training Materials “csv” formatted file used by Watson to establish context and cognitive baseline • Collection of questions and “answers” (Solr documents) • Each question has 3 to 4 ranked answers • Each schema keyword needs to be exercised in 50-62 questions • Each original (unfragmented) document needs to be an answer in100-200 questions What are the common consequences of CWE-125?, 4ab71dc7be9b63aa9fc56e66e75dbeec6a98ec9074b8f427f592dc19fe3a2201,5,448e86d7880eaa27d71b4810d7bfcf776f8ff434f257a7fb4b6dd2c5e478 99a8,3,69e7b7e822c017e611d008cb16a9de996535661ca8e9e1fdfdb26a7c27634b01,1,16a76232b606883d19fe71946d72b439676cebeac8293e8d02 4e59d04f4ad56b,2,, What are the common consequences of CWE-127?, 874b072810a74d4dc9583449ad2f873d6f5c0f0065e4aaecbed60c1240abe251,5,fab02c56385917ab1808912649abcd9f3d5db63ecd451ac382ef786c5 1d4440a,3,dea8ae7d527ee0c53741b5a69535332c288b7fe735bb125358aa2198dadff946,1,f1844eace7f0a7f84fa3c16156aff1c16c111def85dc167b63 092cadfc21bfb0,2,, What is the risk likelihood of CWE-267?, d552947662106e8a191d944f270c9044173528092967f3ce32d18d06c8bffd93,5,855ce52ec9156bc7e272a1a02df53ee850e21f1d665ac97aeb90ef3c17 63010a,3,fb6b4baf311808b97333ac39a6cd7949603e75de782a6a84c6aeacec3dd89e97,1,f16964c6b2a4e2dad3d52c07f8432a0a7476a04e886e39cc 41aa45db71d723a8,2,, Give me the programming languages that apply to CWE-234., f6c1ed915e63dcd0a425d96fc4b4596e8952996532b835dd3cdcbb3a67465f03,5,ff56bc00960e1a86f14fe36f55e3d667245c58ede48d033aae06943343 ab3085,3,9ec316aa9d7670e5787397f1cdc18c26a01217ef3162d21472988f2bff3fb4df,2,,,, Over 150,000 questions generated – needs to be mechanical
  25. 25. 25 Experiences Developing an IBM Watson Cognitive Processing Application to Support Q&A of Application Security (Software Assurance) Diagnostics © 2016 Carnegie Mellon University [DISTRIBUTION STATEMENT A] This material has been approved for public release and unlimited distribution. Post Processing the Answer
  26. 26. 26 Experiences Developing an IBM Watson Cognitive Processing Application to Support Q&A of Application Security (Software Assurance) Diagnostics © 2016 Carnegie Mellon University [DISTRIBUTION STATEMENT A] This material has been approved for public release and unlimited distribution. Experiences Developing an IBM Watson Cognitive Processing Application Project Summary Project Setup Application Performance Application Construction Lessons Learned Availability
  27. 27. 27 Experiences Developing an IBM Watson Cognitive Processing Application to Support Q&A of Application Security (Software Assurance) Diagnostics © 2016 Carnegie Mellon University [DISTRIBUTION STATEMENT A] This material has been approved for public release and unlimited distribution. Lessons Learned From Project Theory Practice
  28. 28. 28 Experiences Developing an IBM Watson Cognitive Processing Application to Support Q&A of Application Security (Software Assurance) Diagnostics © 2016 Carnegie Mellon University [DISTRIBUTION STATEMENT A] This material has been approved for public release and unlimited distribution. Lessons Learned: System Behavior Answers must be found in a single document Retraining means reloading – feedback is collected externally to application and then used to modify or augment original training material • Challenging when reload runs overnight Reconciling fields names when combining multiple corpora requires significant effort in reconciling training • Field names become a key architectural consideration when designing a system Recall – getting a good set of answers – is relatively easy; precision – getting set in the best order – is difficult • Best received advice was to carry out user studies to help guide training data Plan for technology evolution
  29. 29. 29 Experiences Developing an IBM Watson Cognitive Processing Application to Support Q&A of Application Security (Software Assurance) Diagnostics © 2016 Carnegie Mellon University [DISTRIBUTION STATEMENT A] This material has been approved for public release and unlimited distribution. Watson’s Interfaces for Cognitive Querying Evolved Over Time UIMA (Unstructured Information Management Architecture) [Watson Pathways] Question and Answer (QAAPI) with Local infrastructure QAAPI with BlueMix infrastructure Retrieve and Rank (R&R) with BlueMix infrastructure R&R with Natural Language Classifier (Beta) with BlueMix infrastructure Organization of technology rapidly evolved • Splitting some components into distinct services • Combining some services into usable chunks • Ease-of-use interfaces delivered in open source (out of product cycle) Project focused on using “Retrieve and Rank” on BlueMix • Available support from IBM • Combined Watson Pathways for Concept Expansion, Concept Insights and Question-and-Answer
  30. 30. 30 Experiences Developing an IBM Watson Cognitive Processing Application to Support Q&A of Application Security (Software Assurance) Diagnostics © 2016 Carnegie Mellon University [DISTRIBUTION STATEMENT A] This material has been approved for public release and unlimited distribution. Lessons Learned: Corpus Creation Corpus and training preparation requires significant automation by developer • Document segmentation, i.e., creating many Solr documents from a single original document • Locating Solr documents with matching field names (Solr schema) • Generating 50-62 training question and answer lists per field name - Generating 100~200 questions per original document (e.g., rule) • Rank-and-Retrieve (Solr) is meticulous about input formats, best generated by machine • Large supporting Solr document parts, e.g., images, supported indirectly through references in corpus
  31. 31. 31 Experiences Developing an IBM Watson Cognitive Processing Application to Support Q&A of Application Security (Software Assurance) Diagnostics © 2016 Carnegie Mellon University [DISTRIBUTION STATEMENT A] This material has been approved for public release and unlimited distribution. Experiences Developing an IBM Watson Cognitive Processing Application Project Summary Project Setup Application Performance Application Construction Lessons Learned Availability
  32. 32. 32 Experiences Developing an IBM Watson Cognitive Processing Application to Support Q&A of Application Security (Software Assurance) Diagnostics © 2016 Carnegie Mellon University [DISTRIBUTION STATEMENT A] This material has been approved for public release and unlimited distribution. Government use rights apply. IBM Watson software (and any dependencies) must be licensed from IBM. SparkCognition is an IBM Watson business partner (independent software vendor) and has licensed the project materials from CMU for use in their products. Disposition of Materials
  33. 33. 33 Experiences Developing an IBM Watson Cognitive Processing Application to Support Q&A of Application Security (Software Assurance) Diagnostics © 2016 Carnegie Mellon University [DISTRIBUTION STATEMENT A] This material has been approved for public release and unlimited distribution. SparkSecure team at SparkCognition IBM Watson team at IBM Prof. Eric Nyberg, Language Technologies Institute, School of Computer Science, CMU We Want to Thank and Acknowledge Collaborators And our student interns: Christine Baek, Anire Bowman, Skye Toor and Myles Blodnick
  34. 34. 34 Experiences Developing an IBM Watson Cognitive Processing Application to Support Q&A of Application Security (Software Assurance) Diagnostics © 2016 Carnegie Mellon University [DISTRIBUTION STATEMENT A] This material has been approved for public release and unlimited distribution. Contact Information Mark Sherman Technical Director, Cyber Security Foundations Telephone: +1 412.268.9223 Email: mssherman@sei.cmu.edu

×