Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Machine Learning and Social Good

92 views

Published on

AI and Machine Learning can be used for the greater good. Why should AI be applied and what are the main challenges that have to be tackled when harnessing the power of AI for social good? Find out more here.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Machine Learning and Social Good

  1. 1. © 2 0 2 0 S P L U N K I N C . © 2 0 2 0 S P L U N K I N C . Machine Learning for Social Good Dr. Greg Ainslie-Malik – Machine Learning Architect
  2. 2. During the course of this presentation, we may make forward‐looking statements regarding future events or plans of the company. We caution you that such statements reflect our current expectations and estimates based on factors currently known to us and that actual events or results may differ materially. The forward-looking statements made in the this presentation are being made as of the time and date of its live presentation. If reviewed after its live presentation, it may not contain current or accurate information. We do not assume any obligation to update any forward‐looking statements made herein. In addition, any information about our roadmap outlines our general product direction and is subject to change at any time without notice. It is for informational purposes only, and shall not be incorporated into any contract or other commitment. Splunk undertakes no obligation either to develop the features or functionalities described or to include any such feature or functionality in a future release. Splunk, Splunk>, Data-to-Everything, D2E, and Turn Data Into Doing are trademarks and registered trademarks of Splunk Inc. in the United States and other countries. All other brand names, product names, or trademarks belong to their respective owners. © 2020 Splunk Inc. All rights reserved. Forward- Looking Statements © 2 0 2 0 S P L U N K I N C .
  3. 3. © 2 0 2 0 S P L U N K I N C . Introduction to Machine Learning Common challenges with Machine Learning Where have we seen Machine Learning used for social good? Anomaly detection Fraud detection Learning analytics What else are we doing to promote good use of Machine Learning? Agenda 4 3 2 1
  4. 4. © 2 0 2 0 S P L U N K I N C . Introduction to Machine Learning
  5. 5. © 2 0 2 0 S P L U N K I N C . What is Machine Learning? Artificial Intelligence (AI) Machine Learning Deep Learning • AI is supposed to mean any type of algorithm or programme that allows computers to mimic human behaviour • ML is a subset of this that allows machines to make improvements over time • Deep Learning is a type of machine learning that is based on neural networks
  6. 6. © 2 0 2 0 S P L U N K I N C . What is Machine Learning? Data Rules Outcomes Data Outcomes (supervised only) Rules Classic Programming Machine Learning
  7. 7. © 2 0 2 0 S P L U N K I N C . Why Use Machine Learning? Observation from Splunk customers Identify anomalies or ‘unknown unknowns’ Improve alert accuracy Highlight weak relationships
  8. 8. © 2 0 2 0 S P L U N K I N C . How Machine Learning Fits into Splunk Search Every Search Can Use Machine Learning Third-Party Applications Smartphones and Devices Tickets Email Send an email File a ticket Send a text Flash lights Trigger process flow AlertReal TimeOT Industrial Assets IT Consumer and Mobile Devices Security
  9. 9. © 2 0 2 0 S P L U N K I N C . Common Challenges with Machine Learning
  10. 10. © 2 0 2 0 S P L U N K I N C . Problem Statement There is a lack of trust in Machine Learning. This is largely caused by limited transparency or explainability of most Machine Learning processes. Therefore it can be difficult to identify negative bias when applying Machine Learning.
  11. 11. © 2 0 2 0 S P L U N K I N C . UNTAPPED UNANALYSED UNOWNED MOST ORGANIZATIONS’ DATA IS STILL DARK DATA 60% of organizations report that the majority of their data is still dark *Splunk Inc., “State of Dark Data Report” , May 2019
  12. 12. © 2 0 2 0 S P L U N K I N C . Our World Never Stops Evolving. How can we handle the half-life of data? © 2 0 2 0 S P L U N K I N C .
  13. 13. © 2 0 2 0 S P L U N K I N C . Use of AI Globally, 61%-67% saw value in AI for their organizations. 60%–70% of respondents believe that they will be using AI across IT, operations and talent management in the future. And yet … Only 10%–15% say their organizations are deploying AI for use cases today. While only 12% say that AI is currently guiding their business strategy, 61% expect it to do so in the next five years. of respondents say they expect AI to guide business strategy in the next five years. Organizations admit they’re not ready for AI. Their top four concerns: 1. Lack of trained AI experts 2. Lack of understanding of AI 3. Not knowing what can be automated 4. Difficulty successfully wrangling the data 61% 81% 80% 78% 78%
  14. 14. © 2 0 2 0 S P L U N K I N C . Do you know what’s happening? Can you turn data into action? How do you build for the future? © 2 0 2 0 S P L U N K I N C .
  15. 15. © 2 0 2 0 S P L U N K I N C . Try to gain as much visibility of your data as possible Minimise the delivery time for that data Invest in data skills Key Takeaways 1 2 3
  16. 16. © 2 0 2 0 S P L U N K I N C . Machine Learning for Social Good Example case studies
  17. 17. © 2 0 2 0 S P L U N K I N C . Finding Potential Cyber Security Incidents Identifying anomalies in massive datasets
  18. 18. © 2 0 2 0 S P L U N K I N C . https://conf.splunk.com/files/2019/slides/SEC1374.pdf Use Case: Proxy Communication Investigation Workflow 1 2 3 4 5
  19. 19. © 2 0 2 0 S P L U N K I N C . Using the DensityFunction to Find Anomalies | tstats count WHERE (index=botsv2) BY _time span=60m
  20. 20. © 2 0 2 0 S P L U N K I N C . Using the DensityFunction to Find Anomalies | tstats count WHERE (index=botsv2) BY _time span=60m | eval HourOfDay=strftime(_time, "%H") | fit DensityFunction count by "HourOfDay" into df_bots_dns | table _time count IsOutlier(count)
  21. 21. © 2 0 2 0 S P L U N K I N C . Using the DensityFunction to Find Anomalies | tstats count WHERE (index=botsv2) BY _time span=60m | eval HourOfDay=strftime(_time, "%H") | apply df_bots_dns threshold=0.03 | table _time count IsOutlier(count)
  22. 22. © 2 0 2 0 S P L U N K I N C . Using the DensityFunction to Find Anomalies | summary df_bots_dns
  23. 23. © 2 0 2 0 S P L U N K I N C . Using the DensityFunction to Find Anomalies | summary df_bots_dns Much bigger standard deviation Much higher mean than the other times of day None of the times of day have many training points
  24. 24. © 2 0 2 0 S P L U N K I N C . Using the DensityFunction to Find Anomalies | tstats count WHERE (index=botsv2) BY _time span=60m | eval HourOfDay=strftime(_time, "%H") | apply df_bots_dns threshold=0.003 show_density=true | where 'IsOutlier(count)'>0 | join HourOfDay [| summary df_bots_dns | table HourOfDay cardinality mean std] | table _time count ProbabilityDensity(count) cardinality mean std | eval distance_from_mean=abs(count-mean), deviations_from_mean=abs(count-mean)/std Reduce the threshold and include the probability density in the results
  25. 25. © 2 0 2 0 S P L U N K I N C . Using the DensityFunction to Find Anomalies | tstats count WHERE (index=botsv2) BY _time span=60m | eval HourOfDay=strftime(_time, "%H") | apply df_bots_dns threshold=0.003 show_density=true | where 'IsOutlier(count)'>0 | join HourOfDay [| summary df_bots_dns | table HourOfDay cardinality mean std] | table _time count ProbabilityDensity(count) cardinality mean std | eval distance_from_mean=abs(count-mean), deviations_from_mean=abs(count-mean)/std Filter the data to only show the anomalies
  26. 26. © 2 0 2 0 S P L U N K I N C . Using the DensityFunction to Find Anomalies | tstats count WHERE (index=botsv2) BY _time span=60m | eval HourOfDay=strftime(_time, "%H") | apply df_bots_dns threshold=0.003 show_density=true | where 'IsOutlier(count)'>0 | join HourOfDay [| summary df_bots_dns | table HourOfDay cardinality mean std] | table _time count ProbabilityDensity(count) cardinality mean std | eval distance_from_mean=abs(count-mean), deviations_from_mean=abs(count-mean)/std Join with the summary data to include the cardinality, mean and standard deviation
  27. 27. © 2 0 2 0 S P L U N K I N C . Using the DensityFunction to Find Anomalies | tstats count WHERE (index=botsv2) BY _time span=60m | eval HourOfDay=strftime(_time, "%H") | apply df_bots_dns threshold=0.003 show_density=true | where 'IsOutlier(count)'>0 | join HourOfDay [| summary df_bots_dns | table HourOfDay cardinality mean std] | table _time count ProbabilityDensity(count) cardinality mean std | eval distance_from_mean=abs(count-mean), deviations_from_mean=abs(count-mean)/std Calculate some additional fields using the mean and standard deviation that describe how extreme the outlier is
  28. 28. © 2 0 2 0 S P L U N K I N C . Using the DensityFunction to find anomalies | tstats count WHERE (index=botsv2) BY _time span=60m | eval HourOfDay=strftime(_time, "%H") | apply df_bots_dns threshold=0.003 show_density=true | where 'IsOutlier(count)'>0 | join HourOfDay [| summary df_bots_dns | table HourOfDay cardinality mean std] | table _time count ProbabilityDensity(count) cardinality mean std | eval distance_from_mean=abs(count-mean), deviations_from_mean=abs(count-mean)/std
  29. 29. © 2 0 2 0 S P L U N K I N C . Identifying Fraud Finding anomalies in credit card transactions, prescriptions and accesses to patient data
  30. 30. © 2 0 2 0 S P L U N K I N C . Common for exploring transactional data Credit Card Fraud Example Group Like with Like Data is often “batch” loaded Often proactively searching for Unknown Unknowns
  31. 31. © 2 0 2 0 S P L U N K I N C . Enrich the Transactions Region Change between card txns? Cal time delta between card txns. Merchant Change between card txns?
  32. 32. © 2 0 2 0 S P L U N K I N C . Synthesize More Context Too quickly between regions? Avg Merchant/Region change by num txns. Aggregate counts per card Stdev TimeDelta/Amt by averages. Too quickly between merchants?
  33. 33. © 2 0 2 0 S P L U N K I N C . Prep for Clustering and Visualization 1. Standard Scalar – normalize distribution 2. Principal Component Analysis (PCA) – reduce to 3 dimensions
  34. 34. © 2 0 2 0 S P L U N K I N C . Finally – Cluster with KMeans
  35. 35. © 2 0 2 0 S P L U N K I N C . https://medcitynews.com/2019/02/splunk-and-newyork-presbyterian/ https://www.healthcareitnews.com/news/newyork-presbyterian-working-machine- learning-analytics-combat-opioid-crisis “At a time when overdose deaths are at crisis levels across the country and in New York City, largely due to the opioid epidemic, healthcare providers have a responsibility to safeguard against any potential diversion of drugs. NewYork- Presbyterian is taking a leading role in protecting the public by implementing highly effective controls to avoid the illegitimate use of controlled substances. Ultimately, we hope that other hospitals benefit from this new platform as well.” Jennings Aske, senior vice president and chief information security officer at NewYork-Presbyterian
  36. 36. © 2 0 2 0 S P L U N K I N C .
  37. 37. © 2 0 2 0 S P L U N K I N C .
  38. 38. © 2 0 2 0 S P L U N K I N C . Together, NewYork-Presbyterian and Splunk are also creating an enhanced data analytics solution that investigates unauthorized access to patient records.
  39. 39. © 2 0 2 0 S P L U N K I N C .
  40. 40. © 2 0 2 0 S P L U N K I N C . Detect the anomaly…
  41. 41. © 2 0 2 0 S P L U N K I N C . …drill down into that user…
  42. 42. © 2 0 2 0 S P L U N K I N C . Predicting Student Outcomes Predicting student grades based on their digital interactions with university IT and identifying students that are at risk of dropping out
  43. 43. © 2 0 2 0 S P L U N K I N C .
  44. 44. © 2 0 2 0 S P L U N K I N C . What Data Scientists Really Do Data Preparation accounts for about 80% of the work of data scientists “Cleaning Big Data: Most Time-Consuming, Least Enjoyable Data Science Task, Survey Says”, Forbes Mar 23, 2016
  45. 45. © 2 0 2 0 S P L U N K I N C . Predicting Student Outcomes index=oulad code_module=AAA
  46. 46. © 2 0 2 0 S P L U N K I N C . Predicting Student Outcomes index=oulad code_module=AAA | eval weighted_score=score*(weight/100) | eval student_code=id_student."_".code_module."_".code_presentation | bin _time span=1mon | stats sum(sum_click) as sum_clicks sum(weighted_score) as month_score avg(score) as average_score by student_code _time | streamstats sum(month_score) as cumulative_score last(average_score) as last_average count by student_code | eventstats max(count) as course_length | eval average_score=if(average_score>0,average_score,if(last_average>0,last_average,0)), cumulative_score=if(cumulative_score>0,cumulative_score,0), module_perc_complete=count/course_length | join student_code [| inputlookup student_info.csv | eval student_code=id_student."_".code_module."_".code_presentation | table student_code age_band highest_education imd_band studied_credits final_result] | table _time student_code sum_clicks average_score cumulative_score module_perc_complete studied_credits age_band highest_education imd_band final_result | outputlookup oulad_aaa.csv
  47. 47. © 2 0 2 0 S P L U N K I N C . Predicting Student Outcomes index=oulad code_module=AAA | eval weighted_score=score*(weight/100, student_code=id_student."_".code_module."_".code_presentation | bin _time span=1mon | stats sum(sum_click) as sum_clicks sum(weighted_score) as month_score avg(score) as average_score by student_code _time | streamstats sum(month_score) as cumulative_score last(average_score) as last_average count by student_code | eventstats max(count) as course_length | eval average_score=if(average_score>0,average_score,if(last_average>0,last_average,0)), cumulative_score=if(cumulative_score>0,cumulative_score,0), module_perc_complete=count/course_length | join student_code [| inputlookup student_info.csv | eval student_code=id_student."_".code_module."_".code_presentation | table student_code age_band highest_education imd_band studied_credits final_result] | table _time student_code sum_clicks average_score cumulative_score module_perc_complete studied_credits age_band highest_education imd_band final_result | outputlookup oulad_aaa.csv Calculate a weighted score and create a unique identifier for each student and module combination
  48. 48. © 2 0 2 0 S P L U N K I N C . Predicting Student Outcomes index=oulad code_module=AAA | eval weighted_score=score*(weight/100, student_code=id_student."_".code_module."_".code_presentation | bin _time span=1mon | stats sum(sum_click) as sum_clicks sum(weighted_score) as month_score avg(score) as average_score by student_code _time | streamstats sum(month_score) as cumulative_score last(average_score) as last_average count by student_code | eventstats max(count) as course_length | eval average_score=if(average_score>0,average_score,if(last_average>0,last_average,0)), cumulative_score=if(cumulative_score>0,cumulative_score,0), module_perc_complete=count/course_length | join student_code [| inputlookup student_info.csv | eval student_code=id_student."_".code_module."_".code_presentation | table student_code age_band highest_education imd_band studied_credits final_result] | table _time student_code sum_clicks average_score cumulative_score module_perc_complete studied_credits age_band highest_education imd_band final_result | outputlookup oulad_aaa.csv Calculate the number of clicks, total score and average score for each student in each month
  49. 49. © 2 0 2 0 S P L U N K I N C . Predicting Student Outcomes index=oulad code_module=AAA | eval weighted_score=score*(weight/100, student_code=id_student."_".code_module."_".code_presentation | bin _time span=1mon | stats sum(sum_click) as sum_clicks sum(weighted_score) as month_score avg(score) as average_score by student_code _time | streamstats sum(month_score) as cumulative_score last(average_score) as last_average count by student_code | eventstats max(count) as course_length | eval average_score=if(average_score>0,average_score,if(last_average>0,last_average,0)), cumulative_score=if(cumulative_score>0,cumulative_score,0), module_perc_complete=count/course_length | join student_code [| inputlookup student_info.csv | eval student_code=id_student."_".code_module."_".code_presentation | table student_code age_band highest_education imd_band studied_credits final_result] | table _time student_code sum_clicks average_score cumulative_score module_perc_complete studied_credits age_band highest_education imd_band final_result | outputlookup oulad_aaa.csv Calculate the cumulative score over time for each student and also get the previous average score for each month and create a rolling count
  50. 50. © 2 0 2 0 S P L U N K I N C . Predicting Student Outcomes index=oulad code_module=AAA | eval weighted_score=score*(weight/100, student_code=id_student."_".code_module."_".code_presentation | bin _time span=1mon | stats sum(sum_click) as sum_clicks sum(weighted_score) as month_score avg(score) as average_score by student_code _time | streamstats sum(month_score) as cumulative_score last(average_score) as last_average count by student_code | eventstats max(count) as course_length | eval average_score=if(average_score>0,average_score,if(last_average>0,last_average,0)), cumulative_score=if(cumulative_score>0,cumulative_score,0), module_perc_complete=count/course_length | join student_code [| inputlookup student_info.csv | eval student_code=id_student."_".code_module."_".code_presentation | table student_code age_band highest_education imd_band studied_credits final_result] | table _time student_code sum_clicks average_score cumulative_score module_perc_complete studied_credits age_band highest_education imd_band final_result | outputlookup oulad_aaa.csv Find the highest rolling count to use as the course length
  51. 51. © 2 0 2 0 S P L U N K I N C . Predicting Student Outcomes index=oulad code_module=AAA | eval weighted_score=score*(weight/100, student_code=id_student."_".code_module."_".code_presentation | bin _time span=1mon | stats sum(sum_click) as sum_clicks sum(weighted_score) as month_score avg(score) as average_score by student_code _time | streamstats sum(month_score) as cumulative_score last(average_score) as last_average count by student_code | eventstats max(count) as course_length | eval average_score=if(average_score>0,average_score,if(last_average>0,last_average,0)), cumulative_score=if(cumulative_score>0,cumulative_score,0), module_perc_complete=count/course_length | join student_code [| inputlookup student_info.csv | eval student_code=id_student."_".code_module."_".code_presentation | table student_code age_band highest_education imd_band studied_credits final_result] | table _time student_code sum_clicks average_score cumulative_score module_perc_complete studied_credits age_band highest_education imd_band final_result | outputlookup oulad_aaa.csv Fill in empty average and cumulative results and also calculate the module percentage complete
  52. 52. © 2 0 2 0 S P L U N K I N C . Predicting Student Outcomes index=oulad code_module=AAA | eval weighted_score=score*(weight/100, student_code=id_student."_".code_module."_".code_presentation | bin _time span=1mon | stats sum(sum_click) as sum_clicks sum(weighted_score) as month_score avg(score) as average_score by student_code _time | streamstats sum(month_score) as cumulative_score last(average_score) as last_average count by student_code | eventstats max(count) as course_length | eval average_score=if(average_score>0,average_score,if(last_average>0,last_average,0)), cumulative_score=if(cumulative_score>0,cumulative_score,0), module_perc_complete=count/course_length | join student_code [| inputlookup student_info.csv | eval student_code=id_student."_".code_module."_".code_presentation | table student_code age_band highest_education imd_band studied_credits final_result] | table _time student_code sum_clicks average_score cumulative_score module_perc_complete studied_credits age_band highest_education imd_band final_result | outputlookup oulad_aaa.csv Enrich the data with additional context for each student
  53. 53. © 2 0 2 0 S P L U N K I N C . Predicting Student Outcomes index=oulad code_module=AAA | eval weighted_score=score*(weight/100, student_code=id_student."_".code_module."_".code_presentation | bin _time span=1mon | stats sum(sum_click) as sum_clicks sum(weighted_score) as month_score avg(score) as average_score by student_code _time | streamstats sum(month_score) as cumulative_score last(average_score) as last_average count by student_code | eventstats max(count) as course_length | eval average_score=if(average_score>0,average_score,if(last_average>0,last_average,0)), cumulative_score=if(cumulative_score>0,cumulative_score,0), module_perc_complete=count/course_length | join student_code [| inputlookup student_info.csv | eval student_code=id_student."_".code_module."_".code_presentation | table student_code age_band highest_education imd_band studied_credits final_result] | table _time student_code sum_clicks average_score cumulative_score module_perc_complete studied_credits age_band highest_education imd_band final_result | outputlookup oulad_aaa.csv Select only the fields we are interested in and save to a lookup
  54. 54. © 2 0 2 0 S P L U N K I N C . Predicting Student Outcomes
  55. 55. © 2 0 2 0 S P L U N K I N C . Predicting Student Outcomes | inputlookup oulad_aaa.csv | search final_result!="Withdrawn" | sample partitions=10 seed=42 | search partition_number<7 | fit RandomForestClassifier final_result from average_score cumulative_score module_perc_complete studied_credits sum_clicks age_band highest_education imd_band into rf_oulad_aaa
  56. 56. © 2 0 2 0 S P L U N K I N C . Predicting Student Outcomes | inputlookup oulad_aaa.csv | search final_result!="Withdrawn" | sample partitions=10 seed=42 | search partition_number<7 | fit RandomForestClassifier final_result from average_score cumulative_score module_perc_complete studied_credits sum_clicks age_band highest_education imd_band into rf_oulad_aaa Remove data for withdrawn students
  57. 57. © 2 0 2 0 S P L U N K I N C . Predicting Student Outcomes | inputlookup oulad_aaa.csv | search final_result!="Withdrawn" | sample partitions=10 seed=42 | search partition_number<7 | fit RandomForestClassifier final_result from average_score cumulative_score module_perc_complete studied_credits sum_clicks age_band highest_education imd_band into rf_oulad_aaa Select a random sample of 70% of the data
  58. 58. © 2 0 2 0 S P L U N K I N C . Predicting Student Outcomes | inputlookup oulad_aaa.csv | search final_result!="Withdrawn" | sample partitions=10 seed=42 | search partition_number<7 | fit RandomForestClassifier final_result from average_score cumulative_score module_perc_complete studied_credits sum_clicks age_band highest_education imd_band into rf_oulad_aaa Train a random forest classifier on the data
  59. 59. © 2 0 2 0 S P L U N K I N C . Predicting Student Outcomes | inputlookup oulad_aaa.csv | search final_result!="Withdrawn" | sample partitions=10 seed=42 | search partition_number>6 | apply rf_oulad_aaa
  60. 60. © 2 0 2 0 S P L U N K I N C . Predicting Student Outcomes | inputlookup oulad_aaa.csv | search final_result!="Withdrawn" | sample partitions=10 seed=42 | search partition_number>6 | apply rf_oulad_aaa Apply the random forest classifier on the remaining 30% of the data
  61. 61. © 2 0 2 0 S P L U N K I N C . Predicting Student Outcomes
  62. 62. © 2 0 2 0 S P L U N K I N C . Predicting Student Outcomes | inputlookup oulad_aaa.csv | eval withdrawn=if(final_result="Withdrawn","Yes","No") | sample partitions=10 seed=42 | search partition_number<7 | fit RandomForestClassifier withdrawn from average_score cumulative_score module_perc_complete studied_credits sum_clicks age_band highest_education imd_band into rf_withdrawn_oulad_aaa | inputlookup oulad_aaa.csv | eval withdrawn=if(final_result="Withdrawn","Yes","No") | sample partitions=10 seed=42 | search partition_number>6 | apply rf_withdrawn_oulad_aaa Train model Test model
  63. 63. © 2 0 2 0 S P L U N K I N C . Predicting Student Outcomes
  64. 64. © 2 0 2 0 S P L U N K I N C . What Can Be Done to Promote Good Use of Machine Learning?
  65. 65. © 2 0 2 0 S P L U N K I N C . UK Government First to Pilot AI Procurement Guidelines Co- Designed with World Economic Forum https://www.weforum.org/press/2019/09/uk-government- first-to-pilot-ai-procurement-guidelines-co-designed-with- world-economic-forum/ Splunk has supported the development of these guidelines and worked closely with the WEF and UK Government. We will help pilot them in the UK and believe the guidance will enable Governments across the world transform citizen services and deliver ethically sound and beneficial AI based solutions.” — Lenny Stein, Senior Vice President, Global Affairs, Splunk “
  66. 66. © 2 0 2 0 S P L U N K I N C . Work with the WEF Intent Provide information to non-specialists so that they can assess the suitability of ML for a given problem/solution Current solution Procurement guidance for ‘unlocking public sector AI’ High level procurement processes Best practices when evaluating an RFP Map for creating AI-related RFPs Unlocking Public Sector AI go-live Expected in the coming months 4 3 2 1
  67. 67. © 2 0 2 0 S P L U N K I N C . You! Thank

×