Durkheim Project: Social Media Risk & Bayesian Counters


Published on

Cited by a 2012 TIME Magazine cover story (“One A Day”) suicide, particularly the military, is a severe public health problem: Veteran suicide rates, nearly double those of adults in the general U.S. population. And to date there has been a lack of success so far in military efforts to understand and address the suicide crisis: “No program, outreach or initiative has worked against the surge in Army suicides, and no one knows why nothing works.” (Time) In this talk we will describe how we have built a real time risk assessment framework with the US Veterans Administration. As well as how Hadoop and HBase are being used to build further systems based on our new Bayesian Counters framework to predict realtime risk. Bayesian Counters framework was, in part, developed to predict military mental health risks. Trying to help to solve this complicated puzzle, towards the goal of reducing suicidality among those who have served the nation.

Published in: Technology, Business
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • the CPU power has reached the limit (in the end, speed of light is finite) Combining storage or processing capabilities across a distributed system of machines is non-trivial RAM is faster than disks (RAM ns, disk ms)  There are 1,832,160 feet in 347 miles D isk moves at 50 m/s vs 300,000,000 m/s Can we do at least 1,000 feet (300 m)? Network? There is no “virtual memory”
  • If we had all the time (the universe is projected to be less than 1000 trillion years) we could (probably) get the exact answer. Some analytical companies: http://cetas.net/ acquired by VMWare http://www.woopra.com analyses traffic to a website real-time http://www.wibidata.com/ our friends
  • More recent column families are accessed more often
  • Durkheim Project: Social Media Risk & Bayesian Counters

    1. 1. The Durkheim Project: Social Media Risk & Bayesian Counters Hadoop Summit: June 27, 2013 Chris Poulin: PATTERNS AND PREDICTIONS Alex Kozlov: Cloudera Disclaimers: This material is based upon work supported by the Defense Advance Research Project Agency (DARPA), and Space Warfare Systems Center Pacific under Contract N66001-11-4006. Also supported by, the Intelligence Advanced Research Projects Activity (IARPA) via the Department of Interior National Business Center contract number N10PC20221. The opinions, findings and conclusions or recommendations expressed in this material are those of the authors(s) and do not necessarily reflect the views of the Defense Advance Research Program Agency (DARPA) and Space, the Naval Warfare Systems Center Pacific, or the IARPA, DOI/NBC, or the U.S. Government. © 2013 Patterns and Predictions
    2. 2. Speakers PATTERNS AND PREDICTIONS Chris  Principal Investigator, DARPA DCAPS Poulin-Dartmouth Suicide Prediction Team  Former Co-Director, Dartmouth Metalearning Working Group (Theoretical Machine Learning)  Artificial Intelligence Instructor, US Naval War College  Principal, Patterns and Predictions (linguistics and prediction of financial events) … and have now read many suicide notes. Alex Principal Solutions Architect at Cloudera Ph.D. from Stanford University. Data mining and statistical analysis at SGI, Hewlett-Packard
    3. 3. PATTERNS AND PREDICTIONS Suicide is a hard societal problem, but why? Stigma: Victims are socially outcast (i.e. disconnected) Negative Topic: Intense negative emotion. And not a 'sexy' research topic by any means. Freedom of Choice: Ultimately you cant stop someone from risky behaviors, or many other activities that risk self harm. And suicide is the ultimate act of personal risk. Logistics: Even if you know what to look for, there are not enough clinicians to help the number of people suffering. Data privacy issues are as intense, or more so then say banking. Prediction: Accuracy (proper identification), false positives (stigmatization), false negatives (malpractice) Deeper issues?: Recent growth in suicide may be related to something more systemically wrong. Suicide the symptom of something else going on.
    4. 4.  The project is named in honor of Emile Durkheim, a founding sociologist whose 1897 publication of Suicide defined early text analysis for suicide risk.  The team is comprised of a multidisciplinary team of artificial intelligence (machine learning and computational linguistics), and medical experts (psychiatrists).  www.durkheimproject.org PATTERNS AND PREDICTIONS Durkheim
    5. 5. PATTERNS AND PREDICTIONS Social Problem: Opt-In is critical o Clear explanations for consent, no tricky EULAs Technical Problem: How to build a system that collects, stores, analyzes, and allows clinicians to react at Internet scale? Architecture: 1) Opt-In Interface Layer 2) Data Collection Layer 3) Storage Layer 4) Machine Learning, Phase I 5) Machine Learning, Phase II 6) Automated Intervention Our Approach
    6. 6. PATTERNS AND PREDICTIONS 1) Opt-In Interface Layer We cant overemphasize the role of simplified user participation for consent, and privacy control, in our interface/interaction design.
    7. 7. PATTERNS AND PREDICTIONS 2) Data Collection Layer The social media component is handled by a content aggregator (Gigya), and populates a Cassandra database.
    8. 8. PATTERNS AND PREDICTIONS Data Collection Layer, Continued The Cassandra instances were built and maintained (by Scale Unlimited) to handle high throughput storage. However, this is not the final destination of the data.
    9. 9. PATTERNS AND PREDICTIONS 3) Storage Layer Eventually, the data is moved to the medical center (behind a HIPAA compliant firewall at Dartmouth). Here it persists for ongoing research.
    10. 10. PATTERNS AND PREDICTIONS 4) Machine Learning, Phase I In 2011, we initiated a study with the U.S. Department of Veterans Affairs (VA) to study 3 cohorts of 100 subjects each (Non-Psychiatric, Psychiatric, and Suicide Positive).  We developed linguistics- driven prediction models to estimate the risk of suicide.  These models were generated from unstructured clinical notes  From the clinical notes, we generated datasets of single keywords and multi-word phrases  We were able to predict suicide with 65% accuracy on a small dataset.
    11. 11. PATTERNS AND PREDICTIONS 5) Machine Learning, Phase II In 2011, we also initiated a study with Cloudera (Alex Kozlov) on a lightweight machine learning framework for detecting real-time risk at scale.  We wanted a clean statistical model for distributed inference (prediction).  We needed a more lightweight framework than Mahout.  We wanted to be able to tradeoff runtime vs. accuracy.  We wanted the prediction library to be eventually open sourced (Apache license) for the community. ‘‘Alpha’ Build @Alpha’ Build @ http://durkheimproject.org/bcount/http://durkheimproject.org/bcount/ By Alex Kozlov <alexvk@cloudera.com>By Alex Kozlov <alexvk@cloudera.com>
    12. 12. What is B-counts today? And Why?  Distributed aggregation of user events and correlations to fit into RAM of multiple machines  Smart client: Moves substantial amount of logic to clients  Time:An explicit time dimension to support ‘recency analysis’  Based on HBase  Previous analysis (Poulin) had indicated that words and correlations are a good predictor of target variable  Need a faster processing/response time (response time beats accuracy of the model) http://www.slideshare.net/Hadoop_Summit/bayesian-http://www.slideshare.net/Hadoop_Summit/bayesian- counterscounters
    13. 13. Time to Answer Examples  Advertising: if you don’t figure what the user wants in 5 minutes, you lost him  Intrusion detection: the damage may be significantly bigger after a few minutes after break-in  Mental health risk: you need to screen before negative actions occur Value vs. time http://cetas.nethttp://cetas.net http://www.woopra.comhttp://www.woopra.com http://www.wibidata.com/http://www.wibidata.com/
    14. 14. Solution: Time Stamped Hadoop •Key: subset of variables with their values + timestamp (variable length) •Value: count (8 bytes) KeyKey 11 KeyKey 11 ValuValu ee ValuValu ee KeyKey 22 KeyKey 22 ValuValu ee ValuValu ee KeyKey 33 KeyKey 33 ValuValu ee ValuValu ee KeyKey 44 KeyKey 44 ValuValu ee ValuValu ee indexindex Pr(A|B, last 20 minutes)Pr(A|B, last 20 minutes) Column families are different HFiles (30 min, 2 hours, 24 hours, 5 days, etc.) What if we want to access more recent data more often? What if we want to access more recent data more often?
    15. 15. A Bayesian Counter, in detail IrisIrisIrisIris [sepal_width=2;class=0][sepal_width=2;class=0][sepal_width=2;class=0][sepal_width=2;class=0] 15151515 1321038671132103867113210386711321038671 30 mins30 mins30 mins30 mins 2 hours2 hours2 hours2 hours …… Region (divideRegion (divide between)between) ColumnColumn familyfamily ColumnColumn qualifierqualifier FileFile ValueValue (data)(data) Counter/TaCounter/Ta bleble 1321038998132103899813210389981321038998 VersionVersion
    16. 16. Command Line Implementation
    17. 17. Syntax nb iris class=2 sepal_length=5;petal_length=1.4 300 Target VariableTarget Variable PredictorsPredictors Time (seconds from now)Time (seconds from now)
    18. 18. Current Classifier Support (alpha release)  Naïve Bayes: Pr(C|F1, F2, ..., FN) =1/z Pr(C) Πi Pr(Fi|C)  Association rules: Confidence (A -> B): count(A and B)/count(A), Lift (A -> B): count(A and B)/(count(A) x count(B))  Nearest Neighbor: P(C) for k nearest neighbors, count(C|X) = ΣXi count(C|Xi), where X1, X2, ..., XN are in the vicinity of X  Clique ranking: I(X;Y)= p(x,y)log(p(x,y)/p(x)p(y),Where x in X and y inY, Using random projection canΣΣ generalize on two abstract subsets of Z
    19. 19. Performance retail.dat example – 88K transactions over 14,246 items o Mahout FPGrowth – 0.5 sec per pattern (58,623 patterns with min support 2) o 10 ms per pattern on a 5 node cluster
    20. 20. PATTERNS AND PREDICTIONS 6) Intervention Automated systems are coming online for potential patients and families seeking treatment, as well as passive intervention strategies (‘safety plans’).
    21. 21. PATTERNS AND PREDICTIONS What's next? In 2013, we plan a variety of initiatives including the launch of our clinical observation study, deployment of Bayesian Counters on live data, and to seek approval for an automated intervention study.  Launch Data Collection Study (CPHS #23781)… very soon  Deployment of B-Counts on live data for live monitoring  Intervention Research (Clinical Study Approval)
    22. 22. PATTERNS AND PREDICTIONS Conclusion What is Durkheim? And what is the Bayesian Counters library? A near real-time classification library, that, while under development, you’re free to use. Hope that some help is coming to those in need…
    23. 23. Team PATTERNS AND PREDICTIONS Chris Poulin, Director & Principal Investigator Paul Thompson, Study Co-Principal Investigator Thomas W. McAllister, M.D., Key Personnel Ben Goertzel, Ph.D., Key Personnel Brian Shiner, MD, Key Personnel Craig J. Bryan, PsyD, Advisor Linas Vepstas – Lead Machine Learning Programmer Brian Nauheimer – Technical Project Manager Chhean Saur – Lead Web/API Programmer Kevin Watters – Principal Programmer, Middleware Ken Krugler – Lead Distributed Systems Expert Ann Marion – User Experience (UX) Design Jane Nisselson – User Interface (UI) Design Andrew Chen – Social Media Applications Developer Alex Kozlov – Real-time/Distributed Classifier Development Vivek Magotra – Cassandra Database Developer
    24. 24. THANK YOU Chris Poulin, Managing Partner, Patterns and Predictions chris@patternsandpredictions.net Alex Kozlov, Principal Solutions Architect, Cloudera alexvk@cloudera.com Note: We hope that you have found this talk useful and encouraging. However, if you are having thoughts of harming yourself, please call the Veterans Crisis Line at 1-800 273- 8255 or 911. © 2013 Patterns and Predictions PATTERNS AND PREDICTIONS