Suicide Risk Prediction Using Social Media and Cassandra
#CASSANDRA13Ken Krugler | President, Scale UnlimitedSuicide Prevention Using Social Media and Cassandra
#CASSANDRA13What we will discuss today...*Using Cassandra to store social media content*Combining Hadoop workflows with Cassandra*Leveraging Solr search support in DataStax Enterprise*Doing good with big dataThis material is based upon work supported by the Defense Advance Research Project Agency (DARPA),and Space Warfare Systems Center Pacific under Contract N66001-11-4006. Any opinions, findings, andconclusions or recommendations expressed in this material are those of the authors(s) and do notnecessarily reflect the views of the Defense Advance Research Program Agency (DARPA) and Space andNaval Warfare Systems Center Pacific.Fine Print!
#CASSANDRA13Obligatory Background*Ken Krugler, Scale Unlimited - Nevada City, CA*Consulting on big data workflows, machine learning & search*Training for Hadoop, Cascading, Solr & Cassandra
#CASSANDRA13Durkheim Project OverviewIncluding things we didnt work on...
#CASSANDRA13Whats the problem?*More soldiers die from suicide than combat*Suicide rate has gone up 80% since 2002*Civilian suicide rates are also climbing*More suicides than homicides*Intervention after an "event" is often too lateGraph of suicide rates
#CASSANDRA13What is The Durkheim Project?*DARPA-funded initiative tohelp military physicians*Uses predictive analytics toestimate suicide risk fromwhat people write online*Each user is assigned asuicidality risk rating of red,yellow or green.Émile DurkheimNamed after Emile Durkheim, late 1800s sociologist who ﬁrst used text analytics to help deﬁne suicide risk.
#CASSANDRA13Current Status of Durkheim*Collaborative effort involving Patterns and Predictions, DartmouthMedical School & Facebook*Details at http://www.durkheimproject.org/*Finished phase I, now being rolled out to wider audiencePatterns and Predictions has its background expertise in predicting ﬁnancial market events and trends from news, which led to the development of the predictivemodels used in Durkheim
#CASSANDRA13Predictive Analytics*Guessing at state of mind from text-"There are very few people in this world that know the REAL me."-"I lay down to go to sleep, but all I can do is cry"*Uses labeled training data from clinical notes*Phase I results promising, for small sample set-"ensemble" of predictors is a powerful ML technique
#CASSANDRA13Clinician Dashboard*Multiple views on patient*Prediction & confidence*Backing data (key phrases, etc)So this is the goal - give medical staff indications of who they should be most concerned about.
#CASSANDRA13Data CollectionWhere _do_ you put a billion text snippets?The previous section was the project overview, which was work done by others in the project.Now we get to the part that we worked, which involves Cassandra
#CASSANDRA13Saving Social Media Activity*System to continuous save new activity-Scalable data store*Also needs a scalable, reliable way to access data-Processed in bulk (workflows)-Accessed at individual level-Searched at activity levelFor the current size of the project, MySQL would be just ﬁne.But we want an architecture that can scale if/when the project is rolled out to everyone
#CASSANDRA13Data Collection*Pink is what we wrote*Green is in Cassandra*Key data path in redExciting SocialMedia ActivityGigyaDaemonDurkheimSocial APIUsersTableDurkheimAppGigyaServiceActivityTable
#CASSANDRA13Designing the Column Families*What queries do we need to handle?-Always by user id (what we assign)*We want all the data for a user-Both for Users table, and Activities table-Sometimes we want a date range of activities*So one row per user-And ordered by date in the Activities table
#CASSANDRA13Users Table (Column Family)*One row per user - row key is a UUID we assign*Standard "static" columns-First name, last name, opt_in status, etc.*Easy to add more xxx_id columns for new servicesrow key first_name last_name facebook_id twitter_id opt_in
#CASSANDRA13Activities Table (Column Family)*One row per user - row key is a UUID we assign*One composite column per social media event-Timestamp (long value)-Source (FB, TW, GP, etc)-Type of column (data, activity id, user id, type of activity)row key ts_src_data ts_src_id ts_src_providerUid ts_src_typeRemember we wanted to get slices of data by date?So we use timestamp as the ﬁrst (primary) ordering for the columns.We can use regular millisecond timestamp since its for one user, assume we dont get multiple entries.
#CASSANDRA13Two Views of Composite Columns*As a row/column view*As a key-value map 213_FB_data213_FB_id213_FB_providerUid213_FB_type"I feel tired""FB post #32""FB user #66""Status update""uuid1""uuid1" 213_FB_data 213_FB_id 213_FB_providerUid 213_FB_type"I feel tired" "FB post #32" "FB user #66" "Status update"
#CASSANDRA13Implementation Details*API access protected via signature*Gigya Daemon on both t1.micro servers-But only active on one of them*Astyanax client talks to Cassandra*Cluster uses 3 m1.large serversDurkheimSocial APIDurkheimAppAWS LoadBalancerEC2 m1.largeserversDurkheimSocial APIEC2 t1.microservers
#CASSANDRA13Predictive Analytics at ScaleRunning workflows against Cassandra data
#CASSANDRA13How to process all this social media goodness?*Models are defined elsewhere*These are "black boxes" to us213_FB_data 213_FB_id 213_FB_providerUid 213_FB_type"I feel tired" "FB post #32" "FB user #66" "Status update"307_TW_data 307_TW_id 307_TW_providerUid 307_TW_type"Where am I?" "Tweet #17" "TW user #109" "Tweet"FeatureExtractionModelmodel rating probability keywordsModels are data used by PA engine to generate scoresWe do not have or want access to the data used to generate the modelsGenerating the model is often NOT something that needs scalabilityAmount of labeled data is typically pretty small.Training often works best on a single server.
#CASSANDRA13Why do we need Hadoop?*Running one model on one user is easy-And n models on one user is still OK*But when a model changes...-all users with the model need processingModels can change frequentlyAnd when a user changes...- that user with all models needs processing- adding/removing models is also a change
#CASSANDRA13Batch processing is OK*No strict minimum latency requirements*So we use Hadoop, for scalability and reliability
#CASSANDRA13Hadoop Workflow Details*Implemented using Cascading*Read Activities Table using Cassandra Tap*Read models from MySQL via JDBC
#CASSANDRA13Hadoop Bulk Classification WorkflowConvert to CassandraWrite Classification Result TableRun Classifier modelsCoGroup by user profile IDConvert from CassandraRead User Profiles TableConvert from CassandraRead Social Media Activity TableSeparate from this, weve loaded the models into memory and serialized them with the classiﬁcation stepThis is all done using Cascading to deﬁne the workﬂow.
#CASSANDRA13Workflow Issues*Currently manual operation-Ultimately needs a daemon to trigger (time, users, models)*Runs in separate cluster-Lots of network activity to pull data from Cassandra cluster-With DSE we could run on same cluster*Fun with AWS security groups
#CASSANDRA13Solr Search*Model results include key terms for classification result-"feel angry" (0.732)*Now you want to check actual usage of these termsMaybe actual text was "I dont feel angry when my wiﬁ connection drops".
#CASSANDRA13Poking at the Data*Hadoop turns petabytes intopie-charts*How do you verify results?*Search works really well hereMaybe before youd use a spreadsheet printout to argue.But that would be Satans Spreadsheet with billions of rows.
#CASSANDRA13Solr Search*Want "narrow" table for search-Solr dynamic fields are usually not a great idea-Limit to 1024 dynamic fields per document*So well replicate some of our Activity CF data into a new CF*Dont be afraid of making copies of data
#CASSANDRA13The "Search" Column Family*Row key is derived from Activity CF UUID + target column name*One column ("data") has content from that row + column in Activity CFrow key "data""uuid1_213_FB "I feel tired""uuid1" 213_FB_data 213_FB_id"I feel tired" "FB post #32"Activity Column FamilySearch Column Family
#CASSANDRA13Solr Schema*Very simple (which is how we like it)*Direct one-to-one mapping with Cassandra columns*Hits have key field, which contains UUID/Timestamp/Service<fields><field name="key" type="string" indexed="true" stored="true" /><field name="data" type="text" indexed="true" stored="true" /></fields>So once we have a hit, we can access information in activity table if needed.
#CASSANDRA13Combined Cluster*One Cassandra Cluster can allocate nodes for Hadoop & Search
#CASSANDRA13The Most Important Detail*We dont have any personal medical data!!!*We dont have any personal medical data!!!*We dont have any personal medical data!!!As soon as youve got personal medical data, its a whole new ballgame.At least an order of magnitude more work to make it really secure.Likely that you couldnt use AWS cloudWe still care about security, because were collecting social media activity that isnt necessarily public.
#CASSANDRA13Three Aspects of Security*Server-level-ssh via restricted private key*API-level-validate requests using signature-secure SHA1 hash*Services-level-Restrict open ports using security groupsSo even if you knew which server was running OpsCenter, you couldnt just start poking around.Access to Cassandra is only via t1.micro servers, which are in same security groupt1.micro servers only open up ssh and port needed for external API request[include picture?]