SlideShare a Scribd company logo
1 of 38
Download to read offline
#CASSANDRA13
Ken	
  Krugler	
  |	
  President,	
  Scale	
  Unlimited
Suicide Prevention Using Social Media and Cassandra
#CASSANDRA13
What we will discuss today...
*Using Cassandra to store social media content
*Combining Hadoop workflows with Cassandra
*Leveraging Solr search support in DataStax Enterprise
*Doing good with big data
This material is based upon work supported by the Defense Advance Research Project Agency (DARPA),
and Space Warfare Systems Center Pacific under Contract N66001-11-4006. Any opinions, findings, and
conclusions or recommendations expressed in this material are those of the authors(s) and do not
necessarily reflect the views of the Defense Advance Research Program Agency (DARPA) and Space and
Naval Warfare Systems Center Pacific.
Fine Print!
#CASSANDRA13
Obligatory Background
*Ken Krugler, Scale Unlimited - Nevada City, CA
*Consulting on big data workflows, machine learning & search
*Training for Hadoop, Cascading, Solr & Cassandra
#CASSANDRA13
Durkheim Project Overview
Including things we didn't work on...
#CASSANDRA13
What's the problem?
*More soldiers die from suicide than combat
*Suicide rate has gone up 80% since 2002
*Civilian suicide rates are also climbing
*More suicides than homicides
*Intervention after an "event" is often too late
#CASSANDRA13
What is The Durkheim Project?
*DARPA-funded initiative
to help military
physicians
*Uses predictive analytics
to estimate suicide risk
from what people write
online
*Each user is assigned a
suicidality risk rating of
red, yellow or green.
Émile Durkheim
#CASSANDRA13
Current Status of Durkheim
*Collaborative effort involving Patterns and Predictions,
Dartmouth Medical School & Facebook
*Details at http://www.durkheimproject.org/
*Finished phase I, now being rolled out to wider audience
#CASSANDRA13
Predictive Analytics
*Guessing at state of mind from text
-"There are very few people in this world that know the REAL
me."
-"I lay down to go to sleep, but all I can do is cry"
*Uses labeled training data from clinical notes
*Phase I results promising, for small sample set
-"ensemble" of predictors is a powerful ML technique
#CASSANDRA13
Clinician Dashboard
*Multiple views on patient
*Prediction & confidence
*Backing data (key phrases,
etc)
#CASSANDRA13
Data Collection
Where _do_ you put a billion text snippets?
#CASSANDRA13
Saving Social Media Activity
*System to continuous save new activity
-Scalable data store
*Also needs a scalable, reliable way to access data
-Processed in bulk (workflows)
-Accessed at individual level
-Searched at activity level
#CASSANDRA13
Data Collection
*Pink is what we
wrote
*Green is in
Cassandra
*Key data path in red
Exciting Social
Media Activity
Gigya
Daemon
Durkheim
Social API
Users
Table
Durkheim
App
Gigya
Service
Activity
Table
#CASSANDRA13
Designing the Column Families
*What queries do we need to handle?
-Always by user id (what we assign)
*We want all the data for a user
-Both for Users table, and Activities table
-Sometimes we want a date range of activities
*So one row per user
-And ordered by date in the Activities table
#CASSANDRA13
Users Table (Column Family)
*One row per user - row key is a UUID we assign
*Standard "static" columns
-First name, last name, opt_in status, etc.
*Easy to add more xxx_id columns for new services
row key first_name last_name facebook_id twitter_id opt_in
#CASSANDRA13
Activities Table (Column Family)
*One row per user - row key is a UUID we assign
*One composite column per social media event
-Timestamp (long value)
-Source (FB, TW, GP, etc)
-Type of column (data, activity id, user id, type of activity)
row key ts_src_data ts_src_id ts_src_providerUid ts_src_type
#CASSANDRA13
Two Views of Composite Columns
*As a row/column view
*As a key-value map 213_FB_data
213_FB_id
213_FB_providerUid
213_FB_type
"I feel tired"
"FB post #32"
"FB user #66"
"Status update"
"uuid1"
"uuid1" 213_FB_data 213_FB_id 213_FB_providerUid 213_FB_type
"I feel tired" "FB post #32" "FB user #66" "Status update"
#CASSANDRA13
Implementation Details
*API access protected via signature
*Gigya Daemon on both t1.micro servers
-But only active on one of them
*Astyanax client talks to Cassandra
*Cluster uses 3 m1.large servers
Durkheim
Social API
Durkheim
App
AWS Load
Balancer
EC2 m1.large
servers
Durkheim
Social API
EC2 t1.micro
servers
#CASSANDRA13
Predictive Analytics at Scale
Running workflows against Cassandra data
#CASSANDRA13
How to process all this social media goodness?
*Models are defined elsewhere
*These are "black boxes" to us
213_FB_data 213_FB_id 213_FB_providerUid 213_FB_type
"I feel tired" "FB post #32" "FB user #66" "Status update"
307_TW_data 307_TW_id 307_TW_providerUid 307_TW_type
"Where am I?" "Tweet #17" "TW user #109" "Tweet"
Feature
Extraction
Model
model rating probability keywords
#CASSANDRA13
Why do we need Hadoop?
*Running one model on one user is easy
-And n models on one user is still OK
*But when a model changes...
-all users with the model need processing
#CASSANDRA13
Batch processing is OK
*No strict minimum latency requirements
*So we use Hadoop, for scalability and reliability
#CASSANDRA13
Hadoop Workflow Details
*Implemented using Cascading
*Read Activities Table using Cassandra Tap
*Read models from MySQL via JDBC
#CASSANDRA13
Hadoop Bulk Classification Workflow
Convert to Cassandra
Write Classification Result Table
Run Classifier models
CoGroup by user profile ID
Convert from Cassandra
Read User Profiles Table
Convert from Cassandra
Read Social Media Activity Table
#CASSANDRA13
Workflow Issues
*Currently manual operation
-Ultimately needs a daemon to trigger (time, users, models)
*Runs in separate cluster
-Lots of network activity to pull data from Cassandra cluster
-With DSE we could run on same cluster
*Fun with AWS security groups
#CASSANDRA13
Solr Search
Poking at the data
#CASSANDRA13
Solr Search
*Model results include key terms for classification result
-"feel angry" (0.732)
*Now you want to check actual usage of these terms
#CASSANDRA13
Poking at the Data
*Hadoop turns petabytes into
pie-charts
*How do you verify results?
*Search works really well here
#CASSANDRA13
Solr Search
*Want "narrow" table for search
-Solr dynamic fields are usually not a great idea
-Limit to 1024 dynamic fields per document
*So we'll replicate some of our Activity CF data into a new CF
*Don't be afraid of making copies of data
#CASSANDRA13
The "Search" Column Family
*Row key is derived from Activity CF UUID + target column name
*One column ("data") has content from that row + column in
Activity CF
row key "data"
"uuid1_213_FB "I feel tired"
"uuid1" 213_FB_data 213_FB_id
"I feel tired" "FB post #32"
Activity Column Family
Search Column Family
#CASSANDRA13
Solr Schema
*Very simple (which is how we like it)
*Direct one-to-one mapping with Cassandra columns
*Hits have key field, which contains UUID/Timestamp/Service
<fields>
<field name="key" type="string" indexed="true" stored="true" />
<field name="data" type="text" indexed="true" stored="true" />
</fields>
#CASSANDRA13
Combined Cluster
*One Cassandra Cluster can allocate nodes for Hadoop & Search
#CASSANDRA13
Security
Locking things down
#CASSANDRA13
The Most Important Detail
*We don't have any personal medical data!!!
*We don't have any personal medical data!!!
*We don't have any personal medical data!!!
#CASSANDRA13
Three Aspects of Security
*Server-level
-ssh via restricted private key
*API-level
-validate requests using signature
-secure SHA1 hash
*Services-level
-Restrict open ports using security groups
#CASSANDRA13
Summary
Bringing it all home
#CASSANDRA13
*You can effectively use Cassandra as:
A repository for social media data
The data source for workflows
A search index, via Solr integration
Key Points...
#CASSANDRA13
*It is possible to do more with big data than optimize ad yields
And the Meta-Point
#CASSANDRA13
THANK YOU

More Related Content

What's hot

Why data warehouses cannot support hot analytics
Why data warehouses cannot support hot analyticsWhy data warehouses cannot support hot analytics
Why data warehouses cannot support hot analyticsImply
 
Big Data and its emergence
Big Data and its emergenceBig Data and its emergence
Big Data and its emergencekoolkalpz
 
August meetup - All about Apache Druid
August meetup - All about Apache Druid August meetup - All about Apache Druid
August meetup - All about Apache Druid Imply
 
How Netflix Uses Druid in Real-time to Ensure a High Quality Streaming Experi...
How Netflix Uses Druid in Real-time to Ensure a High Quality Streaming Experi...How Netflix Uses Druid in Real-time to Ensure a High Quality Streaming Experi...
How Netflix Uses Druid in Real-time to Ensure a High Quality Streaming Experi...Imply
 
Реляционные или нереляционные (Josh Berkus)
Реляционные или нереляционные (Josh Berkus)Реляционные или нереляционные (Josh Berkus)
Реляционные или нереляционные (Josh Berkus)Ontico
 
Apache Druid®: A Dance of Distributed Processes
 Apache Druid®: A Dance of Distributed Processes Apache Druid®: A Dance of Distributed Processes
Apache Druid®: A Dance of Distributed ProcessesImply
 
Blazing Fast Analytics with MongoDB & Spark
Blazing Fast Analytics with MongoDB & SparkBlazing Fast Analytics with MongoDB & Spark
Blazing Fast Analytics with MongoDB & SparkMongoDB
 
Benchmarking Apache Druid
Benchmarking Apache Druid Benchmarking Apache Druid
Benchmarking Apache Druid Matt Sarrel
 
Querying NoSQL with SQL: HAVING Your JSON Cake and SELECTing it too
Querying NoSQL with SQL: HAVING Your JSON Cake and SELECTing it tooQuerying NoSQL with SQL: HAVING Your JSON Cake and SELECTing it too
Querying NoSQL with SQL: HAVING Your JSON Cake and SELECTing it tooAll Things Open
 
Learn from Case Study; How do people run query on Trino? / Trino japan virtua...
Learn from Case Study; How do people run query on Trino? / Trino japan virtua...Learn from Case Study; How do people run query on Trino? / Trino japan virtua...
Learn from Case Study; How do people run query on Trino? / Trino japan virtua...Toru Takahashi
 
제3회 사내기술세미나-hadoop(배포용)-dh kim-2014-10-1
제3회 사내기술세미나-hadoop(배포용)-dh kim-2014-10-1제3회 사내기술세미나-hadoop(배포용)-dh kim-2014-10-1
제3회 사내기술세미나-hadoop(배포용)-dh kim-2014-10-1Donghan Kim
 
Archmage, Pinterest’s Real-time Analytics Platform on Druid
Archmage, Pinterest’s Real-time Analytics Platform on DruidArchmage, Pinterest’s Real-time Analytics Platform on Druid
Archmage, Pinterest’s Real-time Analytics Platform on DruidImply
 
Webinar: Choosing the Right Shard Key for High Performance and Scale
Webinar: Choosing the Right Shard Key for High Performance and ScaleWebinar: Choosing the Right Shard Key for High Performance and Scale
Webinar: Choosing the Right Shard Key for High Performance and ScaleMongoDB
 
빅데이터 실시간 데이터 분석 - URL 실시간 UV/PV 집계 사례
빅데이터 실시간 데이터 분석 - URL 실시간 UV/PV 집계 사례빅데이터 실시간 데이터 분석 - URL 실시간 UV/PV 집계 사례
빅데이터 실시간 데이터 분석 - URL 실시간 UV/PV 집계 사례대은 유
 
Intro to bigdata on gcp (1)
Intro to bigdata on gcp (1)Intro to bigdata on gcp (1)
Intro to bigdata on gcp (1)SahilRaina21
 
How TrafficGuard uses Druid to Fight Ad Fraud and Bots
How TrafficGuard uses Druid to Fight Ad Fraud and BotsHow TrafficGuard uses Druid to Fight Ad Fraud and Bots
How TrafficGuard uses Druid to Fight Ad Fraud and BotsImply
 
MongoDB .local Munich 2019: Managing a Heterogeneous Stack with MongoDB & SQL
MongoDB .local Munich 2019: Managing a Heterogeneous Stack with MongoDB & SQLMongoDB .local Munich 2019: Managing a Heterogeneous Stack with MongoDB & SQL
MongoDB .local Munich 2019: Managing a Heterogeneous Stack with MongoDB & SQLMongoDB
 
Splunk: Druid on Kubernetes with Druid-operator
Splunk: Druid on Kubernetes with Druid-operatorSplunk: Druid on Kubernetes with Druid-operator
Splunk: Druid on Kubernetes with Druid-operatorImply
 
Druid in Spot Instances
Druid in Spot InstancesDruid in Spot Instances
Druid in Spot InstancesImply
 

What's hot (20)

Why data warehouses cannot support hot analytics
Why data warehouses cannot support hot analyticsWhy data warehouses cannot support hot analytics
Why data warehouses cannot support hot analytics
 
Big Data and its emergence
Big Data and its emergenceBig Data and its emergence
Big Data and its emergence
 
August meetup - All about Apache Druid
August meetup - All about Apache Druid August meetup - All about Apache Druid
August meetup - All about Apache Druid
 
How Netflix Uses Druid in Real-time to Ensure a High Quality Streaming Experi...
How Netflix Uses Druid in Real-time to Ensure a High Quality Streaming Experi...How Netflix Uses Druid in Real-time to Ensure a High Quality Streaming Experi...
How Netflix Uses Druid in Real-time to Ensure a High Quality Streaming Experi...
 
Реляционные или нереляционные (Josh Berkus)
Реляционные или нереляционные (Josh Berkus)Реляционные или нереляционные (Josh Berkus)
Реляционные или нереляционные (Josh Berkus)
 
Apache Druid®: A Dance of Distributed Processes
 Apache Druid®: A Dance of Distributed Processes Apache Druid®: A Dance of Distributed Processes
Apache Druid®: A Dance of Distributed Processes
 
Blazing Fast Analytics with MongoDB & Spark
Blazing Fast Analytics with MongoDB & SparkBlazing Fast Analytics with MongoDB & Spark
Blazing Fast Analytics with MongoDB & Spark
 
Benchmarking Apache Druid
Benchmarking Apache Druid Benchmarking Apache Druid
Benchmarking Apache Druid
 
Querying NoSQL with SQL: HAVING Your JSON Cake and SELECTing it too
Querying NoSQL with SQL: HAVING Your JSON Cake and SELECTing it tooQuerying NoSQL with SQL: HAVING Your JSON Cake and SELECTing it too
Querying NoSQL with SQL: HAVING Your JSON Cake and SELECTing it too
 
Learn from Case Study; How do people run query on Trino? / Trino japan virtua...
Learn from Case Study; How do people run query on Trino? / Trino japan virtua...Learn from Case Study; How do people run query on Trino? / Trino japan virtua...
Learn from Case Study; How do people run query on Trino? / Trino japan virtua...
 
제3회 사내기술세미나-hadoop(배포용)-dh kim-2014-10-1
제3회 사내기술세미나-hadoop(배포용)-dh kim-2014-10-1제3회 사내기술세미나-hadoop(배포용)-dh kim-2014-10-1
제3회 사내기술세미나-hadoop(배포용)-dh kim-2014-10-1
 
Archmage, Pinterest’s Real-time Analytics Platform on Druid
Archmage, Pinterest’s Real-time Analytics Platform on DruidArchmage, Pinterest’s Real-time Analytics Platform on Druid
Archmage, Pinterest’s Real-time Analytics Platform on Druid
 
Webinar: Choosing the Right Shard Key for High Performance and Scale
Webinar: Choosing the Right Shard Key for High Performance and ScaleWebinar: Choosing the Right Shard Key for High Performance and Scale
Webinar: Choosing the Right Shard Key for High Performance and Scale
 
빅데이터 실시간 데이터 분석 - URL 실시간 UV/PV 집계 사례
빅데이터 실시간 데이터 분석 - URL 실시간 UV/PV 집계 사례빅데이터 실시간 데이터 분석 - URL 실시간 UV/PV 집계 사례
빅데이터 실시간 데이터 분석 - URL 실시간 UV/PV 집계 사례
 
Intro to bigdata on gcp (1)
Intro to bigdata on gcp (1)Intro to bigdata on gcp (1)
Intro to bigdata on gcp (1)
 
How TrafficGuard uses Druid to Fight Ad Fraud and Bots
How TrafficGuard uses Druid to Fight Ad Fraud and BotsHow TrafficGuard uses Druid to Fight Ad Fraud and Bots
How TrafficGuard uses Druid to Fight Ad Fraud and Bots
 
MongoDB .local Munich 2019: Managing a Heterogeneous Stack with MongoDB & SQL
MongoDB .local Munich 2019: Managing a Heterogeneous Stack with MongoDB & SQLMongoDB .local Munich 2019: Managing a Heterogeneous Stack with MongoDB & SQL
MongoDB .local Munich 2019: Managing a Heterogeneous Stack with MongoDB & SQL
 
Splunk: Druid on Kubernetes with Druid-operator
Splunk: Druid on Kubernetes with Druid-operatorSplunk: Druid on Kubernetes with Druid-operator
Splunk: Druid on Kubernetes with Druid-operator
 
Redis
RedisRedis
Redis
 
Druid in Spot Instances
Druid in Spot InstancesDruid in Spot Instances
Druid in Spot Instances
 

Viewers also liked

Five Early Challenges Of Building Streaming Fast Data Applications
Five Early Challenges Of Building Streaming Fast Data ApplicationsFive Early Challenges Of Building Streaming Fast Data Applications
Five Early Challenges Of Building Streaming Fast Data ApplicationsLightbend
 
The Power Of Ecosystems - Why 2016 Is The Year Of The Automotive Ecosystem
The Power Of Ecosystems - Why 2016 Is The Year Of The Automotive EcosystemThe Power Of Ecosystems - Why 2016 Is The Year Of The Automotive Ecosystem
The Power Of Ecosystems - Why 2016 Is The Year Of The Automotive EcosystemCloudMade
 
Programmatic Advertising Solutions for Retailers and Brands
Programmatic Advertising Solutions for Retailers and BrandsProgrammatic Advertising Solutions for Retailers and Brands
Programmatic Advertising Solutions for Retailers and BrandsVeronika Sonsev
 
VI Jornada Interhospitalaria de Oncología Pediátrica
VI Jornada Interhospitalaria de Oncología PediátricaVI Jornada Interhospitalaria de Oncología Pediátrica
VI Jornada Interhospitalaria de Oncología PediátricaJavier González de Dios
 
Purple Goldfish Hall of Famers
Purple Goldfish Hall of FamersPurple Goldfish Hall of Famers
Purple Goldfish Hall of FamersStan Phelps
 
26 Disruptive & Technology Trends 2016 - 2018
26 Disruptive & Technology Trends 2016 - 201826 Disruptive & Technology Trends 2016 - 2018
26 Disruptive & Technology Trends 2016 - 2018Brian Solis
 

Viewers also liked (6)

Five Early Challenges Of Building Streaming Fast Data Applications
Five Early Challenges Of Building Streaming Fast Data ApplicationsFive Early Challenges Of Building Streaming Fast Data Applications
Five Early Challenges Of Building Streaming Fast Data Applications
 
The Power Of Ecosystems - Why 2016 Is The Year Of The Automotive Ecosystem
The Power Of Ecosystems - Why 2016 Is The Year Of The Automotive EcosystemThe Power Of Ecosystems - Why 2016 Is The Year Of The Automotive Ecosystem
The Power Of Ecosystems - Why 2016 Is The Year Of The Automotive Ecosystem
 
Programmatic Advertising Solutions for Retailers and Brands
Programmatic Advertising Solutions for Retailers and BrandsProgrammatic Advertising Solutions for Retailers and Brands
Programmatic Advertising Solutions for Retailers and Brands
 
VI Jornada Interhospitalaria de Oncología Pediátrica
VI Jornada Interhospitalaria de Oncología PediátricaVI Jornada Interhospitalaria de Oncología Pediátrica
VI Jornada Interhospitalaria de Oncología Pediátrica
 
Purple Goldfish Hall of Famers
Purple Goldfish Hall of FamersPurple Goldfish Hall of Famers
Purple Goldfish Hall of Famers
 
26 Disruptive & Technology Trends 2016 - 2018
26 Disruptive & Technology Trends 2016 - 201826 Disruptive & Technology Trends 2016 - 2018
26 Disruptive & Technology Trends 2016 - 2018
 

Similar to C* Summit 2013: Suicide Risk Prediction Using Social Media and Cassandra by Ken Krugler

Microsoft Azure Big Data Analytics
Microsoft Azure Big Data AnalyticsMicrosoft Azure Big Data Analytics
Microsoft Azure Big Data AnalyticsMark Kromer
 
Redis Streams plus Spark Structured Streaming
Redis Streams plus Spark Structured StreamingRedis Streams plus Spark Structured Streaming
Redis Streams plus Spark Structured StreamingDave Nielsen
 
NoSQL Application Development with JSON and MapR-DB
NoSQL Application Development with JSON and MapR-DBNoSQL Application Development with JSON and MapR-DB
NoSQL Application Development with JSON and MapR-DBMapR Technologies
 
Eagle6 mongo dc revised
Eagle6 mongo dc revisedEagle6 mongo dc revised
Eagle6 mongo dc revisedMongoDB
 
Eagle6 Enterprise Situational Awareness
Eagle6 Enterprise Situational AwarenessEagle6 Enterprise Situational Awareness
Eagle6 Enterprise Situational AwarenessMongoDB
 
Black friday logs - Scaling Elasticsearch
Black friday logs - Scaling ElasticsearchBlack friday logs - Scaling Elasticsearch
Black friday logs - Scaling ElasticsearchSylvain Wallez
 
Crossdata: an efficient distributed datahub with batch and streaming query ca...
Crossdata: an efficient distributed datahub with batch and streaming query ca...Crossdata: an efficient distributed datahub with batch and streaming query ca...
Crossdata: an efficient distributed datahub with batch and streaming query ca...Álvaro Agea Herradón
 
Cassandra Summit 2014: META — An Efficient Distributed Data Hub with Batch an...
Cassandra Summit 2014: META — An Efficient Distributed Data Hub with Batch an...Cassandra Summit 2014: META — An Efficient Distributed Data Hub with Batch an...
Cassandra Summit 2014: META — An Efficient Distributed Data Hub with Batch an...DataStax Academy
 
Stratio CrossData: an efficient distributed datahub with batch and streaming ...
Stratio CrossData: an efficient distributed datahub with batch and streaming ...Stratio CrossData: an efficient distributed datahub with batch and streaming ...
Stratio CrossData: an efficient distributed datahub with batch and streaming ...Stratio
 
History of NoSQL and Azure Documentdb feature set
History of NoSQL and Azure Documentdb feature setHistory of NoSQL and Azure Documentdb feature set
History of NoSQL and Azure Documentdb feature setSoner Altin
 
MongoDB .local London 2019: Best Practices for Working with IoT and Time-seri...
MongoDB .local London 2019: Best Practices for Working with IoT and Time-seri...MongoDB .local London 2019: Best Practices for Working with IoT and Time-seri...
MongoDB .local London 2019: Best Practices for Working with IoT and Time-seri...MongoDB
 
Jumpstart: MongoDB BI Connector & Tableau
Jumpstart: MongoDB BI Connector & TableauJumpstart: MongoDB BI Connector & Tableau
Jumpstart: MongoDB BI Connector & TableauMongoDB
 
llr+ cHApTEFt s Database Processing(2) Does this design e.docx
llr+ cHApTEFt s Database Processing(2) Does this design e.docxllr+ cHApTEFt s Database Processing(2) Does this design e.docx
llr+ cHApTEFt s Database Processing(2) Does this design e.docxsmile790243
 
Practical Machine Learning in Information Security
Practical Machine Learning in Information SecurityPractical Machine Learning in Information Security
Practical Machine Learning in Information SecuritySven Krasser
 
Managing your black friday logs - Code Europe
Managing your black friday logs - Code EuropeManaging your black friday logs - Code Europe
Managing your black friday logs - Code EuropeDavid Pilato
 
MongoDB Schema Design: Practical Applications and Implications
MongoDB Schema Design: Practical Applications and ImplicationsMongoDB Schema Design: Practical Applications and Implications
MongoDB Schema Design: Practical Applications and ImplicationsMongoDB
 
MongoDB .local Houston 2019: Best Practices for Working with IoT and Time-ser...
MongoDB .local Houston 2019: Best Practices for Working with IoT and Time-ser...MongoDB .local Houston 2019: Best Practices for Working with IoT and Time-ser...
MongoDB .local Houston 2019: Best Practices for Working with IoT and Time-ser...MongoDB
 
Managing your Black Friday Logs NDC Oslo
Managing your  Black Friday Logs NDC OsloManaging your  Black Friday Logs NDC Oslo
Managing your Black Friday Logs NDC OsloDavid Pilato
 

Similar to C* Summit 2013: Suicide Risk Prediction Using Social Media and Cassandra by Ken Krugler (20)

Microsoft Azure Big Data Analytics
Microsoft Azure Big Data AnalyticsMicrosoft Azure Big Data Analytics
Microsoft Azure Big Data Analytics
 
Redis Streams plus Spark Structured Streaming
Redis Streams plus Spark Structured StreamingRedis Streams plus Spark Structured Streaming
Redis Streams plus Spark Structured Streaming
 
NoSQL Application Development with JSON and MapR-DB
NoSQL Application Development with JSON and MapR-DBNoSQL Application Development with JSON and MapR-DB
NoSQL Application Development with JSON and MapR-DB
 
Eagle6 mongo dc revised
Eagle6 mongo dc revisedEagle6 mongo dc revised
Eagle6 mongo dc revised
 
Eagle6 Enterprise Situational Awareness
Eagle6 Enterprise Situational AwarenessEagle6 Enterprise Situational Awareness
Eagle6 Enterprise Situational Awareness
 
Black friday logs - Scaling Elasticsearch
Black friday logs - Scaling ElasticsearchBlack friday logs - Scaling Elasticsearch
Black friday logs - Scaling Elasticsearch
 
Bids talk 9.18
Bids talk 9.18Bids talk 9.18
Bids talk 9.18
 
Crossdata: an efficient distributed datahub with batch and streaming query ca...
Crossdata: an efficient distributed datahub with batch and streaming query ca...Crossdata: an efficient distributed datahub with batch and streaming query ca...
Crossdata: an efficient distributed datahub with batch and streaming query ca...
 
Cassandra Summit 2014: META — An Efficient Distributed Data Hub with Batch an...
Cassandra Summit 2014: META — An Efficient Distributed Data Hub with Batch an...Cassandra Summit 2014: META — An Efficient Distributed Data Hub with Batch an...
Cassandra Summit 2014: META — An Efficient Distributed Data Hub with Batch an...
 
Stratio CrossData: an efficient distributed datahub with batch and streaming ...
Stratio CrossData: an efficient distributed datahub with batch and streaming ...Stratio CrossData: an efficient distributed datahub with batch and streaming ...
Stratio CrossData: an efficient distributed datahub with batch and streaming ...
 
History of NoSQL and Azure Documentdb feature set
History of NoSQL and Azure Documentdb feature setHistory of NoSQL and Azure Documentdb feature set
History of NoSQL and Azure Documentdb feature set
 
Presentation
PresentationPresentation
Presentation
 
MongoDB .local London 2019: Best Practices for Working with IoT and Time-seri...
MongoDB .local London 2019: Best Practices for Working with IoT and Time-seri...MongoDB .local London 2019: Best Practices for Working with IoT and Time-seri...
MongoDB .local London 2019: Best Practices for Working with IoT and Time-seri...
 
Jumpstart: MongoDB BI Connector & Tableau
Jumpstart: MongoDB BI Connector & TableauJumpstart: MongoDB BI Connector & Tableau
Jumpstart: MongoDB BI Connector & Tableau
 
llr+ cHApTEFt s Database Processing(2) Does this design e.docx
llr+ cHApTEFt s Database Processing(2) Does this design e.docxllr+ cHApTEFt s Database Processing(2) Does this design e.docx
llr+ cHApTEFt s Database Processing(2) Does this design e.docx
 
Practical Machine Learning in Information Security
Practical Machine Learning in Information SecurityPractical Machine Learning in Information Security
Practical Machine Learning in Information Security
 
Managing your black friday logs - Code Europe
Managing your black friday logs - Code EuropeManaging your black friday logs - Code Europe
Managing your black friday logs - Code Europe
 
MongoDB Schema Design: Practical Applications and Implications
MongoDB Schema Design: Practical Applications and ImplicationsMongoDB Schema Design: Practical Applications and Implications
MongoDB Schema Design: Practical Applications and Implications
 
MongoDB .local Houston 2019: Best Practices for Working with IoT and Time-ser...
MongoDB .local Houston 2019: Best Practices for Working with IoT and Time-ser...MongoDB .local Houston 2019: Best Practices for Working with IoT and Time-ser...
MongoDB .local Houston 2019: Best Practices for Working with IoT and Time-ser...
 
Managing your Black Friday Logs NDC Oslo
Managing your  Black Friday Logs NDC OsloManaging your  Black Friday Logs NDC Oslo
Managing your Black Friday Logs NDC Oslo
 

More from DataStax Academy

Forrester CXNYC 2017 - Delivering great real-time cx is a true craft
Forrester CXNYC 2017 - Delivering great real-time cx is a true craftForrester CXNYC 2017 - Delivering great real-time cx is a true craft
Forrester CXNYC 2017 - Delivering great real-time cx is a true craftDataStax Academy
 
Introduction to DataStax Enterprise Graph Database
Introduction to DataStax Enterprise Graph DatabaseIntroduction to DataStax Enterprise Graph Database
Introduction to DataStax Enterprise Graph DatabaseDataStax Academy
 
Introduction to DataStax Enterprise Advanced Replication with Apache Cassandra
Introduction to DataStax Enterprise Advanced Replication with Apache CassandraIntroduction to DataStax Enterprise Advanced Replication with Apache Cassandra
Introduction to DataStax Enterprise Advanced Replication with Apache CassandraDataStax Academy
 
Cassandra on Docker @ Walmart Labs
Cassandra on Docker @ Walmart LabsCassandra on Docker @ Walmart Labs
Cassandra on Docker @ Walmart LabsDataStax Academy
 
Cassandra 3.0 Data Modeling
Cassandra 3.0 Data ModelingCassandra 3.0 Data Modeling
Cassandra 3.0 Data ModelingDataStax Academy
 
Cassandra Adoption on Cisco UCS & Open stack
Cassandra Adoption on Cisco UCS & Open stackCassandra Adoption on Cisco UCS & Open stack
Cassandra Adoption on Cisco UCS & Open stackDataStax Academy
 
Data Modeling for Apache Cassandra
Data Modeling for Apache CassandraData Modeling for Apache Cassandra
Data Modeling for Apache CassandraDataStax Academy
 
Production Ready Cassandra
Production Ready CassandraProduction Ready Cassandra
Production Ready CassandraDataStax Academy
 
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & PythonCassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & PythonDataStax Academy
 
Cassandra @ Sony: The good, the bad, and the ugly part 1
Cassandra @ Sony: The good, the bad, and the ugly part 1Cassandra @ Sony: The good, the bad, and the ugly part 1
Cassandra @ Sony: The good, the bad, and the ugly part 1DataStax Academy
 
Cassandra @ Sony: The good, the bad, and the ugly part 2
Cassandra @ Sony: The good, the bad, and the ugly part 2Cassandra @ Sony: The good, the bad, and the ugly part 2
Cassandra @ Sony: The good, the bad, and the ugly part 2DataStax Academy
 
Standing Up Your First Cluster
Standing Up Your First ClusterStanding Up Your First Cluster
Standing Up Your First ClusterDataStax Academy
 
Real Time Analytics with Dse
Real Time Analytics with DseReal Time Analytics with Dse
Real Time Analytics with DseDataStax Academy
 
Introduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache CassandraIntroduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache CassandraDataStax Academy
 
Enabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax EnterpriseEnabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax EnterpriseDataStax Academy
 
Advanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache CassandraAdvanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache CassandraDataStax Academy
 

More from DataStax Academy (20)

Forrester CXNYC 2017 - Delivering great real-time cx is a true craft
Forrester CXNYC 2017 - Delivering great real-time cx is a true craftForrester CXNYC 2017 - Delivering great real-time cx is a true craft
Forrester CXNYC 2017 - Delivering great real-time cx is a true craft
 
Introduction to DataStax Enterprise Graph Database
Introduction to DataStax Enterprise Graph DatabaseIntroduction to DataStax Enterprise Graph Database
Introduction to DataStax Enterprise Graph Database
 
Introduction to DataStax Enterprise Advanced Replication with Apache Cassandra
Introduction to DataStax Enterprise Advanced Replication with Apache CassandraIntroduction to DataStax Enterprise Advanced Replication with Apache Cassandra
Introduction to DataStax Enterprise Advanced Replication with Apache Cassandra
 
Cassandra on Docker @ Walmart Labs
Cassandra on Docker @ Walmart LabsCassandra on Docker @ Walmart Labs
Cassandra on Docker @ Walmart Labs
 
Cassandra 3.0 Data Modeling
Cassandra 3.0 Data ModelingCassandra 3.0 Data Modeling
Cassandra 3.0 Data Modeling
 
Cassandra Adoption on Cisco UCS & Open stack
Cassandra Adoption on Cisco UCS & Open stackCassandra Adoption on Cisco UCS & Open stack
Cassandra Adoption on Cisco UCS & Open stack
 
Data Modeling for Apache Cassandra
Data Modeling for Apache CassandraData Modeling for Apache Cassandra
Data Modeling for Apache Cassandra
 
Coursera Cassandra Driver
Coursera Cassandra DriverCoursera Cassandra Driver
Coursera Cassandra Driver
 
Production Ready Cassandra
Production Ready CassandraProduction Ready Cassandra
Production Ready Cassandra
 
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & PythonCassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
 
Cassandra @ Sony: The good, the bad, and the ugly part 1
Cassandra @ Sony: The good, the bad, and the ugly part 1Cassandra @ Sony: The good, the bad, and the ugly part 1
Cassandra @ Sony: The good, the bad, and the ugly part 1
 
Cassandra @ Sony: The good, the bad, and the ugly part 2
Cassandra @ Sony: The good, the bad, and the ugly part 2Cassandra @ Sony: The good, the bad, and the ugly part 2
Cassandra @ Sony: The good, the bad, and the ugly part 2
 
Standing Up Your First Cluster
Standing Up Your First ClusterStanding Up Your First Cluster
Standing Up Your First Cluster
 
Real Time Analytics with Dse
Real Time Analytics with DseReal Time Analytics with Dse
Real Time Analytics with Dse
 
Introduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache CassandraIntroduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache Cassandra
 
Cassandra Core Concepts
Cassandra Core ConceptsCassandra Core Concepts
Cassandra Core Concepts
 
Enabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax EnterpriseEnabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax Enterprise
 
Bad Habits Die Hard
Bad Habits Die Hard Bad Habits Die Hard
Bad Habits Die Hard
 
Advanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache CassandraAdvanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache Cassandra
 
Advanced Cassandra
Advanced CassandraAdvanced Cassandra
Advanced Cassandra
 

Recently uploaded

#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community
 

Recently uploaded (20)

#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 

C* Summit 2013: Suicide Risk Prediction Using Social Media and Cassandra by Ken Krugler

  • 1. #CASSANDRA13 Ken  Krugler  |  President,  Scale  Unlimited Suicide Prevention Using Social Media and Cassandra
  • 2. #CASSANDRA13 What we will discuss today... *Using Cassandra to store social media content *Combining Hadoop workflows with Cassandra *Leveraging Solr search support in DataStax Enterprise *Doing good with big data This material is based upon work supported by the Defense Advance Research Project Agency (DARPA), and Space Warfare Systems Center Pacific under Contract N66001-11-4006. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors(s) and do not necessarily reflect the views of the Defense Advance Research Program Agency (DARPA) and Space and Naval Warfare Systems Center Pacific. Fine Print!
  • 3. #CASSANDRA13 Obligatory Background *Ken Krugler, Scale Unlimited - Nevada City, CA *Consulting on big data workflows, machine learning & search *Training for Hadoop, Cascading, Solr & Cassandra
  • 5. #CASSANDRA13 What's the problem? *More soldiers die from suicide than combat *Suicide rate has gone up 80% since 2002 *Civilian suicide rates are also climbing *More suicides than homicides *Intervention after an "event" is often too late
  • 6. #CASSANDRA13 What is The Durkheim Project? *DARPA-funded initiative to help military physicians *Uses predictive analytics to estimate suicide risk from what people write online *Each user is assigned a suicidality risk rating of red, yellow or green. Émile Durkheim
  • 7. #CASSANDRA13 Current Status of Durkheim *Collaborative effort involving Patterns and Predictions, Dartmouth Medical School & Facebook *Details at http://www.durkheimproject.org/ *Finished phase I, now being rolled out to wider audience
  • 8. #CASSANDRA13 Predictive Analytics *Guessing at state of mind from text -"There are very few people in this world that know the REAL me." -"I lay down to go to sleep, but all I can do is cry" *Uses labeled training data from clinical notes *Phase I results promising, for small sample set -"ensemble" of predictors is a powerful ML technique
  • 9. #CASSANDRA13 Clinician Dashboard *Multiple views on patient *Prediction & confidence *Backing data (key phrases, etc)
  • 10. #CASSANDRA13 Data Collection Where _do_ you put a billion text snippets?
  • 11. #CASSANDRA13 Saving Social Media Activity *System to continuous save new activity -Scalable data store *Also needs a scalable, reliable way to access data -Processed in bulk (workflows) -Accessed at individual level -Searched at activity level
  • 12. #CASSANDRA13 Data Collection *Pink is what we wrote *Green is in Cassandra *Key data path in red Exciting Social Media Activity Gigya Daemon Durkheim Social API Users Table Durkheim App Gigya Service Activity Table
  • 13. #CASSANDRA13 Designing the Column Families *What queries do we need to handle? -Always by user id (what we assign) *We want all the data for a user -Both for Users table, and Activities table -Sometimes we want a date range of activities *So one row per user -And ordered by date in the Activities table
  • 14. #CASSANDRA13 Users Table (Column Family) *One row per user - row key is a UUID we assign *Standard "static" columns -First name, last name, opt_in status, etc. *Easy to add more xxx_id columns for new services row key first_name last_name facebook_id twitter_id opt_in
  • 15. #CASSANDRA13 Activities Table (Column Family) *One row per user - row key is a UUID we assign *One composite column per social media event -Timestamp (long value) -Source (FB, TW, GP, etc) -Type of column (data, activity id, user id, type of activity) row key ts_src_data ts_src_id ts_src_providerUid ts_src_type
  • 16. #CASSANDRA13 Two Views of Composite Columns *As a row/column view *As a key-value map 213_FB_data 213_FB_id 213_FB_providerUid 213_FB_type "I feel tired" "FB post #32" "FB user #66" "Status update" "uuid1" "uuid1" 213_FB_data 213_FB_id 213_FB_providerUid 213_FB_type "I feel tired" "FB post #32" "FB user #66" "Status update"
  • 17. #CASSANDRA13 Implementation Details *API access protected via signature *Gigya Daemon on both t1.micro servers -But only active on one of them *Astyanax client talks to Cassandra *Cluster uses 3 m1.large servers Durkheim Social API Durkheim App AWS Load Balancer EC2 m1.large servers Durkheim Social API EC2 t1.micro servers
  • 18. #CASSANDRA13 Predictive Analytics at Scale Running workflows against Cassandra data
  • 19. #CASSANDRA13 How to process all this social media goodness? *Models are defined elsewhere *These are "black boxes" to us 213_FB_data 213_FB_id 213_FB_providerUid 213_FB_type "I feel tired" "FB post #32" "FB user #66" "Status update" 307_TW_data 307_TW_id 307_TW_providerUid 307_TW_type "Where am I?" "Tweet #17" "TW user #109" "Tweet" Feature Extraction Model model rating probability keywords
  • 20. #CASSANDRA13 Why do we need Hadoop? *Running one model on one user is easy -And n models on one user is still OK *But when a model changes... -all users with the model need processing
  • 21. #CASSANDRA13 Batch processing is OK *No strict minimum latency requirements *So we use Hadoop, for scalability and reliability
  • 22. #CASSANDRA13 Hadoop Workflow Details *Implemented using Cascading *Read Activities Table using Cassandra Tap *Read models from MySQL via JDBC
  • 23. #CASSANDRA13 Hadoop Bulk Classification Workflow Convert to Cassandra Write Classification Result Table Run Classifier models CoGroup by user profile ID Convert from Cassandra Read User Profiles Table Convert from Cassandra Read Social Media Activity Table
  • 24. #CASSANDRA13 Workflow Issues *Currently manual operation -Ultimately needs a daemon to trigger (time, users, models) *Runs in separate cluster -Lots of network activity to pull data from Cassandra cluster -With DSE we could run on same cluster *Fun with AWS security groups
  • 26. #CASSANDRA13 Solr Search *Model results include key terms for classification result -"feel angry" (0.732) *Now you want to check actual usage of these terms
  • 27. #CASSANDRA13 Poking at the Data *Hadoop turns petabytes into pie-charts *How do you verify results? *Search works really well here
  • 28. #CASSANDRA13 Solr Search *Want "narrow" table for search -Solr dynamic fields are usually not a great idea -Limit to 1024 dynamic fields per document *So we'll replicate some of our Activity CF data into a new CF *Don't be afraid of making copies of data
  • 29. #CASSANDRA13 The "Search" Column Family *Row key is derived from Activity CF UUID + target column name *One column ("data") has content from that row + column in Activity CF row key "data" "uuid1_213_FB "I feel tired" "uuid1" 213_FB_data 213_FB_id "I feel tired" "FB post #32" Activity Column Family Search Column Family
  • 30. #CASSANDRA13 Solr Schema *Very simple (which is how we like it) *Direct one-to-one mapping with Cassandra columns *Hits have key field, which contains UUID/Timestamp/Service <fields> <field name="key" type="string" indexed="true" stored="true" /> <field name="data" type="text" indexed="true" stored="true" /> </fields>
  • 31. #CASSANDRA13 Combined Cluster *One Cassandra Cluster can allocate nodes for Hadoop & Search
  • 33. #CASSANDRA13 The Most Important Detail *We don't have any personal medical data!!! *We don't have any personal medical data!!! *We don't have any personal medical data!!!
  • 34. #CASSANDRA13 Three Aspects of Security *Server-level -ssh via restricted private key *API-level -validate requests using signature -secure SHA1 hash *Services-level -Restrict open ports using security groups
  • 36. #CASSANDRA13 *You can effectively use Cassandra as: A repository for social media data The data source for workflows A search index, via Solr integration Key Points...
  • 37. #CASSANDRA13 *It is possible to do more with big data than optimize ad yields And the Meta-Point