3. 20,000+
community members worldwide in 98 countries,
representing the largest global data science for social good
network
5 global chapters
250+ events around the world
Volunteer sign-ups from 174 countries
300+ projects completed, providing the most
comprehensive library of data science for social good
projects
150+ organizations helped
200,000+ hours donated
$35M+ pro bono services delivered
DataKind
by the Numbers
From evening or weekend events to
multi-month projects, our programs
are designed to provide social
organizations with the pro bono data
science innovation team they need to
tackle critical humanitarian issues.
5. 5
DataKind’s strategy is grounded in four key principles
Catalyze a thriving Data Science for Good ecosystem through partnerships: A healthy
ecosystem requires forming the right “we” to deliver on a range of data science needs – while we will
continue to deliver data science projects, we will also use partnerships wherever possible to create
an accessible ecosystem of data science resources.
Be the connective tissue between the social sector and private sector data science resources:
We will elevate attention to the Data Science for Good field, with the goal of building nonprofit
demand, data science talent / resources, and philanthropic investment for all.
Identify the brightest opportunities for data science: We are known for our ability to scope how to
apply data science to social sector organizations, ensuring that data science solutions are designed
thoughtfully, and implemented ethically, and used effectively.
Build data science projects that advance the field: DataKind will work directly with nonprofits in
targeted issue areas where there are unmet data science needs that stretch beyond the individual
organization and a coalition of interested and committed partners.
6. 6
How data science might help nonprofits?
Expand impact by anticipating future needs
Scale services by providing personalized support
Save staff time by automating processes
Better understand the communities served
Better target efforts and find those in need of services
Use open/external data sources to inform decision making
7. DataCorps Process
The team wrangles
the data and
identifies external
data sources to
leverage.
We explore what’s
possible, then staff an
expert volunteer
team.
2. Data
Discovery
1. Problem
Exploration
The team co-creates
solutions with the
partner, while
DataKind oversee
their work.
Based on feedback,
the team makes
adjustments to meet
the partner’s needs.
The team delivers the
final version and
documentation so the
partner can increase
its impact.
3. Prototyping 4. Refinement 5. Solution
8. HSM is an Australian-based
non-profit focused on helping people
form healthier relationships with
alcohol.
Created the Daybreak app which is a
professional and community support
social network.
About Hello Sunday Morning
9. Daybreak
Members select a
mood they’re feeling.
Share how they’re feeling
by making a post.
Comment, like, and save
other members’ posts.
Set goals and
reminders.
11. Challenge
Moderators read every post and flag those which are potentially problematic-
either those that indicate potentially harmful behavior or those in breach
of community guidelines.
Moderators will either provide support or escalate members to a clinical
team.
HSM is facing the problem of growing memberships: the task of
moderators is becoming unmanageable with hundreds of thousands of
community activity (posts, comments, reactions) to review and flag if
necessary.
12. Ask
Moderators need assistance from an automated approach to
develop an efficient and scalable solution to flag and categorize
the risky or breach activity.
14. Data Provided
HSM provided historical (Jan-Sept 2019) , labeled post data
containing raw text (with PII removed), timestamp of post, and
risk/breach category.
Large amount of data but significant class imbalance (< 0.1% of
the posts were risky/breach)
15. Objective 1: Identify Risky Posts
A model was built to predict the probability of a post being risky.
Steps:
1) Remove weekend posts from the dataset.
2) Calculate lexicon-based sentiment score.
3) Clean text data.
4) Tokenize posts.
5) Create more features.
6) Train model.
16. Assessment of model
Threshold =
0.1
Threshold =
0.3
Threshold =
0.5
Threshold =
0.7
Recall 0.8 0.5 0.3 0.2
Precision 0.8 0.9 0.9 0.9
F1
score 0.8 0.7 0.5 0.4
Table 1: Model performance on test data at varying probability thresholds
The model was tested on a sample of post data unseen by the
model (Nov 2019 – Jan 2020)
HSM looking to use the threshold of 0.1 as to minimize the number of
false negatives
17. Objective 1: Identify Risky Posts
A keyword detector were built to predict indicate potentially risky
words/phrases in a post.
• Suicide
• Domestic Violence
• DUI
• Risky Behavior
• Detox Withdrawal
• Mental Health
• Self Harm
• Other
18. Objective 2: Identify Breach Posts
Pre-trained models were used to detect posts with PII or profanity.
Steps:
1) Detecting PII by utilizing pre-trained/off-the-shelf models for named
entity recognition and regex-based detection. Detects text related to
people, organizations, locations, dates, times, email addresses, phone
numbers, and street addresses.
2) Detecting Profanity by using an off-the-shelf regex-based model.
20. Deploying in Production
A REST API was built in Flask to enable usage of the solutions created.
• Pre-processing Data
• Feature Engineering
• Generating Model
Predictions/Outputs
API HSM
(Daybreak)
Request
Response
21. Deploying in Production
The API has three endpoints that HSM can utilize.
Outputs of the endpoints:
1. Probability a post is risky on a scale of 0-1.
2. Risk keywords in the post.
3. PII categories and words in the post.
22. Examples of API Output: Endpoint #1
Probability Risk: Probability a post is risky on a scale of 0-1.
API
{
share_content: “I feel like things are
starting to turn around for me.”,
created_at: “2020-01-01 09:17:42”
}
Request Response
{'Prediction Risk': 0.01}
23. Examples of API Output: Endpoint #2
Risk Keywords: Risk keywords in the post.
API
{
share_content: “I drank too much last
night and am now in a bad place.”,
created_at: “2019-11-24 08:18:31”
}
{
'DUI': [‘drank too much’],
'Mental Health': [‘bad place’]
}
Request Response
24. Examples of API Output: Endpoint #3
PII/Profanity: PII breaches in the post.
API
{
share_content: “Hey guys my name is
John, anyone in Sydney want to
meet up at Hyde Park this Saturday
at noon?”,
created_at: “2019-10-18 02:11:21”
}
{
'People': ['John’],
'Locations': ['Sydney', 'Hyde Park’],
'Dates': ['this Saturday’],
'Times': ['noon’]
}