Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
A Survey as a Graph
1. A Survey As A Graph
Representing survey data in a natural way
Klaus Blass
Consultant – Development Data Group
The World Bank
klaus.blass@yahoo.com
2. Surveys
• Cross-sectional Surveys
• Snapshots
• Longitudinal Surveys
• Repeated at regular intervals
• Household Budget Surveys (HBS)
• Agricultural Production Surveys (APS)
• Population Census
• . . .
3. Surveys
• Paper forms
Subsequent Data Entry
• CAPI (Computer Assisted Personal Interview)
Tablets
Data already in digital format on server
4. Survey Solutions
• Survey Solutions is a CAPI system developed & maintained by the
World Bank
• used in thousands of surveys and censuses in 175 countries
• free software
https://mysurvey.solutions
6. Survey Data
• Tables of
• Households
• Household assets
• Household members
• Revenues & Expenses
• Agricultural plots
• Crops
• . . .
7. Health and Demographic Surveillance System
HDSS
• Monitoring the state of a population over time
• Demography (age, sex, ethnicity, etc.)
• Health (diseases, mortality)
• Wealth (assets, planted crops, animals, income)
• Migration (immigration & emigration)
8. The Nouna HDSS in Burkina Faso
Centre de Recherche en Santé de
Nouna (CRSN)
Heidelberg Institute of Global
Health (HIGH)
Rural area with 59 villages
14,000 households
115,000 people
11. Survey Data - code labels
. . . .
label define q107 1 `"Moquette / parquet"' 2 `"Bois poli"' 3 `"Carreaux"' 4 `"Vinyle"' 5 `"Ciment"' 6 `"Terre battue / Sable"' 9 `"Autre (à
préciser)"'
label values q107 q107
label variable q107 `"De quels principaux matériaux est fait le sol de l’habitation principale du ménage ?"'
label variable q107autre `"Quels autres matériaux?"'
label define q108 1 `"Béton"' 2 `"Tuiles"' 3 `"Tôles"' 4 `"Paille/Feuille"' 5 `"Banco / Terre Battue"' 6 `"Autre (à préciser)"'
label values q108 q108
label variable q108 `"Quels sont les principaux matériaux du toit de l’habitation principale du ménage ?"'
label variable q108autre `"Quels autres matériaux?"'
label variable q109__1 `"Quel type de toilettes utilisez-vous?:WC avec chasse eau"'
12. Survey Data
• Data are arranged not for clarity* but for completeness
* from a human reader‘s point of view
• A single survey variable may occupy a dozen columns
• Data are just codes, labels are in different (rather cryptic) files
• Just loading these data with a LOAD utility is not an option
Data must be loaded with a custom program!
13. A custom Java loader
Basic structure:
• Instantiate a Bolt driver
• server port
• credentials
• Write a method which
• opens a session
• begins a transaction
• runs a Cypher query
• commits the transaction
14. Example: Loading all households
private final static String MERGE_HOUSEHOLD =
"MERGE (a:Household {ivkey: $ivkey, id: $id, hhNum: $hhNum, phoneCHM: $phoneCHM}) " +
"WITH a " +
"MATCH (c:Compound {ivkey: $ivkey}) " +
"WITH a, c " +
"MERGE (a)-[:IN_COMPOUND]->(c) " +
"RETURN id(a)"
;
public void build_households(String surveyFile) throws IOException {
TabFile tf = new TabFile() {
public void onLine( String line) {
String[] cell = line.split(tab);
write(MERGE_HOUSEHOLD, parameters(
"ivkey", cell[0], "id", cell[6], "hhNum", cell[2], "phoneCHM", cell[9]
));
}
15. Load all nodes and create relationships
• Villages
• Compounds
• Households
• Assets
• Members
• Immigrations/emigrations
16. The village of Moinsi
Moinsi, the smallest
village in the Nouna
HDSS:
All compounds,
the households they
contain,
and their household
members.
17. Bad data
“Never throw away survey data“
Problem: Empty compounds
Solution:
• Relabel empty compounds as
GhostCompounds
• Will no longer show up in
compound related queries
match (c:Compound)
optional match (c)-[r:IN_COMPOUND]-
(hh:Household)
with c,hh,count(r) as hhcount
where hhcount=0
set c:GhostCompound
and
match (g:GhostCompound)
remove g:Compound
18. Reported Deaths
• Members who died should no
longer be considered
“members“
• But we want to remember which
household they belonged to
Relabel them as “Deceased“
19. Code tables
Example: building characteristics
• Only the selected code is saved in the
data.
• Want to be able to query this property
by code or by description.
• There is no “Lookup table“ in Neo4j
where I could look up the description
from the code.
20. Code tables
• Store code and description in one string
• How to query by either code or
(partial) description ?
Roll your own function !
21. The Power of User-Defined Functions
• WHERE klaus.codeOf( habitation.floor, ‘5‘ )
• WHERE klaus.includes( habitation.floor, ‘cement‘ )
• normalizes all text to lowercase, diacritics removed
• allows for slight divergence of spelling (Levenshtein distance <= 1)
22. User-Defined Functions
klaus.includes( habitation.floor, ‘cement‘ )
public Boolean includes(
@Name("String to search")String s1,
@Name("keyword")String keyword
) {
s1 = normalize(s1);
keyword = normalize(keyword);
String[] words = s1.split("s+|/|-|");// whitespace / -
for (String s : words) {
if (LevenshteinDistance(s, keyword) <= 1) return true;
}
return false;
}
23. User-Defined Functions
More user-defined functions:
• Date functions for partial dates
using own assumptions about missing components
• Matching similar text (specify Levenshtein distance)
klaus.similar(‘solar‘, ‘Solaire‘, 2) true
26. Longitudinal Data
Example: pregnancies
• Pregnancies differ during survey rounds
• Pregnancy events are nodes
• Could later be linked to an outcome (new member, abortion, etc.)
27. Longitudinal Data
Identify each pregnancy by the round they were reported
multiple labels
• Query pregnancies in general
MATCH (m:Member)--(p:Pregnancy)
• Query pregnancies during a specific survey round
MATCH (m:Member)--(p:Pregnancy:Round1)
28. Migration
• Node, property or relationship?
• Migrations as relationships
• Properties:
date
reason
returned?
• Great visualization
• Emigrants & Visitors