We introduce the idea that metadata, including project information, data labels, data characteristics and indications of valuable use, can be propagated through a data processing lineage graph. Further, finding examples of significant cooccurrence of propagated and original metadata gives us the basis of an interesting kind of search engine gives interesting recommendations of data given a problem statement even in a near cold-start situation.
12. A NOTE ON IMPLICATIONS
12
The characteristic indicator
matrix is what connects
“umbrella” with “rainfall” or
“mosquito” with
“temperature” + “windspeed”
15. QUERY PROCESS
15
The query is expanded based
on indicators (when they say
“umbrellas” they also mean
“rainfall”)
as well as semantic token
embedding using BERT
18. EVALUATION
• Evaluation is difficult due to a lack of public datasets
• Most machine learning examples are truncated to final steps
• Very few non-machine learning pipelines exist outside of toy examples
• Private datasets generally cannot be shared
• Still important to use when possible due to scale
• Evaluation of recommendation engines is a subtle art
• Their purpose is to change behaviors
• Todays recommendations select tomorrow’s training data
• We aren’t to this point yet, this would be a symptom of success
18