1. Emerging Role of Social Media in
Data Sharing and Management
Dr. Sudha Ram
McClelland Professor of MIS and Computer Science
Director, INSITE Center for Business Intelligence and
Analytics
Eller College of Management
University of Arizona
Tucson, AZ 84721
Email: ram@eller.arizona.edu
1
2. Why Share Data
• Requirement from Funding Agency
• Data Hoarding results in wasted
resources
• Slows down scientific progress
10/24/2012 2
3. Barriers to Data Sharing
• Technical, Social, Legal Barriers
• Privacy
• Competitive nature of R&D process
• Fear of getting “Scooped”
• Protect Intellectual property
• Early sharing may lead to
misunderstanding of results
• Lack of incentives
10/24/2012 3
4. How to Enable Data Sharing
KEY:
Provenance Tracking & Management!
10/24/2012 4
5. What is Provenance?
Lineage, Pedigree, Origin (Kings, Dogs, Aliens)
Enables correct interpretation
Includes:
Who created it
How was it derived
Ownership
Assumptions
…….
“Provenance” is an overloaded Term
6. Uses of Provenance
• Data Quality Evaluate quality of data
Measure trust in data
• Audit Trail Detect data processing errors
Create a usage log
• Replication Recipe Reproduce a dataset
Repeat to verify /compare
• Attribution/Intellectual Establish the rights and
Property ownership of data
• Informational Discover and re-use datasets
Browse provenance
7. What exactly is provenance?
- Creator, publisher, contributor.
Who
- Ownership
- Dates (e.g. creation date and modification date)
When
- The literature reference where data were first reported
- Current location of storage of the data
- How the data has been derived or transformed Where
- Experimental procedures or computations that transform data How
- The sequence of ideas leading to an experiment
Why
- Hypotheses an experiment is intended to test
- Instrument settings
Which
- Parameters of software application
- Creation, transformation, derivation, retirement What
8. Structure of Provenance
• Object: Data, Software, Document,
Tweet, Blog….
• Anchor Point: Events in the Object’s Life
– WHAT
• All other elements of Provenance
describe the events
9. Life Events: Birth to Death
Secure
Review
Approval
Archiving
Storage
Verification
Deletion
Creation Access
Information lifecycle for a design document or any other object
11. How do you track and store
provenance
• Tracking: Ideally at the source
• Some of it can be automated and some of it
requires manual input
• Store in a database – relational, XML, RDF,
NoSQL
• Provenance is “BIG DATA”
• Provenance accumulates over time and can be
1000s of times more than the data itself!
13. Provenance Graph: RayMat
Raytheon Missile Systems
Cycom381/S2
Uni-Glass 111 When:
Data
What: Derivation occurs_at
Tensile Strength: Jan. 5, 2006
759 Mpa
is_involved_in happens_in
Who: is_used_in Where:
because_of leads_to Raytheon,
•Name: John Herold
•Role: Creator Tucson, AZ
How:
Why: •Method: Average
Occurs at Which:
•Project: SM-3 (exclude outliers) Granta Design
Program
has_input has_input
Test Specimen: S1 Test Specimen: S2
Tensile Strength: Tensile Strength: What::Creation
762 Mpa 756.3 Mpa
leads_to is_involved_in
How: Who:
•Test specification: SACMA SRM-4 •Name: AME Material
•Test temperature: 108 F Test Lab
•Condition of test specimen: Dry •Role: Tester
14. iPlant Provenance Management
• Provenance can be used
for estimating the quality
of the data.
- E.g., Where the data came
from is critical for
understanding the quality of
data. After a tree file is
imported, who modified it
for what purposes (why) is
of utmost importance to
determine data quality.
A tree file “PDAP.tree.nex” was imported Nicole
from TreeBASE. It was then modified by Doug. He
changed the name of a species to be consistent
with a naming convention used elsewhere. This
tree file was then modified by Nicole. She
reconciled the tree file with its trait data, and
subsequently removed a species in the tree file.
15. iPlant Provenance Management
• Provides a replication
recipe for data
• Enables attribution
of the creator/owner
of data
Provenance helps understand how the data
was processed and which software tool was
used to manipulate it. We also need
mechanisms to query and browse the who
provenance since attribution of the
creator/owners of the datasets and the
researchers’ discoveries, on the other hand,
relies primarily on provenance such as who
created and modified the data.
18. Who? – Corresponds
to the person who
created or updated a
page
What? –Details of the
change made to the
The page
Tooltip page
that is When? – Corresponds
created or to the time of creation
updated or update of the page
19. Lessons from Social Media/Web2.0
• “Google Analytics” Philosophy for Provenance
Management
• Dashboard for extracting, viewing and drilling
down into provenance for many different
purposes
• Establish Institutional Policies and Reward
Systems
• Mining through provenance to explore patterns
• Crowdsourcing via Wikis, Blogs, Dropbox,
Discussion Forums to enable sharing.
20. Good Provenance Management can help
remove barriers
• Data Quality Evaluate quality of data
Measure trust in data
• Audit Trail Detect data processing errors
Create a usage log
• Replication Recipe Reproduce a dataset
Repeat to verify /compare
• Attribution/Intellectual Establish the rights and
Property ownership of data
• Informational Discover and re-use datasets
Browse provenance