Hasler Stiftung SmartWorld Workshop, June 19, 2014, Thun —
Switzerland
Reclaim your
Digital Life
Motivation (1/3)
Commoditization of digital equipment
■ Desktops, laptops, netbooks, mobile phones,
tablets, e-book readers, set-top boxes, personal
GPSs, digital cameras, TVs, etc.
Fragmentation of information across devices
Motivation (2/3)
The story of my life...
■ Where are the pictures of my niece’s birthday?
■ How should I consolidate/backup my emails?
Fortunately there’s the cloud, right?
Motivation (3/3)
2014 twist on Personal Information Management:
lifelogging, health-monitoring
■ Everylog, Memoto, Google Glasses, Nike's FuelBand,
FitBit, Samsung GearFit & competitors...
➡Urgent need to index & integrate continuous personal
feeds for automated processing
Problem Definition
Personal digital information is today fragmented
and externalized
➡“Each site is a silo, walled off from the others…”
[TBL 10.2010]
■ Data partitioning
■ Loss of governance
How shall one automatically reclaim and
meaningfully organize his/her digital information
dispersed online and on various devices to
generate useful digital memories?
MEM0R1ES...
...a highly-available, secure, scalable, and semantically-
rich platform to extract, preserve, integrate and expose
personal information for a smarter world
the -Team
Prof. Dr. Philippe
Cudré-Mauroux
Prof. Dr. Karl
Aberer
Prof. Dr. Maria Sokhn
Julien Tscherrig
Joël Dumoulin
Michele
Catasta
Dr. Gianluca Demartini
Alberto Tonon
Last Year…
Device & Service Wrappers [EIA-FR]
■ Generic Wrapper Architecture:
SMTP, Gmail, Google Drive, Facebook, DBPedia,
Flickr, LinkedIn
■ Browser wrapper: [EPFL]
Lifelogging rich features (context, user activities
and focus, etc.) from the browser
Storage Infrastructure
■ Multi-purpose, declarative & elastic storage
layer [UNIFR]
Result from the Digital Reclaiming
➡Heterogeneous Graphs of Entities
Information duplication
Sometimes with different facets
Missing information
Today’s Focus
Meaningful information integration from
heterogeneous graphs of entities
1. Entity Search (AOR)
2. Entity Typing (TRank)
3. Entity Clustering (ZenCrowd, MemorySense,
Predict)
4. Entity Elicitation (Transactive Search)
Use-case: leveraging digital mem0r1es from a
conference participation (demonstrators)
1. Entity Search [UNIFR]
Main idea: combine unstructured and
structured search to find relevant entities in
the graph
■ Inverted index to locate first candidates
■ Graph queries to refine the results
■ Graph traversals (queries on object properties)
■ Graph neighborhoods (queries on data type properties)
1. Entity Search
➡ up to 25% MAP improvement over BM25!
2. Entity Typing [UNIFR+EPFL]
Entities can have many types (facets)
■ Which fine-grained types are most relevant given
the context?
Thing
American
Billionaire
s
People
from King
County
People
from
Seattle
Windows
People
Agent
Person
Living
People
American
People of
Scottish
Descent
Harvard
University
People
American
Computer
Programmers
American
Philanthropists
People
from
Seattle
2. Entity Typing
Integrates BigData types from the Web of data
■ Tree of 447’260 types
■ Rooted on <owl:Thing>
■ Depth of 19
Ranks relevant types by analyzing the context
■ Textual context
■ Graph context
■ Decision trees
■ Linear regression
3. Entity Clustering
Several efforts to cluster entities into
meaningful groups depending on context:
PREDIct [EIA-FR]
■ Extracts Web
information through
wrappers
■ Models topics through
Latent Dirichlet Allocation
■ Predictions based on
topic trends
3. Entity Clustering
MemorySense [EPFL]
■ Clusters mobile data into
macro-activities
■ Leverages location, machine-
learning and an activity ontology
B-hist [UNIFR+EPFL+EIA-FR]
■ Better browser history
clustering through
entity typing and
machine-learning
4. Entity Elicitation [EPFL+UNIFR]
Filling the gaps in mem0r1es entity graphs
■ e.g., ‘who also attended WWW03 last year?’
■ Traditional methods (Web crawling, machine-
learning, micro-task crowdsourcing) are insufficient
■ Errors and lack of discriminative features (➘precision)
■ Lack of public data (➘recall)
4. Entity Elicitation
Adapting the concept of transactive memories
(group memories) from psychology
➡Transactive search methods to elicit
information
■ Social network analysis (to direct the search)
■ Crowdsourcing (to get the information)
■ 46% improvement (F1) over best alternative
Demo
Use-case on scientific conference memories
Based on 4 demonstrators:
■ Visualizing clustered mobile data (MemorySense)
■ Information elicitation through Transactive Search
(Hippocampus)
■ Browsing clustered Web history (B-hist)
■ Clustering and prediction of topics based on
extracted information (PREDIct)
Dissemination (1)
Papers at top research venues:
■ Alberto Tonon, Gianluca Demartini, Philippe Cudré-Mauroux: Combining inverted indices and structured
search for ad-hoc object retrieval. SIGIR 2012.
■ Alberto Tonon, Michele Catasta, Gianluca Demartini, Philippe Cudré-Mauroux, Karl Aberer: TRank,
Ranking Entity Types Using the Web of Data. International Semantic Web Conference ISWC 2013.
■ Michele Catasta, Alberto Tonon, Djellel Eddine Difallah, Gianluca Demartini, Karl Aberer, Philippe Cudré-
Mauroux: Hippocampus, answering memory queries using transactive search. WWW 2014.
■ Michele Catasta, Alberto Tonon, Vincent Pasquier, Gianluca Demartini, Karl Aberer, Philippe Cudré-
Mauroux: B-hist, Better Entity-Centric Search over Personal Web Browsing History. International
Semantic Web Conference ISWC 2014.
■ Michele Catasta, Alberto Tonon, Gianluca Demartini, Jean-Eudes Ranvier, Karl Aberer, Philippe Cudré-
Mauroux: B-hist, Entity-Centric Search over Personal Web Browsing History. Journal of Web Semantics,
2014 (to appear).
■ Michele Catasta, Alberto Tonon, Djellel Eddine Difallah, Gianluca Demartini, Karl Aberer, Philippe Cudre-
Mauroux: TransactiveDB: Tapping into Collective Human Memories. PVLDB, 2014 (in revision).
■ Julien Tscherrig, Philippe Cudre-Mauroux, Elena Mugellini, Omar Abou Khaled, Maria Sokhn:
SemantiConverter: A Flexible Framework to Convert Semi-Structured Data into RDF. Submitted for
publication.
Dissemination (2)
Android app on Google Play
Open-source release of most components
■ https://github.com/MEM0R1ES
ISWC 2013 Best-Paper Award nominee (TRank)
Semantic Web Challenge 2013 Finalist (B-hist)
Wall Street Journal mention (B-hist, 30.10.2013)
Technology transfer
■ Extracting entities (Google Zurich)
■ MemorySense (Samsung)
■ TRank (Yahoo!)
Start-up (?)
Current Research Directions
Modelling tail-entities
Transactive DB operator
Automatic capture of important memories
■ Google Glasses
Software integration
Conclusions
Exciting project
■ Important, timely societal issues
■ Fundamental research questions
■ Data Storage, Data Integration, Data Clustering, Data Elicitation
Stimulating collaboration
■ Involving 3 (4) institutions
➡Thanks to all partners for their contributions!
A number of tangible results already
■ Open-source software components
■ Publications at top research venues
■ Industry transfer
Thanks a lot for your attention,
… and many thanks to the Hasler Stiftung
for funding this project!
Questions?
Hasler Stiftung SmartWorld Workshop, June 19, 2014, Thun — Switzerland
Reclaim your Digital Life

Hasler2014

  • 1.
    Hasler Stiftung SmartWorldWorkshop, June 19, 2014, Thun — Switzerland Reclaim your Digital Life
  • 2.
    Motivation (1/3) Commoditization ofdigital equipment ■ Desktops, laptops, netbooks, mobile phones, tablets, e-book readers, set-top boxes, personal GPSs, digital cameras, TVs, etc. Fragmentation of information across devices
  • 3.
    Motivation (2/3) The storyof my life... ■ Where are the pictures of my niece’s birthday? ■ How should I consolidate/backup my emails? Fortunately there’s the cloud, right?
  • 4.
    Motivation (3/3) 2014 twiston Personal Information Management: lifelogging, health-monitoring ■ Everylog, Memoto, Google Glasses, Nike's FuelBand, FitBit, Samsung GearFit & competitors... ➡Urgent need to index & integrate continuous personal feeds for automated processing
  • 5.
    Problem Definition Personal digitalinformation is today fragmented and externalized ➡“Each site is a silo, walled off from the others…” [TBL 10.2010] ■ Data partitioning ■ Loss of governance How shall one automatically reclaim and meaningfully organize his/her digital information dispersed online and on various devices to generate useful digital memories?
  • 6.
    MEM0R1ES... ...a highly-available, secure,scalable, and semantically- rich platform to extract, preserve, integrate and expose personal information for a smarter world
  • 7.
    the -Team Prof. Dr.Philippe Cudré-Mauroux Prof. Dr. Karl Aberer Prof. Dr. Maria Sokhn Julien Tscherrig Joël Dumoulin Michele Catasta Dr. Gianluca Demartini Alberto Tonon
  • 8.
    Last Year… Device &Service Wrappers [EIA-FR] ■ Generic Wrapper Architecture: SMTP, Gmail, Google Drive, Facebook, DBPedia, Flickr, LinkedIn ■ Browser wrapper: [EPFL] Lifelogging rich features (context, user activities and focus, etc.) from the browser Storage Infrastructure ■ Multi-purpose, declarative & elastic storage layer [UNIFR]
  • 9.
    Result from theDigital Reclaiming ➡Heterogeneous Graphs of Entities Information duplication Sometimes with different facets Missing information
  • 10.
    Today’s Focus Meaningful informationintegration from heterogeneous graphs of entities 1. Entity Search (AOR) 2. Entity Typing (TRank) 3. Entity Clustering (ZenCrowd, MemorySense, Predict) 4. Entity Elicitation (Transactive Search) Use-case: leveraging digital mem0r1es from a conference participation (demonstrators)
  • 11.
    1. Entity Search[UNIFR] Main idea: combine unstructured and structured search to find relevant entities in the graph ■ Inverted index to locate first candidates ■ Graph queries to refine the results ■ Graph traversals (queries on object properties) ■ Graph neighborhoods (queries on data type properties)
  • 12.
    1. Entity Search ➡up to 25% MAP improvement over BM25!
  • 13.
    2. Entity Typing[UNIFR+EPFL] Entities can have many types (facets) ■ Which fine-grained types are most relevant given the context? Thing American Billionaire s People from King County People from Seattle Windows People Agent Person Living People American People of Scottish Descent Harvard University People American Computer Programmers American Philanthropists People from Seattle
  • 14.
    2. Entity Typing IntegratesBigData types from the Web of data ■ Tree of 447’260 types ■ Rooted on <owl:Thing> ■ Depth of 19 Ranks relevant types by analyzing the context ■ Textual context ■ Graph context ■ Decision trees ■ Linear regression
  • 15.
    3. Entity Clustering Severalefforts to cluster entities into meaningful groups depending on context: PREDIct [EIA-FR] ■ Extracts Web information through wrappers ■ Models topics through Latent Dirichlet Allocation ■ Predictions based on topic trends
  • 16.
    3. Entity Clustering MemorySense[EPFL] ■ Clusters mobile data into macro-activities ■ Leverages location, machine- learning and an activity ontology B-hist [UNIFR+EPFL+EIA-FR] ■ Better browser history clustering through entity typing and machine-learning
  • 17.
    4. Entity Elicitation[EPFL+UNIFR] Filling the gaps in mem0r1es entity graphs ■ e.g., ‘who also attended WWW03 last year?’ ■ Traditional methods (Web crawling, machine- learning, micro-task crowdsourcing) are insufficient ■ Errors and lack of discriminative features (➘precision) ■ Lack of public data (➘recall)
  • 18.
    4. Entity Elicitation Adaptingthe concept of transactive memories (group memories) from psychology ➡Transactive search methods to elicit information ■ Social network analysis (to direct the search) ■ Crowdsourcing (to get the information) ■ 46% improvement (F1) over best alternative
  • 19.
    Demo Use-case on scientificconference memories Based on 4 demonstrators: ■ Visualizing clustered mobile data (MemorySense) ■ Information elicitation through Transactive Search (Hippocampus) ■ Browsing clustered Web history (B-hist) ■ Clustering and prediction of topics based on extracted information (PREDIct)
  • 20.
    Dissemination (1) Papers attop research venues: ■ Alberto Tonon, Gianluca Demartini, Philippe Cudré-Mauroux: Combining inverted indices and structured search for ad-hoc object retrieval. SIGIR 2012. ■ Alberto Tonon, Michele Catasta, Gianluca Demartini, Philippe Cudré-Mauroux, Karl Aberer: TRank, Ranking Entity Types Using the Web of Data. International Semantic Web Conference ISWC 2013. ■ Michele Catasta, Alberto Tonon, Djellel Eddine Difallah, Gianluca Demartini, Karl Aberer, Philippe Cudré- Mauroux: Hippocampus, answering memory queries using transactive search. WWW 2014. ■ Michele Catasta, Alberto Tonon, Vincent Pasquier, Gianluca Demartini, Karl Aberer, Philippe Cudré- Mauroux: B-hist, Better Entity-Centric Search over Personal Web Browsing History. International Semantic Web Conference ISWC 2014. ■ Michele Catasta, Alberto Tonon, Gianluca Demartini, Jean-Eudes Ranvier, Karl Aberer, Philippe Cudré- Mauroux: B-hist, Entity-Centric Search over Personal Web Browsing History. Journal of Web Semantics, 2014 (to appear). ■ Michele Catasta, Alberto Tonon, Djellel Eddine Difallah, Gianluca Demartini, Karl Aberer, Philippe Cudre- Mauroux: TransactiveDB: Tapping into Collective Human Memories. PVLDB, 2014 (in revision). ■ Julien Tscherrig, Philippe Cudre-Mauroux, Elena Mugellini, Omar Abou Khaled, Maria Sokhn: SemantiConverter: A Flexible Framework to Convert Semi-Structured Data into RDF. Submitted for publication.
  • 21.
    Dissemination (2) Android appon Google Play Open-source release of most components ■ https://github.com/MEM0R1ES ISWC 2013 Best-Paper Award nominee (TRank) Semantic Web Challenge 2013 Finalist (B-hist) Wall Street Journal mention (B-hist, 30.10.2013) Technology transfer ■ Extracting entities (Google Zurich) ■ MemorySense (Samsung) ■ TRank (Yahoo!) Start-up (?)
  • 22.
    Current Research Directions Modellingtail-entities Transactive DB operator Automatic capture of important memories ■ Google Glasses Software integration
  • 23.
    Conclusions Exciting project ■ Important,timely societal issues ■ Fundamental research questions ■ Data Storage, Data Integration, Data Clustering, Data Elicitation Stimulating collaboration ■ Involving 3 (4) institutions ➡Thanks to all partners for their contributions! A number of tangible results already ■ Open-source software components ■ Publications at top research venues ■ Industry transfer
  • 24.
    Thanks a lotfor your attention, … and many thanks to the Hasler Stiftung for funding this project! Questions? Hasler Stiftung SmartWorld Workshop, June 19, 2014, Thun — Switzerland Reclaim your Digital Life