Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Indexing and Searching Cross Media Content in a Social Network


Published on

Indexing and Searching Cross Media Content in a Social Network

Published in: Technology, Business
  • Be the first to comment

  • Be the first to like this

Indexing and Searching Cross Media Content in a Social Network

  1. 1. Indexing and Searching Cross Media Content in a Social Network Pierfrancesco Bellini, Daniele Cenni, Paolo Nesi University of Florence Department of Systems and InformaticsDistributed Systems and Internet Technology Laboratory ECLAP Conference, May 7-9, 2012
  2. 2. ECLAP Social Network ECLAP is a Digital Library on Performing Arts connected with Europeana ECLAP is a Social Network (blogs, forums, comments, tagging, voting, …)
  3. 3. Goals/Requirements Develop an Indexing/Searching solution for ECLAP Social Network allowing:  Indexing multilingual crossmedia content metadata and data (e.g. documents)  Indexing portal blogs, forums, events, group pages, comments, etc.  Efficient multilingual search (keyword search and advanced search) supporting:  misspelled words (e.g. shespeare)  partial word search  Sorting and filtering search results  re-index the whole data without blocking the system  Log and monitor users activity  … Evaluate the Indexing/Searchig service
  4. 4. ECLAP Data Model Group/Channel 0..n 0..n 0..n 0..n 0..n 0..n 1 0..n TaxonomyTerm Content Comment Performing Arts Metadata Dublin Core Technical 1..n Blog WebPage Forum Object 0..n Playlist Document Collection 0..n 1..2 0..n Annotation AVObject 1..n Image Video Audio4
  5. 5. Indexing Indexing & Search system  Based on Apache Solr Multilingual aspects  Translate the metadata or translate the query?  We use metadata translation Indexing schema  Dublin Core + DCTerms (multi language)  Performing Arts  Technical (provider, content type, GPS, IPR, duration, quality, …)  Groups associations (multi language)  Taxonomy associations (multi language)  Comments & multi language tags  FullText of the textual digital resources
  6. 6. Indexing Taxnmy, Comment, DC Perf. Full Group TagsMedia Type (ML) Tech Arts Text (ML) (ML) VotesAudio/Video/Image Y Y Y Y Y YDocument(pdf, doc, …) Y Y Y Y Y Y YCrossMedia(html, MPEG21,…) Y Y Y Y Y Y YAggregations(playlist, Y Y Y Y Y Ycollection, …)Info text(blog, web (Y) Y Ypages, forum,events, …)
  7. 7. Indexing Multilingual fields  title_en, title_it, title_de, title_fr, title_ca, … Catch-all fields Component fields Boost Weight text pdf_*, doc_*, ppt_*, htm_*, … 1.0 body body_* 0.5 title title_* 3.1 description description_* 2.0 contributor contributor_* 0.8 subject subject_* 1.5 taxonomy taxonomy_* 0.8 PerformingArts PerformingArtsMetadata.# 1.0
  8. 8. Indexing Re-indexing  In case of new indexing schema or index corruption the search system should not be blocked  The re-indexing is done on a separete indexing machine while the production system uses the actual index  During re-index the new uploaded/modified content is marked to be reindexed when the new index is put in production
  9. 9. Searching Full text search  Uses the catch all fields to search for keywords in most important fields in all languages (title, description, text, body, subject,…) Fuzzy search  Allows matching mistyped words Deep search  Allows searching for partial words Relevance & boosting of terms
  10. 10. Searching Faceted search
  11. 11. Searching Advanced search
  12. 12. Search Facility Assessment Analisys performed on 3 months 11294 vists (6032 unique visits) 62768 page views (avg 5.76 pages per visit) 7.29 minutes of permanence on the portal 30502 contents accesses (view, play and download)
  13. 13. Search Facility Assessment # Full Text # Faceted # Last #Featured # Popularusers Query Query Posted List List Listsimple 323 24 4 22 17registeredpartners 1094 21 27 19 9anonymous 2634 147 234 302 213Total 4051 192 265 343 239Clicks after 1564 200 318 2799 231query/list
  14. 14. Search Facility Assessment Click order distribution First page
  15. 15. Conclusions Solution allows indexing multilingual metadata and texts Searching & filtering results Search facility assessment show that search is a used feature
  16. 16. Context & Assessment Context  Social Network  User and content items  Content distribution portal  Video on demand portal  Archive, digital library, Performing Arts  Assessment  User behavior  Log user actions on the Web portal  User happiness  Measure the level of user satisfaction about the exposed services
  17. 17. Logging User Profile User Profile  Registered or anonymous, uid (user id)  Timestamp YY-mm-dd hh:mm:ss  IP address, Proxy type etc.  Platform (OS, Browser)  GeoIP data (Country, Region, City)  Friends, connections  Betweenness, Eccentricity  Joined groups  User preferred contents
  18. 18. Understanding User behavior Online survey  A simple module, in the right side of the portal  Presenting 3 - 4 questions per topic (depending on the current portal section visited) Stat Drupal Modules  Custom implemented modules  Log User Activity  Keep track and depict main figures about portal activity  Can be filtered by date, user, type of content, group, type of activity (content enrichment, social promotion, networking etc.) Google Analytics
  19. 19. Understanding User behavior  Top Metrics  Avg # Visits/User  Avg # Queries/User  Avg # Clicks/User  Avg Visit duration  Avg Query length  Query refinement rate  Next Page Click Rate  Back Page Click Rate  Frequency of searching (once/day, week etc.)  Success of searching (assessment...)  …
  20. 20. Logging User Behavior Logging user activities on the portal  Downloads/Views  Queries  Anonymous/Register portal accesses (login/logout)  Adding/Updating/Deleting digital contents  Menu clicks  Content Upload  Content Management  Social Promotion & Networking
  21. 21. Logging User Behavior Content Accesses (Download/View)  Axmedis Content  Pdf, Document, Video, Playlist, Slide, Flash, Image, Excel, Archive, Audio, Tool, Collection  Drupal Content  Page, Blog, Event, Forum, Group, Comment Distribution of Content Access per  Access Type, Portal, Platform, Section, Locale, Country, Region, City, Axoid, Nid, Content Type, Partner, User, Timestamp
  22. 22. Logging User Behavior Queries (Simple, Faceted, Advanced)  Distribution of Queries per  User, Content type, Device, IP, User Agent, Query Type, Country, Region, City, Locale, Filter (faceted) Query Cloud Keyword Cloud IPR Wizard  Definition and usage of IPR Models Metadata Editor  Access and usage  Add, Edit metadata Video Annotations  Personal content  Other users content
  23. 23. Logging User Behavior Social Promotion & Networking  Analysis of  Eccentricity  Betweenness  Connections  Creation, Access of Public/Private Web Pages  Activity on Forums, Blogs, Groups or between users  New Contents  Comments to Objects/Web Pages  Invited People  Featured Objects  Recommendations, suggested content  Export/Import of links to/from other SN  Private Messages
  24. 24. Logging User Behavior Menu Clicks  Distribution of clicks per  User, IP, Locale, Timestamp etc.  LAST POSTED, FEATURED, CALENDAR, ADVANCED SEARCH, UPLOAD AND INGEST, POPULAR, MY CONTENT, MY GROUPS , MY COLLEAGUES, GET AFFILIATED, TERMS OF USE, PRIVACY POLICY, TOP RATED, COURSES, LESS POPULAR, UPLOAD NEW CONTENT, etc. Ranking/Voting  # of ranked items  Distribution per  User, IP, Locale, Timestamp etc. QR Code  Access from Mobile Devices Workflow  Distribution of Workflow Type Content Upload  Distribution of uploads per  User, Partner, Timestamp
  25. 25. Content Access September 1st – November 30th 2011 Affiliation # View/Play # DownloadDSI 46 0Not 1292 14partners/AffiliatedPartners/Affiliated 6712 119(except DSI)Public Users 21418 947 Affiliation # View/Play # Download DSI 3 0 Not 100 4 partners/Affiliated Partners/Affiliated 218 11 (except DSI) Public Users 2225 869
  27. 27. Search September 1st – November 30th 2011 Affiliation # Simple Queries # Faceted QueriesDSI 13 0Not 323 24partners/AffiliatedPartners/Affiliated 1094 21(except DSI)Public Users Affiliation 2634 # Advanced 147 Queries DSI 0 Not 18 partners/Affiliate d Partners/Affiliated 4 (except DSI)
  28. 28. Drupal Stat Metrics September 1st – November 30th 2011 Content Access per nid
  29. 29. Drupal Stat Metrics September 1st – November 30th 2011 Views by Query
  30. 30. Drupal Stat Metrics September 1st – November 30th 2011 Content Access per Platform
  31. 31. Understanding User behavior Drupal Stats (collapsible menus on the right)
  32. 32. Google Analytics vs Drupal Stats Service Pros ConsGoogle  Traffic source data  IP approach, each IP is considered anAnalytics  Bounce rate  unique visitor Can’t deal with  Recency (since specific actions on when) portal (e.g.  Loyalty (how downloads, queries) often)  Session timesDrupal Stats   Identity approach Actions  Can’t deal with traffic source data  Download and bounce rate  User Access  Session time raw  Queries approximation  Content type filtering
  33. 33. Sorting Results Sorting by  Upload Time (first time doc uploading date)  Update Time (last time doc updating date)  Score (doc relevance to search query) Combined with faceting and paging
  34. 34. Suggestions REALTIME, while typing a query suggests similar searches  ecl…  eclap  eclap-de-2-1-1-user  eclap-de-2-2-1-usergroup  …
  35. 35. ECLAP Survey
  36. 36. Indexing/Searching Reqs Enriching search experience  Results Sorting  Suggestions Large # of contents (~ 104-106)  External Indexing Service Hidden/Private contents management Monitoring Exceptions  Email notifications Search Engine Friendly (Google, Bing, Yahoo etc.)  content site crawling HTML dumping
  37. 37. External Indexing Service 1/3 Setup an external service to avoid server overloading when building the index  Taxonomization  Indexing (with exceptions monitoring)  Index Synchronization  Old Index replacement with new one  Index updating  Old contents cleaning (optional)
  38. 38. External Indexing Service 2/3 Taxonom Parent y Taxonomization Performing - Arts  Has a cost pre-computing Cinema Performing  Digital content Arts Music Performing  Execution Rule (JS) Arts  Indexed with object records Documenta Cinema ry Historical Cinema Performing Classical Music Arts Pop Music Cinema Music Object Documentary Historical Classical Pop Taxonomy Performing Arts Cinema Music Documentar Classical y
  39. 39. External Indexing Service 3/3 Indexing with exceptions monitoring  Real-time notifying system  Event time and type (add, update)  Full stacktrace info  Customizable recipients  Object Indexing Recovery  Resource Parse Error Metadata Indexing• Index synchronization  During external indexing, contents may be  Updated/added/deleted on the original index  Need to update these contents Indexed External Indexed on the index (state flag) 1 1 0 1
  40. 40. Search Engine Friendly HTLM dump service  JAVA external service  Periodically invoked by an AXCP rule  Full metadata exporting  Thumbnail  Resource link  Multilanguage  Paginated results
  41. 41. Conclusions Drupal integrated solution for user behavior tracking and analysis  Logging  Stat Data Graph  Online Survey External Indexing Service  Avoids server overloading  HA of query service  Error recovering  Detailed event notifying system  Index Optimization Dumping tool for portal contents (SEO)  Full metadata HTML exporting  Scheduled Service
  42. 42. Future Work Keep collecting Data Deeper Data Analysis  User Sessions  1st, 2nd..., nth click average user behavior  Depict a modular view of the system usage  Popularity/Usability for each feature & functionality  Social Network Analysis (SNA)  Huge Population  User relationships, connections, friendships
  43. 43. References P. Bellini, I. Bruno, D. Cenni, P. Nesi, "Micro grids for scalable media computing and intelligence on distributed scenarious", IEEE Multimedia, 2011 P. Bellini, I. Bruno, D. Cenni, P. Nesi, M. Paolucci, M. Serena, "Semantic Model for Cultural Heritage Social Network and Cross Media Content for Multiple Devices", Conference of the Italian Association of Artificial Intelligence, Workshop for Cultural Heritage, 15-17 September 2011, Palermo, Italy
  44. 44. Q&A
  45. 45. APPENDIX
  46. 46. Architecture (former) Index Rebuilder Indexing Rule JS Rule JS SolrJ Client Grid Rule Node Scheduler AXCP SolrXML/HTTP JSP Indexing Searching Cell Module Module Indexing Apache Solr Service Drupal Apache Tomcat Apache HTTP
  47. 47. DrupalWhat is it?Open source content management platformDeveloped by Dries Buytaert in 2001Written in PHPUsers: The Economist,, TheWhite House,Runs on a WEB server (e.g. Apache, IIS) anda database (e.g. MySQL, PostgreSQL)
  48. 48. Apache LuceneWhat is it?High-performance, full-featured textsearch engine library (indexing andsearching documents)Developed by Doug Cutting (2000)SourceForge, joined Apache SoftwareFoundation in 2001Written entirely in JavaUsers: Wikipedia, Technorati, Nabble,TheServerSide, Akamai, SourceForge
  49. 49. Apache LuceneFeaturesRanked searching (best results returned first)Powerful query types: phrase queries, wildcardqueries, proximity queries, range queries and moreFielded searching (e.g., title, author, contents)Date-range searchingSorting by any fieldMultiple-index searching with merged resultsAllows simultaneous update and searching
  50. 50. Apache LuceneFeaturesDocuments added via IndexWriterDocument = a collection of fieldsNo config files, dynamic field typingFlexible text analysis tokenizers, filtersSearch for documents via IndexSearcher  Hits = search(Query,Filter,Sort,topN)Scoring: tf * idf * lengthNorm
  51. 51. Apache SolrWhat is it?A full text search server based onLucene (Lucene sub-project)Developed by Yonik Seeley at CNETNetworks (2004), donated to the ApacheSoftware Foundation (2006)Written in Java, deployable as a WARUsers: CNET Reviews, CNET Channel,,,,,,
  52. 52. ApacheFeatures SolrAdvanced Full-Text Search CapabilitiesOptimized for High Volume Web TrafficStandards Based Open Interfaces (XML, JSON,HTTP)Web Administration InterfaceServer statistics exposed over JMX formonitoringScalability, efficient Replication to other SolrSearch ServersFlexible and Adaptable with XML configurationExtensible Plugin Architecture