Real time semantic search engine for social tv streams
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

Real time semantic search engine for social tv streams

on

  • 3,070 views

Social TV, the use of social networks to comment on TV programs is a growing phenomena. TV channels and brands are turning into social networks to look for real time insights about their programs. ...

Social TV, the use of social networks to comment on TV programs is a growing phenomena. TV channels and brands are turning into social networks to look for real time insights about their programs. Understanding the global conversation about a program is useful to acquire insights for broadcasters and brands. For broadcasters, acquiring insights while a program is aired enable them to produce new content formats that include social conversation. For brands, it helps to prevent reputation crisis and increase the reach of their marketing efforts. For viewers, which increasingly use second screen devices, should benefit from tools that help to understand opinions around main content and connect with peers during TV programs or live events.

We present a system that combines natural language processing (Textalytics API) and a scalable semi-structured database/search engine (senseiDB) to provide semantic and faceted search, real time analytics and support visualizations for this kind of applications.

In the first part, we will present some of the useful NLP methods that we can use to tame unstructured big data like Twitter or Facebook comments. We will include description for tasks like text categorization, sentiment analysis, named entity recognition. We would also see how this data could be related to external data like Linked Data points. While the description would be general, examples would be illustrated using Textalytics API.

Then we would present how this data could be ingested and made available for search in real time using a semi-structured database like SenseiDB. We would present key features of SenseiDB including high performance real time indexing and simultaneous querying, distribution and support for full-text and faceted search. We would also discuss how facets may be overused to provide real time analytics and enable semantic search. Finally we will discuss advantages, problems and current limitations of SenseiDB.

Takeaway Points.
- Analyzing and searching text in social streams
- Integrating text analytics services (Textalytics) and a semi-structured database (SenseiDB)
- Key features of SenseiDB

Statistics

Views

Total Views
3,070
Views on SlideShare
853
Embed Views
2,217

Actions

Likes
9
Downloads
30
Comments
0

13 Embeds 2,217

http://blog.daedalus.es 1188
http://textalytics.com 941
http://blogdaedalus.wordpress.com 62
http://feeds.feedburner.com 7
http://localhost 6
http://cloud.feedly.com 3
http://www.inoreader.com 2
http://www.google.es 2
https://blogdaedalus.wordpress.com 2
http://inoreader.com 1
https://www.rebelmouse.com 1
http://translate.googleusercontent.com 1
https://twitter.com 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Real time semantic search engine for social tv streams Presentation Transcript

  • 1. Textalytics: Meaning-as-a-Service Real Time Semantic Search for Social TV streams César de Pablo Sánchez Daedalus 8/11 2013 Big Data Spain (Madrid)
  • 2. The plot 1. What's Social TV? 2. Monitoring Social TV conversations. A preliminary architecture 3. Understanding the buzz. Textalytics 4. Organizing the mess. SenseiDB 5. Lessons learned
  • 3. Social TV Second Screen Transmedia
  • 4. Not just TV Sports Elections Alerts
  • 5. Big Data? Volume Velocity Variety
  • 6. Users? Viewers Channels Brands
  • 7. Viewers? Keep updated Participate Confirm beliefs Belong to group Influence Vote
  • 8. Viewers? Keep updated Participate Confirm beliefs Influence
  • 9. Channels? Understand React Measure
  • 10. Channels? Understand React
  • 11. Brands? Select programs Reputation Find public
  • 12. Brands? Reputation Find public Example from Bluefin Labs
  • 13. Monitoring Social TV conversations. The architecture
  • 14. Pull gateway pipeline tracker HTTP Stream EPG
  • 15. Understanding the buzz Textalytics API
  • 16. Text Classification Sentiment Analysis Topics Extraction Language identification User Demographics Core API Lemmatization, POS and Parsing Spell, Grammar and Style Semantic Linked Data Viewer Speeech Recognition and Speaker Diarization
  • 17. Language identification ● Given a text identify a language list - or just one ● 62 languages ● Using language ngrams signatures ● Social TV ● ● Filter – TV hashtags often implies language Sometimes hashtags are multilingual – but not relevant for users
  • 18. Text Classification ● Theme labels – IPTC ● Relevance ● Multiple labels ● Tailored for short text (tweets) ● Define your own models and categories ● Social TV – filter on topic content
  • 19. Sentiment analysis ● Document level classification ● Positive/Negative/Neutral ● Subjective/Objective ● Tailored for short texts ● ● Handles twitter jargon – RT, @, hashtags, emoticons, spelling errors, disfluence Other features ● Entity level sentiment ● Segment level sentiment
  • 20. Topics Extraction ● 12 main types ● Ontology with > 200 types ● ● ● Instances – BBVA ➔ Ben Bernanke, Mariano Rajoy… ➔ ➔ ➔ ● populate custom dictionaries – programs, celebrities, fictional characters relationship Ubicaciones: Londres, EE.UU., París… Conceptos: prima de riesgo, presidente del Gobierno, intervención parlamentaria, índice bursátil, situación económica… SocialTV: ● Entidades económicas: Ibex35, Dax Xetra… ➔ ● Empresas, Organizaciones: BBVA, Bankia, Goldman Sachs, Coca-Cola, Reserva Federal… Classes – bank fictional/historic Personas: ➔ Referencias de tiempo: hoy, ayer, sobre las 11 de la mañana… ➔ Cantidades económicas: 104 dólares, 1 euro…
  • 21. Entity Linking ● Linking entities to their 'real' representation ● Linking to several LOD sources
  • 22. API ● NLP and Semantics API ● Multilingual: EN, ES (FR,IT,PT,CA) ● REST Service : JSON and XML ● Combine best of all worlds ● Deep language analysis ● Comprehensive resources: linguistics and Dbs ● Ontology ● Rule Based Method ● Statistics and Machine Learning Methods
  • 23. API ● High level semantic API – close to bussines scenarios API Análisis Medios … Configuración y Recursos Lingüísticos ● API Publicación Semántica Configuración y Recursos Lingüísticos Configuración y Recursos Lingüísticos Core API – building blocks Topics Classif. Sentiment POS Linked Data
  • 24. Organizing the mess. SenseiDB
  • 25. SenseiDB ● ● Open source, distributed, realtime, semistructured database From LinkedIn sna: powering Linkedin home and LinkedIn signals ● Integrates other open source technologies: – – Bobo - faceted search – ● Zoie – lucene based search engine Apache Kafka – pub-sub system http://www.senseidb.com/
  • 26. SenseiDB features ● 'Hybrid' Information Retrieval – Database ● Full text search ● Structured and faceted search ● Fast real time updates with low latency and high troughput – pull model ● Single table/collection ● BQL – a SQL like language ● Eventual consistency ● Distributed – sharding and partitioning ● Hadoop integration
  • 27. Faceted search ● ● ● Amazon.com? Identify relevant attributes to use as filters Predefined facets ● ● ● Define a table schema Define fields as facets – facet schema Efficient - in memory
  • 28. Faceted search in depth ● Field types ● ● ● Basic: string, int, short, long, float, double, char Complex: date and text (analyzed, termvectors) Facet types ● Simple : 1 row – 1 value ● Hierarchical – Path c>b>a ● Range – define ranges ● Multi : 1 row – n values ● Histogram – define bins and their size ● TimeRange – for real time data ● Custom
  • 29. Real time indexing ● Data events – add and delete ● Data streams – succession of data events ● Gateways ● Read data events from data streams ● File ● JDBC ● JMS ● Kafka ● Custom: Twitter
  • 30. BQL – search, filter and facets ● ● ● ● Search – common boolean and phrase operators Filters – where contitions Facets support basic analytics task defined on facets Relevance ● Default – recency ● Ad-hoc - may be defined in query
  • 31. BQL Query Example on Tweets SELECT * WHERE hashtags in (“TopChef”) BROWSE BY hashtags, user_screen_name, urls
  • 32. Tweet Query example
  • 33. Query examples SELECT * WHERE QUERY IS "relaxing cup of coffee”
  • 34. Query examples SELECT * WHERE QUERY IS "relaxing cup of coffee” BROWSE by entities, sentiment
  • 35. Query examples SELECT * WHERE QUERY IS "relaxing cup of coffee” AND time IN LAST 2 hours BROWSE by entities, sentiment
  • 36. Using facets for semantic search ● Define a facet for: ● ● ● ● entities/concept → tweets about Chicote – include all variants + user + hashtags for each entity types → Navigate by type – Popular people classification/sentiment/emotions → Positive tweets about Chicote users or hashtags → popular users / popular mentions / correlated hashtags
  • 37. Architecture
  • 38. Scalability ● Zookeper to keep replicas ● Low indexing latency (no batch commit) ● Low search latency – even with indexing bursts ● Horizontally scalable – shards ● Shards may be replicated N times ● Elastic – nodes can be added to accomodate growth
  • 39. Other features ● Batch indexing via Hadoop – ETL ● Simple analytics by batch indexing ● Customized relevance models ● MapReduce functions over facets ● ● ● Sum, avg, min, max DistinctCount Activity values – volatile values – likes
  • 40. Lessons learned
  • 41. Conclusions ● ● SenseiDB is fast at searching/indexing – no variance A couple nodes enough to handle Spanish SocialTV volume ● Love query language and time operators - BQL ● Support real time exploration
  • 42. Limitations ● SenseiDB ● ● Single table model – flat users and reputation ● Tricks to store complex facets ● ● Documentation is still scarce Manageability Social TV Tracker ● ● Group and disambiguate entity mentions across tweets Relevance is tricky – ad hoc
  • 43. Comparison ● Solr ● NearRT updates – ● ● ● ● ElasticSearch ● Soft commits Simple facets ● Popular – great tools ● Storm, S4 ? ● Batch/realtime commits On line facets Aggregation after facets Much better plugin system
  • 44. Thanks and QA @zdepablo #bigdata #socialtv #2ndscreen #nlp @textalytics