Successfully reported this slideshow.
Your SlideShare is downloading. ×

A Spark-Based Intelligent Assistant: Making Data Exploration in Natural Language Real

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad

Check these out next

1 of 48 Ad

A Spark-Based Intelligent Assistant: Making Data Exploration in Natural Language Real

Download to read offline

Rather than running pre-defined queries embedded in dashboards, business users and data scientists want to explore data in more intuitive ways. Natural language interfaces for data exploration have gained considerable traction in industry. Their success is triggered by advancements in machine learning and by novel big data technologies that enable processing large amounts of data in real-time. However, even though these systems show significant progress, they have not yet reached the maturity level to support real users in data exploration scenarios either due to the lack of supported functionality or the narrow application scope, remaining one of the ‘holy grails’ of the data analytics community.

In this talk, we will present a Spark-based architecture of an intelligent data assistant, a system that combines real-time data processing and analytics over large amounts of data with user interaction in natural language, and we will argue why Spark is the right platform for next-gen intelligent data assistants.

Our intelligent data assistant
(a) enables a more natural interaction with the user through natural language;
(b) offers active guidance through explanations and suggestions;
(c) constantly learns and improves its performance. To build an intelligent data assistant, there are several challenges. Unlike search engines, users tend to express sophisticated query logics and expect perfect results. The inherent complexity of natural languages complicates things in several ways. The intricacies of the data domain require that the system constantly expands its domain knowledge and its ability to interpret new data and user queries by constantly analyzing data and queries.

Our intelligent data assistant brings together several components, including natural language processing for understanding user queries and generating answers in natural language, automatic knowledge base construction techniques for learning about data sources and how to find the information requested, as well as deep learning methods for query disambiguation and domain understanding.

Rather than running pre-defined queries embedded in dashboards, business users and data scientists want to explore data in more intuitive ways. Natural language interfaces for data exploration have gained considerable traction in industry. Their success is triggered by advancements in machine learning and by novel big data technologies that enable processing large amounts of data in real-time. However, even though these systems show significant progress, they have not yet reached the maturity level to support real users in data exploration scenarios either due to the lack of supported functionality or the narrow application scope, remaining one of the ‘holy grails’ of the data analytics community.

In this talk, we will present a Spark-based architecture of an intelligent data assistant, a system that combines real-time data processing and analytics over large amounts of data with user interaction in natural language, and we will argue why Spark is the right platform for next-gen intelligent data assistants.

Our intelligent data assistant
(a) enables a more natural interaction with the user through natural language;
(b) offers active guidance through explanations and suggestions;
(c) constantly learns and improves its performance. To build an intelligent data assistant, there are several challenges. Unlike search engines, users tend to express sophisticated query logics and expect perfect results. The inherent complexity of natural languages complicates things in several ways. The intricacies of the data domain require that the system constantly expands its domain knowledge and its ability to interpret new data and user queries by constantly analyzing data and queries.

Our intelligent data assistant brings together several components, including natural language processing for understanding user queries and generating answers in natural language, automatic knowledge base construction techniques for learning about data sources and how to find the information requested, as well as deep learning methods for query disambiguation and domain understanding.

Advertisement
Advertisement

More Related Content

Slideshows for you (20)

Similar to A Spark-Based Intelligent Assistant: Making Data Exploration in Natural Language Real (20)

Advertisement

More from Databricks (20)

Advertisement

A Spark-Based Intelligent Assistant: Making Data Exploration in Natural Language Real

  1. 1. WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics
  2. 2. Georgia Koutrika, ATHENA Research Center A Spark-based Intelligent Assistant Making Data Exploration in Natural Language Real #UnifiedDataAnalytics #SparkAISummit
  3. 3. Data, Data, Data 3 Data growth More data than humans can process and comprehend Data democratization From scientists to the public, increasingly more users are consumers of data #UnifiedDataAnalytics #SparkAISummit
  4. 4. 4 HOW CAN WE EXPLORE AND LEVERAGE OUR DATA?
  5. 5. 5#UnifiedDataAnalytics #SparkAISummit Select p.status from conference_attendees p where p.conference=‘SPARK+AI2019’ Data Store SQL queries Results
  6. 6. 6#UnifiedDataAnalytics #SparkAISummit How are you today?
  7. 7. The phases of data exploration 7
  8. 8. The “SQL” Age 8 Programmer SELECT * FROM CITIES WHERE 50 < (SELECT AVG(TEMP_F) FROM STATS WHERE CITIES.ID = STATS.ID); Users which cities have year-round average temperature above 50 degrees? DBMS Limited access (for most but the privileged) Communication bottleneck (the guru) Data starvation (for those that really need it) Limited interaction (query answering) - Sophisticated user - Precise knowledge of data and schema - Precise knowledge of their need - User “speaks” fluently SQL - DBMS “responds” with tables Interaction: Knowledge: User type : Info need: Period characteristics #UnifiedDataAnalytics #SparkAISummit
  9. 9. The “Baby Talk” Age 9 Business user cities with year-round average temperature above 50 degrees? DBMS User pretty much knows what to ask User queries are relatively simple Query answering paradigm (still) - Domain Expert - User understands the data domain - Precise knowledge of their need - User not familiar with SQL - DBMS “responds” with tables and graphs Interaction: Knowledge: User type : Info need: Period characteristics #UnifiedDataAnalytics #SparkAISummit
  10. 10. Chatbots 10 https://medium.com/swlh/chatbots-of-the-future-86b5bf762bb4 A chatbot: • mimics conversations with people • uses artificial intelligence techniques • lives on consumer messaging platforms, as a means for consumers to interact with brands. #UnifiedDataAnalytics #SparkAISummit Drawbacks • Primarily text interfaces based on rules • Encourage canned, linear-driven interactions • Deal with simple, unambiguous questions (“what is the weather forecast today”) • Cannot answer random or complex queries over data repositories
  11. 11. Conversational AI 11#UnifiedDataAnalytics #SparkAISummit For example, Google Duplex: demo released in May 2018 • The technology is directed towards completing specific tasks, such as scheduling certain types of appointments. • For such tasks, the system makes the conversational experience as natural as possible. • One of the key research insights was to constrain Duplex to closed domains. • Duplex can only carry out natural conversations after being deeply trained in such domains
  12. 12. Human-like Data Exploration 12 - From expert to data consumer - Intuition about the data - Not necessarily sure what to ask - Intuitive, natural, interactionInteraction: Knowledge: User type : Info need: Requirements #UnifiedDataAnalytics #SparkAISummit
  13. 13. Human-like Data Exploration 13 • converses with the user in a more natural bilateral interaction; • actively guides the user offering explanations and suggestions; • keeps track of the context and can respond and adapt accordingly; • constantly improves its behavior by learning and adapting. #UnifiedDataAnalytics #SparkAISummit An Intelligent Data Assistant
  14. 14. Intelligent Data Assistant 14#UnifiedDataAnalytics #SparkAISummit Syntax-based (NLP) Query Understanding Data-based Query Understanding NL Query Explanations Query Recommendations
  15. 15. Step 1: Let the user ask using natural languageSyntax-based (NLP) Query Understanding Data-based Query Understanding NL Query Explanations Query Recommendations
  16. 16. Facts 16#UnifiedDataAnalytics #SparkAISummit Unlike search engines, users tend to express sophisticated query logics to a data assistant and expect perfect results Translating a natural language query to a structured query is hard!
  17. 17. Challenges: from the NL Side 17#UnifiedDataAnalytics #SparkAISummit Synonymy: multiple words with the same meaning e.g.,“movies" and “films" Polysemy: a single word has multiple meanings e.g. Paris as in the city and Paris Hilton Syntactic: Multiple readings based on syntax e.g., “Find all German movie directors“ means: “directors that have directed German movies" ?? “directors from Germany that have directed a movie“?? Semantic: Multiple meanings for a sentence. e.g. “Are Brad and Angelina married?". Are they married to each other or separately. Paraphrasing Multiple way to say the same thing E.g. ‘how many people live in ..." could be a mention of the “Population“ column. Context dependent terms: E.g., “Return the top product" the term “top" is a modifier for “product". Does it mean Based on popularity?? Based on number of sales?? Elliptical queries: sentences from which one or more words are omitted. E.g. , “Return the movies of Clooney". Non-exact matching: mentions do not map exactly to values or tables/attributes) E.g. Who is the best actress in 2011 à ‘actress’ should map to the “actor” column
  18. 18. Challenges: from the Data Side 18#UnifiedDataAnalytics #SparkAISummit Complex Syntax: SQL is a structured language with a strict grammar and limited expressivity when compared to natural language. e.g., “Return the movie with the best rating". Should look like “SELECT name , MAX( rating ) FROM Movie ;” but it is WAY more complicated Database Structure: E.g., for the term “date" a system may need to retrieve three attributes: year, month, day Multiple relationships: mentions may connect in multiple ways/join disambiguation E.g., “Woody Allen movies” may need several tables to be joined. Ranking: how to rank multiple answers e.g., “Return Woody Allen movies".
  19. 19. Ask a query 19#UnifiedDataAnalytics #SparkAISummit What movies have the same director as the movie “Revolutionary Road”
  20. 20. Understanding Syntax 20 Step 1. Understand the natural language query linguistically. Generate a dependency parse tree: • part-of-speech (POS) tags that describe each word's syntactic function + • syntactic relationships between words in the sentence. NLQ Syntactic Parser #UnifiedDataAnalytics #SparkAISummit
  21. 21. Understanding Syntax 21 Step 2. 1. Map query elements to data elements: • tables, attributes, values – using indexes • commands (e.g., order by) – using a dictionary 2. Keep best mappings NLQ Syntactic Parser Node Mapper #UnifiedDataAnalytics #SparkAISummit
  22. 22. Understanding Syntax 22 Step 3. Map the parse tree to the database structure and build a query tree NLQ Syntactic Parser Node Mapper Tree Mapper #UnifiedDataAnalytics #SparkAISummit
  23. 23. Understanding Syntax 23 NLQ Syntactic Parser Node Mapper Step 4. Generate the SQL query to execute Tree Mapper SQL Generator SQL #UnifiedDataAnalytics #SparkAISummit
  24. 24. 24 #UnifiedAnalytics #SparkAISummit Keyword Schema Element movie Movie “Revolutionary Road” Movie.Tittle movies Movie director Director What movies have the same director as the movie “Revolutionary Road” ROOT Return director movie “Revolutionary Road” same movies Syntactic Parser Node Mapper ROOT Return movies Same director movies director movie “Revolutionary Road” Tree Mapper SQL Generator Main Query SELECT DISTINCT movie.tittle FROM movie, block0, block1 WHERE movie.mid = block0.mid AND block0.pk_director = block1.pk_director Block0 SELECT director.did, movie.mid FROM movie, director, directed_by WHERE movie.mid = directed_by.msid AND directed_by.did = director.did Block1 SELECT director.did, movie.mid FROM movie, director, directed_by WHERE movie.tittle = “Revolutionary Road” AND movie.mid = directed_by.msid AND
  25. 25. Understanding Syntax 25 Why is Parsing So Hard For Computers to Get Right? • Human languages show remarkable levels of ambiguity. • It is not uncommon for moderate length sentences to have hundreds, thousands, or even tens of thousands of possible syntactic structures. • A natural language parser must somehow search through all of these alternatives, and find the most plausible structure given the context. #UnifiedDataAnalytics #SparkAISummit
  26. 26. Step 1: Let the user ask using natural languageSyntax-based (NLP) Query Understanding Data-based Query Understanding NL Query Explanations Query Recommendations
  27. 27. Ask a query 27#UnifiedDataAnalytics #SparkAISummit Show me Italian restaurants Not much value a parser can add
  28. 28. Ambiguity 28#UnifiedDataAnalytics #SparkAISummit Several possible mappings
  29. 29. 1 "business categorized as restaurant and as Italian” “restaurant''= Category(category), "italian''= Category(category) 2 "business categorized as Italian whose name includes restaurant” "restaurant''= Business(name), "italian''= Category(category) 3 "business categorized as Italian whose address includes restaurant” "restaurant''= Business(address), "italian''= Category(category) 4 "business categorized as restaurant that serves Italian” "restaurant''= Category(category), "italian''= Attribute(value) 5 "business whose address contains restaurant and Italian” "restaurant''= Business(address), "italian''= Business(address) Ambiguity 29#UnifiedDataAnalytics #SparkAISummit Several possible query interpretationsLikely Unlikely
  30. 30. Ambiguity Too many ways to interpret a query • which one(s) represent user intent? • how do we rank them? 30#UnifiedDataAnalytics #SparkAISummit
  31. 31. Analyzing Data 31#UnifiedDataAnalytics #SparkAISummit Expert input + Query logs NLQ Entity Mapper Interpretation Generator ML-based Disambiguation SQL Generator SQL Training data
  32. 32. Analyzing Data 32#UnifiedDataAnalytics #SparkAISummit Probability: Probability captures the commonality of a keyword in an attribute Attribute_WordCount is the number of all words in an attribute Exclusivity is an adopted version of gini-index to capture the power of each mapping Example Features for attribute mappings Several possible mappings
  33. 33. Analyzing Data 33 Possible query interpretations 1 "business categorized as restaurant and as Italian” “restaurant''= Category(category), "italian''= Category(category) 2 "business categorized as Italian whose name includes restaurant” "restaurant''= Business(name), "italian''= Category(category) 3 "business categorized as Italian whose address includes restaurant” "restaurant''= Business(address), "italian''= Category(category) 4 "business categorized as restaurant that serves Italian” "restaurant''= Category(category), "italian''= Attribute(value) 5 "business whose address contains restaurant and Italian” "restaurant''= Business(address), "italian''= Business(address) Min_Prob: We can take the minimum of probabilities inside an attribute combination as a way to represent the mappings with. Example Features for attribute combinations
  34. 34. Analyzing Data 34 Example Features for attribute combinations IR_Score: We compute a relevance score: - For each initial attribute, we compute the single-attribute relevance score - The single attribute scores for an attribute combination are combined into a final score Possible query interpretations 1 "business categorized as restaurant and as Italian” “restaurant''= Category(category), "italian''= Category(category) 2 "business categorized as Italian whose name includes restaurant” "restaurant''= Business(name), "italian''= Category(category) 3 "business categorized as Italian whose address includes restaurant” "restaurant''= Business(address), "italian''= Category(category) 4 "business categorized as restaurant that serves Italian” "restaurant''= Category(category), "italian''= Attribute(value) 5 "business whose address contains restaurant and Italian” "restaurant''= Business(address), "italian''= Business(address)
  35. 35. Step 2: Let the system respond in natural languageSyntax-based (NLP) Query Understanding Data-based Query Understanding NL Query Explanations Query Recommendations
  36. 36. NL Explanations 36#UnifiedDataAnalytics #SparkAISummit comedies by Woody Allen director or producer Woody Allen? director select d.name, m.title from MOVIE m, DIRECTED r, DIRECTOR d, GENRE g where m.id=r.mid and r.did=d.id and m.id = g.mid and d.name = `Woody Allen' and g.genre = `comedy' select d.name, m.title from MOVIE m, CAST c, ACTOR a, GENRE g where m.id=c.mid and c.did=a.id and m.id = g.mid and a.name = `Woody Allen' and g.genre = `comedy'
  37. 37. Generating Explanations 37#UnifiedDataAnalytics #SparkAISummit Domain-independent graph traversal for efficiently exploring query graphs and composing query descriptions as phrases in natural language Structured Query Annotated Query Graph Template-based Synthesis Template-based Synthesis Annotations + Templates Templates Annotations • Relation ACTOR à “actors” • Attribute “fname” à “firstname” • Function MAX à “the greatest” NL explanation
  38. 38. Generating Explanations 38#UnifiedDataAnalytics #SparkAISummit Actors Cast Movies name Year = 2010 l(actors) + ‘that play in’ + l(movies) l(movies) + in + l(year) select a.name from actors a, movies m, cast c where a.id=c.aid and c.mid=m.mid and year=2010 Return the name of the actors for actors that play in movies in 2010
  39. 39. NL->SQL logs What about user NL queries? 39#UnifiedDataAnalytics #SparkAISummit We can use our knowledge of translating past NL queries to synthesize NL explanations of new queries Structured Query Annotated Query Graph Template-based Synthesis Annotations + Templates NL explanation
  40. 40. Step 3: Help the user ask the right questionSyntax-based (NLP) Query Understanding Data-based Query Understanding NL Query Explanations Query Recommendations
  41. 41. Guiding the user 41#UnifiedDataAnalytics #SparkAISummit The user needs help: • discovering the data in the first place • knowing what questions may be asked • finding what to do next Hello??
  42. 42. Query Recommendations 42#UnifiedDataAnalytics #SparkAISummit Two settings Cold-start: Starting with no previous interaction. Show a set of starter queries that the users could use to get some initial answers from the dataset and start understanding the data better Warm-start: Exploring and looking for answers in a new data set takes time and effort. At each step, the user may not know what she should do next. The system can leverage the user’s interactions (queries) to show possible next queries
  43. 43. Query Recommendations 43#UnifiedDataAnalytics #SparkAISummit Starter Query Generation Data statistics User logs Example queries Starter Queries Cold-start Metrics
  44. 44. Query Recommendations 44#UnifiedDataAnalytics #SparkAISummit Generative Approach Data statistics User Query Warm-start Log-based ApproachStructural modifications Query Log Transition Probabilities Query Similarities Next Queries
  45. 45. Step 4: Putting everything together
  46. 46. Intelligent Data Assistant 46 Syntax-based (NLP) Query Understanding Data-based Query Understanding NL Query Explanations Query Recommendations Knowledge + Expert Bases Annotation + Template Bases Query + Translation Logs Statistics Query Similarity Graph Query Transition Graph SPARK SQL SPARK CoreNLP SPARK MLlib GraphX TensorFlow S P A R K C o r e S t o r a g e HDFS Parquet LuceneDataProcessing
  47. 47. Intelligent Data Assistant 47 What are you looking for today?
  48. 48. DON’T FORGET TO RATE AND REVIEW THE SESSIONS SEARCH SPARK + AI SUMMIT

×