Successfully reported this slideshow.
Your SlideShare is downloading. ×

Stanford DeepDive Framework

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Upcoming SlideShare
Total Data Industry Report
Total Data Industry Report
Loading in …3
×

Check these out next

1 of 9 Ad

More Related Content

Slideshows for you (20)

Advertisement

Similar to Stanford DeepDive Framework (20)

Advertisement

Stanford DeepDive Framework

  1. 1. THE DEEPDIVE FRAMEWORK LEO ZHANG STEP-BY-STEP ILLUSTRATION
  2. 2. The Stanford DeepDive, developed by Professor Chris Ré and a team of PhDs, is a powerful data management and preparation platform that allows users to build highly sophisticated end-to-end data pipelines This presentation covers the technicalities of the inference and learning engine behind DeepDive; including how DeepDive is different from traditional data management systems, how to build an application on DeepDive, as well as how exactly does DeepDive work. “We are just an advanced breed of monkeys on a minor planet of a very average star. But we can understand the Universe. That makes us special”” - Stephen Hawking
  3. 3. THE DEEPDIVE OVERVIEW How Is DeepDive Different? Source: www.deepdive.stanford.edu DeepDive is an end-to-end framework for building KBC systems. B.Obama and his wife M. Obama Candidate Generation & Feature Extraction Super- vision Learning & Inference Has Spouse Input Output Newdocs FeatureExt. rules Supervision rules Inference rules Erroranalysis Input: Unstructured Docs Developers will add new rules to improve quality How Does DeepDive Work? •  Candidate Generation and Feature Extraction •  Save input data in relational database •  Feature Extractors: a set of user-defined functions •  Supervision •  DeepDive language is based on Markov Logic •  Can use training data to mirror the same function it serves under supervised learning •  Learning and Inference •  Factor graph •  Error Analysis •  Determine if the user needs to inspect the errors DeepDive Design Features that makes it convenient for non-computer scientists to use: i)  No reference to underlying machine learning algorithm. Probabilistic semantics provide a way to debug the system independently of algorithm ii)  Allows users to write extra features in Python, SQL and Scala iii)  Fits into the familiar SQL stack, therefore allows standard tools to inspect and visualize data Source: Incremental Knowledge Base Construction Using DeepDive Output: structured knowledge base Feature Engineering High Quality Allows developers to think about features rather than algorithms Applications have achieved higher quality than human volunteers Calibration Variety of Sources Computes calibrated probability for every assertion it makes Can extract data from documents, PDFs, web pages, tables and figures Domain Knowledge Distant Supervision Integrates with writing sample rules to improve quality Does not require tedious training for every prediction
  4. 4. DEVELOPMENT PROCESS OF DEEPDIVE APPLICATIONS Writing The Application Running The Application Evaluate / Debug •  Define the data flow in DDlog schema that describes the input data and data to be produced •  Write User-Defined Functions (data transformation rules) •  Specify a statistical model in DDlog •  The user can compile and run the application incrementally •  Actual data loaded to data base and queried -> User-Defined Functions executed incrementally •  Model’s parameters can be learned or reused to make predictions •  Formal error analysis supported by interactive tools •  DeepDive contains a suite of tools and guides: Label data products, browse data, monitor descriptive statistics, calibration etc. # DDlog is a higher-level language for writing DeepDive applications in succinct, Datalog-like syntax # Variable declarations + Scoping and supervision rules + Inference rules # A core set of commands that supports precise control of execution # Several commands on the statistical model such as its creation, parameter estimation, computation of probabilities and keeping and reusing the parameters # User-Defined Functions can be written on any standard programming languages # Produces calibration plots to evaluate the iterative workflow # Comments Start with a basic first version and improve iteratively Source: DeepDive: A Data Management System for Automatic Knowledge Base Construction “It’s okay to have your eggs in one basket as long as you control what happens to that basket” - Elon Musk
  5. 5. THE DEEPDIVE FRAMEWORK Input Candidate Generation & Feature Extraction Supervision Learning & Inference Output New docs Feature Ext. rules Supervision rules Inference rules Error analysis End-To-End Framework For Building KBCs Source: Incremental Knowledge Base Construction Using DeepDive Knowledge-Based Construction Systems The input to a KBC system is a heterogeneous collection of unstructured, semi-structured, and structured data. The output is a relational database containing facts extracted from the input and put into the appropriate schema The KBC Model The standard KBC model seeks to extract four types of objects from input documents: Entity Relation Mention Relation Mention A real person, place, or thing A relation associates two (or more) entities A span of text in input document that refers to the entity or relation A phrase that connects two mentions that participate in a relations
  6. 6. THE DEEPDIVE FRAMEWORK: STEP-BY-STEP Input Candidate Generation & Feature Extraction Supervision Learning & Inference Output New docs Feature Ext. rules Supervision rules Inference rules Error analysis Source: Incremental Knowledge Base Construction Using DeepDive Candidate Generation & Feature Extraction All data is stored in a relational database. This phase populates the database using a set of SQL queries and User-Defined Functions (Feature Extractors) By default, DeepDive stores all documents in the database in one sentence per row with markup produced by standard NLP pre-processing tools, including HTML stripping, part-of-speech tagging, and linguistic parsing Then, DeepDive executes two types of queries: Candidate mappings – SQL queries that produce possible mentions, entities, and relations Feature Extractors – associate features to candidates “A breakthrough in machine learning would be worth ten Microsofts” - Bill Gates
  7. 7. THE DEEPDIVE FRAMEWORK: STEP-BY-STEP Input Candidate Generation & Feature Extraction Supervision Learning & Inference Output New docs Feature Ext. rules Supervision rules Inference rules Error analysis Source: Incremental Knowledge Base Construction Using DeepDive Just as in Markov Logic, DeepDive can use training data or evidence about any relation. Each user relation is associated with an evidence that indicates whether the entry is true or false Two standard techniques generate training data: Hand-labeling and Distant Supervision Distant Supervision Traditional machine learning techniques require a set of training data. In distant supervision, DeepDive takes existing databases (e.g. domain-specific database) to collect relations DeepDive wants to extract. Then use these examples to automatically generate the training data Supervision
  8. 8. THE DEEPDIVE FRAMEWORK: STEP-BY-STEP Input Candidate Generation & Feature Extraction Supervision Learning & Inference Output New docs Feature Ext. rules Supervision rules Inference rules Error analysis Source: Incremental Knowledge Base Construction Using DeepDive Learning & Inference In this phase, DeepDive generates a factor graph An example factor graph. There is one user relation containing all tokens, and there are two correlation relations for adjacent-token correlation (F1) and same- word correlation (F2) respectively. A probabilistic graphical model that is the abstraction used for learning. DeepDive relies heavily on factor graph Raw Data In-database Representation He said that he would come. Factor Graph He Said That He i ii iii iv Adjacent- token Same- word User Rela)ons Token Word A He B Said C That D He Assignment Example Correla)on Rela)ons Rx Vars Rx Vars i (A,B) iv (A,D) ii (B,C) iii (C,D) F1 F2 Assignment Token Assignment A 1 B 0 C 0 D 1 Partition Function Z = f1(1,0) x f1(0,0) x f1(0,1) x f1(1,1) x Factors in F1 Factors in F2 Source: DeepDive: A Data Management System for Automatic Knowledge Base Construction A B C D A B C D “Problems worthy of attack prove their worth by fighting back” - Paul Erdös
  9. 9. REFERENCES Shin, Jaeho, Sen Wu, Feiran Wang, Christopher De Sa, Ce Zhang, and Christopher Ré. "Incremental Knowledge Base Construction Using DeepDive." Proc. VLDB Endow. Proceedings of the VLDB Endowment 8.11 (2015): 1310-321. Web. Ce Zhang. “DeepDive: A Data Management System for Automatic Knowledge Base Construction." Proc. VLDB Endow. Proceedings of the VLDB Endowment 8.13 (2015): 1310-321. Web.

×