To update the Collection-Centric, add auxiliary index + annotation store
Each extraction result is stored with its source document and its associated positions in the document
Basically: Convert JAPE rule into a relational calculus expression => Big self-join over a table of <word, position> pairs Generate efficient join plan using (inverted) index access when possible Some part still require going back to the document --- want these high in the operator graph
At the high level, the optimization strategy is very similar to the one in System R, but with novel access method, novel join algorithms, 2-dismensional cost model
The document-centric model enables embedding SystemT in a wide variety of applications. For instance, in lotus notes, when a user opens an email, at the same time, that email message is sent to SystemT runtime which will generate annotations on the fly. When the email is displayed for the user, the annotations just generated will be displayed as well. Meanwhile, SystemT can also be embedded as a Map job in a map-reduce framework, which allows the system to scale up and process large volume of documents.
Transcript of "Enterprise information extraction: recent developments and open challenges"