SlideShare a Scribd company logo
1 of 43
Download to read offline
DIADEM                domain-centric intelligent automated
                       data extraction methodology



                  Automatically Learning
                     Gazetteers from the
                               Deep Web
                                                                                 Christian Schallhart
                                                                    April 19th, 2012 @ WWW in Lyon
                                   joint work with Tim Furche, Giovanni Grasso, Giorgio Orsi, and Cheng Wang




Friday, May 11, 2012
AMBER: Extraction from Result Pages




                                          2
Friday, May 11, 2012
AMBER: Extraction from Result Pages
                                   <offer>
                                     <price>
                                   !   <currency>GBP</currency>
                                   !   <amount>4000000</amount>
                                     </price>
                                     <bedrooms>5</bedrooms>
                                     <location>Radcliffe House, Boars Hill,
                                        Oxford, Oxfordshire</location>
                                   </offer>

                                   <offer>
                                     <price>
                                   !   <currency>GBP</currency>
                                   !   <amount>3950000</amount>
                                     </price>
                                     <bedrooms>7</bedrooms>
                                     <location>Jarn Way, Boars Hill,
                                        Oxford, Oxfordshire</location>
                                   </offer>

                                   <offer>
                                     <price>
                                   !   <currency>GBP</currency>
                                   !   <amount>3950000</amount>
                                     </price>
                                     <bedrooms>6</bedrooms>
                                     <location>Old Boars Hill,
                                        Oxford</location>
                                   </offer>




                                                                              2
Friday, May 11, 2012
AMBER: Extraction from Result Pages
                                   <offer>
                                     <price>
                                   !   <currency>GBP</currency>
                                   !   <amount>4000000</amount>
                                     </price>
                                     <bedrooms>5</bedrooms>
                                     <location>Radcliffe House, Boars Hill,
                                        Oxford, Oxfordshire</location>
                                   </offer>

                                   <offer>
                                     <price>
                                   !   <currency>GBP</currency>
                                   !   <amount>3950000</amount>
                                     </price>
                                     <bedrooms>7</bedrooms>
                                     <location>Jarn Way, Boars Hill,
                                        Oxford, Oxfordshire</location>
                                   </offer>

                                   <offer>
                                     <price>
                                   !   <currency>GBP</currency>
                                   !   <amount>3950000</amount>
                                     </price>
                                     <bedrooms>6</bedrooms>
                                     <location>Old Boars Hill,
                                        Oxford</location>
                                   </offer>




                                                                              2
Friday, May 11, 2012
AMBER: Extraction from Result Pages
                            100.0%
                                          precision    recall


                            99.5%
                                            <offer>
                                              <price>
                                            !   <currency>GBP</currency>
                            99.0%           !   <amount>4000000</amount>
                                              </price>
                                              <bedrooms>5</bedrooms>
                                              <location>Radcliffe House, Boars Hill,
                            98.5%                Oxford, Oxfordshire</location>
                                            </offer>

                                            <offer>
                                                      >98.5%
                            98.0%             <price>
                                            !
                                            !         F1 score
                                                <currency>GBP</currency>
                                                <amount>3950000</amount>
                                              </price>
                            97.5%             <bedrooms>7</bedrooms>
                                              <location>Jarn Way, Boars Hill,
                                     data areas    records       attributes
                                                 Oxford, Oxfordshire</location>
                                            </offer>

                                            <offer>
                                              <price>
                                            !   <currency>GBP</currency>
                                            !   <amount>3950000</amount>
                                              </price>
                                              <bedrooms>6</bedrooms>
                                              <location>Old Boars Hill,
                                                 Oxford</location>
                                            </offer>




                                                                                       2
Friday, May 11, 2012
AMBER: Extraction from Result Pages
                                                <offer>
                                                  <price>
                                                !   <currency>GBP</currency>
                                                !   <amount>4000000</amount>
                                                  </price>
                                                  <bedrooms>5</bedrooms>
                                                  <location>Radcliffe House, Boars Hill,
                                                     Oxford, Oxfordshire</location>
                                                </offer>

                                                <offer>
                                                        >98.5%
                                                  <price>
                                                !
                                                !       F1 score
                                                    <currency>GBP</currency>
                                                    <amount>3950000</amount>
                                                  </price>
                                                  <bedrooms>7</bedrooms>
                                                  <location>Jarn Way, Boars Hill,
                                                     Oxford, Oxfordshire</location>
                                                </offer>

                                                <offer>
                                                  <price>
                                                !   <currency>GBP</currency>
                                                !   <amount>3950000</amount>
                                                  </price>
                                                  <bedrooms>6</bedrooms>
                                                  <location>Old Boars Hill,
                                                     Oxford</location>
                                                </offer>




                             Domain
                            Knowledge
                       (no per-site training)                                              2
Friday, May 11, 2012
AMBER: Extraction from Result Pages
                                                                   <offer>
                                                                     <price>
                                                                   !   <currency>GBP</currency>
                                                                   !   <amount>4000000</amount>
                                                                     </price>
                                                                     <bedrooms>5</bedrooms>
                                                                     <location>Radcliffe House, Boars Hill,
                                                                        Oxford, Oxfordshire</location>
                                                                   </offer>

                                                                   <offer>
                                                                           >98.5%
                                                                     <price>
                                                                   !
                                                                   !       F1 score
                                                                       <currency>GBP</currency>
                                                                       <amount>3950000</amount>
                                                                     </price>
                                                                     <bedrooms>7</bedrooms>
                                                                     <location>Jarn Way, Boars Hill,
                                                                        Oxford, Oxfordshire</location>
                                                                   </offer>

                                                                   <offer>
                                                                     <price>
                                                                   !   <currency>GBP</currency>
                                                                   !   <amount>3950000</amount>
                                                                     </price>
                                                                     <bedrooms>6</bedrooms>
                                                                     <location>Old Boars Hill,
                                                                        Oxford</location>
                                                                   </offer>




                       Little Ontology
                                                Domain
                         (mandatory)
                                               Knowledge
                        attribute types
                                          (no per-site training)                                              2
Friday, May 11, 2012
AMBER: Extraction from Result Pages
                                                                   <offer>
                                                                     <price>
                                                                   !   <currency>GBP</currency>
                                                                   !   <amount>4000000</amount>
                                                                     </price>
                                                                     <bedrooms>5</bedrooms>
                                                                     <location>Radcliffe House, Boars Hill,
                                                                        Oxford, Oxfordshire</location>
                                                                   </offer>

                                                                   <offer>
                                                                           >98.5%
                                                                     <price>
                                                                   !
                                                                   !       F1 score
                                                                       <currency>GBP</currency>
                                                                       <amount>3950000</amount>
                                                                     </price>
                                                                     <bedrooms>7</bedrooms>
                                                                     <location>Jarn Way, Boars Hill,
                                                                        Oxford, Oxfordshire</location>
                                                                   </offer>

                                                                   <offer>
                                                                     <price>
                                                                   !   <currency>GBP</currency>
                                                                   !   <amount>3950000</amount>
                                                                     </price>
                                                                     <bedrooms>6</bedrooms>
                                                                     <location>Old Boars Hill,
                                                                        Oxford</location>
                                                                   </offer>




                       Little Ontology
                                                Domain             Gazetteers
                         (mandatory)
                                               Knowledge            term lists
                        attribute types
                                          (no per-site training)                                              2
Friday, May 11, 2012
AMBER: Extraction from Result Pages
                                                                   <offer>
                                                                     <price>
                                                                   !   <currency>GBP</currency>
                                                                   !   <amount>4000000</amount>
                                                                     </price>
                                                                     <bedrooms>5</bedrooms>
                                                                     <location>Radcliffe House, Boars Hill,
                                                                        Oxford, Oxfordshire</location>
                                                                   </offer>

                                                                   <offer>
                                                                           >98.5%
                                                                     <price>
                                                                   !
                                                                   !       F1 score
                                                                       <currency>GBP</currency>
                                                                       <amount>3950000</amount>
                                                                     </price>
                                                                     <bedrooms>7</bedrooms>
                                                                     <location>Jarn Way, Boars Hill,
                                                                        Oxford, Oxfordshire</location>
                                                                   </offer>

                                                                   <offer>
                                                                     <price>
                                                                   !   <currency>GBP</currency>
                                                                   !   <amount>3950000</amount>
                                                                     </price>
                                                                     <bedrooms>6</bedrooms>
                                                                     <location>Old Boars Hill,
                                                                        Oxford</location>
                                                                   </offer>



                       Quite easy.
                       Little Ontology
                                                Domain             Gazetteers
                         (mandatory)
                                               Knowledge            term lists
                        attribute types
                                          (no per-site training)                                              2
Friday, May 11, 2012
AMBER: Extraction from Result Pages
                                                                   <offer>
                                                                     <price>
                                                                   !   <currency>GBP</currency>
                                                                   !   <amount>4000000</amount>
                                                                     </price>
                                                                     <bedrooms>5</bedrooms>
                                                                     <location>Radcliffe House, Boars Hill,
                                                                        Oxford, Oxfordshire</location>
                                                                   </offer>

                                                                   <offer>
                                                                           >98.5%
                                                                     <price>
                                                                   !
                                                                   !       F1 score
                                                                       <currency>GBP</currency>
                                                                       <amount>3950000</amount>
                                                                     </price>
                                                                     <bedrooms>7</bedrooms>
                                                                     <location>Jarn Way, Boars Hill,
                                                                        Oxford, Oxfordshire</location>
                                                                   </offer>

                                                                   <offer>
                                                                     <price>
                                                                   !   <currency>GBP</currency>
                                                                   !   <amount>3950000</amount>
                                                                     </price>
                                                                     <bedrooms>6</bedrooms>
                                                                     <location>Old Boars Hill,
                                                                        Oxford</location>
                                                                   </offer>



                       Quite easy.                                 A lot of work!
                       Little Ontology
                                                Domain             Gazetteers
                         (mandatory)
                                               Knowledge            term lists
                        attribute types
                                          (no per-site training)                                              2
Friday, May 11, 2012
AMBER: From Extraction to Learning



                       Leverage the repeated structure in
                                   result pages
                              to learn new terms.



                                                  A lot of work!

                                                  Gazetteers
                                                   term lists
                                                                   3
Friday, May 11, 2012
AMBER: Automatically Learning Gazetteers
                                   <offer>
                                     <price>
                                   !   <currency>GBP</currency>
                                   !   <amount>4000000</amount>
                                     </price>
                                     <bedrooms>5</bedrooms>
                                     <location>Radcliffe House, Boars Hill,
                                        Oxford, Oxfordshire</location>
                                   </offer>

                                   <offer>
                                           >98.5%
                                     <price>
                                   !
                                   !       F1 score
                                       <currency>GBP</currency>
                                       <amount>3950000</amount>
                                     </price>
                                     <bedrooms>7</bedrooms>
                                     <location>Jarn Way, Boars Hill,
                                        Oxford, Oxfordshire</location>
                                   </offer>

                                   <offer>
                                     <price>
                                   !   <currency>GBP</currency>
                                   !   <amount>3950000</amount>
                                     </price>
                                     <bedrooms>6</bedrooms>
                                     <location>Old Boars Hill,
                                        Oxford</location>
                                   </offer>



                                  A lot of work!

                                   Gazetteers
                                    term lists
                                                                              4
Friday, May 11, 2012
AMBER: Automatically Learning Gazetteers
                                              <offer>
                       •Page Segmentation     !
                                                <price>
                                                  <currency>GBP</currency>


                        ‣clusters attribute
                                              !   <amount>4000000</amount>
                                                </price>
                                                <bedrooms>5</bedrooms>
                        instances               <location>Radcliffe House, Boars Hill,
                                                   Oxford, Oxfordshire</location>

                        ‣analyses repeated            >98.5%
                                              </offer>

                                              <offer>
                        structures              <price>
                                              !
                                              !       F1 score
                                                  <currency>GBP</currency>
                                                  <amount>3950000</amount>
                                                </price>
                                                <bedrooms>7</bedrooms>
                                                <location>Jarn Way, Boars Hill,
                                                   Oxford, Oxfordshire</location>
                                              </offer>

                                              <offer>
                                                <price>
                                              !   <currency>GBP</currency>
                                              !   <amount>3950000</amount>
                                                </price>
                                                <bedrooms>6</bedrooms>
                                                <location>Old Boars Hill,
                                                   Oxford</location>
                                              </offer>



                                              A lot of work!

                                              Gazetteers
                                               term lists
                                                                                         4
Friday, May 11, 2012
AMBER: Automatically Learning Gazetteers
 AMBER annotates first
                                                <offer>
 to integrate semantic   •Page Segmentation     !
                                                  <price>
                                                    <currency>GBP</currency>
 information into its     ‣clusters attribute
                                                !   <amount>4000000</amount>
                                                  </price>
                                                  <bedrooms>5</bedrooms>
 repeated structure       instances               <location>Radcliffe House, Boars Hill,
                                                     Oxford, Oxfordshire</location>

                          ‣analyses repeated            >98.5%
                                                </offer>
 analysis.
                                                <offer>
                          structures              <price>
                                                !
                                                !       F1 score
                                                    <currency>GBP</currency>
                                                    <amount>3950000</amount>
                                                  </price>
                                                  <bedrooms>7</bedrooms>
                                                  <location>Jarn Way, Boars Hill,
                                                     Oxford, Oxfordshire</location>
                                                </offer>

                                                <offer>
                                                  <price>
                                                !   <currency>GBP</currency>
                                                !   <amount>3950000</amount>
                                                  </price>
                                                  <bedrooms>6</bedrooms>
                                                  <location>Old Boars Hill,
                                                     Oxford</location>
                                                </offer>



                                                A lot of work!

                                                Gazetteers
                                                 term lists
                                                                                           4
Friday, May 11, 2012
AMBER: Automatically Learning Gazetteers
 AMBER annotates first
                                                <offer>
 to integrate semantic   •Page Segmentation     !
                                                  <price>
                                                    <currency>GBP</currency>
 information into its     ‣clusters attribute
                                                !   <amount>4000000</amount>
                                                  </price>
                                                  <bedrooms>5</bedrooms>
 repeated structure       instances               <location>Radcliffe House, Boars Hill,
                                                     Oxford, Oxfordshire</location>

                          ‣analyses repeated            >98.5%
                                                </offer>
 analysis.
                                                <offer>
                          structures              <price>
                                                !
                                                !       F1 score
                                                    <currency>GBP</currency>
                                                    <amount>3950000</amount>
                                                  </price>
                                                  <bedrooms>7</bedrooms>
                         •Attribute Alignment     <location>Jarn Way, Boars Hill,
                                                     Oxford, Oxfordshire</location>
                          matches knowledge     </offer>

                                                <offer>
                          with observations       <price>
                                                !   <currency>GBP</currency>
                                                !   <amount>3950000</amount>
                                                  </price>
                                                  <bedrooms>6</bedrooms>
                                                  <location>Old Boars Hill,
                                                     Oxford</location>
                                                </offer>



                                                A lot of work!

                                                Gazetteers
                                                 term lists
                                                                                           4
Friday, May 11, 2012
AMBER: Automatically Learning Gazetteers
 AMBER annotates first
                                                <offer>
 to integrate semantic   •Page Segmentation     !
                                                  <price>
                                                    <currency>GBP</currency>
 information into its     ‣clusters attribute
                                                !   <amount>4000000</amount>
                                                  </price>
                                                  <bedrooms>5</bedrooms>
 repeated structure       instances               <location>Radcliffe House, Boars Hill,
                                                     Oxford, Oxfordshire</location>

                          ‣analyses repeated            >98.5%
                                                </offer>
 analysis.
                                                <offer>
                          structures              <price>
                                                !
                                                !       F1 score
                                                    <currency>GBP</currency>
                                                    <amount>3950000</amount>
                                                  </price>
                                                  <bedrooms>7</bedrooms>
                         •Attribute Alignment     <location>Jarn Way, Boars Hill,
                                                     Oxford, Oxfordshire</location>
                          matches knowledge     </offer>

                                                <offer>
                          with observations       <price>
                                                !   <currency>GBP</currency>
                                                !   <amount>3950000</amount>
                                                  </price>

                         •Gazetteer Learning      <bedrooms>6</bedrooms>
                                                  <location>Old Boars Hill,
                                                     Oxford</location>
                          turns phrases into    </offer>


                          terms
                                                A lot of work!

                                                Gazetteers
                                                 term lists
                                                                                           4
Friday, May 11, 2012
AMBER: Automatically Learning Gazetteers
 AMBER annotates first
                                                <offer>
 to integrate semantic   •Page Segmentation     !
                                                  <price>
                                                    <currency>GBP</currency>
 information into its     ‣clusters attribute
                                                !   <amount>4000000</amount>
                                                  </price>
                                                  <bedrooms>5</bedrooms>
 repeated structure       instances               <location>Radcliffe House, Boars Hill,
                                                     Oxford, Oxfordshire</location>

                          ‣analyses repeated            >98.5%
                                                </offer>
 analysis.
                                                <offer>
                          structures              <price>
                                                !
                                                !       F1 score
                                                    <currency>GBP</currency>
                                                    <amount>3950000</amount>
                                                  </price>
                                                  <bedrooms>7</bedrooms>
                         •Attribute Alignment     <location>Jarn Way, Boars Hill,
                                                     Oxford, Oxfordshire</location>
                          matches knowledge     </offer>

                                                <offer>
                          with observations       <price>
                                                !   <currency>GBP</currency>
                                                !   <amount>3950000</amount>
                                                  </price>

                         •Gazetteer Learning      <bedrooms>6</bedrooms>
                                                  <location>Old Boars Hill,
                                                     Oxford</location>
                          turns phrases into    </offer>


                          terms
                                                A lot of work!

                                                Gazetteers
                                                 term lists
                                                                                           4
Friday, May 11, 2012
AMBER: Page Segmentation
             Page Retrieval
                       Mozilla via XUL Runner
                       GATE Annotations




                                                5
Friday, May 11, 2012
AMBER: Page Segmentation
             Page Retrieval
                                                                                    R
                       Mozilla via XUL Runner
                                                                D                               D
                       GATE Annotations
                                                    L           L           L           L       L       L
             Data Area Identification            P       P   P       X   P   P   A       P   A   P   A   P   A
                       Pivot node clustering




                                                                                                                6
Friday, May 11, 2012
AMBER: Page Segmentation
             Page Retrieval
                                                                                      R
                       Mozilla via XUL Runner
                                                                  D                               D
                       GATE Annotations
                                                      L           L           L           L       L       L
             Data Area Identification              P       P   P       X   P   P   A       P   A   P   A   P   A
                       Pivot node clustering


                                                A data area is a maximal DOM subtree, which
                                                 • contains ≥2 pivot nodes, which are
                                                 • depth consistent (depth(n)=k±ε)
                                                 • distance consistent (pathlen(n,n')=k±δ)
                                                 • continuous, such that
                                                 • their least common ancestor is d's root.

                                                                                                                  6
Friday, May 11, 2012
AMBER: Page Segmentation
             Page Retrieval
                                                                                    R
                       Mozilla via XUL Runner
                                                                D                               D
                       GATE Annotations
                                                    L           L           L           L       L       L
             Data Area Identification            P       P   P       X   P   P   A       P   A   P   A   P   A
                       Pivot node clustering
             Record Segmentation
                       head/tail cut-off
                       segmentation boundary
                       shifting


                                                                                                                7
Friday, May 11, 2012
AMBER: Page Segmentation
             Page Retrieval
                                                                                    R
                       Mozilla via XUL Runner
                                                                D                               D
                       GATE Annotations
                                                    L           L           L           L       L       L
             Data Area Identification
                                                P       P   P       X   P   P   A       P   A   P   A   P   A
                       Pivot node clustering
             Record Segmentation
                       head/tail cut-off
                       segmentation boundary
                       shifting


                                                                                                                8
Friday, May 11, 2012
AMBER: Page Segmentation
             Page Retrieval
                                                                                      R
                       Mozilla via XUL Runner
                                                                  D                               D
                       GATE Annotations
                                                      L           L           L           L       L       L
             Data Area Identification
                                                  P       P   P       X   P   P   A       P   A   P   A   P   A
                       Pivot node clustering
             Record Segmentation
                                                A result record is a sequence of children of the
                       head/tail cut-off        data area root.
                       segmentation boundary    A result record segmentation divides a data area
                       shifting                  • into non-overlapping records,
                                                 • containing the same number of siblings,
                                                 • each based on a single selected pivot node.

                                                                                                                  8
Friday, May 11, 2012
AMBER: Attribute Alignment

                            L           L       L       L       L       L

                        P       P   P       X   P   A   P   A   P   A   P   A




                                                                                9
Friday, May 11, 2012
AMBER: Attribute Alignment

                            L           L       L       L       L         L

                        P       P   P       X   P   A   P   A   P   A     P   A




                       The tag path of a node n in a record r is the
                        • tag sequence occurring on the
                        • child/next-sibling path from r's root to n.

                       The support of a type/tag path pair (t,p) is the
                        • fraction of records having an
                        • annotation for t at path p.

                                                                                  9
Friday, May 11, 2012
AMBER: Attribute Alignment
              Attribute Cleanup
                       discard attributes with        L           L         L       L       L       L
                       low support
                                                  P       P   P       X     P   A   P   A   P   A   P   A



                                                                  Cleanup




                                                 The tag path of a node n in a record r is the
                                                  • tag sequence occurring on the
                                                  • child/next-sibling path from r's root to n.

                                                 The support of a type/tag path pair (t,p) is the
                                                  • fraction of records having an
                                                  • annotation for t at path p.

                                                                                                            9
Friday, May 11, 2012
AMBER: Attribute Alignment
              Attribute Cleanup
                       discard attributes with          L           L         L       L       L       L
                       low support
                                                    P       P   P       X     P   A   P   A   P   A   P   A
              Attribute Disambiguation
                       discard ambiguous          Disam-            Cleanup
                                                 biguation
                       attributes with lower
                       support
                                                   The tag path of a node n in a record r is the
                                                    • tag sequence occurring on the
                                                    • child/next-sibling path from r's root to n.

                                                   The support of a type/tag path pair (t,p) is the
                                                    • fraction of records having an
                                                    • annotation for t at path p.

                                                                                                              9
Friday, May 11, 2012
AMBER: Attribute Alignment
              Attribute Cleanup
                       discard attributes with            L           L         L       L         L       L
                       low support
                                                      P       P   P       X     P   A   P     A   P   A   P   A
              Attribute Disambiguation
                       discard ambiguous            Disam-            Cleanup               Generation
                                                   biguation
                       attributes with lower
                       support
              Attribute Generalisation               The tag path of a node n in a record r is the
                                                      • tag sequence occurring on the
                       add new un-annotated           • child/next-sibling path from r's root to n.
                       attributes with sufficient     The support of a type/tag path pair (t,p) is the
                       support                        • fraction of records having an
                                                      • annotation for t at path p.

                                                                                                                  9
Friday, May 11, 2012
AMBER: Gazetteer Learning



                        Oxford, Walton Street, top-floor apartment




                                                                    10
Friday, May 11, 2012
AMBER: Gazetteer Learning
              Term Formulation
                       split newly generated attributes
                       into terms
                                                      Oxford, Walton Street, top-floor apartment


                                                      Oxford
                                                                               top-floor apartment
                                                               Walton Street




                                                                                                  10
Friday, May 11, 2012
AMBER: Gazetteer Learning
              Term Formulation
                       split newly generated attributes
                       into terms
                                                      Oxford, Walton Street, top-floor apartment
                       discard terms on
                       black-lists and from
                       non-overlapping attributes Oxford                      top-floor apartment
                                                              Walton Street




                                                                                                  10
Friday, May 11, 2012
AMBER: Gazetteer Learning
              Term Formulation
                       split newly generated attributes
                       into terms
                                                      Oxford, Walton Street, top-floor apartment
                       discard terms on
                       black-lists and from
                       non-overlapping attributes Oxford                      top-floor apartment
              Term Validation                                 Walton Street
                       track term relevance
                       discard irrelevant ones




                                                                                                  10
Friday, May 11, 2012
AMBER: Evaluation




                        11
Friday, May 11, 2012
AMBER: Evaluation
              Learning Location from 250 pages from 150 sites
              (UK real estate market)
              Starting with a 25% sample of our full gazetteer
              (containing 33.243 terms)




                                                                 11
Friday, May 11, 2012
AMBER: Evaluation
              Learning Location from 250 pages from 150 sites
              (UK real estate market)
              Starting with a 25% sample of our full gazetteer
              (containing 33.243 terms)
              initially failed to annotate 328 locations
              after 3 learning rounds learned 265 of those
              (recall: 80.6% precision: 95.1%)




                                                                 11
Friday, May 11, 2012
AMBER: Evaluation
                                                              !"##$%

                             -,9%:(8     -,9%:(;        -,9%:(5
                  8223

                       773

                       613

                       453

                       123
                             !" "*




                                           ! " *#




                                                             /, "*




                                                                        /, *#
                               )$




                                              -"




                                                               )$




                                                                          -"
                                                               0# &+&




                                                                          0# ..
                                              #$ ..
                               #$ &+&,




                                                                 . ,%




                                                                            .
                                                %&
                                 %& %
                                   %'




                                                   %'
                                     (




Friday, May 11, 2012
AMBER: Evaluation
                                                    !"##$%

                                 ,-./$0&,"1.02"$3    4"//-10&,"1.02"$3

         )**

         (+*

         (**

         '+*

         '**

           +*

              *
                       !"#$%&'                 !"#$%&(                   !"#$%&)



Friday, May 11, 2012
!


                                         $


                       "
                               DE
                                 M
                       #             O   %




Friday, May 11, 2012
AMBER WWW 2012 (Demonstration)
AMBER WWW 2012 (Demonstration)
AMBER WWW 2012 (Demonstration)
AMBER WWW 2012 (Demonstration)
AMBER WWW 2012 (Demonstration)

More Related Content

More from Giorgio Orsi

wadar_poster_final
wadar_poster_finalwadar_poster_final
wadar_poster_final
Giorgio Orsi
 
ROSeAnn: Reconciling Opinions of Semantic Annotators VLDB 2014
ROSeAnn: Reconciling Opinions of Semantic Annotators VLDB 2014ROSeAnn: Reconciling Opinions of Semantic Annotators VLDB 2014
ROSeAnn: Reconciling Opinions of Semantic Annotators VLDB 2014
Giorgio Orsi
 
Datalog and its Extensions for Semantic Web Databases
Datalog and its Extensions for Semantic Web DatabasesDatalog and its Extensions for Semantic Web Databases
Datalog and its Extensions for Semantic Web Databases
Giorgio Orsi
 
AMBER WWW 2012 Poster
AMBER WWW 2012 PosterAMBER WWW 2012 Poster
AMBER WWW 2012 Poster
Giorgio Orsi
 
The Diadem Ontology
The Diadem OntologyThe Diadem Ontology
The Diadem Ontology
Giorgio Orsi
 
AMBER presentation
AMBER presentationAMBER presentation
AMBER presentation
Giorgio Orsi
 

More from Giorgio Orsi (20)

Web Data Extraction: A Crash Course
Web Data Extraction: A Crash CourseWeb Data Extraction: A Crash Course
Web Data Extraction: A Crash Course
 
Fairhair.ai – alan turing institute june '17 (public)
Fairhair.ai – alan turing institute june '17 (public)Fairhair.ai – alan turing institute june '17 (public)
Fairhair.ai – alan turing institute june '17 (public)
 
SAE: Structured Aspect Extraction
SAE: Structured Aspect ExtractionSAE: Structured Aspect Extraction
SAE: Structured Aspect Extraction
 
wadar_poster_final
wadar_poster_finalwadar_poster_final
wadar_poster_final
 
Query Rewriting and Optimization for Ontological Databases
Query Rewriting and Optimization for Ontological DatabasesQuery Rewriting and Optimization for Ontological Databases
Query Rewriting and Optimization for Ontological Databases
 
ROSeAnn: Reconciling Opinions of Semantic Annotators VLDB 2014
ROSeAnn: Reconciling Opinions of Semantic Annotators VLDB 2014ROSeAnn: Reconciling Opinions of Semantic Annotators VLDB 2014
ROSeAnn: Reconciling Opinions of Semantic Annotators VLDB 2014
 
Perv a ds-rr13
Perv a ds-rr13Perv a ds-rr13
Perv a ds-rr13
 
Heuristic Ranking in Tightly Coupled Probabilistic Description Logics
Heuristic Ranking in Tightly Coupled Probabilistic Description LogicsHeuristic Ranking in Tightly Coupled Probabilistic Description Logics
Heuristic Ranking in Tightly Coupled Probabilistic Description Logics
 
Datalog and its Extensions for Semantic Web Databases
Datalog and its Extensions for Semantic Web DatabasesDatalog and its Extensions for Semantic Web Databases
Datalog and its Extensions for Semantic Web Databases
 
AMBER WWW 2012 Poster
AMBER WWW 2012 PosterAMBER WWW 2012 Poster
AMBER WWW 2012 Poster
 
OPAL: a passe-partout for web forms - WWW 2012 (Demonstration)
OPAL: a passe-partout for web forms - WWW 2012 (Demonstration)OPAL: a passe-partout for web forms - WWW 2012 (Demonstration)
OPAL: a passe-partout for web forms - WWW 2012 (Demonstration)
 
Querying UML Class Diagrams - FoSSaCS 2012
Querying UML Class Diagrams - FoSSaCS 2012Querying UML Class Diagrams - FoSSaCS 2012
Querying UML Class Diagrams - FoSSaCS 2012
 
OPAL: automated form understanding for the deep web - WWW 2012
OPAL: automated form understanding for the deep web - WWW 2012OPAL: automated form understanding for the deep web - WWW 2012
OPAL: automated form understanding for the deep web - WWW 2012
 
Nyaya: Semantic data markets: a flexible environment for knowledge management...
Nyaya: Semantic data markets: a flexible environment for knowledge management...Nyaya: Semantic data markets: a flexible environment for knowledge management...
Nyaya: Semantic data markets: a flexible environment for knowledge management...
 
The Diadem Ontology
The Diadem OntologyThe Diadem Ontology
The Diadem Ontology
 
Diadem 1.0
Diadem 1.0Diadem 1.0
Diadem 1.0
 
Oxpath vldb
Oxpath vldbOxpath vldb
Oxpath vldb
 
Gottlob ICDE 2011
Gottlob ICDE 2011Gottlob ICDE 2011
Gottlob ICDE 2011
 
OPAL Presentation
OPAL PresentationOPAL Presentation
OPAL Presentation
 
AMBER presentation
AMBER presentationAMBER presentation
AMBER presentation
 

Recently uploaded

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 

Recently uploaded (20)

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 

AMBER WWW 2012 (Demonstration)

  • 1. DIADEM domain-centric intelligent automated data extraction methodology Automatically Learning Gazetteers from the Deep Web Christian Schallhart April 19th, 2012 @ WWW in Lyon joint work with Tim Furche, Giovanni Grasso, Giorgio Orsi, and Cheng Wang Friday, May 11, 2012
  • 2. AMBER: Extraction from Result Pages 2 Friday, May 11, 2012
  • 3. AMBER: Extraction from Result Pages <offer> <price> ! <currency>GBP</currency> ! <amount>4000000</amount> </price> <bedrooms>5</bedrooms> <location>Radcliffe House, Boars Hill, Oxford, Oxfordshire</location> </offer> <offer> <price> ! <currency>GBP</currency> ! <amount>3950000</amount> </price> <bedrooms>7</bedrooms> <location>Jarn Way, Boars Hill, Oxford, Oxfordshire</location> </offer> <offer> <price> ! <currency>GBP</currency> ! <amount>3950000</amount> </price> <bedrooms>6</bedrooms> <location>Old Boars Hill, Oxford</location> </offer> 2 Friday, May 11, 2012
  • 4. AMBER: Extraction from Result Pages <offer> <price> ! <currency>GBP</currency> ! <amount>4000000</amount> </price> <bedrooms>5</bedrooms> <location>Radcliffe House, Boars Hill, Oxford, Oxfordshire</location> </offer> <offer> <price> ! <currency>GBP</currency> ! <amount>3950000</amount> </price> <bedrooms>7</bedrooms> <location>Jarn Way, Boars Hill, Oxford, Oxfordshire</location> </offer> <offer> <price> ! <currency>GBP</currency> ! <amount>3950000</amount> </price> <bedrooms>6</bedrooms> <location>Old Boars Hill, Oxford</location> </offer> 2 Friday, May 11, 2012
  • 5. AMBER: Extraction from Result Pages 100.0% precision recall 99.5% <offer> <price> ! <currency>GBP</currency> 99.0% ! <amount>4000000</amount> </price> <bedrooms>5</bedrooms> <location>Radcliffe House, Boars Hill, 98.5% Oxford, Oxfordshire</location> </offer> <offer> >98.5% 98.0% <price> ! ! F1 score <currency>GBP</currency> <amount>3950000</amount> </price> 97.5% <bedrooms>7</bedrooms> <location>Jarn Way, Boars Hill, data areas records attributes Oxford, Oxfordshire</location> </offer> <offer> <price> ! <currency>GBP</currency> ! <amount>3950000</amount> </price> <bedrooms>6</bedrooms> <location>Old Boars Hill, Oxford</location> </offer> 2 Friday, May 11, 2012
  • 6. AMBER: Extraction from Result Pages <offer> <price> ! <currency>GBP</currency> ! <amount>4000000</amount> </price> <bedrooms>5</bedrooms> <location>Radcliffe House, Boars Hill, Oxford, Oxfordshire</location> </offer> <offer> >98.5% <price> ! ! F1 score <currency>GBP</currency> <amount>3950000</amount> </price> <bedrooms>7</bedrooms> <location>Jarn Way, Boars Hill, Oxford, Oxfordshire</location> </offer> <offer> <price> ! <currency>GBP</currency> ! <amount>3950000</amount> </price> <bedrooms>6</bedrooms> <location>Old Boars Hill, Oxford</location> </offer> Domain Knowledge (no per-site training) 2 Friday, May 11, 2012
  • 7. AMBER: Extraction from Result Pages <offer> <price> ! <currency>GBP</currency> ! <amount>4000000</amount> </price> <bedrooms>5</bedrooms> <location>Radcliffe House, Boars Hill, Oxford, Oxfordshire</location> </offer> <offer> >98.5% <price> ! ! F1 score <currency>GBP</currency> <amount>3950000</amount> </price> <bedrooms>7</bedrooms> <location>Jarn Way, Boars Hill, Oxford, Oxfordshire</location> </offer> <offer> <price> ! <currency>GBP</currency> ! <amount>3950000</amount> </price> <bedrooms>6</bedrooms> <location>Old Boars Hill, Oxford</location> </offer> Little Ontology Domain (mandatory) Knowledge attribute types (no per-site training) 2 Friday, May 11, 2012
  • 8. AMBER: Extraction from Result Pages <offer> <price> ! <currency>GBP</currency> ! <amount>4000000</amount> </price> <bedrooms>5</bedrooms> <location>Radcliffe House, Boars Hill, Oxford, Oxfordshire</location> </offer> <offer> >98.5% <price> ! ! F1 score <currency>GBP</currency> <amount>3950000</amount> </price> <bedrooms>7</bedrooms> <location>Jarn Way, Boars Hill, Oxford, Oxfordshire</location> </offer> <offer> <price> ! <currency>GBP</currency> ! <amount>3950000</amount> </price> <bedrooms>6</bedrooms> <location>Old Boars Hill, Oxford</location> </offer> Little Ontology Domain Gazetteers (mandatory) Knowledge term lists attribute types (no per-site training) 2 Friday, May 11, 2012
  • 9. AMBER: Extraction from Result Pages <offer> <price> ! <currency>GBP</currency> ! <amount>4000000</amount> </price> <bedrooms>5</bedrooms> <location>Radcliffe House, Boars Hill, Oxford, Oxfordshire</location> </offer> <offer> >98.5% <price> ! ! F1 score <currency>GBP</currency> <amount>3950000</amount> </price> <bedrooms>7</bedrooms> <location>Jarn Way, Boars Hill, Oxford, Oxfordshire</location> </offer> <offer> <price> ! <currency>GBP</currency> ! <amount>3950000</amount> </price> <bedrooms>6</bedrooms> <location>Old Boars Hill, Oxford</location> </offer> Quite easy. Little Ontology Domain Gazetteers (mandatory) Knowledge term lists attribute types (no per-site training) 2 Friday, May 11, 2012
  • 10. AMBER: Extraction from Result Pages <offer> <price> ! <currency>GBP</currency> ! <amount>4000000</amount> </price> <bedrooms>5</bedrooms> <location>Radcliffe House, Boars Hill, Oxford, Oxfordshire</location> </offer> <offer> >98.5% <price> ! ! F1 score <currency>GBP</currency> <amount>3950000</amount> </price> <bedrooms>7</bedrooms> <location>Jarn Way, Boars Hill, Oxford, Oxfordshire</location> </offer> <offer> <price> ! <currency>GBP</currency> ! <amount>3950000</amount> </price> <bedrooms>6</bedrooms> <location>Old Boars Hill, Oxford</location> </offer> Quite easy. A lot of work! Little Ontology Domain Gazetteers (mandatory) Knowledge term lists attribute types (no per-site training) 2 Friday, May 11, 2012
  • 11. AMBER: From Extraction to Learning Leverage the repeated structure in result pages to learn new terms. A lot of work! Gazetteers term lists 3 Friday, May 11, 2012
  • 12. AMBER: Automatically Learning Gazetteers <offer> <price> ! <currency>GBP</currency> ! <amount>4000000</amount> </price> <bedrooms>5</bedrooms> <location>Radcliffe House, Boars Hill, Oxford, Oxfordshire</location> </offer> <offer> >98.5% <price> ! ! F1 score <currency>GBP</currency> <amount>3950000</amount> </price> <bedrooms>7</bedrooms> <location>Jarn Way, Boars Hill, Oxford, Oxfordshire</location> </offer> <offer> <price> ! <currency>GBP</currency> ! <amount>3950000</amount> </price> <bedrooms>6</bedrooms> <location>Old Boars Hill, Oxford</location> </offer> A lot of work! Gazetteers term lists 4 Friday, May 11, 2012
  • 13. AMBER: Automatically Learning Gazetteers <offer> •Page Segmentation ! <price> <currency>GBP</currency> ‣clusters attribute ! <amount>4000000</amount> </price> <bedrooms>5</bedrooms> instances <location>Radcliffe House, Boars Hill, Oxford, Oxfordshire</location> ‣analyses repeated >98.5% </offer> <offer> structures <price> ! ! F1 score <currency>GBP</currency> <amount>3950000</amount> </price> <bedrooms>7</bedrooms> <location>Jarn Way, Boars Hill, Oxford, Oxfordshire</location> </offer> <offer> <price> ! <currency>GBP</currency> ! <amount>3950000</amount> </price> <bedrooms>6</bedrooms> <location>Old Boars Hill, Oxford</location> </offer> A lot of work! Gazetteers term lists 4 Friday, May 11, 2012
  • 14. AMBER: Automatically Learning Gazetteers AMBER annotates first <offer> to integrate semantic •Page Segmentation ! <price> <currency>GBP</currency> information into its ‣clusters attribute ! <amount>4000000</amount> </price> <bedrooms>5</bedrooms> repeated structure instances <location>Radcliffe House, Boars Hill, Oxford, Oxfordshire</location> ‣analyses repeated >98.5% </offer> analysis. <offer> structures <price> ! ! F1 score <currency>GBP</currency> <amount>3950000</amount> </price> <bedrooms>7</bedrooms> <location>Jarn Way, Boars Hill, Oxford, Oxfordshire</location> </offer> <offer> <price> ! <currency>GBP</currency> ! <amount>3950000</amount> </price> <bedrooms>6</bedrooms> <location>Old Boars Hill, Oxford</location> </offer> A lot of work! Gazetteers term lists 4 Friday, May 11, 2012
  • 15. AMBER: Automatically Learning Gazetteers AMBER annotates first <offer> to integrate semantic •Page Segmentation ! <price> <currency>GBP</currency> information into its ‣clusters attribute ! <amount>4000000</amount> </price> <bedrooms>5</bedrooms> repeated structure instances <location>Radcliffe House, Boars Hill, Oxford, Oxfordshire</location> ‣analyses repeated >98.5% </offer> analysis. <offer> structures <price> ! ! F1 score <currency>GBP</currency> <amount>3950000</amount> </price> <bedrooms>7</bedrooms> •Attribute Alignment <location>Jarn Way, Boars Hill, Oxford, Oxfordshire</location> matches knowledge </offer> <offer> with observations <price> ! <currency>GBP</currency> ! <amount>3950000</amount> </price> <bedrooms>6</bedrooms> <location>Old Boars Hill, Oxford</location> </offer> A lot of work! Gazetteers term lists 4 Friday, May 11, 2012
  • 16. AMBER: Automatically Learning Gazetteers AMBER annotates first <offer> to integrate semantic •Page Segmentation ! <price> <currency>GBP</currency> information into its ‣clusters attribute ! <amount>4000000</amount> </price> <bedrooms>5</bedrooms> repeated structure instances <location>Radcliffe House, Boars Hill, Oxford, Oxfordshire</location> ‣analyses repeated >98.5% </offer> analysis. <offer> structures <price> ! ! F1 score <currency>GBP</currency> <amount>3950000</amount> </price> <bedrooms>7</bedrooms> •Attribute Alignment <location>Jarn Way, Boars Hill, Oxford, Oxfordshire</location> matches knowledge </offer> <offer> with observations <price> ! <currency>GBP</currency> ! <amount>3950000</amount> </price> •Gazetteer Learning <bedrooms>6</bedrooms> <location>Old Boars Hill, Oxford</location> turns phrases into </offer> terms A lot of work! Gazetteers term lists 4 Friday, May 11, 2012
  • 17. AMBER: Automatically Learning Gazetteers AMBER annotates first <offer> to integrate semantic •Page Segmentation ! <price> <currency>GBP</currency> information into its ‣clusters attribute ! <amount>4000000</amount> </price> <bedrooms>5</bedrooms> repeated structure instances <location>Radcliffe House, Boars Hill, Oxford, Oxfordshire</location> ‣analyses repeated >98.5% </offer> analysis. <offer> structures <price> ! ! F1 score <currency>GBP</currency> <amount>3950000</amount> </price> <bedrooms>7</bedrooms> •Attribute Alignment <location>Jarn Way, Boars Hill, Oxford, Oxfordshire</location> matches knowledge </offer> <offer> with observations <price> ! <currency>GBP</currency> ! <amount>3950000</amount> </price> •Gazetteer Learning <bedrooms>6</bedrooms> <location>Old Boars Hill, Oxford</location> turns phrases into </offer> terms A lot of work! Gazetteers term lists 4 Friday, May 11, 2012
  • 18. AMBER: Page Segmentation Page Retrieval Mozilla via XUL Runner GATE Annotations 5 Friday, May 11, 2012
  • 19. AMBER: Page Segmentation Page Retrieval R Mozilla via XUL Runner D D GATE Annotations L L L L L L Data Area Identification P P P X P P A P A P A P A Pivot node clustering 6 Friday, May 11, 2012
  • 20. AMBER: Page Segmentation Page Retrieval R Mozilla via XUL Runner D D GATE Annotations L L L L L L Data Area Identification P P P X P P A P A P A P A Pivot node clustering A data area is a maximal DOM subtree, which • contains ≥2 pivot nodes, which are • depth consistent (depth(n)=k±ε) • distance consistent (pathlen(n,n')=k±δ) • continuous, such that • their least common ancestor is d's root. 6 Friday, May 11, 2012
  • 21. AMBER: Page Segmentation Page Retrieval R Mozilla via XUL Runner D D GATE Annotations L L L L L L Data Area Identification P P P X P P A P A P A P A Pivot node clustering Record Segmentation head/tail cut-off segmentation boundary shifting 7 Friday, May 11, 2012
  • 22. AMBER: Page Segmentation Page Retrieval R Mozilla via XUL Runner D D GATE Annotations L L L L L L Data Area Identification P P P X P P A P A P A P A Pivot node clustering Record Segmentation head/tail cut-off segmentation boundary shifting 8 Friday, May 11, 2012
  • 23. AMBER: Page Segmentation Page Retrieval R Mozilla via XUL Runner D D GATE Annotations L L L L L L Data Area Identification P P P X P P A P A P A P A Pivot node clustering Record Segmentation A result record is a sequence of children of the head/tail cut-off data area root. segmentation boundary A result record segmentation divides a data area shifting • into non-overlapping records, • containing the same number of siblings, • each based on a single selected pivot node. 8 Friday, May 11, 2012
  • 24. AMBER: Attribute Alignment L L L L L L P P P X P A P A P A P A 9 Friday, May 11, 2012
  • 25. AMBER: Attribute Alignment L L L L L L P P P X P A P A P A P A The tag path of a node n in a record r is the • tag sequence occurring on the • child/next-sibling path from r's root to n. The support of a type/tag path pair (t,p) is the • fraction of records having an • annotation for t at path p. 9 Friday, May 11, 2012
  • 26. AMBER: Attribute Alignment Attribute Cleanup discard attributes with L L L L L L low support P P P X P A P A P A P A Cleanup The tag path of a node n in a record r is the • tag sequence occurring on the • child/next-sibling path from r's root to n. The support of a type/tag path pair (t,p) is the • fraction of records having an • annotation for t at path p. 9 Friday, May 11, 2012
  • 27. AMBER: Attribute Alignment Attribute Cleanup discard attributes with L L L L L L low support P P P X P A P A P A P A Attribute Disambiguation discard ambiguous Disam- Cleanup biguation attributes with lower support The tag path of a node n in a record r is the • tag sequence occurring on the • child/next-sibling path from r's root to n. The support of a type/tag path pair (t,p) is the • fraction of records having an • annotation for t at path p. 9 Friday, May 11, 2012
  • 28. AMBER: Attribute Alignment Attribute Cleanup discard attributes with L L L L L L low support P P P X P A P A P A P A Attribute Disambiguation discard ambiguous Disam- Cleanup Generation biguation attributes with lower support Attribute Generalisation The tag path of a node n in a record r is the • tag sequence occurring on the add new un-annotated • child/next-sibling path from r's root to n. attributes with sufficient The support of a type/tag path pair (t,p) is the support • fraction of records having an • annotation for t at path p. 9 Friday, May 11, 2012
  • 29. AMBER: Gazetteer Learning Oxford, Walton Street, top-floor apartment 10 Friday, May 11, 2012
  • 30. AMBER: Gazetteer Learning Term Formulation split newly generated attributes into terms Oxford, Walton Street, top-floor apartment Oxford top-floor apartment Walton Street 10 Friday, May 11, 2012
  • 31. AMBER: Gazetteer Learning Term Formulation split newly generated attributes into terms Oxford, Walton Street, top-floor apartment discard terms on black-lists and from non-overlapping attributes Oxford top-floor apartment Walton Street 10 Friday, May 11, 2012
  • 32. AMBER: Gazetteer Learning Term Formulation split newly generated attributes into terms Oxford, Walton Street, top-floor apartment discard terms on black-lists and from non-overlapping attributes Oxford top-floor apartment Term Validation Walton Street track term relevance discard irrelevant ones 10 Friday, May 11, 2012
  • 33. AMBER: Evaluation 11 Friday, May 11, 2012
  • 34. AMBER: Evaluation Learning Location from 250 pages from 150 sites (UK real estate market) Starting with a 25% sample of our full gazetteer (containing 33.243 terms) 11 Friday, May 11, 2012
  • 35. AMBER: Evaluation Learning Location from 250 pages from 150 sites (UK real estate market) Starting with a 25% sample of our full gazetteer (containing 33.243 terms) initially failed to annotate 328 locations after 3 learning rounds learned 265 of those (recall: 80.6% precision: 95.1%) 11 Friday, May 11, 2012
  • 36. AMBER: Evaluation !"##$% -,9%:(8 -,9%:(; -,9%:(5 8223 773 613 453 123 !" "* ! " *# /, "* /, *# )$ -" )$ -" 0# &+& 0# .. #$ .. #$ &+&, . ,% . %& %& % %' %' ( Friday, May 11, 2012
  • 37. AMBER: Evaluation !"##$% ,-./$0&,"1.02"$3 4"//-10&,"1.02"$3 )** (+* (** '+* '** +* * !"#$%&' !"#$%&( !"#$%&) Friday, May 11, 2012
  • 38. ! $ " DE M # O % Friday, May 11, 2012