SlideShare a Scribd company logo
1 of 12
Download to read offline
The Problem
  If	
  a	
  man,	
  A,	
  who	
  weighs	
  11	
  stone	
  leaves	
  from	
  his	
  home	
  at	
  8:30	
  in	
  the	
  morning	
  in	
  a	
  car	
  whose	
  consump<on	
  is	
  16.25	
  mpg	
  at	
  an	
  
  average	
  speed	
  of	
  40	
  m.p.h.	
  to	
  his	
  office	
  which	
  is	
  12	
  miles	
  away	
  and	
  he	
  stops	
  for	
  a	
  coffee	
  on	
  the	
  way	
  for	
  15	
  minutes	
  and	
  also	
  
  puts	
  air	
  in	
  one	
  of	
  his	
  tyres	
  which	
  has	
  a	
  slow	
  puncture	
  leFng	
  out	
  air	
  at	
  a	
  rate	
  of	
  2	
  lbs	
  per	
  square	
  inch	
  per	
  mile	
  travelled	
  when	
  
  the	
  car	
  is	
  moving	
  at	
  32	
  m.p.h.	
  and	
  he	
  picks	
  up	
  a	
  hitch-­‐hiker	
  B	
  who	
  weighs	
  14	
  stone	
  plus	
  suitcase	
  But	
  hitch-­‐hiker	
  B	
  who	
  is	
  a	
  
  poli<cal	
  ac<vist	
  distributes	
  leaflets	
  from	
  his	
  suitcase	
  each	
  of	
  which	
  weigh	
  an	
  ounce	
  at	
  the	
  scale	
  of	
  2	
  leaflets	
  per	
  person	
  at	
  every	
  
  bus	
  stop	
  and	
  every	
  vehicle	
  on	
  either	
  side	
  of	
  them	
  at	
  every	
  red	
  traffic	
  light	
  during	
  the	
  journey	
  which	
  includes	
  20	
  bus	
  stops	
  with	
  
  an	
  average	
  of	
  6	
  people	
  per	
  stop	
  5	
  lorries	
  each	
  with	
  a	
  passenger	
  one	
  of	
  which	
  exchanged	
  a	
  Yorkie	
  Bar	
  weighing	
  an	
  ounce	
  for	
  12	
  
  of	
  the	
  leaflets	
  and	
  2	
  coaches	
  each	
  containing	
  51	
  people	
  7	
  of	
  which	
  from	
  one	
  coach	
  returned	
  the	
  leaflets	
  and	
  16	
  people	
  from	
  
  the	
  other	
  coach	
  who	
  asked	
  for	
  a	
  further	
  leaflet	
  each	
  for	
  a	
  member	
  of	
  one	
  of	
  their	
  families	
  Assuming	
  that	
  man	
  A	
  then	
  had	
  to	
  
  travel	
  a	
  further	
  2.86	
  miles	
  out	
  of	
  his	
  way	
  to	
  drop	
  off	
  hitch-­‐hiker	
  B	
  how	
  late	
  would	
  man	
  A	
  be	
  in	
  arriving	
  at	
  the	
  office	
  by	
  9:30	
  
  a.m.?	
  If	
  he	
  s<ll	
  had	
  6	
  miles	
  to	
  travel	
  and	
  his	
  watch	
  was	
  running	
  23	
  minutes	
  slow	
  but	
  the	
  clock	
  at	
  the	
  office	
  was	
  running	
  2	
  
  minutes	
  faster	
  than	
  his	
  was	
  in	
  fact	
  17	
  minutes	
  and	
  3	
  secs	
  ahead	
  of	
  the	
  correct	
  <me	
  which	
  was	
  2:30	
  in	
  the	
  morning	
  in	
  Caracas	
  If	
  
  when	
  5	
  miles	
  from	
  the	
  office	
  he	
  telephoned	
  his	
  boss	
  to	
  apologize	
  for	
  being	
  late	
  but	
  was	
  told	
  by	
  his	
  boss	
  C	
  to	
  pick	
  up	
  a	
  package	
  
  2.63	
  miles	
  away	
  from	
  his	
  present	
  loca<on	
  and	
  deliver	
  it	
  to	
  client	
  D	
  in	
  Bristol	
  by	
  train,	
  by	
  4:30	
  that	
  aVernoon	
  and	
  at	
  the	
  same	
  
  <me	
  man	
  D	
  was	
  mistakenly	
  told	
  to	
  come	
  to	
  London	
  to	
  receive	
  same	
  package	
  from	
  man	
  A	
  Now	
  man	
  A's	
  train,	
  train	
  1,	
  leV	
  30	
  
  mins.	
  late	
  but	
  man	
  D's	
  train,	
  train	
  2,	
  leV	
  5	
  mins	
  early	
  so	
  when	
  the	
  trains	
  passed	
  each	
  other	
  train	
  1	
  was	
  travelling	
  at	
  75	
  m.p.h.	
  to	
  
  make	
  up	
  for	
  lost	
  <me	
  and	
  train	
  2	
  was	
  travelling	
  at	
  52	
  m.p.h.	
  Would	
  man	
  A	
  reach	
  Bristol	
  earlier	
  or	
  later	
  according	
  to	
  his	
  watch	
  
  which	
  was	
  now	
  running	
  5	
  mins.	
  slower	
  than	
  man	
  D's	
  would	
  have	
  been	
  had	
  he	
  not	
  got	
  off	
  the	
  train	
  and	
  checked	
  the	
  correct	
  
  <me	
  at	
  a	
  sta<on	
  between	
  Bristol	
  and	
  London	
  and	
  stopped	
  to	
  phone	
  A's	
  boss,	
  man	
  C	
  to	
  double	
  check	
  A	
  would	
  be	
  there	
  to	
  meet	
  
  him	
  and	
  discover	
  his	
  mistake	
  catch	
  next	
  train,	
  train	
  3,	
  back	
  to	
  Bristol	
  which	
  unlike	
  A's	
  train	
  1	
  which	
  stopped	
  at	
  4	
  sta<ons	
  on	
  the	
  
  way	
  for	
  6	
  mins	
  each	
  stop	
  was	
  an	
  express	
  train	
  D's	
  train	
  caught	
  up	
  with	
  A's	
  train	
  1	
  4	
  miles	
  from	
  Bristol	
  As	
  the	
  trains	
  drew	
  
  alongside	
  each	
  other	
  A's	
  train	
  was	
  travelling	
  at	
  12	
  m.p.h.	
  and	
  D's	
  train	
  was	
  travelling	
  at	
  13.6	
  m.p.h.	
  and	
  man	
  A	
  was	
  sat	
  in	
  the	
  
  front...

  How	
  long	
  would	
  it	
  take	
  to	
  fill	
  the	
  bath?




                                                                                                                                                                                                                      1
Thursday, 27 September, 12
The Problem
  • Loading	
  from	
  a	
  6Gb	
  structured	
  XML	
  file
  • Methods:
        – DOM	
  -­‐	
  not	
  enough	
  memory
        – Each	
  element,	
  text,	
  abribute,	
  allocated	
  separately
        – “Perl	
  is	
  a	
  profligate	
  wastrel	
  when	
  it	
  comes	
  to	
  memory	
  
          use.	
  There	
  is	
  a	
  saying	
  that	
  to	
  es<mate	
  memory	
  usage	
  
          of	
  Perl,	
  assume	
  a	
  reasonable	
  algorithm	
  for	
  memory	
  
          alloca<on,	
  mul<ply	
  that	
  es<mate	
  by	
  10,	
  and	
  while	
  
          you	
  s<ll	
  may	
  miss	
  the	
  mark,	
  at	
  least	
  you	
  won't	
  be	
  
          quite	
  so	
  astonished”	
  (perldoc	
  perldebguts)


                                                                                            2
Thursday, 27 September, 12
The Problem
  • Methods:
        – SAX	
  -­‐	
  possible	
  but	
  painful
        – Memory	
  is	
  not	
  an	
  issue
        – Event-­‐based,	
  handlers	
  for	
  each	
  element,	
  text
        – Developer	
  needs	
  to	
  build	
  all	
  data	
  structures




                                                                           3
Thursday, 27 September, 12
XML::Twig - a hybrid of DOM and SAX
          sub main {
          	
              ...
          	
              my $twig = XML::Twig->new(
                  twig_handlers => { entry => &entry},
                  ignore_elts => { reference => 'discard',
                                   dbReference => 'discard' },
                  do_not_chain_handlers => 1
              );
              $twig->parsefile($data_file);
              $twig->purge;
          }




                                                                 4
Thursday, 27 September, 12
XML::Twig - a hybrid of DOM and SAX
          sub main {                                  Elements	
  to	
  
          	                                        process,	
  a	
  handler	
  
              ...                                   will	
  be	
  called	
  for	
  
          	                                                  each
              my $twig = XML::Twig->new(
                  twig_handlers => { entry => &entry},
                  ignore_elts => { reference => 'discard',
                                   dbReference => 'discard' },
                  do_not_chain_handlers => 1
              );
              $twig->parsefile($data_file);
              $twig->purge;
          }




                                                                                      4
Thursday, 27 September, 12
XML::Twig - a hybrid of DOM and SAX
          sub main {                                  Elements	
  to	
  
          	                                        process,	
  a	
  handler	
  
              ...                                   will	
  be	
  called	
  for	
  
          	                                                  each
              my $twig = XML::Twig->new(
                  twig_handlers => { entry => &entry},
                  ignore_elts => { reference => 'discard',
                                   dbReference => 'discard' },
                  do_not_chain_handlers => 1
              );
              $twig->parsefile($data_file);                    Elements	
  to	
  
              $twig->purge;                                   ignore;	
  these	
  
          }                                                  are	
  not	
  loaded




                                                                                      4
Thursday, 27 September, 12
XML::Twig - a hybrid of DOM and SAX
          sub main {                                  Elements	
  to	
  
          	                                        process,	
  a	
  handler	
  
              ...                                   will	
  be	
  called	
  for	
  
          	                                                  each
              my $twig = XML::Twig->new(
                  twig_handlers => { entry => &entry},
                  ignore_elts => { reference => 'discard',
                                   dbReference => 'discard' },
                  do_not_chain_handlers => 1
              );
              $twig->parsefile($data_file);                    Elements	
  to	
  
              $twig->purge;                                   ignore;	
  these	
  
          }                                                  are	
  not	
  loaded

                                        Clean	
  up	
  
                                      memory	
  when	
  
                                          done


                                                                                      4
Thursday, 27 September, 12
sub entry {
              my( $twig, $entry) = @_;
          	
              my ($organism_name) =
                  $entry->get_xpath('organism/name[@type = "scientific"]');

                return unless ($organism_name &&
                               $organism_name->trimmed_text() eq 'homo sapiens');

                my ($gene) = $entry->get_xpath('gene');
                return unless (defined($gene));

                my ($gene_name) = $gene->get_xpath('name[@type = "primary"]');
                return unless (defined($gene_name));
                $gene_name = $gene_name->trimmed_text();

                my @synonyms = map {
                    $_->trimmed_text()
                } $gene->get_xpath('name[@type = "synonym"]');

                ...

                $entry->purge;
                return 0;
          }


                                                                                    5
Thursday, 27 September, 12
sub entry {                                 We	
  can	
  use	
  
              my( $twig, $entry) = @_;
                                                   XPath	
  to	
  find	
  stuff
          	
              my ($organism_name) =
                  $entry->get_xpath('organism/name[@type = "scientific"]');

                return unless ($organism_name &&
                               $organism_name->trimmed_text() eq 'homo sapiens');

                my ($gene) = $entry->get_xpath('gene');
                return unless (defined($gene));

                my ($gene_name) = $gene->get_xpath('name[@type = "primary"]');
                return unless (defined($gene_name));
                $gene_name = $gene_name->trimmed_text();

                my @synonyms = map {
                    $_->trimmed_text()
                } $gene->get_xpath('name[@type = "synonym"]');

                ...

                $entry->purge;
                return 0;
          }


                                                                                    5
Thursday, 27 September, 12
sub entry {                                 We	
  can	
  use	
  
              my( $twig, $entry) = @_;
                                                   XPath	
  to	
  find	
  stuff
          	
              my ($organism_name) =
                  $entry->get_xpath('organism/name[@type = "scientific"]');

                return unless ($organism_name &&
                               $organism_name->trimmed_text() eq 'homo sapiens');

                my ($gene) = $entry->get_xpath('gene');
                return unless (defined($gene));

                my ($gene_name) = $gene->get_xpath('name[@type = "primary"]');
                return unless (defined($gene_name));
                $gene_name = $gene_name->trimmed_text();           Methods	
  to	
  get	
  
                                                                           element	
  data
                my @synonyms = map {
                    $_->trimmed_text()
                } $gene->get_xpath('name[@type = "synonym"]');

                ...

                $entry->purge;
                return 0;
          }


                                                                                              5
Thursday, 27 September, 12
sub entry {                                 We	
  can	
  use	
  
              my( $twig, $entry) = @_;
                                                   XPath	
  to	
  find	
  stuff
          	
              my ($organism_name) =
                  $entry->get_xpath('organism/name[@type = "scientific"]');

                return unless ($organism_name &&
                               $organism_name->trimmed_text() eq 'homo sapiens');

                my ($gene) = $entry->get_xpath('gene');
                return unless (defined($gene));

                my ($gene_name) = $gene->get_xpath('name[@type = "primary"]');
                return unless (defined($gene_name));
                $gene_name = $gene_name->trimmed_text();           Methods	
  to	
  get	
  
                                                                           element	
  data
                my @synonyms = map {
                    $_->trimmed_text()
                } $gene->get_xpath('name[@type = "synonym"]');

                ...

                $entry->purge;                    Clean	
  up	
  
                return 0;                       memory	
  when	
  
          }                                         done

                                                                                              5
Thursday, 27 September, 12
Assessment
  • Advantages
        – Only	
  elements	
  of	
  interest	
  are	
  loaded	
  when	
  needed
        – Subset	
  of	
  XPath	
  for	
  naviga<on,	
  among	
  other	
  methods
        – Can	
  both	
  read	
  and	
  rewrite	
  XML
        – Code	
  simpler	
  than	
  SAX,	
  without	
  DOM	
  memory	
  death
  • Disadvantages
        – XML::Twig	
  is...	
  slooooooooow-­‐ish
              • Compared	
  to	
  SAX	
  and	
  DOM
              • Both	
  of	
  which	
  primarily	
  use	
  C



                                                                                    6
Thursday, 27 September, 12

More Related Content

Viewers also liked

Patrimônio Artístico Cultural
Patrimônio Artístico CulturalPatrimônio Artístico Cultural
Patrimônio Artístico CulturalEnelyne Maia
 
Fiesta infantil zarita
Fiesta infantil zaritaFiesta infantil zarita
Fiesta infantil zaritacabanasdelrio
 
Keynote - The future of SharePoint - SPC14 recap
Keynote - The future of SharePoint - SPC14 recapKeynote - The future of SharePoint - SPC14 recap
Keynote - The future of SharePoint - SPC14 recapMatthias Einig
 
Interpretacion radiográfica odontologica
Interpretacion radiográfica odontologicaInterpretacion radiográfica odontologica
Interpretacion radiográfica odontologicaRICHARD ALVAREZ SOTO
 
SPSOslo: Automated code quality analysis of SharePoint solutions
SPSOslo: Automated code quality analysis of SharePoint solutionsSPSOslo: Automated code quality analysis of SharePoint solutions
SPSOslo: Automated code quality analysis of SharePoint solutionsMatthias Einig
 
Factores ambientales en odontologia
Factores ambientales en odontologiaFactores ambientales en odontologia
Factores ambientales en odontologiaRICHARD ALVAREZ SOTO
 
Digital marketing workshop series one kc_forwomen
Digital marketing workshop series   one kc_forwomenDigital marketing workshop series   one kc_forwomen
Digital marketing workshop series one kc_forwomenLMarMax
 
The Evolution of SharePoint
The Evolution of SharePointThe Evolution of SharePoint
The Evolution of SharePointMatthias Einig
 
Transforming SharePoint Farm Solutions to the App Model #SPSSTHLM23
Transforming SharePoint Farm Solutions to the App Model #SPSSTHLM23Transforming SharePoint Farm Solutions to the App Model #SPSSTHLM23
Transforming SharePoint Farm Solutions to the App Model #SPSSTHLM23Matthias Einig
 
Incrustaciones inlay, onlay y overlay
Incrustaciones inlay, onlay y overlayIncrustaciones inlay, onlay y overlay
Incrustaciones inlay, onlay y overlayRICHARD ALVAREZ SOTO
 

Viewers also liked (18)

Patrimônio Artístico Cultural
Patrimônio Artístico CulturalPatrimônio Artístico Cultural
Patrimônio Artístico Cultural
 
4&5
4&54&5
4&5
 
Media task 4
Media task 4Media task 4
Media task 4
 
Fiesta infantil zarita
Fiesta infantil zaritaFiesta infantil zarita
Fiesta infantil zarita
 
Post 8
Post 8Post 8
Post 8
 
Keynote - The future of SharePoint - SPC14 recap
Keynote - The future of SharePoint - SPC14 recapKeynote - The future of SharePoint - SPC14 recap
Keynote - The future of SharePoint - SPC14 recap
 
Interpretacion radiográfica odontologica
Interpretacion radiográfica odontologicaInterpretacion radiográfica odontologica
Interpretacion radiográfica odontologica
 
SPSOslo: Automated code quality analysis of SharePoint solutions
SPSOslo: Automated code quality analysis of SharePoint solutionsSPSOslo: Automated code quality analysis of SharePoint solutions
SPSOslo: Automated code quality analysis of SharePoint solutions
 
Factores ambientales en odontologia
Factores ambientales en odontologiaFactores ambientales en odontologia
Factores ambientales en odontologia
 
Digital marketing workshop series one kc_forwomen
Digital marketing workshop series   one kc_forwomenDigital marketing workshop series   one kc_forwomen
Digital marketing workshop series one kc_forwomen
 
Indonesia
Indonesia Indonesia
Indonesia
 
The Evolution of SharePoint
The Evolution of SharePointThe Evolution of SharePoint
The Evolution of SharePoint
 
Transforming SharePoint Farm Solutions to the App Model #SPSSTHLM23
Transforming SharePoint Farm Solutions to the App Model #SPSSTHLM23Transforming SharePoint Farm Solutions to the App Model #SPSSTHLM23
Transforming SharePoint Farm Solutions to the App Model #SPSSTHLM23
 
Sol foundation
Sol foundationSol foundation
Sol foundation
 
Enfermedades gingivales
Enfermedades gingivalesEnfermedades gingivales
Enfermedades gingivales
 
Historia clinica odontologica
Historia clinica odontologicaHistoria clinica odontologica
Historia clinica odontologica
 
LET´S RECYCLE 2ºA CEIP EL RECUERDO
LET´S RECYCLE 2ºA CEIP EL RECUERDOLET´S RECYCLE 2ºA CEIP EL RECUERDO
LET´S RECYCLE 2ºA CEIP EL RECUERDO
 
Incrustaciones inlay, onlay y overlay
Incrustaciones inlay, onlay y overlayIncrustaciones inlay, onlay y overlay
Incrustaciones inlay, onlay y overlay
 

Similar to Solving Complex XML Parsing Problem with XML::Twig

Our Friends the Utils: A highway traveled by wheels we didn't re-invent.
Our Friends the Utils: A highway traveled by wheels we didn't re-invent. Our Friends the Utils: A highway traveled by wheels we didn't re-invent.
Our Friends the Utils: A highway traveled by wheels we didn't re-invent. Workhorse Computing
 
Object::Franger: Wear a Raincoat in your Code
Object::Franger: Wear a Raincoat in your CodeObject::Franger: Wear a Raincoat in your Code
Object::Franger: Wear a Raincoat in your CodeWorkhorse Computing
 
Miscelaneous Debris
Miscelaneous DebrisMiscelaneous Debris
Miscelaneous Debrisfrewmbot
 
Perl Intro 7 Subroutines
Perl Intro 7 SubroutinesPerl Intro 7 Subroutines
Perl Intro 7 SubroutinesShaun Griffith
 
Object Trampoline: Why having not the object you want is what you need.
Object Trampoline: Why having not the object you want is what you need.Object Trampoline: Why having not the object you want is what you need.
Object Trampoline: Why having not the object you want is what you need.Workhorse Computing
 
Lithium: The Framework for People Who Hate Frameworks
Lithium: The Framework for People Who Hate FrameworksLithium: The Framework for People Who Hate Frameworks
Lithium: The Framework for People Who Hate FrameworksNate Abele
 
HaskellとDebianの辛くて甘い関係
HaskellとDebianの辛くて甘い関係HaskellとDebianの辛くて甘い関係
HaskellとDebianの辛くて甘い関係Kiwamu Okabe
 
UPHPU Meeting, February 17, 2012
UPHPU Meeting, February 17, 2012UPHPU Meeting, February 17, 2012
UPHPU Meeting, February 17, 2012andersonjohnd
 
Advanced Web Programming Chapter 12
Advanced Web Programming Chapter 12Advanced Web Programming Chapter 12
Advanced Web Programming Chapter 12RohanMistry15
 
Databricks spark-knowledge-base-1
Databricks spark-knowledge-base-1Databricks spark-knowledge-base-1
Databricks spark-knowledge-base-1Rahul Kumar
 
CR17 - Designing a database like an archaeologist
CR17 - Designing a database like an archaeologistCR17 - Designing a database like an archaeologist
CR17 - Designing a database like an archaeologistyoavrubin
 
Snmp class
Snmp classSnmp class
Snmp classaduitsis
 
Redis the better NoSQL
Redis the better NoSQLRedis the better NoSQL
Redis the better NoSQLOpenFest team
 
Python dictionary : past, present, future
Python dictionary: past, present, futurePython dictionary: past, present, future
Python dictionary : past, present, futuredelimitry
 

Similar to Solving Complex XML Parsing Problem with XML::Twig (16)

Our Friends the Utils: A highway traveled by wheels we didn't re-invent.
Our Friends the Utils: A highway traveled by wheels we didn't re-invent. Our Friends the Utils: A highway traveled by wheels we didn't re-invent.
Our Friends the Utils: A highway traveled by wheels we didn't re-invent.
 
Object::Franger: Wear a Raincoat in your Code
Object::Franger: Wear a Raincoat in your CodeObject::Franger: Wear a Raincoat in your Code
Object::Franger: Wear a Raincoat in your Code
 
Ingest export
Ingest exportIngest export
Ingest export
 
Miscelaneous Debris
Miscelaneous DebrisMiscelaneous Debris
Miscelaneous Debris
 
Perl Intro 7 Subroutines
Perl Intro 7 SubroutinesPerl Intro 7 Subroutines
Perl Intro 7 Subroutines
 
Object Trampoline: Why having not the object you want is what you need.
Object Trampoline: Why having not the object you want is what you need.Object Trampoline: Why having not the object you want is what you need.
Object Trampoline: Why having not the object you want is what you need.
 
Lithium: The Framework for People Who Hate Frameworks
Lithium: The Framework for People Who Hate FrameworksLithium: The Framework for People Who Hate Frameworks
Lithium: The Framework for People Who Hate Frameworks
 
HaskellとDebianの辛くて甘い関係
HaskellとDebianの辛くて甘い関係HaskellとDebianの辛くて甘い関係
HaskellとDebianの辛くて甘い関係
 
UPHPU Meeting, February 17, 2012
UPHPU Meeting, February 17, 2012UPHPU Meeting, February 17, 2012
UPHPU Meeting, February 17, 2012
 
Laravel doctrine
Laravel doctrineLaravel doctrine
Laravel doctrine
 
Advanced Web Programming Chapter 12
Advanced Web Programming Chapter 12Advanced Web Programming Chapter 12
Advanced Web Programming Chapter 12
 
Databricks spark-knowledge-base-1
Databricks spark-knowledge-base-1Databricks spark-knowledge-base-1
Databricks spark-knowledge-base-1
 
CR17 - Designing a database like an archaeologist
CR17 - Designing a database like an archaeologistCR17 - Designing a database like an archaeologist
CR17 - Designing a database like an archaeologist
 
Snmp class
Snmp classSnmp class
Snmp class
 
Redis the better NoSQL
Redis the better NoSQLRedis the better NoSQL
Redis the better NoSQL
 
Python dictionary : past, present, future
Python dictionary: past, present, futurePython dictionary: past, present, future
Python dictionary : past, present, future
 

Recently uploaded

Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 

Recently uploaded (20)

Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 

Solving Complex XML Parsing Problem with XML::Twig

  • 1. The Problem If  a  man,  A,  who  weighs  11  stone  leaves  from  his  home  at  8:30  in  the  morning  in  a  car  whose  consump<on  is  16.25  mpg  at  an   average  speed  of  40  m.p.h.  to  his  office  which  is  12  miles  away  and  he  stops  for  a  coffee  on  the  way  for  15  minutes  and  also   puts  air  in  one  of  his  tyres  which  has  a  slow  puncture  leFng  out  air  at  a  rate  of  2  lbs  per  square  inch  per  mile  travelled  when   the  car  is  moving  at  32  m.p.h.  and  he  picks  up  a  hitch-­‐hiker  B  who  weighs  14  stone  plus  suitcase  But  hitch-­‐hiker  B  who  is  a   poli<cal  ac<vist  distributes  leaflets  from  his  suitcase  each  of  which  weigh  an  ounce  at  the  scale  of  2  leaflets  per  person  at  every   bus  stop  and  every  vehicle  on  either  side  of  them  at  every  red  traffic  light  during  the  journey  which  includes  20  bus  stops  with   an  average  of  6  people  per  stop  5  lorries  each  with  a  passenger  one  of  which  exchanged  a  Yorkie  Bar  weighing  an  ounce  for  12   of  the  leaflets  and  2  coaches  each  containing  51  people  7  of  which  from  one  coach  returned  the  leaflets  and  16  people  from   the  other  coach  who  asked  for  a  further  leaflet  each  for  a  member  of  one  of  their  families  Assuming  that  man  A  then  had  to   travel  a  further  2.86  miles  out  of  his  way  to  drop  off  hitch-­‐hiker  B  how  late  would  man  A  be  in  arriving  at  the  office  by  9:30   a.m.?  If  he  s<ll  had  6  miles  to  travel  and  his  watch  was  running  23  minutes  slow  but  the  clock  at  the  office  was  running  2   minutes  faster  than  his  was  in  fact  17  minutes  and  3  secs  ahead  of  the  correct  <me  which  was  2:30  in  the  morning  in  Caracas  If   when  5  miles  from  the  office  he  telephoned  his  boss  to  apologize  for  being  late  but  was  told  by  his  boss  C  to  pick  up  a  package   2.63  miles  away  from  his  present  loca<on  and  deliver  it  to  client  D  in  Bristol  by  train,  by  4:30  that  aVernoon  and  at  the  same   <me  man  D  was  mistakenly  told  to  come  to  London  to  receive  same  package  from  man  A  Now  man  A's  train,  train  1,  leV  30   mins.  late  but  man  D's  train,  train  2,  leV  5  mins  early  so  when  the  trains  passed  each  other  train  1  was  travelling  at  75  m.p.h.  to   make  up  for  lost  <me  and  train  2  was  travelling  at  52  m.p.h.  Would  man  A  reach  Bristol  earlier  or  later  according  to  his  watch   which  was  now  running  5  mins.  slower  than  man  D's  would  have  been  had  he  not  got  off  the  train  and  checked  the  correct   <me  at  a  sta<on  between  Bristol  and  London  and  stopped  to  phone  A's  boss,  man  C  to  double  check  A  would  be  there  to  meet   him  and  discover  his  mistake  catch  next  train,  train  3,  back  to  Bristol  which  unlike  A's  train  1  which  stopped  at  4  sta<ons  on  the   way  for  6  mins  each  stop  was  an  express  train  D's  train  caught  up  with  A's  train  1  4  miles  from  Bristol  As  the  trains  drew   alongside  each  other  A's  train  was  travelling  at  12  m.p.h.  and  D's  train  was  travelling  at  13.6  m.p.h.  and  man  A  was  sat  in  the   front... How  long  would  it  take  to  fill  the  bath? 1 Thursday, 27 September, 12
  • 2. The Problem • Loading  from  a  6Gb  structured  XML  file • Methods: – DOM  -­‐  not  enough  memory – Each  element,  text,  abribute,  allocated  separately – “Perl  is  a  profligate  wastrel  when  it  comes  to  memory   use.  There  is  a  saying  that  to  es<mate  memory  usage   of  Perl,  assume  a  reasonable  algorithm  for  memory   alloca<on,  mul<ply  that  es<mate  by  10,  and  while   you  s<ll  may  miss  the  mark,  at  least  you  won't  be   quite  so  astonished”  (perldoc  perldebguts) 2 Thursday, 27 September, 12
  • 3. The Problem • Methods: – SAX  -­‐  possible  but  painful – Memory  is  not  an  issue – Event-­‐based,  handlers  for  each  element,  text – Developer  needs  to  build  all  data  structures 3 Thursday, 27 September, 12
  • 4. XML::Twig - a hybrid of DOM and SAX sub main { ... my $twig = XML::Twig->new( twig_handlers => { entry => &entry}, ignore_elts => { reference => 'discard', dbReference => 'discard' }, do_not_chain_handlers => 1 ); $twig->parsefile($data_file); $twig->purge; } 4 Thursday, 27 September, 12
  • 5. XML::Twig - a hybrid of DOM and SAX sub main { Elements  to   process,  a  handler   ... will  be  called  for   each my $twig = XML::Twig->new( twig_handlers => { entry => &entry}, ignore_elts => { reference => 'discard', dbReference => 'discard' }, do_not_chain_handlers => 1 ); $twig->parsefile($data_file); $twig->purge; } 4 Thursday, 27 September, 12
  • 6. XML::Twig - a hybrid of DOM and SAX sub main { Elements  to   process,  a  handler   ... will  be  called  for   each my $twig = XML::Twig->new( twig_handlers => { entry => &entry}, ignore_elts => { reference => 'discard', dbReference => 'discard' }, do_not_chain_handlers => 1 ); $twig->parsefile($data_file); Elements  to   $twig->purge; ignore;  these   } are  not  loaded 4 Thursday, 27 September, 12
  • 7. XML::Twig - a hybrid of DOM and SAX sub main { Elements  to   process,  a  handler   ... will  be  called  for   each my $twig = XML::Twig->new( twig_handlers => { entry => &entry}, ignore_elts => { reference => 'discard', dbReference => 'discard' }, do_not_chain_handlers => 1 ); $twig->parsefile($data_file); Elements  to   $twig->purge; ignore;  these   } are  not  loaded Clean  up   memory  when   done 4 Thursday, 27 September, 12
  • 8. sub entry { my( $twig, $entry) = @_; my ($organism_name) = $entry->get_xpath('organism/name[@type = "scientific"]'); return unless ($organism_name && $organism_name->trimmed_text() eq 'homo sapiens'); my ($gene) = $entry->get_xpath('gene'); return unless (defined($gene)); my ($gene_name) = $gene->get_xpath('name[@type = "primary"]'); return unless (defined($gene_name)); $gene_name = $gene_name->trimmed_text(); my @synonyms = map { $_->trimmed_text() } $gene->get_xpath('name[@type = "synonym"]'); ... $entry->purge; return 0; } 5 Thursday, 27 September, 12
  • 9. sub entry { We  can  use   my( $twig, $entry) = @_; XPath  to  find  stuff my ($organism_name) = $entry->get_xpath('organism/name[@type = "scientific"]'); return unless ($organism_name && $organism_name->trimmed_text() eq 'homo sapiens'); my ($gene) = $entry->get_xpath('gene'); return unless (defined($gene)); my ($gene_name) = $gene->get_xpath('name[@type = "primary"]'); return unless (defined($gene_name)); $gene_name = $gene_name->trimmed_text(); my @synonyms = map { $_->trimmed_text() } $gene->get_xpath('name[@type = "synonym"]'); ... $entry->purge; return 0; } 5 Thursday, 27 September, 12
  • 10. sub entry { We  can  use   my( $twig, $entry) = @_; XPath  to  find  stuff my ($organism_name) = $entry->get_xpath('organism/name[@type = "scientific"]'); return unless ($organism_name && $organism_name->trimmed_text() eq 'homo sapiens'); my ($gene) = $entry->get_xpath('gene'); return unless (defined($gene)); my ($gene_name) = $gene->get_xpath('name[@type = "primary"]'); return unless (defined($gene_name)); $gene_name = $gene_name->trimmed_text(); Methods  to  get   element  data my @synonyms = map { $_->trimmed_text() } $gene->get_xpath('name[@type = "synonym"]'); ... $entry->purge; return 0; } 5 Thursday, 27 September, 12
  • 11. sub entry { We  can  use   my( $twig, $entry) = @_; XPath  to  find  stuff my ($organism_name) = $entry->get_xpath('organism/name[@type = "scientific"]'); return unless ($organism_name && $organism_name->trimmed_text() eq 'homo sapiens'); my ($gene) = $entry->get_xpath('gene'); return unless (defined($gene)); my ($gene_name) = $gene->get_xpath('name[@type = "primary"]'); return unless (defined($gene_name)); $gene_name = $gene_name->trimmed_text(); Methods  to  get   element  data my @synonyms = map { $_->trimmed_text() } $gene->get_xpath('name[@type = "synonym"]'); ... $entry->purge; Clean  up   return 0; memory  when   } done 5 Thursday, 27 September, 12
  • 12. Assessment • Advantages – Only  elements  of  interest  are  loaded  when  needed – Subset  of  XPath  for  naviga<on,  among  other  methods – Can  both  read  and  rewrite  XML – Code  simpler  than  SAX,  without  DOM  memory  death • Disadvantages – XML::Twig  is...  slooooooooow-­‐ish • Compared  to  SAX  and  DOM • Both  of  which  primarily  use  C 6 Thursday, 27 September, 12