The Problem  If	  a	  man,	  A,	  who	  weighs	  11	  stone	  leaves	  from	  his	  home	  at	  8:30	  in	  the	  morning	...
The Problem  • Loading	  from	  a	  6Gb	  structured	  XML	  file  • Methods:        – DOM	  -­‐	  not	  enough	  memory   ...
The Problem  • Methods:        – SAX	  -­‐	  possible	  but	  painful        – Memory	  is	  not	  an	  issue        – Eve...
XML::Twig - a hybrid of DOM and SAX          sub main {          	              ...          	              my $twig = XML...
XML::Twig - a hybrid of DOM and SAX          sub main {                                  Elements	  to	            	      ...
XML::Twig - a hybrid of DOM and SAX          sub main {                                  Elements	  to	            	      ...
XML::Twig - a hybrid of DOM and SAX          sub main {                                  Elements	  to	            	      ...
sub entry {              my( $twig, $entry) = @_;          	              my ($organism_name) =                  $entry->g...
sub entry {                                 We	  can	  use	                my( $twig, $entry) = @_;                       ...
sub entry {                                 We	  can	  use	                my( $twig, $entry) = @_;                       ...
sub entry {                                 We	  can	  use	                my( $twig, $entry) = @_;                       ...
Assessment  • Advantages        – Only	  elements	  of	  interest	  are	  loaded	  when	  needed        – Subset	  of	  XP...
Upcoming SlideShare
Loading in …5
×

Big xml

167 views
136 views

Published on

Published in: Technology, News & Politics
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
167
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
1
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Big xml

  1. 1. The Problem If  a  man,  A,  who  weighs  11  stone  leaves  from  his  home  at  8:30  in  the  morning  in  a  car  whose  consump<on  is  16.25  mpg  at  an   average  speed  of  40  m.p.h.  to  his  office  which  is  12  miles  away  and  he  stops  for  a  coffee  on  the  way  for  15  minutes  and  also   puts  air  in  one  of  his  tyres  which  has  a  slow  puncture  leFng  out  air  at  a  rate  of  2  lbs  per  square  inch  per  mile  travelled  when   the  car  is  moving  at  32  m.p.h.  and  he  picks  up  a  hitch-­‐hiker  B  who  weighs  14  stone  plus  suitcase  But  hitch-­‐hiker  B  who  is  a   poli<cal  ac<vist  distributes  leaflets  from  his  suitcase  each  of  which  weigh  an  ounce  at  the  scale  of  2  leaflets  per  person  at  every   bus  stop  and  every  vehicle  on  either  side  of  them  at  every  red  traffic  light  during  the  journey  which  includes  20  bus  stops  with   an  average  of  6  people  per  stop  5  lorries  each  with  a  passenger  one  of  which  exchanged  a  Yorkie  Bar  weighing  an  ounce  for  12   of  the  leaflets  and  2  coaches  each  containing  51  people  7  of  which  from  one  coach  returned  the  leaflets  and  16  people  from   the  other  coach  who  asked  for  a  further  leaflet  each  for  a  member  of  one  of  their  families  Assuming  that  man  A  then  had  to   travel  a  further  2.86  miles  out  of  his  way  to  drop  off  hitch-­‐hiker  B  how  late  would  man  A  be  in  arriving  at  the  office  by  9:30   a.m.?  If  he  s<ll  had  6  miles  to  travel  and  his  watch  was  running  23  minutes  slow  but  the  clock  at  the  office  was  running  2   minutes  faster  than  his  was  in  fact  17  minutes  and  3  secs  ahead  of  the  correct  <me  which  was  2:30  in  the  morning  in  Caracas  If   when  5  miles  from  the  office  he  telephoned  his  boss  to  apologize  for  being  late  but  was  told  by  his  boss  C  to  pick  up  a  package   2.63  miles  away  from  his  present  loca<on  and  deliver  it  to  client  D  in  Bristol  by  train,  by  4:30  that  aVernoon  and  at  the  same   <me  man  D  was  mistakenly  told  to  come  to  London  to  receive  same  package  from  man  A  Now  man  As  train,  train  1,  leV  30   mins.  late  but  man  Ds  train,  train  2,  leV  5  mins  early  so  when  the  trains  passed  each  other  train  1  was  travelling  at  75  m.p.h.  to   make  up  for  lost  <me  and  train  2  was  travelling  at  52  m.p.h.  Would  man  A  reach  Bristol  earlier  or  later  according  to  his  watch   which  was  now  running  5  mins.  slower  than  man  Ds  would  have  been  had  he  not  got  off  the  train  and  checked  the  correct   <me  at  a  sta<on  between  Bristol  and  London  and  stopped  to  phone  As  boss,  man  C  to  double  check  A  would  be  there  to  meet   him  and  discover  his  mistake  catch  next  train,  train  3,  back  to  Bristol  which  unlike  As  train  1  which  stopped  at  4  sta<ons  on  the   way  for  6  mins  each  stop  was  an  express  train  Ds  train  caught  up  with  As  train  1  4  miles  from  Bristol  As  the  trains  drew   alongside  each  other  As  train  was  travelling  at  12  m.p.h.  and  Ds  train  was  travelling  at  13.6  m.p.h.  and  man  A  was  sat  in  the   front... How  long  would  it  take  to  fill  the  bath? 1Thursday, 27 September, 12
  2. 2. The Problem • Loading  from  a  6Gb  structured  XML  file • Methods: – DOM  -­‐  not  enough  memory – Each  element,  text,  abribute,  allocated  separately – “Perl  is  a  profligate  wastrel  when  it  comes  to  memory   use.  There  is  a  saying  that  to  es<mate  memory  usage   of  Perl,  assume  a  reasonable  algorithm  for  memory   alloca<on,  mul<ply  that  es<mate  by  10,  and  while   you  s<ll  may  miss  the  mark,  at  least  you  wont  be   quite  so  astonished”  (perldoc  perldebguts) 2Thursday, 27 September, 12
  3. 3. The Problem • Methods: – SAX  -­‐  possible  but  painful – Memory  is  not  an  issue – Event-­‐based,  handlers  for  each  element,  text – Developer  needs  to  build  all  data  structures 3Thursday, 27 September, 12
  4. 4. XML::Twig - a hybrid of DOM and SAX sub main { ... my $twig = XML::Twig->new( twig_handlers => { entry => &entry}, ignore_elts => { reference => discard, dbReference => discard }, do_not_chain_handlers => 1 ); $twig->parsefile($data_file); $twig->purge; } 4Thursday, 27 September, 12
  5. 5. XML::Twig - a hybrid of DOM and SAX sub main { Elements  to   process,  a  handler   ... will  be  called  for   each my $twig = XML::Twig->new( twig_handlers => { entry => &entry}, ignore_elts => { reference => discard, dbReference => discard }, do_not_chain_handlers => 1 ); $twig->parsefile($data_file); $twig->purge; } 4Thursday, 27 September, 12
  6. 6. XML::Twig - a hybrid of DOM and SAX sub main { Elements  to   process,  a  handler   ... will  be  called  for   each my $twig = XML::Twig->new( twig_handlers => { entry => &entry}, ignore_elts => { reference => discard, dbReference => discard }, do_not_chain_handlers => 1 ); $twig->parsefile($data_file); Elements  to   $twig->purge; ignore;  these   } are  not  loaded 4Thursday, 27 September, 12
  7. 7. XML::Twig - a hybrid of DOM and SAX sub main { Elements  to   process,  a  handler   ... will  be  called  for   each my $twig = XML::Twig->new( twig_handlers => { entry => &entry}, ignore_elts => { reference => discard, dbReference => discard }, do_not_chain_handlers => 1 ); $twig->parsefile($data_file); Elements  to   $twig->purge; ignore;  these   } are  not  loaded Clean  up   memory  when   done 4Thursday, 27 September, 12
  8. 8. sub entry { my( $twig, $entry) = @_; my ($organism_name) = $entry->get_xpath(organism/name[@type = "scientific"]); return unless ($organism_name && $organism_name->trimmed_text() eq homo sapiens); my ($gene) = $entry->get_xpath(gene); return unless (defined($gene)); my ($gene_name) = $gene->get_xpath(name[@type = "primary"]); return unless (defined($gene_name)); $gene_name = $gene_name->trimmed_text(); my @synonyms = map { $_->trimmed_text() } $gene->get_xpath(name[@type = "synonym"]); ... $entry->purge; return 0; } 5Thursday, 27 September, 12
  9. 9. sub entry { We  can  use   my( $twig, $entry) = @_; XPath  to  find  stuff my ($organism_name) = $entry->get_xpath(organism/name[@type = "scientific"]); return unless ($organism_name && $organism_name->trimmed_text() eq homo sapiens); my ($gene) = $entry->get_xpath(gene); return unless (defined($gene)); my ($gene_name) = $gene->get_xpath(name[@type = "primary"]); return unless (defined($gene_name)); $gene_name = $gene_name->trimmed_text(); my @synonyms = map { $_->trimmed_text() } $gene->get_xpath(name[@type = "synonym"]); ... $entry->purge; return 0; } 5Thursday, 27 September, 12
  10. 10. sub entry { We  can  use   my( $twig, $entry) = @_; XPath  to  find  stuff my ($organism_name) = $entry->get_xpath(organism/name[@type = "scientific"]); return unless ($organism_name && $organism_name->trimmed_text() eq homo sapiens); my ($gene) = $entry->get_xpath(gene); return unless (defined($gene)); my ($gene_name) = $gene->get_xpath(name[@type = "primary"]); return unless (defined($gene_name)); $gene_name = $gene_name->trimmed_text(); Methods  to  get   element  data my @synonyms = map { $_->trimmed_text() } $gene->get_xpath(name[@type = "synonym"]); ... $entry->purge; return 0; } 5Thursday, 27 September, 12
  11. 11. sub entry { We  can  use   my( $twig, $entry) = @_; XPath  to  find  stuff my ($organism_name) = $entry->get_xpath(organism/name[@type = "scientific"]); return unless ($organism_name && $organism_name->trimmed_text() eq homo sapiens); my ($gene) = $entry->get_xpath(gene); return unless (defined($gene)); my ($gene_name) = $gene->get_xpath(name[@type = "primary"]); return unless (defined($gene_name)); $gene_name = $gene_name->trimmed_text(); Methods  to  get   element  data my @synonyms = map { $_->trimmed_text() } $gene->get_xpath(name[@type = "synonym"]); ... $entry->purge; Clean  up   return 0; memory  when   } done 5Thursday, 27 September, 12
  12. 12. Assessment • Advantages – Only  elements  of  interest  are  loaded  when  needed – Subset  of  XPath  for  naviga<on,  among  other  methods – Can  both  read  and  rewrite  XML – Code  simpler  than  SAX,  without  DOM  memory  death • Disadvantages – XML::Twig  is...  slooooooooow-­‐ish • Compared  to  SAX  and  DOM • Both  of  which  primarily  use  C 6Thursday, 27 September, 12

×