Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Semi-automated Exploration and Extraction of Data in Scientific Tables


Published on

Ron Daniel and Corey Harper of Elsevier Labs present at the Columbia University Data Science Institute:

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Semi-automated Exploration and Extraction of Data in Scientific Tables

  1. 1. Sept. 2018 Ron Daniel Jr., Corey Harper, and Tony Scerri Semi-automated Exploration and Extraction of Data in Scientific Tables
  2. 2. • Something Cool • The Goal • The Problem • Explorations − Preparations − Header Groupings and Value Typing − Table Typing • Next Steps
  3. 3. Something Cool
  4. 4. • The goal of the NeuroElectro Project is to extract information about the electrophysiological properties (e.g. resting membrane potentials and membrane time constants) of diverse neuron types from the existing literature and place it into a centralized database. • Our goal is to facilitate the discovery of neuron-to-neuron relationships and better understand the role of functional diversity across neuron types.
  5. 5. Curated Data • Articles were analyzed • Neurons grouped into types • Values found for ~50 properties • Additional metadata with each measurement – article info and curation status • Data download and error reporting methods provided
  6. 6. Additional Metadata • See the origin of each data point • Also see curation status (automatic or manual)
  7. 7. Properties • ~ 50 electrophysical properties − Definitions − Association to neuron types − List of containing articles − … • > 200 neuron types − Different numbers of properties for each
  8. 8. Clustering • Plot based on first 2 principal components for all properties − Covers ~70% of variance • Plot is interactive – hovering over a circle brings it to the front and hides what is behind • Clicking on a circle brings up data for that neuron type
  9. 9. Neuron Type Data • See data on many properties for a chosen neuron type • All this is available on GitHub: − • Note that updates are years old − How to keep things going when student leaves?
  10. 10. A Few Questions… • Are some of these differences due to medications? To illnesses? Can we curate these automatic values? This seems under-studied. Can we confirm this measurement? What’s a good protocol? Are there other kinds of this neuron to study (mouse, cat, human, …) Why such a range? Can we segment? Why such a range of values? Are some errors? “Other” Can we segment? What is special for this max reading? What is special for this min reading? Why such a range? Are some medicated? Why are some of these neurons studied so much more?
  11. 11. The Goal
  12. 12. The Goal • Help thousands of sub-subdisciplines create such databases − For perspective over the field − For identifying quality results vs. questionable ones − For improvements in methodogy − To reduce undesirable duplication of work − To identify areas for new work − To provide a place to monitor new results in your favorite sub-subdisciplines
  13. 13. The Essential Points • How do we pick the data sources? • What data is needed by practitioners in the sub-subdiscipline? • How do we extract the data and organize it? • How do we analyze the data? • What do we do with the results of the analysis?
  14. 14. A Brief Digression Where is the Data and How is it Commonly Extracted?
  15. 15. Digression: Knowledge Base Construction • Standard Practice − Find named entities (typically we have a big list of important entities) − Extract relations between the entities (typically we have a list of these too) o One frequent limitation is to only look one sentence at a time − Insert <entity1, relation, entity2> triples into knowledge base − Fancier NLP (e.g. co-reference, smarter parsing, universal schemas) can find more entities and relations, and/or build more complex 'events' • Questions − How much of the information in an article is in the text, vs. the figures and the tables? − How much of the supplemental data with an article is text, figure, or table? − How much data in existing databases is text, figure, or table? Applying Universal Schemas for Domain Specific Ontology Expansion AKBC Workshop 2016 Paul Groth, Sujit Pal, Darin McBeath, Brad Allen, and Ron Daniel
  16. 16. Digression: Data Locations in Articles Fractional area measures use total area of paper as the denominator. How much of the article area is devoted to results, vs. intro, bibliography, …?
  17. 17. The Problem Tables are a Mess
  18. 18. The Biggest Problem • Any aspect of tables can, and will, be done in more ways than you can possibly imagine. − Names of 'the same kind' of column − Formats of 'the same kind' of data − Sets of headings − Layouts of tables − … • Each aspect has a long-tailed frequency distribution. • Multiply these distributions, get a super long tail. 0 50 100 150 200 250 300 350 Frequency of Variations in Heading "Latitude" in Ocean Plastics Tables
  19. 19. Frequency of 'Latitude' Heading Variations in Tables of Oil & Gas Data Heading Name Variability 19
  20. 20. Value Format Variability
  21. 21. Heading Name Confusion • Assume two tables have columns with the same heading. • Safe to assume they mean the same thing? − Depth – below water surface − Depth – below earth surface − Depth – below water + below sea floor − P – Probability − P – Phosphorus Concentration − P – Precision o Values as decimals or percentages? − …
  22. 22. Heading Groups Variability • If you select a set of tables, very few will have exactly the same groups of headings • If you select a set of tables, very few will have identical structure − Multi-level headings − Row v. Column headings − Different variables of interest − …
  23. 23. Table Structure Variability and Complexity
  24. 24. 0 50 100 150 200 250 300 350 Frequency of Variations in Heading "Latitude" in Ocean Plastics Tables One Bite at a Time? Doesn't Scale • What does all the variability mean? − The code that maps any one table to the target schema will map very few others • Are we doomed to process many cases that map only 1..5 tables? • No. The core of the solution has already been shown: − 'Kinds' of headings (e.g. latitude) − 'Kinds' of values (e.g. DMS, DM., D., …) − Families of co-occurring headings − Substructure in tables • Only need to map a few parts of the tables to the target schema We already have a bunch of synonyms for 'Latitude'. Just need to maintain such a list for the fields in the target schemas. Those fields are usually popular so much of the work is already done.
  25. 25. Explorations Preparations
  26. 26. Early Wireframes • Key Ideas − Articles − Tables − Headings − Values • Make lists of each − Save, or − Update & save • Use list as input to next step • Iterative improvement Articles Tables Headings Values
  27. 27. Running Examples • Recreating Neuroelectro • Extending GeoFacets - Oil & Gas Data • Extending LITTERBASE - Ocean Plastics • Start a Coreference Resolution DB
  28. 28. 1: Select Articles (Rough cut) Test Case Article Source # Articles Recreate NeuroElectro Original NeuroElectro provided list of PubMed IDs for the articles they used. Import list and select those by Elsevier. 150 Oil & Gas Wells Elsevier-published articles listed in GeoFacets product 52,700 Ocean Plastics (a) (ocean OR sea) AND (plastic OR microplastic OR debris OR litter) 87,360 Very many false positives "" (b) Filter by subject codes for "Environmental Studies" or "Oceanography". Filter by tables with caption containing "plastic" 449 Coreference Resolution (coreference OR "co-reference" OR anaphora OR anaphoric OR cataphoric) AND (precision OR recall OR f1 OR F OR "F-1" OR accuracy OR "F-measure") 3651 Very many false positives
  29. 29. Content Analytics Toolbench (CAT) • Spark-enabled infrastructure for processing content − ScienceDirect, Scopus, Patents, other corpora stored on cloud in large files − Pre-computed analyses (sentence breaks, POS tags, parses, Open Extractions, …) also stored same way • CAT provides tools to easily make new data subsets from the old • One tool is a search form to enter queries and get result sets. − Those results (or just the document IDs) can be saved as a corpus for further work. Content Analytics Toolbench (CAT): a flexible single point of access for content enhancement and data analytics across massive corpora Ron Daniel and Michael Lauruhn
  30. 30. 2: Decide Target Schema • Schemas for Neuroelectro, GeoFacets, and LITTERBASE already exist • Coreference resolution will be a new database − Precision, Recall, and F1 are most common measures − Article title, publication date, … − Algorithm being tested − Spoiler: Iterative analysis showed we will also want: o Table ID o Benchmark corpus being used o Other measures (B3, BLEU, …) o Training/Validation/Test split o New result or previous result provided for comparison Common Measures in Information Science
  31. 31. 3: Extract Tables from Articles • Surprise! It's already done! • All articles in ScienceDirect are XML − <table> elements extracted once from every article − Saved for anyone to use − Just filter by the list of document IDs to get the ones you want − In addition to the table XML, fields available include: o docID, docTitle, journalName, articleType, pubDate, … o tableID, tableCaption(s), #rows, #cols, #groups, … val docIDs = val tableDataset = val myTables = docIDs.join(tableDataset,Seq("docID"),"leftanti").cache
  32. 32. Explorations Header Groupings and Value Typing
  33. 33. Co-occurring Headers • Headers naturally occur in sets: − Latitude? expect Longitude − Std. Dev.? Expect N, Mean (or Med.) − He, Ne, Ar? Expect Xe • Abbreviations occur in similar sets − Lat. -> Long. − lat -> lon − Lat(N) -> Long(E) • Ambiguous headings can be distinguished by the groups they are in − N, Min, Max, Mean, SD -> Sample Size − N, O, P, S, C, H, Fe, Si -> Nitrogen
  34. 34. Clustering Co-occurring Headers • Multiple visualizations and clustering methods tried. • Most successful combination: − D3 "Hierarchical Edge Bundling" − Label Propagation over Line Graph
  35. 35. Recreating NeuroElectro • Neuron types and property names from the official lists were used • Early plots encouraging and discouraging − Rough shape not too far off − Much less data o Only Elsevier articles, so not surprising o Many neuron types were discovered to be classes rather than specific types so things were missed. • Key Lesson: A high degree of manual curation had been applied by NeuroElectro creators Early Trial Plot of Recreated NeuroElectro
  36. 36. Typed patterns of values (Latitudes) 36
  37. 37. Value Normalization (Latitude) 37
  38. 38. Ocean Plastics Early Demo • Finding the synonyms in the various forms of 'Latitude' and 'Longitude' • Normalizing values to one format and standard orientation • This is enough to plot data on the same axes and allow people to start investigating • Additional work was done to extract the items / km2 value at each location, and to provide metadata about the article and table
  39. 39. Merge & map: 529 points, 10 tables, 4 articles 39
  40. 40. Explorations Table Typing
  41. 41. Interactive table parser • Chrome extension − Works on HTML tables • View values and headers in a table − Map these to the patterns we use at scale − Save patterns, properties, units, and table addresses to a backend graph • Next step is to use that graph of patterns to further refine the batch processing & extraction routines 41
  42. 42. Map to a datatype September 17, 2018 Ocean Plastics Tables Data 42
  43. 43. Oil & Gas Table Merging • Mapping code from Ocean Plastics was used as a starting point • More effort on normalization of the headings and values: - Processed latitudes and longitudes - TOC values - Depth values (in water or on land) • Merged those dataframes • Produces a largely normalized view of 844 data points from 21 tables in 20 articles 43
  44. 44. Which we can map along with metadata 44
  45. 45. Inferring Table Structure • Tables are 'exploded' – a row is created for each cell − DocID, TableID, Row#, Col#, Value, … • Headers for each cell added − Row and col headers − Can be very difficult due to colspans, table groups, etc. • Values matched against patterns to find types • Exploded table provides the information used in further processes
  46. 46. Units, Measures, and Normalization • Data in the exploded table grows in multiple analysis passes − e.g. Detecting units and measures abbreviations − May appear in a cell, a heading, a caption, the text, or nowhere. − Percentages are similar • Original values are treated as strings. − Must convert to normalized values for sensible plots − Plots can make normalization errors and omissions very obvious "Hey! Aren't these all supposed to be <= 1.0?"
  47. 47. Early Plots for Coreference • F-1 values extracted from a small number of papers and plotted vs. date − Hope was to see rising scores over time − Not what was seen • Coreference testing is complex − Multiple test collections − Multiple specialized measurements − Many baseline results from earlier systems • Now separating the data by these different factors and trying new plots
  48. 48. Next Steps
  49. 49. Future Work • Integrate components into web app − More efficient curation of data going into a target database − Manage the many different lists used in the curation projects • Share pattern memory − Target types, Synonyms, Patterns, Heading groups, etc. − Want to build on each other's work • Machine Learning of uses of Patterns and Types • Do it again with data from figures! Sample Size Mean Median Standard Deviation Latitude Longitude Depth (earth) Target Heading Synonym s Patterns Ipso Factum Lensing Dio Mare Grande Beethoven Greenday Lorem Ipso Factum Lensing Dio Mare Fortuna Beethoven Vedi Ipso Ipso de neo Factum le chi Lensing forno by Dio Mare effa Juno fortuna Grande Ura Greenday Vici Factum Wireframe
  50. 50. Queries Articles Tables Cells with Properties Mapping Rules Database Visualization Analysis Iterative Process • Early work builds up − Queries, Synonym lists, Heading groups, Value patterns, Data types, Mapping rules, Analysis methods, and Curated Data − Machine Learning methods will be used to apply them on new articles − ML methods will also replace some things, such as hard-coded variant normalization • A long way to go, but we are coming to the end of the beginning Curated Queries Curated DocIDs Curation Corpora Ideas Experience of Articles to Exclude Experience of Tables to Exclude Curated TableIDs Curated Headings & Types Experience of Cells Patterns from Memory Curated Rules User Input of New Rules Suggested Rules from Memory Curated Data Data Imports/ Exports Under- standing Curated Plots Curated Methods Curated Schemas
  51. 51. Thank you Ron Daniel Jr. – @thingsmeta Corey A Harper – @chrpr, Tony Scerri – Elsevier Labs – @ElsevierLabs