Structured data and metadata evaluation methodology for organizations looking to improve image findability on the web emily kolvitz_2014

Structured Data and Metadata Evaluation Methodology for
Organizations Looking to Improve Image Findability on the Web
School of Library and Information Studies
LIS 5733 Taught by: Dr. Susan Burke
Research Proposal Written by: Emily Kolvitz
Research Setting: Primarily Geared Towards Online Ecommerce/Business Organizations, but methodology could
easily translate to Galleries, Museums, Archives, Libraries (GLAMs) or any institution looking to evaluate their
structured data and metadata practices on the world wide web in an effort to improve findability of product offerings,
general information or services.

Introduction
The current state of findability on the web for many organizations is incipient. Search
Engine Optimization (SEO) techniques change frequently and remain much a mystery
to many companies. The one variable in the equation of web findability that remains a
staple is good quality metadata under the hood of the website.
This research methodology will allow for :
● An assessment of findability maturity on the web from an image-centric viewpoint
● Help improve findability on the web by establishing a baseline for where your
organization is at in terms of structured data content and visualize gaps or areas
for improvement from a search engine neutral perspective

Introduction
● Most Searches Start with Google now (Holman 2011) (Lippincott 2013)
● Search Algorithms Shaping what is most Easily Accessible (Connaway, Dickey &
Radford 2011) and they are subject to change frequently (Kritzinger 2013)
● Search Algorithms Look for Your Structured Data and in the future and possibly
your embedded metadata (Cazier 2014) (Beall 2010)

Literature Review
Marshall Breeding (2013) assesses the limitations of the major search engine algorithms:
“But even with the most sophisticated relevancy
algorithms, index-based search and retrieval lacks the
ability to lead users to the potential related content.
Semantic web technologies, in conjunction with
repositories of open linked data, promise to deliver
significant new capabilities in exploring and exploiting
information resources on the web.”

Literature Review
● Semantic web is founded on good, high-quality
structured data
● Future technologies could potentially utilize
embedded metadata in search (Cazier 2014)
(Beall 2010) but there is authenticity,
provenance and “breadcrumbs” value now
(Reicks 2013)

Literature Review
● Most users don’t go past the first page of
search results (Paz 2013)
● Structured Data Practices can help your
organization stay relevant (and findable!) in
the age of information overload
● Keeping it Search Engine Neutral is
advisable (Paz 2013)

Topic/Proposed Research
● Methodology for establishing a baseline or benchmark of where an organization is at
in terms of structured data pertaining to image records that ultimately helps findability
on the web
● By utilizing the proposed methodology for gathering this data for an organization,
data-informed decisions can be made about structured data strategy going forward to
maintain relevancy on the web
● Many structured data elements can affect online findability from file-naming
standards, presence of alt text tags in html markup, html markup itself, embedded
metadata, schema.org markup and rich snippets, text description at or nearby images,
and more. IEEE uses metadata or full-text for search (IEEE Xplore offers this--see
next slide)

Full Text Search & Metadata Search

Topic/Proposed Research
● It is also noteworthy that there are additional factors that affect findability on
the web that do not involve structured data, but this research focuses solely on
structured data techniques within the control of individual organizations.
● All of these structured data techniques pertaining to image records will be
utilized in conjunction with the relevancy of onsite and offsite search results.
● Image search and information retrieval is a more difficult area than text search
and retrieval because accessibility to the image content is largely dependent on
side-car text (or metadata if you will) that describes the aboutness and
(hopefully) the context for the image record.

Questions
Research Questions Addressed in this Study
1. What methods of search are available on the organization’s online website?
1. What is the file-naming structure for images on the website?
1. What is the quality of search engine (onsite and offsite) results?
1. What kinds of search results appear in Image Search when searching by the
organization’s name and product description both with onsite search and offsite
search?

Questions
5. What kinds of search results appear in Google Image Search when searching
by images taken from the organization’s website?
5. What kinds of search results come up when looking for specific products
(measure of structured data) through onsite search and offsite search?
5. What are the results when looking for specific products on the offsite search
engine?

Questions
8. What kinds of structured data are near or around the images on the organization’s
website? Alt Text? Other?
9. What file types appear on the organization’s website? (JPEG? TIFF? PNG?)
9. What embedded metadata is available in images on the website?
11. What does the XMP/XML/RDF for these images look like and how robust is it?
What does the graph look like?

Variables
Quality and number
of alt text tags
Type of page
the image was
on
Level of description for the
filename
Quality and number of structured
data tags pertaining to the images
The image file naming
convention/filename
Quality and number
of embedded
metadata tags
Quality and number of search
results for onsite search
utilizing filename or alt text
Quality and number
of relevant search
results utilizing
offsite image search
These measures are operationalized by utilization of likert scales applied by the human researcher. For
example, when rating the level of description for the file-name, a research could conclude that the
filename sp_18379847923.jpg is not very descriptive filename for a human, let alone for a search engine
(unless of course this is a product sku.) The researcher would then choose to assign it a low value on
descriptiveness on a 1-5 likert scale.

Data Collection Methods
Participants
Participants will include a single institution, anonymized for the protection of their business. The sample of image records utilized
in this study will be limited to image assets appearing on the organization’s website domain. Most data collection can take place
from the organization’s website itself. Some procedures will take place on external sites, services, or programs.
Randomization of Sample
The sample of images utilized in this study can be randomized by extracting a site map of the particular organization of interest
using xsitemap.com. After the site map is constructed, the list of URLs should be inputted into a spreadsheet program and a record
number should be assigned to each URL. From there, the researcher can use a randomizer program to select the order of pages to
utilize in the study (i.e. Research Randomizer Available at: http://www.randomizer.org/form.htm) This method will be utilized for
taking a random sample of pages from the organization of interest.
Consent
All data collected in this study are publicly available and freely available on the web.

Data Collection Methods
Obtaining Data on the website
● Navigate to the URL
● Right Click Image(s) and “Save As”
● Right Click Page and “View Source” Save as
.txt file
● Collect raw data from image by either
opening in Photoshop and Navigating to Raw
Data Column or utilize Phil Harvey’s
ExifTool
Obtaining Data through Structured Data Linter
● Navigate to the Linter website
● Enter URL
● Screenshot Structured Data Results -or- save
as webpage
Obtaining Data through W3C RDF validator
● Copy raw data xml extracted earlier and input
into RDF Validator
● Select Graph Only on the Options
● Parse RDF
● Save Graph or Screenshot Graph
● Store in Folder with other Data
Answer Research Questions
● Systematically go through the collected data
and input findings into spreadsheet

Data Analysis Methods
● Descriptive Statistics
o Bell Curve - measures
towards a central tendency
using likert scale data
Bell Curve Image By Vierge Marie
(Own work) [Public domain], via
Wikimedia Commons
http://upload.wikimedia.org/wikipe
dia/commons/f/f6/Gaussian_Filter
.svg

Data Analysis Methods
● Graphical Analysis
(Charts and Graphs)
● Summary Report
● Discussion of Findings

Visualizing the Results
The Structured Data Linter,
utilizing URLs to display
structured data around the images.
Available at:
http://linter.structured-data.org/
Summary analysis will be
crafted utilizing all of these data
points to show what we are able
to understand about an image
versus what a machine or search
engine is able to know about an
image.
W3C RDF Validator Graph
Visualization utilizing the raw
data markup extracted from the
image
Available at:
http://www.w3.org/RDF/Validator
/

Structured Data Linter
Shows all
structured Data
Tags around the
images and in
the page markup

RDF Validator
Visualization of
embedded data
for images and
their subsequent
relationships to
other data

Summary Report
Complete Picture of Structured
Data, Metadata and Analysis
of Study

Expected Outcomes
The anticipated results of this project include a benchmark for where this specific
organization is at in terms of structured data in the online environment and a
methodology for other organizations looking to assess their structured data maturity in
the digital space. These results will be used to create a roadmap for improving resource
findability both on the web and within websites. Other organizations may also aspire to
reuse this methodology for assessing their own current state of structured data. Future
areas of research could include utilizing metadata/RDF-driven search engines in
conjuncture with Vector Space Models to assess findability of image records on the
web and within websites.

References (Slides & Full Paper)
Algebraix Data, Corporation. 0005. "Algebraix Data Launches Industry’s First Cost-Effective Automated Implementation
of Schema.org." Business Wire (English), 5.
Beall, Jeffrey. 2010. "How Google Uses Metadata to Improve Search Results." Serials Librarian 59, no. 1: 40-53.
Breeding, Marshall. 2013. "Linked Data: The Next Big Wave or Another Tech Fad?." Computers In Libraries 33, no. 3:
20-22.
Cafarella, M.J., Halevy, A.Y., Zhang, Y., Wang, D.Z., and Wu, E. Uncovering the relational Web. In Proceedings of the
11th International Workshop on the Web and Databases (Vancouver, B.C., June 13, 2008).
http://web.eecs.umich.edu/~michjc/papers/webtables_webdb08.pdf
Connaway, Lynn Sillipigni, Timothy J. Dickey, and Marie L. Radford. 2011. "“If it is too inconvenient I'm not going after it:”
Convenience as
a critical factor in information-seeking behaviors." Library & Information Science Research (07408188) 33, no. 3: 179-190.

Cazier, Clay, 2014. PM Digital Marketing Blog “The Future of Exif Image Data” Last accessed November 20, 2014.
http://www.pmdigital.com/blog/2014/04/future-exif-image-data/
Diagram Center: Digital Image and Graphic Resources for Accessible Materials , 2014. “Content Model” Last Accessed
November 23, 2014. http://diagramcenter.org/standards-and-practices/content-model.html
Google. 2014. “Image Publishing Guidelines” Last accessed November 21, 2014.
https://support.google.com/webmasters/answer/114016?hl=en
Holman, Lucy. 2011. "Millennial Students' Mental Models of Search: Implications for Academic Librarians and Database
Developers." Journal Of Academic Librarianship 37, no. 1: 19-27

International Business, Times. 0006. "Bing,Google and Yahoo merge to make search easier with schema.org."
International Business Times, April.
IPTC International Press Telecommunications Council, 2014. “Embedded Metadata Manifesto” Last accessed November
20, 2014. http://www.embeddedmetadata.org/social-media-test-results.php (Embedded Metadata Manifesto 2014).
Kritzinger, W. T. "Search Engine Optimization and Pay-per-Click Marketing Strategies." Journal of Organizational
Computing and Electronic Commerce, no. 3 (2013): 273-86.
Lippincott, Joan K. “Net Generation Students and Libraries,” EDUCAUSE (2005), accessed November 19, 2014,
http://www.educause.edu/research-and-publications/books/educating-net-generation/net-generation-students-and-libraries

Nakanishi, T., "Semantic Context-Dependent Weighting for Vector Space Model," Semantic Computing (ICSC), 2014
IEEE International Conference on , vol., no., pp.262,266, 16-18. June 2014. doi: 10.1109/ICSC.2014.49
Paz, Anita. 2013. "In search of Meaning: The Written Word in the Age of Google." Italian Journal Of Library &
Information Science 4, no. 2: 255-266.
Priebe, T.; Schlager, C.; Pernul, G., "A search engine for RDF metadata," Database and Expert Systems Applications,
2004. Proceedings. 15th International Workshop on , vol., no., pp.168,172, 2004. doi: 10.1109/DEXA.2004.1333468
Reicks, David. 2010. “Why Embedded Metadata Won’t Help Your SEO,” Last Updated December 30, 2013. Last
Accessed November 23, 2014. http://www.controlledvocabulary.com/blog/embedded-metadata-wont-help-seo.html

Structured data and metadata evaluation methodology for organizations looking to improve image findability on the web emily kolvitz_2014

More Related Content

What's hot

Similar to Structured data and metadata evaluation methodology for organizations looking to improve image findability on the web emily kolvitz_2014

Recently uploaded

Structured data and metadata evaluation methodology for organizations looking to improve image findability on the web emily kolvitz_2014

Editor's Notes