HTML	first	Journal	Workflow
Todd	Toler,	John	Wiley	&	Sons
For	NISO,	May	16,	2017
Why	HTML	first?
• HTML	is	the	language	of	the	web,	continuing	 to	evolve	and	expand
– More	than	just	scholarly	publishing
– Thriving	ecosystem	of	tools	and	APIs	to	author,	augment,	support,	and	utilize
– Naturally	incorporates	the	semantic	Web
– The	notion	of	the	journal	article	is	changing	rapidly	– how	to	keep	up?
– Allow	us	to	potentially	re-set	for	a	digital	age		(industry	standards	get	bloated	over	time,	things	
don’t	get	deprecated.)
– Can	be	the	route	to	a	rich	online/offline	experience	 (for	instance,	when	EPUB	swallows	HTML	
as	a	portable	format)
• HTML-based	peer	review	gives	Reviewers	access	to	the	complete	research	output,	
including	 source	data	&	multimedia	that	is	not	well	supported	in	PDF-based	
workflows
• Enables	preservation	of	data	collected	before	and	during	the	publication	process
– eg:	Peer	Review	comments	are	permanently	linked	to	the	content
– Can	pass	through	linked	data	as	researchers	pipeline	 their	source	content	(e.g.Jupyter
notebooks)
• Simplifies	downstream	transformation	
• Normalized	content	format	simplifies	transfer	workflows
• Reduce/eliminate	errors	associated	with	major	format	transformations
Melville
• Melville	is	Wiley’s	internal	HTML	standard
– It	is	intended	to	be	a	superset	of	an	eventual	Scholarly	HTML	standard
– Follows	many	of	the	same	principles,	 but	focused	on	the	needs	of	content	
production
• Differences:
– Polyglot	HTML	for	XML	compatibility
– Tools	for	validation	using	established	XML	standards	(rather	than	buried	in	
proprietary	code)
– Favors	established	RDF	vocabularies over	schema.org where	appropriate
– Compatible	with	WileyML,	Wiley’s	existing	XML	standard
– Supports	 conversion	to	JATS	and	other	syndication	formats
• Trade-offs:
– Polyglot	needed	to	use	the	powerful	validation	tools	available	for	XML,	but	
tradeoff	benefits	of	HTML	(such	as	iFrames and	Scripting	elements)
Output	
• PDF,	ePub generation	occurs	directly	from	the	Melville	
HTML	
– We	license	technology	from	a	vendor	for	this	(Vivliostylein	
Japan)
– Uses	javascript enhancements	to	CSS	Page	Media
– Handles	math	well	(MathML	+	MathJax)
– Creates	PDFs,	EPUBs,	or	even	paginated	HTML
• Some	additional	meta-tagging	needed	before	the	
automation	works	seamlessly	(e.g.	image	meta-tagging	
to	drive	the	sizing	of	the	images)	
• Journal	level	standardization	to	a	few	templates	was	a	
pre-cursor
HTML	ASAP,	Melville	eventually
• Pre-acceptance
– Automated	conversion	to	“dirty”	HTML
– Author	review	&	validation	adds	additional	
structure
– Additional	enrichment	adds	value	that	is	
preserved	throughout	the	workflow
• Post-acceptance
– Final	Melville	prep
– Authors	final	review	&	validation	(proofing)
Automated	“dirty”	HTML
• We	use	Aspose libraries	to	create	a	basic	HTML	model	
of	the	MS	Word	document
– Tried	Open	Source	Apache	POI,	but	Aspose better	meets	
our	needs
– LaTeX conversions	are	more	straightforward	because	they	
already	have	inherent	structure.	Authors	use	LaTeX
templates	that	we	supply.
• Then,	we	attempt	to	interpret	semantic	structure	
within	the	document	to	tag	key	elements		(Title,	
abstract,	authors,	figure	sets,	references,	etc),	primarily	
using	heuristics	(a	complex	set	of	rules).	We	use	the	
GROBID machine	learning	library	to	parse	references.
“Cleaner”	HTML	in	the	future
• The	heuristics	approach	is	
limited	and	we’re	already	
experimenting	with	more	
advanced	methods:
– OpenCV (open	source)	library	
for	computer	vision.	With	this,	
we’re	working	to	interpret	
some	structure	based	on	visual	
features	(contours-see	image)
– TensorFlow(open	source)	
Recurrent	Neural	Network
(RNN)	tool	and	Conditional	
Random	Field (CRF)	algorithms	
for	structure	identification	and	
entity	extraction
Author	review	&	validation	
• The	Author	is	asked	to	review	
and	validate	the	structured	
items
• The	Author	has	simple	tools	to	
correct	any	errors	or	provide	
structure	in	areas	where	the	
algorithm	failed
– Algorithm	quality	improves	over	
time	based	on	Author	feedback
Additional	enrichment
• Author	is	encouraged	to	enrich	
their	submission	with	additional	
detail:
– Source	data
– Computer	code
– Other	external	links/references
– Multimedia
• Staff	validate	(or	fix)	technical	
aspects	of	the	conversion	
including	figure	sets,	tables,	
references,	etc.
• Additional	enrichment	is	
preserved	within	the	HTML	
package	as	it	moves	through	the	
review/decision	workflow
Final	Melville	prep
• After	article	acceptance,	QA	step	fixes	any	
remaining	issues	(expected	to	be	rare)	and	
sends	“final”	Melville	on	to…
• Vendor	copyediting	stage	directly	in	Melville	
(new	tools	being	developed	for	this)
• All	editing	and	peer	review	workflow	is	stored	
as	annotations	using	web	annotation	
standards
Authors	final	review/validation
• Author	proofs	an	HTML	(Melville)	
version	of	their	final	article
• No	direct	edits	permitted,	
revisions	are	noted	via	actionable	
annotations	(annotations	are	more	
than	just	comments	in	our	model)
• A	final,	technical	QA	step	follows	
author	proof	approval
Call	to	Action:	Scholarly	HTML
– Scholarly	HTML	community	group was	formed	in	
2016	to	explore	turning	this	into	a	standard;	
chaired	by	Robin	Berjon of	Standard	Analytics	
(Science.ai)
– Greater	participation	is	needed	to	make	this	real
– Need	strong	arguments	to	extend	HTML
– Need	to	extend	open	web	ontology,	schema.org
– If	NISO	got	behind	it,	we	could	get	built	in	scale.

Toler HTML First Journal Workflow