February 23, 2018
Athens, Greece #eliatogether
SharingEfforts
ToGettheMostfromMT+PE
Luigi Muzii
sQuid
#eliatogether
Introduction
Sharing Efforts to Get the Most from MT+PE 2© 2018 Luigi Muzii
Γεια σας
Με λένε
#eliatogether
In this
industry since
1982
Sharing Efforts to Get the Most from MT+PE 3© 2018 Luigi Muzii
#eliatogether
Working with
MT since
1991
Sharing Efforts to Get the Most from MT+PE 4© 2018 Luigi Muzii
#eliatogether
Working in
telecom until
2002
© 2018 Luigi Muzii 5Sharing Efforts to Get the Most from MT+PE
#eliatogether
Freelancing
since 2002
Sharing Efforts to Get the Most from MT+PE 6© 2018 Luigi Muzii
#eliatogether
University
teacher until
2011
© 2018 Luigi Muzii Sharing Efforts to Get the Most from MT+PE 7
#eliatogether
Business
consultant
since 2012
© 2018 Luigi Muzii
8Sharing Efforts to Get the Most from MT+PE
#eliatogether
Outline
• Clearing the field
• Laying foundations
• Defining requirements
• Arranging the platform
• Running projects
Sharing Efforts to Get the Most from MT+PE 9© 2018 Luigi Muzii
#eliatogether
Clearing the field
Target groups
© 2018 Luigi Muzii Sharing Efforts to Get the Most from MT+PE 10
#eliatogether
Practical
advice
Sharing Efforts to Get the Most from MT+PE 11© 2018 Luigi Muzii
#eliatogether
• To apply whether you are a
freelancer, a project manager, or a
translation buyer
Guiding
principles
Sharing Efforts to Get the Most from MT+PE 12© 2018 Luigi Muzii
#eliatogether
Three
scenarios
• In the making (freelancers)
• Downstream (customers)
• On constraints (LSPs)
Sharing Efforts to Get the Most from MT+PE 13© 2018 Luigi Muzii
#eliatogether
Laying foundations
Devising strategies
© 2018 Luigi Muzii Sharing Efforts to Get the Most from MT+PE 14
#eliatogether
Method?
Sharing Efforts to Get the Most from MT+PE 15© 2018 Luigi Muzii
#eliatogether
Already in the past?SMT
Sharing Efforts to Get the Most from MT+PE 16© 2018 Luigi Muzii
#eliatogether
Still SMT, but…
a whole different kettle of fishNMT?
Sharing Efforts to Get the Most from MT+PE 17© 2018 Luigi Muzii
#eliatogether
Not exactly
child’s play
• Tools
• Data
• Knowledge
Sharing Efforts to Get the Most from MT+PE 18© 2018 Luigi Muzii
#eliatogether
• MT will proliferate
• Good translators will still be lacking
Foresight
Sharing Efforts to Get the Most from MT+PE 19© 2018 Luigi Muzii
#eliatogether
Everyone’s needed, no one’s
indispensableJoin forces
Sharing Efforts to Get the Most from MT+PE 20© 2018 Luigi Muzii
#eliatogether
3 tips for
getting
started
• Recap goals and expectations
• Check MT readiness
• Plan for assistance
Sharing Efforts to Get the Most from MT+PE 21© 2018 Luigi Muzii
#eliatogether
Defining requirements
Simple and straightforward
Sharing Efforts to Get the Most from MT+PE 22© 2018 Luigi Muzii
#eliatogether
Where available data is larger and
quality is higherScope
Sharing Efforts to Get the Most from MT+PE 23© 2018 Luigi Muzii
#eliatogether
• Reduce labor
• Boost productivity
• Keep consistency
Goals
Sharing Efforts to Get the Most from MT+PE 24© 2018 Luigi Muzii
#eliatogether
• Familiarize with technology
• Strengthen your expertise
• Tackle security issues
• Scrub your data
• Plan for support
• Revise your pricing model
Separate the
wheat from
the chaff
Sharing Efforts to Get the Most from MT+PE 25© 2018 Luigi Muzii
#eliatogether
Building a platform
Selection, set-up, training, testing
Sharing Efforts to Get the Most from MT+PE 26© 2018 Luigi Muzii
#eliatogether
Givens
• Not all engines are created equal
• Raw output can vary across
systems—and language pairs
• Errors may not follow a consistent
pattern
• Engine performances also vary
Sharing Efforts to Get the Most from MT+PE 27© 2018 Luigi Muzii
#eliatogether
Set-up
• Data
 Maintenance
• Customized engine
 +100,000 segments
• Tool settings
 Sub-segment recall
 Fuzzy match repair
Sharing Efforts to Get the Most from MT+PE 28© 2018 Luigi Muzii
#eliatogether
Engine
• Total cost of ownership
• Integration
• Expertise
• Security
Sharing Efforts to Get the Most from MT+PE 29© 2018 Luigi Muzii
#eliatogether
Best practices
Running projects
Sharing Efforts to Get the Most from MT+PE 30© 2018 Luigi Muzii
#eliatogether
Dos
• Know your data
• Master quality metrics
• Devise a post-editing fee scheme
Sharing Efforts to Get the Most from MT+PE 31© 2018 Luigi Muzii
#eliatogether
Don’ts
• Mess with data
• DIY/Rely on vendors
• Expect miracles
Sharing Efforts to Get the Most from MT+PE 32© 2018 Luigi Muzii
#eliatogether
In any case,
remember:
Tell the customer you are using MT
So you won’t get sued
Sharing Efforts to Get the Most from MT+PE 33© 2018 Luigi Muzii
#eliatogether
The fuel Output is only as good as the data
used
Sharing Efforts to Get the Most from MT+PE 34© 2018 Luigi Muzii
#eliatogether
Good
(effective)
data
• Few reliable sources
• Single domain
• Current data
• Same encoding
• No empty segments
• No errors
• Terminologically consistent
segments
• Same style
• Same-length segments
Sharing Efforts to Get the Most from MT+PE 35© 2018 Luigi Muzii
#eliatogether
The output Accept that output is unpredictable
Sharing Efforts to Get the Most from MT+PE 36© 2018 Luigi Muzii
#eliatogether
• Fast
• Unchallenging
• Flowing
Post-editing:
expectations
Sharing Efforts to Get the Most from MT+PE 37© 2018 Luigi Muzii
#eliatogether
• EditTime
 The time required to get a raw MT output
to the desired standard
• Post-editing effort
 Percentage of edits to be applied to raw
MT output to attain the desired standard
Post-editing:
measures
Sharing Efforts to Get the Most from MT+PE 38© 2018 Luigi Muzii
#eliatogether
Can only be computed downstreamEdit time
Sharing Efforts to Get the Most from MT+PE 39© 2018 Luigi Muzii
#eliatogether
• Probabilistic forecasts
 Based on automatic metrics
• Depending on
 Post‐editing level
 Volume
 Turn‐around time
Post‐editing
effort
Sharing Efforts to Get the Most from MT+PE 40© 2018 Luigi Muzii
#eliatogether
Post-editing
levels
• Gisting
 Volatile content
 Automatic scripts to fix mechanical/recurring
errors
• Light
 Continuous delivery
 Fixing capitalization and punctuation, replacing
unknown words, removing redundant words,
ignoring stylistic issues
• Full
 Publishing and engine training
 Fixing meaning distortion, fixing grammar and
syntax, translating untranslated terms (possibly
new terms), adjusting fluency
Sharing Efforts to Get the Most from MT+PE 41© 2018 Luigi Muzii
#eliatogether
Dos
• Test before operating
• Ask for MT samples for negotiation
• Negotiate throughput rates
• Ask for glossary (with DNT words)
• Ask for for instructions
• Be open to feedback
Sharing Efforts to Get the Most from MT+PE 42© 2018 Luigi Muzii
#eliatogether
Don’ts
• Use MT to sustain price competition
• Process poor MT outputs
• Treat post-editing as fuzzy matches
Sharing Efforts to Get the Most from MT+PE 43© 2018 Luigi Muzii
#eliatogether
Post-editing
instructions
• Tool selection
• Environment setup
• General references
• Conventions
• Project details
• Pricing model
• Operating instructions
Sharing Efforts to Get the Most from MT+PE 44© 2018 Luigi Muzii
#eliatogether
Pricing and
compensation
• Upstream
 Clear-cut predictive scheme
 No fuzzy match scheme
 Fuzzy match over 85% are inherently correct while
MT segments may contain errors and inaccuracies
• Downstream
 Measurement of actual work
Sharing Efforts to Get the Most from MT+PE 45© 2018 Luigi Muzii
#eliatogether
Negotiation
grid
• Generals
 Engine
 Generic or trained
 Quality
 Raw output
 Expectations
 Formats and formatting
• Compensation
 Per-word rate
 Productivity rate
 Hourly rate
 Time tracking
Sharing Efforts to Get the Most from MT+PE 46© 2018 Luigi Muzii
#eliatogether
• A considerably low pay rate
unrelated to language pair and MT
output quality
• MT output quality is lower than a
generic free online service
When to say
NO
Sharing Efforts to Get the Most from MT+PE 47© 2018 Luigi Muzii
#eliatogether
Automatic
processing
• Pre-processing
 Empty, untranslated, duplicated segments
 Normalization
 Punctuation, diacritics, extra spaces, noise
 Numbers, dates, weights, measures
 Terminology
 Spellcheck
• Post-processing
 Encoding
 Normalization
 Terminology
 Spellcheck
Sharing Efforts to Get the Most from MT+PE 48© 2018 Luigi Muzii
#eliatogether
Ευχαριστίες
Don’t forget your download card
© 2018 Luigi Muzii Sharing Efforts to Get the Most from MT+PE 49

Sharing efforts to get the most from MT+PE

Editor's Notes

  • #2 Good morning. Thank you for coming.
  • #3 My name is Luigi Muzii.
  • #4  I have been working in the language business since 1982, as a translator, terminologist, technical writer, and localizer.
  • #5 I have been working with machine translation (MT) since 1991.
  • #6 I worked in telecommunications until the year 2002, managing translation and localization projects, designing, developing, and managing terminology and documentation, and designing and providing training for customers and in-house staff.
  • #7  I taught terminology and localization at LUSPIO (now UNINT) university until 2011. Since 2012, I have been a full-time independent business consultant helping customers choose and implement best-suited technologies and redesign their business processes to get the best in multilingual content production, translation, and localization.
  • #8 I taught terminology and localization at LUSPIO (now UNINT) university until 2011.
  • #9 Since 2012, I have been a full-time independent business consultant helping customers choose and implement best-suited technologies and redesign their business processes to get the best in multilingual content production, translation, and localization.
  • #10 We will start with identifying target groups, then we will continue with laying the foundations for using MT in a production environment. We will also consider the definition of requirements for MT platforms and projects, and the setup of an MT platform, and finally with running MT and post-editing projects.
  • #12 This presentation is designed to provide some practical advice about tackling the challenges that freelancers, project managers, and translation buyers face when approaching MT, implementing MT or running MT and post-editing projects.
  • #14 Three target groups can be identified for three kinds of task: In the making; machine translation is used for “suggestions” while processing a document by a human translator; this is probably the most common approach today; Downstream; machine translation is used to fully process a document that will be possibly post-edited; this approach is typically adopted by larger clients with established experience in the field; On constraints; machine translation is used by an LSP to finalize a translation job by asking translators to work on suggestions coming from a specialized engine. While scenarios two and three might meet the customer’s need for confidentiality and IP protection through an in-house engine using only the client’s own data, scenario one is becoming more and more general among professional translators, given the astonishing improvement of online engines. At the same time, scenario three is slowly but constantly applying to LSPs who try to escape price and volume pressures through machine translation and post-editing.
  • #16 The three scenarios just outlined require different strategies to be devised. The first one involves the machine-translation method.
  • #17 Given the circumstances, the premises and the many reservations about it—basically, the hype—a question must be asked: Is PEMT already in the past?
  • #18 In this presentation, the reference method is PB-SMT (Phrase-Based Statistical Machine Translation) because PB-SMT engines are inexpensive and effective, whereas customizable NMT (Neural Machine Translation) engines are still quite pricey and challenging as to technical requirements and operational complexity, thus out of range for most customers. Translators working on unrestricted documents (scenario one applies here,) would generally choose an online NMT engine. For customers requiring confidentiality and IP protection and willing to leverage their own language data (scenario two and three apply here,) a highly customizable PB-SMT engine might be a valid option, especially where no major investment is envisaged in the field, vast and suitable data is available and/or limited volumes are processed.
  • #19 In general, the main drivers in the adoption of MT are productivity (speed and volumes) and usefulness (consistency and marginal cost) especially for large projects otherwise involving many translators. Unfortunately, MT is not exactly child’s play. MT engines are complex and challenging applications requiring a combination of tools, data, and knowledge. This is a rare commodity, especially in a single person, on both side of a translation project.
  • #20 Also, in the future, while MT will continue to proliferate, the shortage of good translators will increase.
  • #21 Therefore, joining forces is important to explore and vet as many solutions as possible. According to a popular saying, everyone’s needed, no one’s indispensable, and can be easily replaced by anyone else with a similar profile in the same role. This also means, though, that, to be valuable, everyone’s effort is needed, on the highest level of performance all the time. For quite some time now, translation is no longer a solitary feat, but a collaborative effort. This is especially true with the current level of technology.
  • #22 In this respect, three steps should be completed before starting any MT project. Recap your goals and requirements What you expect from MT and how much you are willing to rely on it. Check your MT readiness; Realistically analyze your knowledge, tools, and data. Plan for assistance Never venture into an unfamiliar territory without a guide.
  • #23 When planning to implement MT, keep scope, goals, and expectations clearly distinct.
  • #24 Identify one or more items within your scope of business for MT, possibly picking those where the amount of data available is larger, and the quality higher.
  • #25 Clearly define and prioritize your goals. Major goals may be reducing labor, boosting productivity and keeping consistency, especially in larger projects.
  • #26 Be realistic in your expectations. Therefore, familiarize with technology and strengthen your expertise to make the best of it. Tackle any security issues for confidentiality and data integrity and protection of intellectual property. Don’t forget to scrub your data if you plan to train an engine and to plan for any relevant support. Finally, revise your pricing model to encompass MT-related tasks.
  • #28 When building an MT platform, never forget that MT engines are not all equal, for different environments, configurations, and data. Therefore, although the output could be considered someway predictable, raw output quality can vary across systems and language combinations, and error may not follow a consistent pattern. Performances of MT engines also vary.
  • #29 In data-driven MT, data maintenance is crucial, and it is the first task when setting up an MT platform. Data must be organized, cleaned, and fixed for terminology and style. For a customized engine, at least 100,000 paired segments are necessary and the cleaner and healthier the better. Another important factor to the effectiveness of a MT strategy are the tools used for data preparation, pre-translation and post-editing. Special attention must be given to translation tool settings to allow for sub-segment recall and fuzzy match repair.
  • #30 Finally, when choosing the engine, the items to consider are: The total cost of ownership (TCO) Integration Expertise Security (especially as to intellectual property and confidentiality)
  • #31 When running MT projects, best practices may be different depending on whether you are a translation buyer or vendor.
  • #32 In general, knowing your data and mastering quality metrics is a must. As to post-editing, devise your own compensation scheme.
  • #33 A common mistake is to consider all content as equal and then mess with data. In the same way, absolutely avoid relying only on your capacities or on vendors. In the end, everything can be summed up in the simple invitation to not expect any miracles.
  • #34 Never forget to agree with the customer and the content owner about using MT, especially if you are using a SaaS/online platform to prevent being sued.
  • #35 Also remember that data is the fuel to any SMT/NMT engine and that the output is only as good as the data used. In fact, these engines perform statistical predictions by inference, and when the amount and quality of data increase, an engine improves.
  • #36 Collect as much data as possible, but always make sure it comes from few reliable sources in a restricted domain, that it contains correct translations with no errors, it is real and recent, and terminologically and stylistically consistent.
  • #37 At this point, you must accept that MT output is unpredictable. For this reason, MT quality assessment should be run in such a way as to prevent any subjectivity.
  • #38 For the same reason, post-editing is and will remain a critical, integrated part of MT usage, and it is expected to be fast, unchallenging, and flowing.
  • #39 Anyway, the amount of post-editing required can be hard to assess. To plan deadlines and allocate a budget for the job, two different measures can be used, the edit time and the post-editing effort. The first is the time required to get a raw MT output to the desired standard, and the latter is the percentage of edits to be applied to raw MT output to attain the desired standard.
  • #40 The main problem with edit time is that it can only be computed downstream, assuming that the time spent has been entirely devoted to editing.
  • #41 The post-editing effort can be estimated through probabilistic forecasts based on automatic metrics as a reverse projection of the productivity boost. In fact, translation productivity is measured as the throughput expressed in the number of words per hour, and MT is supposed to improve it by reducing the turnaround time and increasing the workable volumes. However, the post-editing effort and the turnaround time are hard to predict, especially for translation of publishable quality and/or data for incremental training of engines. In fact, it depends on diverse factors such as the quality expectations for the finalized output, the volume of content to process, and the allotted time for the task. Also, the effort required depends on the post-editing level.
  • #42 The post-editing level is generally restricted to: Gisting; Light; Full. Gisting consists in fixing recurring errors in raw MT output with automatic scripts. It is used for volatile content, e.g. messaging, conversations, etc. Light post-editing consists in fixing capitalization and punctuation errors, replacing unknown words, removing redundant words or inserting missing ones, and ignoring all stylistic issues. It is generally used for continuous delivery and reprocessing. Full post-editing consists in fixing meaning distortion, grammar and syntax, translating untranslated terms (possibly new terms), and adjusting fluency. It is reserved for publishing and engine training.
  • #43 Finally, always follow a few basic rules before boarding on a post-editing project: Test before operating; Ask for MT samples for negotiation; Negotiate throughput rates; Ask for glossary with the list of DNT words; Ask for instructions; Be open to feedback.
  • #44 Similarly, Never use MT to sustain price competition; Never process poor MT outputs; Never treat post-editing as fuzzy matches.
  • #45 Remember that different engines, domains, and language pairs produce different outputs, involve different post-editing efforts, and require different post-editing instructions. These should address tool selection criteria and environment setup guidelines, as well as a style guide, and a comprehensive term base. They should also address conventions and the type and number of project details as well as the general pricing model and the actual operating instructions.
  • #46 As to pricing and compensation, for light post-editing of very good output when speed is the major concern and the first requirement, a model should be settled prior to any assessment of the actual MT output based on a clear-cut predictive scheme. However, do not follow any translation-memory fuzzy matches scheme. In fact, while fuzzy matches over 85% are inherently correct and generally require minor changes, machine-translated segments may contain errors and inaccuracies, and even a supposedly light post-editing may prove challenging. A downstream computation scheme might also be devised in full post-editing for an accurate measurement of the actual work performed. This is usually made by computing the edit distance and then inferring the percentage on the hourly rate.
  • #47 A negotiation grid can be helpful to cross-reference type and nature of engines, quality of raw output, and all the relevant technical requirements with compensation based on productivity, type of performance, bid (hourly rate) and ancillary services (e.g. filling in QA forms for ongoing training of engine.)
  • #48 In any case, a strong and clear “No, thanks!” is more than reasonable when a considerably low pay rate is offered that is unrelated to language pair and MT output quality and/or MT output quality is lower than a generic free online service.
  • #49 Lastly, raw MT output should be processed before a post-editing job for automatic removal of empty and/or untranslated segments and duplications, fixing of diacritics, punctuation marks, extra spaces, noise and numbers; terminology should also be checked for consistency and a spellcheck should be run. A post-processing stage should also be envisaged involving encoding, normalization, formatting (especially tag injection,) a terminology check and, obviously, a spellcheck.