Let's call the whole thing off


Published on

A short essay on translation quality standards, the new standards ISO 17100, translation quality assessment, sampling and translation data quality for statistical machine translation.

Published in: Education
  • Be the first to comment

  • Be the first to like this

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Let's call the whole thing off

  1. 1. Let’s call the whole thing off Tout ce qui a besoin d'être dit l'a déjà été. Mais puisque personne n'écoutait, tout doit être dit à nouveau. Everything that needs to be said has already been said. But since no one was listening, everything must be said again. André Gide, Le traité du Narcisse, 1891* Although considered a mature and widespread concept, quality is a relative and largely subjective notion. There is no unique conventional set of metrics for translation quality measurement and, as in many other fields of application, translation quality broadly corresponds to the fulfillment of a set of specifications, encompassing the buyer’s requirements. Quality, utility and pricing Utility is defined as the ability of something to satisfy needs or wants. In this sense, it is quite similar to quality, defined as ‘fitness for purpose’. Both refer to customer’s satisfaction for a good or a service. In business, quality has a pragmatic meaning as the non-inferiority or superiority of something, but is always an intuitive, conditional, and subjective attribute and may be interpreted differently by different people. In economics, utility is a representation of preferences over some set of goods and services. As is the case for quality, utility cannot be measured. Nobel laureate Paul Samuelson named ‘revealed preferences’ the choices outlining utility. In economics, the marginal utility of a good or service is the gain from an increase or loss from a decrease in the consumption of that good or service. In other words, the first unit of consumption of a good or service yields more utility than the second and subsequent units, with a continuing reduction for greater amounts. A good or service should then be consumed at a quantity at which the marginal utility equals the change in the cost of producing one more unit of a good (marginal cost). Due to information asymmetry, translation is supplied without qualitative differentiation across markets. This makes it a typical commodity. * Many thanks to Kirti Vashee for the quote.
  2. 2. Research claims that the demand for translation has been increasing, although at a slower pace in the last few years. As the rate of commodity acquisition increases, marginal utility decreases, and if commodity consumption continues to rise, marginal utility at some point may fall to zero, reaching maximum total utility. Further increase in consumption of units of commodities causes marginal utility to become negative; this signifies dissatisfaction. Price is determined by both marginal utility and marginal cost, and this dynamic explains clearly not only why the marginal cost of water is far lower than that of diamonds, but also why quality is an expected feature in a good or service, which is not linked to its selling price. Is then quality a way to differentiation? Is ‘purity’ the right way to differentiation? Is diversity still richness? Is differentiation really important? Or do we have to wear camouflage to stay alive? Most often, the buyer perceives translation as the only material available for scrutiny. Therefore, particularly in translation, there is no such thing as absolute quality, with different jobs meeting different requirements and different quality criteria. To be reliable, translation quality assessment must be undisputable and repeatable; effective metrics must be available that are objective (measurable), unbiased, and able to provide enough resolution (detail) to assess the factors that need improvement. Since there are no common protocol or tools for automated translation quality assessment, guidelines enable a human team to perform this task while keeping error margin as low as possible. However, so far, two different people following the same protocol could hardly achieve the same result (or at least a comparable result). In fact, the detailed and strict error-based evaluation models used so far have proved costly, ineffectual, and erratic as they hardly consider content type, end-user requirements, and usability, in one word, fitness for purpose. These models have been developed, unfolded, and implemented by linguists for linguists. They focus on linguistic features instead of cost-effectiveness and functionality, with time and cost growing linearly with volume. Technology and Incomes Google has made machine translation a General Purpose Technology (GPT), thus helping spread the concept of translation as a utility. It cannot be, however, be accused of having contributed to the commodification of translation, being the two concepts distinct and unconnected. If anything, Google Translate has helped raise awareness for the importance of translation for the circulation of information and knowledge, even if only indirectly. The Moses engine has been harming translation more because it is charge-free and apparently easy and convenient to implement, while, like any other complex technology, no matter how seemingly simple, it requires specific skills, know-how, understanding, and patience. Improvisation does not pay. Gambling neither, especially if the ultimate goal is to lower costs and increase profits. Just like players in the translation industry would like prospects to see translation as an investment, professional users should see machine translation as a complex technology and they should refrain from proclaiming themselves experts just for being able to install and run a DIY of a piece of software. This applies for any software product. In the last decade the acceleration in technology has shocked not only the industry, but virtually everyone. Skills and institutions in the translation industry have not been able to keep pace with the rapid changes of technology. Also in the translation industry, skill-biased technological change (SBTC) increases the incomes of highly skilled workers and reduce the incomes and employment of low-skilled workers.
  3. 3. As Erik Brynjolfsson and Andrew McAfee argue in The Second Machine Age, in the last decade, the fall in demand has been greater for those who find themselves in the middle of skill distribution. Highly qualified workers have done well, but workers with lower qualifications have been less affected than those with medium qualifications, reflecting a polarization of labor demand and an interesting fact about automation. Physical activities requiring a physical and sensory perception coordination have proved more resilient to automation than basic data processing, following Moravec's Paradox, which claims that high-level reasoning requires very little computation, but low-level sensorimotor skills require enormous computational resources. In this respect, a recent Economist article helps clarify this point. Lower qualified jobs are and will most probably remain low paid; this makes replacement of highly qualified jobs with machines convenient, especially in the long run, despite Keynes’s opinions (This long run is a misleading guide to current affairs. In the long run we are all dead.) As Brynjolfsson and McAfee suggest, the whole translation industry should pursue a strategy of innovating and reshaping organizations, structures, processes, and business models to leverage developing technologies and human skills. These would easier to achieve than technological disruptive innovations that its history proved the industry is incapable to produce, to rather undergo and endure outsiders (see also Moore’s Law and Commoditization (of Translation too)). On the other hand, the more technologies are present in an industry, the harsher the competition. The spread between the highest and lowest performers increases as well as the profit margin spread between the companies at the top and at the bottom of the scale. Going back for a moment to information asymmetry, it is worth recalling a study by Robert Jensen of the John F. Kennedy School of Government of Harvard University on the digital provide in the fisheries sector in Kerala. In Professor Jensen’s words, “when information is limited or costly, agents are unable to engage in optimal arbitrage. Excess price dispersion across markets can arise, and goods may not be allocated efficiently.” Information technologies and mobile phones in Kerala allowed fishermen to access information on prices and market demand in real time and use this information to make decisions. This resulted in a significant reduction in price dispersion and improved market performance, after an initial drop in prices and subsequent stabilization, with an eventual increase in profits.
  4. 4. A New Standard After a lustrum, a seemingly endless gestation, especially for our fast-paced times, ISO/DIS 17100 (Translation Services — Requirements for translation services) has eventually reached the quasi final draft status (voting terminated on November 20, 2013). This draft has been submitted to the ISO member bodies and to the CEN member bodies for a parallel enquiry, which is about to end as well. Waiting for imprimatur, this 20- page draft is available for purchase at CHF 66,00 (€ 54,18 or US $ 74.50). Very ambitiously, in its introduction, ISO/DIS 17100 declares to specify “requirements for all aspects of the translation process directly affecting the quality and delivery of translation services.” ‘All’ is a very challenging word, especially when it comes to a typical human task like translation; in reality, more realistically, in its scope section, the standard only “provides requirements for the core processes, resources and other aspects necessary for the delivery of a quality translation service that meets applicable specifications.” It is not a good start for a supposed state-of-the-art standard made by presumably renowned experts. In fact, ISO/DIS 17100 is a reworking of EN 15038 to partly accommodate ASTM F2575-06 and blink an eye to the Chinese GB/T 19363 1-2003. ISO/TS 11669 is crucial to ISO/DIS 17100 framework. ISO/TS 11669 is a technical specification. The shelf life of an ISO technical specification is six years: within this timeframe it is either converted to a full standard or eliminated. ISO/TS 11669 provides a framework for developing structured specifications for translation projects, but it does not cover legally binding contracts between parties involved in a translation project. It addresses quality assurance and provides the basis for qualitative assessment, but it does not provide procedures for a quantitative measurement of the quality of a translation product. ISO/TS 11669 describes a decision-making system about how translation projects should be carried out. Those decisions — or project specifications — would then become a resource for both the requester (and the translation service provider (TSP) throughout all phases of a translation project. These specifications can be attached to a legally binding contract to define the work to be done. In the absence of a contract, they can be attached to a purchase order or any other document supporting the request. Requesters and TSPs should determine project specifications together. The project specifications can be used to guide assessments made by either the TSP or the end user. The use of the same specifications by all parties allows to avoid assessment based on personal opinions of how source content should be translated. ISO/TS 11669 does not provide any procedures for quantitative measurement of the quality of a translation product. ISO/TS 11669 introduces translation parameters, intended as key factors, activities, elements and attributes of a given project used for creating project specifications. However, the long listing of translation parameters is a surreptitious way to levy vague and blurry translation quality assessment criteria, which are traditionally subjective.
  5. 5. In addition, since quality is defined as the degree to which the translation product conforms to the project specifications, and no guidance is given for qualitative assessment, register should not be a parameter, as its compliance to requirements is highly subjective. Like EN 15038:2006, ISO/DIS 17100 specifies requirements that a provider of translation services must meet, in terms of staff and equipment, project management and processes. Like EN 15038:2006, ISO/DIS 17100 shows the typical conservatism of the translation industry. Although the EN 15038:2006 draft was finalized two years well before its release, like EN 15038:2006, ISO/DIS 17100 still reflects the typical old business model of the whole translation industry. Like EN 15038:2006, ISO/DIS 17100 contains no commitment towards metrics, and no hints on how the quality of these translations achieves a certain level. Anyway, in one of the informative annexes, ISO/DIS 17100 contains a timid commitment towards service level agreements (SLAs) that could outline such a framework. Translators’ competences are still a weak point in ISO/DIS 17100. The TSP is required to “have a documented process in place to ensure that the people selected to perform translation projects have the required competences and qualifications,” but no means is envisage to ensure it shall anyway in any case. A basic requirement for translator qualification is “a recognized graduate qualification in translation” or a substantial full-time professional experience in translating. The same basic errors as in EN 15038:2006, reflecting a candied view, which is now far away from reality, proving unfailingly the inadequacy of the newbies being churned out by old-fashioned translation schools flocking the old and the new world. Not surprisingly, these schools are under the thumb of the same advocates of EN 15038:2006 and ISO/DIS 17100. On the other hand, ISO/DIS 17100 takes translation vendor and project management into consideration, in the view of the assurance that “the people selected to perform translation projects have the required competences and qualifications.” According to the standard, “translation project management competence can be acquired in the course of formal or informal training, e.g. as part of a relevant higher educational course or by means of on-the-job training or by industry experience.” This is somewhat dismissive of the importance admittedly acknowledge to translation project management, and yet is definitely much more than the attention devoted to translation vendor management, which is in fact crucial. Indeed, the standard does not envisage any requirement in this respect. Here comes the biggest flaw in ISO/DIS 17100, in section 5.3 Translation process. With the typical dirigist trait of translation scholars, the abundance of details is not accompanied by any specification of requirements as to who and how should monitor the several tasks in the process, ending in an utter manifestation of the typical wishful thinking that permeates the industry. A blatant example is given in section 5.3.3 Revision. Beyond the impractical revival of the typical academic approach based on contrastive analysis, no indication is given about the base to “correct any errors found in the translation output or recommend the corrective measures to be implemented”, leaving any decision entirely to the reviser’s discretion, thus wide space for the introduction of further errors. ISO/DIS 17100 still contains all the flaws and limitations of EN 15038:2006 and incorporates some from ASTM F2575-06, although both left much room for improvement, and four years of life for both at the start and as much of drafting were enough time span for doing better. Annex A and G are perfect examples in these respect. They seem quite a divertissement in themselves, with the translation workflow outlined in the first one still offering a monolithic serial model afar from agility, and a ‘DOK’ in the latter, with no definition/elaboration, being something that would most probably disturb the sleep of many uninformed readers. For being informative, both annexes surely miss their goal. Annex B offers a list of elements to be included in an agreement as project specifications possibly in “the form of statements of work such as a service level agreement (SLA),” but it gives no definition for statements of work (SoW) or SLA.
  6. 6. Standards are all about allowing stakeholders to overcome information asymmetries and make informed decisions; to this end, they must be simple, functional, and end-user oriented. ISO/DIS 17100 is another missed opportunity to gain respect and consideration for the translation industry. Measurability and Metrics The quality process standard par excellence, ISO 9001:2008 is based on the assumption that regulating and systematizing tasks in repeatable processes, with strong audit trails, will eventually lead to control production processes and products/services delivered with repeatable quality (attributes). Over the years, the concept of continuous improvement has been spreading, to be eventually incorporated in this standard. While leading industries developed complementary sets of techniques and tools for process improvement, the translation industry pursued its own standards, which respected its peculiarity and the special nature of its services. The manufacturing industry applied the concept of Kaizen and conceived Total Quality Management (TQM), Six Sigma (6) and CMMI to improve the quality of process outputs by identifying and removing the causes of defects (errors) and minimizing variability in manufacturing and business processes. The table below gives a measure of process performance corresponding to Sis Sigma levels roughly expressed in errors per million units. Sigma level DPMO Percentage yield 1 690,000 31% 2 310,000 69% 3 67,000 93.3% 4 6,200 99.38% 5 230 99.977% 6 3.4 99.99966% This means that, in a 10,000 word projects, the seemingly minute difference between 99,38% and 99,99% means 62 errors compared to 1; 2 errors every three pages compared to only 1 in total. In the language industry, quality is a most debated subject. The most commonly asked question about quality is: how can quality be measured? To measure something, you must know what it is, and then you must develop metrics that measure it. Metrics definition is the hardest part for people who have always thought of quality in their deliverables as a questionable subject. The best way to assess quality remains measuring the number and magnitude of defects, and when defects cannot be physically removed, their features and scope must be specified. The first step, then, is to establish a model or definition of quality, and translate it into a set of metrics that measure each of the elements of quality in it. Measuring things just because they can be measured is not useful. If something is not relevant to the quality model established, it is not a good use of time to develop metrics to measure it.
  7. 7. Striving for a single, all-encompassing metric is not only troublesome, it can be useless, as a simple metric would not reveal all the problems. Creating multiple metrics that assess the various aspects of what is to be measured can help re-compose the overall framework: knowing which parts of a process work well and which ones do not allows to take measures to correct the problems. A comprehensive set of metrics must measure quality from several perspectives and at several points during the production process, regardless of the quality model. At a minimum, metrics should tell something about:  The quality of the finished product or the lack of it;  The quality of the process, i.e. how reliable it is to produce quality products;  The likelihood of achieving quality in a deliverable. The quality of the finished product corresponds to general customer satisfaction ratings, while the lack of quality can be given by defects such as technical errors, the quality of process comes from repeatability, and typical predictors of quality are in-process indicators such as editing. Sampling In this perspective, the distinction is important between quality assurance, quality assessment, and quality inspection and control. Quality assurance is a planned and systematic pattern of all actions necessary to provide adequate confidence that the item or product conforms to established technical requirements. Quality assurance covers all activities, in accordance with two basic rules, “fit for purpose” and “do it right the first time”. Quality control and quality assessment contribute to quality assurance. Quality assurance is the full set of procedures applied before, during and after the production process, by all members of an organization, to ensure that quality objectives important to clients are being met. Quality assessment is intended for establishing whether contract conditions have been met. Whereas quality control is product-oriented and customer-oriented, quality assessment is business-oriented. Unlike quality control, which always occurs before the final product is delivered to the client, quality assessment may take place after delivery. Assessment is not part of the production process. It consists in identifying — but not correcting — problems in one or more randomly selected samples of a product output to determine the degree to which it meets the agreed standards. In the translation industry, quality control is done with specific software tools, whether standalone or integrated in translation environments. These tools usually detect mechanical errors, spelling errors, omissions, inconsistencies, and oversights, especially when reference material is provided. Nevertheless, since there is no ‘perfect’ translation, the intended purpose of a translation and its suitability remain the only judgment criteria which, for the sake of objectivity, must be accompanied by assessment metrics. The combination of process and output quality assessment of translation work will eventually tell simply whether it is acceptable or defective. Therefore, translation quality assessment (TQA) criteria are to be agreed upon with the client, be subject of requirements and be formalized in a separate document. So far, TQA has been performed on the basis of a strict correspondence between source and target texts and on intensive error detection and analysis. While this could be the best approach from a theoretical — and maybe pedagogical — point of view, it is uneconomic. It requires a considerable investment in human resources and time, and it reduces translation to a matter of trust.
  8. 8. On the other hand, who will go over 100,000 words of translation to check for terminology changes after a translation has been delivered? However, if terminology issues can be approached in a systematic way, style is a matter of personal preferences. The same goes for correctness and meaning with respect to completeness. Any translation can be fully checked, automatically, for comprehensiveness with the source text, freedom from mechanical flaws or errors, and even for grammar, intended as correctness as conforming to an approved or conventional standard. In any case, any job done by a professional translator is taken for granted as free from such defects. Today, any large translation project follows the same standards and rules as a production process in common business. In this perspective, defects as such should positively be reproduced in the same conditions, corrected and then removed. A first step towards improvement in the quality of process outputs consists in preventing the insurgence of defects by minimizing variability in processes. To this end, a detailed statement of work and an accurate style guide can be helpful — although time consuming — in most situations, possibly together with examples of do’s and don’ts. This approach could eventually lead to set defect tracking and assessment procedures. Here comes inspection. Just like any other object, to be measurable, a translation, especially when large, should be apportioned in definite allotments, to be homogeneous in size and scope for a reasonable estimate in the number and significance of defects and set a limit for both. Such apportionment is called sampling. Sampling becomes necessary for any translation project exceeding a typical freelancer’s single-day capacity, making 100% inspection not sustainable. Sampling will allow for inspection of meaningful, representative batches, and for accepting or rejecting them through the determination of the maximum number of defects, based on simple pass/fail criteria. Acceptance sampling is the middle-of-the-road approach between no inspection and full inspection. Its main purpose is to decide whether a lot is acceptable, not to estimate its quality. To determine acceptability, criteria for inspection by attributes must be specified in advance. Once criteria for inspection are specified, acceptability thresholds must be set. The ISO 2859 series of standards can be used here as a reference. For acceptance sampling to be effective, a lot acceptance sampling plan (LASP) must be implemented indicating the conditions for acceptance or rejection of the lot that is being inspected. These parameters are usually the number of different defectives in a sample and should vary in quantity and severity in direct relation to the importance of the characteristics inspected. Average Outgoing Quality (AOQ) procedures are the best suited for translation projects, since sampling is non-destructive, lots are fully inspected and all defectives in rejected lots are replaced with good units. In this case, all rejected lots are made perfect and the only defects left are those in lots that were accepted. AOQ expresses the average nonconforming fraction that is shipped to clients:       Np1PpnN PpnN AOQ(p) A A    where PA is the probability of accepting the lot, (N-n)PA is the number of pieces that are shipped without inspection, and p is the nonconforming fraction. The numerator is the number of bad pieces that are shipped, and the denominator is the total pieces shipped. Corrections are made to make rejected lots perfect and allow for identifying and removing the causes of defects, thus preventing their insurgence by improving processes and then the quality of outputs.
  9. 9. To make assessment criteria, methods and tools unambiguous, AQLs (Acceptance Quality Levels) can be used allowing for tolerance and deviations (errors). AQLs should be agreed upon in a SLA and should specify the maximal percentage of non-conforming items to be considered as a satisfying process mean. Different AQLs may be designated for different types of defects. An implication of acceptance sampling is that a lot exceeding a given percentage of deviations from the AQL is unsatisfactory and must be rejected. At the same time, a high defect level (Lot Tolerance Percentage Defective, LTPD) must be designated that would be unacceptable to the consumer. AQLs imply that a level of non-quality exists in a product where defects remain that ruin a batch, despite being ‘acceptable’. This level represents a compromise between quality, volume and price negotiated. To set AQLs, a simple defect prediction technique can be implemented to separate the defects found in a translation sample in two groups. Depending on the number of defects found in either of the two groups — but not in both — the defects that have not been found in the sample can then be estimated. This number gives approximately the number of defects in the entire project. Drawing Samples A sampling is a subset of a production output to estimate characteristics of the whole output. The sample drawing process consists of:  Defining the production output;  Specifying a sampling frame, a set of items to measure;  Specifying a sampling method for selecting items from the frame;  Determining the sample size;  Implementing the sampling plan;  Sampling and data collecting;  Data that can be selected. In most cases, it is inconvenient and uneconomic to sentence a batch of material from production (acceptance sampling by lots) by identifying and measuring every single item in the production output and including any one of them in the sample. Given the variety and variance in projects, the need to use different providers to match (large) volumes with (tight) deadlines, and the consequent unpredictable nature of translation, simple random sampling (SRS) is the most advisable method to minimize bias and simplify the analysis of results. In SRS, the variance between individual results is a good indicator of the variance in the sample, which helps estimate the accuracy of results, even though the randomness of a selection may result in a sample that does not reflect the makeup of the overall output. Assuming a source content for a translation project is homogeneous per se, the size of samples could be determined according to the type of deliverables and AOQ. Purity and Quality In recent years, Statistical Machine Translation (SMT) have become interesting particularly for LSPs, mostly thanks to the availability of the free Moses engine. However, contrary to expectations, corpus creation can be costly for a system to run effectively and satisfactorily. In fact, for quite some time now, a distinction has been made between generic SMT and customized SMT, where customized the latter leverages domain resources for phraseology, terminology, and style. In this respect, a further distinction has been made between clean data and quality data. In reality, the latter include the first. The following table should help clarify this concept.
  10. 10. Clean Data Quality Data Small number of trusted quality sources Actual data Domain relevance (restricted) Standard length sentences No less than 1,000 segments Terminologically consistent Encoding consistency Consistent writing style No empty segments No mistakes or errors (syntax, grammar, spelling) No mechanical errors (diacritics, punctuation, capitalization, spelling) Correct translation (exact words, morphology, no loans) Cleaning data for training purposes can be performed automatically or semi-automatically with the aid of software tools. These tools can be used to run a series of checks on parallel data, e.g. for no empty segments, unbroken markups, correct numbers, etc. and even for consistent translations and correspondence with approved terminology. Refining data for quality, i.e. to match the intended purpose and target audience with preferred writing style and terminology, is a human task requiring thorough understanding of the data.