Data Quality And Soa1. White Paper
Data Quality meets SOA –
Making Data Quality available for
all Business Processes
Data quality functions were already being provided as services for Unix, Windows and
Linux, before the analysts of Gartner had invented the term SOA. For the most part, technical
reasons were the decisive factor for this architecture. In addition to this, the implementation
of service-oriented architectures results in new and changed requirements for data quality
services and also increases the opportunities and benefits which they can create.
All company and product names
and logos used in this document
are trade names and/or registered
trademarks of the respective com-
panies.
Uniserv GmbH • Rastatter Str. 13 • 75179 Pforzheim/Germany
T +49 7231 936-0 • F +49 7231 936-3002 • E info@uniserv.com • www.uniserv.com
© Copyright Uniserv • Pforzheim • All rights reserved.
2. White Paper
The starting point for data quality services
In order to consider the importance of service-oriented architectures for the provision and use of data quality functions, it is
first of all useful to look at the typical data quality functions themselves.
a A classic application is the validation of a postal
address on the basis of reference data, which
includes street and place names and the depen-
b Another application is searching for duplicates
in the in-house database. Here too the goal is to
quickly and reliably identify a business object, a
dencies of the postcodes on places, streets and business partner, a product or a sales opportuni-
house numbers. In contrast to simple database ty in spite of incomplete, divergent or incorrect
access, the correct address should also be found input, in order to (a) simplify the search and there-
here if the input is incomplete or contains recor- by increase the productivity of the users, and (b)
ding or hearing errors. The goal is a high mat- to prevent the creation of duplicates, i.e. multiple
ching accuracy, in order to be able to correct the entries which refer to the same object in the real
greatest possible number of incorrect addresses world. Consistent, complete and unambiguous
automatically. mapping of the real objects in the database will
be achieved as a result.
An important goal in the context of improving the data quality consists in preventing incomplete or incorrect data from being
stored in the database. Possible problems should be detected at data entry and then cleaned up either automatically or manu-
ally by the user after appropriate feedback. Specialized search indices ensure that a search in databases with 1,000,000 to
100,000,000 data records normally only requires a fraction of a second, even with divergent spelling. Nevertheless, these
response times require intelligent caching of the indices. This can be provided very efficiently by implementing the software as
a central service which is made available from an in-house server.
Apart from the response time, the integration in a wide range of environments already played an important role in data quality
services. Decoupling the data quality services from the service consumers and utilization via a client/server protocol is important
for reaching this goal. This function had therefore already been provided and used for data quality, at least for more sophistica-
ted applications, before the invention of service-oriented architectures as a service. As a result, both the high requirements for
the response times could be met and the provision of functions guaranteed for a wide range of environments.
Uniserv GmbH • Rastatter Str. 13 • 75179 Pforzheim/ Germany
T +49 7231 936-0 • F +49 7231 936-3002 • E info@uniserv.com • www.uniserv.com
© Copyright Uniserv • Pforzheim • All rights reserved.
3. White Paper
From the fat client to the 3-layer architecture
LAYERED ARCHITECTURE
FAT CLIENT CLIENT
ROLE ROLE
Name Customer Name Customer
Street Supplier Street Supplier
Postcode Reseller Postcode Reseller
APPLICATION SERVER
The typical architecture in the early days of the client/server world consisted of a database which, in addition to the storage
of transaction and master data, enabled asynchronous communication between different system components, thereby allowing
these components to be decoupled. Messages were written in the database by the sender and read there by the receiver. For
this purpose, however, regular polling of the respective table was required, in order to establish whether new unprocessed
messages were available. The business logic was mainly implemented in so-called fat clients. Tasks which were executed via
batch processes were implemented via background processes which accessed the database.
Uniserv GmbH • Rastatter Str. 13 • 75179 Pforzheim/ Germany
T +49 7231 936-0 • F +49 7231 936-3002 • E info@uniserv.com • www.uniserv.com 3
© Copyright Uniserv • Pforzheim • All rights reserved.
4. White Paper
Interactive functions for the validation of addresses or for the detection of duplicates directly at data entry were normally
integrated in the graphical user interface. They were usually called via proprietary interfaces. Specifications such as DCE1
or CORBA 2, whose goal was the standardization of interfaces for the communication of distributed components, had nothing
more than a niche existence.
This situation has fundamentally changed in the past few years. The starting point for this was mainly the establishment of
standards within the framework of JEE3 (Java Enterprise Edition) which resulted in the provision of high-performance implemen-
tations of these standards both as commercial products and open source solutions.
For the Windows world, Microsoft followed with the development of .NET4 as a language-independent platform and the .NET
Enterprise Services. As a result, a high-performance infrastructure software was available irrespective of the selected platform
(JEE, .NET), in order to largely detach the business logic from the presentation layer and implement it in its own layer on the
server. This also changed the requirements for data quality services, which were now executed mainly from the business logic
on the server side.
Simple integration in the application server – either with a JEE or a .NET architecture – now came to the fore.
1 http://en.wikipedia.org/wiki/Distributed_Computing_Environment 3 http://en.wikipedia.org/wiki/Java_Platform,_Enterprise_Edition
2 http://en.wikipedia.org/wiki/CORBA 4 http://en.wikipedia.org/wiki/.NET_Framework
Uniserv GmbH • Rastatter Str. 13 • 75179 Pforzheim/ Germany
T +49 7231 936-0 • F +49 7231 936-3002 • E info@uniserv.com • www.uniserv.com
© Copyright Uniserv • Pforzheim • All rights reserved.
5. White Paper
SOAP as an Enabler for SOA
SOA does not become really effective until a standard SOAP-based Web Services have therefore developed into a
protocol for the provision and use of services has been central instrument for the provision of interactive data qua-
established. This gap was closed by the Web Service pro- lity services in modern enterprise architectures. These main-
tocol SOAP5. It is supported by practically all middleware ly concern services which run according to the request/
and infrastructure components, thereby enabling interope- response pattern, in which the service consumer initiates a
rability between service providers, middleware and service request, e.g. validate the specified address, and the ser-
consumers. As a result, the provision of connectors for the vice makes a direct response with a confirmation, a correc-
use of proprietary protocols in proprietary middleware is tion suggestion or a selection of possible correct addresses.
no longer necessary. This therefore lays the basis for the This procedure corresponds to the interactive character of
establishment of powerful middleware components. It inclu- the validation, which should support the user directly at
des the concept of the Enterprise Service Bus6 (ESB), which entry of a business object and offer options for intervention
enables the loose coupling of different components which in the event of problems. However, this function is no longer
play a significant role in the routing of messages, as well implemented in isolation in the presentation layer but nor-
as engines which can directly execute defined workflows mally takes place in the context of a higher-ranking business
(BPEL)7 in a business process language. process, e.g. the implementation of the ordering process in
an e-business application, the implementation of a process
for lead conversion and qualification in a CRM application
or a comparable process. The correlation between the
implementation of business processes and data quality func-
tions immediately becomes obvious and therefore also the
contribution which they make to the success of the respecti-
ve business process.
5 http://en.wikipedia.org/wiki/SOAP
6 http://en.wikipedia.org/wiki/Enterprise_service_bus
7 http://en.wikipedia.org/wiki/BPEL
Uniserv GmbH • Rastatter Str. 13 • 75179 Pforzheim/ Germany
T +49 7231 936-0 • F +49 7231 936-3002 • E info@uniserv.com • www.uniserv.com
© Copyright Uniserv • Pforzheim • All rights reserved.
6. White Paper
Customer Cases
Orange
The customers of the telecommunications company Orange in France can contact the compa-
ny via various channels. They can visit the web portal of Orange in the Internet, call the call
center of Orange and visit the mobile phone business of a partner of Orange. However, irre-
spective of which contact channel the customer chooses, it must be ensured that the respective
processes are executed with the same quality. An important component of this process quality
consists in ensuring that customer addresses are correct.
The technical basis is the open source Java application server JONAS. This JEE server is
the central point for the provision of services which are required to implement the business
processes of Orange. The Uniserv data quality service for the validation, restructuring and
standardization of customer addresses is also provided in this environment. This service-orien-
ted approach makes it possible to use the same services in different processes and different
channels. Irrespective of whether it concerns the creation of a new Orange customer or the
change of address of an existing customer and irrespective of which contact channel a pro-
cess is initiated, the underlying service-oriented architecture always ensures that the executed
processes are configured consistently and can access the same services. It is thereby possible
to guarantee a consistently high quality standard of the address data across the company.
WinGroup AG
The German WinGroup AG, a service network for sales and marketing, offers its customers
an extensive range of services in the areas of call center, lettershop, dialog marketing
and IT services. In order to guarantee a consistently high quality level of the underlying
processes in all service areas and customer-specific applications, the subsidiary company,
WinLogic, has developed a service-oriented architecture based on an Enterprise Service
Bus. This represents the central middleware for docking all applications in the company in
the central services. The address validation and the duplicate check of Uniserv are linked
here to secure the data quality.
Uniserv GmbH • Rastatter Str. 13 • 75179 Pforzheim/ Germany
T +49 7231 936-0 • F +49 7231 936-3002 • E info@uniserv.com • www.uniserv.com
© Copyright Uniserv • Pforzheim • All rights reserved.
7. White Paper
Lightweight REST –
when SOAP is too unwieldy
Even if the SOAP-based protocol for integration in typical Although the same services used for address validation pre-
enterprise middleware seems to be the ideal solution, there sent themselves for the address input in this implementation,
are various applications where alternatives such as RESTful8 SOAP-based Web Services are not always suitable for this
Services are advantageous. This is particularly the case application scenario on account of the overhead which the
when data quality services are to be directly activated in a SOAP protocol entails. RESTful Web Services are extremely
presentation layer which is HTML/AJAX-based. Input aids lean in comparison to SOAP-based Web Services. The call
which automatically complete a partial input by the user is via the http protocol, and the call arguments in the URL are
or offer suggestions for completion of the input based on encoded as a result. In the case of a call from JavaScript,
the partial input are a typical scenario for this case. Input the result is output in the JSONformat 9 in the ideal case. This
aids are located in the presentation layer by their nature. results in a JavaScript code which can be directly interpre-
However, they make considerably higher demands for the ted by the JavaScript interpreter of the browser, in order to
response time, since they are called more frequently during provide the result of the call directly as JavaScript objects.
the input, usually after each input character. Services configured in this manner can be ideally used in
typical Web 2.0 applications.
SOAP - the fine differences
A learning curve also has to be overcome for adapting Validation against the XML scheme using standard XML
proprietary interfaces to a SOAP-based communication. means is therefore possible on the one hand, and the result
The packets exchanged in the framework of a SOAP-based document can be easily further processed, transformed or
communication are described in a metaformat, the so-called processed for the presentation layer using standard XML
WSDL (Web Service Description Language).10 means or suitable frameworks on the other. Both variants
During the development of the WSDL, it must first of all be have their advantages and disadvantages. Both variants
ensured without fail that the data types used are actually may also have to be made available to services which need
supported by all the target languages and systems, in which to be called from any context.
the service is to be consumed. This aspect is relatively non- Web Services are stateless by their nature. In the ideal case,
critical in a pure in-house development with a homogeneous this means no partial results; instead the overall result is out-
software infrastructure, e.g. JEE or .NET. However, this put as a result of the call, and two successive calls are totally
criterion is essential in a data quality service which must independent of each other.
useable in a great variety of environments which may not Web Services must be scalable. A prerequisite for this is the
even be known beforehand. described statelessness. In addition to this, the Web Service
In addition to this, the SOAP specification offers two basic should access global resources as little as possible or not at
options for defining in the WSDL the linkage between the all. If this is necessary, however, the administration and syn-
packet structure in XML and the constructs of the program- chronization of the access to these resources should never be
ming language which provides or interprets the packet. As implemented ad-hoc for the respective WEB service. Resource
the acronym suggests, the so-called rpc-style corresponds pools such as are offered by most application servers or exist
more to the conventional Remote Procedure Call (RPC) and as open source extensions should be used instead. Effective
models operations as method calls which do not differ and configurable pooling is thereby enabled.
from local calls. They are ideal when the Web Service is to The last two points in particular normally require at least a
be called from an object-oriented programming language partial redesign if the existing functionality is to be made
such as Java or C#. The so-called document-style is more available as a Web Service.
suitable for modelling complex contents as a result which is
represented as an XML document with its own XML scheme.
8 http://en.wikipedia.org/wiki/REST 10 http://en.wikipedia.org/wiki/Web_Services_Description_Language
9 http://en.wikipedia.org/wiki/JSON
Uniserv GmbH • Rastatter Str. 13 • 75179 Pforzheim/ Germany
T +49 7231 936-0 • F +49 7231 936-3002 • E info@uniserv.com • www.uniserv.com
© Copyright Uniserv • Pforzheim • All rights reserved.
8. White Paper
SOA blurs the distinction between on premise and on demand
The provision of software services and applications via the Internet, referred to by the buzzwords Software on Demand,
Software as a Service11 or Cloud Computing, is a theme which is growing in importance. In many cases, a fundamental and
profound contradiction between locally installed software (on premise) and software used via the Internet (on demand)
is depicted. This is not the case from the SOA perspective, because within the framework of a service-oriented architecture it
is not critical how the respective service is provided. The provision of a service via the Internet as an alternative to a locally
installed server presents interesting new application possibilities, particularly in the area of data quality services which carry
out validations and corrections by matching and merging against reference data.
The reference data required for the service must be regularly updated. This means both regularly recurring manual work as
well as regular subscription charges which have to be paid to the data provider. In the case of country-specific postcode direc-
tories, this work and expense are incurred for each country for which addresses are checked. This only makes the use of such
solutions interesting for larger quantities of data. This restriction is not applicable if the service is provided as an SaaS offer, in
which invoicing is based exclusively on the executed transactions. Locally installed services and services used via the Internet
can be also combined as required or exchanged by using them in a service-oriented architecture. As a result, an optimum
solution, which can also be flexibly adapted in retrospect to changed basic conditions, can be found for the respective user.
11 http://en.wikipedia.org/wiki/Software_as_a_Service
Uniserv GmbH • Rastatter Str. 13 • 75179 Pforzheim/ Germany
T +49 7231 936-0 • F +49 7231 936-3002 • E info@uniserv.com • www.uniserv.com
© Copyright Uniserv • Pforzheim • All rights reserved.
9. White Paper
Conclusions
The following conclusions about two aspects can be drawn from the above described experience:
What are the requirements for data quality functions Which aspects have to be considered if functions for
1 which are to be used in an SOA environment?
2 the validation, enhancement and processing of data
are to be made SOA-capable?
The functions must be accessible via SOAP as The use scenarios must be clearly defined. The que-
Services. Integration in practically any infrastructure stion as to whether the application of the services
for the service-oriented implementation of the busi- takes place within the framework of the business
ness processes is thereby guaranteed. However, care logic or the presentation logic is particularly impor-
should be taken in the detail that mapping between tant. In the first case, integration takes place in the
the XML elements of the service description and the application server, the Enterprise Services Bus or a
respective constructs of the respective server environ- BPEL engine, in the second case in a graphical user
ment can be represented. interface, which can in turn make its own demands.
The decision as to whether SOAP-based and/or
If use in the presentation layer is foreseeable, it must RESTful services are more appropriate is made
be checked whether a RESTful Service implementati- depending on this.
on is necessary.
The target systems and languages in which the ser-
It should be checked whether an alternative use vice is to be used must be specified. The decision
scenario, in which the service is provided via the on the degree of complexity of the modelling with
Internet, provides commercial or technical advan- respect to the XML structures used and XML data
tages, and whether the service provider supports types of the call results is made depending on this.
such a scenario.
The service must be designed, so that it is stateless,
i.e. it must function without the storage of an internal
state between two calls. If a state between the calls
is required, it must be transferred for the follow-up
call or made persistent in a suitable manner.
The scalability of the service must be provided: if
the Web Service requires global resources, e.g. a
database connection, these should be administered
by means of a suitable resource pool in the server
container. Otherwise, these resources quickly beco-
me a bottleneck which prevents genuine scalability
of the service.
The service should be internet-capable, i.e. it should
be irrelevant for the functionality of the service
whether it is provided in the local network or via
the Internet. The possible applications are extended
enormously as a result.
Uniserv GmbH • Rastatter Str. 13 • 75179 Pforzheim/ Germany
T +49 7231 936-0 • F +49 7231 936-3002 • E info@uniserv.com • www.uniserv.com
© Copyright Uniserv • Pforzheim • All rights reserved.