Sur une planète toujours plus intelligente, instrumentée et interconnectée, la masse d\'information explose. Il n\'y a pas de prise de décision de qualité sans une information fiable, pertinente, à la bonne personne au bon moment. Lors des Tendances Logicielles New Intelligence, Dan Benouaisch, IBM, a développé les concepts et présenté l\'offre IBM InfoSpere qui répond à ces impératifs.
4. Solution IBM InfoSphere – End to End Cognos Data Integration Data Quality Data Delivery Operational Source Systems Structured/ Unstructured Data InfoSphere MDM Server COMMON METADATA Data Glossary Spreadsheets Applications Information Server Federated Data Cubing Services Industry Models Data Repository Multidimensional Analysis Data Mining Data Definition Glossary SOA Web Service InfoSphere Warehouse Common Definition Management Deployment
5.
6.
7.
8.
9. CONSTRUIRE OU ACHETER : LES CONSTATS Construire, à condition de pouvoir le justifier Acheter, mais en trouvant le bon compromis Un impact sur les temps de mise en œuvre et sur les coûts Un impact sur l’ouverture et la flexibilité « Il coûte 7 à 10 fois plus cher de développer en spécifique une fonction plutôt que d’utiliser son équivalent dans un progiciel » GIGA GROUP « Nos études montrent que les coûts de possession du spécifique dépassent de 40% ceux du Progiciel » GARTNER Progiciel par défaut Au cas par cas en fonction du projet Progiciel systématiquement Selon le coût NSP Progiciel adapté aux processus Progiciel adapté aux métiers Approche mixte Source Forrester (Étude Sur 25 grands comptes Européens), AMR et Gartner Plus facile avec un progiciel Plus facile avec un spécifique Équivalent NSP Spécifique plus cher Équivalent Logiciel plus cher NSP
10. Une méthodologie adaptée à vos enjeux métiers Time To Value Est-ce que vos sources de données contiennent l’information que vous pensez y trouver? Quelles sont les sources à utiliser pour ce projet? Est-ce que le sens de vos données est celui que vous croyez? Découvrir Comment rapprocher les enregistrements de même signification? Pouvez-vous corriger et améliorer la qualité de vos données? Standardiser Pouvez-vous affecter un sens aux données à destination des utilisateurs ? Pouvez-vous apporter une synchronisation des données entre les systèmes? Pouvez-vous délivrer & mettre à jour les données en temps réel? Vos données peuvent-elles être délivrées sur la base d’évènements ou selon leur contenu? Transformer & Délivrer Fédérer Comment accéder de manière transparente, efficace et simple à des données provenant de sources hétérogènes ?
11. Vos projets d’intégration de l’information … Exécution performante quelque soit la volumétrie Une seule plateforme, un seul outil : le Serveur d’Information Connectivité étendue aux applications, données et contenu Comprendre Cartographier, définir, découvrir et modéliser et maîtriser qualité et structure de l’information Nettoyer Standardiser, fusionner et corriger l’information Transformer Transformer, enrichir, déplacer et synchroniser l’information Fédérer Virtualiser et simplifier l’accès à l’information Déployer la logique d’intégration sous forme de Service Gérer de façon unique et simple toutes vos métadonnées
12.
13. IBM Information Server Delivering information you can trust Comprendre Nettoyer Transformer Fédérer QualityStage Information Analyzer Federation Server DataStage Business Glossary Information Services Director Metadata Server Exécution parallélisée Connectivité aux applications, données et contenu Information Server Metadata Workbench
14.
15.
16.
17.
18.
19.
20. IBM Information Server Delivering information you can trust Comprendre Nettoyer Transformer Fédérer QualityStage Information Analyzer Federation Server DataStage Business Glossary Information Services Director Metadata Server Exécution parallélisée Connectivité aux applications, données et contenu Information Server Metadata Workbench
23. Le processus de Nettoyage des données Vues Consolidées 1. Standardiser 2. Rapprocher 3. Consolider Clients Transactions Vendeurs / Fournisseurs Cible Produits / Matériels
24. Un exemple de données “non propres” Comment identifier et consolider des données quand le nombre d’enregistrement s’élève a plusieurs millions/milliards d’enregiqtrement ? 90328574 IBM 187 N.Pk. Str. Salem NH 01456 8,494.00 90328575 I.B.M. Inc. 187 N.Pk. St. Salem NH 01456 3,432.00 90238495 Int. Bus. Machines 187 No. Park St Salem NH 04156 2,243.00 90233479 International Bus. M. 187 Park Ave Salem NH 04156 5,900.00 90233489 Inter-Nation Consults 15 Main Street Andover MA 02341 6,800.00 90234889 Int. Bus. Consultants PO Box 9 Boston MA 02210 10,243.00 90345672 I.B. Manufacturing Park Blvd. Bostno MA 04106 15,999.00 Pas de clés communes Anomalies Erreurs de traduction Pas de standard
25. Étape 1 : La standardisation (exemple adresses) Analyse lexicale: Détermination de la signification métier de chaque composant Mise en contexte: Identification de la structure variable des données et de leur signification ^ Repetition Street Common + Index Type Word 3 | BIS | RUE | DE | PARIS melle Morognier Françoise 3 bis, r. de Paris 72000 Le Mans House Repetition Street Street Number Index Type Name 3 | B | RUE | DE PARIS 3 | BIS | R. | DE | PARIS Décomposition: Détermination de la signification métier de chaque composant FRADDR
26. Étape 1 : La standardisation (exemple produits) Pneu Energy Serie Audi A4 TDI 115ch 2.0 2.0 Pneu Pilot Primacy 205/55R16 Audi A4 91/H Pneu Exalto Option AudiA4 130ch 2.2 Pneu Pilot Sport Serie Audi A4 TDi quattro 2.5 2.3 91/Y Audi A4 TDI Quatro Audi A4 TDI 130ch Audi A4 TDI Audi A4 TDI 115ch Type Voiture 2.5 2.3 2.2 2.2 2.2 2.2 2.0 2.0 Pression AV AR Serie 91/Y 225/45ZR17 Pneu Pilot Sport Option 91/V 205/55R16 Pneu Pilot Exalto Option 91/H 205/55R16 Pneu Pilot Primacy Ssérie 91/H 195/65R15 Pneu Energy Monte IC/IV Dimension Description
27. Étape 2 : Le Rapprochement Prénom 2ème Prénom Nom Fonction ALEXANDRE J DEMARIA DG ALEXANDRE JEAN DEMARA DG + 7 +1 + 1 0 +5 = 23 Le s CUTOFF sont le s score s au dessus et en dessous d es quel s un rapprochement est considéré comme bon ou non Le score d’un poids est une mesure relative de probabilité de match 0 500 1000 1500 2000 2500 3000 3500 4000 -50 -40 -30 -20 -10 0 10 20 30 40 50 60 Nbre Paires Non rapprochées Rapprochées
28. Le scoring probabiliste améliore la qualité Les Tables de Décisions de la méthode classique (déterministe) appliquent les même règles quel que soit le contenu intrinsèque. Par contre, la méthode probabiliste tient compte de la différence intrinsèque des valeurs. Un nom rare (« YUSKA ») et des chaînes plus longues compensent les champs manquants ou litigieux. Illustration dans cette détection de foyer : la pattern déterministe « ABBCB » est un non-match, alors que l’algorithme probabiliste donne 24 > 21 = match non oui non (erreur !) L-Name Hse# Street Apt# Zip Rec-1 SMITH 123 BEECH 18A 02112 Rec-2 SMITH 132 BEACH 18 02111 Pattern A B B C B ABBCB Weight 5 2 7 1 4 19 Rec-3 YUSKA 5401 VETCH 818A 02112 Rec-4 YUSKA 5410 VEECH 81A 02111 Pattern A B B C B ABBCB Weight 7 3 8 2 4 24
29.
30.
31. Un exemple d’harmonisation (produits) DONNEES EN ENTREE Operation Work Instructions in a free text field WNG ASSY DRL 3 HOLE USE HEXBOLT ¼ INCH WING ASSEMBY, HEX BOLT .25” - DRILL FOUR, USE 5J868-A USE 4 5J868A BOLTS (HEX .25) - DRILL HOLES FOR EACH ON WING ASSEM RUDER, TAP 6 WHOLES, SECURE W/KL2301 RIVETS (10 CM) Assembly Instruction QTY Type Part Size Unit Measure SKU WING DRILL 3 HOLES HEXBOLT .25 INCH WING DRILL 4 HEXBOLT .25 INCH 5J868A WING DRILL 4 HOLES HEXBOLT .25 5J868A RUDDER TAP 6 HOLES RIVET 10 CM KL2301 STANDARDISATION Assembly Instruction QTY Type Part Size Unit Measure SKU WING DRILL 3 HOLES HEXBOLT .25 INCH WING DRILL 4 HEXBOLT .25 INCH 5J868A WING DRILL 4 HOLES HEXBOLT .25 5J868A RUDDER TAP 6 HOLES RIVET 10 CM KL2301 RAPPROCHEMENT MATCH Assembly Instruction QTY Type Part Size Unit Measure SKU WING DRILL 4 HOLES HEXBOLT .25 INCH 5J868A RUDDER TAP 6 HOLES RIVET 10 CM KL2301 CONSOLIDATION
32.
33.
34. Gestion de la qualité de donnée : Performance & Scalabilté
35. Plus de 500 clients en France Banque & Assurance Communications & Services Industries Secteur Public Distribution Majeurs SAP
43. Thank You Merci Grazie Gracias Obrigado Danke Japanese French Russian German Italian Spanish Brazilian Portuguese Arabic Traditional Chinese Simplified Chinese Hindi Tamil Thai Korean
IBM has assembled a portfolio specific designed to help organizations deal with the challenges of fragmented information. This portfolio, called InfoSphere, accelerates the delivery of trusted information throughout an organization. The portfolio accelerates client value and reduces risk in critical information projects. There are four primary parts to the portfolio. At the foundation is the InfoSphere Information Server, which specializes in integrating data across a heterogeneous landscape and delivering complete and accurate information when and where it is needed. A common target of this data is InfoSphere MDM, which manages a master view of key data elements like customer, product, account, and location over time. InfoSphere Warehouse provides a foundation for enormously scalable data warehouses, with key partitioning, mining, and cubing features to maximize the value of information. And providing acceleration for all of these are the IBM Industry Models, which contain industry-centric domain knowledge to help organizations achieve better results faster. Each part of the portfolio enjoys a market leadership position and stands alone in its value, but IBM is also investing in making the pieces work better together – helping companies who choose multiple parts to leverage deep synergies to further accelerate value.
TDWI – The Data Warehousing Institute has done some recent studies regarding data quality problems. It’s often easier to understand bad data if you identify the source – how it got into the system in the first place. Based upon 266 respondents who were able to select multuple items – they found that…
IBM recognized this challenge – which is why we’ve created the WebSphere Information Integration Platform. The IBM WebSphere Information Integration platform enables businesses to perform 5 integration functions: Connect to any data or content, wherever it resides Understand and analyze that information, including its meanings, relationships, and lineage Cleanse it to assure its quality and consistency Transform it to provide enriched and tailored information Federate it to make it accessible to people, processes, and applications Underlying these functions is a common metadata and parallel processing infrastructure that provides leverage and automation across the platform. Each product in the portfolio also provides connections to many data and content sources, and the ability to deliver information through a variety of mechanisms. Additionally, these functions can be leveraged in a service oriented architecture through easily published shared services. The IBM WebSphere Information Integration platform provides: access to the broadest range of information sources the broadest range of integration functionality, including federation, ETL, in-line transformation, replication, and event publishing the most flexibility in how these functions are used, including support for service-oriented architectures, event-driven processing, scheduled batch processing, and even standard APIs like SQL and Java. The breadth and flexibility of the platform enable it to address many types of business problems and meet the requirements of many types of projects. This optimizes the opportunities for reuse, leading to faster project cycles, better information consistency, and stronger information governance. How does Information Integration fit into an SOA? Regarding Service-Oriented Architectures, information integration enables information to be made available as a service , publishing consistent, reusable services for information that make it easier for processes to get the information they need from across a heterogeneous landscape.
IBM recognized this challenge – which is why we’ve created the WebSphere Information Integration Platform. The IBM WebSphere Information Integration platform enables businesses to perform 5 integration functions: Connect to any data or content, wherever it resides Understand and analyze that information, including its meanings, relationships, and lineage Cleanse it to assure its quality and consistency Transform it to provide enriched and tailored information Federate it to make it accessible to people, processes, and applications Underlying these functions is a common metadata and parallel processing infrastructure that provides leverage and automation across the platform. Each product in the portfolio also provides connections to many data and content sources, and the ability to deliver information through a variety of mechanisms. Additionally, these functions can be leveraged in a service oriented architecture through easily published shared services. The IBM WebSphere Information Integration platform provides: access to the broadest range of information sources the broadest range of integration functionality, including federation, ETL, in-line transformation, replication, and event publishing the most flexibility in how these functions are used, including support for service-oriented architectures, event-driven processing, scheduled batch processing, and even standard APIs like SQL and Java. The breadth and flexibility of the platform enable it to address many types of business problems and meet the requirements of many types of projects. This optimizes the opportunities for reuse, leading to faster project cycles, better information consistency, and stronger information governance. How does Information Integration fit into an SOA? Regarding Service-Oriented Architectures, information integration enables information to be made available as a service , publishing consistent, reusable services for information that make it easier for processes to get the information they need from across a heterogeneous landscape.
IBM recognized this challenge – which is why we’ve created the WebSphere Information Integration Platform. The IBM WebSphere Information Integration platform enables businesses to perform 5 integration functions: Connect to any data or content, wherever it resides Understand and analyze that information, including its meanings, relationships, and lineage Cleanse it to assure its quality and consistency Transform it to provide enriched and tailored information Federate it to make it accessible to people, processes, and applications Underlying these functions is a common metadata and parallel processing infrastructure that provides leverage and automation across the platform. Each product in the portfolio also provides connections to many data and content sources, and the ability to deliver information through a variety of mechanisms. Additionally, these functions can be leveraged in a service oriented architecture through easily published shared services. The IBM WebSphere Information Integration platform provides: access to the broadest range of information sources the broadest range of integration functionality, including federation, ETL, in-line transformation, replication, and event publishing the most flexibility in how these functions are used, including support for service-oriented architectures, event-driven processing, scheduled batch processing, and even standard APIs like SQL and Java. The breadth and flexibility of the platform enable it to address many types of business problems and meet the requirements of many types of projects. This optimizes the opportunities for reuse, leading to faster project cycles, better information consistency, and stronger information governance. How does Information Integration fit into an SOA? Regarding Service-Oriented Architectures, information integration enables information to be made available as a service , publishing consistent, reusable services for information that make it easier for processes to get the information they need from across a heterogeneous landscape.
Cleansing is the process of cleaning up these sorts of problems. Within IBM Information Server, WebSphere QualityStage is a product module that helps to identify and resolve all five of those types of issues, for any type of data. It provides data quality functions on an easy-to-use, design-as-you-think flow diagram. This allows data quality to be embedded in any information integration process. The quality functions include: free-form text investigation - allowing you to recognize and parse out individual fields of data from free-form text, standardization – allowing individual fields to be made uniform according to your own standards, address verification and correction – which uses postal information to standardize, validate, and enrich address data, matching – which allows duplicates to be removed from individual sources, and common records across sources to be identified and linked, and lastly, survivorship – which allows the best data from across different systems to be merged into a consolidated record. The true power of QualityStage is in its ability to match data from different records, even when it appears very different. The design of these matching rules is very important, since it determines which records will be brought together. These match rules are designed using a visual, business-centric interface, providing instant feedback on match rule changes to allow the rules to be fine tuned quickly and easily. Because of this ability to match records, QualityStage is a key enabler of creating a single view of customers or products. Silver Bullets: Provides the most intuitive and productive visual quality design capability on the market, allowing quality logic to be fine-tuned with actual data samples and incorporated as a seamless component of data flows (single engine, single user interface, single meta-model across ETL and Quality) Works across any data type (including product and customer data) Uses probabilistic matching to ensure a 2-4% better match result Allows quality logic to be easily deployed as shared services within a SOA to ensure consistent enterprise reuse of quality logic Leverages the scalability of the platform parallel processing services
So once records are matched together, what you decide to do with that information is completely up to the business. We discussed clerical review. Some organizations like every potential match to be reviewed (particularly for things like bank accounts). However, in most cases the automated match results can be employed. When a match is found, records can be linked together, using a cross-reference table that stores the identifiers of each record, and potentially enough additional information to allow that table to act as a matching base for future records. When record linkage is employed, a merged record is not stored anywhere, but it is rather assembled from the various sources when needed. Survivorship can be employed when a complete master record is desired. Survivorship uses business-defined rules to determine how to build a record that merges the best information from each source. For example, you may have a natural preference for one source, since it is typically more reliable, so by default its data should be used, unless it is missing data elements, in which case alternative sources could be used. Survivorship creates a complete, merged, “gold copy” of data across systems – this is often used to load master data management systems like WebSphere Customer Center or WebSphere Product Center. Whichever mechanism you choose, you may wish to go back and correct source systems with information from other linked records that are more complete, or from the gold copy. In some cases, organizations don’t like to change the original values, so they append this new information in additional fields. All of this is dependent on the business requirements and can be adjusted according to the need.
Able to alter the number of processors without altering the code