Here’s something interesting I thought I would share with you. You can’t talk about big data without talking about the internet. These statistics around what can be accomplished in 60 seconds amaze me, and I’m sure you as well. Almost 700,000 status updates on Facebook? And half of those 168 million emails are completely junk mail. Look at the amount of data, volumes, Who said that our PL analysis from transactional DB of Financial Module is the only thing that I need to know as a manager? But how can we make this data relevant to our business? What do we do with all this data? Well, we can analyze it for patterns? Understand customers sentiment? Build better advertising? Optimize our business operations. And companies are doing just that now.
But how do you get at this exceptional value? … Who said that data about your customers in CRM systems are the best data about customers, or about number of orders that passed within your service bus, or amount of records in your database. But what about this data? Let’s take an example. Insurance companies now start to collect data on driving habits through plugging in a sensor in your car. The device tracks your driving data and shares it with the insurance company. It is looking for things like gentle braking, and miles you drive during which hours . Lots of data, how does it challenge your current view of infrastructure and skills. . E.g. understanding that a positive or negative social media comment is associated with your best, worst customer.
How do we use data? Lets forget for a second that we are IT experts.. ..then someone explain us what is transactional DB, what is doc and content management..the key thing between is Data Integration. DIS is bridging the gap between how we use data and where we store them Architecturally, the critical component that breaks the divide is the integration layer. It needs to extend across all of the data types and domains, and bridge the gap between the traditional and new data acquisition and processing framework. As we outlined from the previous Integrated Architecture, the need to consolidate these types of integration technologies is key to address your big data requirements where you need to correlate data organize them using standard tools , as opposed to creating separate islands one for structured, one for un-structured data. We need to understand the whole picture of data integration around transactional data for 2 reasons : 03/22/12 16:26 Copyright Oracle Corporation 2012. All rights reserved. EA AND BIG DATA: Data Integration Webcast
We can not suddenly move away from transactional data. The real value of Big data is correlation between transactional and unstructured data. You need to look for tools that will allow you to integrate Hadoop / MapReduce with your data warehouse and transactional data stores in a bi-directional manner. You need to be able to load the “nuggets of valuable information” from big data processing into your data warehouse for further analysis. You also need the ability to access your structured data such as customer profile information while you process through your big data to look for patterns and detect fraudulous activities.
Oracle’s complete solution in Data Integration offers a Complete and best-of-breed approach to your big and small data. What distinguishes us at Oracle is that we have the best performance, while helping customers attain lower cost of ownership, ease of use, and reliability. For Big Data Projects we also enable faster time to value with our exceptional interfaces and design tools.
Very often both ETL and DQ are required (ODI for extracting from complex data sources and for complex non-semantic transform) It’s not the size or complexity of the transformation that is different, it’s the nature of the transformations – semantic or not? (Do we have to go in fine details in relations between words and between characters) Complex matching or not?
Because of lack of standards, or different standards across different sources Because of missing values and typos Robust Matching features are the only that solve duplicate issues Poor data are quite often not trapped by IT depts with custom DQ and data integration processes, for several reasons: generally data are not checked against all business rules and standards, then they pass trough up to the final data target it’s highly cost to build and maintain a custom DQ framework with a custom data integration framework. Most of the times it’s a business issue that raises the flag about poor data quality . And sometimes it’s too late… From Business point of view, this means wrong financial summary and reports, doubled expenses, delay in delivering, etc. From IT point of view, correct, restore original good data takes long time, expensive controls, resources redirected to solve the issue, and in a context of urgency.
Packaged processes including steps used to perform common quality tasks (e.g. providing values for incomplete data, resolving conflicts of duplicate records, specifying custom rules for merging records, profiling, auditing, etc) Pre-built templates for single customer view projects & for customer screening + packaged &quot;processors&quot; for Investment Banking, Basel II, Solvency II
This is a build slide to illustrate the steps one goes through to collect a piece of an address and go through the entire process. The important items to notice are: The incomplete parts of the address are added – St to Berry, Unit to 1210 Inconsistent or potentially incorrect pieces are corrected and standardized – California to CA Missing pieces are added – missing zip code / post code is inserted. Loqate also changes 5 digit Zip Codes (US) to Zip+4 format. The address is placed in the correct format for the appropriate country, in this case, the US The Geocode is assigned to the address.
Oracle Data Integrator (ODI) Application Adapter for Hadoop provides native Hadoop integration within ODI. Specific ODI Knowledge Modules optimized for Hive and Oracle Loader for Hadoop are included within ODI Application Adapter for Hadoop. The knowledge modules can be used to build Hadoop metadata within ODI, load data into Hadoop, transform data within Hadoop , and load data easily and directly into any Hadoop environment Hadoop implementations, whether it’s HDFS or NoSQL require complex Java MapReduce code to be written and executed on the Hadoop cluster. Using ODI and the ODI Application Adapter for Hadoop developers use a graphical user interface to create these programs . Utilizing the ODI Application Adapter for Hadoop, ODI generates optimized HiveQL which in turn generates native MapReduce programs that are executed on the Hadoop cluster. Takeaway – ODI is part of Oracle’s complete data integration strategy, but specifically for Big Data, ODI is the key element you want to start with. The unique advantages of the offering include: Simplifies creation of Hadoop and MapReduce code to boost productivity Integrates big data heterogeneously via industry standards: Hadoop, MapReduce, Hive, NoSQL, HDFS Unifies integration tooling across unstructured/semi-structured and structured data Optimizes loading of big data to Oracle Exadata using Oracle Loader for Hadoop Integrated and optimized for running on Oracle Big Data Appliance via Big Data Connectors
In this case, ODI loads the data directly into Oracle Database utilizing the Oracle Loader for Hadoop support within the Application Adapter for Hadoop. What makes the loading optimized: Oracle Loader for Hadoop sorts, partitions, and converts data into Oracle Database formats in Hadoop, then loads the converted data into the database. By preprocessing the data to be loaded as a Hadoop job on a Hadoop cluster, Oracle Loader for Hadoop dramatically reduces the CPU and IO utilization on the database
Takeaway – ODI is part of Oracle’s complete data integration strategy, but specifically for Big Data, ODI is the key element you want to start with. The unique advantages of the offering include: Simplifies creation of Hadoop and MapReduce code to boost productivity Integrates big data heterogeneously via industry standards: Hadoop, MapReduce, Hive, NoSQL, HDFS Unifies integration tooling across unstructured/semi-structured and structured data Optimizes loading of big data to Oracle Exadata using Oracle Loader for Hadoop Integrated and optimized for running on Oracle Big Data Appliance via Big Data Connectors
Thank you for your time today, I hope that this presentation was informative and useful. Please contact your Oracle data integration sales representative for more information about any of the Oracle data integration products you heard about today.
1. Bridging the Big Data Divide with Oracle Data IntegrationMilomir Vojvodic,Business Development Manager, EMEA DIS
2. Diverse Data SetsInformation Architectures Today:Decisions based on transactional data transactions, applications, structured DataInformation Architectures Today: Video and ImagesDecisions based on all your data Documents Social Data Machine-Generated Data
3. Architecture Principles Solutions Oracle Data Integrationand Big Data for Best Practices
4. Integrated Architecture Capture Store/Process Integrate Organize Analyze Govern Master & Reporting & Ref Data DBMS DB Replication Dashboards (OLTP) ODS AlertingTransaction Data ETL/ELT EPM Data BI Applications WarehouseMachine Text Analytics CDCGenerated and Search Hadoop Data Marts In-Database Cluster w Analytics Social Media MapReduce Real-Time Streaming Advanced (CEP Engine) AnalyticsText, Image Key-Value Message- VisualVideo, Audio Data Store Based Discovery
5. Integrate Big Data with DW and Transactional Data Stores Oracle Oracle Big Data Appliance Exadata Oracle Exalytics Stream Acquire Organize Analyze & VisualizeLoad from big data processing into your data warehouse for further analysisAccess your customer information while you process through your big data in order to look for patterns
6. Oracle Data Integration Solutions • Complete and best-of-breed Legacy Sources approach to address enterprise integration Oracle Enterprise Data Quality • Maximum performance with lower cost of ownership, ease of Application use, and reliability. Sources Oracle Data Integrator • Certified for leading technologies to deliver fast time to valueRelational and Oracle GoldenGateNon-Relational • Oracle customers report: – 80% lower TCO – Five times higher performance – 70% reduction in development costs
7. Architecture Principles within DB Replica and CDCand Best Practices Data Integration Layer
8. What is Oracle GoldenGate? OGG Source DB Target DB
9. What is Oracle GoldenGate? First OGG Differentiator Accessing directly transaction logs OGG Source DB Target DB Second OGG Differentiator Moving only committed transactions
10. Oracle DIS Use Cases - OGG Migrations&Consolidations OGG Zero Downtime Migrations & Upgrades New DB/HW/OS/APP OGG Active/Active DB Deployment Fully Active Distributed DB OGG ADG Disaster Recovery Reporting Database Reporting Database and/or DR database OGG DW Synchronization Data Warehouse
11. OGG is Log Based Replica o round 50% of ring t C om pa c as es : OR hole DB ng of the w w) Scanni atch windo (bTIME REQUIRED FOR THE END OF DAY NO OF CPUs REQUIRED FOR SAME ESTIMATED COSTS FOR SERVER ANDPROCEDURE Daily load time can PERFORMANCE* LICENSE**Hours reach 5 days with No Of Required CPUs Required No. Estimated Cost of Purchase in USD the current HW CPUs can be Costs can be doubled doubledCurrently during the End Of Dayutilizes the Server CPU by 40-50%and the IO by 90%. Probably the IOis the bottleneck.
12. OGG Moves Only Committed Transactions OR 30% of ring to round Compa cases : eplica HW/St orage r Begin, TX 1 Insert, TX 1 Begin, TX 2 Pump Delivery Begin, TX 2 Begin, TX 2 Checkpoint Checkpoint Update, TX 1 Insert, TX 2 Insert, TX 2 Insert, TX 2 Commit, TX 2 Commit, TX 2 Commit, TX 2 Capture Begin, TX 3 Checkpoint 20% of Begin, TX 3 Insert, TX 3 Insert, TX 3 Commit, TX 3 ring t o r ound Begin, TX 4 C om pa cases : n nsactio Commit, TX 3 her Tra tion Delete, TX 4 t Some o sed solu ba
13. Architecture Principles within ETL and Data Qualityand Best Practices Data Integration Layer
14. ODI is centralizing all ETL Development Analytics Data Data Replication Data Migration Warehousing Data Data Silos Data Marts Federation Data Hubs Data Access Batch Scripts SQL Custom Java OLTP & ODS Oracle Files Systems Data PeopleSoft, Siebel, SAP Excel OLAP Warehouse, Data Mart Custom Apps XML
15. ODI is centralizing all ETL Developmentf nd 80% o ou ring to r : Compa cases oding anual C Analytics Using M Oracle Data Integrator OLTP & ODS Oracle Files Systems Data PeopleSoft, Siebel, SAP Excel OLAP Warehouse, Data Mart Custom Apps XML
16. Why is ODI different? ODI E-LT First ODI Differentiator Transformations using the power of the Target Database – no staging server Second ODI Differentiator ODI Declarative Design and ODI Knowledge Modules Staging Server for reusing already written down level SQL code OGG ODI Data Warehouse ODI Knowledge Modules Reverse Journalize Load Check Integrate Service Engineer Read from From Constraints Transform Expose Data Metadata Reverse CDC Source Sources to before Load and Move to and ODI Declarative Design Staging Targets Transformati WW W ODI Declarative Design on Services SS S Staging Tables Load Integrate Services CDC Journal 1 2 Check Target Tables Sources ize Error Tables Define Automatically What GenerateSample out-of-the-box Knowledge Modules You Want Dataflow SQL Oracle Oracle JMS Check MS TPump/ Oracle SAP/R3 Log Miner Server Web DBLink Queues Excel Multiload Merge Triggers Oracle Services DB2 DB2 Check Type II Siebel EIM DB2 Web Siebel Journals Exp/Imp SQL*Load er Sybase SCD Schema Services Define How : Built -in TemplatesBenefits
17. Oracle DIS Use Cases – ODI and EDQ Migrations&Consolidations EDQ OGG ODI Zero Downtime Migrations & Upgrades New DB/HW/OS/APP OGG Active/Active High Availability Fully Active Distributed DB OGG ADG Query Off-Loading and Disaster Recovery Reporting Database and/or DR database OGG ODI EDQ BI&DW Synchronization and Loading Data Warehouse
18. Why Do We Need Data Quality? Abbreviations Attributes non-standard, Inconsistent formats (often ambiguous) missing or invalidCustomer ID Customer Name Address 1 Address 2 City State Zip Country Birth Date GenderAD23298 Mr Peter Mayhew 9407 Main St Fairfax VA 22031-4001 USA 02/23/61 MVS38611 Dr Ellen Van Der Heijde 144 E Grove St Kingston PA 18704 US 07/12/57DC18223 Jalila Abdul-Alim (Do Not Call) 4548 Pennsylvania Ave Apt 205 Kansas City MO 64111-3349 USA 02/23/63 FCO9387A Tayside Computers Inc. 4912 E 41st N Idaho Falls ID 83401 USA 31/03/2007 N/A WidespreadTZ35019 Mr Zachary P Jahn 98-1731 Ipuala Loop Aiea Hawaii 96701 1710 United States 06/12/86 Male duplicationCB27843 Mrs Edith Y Baba Junior Baba Real Est. Corp. 209 Stony Point Trl Webster NY USA 11/17/1971 M (often hardOX80306 Andrew & Mary Baxter 14 Oxbridge Way Milfrod NH 03055-4614 US 05/28/67 FJP70210 Mr RJ & Mrs FB MacDonald 57 Hadleigh Close Westlea Swindon SN5 9BZ MA - USA - Y to spot)RD48107 Mr Andy Baxter 14 Oxbridge Wy Milford NH 3056 USA 01/01/01 M Compound Names Mis-Fielded Data Embedded Additional Information Erroneous Data Mixed Business & Personal Names International Date Formats Multiple Names Default or Dummy Data 19 | © 2011 Oracle Corporation
19. Why Do We Need Data Quality? 10hp motor 115V Yoke mount MOT-10,115V, 48YZ,YOKE mtr, ac(115) 10 horsepower 115volts Item Motor Classification 26101600 This 10hp yoke mounted motor is rated Power 10 horsepower for 115V with a 5 year warranty Voltage 115 Mounting Yoke 10 Caballos, Motor, 115 Voltios TEAO HP = 10.0 1725RPM 115V 48YZ YOKE MTR Product data is much more variable and unpredictable than other data types Motor, TEAO, 1725 RPM, 48YZ, 15 Voltios, Montaje de Yugo, hp = 1020
20. Oracle Enterprise Data Quality• Profile, Audit, Transform, Parse, Cleanse, Standardize, Match within One Unified Solution
21. EDQ Address Verification 300 Berry #1210 SF California Latitude 37.775837 Longitude -122.39557 Parse ValidatePremiseNumber 300 300 Step 1 Extract pieces of the addressThoroughfareName Berry Berry St Step 2 Check the piecesSubPremise #1210 Unit 1210 against the information in the Global Knowledge Repository to complete and find theLocality SF San Francisco correct abbreviationsAdministrativeArea California CA Step 3 Change character set – transliterate - if necessaryPostCode 94158-1670 Step 4 Find Location© 2012 Oracle Corporation – Proprietary and Confidential
22. Architecture Principles Oracle Data Integrator and Big Data for Best Practices© 2012 Oracle Corporation – Proprietary and Confidential
23. ODI for Big DataHeterogeneous Integration to Hadoop Environments Transforms • Supports Hadoop standards Via MapReduce • Easy to configure UI for generating MapReduce Oracle Data Integrator Loads
24. ODI for Big Data to OracleOptimized Integration to Oracle Exadata Oracle Big Data Connectors Transforms Via MapReduce Oracle Data Integrator Activates Oracle Loader for Hadoop Loads Oracle Database, Hadoop Cluster Oracle Exadata Oracle Big Data Appliance
25. Oracle Data Integrator for Big DataPutting Together the Unique Advantages