Cloud Data Integration Best Practices


Published on

While many enterprises consider cloud computing the savior of their data strategy, there is a process they should be following when looking to leveraging database-as-a-service. This includes understanding their own data requirements, selecting the right cloud computing candidate, and then planning for the migration and operations. A huge number of issues and obstacles will inevitably arise, but fortunately best practices are emerging. This presentation will take you through the process of moving data to cloud computing providers.

Published in: Technology, Business
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • First, it’s useful to provide the context that the way we think about what is, the way we think about it at the highest levels of the Company, is that we have three macro and distinct businesses: our Consumer/Retail business, our Seller business, and our Developer business.
  • Stanford: Stanford University uses Moonwalk’s Enterprise Data Management System to store and backup their data into Amazon S3.  Moonwalk is an ISV partner of AWS. CA: Management solutions from CA provide a business driven management approach and add comprehensive support for Amazon EC2 with real-time automation, provisioning, application performance, service, and database management.  Adobe: Adobe offers their LiveCycle Developer Express program on AWS, giving their their enterprise developer community ready access to their document workflow solution for developing solutions.  They are also now offering ColdFusion 9, their application development platform, on AWS. Microsoft: Microsoft has used AWS over the years for various tasks, including software delivery, human intelligence tasks, and application hosting.  Mailtrust (Rackspace): Mailtrust archives their mail servers in S3.  NYTimes: NYTimes have used AWS extensively over the years for data processing pipelines, data analysis, application hosting, etc.  Early projects included the TimesMachine, etc.  SanDisk: Sandisk uses S3 as the cloud-based storage mechanism to backup and share data from their flash drive products.  NASDAQ: NASDAQ uses S3 to store and delivery their ticker symbol data for the NASDAQ Market Replay application.  ESPN: ESPN uses EC2 and S3 to host several of their social networking and mobile properties.  Intuit: Intuit used SOASTA’s CloudTest service to run load testing on their tax software, utilizing 2200 EC2 cores.  They also host some applications one EC2 during tax season, including Intuit TaxCaster.  Netflix: Netflix is using AWS for a variety of mission critical applications and services, and will continue to look for ways to leverage the Amazon cloud to service their customers.  Autodesk: Autodesk hosts several applications on EC2, including Autodesk Seek.  Autodesk Seek is the online source for architects and building engineers to search and download manufacturer design information (i.e., reusable CAD models that can be dropped directly into design projects)  Pfizer: Pfizer has done antibody docking for drug design on EC2.  They leveraged up to 500 c1.mediums at a time to do the modeling. New York Life has created a financial planning application. This application will help their employees do 'what-if' scenarios for their customers. It will look at things like income, debt, expenses, etc, and come up with a customized plan.
  • What should you expect from this approach Key to adaptability and agility Investments required in strategy not technology – how to use current technology to achieve goals. Investment in re-orienting the thinking of IT staff How do we provide services the same way FedEx enables overnight versus 2 day. FedEx has built flexible architecture to enable business level services
  • So the cloud is taking off…but it’s also become the #1 driver of Data Fragmentation in the enterprise. As one of our customers said, “a SaaS application without integration is like a beautiful island that nobody can actually get to.” In fact, Forrester Research has shown that 65% of IT managers recognize integration issues as the top barriers to success.
  • So the cloud is taking off…but it’s also become the #1 driver of Data Fragmentation in the enterprise. As one of our customers said, “a SaaS application without integration is like a beautiful island that nobody can actually get to.” In fact, Forrester Research has shown that 65% of IT managers recognize integration issues as the top barriers to success.
  • That’s why Informatica has put such a focus on the cloud. We’re actually now delivering deployment options across the different flavors of cloud computing – SaaS, PaaS, and Infrastructure as a Service. Informatica Cloud Services are purpose-built applications that are designed for non-technical line of business users (often the SaaS administrator for example). We’ve initally focused on a set of specific use cases that we see as the primary requirements today for SaaS application customers: data migration (loading data in), data synchronization (keeping systems and processes unified on a real-time basis), data quality, and data replication (keeping a local copy of cloud data – typically for on-premise business intelligence). Last year we extended our cloud offerings in two important ways: We introduced the Informatica Cloud Platform, which allows our customers to build and share more complex mappings and functions as a custom cloud service and… Support for IaaS deployments such as Amazon EC2. This means you can sign up to use Informatica PowerCenter or Data Quality on an hourly basis or deploy your software directly on their servers.
  • By 2010, 76% of US organizations will use at least one SaaS-delivered application for business.
  • Cloud Data Integration Best Practices

    1. 1. Cloud Data Integration Best Practices Kurt Messersmith, Amazon Web Services David S. Linthicum, Darren Cunningham, Informatica Cloud
    2. 2. Today’s Agenda <ul><li>Amazon Web Services </li></ul><ul><li>Data Integration for “Cloud” </li></ul><ul><li>Informatica Cloud </li></ul>
    3. 3. AMAZON WEB SERVICES Kurt Messersmith, Sr Manager, AWS
    4. 4. AMAZON’S THREE BUSINESSES Consumer (Retail) Business Tens of millions of active customer accounts Seven countries: US, UK, Germany, Japan, France, Canada, China Seller Business Sell on Amazon websites Use Amazon technology for your own retail website Leverage Amazon’s massive fulfillment center network Developers & IT Professionals On-demand infrastructure for hosting web-scale solutions Hundreds of thousands of registered customers
    5. 5. AWS USAGE GRAPH 2007: AWS bandwidth usage surpassed global websites Today: AWS bandwidth usage 30% greater than global websites Bandwidth Usage:
    6. 6. CLOUD ATTRIBUTES Abstract Resources Not tied to physical hardware and can be flexible as your needs demand. On-Demand Provisioning Ask for what you need, exactly when you need it. Pay only for what you use. Scalability Scale up or down depending on usage needs. No Up-Front Costs No contracts or long-term commitments. Pay only for what you use. Efficiency of Experts Utilize the skills, knowledge and resources of experts.
    7. 7. <ul><li>Owned Infrastructure: The Heavy Lifting </li></ul><ul><ul><li>- Server hosting </li></ul></ul><ul><ul><li>- Contract negotiation </li></ul></ul><ul><ul><li>- Bandwidth management </li></ul></ul><ul><ul><li>- Purchase decisions </li></ul></ul><ul><ul><li>- Moving facilities </li></ul></ul>OPPORTUNITY COSTS: OWNED VS. CLOUD <ul><ul><li>Scaling and managing physical growth </li></ul></ul><ul><ul><li>Heterogeneous hardware </li></ul></ul><ul><ul><li>Legacy software </li></ul></ul><ul><ul><li>Coordinating large teams </li></ul></ul><ul><li>Cloud Computing: The 70/30 Switch </li></ul>30% of time, energy and dollars on differentiated value creation
    8. 8. PREDICTIONS COST MONEY Infrastructure Cost $ time Large Capital Expenditure You just lost customers Predicted Demand Traditional Hardware Actual Demand Automated Virtualization
    9. 9. AGILITY EXAMPLE—COST NEUTRAL EQUATION This graphic compares running the same 10,000 jobs on 2 servers versus 1000 servers. The cost is the same for either scenario in using AWS (and RightScale), but the difference in elapsed time is 499 hours. (assuming each server can process 10 jobs/hour) 2 server cloud 10,000 jobs 10,000 jobs 1000 server cloud Output data Output data Total processing time:500 hours Total processing time:1 hour
    10. 10. <ul><li>Web site hosting </li></ul><ul><li>Application hosting </li></ul><ul><li>Internal IT application hosting </li></ul><ul><li>Quick and effective marketing campaigns </li></ul><ul><li>Content delivery and media distribution </li></ul><ul><li>High performance computing, batch data processing, and large scale analytics </li></ul><ul><li>Storage, backup, and disaster recovery </li></ul><ul><li>Development and test environments </li></ul>DIVERSE ENTERPRISE USE CASES
    12. 12. Data Integration for “Cloud.” David S. Linthicum [email_address]
    13. 13. So, Why Data Integration and “Cloud” <ul><li>Improved Adaptability and Agility </li></ul><ul><ul><li>Respond to business needs in near real-time </li></ul></ul><ul><li>Functional Reusability </li></ul><ul><ul><li>Eliminate the need for large scale rip and replace </li></ul></ul><ul><li>Independent Change Management </li></ul><ul><ul><li>Focus on configuration rather than programming </li></ul></ul><ul><li>Interoperability instead of point-to-point integration </li></ul><ul><ul><li>Loosely-coupled framework, services in network </li></ul></ul><ul><li>Orchestrate rather than integrate </li></ul><ul><ul><li>Configuration rather than development to deliver business needs </li></ul></ul>
    14. 14. Understand Cloud Provider Interfaces New Accounts Commission Calculation Data Cleaning Sales Order Update Finance/ Operations Sales
    15. 15. Public Cloud Traditional Data Center Evolving Migration to Public Cloud Computing Providers
    16. 16. Public Cloud Traditional Data Center Evolving Migration to Public Cloud Computing Providers
    17. 17. SaaS IaaS PaaS APIs ERP Legacy Data On Premise On Demand Data Integration
    18. 18. Understanding the Problem <ul><li>Cloud services must integrate with existing enterprise systems to become more valuable. </li></ul><ul><li>However, existing internal integration needs to exist to ensure: </li></ul><ul><ul><li>Production and consumption of structured information </li></ul></ul><ul><ul><li>Semantic mediation </li></ul></ul><ul><ul><li>Security mediation </li></ul></ul><ul><ul><li>Service enablement </li></ul></ul><ul><ul><li>Firewall management </li></ul></ul><ul><ul><li>Transactional integrity </li></ul></ul><ul><ul><li>Holistic management of complete integration chain </li></ul></ul>
    19. 19. Getting Ready <ul><li>So, how do you prepare yourself? I have a few suggestions: </li></ul><ul><ul><li>First, accept the notion that it's okay to leverage services that are hosted on the Internet as part of your SOA. Normal security management needs to apply, of course. </li></ul></ul><ul><ul><li>Second, create a strategy for the consumption and management of outside-in services , including how you'll deal with semantic management, security, transactions, etc. </li></ul></ul><ul><ul><li>Finally, create a proof of concept now. This does a few things including getting you through the initial learning process and providing proof points as to the feasibility of leveraging outside-in services. </li></ul></ul>
    20. 20. Remember, there are a few technical issues that you must address… <ul><li>Semantic and metadata management , or, the management of the different information representations amount the external services and internal systems. </li></ul><ul><li>Transformation and routing , or, accounting for those data differences during run time. </li></ul><ul><li>Governance across all systems , meaning, not giving up the notion of security and control when extending your SOA to the global SOA. </li></ul><ul><li>Discovery and service management , meaning, how to find and leverage services inside or outside of your enterprise, and how to keep track of those services through their maturation. </li></ul><ul><li>Information consumption, processing, and delivery , or, how to effectively move information to and from all interested systems. </li></ul><ul><li>Connectivity and adapter management , or, how to externalize and internalize information and services from very old and proprietary systems. </li></ul><ul><li>Process orchestration and service, and process abstraction , or, the ability to abstract the services and information flows into bound processes, thus creating a solution </li></ul>
    21. 21. Core Issues that Architects Must Consider when Integrating with “Clouds.” <ul><li>The ability to handle larger data sets. </li></ul><ul><li>The ability to handle and resolve data inaccuracies and inconsistencies. </li></ul><ul><li>The ability to do data manipulation efficiently and inexpensively. </li></ul><ul><li>The ability to provide visibility into the lineage of data. </li></ul><ul><li>The ability to decouple data access from the implementation </li></ul>
    22. 22. Limitations of Existing Integration Approaches <ul><li>Inefficient consumption of data by the integration engine from the source systems. </li></ul><ul><li>Lack of validation and transformation of the data for the correct format and structure. </li></ul><ul><li>No early detection of data inaccuracies and inconsistencies leading to error-prone business processes </li></ul><ul><li>Inability to handle data quality issues </li></ul><ul><li>No tracking of data to insure data traceability and lineage </li></ul><ul><li>Content transformation, on message and large set of data </li></ul><ul><li>Inefficient provisioning of the data from the integration and processing engine to the target system. </li></ul>
    23. 23. Issues You Need to Consider when Selecting Data Integration Technology for Enterprise-to-Cloud <ul><li>Lack of support for complex data transformations. </li></ul><ul><li>Challenges in handling large data volumes. </li></ul><ul><li>Lack of support for handling varying data latencies including batch, trickle-feed and real-time. </li></ul><ul><li>Difficulty in determining the origin of data or how it’s utilized. </li></ul><ul><li>Lack of standards-based approaches and limited re-use across projects. </li></ul><ul><li>Lacking mechanisms to handle data quality issues across sources. </li></ul><ul><li>No protection against changes to underlying data sources. </li></ul><ul><li>The requirement for manual handling of diverse data structures, formats, access, etc. </li></ul><ul><li>Limited support for metadata and impact analysis. </li></ul><ul><li>Lacking a mechanism to automatically detect changes to the data. </li></ul><ul><li>Lack of support for batch and trickle-feed (CDC) data movement. </li></ul>
    24. 24. Create the Information Model Ontologies Understand Ontologies Understand the Data Data Dictionary & Metadata Catalog the Data Data Catalog Legacy Metadata External Metadata (B2B) Build Information Model Information Model
    25. 25. Start with the Architecture <ul><li>Understand: </li></ul><ul><li>Business drivers </li></ul><ul><li>Information under management </li></ul><ul><li>Existing services under management </li></ul><ul><li>Core business processes </li></ul>
    26. 26. The Informatica Cloud Darren Cunningham, Informatica Cloud Marketing
    27. 27. Replicate Data Primary Cloud Integration Use Cases: Your Company Load Data Synchronize Data Cleanse Data
    28. 28. Cloud Integration Options Outsource Cloud Services On-Premise Tools 3 4 2 Hand Code You need to consider integration for what it is: the mother of all single points of failure . “ ” David Linthicum Author, Cloud Computing and SOA Convergence in Your Enterprise 1
    29. 29. The Informatica Cloud The Industry’s Broadest Cloud Integration Portfolio Informatica Cloud Services Business Managers Migrate Validate Monitor Synch Replicate Informatica Cloud Editions & Options IT Informatica Cloud Platform SIs, ISVs, Developers Custom
    30. 30. Data Integration as a Service Advantages <ul><ul><li>+500 customers </li></ul></ul><ul><ul><li>+20K jobs/day </li></ul></ul><ul><ul><li>+ 5B rows/month </li></ul></ul>Migrate Monitor Replicate Synch For Customers Rapid Deployment Utility Pricing Minimal Training Fewer IT Resources Seamless Upgrades Usage Tracking For ISVs Reduced Dev Costs Rapid Innovation Best of Breed Tech Greater Scalability Expand Your Market Focus on Your Core Custom
    31. 31. Data Replication as a Cloud Service We’re using Informatica Cloud Services to replicate millions of rows of data from Salesforce to a centralized database running on Amazon EC2. ” “
    32. 32. Contacts <ul><li>David S. Linthicum </li></ul><ul><ul><li> </li></ul></ul><ul><ul><li>[email_address] </li></ul></ul><ul><li>Kurt Messersmith </li></ul><ul><ul><li>Amazon Web Services </li></ul></ul><ul><ul><li>[email_address] </li></ul></ul><ul><li>Darren Cuningham </li></ul><ul><ul><li> </li></ul></ul><ul><ul><li>[email_address] </li></ul></ul>