8.17.11 big data and hadoop with informatica slideshare


Published on

This presentation provides a briefing on Big Data and Hadoop and how Informatica's Big Data Integration plays a role to empower the data-centric enterprise.

Published in: Technology, Business
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Intro 1. Increase Business Agility (both Responsiveness And Velocity)
  • Informatica 9.1 for Big Data is the industry’s first data integration platform designed to handle the challenges of Big Data. Informatica 9.1 helps businesses stay competitive by taking advantage of big transaction and interaction data. Here is the list of customer examples who are using the Informatica platform for big data integration. Please note that this list shows: Breadth of business imperatives from analytics to operational improvement to counter terrorism Handling of any data from transaction data to social and interaction data Diversity of vertical coverage Use of big data processing platform such as Hadoop A few examples: *Healthnow with Big Transaction Data moving to incorporate Social Media Data” A premier and diversified health benefits and information company - Insurer.  Support 16 legacy, 30000 data marts, 10M claims. Pioneer in the use of Informatica data services, a data virtualization solution large transaction data with interest in using interaction data like texts from customer support or social media chatters about quality of care and member satisfaction (See below for media write-up) Beta tester for self-service and data services Vision for Big Data Go beyond the claim data and incorporate patient records (texts, image, etc.), demographic data and health history to perform predictive analysis for improved quality of care and reduced cost Monitor member feedback including complaints and churns Possible integration of pharmacy operations with adverse impact discussed in social media *US Xpress for Big Data Processing* About – a trucking company that has become a pioneer in transportation mobile intelligence with n ext-generation onboard communication systems, over 3,000 carefully screened and trained drivers, more than 2,800 state-of-the-art tractors and excess of 22,000 53’ trailers Challenges - No means of determining where its trucks had stopped, or for how long. Money wasted on engine idle time, trucks weren’t being used as efficiently as possible, and customer service was suffering. 900 data elements from trucking systems - The data wasn’t clean and couldn’t be trusted. Sampling operational data did not help them fully optimize routes. Trying to move to “No Data Left behind” cost effectively - track every piece of the data you can imagine – sensor data for tire and gas usage, geospatial data to track fleet, data from blogs by truckers. Unable to allow business to ask questions in texts and answer without middlemen (=IT) . Solution – Informatica PowerCenter, PowerExchange, Web Service / MOM option Informatica Data Quality, Informatica Data Explorer, Extending to Hadoop to perform text mining. Benefits – Saving millions per year, Reduce emissions and meet environmental commitments. Go Green. No data left behind – extensive use of new data types, mobile, geospatial, sensors, blogs, social media, etc. *StationCasinos for Big Interaction Data* A hospitality enterprise in Las Vegas Their vision is to increase customer loyalty and provide a total entertainment experience StationCasinos combines enormous volumes of real-time, interaction data such as ATM date, slot machines, data and even social data from Facebook with historical transaction data to deliver complete real-time entertainment services to their customers Informatica is the critical data integration engine, delivering a unique 360 degree view of the customer enriched with social data real-time customer interactions captured through slot machines, ATM, updated and cleansed real-time to customer records Four customer references for media as of 5/26/11 Healthnow, New York (see above) Linkshare (Not on the slide 11 list but an approved media reference) Online marketing solutions provider for Affiliate Marketing, Search Marketing, and Lead Generation. Supporting >500 sources, 300 million transactions/day, 300k users in real-time. Vision for Big Data Continue maintaining the reliability and consistency of big transaction data - real-time customer data feeds such as lead management systems campaign search, external third-party search engine interaction and bid extraction Use established presence in social media like twitter and linkedin to extend outreach to affiliate partners and end users Westpac (Not on the slide 11 list but an approved media reference) Global 500 financial institution  including retail banking, insurance, institutional banking, etc. based in Australia.  Supporting a wide variety of initiatives including risk management, cross-sell/upsell-customer retention, enterprise financial reporting/audit Vision for Big Data Sentiment analysis including facebook, twitter feeds  (text, location data) combined with golden records/householding Looking at the geographic heat map on who are having positive or negative sentiments by location to understand retail banking performance and determining issues Sentiment trading – Dow jones sentiment index for the institutional banking decision Become more adaptable in identifying opportunities and mitigating risks – more data-driven versus traditional top down approach, often not useful for detecting something that were never thought of T-Mobile USA(Not on the slide 11 list but an approved media reference) T-Mobile USA is a national provider of wireless voice, messaging, and data services capable of reaching over 293 million Americans Vision for Big Data Seeking to perform analysis of data stored in the subscriber data warehouse to design cross-sell and upsell campaign for their end customers. This involved analysis of data related to handset ordering, customer info, subscriber info, tax, financials and prepay plan information. What-if analysis needed to be performed – including texts in the customer support representative records. Social network analysis – details to be captured on June 3 with an interview with customer
  • Big Data means all data, including both transaction and interaction data, in sets whose size or complexity exceeds the ability of commonly used technologies to capture, manage and process at a reasonable cost and timeframe. In fact Big Data is the confluence of three technology trends: Big Transaction Data: Massive growth of transaction data volumes Big Interaction Data: Explosion of interaction data such as social media, sensor technologies, call detail records, and other sources Big Data Processing: New very large scale processing with Hadoop For the last 40 years, the IT industry has been focused on automating business processes by using relational databases to process transaction data. This data has become fragmented and locked within operational and analytical systems, both on premise and in the cloud. Data integration technology integrated these transactional data silos. Over time the volume of this transaction data has grown to outpace the capabilities of IT to effectively manage and process what has become “Big Transaction Data”. Today organizations are also confronted with an explosion of a new type of data called “Big Interaction Data” which poses new challenges and new opportunities. Gaining access to this data is critical to the empowerment of the enterprise to take advantage of new business opportunities. However, IT organizations are not adequately prepared to access, process, integrate and deliver this data. Combining Big Interaction Data with Big Transaction Data will unleash great new opportunities for the data-centric enterprise and drive competitive advantage.
  • Connectivity to Big Transaction Data. Informatica 9.1 provides access to high volumes of transaction data, up to a petabyte in scale, with native connectivity to OLTP and online analytical processing (OLAP) data stores. A new relational/data warehouse appliance package available in Informatica 9.1 extends this connectivity to solutions purpose-built for Big Data.   Maximize the availability and performance of large-scale transaction data from any source Reduce the cost and risk of managing connectivity with a single platform supporting all database and processing types Uncover new areas for growth and efficiency by leveraging transaction data in a scalable, cost-effective way   CLICK – You can see how a customer can uncover new areas for growth and efficiency if they don’t have to rely on sampled data and use all transaction data
  • Connectivity to Big Interaction Data. Access new sources such as social media data on Facebook, Twitter, LinkedIn and more with new social media connectors available in Informatica 9.1. Extend your data reach into emerging data sets of value in your industry, including devices and sensors, CDRs, large image files, or healthcare-related information for biotech, pharmaceuticals, and medical companies.   Gain new insights into customer relationships and influences enabled by social media data Access and integrate other types of Big Interaction data and combine it with transaction data to sharpen insights and identify new opportunities Reduce the time, cost, and risk of incorporating new data sets and making them available to enterprise users   Large Department Store Retailer for Customer Centricity One of the leading fashion specialty retailers serves its customers through local department stores, online, and its catalog business. The company is known for its differentiated services for its clientele. After some analysis, the retailer decided to stop giving free makeup services and cosmetic samples because managers realized that customers receiving these freebies were not buying more cosmetics. The retailer expected that cosmetics sales would remain the same once its giveaway program ended, but instead experienced a decline in cosmetics sales.   Through research, including harvesting social media information from Twitter and Facebook, they started to better understand the influence model for cosmetics. They came to learn that they have two types of valuable customers that must be retained—high spenders and high influencers. Customers receiving a free makeup session weren’t necessarily buying cosmetics, but their word of mouth was prompting purchases by friends and friends of friends. This was a perfect marriage of transaction data and interaction data to come up with non-obvious answers to a business challenge. By using Informatica, this retailer enriched its customer master data with social media data and made its services more targeted and thus increased profits through treating those high-influence customers with the right offers.
  • Big Data Processing. New connectivity in Informatica 9.1 enables IT to load data from any source into Hadoop, and extract data from Hadoop for delivery to any target. The connectivity also allows the application of Informatica data quality, data profiling, and other techniques to data in Hadoop. These capabilities open new possibilities for enterprises combining transaction and interaction data either inside or outside of Hadoop.   Confidently deploy the Hadoop platform for Big Data processing with seamless source-and-target data integration Integrate insights from Hadoop Big Data analytics into traditional enterprise systems to improve business processes and decision-making Leverage petabyte-scale performance to process large data sets of virtually any type and origin     We are also looking to develop Graphical integrated development environment for Hadoop environment in the future release
  • The latest release of the Informatica Platform, Informatica 9.1, was developed with the express purpose of turning Big Data challenges into big opportunities. Informatica’s data integration platform harnesses the power of both Big Transaction Data and Big Interaction Data in a scalable, cost-effective way. Indeed, Informatica 9.1 converts Big Data into trustworthy, actionable, and authoritative data that companies can use to gain competitive insights and improve business operations. In other words, Big Data integration helps Unleash the full business potential of Big Data to empower the data-centric enterprise.
  • Julianna: Complex data analytics. Data types that are complex and diverse in sources and formats, including videos, images, text, real-time feeds, devices, scientific, sensors, call detailed records (CDR), etc., are good targets for consideration for Hadoop—. Many organizations perform sentiment analysis or fraud analysis combining transaction data with textual and other data coming for this reason in Hadoop.   Store large amounts of data. Hadoop stores data without a need to change it into a different format that is common in a traditional data warehouse. In Hadoop, data is not lost in the translation process. When you need to accommodate new data sources and formats, but aren’t ready to finalize which ones, Hadoop is often a good framework, allowing the flexibility for a data analyst to choose how and when to perform data analysis.   Scaling through distributed processing . While Hadoop is used for large-scale data processing with petabytes of data, many organizations using Hadoop are performing data processing tasks that are at terabyte scale. If you adopt Hadoop, a Big Data processing platform, it can scale as your organizational demand changes. Hadoop, with its MapReduce framework, can abstract the complexity of running distributed, shared-nothing, data processing functions across multiple nodes in the cluster, making it easier to gain benefits of scaling.   Cost advantage. Hadoop is open source software that runs on commodity hardware, and you can add or retire nodes based on your project demand. This cost advantage in Hadoop is driving many organizations to add Hadoop clusters in their data processing investment portfolio, rather than replacing the existing portfolio. The ability to store and process data cost-effectively in Hadoop is helping organization to harness more data for projects that were not making “business” sense previously.   Power of Open Source community. Hadoop is supported by an active, global, and growing network of developers worldwide, and Hadoop’s subprojects are supported by the world’s largest Web-centric companies, like Yahoo and Facebook. Organizations who choose Hadoop also find this aspect of sharing best practices, implementing enhancements and fixes in software and documentation crucial to consider or use Hadoop.
  • Copyright 1992-2005, Product & Market Development, Inc. Do not copy without written permission. www.ProductDevelopment.com 5.
  • Wei:
  • Data Transformation Architecture Developer builds a transformation using the DT Studio. This could be a parser (interpreting a file), a serializer (creating a file) or a mapper (moving between XML schemas). The developer then ‘deploys’ the service from the Studio. This moves all of the required project files from the Studio workspace to the DT repository folder. The service is then moved to the server’s DT repository folder via FTP, copy, or script. At this point the service is immediately available to the engine on the server. The server engine, when invoked, executes the rules defined in the transformation service on a given input. The engine is built to be embedded and has API hooks in Java, C, C++, .NET, and Web Services. Using these APIs there are several ways DT can be embedded: Via command line interface, provided with the product (or built custom) In any application using the appropriate APIs Inside PowerCenter using the built in Unstructured Data Transformation which sits ‘around’ the engine and the APIs Inside other middleware software. In some cases we have GUI agents for them. In others, DT can be embedded programmatically via the APIs Data can be passed to DT and retrieved from DT in 2 ways: Filenames can be sent in and DT will open the file directly. On the output side, DT can write output files directly. Memory buffers of data can be passed into DT and it can also output transformed data as a memory buffer back to the calling application. Although only one input and output are shown in the slide, transformations can be built to support multiple separate inputs and generate multiple outputs as well. Additionally, the engine supports the notion of ‘service parameters’ which is effectively variable passing between a calling app and the DT engine, making those values available inside the transformation. One important thing to note is that DT is a fully embedded engine and actually runs inside the process space of the calling application. It is also fully re-entrant and thread safe. Therefore it can inherently take advantage of the calling applications scalability and multi-threading capabilities. A great example is the use of partitioning in PowerCenter.
  • Telco: Example opportunity: Vodafone (Spain) – wants to extend analytics beyond 2 months of topology information. Limitation today is due to using standard DW approach and high cost of holding big data. Limits ability to understand seasonality of demand patterns. Pharmaceutical: Predictive analytics using ALL information collected in past critical trials Healthcare (similar use case for utilities and telco): Monitoring devices generating vast amounts of data needing to be collected, aggregated, and analyzed
  • Intro 1.
  • This is the Informatica Corporate Presentation
  • 8.17.11 big data and hadoop with informatica slideshare

    1. 1. Big Data and Hadoop with Informatica August 2011 Julianna DeLua, Enterprise Solution Evangelist
    2. 2. Globalization Operational Efficiency Consolidation Growth Governance Improve Decisions Modernize Business Improve Efficiency & Reduce Costs Mergers Acquisitions & Divestitures Acquire & Retain Customers Outsource Non-core Functions Governance Risk Compliance Increase Partner Network Efficiency Increase Business Agility Cloud Computing Application Database Unstructured Partner Data SWIFT NACHA HIPAA … The Information Economy Lack of Trustworthy Data Impedes Key Business Imperatives Lack of relevant, trustworthy and timely data
    3. 3. Improve Decisions Business & Operational Intelligence Data Warehouse Empowering the Data-Centric Enterprise Modernize Business Improve Efficiency & Reduce Costs Mergers Acquisitions & Divestitures Acquire & Retain Customers Outsource Non-core Functions Governance Risk Compliance Increase Partner Network Efficiency Increase Business Agility Business Imperatives Legacy Retirement Application ILM Application Consolidation Customer, Supplier, Product Hubs BPO SaaS Risk Mitigation & Regulatory Reporting B2B Integration Zero Latency Operations IT Initiatives IT Projects Data Migration Database Archiving Master Data Management Data Synchronization B2B Data Exchange Data Consolidation Complex Event Processing Ultra Messaging
    4. 4. Informatica for Big Data Integration Saved millions annually by improving trucking operations and empowering business with Hadoop-based free-form questions using sensor, mobile and geospatial data Unite operations across 200 brands over 100+ countries through migration of business data from five systems to one Deliver 5x faster & direct access to customer, risk, claims data in variety of sources – DW, 16 legacy, 30000 data marts, 10M claims via data feeds at 1/3 of the cost Business Imperatives Big Data Warehousing & Operational BI Big Data Services Big Data Archiving Social /Big Data Synchronization Big Data Consolidation Complex Event Processing Turned human review into automated alerts in seconds for maritime security – through geospatial and video tracking Deliver cloud access to 177+ million businesses worldwide and 53 million contacts. D&B 360 app updates with linkedin and twitter Increased monthly slot revenues by 4% while expanding target customer segments from 40 to 160 across 500 sources in real-time with social and machine data 25% savings in data center footprint ($1M+) reduce latency by 83 percent to 340 microseconds, enabling a 580 percent increase in throughput over 1B transactions per day and growing Ultra messaging Real-Time Customer View Big Data Collection & Aggregation Reduce Time to Market by 90% by On-Boarding New Data Sources Faster and enabling a wide variety of Data Formats Rationalized application portfolio and saved $1 million with 6 month payback. Reduced age of data by 87% for service monitoring & pattern identification of large scale data Deliver Analytical Insight Improve Business Processes Improve Efficiency & Reduce Costs Mergers Acquisitions & Divestitures Acquire & Retain Customers Outsource Non-core Functions Governance Risk Compliance Increase Partner Network Efficiency Increase Business Agility
    5. 5. Cloud Computing Enterprise Partner Trading Network (B2B) Information Infrastructure Data Infrastructure Business Value through Trustworthy, Actionable, Authoritative Information Assets ILM Enterprise Data Integration B2B Data Exchange EDI NACHA HIPAA Ultra Messaging Ultra Messaging Cloud Data Integration Trust Profile Act Sense Govern Model Complex Event Processing Master Data Management Data Quality Ultra Messaging
    6. 6. Big Data
    7. 7. WHERE PAST FUTURE WHAT HOW Mobile Nexus Of Secular Technology Megatrends Reinventing The Computer Industry On-Premise Transactions Desktops Interactions Cloud
    8. 8. Defining Big Data Definition: Big data is the confluence of the three trends consisting of Big Transaction Data, Big Interaction Data and Big Data Processing Online Transaction Processing (OLTP) Online Analytical Processing (OLAP) & DW Appliances Social Media Data Other Interaction Data Scientific Machine/Device BIG TRANSACTION DATA BIG INTERACTION DATA BIG DATA PROCESSING BIG DATA INTEGRATION
    9. 9. Big Transaction Data Maximize availability and performance of big transaction data All data including OLTP, OLAP and DW appliances Reliable, complete information No data discarded Greater confidence Continuous innovation Database Warehouse Appliances Universal Access Uncover new areas for growth & efficiency Better Actions & Operations Near-Universal Connectivity to Big Transaction Data
    10. 10. Big Interaction Data Achieve a complete view with social and interaction data What influence does she have with her family and friends? How connected is she? What will she do with this merchandise? Any additional services? Turn insights on relationships, influences and behaviors Into opportunities ? Databases Call Detailed Records, Image Files, RFIDs External Data Providers Applications Customer Product … Informatica MDM Connectivity to Big Interaction Data including social data
    11. 11. Weblogs, Mobile Data, Sensor Data Enterprise Applications Semi-structured Unstructured Big Data Processing Unleash the Power of Hadoop Cloud Applications, Social Data Databases, Data Warehouses Hadoop Cluster Sentiment Analysis Fraud Detection Predictive Analytics Portfolio & Risk Analysis Smart Devices Parse & Prepare Data Load Data Read & Deliver Data Transform & Analyze Data Monitor & Manage Orchestrate Workflows
    12. 12. Value of Big Data Integration Unleash the full business potential of Big Data to empower the data-centric enterprise
    13. 13. Hadoop
    14. 14. Big Data Processing What does Hadoop do? <ul><li>Complex data analytics </li></ul><ul><li>Store large amounts of data </li></ul><ul><li>Scaling through distributed processing </li></ul><ul><li>Cost advantage. </li></ul><ul><li>Power of Open Source community </li></ul>
    15. 15. Hadoop Related Use-Cases Meta Use Case Use Case Description High Volume Analytics Customer Churn Analysis Predictive analytics of weblogs to understand user behavior. Risk Analysis Massive modeling and data generation to understand what-if scenarios and total assets needed to cover various positions. ETL on Hadoop Data processing and transformation prior to loading into data warehouses for analytics Defect Tracking and Device Monitoring Device log file analysis to find root cause to issues or patterns of defects Sentiment Analysis Sentiment Analysis Social media data mining combined with transaction data to understand customer sentiments. Marketing Campaign and Ad Analysis Mining of clickstream and log data to understand campaign and offer effectiveness Interaction Analysis Fraud Analysis Clickstream, log mining and web scraping to understand fraudulent behaviors Data storage Data staging and archive Archive data for temporary or permanent storage
    16. 16. Weblogs, Mobile Data, Sensor Data Enterprise Applications Semi-structured Unstructured Big Data Processing Unleash the Power of Hadoop Cloud Applications, Social Data Databases, Data Warehouses Hadoop Cluster PowerExchange for Hadoop B2B Data Transformation for Hadoop Sentiment Analysis Fraud Detection Predictive Analytics Portfolio & Risk Analysis Smart Devices Parse & Prepare Data Load Data Read & Deliver Data Transform & Analyze Data Monitor & Manage Orchestrate Workflows
    17. 17. Tackling Diversity of Big Data Svc Repository social Device/sensor scientific <ul><li>Visual parsing environment </li></ul><ul><li>Predefined translations </li></ul>PIG EDW MDM <ul><ul><li>4. The DT engine can immediately use this service to process data. </li></ul></ul><ul><ul><li>The DT Engine is fully embeddable and can be invoked using any of the supported APIs. </li></ul></ul><ul><ul><li>Java, C++, C, .NET, web services </li></ul></ul><ul><ul><li>For simple integration, a command line interface is available to invoke services. </li></ul></ul><ul><ul><li>Internal custom applications can embed transformation services using the various APIs. </li></ul></ul><ul><ul><li>PowerCenter leverages DT via the Unstructured Data Transformation (UDT). </li></ul></ul><ul><ul><li>This is a GUI transformation widget in Powercenter which wraps around the DT API and engine. </li></ul></ul><ul><ul><li>DT can also be embedded in other middleware technologies. </li></ul></ul><ul><ul><li>For some (WBIMB, WebMethods, BizTalk) INFA provides similar GUI widgets (agents) for the respective design environments. </li></ul></ul><ul><ul><li>For others the API layer can be used directly. </li></ul></ul><ul><ul><li>DT can be invoked in two general ways: </li></ul></ul><ul><ul><li>Filenames can be passed to it, and DT will directly open the file(s) for processing. On the output side, DT can also directly write to the filesystem. </li></ul></ul><ul><ul><li>The calling application can buffer the data and send buffers to DT for processing. On the output side, DT can also write back to memory buffers which are returned to the calling application. </li></ul></ul><ul><li>Though not shown below, the engine fully supports multiple input and output files or buffers as needed by the transformation. </li></ul>Engine invocation is a shared library. The DT engine runs fully within the process of the calling application. It is not an external engine. This removes any overhead from passing data between processes, across the network, etc. The engine is also dynamically invoked and does not need to be ‘started up’ or maintained externally. The DT engine is also thread-safe and re-entrant. This allows the calling application to invoke DT in multiple threads to increase throughput. A good example is DT’s support of PowerCenter partitioning to scale up processing. As shown below, the actual transformation logic is completely independent of any calling application. This means you can develop a transformation once, and leverage it in multiple environments simultaneously resulting in reduced development and maintenance times and lower impact of change. <ul><ul><li>1. Developer uses Studio to develop a transformation </li></ul></ul><ul><ul><li>2. Developer deploys transformation to local service repository (directory). </li></ul></ul><ul><ul><li>All files needed for the transformation are moved. </li></ul></ul><ul><ul><li>3. To deploy to the server, this service folder is moved to the server via FTP, copy, script, etc. </li></ul></ul><ul><ul><li>NOTE : If the server file system is mountable from the developer machine directly, then step 2 would deploy directly to the server. </li></ul></ul>S S Flat Files & Documents Interaction data Industry Standards XML The broadest coverage for Big Data ^/>Delimited<^ Positional Name = Value Productivity Any DI/BI architecture
    18. 18. Device generated data Telco example <ul><ul><li>Support multiple standards, versions, and manufacturer specific extensions of call detail record and XML topology data </li></ul></ul><ul><ul><li>Securely and reliably transfer large volumes of data from staging area to the enterprise </li></ul></ul><ul><ul><li>Manage and monitor data aggregation / collection process to enable analytics using Hadoop </li></ul></ul>HDFS Map reduce Data Exchange - Call Detail Record (CDR) analytics - Node topology analytics Binary ASN.1/XML topology MFT MFT MFT The Challenge Firewall DT DT
    19. 19. External data Channel data analytics over time example HDFS Map reduce Future predictive analytic of channel information Channel/Customer Data Analytics over time of very large amount of data via multiple dimensions: POS, Customer, Product Feedback etc The Challenge DT
    20. 20. High-Level Technical Directions Universal data access Metadata management and auditability Processing in Hadoop Data quality and data governance Data parsing and exchange High throughput data provisioning <ul><li>Easily integrate diversity of Big Data and make sense of it all </li></ul><ul><li>Govern and audit Big Data </li></ul><ul><li>Arm business with right data with high performance data processing and provisioning </li></ul>
    21. 21. Key Takeaways <ul><li>Informatica for Big Data uniquely empowers the data centric enterprise </li></ul><ul><li>Big data integration turn Big Data Challenges into big opportunities </li></ul><ul><li>Informatica continues our pioneering efforts in pushing the frontiers of data integration. </li></ul>