Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Track B-1 建構新世代的智慧數據平台

講師:Informatica 大中華區 首席技術顧問 尹寒柏

Track B-1 建構新世代的智慧數據平台

  1. 1. 建構新世代的智慧數據平台 尹寒柏 Bob Yin Senior Product Specialist
  2. 2. 10 2 MAINFRAME CLIENT-SERVER WEB SOCIAL INTERNET OF THINGS CLOUD Few Employees Many Employees Customers/ Consumers Business Ecosystems Communities & Society Devices & Machines 10 4 10 6 10 7 10 9 10 11 Front Office ProductivityBack Office Automation E-Commerce Line-of-Business Self-Service Social Engagement Real-Time Optimization 1960s-1970s 1980s 1990s 2011 2014 2007 OS/360 TECHNOLOGY USERS VALUE TECHNOLOGIES SOURCES BUSINESS
  3. 3. What are your Business Initiatives related to Big Data? • Fraud Detection • Risk & Portfolio Analysis • Investment Recommendations Financial Services • Proactive Customer Engagement • Location Based Services Retail & Telco •  Connected Vehicle •  Predictive Maintenance Manufacturing •  Predicting Patient Outcomes •  Total Cost of Care •  Drug Discovery Healthcare & Pharma •  Health Insurance Exchanges •  Public Safety •  Tax Optimization •  Fraud Detection Public Sector Media & Entertainment • Online & In-Game Behavior • Customer X/Up-Sell
  4. 4. 80% of the work in big data projects is data integration and data quality “80% of the work in any data project is in cleaning the data” “70% of my value is an ability to pull the data, 20% of my value is using data-science…” “I spend more than half my time integrating, cleansing, and transforming data without doing any actual analysis.”
  5. 5. InformationWeek 2013 Analytics, Business Intelligence and Information Management Survey of 541 business technology professionals, October 2012 Big data expertise is scarce and expensive Data warehouse appliance platforms are expensive We aren’t sure how big data analytics will create business opportunities Analytical tools are lacking for big data platforms like Hadoop and NoSQL databases Our data’s not accurate Hadoop and NoSQL technologies are hard to learn What Are Your Primary Concerns About Using Big Data Software 38% 33% 31% 22% 21% 17%
  6. 6. Staff Projects with Readily Available Skills •  Informatica Developers are Hadoop Developers Hand-coding A large global bank grew staff from 2 Java developers to 100 Informatica developers after implementing Informatica Big Data Edition Careerbuilder.com found in a survey there were 27,000 requests for Hadoop skills and only 3,000 resumes with Hadoop skills – whereas there are over 100,000 trained Informatica developers globally.
  7. 7. Increase Developer Productivity •  Informatica Developers are up to 5x more productive 4 weeks 4 days! 2X performance! Vs. Hadoop Hand-coders Informatica developers Informatica Developers are 5x more productive based on customer POCs
  8. 8. Why Informatica for Big Data & Hadoop Informatica on Hadoop Why Customers Care Visual development environment Increase productivity up to 5x over hand-coding 100K+ trained Informatica developers globally Use existing & readily available skills for big data 200+ high-performance connectors (legacy & new) Move all types of customer data into Hadoop faster 100+ pre-built transforms for ETL & data quality Provide broadest out-of-box transformations on Hadoop 100+ pre-built parsers for complex data formats Analyze and integrate all types of data faster Vibe “Map Once, Deploy Anywhere” virtual data machine An insurance policy as new data types and technologies change Reference architectures to get started Accelerate customer success with proven solution
  9. 9. Unleash the Power of Hadoop Informatica Developers are Now Hadoop Developers Archive Profile Parse CleanseETL Match Stream Load Load Services Events Replicate Topics Machine Device, Cloud Documents and Emails Relational, Mainframe Social Media, Web Logs Data Warehouse Mobile Apps Analytics & Op Dashboards Alerts Analytics Teams
  10. 10. Transactions, OLTP, OLAP Social Media, Web Logs Documents, Email Machine Device, Scientific Maximize Your Return On Big Data Data WarehouseMDM Operational Systems Analytical SystemsData Assets Data Products Data Mart ODS OLTP OLTP Access & Ingest Parse & Prepare Discover & Profile Transform & Cleanse Extract & Deliver Manage (i.e. Security, Performance, Governance, Collaboration) & other NoSQL
  11. 11. Hadoop complements your existing infrastructure
  12. 12. Data Warehouse MDM Applications Data Ingestion and Extraction •  Moving terabytes of data per hour Replicate Streaming Batch Load Extract Archive Extract Low Cost Store Transactions, OLTP, OLAP Social Media, Web Logs Documents, Email Industry Standards Machine Device, Scientific
  13. 13. Access All Types of Data •  200+ High Performance Connectors, Pre-built Parsers for Specialized Data Formats WebSphere MQ JMS MSMQ SAP NetWeaver XI JD Edwards Lotus Notes Oracle E-Business PeopleSoft Oracle DB2 UDB DB2/400 SQL Server Sybase ADABAS Datacom DB2 IDMS IMS Word, Excel PDF StarOffice WordPerfect Email (POP, IMPA) HTTP Informix Teradata Netezza ODBC JDBC VSAM C-ISAM Binary Flat Files Tape Formats… Web Services TIBCO webMethods SAP NetWeaver SAP NetWeaver BI SAS Siebel Flat files ASCII reports HTML RPG ANSI LDAP EDI–X12 EDI-Fact RosettaNet HL7 HIPAA ebXML HL7 v3.0 ACORD (AL3, XML) XML LegalXML IFX cXML AST FIX SWIFT Cargo IMP MVR Salesforce CRM Force.com RightNow NetSuite ADP Hewitt SAP By Design Oracle OnDemand Facebook Twitter LinkedIn Kapow Datasift Pivotal Vertica Netezza Teradata Aster Messaging, and Web Services Relational and Flat Files Mainframe and Midrange Unstructured Data and Files MPP Appliances Packaged Applications Industry Standards XML Standards SaaS/BPO Social Media
  14. 14. Cloud of Connectors
  15. 15. Real-Time Data Collection and Streaming 15 UltraMessagingBus Publish/Subscribe Leverage High Performance Messaging Infrastructure Publish with Ultra Messaging for global distribution without additional staging or landing. HDFS Targets Web Servers, Operations Monitors, rsyslog, log files, JSON, TCP/ UDP, HTTP, SLF4J, etc. Handhelds, Smart Meters, etc. Discrete Data Messages, MQTT Sources Zookeeper Management and Monitoring Internet of Things, Sensor Data PowerCenter Real-Time Edition, Rulepoint (CEP) No SQL Databases: CassandaraNode Node Node Node Node Node Transformations: Filtering, Timestamp, Static Text, Custom
  16. 16. Informatica Vibe Data Stream for Machine Data 16 •  High performance/efficient streaming data collection over LAN/WAN •  GUI interface provides ease of configuration, deployment & use •  Continuous ingestion of real-time generated data (sensors; logs; etc.). Machine generated & other data sources •  Enable real-time interactions & response •  Real-time delivery directly to multiple targets (batch/stream processing) •  Highly available; efficient; scalable •  Available ecosystem of light weight agents (sources & targets)
  17. 17. Streaming Analytics Complex Event Processing
  18. 18. NoSQL Support for HBase 18 Read from HBase as standard source Write to HBase as standard target Complete Mapping with HBase Src/Tgt can execute on hadoop Sample HBase column families (Stored in JSON/complex formats)
  19. 19. NoSQL Support for MongoDB Access, integrate, transform & ingest MongoDB data into other analytic systems (e.g. Hadoop, data warehouse) Access, integrate, transform, & ingest data into MongoDB Sampling MongoDB data & flattening it to relational format
  20. 20. Graphical  representa.on   highligh.ng  data,  segments,   separators,  and  missing  or   invalid  data   Big Data Parser Easy  Deployment  of  Industry  Standards     Import  pre-­‐built  industry  libraries  and   easily  customize  for  specific  needs   Support  of  Healthcare  industry   standards  and  more   Libraries  are  constantly  maintained  to   ensure  con.nued  compliance  
  21. 21. Big Data Parser on Taobao
  22. 22. CUSTOMER_ID example COUNTRY CODE example 3. Drilldown Analysis (into Hadoop Data) 2. Value & Pattern Analysis of Hadoop Data 1. Profiling Stats: Min/Max Values, NULLs, Inferred Data Types, etc. Drill down into actual data values to inspect results across entire data set, including potential duplicates Value and Pattern Frequency to isolated inconsistent/dirty data or unexpected patterns Hadoop Data Profiling results – exposed to anyone in enterprise via browser Stats to identify outliers and anomalies in data Hadoop Data Profiling Results
  23. 23. •  Big Data cleansing, deduplication, parsing Execute Data Quality on Hadoop 23 Address Validation Standardize Parsing Matching Address Validation and Geocoding enrichment across 260 countries Probabilistic or Deterministic Matching Standardization and Reference Data Management Parsing of Unstructured Data/Text Fields of all data types of data (customer/ product/ social/ logs) DQ logic pushed down/run natively ON Hadoop
  24. 24. Data Quality Taiwan Address
  25. 25. Cross-language matching Abdulaziz A/Rahman Al Sugair ‫ﺍاﻝلﺹصﻕقﻱيﺭر‬ ‫ﻉعﺏبﺩدﺍاﻝلﺭرﺡحﻡمﻥن‬ ‫ﻉعﺏبﺩدﺍاﻝلﻉعﺯزﻱيﺯز‬ Abd. A.Rhman Hammed Al-Shuqair ‫ﺍاﻝلﺵشﻕقﻱي‬ ‫ﺡحﻡمﺩد‬ ‫ﻉعﺏبﺩدﺍاﻝلﺭرﺡحﻡمﻥن‬ ‫ﻉعﺏبﺩدﺍاﻝلﻝلﻩه‬ ‫ﺍاﻝلﺵشﻕقﻱيﺭر‬ ‫ﻉعﺏبﺩدﺍاﻝلﻝلﻩه‬ ‫ﻉعﺏبﺩدﺍاﻝلﻉعﺯزﻱيﺯز‬ ‫ﺍاﻝلﺹصﻕق‬ ‫ﻡمﺡحﻡمﺩد‬ ‫ﺏبﻥن‬ ‫ﻉعﺏبﺩدﺍاﻝلﻉعﺯزﻱيﺯز‬ Abdulrahman Abdullah A.Alshegri Arabic: Toyotomi Hideyoshi 豊臣秀吉 トヨトミヒデヨシ とよとみひでよし 上本町207 シャトー上本町303 シャトー上本町303 兵庫県 小野市 上本町207 上本町303 シャトー上本町33 兵庫県 野市 Japanese:
  26. 26. Cross-language matching example 繁簡 簡英 簡英(廣東)
  27. 27. SELECT T1.ORDERKEY1 AS ORDERKEY2, T1.li_count, orders.O_CUSTKEY AS CUSTKEY, customer.C_NAME, customer.C_NATIONKEY, nation.N_NAME, nation.N_REGIONKEY FROM ( SELECT TRANSFORM (L_Orderkey.id) USING CustomInfaTx FROM lineitem GROUP BY L_ORDERKEY ) T1 JOIN orders ON (customer.C_ORDERKEY = orders.O_ORDERKEY) JOIN customer ON (orders.O_CUSTKEY = customer.C_CUSTKEY) JOIN nation ON (customer.C_NATIONKEY = nation.N_NATIONKEY) WHERE nation.N_NAME = 'UNITED STATES' ) T2 INSERT OVERWRITE TABLE TARGET1 SELECT * INSERT OVERWRITE TABLE TARGET2 SELECT CUSTKEY, count(ORDERKEY2) GROUP BY CUSTKEY; Data Integration & Quality on Hadoop Hive-QL 1.  Entire Informatica mapping translated to Hive Query Language 2.  Optimized HQL converted to MapReduce & submitted to Hadoop cluster (job tracker). 3.  Advanced mapping transformations executed on Hadoop through User Defined Functions using Vibe MapReduce UDF
  28. 28. Configure Mapping for Hadoop Execution No need to redesign mapping logic to execute on either Traditional or Hadoop infrastructure. Configure where the integration logic should run – Hadoop or Native
  29. 29. Mixed Workflow Orchestration One workflow running tasks on hadoop and local environments Cmd_Choose LoadPath MT_Load2Hadoop + Parse Cmd_Load2 Hadoop MT_Parse Cmd_ProfileData MT_Cleanse MT_Data Analysis Notification Name Type Default Value Description $User.LoadOptionPath Integer 2 Load path for workflow, depending on output of cmd task $User.DataSourceConnection String HiveSourceConnection Source connection object $User.ProfileResult Integer 100 Output from “profiling” commnad task. Add Edit Remove List of variables:
  30. 30. Full traceability from workflow to MapReduce jobs View generated Hive scripts Unified Administration Single Place to Manage & Monitor
  31. 31. Map Once. Deploy Anywhere. ON PREMISE HADOOP 3rd PARTY APPLICATIONS CLOUD

×