0
Senior Product Specialist
Big Data and Hadoop
The Challenge
Data fragmentation becomes the
barrier to business success
10 2
MAINFRAME
CLIENT-SERVER
WEB
SOCIAL
INTERNET
...
Data
Mart
Data
MartData
Mart
Data
Mart
Data
Mart
Data
Mart
Data
Mart
Data
Mart
Data
Mart
Batch ETL
Big Data Challenges
Vol...
80% of the work in big data projects
is data integration and quality
“I spend more than half my time
integrating, cleansin...
Why Informatica for Big Data & Hadoop
PowerCenter Big Data Edition
Big Transaction Data Big Interaction Data
Online Transaction
Processing (OLTP)
Oracle
DB2
Ing...
Get Data Into and Out of
Hadoop
PowerExchange for Hadoop
Replication to Hadoop
Streaming to Hadoop
Data Archiving to Hadoop
Data
Warehouse
MDM
Applications
Data Ingestion and Extraction
Moving terabytes of data per hour
Replicate
Streaming
Batch ...
PowerExchange Connectors
Enterprise
Applications,
Software as a
Service (SaaS)
JDE EnterpriseOne
JDE World
Lotus Notes
Ora...
NoSQL Support for HBase
11
Read
from HBase as
standard source
Write
to HBase as
standard target
Complete Mapping with
HBas...
NoSQL Support for MongoDB
Access, integrate,
transform & ingest
MongoDB data into
other analytic
systems (e.g.
Hadoop, dat...
IDR for Replicating to Hadoop
Supported
Distributions
•  Apache
•  0.20.203.x
•  0.20.204.x
•  0.20.205.x
•  0.23.x
•  1.0...
Real-Time Data Collection and Streaming
14
UltraMessagingBus
Publish/Subscribe
Leverage High Performance Messaging
Infrast...
Informatica Vibe Data Stream for Machine Data
15
•  High performance/efficient
streaming data collection over
LAN/WAN
•  G...
Predictive Maintenance
with Event Processing and Analytics
United Technologies Aerospace Systems (UTAS)
provides engines a...
Archive to Hadoop
Compression Extends Hadoop Cluster Capacity
Without INFA Optimized
Archive Compression
With INFA Optimiz...
Parse and Prepare Data On
Hadoop
hParser and XMap
4. The DT engine can immediately
use this service to process data.
The DT Engine is fully
embeddable and can be invoked
us...
Example use cases
Call Detail record
•  Why Hadoop?
•  CDR – Large data sets every 7 seconds every mobile phone in
the reg...
hadoop … dt-hadoop.jar
… My_Parser /input/*/input*.txt
1.  Define parser in HParser
visual studio
2.  Deploy the parser on...
Profiling and Discovering
Data
Informatica Profiling & Data Discovery on
Hadoop
CUSTOMER_ID example COUNTRY CODE example
3. Drilldown Analysis (into Hadoop Data)
2. Value &
Pattern
Analysis of
Hadoop Da...
Hadoop Data Domain Discovery
Finding functional meaning of Data in Hadoop
Leverage INFA rules/mapplets to
identify functio...
Transforming and Cleansing
Data
PowerCenter on Hadoop
Data Quality on Hadoop
No-code visual
development
environment
Preview results at
any point in the
data flow
PowerCenter developers are now Hadoop...
Reuse and Import PC Metadata for Hadoop
Import existing
PC artifacts into
Hadoop
development
environment
Validate import
l...
Natural Language Processing
Entity Extraction & Data Classification
Train NLP to find and
classify entities in
unstructure...
Address Validation & Data Cleansing
Configure Mapping for Hadoop Execution
No need to redesign
mapping logic to
execute on either
Traditional or Hadoop
infras...
SELECT
T1.ORDERKEY1 AS ORDERKEY2, T1.li_count, orders.O_CUSTKEY AS CUSTKEY, customer.C_NAME,
customer.C_NATIONKEY, nation....
Example Mapping Execution
Source External
Flat File
Source External
Relational Data
Engine Repository
Source HDFS
File
Clu...
Orchestrating and
Monitoring Hadoop
Informatica Workflow &
Monitoring for Hadoop
Metadata Manager for Hadoop
Dynamic Data ...
Mixed Workflow Orchestration
One workflow running tasks on hadoop and local environments
Cmd_Choose
LoadPath
MT_Load2Hadoo...
Full traceability from workflow
to MapReduce jobs
View generated
Hive scripts
Unified Administration
Single Place to Manag...
Data Lineage and Business Glossary
Hadoop Architecture Overview
•  PowerCenter on Hadoop
•  Data Quality on Hadoop
•  DT on Hadoop
•  Entity Extraction on Ha...
40
Big Data Taiwan 2014 Track2-2: Informatica Big Data Solution
Big Data Taiwan 2014 Track2-2: Informatica Big Data Solution
Big Data Taiwan 2014 Track2-2: Informatica Big Data Solution
Upcoming SlideShare
Loading in...5
×

Big Data Taiwan 2014 Track2-2: Informatica Big Data Solution

6,750

Published on

講者:Informatica 資深產品顧問 | 尹寒柏
議題簡介:Big Data 時代,比的不是數據數量,而是了解數據的深度。現在,因為 Big Data 技術的成熟,讓非資訊背景的 CXO 們,可以讓過去像是專有名詞的 CI (Customer Intelligence) 變成動詞,從 BI 進入 CI,更連結消費者經濟的脈動,洞悉顧客的意圖。不過,有個 Big Data 時代要 注意的思維,那就是競爭到最後,不單只是看數據量的增長,還要比誰能更了解數據的深度。而 Informatica 正是這個最佳解決的答案。我們透過 Informatica 解決在企業及時提供可信賴數據的巨大壓力;同時隨著日益增高的數據量和複雜程度,Informatica 也有能力提供更快速彙集數據技術,從而讓數據變的有意義並可供企業用來促進效率提升、完善品質、保證確定性和發揮優勢的功能。Inforamtica 提供了更為快速有效地實現此目標的方案,是精誠集團在 Big Data 時代的最佳工具。

Published in: Technology
2 Comments
36 Likes
Statistics
Notes
No Downloads
Views
Total Views
6,750
On Slideshare
0
From Embeds
0
Number of Embeds
5
Actions
Shares
0
Downloads
758
Comments
2
Likes
36
Embeds 0
No embeds

No notes for slide

Transcript of "Big Data Taiwan 2014 Track2-2: Informatica Big Data Solution"

  1. 1. Senior Product Specialist Big Data and Hadoop
  2. 2. The Challenge Data fragmentation becomes the barrier to business success 10 2 MAINFRAME CLIENT-SERVER WEB SOCIAL INTERNET OF THINGS CLOUD Few Employees Many Employees Customers/ Consumers Business Ecosystems Communities & Society Devices & Machines 10 4 10 6 10 7 10 9 10 11 Front Office ProductivityBack Office Automation E-Commerce Line-of-Business Self-Service Social Engagement Real-Time Optimization 1960s-1970s 1980s 1990s 2011 2014 2007 OS/360 TECHNOLOGY USERS VALUE TECHNOLOGIES SOURCES BUSINESS
  3. 3. Data Mart Data MartData Mart Data Mart Data Mart Data Mart Data Mart Data Mart Data Mart Batch ETL Big Data Challenges Volume, Variety, Velocity, Veracity Where is the data I need? Can I trust this data? Transactions, OLTP, OLAP Enterprise Data Warehouse Social Media, Web Logs Machine Device, Scientific Documents and Emails Source Data Analytic Systems
  4. 4. 80% of the work in big data projects is data integration and quality “I spend more than half my time integrating, cleansing, and transforming data without doing any actual analysis.” “80% of the work in any data project is in cleaning the data” “70% of my value is an ability to pull the data, 20% of my value is using data-science…”
  5. 5. Why Informatica for Big Data & Hadoop
  6. 6. PowerCenter Big Data Edition Big Transaction Data Big Interaction Data Online Transaction Processing (OLTP) Oracle DB2 Ingres Informix Sysbase SQL Server … Cloud Salesforce.com Concur Google App Engine Amazon … Other Interaction Data Clickstream image/Text Scientific Genomoic/pharma Medical Medical/Device Sensors/meters RFID tags CDR/mobile … Social Media & Web Data Facebook Twitter Linkedin Youtube … Big Data Processing Online Analytical Processing (OLAP) & DW Appliances Teradata Redbrick EssBase Sybase IQ Netezza Exadata HANA Greenplum DataAllegro Asterdata Vertica Paraccel … Web applications Blogs Discussion forums Communities Partner portals … Universal Data Access High-Speed Data Ingestion and Extraction ETL on Hadoop Profiling on Hadoop Complex Data Parsing on Hadoop Entity Extraction and Data Classification on Hadoop No-Code Productivity Business-IT Collaboration Unified Administration the VibeTM virtual data machine 9.6
  7. 7. Get Data Into and Out of Hadoop PowerExchange for Hadoop Replication to Hadoop Streaming to Hadoop Data Archiving to Hadoop
  8. 8. Data Warehouse MDM Applications Data Ingestion and Extraction Moving terabytes of data per hour Replicate Streaming Batch Load Extract Archive Extract Low Cost Store Transactions, OLTP, OLAP Social Media, Web Logs Documents, Email Industry Standards Machine Device, Scientific
  9. 9. PowerExchange Connectors Enterprise Applications, Software as a Service (SaaS) JDE EnterpriseOne JDE World Lotus Notes Oracle E-Business Suite ✔ PeopleSoft Enterprise Salesforce (salesforce.com) ✔ SAP NetWeaver ✔ SAP NetWeaver BI ✔ SAS Siebel Netsuite Microsoft Dynamics Databases and Data Warehouses Adabas for UNIX, Windows C-ISAM DB2 for LUW ✔ Essbase EMC/Greenplum Informix Dynamic Server Netezza Performance Server ODBC Oracle ✔ SQL Server ✔ Sybase Teradata Messaging Systems JMS ✔ MSMQ ✔ TIBCO ✔ webMethods Broker ✔ WebSphere MQ ✔ Technology Standards Email (POP, IMAP) HTTP(S) ✔ LDAP ✔ Web Services ✔ XML Mainframe Adabas for z/OS ✔ Datacom ✔ DB2 for z/OS, z/Linux✔ IDMS ✔ IMS DB ✔ Oracle for z/Linux ✔ Teradata WebSphere MQ for z/Linux ✔ VSAM ✔ Big Data Asterdata, Greenplum Vertica ParAccel Microsoft PDW Kognitio Social Facebook, Twitter, LinkedIn DataSift, Kapow MongoDB Hadoop HDFS HIVE HBASE - Accessible in Real-time and/or via Change Data Capture (CDC)
  10. 10. NoSQL Support for HBase 11 Read from HBase as standard source Write to HBase as standard target Complete Mapping with HBase Src/Tgt can execute on hadoop Sample HBase column families (Stored in JSON/complex formats)
  11. 11. NoSQL Support for MongoDB Access, integrate, transform & ingest MongoDB data into other analytic systems (e.g. Hadoop, data warehouse) Access, integrate, transform, & ingest data into MongoDB Sampling MongoDB data & flattening it to relational format
  12. 12. IDR for Replicating to Hadoop Supported Distributions •  Apache •  0.20.203.x •  0.20.204.x •  0.20.205.x •  0.23.x •  1.0.x •  1.1.x •  2.x.x •  Cloudera •  CDH3 •  CDH4 EXTRACT APPLY Source System Intermediate Files Cycle_1.work directory HDFS Table 1 File Table 2 File … Table N File Schema.ini File
  13. 13. Real-Time Data Collection and Streaming 14 UltraMessagingBus Publish/Subscribe Leverage High Performance Messaging Infrastructure Publish with Ultra Messaging for global distribution without additional staging or landing. HDFS, HBase, Targets Web Servers, Operations Monitors, rsyslog, SLF4J, etc. Handhelds, Smart Meters, etc. Discrete Data Messages Sources Zookeeper Management and Monitoring Internet of Things, Sensor Data Real Time Analysis, Complex Event Processing No SQL Databases: Cassandara, Riak, MongoDB Node Node Node Node Node Node
  14. 14. Informatica Vibe Data Stream for Machine Data 15 •  High performance/efficient streaming data collection over LAN/WAN •  GUI interface provides ease of configuration, deployment & use •  Continuous ingestion of real-time generated data (sensors; logs; etc.). Machine generated & other data sources •  Enable real-time interactions & response •  Real-time delivery directly to multiple targets (batch/stream processing) •  Highly available; efficient; scalable •  Available ecosystem of light weight agents (sources & targets)
  15. 15. Predictive Maintenance with Event Processing and Analytics United Technologies Aerospace Systems (UTAS) provides engines and aircraft components to leading commercial and defense manufacturers, including the new Airbus A380 and Boeing B787. The challenge: •  5,000+ aircraft in service plus new design wins exponentially increases the amount of sensor data being generated •  “Power by the Hour” leasing model means the maintenance cost and service outages falls to UTAS •  No proactive capability to predict when a safety issue might occur •  Once-per-day sensor readings moving to real-time, over-the-air
  16. 16. Archive to Hadoop Compression Extends Hadoop Cluster Capacity Without INFA Optimized Archive Compression With INFA Optimized Archive 95% Compression 10 TB 10 TB 10 TB 10 TB replicated 3X = 30TB 10 TB compressed 95% = 500GB Replicated 3X = 1.5 TB 20X less I/O bandwidth required 20 min vs. 1 min response time 8 hours vs. 24 mins backup window 500 GB 500 GB 500 GB
  17. 17. Parse and Prepare Data On Hadoop hParser and XMap
  18. 18. 4. The DT engine can immediately use this service to process data. The DT Engine is fully embeddable and can be invoked using any of the supported APIs. Java, C++, C, .NET, web services For simple integration, a command line interface is available to invoke services. Internal custom applications can embed transformation services using the various APIs. PowerCenter leverages DT via the Unstructured Data Transformation (UDT). This is a GUI transformation widget in Powercenter which wraps around the DT API and engine. DT can also be embedded in other middleware technologies. For some (WBIMB, WebMethods, BizTalk) INFA provides similar GUI widgets (agents) for the respective design environments. For others the API layer can be used directly. DT can be invoked in two general ways: 1.  Filenames can be passed to it, and DT will directly open the file(s) for processing. On the output side, DT can also directly write to the filesystem. 2.  The calling application can buffer the data and send buffers to DT for processing. On the output side, DT can also write back to memory buffers which are returned to the calling application. Though not shown below, the engine fully supports multiple input and output files or buffers as needed by the transformation. Engine invocation is a shared library. The DT engine runs fully within the process of the calling application. It is not an external engine. This removes any overhead from passing data between processes, across the network, etc. The engine is also dynamically invoked and does not need to be ‘started up’ or maintained externally. The DT engine is also thread-safe and re-entrant. This allows the calling application to invoke DT in multiple threads to increase throughput. A good example is DT’s support of PowerCenter partitioning to scale up processing. As shown below, the actual transformation logic is completely independent of any calling application. This means you can develop a transformation once, and leverage it in multiple environments simultaneously resulting in reduced development and maintenance times and lower impact of change. 1. Developer uses Studio to develop a transformation 2. Developer deploys transformation to local service repository (directory). All files needed for the transformation are moved. 3. To deploy to the server, this service folder is moved to the server via FTP, copy, script, etc. NOTE: If the server file system is mountable from the developer machine directly, then step 2 would deploy directly to the server. Parse and Prepare Data on Hadoop S Svc Repository S Flat Files & Documents Interaction dataIndustry StandardsXML The broadest coverage for Big Data social Device/sensor scientific Productivity •  Visual parsing environment •  Predefined translations Any DI/BI architecture PIG EDW MDM
  19. 19. Example use cases Call Detail record •  Why Hadoop? •  CDR – Large data sets every 7 seconds every mobile phone in the region create a record •  Desire to analyze behavior, location to personalize and optimize pricing and ,marketing
  20. 20. hadoop … dt-hadoop.jar … My_Parser /input/*/input*.txt 1.  Define parser in HParser visual studio 2.  Deploy the parser on Hadoop Distributed File System (HDFS) 3.  Run HParser to extract data and produce tabular format in Hadoop Parse and Prepare Data on Hadoop How does it work?
  21. 21. Profiling and Discovering Data Informatica Profiling & Data Discovery on Hadoop
  22. 22. CUSTOMER_ID example COUNTRY CODE example 3. Drilldown Analysis (into Hadoop Data) 2. Value & Pattern Analysis of Hadoop Data 1. Profiling Stats: Min/Max Values, NULLs, Inferred Data Types, etc. Drill down into actual data values to inspect results across entire data set, including potential duplicates Value and Pattern Frequency to isolated inconsistent/dirty data or unexpected patterns Hadoop Data Profiling results – exposed to anyone in enterprise via browser Stats to identify outliers and anomalies in data Hadoop Data Profiling Results
  23. 23. Hadoop Data Domain Discovery Finding functional meaning of Data in Hadoop Leverage INFA rules/mapplets to identify functional meaning of Hadoop data Sensitive data (e.g. SSN, Credit Card number, etc.) PHI: Protected Health Information PII: Personally Identifiable Information Scalable to look for/discover ANY Domain type View/share report of data domains/ sensitive data contained in Hadoop. Ability to drill down to see suspect data values.
  24. 24. Transforming and Cleansing Data PowerCenter on Hadoop Data Quality on Hadoop
  25. 25. No-code visual development environment Preview results at any point in the data flow PowerCenter developers are now Hadoop developers
  26. 26. Reuse and Import PC Metadata for Hadoop Import existing PC artifacts into Hadoop development environment Validate import logic before the actual import process to ensure compatibility
  27. 27. Natural Language Processing Entity Extraction & Data Classification Train NLP to find and classify entities in unstructured data
  28. 28. Address Validation & Data Cleansing
  29. 29. Configure Mapping for Hadoop Execution No need to redesign mapping logic to execute on either Traditional or Hadoop infrastructure. Configure where the integration logic should run – Hadoop or Native
  30. 30. SELECT T1.ORDERKEY1 AS ORDERKEY2, T1.li_count, orders.O_CUSTKEY AS CUSTKEY, customer.C_NAME, customer.C_NATIONKEY, nation.N_NAME, nation.N_REGIONKEY FROM ( SELECT TRANSFORM (L_Orderkey.id) USING CustomInfaTx FROM lineitem GROUP BY L_ORDERKEY ) T1 JOIN orders ON (customer.C_ORDERKEY = orders.O_ORDERKEY) JOIN customer ON (orders.O_CUSTKEY = customer.C_CUSTKEY) JOIN nation ON (customer.C_NATIONKEY = nation.N_NATIONKEY) WHERE nation.N_NAME = 'UNITED STATES' ) T2 INSERT OVERWRITE TABLE TARGET1 SELECT * INSERT OVERWRITE TABLE TARGET2 SELECT CUSTKEY, count(ORDERKEY2) GROUP BY CUSTKEY; Data Integration & Quality on Hadoop Hive-QL 1.  Entire Informatica mapping translated to Hive Query Language 2.  Optimized HQL converted to MapReduce & submitted to Hadoop cluster (job tracker). 3.  Advanced mapping transformations executed on Hadoop through User Defined Functions using Vibe MapReduce UDF
  31. 31. Example Mapping Execution Source External Flat File Source External Relational Data Engine Repository Source HDFS File Cluster of Linux Machines Mapping logic translated to HQL and submitted to Hadoop Cluster Relational Data streamed to Hadoop for processing Target HDFS FIle Local flat file staged temporarily on HDFS Read HDFS file data Final processed data loaded into HDFS file Temp Staged Lookup File
  32. 32. Orchestrating and Monitoring Hadoop Informatica Workflow & Monitoring for Hadoop Metadata Manager for Hadoop Dynamic Data Masking for Hadoop
  33. 33. Mixed Workflow Orchestration One workflow running tasks on hadoop and local environments Cmd_Choose LoadPath MT_Load2Hadoop + Parse Cmd_Load2 Hadoop MT_Parse Cmd_ProfileData MT_Cleanse MT_Data Analysis Notification Name Type Default Value Description $User.LoadOptionPath Integer 2 Load path for workflow, depending on output of cmd task $User.DataSourceConnection String HiveSourceConnection Source connection object $User.ProfileResult Integer 100 Output from “profiling” commnad task. Add Edit Remove List of variables: Informatica Corporation Confidential Do Not Distribute.
  34. 34. Full traceability from workflow to MapReduce jobs View generated Hive scripts Unified Administration Single Place to Manage & Monitor
  35. 35. Data Lineage and Business Glossary
  36. 36. Hadoop Architecture Overview •  PowerCenter on Hadoop •  Data Quality on Hadoop •  DT on Hadoop •  Entity Extraction on Hadoop •  Profiling on Hadoop Execution on HadoopPWX for HDFS PWX for HDFS PWX for Hive MYSQL Mercury Services Hive Client HDFS Infa-Lib DataNode1 HParser Map Reduce RDBMS Clients PWXfor Mercury Transactions, OLTP, OLAP Documents, Email Social Media, Web Logs Machine Device, Scientific PowerCenter SE Enterprise Grid PowerCenter Services PWXfor PC HDFS Infa-Lib HParser Map Reduce RDBMS Clients HDFS Infa-Lib HParser Map Reduce RDBMS Clients HDFS Infa-Lib HParser Map Reduce RDBMS Clients Hive DataNode2 DataNode3NameNode Job Tracker INFA Clients
  37. 37. 40
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×