建構新世代的智慧數據平台	
 	
 
尹寒柏	
 	
 Bob Yin
Senior Product Specialist
10 2
MAINFRAME
CLIENT-SERVER
WEB
SOCIAL
INTERNET
OF THINGS
CLOUD
Few
Employees
Many
Employees
Customers/
Consumers
Business
Ecosystems
Communities
& Society
Devices
& Machines
10 4
10 6
10 7
10 9
10 11
Front Office
ProductivityBack Office
Automation
E-Commerce
Line-of-Business
Self-Service
Social
Engagement
Real-Time
Optimization
1960s-1970s
1980s
1990s
2011
2014
2007
OS/360
TECHNOLOGY
USERS
VALUE
TECHNOLOGIES
SOURCES
BUSINESS
What are your Business Initiatives related to Big Data?
• Fraud Detection
• Risk & Portfolio
Analysis
• Investment
Recommendations
Financial Services
• Proactive Customer
Engagement
• Location Based
Services
Retail & Telco
•  Connected Vehicle
•  Predictive
Maintenance
Manufacturing
•  Predicting Patient
Outcomes
•  Total Cost of Care
•  Drug Discovery
Healthcare & Pharma
•  Health Insurance
Exchanges
•  Public Safety
•  Tax Optimization
•  Fraud Detection
Public Sector
Media & Entertainment
• Online & In-Game
Behavior
• Customer X/Up-Sell
80% of the work in big data projects is
data integration and data quality
“80% of the work in any
data project is in cleaning
the data”
“70% of my value is an
ability to pull the data,
20% of my value is using
data-science…”
“I spend more than half
my time integrating,
cleansing, and
transforming data without
doing any actual
analysis.”
InformationWeek 2013 Analytics, Business Intelligence and Information
Management Survey of 541 business technology professionals, October 2012
Big data expertise is scarce and expensive
Data warehouse appliance platforms are expensive
We aren’t sure how big data analytics will
create business opportunities
Analytical tools are lacking for big data platforms
like Hadoop and NoSQL databases
Our data’s not accurate
Hadoop and NoSQL technologies are hard to learn
What Are Your Primary Concerns About
Using Big Data Software
38%
33%
31%
22%
21%
17%
Staff Projects with Readily Available Skills
•  Informatica Developers are Hadoop Developers
Hand-coding
A large global bank grew staff from 2 Java
developers to 100 Informatica developers after
implementing Informatica Big Data Edition
Careerbuilder.com found in a survey
there were 27,000 requests for Hadoop
skills and only 3,000 resumes with
Hadoop skills
– whereas there are over 100,000
trained Informatica developers globally.
Increase Developer Productivity
•  Informatica Developers are up to 5x more productive
4 weeks
4 days!
2X performance!
Vs.
Hadoop
Hand-coders
Informatica developers
Informatica Developers are
5x more productive based on
customer POCs
Why Informatica for Big Data & Hadoop
Informatica on Hadoop Why Customers Care
Visual development environment Increase productivity up to 5x
over hand-coding
100K+ trained Informatica
developers globally
Use existing & readily available
skills for big data
200+ high-performance
connectors (legacy & new)
Move all types of customer data
into Hadoop faster
100+ pre-built transforms for
ETL & data quality
Provide broadest out-of-box
transformations on Hadoop
100+ pre-built parsers for
complex data formats
Analyze and integrate all types
of data faster
Vibe “Map Once, Deploy
Anywhere” virtual data machine
An insurance policy as new data
types and technologies change
Reference architectures to get
started
Accelerate customer success
with proven solution
Unleash the Power of Hadoop
Informatica Developers are Now Hadoop Developers
Archive
Profile Parse CleanseETL Match
Stream
Load Load
Services
Events
Replicate
Topics
Machine Device,
Cloud
Documents and
Emails
Relational, Mainframe
Social Media, Web
Logs
Data Warehouse
Mobile Apps
Analytics & Op
Dashboards
Alerts
Analytics Teams
Transactions,
OLTP, OLAP
Social Media,
Web Logs
Documents,
Email
Machine Device,
Scientific
Maximize Your Return On Big Data
Data
WarehouseMDM
Operational Systems Analytical SystemsData Assets Data Products
Data
Mart
ODS
OLTP
OLTP
Access
& Ingest
Parse &
Prepare
Discover
& Profile
Transform
& Cleanse
Extract &
Deliver
Manage (i.e. Security, Performance, Governance, Collaboration)
& other NoSQL
Hadoop complements your existing infrastructure
Data
Warehouse
MDM
Applications
Data Ingestion and Extraction
•  Moving terabytes of data per hour
Replicate
Streaming
Batch Load
Extract
Archive Extract Low
Cost
Store
Transactions,
OLTP, OLAP
Social Media,
Web Logs
Documents,
Email
Industry
Standards
Machine Device,
Scientific
Access All Types of Data
•  200+ High Performance Connectors, Pre-built Parsers for Specialized Data
Formats
WebSphere MQ
JMS
MSMQ
SAP NetWeaver XI
JD Edwards
Lotus Notes
Oracle E-Business
PeopleSoft
Oracle
DB2 UDB
DB2/400
SQL Server
Sybase
ADABAS
Datacom
DB2
IDMS
IMS
Word, Excel
PDF
StarOffice
WordPerfect
Email (POP, IMPA)
HTTP
Informix
Teradata
Netezza
ODBC
JDBC
VSAM
C-ISAM
Binary Flat Files
Tape Formats…
Web Services
TIBCO
webMethods
SAP NetWeaver
SAP NetWeaver BI
SAS
Siebel
Flat files
ASCII reports
HTML
RPG
ANSI
LDAP
EDI–X12
EDI-Fact
RosettaNet
HL7
HIPAA
ebXML
HL7 v3.0
ACORD (AL3, XML)
XML
LegalXML
IFX
cXML
AST
FIX
SWIFT
Cargo IMP
MVR
Salesforce CRM
Force.com
RightNow
NetSuite
ADP
Hewitt
SAP By Design
Oracle OnDemand
Facebook
Twitter
LinkedIn
Kapow
Datasift
Pivotal
Vertica
Netezza
Teradata
Aster
Messaging,
and Web
Services
Relational
and Flat
Files
Mainframe
and Midrange
Unstructured
Data and Files
MPP
Appliances
Packaged
Applications
Industry
Standards
XML
Standards
SaaS/BPO
Social
Media
Cloud of Connectors
Real-Time Data Collection and Streaming
15
UltraMessagingBus
Publish/Subscribe
Leverage High Performance Messaging
Infrastructure Publish with Ultra
Messaging for global distribution without
additional staging or landing.
HDFS
Targets
Web Servers,
Operations
Monitors, rsyslog, log
files, JSON, TCP/
UDP, HTTP, SLF4J,
etc.
Handhelds, Smart
Meters, etc.
Discrete Data
Messages, MQTT
Sources
Zookeeper
Management
and Monitoring
Internet of Things,
Sensor Data
PowerCenter
Real-Time Edition,
Rulepoint (CEP)
No SQL
Databases:
CassandaraNode
Node
Node
Node
Node
Node
Transformations: Filtering,
Timestamp, Static Text, Custom
Informatica Vibe Data Stream for Machine Data
16
•  High performance/efficient
streaming data collection over
LAN/WAN
•  GUI interface provides ease of
configuration, deployment & use
•  Continuous ingestion of real-time
generated data (sensors; logs;
etc.). Machine generated & other
data sources
•  Enable real-time interactions &
response
•  Real-time delivery directly to
multiple targets (batch/stream
processing)
•  Highly available; efficient;
scalable
•  Available ecosystem of light
weight agents (sources & targets)
Streaming Analytics
Complex Event Processing
NoSQL Support for HBase
18
Read
from HBase as
standard source
Write
to HBase as
standard target
Complete Mapping with
HBase Src/Tgt can
execute on hadoop
Sample HBase column
families
(Stored in JSON/complex
formats)
NoSQL Support for MongoDB
Access, integrate,
transform & ingest
MongoDB data into
other analytic
systems (e.g.
Hadoop, data
warehouse)
Access, integrate,
transform, & ingest
data into MongoDB
Sampling
MongoDB data &
flattening it to
relational format
Graphical	
  representa.on	
  
highligh.ng	
  data,	
  segments,	
  
separators,	
  and	
  missing	
  or	
  
invalid	
  data	
  
Big Data Parser
Easy	
  Deployment	
  of	
  Industry	
  Standards	
  	
  
Import	
  pre-­‐built	
  industry	
  libraries	
  and	
  
easily	
  customize	
  for	
  specific	
  needs	
  
Support	
  of	
  Healthcare	
  industry	
  
standards	
  and	
  more	
  
Libraries	
  are	
  constantly	
  maintained	
  to	
  
ensure	
  con.nued	
  compliance	
  
Big Data Parser on Taobao
CUSTOMER_ID example COUNTRY CODE example
3. Drilldown Analysis (into Hadoop Data)
2. Value &
Pattern
Analysis of
Hadoop Data
1. Profiling Stats:
Min/Max Values, NULLs,
Inferred Data Types, etc.
Drill down into actual
data values to inspect
results across entire
data set, including
potential duplicates
Value and Pattern
Frequency to isolated
inconsistent/dirty data or
unexpected patterns
Hadoop Data Profiling
results – exposed to
anyone in enterprise
via browser
Stats to identify
outliers and
anomalies in data
Hadoop Data Profiling Results
•  Big Data cleansing, deduplication, parsing
Execute Data Quality on Hadoop
23
Address
Validation
Standardize
Parsing
Matching
Address Validation and
Geocoding enrichment across
260 countries
Probabilistic or Deterministic
Matching
Standardization and Reference
Data Management
Parsing of Unstructured
Data/Text Fields of all data
types of data (customer/
product/ social/ logs)
DQ logic pushed down/run natively ON Hadoop
Data Quality Taiwan Address
Cross-language matching
Abdulaziz A/Rahman Al Sugair
‫ﺍاﻝلﺹصﻕقﻱيﺭر‬ ‫ﻉعﺏبﺩدﺍاﻝلﺭرﺡحﻡمﻥن‬ ‫ﻉعﺏبﺩدﺍاﻝلﻉعﺯزﻱيﺯز‬
Abd. A.Rhman Hammed Al-Shuqair
‫ﺍاﻝلﺵشﻕقﻱي‬ ‫ﺡحﻡمﺩد‬ ‫ﻉعﺏبﺩدﺍاﻝلﺭرﺡحﻡمﻥن‬ ‫ﻉعﺏبﺩدﺍاﻝلﻝلﻩه‬
‫ﺍاﻝلﺵشﻕقﻱيﺭر‬ ‫ﻉعﺏبﺩدﺍاﻝلﻝلﻩه‬ ‫ﻉعﺏبﺩدﺍاﻝلﻉعﺯزﻱيﺯز‬
‫ﺍاﻝلﺹصﻕق‬ ‫ﻡمﺡحﻡمﺩد‬ ‫ﺏبﻥن‬ ‫ﻉعﺏبﺩدﺍاﻝلﻉعﺯزﻱيﺯز‬
Abdulrahman Abdullah A.Alshegri
Arabic:
Toyotomi Hideyoshi
豊臣秀吉
トヨトミヒデヨシ
とよとみひでよし
上本町207 シャトー上本町303	
シャトー上本町303 兵庫県 小野市	
上本町207 上本町303	
シャトー上本町33 兵庫県 野市	
Japanese:
Cross-language matching example
繁簡
簡英
簡英(廣東)
SELECT
T1.ORDERKEY1 AS ORDERKEY2, T1.li_count, orders.O_CUSTKEY AS CUSTKEY, customer.C_NAME,
customer.C_NATIONKEY, nation.N_NAME, nation.N_REGIONKEY
FROM
(
SELECT TRANSFORM (L_Orderkey.id) USING CustomInfaTx
FROM lineitem
GROUP BY L_ORDERKEY
) T1
JOIN orders ON (customer.C_ORDERKEY = orders.O_ORDERKEY)
JOIN customer ON (orders.O_CUSTKEY = customer.C_CUSTKEY)
JOIN nation ON (customer.C_NATIONKEY = nation.N_NATIONKEY)
WHERE nation.N_NAME = 'UNITED STATES'
) T2
INSERT OVERWRITE TABLE TARGET1 SELECT *
INSERT OVERWRITE TABLE TARGET2 SELECT CUSTKEY,
count(ORDERKEY2) GROUP BY CUSTKEY;
Data Integration & Quality on Hadoop
Hive-QL
1.  Entire Informatica mapping
translated to Hive Query Language
2.  Optimized HQL converted to
MapReduce & submitted to Hadoop
cluster (job tracker).
3.  Advanced mapping transformations
executed on Hadoop through User
Defined Functions using Vibe
MapReduce
UDF
Configure Mapping for Hadoop Execution
No need to redesign
mapping logic to
execute on either
Traditional or Hadoop
infrastructure.
Configure where the
integration logic should
run – Hadoop or Native
Mixed Workflow Orchestration
One workflow running tasks on hadoop and local environments
Cmd_Choose
LoadPath
MT_Load2Hadoop
+ Parse
Cmd_Load2
Hadoop
MT_Parse
Cmd_ProfileData MT_Cleanse
MT_Data
Analysis
Notification
Name Type Default Value Description
$User.LoadOptionPath Integer 2 Load path for workflow, depending on output of cmd task
$User.DataSourceConnection String HiveSourceConnection Source connection object
$User.ProfileResult Integer 100 Output from “profiling” commnad task.
Add
Edit
Remove
List of variables:
Full traceability from workflow
to MapReduce jobs
View generated
Hive scripts
Unified Administration
Single Place to Manage & Monitor
Map Once. Deploy Anywhere.
ON PREMISE HADOOP 3rd PARTY
APPLICATIONS
CLOUD
Track B-1 建構新世代的智慧數據平台

Track B-1 建構新世代的智慧數據平台

  • 1.
    建構新世代的智慧數據平台 尹寒柏 Bob Yin Senior Product Specialist
  • 2.
    10 2 MAINFRAME CLIENT-SERVER WEB SOCIAL INTERNET OF THINGS CLOUD Few Employees Many Employees Customers/ Consumers Business Ecosystems Communities &Society Devices & Machines 10 4 10 6 10 7 10 9 10 11 Front Office ProductivityBack Office Automation E-Commerce Line-of-Business Self-Service Social Engagement Real-Time Optimization 1960s-1970s 1980s 1990s 2011 2014 2007 OS/360 TECHNOLOGY USERS VALUE TECHNOLOGIES SOURCES BUSINESS
  • 3.
    What are yourBusiness Initiatives related to Big Data? • Fraud Detection • Risk & Portfolio Analysis • Investment Recommendations Financial Services • Proactive Customer Engagement • Location Based Services Retail & Telco •  Connected Vehicle •  Predictive Maintenance Manufacturing •  Predicting Patient Outcomes •  Total Cost of Care •  Drug Discovery Healthcare & Pharma •  Health Insurance Exchanges •  Public Safety •  Tax Optimization •  Fraud Detection Public Sector Media & Entertainment • Online & In-Game Behavior • Customer X/Up-Sell
  • 4.
    80% of thework in big data projects is data integration and data quality “80% of the work in any data project is in cleaning the data” “70% of my value is an ability to pull the data, 20% of my value is using data-science…” “I spend more than half my time integrating, cleansing, and transforming data without doing any actual analysis.”
  • 5.
    InformationWeek 2013 Analytics,Business Intelligence and Information Management Survey of 541 business technology professionals, October 2012 Big data expertise is scarce and expensive Data warehouse appliance platforms are expensive We aren’t sure how big data analytics will create business opportunities Analytical tools are lacking for big data platforms like Hadoop and NoSQL databases Our data’s not accurate Hadoop and NoSQL technologies are hard to learn What Are Your Primary Concerns About Using Big Data Software 38% 33% 31% 22% 21% 17%
  • 6.
    Staff Projects withReadily Available Skills •  Informatica Developers are Hadoop Developers Hand-coding A large global bank grew staff from 2 Java developers to 100 Informatica developers after implementing Informatica Big Data Edition Careerbuilder.com found in a survey there were 27,000 requests for Hadoop skills and only 3,000 resumes with Hadoop skills – whereas there are over 100,000 trained Informatica developers globally.
  • 7.
    Increase Developer Productivity • Informatica Developers are up to 5x more productive 4 weeks 4 days! 2X performance! Vs. Hadoop Hand-coders Informatica developers Informatica Developers are 5x more productive based on customer POCs
  • 8.
    Why Informatica forBig Data & Hadoop Informatica on Hadoop Why Customers Care Visual development environment Increase productivity up to 5x over hand-coding 100K+ trained Informatica developers globally Use existing & readily available skills for big data 200+ high-performance connectors (legacy & new) Move all types of customer data into Hadoop faster 100+ pre-built transforms for ETL & data quality Provide broadest out-of-box transformations on Hadoop 100+ pre-built parsers for complex data formats Analyze and integrate all types of data faster Vibe “Map Once, Deploy Anywhere” virtual data machine An insurance policy as new data types and technologies change Reference architectures to get started Accelerate customer success with proven solution
  • 9.
    Unleash the Powerof Hadoop Informatica Developers are Now Hadoop Developers Archive Profile Parse CleanseETL Match Stream Load Load Services Events Replicate Topics Machine Device, Cloud Documents and Emails Relational, Mainframe Social Media, Web Logs Data Warehouse Mobile Apps Analytics & Op Dashboards Alerts Analytics Teams
  • 10.
    Transactions, OLTP, OLAP Social Media, WebLogs Documents, Email Machine Device, Scientific Maximize Your Return On Big Data Data WarehouseMDM Operational Systems Analytical SystemsData Assets Data Products Data Mart ODS OLTP OLTP Access & Ingest Parse & Prepare Discover & Profile Transform & Cleanse Extract & Deliver Manage (i.e. Security, Performance, Governance, Collaboration) & other NoSQL
  • 11.
    Hadoop complements yourexisting infrastructure
  • 12.
    Data Warehouse MDM Applications Data Ingestion andExtraction •  Moving terabytes of data per hour Replicate Streaming Batch Load Extract Archive Extract Low Cost Store Transactions, OLTP, OLAP Social Media, Web Logs Documents, Email Industry Standards Machine Device, Scientific
  • 13.
    Access All Typesof Data •  200+ High Performance Connectors, Pre-built Parsers for Specialized Data Formats WebSphere MQ JMS MSMQ SAP NetWeaver XI JD Edwards Lotus Notes Oracle E-Business PeopleSoft Oracle DB2 UDB DB2/400 SQL Server Sybase ADABAS Datacom DB2 IDMS IMS Word, Excel PDF StarOffice WordPerfect Email (POP, IMPA) HTTP Informix Teradata Netezza ODBC JDBC VSAM C-ISAM Binary Flat Files Tape Formats… Web Services TIBCO webMethods SAP NetWeaver SAP NetWeaver BI SAS Siebel Flat files ASCII reports HTML RPG ANSI LDAP EDI–X12 EDI-Fact RosettaNet HL7 HIPAA ebXML HL7 v3.0 ACORD (AL3, XML) XML LegalXML IFX cXML AST FIX SWIFT Cargo IMP MVR Salesforce CRM Force.com RightNow NetSuite ADP Hewitt SAP By Design Oracle OnDemand Facebook Twitter LinkedIn Kapow Datasift Pivotal Vertica Netezza Teradata Aster Messaging, and Web Services Relational and Flat Files Mainframe and Midrange Unstructured Data and Files MPP Appliances Packaged Applications Industry Standards XML Standards SaaS/BPO Social Media
  • 14.
  • 15.
    Real-Time Data Collectionand Streaming 15 UltraMessagingBus Publish/Subscribe Leverage High Performance Messaging Infrastructure Publish with Ultra Messaging for global distribution without additional staging or landing. HDFS Targets Web Servers, Operations Monitors, rsyslog, log files, JSON, TCP/ UDP, HTTP, SLF4J, etc. Handhelds, Smart Meters, etc. Discrete Data Messages, MQTT Sources Zookeeper Management and Monitoring Internet of Things, Sensor Data PowerCenter Real-Time Edition, Rulepoint (CEP) No SQL Databases: CassandaraNode Node Node Node Node Node Transformations: Filtering, Timestamp, Static Text, Custom
  • 16.
    Informatica Vibe DataStream for Machine Data 16 •  High performance/efficient streaming data collection over LAN/WAN •  GUI interface provides ease of configuration, deployment & use •  Continuous ingestion of real-time generated data (sensors; logs; etc.). Machine generated & other data sources •  Enable real-time interactions & response •  Real-time delivery directly to multiple targets (batch/stream processing) •  Highly available; efficient; scalable •  Available ecosystem of light weight agents (sources & targets)
  • 17.
  • 18.
    NoSQL Support forHBase 18 Read from HBase as standard source Write to HBase as standard target Complete Mapping with HBase Src/Tgt can execute on hadoop Sample HBase column families (Stored in JSON/complex formats)
  • 19.
    NoSQL Support forMongoDB Access, integrate, transform & ingest MongoDB data into other analytic systems (e.g. Hadoop, data warehouse) Access, integrate, transform, & ingest data into MongoDB Sampling MongoDB data & flattening it to relational format
  • 20.
    Graphical  representa.on   highligh.ng  data,  segments,   separators,  and  missing  or   invalid  data   Big Data Parser Easy  Deployment  of  Industry  Standards     Import  pre-­‐built  industry  libraries  and   easily  customize  for  specific  needs   Support  of  Healthcare  industry   standards  and  more   Libraries  are  constantly  maintained  to   ensure  con.nued  compliance  
  • 21.
    Big Data Parseron Taobao
  • 22.
    CUSTOMER_ID example COUNTRYCODE example 3. Drilldown Analysis (into Hadoop Data) 2. Value & Pattern Analysis of Hadoop Data 1. Profiling Stats: Min/Max Values, NULLs, Inferred Data Types, etc. Drill down into actual data values to inspect results across entire data set, including potential duplicates Value and Pattern Frequency to isolated inconsistent/dirty data or unexpected patterns Hadoop Data Profiling results – exposed to anyone in enterprise via browser Stats to identify outliers and anomalies in data Hadoop Data Profiling Results
  • 23.
    •  Big Datacleansing, deduplication, parsing Execute Data Quality on Hadoop 23 Address Validation Standardize Parsing Matching Address Validation and Geocoding enrichment across 260 countries Probabilistic or Deterministic Matching Standardization and Reference Data Management Parsing of Unstructured Data/Text Fields of all data types of data (customer/ product/ social/ logs) DQ logic pushed down/run natively ON Hadoop
  • 24.
  • 25.
    Cross-language matching Abdulaziz A/RahmanAl Sugair ‫ﺍاﻝلﺹصﻕقﻱيﺭر‬ ‫ﻉعﺏبﺩدﺍاﻝلﺭرﺡحﻡمﻥن‬ ‫ﻉعﺏبﺩدﺍاﻝلﻉعﺯزﻱيﺯز‬ Abd. A.Rhman Hammed Al-Shuqair ‫ﺍاﻝلﺵشﻕقﻱي‬ ‫ﺡحﻡمﺩد‬ ‫ﻉعﺏبﺩدﺍاﻝلﺭرﺡحﻡمﻥن‬ ‫ﻉعﺏبﺩدﺍاﻝلﻝلﻩه‬ ‫ﺍاﻝلﺵشﻕقﻱيﺭر‬ ‫ﻉعﺏبﺩدﺍاﻝلﻝلﻩه‬ ‫ﻉعﺏبﺩدﺍاﻝلﻉعﺯزﻱيﺯز‬ ‫ﺍاﻝلﺹصﻕق‬ ‫ﻡمﺡحﻡمﺩد‬ ‫ﺏبﻥن‬ ‫ﻉعﺏبﺩدﺍاﻝلﻉعﺯزﻱيﺯز‬ Abdulrahman Abdullah A.Alshegri Arabic: Toyotomi Hideyoshi 豊臣秀吉 トヨトミヒデヨシ とよとみひでよし 上本町207 シャトー上本町303 シャトー上本町303 兵庫県 小野市 上本町207 上本町303 シャトー上本町33 兵庫県 野市 Japanese:
  • 26.
  • 27.
    SELECT T1.ORDERKEY1 AS ORDERKEY2,T1.li_count, orders.O_CUSTKEY AS CUSTKEY, customer.C_NAME, customer.C_NATIONKEY, nation.N_NAME, nation.N_REGIONKEY FROM ( SELECT TRANSFORM (L_Orderkey.id) USING CustomInfaTx FROM lineitem GROUP BY L_ORDERKEY ) T1 JOIN orders ON (customer.C_ORDERKEY = orders.O_ORDERKEY) JOIN customer ON (orders.O_CUSTKEY = customer.C_CUSTKEY) JOIN nation ON (customer.C_NATIONKEY = nation.N_NATIONKEY) WHERE nation.N_NAME = 'UNITED STATES' ) T2 INSERT OVERWRITE TABLE TARGET1 SELECT * INSERT OVERWRITE TABLE TARGET2 SELECT CUSTKEY, count(ORDERKEY2) GROUP BY CUSTKEY; Data Integration & Quality on Hadoop Hive-QL 1.  Entire Informatica mapping translated to Hive Query Language 2.  Optimized HQL converted to MapReduce & submitted to Hadoop cluster (job tracker). 3.  Advanced mapping transformations executed on Hadoop through User Defined Functions using Vibe MapReduce UDF
  • 28.
    Configure Mapping forHadoop Execution No need to redesign mapping logic to execute on either Traditional or Hadoop infrastructure. Configure where the integration logic should run – Hadoop or Native
  • 29.
    Mixed Workflow Orchestration Oneworkflow running tasks on hadoop and local environments Cmd_Choose LoadPath MT_Load2Hadoop + Parse Cmd_Load2 Hadoop MT_Parse Cmd_ProfileData MT_Cleanse MT_Data Analysis Notification Name Type Default Value Description $User.LoadOptionPath Integer 2 Load path for workflow, depending on output of cmd task $User.DataSourceConnection String HiveSourceConnection Source connection object $User.ProfileResult Integer 100 Output from “profiling” commnad task. Add Edit Remove List of variables:
  • 30.
    Full traceability fromworkflow to MapReduce jobs View generated Hive scripts Unified Administration Single Place to Manage & Monitor
  • 32.
    Map Once. DeployAnywhere. ON PREMISE HADOOP 3rd PARTY APPLICATIONS CLOUD