What are your Business Initiatives related to Big Data?
• Fraud Detection
• Risk & Portfolio
Analysis
• Investment
Recommendations
Financial Services
• Proactive Customer
Engagement
• Location Based
Services
Retail & Telco
• Connected Vehicle
• Predictive
Maintenance
Manufacturing
• Predicting Patient
Outcomes
• Total Cost of Care
• Drug Discovery
Healthcare & Pharma
• Health Insurance
Exchanges
• Public Safety
• Tax Optimization
• Fraud Detection
Public Sector
Media & Entertainment
• Online & In-Game
Behavior
• Customer X/Up-Sell
80% of the work in big data projects is
data integration and data quality
“80% of the work in any
data project is in cleaning
the data”
“70% of my value is an
ability to pull the data,
20% of my value is using
data-science…”
“I spend more than half
my time integrating,
cleansing, and
transforming data without
doing any actual
analysis.”
InformationWeek 2013 Analytics, Business Intelligence and Information
Management Survey of 541 business technology professionals, October 2012
Big data expertise is scarce and expensive
Data warehouse appliance platforms are expensive
We aren’t sure how big data analytics will
create business opportunities
Analytical tools are lacking for big data platforms
like Hadoop and NoSQL databases
Our data’s not accurate
Hadoop and NoSQL technologies are hard to learn
What Are Your Primary Concerns About
Using Big Data Software
38%
33%
31%
22%
21%
17%
Staff Projects with Readily Available Skills
• Informatica Developers are Hadoop Developers
Hand-coding
A large global bank grew staff from 2 Java
developers to 100 Informatica developers after
implementing Informatica Big Data Edition
Careerbuilder.com found in a survey
there were 27,000 requests for Hadoop
skills and only 3,000 resumes with
Hadoop skills
– whereas there are over 100,000
trained Informatica developers globally.
Increase Developer Productivity
• Informatica Developers are up to 5x more productive
4 weeks
4 days!
2X performance!
Vs.
Hadoop
Hand-coders
Informatica developers
Informatica Developers are
5x more productive based on
customer POCs
Why Informatica for Big Data & Hadoop
Informatica on Hadoop Why Customers Care
Visual development environment Increase productivity up to 5x
over hand-coding
100K+ trained Informatica
developers globally
Use existing & readily available
skills for big data
200+ high-performance
connectors (legacy & new)
Move all types of customer data
into Hadoop faster
100+ pre-built transforms for
ETL & data quality
Provide broadest out-of-box
transformations on Hadoop
100+ pre-built parsers for
complex data formats
Analyze and integrate all types
of data faster
Vibe “Map Once, Deploy
Anywhere” virtual data machine
An insurance policy as new data
types and technologies change
Reference architectures to get
started
Accelerate customer success
with proven solution
Unleash the Power of Hadoop
Informatica Developers are Now Hadoop Developers
Archive
Profile Parse CleanseETL Match
Stream
Load Load
Services
Events
Replicate
Topics
Machine Device,
Cloud
Documents and
Emails
Relational, Mainframe
Social Media, Web
Logs
Data Warehouse
Mobile Apps
Analytics & Op
Dashboards
Alerts
Analytics Teams
Transactions,
OLTP, OLAP
Social Media,
Web Logs
Documents,
Email
Machine Device,
Scientific
Maximize Your Return On Big Data
Data
WarehouseMDM
Operational Systems Analytical SystemsData Assets Data Products
Data
Mart
ODS
OLTP
OLTP
Access
& Ingest
Parse &
Prepare
Discover
& Profile
Transform
& Cleanse
Extract &
Deliver
Manage (i.e. Security, Performance, Governance, Collaboration)
& other NoSQL
Data
Warehouse
MDM
Applications
Data Ingestion and Extraction
• Moving terabytes of data per hour
Replicate
Streaming
Batch Load
Extract
Archive Extract Low
Cost
Store
Transactions,
OLTP, OLAP
Social Media,
Web Logs
Documents,
Email
Industry
Standards
Machine Device,
Scientific
Access All Types of Data
• 200+ High Performance Connectors, Pre-built Parsers for Specialized Data
Formats
WebSphere MQ
JMS
MSMQ
SAP NetWeaver XI
JD Edwards
Lotus Notes
Oracle E-Business
PeopleSoft
Oracle
DB2 UDB
DB2/400
SQL Server
Sybase
ADABAS
Datacom
DB2
IDMS
IMS
Word, Excel
PDF
StarOffice
WordPerfect
Email (POP, IMPA)
HTTP
Informix
Teradata
Netezza
ODBC
JDBC
VSAM
C-ISAM
Binary Flat Files
Tape Formats…
Web Services
TIBCO
webMethods
SAP NetWeaver
SAP NetWeaver BI
SAS
Siebel
Flat files
ASCII reports
HTML
RPG
ANSI
LDAP
EDI–X12
EDI-Fact
RosettaNet
HL7
HIPAA
ebXML
HL7 v3.0
ACORD (AL3, XML)
XML
LegalXML
IFX
cXML
AST
FIX
SWIFT
Cargo IMP
MVR
Salesforce CRM
Force.com
RightNow
NetSuite
ADP
Hewitt
SAP By Design
Oracle OnDemand
Facebook
Twitter
LinkedIn
Kapow
Datasift
Pivotal
Vertica
Netezza
Teradata
Aster
Messaging,
and Web
Services
Relational
and Flat
Files
Mainframe
and Midrange
Unstructured
Data and Files
MPP
Appliances
Packaged
Applications
Industry
Standards
XML
Standards
SaaS/BPO
Social
Media
Real-Time Data Collection and Streaming
15
UltraMessagingBus
Publish/Subscribe
Leverage High Performance Messaging
Infrastructure Publish with Ultra
Messaging for global distribution without
additional staging or landing.
HDFS
Targets
Web Servers,
Operations
Monitors, rsyslog, log
files, JSON, TCP/
UDP, HTTP, SLF4J,
etc.
Handhelds, Smart
Meters, etc.
Discrete Data
Messages, MQTT
Sources
Zookeeper
Management
and Monitoring
Internet of Things,
Sensor Data
PowerCenter
Real-Time Edition,
Rulepoint (CEP)
No SQL
Databases:
CassandaraNode
Node
Node
Node
Node
Node
Transformations: Filtering,
Timestamp, Static Text, Custom
Informatica Vibe Data Stream for Machine Data
16
• High performance/efficient
streaming data collection over
LAN/WAN
• GUI interface provides ease of
configuration, deployment & use
• Continuous ingestion of real-time
generated data (sensors; logs;
etc.). Machine generated & other
data sources
• Enable real-time interactions &
response
• Real-time delivery directly to
multiple targets (batch/stream
processing)
• Highly available; efficient;
scalable
• Available ecosystem of light
weight agents (sources & targets)
NoSQL Support for HBase
18
Read
from HBase as
standard source
Write
to HBase as
standard target
Complete Mapping with
HBase Src/Tgt can
execute on hadoop
Sample HBase column
families
(Stored in JSON/complex
formats)
NoSQL Support for MongoDB
Access, integrate,
transform & ingest
MongoDB data into
other analytic
systems (e.g.
Hadoop, data
warehouse)
Access, integrate,
transform, & ingest
data into MongoDB
Sampling
MongoDB data &
flattening it to
relational format
Graphical
representa.on
highligh.ng
data,
segments,
separators,
and
missing
or
invalid
data
Big Data Parser
Easy
Deployment
of
Industry
Standards
Import
pre-‐built
industry
libraries
and
easily
customize
for
specific
needs
Support
of
Healthcare
industry
standards
and
more
Libraries
are
constantly
maintained
to
ensure
con.nued
compliance
CUSTOMER_ID example COUNTRY CODE example
3. Drilldown Analysis (into Hadoop Data)
2. Value &
Pattern
Analysis of
Hadoop Data
1. Profiling Stats:
Min/Max Values, NULLs,
Inferred Data Types, etc.
Drill down into actual
data values to inspect
results across entire
data set, including
potential duplicates
Value and Pattern
Frequency to isolated
inconsistent/dirty data or
unexpected patterns
Hadoop Data Profiling
results – exposed to
anyone in enterprise
via browser
Stats to identify
outliers and
anomalies in data
Hadoop Data Profiling Results
• Big Data cleansing, deduplication, parsing
Execute Data Quality on Hadoop
23
Address
Validation
Standardize
Parsing
Matching
Address Validation and
Geocoding enrichment across
260 countries
Probabilistic or Deterministic
Matching
Standardization and Reference
Data Management
Parsing of Unstructured
Data/Text Fields of all data
types of data (customer/
product/ social/ logs)
DQ logic pushed down/run natively ON Hadoop
SELECT
T1.ORDERKEY1 AS ORDERKEY2, T1.li_count, orders.O_CUSTKEY AS CUSTKEY, customer.C_NAME,
customer.C_NATIONKEY, nation.N_NAME, nation.N_REGIONKEY
FROM
(
SELECT TRANSFORM (L_Orderkey.id) USING CustomInfaTx
FROM lineitem
GROUP BY L_ORDERKEY
) T1
JOIN orders ON (customer.C_ORDERKEY = orders.O_ORDERKEY)
JOIN customer ON (orders.O_CUSTKEY = customer.C_CUSTKEY)
JOIN nation ON (customer.C_NATIONKEY = nation.N_NATIONKEY)
WHERE nation.N_NAME = 'UNITED STATES'
) T2
INSERT OVERWRITE TABLE TARGET1 SELECT *
INSERT OVERWRITE TABLE TARGET2 SELECT CUSTKEY,
count(ORDERKEY2) GROUP BY CUSTKEY;
Data Integration & Quality on Hadoop
Hive-QL
1. Entire Informatica mapping
translated to Hive Query Language
2. Optimized HQL converted to
MapReduce & submitted to Hadoop
cluster (job tracker).
3. Advanced mapping transformations
executed on Hadoop through User
Defined Functions using Vibe
MapReduce
UDF
Configure Mapping for Hadoop Execution
No need to redesign
mapping logic to
execute on either
Traditional or Hadoop
infrastructure.
Configure where the
integration logic should
run – Hadoop or Native
Mixed Workflow Orchestration
One workflow running tasks on hadoop and local environments
Cmd_Choose
LoadPath
MT_Load2Hadoop
+ Parse
Cmd_Load2
Hadoop
MT_Parse
Cmd_ProfileData MT_Cleanse
MT_Data
Analysis
Notification
Name Type Default Value Description
$User.LoadOptionPath Integer 2 Load path for workflow, depending on output of cmd task
$User.DataSourceConnection String HiveSourceConnection Source connection object
$User.ProfileResult Integer 100 Output from “profiling” commnad task.
Add
Edit
Remove
List of variables:
Full traceability from workflow
to MapReduce jobs
View generated
Hive scripts
Unified Administration
Single Place to Manage & Monitor
Map Once. Deploy Anywhere.
ON PREMISE HADOOP 3rd PARTY
APPLICATIONS
CLOUD