Designing Data Pipelines for Automous and Trusted Analytics
1. Designing data pipelines for
autonomous and trusted
analytics
Murthy Mathiprakasam, Principal Product
Marketing Manager
Sumeet Agrawal, Principal Product Manager
2. Opportunity: New Insights With New Data Sources
2
In the era of Big Interaction Data, enterprises can drive
unprecedented insights by analyzing new sources
SENSORS METERS LOGS
BADGES WEARABLES MOBILE
3. Opportunity: Hadoop Is An Efficient, Scalable Platform
3
Enterprises are adopting Hadoop to augment Data
Warehouses and drive more compelling analytical outcomes
Flexible
DATA MODELING
Scalable
TO LARGE DATASETS
Efficient
BASED ON COMMODITY
SERVER/STORAGE
4. Big Data Analytics Hasn’t Kept Up With Pace of Business
4
Up to 80%
ANALYST TIME SPENT ON
DATA PREPARATION
Untimely Delivery
OF DATA INHIBITS
AGILE, REAL-TIME DECISIONS
5. Big Data is Hard to Adopt
5
Can’t Re-Use
EXISTING SKILLS
WHEN PLATFORMS
CHANGE
Can’t Re-Use
EXISTING PROCESSES
TO DRIVE SCALABILITY
AND REPEATABILITY
6. Big Data Is Difficult to Trust
6
John Smith
11710 Plaza Drive
Reston, VA ______
(___)-___-____
Incomplete
DATASETS THAT
ARE NOT ACCURATE
Jonathan Smith
John Smith
John H Smith
Inconsistent
DATASETS THAT
ARE NOT STANDARDIZED
Insecure
DATASETS THAT
ARE NOT MASKED
jsmith@yahoo.com
703-844-1212
TAYwRG@zcqee.Qew
194-366-5858
vs
7. Ultimately, Analysts Are Constrained
Data Silos
INHIBIT UNDERSTANDING
AND USE OF ENTERPRISE
DATA ASSETS
Long Waits
FOR ACCESS TO TRUSTED
DATA ASSETS
Can’t Re-Use
DATA ASSETS FOR MORE
PERVASIVE ANALYTICS
8. The Need: Speed, Quality, and Agility in Big Data Projects
Insights That
Are Timely
Data That
Can Be
Trusted
Simple,
Repeatable,
Scalable
Delivery
Analyst
Productivity
and
Autonomy
9. Repeatably Deliver Trusted and Timely Data for Big Data Analytics
The Answer: Design for Autonomous Analytics
ACQUIRE
FROM
DISTRIBUTED
SOURCES Raw
Data Swamp
Integrated
Data Pool
Governed
Data
Reservoir
ACCESS
TO
DISTRIBUTED
ANALYSTS
Profile
Parse
Cleanse
Relate
Batch
Stream
Data Intelligence
10. Automate data
discovery,
preparation, and
security
Self-document
processes and data
relationships
Recommend best
actions using
machine learning
Combine domain
specific tools with
open source
scalability
Rapidly adapt to
change without
disrupting operations
Automate
deployment from
insight to action -
operationalize
Data Intelligence
Business Intelligence : Business Analysts ::
Data Intelligence : Data Developers/Architects
11. Put More Data To Use With Near Universal Connectivity
11
Word, Excel
PDF
StarOffice
Email, LDAP
Oracle
DB2
SQL Server
Sybase
Informix
Teradata
Netezza
ODBC/JDBC
Flat files
HTTP/HTML
RPG
ANSI
AST
FIX
SWIFT
MVR
SAP NetWeaver
SAP NetWeaver BI
SAS
Siebel
JD Edwards
Lotus Notes
Oracle E-Business
PeopleSoft
EDI–X12
EDI-Fact
RosettaNet
HL7/HIPAA
XML
LegalXML
IFX
cXML
Salesforce
RightNow
NetSuite
Oracle OnDemand
Facebook
Twitter
LinkedIn
Datasift
ebXML
HL7 v3.0
ACORD
100+
PRE-BUILT PARSERS
200+
PRE-BUILT CONNECTORS
Out of the Box
BUSINESS RULES AND
DATA STANDARDIZATION
Sample of Compatible Data Types and Sources
12. Ensure Highest Data Quality
12
“Contact
Bill.Harison@gmail.com
for more information
about #AAPL and
#GOOG”
Person: William Harrison
Company: Apple, Inc
Company: Google
EXTRACT ENTITIES
WITH NATURAL
LANGUAGE PROCESSING
ENRICH DATASETS
WITH ADDRESS VALIDATION
AND GEOCODING
MATCH AND STANDARDIZE
FOR DATA QUALITY
AND DATA MASTERING
13. Discover Data Domains Intelligently
13
PHI: Protected Health Information
PII: Personally Identifiable Information
Scalable to look for/discover ANY Domain type
ANALYZE STRUCTURE
OF DATA WITH BUILT-IN
DATA PROFILING
ISOLATE BAD DATA QUICKLY
WITH PROFILING STATISTICS
UNDERSTAND MEANING
AND CONTEXT OF DATA
IDENTITY SENSITIVE DATA
WITH DATA DOMAIN REPORTS
14. Manage Metadata and Data Lineage In-Depth
14
TRACK DATA LINEAGE
FROM DATA SOURCE
THROUGH HADOOP TO
DATA TARGET
Metadata
• Business
• Technical
ENSURE GREATER
UNDERSTANDING
WITH METADATA MANAGEMENT
AND BUSINESS GLOSSARY
15. Mask Sensitive Data
15
MASK IN REAL TIME
BASED ON ROLE
MASK FASTER
WITH PRE-BUILT RULES
MASK SENSITIVE DATA
WITH REPEATABLE TRANSFORMATIONS
16. Manage Relationships For All Data
16
Data Source1 MDM2 Services3
HDFS
Fuzzy Index
Ingest Transform
Match &
Link
19. Design for Autonomous Analytics – Best Practices
2
Design for re-usability3
Design for security4
Design for auditability5
Design with your analysts
1 Design for standardization
6
Design for speed
Data is exploding, there is simply too much to tackle with arcane, manually intensive approaches.
Data intelligence is the ability to:
Automate data discovery, preparation, and security
Self-document processes and data relationships
Recommend best actions using machine learning
Combine domain specific tools with open source scalability
Rapidly adapt to change without disrupting operations
Automate deployment from insight to action - operationalize
All mappings were built in the developer tool which included flat files, Hive and Netezza sources and targets and all the transformations and loaded those into MM. The lineage shows the entire mappings flow.
The lineage will show you for example that I pick a source file and can see all the mappings it is used in and driil in to see all the targets it goes to. You can drill that down to field level. As of May 2014, this is not for the data values themselves.
For example, here is just a small portion of some Hadoop Java MapReduce code to join two datasets. It may be time consuming but you’ll be forced to learn about the basic Hadoop concepts like MapReduce by implementing your own hand-crafted mappers and reducers
But as you can see this can be quite time consuming and as you further develop and expand your projects this approach is a recipe for failure
For example, the pre-built joiner transform circled here is how you would join datasets on Hadoop using a visual development tool like Informatica
To implement this entire data pipeline with hand-coding in Java would take at least 5x as long
Another benefit with this approach is that it accelerates deployments, facilitates reuse and insulates developers from the underlying complexities of the deployment system so as things change you don’t have to rebuild and retest your data pipeline.
This fullfills three of the requirements for a big data platform