More Related Content
Similar to Drive dataqualityatyourcompanycreateadatalake
Similar to Drive dataqualityatyourcompanycreateadatalake (20)
More from The Pathway Group
More from The Pathway Group (20)
Drive dataqualityatyourcompanycreateadatalake
- 2. 2 © RedPoint Global Inc. 2014 Confidential
Purpose of MDM
Create correct and consistent data
across the enterprise that fosters
trust in information and
acceleration of growth.
- 3. 3 © RedPoint Global Inc. 2014 Confidential
“ Without data you’re just
another person with an
opinion.”
W. Edwards Deming
Why it matters
- 4. 4 © RedPoint Global Inc. 2014 Confidential
Vicious Cycle of Unmanaged Data
Unmanaged Data
Master Data Issues
remain unaddressed or
unresolved
1
Garbage
in/garbage out
creates process
confusion
2Data conflicts
reinforce siloed
operations
4
Lack of process trust slows
business momentum
3
- 5. 5 © RedPoint Global Inc. 2014 Confidential
© Hortonworks Inc. 2014
A Data Architecture Under Pressure
Applications
Business
Analytics
Custom
Applications
Packaged
Applications
Data System
Repositories
RDBMS EDW MPP
Sources
Existing Sources
(CRM, ERP, Clickstream,
Logs)
Unstructured documents, emails
Transactional data
Server logs
Sentiment, web data
Geolocation
Sensor, machine data
Clickstream
Hierarchical data
OLTP, ERP, CRM
Master data
2.8 ZB in 2013
85% from new data types
15x Machine Data by 2020
40 ZB by 2020
Source: IDC
- 6. 6 © RedPoint Global Inc. 2014 Confidential
Broad Spectrum of Benefits Across Industries
• New account risk
screens
• Fraud prevention
• Trading risk
• Maximize deposit
spread
• Insurance underwriting
• Accelerate loan
processing
Financial
Services
• 360° view of the
customer
• Analyze brand
sentiment
• Localized, personalized
promotions
• Website optimization
• Optimal store layout
Retail
• Call detail records
(CDRs)
• Infrastructure
investment
• Next product to buy
(NPTB)
• Real-time bandwidth
allocation
• New product
development
Telecom
• Supplier consolidation
• Supply chain and
logistics
• Assembly line quality
assurance
• Proactive maintenance
• Crowdsourced quality
assurance
Manufacturing
• Genomic data for
medical trials
• Monitor patient vitals
• Reduce re-admittance
rates
• Store medical research
data
• Recruit cohorts for
pharmaceutical trials
Healthcare
• Smart meter stream
analysis
• Slow oil well decline
curves
• Optimize lease bidding
• Compliance reporting
• Proactive equipment
repair
• Seismic image
processing
Utilities, Oil
& Gas
• Analyze public
sentiment
• Protect critical networks
• Prevent fraud and
waste
• Crowdsource reporting
for repairs to
infrastructure
• Fulfill open records
requests
Public Sector
- 7. 7 © RedPoint Global Inc. 2014 Confidential
Gartner’s Nexus of Forces Making Things Worse
- 8. 8 © RedPoint Global Inc. 2014 Confidential
Business Benefits of MDM
Today IT data mgmt. pros focus on: Business leaders really care about:
Eliminating duplicate/orphaned data Increasing revenue
Standardizing and centralizing data/metadata Decreasing costs
Meeting operational SLAs Increasing operational efficiencies
Data enrichment Reducing risk
Data integration and synchronization Improving customer experiences
Increase in customer self-service for
order management, technical support
and customer service
Reduction in customer privacy
compliance risk exposure
Reduction in direct marketing
postage costs
Increase in campaign response rates
Delivering a consistent cross-
channel customer experience
Reduction in average handle time
in call center
Use business-value driven KPIs to evangelize MDM benefits
- 9. 9 © RedPoint Global Inc. 2014 Confidential
How About MDM on a Data Lake?
• Severe shortage of Map Reduce skilled
resources
• Inconsistent skills lead to inconsistent
results of code based solutions
• Nascent technologies require multiple
point solutions
• Technologies are not enterprise grade
• Some functionality may not be possible
within these frameworks
Challenges to Data Lake Approach
• Data is ingested in its raw state regardless of
format, structure or lack of structure
• Raw data can be used and reused for
differing purposes across the enterprise
• Beyond inexpensive storage, Hadoop is an
extremely power and scalable and
segmentable computational platform
• Master Data can be fed across the enterprise
and deep analytics on clean data is
immediately enabled
Benefits of a Hadoop Data Lake
- 10. 10 © RedPoint Global Inc. 2014 Confidential
Key Functions for Master Data Management
Master Key
Management
ETL & ELT Data Quality
Web Services
Integration
Integration & Matching
Process Automation
& Operations
• Profiling, reads/writes,
transformations
• Single project for all jobs
• Cleanse data
• Parsing, correction
• Geo-spatial analysis
• Grouping
• Fuzzy match
• Create keys
• Track changes
• Maintain matches
over time
• Consume and publish
• HTTP/HTTPS protocols
• XML/JSON/SOAP
formats
• Job scheduling, monitoring,
notifications
• Central point of control
• Meta Data Management
- 11. 11 © RedPoint Global Inc. 2014 Confidential
Data Lake is the Center of Your MDM Strategy
Ingestion of all data available from
any source, format, cadence,
structure or non-structure
ELT and data transformation,
refinement, cleansing, completion,
validation and standardization
Geospatial processing and
geocoding
Data profiling, lineage and metadata
management
Identity resolution and persistent
keying and entity profile
management
- 12. 12 © RedPoint Global Inc. 2014 Confidential
Data Lake Architecture for MDM
Data Sources
CRM
ERP
Billing
Subscrib
er
Product
Network
Weather
Compete
Manuf.
Clickstream
Online Chat
Sensor
Data
Social
Media
Call Detail
Records
Fabrication
Logs
Sales
Feedback
Field
Feedback
Field
Feedback
+
- 13. 13 © RedPoint Global Inc. 2014 Confidential
How Can That Possibly Work?
More Map
Reduce! YARN!
- 14. 14 © RedPoint Global Inc. 2014 Confidential
Overview What is Hadoop/Hadoop 2.0
Hadoop 1.0
• All operations based on Map Reduce
• Intrinsic inconsistency of code based
solutions
• Highly skilled and expensive resources
needed
• 3rd party applications constrained by the
need to generate code
Hadoop 2.0
• Introduction of the YARN:
“a general-purpose, distributed, application
management framework that supersedes the
classic Apache Hadoop MapReduce framework
for processing data in Hadoop clusters.”
• Mature applications can now operate
directly on Hadoop
• Reduce skill requirements and
increased consistency
- 15. 15 © RedPoint Global Inc. 2014 Confidential
RedPoint Data Management on Hadoop
Partitioning
AM / Tasks
Execution
AM /
Tasks
Data
I/O
Key /
Split
Analysis
Parallel Section
Partition
Data
server
YARN
MapReduce
- 16. 16 © RedPoint Global Inc. 2014 Confidential
Reference Hadoop Architecture
Monitoring and Management Tools
AMBARI
MAPREDUCE
REST
DATA REFINEMENT
HIVEPIG
HTTP
STREAM
STRUCTURE
HCATALOG
(metadata services)
Query/Visualization/
Reporting/Analytical
Tools and Apps
SOURCE
DATA
- Sensor Logs
- Clickstream
- Flat Files
- Unstructured
- Sentiment
- Customer
- Inventory
DBs
JMS
Queue’s
Fil
es
Fil
esFiles
Data Sources
RDBMS
EDW
INTERACTIVE
HIVE Server2
LOAD
SQOOP
WebHDFS
Flume
NFS
LOAD
SQOOP/Hive
Web HDFS
YARN
n
HDFS
1
RedPoint Functional Footprint
- 17. 17 © RedPoint Global Inc. 2014 Confidential
>150 Lines of MR Code ~50 Lines of Script Code 0 Lines of Code
6 hours of development 3 hours of development 15 min. of development
6 minutes runtime 15 minutes runtime 3 minutes runtime
Extensive optimization
needed
User Defined Functions
required prior to running
script
No tuning or optimization
required
RedPoint
Benchmarks – Project Gutenberg
Map Reduce Pig
Sample MapReduce (small subset of the entire code which totalsnearly150 lines):
public static class MapClass
extends Mapper<WordOffset, Text, Text, IntWritable> {
private final static String delimiters =
"',./<>?;:"[]{}-=_+()&*%^#$!@`~ |«»¡¢£¤¥¦©¬®¯±¶·¿";
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(WordOffset key, Text value, Context context)
throws IOException, InterruptedException {
String line = value.toString();
StringTokenizer itr = new StringTokenizer(line, delimiters);
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}
Sample Pigscript without the UDF:
SET pig.maxCombinedSplitSize 67108864
SET pig.splitCombination true
A = LOAD '/testdata/pg/*/*/*';
B = FOREACH A GENERATE FLATTEN(TOKENIZE((chararray)$0))
C = FOREACH B GENERATE UPPER(word) AS word;
D = GROUP C BY word;
E = FOREACH D GENERATE COUNT(C) AS occurrences, group;
F = ORDER E BY occurrences DESC;
STORE F INTO '/user/cleonardi/pg/pig-count';
- 18. 18 © RedPoint Global Inc. 2014 Confidential
Data Lake Architecture for MDM
Data Sources
CRM
ERP
Billing
Subscrib
er
Product
Network
Weather
Compete
Manuf.
Clickstream
Online Chat
Sensor
Data
Social
Media
Call Detail
Records
Fabrication
Logs
Sales
Feedback
Field
Feedback
Field
Feedback
+