SlideShare a Scribd company logo
1 of 34
Download to read offline
Grab some 
coffee and 
enjoy the 
pre-show 
banter 
before the 
top of the 
hour!
H T 
Technologies 
of 
2014
HOST: 
Eric 
Kavanagh
THIS 
YEAR 
is…
ANALYST: 
David 
Loshin 
President, 
Knowledge 
Integrity 
ANALYST: 
David 
Raab 
Principal, 
Raab 
Associates 
GUEST: 
George 
Corugedo 
CTO, 
RedPoint 
Global 
THE 
LINE 
UP
INTRODUCING 
David 
Loshin
© 2014 Knowledge Integrity, Inc. 
www.knowledge-integrity.com 
(301) 754-6350 
7 
Big Data Quality 
David Loshin 
Knowledge Integrity, Inc. 
www.knowledge-integrity.com
A Conventional Approach to “Big Data Quality” 
p Immediate thoughts: 
n “lots of data” means “lots of errors” means “lots of cleansing” 
n “big data quality” = prepending “big” onto “data quality” 
p However, conventional processes associated with manufacturing quality are 
less applicable in a big data world: 
DQ Directive Challenge for Repurposing 
“Process management” Business function-based data development intends data for specific 
© 2014 Knowledge Integrity, Inc. 
www.knowledge-integrity.com 
(301) 754-6350 
8 
purposes 
“Customer 
requirements” 
Desire to assert customer requirements at point of creation or 
acquisition conflicts with expectation to use data in different ways 
“Supplier management” Often, the source of the data is way beyond the organization’s 
administrative control or is completely unknown 
“Control” There is opacity regarding the data flow processes and lineage, so 
control is limited to execution at the point of use 
“Continuous 
improvement” 
Transitory nature of streamed data
Drivers of Quality of Big Data 
p Rampant data repurposing 
n For data scientists, the data sets are demanded in raw form, free 
from the shackles of dimensional models 
n Simultaneously, analytical results need to be integrated with 
existing data warehouse and business intelligence architecture 
p Content in addition to structure 
n Scanning and parsing text must go beyond validation against 
known formats to ascertain meaning in context 
p Value of discrete pieces information varies in relation to content type, 
precision, timeliness, overall volume 
p Information Utility 
n The onus of ensuring usability (and “quality”) is on the data 
consumer, not the data producer 
© 2014 Knowledge Integrity, Inc. 
www.knowledge-integrity.com 
(301) 754-6350 
9
Assessing Big Data Quality and Utility 
Temporal Consistency 
Completeness 
Precision Consistency 
Currency 
Unique Identifiability 
Timeliness 
Semantic Consistency 
✔ 
✔ 
✔ 
✔ 
✔ 
✔ 
✔ 
© 2014 Knowledge Integrity, Inc. 
www.knowledge-integrity.com 
(301) 754-6350 
10
Characteristics of Applications Suited for Hadoop 
p Adapting existing solutions to 
improve performance 
p Algorithms or solutions exhibit 
one or more of these 
characteristics: 
n Large data volumes 
n Significant data variety 
n Performance impacted by data 
latency 
n Computational performance 
throttled 
n Amenable to parallelization 
p Enabling or improving 
solutions whose requirements 
exceed existing resource 
capabilities © 2014 Knowledge Integrity, Inc. 
www.knowledge-integrity.com 
(301) 754-6350 
11
Leveraging Performance Computing for Data Quality 
p Hadoop’s parallel and distributed computing enables scalability 
in deploying key data quality tasks 
p Each of these activities exhibits the characteristics of 
applications amenable to Hadoop: 
n Data validation that uses data quality and business rules for 
validating consistency, completeness, timeliness, etc. 
n Identity Resolution that uses advanced techniques for entity 
recognition and identity resolution from structured and 
unstructured sources 
n Data cleansing, standardization, enhancement that applies 
parsing and standardization rules within context of end-use 
business analyses and applications 
n Inspection, Monitoring, and Remediation to empower data 
stewards to monitor quality and take proper actions for ensuring 
big data utility 
© 2014 Knowledge Integrity, Inc. 
www.knowledge-integrity.com 
(301) 754-6350 
12
Check Out These Resources! 
p www.knowledge-integrity.com 
p www.dataqualitybook.com 
p If you have questions, 
comments, or suggestions, 
please contact me 
David Loshin 
301-754-6350 
loshin@knowledge-integrity.com 
© 2014 Knowledge Integrity, Inc. 
www.knowledge-integrity.com 
(301) 754-6350 
13
INTRODUCING 
David 
Raab
Big 
Uses 
for 
Big 
Data 
August 
20, 
2014 
David 
M. 
Raab 
Raab 
Associates 
draab@raabassociates.com
What’s 
New 
about 
Big 
Data? 
• Comprehensive 
details 
on 
all 
customers 
• Near-­‐immediate 
updates 
• High-­‐speed 
detecJon, 
assessment, 
response 
• (Almost) 
always 
reachable
What’s 
It 
Good 
For? 
• PersonalizaJon 
• MarkeJng 
measurement 
• Real 
Jme 
bidding 
• Web 
prospecJng
What 
Are 
the 
Challenges? 
(Technical)
What 
Are 
the 
Challenges? 
(Business)
INTRODUCING 
George 
Corugedo
YARN Changes the Data Quality Game 
August 
2014
Overview – Challenges to Adoption 
• Severe 
Skills 
Gap 
shortage 
of 
MR 
skilled 
resources 
• Very 
expensive 
resources 
and 
hard 
to 
retain 
• Inconsistent 
skills 
lead 
to 
inconsistent 
results 
• Under 
uJlizes 
exisJng 
resources 
• Prevents 
broad 
leverage 
of 
investments 
across 
enterprise 
Maturity 
& 
Governance 
• A 
nascent 
technology 
ecosystem 
around 
Hadoop 
• Emerging 
technologies 
only 
address 
narrow 
slivers 
of 
funcJonality 
• New 
applicaJons 
are 
not 
enterprise 
class 
• Legacy 
applicaJons 
have 
built 
short 
term 
capabiliJes 
Data 
Into 
InformaJon 
• Data 
22 © RedPoint Global Inc. 2014 Confidential 
is 
not 
useful 
in 
its 
raw 
state, 
it 
must 
be 
turned 
into 
informaJon 
• Benefit 
of 
Hadoop 
is 
that 
same 
data 
can 
be 
used 
from 
many 
perspecJves 
• Analysts 
must 
now 
do 
the 
structuring 
of 
the 
data 
based 
on 
intended 
use 
of 
the 
data
Overview - What is Hadoop/Hadoop 2.0 
Lower 
cost 
scaling 
No need 
for 
structure 
23 © RedPoint Global Inc. 2014 Confidential 
Ease of 
data 
capture 
Hadoop 
1.0 
• All 
operaJons 
based 
on 
Map 
Reduce 
• Intrinsic 
inconsistency 
of 
code 
based 
soluJons 
• Highly 
skilled 
and 
expensive 
resources 
needed 
• 3rd 
party 
applicaJons 
constrained 
by 
the 
need 
to 
generate 
code 
Hadoop 
2.0 
• IntroducJon 
of 
the 
YARN: 
“a 
general-­‐purpose, 
distributed, 
applicaJon 
management 
framework 
that 
supersedes 
the 
classic 
Apache 
Hadoop 
MapReduce 
framework 
for 
processing 
data 
in 
Hadoop 
clusters.” 
• Mature 
applicaJons 
can 
now 
operate 
directly 
on 
Hadoop 
• Reduce 
skill 
requirements 
and 
increased 
consistency
RedPoint Data Management on Hadoop 
ParJJoning 
AM 
/ 
Tasks 
Parallel 
SecJon 
(UI) 
ExecuJon 
AM 
/ 
Tasks 
Data 
I/O 
Key 
/ 
Split 
Analysis 
YARN 
24 © RedPoint Global Inc. 2014 Confidential 
MapReduce
Key features of RedPoint Data Management 
ETL 
& 
ELT 
Data 
Quality 
Master 
Key 
Management 
Web 
Services 
IntegraJon 
IntegraJon 
& 
Matching 
Process 
AutomaJon 
25 © RedPoint Global Inc. 2014 Confidential 
& 
OperaJons 
• Profiling, 
reads/writes, 
transformaJons 
• Single 
project 
for 
all 
jobs 
• Cleanse 
data 
• Parsing, 
correcJon 
• Geo-­‐spaJal 
analysis 
• Grouping 
• Fuzzy 
match 
• Create 
keys 
• Track 
changes 
• Maintain 
matches 
over 
Jme 
• Consume 
and 
publish 
• HTTP/HTTPS 
protocols 
• XML/JSON/SOAP 
formats 
• Job 
scheduling, 
monitoring, 
noJficaJons 
• Central 
point 
of 
control 
All 
func(ons 
can 
be 
used 
on 
both 
TRADITIONAL 
and 
BIG 
DATA 
Creates 
clean, 
integrated, 
ac/onable 
data 
– 
quickly, 
reliably 
and 
at 
low 
cost
RedPoint Functional Footprint 
Monitoring and Management Tools 
AMBARI 
DATA REFINEMENT 
PIG HIVE 
MAPREDUCE 
REST 
HTTP 
STREAM 
STRUCTURE 
HCATALOG 
(metadata services) 
DBs 
Fil 
esF il 
Feilse s 
NFS 
Ÿ 
26 © RedPoint Global Inc. 2014 Confidential 
Query/Visualization/ 
Reporting/Analytical 
Tools and Apps 
SOURCE 
DATA 
- Sensor Logs 
- Clickstream 
JMS 
- Flat Queue’s 
Files 
- Unstructured 
- Sentiment 
- Customer 
- Inventory 
Data Sources 
RDBMS 
EDW 
INTERACTIVE 
HIVE Server2 
LOAD 
SQOOP 
WebHDFS 
Flume 
LOAD 
SQOO P/Hive 
Web HDFS 
YARN 
Ÿ 
Ÿ 
Ÿ 
Ÿ 
Ÿ 
Ÿ 
Ÿ 
Ÿ 
Ÿ 
Ÿ 
Ÿ 
Ÿ 
Ÿ 
Ÿ 
Ÿ 
Ÿ 
Ÿ 
Ÿ 
Ÿ 
Ÿ 
Ÿ 
Ÿ 
Ÿ 
Ÿ 
Ÿ 
Ÿ 
Ÿ 
Ÿ 
Ÿ 
Ÿ 
Ÿ 
Ÿ 
Ÿ 
Ÿ 
Ÿ 
Ÿ 
Ÿ 
n 
HDFS 
1 Ÿ 
Ÿ 
Ÿ 
Ÿ 
Ÿ 
Ÿ 
Ÿ 
Ÿ 
Ÿ 
Ÿ 
Ÿ 
Ÿ 
Ÿ 
Ÿ 
Ÿ 
Ÿ 
Ÿ 
Ÿ 
Ÿ 
Ÿ 
Ÿ 
Ÿ 
Ÿ 
Ÿ 
Ÿ 
Ÿ 
Ÿ 
Ÿ 
Ÿ 
Ÿ 
Ÿ 
Ÿ 
Ÿ 
Ÿ 
Ÿ 
Ÿ 
Ÿ 
Ÿ 
Ÿ 
Ÿ 
Ÿ 
Ÿ 
Ÿ 
Ÿ 
Ÿ 
Ÿ 
Ÿ 
Ÿ 
Ÿ 
Ÿ 
Ÿ
Sample 
MapReduce 
(small 
subset 
of 
the 
entire 
code 
which 
totals 
nearly 
150 
lines): 
public 
static 
class 
MapClass 
extends 
Mapper<WordOffset, Text, Text, IntWritable> { 
27 © RedPoint Global Inc. 2014 Confidential 
RedPoint 
Benchmarks – Project Gutenberg 
Map 
Reduce 
Pig 
private 
final 
static 
String delimiters = 
"',./<>?;:"[]{}-=_+()&*%^#$!@`~ |«»¡¢£¤¥¦©¬®¯±¶·¿"; 
private 
final 
static 
IntWritable one = new 
IntWritable(1); 
private 
Text word = new 
Text(); 
public 
void 
map(WordOffset key, Text value, Context context) 
throws 
IOException, InterruptedException { 
String line = value.toString(); 
StringTokenizer itr = new 
StringTokenizer(line, delimiters); 
while 
(itr.hasMoreTokens()) { 
word.set(itr.nextToken()); 
context.write(word, one); 
} 
} 
} 
Sample 
Pig 
script 
without 
the 
UDF: 
SET 
pig.maxCombinedSplitSize 67108864 
SET 
pig.splitCombination true 
A = LOAD 
'/testdata/pg/*/*/*'; 
B = FOREACH A GENERATE FLATTEN(TOKENIZE((chararray)$0)) AS 
word; 
C = FOREACH B GENERATE UPPER(word) AS 
word; 
D = GROUP 
C BY 
word; 
E = FOREACH D GENERATE COUNT(C) AS 
occurrences, group; 
F = ORDER 
E BY 
occurrences DESC; 
STORE F INTO 
'/user/cleonardi/pg/pig-count'; 
>150 Lines of MR Code ~50 Lines of Script Code 0 Lines of Code 
6 hours of development 3 hours of development 15 min. of development 
6 minutes runtime 15 minutes runtime 3 minutes runtime 
Extensive optimization needed User Defined Functions required 
prior to running script 
No tuning or optimization 
required
Attributes of Information 
RELEVANT 
InformaJon 
must 
pertain 
to 
a 
specific 
problem. 
General 
data 
must 
be 
connected 
to 
reveal 
relevance 
of 
the 
informaJon. 
COMPLETE 
ParJal 
informaJon 
is 
oien 
worse 
than 
no 
informaJon. 
ParJal 
informaJon 
frequently 
leads 
to 
worse 
conclusions 
than 
if 
no 
data 
had 
been 
used 
at 
all. 
ACCURATE 
This 
one 
is 
obvious. 
In 
a 
context 
like 
health 
care, 
inaccurate 
data 
can 
be 
fatal. 
Precision 
is 
required 
across 
all 
applicaJons 
of 
informaJon. 
CURRENT 
As 
data 
ages, 
it 
becomes 
less 
accurate. 
MulJple 
research 
studies 
by 
Google 
and 
others 
show 
the 
decay 
in 
the 
accuracy 
of 
analyJcs 
as 
data 
becomes 
stale. 
ECONOMICAL 
There 
has 
to 
be 
a 
clear 
cost 
benefit. 
This 
requires 
work 
to 
idenJfy 
the 
realizable 
benefit 
of 
informaJon 
but 
this 
is 
also 
what 
rives 
the 
use 
if 
successful 
28 © RedPoint Global Inc. 2014 Confidential
Big Data Can Become Big Information 
29 © RedPoint Global Inc. 2014 Confidential
Big Data Can Become Big Information 
" IngesJon 
of 
all 
data 
available 
from 
any 
source, 
format, 
cadence, 
structure 
or 
non-­‐structure 
" ELT 
and 
data 
refinement 
" GeospaJal 
processing 
and 
geocoding 
" Data 
profiling, 
lineage 
and 
metadata 
management 
" Data 
parsing, 
quality, 
validaJon 
and 
hygiene 
" IdenJty 
resoluJon 
and 
persistent 
keying 
" EnJty 
profile 
management 
30 © RedPoint Global Inc. 2014 Confidential
Reference Architecture for Matching in Hadoop 
Data 
Sources 
CRM 
ERP 
Billing 
Subscriber 
Product 
Network 
Weather 
Compete 
Manuf. 
Clickstream 
Online 
Chat 
Sensor 
Data 
Social 
Media 
Call 
Detail 
Records 
FabricaJon 
Logs 
Sales 
Feedback 
Field 
Feedback 
Field 
Feedback 
+ 
31 © RedPoint Global Inc. 2014 Confidential
Key Points to Cover Today 
" Broad functionality across data processing domains 
" Validated ease of use, speed, match quality and party data superiority 
" Hadoop 2.0/YARN certified – 1 of first 17 companies to do so 
" Not a repackaging of Hadoop 1.0 functionality. RedPoint Data 
Management is a pure YARN application (1 of only 2 in the initial wave of 
certifications) 
" Building a complex job in RPDM takes a fraction of the time that it takes to 
write the same job in Map Reduce and none of the coding or java skills. 
" Big functional footprint without touching a line of code 
" Design model consistent with data flow paradigm 
" RPDM has a “Zero-Footprint” install in the Hadoop cluster 
" The same interface and functionality is available for both structured and 
unstructured databases. Thus it is seamless to work across both from a users 
perspective. 
" Data quality done completely within the cluster 
32 © RedPoint Global Inc. 2014 Confidential
THANK 
YOU! 
The 
Archive 
Trifecta: 
• Inside 
Analysis 
www.insideanalysis.com 
• SlideShare 
www.slideshare.net/InsideAnalysis 
• YouTube 
www.youtube.com/user/BloorGroup

More Related Content

What's hot

Hadoop 2.0: YARN to Further Optimize Data Processing
Hadoop 2.0: YARN to Further Optimize Data ProcessingHadoop 2.0: YARN to Further Optimize Data Processing
Hadoop 2.0: YARN to Further Optimize Data ProcessingHortonworks
 
Transform You Business with Big Data and Hortonworks
Transform You Business with Big Data and HortonworksTransform You Business with Big Data and Hortonworks
Transform You Business with Big Data and HortonworksHortonworks
 
Moving Health Care Analytics to Hadoop to Build a Better Predictive Model
Moving Health Care Analytics to Hadoop to Build a Better Predictive ModelMoving Health Care Analytics to Hadoop to Build a Better Predictive Model
Moving Health Care Analytics to Hadoop to Build a Better Predictive ModelDataWorks Summit
 
Predicting Customer Behavior with Customer Convsrsation Modeling
Predicting Customer Behavior with Customer Convsrsation ModelingPredicting Customer Behavior with Customer Convsrsation Modeling
Predicting Customer Behavior with Customer Convsrsation ModelingDataWorks Summit
 
How Universities Use Big Data to Transform Education
How Universities Use Big Data to Transform EducationHow Universities Use Big Data to Transform Education
How Universities Use Big Data to Transform EducationHortonworks
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)DataWorks Summit/Hadoop Summit
 
Use dependency injection to get Hadoop *out* of your application code
Use dependency injection to get Hadoop *out* of your application codeUse dependency injection to get Hadoop *out* of your application code
Use dependency injection to get Hadoop *out* of your application codeDataWorks Summit
 
Zementis hortonworks-webinar-2014-09
Zementis hortonworks-webinar-2014-09Zementis hortonworks-webinar-2014-09
Zementis hortonworks-webinar-2014-09Hortonworks
 
Global Data Management – a practical framework to rethinking enterprise, oper...
Global Data Management – a practical framework to rethinking enterprise, oper...Global Data Management – a practical framework to rethinking enterprise, oper...
Global Data Management – a practical framework to rethinking enterprise, oper...DataWorks Summit
 
Making Bank Predictive and Real-Time
Making Bank Predictive and Real-TimeMaking Bank Predictive and Real-Time
Making Bank Predictive and Real-TimeDataWorks Summit
 
Break Through the Traditional Advertisement Services with Big Data and Apache...
Break Through the Traditional Advertisement Services with Big Data and Apache...Break Through the Traditional Advertisement Services with Big Data and Apache...
Break Through the Traditional Advertisement Services with Big Data and Apache...Hortonworks
 
Use Cases from Batch to Streaming, MapReduce to Spark, Mainframe to Cloud: To...
Use Cases from Batch to Streaming, MapReduce to Spark, Mainframe to Cloud: To...Use Cases from Batch to Streaming, MapReduce to Spark, Mainframe to Cloud: To...
Use Cases from Batch to Streaming, MapReduce to Spark, Mainframe to Cloud: To...Precisely
 
MongoDB IoT City Tour STUTTGART: Hadoop and future data management. By, Cloudera
MongoDB IoT City Tour STUTTGART: Hadoop and future data management. By, ClouderaMongoDB IoT City Tour STUTTGART: Hadoop and future data management. By, Cloudera
MongoDB IoT City Tour STUTTGART: Hadoop and future data management. By, ClouderaMongoDB
 
Webinar turbo charging_data_science_hawq_on_hdp_final
Webinar turbo charging_data_science_hawq_on_hdp_finalWebinar turbo charging_data_science_hawq_on_hdp_final
Webinar turbo charging_data_science_hawq_on_hdp_finalHortonworks
 
Hadoop as an Analytic Platform: Why Not?
Hadoop as an Analytic Platform: Why Not?Hadoop as an Analytic Platform: Why Not?
Hadoop as an Analytic Platform: Why Not?Inside Analysis
 
Harnessing Hadoop Distuption: A Telco Case Study
Harnessing Hadoop Distuption: A Telco Case StudyHarnessing Hadoop Distuption: A Telco Case Study
Harnessing Hadoop Distuption: A Telco Case StudyDataWorks Summit
 
A Tale of Two Regulations: Cross-Border Data Protection For Big Data Under GD...
A Tale of Two Regulations: Cross-Border Data Protection For Big Data Under GD...A Tale of Two Regulations: Cross-Border Data Protection For Big Data Under GD...
A Tale of Two Regulations: Cross-Border Data Protection For Big Data Under GD...DataWorks Summit/Hadoop Summit
 
The Value of the Modern Data Architecture with Apache Hadoop and Teradata
The Value of the Modern Data Architecture with Apache Hadoop and Teradata The Value of the Modern Data Architecture with Apache Hadoop and Teradata
The Value of the Modern Data Architecture with Apache Hadoop and Teradata Hortonworks
 
Big Data/Hadoop Option Analysis
Big Data/Hadoop Option AnalysisBig Data/Hadoop Option Analysis
Big Data/Hadoop Option Analysiszafarali1981
 

What's hot (20)

Hadoop 2.0: YARN to Further Optimize Data Processing
Hadoop 2.0: YARN to Further Optimize Data ProcessingHadoop 2.0: YARN to Further Optimize Data Processing
Hadoop 2.0: YARN to Further Optimize Data Processing
 
Transform You Business with Big Data and Hortonworks
Transform You Business with Big Data and HortonworksTransform You Business with Big Data and Hortonworks
Transform You Business with Big Data and Hortonworks
 
Moving Health Care Analytics to Hadoop to Build a Better Predictive Model
Moving Health Care Analytics to Hadoop to Build a Better Predictive ModelMoving Health Care Analytics to Hadoop to Build a Better Predictive Model
Moving Health Care Analytics to Hadoop to Build a Better Predictive Model
 
Predicting Customer Behavior with Customer Convsrsation Modeling
Predicting Customer Behavior with Customer Convsrsation ModelingPredicting Customer Behavior with Customer Convsrsation Modeling
Predicting Customer Behavior with Customer Convsrsation Modeling
 
How Universities Use Big Data to Transform Education
How Universities Use Big Data to Transform EducationHow Universities Use Big Data to Transform Education
How Universities Use Big Data to Transform Education
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)
 
Hadoop Summit Tokyo HDP Sandbox Workshop
Hadoop Summit Tokyo HDP Sandbox Workshop Hadoop Summit Tokyo HDP Sandbox Workshop
Hadoop Summit Tokyo HDP Sandbox Workshop
 
Use dependency injection to get Hadoop *out* of your application code
Use dependency injection to get Hadoop *out* of your application codeUse dependency injection to get Hadoop *out* of your application code
Use dependency injection to get Hadoop *out* of your application code
 
Zementis hortonworks-webinar-2014-09
Zementis hortonworks-webinar-2014-09Zementis hortonworks-webinar-2014-09
Zementis hortonworks-webinar-2014-09
 
Global Data Management – a practical framework to rethinking enterprise, oper...
Global Data Management – a practical framework to rethinking enterprise, oper...Global Data Management – a practical framework to rethinking enterprise, oper...
Global Data Management – a practical framework to rethinking enterprise, oper...
 
Making Bank Predictive and Real-Time
Making Bank Predictive and Real-TimeMaking Bank Predictive and Real-Time
Making Bank Predictive and Real-Time
 
Break Through the Traditional Advertisement Services with Big Data and Apache...
Break Through the Traditional Advertisement Services with Big Data and Apache...Break Through the Traditional Advertisement Services with Big Data and Apache...
Break Through the Traditional Advertisement Services with Big Data and Apache...
 
Use Cases from Batch to Streaming, MapReduce to Spark, Mainframe to Cloud: To...
Use Cases from Batch to Streaming, MapReduce to Spark, Mainframe to Cloud: To...Use Cases from Batch to Streaming, MapReduce to Spark, Mainframe to Cloud: To...
Use Cases from Batch to Streaming, MapReduce to Spark, Mainframe to Cloud: To...
 
MongoDB IoT City Tour STUTTGART: Hadoop and future data management. By, Cloudera
MongoDB IoT City Tour STUTTGART: Hadoop and future data management. By, ClouderaMongoDB IoT City Tour STUTTGART: Hadoop and future data management. By, Cloudera
MongoDB IoT City Tour STUTTGART: Hadoop and future data management. By, Cloudera
 
Webinar turbo charging_data_science_hawq_on_hdp_final
Webinar turbo charging_data_science_hawq_on_hdp_finalWebinar turbo charging_data_science_hawq_on_hdp_final
Webinar turbo charging_data_science_hawq_on_hdp_final
 
Hadoop as an Analytic Platform: Why Not?
Hadoop as an Analytic Platform: Why Not?Hadoop as an Analytic Platform: Why Not?
Hadoop as an Analytic Platform: Why Not?
 
Harnessing Hadoop Distuption: A Telco Case Study
Harnessing Hadoop Distuption: A Telco Case StudyHarnessing Hadoop Distuption: A Telco Case Study
Harnessing Hadoop Distuption: A Telco Case Study
 
A Tale of Two Regulations: Cross-Border Data Protection For Big Data Under GD...
A Tale of Two Regulations: Cross-Border Data Protection For Big Data Under GD...A Tale of Two Regulations: Cross-Border Data Protection For Big Data Under GD...
A Tale of Two Regulations: Cross-Border Data Protection For Big Data Under GD...
 
The Value of the Modern Data Architecture with Apache Hadoop and Teradata
The Value of the Modern Data Architecture with Apache Hadoop and Teradata The Value of the Modern Data Architecture with Apache Hadoop and Teradata
The Value of the Modern Data Architecture with Apache Hadoop and Teradata
 
Big Data/Hadoop Option Analysis
Big Data/Hadoop Option AnalysisBig Data/Hadoop Option Analysis
Big Data/Hadoop Option Analysis
 

Viewers also liked

The Cloud Imperative – What, Why, When and How
The Cloud Imperative – What, Why, When and HowThe Cloud Imperative – What, Why, When and How
The Cloud Imperative – What, Why, When and HowInside Analysis
 
The Crown Jewels: Is Enterprise Data Ready for the Cloud?
The Crown Jewels: Is Enterprise Data Ready for the Cloud?The Crown Jewels: Is Enterprise Data Ready for the Cloud?
The Crown Jewels: Is Enterprise Data Ready for the Cloud?Inside Analysis
 
How Data Visualization Enhances the News
How Data Visualization Enhances the NewsHow Data Visualization Enhances the News
How Data Visualization Enhances the NewsInside Analysis
 
Raising the Bar: Innovative Healthcare Program Fosters Collaboration, Education
Raising the Bar: Innovative Healthcare Program Fosters Collaboration, EducationRaising the Bar: Innovative Healthcare Program Fosters Collaboration, Education
Raising the Bar: Innovative Healthcare Program Fosters Collaboration, EducationInside Analysis
 
No Time-Outs: How to Empower Round-the-Clock Analytics
No Time-Outs: How to Empower Round-the-Clock AnalyticsNo Time-Outs: How to Empower Round-the-Clock Analytics
No Time-Outs: How to Empower Round-the-Clock AnalyticsInside Analysis
 
Think Big - How to Design a Big Data Information Architecture
Think Big - How to Design a Big Data Information ArchitectureThink Big - How to Design a Big Data Information Architecture
Think Big - How to Design a Big Data Information ArchitectureInside Analysis
 
Down to Business: Taking Action Quickly with Linked Data Services
Down to Business: Taking Action Quickly with Linked Data ServicesDown to Business: Taking Action Quickly with Linked Data Services
Down to Business: Taking Action Quickly with Linked Data ServicesInside Analysis
 
Thinking Outside the Cube: How In-Memory Bolsters Analytics
Thinking Outside the Cube: How In-Memory Bolsters AnalyticsThinking Outside the Cube: How In-Memory Bolsters Analytics
Thinking Outside the Cube: How In-Memory Bolsters AnalyticsInside Analysis
 
Enabling Flexible Governance for All Data Sources
Enabling Flexible Governance for All Data SourcesEnabling Flexible Governance for All Data Sources
Enabling Flexible Governance for All Data SourcesInside Analysis
 
Continuous Intelligence: Staying Ahead with Streaming Analytics
Continuous Intelligence: Staying Ahead with Streaming AnalyticsContinuous Intelligence: Staying Ahead with Streaming Analytics
Continuous Intelligence: Staying Ahead with Streaming AnalyticsInside Analysis
 
Bridging the Gap: Analyzing Data in and Below the Cloud
Bridging the Gap: Analyzing Data in and Below the CloudBridging the Gap: Analyzing Data in and Below the Cloud
Bridging the Gap: Analyzing Data in and Below the CloudInside Analysis
 
Agents for Agility - The Just-in-Time Enterprise Has Arrived
Agents for Agility - The Just-in-Time Enterprise Has ArrivedAgents for Agility - The Just-in-Time Enterprise Has Arrived
Agents for Agility - The Just-in-Time Enterprise Has ArrivedInside Analysis
 
All Grown Up: Maturation of Analytics in the Cloud
All Grown Up: Maturation of Analytics in the CloudAll Grown Up: Maturation of Analytics in the Cloud
All Grown Up: Maturation of Analytics in the CloudInside Analysis
 
Database Revolution - Exploratory Webcast
Database Revolution - Exploratory WebcastDatabase Revolution - Exploratory Webcast
Database Revolution - Exploratory WebcastInside Analysis
 
Hadoop and the Data Warehouse: Point/Counter Point
Hadoop and the Data Warehouse: Point/Counter PointHadoop and the Data Warehouse: Point/Counter Point
Hadoop and the Data Warehouse: Point/Counter PointInside Analysis
 
At the Tipping Point: Considerations for Cloud BI in a Multi-platform BI Ente...
At the Tipping Point: Considerations for Cloud BI in a Multi-platform BI Ente...At the Tipping Point: Considerations for Cloud BI in a Multi-platform BI Ente...
At the Tipping Point: Considerations for Cloud BI in a Multi-platform BI Ente...Inside Analysis
 

Viewers also liked (16)

The Cloud Imperative – What, Why, When and How
The Cloud Imperative – What, Why, When and HowThe Cloud Imperative – What, Why, When and How
The Cloud Imperative – What, Why, When and How
 
The Crown Jewels: Is Enterprise Data Ready for the Cloud?
The Crown Jewels: Is Enterprise Data Ready for the Cloud?The Crown Jewels: Is Enterprise Data Ready for the Cloud?
The Crown Jewels: Is Enterprise Data Ready for the Cloud?
 
How Data Visualization Enhances the News
How Data Visualization Enhances the NewsHow Data Visualization Enhances the News
How Data Visualization Enhances the News
 
Raising the Bar: Innovative Healthcare Program Fosters Collaboration, Education
Raising the Bar: Innovative Healthcare Program Fosters Collaboration, EducationRaising the Bar: Innovative Healthcare Program Fosters Collaboration, Education
Raising the Bar: Innovative Healthcare Program Fosters Collaboration, Education
 
No Time-Outs: How to Empower Round-the-Clock Analytics
No Time-Outs: How to Empower Round-the-Clock AnalyticsNo Time-Outs: How to Empower Round-the-Clock Analytics
No Time-Outs: How to Empower Round-the-Clock Analytics
 
Think Big - How to Design a Big Data Information Architecture
Think Big - How to Design a Big Data Information ArchitectureThink Big - How to Design a Big Data Information Architecture
Think Big - How to Design a Big Data Information Architecture
 
Down to Business: Taking Action Quickly with Linked Data Services
Down to Business: Taking Action Quickly with Linked Data ServicesDown to Business: Taking Action Quickly with Linked Data Services
Down to Business: Taking Action Quickly with Linked Data Services
 
Thinking Outside the Cube: How In-Memory Bolsters Analytics
Thinking Outside the Cube: How In-Memory Bolsters AnalyticsThinking Outside the Cube: How In-Memory Bolsters Analytics
Thinking Outside the Cube: How In-Memory Bolsters Analytics
 
Enabling Flexible Governance for All Data Sources
Enabling Flexible Governance for All Data SourcesEnabling Flexible Governance for All Data Sources
Enabling Flexible Governance for All Data Sources
 
Continuous Intelligence: Staying Ahead with Streaming Analytics
Continuous Intelligence: Staying Ahead with Streaming AnalyticsContinuous Intelligence: Staying Ahead with Streaming Analytics
Continuous Intelligence: Staying Ahead with Streaming Analytics
 
Bridging the Gap: Analyzing Data in and Below the Cloud
Bridging the Gap: Analyzing Data in and Below the CloudBridging the Gap: Analyzing Data in and Below the Cloud
Bridging the Gap: Analyzing Data in and Below the Cloud
 
Agents for Agility - The Just-in-Time Enterprise Has Arrived
Agents for Agility - The Just-in-Time Enterprise Has ArrivedAgents for Agility - The Just-in-Time Enterprise Has Arrived
Agents for Agility - The Just-in-Time Enterprise Has Arrived
 
All Grown Up: Maturation of Analytics in the Cloud
All Grown Up: Maturation of Analytics in the CloudAll Grown Up: Maturation of Analytics in the Cloud
All Grown Up: Maturation of Analytics in the Cloud
 
Database Revolution - Exploratory Webcast
Database Revolution - Exploratory WebcastDatabase Revolution - Exploratory Webcast
Database Revolution - Exploratory Webcast
 
Hadoop and the Data Warehouse: Point/Counter Point
Hadoop and the Data Warehouse: Point/Counter PointHadoop and the Data Warehouse: Point/Counter Point
Hadoop and the Data Warehouse: Point/Counter Point
 
At the Tipping Point: Considerations for Cloud BI in a Multi-platform BI Ente...
At the Tipping Point: Considerations for Cloud BI in a Multi-platform BI Ente...At the Tipping Point: Considerations for Cloud BI in a Multi-platform BI Ente...
At the Tipping Point: Considerations for Cloud BI in a Multi-platform BI Ente...
 

Similar to A Tighter Weave – How YARN Changes the Data Quality Game

Has Traditional MDM Finally Met its Match?
Has Traditional MDM Finally Met its Match?Has Traditional MDM Finally Met its Match?
Has Traditional MDM Finally Met its Match?Inside Analysis
 
Data Quality in the Data Hub with RedPointGlobal
Data Quality in the Data Hub with RedPointGlobalData Quality in the Data Hub with RedPointGlobal
Data Quality in the Data Hub with RedPointGlobalCaserta
 
Drive dataqualityatyourcompanycreateadatalake
Drive dataqualityatyourcompanycreateadatalakeDrive dataqualityatyourcompanycreateadatalake
Drive dataqualityatyourcompanycreateadatalakeThe Pathway Group
 
Hadoop in 2015: Keys to Achieving Operational Excellence for the Real-Time En...
Hadoop in 2015: Keys to Achieving Operational Excellence for the Real-Time En...Hadoop in 2015: Keys to Achieving Operational Excellence for the Real-Time En...
Hadoop in 2015: Keys to Achieving Operational Excellence for the Real-Time En...MapR Technologies
 
Fast and Furious: From POC to an Enterprise Big Data Stack in 2014
Fast and Furious: From POC to an Enterprise Big Data Stack in 2014Fast and Furious: From POC to an Enterprise Big Data Stack in 2014
Fast and Furious: From POC to an Enterprise Big Data Stack in 2014MapR Technologies
 
YARN: the Key to overcoming the challenges of broad-based Hadoop Adoption
YARN: the Key to overcoming the challenges of broad-based Hadoop AdoptionYARN: the Key to overcoming the challenges of broad-based Hadoop Adoption
YARN: the Key to overcoming the challenges of broad-based Hadoop AdoptionDataWorks Summit
 
Enabling Next Gen Analytics with Azure Data Lake and StreamSets
Enabling Next Gen Analytics with Azure Data Lake and StreamSetsEnabling Next Gen Analytics with Azure Data Lake and StreamSets
Enabling Next Gen Analytics with Azure Data Lake and StreamSetsStreamsets Inc.
 
Eliminating the Challenges of Big Data Management Inside Hadoop
Eliminating the Challenges of Big Data Management Inside HadoopEliminating the Challenges of Big Data Management Inside Hadoop
Eliminating the Challenges of Big Data Management Inside HadoopHortonworks
 
Eliminating the Challenges of Big Data Management Inside Hadoop
Eliminating the Challenges of Big Data Management Inside HadoopEliminating the Challenges of Big Data Management Inside Hadoop
Eliminating the Challenges of Big Data Management Inside HadoopHortonworks
 
Whither the Hadoop Developer Experience, June Hadoop Meetup, Nitin Motgi
Whither the Hadoop Developer Experience, June Hadoop Meetup, Nitin MotgiWhither the Hadoop Developer Experience, June Hadoop Meetup, Nitin Motgi
Whither the Hadoop Developer Experience, June Hadoop Meetup, Nitin MotgiFelicia Haggarty
 
C-BAG Big Data Meetup Chennai Oct.29-2014 Hortonworks and Concurrent on Casca...
C-BAG Big Data Meetup Chennai Oct.29-2014 Hortonworks and Concurrent on Casca...C-BAG Big Data Meetup Chennai Oct.29-2014 Hortonworks and Concurrent on Casca...
C-BAG Big Data Meetup Chennai Oct.29-2014 Hortonworks and Concurrent on Casca...Hortonworks
 
Putting the Ops in DataOps: Orchestrate the Flow of Data Across Data Pipelines
Putting the Ops in DataOps: Orchestrate the Flow of Data Across Data PipelinesPutting the Ops in DataOps: Orchestrate the Flow of Data Across Data Pipelines
Putting the Ops in DataOps: Orchestrate the Flow of Data Across Data PipelinesDATAVERSITY
 
Strengthening the Quality of Big Data Implementations
Strengthening the Quality of Big Data ImplementationsStrengthening the Quality of Big Data Implementations
Strengthening the Quality of Big Data ImplementationsCognizant
 
Skillwise Big Data part 2
Skillwise Big Data part 2Skillwise Big Data part 2
Skillwise Big Data part 2Skillwise Group
 
Transforming Business in a Digital Era with Big Data and Microsoft
Transforming Business in a Digital Era with Big Data and MicrosoftTransforming Business in a Digital Era with Big Data and Microsoft
Transforming Business in a Digital Era with Big Data and MicrosoftPerficient, Inc.
 
Big Data Made Easy: A Simple, Scalable Solution for Getting Started with Hadoop
Big Data Made Easy:  A Simple, Scalable Solution for Getting Started with HadoopBig Data Made Easy:  A Simple, Scalable Solution for Getting Started with Hadoop
Big Data Made Easy: A Simple, Scalable Solution for Getting Started with HadoopPrecisely
 
Horses for Courses: Database Roundtable
Horses for Courses: Database RoundtableHorses for Courses: Database Roundtable
Horses for Courses: Database RoundtableEric Kavanagh
 

Similar to A Tighter Weave – How YARN Changes the Data Quality Game (20)

Has Traditional MDM Finally Met its Match?
Has Traditional MDM Finally Met its Match?Has Traditional MDM Finally Met its Match?
Has Traditional MDM Finally Met its Match?
 
Data Quality in the Data Hub with RedPointGlobal
Data Quality in the Data Hub with RedPointGlobalData Quality in the Data Hub with RedPointGlobal
Data Quality in the Data Hub with RedPointGlobal
 
Drive dataqualityatyourcompanycreateadatalake
Drive dataqualityatyourcompanycreateadatalakeDrive dataqualityatyourcompanycreateadatalake
Drive dataqualityatyourcompanycreateadatalake
 
Hadoop in 2015: Keys to Achieving Operational Excellence for the Real-Time En...
Hadoop in 2015: Keys to Achieving Operational Excellence for the Real-Time En...Hadoop in 2015: Keys to Achieving Operational Excellence for the Real-Time En...
Hadoop in 2015: Keys to Achieving Operational Excellence for the Real-Time En...
 
Hadoop and Your Enterprise Data Warehouse
Hadoop and Your Enterprise Data WarehouseHadoop and Your Enterprise Data Warehouse
Hadoop and Your Enterprise Data Warehouse
 
Fast and Furious: From POC to an Enterprise Big Data Stack in 2014
Fast and Furious: From POC to an Enterprise Big Data Stack in 2014Fast and Furious: From POC to an Enterprise Big Data Stack in 2014
Fast and Furious: From POC to an Enterprise Big Data Stack in 2014
 
YARN: the Key to overcoming the challenges of broad-based Hadoop Adoption
YARN: the Key to overcoming the challenges of broad-based Hadoop AdoptionYARN: the Key to overcoming the challenges of broad-based Hadoop Adoption
YARN: the Key to overcoming the challenges of broad-based Hadoop Adoption
 
Enabling Next Gen Analytics with Azure Data Lake and StreamSets
Enabling Next Gen Analytics with Azure Data Lake and StreamSetsEnabling Next Gen Analytics with Azure Data Lake and StreamSets
Enabling Next Gen Analytics with Azure Data Lake and StreamSets
 
Eliminating the Challenges of Big Data Management Inside Hadoop
Eliminating the Challenges of Big Data Management Inside HadoopEliminating the Challenges of Big Data Management Inside Hadoop
Eliminating the Challenges of Big Data Management Inside Hadoop
 
Eliminating the Challenges of Big Data Management Inside Hadoop
Eliminating the Challenges of Big Data Management Inside HadoopEliminating the Challenges of Big Data Management Inside Hadoop
Eliminating the Challenges of Big Data Management Inside Hadoop
 
Whither the Hadoop Developer Experience, June Hadoop Meetup, Nitin Motgi
Whither the Hadoop Developer Experience, June Hadoop Meetup, Nitin MotgiWhither the Hadoop Developer Experience, June Hadoop Meetup, Nitin Motgi
Whither the Hadoop Developer Experience, June Hadoop Meetup, Nitin Motgi
 
C-BAG Big Data Meetup Chennai Oct.29-2014 Hortonworks and Concurrent on Casca...
C-BAG Big Data Meetup Chennai Oct.29-2014 Hortonworks and Concurrent on Casca...C-BAG Big Data Meetup Chennai Oct.29-2014 Hortonworks and Concurrent on Casca...
C-BAG Big Data Meetup Chennai Oct.29-2014 Hortonworks and Concurrent on Casca...
 
Putting the Ops in DataOps: Orchestrate the Flow of Data Across Data Pipelines
Putting the Ops in DataOps: Orchestrate the Flow of Data Across Data PipelinesPutting the Ops in DataOps: Orchestrate the Flow of Data Across Data Pipelines
Putting the Ops in DataOps: Orchestrate the Flow of Data Across Data Pipelines
 
Strengthening the Quality of Big Data Implementations
Strengthening the Quality of Big Data ImplementationsStrengthening the Quality of Big Data Implementations
Strengthening the Quality of Big Data Implementations
 
Skilwise Big data
Skilwise Big dataSkilwise Big data
Skilwise Big data
 
Skillwise Big Data part 2
Skillwise Big Data part 2Skillwise Big Data part 2
Skillwise Big Data part 2
 
Big Data Analyst at BankofAmerica
Big Data Analyst at BankofAmericaBig Data Analyst at BankofAmerica
Big Data Analyst at BankofAmerica
 
Transforming Business in a Digital Era with Big Data and Microsoft
Transforming Business in a Digital Era with Big Data and MicrosoftTransforming Business in a Digital Era with Big Data and Microsoft
Transforming Business in a Digital Era with Big Data and Microsoft
 
Big Data Made Easy: A Simple, Scalable Solution for Getting Started with Hadoop
Big Data Made Easy:  A Simple, Scalable Solution for Getting Started with HadoopBig Data Made Easy:  A Simple, Scalable Solution for Getting Started with Hadoop
Big Data Made Easy: A Simple, Scalable Solution for Getting Started with Hadoop
 
Horses for Courses: Database Roundtable
Horses for Courses: Database RoundtableHorses for Courses: Database Roundtable
Horses for Courses: Database Roundtable
 

More from Inside Analysis

An Ounce of Prevention: Forging Healthy BI
An Ounce of Prevention: Forging Healthy BIAn Ounce of Prevention: Forging Healthy BI
An Ounce of Prevention: Forging Healthy BIInside Analysis
 
Agile, Automated, Aware: How to Model for Success
Agile, Automated, Aware: How to Model for SuccessAgile, Automated, Aware: How to Model for Success
Agile, Automated, Aware: How to Model for SuccessInside Analysis
 
First in Class: Optimizing the Data Lake for Tighter Integration
First in Class: Optimizing the Data Lake for Tighter IntegrationFirst in Class: Optimizing the Data Lake for Tighter Integration
First in Class: Optimizing the Data Lake for Tighter IntegrationInside Analysis
 
Fit For Purpose: Preventing a Big Data Letdown
Fit For Purpose: Preventing a Big Data LetdownFit For Purpose: Preventing a Big Data Letdown
Fit For Purpose: Preventing a Big Data LetdownInside Analysis
 
To Serve and Protect: Making Sense of Hadoop Security
To Serve and Protect: Making Sense of Hadoop Security To Serve and Protect: Making Sense of Hadoop Security
To Serve and Protect: Making Sense of Hadoop Security Inside Analysis
 
The Hadoop Guarantee: Keeping Analytics Running On Time
The Hadoop Guarantee: Keeping Analytics Running On TimeThe Hadoop Guarantee: Keeping Analytics Running On Time
The Hadoop Guarantee: Keeping Analytics Running On TimeInside Analysis
 
Introducing: A Complete Algebra of Data
Introducing: A Complete Algebra of DataIntroducing: A Complete Algebra of Data
Introducing: A Complete Algebra of DataInside Analysis
 
The Role of Data Wrangling in Driving Hadoop Adoption
The Role of Data Wrangling in Driving Hadoop AdoptionThe Role of Data Wrangling in Driving Hadoop Adoption
The Role of Data Wrangling in Driving Hadoop AdoptionInside Analysis
 
Ahead of the Stream: How to Future-Proof Real-Time Analytics
Ahead of the Stream: How to Future-Proof Real-Time AnalyticsAhead of the Stream: How to Future-Proof Real-Time Analytics
Ahead of the Stream: How to Future-Proof Real-Time AnalyticsInside Analysis
 
All Together Now: Connected Analytics for the Internet of Everything
All Together Now: Connected Analytics for the Internet of EverythingAll Together Now: Connected Analytics for the Internet of Everything
All Together Now: Connected Analytics for the Internet of EverythingInside Analysis
 
Goodbye, Bottlenecks: How Scale-Out and In-Memory Solve ETL
Goodbye, Bottlenecks: How Scale-Out and In-Memory Solve ETLGoodbye, Bottlenecks: How Scale-Out and In-Memory Solve ETL
Goodbye, Bottlenecks: How Scale-Out and In-Memory Solve ETLInside Analysis
 
The Biggest Picture: Situational Awareness on a Global Level
The Biggest Picture: Situational Awareness on a Global LevelThe Biggest Picture: Situational Awareness on a Global Level
The Biggest Picture: Situational Awareness on a Global LevelInside Analysis
 
Structurally Sound: How to Tame Your Architecture
Structurally Sound: How to Tame Your ArchitectureStructurally Sound: How to Tame Your Architecture
Structurally Sound: How to Tame Your ArchitectureInside Analysis
 
SQL In Hadoop: Big Data Innovation Without the Risk
SQL In Hadoop: Big Data Innovation Without the RiskSQL In Hadoop: Big Data Innovation Without the Risk
SQL In Hadoop: Big Data Innovation Without the RiskInside Analysis
 
The Perfect Fit: Scalable Graph for Big Data
The Perfect Fit: Scalable Graph for Big DataThe Perfect Fit: Scalable Graph for Big Data
The Perfect Fit: Scalable Graph for Big DataInside Analysis
 
A Revolutionary Approach to Modernizing the Data Warehouse
A Revolutionary Approach to Modernizing the Data WarehouseA Revolutionary Approach to Modernizing the Data Warehouse
A Revolutionary Approach to Modernizing the Data WarehouseInside Analysis
 
The Maturity Model: Taking the Growing Pains Out of Hadoop
The Maturity Model: Taking the Growing Pains Out of HadoopThe Maturity Model: Taking the Growing Pains Out of Hadoop
The Maturity Model: Taking the Growing Pains Out of HadoopInside Analysis
 
Rethinking Data Availability and Governance in a Mobile World
Rethinking Data Availability and Governance in a Mobile WorldRethinking Data Availability and Governance in a Mobile World
Rethinking Data Availability and Governance in a Mobile WorldInside Analysis
 
DisrupTech - Dave Duggal
DisrupTech - Dave DuggalDisrupTech - Dave Duggal
DisrupTech - Dave DuggalInside Analysis
 

More from Inside Analysis (20)

An Ounce of Prevention: Forging Healthy BI
An Ounce of Prevention: Forging Healthy BIAn Ounce of Prevention: Forging Healthy BI
An Ounce of Prevention: Forging Healthy BI
 
Agile, Automated, Aware: How to Model for Success
Agile, Automated, Aware: How to Model for SuccessAgile, Automated, Aware: How to Model for Success
Agile, Automated, Aware: How to Model for Success
 
First in Class: Optimizing the Data Lake for Tighter Integration
First in Class: Optimizing the Data Lake for Tighter IntegrationFirst in Class: Optimizing the Data Lake for Tighter Integration
First in Class: Optimizing the Data Lake for Tighter Integration
 
Fit For Purpose: Preventing a Big Data Letdown
Fit For Purpose: Preventing a Big Data LetdownFit For Purpose: Preventing a Big Data Letdown
Fit For Purpose: Preventing a Big Data Letdown
 
To Serve and Protect: Making Sense of Hadoop Security
To Serve and Protect: Making Sense of Hadoop Security To Serve and Protect: Making Sense of Hadoop Security
To Serve and Protect: Making Sense of Hadoop Security
 
The Hadoop Guarantee: Keeping Analytics Running On Time
The Hadoop Guarantee: Keeping Analytics Running On TimeThe Hadoop Guarantee: Keeping Analytics Running On Time
The Hadoop Guarantee: Keeping Analytics Running On Time
 
Introducing: A Complete Algebra of Data
Introducing: A Complete Algebra of DataIntroducing: A Complete Algebra of Data
Introducing: A Complete Algebra of Data
 
The Role of Data Wrangling in Driving Hadoop Adoption
The Role of Data Wrangling in Driving Hadoop AdoptionThe Role of Data Wrangling in Driving Hadoop Adoption
The Role of Data Wrangling in Driving Hadoop Adoption
 
Ahead of the Stream: How to Future-Proof Real-Time Analytics
Ahead of the Stream: How to Future-Proof Real-Time AnalyticsAhead of the Stream: How to Future-Proof Real-Time Analytics
Ahead of the Stream: How to Future-Proof Real-Time Analytics
 
All Together Now: Connected Analytics for the Internet of Everything
All Together Now: Connected Analytics for the Internet of EverythingAll Together Now: Connected Analytics for the Internet of Everything
All Together Now: Connected Analytics for the Internet of Everything
 
Goodbye, Bottlenecks: How Scale-Out and In-Memory Solve ETL
Goodbye, Bottlenecks: How Scale-Out and In-Memory Solve ETLGoodbye, Bottlenecks: How Scale-Out and In-Memory Solve ETL
Goodbye, Bottlenecks: How Scale-Out and In-Memory Solve ETL
 
The Biggest Picture: Situational Awareness on a Global Level
The Biggest Picture: Situational Awareness on a Global LevelThe Biggest Picture: Situational Awareness on a Global Level
The Biggest Picture: Situational Awareness on a Global Level
 
Structurally Sound: How to Tame Your Architecture
Structurally Sound: How to Tame Your ArchitectureStructurally Sound: How to Tame Your Architecture
Structurally Sound: How to Tame Your Architecture
 
SQL In Hadoop: Big Data Innovation Without the Risk
SQL In Hadoop: Big Data Innovation Without the RiskSQL In Hadoop: Big Data Innovation Without the Risk
SQL In Hadoop: Big Data Innovation Without the Risk
 
The Perfect Fit: Scalable Graph for Big Data
The Perfect Fit: Scalable Graph for Big DataThe Perfect Fit: Scalable Graph for Big Data
The Perfect Fit: Scalable Graph for Big Data
 
A Revolutionary Approach to Modernizing the Data Warehouse
A Revolutionary Approach to Modernizing the Data WarehouseA Revolutionary Approach to Modernizing the Data Warehouse
A Revolutionary Approach to Modernizing the Data Warehouse
 
The Maturity Model: Taking the Growing Pains Out of Hadoop
The Maturity Model: Taking the Growing Pains Out of HadoopThe Maturity Model: Taking the Growing Pains Out of Hadoop
The Maturity Model: Taking the Growing Pains Out of Hadoop
 
Rethinking Data Availability and Governance in a Mobile World
Rethinking Data Availability and Governance in a Mobile WorldRethinking Data Availability and Governance in a Mobile World
Rethinking Data Availability and Governance in a Mobile World
 
DisrupTech - Dave Duggal
DisrupTech - Dave DuggalDisrupTech - Dave Duggal
DisrupTech - Dave Duggal
 
Modus Operandi
Modus OperandiModus Operandi
Modus Operandi
 

Recently uploaded

Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfngoud9212
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfjimielynbastida
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Neo4j
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 

Recently uploaded (20)

Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdf
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort ServiceHot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdf
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 

A Tighter Weave – How YARN Changes the Data Quality Game

  • 1. Grab some coffee and enjoy the pre-show banter before the top of the hour!
  • 5. ANALYST: David Loshin President, Knowledge Integrity ANALYST: David Raab Principal, Raab Associates GUEST: George Corugedo CTO, RedPoint Global THE LINE UP
  • 7. © 2014 Knowledge Integrity, Inc. www.knowledge-integrity.com (301) 754-6350 7 Big Data Quality David Loshin Knowledge Integrity, Inc. www.knowledge-integrity.com
  • 8. A Conventional Approach to “Big Data Quality” p Immediate thoughts: n “lots of data” means “lots of errors” means “lots of cleansing” n “big data quality” = prepending “big” onto “data quality” p However, conventional processes associated with manufacturing quality are less applicable in a big data world: DQ Directive Challenge for Repurposing “Process management” Business function-based data development intends data for specific © 2014 Knowledge Integrity, Inc. www.knowledge-integrity.com (301) 754-6350 8 purposes “Customer requirements” Desire to assert customer requirements at point of creation or acquisition conflicts with expectation to use data in different ways “Supplier management” Often, the source of the data is way beyond the organization’s administrative control or is completely unknown “Control” There is opacity regarding the data flow processes and lineage, so control is limited to execution at the point of use “Continuous improvement” Transitory nature of streamed data
  • 9. Drivers of Quality of Big Data p Rampant data repurposing n For data scientists, the data sets are demanded in raw form, free from the shackles of dimensional models n Simultaneously, analytical results need to be integrated with existing data warehouse and business intelligence architecture p Content in addition to structure n Scanning and parsing text must go beyond validation against known formats to ascertain meaning in context p Value of discrete pieces information varies in relation to content type, precision, timeliness, overall volume p Information Utility n The onus of ensuring usability (and “quality”) is on the data consumer, not the data producer © 2014 Knowledge Integrity, Inc. www.knowledge-integrity.com (301) 754-6350 9
  • 10. Assessing Big Data Quality and Utility Temporal Consistency Completeness Precision Consistency Currency Unique Identifiability Timeliness Semantic Consistency ✔ ✔ ✔ ✔ ✔ ✔ ✔ © 2014 Knowledge Integrity, Inc. www.knowledge-integrity.com (301) 754-6350 10
  • 11. Characteristics of Applications Suited for Hadoop p Adapting existing solutions to improve performance p Algorithms or solutions exhibit one or more of these characteristics: n Large data volumes n Significant data variety n Performance impacted by data latency n Computational performance throttled n Amenable to parallelization p Enabling or improving solutions whose requirements exceed existing resource capabilities © 2014 Knowledge Integrity, Inc. www.knowledge-integrity.com (301) 754-6350 11
  • 12. Leveraging Performance Computing for Data Quality p Hadoop’s parallel and distributed computing enables scalability in deploying key data quality tasks p Each of these activities exhibits the characteristics of applications amenable to Hadoop: n Data validation that uses data quality and business rules for validating consistency, completeness, timeliness, etc. n Identity Resolution that uses advanced techniques for entity recognition and identity resolution from structured and unstructured sources n Data cleansing, standardization, enhancement that applies parsing and standardization rules within context of end-use business analyses and applications n Inspection, Monitoring, and Remediation to empower data stewards to monitor quality and take proper actions for ensuring big data utility © 2014 Knowledge Integrity, Inc. www.knowledge-integrity.com (301) 754-6350 12
  • 13. Check Out These Resources! p www.knowledge-integrity.com p www.dataqualitybook.com p If you have questions, comments, or suggestions, please contact me David Loshin 301-754-6350 loshin@knowledge-integrity.com © 2014 Knowledge Integrity, Inc. www.knowledge-integrity.com (301) 754-6350 13
  • 15. Big Uses for Big Data August 20, 2014 David M. Raab Raab Associates draab@raabassociates.com
  • 16. What’s New about Big Data? • Comprehensive details on all customers • Near-­‐immediate updates • High-­‐speed detecJon, assessment, response • (Almost) always reachable
  • 17. What’s It Good For? • PersonalizaJon • MarkeJng measurement • Real Jme bidding • Web prospecJng
  • 18. What Are the Challenges? (Technical)
  • 19. What Are the Challenges? (Business)
  • 21. YARN Changes the Data Quality Game August 2014
  • 22. Overview – Challenges to Adoption • Severe Skills Gap shortage of MR skilled resources • Very expensive resources and hard to retain • Inconsistent skills lead to inconsistent results • Under uJlizes exisJng resources • Prevents broad leverage of investments across enterprise Maturity & Governance • A nascent technology ecosystem around Hadoop • Emerging technologies only address narrow slivers of funcJonality • New applicaJons are not enterprise class • Legacy applicaJons have built short term capabiliJes Data Into InformaJon • Data 22 © RedPoint Global Inc. 2014 Confidential is not useful in its raw state, it must be turned into informaJon • Benefit of Hadoop is that same data can be used from many perspecJves • Analysts must now do the structuring of the data based on intended use of the data
  • 23. Overview - What is Hadoop/Hadoop 2.0 Lower cost scaling No need for structure 23 © RedPoint Global Inc. 2014 Confidential Ease of data capture Hadoop 1.0 • All operaJons based on Map Reduce • Intrinsic inconsistency of code based soluJons • Highly skilled and expensive resources needed • 3rd party applicaJons constrained by the need to generate code Hadoop 2.0 • IntroducJon of the YARN: “a general-­‐purpose, distributed, applicaJon management framework that supersedes the classic Apache Hadoop MapReduce framework for processing data in Hadoop clusters.” • Mature applicaJons can now operate directly on Hadoop • Reduce skill requirements and increased consistency
  • 24. RedPoint Data Management on Hadoop ParJJoning AM / Tasks Parallel SecJon (UI) ExecuJon AM / Tasks Data I/O Key / Split Analysis YARN 24 © RedPoint Global Inc. 2014 Confidential MapReduce
  • 25. Key features of RedPoint Data Management ETL & ELT Data Quality Master Key Management Web Services IntegraJon IntegraJon & Matching Process AutomaJon 25 © RedPoint Global Inc. 2014 Confidential & OperaJons • Profiling, reads/writes, transformaJons • Single project for all jobs • Cleanse data • Parsing, correcJon • Geo-­‐spaJal analysis • Grouping • Fuzzy match • Create keys • Track changes • Maintain matches over Jme • Consume and publish • HTTP/HTTPS protocols • XML/JSON/SOAP formats • Job scheduling, monitoring, noJficaJons • Central point of control All func(ons can be used on both TRADITIONAL and BIG DATA Creates clean, integrated, ac/onable data – quickly, reliably and at low cost
  • 26. RedPoint Functional Footprint Monitoring and Management Tools AMBARI DATA REFINEMENT PIG HIVE MAPREDUCE REST HTTP STREAM STRUCTURE HCATALOG (metadata services) DBs Fil esF il Feilse s NFS Ÿ 26 © RedPoint Global Inc. 2014 Confidential Query/Visualization/ Reporting/Analytical Tools and Apps SOURCE DATA - Sensor Logs - Clickstream JMS - Flat Queue’s Files - Unstructured - Sentiment - Customer - Inventory Data Sources RDBMS EDW INTERACTIVE HIVE Server2 LOAD SQOOP WebHDFS Flume LOAD SQOO P/Hive Web HDFS YARN Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ n HDFS 1 Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ
  • 27. Sample MapReduce (small subset of the entire code which totals nearly 150 lines): public static class MapClass extends Mapper<WordOffset, Text, Text, IntWritable> { 27 © RedPoint Global Inc. 2014 Confidential RedPoint Benchmarks – Project Gutenberg Map Reduce Pig private final static String delimiters = "',./<>?;:"[]{}-=_+()&*%^#$!@`~ |«»¡¢£¤¥¦©¬®¯±¶·¿"; private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(WordOffset key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line, delimiters); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); } } } Sample Pig script without the UDF: SET pig.maxCombinedSplitSize 67108864 SET pig.splitCombination true A = LOAD '/testdata/pg/*/*/*'; B = FOREACH A GENERATE FLATTEN(TOKENIZE((chararray)$0)) AS word; C = FOREACH B GENERATE UPPER(word) AS word; D = GROUP C BY word; E = FOREACH D GENERATE COUNT(C) AS occurrences, group; F = ORDER E BY occurrences DESC; STORE F INTO '/user/cleonardi/pg/pig-count'; >150 Lines of MR Code ~50 Lines of Script Code 0 Lines of Code 6 hours of development 3 hours of development 15 min. of development 6 minutes runtime 15 minutes runtime 3 minutes runtime Extensive optimization needed User Defined Functions required prior to running script No tuning or optimization required
  • 28. Attributes of Information RELEVANT InformaJon must pertain to a specific problem. General data must be connected to reveal relevance of the informaJon. COMPLETE ParJal informaJon is oien worse than no informaJon. ParJal informaJon frequently leads to worse conclusions than if no data had been used at all. ACCURATE This one is obvious. In a context like health care, inaccurate data can be fatal. Precision is required across all applicaJons of informaJon. CURRENT As data ages, it becomes less accurate. MulJple research studies by Google and others show the decay in the accuracy of analyJcs as data becomes stale. ECONOMICAL There has to be a clear cost benefit. This requires work to idenJfy the realizable benefit of informaJon but this is also what rives the use if successful 28 © RedPoint Global Inc. 2014 Confidential
  • 29. Big Data Can Become Big Information 29 © RedPoint Global Inc. 2014 Confidential
  • 30. Big Data Can Become Big Information " IngesJon of all data available from any source, format, cadence, structure or non-­‐structure " ELT and data refinement " GeospaJal processing and geocoding " Data profiling, lineage and metadata management " Data parsing, quality, validaJon and hygiene " IdenJty resoluJon and persistent keying " EnJty profile management 30 © RedPoint Global Inc. 2014 Confidential
  • 31. Reference Architecture for Matching in Hadoop Data Sources CRM ERP Billing Subscriber Product Network Weather Compete Manuf. Clickstream Online Chat Sensor Data Social Media Call Detail Records FabricaJon Logs Sales Feedback Field Feedback Field Feedback + 31 © RedPoint Global Inc. 2014 Confidential
  • 32. Key Points to Cover Today " Broad functionality across data processing domains " Validated ease of use, speed, match quality and party data superiority " Hadoop 2.0/YARN certified – 1 of first 17 companies to do so " Not a repackaging of Hadoop 1.0 functionality. RedPoint Data Management is a pure YARN application (1 of only 2 in the initial wave of certifications) " Building a complex job in RPDM takes a fraction of the time that it takes to write the same job in Map Reduce and none of the coding or java skills. " Big functional footprint without touching a line of code " Design model consistent with data flow paradigm " RPDM has a “Zero-Footprint” install in the Hadoop cluster " The same interface and functionality is available for both structured and unstructured databases. Thus it is seamless to work across both from a users perspective. " Data quality done completely within the cluster 32 © RedPoint Global Inc. 2014 Confidential
  • 33.
  • 34. THANK YOU! The Archive Trifecta: • Inside Analysis www.insideanalysis.com • SlideShare www.slideshare.net/InsideAnalysis • YouTube www.youtube.com/user/BloorGroup