SlideShare a Scribd company logo
1 of 25
Download to read offline
IS DATA	PREPARATION	THE	
NEXT
BIG	DATA	DISRUPTION?
The	22nd	International	Conference	on	Distributed	Multimedia	Systems
DMS	2016
Grand	Hotel	Salerno,	Salerno,	Italy
November	25	- 26,	2016
• SCENARIO
• BIG	DATA	IN	THE	DATA	DRIVEN	ENTERPRISE
• WHAT	DATA	PREPARATION	SHOULD	COVER
• CREATING	READY	DATA	USING	FRACTALS
• CASE	STUDY
Agenda
Source	Forrester	2016
1. DOES	THE	BUSINESS	ANALYST	UNDERSTAND	THE	DATA	SCIENTIST?
2. WHY	DATA	DRIVEN	COMPANIES	ARE	HIRING	DATA	JOURNALISTS?
3. WHY	DARK	DATA	EXTERNAL	TO	DATA	LAKES	CONTINUE	TO	GROW?
4. WHY	IT	IS	REQUIRED	SO	LONG	TIME	FOR	MAKING	DATA?
5. DATA	PLAY	AND	NARRATIVES?
HOW LONG TIME AVAILABLE TO EXPLOIT DATA PROCESSING OUTPUT?
77%
Data	Processing
23%
Data	Analysis
Source	Bloor2016
90%	IS	DARK
12%	AVAILABLE	FOR	BUSINESS	INSIGHTS	
88%	IS	JUST	STORED
80%	RECORDINGs,	PDFs AND	TEXTs
source	IDC	2016
+4300%	ANNUAL	DATA	GENERATION
Data	preparation	is	an	iterative	process	for	exploring	and	transforming	raw	data	into	forms	
suitable	for	data	science,	data	discovery,	and	analytics.	
Self-service	data	preparation	tools	(SSDP)	are	user-oriented	tools	that	enable	data	preparation	
capabilities	such	as	data	cataloging	- inventorying,	data	discovery,	data	exploration,	data	
transformation,	data	structuring,	surfacing	of	sensitive	attributes	and	anomaly	detection.	
These	tools	are	aimed	at	reducing	the	time	and	complexity	of	preparing	data	and	improving	
analyst	productivity.
Pre	process
Prepare
Discover
Exploit
Raw Technically	correct
Ready	Data
Patterns
Formatted
Multimedia	
domain
Missing
Multimedia
Depending	on	how	you	count	them,	there	are	
anywhere	from	20	to	50	providers	of	self-service	
data	preparation	tools.	However,	they’re	not	all	
equal,	and	users	should	carefully	examine	the	
offering	to	measure	they’re	getting	what	they	
expect.	
Many	BI	and	Advanced	Analytics	vendors		(Tableau,	Qlik,	Sas etc.)	
have	jumped	onto	SSDP,	even	if		their	capabilities	aren’t	separate	
from	their	core	offerings	and	shows	limitations	in	term	of	
Performances,	Neutrality,	Custom	processing.
The	key	reason	why	self-service	data	prep	will	survive	as	its	own	
category	entity	is	the	growing	realization	that	data	preparation	
needs	to	be	kept	separate	from	analysis	and	Discovery.	
The	volumes	and	the	number	of	data	sources	will	not	be	
decreasing,	and	neither	will	the	number	of	BI	tools.	
To	that	end,	it’s	likely	that	self-service	data	prep	will	remain	a	
product	category	unto	itself	for	the	foreseeable	future.
Source	Bloor2016
Where	we	are
BIG	DATA	IN	THE	DATA	DRIVEN	ENTERPRISE
WE ALL ARE	AWARE
I.T.	DIVISION
IS GOING TO	BUILD
PLANETS OF	DATA
WHICH	ARE	WORLDS MADE	OF
DATA	BASEs,	DATA	LAKEs,
DATA	WAREHOUSEs,	
STRUCTUREs,	AND	SCHEMAs
IT SEEMS THAT
THESE WORLDS ARE	CALLED
“BIGDATA”
BUT,	WE’RE AFRAID TO	CREATE	THEM,
LORDS	ARE	TAKING LONGER THAN 7	DAYS
AND,	UNFORTUNATELY,	WORSE…
IT SEEMS THAT
HUMANS	HAVEN’T	
ACCESS	TO	THOSE
WORLDS
Bottom	line:
Is	the	data	preparation	the	bridge	between	
planets	of	data	and	the	user?
BigData is	not	Just	technology,	responsibility	
should	be	allocated	on	the	basis	of	the	
following	critical	factors:
1. Raw	data	will be	transfered to	the	preparation	unit
(push),	or
2. the	preparation	unit has to	read data	from	the	data	
lake (pull)?
3. the	data	lake has been designed to	stage	or	to	store
raw	data?
4. what about the	variability of	the	context and	data?
PULL
IT
Data	lake	purpose
PUSH
STORESTAGE
Data	Communication mode
END	USER
IT
END	USER END	USER
Low	
variability
High	
variability
Backgrounds
WHAT	DATA	PREPARATION	SHOULD	COVER
raw	data	r	cold,
analytics	hot
reality
1993	understanding	comics
How	to	Connect	
analytics	and	
details?
A	database	is	
required	to	
contextualize	
languages	and	
realities
Bottom Line:
Usage of data should be faster, cost less with minimum data
movement requirements
• materialize	reality	and	language	in	a	
consistent	database
• couple	language	and	reality	using	
keyback features
• Bind	external	algorithm	using	Open	
(Standard?)	User	Exits
• foster	holistic	views	of	data	through	
Grid	Data	Unification
blending
Context,	languages	and	facts
CREATING	READY	DATA	USING	FRACTAL	ADC
rowId Nname Ncity
1 1 1
2 2 2
3 3 3
4 2 2
Key Value NValue
Name Aldo 1
Name Sara 2
Name Anna 3
City Miami 1
… … …
DateBirth UDateB Age
11/1/90 1/11/90 26
12/2/89 2/12/89 26
1.1.68 1/1/68 48
31-1-61 1/31/61 56
Ncity city state
1 Miami Fl
2 NYC NY
3 Rome Italy
Map DictionaryLuggage
hierarchy
Data	complex Storage	group
name city DateBirth
Aldo Miami 11/1/90
Sara NYC 12/2/89
Anna Rome 1.1.68
Sara NYC 31-1-61
Data	source
Fractal	conversion
Transform
DateBirth
Add Geo	
classification
ADC	is	a	fractal	like	algorithm	that	converts	an	input	raw	data	and	related	data	processing	into	a	set	of	
chained	binary	blocks,	formulas	and	long	pointers.	
We	show	that	ADC	represents	an	important	set	of	computations…	The	advantages	of	ADC	are	that:
it	is	described	by	a	small	number	of	parameters	and	has	a	priori	known	sizes	of	the	views	,			the	views	can	be	generated	
independently,			the	overhead	of	combining	the	generated	views	is	predictable,		the	data	set	can	be	partitioned	into	a	
number	of	independently	generated	subsets,		the	elements	of	the	data	set	are	pseudo	random
These	properties	make	ADC	a	strong	candidate	for	a	data	intensive	grid	benchmark	<	M.	Frumkin NASA	NAS	Division	>
Using the fractal engine,
performances are extreme
Use	case
MATERIAL	TESTING
• Complex	Json,	Oracle,	csv,	wmv data
• Manual	data	processing	executed	using	
Mathlab
• Hours	of	Scientist	work	to	detect	outlier
• Impossibility	to	replicate	tests	with	same	
results
• Scarce	know	how	capitalization
• Blend	of	data	happens	at	Narrative	
writing	time
Terabyte	level	staging
Rigid	batch	processing
No	history
Digital	reality Language
Fractal
Data	base
Bottom	Line:	
Everyday	we	hear	from	entrepreneurs	doing	their	best	to	turn	their	big	ideas	in	a	consistent	and	
successful	online	business.	Here	IT	is	the	enabler	but,	unfortunately,	sometimes	the	T	part	has	a	negative	
influence	on	the	development	of	the	core	idea.
The	ideal	tool	kit	is	made	for	who	wish	to	exploit	the	I	part	of	the	IT,	so	that	entrepreneurs	having	great	
ideas,	can	craft	their	business	themselves.	And	they	should!
©2016	datonix	Spa
Thank you

More Related Content

Similar to Implementing Data Preparation in Distributed Multimedia System

Capturing big value in big data
Capturing big value in big data Capturing big value in big data
Capturing big value in big data
BSP Media Group
 

Similar to Implementing Data Preparation in Distributed Multimedia System (20)

SAP Forum Ankara 2017 - "Verinin Merkezine Seyahat"
SAP Forum Ankara 2017 - "Verinin Merkezine Seyahat"SAP Forum Ankara 2017 - "Verinin Merkezine Seyahat"
SAP Forum Ankara 2017 - "Verinin Merkezine Seyahat"
 
Governing and Preparing Data for Analytics and Business
Governing and Preparing Data for Analytics and BusinessGoverning and Preparing Data for Analytics and Business
Governing and Preparing Data for Analytics and Business
 
CPA ONE 2016 - Big data: big decisions or big fallacy
CPA ONE 2016 - Big data: big decisions or big fallacyCPA ONE 2016 - Big data: big decisions or big fallacy
CPA ONE 2016 - Big data: big decisions or big fallacy
 
Kudu Forrester Webinar
Kudu Forrester WebinarKudu Forrester Webinar
Kudu Forrester Webinar
 
Data & Analytic Innovations: 5 lessons from our customers
Data & Analytic Innovations: 5 lessons from our customersData & Analytic Innovations: 5 lessons from our customers
Data & Analytic Innovations: 5 lessons from our customers
 
Data Virtualization: An Introduction
Data Virtualization: An IntroductionData Virtualization: An Introduction
Data Virtualization: An Introduction
 
Rethink Your 2021 Data Management Strategy with Data Virtualization (ASEAN)
Rethink Your 2021 Data Management Strategy with Data Virtualization (ASEAN)Rethink Your 2021 Data Management Strategy with Data Virtualization (ASEAN)
Rethink Your 2021 Data Management Strategy with Data Virtualization (ASEAN)
 
Delivering Analytics at The Speed of Transactions with Data Fabric
Delivering Analytics at The Speed of Transactions with Data FabricDelivering Analytics at The Speed of Transactions with Data Fabric
Delivering Analytics at The Speed of Transactions with Data Fabric
 
Converging your data landscape
Converging your data landscapeConverging your data landscape
Converging your data landscape
 
Module 1 the power of data
Module 1 the power of dataModule 1 the power of data
Module 1 the power of data
 
short talk at Kean
short talk at Keanshort talk at Kean
short talk at Kean
 
Identifying the new frontier of big data as an enabler for T&T industries: Re...
Identifying the new frontier of big data as an enabler for T&T industries: Re...Identifying the new frontier of big data as an enabler for T&T industries: Re...
Identifying the new frontier of big data as an enabler for T&T industries: Re...
 
Data Virtualization. An Introduction (ASEAN)
Data Virtualization. An Introduction (ASEAN)Data Virtualization. An Introduction (ASEAN)
Data Virtualization. An Introduction (ASEAN)
 
An Overview of BigData
An Overview of BigDataAn Overview of BigData
An Overview of BigData
 
Capturing big value in big data
Capturing big value in big data Capturing big value in big data
Capturing big value in big data
 
Getting Started with Splunk Breakout Session
Getting Started with Splunk Breakout SessionGetting Started with Splunk Breakout Session
Getting Started with Splunk Breakout Session
 
DataStax on Azure: Deploying an industry-leading data platform for cloud apps...
DataStax on Azure: Deploying an industry-leading data platform for cloud apps...DataStax on Azure: Deploying an industry-leading data platform for cloud apps...
DataStax on Azure: Deploying an industry-leading data platform for cloud apps...
 
The Evolution of Data Stack: From Query Accelerators to Data Fabrics
The Evolution of Data Stack: From Query Accelerators to Data FabricsThe Evolution of Data Stack: From Query Accelerators to Data Fabrics
The Evolution of Data Stack: From Query Accelerators to Data Fabrics
 
What is the future of data strategy?
What is the future of data strategy?What is the future of data strategy?
What is the future of data strategy?
 
Capgemini’s Data WARP: Accelerate your Journey to Insights
Capgemini’s Data WARP: Accelerate your Journey to InsightsCapgemini’s Data WARP: Accelerate your Journey to Insights
Capgemini’s Data WARP: Accelerate your Journey to Insights
 

Recently uploaded

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Recently uploaded (20)

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 

Implementing Data Preparation in Distributed Multimedia System