SlideShare a Scribd company logo
Grab some
coffee and
enjoy the
pre-­show
banter
before the
top of the
hour!
The Briefing Room
The Role of Data Wrangling in Driving Hadoop Adoption
Twitter Tag: #briefr The Briefing Room
Welcome
Host:
Eric Kavanagh
eric.kavanagh@bloorgroup.com
@eric_kavanagh
Twitter Tag: #briefr The Briefing Room
  Reveal the essential characteristics of enterprise
software, good and bad
  Provide a forum for detailed analysis of today s innovative
technologies
  Give vendors a chance to explain their product to savvy
analysts
  Allow audience members to pose serious questions... and
get answers!
Mission
Twitter Tag: #briefr The Briefing Room
Topics
September: HADOOP 2.0
October: DATA MANAGEMENT
November: ANALYTICS
Twitter Tag: #briefr The Briefing Room
The Great Divide
Ø Close the Gap
Ø Empower Business Users
Ø Shift Focus of IT
Ø Developers are Third
Leg
Twitter Tag: #briefr The Briefing Room
Analyst: Mark Madsen
Mark Madsen is president of Third Nature, a
technology research and consulting firm
focused on business intelligence, data
integration and data management. Mark is
an award-winning author, architect and
CTO whose work has been featured in
numerous industry publications. Over the
past ten years Mark received awards for his
work from the American Productivity &
Quality Center, TDWI, and the Smithsonian
Institute. He is an international speaker, a
contributor to Forbes Online and on the
O’Reilly Strata program committee. For
more information or to contact Mark, follow
@markmadsen on Twitter or visit http://
ThirdNature.net
Twitter Tag: #briefr The Briefing Room
Trifacta
Trifacta offers a platform for data transformation
and preparation
  The interface is rich in visualization and provides a
productive data wrangling capability
  The platform also includes access to raw data in
Hadoop, providing analysts and data scientists with
secure, governed data
Twitter Tag: #briefr The Briefing Room
Guests:
Will Davis
Director of Product Marketing, Trifacta
Alon Bartur
Principal Product Manager, Trifacta
Trifacta:
The Role of Data Wrangling
In Driving Hadoop Adoption
Variety = Data is Messy
When Data is Messy… Analysis is More Complicated
Question Analysis Insight
Messy Data Requires Data Wrangling
Question Analyze InsightDiscover Structure Clean Enrich Distill
Data Wrangling
The Bottleneck
DATA PRODUCT
Simplicity
DATA SOURCE
Complexity
The Bottleneck on Hadoop
Ingestion Storage Processing IT
ANALYSIS & CONSUMPTION
LOBBusiness System Data
Machine Generated Data
Third Party Data
Java
Python
R
Pig
etc…How do you move from here?
To here?
80%
of the work in any data
project is preparing the
data for analysis
Breakdown of Communication Between IT & LOB
LOB IT
How can I access the data in
Hadoop?
What do you want to analyze?
I can’t tell you until I see the data – let
me see the data first.
I can’t just point you to the raw data –
you’ll need to tell me.
Conventional Approaches Inhibit User Empowerment
Hand-Coding Technical Workflow Mapping
Bringing Hadoop to an Analyst’s Fingertips
““ JOHN, DATA ANALYST
I want direct access to the raw data so I can actually see the content of
different datasets to define my analytic requirements.
Wrangle Data Using
This?
10
Empowering Analysts
Requires a
New User Experience
It’s All About The Experience
Interact
Predict
Preview
12
Demo
Analyst Workflow on Hadoop
13
Register Hadoop Data Sets
in Trifacta
1.
HDFS
Visualize, Interact & Define Tr
ansformation Script
2.
HDFS
Execute Script on Entirety of Dat
a Set at Scale in Hadoop
3.
HDFS
Execution in Pig or Spark
Analytic ToolsAnalytic Tools
Select Transformation
Output Format & Location
4.
Analytic ToolsHadoop
HDFS
Parquet or Avro
Table in HCatalog
Tableau
R
Etc…
QUESTIONS?
SIGN UP FOR A FREE TRIAL AT
TRIFACTA.COM/TRIAL
THANK YOU!
Twitter Tag: #briefr The Briefing Room
Perceptions & Questions
Analyst:
Mark Madsen
© Third Nature Inc.
Analyst	
  comments	
  and	
  ques0ons	
  
Copyright	
  Third	
  Nature,	
  Inc.	
  
Ideas	
  about	
  how	
  we	
  make	
  data	
  available	
  are	
  changing	
  
Making	
  data	
  available	
  is	
  not	
  the	
  same	
  as	
  enabling	
  its	
  use	
  
Copyright	
  Third	
  Nature,	
  Inc.	
  
From	
  scarcity	
  to	
  abundance	
  
All	
  the	
  data	
  
Common,	
  typed,	
  tabular	
  data	
  
The	
  bo9leneck	
  is	
  us	
  
Copyright	
  Third	
  Nature,	
  Inc.	
  
The	
  old	
  problem	
  was	
  access,	
  the	
  new	
  problem	
  is	
  analysis	
  
© Third Nature Inc.
Changed	
  design	
  assump=on:	
  analysis	
  isn’t	
  read-­‐only	
  
The	
  results	
  of	
  analysis	
  
can,	
  o=en	
  do,	
  feed	
  back	
  
into	
  the	
  system	
  from	
  
which	
  they	
  originate.	
  
	
  
Much	
  of	
  the	
  data	
  is	
  being	
  
read,	
  wri9en	
  and	
  
processed	
  in	
  real	
  @me.	
  
	
  
Our	
  design	
  point	
  in	
  IT	
  
was	
  not	
  changing	
  tables	
  
and	
  ephemeral	
  pa9erns.	
  
Copyright	
  Third	
  Nature,	
  Inc.	
  
Schema
In	
  a	
  repor=ng	
  world	
  data	
  and	
  processing	
  are	
  bounded	
  
No consideration for feedback loops and change
Processing only
happens here
Carefully
controlled
SQL only
access
Nobodycreates
newinformation
Sources few and
well understood
Complex DI
is controlled
by IT
Schemas are few
and designed
Tools are authorized,
few in number and
kind
One way flow
Copyright	
  Third	
  Nature,	
  Inc.	
  
In	
  an	
  analysis	
  world	
  flow	
  is	
  unbounded	
  and	
  con=nuous	
  
Feedback
loops allowed
End-of-analysis
dataset may be
start of a BI dataset
Continuous data
integration and delivery
Files are back as both
input and storage
Minimal
barrier of /
control on
collection
Areas of
provisioned
data
Any shape in,
rectangles out
Copyright	
  Third	
  Nature,	
  Inc.	
  
The	
  model	
  and	
  reality	
  of	
  ETL:	
  one-­‐way	
  pipes	
  
DI BI
Our methods tell us that data integration and analysis are
separate, and schema comes first as the point of
synchronization between them.
Schema
Copyright	
  Third	
  Nature,	
  Inc.	
  
Schema
Data	
  isn’t	
  just	
  source	
  or	
  target,	
  it’s	
  a	
  con=nuum	
  
Unusable data
that needs
engineering: ETL
Data that can be
used : BI
Fuzzy areas of data that need
engineering and / or composing:
exploration, blending & discovery
Copyright	
  Third	
  Nature,	
  Inc.	
  
Food	
  supply	
  chain:	
  an	
  analogy	
  for	
  data	
  
Mul@ple	
  contexts	
  of	
  use,	
  differing	
  quality	
  levels	
  
Copyright	
  Third	
  Nature,	
  Inc.	
  
Tools	
  were	
  designed	
  with	
  data	
  model	
  assump=ons	
  
Sourcedata,modelcomplexity
SimpleComplex
Target data model
complexity
Simple Complex
Blending
Selectively linking and
changing data, producing
a simpler data model as
output
ETL
Multiple complex source
models, large complex
target model
Application integration
Basic movement of data
from one place to another,
minimal changes to data
Processing & Analytics
Deriving new data from a
relatively simple dataset
(like an event stream)
Copyright	
  Third	
  Nature,	
  Inc.	
  
Some	
  ques=ons	
  to	
  start	
  discussion	
  
1.  Who	
  is	
  this	
  product	
  aimed	
  at:	
  end	
  users,	
  analysts	
  	
  or	
  the	
  
people	
  who	
  get	
  and	
  manage	
  data	
  for	
  others?	
  
2.  Can	
  you	
  get	
  data	
  from	
  places	
  other	
  than	
  Hadoop?	
  
3.  How	
  do	
  you	
  deal	
  with	
  WYSIWYG	
  data	
  prepara@on	
  when	
  the	
  
dataset	
  is	
  very	
  large?	
  
4.  How	
  well	
  does	
  it	
  handle	
  small	
  datasets?	
  
5.  How	
  do	
  you	
  take	
  something	
  from	
  one-­‐@me-­‐process	
  to	
  a	
  
repeatably	
  executed	
  process	
  in	
  a	
  produc@on	
  environment?	
  
6.  What	
  analysis	
  tool	
  integra@on	
  is	
  available?	
  
7.  What	
  	
  maintenance	
  features	
  are	
  available?	
  
Copyright	
  Third	
  Nature,	
  Inc.	
  
CC	
  Image	
  AIribu=ons	
  
Thanks	
  to	
  the	
  people	
  who	
  supplied	
  the	
  crea@ve	
  commons	
  licensed	
  images	
  used	
  in	
  this	
  presenta@on:	
  
	
  
Tokyo	
  	
  forum	
  -­‐	
  h9p://flickr.com/photos/fukagawa/2004106475/	
  
klein_bo9le_red.jpg	
  -­‐	
  h9p://flickr.com/photos/sveinhal/2081201200/	
  
donuts_4_views.jpg	
  -­‐	
  h9p://www.flickr.com/photos/le_hibou/76718773/	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
Copyright	
  Third	
  Nature,	
  Inc.	
  
About	
  the	
  Presenter	
  
Mark	
  Madsen	
  is	
  president	
  of	
  Third	
  
Nature,	
  a	
  technology	
  research	
  and	
  
consul@ng	
  firm	
  focused	
  on	
  business	
  
intelligence,	
  data	
  integra@on	
  and	
  data	
  
management.	
  Mark	
  is	
  an	
  award-­‐winning	
  
author,	
  architect	
  and	
  CTO	
  whose	
  work	
  
has	
  been	
  featured	
  in	
  numerous	
  industry	
  
publica@ons.	
  Over	
  the	
  past	
  ten	
  years	
  
Mark	
  received	
  awards	
  for	
  his	
  work	
  from	
  
the	
  American	
  Produc@vity	
  &	
  Quality	
  
Center,	
  TDWI,	
  and	
  the	
  Smithsonian	
  
Ins@tute.	
  He	
  is	
  an	
  interna@onal	
  speaker,	
  
a	
  contributor	
  to	
  Forbes	
  Online	
  and	
  on	
  
the	
  O’Reilly	
  Strata	
  program	
  commi9ee.	
  
For	
  more	
  informa@on	
  or	
  to	
  contact	
  
Mark,	
  follow	
  @markmadsen	
  on	
  Twi9er	
  
or	
  visit	
  	
  h9p://ThirdNature.net	
  	
  
Copyright	
  Third	
  Nature,	
  Inc.	
  
About	
  Third	
  Nature	
  
Third Nature is a research and consulting firm focused on new and
emerging technology and practices in analytics, business intelligence,
information strategy and data management. If your question is related to
data, analytics, information strategy and technology infrastructure then
you‘re at the right place.
Our goal is to help organizations solve problems using data. We offer
education, consulting and research services to support business and IT
organizations as well as technology vendors.
We fill the gap between what the industry analyst firms cover and what IT
needs. We specialize in product and technology analysis, so we look at
emerging technologies and markets, evaluating technology and hw it is
applied rather than vendor market positions.
Twitter Tag: #briefr The Briefing Room
Twitter Tag: #briefr The Briefing Room
Upcoming Topics
www.insideanalysis.com
September: HADOOP 2.0
October: DATA MANAGEMENT
November: ANALYTICS
Twitter Tag: #briefr The Briefing Room
THANK YOU
for your
ATTENTION!
Some images provided courtesy of Wikimedia Commons
and "Grand Canyon view from Pima Point 2010" by Chensiyuan - Own work. Licensed under GFDL via Commons
- https://commons.wikimedia.org/wiki/File:Grand_Canyon_view_from_Pima_Point_2010.jpg#/media/
File:Grand_Canyon_view_from_Pima_Point_2010.jpg

More Related Content

What's hot

Data science presentation 2nd CI day
Data science presentation 2nd CI dayData science presentation 2nd CI day
Data science presentation 2nd CI day
Mohammed Barakat
 
Introduction to data science intro,ch(1,2,3)
Introduction to data science intro,ch(1,2,3)Introduction to data science intro,ch(1,2,3)
Introduction to data science intro,ch(1,2,3)
heba_ahmad
 
Big Data and the Art of Data Science
Big Data and the Art of Data ScienceBig Data and the Art of Data Science
Big Data and the Art of Data Science
Andrew Gardner
 
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Data Science London
 
Stanford DeepDive Framework
Stanford DeepDive FrameworkStanford DeepDive Framework
Stanford DeepDive Framework
Ran Zhang
 
8 minute intro to data science
8 minute intro to data science 8 minute intro to data science
8 minute intro to data science
Mahesh Kumar CV
 
Data Science Project Lifecycle
Data Science Project LifecycleData Science Project Lifecycle
Data Science Project Lifecycle
Jason Geng
 
Data Science
Data ScienceData Science
Data Science
Prithwis Mukerjee
 
Intro to Data Science for Non-Data Scientists
Intro to Data Science for Non-Data ScientistsIntro to Data Science for Non-Data Scientists
Intro to Data Science for Non-Data Scientists
Sri Ambati
 
YHORG Presentation 23 February 2016
YHORG Presentation 23 February 2016YHORG Presentation 23 February 2016
YHORG Presentation 23 February 2016
Richard Vidgen
 
AI-SDV 2021: Francisco Webber - Efficiency is the New Precision
AI-SDV 2021: Francisco Webber - Efficiency is the New PrecisionAI-SDV 2021: Francisco Webber - Efficiency is the New Precision
AI-SDV 2021: Francisco Webber - Efficiency is the New Precision
Dr. Haxel Consult
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
Ghulam Imaduddin
 
Applications of Machine Learning at USC
Applications of Machine Learning at USCApplications of Machine Learning at USC
Applications of Machine Learning at USC
Sri Ambati
 
Intro to Data Science by DatalentTeam at Data Science Clinic#11
Intro to Data Science by DatalentTeam at Data Science Clinic#11Intro to Data Science by DatalentTeam at Data Science Clinic#11
Intro to Data Science by DatalentTeam at Data Science Clinic#11
Dr.Sotarat Thammaboosadee CIMP-Data Governance
 
Data science presentation
Data science presentationData science presentation
Data science presentation
MSDEVMTL
 
Data science vs. Data scientist by Jothi Periasamy
Data science vs. Data scientist by Jothi PeriasamyData science vs. Data scientist by Jothi Periasamy
Data science vs. Data scientist by Jothi Periasamy
Peter Kua
 
Life of a data scientist (pub)
Life of a data scientist (pub)Life of a data scientist (pub)
Life of a data scientist (pub)
Buhwan Jeong
 
Datascienceindia article
Datascienceindia articleDatascienceindia article
Datascienceindia article
HimanshuPise1
 
Total Data Industry Report
Total Data Industry ReportTotal Data Industry Report
Total Data Industry Report
Ran Zhang
 
Introduction of Data Science
Introduction of Data ScienceIntroduction of Data Science
Introduction of Data Science
Jason Geng
 

What's hot (20)

Data science presentation 2nd CI day
Data science presentation 2nd CI dayData science presentation 2nd CI day
Data science presentation 2nd CI day
 
Introduction to data science intro,ch(1,2,3)
Introduction to data science intro,ch(1,2,3)Introduction to data science intro,ch(1,2,3)
Introduction to data science intro,ch(1,2,3)
 
Big Data and the Art of Data Science
Big Data and the Art of Data ScienceBig Data and the Art of Data Science
Big Data and the Art of Data Science
 
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
 
Stanford DeepDive Framework
Stanford DeepDive FrameworkStanford DeepDive Framework
Stanford DeepDive Framework
 
8 minute intro to data science
8 minute intro to data science 8 minute intro to data science
8 minute intro to data science
 
Data Science Project Lifecycle
Data Science Project LifecycleData Science Project Lifecycle
Data Science Project Lifecycle
 
Data Science
Data ScienceData Science
Data Science
 
Intro to Data Science for Non-Data Scientists
Intro to Data Science for Non-Data ScientistsIntro to Data Science for Non-Data Scientists
Intro to Data Science for Non-Data Scientists
 
YHORG Presentation 23 February 2016
YHORG Presentation 23 February 2016YHORG Presentation 23 February 2016
YHORG Presentation 23 February 2016
 
AI-SDV 2021: Francisco Webber - Efficiency is the New Precision
AI-SDV 2021: Francisco Webber - Efficiency is the New PrecisionAI-SDV 2021: Francisco Webber - Efficiency is the New Precision
AI-SDV 2021: Francisco Webber - Efficiency is the New Precision
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
Applications of Machine Learning at USC
Applications of Machine Learning at USCApplications of Machine Learning at USC
Applications of Machine Learning at USC
 
Intro to Data Science by DatalentTeam at Data Science Clinic#11
Intro to Data Science by DatalentTeam at Data Science Clinic#11Intro to Data Science by DatalentTeam at Data Science Clinic#11
Intro to Data Science by DatalentTeam at Data Science Clinic#11
 
Data science presentation
Data science presentationData science presentation
Data science presentation
 
Data science vs. Data scientist by Jothi Periasamy
Data science vs. Data scientist by Jothi PeriasamyData science vs. Data scientist by Jothi Periasamy
Data science vs. Data scientist by Jothi Periasamy
 
Life of a data scientist (pub)
Life of a data scientist (pub)Life of a data scientist (pub)
Life of a data scientist (pub)
 
Datascienceindia article
Datascienceindia articleDatascienceindia article
Datascienceindia article
 
Total Data Industry Report
Total Data Industry ReportTotal Data Industry Report
Total Data Industry Report
 
Introduction of Data Science
Introduction of Data ScienceIntroduction of Data Science
Introduction of Data Science
 

Viewers also liked

Data Wrangling and Oracle Connectors for Hadoop
Data Wrangling and Oracle Connectors for HadoopData Wrangling and Oracle Connectors for Hadoop
Data Wrangling and Oracle Connectors for HadoopGwen (Chen) Shapira
 
Real time analytics in Big Data
Real time analytics in Big DataReal time analytics in Big Data
Real time analytics in Big Data
BharathiRaja Chandrasekaran
 
Data Wrangling
Data WranglingData Wrangling
Data Wrangling
Ashwini Kuntamukkala
 
Impact of health education on tuberculosis drug adherence
Impact of health education on tuberculosis drug adherenceImpact of health education on tuberculosis drug adherence
Impact of health education on tuberculosis drug adherence
Skillet Tony
 
OUR GOAL AND FOCUS FOR "OPEN FOG CONSORTIUM"
OUR GOAL AND FOCUS FOR "OPEN FOG CONSORTIUM"OUR GOAL AND FOCUS FOR "OPEN FOG CONSORTIUM"
OUR GOAL AND FOCUS FOR "OPEN FOG CONSORTIUM"
Naoto MATSUMOTO
 
DataMeet 4: Data cleaning & census data
DataMeet 4: Data cleaning & census dataDataMeet 4: Data cleaning & census data
DataMeet 4: Data cleaning & census data
Ritvvij Parrikh
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
Gajanand Sharma
 
Data Mining Overview
Data Mining OverviewData Mining Overview
Data Mining Overview
Golda Margret Sheeba J
 
Real-Time Analytics with Apache Cassandra and Apache Spark
Real-Time Analytics with Apache Cassandra and Apache SparkReal-Time Analytics with Apache Cassandra and Apache Spark
Real-Time Analytics with Apache Cassandra and Apache Spark
Guido Schmutz
 
Data Mining Concepts
Data Mining ConceptsData Mining Concepts
Data Mining Concepts
Dung Nguyen
 
Data Collection-Primary & Secondary
Data Collection-Primary & SecondaryData Collection-Primary & Secondary
Data Collection-Primary & SecondaryPrathamesh Parab
 

Viewers also liked (12)

Data Wrangling and Oracle Connectors for Hadoop
Data Wrangling and Oracle Connectors for HadoopData Wrangling and Oracle Connectors for Hadoop
Data Wrangling and Oracle Connectors for Hadoop
 
Real time analytics in Big Data
Real time analytics in Big DataReal time analytics in Big Data
Real time analytics in Big Data
 
Data Wrangling
Data WranglingData Wrangling
Data Wrangling
 
Impact of health education on tuberculosis drug adherence
Impact of health education on tuberculosis drug adherenceImpact of health education on tuberculosis drug adherence
Impact of health education on tuberculosis drug adherence
 
OUR GOAL AND FOCUS FOR "OPEN FOG CONSORTIUM"
OUR GOAL AND FOCUS FOR "OPEN FOG CONSORTIUM"OUR GOAL AND FOCUS FOR "OPEN FOG CONSORTIUM"
OUR GOAL AND FOCUS FOR "OPEN FOG CONSORTIUM"
 
DataMeet 4: Data cleaning & census data
DataMeet 4: Data cleaning & census dataDataMeet 4: Data cleaning & census data
DataMeet 4: Data cleaning & census data
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data Mining Overview
Data Mining OverviewData Mining Overview
Data Mining Overview
 
Real-Time Analytics with Apache Cassandra and Apache Spark
Real-Time Analytics with Apache Cassandra and Apache SparkReal-Time Analytics with Apache Cassandra and Apache Spark
Real-Time Analytics with Apache Cassandra and Apache Spark
 
Data Mining Concepts
Data Mining ConceptsData Mining Concepts
Data Mining Concepts
 
Chapter 10-DATA ANALYSIS & PRESENTATION
Chapter 10-DATA ANALYSIS & PRESENTATIONChapter 10-DATA ANALYSIS & PRESENTATION
Chapter 10-DATA ANALYSIS & PRESENTATION
 
Data Collection-Primary & Secondary
Data Collection-Primary & SecondaryData Collection-Primary & Secondary
Data Collection-Primary & Secondary
 

Similar to The Role of Data Wrangling in Driving Hadoop Adoption

The Maturity Model: Taking the Growing Pains Out of Hadoop
The Maturity Model: Taking the Growing Pains Out of HadoopThe Maturity Model: Taking the Growing Pains Out of Hadoop
The Maturity Model: Taking the Growing Pains Out of Hadoop
Inside Analysis
 
Architecting a Platform for Enterprise Use - Strata London 2018
Architecting a Platform for Enterprise Use - Strata London 2018Architecting a Platform for Enterprise Use - Strata London 2018
Architecting a Platform for Enterprise Use - Strata London 2018
mark madsen
 
1 UNIT-DSP.pptx
1 UNIT-DSP.pptx1 UNIT-DSP.pptx
1 UNIT-DSP.pptx
PothyeswariPothyes
 
Architecting a Data Platform For Enterprise Use (Strata NY 2018)
Architecting a Data Platform For Enterprise Use (Strata NY 2018)Architecting a Data Platform For Enterprise Use (Strata NY 2018)
Architecting a Data Platform For Enterprise Use (Strata NY 2018)
mark madsen
 
The Simple 5-Step Process for Creating a Winning Data Pipeline.pdf
The Simple 5-Step Process for Creating a Winning Data Pipeline.pdfThe Simple 5-Step Process for Creating a Winning Data Pipeline.pdf
The Simple 5-Step Process for Creating a Winning Data Pipeline.pdf
Data Science Council of America
 
Business in the Driver’s Seat – An Improved Model for Integration
Business in the Driver’s Seat – An Improved Model for IntegrationBusiness in the Driver’s Seat – An Improved Model for Integration
Business in the Driver’s Seat – An Improved Model for Integration
Inside Analysis
 
The 3 Key Barriers Keeping Companies from Deploying Data Products
The 3 Key Barriers Keeping Companies from Deploying Data Products The 3 Key Barriers Keeping Companies from Deploying Data Products
The 3 Key Barriers Keeping Companies from Deploying Data Products
Dataiku
 
Ch1IntroductiontoDataScience.pptx
Ch1IntroductiontoDataScience.pptxCh1IntroductiontoDataScience.pptx
Ch1IntroductiontoDataScience.pptx
AbderrahmanABID2
 
Choosing which big data, nosql or database technology to use
Choosing which big data, nosql or database technology to useChoosing which big data, nosql or database technology to use
Choosing which big data, nosql or database technology to use
mark madsen
 
Big Data Driven Solutions to Combat Covid' 19
Big Data Driven Solutions to Combat Covid' 19Big Data Driven Solutions to Combat Covid' 19
Big Data Driven Solutions to Combat Covid' 19
Prof.Balakrishnan S
 
Thinkful DC - Intro to Data Science
Thinkful DC - Intro to Data Science Thinkful DC - Intro to Data Science
Thinkful DC - Intro to Data Science
TJ Stalcup
 
The Analytic Platform: Empowering the Business Now
The Analytic Platform: Empowering the Business NowThe Analytic Platform: Empowering the Business Now
The Analytic Platform: Empowering the Business Now
Inside Analysis
 
Data science in business Administration Nagarajan.pptx
Data science in business Administration Nagarajan.pptxData science in business Administration Nagarajan.pptx
Data science in business Administration Nagarajan.pptx
NagarajanG35
 
1° Sessione Oracle CRUI: Analytics Data Lab, the power of Big Data Investiga...
1° Sessione Oracle CRUI: Analytics Data Lab,  the power of Big Data Investiga...1° Sessione Oracle CRUI: Analytics Data Lab,  the power of Big Data Investiga...
1° Sessione Oracle CRUI: Analytics Data Lab, the power of Big Data Investiga...
Jürgen Ambrosi
 
The Right Data Warehouse: Automation Now, Business Value Thereafter
The Right Data Warehouse: Automation Now, Business Value ThereafterThe Right Data Warehouse: Automation Now, Business Value Thereafter
The Right Data Warehouse: Automation Now, Business Value Thereafter
Inside Analysis
 
Connecting the Dots with Data Mashups
Connecting the Dots with Data MashupsConnecting the Dots with Data Mashups
Connecting the Dots with Data Mashups
Inside Analysis
 
Implementing Data Mesh WP LTIMindtree White Paper
Implementing Data Mesh WP LTIMindtree White PaperImplementing Data Mesh WP LTIMindtree White Paper
Implementing Data Mesh WP LTIMindtree White Paper
shashanksalunkhe12
 
How to Prepare for a Career in Data Science
How to Prepare for a Career in Data ScienceHow to Prepare for a Career in Data Science
How to Prepare for a Career in Data Science
Juuso Parkkinen
 
Welcome to Data Science
Welcome to Data ScienceWelcome to Data Science
Welcome to Data Science
NyraSehgal
 
Agile data science
Agile data scienceAgile data science
Agile data science
Joel Horwitz
 

Similar to The Role of Data Wrangling in Driving Hadoop Adoption (20)

The Maturity Model: Taking the Growing Pains Out of Hadoop
The Maturity Model: Taking the Growing Pains Out of HadoopThe Maturity Model: Taking the Growing Pains Out of Hadoop
The Maturity Model: Taking the Growing Pains Out of Hadoop
 
Architecting a Platform for Enterprise Use - Strata London 2018
Architecting a Platform for Enterprise Use - Strata London 2018Architecting a Platform for Enterprise Use - Strata London 2018
Architecting a Platform for Enterprise Use - Strata London 2018
 
1 UNIT-DSP.pptx
1 UNIT-DSP.pptx1 UNIT-DSP.pptx
1 UNIT-DSP.pptx
 
Architecting a Data Platform For Enterprise Use (Strata NY 2018)
Architecting a Data Platform For Enterprise Use (Strata NY 2018)Architecting a Data Platform For Enterprise Use (Strata NY 2018)
Architecting a Data Platform For Enterprise Use (Strata NY 2018)
 
The Simple 5-Step Process for Creating a Winning Data Pipeline.pdf
The Simple 5-Step Process for Creating a Winning Data Pipeline.pdfThe Simple 5-Step Process for Creating a Winning Data Pipeline.pdf
The Simple 5-Step Process for Creating a Winning Data Pipeline.pdf
 
Business in the Driver’s Seat – An Improved Model for Integration
Business in the Driver’s Seat – An Improved Model for IntegrationBusiness in the Driver’s Seat – An Improved Model for Integration
Business in the Driver’s Seat – An Improved Model for Integration
 
The 3 Key Barriers Keeping Companies from Deploying Data Products
The 3 Key Barriers Keeping Companies from Deploying Data Products The 3 Key Barriers Keeping Companies from Deploying Data Products
The 3 Key Barriers Keeping Companies from Deploying Data Products
 
Ch1IntroductiontoDataScience.pptx
Ch1IntroductiontoDataScience.pptxCh1IntroductiontoDataScience.pptx
Ch1IntroductiontoDataScience.pptx
 
Choosing which big data, nosql or database technology to use
Choosing which big data, nosql or database technology to useChoosing which big data, nosql or database technology to use
Choosing which big data, nosql or database technology to use
 
Big Data Driven Solutions to Combat Covid' 19
Big Data Driven Solutions to Combat Covid' 19Big Data Driven Solutions to Combat Covid' 19
Big Data Driven Solutions to Combat Covid' 19
 
Thinkful DC - Intro to Data Science
Thinkful DC - Intro to Data Science Thinkful DC - Intro to Data Science
Thinkful DC - Intro to Data Science
 
The Analytic Platform: Empowering the Business Now
The Analytic Platform: Empowering the Business NowThe Analytic Platform: Empowering the Business Now
The Analytic Platform: Empowering the Business Now
 
Data science in business Administration Nagarajan.pptx
Data science in business Administration Nagarajan.pptxData science in business Administration Nagarajan.pptx
Data science in business Administration Nagarajan.pptx
 
1° Sessione Oracle CRUI: Analytics Data Lab, the power of Big Data Investiga...
1° Sessione Oracle CRUI: Analytics Data Lab,  the power of Big Data Investiga...1° Sessione Oracle CRUI: Analytics Data Lab,  the power of Big Data Investiga...
1° Sessione Oracle CRUI: Analytics Data Lab, the power of Big Data Investiga...
 
The Right Data Warehouse: Automation Now, Business Value Thereafter
The Right Data Warehouse: Automation Now, Business Value ThereafterThe Right Data Warehouse: Automation Now, Business Value Thereafter
The Right Data Warehouse: Automation Now, Business Value Thereafter
 
Connecting the Dots with Data Mashups
Connecting the Dots with Data MashupsConnecting the Dots with Data Mashups
Connecting the Dots with Data Mashups
 
Implementing Data Mesh WP LTIMindtree White Paper
Implementing Data Mesh WP LTIMindtree White PaperImplementing Data Mesh WP LTIMindtree White Paper
Implementing Data Mesh WP LTIMindtree White Paper
 
How to Prepare for a Career in Data Science
How to Prepare for a Career in Data ScienceHow to Prepare for a Career in Data Science
How to Prepare for a Career in Data Science
 
Welcome to Data Science
Welcome to Data ScienceWelcome to Data Science
Welcome to Data Science
 
Agile data science
Agile data scienceAgile data science
Agile data science
 

More from Inside Analysis

An Ounce of Prevention: Forging Healthy BI
An Ounce of Prevention: Forging Healthy BIAn Ounce of Prevention: Forging Healthy BI
An Ounce of Prevention: Forging Healthy BI
Inside Analysis
 
Agile, Automated, Aware: How to Model for Success
Agile, Automated, Aware: How to Model for SuccessAgile, Automated, Aware: How to Model for Success
Agile, Automated, Aware: How to Model for Success
Inside Analysis
 
First in Class: Optimizing the Data Lake for Tighter Integration
First in Class: Optimizing the Data Lake for Tighter IntegrationFirst in Class: Optimizing the Data Lake for Tighter Integration
First in Class: Optimizing the Data Lake for Tighter Integration
Inside Analysis
 
Fit For Purpose: Preventing a Big Data Letdown
Fit For Purpose: Preventing a Big Data LetdownFit For Purpose: Preventing a Big Data Letdown
Fit For Purpose: Preventing a Big Data Letdown
Inside Analysis
 
To Serve and Protect: Making Sense of Hadoop Security
To Serve and Protect: Making Sense of Hadoop Security To Serve and Protect: Making Sense of Hadoop Security
To Serve and Protect: Making Sense of Hadoop Security
Inside Analysis
 
The Hadoop Guarantee: Keeping Analytics Running On Time
The Hadoop Guarantee: Keeping Analytics Running On TimeThe Hadoop Guarantee: Keeping Analytics Running On Time
The Hadoop Guarantee: Keeping Analytics Running On Time
Inside Analysis
 
Introducing: A Complete Algebra of Data
Introducing: A Complete Algebra of DataIntroducing: A Complete Algebra of Data
Introducing: A Complete Algebra of Data
Inside Analysis
 
Ahead of the Stream: How to Future-Proof Real-Time Analytics
Ahead of the Stream: How to Future-Proof Real-Time AnalyticsAhead of the Stream: How to Future-Proof Real-Time Analytics
Ahead of the Stream: How to Future-Proof Real-Time Analytics
Inside Analysis
 
All Together Now: Connected Analytics for the Internet of Everything
All Together Now: Connected Analytics for the Internet of EverythingAll Together Now: Connected Analytics for the Internet of Everything
All Together Now: Connected Analytics for the Internet of Everything
Inside Analysis
 
Goodbye, Bottlenecks: How Scale-Out and In-Memory Solve ETL
Goodbye, Bottlenecks: How Scale-Out and In-Memory Solve ETLGoodbye, Bottlenecks: How Scale-Out and In-Memory Solve ETL
Goodbye, Bottlenecks: How Scale-Out and In-Memory Solve ETL
Inside Analysis
 
The Biggest Picture: Situational Awareness on a Global Level
The Biggest Picture: Situational Awareness on a Global LevelThe Biggest Picture: Situational Awareness on a Global Level
The Biggest Picture: Situational Awareness on a Global Level
Inside Analysis
 
Structurally Sound: How to Tame Your Architecture
Structurally Sound: How to Tame Your ArchitectureStructurally Sound: How to Tame Your Architecture
Structurally Sound: How to Tame Your Architecture
Inside Analysis
 
SQL In Hadoop: Big Data Innovation Without the Risk
SQL In Hadoop: Big Data Innovation Without the RiskSQL In Hadoop: Big Data Innovation Without the Risk
SQL In Hadoop: Big Data Innovation Without the Risk
Inside Analysis
 
The Perfect Fit: Scalable Graph for Big Data
The Perfect Fit: Scalable Graph for Big DataThe Perfect Fit: Scalable Graph for Big Data
The Perfect Fit: Scalable Graph for Big Data
Inside Analysis
 
A Revolutionary Approach to Modernizing the Data Warehouse
A Revolutionary Approach to Modernizing the Data WarehouseA Revolutionary Approach to Modernizing the Data Warehouse
A Revolutionary Approach to Modernizing the Data Warehouse
Inside Analysis
 
Rethinking Data Availability and Governance in a Mobile World
Rethinking Data Availability and Governance in a Mobile WorldRethinking Data Availability and Governance in a Mobile World
Rethinking Data Availability and Governance in a Mobile World
Inside Analysis
 
DisrupTech - Dave Duggal
DisrupTech - Dave DuggalDisrupTech - Dave Duggal
DisrupTech - Dave Duggal
Inside Analysis
 
Modus Operandi
Modus OperandiModus Operandi
Modus Operandi
Inside Analysis
 
Phasic Systems - Dr. Geoffrey Malafsky
Phasic Systems - Dr. Geoffrey MalafskyPhasic Systems - Dr. Geoffrey Malafsky
Phasic Systems - Dr. Geoffrey Malafsky
Inside Analysis
 
Red Hat - Sarangan Rangachari
Red Hat - Sarangan RangachariRed Hat - Sarangan Rangachari
Red Hat - Sarangan Rangachari
Inside Analysis
 

More from Inside Analysis (20)

An Ounce of Prevention: Forging Healthy BI
An Ounce of Prevention: Forging Healthy BIAn Ounce of Prevention: Forging Healthy BI
An Ounce of Prevention: Forging Healthy BI
 
Agile, Automated, Aware: How to Model for Success
Agile, Automated, Aware: How to Model for SuccessAgile, Automated, Aware: How to Model for Success
Agile, Automated, Aware: How to Model for Success
 
First in Class: Optimizing the Data Lake for Tighter Integration
First in Class: Optimizing the Data Lake for Tighter IntegrationFirst in Class: Optimizing the Data Lake for Tighter Integration
First in Class: Optimizing the Data Lake for Tighter Integration
 
Fit For Purpose: Preventing a Big Data Letdown
Fit For Purpose: Preventing a Big Data LetdownFit For Purpose: Preventing a Big Data Letdown
Fit For Purpose: Preventing a Big Data Letdown
 
To Serve and Protect: Making Sense of Hadoop Security
To Serve and Protect: Making Sense of Hadoop Security To Serve and Protect: Making Sense of Hadoop Security
To Serve and Protect: Making Sense of Hadoop Security
 
The Hadoop Guarantee: Keeping Analytics Running On Time
The Hadoop Guarantee: Keeping Analytics Running On TimeThe Hadoop Guarantee: Keeping Analytics Running On Time
The Hadoop Guarantee: Keeping Analytics Running On Time
 
Introducing: A Complete Algebra of Data
Introducing: A Complete Algebra of DataIntroducing: A Complete Algebra of Data
Introducing: A Complete Algebra of Data
 
Ahead of the Stream: How to Future-Proof Real-Time Analytics
Ahead of the Stream: How to Future-Proof Real-Time AnalyticsAhead of the Stream: How to Future-Proof Real-Time Analytics
Ahead of the Stream: How to Future-Proof Real-Time Analytics
 
All Together Now: Connected Analytics for the Internet of Everything
All Together Now: Connected Analytics for the Internet of EverythingAll Together Now: Connected Analytics for the Internet of Everything
All Together Now: Connected Analytics for the Internet of Everything
 
Goodbye, Bottlenecks: How Scale-Out and In-Memory Solve ETL
Goodbye, Bottlenecks: How Scale-Out and In-Memory Solve ETLGoodbye, Bottlenecks: How Scale-Out and In-Memory Solve ETL
Goodbye, Bottlenecks: How Scale-Out and In-Memory Solve ETL
 
The Biggest Picture: Situational Awareness on a Global Level
The Biggest Picture: Situational Awareness on a Global LevelThe Biggest Picture: Situational Awareness on a Global Level
The Biggest Picture: Situational Awareness on a Global Level
 
Structurally Sound: How to Tame Your Architecture
Structurally Sound: How to Tame Your ArchitectureStructurally Sound: How to Tame Your Architecture
Structurally Sound: How to Tame Your Architecture
 
SQL In Hadoop: Big Data Innovation Without the Risk
SQL In Hadoop: Big Data Innovation Without the RiskSQL In Hadoop: Big Data Innovation Without the Risk
SQL In Hadoop: Big Data Innovation Without the Risk
 
The Perfect Fit: Scalable Graph for Big Data
The Perfect Fit: Scalable Graph for Big DataThe Perfect Fit: Scalable Graph for Big Data
The Perfect Fit: Scalable Graph for Big Data
 
A Revolutionary Approach to Modernizing the Data Warehouse
A Revolutionary Approach to Modernizing the Data WarehouseA Revolutionary Approach to Modernizing the Data Warehouse
A Revolutionary Approach to Modernizing the Data Warehouse
 
Rethinking Data Availability and Governance in a Mobile World
Rethinking Data Availability and Governance in a Mobile WorldRethinking Data Availability and Governance in a Mobile World
Rethinking Data Availability and Governance in a Mobile World
 
DisrupTech - Dave Duggal
DisrupTech - Dave DuggalDisrupTech - Dave Duggal
DisrupTech - Dave Duggal
 
Modus Operandi
Modus OperandiModus Operandi
Modus Operandi
 
Phasic Systems - Dr. Geoffrey Malafsky
Phasic Systems - Dr. Geoffrey MalafskyPhasic Systems - Dr. Geoffrey Malafsky
Phasic Systems - Dr. Geoffrey Malafsky
 
Red Hat - Sarangan Rangachari
Red Hat - Sarangan RangachariRed Hat - Sarangan Rangachari
Red Hat - Sarangan Rangachari
 

Recently uploaded

Large Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial ApplicationsLarge Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial Applications
Rohit Gautam
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
Aftab Hussain
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Aggregage
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
sonjaschweigert1
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
KAMESHS29
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
Octavian Nadolu
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
Kumud Singh
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
名前 です男
 
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
RinaMondal9
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
Adtran
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
Uni Systems S.M.S.A.
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
Neo4j
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
Neo4j
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
Matthew Sinclair
 
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptxSecstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
nkrafacyberclub
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
Matthew Sinclair
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
Quotidiano Piemontese
 

Recently uploaded (20)

Large Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial ApplicationsLarge Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial Applications
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
 
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
 
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptxSecstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
 

The Role of Data Wrangling in Driving Hadoop Adoption

  • 1. Grab some coffee and enjoy the pre-­show banter before the top of the hour!
  • 2. The Briefing Room The Role of Data Wrangling in Driving Hadoop Adoption
  • 3. Twitter Tag: #briefr The Briefing Room Welcome Host: Eric Kavanagh eric.kavanagh@bloorgroup.com @eric_kavanagh
  • 4. Twitter Tag: #briefr The Briefing Room   Reveal the essential characteristics of enterprise software, good and bad   Provide a forum for detailed analysis of today s innovative technologies   Give vendors a chance to explain their product to savvy analysts   Allow audience members to pose serious questions... and get answers! Mission
  • 5. Twitter Tag: #briefr The Briefing Room Topics September: HADOOP 2.0 October: DATA MANAGEMENT November: ANALYTICS
  • 6. Twitter Tag: #briefr The Briefing Room The Great Divide Ø Close the Gap Ø Empower Business Users Ø Shift Focus of IT Ø Developers are Third Leg
  • 7. Twitter Tag: #briefr The Briefing Room Analyst: Mark Madsen Mark Madsen is president of Third Nature, a technology research and consulting firm focused on business intelligence, data integration and data management. Mark is an award-winning author, architect and CTO whose work has been featured in numerous industry publications. Over the past ten years Mark received awards for his work from the American Productivity & Quality Center, TDWI, and the Smithsonian Institute. He is an international speaker, a contributor to Forbes Online and on the O’Reilly Strata program committee. For more information or to contact Mark, follow @markmadsen on Twitter or visit http:// ThirdNature.net
  • 8. Twitter Tag: #briefr The Briefing Room Trifacta Trifacta offers a platform for data transformation and preparation   The interface is rich in visualization and provides a productive data wrangling capability   The platform also includes access to raw data in Hadoop, providing analysts and data scientists with secure, governed data
  • 9. Twitter Tag: #briefr The Briefing Room Guests: Will Davis Director of Product Marketing, Trifacta Alon Bartur Principal Product Manager, Trifacta
  • 10. Trifacta: The Role of Data Wrangling In Driving Hadoop Adoption
  • 11. Variety = Data is Messy
  • 12. When Data is Messy… Analysis is More Complicated Question Analysis Insight
  • 13. Messy Data Requires Data Wrangling Question Analyze InsightDiscover Structure Clean Enrich Distill Data Wrangling
  • 15. The Bottleneck on Hadoop Ingestion Storage Processing IT ANALYSIS & CONSUMPTION LOBBusiness System Data Machine Generated Data Third Party Data Java Python R Pig etc…How do you move from here? To here? 80% of the work in any data project is preparing the data for analysis
  • 16. Breakdown of Communication Between IT & LOB LOB IT How can I access the data in Hadoop? What do you want to analyze? I can’t tell you until I see the data – let me see the data first. I can’t just point you to the raw data – you’ll need to tell me.
  • 17. Conventional Approaches Inhibit User Empowerment Hand-Coding Technical Workflow Mapping
  • 18. Bringing Hadoop to an Analyst’s Fingertips ““ JOHN, DATA ANALYST I want direct access to the raw data so I can actually see the content of different datasets to define my analytic requirements. Wrangle Data Using This?
  • 20. It’s All About The Experience Interact Predict Preview
  • 22. Analyst Workflow on Hadoop 13 Register Hadoop Data Sets in Trifacta 1. HDFS Visualize, Interact & Define Tr ansformation Script 2. HDFS Execute Script on Entirety of Dat a Set at Scale in Hadoop 3. HDFS Execution in Pig or Spark Analytic ToolsAnalytic Tools Select Transformation Output Format & Location 4. Analytic ToolsHadoop HDFS Parquet or Avro Table in HCatalog Tableau R Etc…
  • 24. SIGN UP FOR A FREE TRIAL AT TRIFACTA.COM/TRIAL THANK YOU!
  • 25. Twitter Tag: #briefr The Briefing Room Perceptions & Questions Analyst: Mark Madsen
  • 26. © Third Nature Inc. Analyst  comments  and  ques0ons  
  • 27. Copyright  Third  Nature,  Inc.   Ideas  about  how  we  make  data  available  are  changing   Making  data  available  is  not  the  same  as  enabling  its  use  
  • 28. Copyright  Third  Nature,  Inc.   From  scarcity  to  abundance   All  the  data   Common,  typed,  tabular  data   The  bo9leneck  is  us  
  • 29. Copyright  Third  Nature,  Inc.   The  old  problem  was  access,  the  new  problem  is  analysis  
  • 30. © Third Nature Inc. Changed  design  assump=on:  analysis  isn’t  read-­‐only   The  results  of  analysis   can,  o=en  do,  feed  back   into  the  system  from   which  they  originate.     Much  of  the  data  is  being   read,  wri9en  and   processed  in  real  @me.     Our  design  point  in  IT   was  not  changing  tables   and  ephemeral  pa9erns.  
  • 31. Copyright  Third  Nature,  Inc.   Schema In  a  repor=ng  world  data  and  processing  are  bounded   No consideration for feedback loops and change Processing only happens here Carefully controlled SQL only access Nobodycreates newinformation Sources few and well understood Complex DI is controlled by IT Schemas are few and designed Tools are authorized, few in number and kind One way flow
  • 32. Copyright  Third  Nature,  Inc.   In  an  analysis  world  flow  is  unbounded  and  con=nuous   Feedback loops allowed End-of-analysis dataset may be start of a BI dataset Continuous data integration and delivery Files are back as both input and storage Minimal barrier of / control on collection Areas of provisioned data Any shape in, rectangles out
  • 33. Copyright  Third  Nature,  Inc.   The  model  and  reality  of  ETL:  one-­‐way  pipes   DI BI Our methods tell us that data integration and analysis are separate, and schema comes first as the point of synchronization between them. Schema
  • 34. Copyright  Third  Nature,  Inc.   Schema Data  isn’t  just  source  or  target,  it’s  a  con=nuum   Unusable data that needs engineering: ETL Data that can be used : BI Fuzzy areas of data that need engineering and / or composing: exploration, blending & discovery
  • 35. Copyright  Third  Nature,  Inc.   Food  supply  chain:  an  analogy  for  data   Mul@ple  contexts  of  use,  differing  quality  levels  
  • 36. Copyright  Third  Nature,  Inc.   Tools  were  designed  with  data  model  assump=ons   Sourcedata,modelcomplexity SimpleComplex Target data model complexity Simple Complex Blending Selectively linking and changing data, producing a simpler data model as output ETL Multiple complex source models, large complex target model Application integration Basic movement of data from one place to another, minimal changes to data Processing & Analytics Deriving new data from a relatively simple dataset (like an event stream)
  • 37. Copyright  Third  Nature,  Inc.   Some  ques=ons  to  start  discussion   1.  Who  is  this  product  aimed  at:  end  users,  analysts    or  the   people  who  get  and  manage  data  for  others?   2.  Can  you  get  data  from  places  other  than  Hadoop?   3.  How  do  you  deal  with  WYSIWYG  data  prepara@on  when  the   dataset  is  very  large?   4.  How  well  does  it  handle  small  datasets?   5.  How  do  you  take  something  from  one-­‐@me-­‐process  to  a   repeatably  executed  process  in  a  produc@on  environment?   6.  What  analysis  tool  integra@on  is  available?   7.  What    maintenance  features  are  available?  
  • 38. Copyright  Third  Nature,  Inc.   CC  Image  AIribu=ons   Thanks  to  the  people  who  supplied  the  crea@ve  commons  licensed  images  used  in  this  presenta@on:     Tokyo    forum  -­‐  h9p://flickr.com/photos/fukagawa/2004106475/   klein_bo9le_red.jpg  -­‐  h9p://flickr.com/photos/sveinhal/2081201200/   donuts_4_views.jpg  -­‐  h9p://www.flickr.com/photos/le_hibou/76718773/                                    
  • 39. Copyright  Third  Nature,  Inc.   About  the  Presenter   Mark  Madsen  is  president  of  Third   Nature,  a  technology  research  and   consul@ng  firm  focused  on  business   intelligence,  data  integra@on  and  data   management.  Mark  is  an  award-­‐winning   author,  architect  and  CTO  whose  work   has  been  featured  in  numerous  industry   publica@ons.  Over  the  past  ten  years   Mark  received  awards  for  his  work  from   the  American  Produc@vity  &  Quality   Center,  TDWI,  and  the  Smithsonian   Ins@tute.  He  is  an  interna@onal  speaker,   a  contributor  to  Forbes  Online  and  on   the  O’Reilly  Strata  program  commi9ee.   For  more  informa@on  or  to  contact   Mark,  follow  @markmadsen  on  Twi9er   or  visit    h9p://ThirdNature.net    
  • 40. Copyright  Third  Nature,  Inc.   About  Third  Nature   Third Nature is a research and consulting firm focused on new and emerging technology and practices in analytics, business intelligence, information strategy and data management. If your question is related to data, analytics, information strategy and technology infrastructure then you‘re at the right place. Our goal is to help organizations solve problems using data. We offer education, consulting and research services to support business and IT organizations as well as technology vendors. We fill the gap between what the industry analyst firms cover and what IT needs. We specialize in product and technology analysis, so we look at emerging technologies and markets, evaluating technology and hw it is applied rather than vendor market positions.
  • 41. Twitter Tag: #briefr The Briefing Room
  • 42. Twitter Tag: #briefr The Briefing Room Upcoming Topics www.insideanalysis.com September: HADOOP 2.0 October: DATA MANAGEMENT November: ANALYTICS
  • 43. Twitter Tag: #briefr The Briefing Room THANK YOU for your ATTENTION! Some images provided courtesy of Wikimedia Commons and "Grand Canyon view from Pima Point 2010" by Chensiyuan - Own work. Licensed under GFDL via Commons - https://commons.wikimedia.org/wiki/File:Grand_Canyon_view_from_Pima_Point_2010.jpg#/media/ File:Grand_Canyon_view_from_Pima_Point_2010.jpg