As Hadoop became mainstream, the need to simplify and speed up analytics processes grew rapidly. Data wrangling emerged as a necessary step in any analytical pipeline, and is often considered to be its crux, taking as much as 80% of an analyst's time. In this presentation we will discuss how data wrangling solutions can be leveraged to streamline, strengthen and improve data analytics initiatives on Hadoop, including use cases from Trifacta customers.
Bio: Olivier is EMEA Solutions Lead at Trifacta. He has 7 years experience in analytics with prior roles as technical lead for business analytics at Splunk and quantitative analyst at Accenture and Aon.
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Data Wrangling on Hadoop - Olivier De Garrigues, Trifacta
1. Hadoop User Group London: Data Wrangling on Hadoop
September 8 2016
Olivier de Garrigues, EMEA Solutions Lead
2. Creating radical productivity
for people who analyze data.
JEFFREY HEER
Co-Founder & CXO
VISUALIZATION
JOE HELLERSTEIN
Co-Founder & CSO
BIG DATA
SEAN KANDEL
Co-Founder & CTO
HUMAN-COMPUTER INTERACTION
4. What is Data Wrangling?
4
QUESTION ANALYZE INSIGHTDISCOVER STRUCTURE CLEANSE ENRICH VALIDATE PUBLISH
5. The Bridge Between Raw Data & Analysis
5
v
Ingestion Storage Processing
ANALYSIS & VISUALIZATION
LOBCLEANING ENRICHMENT DISTILLATIONSTRUCTURINGDISCOVERY
End-User Capabilities
IT
GOVERNANCE INTEGRATION AVAILABILTIYSCALABILITYSECURITY
Technical Capabilities
9. TRIFACTA
DATA WRANGLING WORKFLOW
Trifacta. Confidential & Proprietary.
Sample Scale Up
Refine
Sample
Results
Identify/Register Data
1.
Predictive Interaction
2
.
Consume
Schedulers
Monitor and Adjust
3
.
Schedule
Visualization & Analysis
Secure Access
10. Ingestion Processing Storage
ANALYSIS & CONSUMPTION
v
Discover Structure Clean Enrich Distill
LOB
IT
News
Topics
Time
Trades
Tickers
Date
$
eMails
Recipients
Topics
Phone Logs
Call Details
Recipients
Corporations
Company Relations
Individuals
Financial Services use case: Trader Fraud
11. Data Wrangling Benefits
➔ Empower the people who know the data best
➔ Accelerate time to value
➔ Lower business risk with more accurate data
➔ Unlock innovation using a wider variety of data