SlideShare a Scribd company logo
1 of 15
Best Practices for Data at Scale
Carolyn Duby
Big Data Architect
Hortonworks
Choosing a Use Case
• Build the business case
– Assess the value - profit – investment year over year
– Consult industry experts
• Start small, simple
• Map out path to future use cases
– One year out
• Don’t oversell
3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Payment
Tracking
Due
Diligence
Social
Mapping
Product
Design
M & ACall
Analysis
Machine
Data
Defect
Detecting
Factory
Yields
Customer
Support
Basket
Analysis
Segments
Customer
Retention
Sentiment
Analysis
Optimize
Inventories
Supply
Chain
Cross-
Sell
Vendor
Scorecards
Ad
Placement
Cyber
Security
Disaster
Mitigation
Investment
Planning
Ad
Placement
Risk
Modeling
Proactive
Repair
Inventory
Predictions
Next
Product Recs
OPEX
Reduction
Historical
Records
Mainframe
Offloads
Device
Data
Ingest
Rapid
Reporting
Digital
Protection
Data
as a
Service
Fraud
Prevention
Public
Data
Capture
INNOVATE
RENOVATE
E X P LO RE O P T I M I Z E T RA N S FO RM
ACTIVE
ARCHIVE
ETL
ONBOARD
DATA
ENRICHMENT
DATA
DISCOVERY
SINGLE
VIEW
PREDICTIVE
ANALYTICS
Learn to Communicate with the
Business
• Data driven decisions don’t come naturally
• Don’t dwell on technical details
• A picture is worth a thousand words
• Explain counterintuitive results
Do a Pilot
• Try out your ideas
• Fail fast
– Can you get the data?
– Is the data useful?
– How much will it really cost?
Pilot in the Cloud
• Spinning up a cluster in the cloud is quick
• Focus on the problem you are trying to solve
• Minimize startup time and cost
Setting up a Cluster
• Start with governance and security from the
start
• Harder to add in later
• Protect your data from day one
• Aggregated data needs good security
Don’t Skimp
• Train or hire skilled people
• Get the right hardware for workload
– Cluster size
– Hardware configuration
• Start with a balanced hardware configuration
Data at Scale Solution
Components
• Getting the raw data
• Cleaning the data
– First two steps can be a big job
• Building the model
• Deploying or productizing the model
Improve Iteratively
• Start simply
• Add more data and improve accuracy as
needed
• Simpler models are easier to understand
• Don’t trade complexity for small gains in
accuracy
Scaling Up
• Pat yourself on the back! You did it!
• Go back to the business case and find more
value
• Horizontally scale your cluster as needed
• Take on more advanced use cases
Capacity Planning
• Proactively monitor storage and compute
• Stay below 80% utilization
Disaster Recovery
• Address disaster recovery early
• Requirement for business critical use cases
• Lack of DR will block higher value use cases
Questions
• Ask away!
www.globalbigdataconference.com
Twitter : @bigdataconf

More Related Content

What's hot

Why Data Science Projects Fail
Why Data Science Projects FailWhy Data Science Projects Fail
Why Data Science Projects FailSense Corp
 
1645 track 1 bress_using his laptop
1645 track 1 bress_using his laptop1645 track 1 bress_using his laptop
1645 track 1 bress_using his laptopRising Media, Inc.
 
Big data expo - machine learning in the elastic stack
Big data expo - machine learning in the elastic stack Big data expo - machine learning in the elastic stack
Big data expo - machine learning in the elastic stack BigDataExpo
 
Technology trends in intelligent high performance buildings v2
Technology trends in intelligent  high performance buildings v2Technology trends in intelligent  high performance buildings v2
Technology trends in intelligent high performance buildings v2Mike Putich
 
Staffing your analytics team: 6 skill sets
Staffing your analytics team:  6 skill setsStaffing your analytics team:  6 skill sets
Staffing your analytics team: 6 skill setsDavid Stephenson, Ph.D.
 
DCIM Software Five Years Later: What I Wish I Had Known When I Started (Case ...
DCIM Software Five Years Later: What I Wish I Had Known When I Started (Case ...DCIM Software Five Years Later: What I Wish I Had Known When I Started (Case ...
DCIM Software Five Years Later: What I Wish I Had Known When I Started (Case ...Sunbird DCIM
 
Building a data platform tnt
Building a data platform tntBuilding a data platform tnt
Building a data platform tntBigDataExpo
 
Alliander robin hagemans daniel peyron
Alliander robin hagemans daniel peyronAlliander robin hagemans daniel peyron
Alliander robin hagemans daniel peyronBigDataExpo
 
Top Mobile Apps for Construction Job-Sites_AGC Fall Conference 2013
Top Mobile Apps for Construction Job-Sites_AGC Fall Conference 2013Top Mobile Apps for Construction Job-Sites_AGC Fall Conference 2013
Top Mobile Apps for Construction Job-Sites_AGC Fall Conference 2013Kevin Halter
 
Augmented Analytics and Automation in the Age of the Data Scientist
Augmented Analytics and Automation in the Age of the Data ScientistAugmented Analytics and Automation in the Age of the Data Scientist
Augmented Analytics and Automation in the Age of the Data ScientistWhereScape
 
How to build a data science team 20115.03.13v6
How to build a data science team 20115.03.13v6How to build a data science team 20115.03.13v6
How to build a data science team 20115.03.13v6Zhihao Lin
 
Accenture Big Data Expo
Accenture Big Data ExpoAccenture Big Data Expo
Accenture Big Data ExpoBigDataExpo
 
Back to Square One: Building a Data Science Team from Scratch
Back to Square One: Building a Data Science Team from ScratchBack to Square One: Building a Data Science Team from Scratch
Back to Square One: Building a Data Science Team from ScratchKlaas Bosteels
 
Big Data Blackout: Are Utilities Powering Up Their Data Analytics?
Big Data Blackout: Are Utilities Powering Up Their Data Analytics?Big Data Blackout: Are Utilities Powering Up Their Data Analytics?
Big Data Blackout: Are Utilities Powering Up Their Data Analytics?Den Reymer
 
Business Analytics and Big Data
Business Analytics and Big DataBusiness Analytics and Big Data
Business Analytics and Big DataAbhishek Kapoor
 
Big data sharing at fintech academy oct19 (1)
Big data sharing at fintech academy oct19 (1)Big data sharing at fintech academy oct19 (1)
Big data sharing at fintech academy oct19 (1)sgfta2020
 
On demand cloud
On demand cloudOn demand cloud
On demand cloudNinefold
 
IDC DCIM Webinar - How to Take Control of Chaos in a Lights-Out Data Center
IDC DCIM Webinar - How to Take Control of Chaos in a Lights-Out Data CenterIDC DCIM Webinar - How to Take Control of Chaos in a Lights-Out Data Center
IDC DCIM Webinar - How to Take Control of Chaos in a Lights-Out Data CenterSunbird DCIM
 
Using Mithril Technology Reduces Costs
Using Mithril Technology Reduces CostsUsing Mithril Technology Reduces Costs
Using Mithril Technology Reduces CostsRaphaël Santarossa
 

What's hot (20)

Why Data Science Projects Fail
Why Data Science Projects FailWhy Data Science Projects Fail
Why Data Science Projects Fail
 
1645 track 1 bress_using his laptop
1645 track 1 bress_using his laptop1645 track 1 bress_using his laptop
1645 track 1 bress_using his laptop
 
Big data expo - machine learning in the elastic stack
Big data expo - machine learning in the elastic stack Big data expo - machine learning in the elastic stack
Big data expo - machine learning in the elastic stack
 
Technology trends in intelligent high performance buildings v2
Technology trends in intelligent  high performance buildings v2Technology trends in intelligent  high performance buildings v2
Technology trends in intelligent high performance buildings v2
 
Staffing your analytics team: 6 skill sets
Staffing your analytics team:  6 skill setsStaffing your analytics team:  6 skill sets
Staffing your analytics team: 6 skill sets
 
DCIM Software Five Years Later: What I Wish I Had Known When I Started (Case ...
DCIM Software Five Years Later: What I Wish I Had Known When I Started (Case ...DCIM Software Five Years Later: What I Wish I Had Known When I Started (Case ...
DCIM Software Five Years Later: What I Wish I Had Known When I Started (Case ...
 
Building a data platform tnt
Building a data platform tntBuilding a data platform tnt
Building a data platform tnt
 
Alliander robin hagemans daniel peyron
Alliander robin hagemans daniel peyronAlliander robin hagemans daniel peyron
Alliander robin hagemans daniel peyron
 
Top Mobile Apps for Construction Job-Sites_AGC Fall Conference 2013
Top Mobile Apps for Construction Job-Sites_AGC Fall Conference 2013Top Mobile Apps for Construction Job-Sites_AGC Fall Conference 2013
Top Mobile Apps for Construction Job-Sites_AGC Fall Conference 2013
 
Augmented Analytics and Automation in the Age of the Data Scientist
Augmented Analytics and Automation in the Age of the Data ScientistAugmented Analytics and Automation in the Age of the Data Scientist
Augmented Analytics and Automation in the Age of the Data Scientist
 
How to build a data science team 20115.03.13v6
How to build a data science team 20115.03.13v6How to build a data science team 20115.03.13v6
How to build a data science team 20115.03.13v6
 
Accenture Big Data Expo
Accenture Big Data ExpoAccenture Big Data Expo
Accenture Big Data Expo
 
Back to Square One: Building a Data Science Team from Scratch
Back to Square One: Building a Data Science Team from ScratchBack to Square One: Building a Data Science Team from Scratch
Back to Square One: Building a Data Science Team from Scratch
 
Big Data Blackout: Are Utilities Powering Up Their Data Analytics?
Big Data Blackout: Are Utilities Powering Up Their Data Analytics?Big Data Blackout: Are Utilities Powering Up Their Data Analytics?
Big Data Blackout: Are Utilities Powering Up Their Data Analytics?
 
Innovation deck
Innovation deckInnovation deck
Innovation deck
 
Business Analytics and Big Data
Business Analytics and Big DataBusiness Analytics and Big Data
Business Analytics and Big Data
 
Big data sharing at fintech academy oct19 (1)
Big data sharing at fintech academy oct19 (1)Big data sharing at fintech academy oct19 (1)
Big data sharing at fintech academy oct19 (1)
 
On demand cloud
On demand cloudOn demand cloud
On demand cloud
 
IDC DCIM Webinar - How to Take Control of Chaos in a Lights-Out Data Center
IDC DCIM Webinar - How to Take Control of Chaos in a Lights-Out Data CenterIDC DCIM Webinar - How to Take Control of Chaos in a Lights-Out Data Center
IDC DCIM Webinar - How to Take Control of Chaos in a Lights-Out Data Center
 
Using Mithril Technology Reduces Costs
Using Mithril Technology Reduces CostsUsing Mithril Technology Reduces Costs
Using Mithril Technology Reduces Costs
 

Similar to Best Practices for Data at Scale - Global Data Science Conference

Deliveinrg explainable AI
Deliveinrg explainable AIDeliveinrg explainable AI
Deliveinrg explainable AIGary Allemann
 
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...DATAVERSITY
 
Barga Galvanize Sept 2015
Barga Galvanize Sept 2015Barga Galvanize Sept 2015
Barga Galvanize Sept 2015Roger Barga
 
Operationalize analytics through modern data strategy
Operationalize analytics through modern data strategyOperationalize analytics through modern data strategy
Operationalize analytics through modern data strategyNagarro
 
Justifying Capacity Management Efforts with Provable and Positive ROI
Justifying Capacity Management Efforts with Provable and Positive ROIJustifying Capacity Management Efforts with Provable and Positive ROI
Justifying Capacity Management Efforts with Provable and Positive ROIPrecisely
 
Foundational Strategies for Trust in Big Data Part 3: Data Lineage
Foundational Strategies for Trust in Big Data Part 3: Data LineageFoundational Strategies for Trust in Big Data Part 3: Data Lineage
Foundational Strategies for Trust in Big Data Part 3: Data LineagePrecisely
 
351315535-Module-1-Intro-to-Data-Science-pptx.pptx
351315535-Module-1-Intro-to-Data-Science-pptx.pptx351315535-Module-1-Intro-to-Data-Science-pptx.pptx
351315535-Module-1-Intro-to-Data-Science-pptx.pptxXanGwaps
 
The Shifting Landscape of Data Integration
The Shifting Landscape of Data IntegrationThe Shifting Landscape of Data Integration
The Shifting Landscape of Data IntegrationDATAVERSITY
 
Platforming the Major Analytic Use Cases for Modern Engineering
Platforming the Major Analytic Use Cases for Modern EngineeringPlatforming the Major Analytic Use Cases for Modern Engineering
Platforming the Major Analytic Use Cases for Modern EngineeringDATAVERSITY
 
Five Attributes to a Successful Big Data Strategy
Five Attributes to a Successful Big Data StrategyFive Attributes to a Successful Big Data Strategy
Five Attributes to a Successful Big Data StrategyPerficient, Inc.
 
ADV Slides: Data Curation for Artificial Intelligence Strategies
ADV Slides: Data Curation for Artificial Intelligence StrategiesADV Slides: Data Curation for Artificial Intelligence Strategies
ADV Slides: Data Curation for Artificial Intelligence StrategiesDATAVERSITY
 
Essential Prerequisites for Maximizing Success from Big Data
Essential Prerequisites for Maximizing Success from Big DataEssential Prerequisites for Maximizing Success from Big Data
Essential Prerequisites for Maximizing Success from Big DataSociety of Petroleum Engineers
 
Big data
Big dataBig data
Big dataRiya
 
Keeping the Pulse of Your Data:  Why You Need Data Observability 
Keeping the Pulse of Your Data:  Why You Need Data Observability Keeping the Pulse of Your Data:  Why You Need Data Observability 
Keeping the Pulse of Your Data:  Why You Need Data Observability Precisely
 
Getting Data Quality Right
Getting Data Quality RightGetting Data Quality Right
Getting Data Quality RightDATAVERSITY
 
Where HADOOP fits in and challenges
Where HADOOP fits in and challengesWhere HADOOP fits in and challenges
Where HADOOP fits in and challengesSuvradeep Rudra
 
final oracle presentation
final oracle presentationfinal oracle presentation
final oracle presentationPriyesh Patel
 
Capitaliser sur la valeur de l’IoT : comment démarrer sa transformation numér...
Capitaliser sur la valeur de l’IoT : comment démarrer sa transformation numér...Capitaliser sur la valeur de l’IoT : comment démarrer sa transformation numér...
Capitaliser sur la valeur de l’IoT : comment démarrer sa transformation numér...Greg Eva
 

Similar to Best Practices for Data at Scale - Global Data Science Conference (20)

Deliveinrg explainable AI
Deliveinrg explainable AIDeliveinrg explainable AI
Deliveinrg explainable AI
 
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
 
Barga Galvanize Sept 2015
Barga Galvanize Sept 2015Barga Galvanize Sept 2015
Barga Galvanize Sept 2015
 
Operationalize analytics through modern data strategy
Operationalize analytics through modern data strategyOperationalize analytics through modern data strategy
Operationalize analytics through modern data strategy
 
Justifying Capacity Management Efforts with Provable and Positive ROI
Justifying Capacity Management Efforts with Provable and Positive ROIJustifying Capacity Management Efforts with Provable and Positive ROI
Justifying Capacity Management Efforts with Provable and Positive ROI
 
Foundational Strategies for Trust in Big Data Part 3: Data Lineage
Foundational Strategies for Trust in Big Data Part 3: Data LineageFoundational Strategies for Trust in Big Data Part 3: Data Lineage
Foundational Strategies for Trust in Big Data Part 3: Data Lineage
 
Big data
Big dataBig data
Big data
 
351315535-Module-1-Intro-to-Data-Science-pptx.pptx
351315535-Module-1-Intro-to-Data-Science-pptx.pptx351315535-Module-1-Intro-to-Data-Science-pptx.pptx
351315535-Module-1-Intro-to-Data-Science-pptx.pptx
 
The Shifting Landscape of Data Integration
The Shifting Landscape of Data IntegrationThe Shifting Landscape of Data Integration
The Shifting Landscape of Data Integration
 
Platforming the Major Analytic Use Cases for Modern Engineering
Platforming the Major Analytic Use Cases for Modern EngineeringPlatforming the Major Analytic Use Cases for Modern Engineering
Platforming the Major Analytic Use Cases for Modern Engineering
 
Five Attributes to a Successful Big Data Strategy
Five Attributes to a Successful Big Data StrategyFive Attributes to a Successful Big Data Strategy
Five Attributes to a Successful Big Data Strategy
 
Trends in data analytics
Trends in data analyticsTrends in data analytics
Trends in data analytics
 
ADV Slides: Data Curation for Artificial Intelligence Strategies
ADV Slides: Data Curation for Artificial Intelligence StrategiesADV Slides: Data Curation for Artificial Intelligence Strategies
ADV Slides: Data Curation for Artificial Intelligence Strategies
 
Essential Prerequisites for Maximizing Success from Big Data
Essential Prerequisites for Maximizing Success from Big DataEssential Prerequisites for Maximizing Success from Big Data
Essential Prerequisites for Maximizing Success from Big Data
 
Big data
Big dataBig data
Big data
 
Keeping the Pulse of Your Data:  Why You Need Data Observability 
Keeping the Pulse of Your Data:  Why You Need Data Observability Keeping the Pulse of Your Data:  Why You Need Data Observability 
Keeping the Pulse of Your Data:  Why You Need Data Observability 
 
Getting Data Quality Right
Getting Data Quality RightGetting Data Quality Right
Getting Data Quality Right
 
Where HADOOP fits in and challenges
Where HADOOP fits in and challengesWhere HADOOP fits in and challenges
Where HADOOP fits in and challenges
 
final oracle presentation
final oracle presentationfinal oracle presentation
final oracle presentation
 
Capitaliser sur la valeur de l’IoT : comment démarrer sa transformation numér...
Capitaliser sur la valeur de l’IoT : comment démarrer sa transformation numér...Capitaliser sur la valeur de l’IoT : comment démarrer sa transformation numér...
Capitaliser sur la valeur de l’IoT : comment démarrer sa transformation numér...
 

More from Carolyn Duby

Enterprise data science at scale
Enterprise data science at scaleEnterprise data science at scale
Enterprise data science at scaleCarolyn Duby
 
Unlocking insights in streaming data
Unlocking insights in streaming dataUnlocking insights in streaming data
Unlocking insights in streaming dataCarolyn Duby
 
Data Science at Scale with Apache Spark and Zeppelin Notebook
Data Science at Scale with Apache Spark and Zeppelin NotebookData Science at Scale with Apache Spark and Zeppelin Notebook
Data Science at Scale with Apache Spark and Zeppelin NotebookCarolyn Duby
 
Boston Future of Data Meetup: May 2017: Spark Introduction with Credit Card F...
Boston Future of Data Meetup: May 2017: Spark Introduction with Credit Card F...Boston Future of Data Meetup: May 2017: Spark Introduction with Credit Card F...
Boston Future of Data Meetup: May 2017: Spark Introduction with Credit Card F...Carolyn Duby
 
ODSC East 2017 - Reproducible Research at Scale with Apache Zeppelin and Spark
ODSC East 2017 - Reproducible Research at Scale with Apache Zeppelin and SparkODSC East 2017 - Reproducible Research at Scale with Apache Zeppelin and Spark
ODSC East 2017 - Reproducible Research at Scale with Apache Zeppelin and SparkCarolyn Duby
 
Providence Future of Data Meetup - Apache Metron Open Source Cybersecurity Pl...
Providence Future of Data Meetup - Apache Metron Open Source Cybersecurity Pl...Providence Future of Data Meetup - Apache Metron Open Source Cybersecurity Pl...
Providence Future of Data Meetup - Apache Metron Open Source Cybersecurity Pl...Carolyn Duby
 

More from Carolyn Duby (6)

Enterprise data science at scale
Enterprise data science at scaleEnterprise data science at scale
Enterprise data science at scale
 
Unlocking insights in streaming data
Unlocking insights in streaming dataUnlocking insights in streaming data
Unlocking insights in streaming data
 
Data Science at Scale with Apache Spark and Zeppelin Notebook
Data Science at Scale with Apache Spark and Zeppelin NotebookData Science at Scale with Apache Spark and Zeppelin Notebook
Data Science at Scale with Apache Spark and Zeppelin Notebook
 
Boston Future of Data Meetup: May 2017: Spark Introduction with Credit Card F...
Boston Future of Data Meetup: May 2017: Spark Introduction with Credit Card F...Boston Future of Data Meetup: May 2017: Spark Introduction with Credit Card F...
Boston Future of Data Meetup: May 2017: Spark Introduction with Credit Card F...
 
ODSC East 2017 - Reproducible Research at Scale with Apache Zeppelin and Spark
ODSC East 2017 - Reproducible Research at Scale with Apache Zeppelin and SparkODSC East 2017 - Reproducible Research at Scale with Apache Zeppelin and Spark
ODSC East 2017 - Reproducible Research at Scale with Apache Zeppelin and Spark
 
Providence Future of Data Meetup - Apache Metron Open Source Cybersecurity Pl...
Providence Future of Data Meetup - Apache Metron Open Source Cybersecurity Pl...Providence Future of Data Meetup - Apache Metron Open Source Cybersecurity Pl...
Providence Future of Data Meetup - Apache Metron Open Source Cybersecurity Pl...
 

Recently uploaded

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 

Recently uploaded (20)

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 

Best Practices for Data at Scale - Global Data Science Conference

  • 1. Best Practices for Data at Scale Carolyn Duby Big Data Architect Hortonworks
  • 2. Choosing a Use Case • Build the business case – Assess the value - profit – investment year over year – Consult industry experts • Start small, simple • Map out path to future use cases – One year out • Don’t oversell
  • 3. 3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Payment Tracking Due Diligence Social Mapping Product Design M & ACall Analysis Machine Data Defect Detecting Factory Yields Customer Support Basket Analysis Segments Customer Retention Sentiment Analysis Optimize Inventories Supply Chain Cross- Sell Vendor Scorecards Ad Placement Cyber Security Disaster Mitigation Investment Planning Ad Placement Risk Modeling Proactive Repair Inventory Predictions Next Product Recs OPEX Reduction Historical Records Mainframe Offloads Device Data Ingest Rapid Reporting Digital Protection Data as a Service Fraud Prevention Public Data Capture INNOVATE RENOVATE E X P LO RE O P T I M I Z E T RA N S FO RM ACTIVE ARCHIVE ETL ONBOARD DATA ENRICHMENT DATA DISCOVERY SINGLE VIEW PREDICTIVE ANALYTICS
  • 4. Learn to Communicate with the Business • Data driven decisions don’t come naturally • Don’t dwell on technical details • A picture is worth a thousand words • Explain counterintuitive results
  • 5. Do a Pilot • Try out your ideas • Fail fast – Can you get the data? – Is the data useful? – How much will it really cost?
  • 6. Pilot in the Cloud • Spinning up a cluster in the cloud is quick • Focus on the problem you are trying to solve • Minimize startup time and cost
  • 7. Setting up a Cluster • Start with governance and security from the start • Harder to add in later • Protect your data from day one • Aggregated data needs good security
  • 8. Don’t Skimp • Train or hire skilled people • Get the right hardware for workload – Cluster size – Hardware configuration • Start with a balanced hardware configuration
  • 9. Data at Scale Solution Components • Getting the raw data • Cleaning the data – First two steps can be a big job • Building the model • Deploying or productizing the model
  • 10. Improve Iteratively • Start simply • Add more data and improve accuracy as needed • Simpler models are easier to understand • Don’t trade complexity for small gains in accuracy
  • 11. Scaling Up • Pat yourself on the back! You did it! • Go back to the business case and find more value • Horizontally scale your cluster as needed • Take on more advanced use cases
  • 12. Capacity Planning • Proactively monitor storage and compute • Stay below 80% utilization
  • 13. Disaster Recovery • Address disaster recovery early • Requirement for business critical use cases • Lack of DR will block higher value use cases

Editor's Notes

  1. 3