Best Practices for Data at Scale - Global Data Science Conference

Best Practices for Data at Scale
Carolyn Duby
Big Data Architect
Hortonworks

Choosing a Use Case
• Build the business case
– Assess the value - profit – investment year over year
– Consult industry experts
• Start small, simple
• Map out path to future use cases
– One year out
• Don’t oversell

3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Payment
Tracking
Due
Diligence
Social
Mapping
Product
Design
M & ACall
Analysis
Machine
Data
Defect
Detecting
Factory
Yields
Customer
Support
Basket
Analysis
Segments
Customer
Retention
Sentiment
Analysis
Optimize
Inventories
Supply
Chain
Cross-
Sell
Vendor
Scorecards
Ad
Placement
Cyber
Security
Disaster
Mitigation
Investment
Planning
Ad
Placement
Risk
Modeling
Proactive
Repair
Inventory
Predictions
Next
Product Recs
OPEX
Reduction
Historical
Records
Mainframe
Offloads
Device
Data
Ingest
Rapid
Reporting
Digital
Protection
Data
as a
Service
Fraud
Prevention
Public
Data
Capture
INNOVATE
RENOVATE
E X P LO RE O P T I M I Z E T RA N S FO RM
ACTIVE
ARCHIVE
ETL
ONBOARD
DATA
ENRICHMENT
DATA
DISCOVERY
SINGLE
VIEW
PREDICTIVE
ANALYTICS

Learn to Communicate with the
Business
• Data driven decisions don’t come naturally
• Don’t dwell on technical details
• A picture is worth a thousand words
• Explain counterintuitive results

Do a Pilot
• Try out your ideas
• Fail fast
– Can you get the data?
– Is the data useful?
– How much will it really cost?

Pilot in the Cloud
• Spinning up a cluster in the cloud is quick
• Focus on the problem you are trying to solve
• Minimize startup time and cost

Setting up a Cluster
• Start with governance and security from the
start
• Harder to add in later
• Protect your data from day one
• Aggregated data needs good security

Don’t Skimp
• Train or hire skilled people
• Get the right hardware for workload
– Cluster size
– Hardware configuration
• Start with a balanced hardware configuration

Data at Scale Solution
Components
• Getting the raw data
• Cleaning the data
– First two steps can be a big job
• Building the model
• Deploying or productizing the model

Improve Iteratively
• Start simply
• Add more data and improve accuracy as
needed
• Simpler models are easier to understand
• Don’t trade complexity for small gains in
accuracy

Scaling Up
• Pat yourself on the back! You did it!
• Go back to the business case and find more
value
• Horizontally scale your cluster as needed
• Take on more advanced use cases

Capacity Planning
• Proactively monitor storage and compute
• Stay below 80% utilization

Disaster Recovery
• Address disaster recovery early
• Requirement for business critical use cases
• Lack of DR will block higher value use cases

www.globalbigdataconference.com
Twitter : @bigdataconf

Best Practices for Data at Scale - Global Data Science Conference

More Related Content

What's hot

Similar to Best Practices for Data at Scale - Global Data Science Conference

More from Carolyn Duby

Recently uploaded

Best Practices for Data at Scale - Global Data Science Conference

Editor's Notes