Best Practices for Data at Scale
Carolyn Duby
Big Data Architect
Hortonworks
Choosing a Use Case
• Build the business case
– Assess the value - profit – investment year over year
– Consult industry experts
• Start small, simple
• Map out path to future use cases
– One year out
• Don’t oversell
3 Ā© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Payment
Tracking
Due
Diligence
Social
Mapping
Product
Design
M & ACall
Analysis
Machine
Data
Defect
Detecting
Factory
Yields
Customer
Support
Basket
Analysis
Segments
Customer
Retention
Sentiment
Analysis
Optimize
Inventories
Supply
Chain
Cross-
Sell
Vendor
Scorecards
Ad
Placement
Cyber
Security
Disaster
Mitigation
Investment
Planning
Ad
Placement
Risk
Modeling
Proactive
Repair
Inventory
Predictions
Next
Product Recs
OPEX
Reduction
Historical
Records
Mainframe
Offloads
Device
Data
Ingest
Rapid
Reporting
Digital
Protection
Data
as a
Service
Fraud
Prevention
Public
Data
Capture
INNOVATE
RENOVATE
E X P LO RE O P T I M I Z E T RA N S FO RM
ACTIVE
ARCHIVE
ETL
ONBOARD
DATA
ENRICHMENT
DATA
DISCOVERY
SINGLE
VIEW
PREDICTIVE
ANALYTICS
Learn to Communicate with the
Business
• Data driven decisions don’t come naturally
• Don’t dwell on technical details
• A picture is worth a thousand words
• Explain counterintuitive results
Do a Pilot
• Try out your ideas
• Fail fast
– Can you get the data?
– Is the data useful?
– How much will it really cost?
Pilot in the Cloud
• Spinning up a cluster in the cloud is quick
• Focus on the problem you are trying to solve
• Minimize startup time and cost
Setting up a Cluster
• Start with governance and security from the
start
• Harder to add in later
• Protect your data from day one
• Aggregated data needs good security
Don’t Skimp
• Train or hire skilled people
• Get the right hardware for workload
– Cluster size
– Hardware configuration
• Start with a balanced hardware configuration
Data at Scale Solution
Components
• Getting the raw data
• Cleaning the data
– First two steps can be a big job
• Building the model
• Deploying or productizing the model
Improve Iteratively
• Start simply
• Add more data and improve accuracy as
needed
• Simpler models are easier to understand
• Don’t trade complexity for small gains in
accuracy
Scaling Up
• Pat yourself on the back! You did it!
• Go back to the business case and find more
value
• Horizontally scale your cluster as needed
• Take on more advanced use cases
Capacity Planning
• Proactively monitor storage and compute
• Stay below 80% utilization
Disaster Recovery
• Address disaster recovery early
• Requirement for business critical use cases
• Lack of DR will block higher value use cases
Questions
• Ask away!
www.globalbigdataconference.com
Twitter : @bigdataconf

Best Practices for Data at Scale - Global Data Science Conference

  • 1.
    Best Practices forData at Scale Carolyn Duby Big Data Architect Hortonworks
  • 2.
    Choosing a UseCase • Build the business case – Assess the value - profit – investment year over year – Consult industry experts • Start small, simple • Map out path to future use cases – One year out • Don’t oversell
  • 3.
    3 Ā© HortonworksInc. 2011 – 2016. All Rights Reserved Payment Tracking Due Diligence Social Mapping Product Design M & ACall Analysis Machine Data Defect Detecting Factory Yields Customer Support Basket Analysis Segments Customer Retention Sentiment Analysis Optimize Inventories Supply Chain Cross- Sell Vendor Scorecards Ad Placement Cyber Security Disaster Mitigation Investment Planning Ad Placement Risk Modeling Proactive Repair Inventory Predictions Next Product Recs OPEX Reduction Historical Records Mainframe Offloads Device Data Ingest Rapid Reporting Digital Protection Data as a Service Fraud Prevention Public Data Capture INNOVATE RENOVATE E X P LO RE O P T I M I Z E T RA N S FO RM ACTIVE ARCHIVE ETL ONBOARD DATA ENRICHMENT DATA DISCOVERY SINGLE VIEW PREDICTIVE ANALYTICS
  • 4.
    Learn to Communicatewith the Business • Data driven decisions don’t come naturally • Don’t dwell on technical details • A picture is worth a thousand words • Explain counterintuitive results
  • 5.
    Do a Pilot •Try out your ideas • Fail fast – Can you get the data? – Is the data useful? – How much will it really cost?
  • 6.
    Pilot in theCloud • Spinning up a cluster in the cloud is quick • Focus on the problem you are trying to solve • Minimize startup time and cost
  • 7.
    Setting up aCluster • Start with governance and security from the start • Harder to add in later • Protect your data from day one • Aggregated data needs good security
  • 8.
    Don’t Skimp • Trainor hire skilled people • Get the right hardware for workload – Cluster size – Hardware configuration • Start with a balanced hardware configuration
  • 9.
    Data at ScaleSolution Components • Getting the raw data • Cleaning the data – First two steps can be a big job • Building the model • Deploying or productizing the model
  • 10.
    Improve Iteratively • Startsimply • Add more data and improve accuracy as needed • Simpler models are easier to understand • Don’t trade complexity for small gains in accuracy
  • 11.
    Scaling Up • Patyourself on the back! You did it! • Go back to the business case and find more value • Horizontally scale your cluster as needed • Take on more advanced use cases
  • 12.
    Capacity Planning • Proactivelymonitor storage and compute • Stay below 80% utilization
  • 13.
    Disaster Recovery • Addressdisaster recovery early • Requirement for business critical use cases • Lack of DR will block higher value use cases
  • 14.
  • 15.

Editor's Notes