Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Best Practices for Data at Scale - Global Data Science Conference
1. Best Practices for Data at Scale
Carolyn Duby
Big Data Architect
Hortonworks
2. Choosing a Use Case
• Build the business case
– Assess the value - profit – investment year over year
– Consult industry experts
• Start small, simple
• Map out path to future use cases
– One year out
• Don’t oversell
4. Learn to Communicate with the
Business
• Data driven decisions don’t come naturally
• Don’t dwell on technical details
• A picture is worth a thousand words
• Explain counterintuitive results
5. Do a Pilot
• Try out your ideas
• Fail fast
– Can you get the data?
– Is the data useful?
– How much will it really cost?
6. Pilot in the Cloud
• Spinning up a cluster in the cloud is quick
• Focus on the problem you are trying to solve
• Minimize startup time and cost
7. Setting up a Cluster
• Start with governance and security from the
start
• Harder to add in later
• Protect your data from day one
• Aggregated data needs good security
8. Don’t Skimp
• Train or hire skilled people
• Get the right hardware for workload
– Cluster size
– Hardware configuration
• Start with a balanced hardware configuration
9. Data at Scale Solution
Components
• Getting the raw data
• Cleaning the data
– First two steps can be a big job
• Building the model
• Deploying or productizing the model
10. Improve Iteratively
• Start simply
• Add more data and improve accuracy as
needed
• Simpler models are easier to understand
• Don’t trade complexity for small gains in
accuracy
11. Scaling Up
• Pat yourself on the back! You did it!
• Go back to the business case and find more
value
• Horizontally scale your cluster as needed
• Take on more advanced use cases