As a steward for your enterprise’s data and digital transformation initiatives, you’re tasked with making the right choice. But before you can make those decisions, it’s important to understand what not to do when planning for your organization’s big data initiatives.
Michael Stonebraker shares the top 10 big data blunders that he has witnessed in the last decade or so. As a pioneer of database research and technology for more than 40 years, Michael understands the mistakes enterprises often made and knows how to correct and avoid them. By learning about the major blunders, you’ll know how best to future-proof your big data management and digital transformation needs. Common blunders include problems from not planning on moving everything to the cloud to believing that a data warehouse will solve all your problems to succumbing to the “innovator’s dilemma.” To illustrate the blunders, he shares a variety of corrective tips, strategies, and real-world examples.
Call Girls In Mahipalpur O9654467111 Escorts Service
Slides: How to Avoid the 10 Big Data Analytics Blunders — Best Practices for Success in 2021
1. How to Avoid the 10 Big Data Blunders - Best Practices for Success in 2021 1
2. How to Avoid the 10 Big Data Blunders - Best Practices for Success in 2021 2
Speakers
Dr. Michael Stonebraker
Co-Founder,
Tamr
Anthony Deighton
Chief Product Officer,
Tamr
3. How to Avoid the 10 Big Data Blunders - Best Practices for Success in 2021
Blunder #1
Not Planning to Move Most EVERYTHING to the Cloud
3
4. How to Avoid the 10 Big Data Blunders - Best Practices for Success in 2021
It may take a decade, but it is the right thing to do
● Dewitt vignette
● Hamilton vignette
● Elasticity!!!
● Data will move easier than applications -- decision support first
4
5. How to Avoid the 10 Big Data Blunders - Best Practices for Success in 2021
YABUT...
5
Security
● Cloud security is
likely better than
yours
● Misconfiguration,
rogue employees
Cost
● Likely that
you are
cheating
Geographic
Restrictions
● Cloud guys
respect this
Legal
Restrictions
● Hopefully a
short term
problem
Other
Restrictions
● Your CEO
doesn’t
approve (see
item 11 to
come)
6. How to Avoid the 10 Big Data Blunders - Best Practices for Success in 2021
YABUT...
6
Where does App run?
● Decision support: move the app
● Other stuff:
○ Start with local deployment; move to remote data (SLOWLY!!!)
○ Migrate to cloud-native as you have resources, starting with the most
costly ones
○ This may be a lot of work and may take a decade or more
○ Issue is legacy code/hardware
7. How to Avoid the 10 Big Data Blunders - Best Practices for Success in 2021 7
Blunder #2
Not Planning for AI/ML to be Disruptive
8. How to Avoid the 10 Big Data Blunders - Best Practices for Success in 2021
Blunder #2
Not Planning for AI/ML to be Disruptive
ML (whether deep or conventional) is getting much better
● Will displace workers with easy-to-explain jobs
● Think autonomous vehicles, automatic checkout, drone delivery, actuary
calculations
Likely to be disruptive
● You can be a disruptor or get disrupted - Your choice
● Think Uber/Lyft or taxis
8
9. How to Avoid the 10 Big Data Blunders - Best Practices for Success in 2021
So what to do?
9
Pay up to get some AI/ML experts
● They are in short supply and very expensive
● Don’t contract this out (See Blunder #8)
Get going on the coming arms race
● You will be a winner or a loser in a winner-take-all sweepstakes
10. How to Avoid the 10 Big Data Blunders - Best Practices for Success in 2021 10
Blunder #3
Not Solving your REAL Data Science Problem
11. How to Avoid the 10 Big Data Blunders - Best Practices for Success in 2021
Blunder #3
Not Solving your REAL Data Science Problem
Typical data scientist spends 90+% of his/her time on data discovery, data
integration and data cleaning
● Irobot vignette
● Merck vignette
Nobody quotes less than 80%!!!
● Without clean data ML is worthless!!!
○ More accurately without “clean enough” data, ML is worthless
Obvious directive: Get a strategy in place to do this
● Start by giving Chief Data Officer (CDO) read access to ALL enterprise data!
11
12. How to Avoid the 10 Big Data Blunders - Best Practices for Success in 2021 12
Blunder #4
Belief that Traditional Data Integration
Techniques Will Solve Issue #3
13. How to Avoid the 10 Big Data Blunders - Best Practices for Success in 2021
Blunder #4
Belief that Traditional Data Integration Techniques Will
Solve Issue #3
Exact Transformation and Load
(Available from a variety of vendors)
13
Master Data Management
(Also available from the usual suspects)
14. How to Avoid the 10 Big Data Blunders - Best Practices for Success in 2021
ETL
What’s attempted:
● Decide what data sources to
integrate (top dow)
● Build a global data model (up front)
● For each data source
○ Send a programmer to interview
the data set owner
○ He then builds an extractor, data
cleaning routines (in a proprietary
scripting language)
○ And loads data into the global
schema
14
Why it doesn’t work:
● I have never seen this technique work for
more than 20 data sources
○ Too human intensive
● Building a global schema upfront is way
too different at scale
○ Remember enterprise wide data models
from 15-20 years ago...
● Most enterprises I know have way more
than 20 data sources
○ Merck has 4000+/- Oracle data
bases
○ A data lake
○ Countless files
○ And data from the web is also
important
15. How to Avoid the 10 Big Data Blunders - Best Practices for Success in 2021
MDM
● Once you have run ETL, you need “match/merge”
● MDM suggests building “golden records” by
○ Implementing match rues (e.g. two entities are the same if they have the same
address)
○ Implementing merge rules (e.g. take the most recent value and ignore older ones)
Doesn’t Scale!
● GE classification problem: 20M spend transactions to be classified
into a pre-built hierarchy
● 500 rules classified only 10% of the spend transaction
15
16. How to Avoid the 10 Big Data Blunders - Best Practices for Success in 2021
So what to do?
At scale, you need a solution that leverages ML and statistics
● OK to use rules to generate training data
● That’s what Tamr did on the GE problem
16
+
17. How to Avoid the 10 Big Data Blunders - Best Practices for Success in 2021 17
Blunder #5
Belief that Data Warehouses will Solve all your Problems
18. How to Avoid the 10 Big Data Blunders - Best Practices for Success in 2021
Blunder #5
Belief that Data Warehouses will Solve all your Problems
18
Data warehouses are good at customer facing structured data
FROM A FEW DATA SOURCES
● But not text, images, video, …
● Use the technology for what it is good for
○ Do not perform unnatural acts!
○ And get rid of the “high price spread”, if you bought into it
○ And remember that your warehouse will move to the cloud (see
Blunder #1)
19. How to Avoid the 10 Big Data Blunders - Best Practices for Success in 2021 19
Blunder #6
Belief that Hadoop/Spark will Solve all your Problems
20. How to Avoid the 10 Big Data Blunders - Best Practices for Success in 2021
Blunder #6
Belief that Hadoop/Spark will Solve all your Problems
20
● Hadoop/Spark is not very good at anything
○ E.g. Spark/SQL is not competitive (but getting better)
○ E.g. Spark/Streaming is not competitive (last time I looked)
● Use “best of breed” not “lowest common denominator” -- at least for your
“secret sauce”
○ This is a universal blunder -- desire to use only one vendor
○ Hadoop/Spark is not very good at anything
● And…
○ Spark/Hadoop is useless on Blunders #3 and #4 (i.e. data integration)
21. How to Avoid the 10 Big Data Blunders - Best Practices for Success in 2021
So what to do with your Hadoop/Spark cluster?
● Repurpose it or a Data Lake
● Repurpose it for Data Integration
● Throw it Away
○ Hardware lifetime is 3 years (maybe)
○ Remember Blunder #1
21
22. How to Avoid the 10 Big Data Blunders - Best Practices for Success in 2021 22
Blunder #7
Belief that Data Lakes will Solve all your Problems
23. How to Avoid the 10 Big Data Blunders - Best Practices for Success in 2021
Blunder #7
Belief that Data Lakes will Solve all your Problems
23
Conventional Wisdom
Just load all your data into a “data
lake” and you will be able to
correlate all data sets
Important Fact (Tattoo this on
your Brain):
Independently constructed data
sets are never “plug compatible”
24. How to Avoid the 10 Big Data Blunders - Best Practices for Success in 2021
Why?
● Schemas don’t match
○ You call it salary; I call it
wages
● Units don’t match
○ You use Euros; I use $$$
● Semantics don’t match
○ My salaries are gross before
taxes; yours are net after
taxes with a lunch allowance
24
● Time granularity doesn't match
○ You have annual data; I have
monthly data
● Data is dirty
○ 99 means null (sometimes)
○ Null means “data missing” or
“data not allowed” or...
● Duplicates must be removed
○ And there are no keys
○ I am Mike Stonebraker in
one data set; M.R.
Stonebreaker in a second
one
25. How to Avoid the 10 Big Data Blunders - Best Practices for Success in 2021
The Net Result
● Your analytics will be garbage
○ “GIGO”
● Your ML models will fail
○ I.e. produce garbage
○ Again “GIGO”
25
26. How to Avoid the 10 Big Data Blunders - Best Practices for Success in 2021
So what to do?
● You don’t have a data lake; you have a data swamp
● Need a data curation system
○ Which will solve the aforementioned problems
○ And this will not be trivial!!
● Traditional technology likely to fail (See Blunder #4)
● This is an 800 pound gorilla
○ Make sure you put your best people on it!!!!
○ Chances are your in-house solution is crap
○ Use modern technology (from startups) not your “home brew”
● If you want the best technology, you have to deal with startups!!!!
26
27. How to Avoid the 10 Big Data Blunders - Best Practices for Success in 2021 27
Blunder #8
Outsourcing your new stuff to Palantir, IBM, Mu Sigma
28. How to Avoid the 10 Big Data Blunders - Best Practices for Success in 2021
Blunder #8
Outsourcing your new stuff to Palantir, IBM, Mu Sigma
28
● Typical enterprise spends 95% of its IT resources keeping current
(legacy) code running
○ i.e. Maintenance
○ Most are dug in pretty deep
○ Often have the best people “keeping the lights on”
● “Shiny new stuff” gets outsourced
○ Often because here is no appropriate talent internally
29. How to Avoid the 10 Big Data Blunders - Best Practices for Success in 2021
This is a catch 22
● Your maintenance is boring!
○ So creative people quit
○ So there is no good talent to work on the new stuff
○ And you can’t hire great talent (Takes great people to hire great people)
● Your new stuff is your “secret sauce” over the next decade or so…
○ Please don’t outsource it. This is long-term suicide
○ Instead outsource the diddly-crap (e-mail et. al.)
○ Software is your secret sauce -- invest in your own people
29
30. How to Avoid the 10 Big Data Blunders - Best Practices for Success in 2021
So what to do?
1. Start by solving Blunder #2
(Not planning for AI/ML to change most everything)
1. Outsource the borning maintenance
2. Cancel the Palantir contract
30
31. How to Avoid the 10 Big Data Blunders - Best Practices for Success in 2021 31
Blunder #9
Succumbing to the “Innovator’s Dilemma”
32. How to Avoid the 10 Big Data Blunders - Best Practices for Success in 2021
Blunder #9
Succumbing to the “Innovator’s Dilemma”
32
● Must read book by Clayton
Christensen
● Stream shovel example
○ Cable stream shovels - big payload
○ Hydraulics - much safer, but low
payload
● Used for “small jobs”
○ Payloads increased and hydraulics
won
○ Cable guys went out of business
33. How to Avoid the 10 Big Data Blunders - Best Practices for Success in 2021
Net-Net
● Have to be willing to give up your current business model
● And reinvent yourself
● Possibly losing some current customers in the process
○ Otherwise, you go out of business in the long run
○ Taxi licenses in Cambridge have gone from $700k to $10k
33
34. How to Avoid the 10 Big Data Blunders - Best Practices for Success in 2021 34
Blunder #10
Not Paying Up for a Few “Rocket Scientists”
35. How to Avoid the 10 Big Data Blunders - Best Practices for Success in 2021
Blunder #10
Not Paying Up for a Few “Rocket Scientists”
35
● They will be your guiding light to avoiding these blunders
● They will be “off scale”
○ Your HR folks won’t like what you have to pay
● Chances are they will be weird
○ E.g. no shoes, no socks, no tie, feet on the table, ...
● Please don’t drive them away!
○ As Citibank did to one of my Berkeley students a while ago
36. How to Avoid the 10 Big Data Blunders - Best Practices for Success in 2021 36
Blunder #11 (Bonus)
Working for a Company That is not Trying to
do Something about the “Sins of the Past”
37. How to Avoid the 10 Big Data Blunders - Best Practices for Success in 2021
Blunder #11 (Bonus)
Working for a Company That is not Trying to do
Something about the “Sins of the Past”
37
If you work for a company that is succumbing to (even one) of these blunders
then:
1. You should be fixing it
a. Be part of the solution, not part of the problem
2. Or looking for a new employer
a. Tamr is hiring!
38. How to Avoid the 10 Big Data Blunders - Best Practices for Success in 2021
Questions?
38
39. How to Avoid the 10 Big Data Blunders - Best Practices for Success in 2021 39
Thank You!
To learn more about Tamr visit tamr.com
You’ll receive the 10 Big Data Analytics
Blunders Infographic via email.