General presentation listing some of the newest tools for data discovery, anomaly detection, visualization and model building. Traditional audit software is fine for answering questions that you have already formulated but poor for discovery and advanced analytics. The audit community needs to close the gap and adopt more current tools.
2. Traditional tools will continue to evolve
3/16/2020 2
EXCEL, ACL, IDEA, AND OTHERS WILL
CONTINUE TO ADD FEATURES
THEIR ORIENTATION IS TOWARD
“KNOWN UNKNOWNS” RATHER THAN
“UNKNOWN UNKNOWNS”
3. Data science & Complexity
•Why sample?
•Are visualizations good enough?
•Do you have tools of discovery,
such as forecasting, clustering,
and anomaly detection?
3/16/2020 3
4. Analytics tools advance daily but
that’s not the most important
thing …
• Profound
changes in the
philosophy of
data science
• Underlying
algorithms are
moving from
simply
following
instructions
to “thinking”
(at least a
little).
3/16/2020 4
5. This
presentation
does not
include
Machine
Learning …
3/16/2020 5
Machine learning routines require considerable
expertise and smarts
It also means the surrender of humans to the
machines for a narrow section of
thought/prediction
Theoretically, from the 1930’s (Alan Turing) to
the early 2000’s – you could look at the program
code and follow its progression
Machine learning is only a WRAPPER of code. Its
real behavior comes from millions & billions of
data points. Not possible to understand.
6. Tools & Algorithms
• Clustering – where’s Waldo on steroids
• Prediction – yea, this is way more than Excel
• Gaps & discontinuities
• Credit card fraud
• Sentiment analysis: Like my stuff? Think it stinks?
• Text analytics: Billy Bob (purchasing) next to Sally
Mae (vendor) in 10K emails?
• Anomaly detection
• R has more than 12K packages, many of them
vertical apps
6
7. Many R packages
are industry
specific. Some are
even
understandable.
Python has many
as well
3/16/2020 7
8. What does Harvey
talk about in his
emails?
8
What is not said is as important as what is said. For
example, you do not see the phrase “Bill, you ignorant pig
….”
11. Stuff to think about
3/16/2020 11
ANOMALIES BIAS COMMON SENSE SAMPLING CHERRY PICKING LONG TERM
MANAGEMENT CAPITAL
HEDGE FUND. DUH! DEAR
HERR DOCKTOR PH.DS –
PLEASE BE AWARE THAT
NOT ALL POPULATIONS
ARE “NORMAL.” YOU GOT
THE TAILS WRONG AND
WE (TAXPAYERS) PAID
FOR YOUR IDIOT MISTAKE
14. Kudos to Microsoft/Excel.
Enter dates and values.
Highlight. Click data, forecast
sheet. You can specify the
number of periods ahead you
want. This is a TIME Series
projection (better than
moving average).
3/16/2020 14
Data = New Zealand
government core expenditures
since 1972
15. 3/16/2020 15
Data science thinking: How do you look at a large log
of, for example, 20,000 time stamps to determine if
the log shows someone turned off recording for a few
minutes or hours?
One approach: create an artificial file (via a program
or even Excel) with all dates/times for the period. For
example, if the log collects data every 5 minutes, show
1/31/20 8:00am, 1/31/20 8:05am, and so on.
Perform an “anti-join” to show records in your artificial
file with no match on your log file. This could be done
in R with an actual antijoin statement or in Excel using
Vlookup.
Any other ways?
21. This stuff is fun … call or email me with questions,
ideas to solve audit problems or suggestions for
research. Thanks for your time today.
• Bill Yarberry, CPA
• ICCM Consulting
• 713.582.6275
• byarberry@iccmconsulting.net
3/16/2020 21
Editor's Notes
Discuss how audit has fallen behind simply because data science is evolving so quickly
I’m not suggesting in any way that audit throw away the traditional tools; but they are not oriented to discovery
I’m not suggesting in any way that audit throw away the traditional tools; but they are not oriented to discovery
Early days of statistics – get the most information out of the least possible data; now what are the possibilities if I have millions, billions or even trillions of points of data.
Not enough time but it is an extremely important topic. Robots in factories, identification of rogue behavior, just about any narrow expertise can be gained. I ran a ML routine for determination of breast cell malignancy – 30 characteristics, 98% accuracy.
How would you do it? One variable distance from another, easy. Millions of trans per second. Is this fraudulent? Age, income, $ history, sex, location, abrupt shift. Sentiment analysis: lexicon of good and bad words (various ones).
Some tools are really easy to use.
Diederik Stapel, Dutch psychologist, found to have falsified data in 30 papers. Extreme values more probable than in normal distributions. Russia defaulted. Late 1990s
M,athematical skepticism. Think rational, dispassionate, not ape-to-ape social dominance. SSC is a legitimate trading algorithm
Rattle, plot.ly
Don’t make viewer work too hard.
What is the sweet spot between too much time devoted to analytics and missing valuable information? What are some ways to improve data awareness & opportunities? Where to get objective advice? Nobody pushes tools they can’t make money on.
What is the sweet spot between too much time devoted to analytics and missing valuable information? What are some ways to improve data awareness & opportunities? Where to get objective advice? Nobody pushes tools they can’t make money on.