What is Data Science, Artificial Intelligence / Machine Learning, how did we come about it and what is it used for in our day to day lives... not a technical presentation, but for anyone looking to understand what "AI" "Machine Learning" and "Data Science" is all about...
On Starlink, presented by Geoff Huston at NZNOG 2024
Data Science for Business
1. DATA SCIENCE
IIT AGNE MARCH 2019
ANURAG WAKHLU, CFA, MBA
IIT BOMBAY ’93
@ANURAGWAKHLU @ANURAG3DS
2. WHAT
IS
THIS?
Bob: i can i i everything else . . . . . . . . . . . . . .
Alice: balls have zero to me to me to me to me to me to me to me
to me to
Bob: you i everything else . . . . . . . . . . . . . .
Alice: balls have a ball to me to me to me to me to me to me to
me
Bob: i i can i i i everything else . . . . . . . . . . . . . .
Alice: balls have a ball to me to me to me to me to me to me to
me
Bob: i . . . . . . . . . . . . . . . . . . .
Alice: balls have zero to me to me to me to me to me to me to me
to me to
Bob: you i i i i i everything else . . . . . . . . . . . . . .
Alice: balls have 0 to me to me to me to me to me to me to me to
me to
5. WHAT DO YOU DO IN DATA SCIENCE?
• Classification (e.g., spam or not spam)
• Pattern detection and grouping (classification without known classes)
• Anomaly detection (e.g., fraud detection)
• Recognition (image, text, audio, video, facial, …)
• Actionable insights (via dashboards, reports, visualizations, …)
• Automated processes and decision-making (e.g., credit card approval)
• Scoring and ranking (e.g., FICO score)
• Segmentation (e.g., demographic-based marketing)
• Optimization (e.g., risk management)
6. WHAT CAN YOU DO WITH DATA SCIENCE
Recommendation
s
Fraud detection
Customer sentiment
analysis
Churn, Next Best Action, Propensity
Predictive Shipping
Supply Chain
Optimization
Price
Optimization
Clickstream
analytics
12. A BRIEF HISTORY OF … DATA
2.5 Quintillion (Exabytes) data / day
(2018)
10 TB = printed Library of
Congress
13. 90% of the world’s data
was created in last 2
years.
1.7 megabytes of new
information will be
created / second / person,
by 2020
14.
15.
16.
17. A Boeing 787 aircraft could generate 40 TBs per hour of flight.
18. An ADV car will churn out 4,000 GB of data per hour of driving
19. In just 10 minutes, 16 players with
6 balls can produce almost 13
million data points! (soccer)
“You’re capturing real-time
data at every point, on every
single food product.”
- Walmart, Food Trust
blockchain
20.
21. CERN LHC ~ 40TB/S DURING PARTICLE
SMASHING
NASA ~ CREATES 15 TB/DAY
32. AlphaGo first learned from studying 30 million
moves of expert human play. AlphaGo Zero just
learned the rules and played.
33. DATA SCIENCE IN FINANCIAL SERVICES
• Dataminr analyzes billions of tweets to monitor the entire
world – predicting stock movements
• 56% of hedge funds said they used AI/ML for investing
• Blackrock (largest Asset Manager, 6.5 TN AUM) using AI for
investing.
• JPM Chase (largest bank, 2.6 TN assets) using AI to
“deepen customer engagements.”
• Risk management
• Algorithmic trading
36. Financial Conditions
Policy Liquidity
Quantity Liquidity
Domestic Liquidity
Equity Exposures
Bond Exposures
Money Flows
Monetized Savings
Momentum TedSpread
OIS spread
10yr, 2yr CMT
Convexity at 5yr
5 yr inflation + 5
years
Banks' swap
spreads
CB Credit Risk
Index
Sentiment Index
Dollar Sentiment
Trade weighted $
2-10yr Yield Curve
BAA-AAA credit
spread
Mkt PE & EPS
VIX, S&P 500, FTSE
EuroStoxx, MSCI EM
USD/ GBP, EUR
MARKET VOLATILITY PREDICTION - DATA
37. MODEL BUILDING PROCESS
Rule modelingInput enhancements
Classify output
Lag/lead the factors
Remove correlations
Induction for
optimization
Transform inputs,
outputs
Analyze explanatory
power
Operational
Monitoring
Compute risk
&probability
What if scenario
modeling
Monitor incoming
data
Expert Analysis
Analyze rules for
purity
Analyze rules for
causality
Analyze new
requirements
Feedback
38. PREDICTIVE RISK MAP VS. MARKETS…
What is
happenin
g in
global
markets?The RBA
surprised …
cutting its …
rate by 0.25 …a
level last seen
in late 2009
Heightened
global risk in
May – Jul
2013
Japan starts
QE
The US Fed
“Taper” talk
starts
40. Factors for a high out performance… Insight: Brazilian Real…for
real?
FINANCIAL PORTFOLIO CONSTRUCTION
41. PERSON
A OF A
DATA
SCIENTIS
T
Credit- Stephan
Kolassa – Data Science
Expert – SAP
Switzerland AG
Business
Domain
Knowledge
& soft skills
Math, Stats,
Data
Engineering,
Programming
42. WHY BE A DATA SCIENTIST
• If you like data
• Data scientists today are akin to the Wall Street “quants”
of the 1980s 1990s.. And 2000s
• Salary $120-160K +
Sexiest Job of the 21st Century – Harvard Business Rev
Editor's Notes
Summer 2017 - Facebook’s AI research lab. Researchers set out to make chatbots that could negotiate with people. Their thinking: Negotiation and cooperation will be necessary for bots to work more closely with humans. First, they fed the computers dialog from thousands of games between humans to give the system a sense of the language of negotiation. Then they allowed bots to use trial and error—in the form of a technique called reinforcement learning, which helped Google’s Go bot AlphaGo defeat champion players When two bots using reinforcement learning played each other, they stopped using recognizable sentences.
Data Science is an interdisciplinary field to extract insights from data .
AI is the science of making machines do intelligent tasks like humans.
To do this, machines have to learn from data – and that process is called machine learning
Deep learning is a type of ML generally modeled after the human brain – neural networks. DL is more scalable than other ML , for improved learning and larger data.
data science allows for AIs to find appropriate and meaningful information from those huge pools faster and more efficiently. machine learning is the process of learning from data over time.
Artificial intelligence refers to the simulation of a human brain function by machines. This is achieved by creating an artificial neural network that can mimick human intelligence. The primary human functions that an AI machine performs include logical reasoning, learning and self-correction. Machines inherently are not smart and to make them so, we need a lot of computing power and data to empower them to simulate human thinking.
Artificial intelligence is classified into two parts, general AI and Narrow AI. General AI refers to making machines intelligent in a wide array of activities that involve thinking and reasoning. Narrow AI, on the other hand, involves the use of artificial intelligence for a very specific task. For instance, general AI would mean an algorithm that is capable of playing all kinds of board game while narrow AI will limit the range of machine capabilities to a specific game like chess or scrabble.
Machine learning is the ability of a computer system to learn from the environment and improve itself from experience without the need for any explicit programming. Machine learning focuses on enabling algorithms to learn from the data provided, gather insights and make predictions on previously unanalyzed data using the information gathered. Machine learning can be performed using multiple approaches. The three basic models of machine learning are supervised, unsupervised and reinforcement learning.
In case of supervised learning, labeled data is used to help machines recognize characteristics and use them for future data. For instance, if you want to classify pictures of cats and dogs then you can feed the data of a few labeled pictures and then the machine will classify all the remaining pictures for you. On the other hand, in unsupervised learning, we simply put unlabeled data and let machine understand the characteristics and classify it. Reinforcement machine learning algorithms interact with the environment by producing actions and then analyze errors or rewards. For example, to understand a game of chess an ML algorithm will not analyze individual moves but will study the game as a whole.
Solve real world problems or improve things, using data & AI/ML
The Manhattan Population Explorer provides a visual representation of the dynamic population shifts within the borough. In this example it synthesizes a heartbeat of New York
24 BN connected devices in 2018
Atoms in universe 10^80
2.5 quintillion bytes of data created each day at our current pace, but tha
t pace is only accelerating with the growth of the Internet of Things (IoT). Over the last two years alone 90 percent of the data in the world was generated.
Data is growing at a rapid pace. By 2020 the new information generated per second for every human being will approximate amount to 1.7 megabytes.
By 2020, the accumulated volume of big data will increase from 4.4 zettabytes to roughly 44 zettabytes or 44 trillion GB.
Originally, data scientists maintained that the volume of data would double every two years thus reaching the 44 ZB point by 2020 with iot
The rate at which data is created is increased exponentially. For instance, 40,000 search queries are performed per second (on Google alone), which makes it 3.46 million searches per day and 1.2 trillion every year.
Every minute Facebook users send roughly 31.25 million messages and watch 2.77 million videos.
The data gathered is no more text-only. An exponential growth in videos and photos is equally prominent. On YouTube alone, 300 hours of video are uploaded every minute.
IDC estimates that by 2020, business transactions (including both B2B and B2C) via the internet will reach up to 450 billion per day.
Globally, the number of smartphone users will grow to 6.1 billion by 2020 .
In just 5 years the number of smart connected devices in the world will be more than 50 billion – all of which will create data that can be shared, collected and analyzed.
A typical human genome contains more than 20,000 genes, with each made up of millions of base pairs. Simply mapping a genome requires a hundred gigabytes of data, and sequencing multiple genomes and tracking gene interactions multiplies that number many times — hundreds of petabytes in some cases.
Physicists use the 17 -mile) LHC tunnel to accelerate particles almost to light speed, and smash them together
At about 30 million collisions per second for 120 billion protons.
one billion collisions per second generates one petabyte per second.
to keep all 30 million events per second we would need about 2,000 petabytes to store a typical 12-hour run.
For a typical running year of 150 days uptime, this would mean almost 400 ExaByte per year
throws away 99.99% of 400 EB
The Large Hadron Collider is the world's largest and most powerful particle collider and the largest machine in the world.
CERN has dumped about 300 TB of Large Hadron Collider (LHC) data online. It’s completely free,
DATA IS THE NEW OIL
the world’s first electronic stock market, NASDAQ OMX owns and operates three clearing houses, five central securities depositories, and 26 markets (including the NASDAQ Stock Market) with a combined value that exceeds US$8 trillion. Its trading engine is used by 80 global marketplaces.
When markets open, the company processes more than 1 million messages per second.
Director of Database Structures at NASDAQ OMX, says, “Just our US Options and Equity data archive handles billions of transactions per day, stores multiple petabytes of online data, and has tables that contain quintillions of records about business transactions.”
the Options and Equity archive measures 2 petabytes (PB)
US Department of Energy’s Oak Ridge National Laboratory announced the top speeds of its Summit supercomputing machine, which nearly laps the previous record-holder, China’s Sunway TaihuLight. The Summit’s theoretical peak speed is 200 petaflops, or 200,000 teraflops. To put that in human terms, approximately 6.3 billion people would all have to make a calculation at the same time, every second, for an entire year, to match what Summit can do in just one second.
In 2015, Google and NASA reported that their new 1097-qubit D-Wave quantum computer had solved an optimization problem in a few seconds. That’s 100 million times faster than a regular computer chip. They claimed that a problem their D-Wave 2X machine processed inside one second would take a classical computer 10,000 years to solve.
Your brain is 10 million times slower than a computer.
Brain ~ 1000 operations /s
Google offers an option to download all of the data it stores about you. I’ve requested to download it and the file is 5.5GB big,
Facebook offers a similar option to download all your information. Mine was roughly 600MB
8x8 pixel photos were inputted into a Deep Learning network which tried to guess what the original face looked like. As you can see it was fairly close (the correct answer is under "ground truth” - which was the real face originally in the photos)).
https://youtu.be/aKed5FHzDTw?t=43
Natural language processing (NLP) deals with building computational algorithms to automatically analyze and represent human language. NLP-based systems have enabled a wide range of applications such as Google’s powerful search engine, and more recently, Amazon’s voice assistant named Alexa. NLP is also useful to teach machines the ability to perform complex natural language related tasks such as machine translation and dialogue generation.
Gebru et al took 50 million Google Street View images and exploredwhat a Deep Learning network can do - "if the number of sedans encountered during a 15-minute drive through a city is higher than the number of pickup trucks, the city is likely to vote for a Democrat during the next Presidential election (88% chance); otherwise, it is likely to vote Republican (82%).”
Harvard scientists used Deep Learning to teach a computer to perform viscoelastic computations, these are the computations used in predictions of earthquakes.
Deep Learning improved calculation time by 50,000%
total number of possible games of Go has been estimated at 10761, compared to 10120 for chess. Both are very large numbers: the entire universe is estimated to contain "only" about 1080 atoms.
2017
The original AlphaGo first learned from studying 30 million moves of expert human play.
https://deepmind.com/blog/alphago-zero-learning-scratch/#gif-120
Fifty-six percent of the survey’s respondents said they used AI or machine learning in their investment processes. Just 20 percent had said the same in a BarclayHedge poll last August.
Among current users, slightly more than two-thirds said they relied on these quantitative techniques for idea generation, while 58 percent said they used them for portfolio construction. Other applications of AI and machine learning included risk management
Why is liquidity?
Trivia question – what factor has been very highly correlated to S&P in the late 80s and 90s? Bangla butter prod
Don’t pick on 1 country, - enhance the model by adding another factor to the mix, - US cheese prod
Bangla sheep
We do some intelligent things eliminate correls, to reduce noise… but not dwell on this too much. Offline
Use for macro risk, asset allocation, portfolio construction and management,
Whether you are a CRO, CIO, CXO, strategist, asset allocator, etc