Scaling Financial Time Series Analysis Beyond PCs and Pandas

Scaling Financial Time Series Analysis
Beyond PCs and Pandas
Junta Nakai, Industry Leader FS at Databricks
Ricardo Portilla, Sr. Solutions Architect at Databricks

Housekeeping
• Your connection will be muted
• Submit questions via the Q&A panel
• Questions will be answered at the end of the webinar
• Any outstanding questions will be answered in the Q&A Blog after the
webinar
• Webinar will be recorded and attachments will be made available

Introducing Our Speakers
3
Junta Nakai
● Industry Leader - Financial Services at Databricks
● 14+ Years at Goldman Sachs
● Northwestern University and London School of Economics
Ricardo Portilla
● Senior Solutions Architect - Financial Services at Databricks
● 6+ years at FINRA - Market Regulation
● Mathematics Ph.D from University of Michigan

Accelerate innovation by unifying data science,
engineering and business
• Original creators of ,
• 5000+ global companies use our platform across big data
& machine learning lifecycle
VISION
WHO WE
ARE
Unified Analytics PlatformSOLUTION
,and

Databricks Customers Across Industries

Agenda • Data in Financial Services
• Time Series Data in Financial Institutions
• Challenges
• Analysis for Historical Data at Scale
• Lack of Data Science Tools for Big Data
• Solutions
• Spark Built-in Windowing for Time Series
• Koalas for Data Science
• Demo- Action

Segments and Use Cases
•Customer 360
•Digital Transformation
•Cost Savings
•Regulatory Compliance
•Predictive Analytics
•Cross-selling
•Chatbots
•Regulatory Calculations
•Unified Banking
•Superior Underwriting
•Claim Process Automation
•Actuary Science
•Risk Monitoring
•Predictive Risk Pricing
•Optimize Broker/Agent Performance
•IOT
•Customer 360
•Reduce CAC
•Alternative Data
•Portfolio Optimization
•Model Back-testing
•Quantamental Investing
•Trading Cost Analysis
•VAR Calculations
•Automation
•Regulatory Reporting
•Factor Analysis
Banking/Payments Insurance Capital Markets
FASTER INNOVATION
& DIGITAL
TRANSFORMATION
SUPERIOR
UNDERWRITING &
INCREASED PROFITS
GENERATE ALPHA &
AUTOMATE WORKFLOWS
Public Sector
CUT COSTS & KEEP UP
WITH PRIVATE SECTOR
INNOVATIONS
•AML
•Sanctions LIst
•Anomaly Detection
•Economic Forecasting
•Policy-making
•Research
•TCR
•Identify Misconduct
•Regulatory Reporting
Data Vendor

● Financial Services is #2 in volume of data generated globally.
● Expected CAGR of 26% reaching 10+ Zetabytes in 2025 (IDC)
Financial Data is Growing Rapidly

Real-Time Data is Growing Fastest
By 2025, 30% of Data will be Real-Time

Investment is Not Keeping Up
Financials have largest
worldwide market cap (17.2%)
No Financial
Services
Companies
Zero financials in top 20 global
R&D spenders

Faster Real Time Predictive
What do FSI’s want to be

Huge Potential Value With Big Data
Banking - $1T Annually
Insurance - $1.1T Annually

A Look at Time Series Challenges

Insurance
Capital
Markets
Public Sector
Banking /
Payments
Entities
Time Series across the FS Ecosystem
Data Vendors

Insurance
Capital
Markets
Public Sector
Banking /
Payments
Entities
Liability Claims Data
Satellite Images, IoT
Data Vendors
Transactional Data,
Geospatial Data
Order Data, Quote
Data
Macroeconomic Data
Data Sources
Use Cases

Insurance
Capital
Markets
Public Sector
Banking /
Payments
Entities
Liability Claims Data
Satellite Images, IoT
Data Vendors
Forecast Paid Claim
Costs
Transactional Data,
Geospatial Data
Next Best Oﬀer Order Data, Quote
Data
Transaction Cost
Analysis
Macroeconomic Data Forecast GDP Growth Rates
Data Sources
Use Cases

Challenges with Time Series Data
Structured Data
Trades, Order Data
Reference Data (market
makers, dividends)
News and sentiment
Trades, Orders for Reporting
Geospatial data
Unstructured Data
X
BI / Machine Learning
Historical Data locked in databases
No tools for end-to-end data science
Next Best Oﬀer
Transaction Cost
Analysis
Semi-structured Data

Challenge: Analysis for Historical Data at Scale
Historical market data is too large for traditional
data warehouses✗
Datasets need to be merged across hundreds
of thousands of tickers to generate alpha
SQL and Python tools make analysis simple and
universal but are not designed for petabyte scale
Kodak crypto crash

Challenge: Lack of Bridge from Big Data to ML
Pandas is the standard Python framework for time series
manipulation but is typically limited to a single node
Apache SparkTM
may have a steep learning curve for teams
looking to explore and fit models to time series
Limited code re-use or reproducibility without cloud
workflows

Apache Spark: De-Facto Unified Analytics Engine
Big Data Processing
ETL + SQL + Streaming
Machine Learning
MLlib + SparkR
Uniquely combines Data & AI technologies
R Python Scala Java

Solution: Scale Time Series Analysis with Apache Spark
Portfolio Optimization
from pyspark.sql.functions
import last, lag
df.select(last(‘bid_pr’).over
(time_window))
df.select(lag(‘bid_sz’).over(
time_window))
Anomaly Detection

Solution: Connect Big Data
to ML with Koalas
• Announced April 24, 2019
• Easy way to transition from Pandas to Apache
Spark
• Aims at providing the pandas API on top
of Spark
• Unifies the two ecosystems with a
familiar API
• Seamless transition between small and
large data

import
databricks.koalas as ks
...
data =
ks.read_csv("...",
header=0)
Get Started with Koalas using Pandas APIs
import pandas as pd
...
data =
pd.read_csv("...",
header=0)
Instead of pandas... … simply say koalas

Databricks Unified Analytics Platform
Collaborative workspace
brings data scientists and
data engineers together
Runtime for ML provides ready
to use clusters with built-in ML
frameworks
End-to-end Data and ML platform that unifies people, processes and technologies
MLflow standardizes ML Lifecycle
with experimentation,
reproducibility & deployment
Databricks Delta Makes
Data Lakes Ready for
Unified Analytics
1 2
34

Geospatial Data
End State With Databricks
BI Tools
News &
Sentiment
Data Sources Data Processing
Trades, NBBO,
Order Data
Reference Data
AI Applications
and Reporting
Transformations Enrichments
Feature Eng
Data engineering
Data science and
Machine Learning

Key Takeaways
• Time series data challenges include both scalability of data
and connecting data with ML
• Scale Time Series Analysis with Apache Spark
• Connect Big Data to ML with Koalas
• Use Databricks

Thank you! Questions?
Sign up for a Free Trial : http://databricks.com/trial
Blog post here
Spark + AI Summit Europe almost sold out! Register here.

Scaling Financial Time Series Analysis Beyond PCs and Pandas

Recommended

Recommended

More Related Content

Recently uploaded

Recently uploaded (20)

Scaling Financial Time Series Analysis Beyond PCs and Pandas