Fundamental economic data, financial stock tick data and alternative data sets such as geospatial or transactional data are all indexed by time, often at irregular intervals. Solving business problems in finance such as investment risk, fraud, transaction costs analysis and compliance ultimately rests on being able to analyze millions of time series in parallel. Older technologies, which are RDBMS-based, do not easily scale when analyzing trading strategies or conducting regulatory analyses over years of historical data.
In this webinar we reviewed:
How to build time series functions on hundreds of thousands of tickers in parallel using Apache Spark™.
We then showed how to modularize functions in a local IDE and create rich time-series feature sets with Databricks Connect.
Lastly, if you are a pandas user looking to scale data preparation which feeds into financial anomaly detection or other statistical analyses, we used a market manipulation example to show how Koalas makes scaling transparent to the typical data science workflow.
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
Scaling Financial Time Series Analysis Beyond PCs and Pandas
1. Scaling Financial Time Series Analysis
Beyond PCs and Pandas
Junta Nakai, Industry Leader FS at Databricks
Ricardo Portilla, Sr. Solutions Architect at Databricks
2. Housekeeping
• Your connection will be muted
• Submit questions via the Q&A panel
• Questions will be answered at the end of the webinar
• Any outstanding questions will be answered in the Q&A Blog after the
webinar
• Webinar will be recorded and attachments will be made available
3. Introducing Our Speakers
3
Junta Nakai
● Industry Leader - Financial Services at Databricks
● 14+ Years at Goldman Sachs
● Northwestern University and London School of Economics
Ricardo Portilla
● Senior Solutions Architect - Financial Services at Databricks
● 6+ years at FINRA - Market Regulation
● Mathematics Ph.D from University of Michigan
4. Accelerate innovation by unifying data science,
engineering and business
• Original creators of ,
• 5000+ global companies use our platform across big data
& machine learning lifecycle
VISION
WHO WE
ARE
Unified Analytics PlatformSOLUTION
,and
6. Agenda • Data in Financial Services
• Time Series Data in Financial Institutions
• Challenges
• Analysis for Historical Data at Scale
• Lack of Data Science Tools for Big Data
• Solutions
• Spark Built-in Windowing for Time Series
• Koalas for Data Science
• Demo- Action
8. Segments and Use Cases
•Customer 360
•Digital Transformation
•Cost Savings
•Regulatory Compliance
•Predictive Analytics
•Cross-selling
•Chatbots
•Regulatory Calculations
•Unified Banking
•Superior Underwriting
•Claim Process Automation
•Actuary Science
•Risk Monitoring
•Predictive Risk Pricing
•Optimize Broker/Agent Performance
•IOT
•Customer 360
•Reduce CAC
•Alternative Data
•Portfolio Optimization
•Model Back-testing
•Quantamental Investing
•Trading Cost Analysis
•VAR Calculations
•Automation
•Regulatory Reporting
•Factor Analysis
Banking/Payments Insurance Capital Markets
FASTER INNOVATION
& DIGITAL
TRANSFORMATION
SUPERIOR
UNDERWRITING &
INCREASED PROFITS
GENERATE ALPHA &
AUTOMATE WORKFLOWS
Public Sector
CUT COSTS & KEEP UP
WITH PRIVATE SECTOR
INNOVATIONS
•AML
•Sanctions LIst
•Anomaly Detection
•Economic Forecasting
•Policy-making
•Research
•TCR
•Identify Misconduct
•Regulatory Reporting
Data Vendor
9. ● Financial Services is #2 in volume of data generated globally.
● Expected CAGR of 26% reaching 10+ Zetabytes in 2025 (IDC)
Financial Data is Growing Rapidly
10. Real-Time Data is Growing Fastest
By 2025, 30% of Data will be Real-Time
11. Investment is Not Keeping Up
Financials have largest
worldwide market cap (17.2%)
No Financial
Services
Companies
Zero financials in top 20 global
R&D spenders
17. Insurance
Capital
Markets
Public Sector
Banking /
Payments
Entities
Liability Claims Data
Time Series across the FS Ecosystem
Satellite Images, IoT
Data Vendors
Forecast Paid Claim
Costs
Transactional Data,
Geospatial Data
Next Best Offer Order Data, Quote
Data
Transaction Cost
Analysis
Macroeconomic Data Forecast GDP Growth Rates
Data Sources
Use Cases
18. Challenges with Time Series Data
Structured Data
Trades, Order Data
Reference Data (market
makers, dividends)
News and sentiment
Trades, Orders for Reporting
Geospatial data
Unstructured Data
X
BI / Machine Learning
Historical Data locked in databases
No tools for end-to-end data science
Next Best Offer
Transaction Cost
Analysis
Semi-structured Data
19. Challenge: Analysis for Historical Data at Scale
Historical market data is too large for traditional
data warehouses✗
Datasets need to be merged across hundreds
of thousands of tickers to generate alpha
SQL and Python tools make analysis simple and
universal but are not designed for petabyte scale
Kodak crypto crash
20. Challenge: Lack of Bridge from Big Data to ML
Pandas is the standard Python framework for time series
manipulation but is typically limited to a single node
Apache SparkTM
may have a steep learning curve for teams
looking to explore and fit models to time series
Limited code re-use or reproducibility without cloud
workflows
22. Apache Spark: De-Facto Unified Analytics Engine
Big Data Processing
ETL + SQL + Streaming
Machine Learning
MLlib + SparkR
Uniquely combines Data & AI technologies
R Python Scala Java
23. Solution: Scale Time Series Analysis with Apache Spark
Portfolio Optimization
from pyspark.sql.functions
import last, lag
df.select(last(‘bid_pr’).over
(time_window))
df.select(lag(‘bid_sz’).over(
time_window))
Anomaly Detection
24. Solution: Connect Big Data
to ML with Koalas
• Announced April 24, 2019
• Easy way to transition from Pandas to Apache
Spark
• Aims at providing the pandas API on top
of Spark
• Unifies the two ecosystems with a
familiar API
• Seamless transition between small and
large data
25. import
databricks.koalas as ks
...
data =
ks.read_csv("...",
header=0)
Get Started with Koalas using Pandas APIs
import pandas as pd
...
data =
pd.read_csv("...",
header=0)
Instead of pandas... … simply say koalas
26. Databricks Unified Analytics Platform
Collaborative workspace
brings data scientists and
data engineers together
Runtime for ML provides ready
to use clusters with built-in ML
frameworks
End-to-end Data and ML platform that unifies people, processes and technologies
MLflow standardizes ML Lifecycle
with experimentation,
reproducibility & deployment
Databricks Delta Makes
Data Lakes Ready for
Unified Analytics
1 2
34
27. Geospatial Data
End State With Databricks
BI Tools
News &
Sentiment
Data Sources Data Processing
Trades, NBBO,
Order Data
Reference Data
AI Applications
and Reporting
Transformations Enrichments
Feature Eng
Data engineering
Data science and
Machine Learning
29. Key Takeaways
• Time series data challenges include both scalability of data
and connecting data with ML
• Scale Time Series Analysis with Apache Spark
• Connect Big Data to ML with Koalas
• Use Databricks
30. Thank you! Questions?
Sign up for a Free Trial : http://databricks.com/trial
Blog post here
Spark + AI Summit Europe almost sold out! Register here.