SlideShare a Scribd company logo
1 of 27
SQL + NOSQL + NEWSQL + REALTIME
FOR INVESTMENT BANKS
CHARLES CAI
ASHWANI ROY
8 March 2013
Enterprise Data Problems in Investment Banks
“BigData” History and Trend – Driven by Google
CAP Theorem for Distributed Computer System
Open Source Building Blocks: Hadoop, Solr, Storm..
Hypothetical Solution using Lambda Architecture
Where “BigData” Industry is Going?
3548
InfoQ.com: News & Community Site
• 750,000 unique visitors/month
• Published in 4 languages (English, Chinese, Japanese and Brazilian
Portuguese)
• Post content from our QCon conferences
• News 15-20 / week
• Articles 3-4 / week
• Presentations (videos) 12-15 / week
• Interviews 2-3 / week
• Books 1 / month
Watch the video with slide
synchronization on InfoQ.com!
http://www.infoq.com/presentations
/finance-sql-nosql-newsql
Presented at QCon London
www.qconlondon.com
Purpose of QCon
- to empower software development by facilitating the spread of
knowledge and innovation
Strategy
- practitioner-driven conference designed for YOU: influencers of
change and innovation in your teams
- speakers and topics driving the evolution and innovation
- connecting and catalyzing the influencers and innovators
Highlights
- attended by more than 12,000 delegates since 2007
- held in 9 cities worldwide
Presenter: Charles Cai
¨  Charles Cai makes a living by designing and
implementing trading and risk systems for investment
banks.
¨  Currently a Chief Front Office Technical Architect in
a global energy trading firm.
¤  Twitter: @caidong
¤  Linkedin: charlescai
Presenter: Ashwani Roy
¨  Ashwani Roy – Masters in Finance Student at London
Business School and VP at a Tier 1 Investment Bank.
¨  Love to mix programming and Applied Mathematics
to solve difficult problems in Investment Banking
¤  Twitter: @Ashwani_Roy
¤  Linkedin: ashwaniroy
3548
Why Finance Industry should care ?
¨  We care because of
¤  Compliance requirements
¤  Risk Management
¤  Pricing
¤  Rise of Machines (Ecommerce)
¤  Cost Cutting
¤  BTW: Twitter is also part of Market Data
Sample Interest Model / Simulations
A quick Monte Carlo Demo
¨  Demo – Computing this is functional
Some Terminology
¨  PV = present value = Cash flows discounted to
current time
¨  Delta = change in price / change in interest rate
¨  Gamma .. Vega .. Rho .. Theta .. Vanna …. And
other Greeks
Monte Carlo Simulations -Results
¨  <results> = func<I,j,k……
¨  Parallelize computation with mappers
¨  Save results and run reducers
¨  [[ trade: 1 curveid: Orig PV:100 Delta:200]{ to OLAP}
..[ trade: 1 curveid: Sim1 PV:100 Delta:200] {big data}
..[ trade: 1 curveid: Sim2 PV:99 Delta:220]{big data} ] ]
Compliance
¤  Dodd-Frank requires >= five years records
¤  Fast Disaster recovery requirements (Tapes backup not acceptable)
¤  All Bloomberg and other chats to be saves in quick reportable form
¤  … Many more in Basel 3 and Dodd Frank Act
You need to
#  get chats for AshwaniRoy@bloomberg.net and ashwaniR@reuters.net
#  from the 5 years Bloomberg and Reuters log of a global investment bank
of 1TB(assume 1MB/Day/Trader * 220 trading days * 1000 traders* 5
years)
#  for all EURUSD swaps only
….. Additional filters and aggregation requirements
Big Data Industry History: Google’s Papers
1
Google’s Big Data Papers: 2003 – 2006
1
GFS – Google File
System
•  2003
•  Distributed file
system
•  3 x copies
•  Commodity
machines
•  Colossus (2012)
MapReduce
•  2004
•  Input à Map à
Partition à
Compare à
Shuffle à Sort à
Reduce à Output
BigTable
•  2006
•  Distributed Key-
Value column-
family based
database
Hadoop Distributed File System (HDFS)
¨  http://ecomcanada.files.wordpress.com/2012/11/hadoop-architecture.png
Google’s MapReduce Programming Model
1
Apache Hbase: Column Family Distributed K-V Store
Google’s Big Data Papers 2: 2010 - now
1
Percolator
• 2010
• Incremental update/
compute
•  built on BigTable
• Adds transactions, locks,
notifications
• SPFs: “Stream Processing
Frameworks” + underlying
database
Dremel
• 2010
• Online analytics and
visualization
• SQL like language for
structured data
• Each row is JSON object –
in protobuf format
• Column based
• Spanner (2012),
BigQuery, F1
Pregel
• 2010
• Scalable graph computing
• Worker threads à nodes
à parallel “superstep” à
messages à nodes à
Aggregator/Combiners
(global statistics)
• PageRank, shortest path,
bipartite matching
Impala
Tez/Stinger
Microsoft
Trinity
Unstructured Data: Index/Search Engine
¨  Github Code Search: 17 TB
Apache Lucene/SOLR
¨  Open Source Indexing
and Search Engine
¨  4,000+ Enterprise users
¤  IBM, HP, Cisco
¤  Apple, Linkedin
¤  Wikipedia
¤  CNet, Sky
¤  Twitter
What’s Next for Hadoop? Real-time!
Nathan Marz
Some more use cases
¤  Save money to save your jobs
¤  Save money to your firm can do more
¤  E Commerce is norm…
¤  Market sentiment analysis cannot be relied on using
“Bloomberg's sentiment analysis” only
¤  .. Add some more
“Lambda Architecture” – Nathan Marz, BackType/Twitter
¨  query = func (data, ...)
2
•  Real-time ticks, events…
•  Historical (all history data
points)
•  Curated/cleansed curves…
•  Derived curves…
•  Back-testing models…
•  …
•  Technical analysis…
•  Alerts…
•  Join across data sources (e.g.
correlation among weather / energy)
•  Curating/cleanse curves…
•  Derive curves, building models…
•  Back-testing models…
•  Visualization of the above!
•  …
•  Excel / VBA
•  Java, C#/F#...
•  MatLab
•  3rd party ETL Tools
•  R
•  …
Batch	
  Layer	
  (Hadoop) Servicing	
  Layer
Speed	
  Layer	
  (Storm)
RDBMS/DW	
  + 	
  Full-­‐text	
  Search	
  + 	
  Graph	
  Database
QFD	
  1
Batch	
  recompute
All	
  data	
  
(HDFS/HBase)
QFD	
  1
Tableau/Spotfire
Excel/Apps
MDX/DW
(T+ 1)
New	
  data	
  stream
Process	
  stream
Precompute	
  Views	
  
(MapReduce)
Realtime	
  increment	
  
QFD	
  2
QFD	
  2
QFD	
  N
Batch	
  views	
  (HDFS/Impala)
QFD	
  N
Realtime	
  views	
  (Apache	
  Hbase)
Metadata	
  /	
  Classification	
  /	
  Curation
Automation	
  /	
  Aggregation	
  /	
  Centralization
RDBMS Graph	
  Database
Merge
Full-­‐text	
  Search
COTS	
  
Reporting	
  Tools
Ad-­‐hock	
  Analysis/Writeback:	
  Java/
C#,R/Clojure,	
  HIVE/PIG,	
  Talend/3rd
	
  
party,	
  ...
Alerts
Visualization
Quality	
  /	
  Access	
  /	
  
Manipulation
Access	
  /	
  Centralization	
  /	
  
Manipulation
Acquisition
Visualization
Lambda	
  Architecture :	
  query	
  = 	
  func	
  (data,	
  ...)
Online resources and alternative stacks
¨  An Introduction to Data Science.PDF – Free e-book on Data Science with R under Creative
Commons Licenses
¨  Berkeley Data Analytics Stack (Open Source: Mesos – cluster management, Spark/Streaming
– cluster computing, Shark-SQL/DW)
¨  Learning Statistics with R, Free Big Data Education: Advanced Data Science
¨  DataStax Enterprise (Apache C*/Cassandra, Apache Hadoop, Apache Solr…)
¨  An example “lambda architecture” for real-time analysis of hashtags using Trident, Hadoop
and Splout SQL
¨  Nathan Marz (BackType, acquired by Twitter) Big Data Lambda Architecture
¨  Open source clustered Lucene: elasticsearch used by GitHub (17 TB code)
Distributed Computing System: CAP Theorem
https://github.com/thinkaurelius/titan/wiki/Storage-Backend-Overview
http://en.wikipedia.org/wiki/CAP_theorem
http://www.infoq.com/articles/cap-twelve-years-later-how-the-rules-have-changed
Consistency
•  all nodes see the same data at
the same time
Availability
•  a guarantee that every request
receives a response about
whether it was successful or
failed
Partition tolerance
•  the system continues to operate
despite arbitrary message loss
or failure of part of the system
“Lambda Architecture”: Enterprise Data
•  Quality of data
•  Ways to improve
data quality
•  Discover hidden
business insights
•  Data sources
•  Data formats (./
semi-/non-
structured…)
•  Speed of change
•  Speed of reaction
•  Data size
•  Retention granular
level…
Volume Velocity
ValueVariety
“Lambda Architecture” – Nathan Marz, BackType/Twitter
¨  Design Principle:
¤  Human fault-tolerance
¤  Immutability
¤  Pre-computation
¨  Lambda Architecture:
¤  Batch Layer
¤  Serving Layer
¤  Speed Layer
¨  Technology Stack
¤  Apache Hadoop/HBase/Cloudera Impala
¤  Twitter Storm

More Related Content

More from C4Media

Shifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CDShifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CDC4Media
 
CI/CD for Machine Learning
CI/CD for Machine LearningCI/CD for Machine Learning
CI/CD for Machine LearningC4Media
 
Fault Tolerance at Speed
Fault Tolerance at SpeedFault Tolerance at Speed
Fault Tolerance at SpeedC4Media
 
Architectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep SystemsArchitectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep SystemsC4Media
 
ML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsC4Media
 
Build Your Own WebAssembly Compiler
Build Your Own WebAssembly CompilerBuild Your Own WebAssembly Compiler
Build Your Own WebAssembly CompilerC4Media
 
User & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix ScaleUser & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix ScaleC4Media
 
Scaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's EdgeScaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's EdgeC4Media
 
Make Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home EverywhereMake Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home EverywhereC4Media
 
The Talk You've Been Await-ing For
The Talk You've Been Await-ing ForThe Talk You've Been Await-ing For
The Talk You've Been Await-ing ForC4Media
 
Future of Data Engineering
Future of Data EngineeringFuture of Data Engineering
Future of Data EngineeringC4Media
 
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreAutomated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreC4Media
 
Navigating Complexity: High-performance Delivery and Discovery Teams
Navigating Complexity: High-performance Delivery and Discovery TeamsNavigating Complexity: High-performance Delivery and Discovery Teams
Navigating Complexity: High-performance Delivery and Discovery TeamsC4Media
 
High Performance Cooperative Distributed Systems in Adtech
High Performance Cooperative Distributed Systems in AdtechHigh Performance Cooperative Distributed Systems in Adtech
High Performance Cooperative Distributed Systems in AdtechC4Media
 
Rust's Journey to Async/await
Rust's Journey to Async/awaitRust's Journey to Async/await
Rust's Journey to Async/awaitC4Media
 
Opportunities and Pitfalls of Event-Driven Utopia
Opportunities and Pitfalls of Event-Driven UtopiaOpportunities and Pitfalls of Event-Driven Utopia
Opportunities and Pitfalls of Event-Driven UtopiaC4Media
 
Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/DayDatadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/DayC4Media
 
Are We Really Cloud-Native?
Are We Really Cloud-Native?Are We Really Cloud-Native?
Are We Really Cloud-Native?C4Media
 
CockroachDB: Architecture of a Geo-Distributed SQL Database
CockroachDB: Architecture of a Geo-Distributed SQL DatabaseCockroachDB: Architecture of a Geo-Distributed SQL Database
CockroachDB: Architecture of a Geo-Distributed SQL DatabaseC4Media
 
A Dive into Streams @LinkedIn with Brooklin
A Dive into Streams @LinkedIn with BrooklinA Dive into Streams @LinkedIn with Brooklin
A Dive into Streams @LinkedIn with BrooklinC4Media
 

More from C4Media (20)

Shifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CDShifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CD
 
CI/CD for Machine Learning
CI/CD for Machine LearningCI/CD for Machine Learning
CI/CD for Machine Learning
 
Fault Tolerance at Speed
Fault Tolerance at SpeedFault Tolerance at Speed
Fault Tolerance at Speed
 
Architectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep SystemsArchitectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep Systems
 
ML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.js
 
Build Your Own WebAssembly Compiler
Build Your Own WebAssembly CompilerBuild Your Own WebAssembly Compiler
Build Your Own WebAssembly Compiler
 
User & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix ScaleUser & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix Scale
 
Scaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's EdgeScaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's Edge
 
Make Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home EverywhereMake Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home Everywhere
 
The Talk You've Been Await-ing For
The Talk You've Been Await-ing ForThe Talk You've Been Await-ing For
The Talk You've Been Await-ing For
 
Future of Data Engineering
Future of Data EngineeringFuture of Data Engineering
Future of Data Engineering
 
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreAutomated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
 
Navigating Complexity: High-performance Delivery and Discovery Teams
Navigating Complexity: High-performance Delivery and Discovery TeamsNavigating Complexity: High-performance Delivery and Discovery Teams
Navigating Complexity: High-performance Delivery and Discovery Teams
 
High Performance Cooperative Distributed Systems in Adtech
High Performance Cooperative Distributed Systems in AdtechHigh Performance Cooperative Distributed Systems in Adtech
High Performance Cooperative Distributed Systems in Adtech
 
Rust's Journey to Async/await
Rust's Journey to Async/awaitRust's Journey to Async/await
Rust's Journey to Async/await
 
Opportunities and Pitfalls of Event-Driven Utopia
Opportunities and Pitfalls of Event-Driven UtopiaOpportunities and Pitfalls of Event-Driven Utopia
Opportunities and Pitfalls of Event-Driven Utopia
 
Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/DayDatadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day
 
Are We Really Cloud-Native?
Are We Really Cloud-Native?Are We Really Cloud-Native?
Are We Really Cloud-Native?
 
CockroachDB: Architecture of a Geo-Distributed SQL Database
CockroachDB: Architecture of a Geo-Distributed SQL DatabaseCockroachDB: Architecture of a Geo-Distributed SQL Database
CockroachDB: Architecture of a Geo-Distributed SQL Database
 
A Dive into Streams @LinkedIn with Brooklin
A Dive into Streams @LinkedIn with BrooklinA Dive into Streams @LinkedIn with Brooklin
A Dive into Streams @LinkedIn with Brooklin
 

Recently uploaded

SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 

Recently uploaded (20)

SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 

Integrating SQL & NoSQL & NewSQL & Realtime Data Intelligence for the Financial Industry

  • 1. SQL + NOSQL + NEWSQL + REALTIME FOR INVESTMENT BANKS CHARLES CAI ASHWANI ROY 8 March 2013 Enterprise Data Problems in Investment Banks “BigData” History and Trend – Driven by Google CAP Theorem for Distributed Computer System Open Source Building Blocks: Hadoop, Solr, Storm.. Hypothetical Solution using Lambda Architecture Where “BigData” Industry is Going? 3548
  • 2. InfoQ.com: News & Community Site • 750,000 unique visitors/month • Published in 4 languages (English, Chinese, Japanese and Brazilian Portuguese) • Post content from our QCon conferences • News 15-20 / week • Articles 3-4 / week • Presentations (videos) 12-15 / week • Interviews 2-3 / week • Books 1 / month Watch the video with slide synchronization on InfoQ.com! http://www.infoq.com/presentations /finance-sql-nosql-newsql
  • 3. Presented at QCon London www.qconlondon.com Purpose of QCon - to empower software development by facilitating the spread of knowledge and innovation Strategy - practitioner-driven conference designed for YOU: influencers of change and innovation in your teams - speakers and topics driving the evolution and innovation - connecting and catalyzing the influencers and innovators Highlights - attended by more than 12,000 delegates since 2007 - held in 9 cities worldwide
  • 4. Presenter: Charles Cai ¨  Charles Cai makes a living by designing and implementing trading and risk systems for investment banks. ¨  Currently a Chief Front Office Technical Architect in a global energy trading firm. ¤  Twitter: @caidong ¤  Linkedin: charlescai
  • 5. Presenter: Ashwani Roy ¨  Ashwani Roy – Masters in Finance Student at London Business School and VP at a Tier 1 Investment Bank. ¨  Love to mix programming and Applied Mathematics to solve difficult problems in Investment Banking ¤  Twitter: @Ashwani_Roy ¤  Linkedin: ashwaniroy
  • 7. Why Finance Industry should care ? ¨  We care because of ¤  Compliance requirements ¤  Risk Management ¤  Pricing ¤  Rise of Machines (Ecommerce) ¤  Cost Cutting ¤  BTW: Twitter is also part of Market Data
  • 8. Sample Interest Model / Simulations
  • 9. A quick Monte Carlo Demo ¨  Demo – Computing this is functional Some Terminology ¨  PV = present value = Cash flows discounted to current time ¨  Delta = change in price / change in interest rate ¨  Gamma .. Vega .. Rho .. Theta .. Vanna …. And other Greeks
  • 10. Monte Carlo Simulations -Results ¨  <results> = func<I,j,k…… ¨  Parallelize computation with mappers ¨  Save results and run reducers ¨  [[ trade: 1 curveid: Orig PV:100 Delta:200]{ to OLAP} ..[ trade: 1 curveid: Sim1 PV:100 Delta:200] {big data} ..[ trade: 1 curveid: Sim2 PV:99 Delta:220]{big data} ] ]
  • 11. Compliance ¤  Dodd-Frank requires >= five years records ¤  Fast Disaster recovery requirements (Tapes backup not acceptable) ¤  All Bloomberg and other chats to be saves in quick reportable form ¤  … Many more in Basel 3 and Dodd Frank Act You need to #  get chats for AshwaniRoy@bloomberg.net and ashwaniR@reuters.net #  from the 5 years Bloomberg and Reuters log of a global investment bank of 1TB(assume 1MB/Day/Trader * 220 trading days * 1000 traders* 5 years) #  for all EURUSD swaps only ….. Additional filters and aggregation requirements
  • 12. Big Data Industry History: Google’s Papers 1
  • 13. Google’s Big Data Papers: 2003 – 2006 1 GFS – Google File System •  2003 •  Distributed file system •  3 x copies •  Commodity machines •  Colossus (2012) MapReduce •  2004 •  Input à Map à Partition à Compare à Shuffle à Sort à Reduce à Output BigTable •  2006 •  Distributed Key- Value column- family based database
  • 14. Hadoop Distributed File System (HDFS) ¨  http://ecomcanada.files.wordpress.com/2012/11/hadoop-architecture.png
  • 16. Apache Hbase: Column Family Distributed K-V Store
  • 17. Google’s Big Data Papers 2: 2010 - now 1 Percolator • 2010 • Incremental update/ compute •  built on BigTable • Adds transactions, locks, notifications • SPFs: “Stream Processing Frameworks” + underlying database Dremel • 2010 • Online analytics and visualization • SQL like language for structured data • Each row is JSON object – in protobuf format • Column based • Spanner (2012), BigQuery, F1 Pregel • 2010 • Scalable graph computing • Worker threads à nodes à parallel “superstep” à messages à nodes à Aggregator/Combiners (global statistics) • PageRank, shortest path, bipartite matching Impala Tez/Stinger Microsoft Trinity
  • 18. Unstructured Data: Index/Search Engine ¨  Github Code Search: 17 TB
  • 19. Apache Lucene/SOLR ¨  Open Source Indexing and Search Engine ¨  4,000+ Enterprise users ¤  IBM, HP, Cisco ¤  Apple, Linkedin ¤  Wikipedia ¤  CNet, Sky ¤  Twitter
  • 20. What’s Next for Hadoop? Real-time! Nathan Marz
  • 21. Some more use cases ¤  Save money to save your jobs ¤  Save money to your firm can do more ¤  E Commerce is norm… ¤  Market sentiment analysis cannot be relied on using “Bloomberg's sentiment analysis” only ¤  .. Add some more
  • 22. “Lambda Architecture” – Nathan Marz, BackType/Twitter ¨  query = func (data, ...) 2 •  Real-time ticks, events… •  Historical (all history data points) •  Curated/cleansed curves… •  Derived curves… •  Back-testing models… •  … •  Technical analysis… •  Alerts… •  Join across data sources (e.g. correlation among weather / energy) •  Curating/cleanse curves… •  Derive curves, building models… •  Back-testing models… •  Visualization of the above! •  … •  Excel / VBA •  Java, C#/F#... •  MatLab •  3rd party ETL Tools •  R •  …
  • 23. Batch  Layer  (Hadoop) Servicing  Layer Speed  Layer  (Storm) RDBMS/DW  +  Full-­‐text  Search  +  Graph  Database QFD  1 Batch  recompute All  data   (HDFS/HBase) QFD  1 Tableau/Spotfire Excel/Apps MDX/DW (T+ 1) New  data  stream Process  stream Precompute  Views   (MapReduce) Realtime  increment   QFD  2 QFD  2 QFD  N Batch  views  (HDFS/Impala) QFD  N Realtime  views  (Apache  Hbase) Metadata  /  Classification  /  Curation Automation  /  Aggregation  /  Centralization RDBMS Graph  Database Merge Full-­‐text  Search COTS   Reporting  Tools Ad-­‐hock  Analysis/Writeback:  Java/ C#,R/Clojure,  HIVE/PIG,  Talend/3rd   party,  ... Alerts Visualization Quality  /  Access  /   Manipulation Access  /  Centralization  /   Manipulation Acquisition Visualization Lambda  Architecture :  query  =  func  (data,  ...)
  • 24. Online resources and alternative stacks ¨  An Introduction to Data Science.PDF – Free e-book on Data Science with R under Creative Commons Licenses ¨  Berkeley Data Analytics Stack (Open Source: Mesos – cluster management, Spark/Streaming – cluster computing, Shark-SQL/DW) ¨  Learning Statistics with R, Free Big Data Education: Advanced Data Science ¨  DataStax Enterprise (Apache C*/Cassandra, Apache Hadoop, Apache Solr…) ¨  An example “lambda architecture” for real-time analysis of hashtags using Trident, Hadoop and Splout SQL ¨  Nathan Marz (BackType, acquired by Twitter) Big Data Lambda Architecture ¨  Open source clustered Lucene: elasticsearch used by GitHub (17 TB code)
  • 25. Distributed Computing System: CAP Theorem https://github.com/thinkaurelius/titan/wiki/Storage-Backend-Overview http://en.wikipedia.org/wiki/CAP_theorem http://www.infoq.com/articles/cap-twelve-years-later-how-the-rules-have-changed Consistency •  all nodes see the same data at the same time Availability •  a guarantee that every request receives a response about whether it was successful or failed Partition tolerance •  the system continues to operate despite arbitrary message loss or failure of part of the system
  • 26. “Lambda Architecture”: Enterprise Data •  Quality of data •  Ways to improve data quality •  Discover hidden business insights •  Data sources •  Data formats (./ semi-/non- structured…) •  Speed of change •  Speed of reaction •  Data size •  Retention granular level… Volume Velocity ValueVariety
  • 27. “Lambda Architecture” – Nathan Marz, BackType/Twitter ¨  Design Principle: ¤  Human fault-tolerance ¤  Immutability ¤  Pre-computation ¨  Lambda Architecture: ¤  Batch Layer ¤  Serving Layer ¤  Speed Layer ¨  Technology Stack ¤  Apache Hadoop/HBase/Cloudera Impala ¤  Twitter Storm