SQL + NOSQL + NEWSQL + REALTIMEFOR INVESTMENT BANKSCHARLES CAIASHWANI ROY8 March 2013Enterprise Data Problems in Investmen...
InfoQ.com: News & Community Site• 750,000 unique visitors/month• Published in 4 languages (English, Chinese, Japanese and ...
Presented at QCon Londonwww.qconlondon.comPurpose of QCon- to empower software development by facilitating the spread ofkn...
Presenter: Charles Cai¨  Charles Cai makes a living by designing andimplementing trading and risk systems for investmentb...
Presenter: Ashwani Roy¨  Ashwani Roy – Masters in Finance Student at LondonBusiness School and VP at a Tier 1 Investment ...
3548
Why Finance Industry should care ?¨  We care because of¤  Compliance requirements¤  Risk Management¤  Pricing¤  Rise ...
Sample Interest Model / Simulations
A quick Monte Carlo Demo¨  Demo – Computing this is functionalSome Terminology¨  PV = present value = Cash flows discoun...
Monte Carlo Simulations -Results¨  <results> = func<I,j,k……¨  Parallelize computation with mappers¨  Save results and r...
Compliance¤  Dodd-Frank requires >= five years records¤  Fast Disaster recovery requirements (Tapes backup not acceptabl...
Big Data Industry History: Google’s Papers1
Google’s Big Data Papers: 2003 – 20061GFS – Google FileSystem•  2003•  Distributed filesystem•  3 x copies•  Commoditymach...
Hadoop Distributed File System (HDFS)¨  http://ecomcanada.files.wordpress.com/2012/11/hadoop-architecture.png
Google’s MapReduce Programming Model1
Apache Hbase: Column Family Distributed K-V Store
Google’s Big Data Papers 2: 2010 - now1Percolator• 2010• Incremental update/compute•  built on BigTable• Adds transactions...
Unstructured Data: Index/Search Engine¨  Github Code Search: 17 TB
Apache Lucene/SOLR¨  Open Source Indexingand Search Engine¨  4,000+ Enterprise users¤  IBM, HP, Cisco¤  Apple, Linkedi...
What’s Next for Hadoop? Real-time!Nathan Marz
Some more use cases¤  Save money to save your jobs¤  Save money to your firm can do more¤  E Commerce is norm…¤  Marke...
“Lambda Architecture” – Nathan Marz, BackType/Twitter¨  query = func (data, ...)2•  Real-time ticks, events…•  Historical...
Batch	  Layer	  (Hadoop) Servicing	  LayerSpeed	  Layer	  (Storm)RDBMS/DW	  + 	  Full-­‐text	  Search	  + 	  Graph	  Datab...
Online resources and alternative stacks¨  An Introduction to Data Science.PDF – Free e-book on Data Science with R under ...
Distributed Computing System: CAP Theoremhttps://github.com/thinkaurelius/titan/wiki/Storage-Backend-Overviewhttp://en.wik...
“Lambda Architecture”: Enterprise Data•  Quality of data•  Ways to improvedata quality•  Discover hiddenbusiness insights•...
“Lambda Architecture” – Nathan Marz, BackType/Twitter¨  Design Principle:¤  Human fault-tolerance¤  Immutability¤  Pre...
Upcoming SlideShare
Loading in …5
×

Integrating SQL & NoSQL & NewSQL & Realtime Data Intelligence for the Financial Industry

1,673 views

Published on

Video and slides synchronized, mp3 and slide download available at http://bit.ly/15ibJhX.

Charles Cai, Ashwani Roy discuss a robust, cost effective, hypothetical solution to address extreme challenges in financial institutions, from decision making support to pricing and risk management. Filmed at qconlondon.com.

Charles Cai makes a living by designing Foreign Exchange and energy trading systems and he is currently focusing on architecture and technology innovation for data analytics,cross-asset trading and risk management. Ashwani Roy has 10 years of experience in programming and software architecture out of which 5 years are with various investments in areas like risk computation, pricing.

0 Comments
9 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,673
On SlideShare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
0
Comments
0
Likes
9
Embeds 0
No embeds

No notes for slide

Integrating SQL & NoSQL & NewSQL & Realtime Data Intelligence for the Financial Industry

  1. 1. SQL + NOSQL + NEWSQL + REALTIMEFOR INVESTMENT BANKSCHARLES CAIASHWANI ROY8 March 2013Enterprise Data Problems in Investment Banks“BigData” History and Trend – Driven by GoogleCAP Theorem for Distributed Computer SystemOpen Source Building Blocks: Hadoop, Solr, Storm..Hypothetical Solution using Lambda ArchitectureWhere “BigData” Industry is Going?3548
  2. 2. InfoQ.com: News & Community Site• 750,000 unique visitors/month• Published in 4 languages (English, Chinese, Japanese and BrazilianPortuguese)• Post content from our QCon conferences• News 15-20 / week• Articles 3-4 / week• Presentations (videos) 12-15 / week• Interviews 2-3 / week• Books 1 / monthWatch the video with slidesynchronization on InfoQ.com!http://www.infoq.com/presentations/finance-sql-nosql-newsql
  3. 3. Presented at QCon Londonwww.qconlondon.comPurpose of QCon- to empower software development by facilitating the spread ofknowledge and innovationStrategy- practitioner-driven conference designed for YOU: influencers ofchange and innovation in your teams- speakers and topics driving the evolution and innovation- connecting and catalyzing the influencers and innovatorsHighlights- attended by more than 12,000 delegates since 2007- held in 9 cities worldwide
  4. 4. Presenter: Charles Cai¨  Charles Cai makes a living by designing andimplementing trading and risk systems for investmentbanks.¨  Currently a Chief Front Office Technical Architect ina global energy trading firm.¤  Twitter: @caidong¤  Linkedin: charlescai
  5. 5. Presenter: Ashwani Roy¨  Ashwani Roy – Masters in Finance Student at LondonBusiness School and VP at a Tier 1 Investment Bank.¨  Love to mix programming and Applied Mathematicsto solve difficult problems in Investment Banking¤  Twitter: @Ashwani_Roy¤  Linkedin: ashwaniroy
  6. 6. 3548
  7. 7. Why Finance Industry should care ?¨  We care because of¤  Compliance requirements¤  Risk Management¤  Pricing¤  Rise of Machines (Ecommerce)¤  Cost Cutting¤  BTW: Twitter is also part of Market Data
  8. 8. Sample Interest Model / Simulations
  9. 9. A quick Monte Carlo Demo¨  Demo – Computing this is functionalSome Terminology¨  PV = present value = Cash flows discounted tocurrent time¨  Delta = change in price / change in interest rate¨  Gamma .. Vega .. Rho .. Theta .. Vanna …. Andother Greeks
  10. 10. Monte Carlo Simulations -Results¨  <results> = func<I,j,k……¨  Parallelize computation with mappers¨  Save results and run reducers¨  [[ trade: 1 curveid: Orig PV:100 Delta:200]{ to OLAP}..[ trade: 1 curveid: Sim1 PV:100 Delta:200] {big data}..[ trade: 1 curveid: Sim2 PV:99 Delta:220]{big data} ] ]
  11. 11. Compliance¤  Dodd-Frank requires >= five years records¤  Fast Disaster recovery requirements (Tapes backup not acceptable)¤  All Bloomberg and other chats to be saves in quick reportable form¤  … Many more in Basel 3 and Dodd Frank ActYou need to#  get chats for AshwaniRoy@bloomberg.net and ashwaniR@reuters.net#  from the 5 years Bloomberg and Reuters log of a global investment bankof 1TB(assume 1MB/Day/Trader * 220 trading days * 1000 traders* 5years)#  for all EURUSD swaps only….. Additional filters and aggregation requirements
  12. 12. Big Data Industry History: Google’s Papers1
  13. 13. Google’s Big Data Papers: 2003 – 20061GFS – Google FileSystem•  2003•  Distributed filesystem•  3 x copies•  Commoditymachines•  Colossus (2012)MapReduce•  2004•  Input à Map àPartition àCompare àShuffle à Sort àReduce à OutputBigTable•  2006•  Distributed Key-Value column-family baseddatabase
  14. 14. Hadoop Distributed File System (HDFS)¨  http://ecomcanada.files.wordpress.com/2012/11/hadoop-architecture.png
  15. 15. Google’s MapReduce Programming Model1
  16. 16. Apache Hbase: Column Family Distributed K-V Store
  17. 17. Google’s Big Data Papers 2: 2010 - now1Percolator• 2010• Incremental update/compute•  built on BigTable• Adds transactions, locks,notifications• SPFs: “Stream ProcessingFrameworks” + underlyingdatabaseDremel• 2010• Online analytics andvisualization• SQL like language forstructured data• Each row is JSON object –in protobuf format• Column based• Spanner (2012),BigQuery, F1Pregel• 2010• Scalable graph computing• Worker threads à nodesà parallel “superstep” àmessages à nodes àAggregator/Combiners(global statistics)• PageRank, shortest path,bipartite matchingImpalaTez/StingerMicrosoftTrinity
  18. 18. Unstructured Data: Index/Search Engine¨  Github Code Search: 17 TB
  19. 19. Apache Lucene/SOLR¨  Open Source Indexingand Search Engine¨  4,000+ Enterprise users¤  IBM, HP, Cisco¤  Apple, Linkedin¤  Wikipedia¤  CNet, Sky¤  Twitter
  20. 20. What’s Next for Hadoop? Real-time!Nathan Marz
  21. 21. Some more use cases¤  Save money to save your jobs¤  Save money to your firm can do more¤  E Commerce is norm…¤  Market sentiment analysis cannot be relied on using“Bloombergs sentiment analysis” only¤  .. Add some more
  22. 22. “Lambda Architecture” – Nathan Marz, BackType/Twitter¨  query = func (data, ...)2•  Real-time ticks, events…•  Historical (all history datapoints)•  Curated/cleansed curves…•  Derived curves…•  Back-testing models…•  …•  Technical analysis…•  Alerts…•  Join across data sources (e.g.correlation among weather / energy)•  Curating/cleanse curves…•  Derive curves, building models…•  Back-testing models…•  Visualization of the above!•  …•  Excel / VBA•  Java, C#/F#...•  MatLab•  3rd party ETL Tools•  R•  …
  23. 23. Batch  Layer  (Hadoop) Servicing  LayerSpeed  Layer  (Storm)RDBMS/DW  +  Full-­‐text  Search  +  Graph  DatabaseQFD  1Batch  recomputeAll  data  (HDFS/HBase)QFD  1Tableau/SpotfireExcel/AppsMDX/DW(T+ 1)New  data  streamProcess  streamPrecompute  Views  (MapReduce)Realtime  increment  QFD  2QFD  2QFD  NBatch  views  (HDFS/Impala)QFD  NRealtime  views  (Apache  Hbase)Metadata  /  Classification  /  CurationAutomation  /  Aggregation  /  CentralizationRDBMS Graph  DatabaseMergeFull-­‐text  SearchCOTS  Reporting  ToolsAd-­‐hock  Analysis/Writeback:  Java/C#,R/Clojure,  HIVE/PIG,  Talend/3rd  party,  ...AlertsVisualizationQuality  /  Access  /  ManipulationAccess  /  Centralization  /  ManipulationAcquisitionVisualizationLambda  Architecture :  query  =  func  (data,  ...)
  24. 24. Online resources and alternative stacks¨  An Introduction to Data Science.PDF – Free e-book on Data Science with R under CreativeCommons Licenses¨  Berkeley Data Analytics Stack (Open Source: Mesos – cluster management, Spark/Streaming– cluster computing, Shark-SQL/DW)¨  Learning Statistics with R, Free Big Data Education: Advanced Data Science¨  DataStax Enterprise (Apache C*/Cassandra, Apache Hadoop, Apache Solr…)¨  An example “lambda architecture” for real-time analysis of hashtags using Trident, Hadoopand Splout SQL¨  Nathan Marz (BackType, acquired by Twitter) Big Data Lambda Architecture¨  Open source clustered Lucene: elasticsearch used by GitHub (17 TB code)
  25. 25. Distributed Computing System: CAP Theoremhttps://github.com/thinkaurelius/titan/wiki/Storage-Backend-Overviewhttp://en.wikipedia.org/wiki/CAP_theoremhttp://www.infoq.com/articles/cap-twelve-years-later-how-the-rules-have-changedConsistency•  all nodes see the same data atthe same timeAvailability•  a guarantee that every requestreceives a response aboutwhether it was successful orfailedPartition tolerance•  the system continues to operatedespite arbitrary message lossor failure of part of the system
  26. 26. “Lambda Architecture”: Enterprise Data•  Quality of data•  Ways to improvedata quality•  Discover hiddenbusiness insights•  Data sources•  Data formats (./semi-/non-structured…)•  Speed of change•  Speed of reaction•  Data size•  Retention granularlevel…Volume VelocityValueVariety
  27. 27. “Lambda Architecture” – Nathan Marz, BackType/Twitter¨  Design Principle:¤  Human fault-tolerance¤  Immutability¤  Pre-computation¨  Lambda Architecture:¤  Batch Layer¤  Serving Layer¤  Speed Layer¨  Technology Stack¤  Apache Hadoop/HBase/Cloudera Impala¤  Twitter Storm

×