Ceph Day New York 2014: Distributed OLAP queries in seconds using CephFS

•Download as PPTX, PDF•

0 likes•472 views

Ceph Community

Presented by Milosz Tanski, AdFin

Software

OLAP ON QERIES IN SECONDS ON PETABYTE DATASET
Distributing Petabucket data using CephFS
Milosz Tanski, CTO @Adfin
milosz@adfin.com
October 2014

Outline
 Who/what is AdFin?
 What is PetaBucket?
 Petabucket on CephFS
 Contributing FSCache support to CephFS
2 ©AdFin. All Rights Reserved

About Adfin
 = Ad-Tech + Finance-Tech
 Creating tools that bring buying intelligence to programmatic media.
 Advertising is bought and sold in real time via RTB (since 2008)
 Brining transparency to the Ad markets.
 The Bloomberg, S&P, Markit… for Ad markets.
3 ©AdFin. All Rights Reserved

We Deliver… Pretty Analytics
4 ©AdFin. All Rights Reserved

We Deliver… Pretty Analytics
5 ©AdFin. All Rights Reserved

We Deliver… Pretty Analytics
6 ©AdFin. All Rights Reserved

We Deliver… Pretty Analytics
7 ©AdFin. All Rights Reserved

$What’s the problem?  Market is ~500 Billion impressions a day; it’s growing.  Each impression is unique.  Each is worth a small fraction of a penny.  Magnitude more then number of trades in the Financial markets  There’s a magnitude more bids for those impressions.  That’s a lot of data to process, store, analyze. 8 ©AdFin. All Rights Reserved$

Petabucket
 Distributed, time series, relational, OLAP database
 Relational query language (but not SQL)
 Query in broken up into many smaller chunks
 Great single node performance. 10s of millions rows a second.
 Vectorized query processing, vectorized compressed bitmap indexes.
 Responses in real-time. Goal is low single digit seconds (uncached)
 Why? Because we’re a bit crazy.
9 ©AdFin. All Rights Reserved

Queries easy for humans / machines
10 ©AdFin. All Rights Reserved

Petabucket and CephFS
 CephFS as a single namespace storage for nodes
 Why?
 Scalable storage (speed / size)
 Separate storage from computation
 No SPOF
 DFS performance
 Client (kernel) performance
13 ©AdFin. All Rights Reserved

High Level System Diagram, part 2
14 ©AdFin. All Rights Reserved

CephFS is not production ready?
 Again, we’re a bit crazy?
 Started in early 2013.
 When we started client and MDS were not ready.
 We found and reported a lot of bugs.
 Yan Zhen fixed a lot of bugs. Thanks Yan.
 Today we’re happy and in production.
 Processed multiple PB of data since then.
15 ©AdFin. All Rights Reserved

FSCache for kclient
 We decided to add local persistent caching support to the kclient.
 Our access pattern:
 Working set larger then node memory (page cache)
 Append-only data (time series)
 Most recent month, quarter of data access 100x more often
 Benefits:
 Reducing latency / speed lost by moving to non-local filesystem
 Reduce Ceph network traffic and OSD utilization
 Cheap local SSD drives get 500MB/s read performance
 Not re-inventing the wheel
16 ©AdFin. All Rights Reserved

Kernel programming is hard
 Have to understand Ceph, kernel, concurrency.
 An error in the kernel hangs or Oops your machine.
 Bugs in other parts of the kernel? (CacheFS).
 Prototype working in two weeks
 First submission 2 months later.
 In kernel 5 months later.
 Number one problem concurrency.
17 ©AdFin. All Rights Reserved

Ceph with FSCache Status
 In since: 3.13
 … Works well since: 3.15
 … All bugs fixed: 3.17
 Speed… as fast as your caching disk
 Tested single client performance 1200MB/s
18 ©AdFin. All Rights Reserved

Next steps…
 Contributing to Ceph & kernel is addicting:
 Ceph performance work. Improving latency / ioops.
 Kernel work: readv2() syscall. File serving applications
 http://lwn.net/Articles/612483/
19 ©AdFin. All Rights Reserved

Let’s Get in Touch
21 ©AdFin. All Rights Reserved
Milosz Tanski
CTO
milosz@adfin.com
16 E. 34th Street, 15th Floor
New York, New York 10016
linkedin.com/company/AdFin
twitter.com/AdFin

Viewers also liked

Transforming the Ceph Integration Tests with OpenStack Ceph Community

iSCSI Target Support for Ceph Ceph Community

Ceph Day Beijing: Experience Sharing and OpenStack and Ceph Integration Ceph Community

Ceph Day Shanghai - CeTune - Benchmarking and tuning your Ceph cluster Ceph Community

Ceph Day New York 2014: Ceph, a physical perspective Ceph Community

London Ceph Day: Erasure Coding: Purpose and Progress Ceph Community

Ceph Day Shanghai - Hyper Converged PLCloud with Ceph Ceph Community

London Ceph Day: Ceph Performance and Optimization Ceph Community

Ceph Day Berlin: Ceph and iSCSI in a high availability setupCeph Community

Ceph Day Berlin: Scaling an Academic CloudCeph Community

Ceph Day Beijing: Big Data Analytics on Ceph Object Store Ceph Community

Ceph Day Berlin: Measuring and predicting performance of Ceph clustersCeph Community

Ceph Day LA: Ceph Ecosystem Update Ceph Community

Reference Architecture: Architecting Ceph Storage Solutions Ceph Community

Ceph Day 2015 - Erasure Coding Ceph Community

Ceph Day Berlin: Erasure Code in Ceph Ceph Community

Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...Ceph Community

Using Recently Published Ceph Reference Architectures to Select Your Ceph Con...Ceph Community

Ceph Day New York 2014: Ceph and the Open Ethernet Drive Architecture Ceph Community

Ceph Day Beijing: Containers and Ceph Ceph Community

Viewers also liked (20)

Transforming the Ceph Integration Tests with OpenStack

iSCSI Target Support for Ceph

Ceph Day Beijing: Experience Sharing and OpenStack and Ceph Integration

Ceph Day Shanghai - CeTune - Benchmarking and tuning your Ceph cluster

Ceph Day New York 2014: Ceph, a physical perspective

London Ceph Day: Erasure Coding: Purpose and Progress

Ceph Day Shanghai - Hyper Converged PLCloud with Ceph

London Ceph Day: Ceph Performance and Optimization

Ceph Day Berlin: Ceph and iSCSI in a high availability setup

Ceph Day Berlin: Scaling an Academic Cloud

Ceph Day Beijing: Big Data Analytics on Ceph Object Store

Ceph Day Berlin: Measuring and predicting performance of Ceph clusters

Ceph Day LA: Ceph Ecosystem Update

Reference Architecture: Architecting Ceph Storage Solutions

Ceph Day 2015 - Erasure Coding

Ceph Day Berlin: Erasure Code in Ceph

Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...

Using Recently Published Ceph Reference Architectures to Select Your Ceph Con...

Ceph Day New York 2014: Ceph and the Open Ethernet Drive Architecture

Ceph Day Beijing: Containers and Ceph

Similar to Ceph Day New York 2014: Distributed OLAP queries in seconds using CephFS

OpenStack and NetApp - Chen Reuven - OpenStack Day Israel 2017Cloud Native Day Tel Aviv

Streaming solutions for real time problems Aparna Gaonkar

Improving Hadoop Resiliency and Operational Efficiency with EMC IsilonDataWorks Summit/Hadoop Summit

Hadoop EverywhereDataWorks Summit/Hadoop Summit

Software Stacks to enable SDN and NFVYoshihiro Nakajima

Presentazione SimpliVity @ VMUGIT UserCon 2015VMUG IT

Optimizing Data for Fast QueryingAndrei Ionescu

Big Data Real Time Analytics - A Facebook Case StudyNati Shalom

NameNode Analytics - Querying HDFS Namespace in Real TimePlamen Jeliazkov

Puppet Camp Charlotte 2015: Use Puppet to Manage your NetApp Storage Infrastr...Puppet

SnapDiffAshwin Pawar

Modern infrastructure for business data lakeEMC

Growing Monitoring to Keep Up with Technology and Business DemandsZenoss

In-Place analytics with Unified Data AccessDataWorks Summit

NetApp IT Data Center Strategies to Enable Digital TransformationNetApp

Hadoop Analytics on Isilon Deep DiveClaudioFahey1

WekaIO: Making Machine Learning Compute Bound Againinside-BigData.com

Aem asset optimizations & best practicesKanika Gera

Bridging Your Business Across the Enterprise and Cloud with MongoDB and NetAppMongoDB

Decreasing Incident Response TimeBoni Bruno

Similar to Ceph Day New York 2014: Distributed OLAP queries in seconds using CephFS (20)

OpenStack and NetApp - Chen Reuven - OpenStack Day Israel 2017

Streaming solutions for real time problems

Improving Hadoop Resiliency and Operational Efficiency with EMC Isilon

Hadoop Everywhere

Software Stacks to enable SDN and NFV

Presentazione SimpliVity @ VMUGIT UserCon 2015

Optimizing Data for Fast Querying

Big Data Real Time Analytics - A Facebook Case Study

NameNode Analytics - Querying HDFS Namespace in Real Time

Puppet Camp Charlotte 2015: Use Puppet to Manage your NetApp Storage Infrastr...

SnapDiff

Modern infrastructure for business data lake

Growing Monitoring to Keep Up with Technology and Business Demands

In-Place analytics with Unified Data Access

NetApp IT Data Center Strategies to Enable Digital Transformation

Hadoop Analytics on Isilon Deep Dive

WekaIO: Making Machine Learning Compute Bound Again

Aem asset optimizations & best practices

Bridging Your Business Across the Enterprise and Cloud with MongoDB and NetApp

Decreasing Incident Response Time

Recently uploaded

Unveiling Design Patterns: A Visual Guide with UML DiagramsAhmed Mohamed

Cloud Management Software Platforms: OpenStackVICTOR MAESTRE RAMIREZ

GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfAlina Yurenko

英国UN学位证,北安普顿大学毕业证书1:1制作qr0udbr0

Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio, Inc.

Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024StefanoLambiase

The Evolution of Karaoke From Analog to App.pdfPower Karaoke

Folding Cheat Sheet #4 - fourth in a seriesPhilip Schwarz

ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...Christina Lin

BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEOrtus Solutions, Corp

Intelligent Home Wi-Fi Solutions | ThinkPalmSujith Sukumaran

chapter--4-software-project-planning.pptkotipi9215

Asset Management Software - InfographicHr365.us smith

Software Project Health Check: Best Practices and Techniques for Your Product...Velvetech LLC

Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataBradBedford3

Der Spagat zwischen BIAS und FAIRNESS (2024)OPEN KNOWLEDGE GmbH

Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...soniya singh

SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanyChristoph Pohl

Professional Resume Template for Software DevelopersVinodh Ram

EY_Graph Database Powered SustainabilityNeo4j

Recently uploaded (20)

Unveiling Design Patterns: A Visual Guide with UML Diagrams

Cloud Management Software Platforms: OpenStack

GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf

英国UN学位证,北安普顿大学毕业证书1:1制作

Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data

Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024

The Evolution of Karaoke From Analog to App.pdf

Folding Cheat Sheet #4 - fourth in a series

ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...

BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE

Intelligent Home Wi-Fi Solutions | ThinkPalm

chapter--4-software-project-planning.ppt

Asset Management Software - Infographic

Software Project Health Check: Best Practices and Techniques for Your Product...

Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data

Der Spagat zwischen BIAS und FAIRNESS (2024)

Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...

SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany

Professional Resume Template for Software Developers

EY_Graph Database Powered Sustainability

Ceph Day New York 2014: Distributed OLAP queries in seconds using CephFS

1. OLAP ON QERIES IN SECONDS ON PETABYTE DATASET Distributing Petabucket data using CephFS Milosz Tanski, CTO @Adfin milosz@adfin.com October 2014

3. About Adfin  = Ad-Tech + Finance-Tech  Creating tools that bring buying intelligence to programmatic media.  Advertising is bought and sold in real time via RTB (since 2008)  Brining transparency to the Ad markets.  The Bloomberg, S&P, Markit… for Ad markets. 3 ©AdFin. All Rights Reserved

8. What’s the problem?  Market is ~500 Billion impressions a day; it’s growing.  Each impression is unique.  Each is worth a small fraction of a penny.  Magnitude more then number of trades in the Financial markets  There’s a magnitude more bids for those impressions.  That’s a lot of data to process, store, analyze. 8 ©AdFin. All Rights Reserved

9. Petabucket  Distributed, time series, relational, OLAP database  Relational query language (but not SQL)  Query in broken up into many smaller chunks  Great single node performance. 10s of millions rows a second.  Vectorized query processing, vectorized compressed bitmap indexes.  Responses in real-time. Goal is low single digit seconds (uncached)  Why? Because we’re a bit crazy. 9 ©AdFin. All Rights Reserved

11. High Level System Diagram 11

12. Time series bulk import 12

13. Petabucket and CephFS  CephFS as a single namespace storage for nodes  Why?  Scalable storage (speed / size)  Separate storage from computation  No SPOF  DFS performance  Client (kernel) performance 13 ©AdFin. All Rights Reserved

15. CephFS is not production ready?  Again, we’re a bit crazy?  Started in early 2013.  When we started client and MDS were not ready.  We found and reported a lot of bugs.  Yan Zhen fixed a lot of bugs. Thanks Yan.  Today we’re happy and in production.  Processed multiple PB of data since then. 15 ©AdFin. All Rights Reserved

16. FSCache for kclient  We decided to add local persistent caching support to the kclient.  Our access pattern:  Working set larger then node memory (page cache)  Append-only data (time series)  Most recent month, quarter of data access 100x more often  Benefits:  Reducing latency / speed lost by moving to non-local filesystem  Reduce Ceph network traffic and OSD utilization  Cheap local SSD drives get 500MB/s read performance  Not re-inventing the wheel 16 ©AdFin. All Rights Reserved

17. Kernel programming is hard  Have to understand Ceph, kernel, concurrency.  An error in the kernel hangs or Oops your machine.  Bugs in other parts of the kernel? (CacheFS).  Prototype working in two weeks  First submission 2 months later.  In kernel 5 months later.  Number one problem concurrency. 17 ©AdFin. All Rights Reserved

18. Ceph with FSCache Status  In since: 3.13  … Works well since: 3.15  … All bugs fixed: 3.17  Speed… as fast as your caching disk  Tested single client performance 1200MB/s 18 ©AdFin. All Rights Reserved

19. Next steps…  Contributing to Ceph & kernel is addicting:  Ceph performance work. Improving latency / ioops.  Kernel work: readv2() syscall. File serving applications  http://lwn.net/Articles/612483/ 19 ©AdFin. All Rights Reserved

20. Thank You!

Editor's Notes

Who is Adfin? What special sauce did we build … very large OLAP DB. Goals: Have you take a look at at CephFS … might be one of the few people talking about it. Realized that it’s possible for your organization to develop some expertise in-house… contribute.
Name implies a combination of Advertising + Finance Markets. Two home town industries (Madison Ave and Wall St) Using tools and knowledge pioneered by the financial industry. Most media (by volume) is bought and sold pragmatically. Ala. HFT It’s an opaque marketplace. Bloomberg … Information Platform, S&P… Indices, Market … aggregating market data (CDS) I am going to keep butchering these analogies.
Pictures of some of the tools we’ve built. Real time analysis into your own data and market data. Run a query get a result… lots of variables. Forecasting
Pictures of some of the tools we’ve built. Real time analysis into your own data and market data. Run a query get a result… lots of variables. Forecasting
Pictures of some of the tools we’ve built. Real time analysis into your own data and market data. Run a query get a result… lots of variables. Forecasting
Pictures of some of the tools we’ve built. Real time analysis into your own data and market data. Run a query get a result… lots of variables. Forecasting
The advertising market is larger then the financial market… in terms of volume of transactions. Each impression is worth a tiny fraction of a penny. When I looked at the number of transactions for an exchange like the NASDAQ… it’s like 50 million, NYSE 100 million. A lot of duct tape, but also a lot of efficiency. This number is not getting smaller. All advertising is going to be digitally bought and sold and that day is coming.
Distributed, relational database for running real time analytics queries on very large time series data. KDB on many many nodes. Some fun things. It’s a relational model, but not SQL. 90% of queries or sums or group bys. Data is sharded into partitions by time. Spread across many nodes. We get pretty amazing singe node performance. 100s of millions of rows a second per partition. There’s been a lot of research into this stuff. Based on research into compression, indexing, query all from like last 3 to 4 years. For large datasets our goal is to answer under 10 seconds for really large queries. Reality is most things we do answer under 1 second. Why? Because the dataset is huge. Also, we’re a bit crazy.
Distributed, relational database for running real time analytics queries on very large time series data. KDB on many many nodes. Some fun things. It’s a relational model, but not SQL. 90% of queries or sums or group bys. Data is sharded into partitions by time. Spread across many nodes. We get pretty amazing singe node performance. 100s of millions of rows a second per partition. There’s been a lot of research into this stuff. Based on research into compression, indexing, query all from like last 3 to 4 years. For large datasets our goal is to answer under 10 seconds for really large queries. Reality is most things we do answer under 1 second. Why? Because the dataset is huge. Also, we’re a bit crazy.
Before we’re storing it all on local disks. Couple problems: Redundancy? Can’t grow computation without storage, vice versa. Looked into Ceph: Scalable storage, just throw more machines at it… don’t worry about topology too much. We could separate storage from computation. No SPOF, redundancy everywhere. Pretty good speed for DFS. We can leverage the kernel. The kernel client versus doing it directly. Page cache etc…Common theme
“Beta company, okay using a beta product” We can get under the good. Early start was a bit rough. There was lots of bugs. We found lots of bugs. Community was great, esp Yan. Yan fixed our last bug around the end of 2013… haven’t had a single problem since. We’re not storing multi-PB yet but we processed multi-PB and haven’t had a problem
We lost some performance as a result of this. Network latency, overhead, Ceph overhead. We can also go even cheaper without Ceph nodes / network. Our access pattern, write once read many (mostly true). Most recent data is most often use (working set larger then RAM smaller the the full DFS) The linux kernel people really put hundreds of man years into scabiliity.
I don’t want to discourage anybody … we did something not smart, picked the hardest problem. It required us to know a lot of things about Ceph, kernel, concurrency. I would pick something simpler next time. There’s bugs in the other parts of the kernel? So one of the reasons we wanted to do this work in the kernel was concurrency, so our benefit was also out PITA.
We got it up to the Ceph code base around 3.13 Bunch of bug fixes from external folks. We’ve exposed issues with FSCache code. We’ve fixed a bunch of concurrency bugs that only happen in the error path of FSCache under VMA pressure. A lot of filesystems benefit. We’re really happy with performance… we’ve made a good bet on the kernel. We’re able to really the fscache up to the speed of the disks we have.
So despite the initial learning curve … we want to contribute work. Where we can leverage our knowledge … performance. We’ve built a lot of things in our system for improving latency. Learned what to do what not to do, where to apply lockless alogs. Readv2 syscall… Help all applications that do both IO and CPU bound work.
Thanks for listening to me. Hopefully it was a good story of what we’re up to… how we’re leveraging Ceph. Motivating to help and contribute. It’s nice to have a vendor you can call up and yell at when things not working, but it’s even better to be able to guide the tool to do what you want. The Ceph community is great, there’s so many people contributing to so many different projects.
Contact info

Ceph Day New York 2014: Distributed OLAP queries in seconds using CephFS

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (20)

Similar to Ceph Day New York 2014: Distributed OLAP queries in seconds using CephFS

Similar to Ceph Day New York 2014: Distributed OLAP queries in seconds using CephFS (20)

Recently uploaded

Recently uploaded (20)

Ceph Day New York 2014: Distributed OLAP queries in seconds using CephFS

Editor's Notes