Francesc Alted (UberResearch GmbH), “New Trends In Storing And Analyzing Large Data Silos With Python”.
Bio: Teacher, developer and consultant in a wide variety of business applications. Particularly interested in the field of very large databases, with special emphasis in squeezing the last drop of performance out of computer as whole, i.e. not only the CPU, but the memory and I/O subsystems.
Memory efficient applications. FRANCESC ALTED at Big Data Spain 2012Big Data Spain
Session presented at Big Data Spain 2012 Conference
16th Nov 2012
ETSI Telecomunicacion UPM Madrid
www.bigdataspain.org
More info: http://www.bigdataspain.org/es-2012/conference/memory-efficient-applications/francesc-alted
Processing of large data requires new approaches to data mining: low, close to linear, complexity and stream processing. While in the traditional data mining the practitioner is usually presented with a static dataset, which might have just a timestamp attached to it, to infer a model for predicting future/takeout observations, in stream processing the problem is often posed as extracting as much information as possible on the current data to convert them to an actionable model within a limited time window. In this talk I present an approach based on HBase counters for mining over streams of data, which allows for massively distributed processing and data mining. I will consider overall design goals as well as HBase schema design dilemmas to speed up knowledge extraction process. I will also demo efficient implementations of Naive Bayes, Nearest Neighbor and Bayesian Learning on top of Bayesian Counters.
Francesc Alted (UberResearch GmbH), “New Trends In Storing And Analyzing Large Data Silos With Python”.
Bio: Teacher, developer and consultant in a wide variety of business applications. Particularly interested in the field of very large databases, with special emphasis in squeezing the last drop of performance out of computer as whole, i.e. not only the CPU, but the memory and I/O subsystems.
Memory efficient applications. FRANCESC ALTED at Big Data Spain 2012Big Data Spain
Session presented at Big Data Spain 2012 Conference
16th Nov 2012
ETSI Telecomunicacion UPM Madrid
www.bigdataspain.org
More info: http://www.bigdataspain.org/es-2012/conference/memory-efficient-applications/francesc-alted
Processing of large data requires new approaches to data mining: low, close to linear, complexity and stream processing. While in the traditional data mining the practitioner is usually presented with a static dataset, which might have just a timestamp attached to it, to infer a model for predicting future/takeout observations, in stream processing the problem is often posed as extracting as much information as possible on the current data to convert them to an actionable model within a limited time window. In this talk I present an approach based on HBase counters for mining over streams of data, which allows for massively distributed processing and data mining. I will consider overall design goals as well as HBase schema design dilemmas to speed up knowledge extraction process. I will also demo efficient implementations of Naive Bayes, Nearest Neighbor and Bayesian Learning on top of Bayesian Counters.
Replication, Durability, and Disaster RecoverySteven Francia
This session introduces the basic components of high availability before going into a deep dive on MongoDB replication. We'll explore some of the advanced capabilities with MongoDB replication and best practices to ensure data durability and redundancy. We'll also look at various deployment scenarios and disaster recovery configurations.
An Overview of Spanner: Google's Globally Distributed DatabaseBenjamin Bengfort
Spanner is a globally distributed database that provides external consistency between data centers and stores data in a schema based semi-relational data structure. Not only that, Spanner provides a versioned view of the data that allows for instantaneous snapshot isolation across any segment of the data. This versioned isolation allows Spanner to provide globally consistent reads of the database at a particular time allowing for lock-free read-only transactions (and therefore no communications overhead for consensus during these types of reads). Spanner also provides externally consistent reads and writes with a timestamp-based linear execution of transactions and two phase commits. Spanner is the first distributed database that provides global sharding and replication with strong consistency semantics.
A brief history of Instagram's adoption cycle of the open source distributed database Apache Cassandra, in addition to details about it's use case and implementation. This was presented at the San Francisco Cassandra Meetup at the Disqus HQ in August 2013.
Apache Cassandra operations have the reputation to be simple on single datacenter deployments and / or low volume clusters but they become way more complex on high latency multi-datacenter clusters with high volume and / or high throughout: basic Apache Cassandra operations such as repairs, compactions or hints delivery can have dramatic consequences even on a healthy high latency multi-datacenter cluster.
In this presentation, Julien will go through Apache Cassandra mutli-datacenter concepts first then show multi-datacenter operations essentials in details: bootstrapping new nodes and / or datacenter, repairs strategy, Java GC tuning, OS tuning, Apache Cassandra configuration and monitoring.
Based on his 3 years experience managing a multi-datacenter cluster against Apache Cassandra 2.0, 2.1, 2.2 and 3.0, Julien will give you tips on how to anticipate and prevent / mitigate issues related to basic Apache Cassandra operations with a multi-datacenter cluster.
This presentation is on the DRBD product. At eNovance, we're using it for several years. In those slides, you will find informations on how we use it, use cases and Ninja tricks.
This document has been realized with a lot of feedbacks and thanks to strong knowledges on that technology that eNovance is able to provide.
C* Summit 2013: Cassandra at Instagram by Rick BransonDataStax Academy
Speaker: Rick Branson, Infrastructure Engineer at Instagram
Cassandra is a critical part of Instagram's large scale site infrastructure that supports more than 100 million active users. This talk is a practical deep dive into data models, systems architecture, and challenges encountered during the implementation process.
SQL? NoSQL? NewSQL?!? What's a Java developer to do? - PhillyETE 2012Chris Richardson
The database world is undergoing a major upheaval. NoSQL databases such as MongoDB and Cassandra are emerging as a compelling choice for many applications. They can simplify the persistence of complex data models and offering significantly better scalability and performance. But these databases have a very different and unfamiliar data model and APIs as well as a limited transaction model. Moreover, the relational world is fighting back with so-called NewSQL databases such as VoltDB, which by using a radically different architecture offers high scalability and performance as well as the familiar relational model and ACID transactions. Sounds great but unlike the traditional relational database you can’t use JDBC and must partition your data.
In this presentation you will learn about popular NoSQL databases – MongoDB, and Cassandra – as well at VoltDB. We will compare and contrast each database’s data model and Java API using NoSQL and NewSQL versions of a use case from the book POJOs in Action. We will learn about the benefits and drawbacks of using NoSQL and NewSQL databases.
Hanborq Optimizations on Hadoop MapReduceHanborq Inc.
A Hanborq optimized Hadoop Distribution, especially with high performance of MapReduce. It's the core part of HDH (Hanborq Distribution with Hadoop for Big Data Engineering).
Replication, Durability, and Disaster RecoverySteven Francia
This session introduces the basic components of high availability before going into a deep dive on MongoDB replication. We'll explore some of the advanced capabilities with MongoDB replication and best practices to ensure data durability and redundancy. We'll also look at various deployment scenarios and disaster recovery configurations.
An Overview of Spanner: Google's Globally Distributed DatabaseBenjamin Bengfort
Spanner is a globally distributed database that provides external consistency between data centers and stores data in a schema based semi-relational data structure. Not only that, Spanner provides a versioned view of the data that allows for instantaneous snapshot isolation across any segment of the data. This versioned isolation allows Spanner to provide globally consistent reads of the database at a particular time allowing for lock-free read-only transactions (and therefore no communications overhead for consensus during these types of reads). Spanner also provides externally consistent reads and writes with a timestamp-based linear execution of transactions and two phase commits. Spanner is the first distributed database that provides global sharding and replication with strong consistency semantics.
A brief history of Instagram's adoption cycle of the open source distributed database Apache Cassandra, in addition to details about it's use case and implementation. This was presented at the San Francisco Cassandra Meetup at the Disqus HQ in August 2013.
Apache Cassandra operations have the reputation to be simple on single datacenter deployments and / or low volume clusters but they become way more complex on high latency multi-datacenter clusters with high volume and / or high throughout: basic Apache Cassandra operations such as repairs, compactions or hints delivery can have dramatic consequences even on a healthy high latency multi-datacenter cluster.
In this presentation, Julien will go through Apache Cassandra mutli-datacenter concepts first then show multi-datacenter operations essentials in details: bootstrapping new nodes and / or datacenter, repairs strategy, Java GC tuning, OS tuning, Apache Cassandra configuration and monitoring.
Based on his 3 years experience managing a multi-datacenter cluster against Apache Cassandra 2.0, 2.1, 2.2 and 3.0, Julien will give you tips on how to anticipate and prevent / mitigate issues related to basic Apache Cassandra operations with a multi-datacenter cluster.
This presentation is on the DRBD product. At eNovance, we're using it for several years. In those slides, you will find informations on how we use it, use cases and Ninja tricks.
This document has been realized with a lot of feedbacks and thanks to strong knowledges on that technology that eNovance is able to provide.
C* Summit 2013: Cassandra at Instagram by Rick BransonDataStax Academy
Speaker: Rick Branson, Infrastructure Engineer at Instagram
Cassandra is a critical part of Instagram's large scale site infrastructure that supports more than 100 million active users. This talk is a practical deep dive into data models, systems architecture, and challenges encountered during the implementation process.
SQL? NoSQL? NewSQL?!? What's a Java developer to do? - PhillyETE 2012Chris Richardson
The database world is undergoing a major upheaval. NoSQL databases such as MongoDB and Cassandra are emerging as a compelling choice for many applications. They can simplify the persistence of complex data models and offering significantly better scalability and performance. But these databases have a very different and unfamiliar data model and APIs as well as a limited transaction model. Moreover, the relational world is fighting back with so-called NewSQL databases such as VoltDB, which by using a radically different architecture offers high scalability and performance as well as the familiar relational model and ACID transactions. Sounds great but unlike the traditional relational database you can’t use JDBC and must partition your data.
In this presentation you will learn about popular NoSQL databases – MongoDB, and Cassandra – as well at VoltDB. We will compare and contrast each database’s data model and Java API using NoSQL and NewSQL versions of a use case from the book POJOs in Action. We will learn about the benefits and drawbacks of using NoSQL and NewSQL databases.
Hanborq Optimizations on Hadoop MapReduceHanborq Inc.
A Hanborq optimized Hadoop Distribution, especially with high performance of MapReduce. It's the core part of HDH (Hanborq Distribution with Hadoop for Big Data Engineering).
Cloud computing دور الحوسبة السحابية فى المكتبات الرقمية ونظم الارشفة الالكتر...Essam Obaid
تقنيات الحوسبة السحابية ودورها
في المكتبات الرقمية ونظم الأرشفة الإلكترونية
Cloud computing technologies and their role in digital libraries and electronic archiving systems
Mobile App Development- Project Management ProcessBagaria Swati
Are you looking to build new capabilities or extend capabilities of existing business software to enhance productivity and profitability.
Key performance metrics are:
1. application scope management
2. project status and dependencies
3. prompt action for defect containment and defect resolution
4. schedule variance and budget variance analysis
Follow a well-defined and mature application development process based on business case analysis.
Who Manages a Project?
Highly-trained Project Managers at
CodeMyMobile manage the complete
application development lifecycle with a
focus on efficiency. Our experienced
project managers lead planning,
coordination, communication and control
of activities pertaining to technology
initiatives, ensuring that project outcomes
are in line with our customers’ business
objectives and comply with overall time,
cost and quality success criteria.
Responsibilities of the Project Manager:
Manage the project goals, scope and project
teams to ensure overall project success,
including customer satisfaction.
Develop and proactively manage project plans,
including scheduling, identification of risks,
contingency plans, issues management, and
allocation of available resources.
Project Control & Risk Management:
Monitor progress against the overall project
plan, leading the team toward successful
milestone completion.
Identify, communicate and manage project
issues and risks, notifying and/or escalating
appropriately to the customer or internally.
Flink Forward Berlin 2017: Robert Metzger - Keep it going - How to reliably a...Flink Forward
Let’s be honest: Running a distributed stateful stream processor that is able to handle terabytes of state and tens of gigabytes of data per second while being highly available and correct (in an exactly-once sense) does not work without any planning, configuration and monitoring. While the Flink developer community tries to make everything as simple as possible, it is still important to be aware of all the requirements and implications In this talk, we will provide some insights into the greatest operations mysteries of Flink from a high-level perspective: - Capacity and resource planning: Understand the theoretical limits. - Memory and CPU configuration: Distribute resources according to your needs. - Setting up High Availability: Planning for failures. - Checkpointing and State Backends: Ensure correctness and fast recovery For each of the listed topics, we will introduce the concepts of Flink and provide some best practices we have learned over the past years supporting Flink users in production.
Scylla Summit 2022: Operating at Monstrous Scales: Benchmarking Petabyte Work...ScyllaDB
ScyllaDB is a distributed database designed to scale horizontally and vertically — in theory. What about in practice? ScyllaDB’s Benny Halevy, Director, Software Engineering, will take you through the process and results of benchmarking our NoSQL database at the petabyte level, showing how you can use advanced features like workload prioritization to control priorities of transactional (read-write) and analytic (read-only) queries on the same cluster with smooth and predictable performance.
To watch all of the recordings hosted during Scylla Summit 2022 visit our website here: https://www.scylladb.com/summit.
Slides from the High Performance Cloud Computing tutorial at Supercomputing 2011 in Seattle. Additional materials available from: cloudsupercomputing.net.
In-memory processing has started to become the norm in large scale data handling. This is aclose to the metal analysis of highly important but often neglected aspects of memory accesstimes and how it impacts big data and NoSQL technologies.We cover aspects such as the TLB, the Transparent Huge Pages, the QPI Link, Hyperthreading and the impact of virtualization on high-memory footprint applications. We present benchmarks of various technologies ranging from Cloudera’s Impala to Couchbase and how they are impacted by the underlying hardware.The key takeaway is a better understanding of how to size a cluster, how to choose a cloud provider and an instance type for big data and NoSQL workloads and why not every core or GB of RAM is created equal.
MySQL NDB Cluster 8.0 SQL faster than NoSQL Bernd Ocklin
MySQL NDB Cluster running SQL faster than most NoSQL databases. Benchmark results, comparisons and introduction into NDB's parallel distributed in-memory query engine. MySQL Day before FOSDEM 2020.
Hadoop World 2011: Hadoop and Performance - Todd Lipcon & Yanpei Chen, ClouderaCloudera, Inc.
Performance is a thing that you can never have too much of. But performance is a nebulous concept in Hadoop. Unlike databases, there is no equivalent in Hadoop to TPC, and different use cases experience performance differently. This talk will discuss advances on how Hadoop performance is measured and will also talk about recent and future advances in performance in different areas of the Hadoop stack.
NYJavaSIG - Big Data Microservices w/ SpeedmentSpeedment, Inc.
JAVA MICROSERVICES FOR BIG DATA WITH LOW LATENCY - Per-Ake Minborg, CTO Speedment
By leveraging on memory mapped files (eg. Hazelcast, ChronicleMaps etc.), Speedment supports large Java Maps that easily can exceed the size of your server’s RAM. Because the Java Maps are mapped onto files, these maps can be shared instantly between several microservice JVMs and new microservice instances can be added, removed or restarted very quickly. Data can be retrieved with predictable ultra-low latency for a wide range of operations. The solution can be synchronized with an underlying database so that your in-memory maps will be consistently “alive”. The mapped files can be terabytes which has been done in real world deployment cases and there can be a large number of microservices that shares these maps simultaneously.
Beyond The Data Grid: Coherence, Normalisation, Joins and Linear ScalabilityBen Stopford
In 2009 RBS set out to build a single store of trade and risk data that all applications in the bank could use. This talk discusses a number of novel techniques that were developed as part of this work. Based on Oracle Coherence the ODC departs from the trend set by most caching solutions by holding its data in a normalised form making it both memory efficient and easy to change. However it does this in a novel way that supports most arbitrary queries without the usual problems associated with distributed joins. We'll be discussing these patterns as well as others that allow linear scalability, fault tolerance and millisecond latencies.
Project Tungsten Phase II: Joining a Billion Rows per Second on a LaptopDatabricks
Tech-talk at Bay Area Apache Spark Meetup.
Apache Spark 2.0 will ship with the second generation Tungsten engine. Building upon ideas from modern compilers and MPP databases, and applying them to data processing queries, we have started an ongoing effort to dramatically improve Spark’s performance and bringing execution closer to bare metal. In this talk, we’ll take a deep dive into Apache Spark 2.0’s execution engine and discuss a number of architectural changes around whole-stage code generation/vectorization that have been instrumental in improving CPU efficiency and gaining performance.
Have you heard that all in-memory databases are equally fast but unreliable, inconsistent and expensive? This session highlights in-memory technology that busts all those myths.
Redis, the fastest database on the planet, is not a simply in-memory key-value data-store; but rather a rich in-memory data-structure engine that serves the world’s most popular apps. Redis Labs’ unique clustering technology enables Redis to be highly reliable, keeping every data byte intact despite hundreds of cloud instance failures and dozens of complete data-center outages. It delivers full CP system characteristics at high performance. And with the latest Redis on Flash technology, Redis Labs achieves close to in-memory performance at 70% lower operational costs. Learn about the best uses of in-memory computing to accelerate everyday applications such as high volume transactions, real time analytics, IoT data ingestion and more.
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
Generating a custom Ruby SDK for your web service or Rails API using Smithyg2nightmarescribd
Have you ever wanted a Ruby client API to communicate with your web service? Smithy is a protocol-agnostic language for defining services and SDKs. Smithy Ruby is an implementation of Smithy that generates a Ruby SDK using a Smithy model. In this talk, we will explore Smithy and Smithy Ruby to learn how to generate custom feature-rich SDKs that can communicate with any web service, such as a Rails JSON API.
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams.
Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
3. What to take home from this talk?
Answers to four questions:
! Why are memory based architectures great for cloud computing?
! How predictable is the behavior of an in-memory column database?
! Does virtualization have a negative impact on in-memory databases?
! How do I assign tenants to servers in order to manage fault-tolerance and
scalability?
4. First question
Why are memory based architectures
great for cloud computing?
5. Numbers everyone should know
" L1 cache reference 0.5 ns
" Branch mispredict 5 ns
" L2 cache reference 7 ns
" Mutex lock/unlock 25 ns
" Main memory reference 100 ns (in 2008)
" Compress 1K bytes with Zippy 3,000 ns
" Send 2K bytes over 1 Gbps network 20,000 ns
" Read 1 MB sequentially from memory 250,000 ns
" Round trip within same datacenter 500,000 ns (in 2008)
" Disk seek 10,000,000 ns
" Read 1 MB sequentially from network 10,000,000 ns
" Read 1 MB sequentially from disk 20,000,000 ns
" Send packet CA ! Netherlands ! CA 150,000,000 ns
Source: Jeff Dean
6. Memory should be the
system of record
" Typically disks have been the system of record
! Slow ! wrap them in complicated caching and distributed file
systems to make them perform
! Memory used as cache all over the place but it can be
invalidated when something changes on disk
" Bandwidth:
! Disk: 120 MB/s/controller
! DRAM (x86 + FSB): 10.4 GB/s/board
! DRAM (Nehalem): 25.6 GB/s/socket
" Latency:
! Disk: 13 milliseconds (up to seconds when queuing)
! InfiniBand: 1-2 microseconds
! DRAM: 5 nanoseconds
7. High-end networks vs. disks
Maximum bandwidths:
Hard Disk 100-120 MB/s
SSD 250 MB/s
Serial ATA II 600 MB/s
10 GB Ethernet 1204 MB/s
InfiniBand 1250 MB/s (4 channels)
PCIe Flash Storage 1400 MB/s
PCIe 3.0 32 GB/s
DDR3-1600 25.6 GB/s (dual channel)
8. Designing a database for the cloud
" Disks are the limiting factor in contemporary database systems
! Sharing a high performance disk on a machine/cluster/cloud is
fine/troublesome/miserable
! While one guy is fetching 100 MB/s, everyone else is waiting
" Claim: Two machines + network is better than one machine + disk
! Log to disk on a single node:
> 10,000 !s (not predictable)
! Transactions only in memory but on two nodes:
< 600 !s (more predictable)
" Concept: Design to the strengths of cloud (redundancy) rather than
their weaknesses (shared anything)
9. Design choices for a cloud database
" No disks (in-memory delta tables + async snapshots)
" Multi-master replication
! Two copies of the data
! Load balancing both reads and (monotonic) writes
! (Eventual) consistency achieved via MVCC (+ Paxos, later)
" High-end hardware
! Nehalem for high memory bandwidth
! Fast interconnect
" Virtualization
! Ease of deployment/administration
! Consolidation/multi-tenancy
10. Why consolidation?
" In-memory column databases are ideal for mixed workload
processing
" But: In a SaaS environment it seems costly to give everybody
their private NewDB box
" How much consolidation is possible?
! 3 years worth of sales records from our favorite
Fortune 500 retail company
! 360 million records
! Less than 3 GB in compressed columns in memory
! Next door is a machine with 1 TB of DRAM
! (Beware of overhead)
11. Multi-tenancy in the database –
four different options
" No multi-tenancy – one VM per tenant
! Ex.: RightNow has 3000 tenants in 200 databases (2007):
3000 vs. 200 Amazon VMs cost $2,628,000 vs. $175,200/year
! Very strong isolation
" Shared machine – one database process per tenant
T1
! Scheduler, session manager and transaction manager need
live inside the individual DB processes: IPC for synchronization T3
T2
! Good for custom extensions, good isolation
" Shared process – one schema instance per tenant
T1
! Must support large numbers of tables T3
T2
! Must support online schema extension and evolution
" Shared table – use a tenant_id column and partitioning
T1, T2, T3
! Bad for custom extensions, bad isolation
! Hard to backup/restore/migrate individual tenants
12. Putting it all together:
Rock cluster architecture
&))-./(012% &))-./(012%
89!:%3;<*+7%
3+,4+,% 3+,4+,%
Extract data from
external system 67)1,*+,% =-5<*+,% Cluster membership,
Rock Tenant placement
>(<*+,%
Load balance
between replicas "15*+,% "15*+,%
Forward writes to
other replicas &'()*+,% &'()*+,% &'()*+,%
!"#$% !"#$% !"#$%
13. Second question
How predictable is the behavior of an
in-memory column database?
14. What does “predictable” mean?
" Traditionally, database people are concerned with the questions of type
“how do I make a query faster?”
" In a SaaS environment, the question is
“how do I get a fixed (low) response time as cheap as possible?”
! Look at throughput
! Look at quantiles (e.g. 99-th percentile)
" Example formulation of desired performance:
! Response time goal “1 second in the 99-th percentile”
! Average response time around 200 ms
! Less than 1% of all queries exceed 1,000 ms
! Results in a maximum number of concurrent queries before
response time goal is violated
15. each test contained several tenants of a particular tenant size so that
20% of the available main memory has always been used for tenant data
preloaded into main memory before each test run and a six minute warm
System capacity
performed before each run. The test data was generated by the Star Schem
" Fixed amount of data split equally among all tenants
Data Generator [36].
200
Figure 4.2 shows the Measured
same graph using a logarithmic scale. This grap
180 Approx. Function
160 linear which means that the relation can be described as
140 Can be expressed as:
Requests / s
120 log(f (tSize )) = m · log(tSize ) + n
100
80 where n is the intercept with the y-axis f (0) and m is the gradient. T
60 n and the gradient m can be estimated using regression and the Least-S
40
(see chapter 2.3.2) or simply by using a slope triangle. The gradient fo
20
20 40 60 ≈80 100 120 140(0) = n ≈ 3.6174113. This equation can, then, be
m −0.945496 and f 160 180 200 220
Tenant Size in MB
equation (4.1).
" Capacity " bytes scanned per second
(there is a small overhead when processing more requests)
" In-memory databases behave very linearly!
16. Workload
" Tenants generally have different rates and sizes
" For a given set of T tenants (on one server) define
" When Workload = 1
! System runs at it’s maximum throughput level
! Further increase of workload will result in violation of
response time goal
17. Response time
" Different amounts of data and different request rates (“assorted mix”)
" Workload is varied by scaling the request rates
2500
Tenant Data 1.5 GB
Tenant Data 2.0 GB
99-th Perc. Value in ms
2000 Tenant Data 2.6 GB
Tenant Data 3.2 GB
1500 Prediction
1000
500
0
0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1
Workload
18. Impact of writes
" Added periodic batch writes (fact table grows by 0.5% every 5 minutes)
2000 Without Writes
With Writes
99-th Perc. Value in ms
Prediction without Writes
1500 Prediction with Writes
1000
500
0
0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1
Workload
19. Why is predictability good?
" Ability to plan and perform resource intensive tasks during normal
operations:
! Upgrades
! Merges
! Migrations of tenants in the cluster (e.g. to dynamically
re-balance the load situation in the cluster)
Cost breakdown for
migration of tenants
20. Definition
Cloud Computing
=
Data Center + API
21. Third question
Does virtualization have a negative impact
on in-memory databases?
22. Impact of virtualization
" Run multi-tenant OLAP benchmark on either:
! one TREX instance directly on the physical host vs.
! one TREX instance inside VM on the physical host
" Overhead is approximately 7% (both in response time and throughput)
2600 5.5
2400 5
2200
Average response time
4.5
Queries per second
2000
4
1800
3.5
1600
3
1400
2.5
1200
1000 2
800 1.5
virtual physical virtual physical
600 1
0 2 4 6 8 10 12 0 2 4 6 8 10 12
Client Threads Client Threads
23. Impact of virtualization (contd.)
" Virtualization is often used to get “better” system utilization
! What happens when a physical machine is split into multiple VMs?
! Burning CPU cycles does not hurt ! memory bandwidth is the
limiting factor
160 %
Response Time as a Percentage of
Xeon E5450
Response Time with 1 Active Slot
150 % Xeon X5650
140 %
130 %
120 %
110 %
100 %
90 %
80 %
1 2 3 4
Concurrently Active VM Slots
24. Fourth question
How do I assign tenants to servers
in order to manage fault-tolerance
and scalability?
25. Why is it good to have multiple
copies of the data?
" Scalability beyond a certain number of concurrently active users
" High availability during normal operations
" Alternating execution of resource-intensive operations (e.g. merge)
" Rolling upgrades without downtime
" Data migration without downtime
" Reminder: Two in-memory copies allow faster writes and are more
predictable than one in-memory copy plus disk
" Downsides:
! Response time goal might be violated during recovery
! You need to plan for twice the capacity
26. Tenant placement
Conventional Interleaved
Mirrored Layout Layout
T1 T1 T1 T1
T3 T5
T2 T2 T2 T4
T3 T3 T2 T4
T6 T3
T4 T4 T5 T6
If a node fails, all work moves to If a node fails, work moves to
one other node. The system must many other nodes. Allows
be 100% over-provisioned. higher utilization of nodes.
27. Handcrafted best case
" Perfect placement:
1 1 4 4 7 7 1 4 7 1 2 3
! 100 tenants
2 2 5 5 8 8 2 5 8 4 5 6
! 2 copies/tenant 3 3 6 6 9 9 3 6 9 7 8 9
! All tenants have same size
Mirrored Interleaved
! 10 tenants/server
! Perfect balancing (same load on all tenants):
! 6M rows (204 MB compressed) of data per tenant
! The same (increasing) number of users per tenant
! No writes
Mirrored Interleaved Improvement
No failures 4218 users 4506 users 7%
Periodic single 2265 users 4250 users 88%
failures
Throughput before violating
response time goal
28. Requirements for placement algorithm
" An optimal placement algorithm needs to cope with
multiple (conflicting) goals:
! Balance load across servers
! Achieve good interleaving
" Use migrations consciously for online layout improvements
(no big bang cluster re-organization)
" Take usage patterns into account
! Request rates double during last week before end of quarter
! Time-zones, Christmas, etc.
29. Conclusion
" Answers to four questions:
! Why are memory based architectures great for cloud computing?
! How predictable is the behavior of an in-memory column database?
! Does virtualization have a negative impact on in-memory databases?
! How do I assign tenants to servers in order to manage fault-tolerance and
scalability?
" Questions?