This is an overview of how distributed data grids can enable sharing across web servers and virtual cloud environments to enable scalability and high availability. It also covers how distributed data grids are highly useful for running MapReduce analysis across large data sets.
Covers the problems of achieving scalability in server farm environments and how distributed data grids provide in-memory storage and boost performance. Includes summary of ScaleOut Software product offerings including ScaleOut State Server and Grid Computing Edition.
Ben Golub gives insight to the latest storage trends including the EMC's latest acquisition of Isilon.
http://blog.gluster.com/2010/11/storage-is-sexy-again/
Covers the problems of achieving scalability in server farm environments and how distributed data grids provide in-memory storage and boost performance. Includes summary of ScaleOut Software product offerings including ScaleOut State Server and Grid Computing Edition.
Ben Golub gives insight to the latest storage trends including the EMC's latest acquisition of Isilon.
http://blog.gluster.com/2010/11/storage-is-sexy-again/
Timely access to relevant information has always been critical to business success. Thousands and thousands of companies and institutions use SAP NetWeaver Business Warehouse (SAP NetWeaver BW) as the cornerstone for business intelligence in their SAP application landscapes. However, query performance has often been a challenge...
Virtual Machines are a mainstay in the enterprise. Apache Hadoop is normally run on bare machines. This talk walks through the convergence and the use of virtual machines for running ApacheHadoop. We describe the results from various tests and benchmarks which show that the overhead of using VMs is small. This is a small price to pay for the advantages offered by virtualization. The second half of talk compares multi-tenancy with VMs versus multi-tenancy of with Hadoop`s Capacity scheduler. We follow on with a comparison of resource management in V-Sphere and the finer grained resource management and scheduling in NextGen MapReduce. NextGen MapReduce supports a general notion of a container (such as a process, jvm, virtual machine etc) in which tasks are run;. We compare the role of such first class VM support in Hadoop.
An Active and Hybrid Storage System for Data-intensive ApplicationsXiao Qin
Since large-scale and data-intensive applications have been widely deployed, there is a growing demand for high-performance storage systems to support data-intensive applications. Compared with traditional storage systems, next-generation systems will embrace dedicated processor to reduce computational load of host machines and will have hybrid combinations of different storage devices. We present a new architecture of active storage system, which leverage the computational power of the dedicated processor, and show how it utilizes the multi-core processor and offloads the computation from the host machine. We then solve the challenge of applying the active storage node to cooperate with the other nodes in the cluster environment by design a pipeline-parallel processing pattern and report the effectiveness of the mechanism. In order to evaluate the design, an open-source bioinformatics application is extended based on the pipeline-parallel mechanism. We also explore the hybrid configuration of storage devices within the active storage. The advent of flash-memory-based solid state disk has become a critical role in revolutionizing the storage world. However, instead of simply replacing the traditional magnetic hard disk with the solid state disk, researchers believe that finding a complementary approach to corporate both of them is more challenging and attractive. Thus, we propose a hybrid combination of different types of disk drives for our active storage system. An simulator is designed and implemented to verify the new configuration. In summary, this dissertation explores the idea of active storage, an emerging new storage configuration, in terms of the architecture and design, the parallel processing capability, the cooperation of other machines in cluster computing environment, and the new disk configuration, the hybrid combination of different types of disk drives.
Dynamo Systems - QCon SF 2012 PresentationShanley Kane
A look at Dynamo-based systems: the architectural principles, use cases and requirements; where they differ from relational databases; and where they are going.
HDFS Futures: NameNode Federation for Improved Efficiency and ScalabilityHortonworks
Scalability of the NameNode has been a key issue for HDFS clusters. Because the entire file system metadata is stored in memory on a single NameNode, and all metadata operations are processed on this single system, the NameNode both limits the growth in size of the cluster and makes the NameService a bottleneck for the MapReduce framework as demand increases. HDFS Federation horizontally scales the NameService using multiple federated NameNodes/namespaces. The federated NameNodes share the DataNodes in the cluster as a common storage layer. HDFS Federation also adds client-side namespaces to provide a unified view of the file system. In this talk, Hortonworks co-founder and key architect, Sanjay Raidia, will discuss the benefits, features and best practices for implementing HDFS Federation.
DV01 Ten Things You Always Wanted to Know About Windows Azure But Were Afraid...Ronald Widha
The session took a look at the Window Azure platform and asks the hard questions that everybody is thinking but nobody wants to ask. Is Windows Azure right for me? Should this application be migrated to Azure? Will Azure save me money? How do I manage my Azure implementations? Is Azure secure? The session look at real world implementations on the Windows Azure platform to try and answer some of these hard questions.
CodeFutures - Scaling Your Database in the CloudRightScale
RightScale Conference Santa Clara 2011: Scaling an application in the cloud often hits the most common bottleneck – the database tier. Not only is database performance the number one cause of poor application performance, but also the issue is magnified in cloud environments where I/O and bandwidth is generally slower and less predictable than in dedicated data centers. Database sharding is a highly effective method of removing the database scalability barrier, operating on top of proven RDBMS products such as MySQL and Postgres – as well as the new NoSQL database platforms. One critical aspect often given too little consideration is monitoring and continuous operation of your databases, including the full lifecycle, to ensure that they stay up.
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...ScyllaDB
Discover how to avoid common pitfalls when shifting to an event-driven architecture (EDA) in order to boost system recovery and scalability. We cover Kafka Schema Registry, in-broker transformations, event sourcing, and more.
Elastic Caching for a Smarter Planet - Make Every Transaction CountYakura Coffee
Social Media, mobile devices and new innovative infrastructures mean that more data is being used to serve end-users more than ever before. Enterprise customers must act quickly on data stored across their enterprise. IBM Elastic Caching solutions provide the best opportunity for improving your end-users experience in consuming application data. Every business, of every size, in every Industry needs an effective data caching solution. The industry has moved beyond the bottleneck of CPU processing and must address the growing data bottleneck problems which prevent predictable and cost-effective scalability that directly impacts the performance and throughput of every data-intensive application.
IBM Elastic Caching solutions WebSphere eXtreme Scale and the DataPower XC10 Appliance solve these problems better than the competition. Learn how IBM Elastic Caching solutions have evolved to eliminate enterprise data bottlenecks by elastically distributing data among many resources and allowing applications to efficiently access needed data quickly. We beat our competition by not only allowing our customers flexibility to create mission-critical applications that achieve predictable, scalable performance and high availability, but also extending and integrating IBM Elastic Caching into many IBM products covering Retail/Commerce solutions, Mobile Devices, Content Management, Business Rule Management, ESBs, Messaging and more.
Timely access to relevant information has always been critical to business success. Thousands and thousands of companies and institutions use SAP NetWeaver Business Warehouse (SAP NetWeaver BW) as the cornerstone for business intelligence in their SAP application landscapes. However, query performance has often been a challenge...
Virtual Machines are a mainstay in the enterprise. Apache Hadoop is normally run on bare machines. This talk walks through the convergence and the use of virtual machines for running ApacheHadoop. We describe the results from various tests and benchmarks which show that the overhead of using VMs is small. This is a small price to pay for the advantages offered by virtualization. The second half of talk compares multi-tenancy with VMs versus multi-tenancy of with Hadoop`s Capacity scheduler. We follow on with a comparison of resource management in V-Sphere and the finer grained resource management and scheduling in NextGen MapReduce. NextGen MapReduce supports a general notion of a container (such as a process, jvm, virtual machine etc) in which tasks are run;. We compare the role of such first class VM support in Hadoop.
An Active and Hybrid Storage System for Data-intensive ApplicationsXiao Qin
Since large-scale and data-intensive applications have been widely deployed, there is a growing demand for high-performance storage systems to support data-intensive applications. Compared with traditional storage systems, next-generation systems will embrace dedicated processor to reduce computational load of host machines and will have hybrid combinations of different storage devices. We present a new architecture of active storage system, which leverage the computational power of the dedicated processor, and show how it utilizes the multi-core processor and offloads the computation from the host machine. We then solve the challenge of applying the active storage node to cooperate with the other nodes in the cluster environment by design a pipeline-parallel processing pattern and report the effectiveness of the mechanism. In order to evaluate the design, an open-source bioinformatics application is extended based on the pipeline-parallel mechanism. We also explore the hybrid configuration of storage devices within the active storage. The advent of flash-memory-based solid state disk has become a critical role in revolutionizing the storage world. However, instead of simply replacing the traditional magnetic hard disk with the solid state disk, researchers believe that finding a complementary approach to corporate both of them is more challenging and attractive. Thus, we propose a hybrid combination of different types of disk drives for our active storage system. An simulator is designed and implemented to verify the new configuration. In summary, this dissertation explores the idea of active storage, an emerging new storage configuration, in terms of the architecture and design, the parallel processing capability, the cooperation of other machines in cluster computing environment, and the new disk configuration, the hybrid combination of different types of disk drives.
Dynamo Systems - QCon SF 2012 PresentationShanley Kane
A look at Dynamo-based systems: the architectural principles, use cases and requirements; where they differ from relational databases; and where they are going.
HDFS Futures: NameNode Federation for Improved Efficiency and ScalabilityHortonworks
Scalability of the NameNode has been a key issue for HDFS clusters. Because the entire file system metadata is stored in memory on a single NameNode, and all metadata operations are processed on this single system, the NameNode both limits the growth in size of the cluster and makes the NameService a bottleneck for the MapReduce framework as demand increases. HDFS Federation horizontally scales the NameService using multiple federated NameNodes/namespaces. The federated NameNodes share the DataNodes in the cluster as a common storage layer. HDFS Federation also adds client-side namespaces to provide a unified view of the file system. In this talk, Hortonworks co-founder and key architect, Sanjay Raidia, will discuss the benefits, features and best practices for implementing HDFS Federation.
DV01 Ten Things You Always Wanted to Know About Windows Azure But Were Afraid...Ronald Widha
The session took a look at the Window Azure platform and asks the hard questions that everybody is thinking but nobody wants to ask. Is Windows Azure right for me? Should this application be migrated to Azure? Will Azure save me money? How do I manage my Azure implementations? Is Azure secure? The session look at real world implementations on the Windows Azure platform to try and answer some of these hard questions.
CodeFutures - Scaling Your Database in the CloudRightScale
RightScale Conference Santa Clara 2011: Scaling an application in the cloud often hits the most common bottleneck – the database tier. Not only is database performance the number one cause of poor application performance, but also the issue is magnified in cloud environments where I/O and bandwidth is generally slower and less predictable than in dedicated data centers. Database sharding is a highly effective method of removing the database scalability barrier, operating on top of proven RDBMS products such as MySQL and Postgres – as well as the new NoSQL database platforms. One critical aspect often given too little consideration is monitoring and continuous operation of your databases, including the full lifecycle, to ensure that they stay up.
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...ScyllaDB
Discover how to avoid common pitfalls when shifting to an event-driven architecture (EDA) in order to boost system recovery and scalability. We cover Kafka Schema Registry, in-broker transformations, event sourcing, and more.
Elastic Caching for a Smarter Planet - Make Every Transaction CountYakura Coffee
Social Media, mobile devices and new innovative infrastructures mean that more data is being used to serve end-users more than ever before. Enterprise customers must act quickly on data stored across their enterprise. IBM Elastic Caching solutions provide the best opportunity for improving your end-users experience in consuming application data. Every business, of every size, in every Industry needs an effective data caching solution. The industry has moved beyond the bottleneck of CPU processing and must address the growing data bottleneck problems which prevent predictable and cost-effective scalability that directly impacts the performance and throughput of every data-intensive application.
IBM Elastic Caching solutions WebSphere eXtreme Scale and the DataPower XC10 Appliance solve these problems better than the competition. Learn how IBM Elastic Caching solutions have evolved to eliminate enterprise data bottlenecks by elastically distributing data among many resources and allowing applications to efficiently access needed data quickly. We beat our competition by not only allowing our customers flexibility to create mission-critical applications that achieve predictable, scalable performance and high availability, but also extending and integrating IBM Elastic Caching into many IBM products covering Retail/Commerce solutions, Mobile Devices, Content Management, Business Rule Management, ESBs, Messaging and more.
Virtualizing Latency Sensitive Workloads and vFabric GemFireCarter Shanklin
This presentation was made by Emad Benjamin of VMware Technical Marketing. Normally I wouldn't upload someone else's preso but I really insisted this get posted and he asked me to help him out.
This deck covers tips and best practices for virtualizing latency sensitive apps on vSphere in general, and takes a deep dive into virtualizing vFabric GemFire, which is a high-performance distributed and memory-optimized key/value store.
Best practices include how to configure the virtual machines and how to tune them appropriately to the hardware the application runs on.
Scale-out Storage on Intel® Architecture Based Platforms: Characterizing and ...Odinot Stanislas
Issue du salon orienté développeurs d'Intel (l'IDF) voici une présentation plutôt sympa sur le stockage dit "scale out" avec une présentation des différents fournisseurs de solutions (slide 6) comprenant ceux qui font du mode fichier, bloc et objet. Puis du benchmark sur certains d'entre eux dont Swift, Ceph et GlusterFS.
Database Virtualization: The Next Wave of Big Dataexponential-inc
Servers, Storage and Networking have all been virtualized, the next big wave is the database. SQL databases are the one thing in the cloud that require single dedicated instances. Database virtualization changes all of this, enabling full elasticity without sacrificing functionality.
Oracle presentation from Gartner's Infrastructure, Operations & Data Centre Summit held in Sydney March 2010. Presentation delivered by Roland Slee, VP Database Product Management. Explains how customers can consolidate Oracle Database workloads onto a scale-out, industry-standard platform.
A brief audio summary of this presentation is available online here: http://audioboo.fm/boos/109000-oracle-s-roland-slee-summarises-his-gartner-datacentre-summit-presentation
Storage for Microsoft®Windows EnfironmentsMichael Hudak
This document explores some of the common challenges Windows® architects and administrators face in managing storage for Microsoft® environments with workloads such as:
• Microsoft Exchange Server,
• Microsoft Office SharePoint® Server,
• Microsoft SQL Serv
Caching for Microservices Architectures: Session IVMware Tanzu
In this 60 minute webinar, we will cover the key areas of consideration for data layer decisions in a microservices architecture, and how a caching layer, satisfies these requirements. You’ll walk away from this webinar with a better understanding of the following concepts:
- How microservices are easy to scale up and down, so both the service layer and the data layer need to support this elasticity.
- Why microservices simplify and accelerate the software delivery lifecycle by splitting up effort into smaller isolated pieces that autonomous teams can work on independently. Event-driven systems promote autonomy.
- Where microservices can be distributed across availability zones and data centers for addressing performance and availability requirements. Similarly, the data layer should support this distribution of workload.
- How microservices can be part of an evolution that includes your legacy applications. Similarly, the data layer must accommodate this graceful on-ramp to microservices.
Presenter : Jagdish Mirani is a Product Marketing Manager in charge of Pivotal’s in-memory products
ScaleBase Webinar: Scaling MySQL - Sharding Made Easy!ScaleBase
Home-grown sharding is hard - REALLY HARD! ScaleBase scales-out MySQL, delivering all the benefits of MySQL sharding, with NONE of the sharding headaches. This webinar explains: MySQL scale-out without embedding code and re-writing apps, Successful sharding on Amazon and private clouds, Single vs. multiple shards per server, Eliminating data silos, Creating a redundant, fault tolerant architecture with no single-point-of-failure, Re-balancing and splitting shards
We will examine most of the features that this “Swiss knife” software provides. It is an in-memory fabric that fits between the database and the application layer. Apache Ignite is powered by the H2 engine. They have used it to create an in-memory distributed ACID, fully ANSI-99 complaint, Highly Available (HA) and scalable database. They have used a non-consensus (https://en.wikipedia.org/wiki/Rendezvous_hashing) clustering algorithm to be even more scalable compared to other NoSql solutions. This tool respects the relational data model that we have used for so many years and eliminates traditional problems like the “expensive joins” since it uses the RAM as the primary storage medium. We will see what this tool can do in action through hands-on examples.
Similar to Using Distributed In-Memory Computing for Fast Data Analysis (20)
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
"Impact of front-end architecture on development cost", Viktor TurskyiFwdays
I have heard many times that architecture is not important for the front-end. Also, many times I have seen how developers implement features on the front-end just following the standard rules for a framework and think that this is enough to successfully launch the project, and then the project fails. How to prevent this and what approach to choose? I have launched dozens of complex projects and during the talk we will analyze which approaches have worked for me and which have not.
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
2. Agenda
• The Need for Memory-Based, Distributed
Storage
• What Is a Distributed Data Grid (DDG)
• Performance Advantages and Architecture
• Migrating Data to the Cloud and Across Global
Sites
• Parallel Data Analysis
• Comparison of DDG to File-Based
Map/Reduce
2 WSTA Seminar
3. The Need for Memory-Based Storage
Example: Web server farm:
Internet
• Load-balancer directs POW E R FAU LT DATA AL A RM
Load-balancer
incoming client requests Ethernet
to Web servers.
• Web and app. server
farms build Web pages W eb Server
Distributed, In-Memory DataServer W eb Server
W eb Server W eb Server W eb Server W eb
Grid
and run business logic. Ethernet
• Database server holds all
mission-critical, LOB data.
D atabase R aid D isk D atabase
Server Array Server Bottleneck
• Server farms share fast- Ethernet
changing data using a Distributed, In-Memory Data Grid
DDG to avoid bottlenecks
and maximize scalability. App. Server App. Server App. Server App. Server
3 WSTA Seminar
4. The Need for Memory-Based Storage
Example: Cloud Application: Cloud Application
• Application runs as multiple, App VS App VS
App VS
virtual servers (VS). App VS
App VS
• Application instances store and
retrieve LOB data from cloud- Grid VS
Grid VS
based file system or database. Grid VS
Distributed Data Grid
• Applications need fast, scalable
storage for fast-changing data.
• Distributed data grid runs as
multiple, virtual servers to
provide “elastic,” in-memory
storage.
Cloud-Based Storage
4 WSTA Seminar
5. What is a Distributed Data Grid?
• A new “vertical” storage tier: Processor Processor
Cache Cache
– Adds missing layer to boost
performance.
– Uses in-memory, out-of-process L2 Cache L2 Cache
storage.
– Avoids repeated trips to backing Application
Memory
Application
Memory
“In-Process” “In-Process”
storage.
• A new “horizontal” storage tier: Distributed Distributed
Cache Cache
– Allows data sharing among servers. “Out-of- “Out-of-
Process” Process”
– Scales performance & capacity.
– Adds high availability.
Backing
– Can be used independently of Storage
backing storage.
5 WSTA Seminar
6. Distributed Data Grids: A Closer Look
• Incorporates a client-side, in- Application
Memory
process cache (“near cache”): “In-Process”
– Transparent to the application
– Holds recently accessed data.
Client-side
• Boosts performance: Cache
– Eliminates repeated network data “In-Process”
transfers & deserialization.
– Reduces access times to near “in-
process” latency. Distributed
– Is automatically updated if the Cache
distributed grid changes. “Out-of-
Process”
– Supports various coherency models
(coherent, polled, event-driven)
6 WSTA Seminar
7. Performance Benefit of Client-side Cache
• Eliminates repeated network data transfers.
• Eliminates repeated object deserialization.
Average Response Time
10KB Objects
3500 20:1 Read/Update
3000
2500
Microseconds
2000
1500
1000
500
0
DDG DBMS
7 WSTA Seminar
8. Top 5 Benefits of Distributed Data Grids
1. Faster access time for business logic state or database data
2. Scalable throughput to match a growing workload and keep
response times low
3. High availability to prevent data loss if a grid server (or network
link) fails
Access Latency vs. Throughput
4. Shared access to data across
Access Latency (msec)
the server farm Grid DBMS
5. Advanced capabilities
for quickly and easily mining
data using scalable,
“map/reduce,” analysis.
Throughput (accesses / sec)
8 WSTA Seminar
9. Scaling the Distributed Data Grid
• Distributed data grid must deliver scalable throughput.
• To do so, its architecture must eliminate bottlenecks to
scaling:
– Avoid centralized scheduling to eliminate hot spots.
– Use data partitioning and maintain load-balance to allow scaling.
– Use fixed vs. full replication Read/Write Throughput
to avoid n-fold overhead. 10KB Objects
– Use low overhead
Accesses / Second
heart-beating. 80,000
• Example of linear 60,000
40,000
throughput scaling: 20,000
0
4 16 28 40 52 64 Nodes
16,000 ------------------------------------------- 256,000 #Objects
9 WSTA Seminar
10. Typical Commercial Distributed Data Grids
• Partition objects to scale throughput and avoid hot
spots.
• Synchronize access to objects across all servers.
• Dynamically rebalance objects to avoid hot spots.
• Replicate each cached object for high availability.
• Detect server or network failures and self-heal.
Client
Application
Retrieve
Client Cached
Library Copy
Object Copy Replica
Cache Cache Cache Cache
Service Service Service Service
Distributed Cache
Ethernet
10 WSTA Seminar
11. Wide Range of Applications
Financial Services E-commerce
• Portfolio risk analysis • Session-state storage
• VaR calculations • Application state storage
• Monte Carlo simulations • Online banking
• Algorithmic trading • Loan applications
• Market message caching • Wealth management
• Derivatives trading • Online learning
• Pricing calculations • Hotel reservations
• News story caching
Other Applications
• Edge servers: chat, email • Shopping carts
• Online gaming servers • Social networking
• Scientific computations • Service call tracking
• Command and control • Online surveys
11 WSTA Seminar
12. Importance for Cloud Computing
• Cloud computing:
– Make elastic resources readily available, but…
– Clouds have relatively slow interconnects.
• Distributed data grids add significant value in the cloud:
– Allow data sharing across a group of virtual servers.
– Elastically scale throughput as needed.
– Provide low latency, object-oriented storage
• Clouds provide the elastic platform for parallel data
analysis.
• DDGs provides the efficiency and scalability needed to
overcome the cloud’s limited interconnect speed.
12 WSTA Seminar
13. DDGs Simplify Data Migration to the Cloud
• Distributed data grids can automatically bridge on-
premise and cloud-based data grids to unify access.
• This enables seamless access to data across
multiple sites.
Cloud Application
Cloud Application VS
App App VS
App VS App VS App VS
App VS
App VS
App VS On-Premise Application 2
App VS App VS
Server App Server App
On-Premise Application 2
SOSS VS
Server App Server App
SOSS VS
SOSS VSVS
SOSS Aut
o
SOSS VS Mig matic
rate ally
Cloud-Based Distributed Automatically
Cache Da
ta SOSS Host SOSSHost
SOSS Host
SOSS VS Migrate Data
SOSS Host
Cloud hosted Cloud of Virtual Servers On-Premise Backing
Distributed Data Grid Distributed Data Grid
On-Premise Cache Store
User’s On-Premise Application
Cloud of Virtual Servers User’s On-Premise Application
13 WSTA Seminar
14. DDGs Enable Seamless Global Access
Mirrored Data Centers
SOSS SVR Satellite Data Centers
SOSS SVR
SOSS SVR
SOSS SVR
Distributed Data Grid SOSS SVR
SOSS SVR
SOSS SVR
SOSS SVR
SOSS SVR Distributed Data Grid
Distributed Data Grid
SOSS SVR
SOSS SVR
SOSS SVR
Distributed Data Grid
Global Distributed Data Grid
14 WSTA Seminar
15. Introducing Parallel Data Analysis
• The goal:
– Quickly analyze a large set of data for patterns and trends.
– How? Run a method E (“eval”) across a set of objects D in parallel.
– Optionally merge the results using method M (“merge”).
• Evolution of parallel analysis: E M
– '80s: “SIMD/SPMD” (Flynn, Hillis)
– '90s: “Domain decomposition” (Intel, IBM) D D D D
– '00s: “Map/reduce” (Google, Hadoop, Dryad)
D D D D
• Applications:
– Search, financial services, D D D D
business intelligence, simulation
D D D D
Result
15 WSTA Seminar
16. Example in Financial Services
Analyze trading strategies across stock histories:
Why?
• Back-testing systems help guard against risks in deploying new
trading strategies.
• Performance is critical for “first to market” advantage.
• Uses significant amount of market data and computation time.
How?
• Write method E to analyze trading strategies across a single
stock history.
• Write method M to merge two sets of results.
• Populate the data store with a set of stock histories.
• Run method E in parallel on all stock histories.
• Merge the results with method M to produce a report.
• Refine and repeat…
16 WSTA Seminar
17. Stage the Data for Analysis
• Step 1: Populate the distributed data grid with objects each of which
represents a price history for a ticker symbol:
17 WSTA Seminar
18. Code the Eval and Merge Methods
• Step 2: Write a method to evaluate a stock history based on parameters:
Results EvalStockHistory(StockHistory history, Parameters params)
{
<analyze trading strategy for this stock history>
return results;
}
• Step 3: Write a method to merge the results of two evaluations:
Results MergeResuts(Results results1, Results results2)
{
<merge both results>
return results;
}
• Notes:
– This code can be run a sequential calculation on in-memory data.
– No explicit accesses to the distributed data grid are used.
18 WSTA Seminar
19. Run the Analysis
• Step 4: Invoke parallel evaluation and merging of results:
Results Invoke(EvalStockHistory, MergeResults, querySpec,
params);
EvalStockHistory()
MergeResults()
19 WSTA Seminar
20. Start parallel
analysis
.eval()
stock stock stock stock stock stock
history history history history history history
results results results results results results
.merge() .merge() .merge()
results results results
.merge()
results returned results
to client
20 WSTA Seminar
21. DDG Minimizes Data Motion
• File-based map/reduce must move data to memory for analysis:
M/R Server M/R Server M/R Server
E E E
Server
Memory
File System /
D D D D D D D D D Database
• Memory-based DDG analyzes data in place:
Grid Server Grid Server Grid Server
E E E
Distributed
D D D D D D D D D Data Grid
21 WSTA Seminar
22. Start parallel
analysis
.eval()
File I/O
stock stock stock stock stock stock
history history history history history history
results results results results results results
.merge() .merge() .merge()
File I/O
results results results
File I/O
.merge()
results returned results
to client
22 WSTA Seminar
23. Performance Impact of Data Motion
Measured random access to DDG data to simulate file I/O:
23 WSTA Seminar
24. Comparison of DDGs and File-Based M/R
DDG File-Based M/R
Data set size Gigabytes->terabytes Terabytes->petabytes
Data repository In-memory File / database
Data view Queried object collection File-based key/value
pairs
Development time Low High
Automatic Yes Application
scalability dependent
Best use Quick-turn analysis of Complex analysis of
memory-based data large datasets
I/O overhead Low High
Cluster mgt. Simple Complex
High availability Memory-based File-based
24 WSTA Seminar
25. Walk-Away Points
• Developers need fast, scalable, highly available and sharable
memory-based storage for scaled out applications.
• Distributed data grids (DDGs) address these needs with:
– Fast access time & scalable throughput
– Highly available data storage
– Support for parallel data analysis
• Cloud-based and globally distributed applications need DDGs to:
– Support scalable data access for “elastic” applications.
– Efficiently and easily migrate data across sites.
– Avoid relatively slow cloud I/O storage and interconnects.
• DDGs offer simple, fast “map/reduce” parallel analysis:
– Make it easy to develop applications and configure clusters.
– Avoid file I/O overhead for datasets that fit in memory-based grids.
– Deliver automatic, highly scalable performance.
25 WSTA Seminar
26. Distributed Data Grids for
Server Farms & High Performance Computing
www.scaleoutsoftware.com