1. Deploying a Highly Performing,
Scalable and Available
Blackboard Solution
Steve Feldman,
sfeldman@blackboard.com
2. performance* The amount of useful work
accomplished by a computer system compared to
the time and resource used.
Alternative Definition: Response time plus latency.
3. scalability* The ability for a distributed
system to expand by accommodating greater levels
of load while maintaining similar levels of
performance.
4. availability* The capability to service a
functional request without issue under conditions of
desired performance and workload scalability.
5. What We’ll Cover
• In the beginning there was Performance…
• Which came first…Scalability or the Egg?
• How much availability do I really need?
• 2010 BbWorld conference theme…deploy for
performance.
• Continuous measurements are absolutely critical.
• Collaborative monitoring solutions with Quest
Software
7. The Online Momentum Shift
• 66% of degree-granting post-secondary institutions in
the US offer online, hybrid/blended online and other
distance education courses.1
• Over 4.6 million students were taking at least one online
course during the fall 2008 term; a 17 percent increase
over the number reported the previous year.2
• The 17 percent growth rate for online enrollments far
exceeds the 1.2 percent growth of the overall higher
education student population.
• By 2020, 50% of high school students will take an online
course.1
7
8. Communities are Getting Larger
• State and County Initiatives
• Consortium Programs and
strategic alliances between
institutions.
• Content distribution networks
• New sources or revenue to
reach markets and students
that were not historically
accessible
– Non-traditional students are
being marketed to
9. Stakes are Getting Higher
• Competition for funding by government
• Competition for revenue by students
• Learning modality changing with each
technological innovation
• User expectations and online behavior
changing constantly
• Hours of availability fighting toward
mission critical
– Often VLEs identified as 24x7 mission
critical systems, but resources to support
are more like 8 x 5
10.
11. Areas of Consideration Tied to Online
Learning
• Educational Continuity
• High Availability, Scalability and Disaster Recovery
• Expectations and Impatience
• The Cost of Doing Business
– Hidden costs
– Justifiable costs (things you should bring back to work)
– Costs to plan for the future
17. The Blackboard Profile Shift
• Existing customers approached Bb about hybrid
eLearning modalities:
– Creative growth opportunities
– Accessible communities
– Competitive offerings
• New customers approached Bb about 100% online
programs.
– Struggling with competitor systems and/or home-grown offerings.
• Customers wanted “proven” solutions not just from Bb,
but from recognized vendors.
24. What is Performance?
• Simple Definition:
– Performance = Response Time + Latency
• Performance is quantifiable and measureable
• Performance is also perception
• Mostly recognized from a cognitive perspective
– Instantaneous
– Immediate
– Continuous
– Captive
25. Realistic Views of Performance
• Should all my pages respond the same?
• Will my response times vary because of the browser I
use?
• Is it acceptable for the application to respond differently
for the same exact page request at different times of the
day?
• As the administrator am I responsible only for time for
first byte, or end-to-end response time?
• Do I understand the expectations of my users? Are they
satisfied with the response times they are receiving?
• Do I understand the patterns of my users?
26. Realistic Approaches to Achieve Performance
• Eliminate interface and resource contention.
– Better to have more capacity than queuing
• Know your user behavior.
• Optimize for the saturated and low-bandwidth network
conditions.
– Enable Compression
– Optimize Images
– Cache Static Content
• Large JVM memory allocations are not a bad thing, but
rather something to expect with Java-based applications.
– Large JVM (4GB to 16GB) with aggressive options you understand.
• Two keys to the database
– Continuous maintenance
– Understand the key queries and how the CBO handles
30. Flexible and Scalable Application Deployment
• An ideal deployment will contain…
– Availability at every edge of the application environment
• Strategy: Physical distribution of load-balanced systems
• Strategy: Minimum DB recovery, not necessarily 0 downtime
– Consumption of every possible machine resource
• Strategy: Virtualization provisioning
– Techniques for improving user experience
• Strategy: Techniques and tools for achieving page-level SLAs
– Large addressable memory spaces
• Strategy: 64-bit and large OS process space allocations
31.
32. Flexible Deployments
• Emphasis on adoption of virtualization technologies
– Virtualization technology transparent to guest OS and
application.
– Why: Take advantage of CPU and Memory expansion
• Emphasis on fast provisioning
– Provisioning technology such as Dell AIM, VMWare
deployment technology and XenServer deployment
technology
– Why: Solved problems to minimize human error and fast
deployment.
• Emphasis on diskless systems
– Hardware is just “rented” space for CPU, Memory and
Network.
– Why: Speed of network and storage so fast, why be
dependent on “wired” solutions.
33.
34. Reliable Deployments
• Emphasis on distributed computing
– De-emphasize clustering and push heavy load-balancing and
virtualization.
– Clustering was before our 64-bit offering
• Emphasis on active/passive availability solutions at the
DB
– SQL Server cluster and Oracle RAC One
– Availability != Scalability
• Emphasis on diskless hardware systems using
enterprise network boot storage
• Fault tolerance is expensive and has to be strategically
identified…
– Focus on sources of greatest probability of failure
35.
36. Responsive Deployments
• Large 64-bid address space…
– It’s cheaper today than 4 years ago
– Technology is heading this direction
– It’s not a bad thing…
• Plentiful CPU worker threads…
– Use only which you need
– Take advantage of hyperthreading and MT technology
– Partition via virtualization
• Many bigger…distributed environments
• Continuous maintenance
– If you want to make your systems remain fast, you have to
“service” the roads. Lots of litter and potholes out there.
37.
38. Efficient Deployments
• Emphasis on blade and rack mount systems
– Space management
– Power and Conditioning Control
• Emphasis on virtualization
– Efficient utilization of CPU and Memory resources that are
quickly exploding
• Shared enterprise networked storage
– Capable of OS/VM boot partitions
– Application Binary Installation
– Non-Relational File Content
– Relational Content
• Network optimized
– Compression, HTTP optimization
– Be wary of proxies that disable Gzip compression
39.
40. Adaptive Deployments
• Pooled resources via virtualization and consolidated
storage
• Deployment/Provisioning considerations
– Dell AIM (Recent acquisition of Scalent)
– VMWare Provisioning Software (VCenter/VMotion)
41. Flexible and Scalable Application Deployment
• An ideal deployment will contain…
– Minimum Storage Recovery Time
• Strategy: Enterprise storage with Snapshot capabilities
– Advanced monitoring for operations and planning
• Strategy: Measurement tools and analytics
– Automation…Automation…Automation
• Strategy: Investment in repeatable, reliable automated processes.
42. Deployment: Resource Utilization
• Moore’s law is in full effect
– CPUs are getting faster with more cores
– Memory is in abundance and cheap
– Storage is grossly abundant
• Massive systems can be obtained at low cost, but
cannot be saturated in stand-alone configurations.
• Virtualization offers the opportunity…
– Deploy with availability in mind
– Saturate system resources
43. Deployment: Large Address Space
• As of Blackboard Learn™ Release 9.1 all supported/
certified configurations include a 64-bit option.
• Pushing more processing to client and DB over the last
few releases, but major memory management technique
is to use more application caches.
– Memory stays persistent longer
– Less wasteful from a creation/destruction perspective, but puts
greater demands on larger spaces.
• Most of our application testing focused on 4GB and 8GB
JVM deployments on 6GB and 10GB OS spaces.
– Limited testing at 16GB and 32GB
47. What is Availability?
• High-availability offerings mask the effects of a
system failure in order to minimize the impact of
access and functional use of a system to a
community of users.
• Simple Definition:
– Percentage of time the system is in its operational state.
• You will often hear the concept of 3x9’s, 4x9’s or
even 5x9’s
– Planned versus Unplanned
• Availability = (Total Units of Time – Downtime) /
Total Units of Time
– 8760 hours in a year
– Downtime = 10 hours
– Availability = (8760 – 10)/8760 = 99.88%
48. Quick View into Availability Statistics
Availability
Percentage
Model
Unexpected
Down8me
per
Year
90%
36.5
days
95%
18.25
days
98%
7.30
days
99%
3.65
days
99.5%
1.83
days
99.8%
17.52
hours
99.9%
8.76
hours
99.95%
4.38
hours
99.99%
52.6
minutes
99.999%
5.26
minutes
99.9999%
31.5s
49. Realistic Views of Availability
• If the application is not functioning as expected, but you
can login, is it available?
– Perception versus Reality
– If it’s slow, do my users feel just as bad as if they received an
error?
• How do you plan for unexpected?
– Practice really does make perfect
• Do I treat the calendar from a date and time perspective
differently from an availability perspective?
– Will my users cause problems if I take the site down during low
usage periods/dates?
– Will the users even know that something happened?
– Can I recover fast enough?
50. Realistic Approaches to Achieve Availability
• Strategically picking redundancy in the architecture.
– Servers and storage make sense to a degree
– Monitoring makes sense
– Do advanced clustering architectures really make a difference?
– Do the costs of a dedicated DR facility and site make sense?
• Choosing the right initiatives based on the resources
available to manage
– Don’t set your administrators up to fail.
– If you don’t have the capabilities on-site, don’t be skeptical of
outsourcing the problem.
• Balance costs over goals
– Choose the right places to put your pennies.
– Make the business drive the decision…it’s their money!
51. Deployment: Availability
• VLEs are different beasts today then in the past.
– Communities are bigger
– Sessions last longer
– Content is richer
– Key point: Adoption is greater and users expect their sites up 24 x
7 x 365
• Architecture is designed for many parallel instances of the
product scaled in a horizontal fashion.
– Distributed physical deployments
– Virtualization is a key element
• Database failover more important than horizontal
database scalability.
– Emphasis on vertical database scalability
53. Pros and Cons of SQL Server Clustering
Pros
of
Clustering
Cons
of
Clustering
Reduces
overall
downDme
for
both
Does
not
account
for
AcDve/AcDve
for
planned
and
unplanned
situaDons
Monolithic
ApplicaDons
Easy
to
Configure
and
Manage
Differences
between
SQL
2005
and
2008
Simplifies
management
of
patches
and
More
expensive
than
alternaDve
failover
upgrades
approaches
Mean
Dme
to
Recovery
is
sub-‐5
seconds
Requires
more
dedicated
DBA
personnel
in
most
situaDons
on-‐board
54. Pros and Cons of Oracle RAC
Pros
of
RAC
Cons
of
RAC
Reduces
overall
downDme
for
both
Very
pro-‐Oracle
uDliDes
and
licensing
planned
and
unplanned
situaDons
which
can
make
RAC
beyond
expensive.
Can
improve
overall
scalability
with
Performance
can
suffer
dramaDcally
due
increased
parallel
nodes
able
to
handle
to
basic
configuraDons
challenges
and
concurrent
and
compeDng
requests.
ulDmate
complexity
of
RAC.
Seamless
integraDon
with
applicaDons
NoDon
that
developers
do
not
have
to
like
Blackboard
making
easier
to
stand-‐up
programmaDcally
account
for
RAC
is
not
and
enterprise
applicaDon
in
a
RAC
true.
Certain
SQL
operaDons
can
be
environment.
harmful
in
a
RAC
environment.
Has
opDon
for
AcDve/AcDve
and
AcDve/ Requires
more
dedicated
DBA
personnel
Passive
for
monolithic
schemas.
on-‐board
55. Deployment: Storage MTTR
• Reference architecture pushes for “diskless” boots in
which ISCSI or NFS partition resides on an enterprise
storage system.
• Both OS/VM partition and data partition served up from
remote storage deployment designed for performance
and scalability.
– Make your hardware work from a CPU, Memory and Network
perspective…save the Disk for the experts.
• Consider scenarios for reducing “Mean Time to
Recovery or Repair”
– Snapshot technology offering minutes for recovery
57. Deployment: Advanced Monitoring
• Measurement is the secret sauce for successful
deployments.
– Most reliable and scalable deployments measure beyond
the server infrastructure
• Different types of measurements
– System/Environmental measurements
– Business measurements
– Synthetic measurements
• Collecting is only part of the prize
– Need to analyze the data to drive business decisions from
the data.
58.
59. Lifecycle of Measurement
Define
Metrics:
Goal
Seng
Reset
Implement
ExpectaDons:
New
InstrumentaDon:
IniDaDves
Begin
Measuring
Recommend
Prepare
ReporDng:
Changes:
Show
Generate
Reports
Business
Value
Align
to
KPI/ROI:
Share
Results
with
Convince
Stakeholders:
Stakeholders
Distribute
Reports
60. Different Types of Monitoring
SyntheDc
Monitoring
Real
User
Monitoring
Performance
Forensic
Monitoring
61. What is Synthetic Monitoring?
• Automated monitoring technique to measure the
functional behavior of a system, sub-system or
component.
• Typically a scheduled activity used to measure the
availability, responsiveness and functional attributes
of a common application scenario.
• Can be executed from any access point to the
system in question, both internal or external.
• Also considered “Active” Monitoring of a system
• Not intended to supply load, but rather perform
sampling of performance and availability
• Two methods:
– HTTP Simulation or Real Browser Emulation
62. Tools for Synthetic Transactions
• You can really use any form of HTTP emulation tool
like JMeter, Grinder, MSTS, LoadRunner,
SilkPerformer, SOASTA, etc…
• Some monitoring software systems like Foglight,
SiteScope, Nagios, CA IntroScope, Argent
Defender
• External services: Keynote, Gomez (Compuware),
WebMetrics, AlertSite, Pingdom, SiteUpTime
• Browser based solution: Selenium
63. Strategies for Synthetic Transactions
• Site and Host Ping Tests should run on a multi-
second basis (15s to 30s)
• Common, yet critical paths targeting functional
systems for availability should run on a continuous
interval (x < 5 minutes).
• Complicated paths focusing on performance and
availability should run every 30 to 60 minutes.
• Repeated tests when desired SLA or outcome not
achieved
64. Why Synthetic Transactions are Critical
• Knowing is half the battle…
• Organic growth of transaction data available for
comprehensive analytics.
– Am I meeting my SLAs?
– Are my users experiencing response challenges?
– Do I experience issues everywhere or just specific parts of
my system, sub-system or component?
• I could use Real User Transactions, but is it really
fair to compare?
– Continuous baseline comparison test about a “known” or
expected experience.
• Probably the most important…we monitor to protect
our community!
65. What is Real User Experience Monitoring?
• Passive web monitoring that observes web traffic to
measure the user experience.
• Provides both quality of service and responsiveness
metrics in order to gauge service levels of performance
and availability.
• Typically a continuous activity watching silently in a
parallel channel or as a pass through channel.
• Able to capture characteristics about the entire HTTP
stream to be used for forensics and user incidents.
• Most vendors package as an appliance, but beginning to
see the rise of “virtual” appliances.
• Synthetic monitoring is just not enough…
66. Tools for RUM Monitoring
• Dominated by commercial vendors who have a niche in
web performance and/or application performance
management.
– Quest FxM
– Coradiant TrueSight
– Oracle Real User Experience Insight
– Tealeaf
– CA/NetQoS
• Rise in new tools coming from network equipment
vendors like Cisco, Opnet and Citrix/NetScaler
67. Strategies for RUM Monitoring
• Identify areas of dense usage in order to highlight
performance, availability and functional experience in
most common components of system.
• Start with a wide lens of traffic watching and slowly
narrow the area of focus to minimize the “purge” of data.
• The “purge” of data is going to happen, so be prepared
to move the data out of the system into an alternative
repository.
– Some of the vendors have already solved this problem via an
Enterprise Data Warehouse (eg: Coradiant BI)
• Most of these tools can show
– Time 2 First Byte, Host Latency, Network Latency and E2E
• Avoid the trap of focusing on Time 2 First Byte
– You are serving an entire application from client to server
68. Why RUM Monitoring is Important
• Critical data for use in solving forensics issues.
• Closest data point to informing the implementation team
about the “real” user experience without talking to the
user (passive watching).
• Captures both functional and performance
characteristics about the user’s session experience.
• Provides insight into user’s clickstream, but does not
aggregate clickstream behavior.
• Covers the full pipeline from host to network to client.
69. What is Performance Forensic Monitoring?
• Deliberate instrumentation approach to capture
performance characteristics about an application
deployment.
• Measures resource and interface statistics not typically
visible from the application directly.
• Provides data points about application code execution
that can be tied down to both the user and/or the
application component.
• Can’t measure everything, but can sample consistently.
– Certain data points can be captured on a continuous basis such
as Java/J2EE container statistics
70. Tools for Forensic Monitoring
• Recommended tool sets tie the PFM tool with the RUM
tool.
– Foglight FxM seemless integration with Foglight Application
Cartridges and Database Performance Analysis
– Coradiant TrueSight integration with Dynatrace APM (Coradiant
AV)
– CA NetQoS integration with CA Wily IntroScope
– Oracle RUE Insight with Oracle Enterprise Manager for
Applications and Databases.
• Limited supply of open source tools that can perform a
fraction of the functionality.
– No known integrations with RUM tools
– Point based tools per container (not aggregators)
– Example tools: JConsole, Java VisualVM
71. Strategies for Forensic Monitoring
• Measure the essentials such as container interfaces and
resources.
• Most vendors have rule agents to begin sampling with a
greater degree of instrumentation when certain rules are
broken.
• Retain statistics for extended periods of time (greater than
1 year) for annual, month, weekly, daily and hourly
comparison purposes.
• Construct trending thresholds for alert purposes to invoke
a planning exercise in advance of an incident.
– Yes application forensics can be used for trending purposes for
events in the future as they are based on events in the past as
points of reference.
72. Why Forensic Monitoring is Important
• Most obvious is for explaining why an incident occurred,
when it occurred and to whom it occurred.
• Unlocks the black box of the application container.
– Provide feedback to the vendor about application design issues.
– Provide guidance for capacity and configuration changes to the
environment.
• Some vendors provide an entire pipeline request model
from the client to the container to the database.
– Great for schools that are leveraging home grown B2 or non-Bb
developed B2s that have not gone through full-fledge
performance and scalability testing.
74. What We’ll Cover
• Monitoring the environment to increase
performance and reduce failovers
• Solutions to help implement an HA/DR environment
• Managing your virtual environment
77. Heterogeneous Systems Management
System
Center
+
Quest
(QMX)
Operating Systems
Applications: Apache,
Blackberry, etc.
Databases: Oracle, etc.
Network Devices: Third-Party Frameworks:
Cisco, Juniper, etc. Connectors
Mainframes: Storage:
AS400 & z/OS EMC & NetApp
78. Quest and Microsoft - Managing the Enterprise
Comprehensive Heterogeneous Systems Management
79. Quest System Center Solutions
Product Alignment to Microsoft System Center Capabilities
VMM + Virtual Access Suite (from
Provisioning & Provision Networks) = Advanced
VDI
Virtualization
DPM + Recovery Manager for
AD, Exchange & SharePoint
QMX for DPM
(Delivered via Prof Services)
Backup & Monitoring &
Recovery Diagnostics
QMX – Operations Manager 2005 Edition
QMX – Operations Manager 2007 Edition
QMPs: Oracle, .NET, AS400, z/OS, Cisco
QMCs: Ex.Patrol, NetCool, MOM, Foglight
Simplified Management of QRX – Audit Collection Services
System Center using Windows
PowerShell and Quest PowerGUI Change &
Configuration
QMX – Quest Management Xtensions QMX - SCCM 2007 Edition
QMP – Quest Management Packs QMX - Configuration Manager 2007 Edition
QMC – Quest Management Connectors QMX for Device Management – SCCM 2007 Edition
QRX – Quest Reporting Xtensions QMX for Device Management – Configuration Manager 2007 Edition
MSI Studio
80. SharePlex
for
Oracle
High
Availability
/
Disaster
Recovery
• Provides an alternate copy of
production data for failover in the
event of maintenance or
downtime
• Ensures production databases
High Availability / are available 24x7
Disaster Recovery
• Avoids loss in revenues or end-
user satisfaction due to loss of
critical data
81. SharePlex
for
Oracle
High
Availability
/
Disaster
Recovery
Export Import
Export Post
Queue Queue
SQL Post
Read
Redo-Logs
Capture
Queue
Capture
82. Quest Virtualization Solutions
Vizioncore Vizioncore
vConverter vRanger Pro
vReplicator
vOptimizer
Quest
Quest InTrust
Foglight for QMX for SMS
Virtualization
ScriptLogic
Desktop Authority
Vizioncore
vFoglight Vintela
Access Suite
Provision Networks
Virtual Access Suite
84. vConverter
Used
To:
Help companies begin first steps of virtualization –
Physical to Virtual (P2V) conversions
Convert physical servers to VMware, MicroSoft,
Citrix, or Virtual Iron virtual machines (VMs)
Assist organizations in deploying multi-vendor
hypervisor solutions – Virtual to Virtual (V2V)
conversions
Provide Disaster Recovery (DR) protection for old
physical servers – allow quick recovery to as close
to point of failure as possible
7/19/10
85. Image-‐Based
Backup
&
Recovery
Comprehensive
Backup
and
Recovery
for
vRanger
Pro
Virtual
Infrastructures
Image-‐Based
Backup
and
Recovery
Applica8on
Applica8on
OS
Applica8on
OS
OS
• Backup-‐Once-‐Recover-‐Any
(files,
email
objects,
OS,
patches,
registry,
applicaDons,
and
recovery
agents)
• Full
VM
recovery
Dmes
are
faster
than
tradiDonal
backup
and
recovery
soluDons
• Object-‐level
restore
(OLR)
for
Microsof
Exchange
email
objects
offers
faster,
more
flexible
recovery
opDons
86. Replicate
VMs
vReplicator
DR
&
BC
with
VM
replicaDon
with
Asynchronous
Data
shorter
RTO/RPO
minimal
data
and
Transfer
overhead
to
reduce
impact
Applica8on
O
Applica8on
O
Applica8on
O
Applica8on
O
Applica8on
O
Applica8on
O
Applica8on
Application
O
Applica8on
O
Application
S
OS
Application
S
OS
Application
S
OS
S
OS
S
S
S
S
Affordable and Easy to Use
87. Manage
VM
Performance
Performance
Capacity
Monitoring
Planning
vFoglight
Service
Chargeback
Management
88. OpDmize
VM
Storage
Find
and
reclaim
over
Prevent
VM
unavailable
allocated
VM
storage
storage
outages
vOpDmizer
Pro
Reduce
8me
spent
Automated
64K
alignment
monitoring
&
managing
improves
VM
performance
VM
storage
89. Self
Service
Provisioning
and
VM
Management
vControl
Self-Service Request, VM Management,
Approval Visibility
& Provisioning & Control
• Self-Service VM Request Portal • VM Management Console
• VM Approval & Fulfillment System • Extensible Workflow Engine
90. Quest Virtualization Solutions: Business Continuity
Use Cases for vRanger Pro
• Complement existing file-level backups
– Image servers weekly to complement nightly
file level back up solutions for faster restore
• Complete backup solution
– Image servers weekly with nightly
differentials and file level restore for most
complete and cost effective backup solution
• Offsite Disaster Recovery (DR)
– Image complete servers offsite for low cost,
comprehensive DR plan
91. vConverter
Feature:
Benefit:
Powerful and easy-to-use graphical Simplify P2V migrations; manage more
interface conversions with less people and effort
Automate many P2V and V2V pre- Speed up P2V & V2V conversions – reduce
and post-conversion tasks manual, time-consuming, and error-prone tasks
High-speed file based or block level Complete conversions faster while using less
transfers critical system resources
Synchronized cutover for large Pre-plan and schedule large P2V and V2V
conversion projects conversions – reduce server outages
Generate DR image of physical Streamline and simplify recovery of old physical
servers to virtual machines servers that fail
Provide continuous DR protection for Ensure recovery of physical servers to as close
physical servers to the time of failure as possible
7/19/10
92. Quest Virtualization Solutions: Business Continuity
Use Cases for vReplicator
• Quick Recovery of Servers
– Recover critical servers where data recovery from
the day before is not acceptable – need at least
last 2 hours of data
• Cost-Effective Offsite Solution
– Organizations who do not have the budget or
personnel to support traditional HA / DR solutions
• Complement to SAN Replication
– Organizations who want replication at
departmental or group level or who have selected
virtual machines they need to replicate without
using SAN
96. Please provide feedback for this session by emailing
BbWorldFeedback@blackboard.com.
The subject of the email should be title of this
session:
Deploying a Highly Performing, Scalable and Available
Blackboard Solution