Designing Backup for Microsoft 365 data is completely different when you are dealing with thousands of users as opposed to hundreds. But what if you must handle 40,000 users? As an Enterprise focused VASP architect and an architect at Veeam's largest Service Provider we have designed systems for VB365 to handle some of the largest organizations anywhere in the world. In this session we'll cover best practices and guidance for designing for success backup solutions for large scale organizations and multi-tenancy. Topics will include:
- Organization discovery and designing the backup infrastructure.
- Job best practices
- Automation of large-scale setup
- and finally monitoring and troubleshooting gotchas
VeeamON 2023 Architecting Veeam Backup for Microsoft 365 at Scale
1. Jim Jones
Senior Product Infrastructure Architect
11:11 Systems
@k00laidIT
https://koolaid.info
Architecting Veeam Backup
for Microsoft 365 at Scale
Falko Banaszak
Solution Architect
@Falko_Banaszak
https://virtualhome.blog
3. Why all this ?
- Four different M365 workloads with
their own “I/O pattern”.
Those are almost impossible to size properly at
the
beginning and can change.
- Microsoft tends to silently make
changes
to the M365 platform.
- SP style multi-tenancy has its own
challenges.
- In most cases VB365 is being
onboarded in
parallel to the M365 migration.
- A proper architecture helps maintain
6. Organization size discovery
Multi- geo organization – consider licensing, regulation.
Customer/admin managed groups are good.
- At this point only using for sizing/ to discover
size of a particular group.
- Delimit users.
- Delimit geographical boundaries.
Use a good calculator as early as possible:
- https://calculator.veeam.com/vb365
Overestimates proxy design IMHO.
- https://1111systems.com/catalyst/
7. Environment placement
Cloud vs. on-premises:
- Compute: keep close to storage.
Azure: F or D type instances.
AWS: M type.
- Secondary copies and where to keep them.
- Egress, API, retention… OH MY!
Environmental observations:
- Networking.
- Compute.
- Scale out.
Best practices: BP/design/placement
9. - Compute: 8 vCPU, 32 GB RAM.
- Local secondary SSD for cache.
- Be conservative against BP guidance, needs will grow.
BP/design/sizing proxy server.
BP/design/configuration maximums.
- Start at 2,000-2,500 objects per proxy.
- Least taken advice ever.
Proxy design
“For optimum
performance, we
recommend storing
no more than 300,000
files in a single
OneDrive or team site
library.”
10. Proxy to repository relationship
Recommended way on scaling a
backup proxy with its backup jobs
and repositories.
Repeat this for OneDrive and
Archive Mail (user-based objects).
Workload type proxies are getting
the same “I/O pattern”.
Highest flexibility.
Meta Data
Proxy
Meta Data and
Backup Data
Mail Backup Job 1
Mail Backup Job 2
Mail Backup Job 3
Mail Backup Job 4
Mail Backup Job 5
Mail Backup Job 6
Mail Backup Job 7
Mail Backup Job 8
Mail Backup Job 9
Mail Backup Job 10
Local Proxy
disks for cache
metadata
(SSD / NVME)
11. Failure domains within proxy to
repository
Keep “failure domains” as small as possible.
- A job with 10K users is a big failure domain.
- A job with 1K users is a smaller failure domain.
- RTO & retrying.
Reassigning objects from one job to another.
- Move to another proxy == another repository!
- Another full sync!
- Psst: check out Mike Ressler’s session tomorrow.
12. Repository design
Item level vs. snapshot choices
Retention
- Consider tiered retention policies based on governance.
- ONLY KEEP DATA AS LONG AS MANDATED.
- Maximum in named years: 25 years.
- Maximum years in days: 273 years.
Immutable secondary copies are good!
(and bad!)
- Good to have but keep retention low.
- Uses true object-lock, choices have consequences.
- Retention policy MUST match primary copy.
13. Repository design
- Object storage cache: (5 MB/ObjectTB)*2.
- 1 job = 1 bucket/repository.
- Use description field to show relationships.
- Best practices:
BP/design/sizing/object storage.
BP/build & configure/repository/object storage.
14. Job design
- Design with best practices(ish):
BP/design/job design.
- Consider staff turnover/growth in design:
If high design for new user growth
Catch all jobs aren’t great at scale
- Always split by workload type as a minimum.
- Dynamic groups are great but $$$$ for pure M365 orgs
- Alphabetical jobs not bad but hard to manage
- Teams: do you really need Teams channels?
- Stagger first runs with no schedule
16. Monitoring
Veeam® ONE™ – dashboards,
reports, API!
API options
- Grafana:
https://github.com/jorgedlcruz/veeam-
backup-for-microsoft365-grafana
- Roll your own reports:
https://benyoung.blog/tag/vbo/
New with v7
- Prebuilt VB365 reports:
Mailbox Protection Report.
User Protection Report.
17. Plan well: avoid pains later
Job/data movement
- No supported method for object > object:
Veeam KB 3067.
But… forums / “Migrate to a another object storage repo”.
- Block > Object = SLOW and jobs must be disabled:
k00laidIT / Veeam / kb3067.ps1 – modified and improved.
k00laidIT / Veeam / kb3067-validation.ps1 – stats and verification.
Graph API for Teams export
- Only way to get Teams Channel data, paid API:
- https://1111systems.com/tag/microsoft-365-backup/
- Veeam KB 4322 – request access.
- Veeam KB 4340 – enable on server.
18. Plan well: avoid pains later
Authentication
- Setup separate, well protected account for purpose.
- App secrets = long expiration.
- #1 most common reason for VB365 errors today.
Throttling
- Backup groups support murky today
May actually make processing slower with AppOnly.
- Logs to verify.
- BP/operate/M365 throttling.
THIS IS FINE!
19. Log diving - throttling
HTTP Error 500 internal server error.
HTTP Error 429 too many requests.
20. - Use a direct internet connection whenever possible.
- Try to avoid traffic shaping, next-generation firewalling and all sorts of "logic" on the internet
connection uplink .
- "Normal sync times" for a direct connection are expected to be in the range of 50 to 150
milliseconds.
select-string "Sync time: [^0]" *
<JOB NAME>_2022_04_14_08_59_59.log:2732:[14.04.2022 09:00:08] 59 (8132) Sync time:
143.7938527
<JOB NAME>_2022_04_14_08_59_59.log:2737:[14.04.2022 09:00:08] 49 (4600) Sync time:
324.8092703
Yes, those are seconds not milliseconds - and guess what the next log line say:
14.04.2022 09:00:08 75 (4600) No changes
Log diving - high sync times
BP/operate/common issues.
Introduction of the session (Falko)
Introducinf Falko (Falko)
Introducing (Jim)
Falko
Planning and Preparation 10
Proxy, Repo and Job Design 15
Monitoring & Troubleshooting 15
Questions 5
Don’t worry about screenshotting, photographing we got your back with the QR Codes and links
Jim and Falko
Why are we doing this session ?
Jim and Falko
JIM HAND TO FALKO AFTER MULTI-TENANCY
"Speaking of planning..." (Falko)
Falko transition
Falko hand off to Jim
Just like any backup, replication or other disaster recovery project the planning and preparation stage is by far the most important and time consuming portion of the job and that will be reflected here in this presentation. We will often be referring to the Veeam Best Practices Guide for vb365 so it’s an important thing to take note of and when you have a chance to give a quick read through before you consider doing even a small scale deployment because there is so much room for confusion with this product.
GIVE FEEDBACK – its valuable and Veeam listens ! Observations from the field are getting into the BP guide !
- Falko talk about bp guide and how stuff gets into there
- Jim chime in with make sure to give feedback to help center, Veeam teams, community, forums
Jim
Multi-geo challenges around licensing and regulation (GDPR)
Talk about managed groups for “interesting users”
Delimit on users and geographic boundaries
Use calculators
Veeam has somewhat made theirs more lightweight by making it generic numbers
https://success.1111systems.com/catalyst/vbo
JIM HAND TO FALKO FOR ENVIRONMENT PLACEMENT
Falko and Jim Cloud vs On Prem
Falko talk about Azure instance types
Jim talk about AWS instance types
Secondary copy to S3 Compatible VCSP possible but…
Falko addl fees
Falko Environmental observations, hand to Jim after Compute
Falko talks about if it makes sense to have data from a SaaS solution on premises, even though regulations or some sort of security frameworks tell you to do so...
Falko and Jim Cloud vs On Prem
Falko talk about Azure instance types
Jim talk about AWS instance types
Jim talk about Secondary copy to S3 Compatible VCSP possible but…
Falko addl fees
Falko and Jim talks about if it makes sense to have data from a SaaS solution on premises, even though regulations or some sort of security frameworks tell you to do so...
Falko Environmental observations
FALKO HAND TO JIM AFTER COMPUTE
Jim talk about designing for scale out early
Jim intro Falko for Proxy Design
Falko
Falko and Jim back and forth on horror stories of poorly designed M365
FALKO TO JIM AFTER BE CONSERVATIVE
Jim prefer 2000 at onboarding per proxy
Falko CRM systems in a single sharepoint site example
Jim talk about OneDrive/SP limitation
Falko
Single Job > Repository
Many Repository to Proxy
Keep workloads types grouped by proxy
This gives you the greatest amount of flexibility
CLICK TRIGGER ON FLEXIBILITY
Falko
Can you hold your RTO if you have a job with 10K users that needs to be retried several times ?
A job with 1K users can easily be retried and will not take as long as a job with 10 k users so if something breaks you are quicker with “smaller jobs” / “portions”
Jim mention Ressler session
Falko Now that we finished Proxies, we need to talk about repositories as well JIm, right ? :D
FALKO HAND TO JIM CLICK TRIGGER ON REPOSITORY DESIGN
Jim talk about retention policy types
Item level is based on retaining individual items- if retention is set to 7 days then only emails in backups from the last week
Snapshot is VBR like, point in time
Back and forth Jim and Falko ?
Jim – Object storage cache
JIm – 1 Job = 1 bucket / repository
Falko Description
Falko Best Practices again – no logic at all, no tiering, let the bucket be a bucket
FALKO HANDS TO JIM FOR JOB DESIGN
Jim with Falko chips to be determined
Jim talk about best practices, they are mostly great but you’ll notice differences based on what we say here
Jim talk about staff turnover in design, ask this question
Falko chip in about splitting by workload always, even for super small organizations
Jim talk about dynamic groups vs splitting by alphabetical, RegEx would sure be awesome here
Jim and Falko talk about M vs J for alphabetical
Jim talk about Teams, do you really want/need public channel backups,
Falko chip in about General channel where everybody just says good morning, is it worth the costs we’ll talk about in a minute
Falko talk about Stagger first runs
FALKO CONTINUE, CLICK TRIGGER ON SPEAKING OF MONITORING
Jim heading into our last section talk about monitoring and troubleshooting
Jim talk about all the options
CLICK TRIGGER ON VEEAMONE
Jim talk about VeeamONE capabilities
CLICK TRIGGER ON JORGE
Falko talk about API
CLICK TRIGGER ON v7
Jim talk about v7 prebuilt reports
Mailbox and User protection
CLICK TRIGGER ON Wrapping up with pain points
Jim talk about data movement
Object to object not CURRENTLY supported
Block to object is possible but super slow
Jim rant about Graph API for Team export
CLICK TRIGGER ON AUTHENTICATION
Jim talk about authentication
Jim hand to Falko about Throttling
Falko talk about Throttling
Falko talk about High Sync wait trimes
CLICK TRIGGER ON LET’S LOOK AT THROTTLING
Falko talk about throttling
Mitigating it: It is possible to disable the EXO throttling via M365 self-service for up to 90 days. The detailed procedure how to do this is described in Veeam KB4198.
Do not use “endless” backup accounts or application ID’s as it is exhausting the tenants ressources
You wont have the chance to overcome it, it’s a SaaS application and thousands of customers are using it in parallel, there is no exclusivity
Falko talk about throttling
Mitigating it: It is possible to disable the EXO throttling via M365 self-service for up to 90 days. The detailed procedure how to do this is described in Veeam KB4198.
Do not use “endless” backup accounts or application ID’s as it is exhausting the tenants ressources
CLICK TRIGGER ON HIGH SYNC TIMES
Falko
CLICK TRIGGER AT GUESS WHAT THE NEXT LOG LINE SAYS
CLICK THROUGH TO QUESTION AFTER
Jim that’s our session, any question either for the room or you can link up with us after
Jim rehash all content available at github repo
Falko with that we are done CLICK TRIGGER to thank you