This is part four of a four-part series on Microsoft SQL server deployment. Click here to view part one, part two and part three.
Now that you’ve successfully deployed SQL server, it’s critical to ensure things continue to run smoothly over time. Here are some best practices to ensure you are setup for success.
SQL Server Maintenance
It’s a good idea to configure regular maintenance to keep everything running at peak performance. Some applications deploy their “own” maintenance plans, so make sure you don’t double up when configuring your own maintenance jobs.
I recommend using Ola Hallengren’s “SQL Server Integrity Check” and “SQL Server Index and Statistics Maintenance” scripts. These have a range of parameters that can be adjusted to fit almost any scenario.
Make sure you run these before going live, so you have a baseline on how long they will take to complete. This is probably also the perfect opportunity to make sure your SQLMail is configured properly, to notify you of DB maintenance failures.
If you feel like diving deeper into the topics of index and statistics maintenance, make sure you check out Brent Ozar’s blog.
This should go without mentioning, but performing regular backups is mandatory. It’s not a question of if you ever need them, it’s when.
We strongly recommend starting out by asking the “business” what their Business Continuity requirements are, specifically Recovery Time Objective (RTO) and Recovery Point Objective (RPO) requirements. To recap, RTO specifies how long the organization can afford to be without access to the data in the database (i.e. the amount of time it takes to restore the service). RPO specifies how much data loss is acceptable if something breaks.
These two parameters essentially mandate the database recovery mode, backup frequency and – if RPO and RTO values are very low – high availability architectures.
Select the recovery mode based on the RPO definition. Recap 2: there are two recovery modes that are important in this context: Simple and Full (the third one, Bulk-Logged, is for special use cases).
In Simple recovery mode, transactions are committed to the database file as quickly as possible and then immediately flushed, or truncated, from the log file. This means that the transaction log file doesn’t really grow and disk space does not need to be managed as diligently. The disadvantage is that your RPO is directly tied to the frequency of your backups. This means if you back up your database once a day, you can potentially loose up to one full day of data, all the transactions that happened since the last backup.
With Full recovery mode data is committed to the database files in regular intervals, but the transactions are not flushed from the log file. Instead, a checkpoint is maintained that tells SQL Server “where it currently is” in the list of transactions. The transaction log files are only flushed on transaction log backups. This means that if you don’t back up your transaction logs on a regular basis, your log files could grow out of control and fill up the disk. Also keep in mind that truncating logs does not shrink the logfile itself. It only creates “white space” inside the logfile, just like with database files when deleting data. It is not uncommon for organizations to perform transaction log backups in intervals of less than ten minutes in order to meet RPO requirements and manage disk space.
The advantage of full recovery mode is that you can perform exact point-in-time recoveries. For example, if you know that something happened at 3:21pm that corrupted your data, with Full recovery mode you can restore the database to its state of 3:20pm.
In summary, if you hear the business defining the RPO for a database as “we can live with losing a day’s worth of data”, then feel free to choose Simple recovery mode. If the statement is something like “data loss must be as little as possible, and not greater than x days/hours” then you need Full recovery mode with regular transaction log backups.
Note that there may also be other reasons for enabling Full recovery mode for a database, such as Always On Availability Groups, Database mirroring, etc.
Do not keep your backups on the SQL server itself, other than for staging. Copy or move them off to separate storage, that means different storage (SAN or storage array) than the one the SQL server instance is using. In addition, it is highly recommended to keep a copy off-site. Cloud storage is a great solution for this. Most modern Backup and Restore applications (Veeam, Rubrik, Commvault, Microsoft DPM) have Cloud Storage options built in, either BLOB or Object Storage.
You may also want to consider what is referred to as “immutable” storage, which is essentially read-only storage. Potential use cases for this type of storage are tamper protection (regulatory compliance) or protection from ransomware.
Finally, make sure you perform regular restore tests using different scenarios, different back sets, to ensure you can comply with all of the organization’s business continuity requirements.
Once your SQL Server deployment is up and running, it’s time to implement proper monitoring. That term, “proper monitoring,” means implementing what is necessary to gather the intelligence required to make decisions on availability, scalability and performance. Just monitoring CPU and RAM usage is not sufficient. While up/down alerting is also part of monitoring, a good monitoring implementation goes beyond just telling you when something goes down. Ideally, it provides information on how to avoid outages before they even occur, also known as “trend analysis.”
Regardless of the monitoring solution you have deployed, or are planning to deploy, ensure you’re collecting all the necessary metrics from your SQL Server deployment to make informed decisions. Here’s a baseline to get you started. Add on depending on your deployment and requirements.
- System KPIs: These are all pretty self-explanatory.
- Memory usage.
- CPU utilization.
- Physical Disk performance.
- SQL Server KPIs: Look here for some explanation on these metrics.
- Buffer cache hit ratio.
- Page life expectancy.
- Batch Requests/sec.
- SQL Compilations/sec.
- SQL Re-Compilations/sec.
- User Connections.
- Lock Waits/sec.
- Thread Count.
Consider Enabling Remote Access to the Diagnostic Admin Connection
The Diagnostic Admin Connection, in simplified terms, is an isolated connection method that allows an administrator to perform diagnostic queries when the SQL Server is otherwise inaccessible. By default, it’s enabled only for local connections, meaning you have to remote desktop into the SQL Server to use it. Instead, you can enable it for remote troubleshooting.
Keep in mind that this method only allows one connection. Also, don’t go crazy here. There is no parallelism for queries executed from the Diagnostic Admin Connection, so don’t be tempted to run your re-indexing jobs here.
As always, we’d love to hear your questions and comments in the Cireson Community.