Detect Suspect Database Pages with msdb.dbo.suspect_pages

Storage errors do not always surface as immediate failures. When SQL Server encounters an 823, 824, or 825 error reading or writing a database page, it records the offending page in msdb.dbo.suspect_pages — a system table that accumulates page-level I/O failures across service restarts. Querying this table is the fastest first-pass for any corruption investigation, answering whether bad pages exist and which databases are affected before committing to a full DBCC CHECKDB run.

Purpose and Overview

msdb.dbo.suspect_pages is SQL Server's persistent registry of page-level I/O failures. Unlike an error log entry that requires text parsing, this table exposes structured data: one row per bad page, with the database ID, file ID, page number, error type, cumulative error count, and the timestamp of the last failure. Entries survive SQL Server restarts and accumulate over time, so a page flagged months ago but never repaired remains visible until a DBA explicitly removes the entry or the page is restored.

The table captures three classes of physical-integrity errors. Event type 1 covers 823 and 824 errors — OS-level I/O failures and checksum mismatches that do not indicate a torn page specifically. Event types 2 and 3 represent torn pages: rows where the page checksum computed at read time does not match what was written, or where a page header indicates it was only partially flushed to disk. These three types are the active corruption signals; event types 4, 5, and 7 mark pages that have been resolved by a backup restore or a DBCC CHECKDB repair operation.

The error_count column is particularly diagnostic: a value greater than 1 on a page means SQL Server has failed to read or verify that page multiple times since the last update. Repeated failures against the same page indicate a hardware fault that is worsening — a drive sector in decline — rather than a transient I/O blip that self-corrected. For a DBA triaging a production alert, that single column separates the "monitor it" situations from the "restore from backup now" situations.

One important operational boundary: msdb.dbo.suspect_pages is limited to 1,000 rows. When the table reaches capacity, SQL Server begins logging additional errors to the SQL Server error log rather than writing new rows. A full suspect_pages table is itself a serious signal — it means corruption events are outpacing DBA response. Monitoring row count alongside the error entries is part of a complete integrity-monitoring strategy.

Code Breakdown

The detection script queries the table for all active (unresolved) suspect pages, joins to sys.master_files for the physical file path, and decodes the event_type integer into a readable description.

 1USE msdb;
 2GO
 3
 4SELECT
 5    DB_NAME(sp.database_id)             AS database_name,
 6    mf.physical_name                    AS file_path,
 7    sp.file_id,
 8    sp.page_id,
 9    CASE sp.event_type
10        WHEN 1 THEN '823/824 - Hard I/O error or bad checksum'
11        WHEN 2 THEN 'Torn page (read error)'
12        WHEN 3 THEN 'Torn page (bad checksum)'
13        WHEN 4 THEN 'Restored from backup'
14        WHEN 5 THEN 'Repaired by DBCC CHECKDB'
15        WHEN 7 THEN 'Deallocated by DBCC CHECKDB'
16        ELSE        'Unknown event type'
17    END                                 AS event_type_desc,
18    sp.error_count,
19    sp.last_update_date
20FROM msdb.dbo.suspect_pages AS sp
21LEFT JOIN sys.master_files AS mf
22    ON  sp.database_id = mf.database_id
23    AND sp.file_id     = mf.file_id
24WHERE sp.event_type IN (1, 2, 3)
25ORDER BY sp.last_update_date DESC;

To view all entries including resolved pages — useful for a historical audit of prior repair activity:

1SELECT
2    DB_NAME(sp.database_id)             AS database_name,
3    sp.file_id,
4    sp.page_id,
5    sp.event_type,
6    sp.error_count,
7    sp.last_update_date
8FROM msdb.dbo.suspect_pages AS sp
9ORDER BY sp.last_update_date DESC;

Running in the msdb context

The USE msdb statement is required because suspect_pages is a system table in the msdb database, not in each user database. All page failures across every database on the instance funnel into a single msdb table — there is no per-database scoped equivalent. Any user with SELECT permission on msdb.dbo.suspect_pages can read it; the sysadmin fixed server role and the db_datareader role on msdb are the usual access paths for monitoring accounts.

Filtering on event_type IN (1, 2, 3)

The WHERE event_type IN (1, 2, 3) predicate isolates the three active error categories and suppresses rows representing already-resolved pages. Without this filter, the result set mixes current problems with historical entries for pages that DBCC CHECKDB has repaired or that a backup restore has corrected. In a post-incident review, dropping the filter and examining the full history is useful; in a monitoring or alerting context, the filter keeps the result focused on pages still requiring attention.

Joining to sys.master_files for physical paths

msdb.dbo.suspect_pages stores only integer IDs — database_id, file_id, and page_id. The LEFT JOIN to sys.master_files resolves those integers to a physical_name, adding the full path of the affected data or log file. This is practical for storage-team handoff: the output names not just the SQL Server database but the actual file on the storage controller, identifying whether the bad pages originate from a specific LUN, drive, or SAN volume that can be isolated or replaced. The join uses LEFT JOIN rather than INNER JOIN to preserve rows for offline or detached databases whose files no longer appear in sys.master_files.

Decoding event_type

The CASE expression translates the integer event_type into a human-readable description. The six defined values cover the full lifecycle of a suspect page: type 1 (823 or 824 error — the broadest category, covering hard I/O faults and bad checksums when page verification is set to CHECKSUM), types 2 and 3 (torn page detected during a read and flagged by a checksum mismatch respectively), and types 4, 5, and 7 (page restored from a backup, repaired by DBCC CHECKDB WITH REPAIR_ALLOW_DATA_LOSS, and deallocated by DBCC CHECKDB respectively). Types 4, 5, and 7 indicate prior resolution and are suppressed in the active-errors query by the event_type IN (1, 2, 3) filter.

Table capacity monitoring

A companion query that shows how close the table is to its 1,000-row hard limit:

1SELECT
2    COUNT(*)                  AS total_rows,
3    1000 - COUNT(*)           AS rows_remaining,
4    MAX(last_update_date)     AS most_recent_event
5FROM msdb.dbo.suspect_pages;

When total_rows approaches 1,000, the underlying corruption problem is active and widening. At that point new events begin going exclusively to the SQL Server error log, and the structured query-based monitoring this script provides starts losing visibility.

Key Benefits and Use Cases

  • Instant corruption visibility — a single query surfaces all unresolved bad pages across every database on the instance without running a full DBCC CHECKDB.
  • Repeated-failure identificationerror_count distinguishes transient I/O blips from deteriorating hardware by counting how many times SQL Server has failed on the same page.
  • Storage-team handoff — joining to sys.master_files maps the SQL Server file ID to a physical path, enabling storage-layer investigation without ambiguity.
  • Repair triage — the resolved event types (4, 5, 7) confirm that prior DBCC CHECKDB repairs wrote back to the table, providing an auditable record of remediation activity.
  • Capacity alerting — tracking total row count catches the approach to the 1,000-row limit before overflow to the error log removes monitoring visibility.
  • Lightweight baseline — the detection query costs nothing on a clean instance and only produces an alert-worthy result when bad pages actually exist.

Performance Considerations

  • msdb context is mandatory: the query must reference msdb.dbo.suspect_pages explicitly; running it without the three-part name or without USE msdb fails with an object-not-found error.
  • 1,000-row hard limit: the table cannot grow beyond 1,000 rows. When full, overflow events go to the SQL Server error log only — row-count monitoring is essential to avoid missing entries in heavily degraded environments.
  • Resolved rows are not purged automatically: SQL Server adds rows for new failures and updates last_update_date and error_count on repeat failures, but it does not delete resolved rows. Over time, rows with event types 4, 5, and 7 accumulate. Explicit DELETE of resolved rows — typically after confirming repair with DBCC CHECKDB — is required to keep the table manageable and the 1,000-row capacity limit meaningful.
  • Left join behavior on offline databases: sys.master_files does not list files for databases in an offline, restoring, or suspect state; the LEFT JOIN returns NULL for physical_name in those cases. A NULL file path signals a database that may itself be inaccessible, not a join error.
  • No automatic alerting built in: the table accumulates silently. SQL Server does not fire a notification when a new row is inserted. A SQL Server Agent job polling this query and sending an alert on any non-empty result is the standard monitoring implementation.

Practical Tips

  • Schedule a SQL Server Agent job that queries suspect_pages WHERE event_type IN (1, 2, 3) and sends a Database Mail notification when the result set returns any rows — without active polling, new entries accumulate with no visibility.
  • After confirming a page has been repaired by backup restore or DBCC CHECKDB, update the event_type to the appropriate resolved value or delete the row from suspect_pages to keep the active error list accurate and preserve capacity for new entries.
  • Pair the detection script with DBCC PAGE (database_name, file_id, page_id, 3) to inspect the raw contents of a flagged page and determine whether it is a data or index page — this determines whether REPAIR_ALLOW_DATA_LOSS is acceptable or whether a restore is the only sound option.
  • When error_count exceeds 3 on any single page, treat it as a hardware signal rather than a software one: escalate to storage-team review before attempting DBCC repair, because additional writes to a failing sector may widen corruption.
  • Cross-reference flagged database_id values against your backup schedule immediately — a database with active suspect pages and no recent clean backup is at acute risk. Confirm a restore point exists before running any repair operation.

Conclusion

msdb.dbo.suspect_pages is the most direct window into active page-level corruption across a SQL Server instance: a structured, queryable table that records each I/O failure event, survives service restarts, and accumulates repeat-failure counts that distinguish deteriorating hardware from transient errors. Querying it with the detection script above — filtering to active event types and joining to sys.master_files for physical file paths — provides the fast first-pass that every corruption-response workflow should begin with. Combined with a scheduled Agent job that alerts on any new rows, it closes the monitoring gap that allows page corruption to accumulate undetected.

References

Posts in this series