Microsoft: How to Troubleshooting Windows Server 2008 R2 & 2012 Failover Clusters.
How to Troubleshooting Windows Server 2008 R2 & 2012 Failover Clusters.
I was configuring windows 2012 cluster on xen cloud platform and found this troubleshoot guide from microsoft blog, really usefull and easy to find an error.
How to get to the root of the problem
In "Troubleshooting Windows Server 2008 R2 Failover Clusters," the locations and tips for where you can go to get the data you need to troubleshoot a problem. Now I'll discuss some of the improvements made to the troubleshooting tools for Windows Server 2012 failover clusters and show you how to take advantage of those tools.
Introducing the New Event Channels
There are some new event channels for failover clustering to help with troubleshooting. Figure 1 shows all the available channels.
Note that the events are specific to the node you're on.
Knowing the purpose of each event channel can help you find the errors more quickly, which in turn will help you troubleshoot the problem more quickly. Here's an explanation of each channel:
- FailoverClustering
o Diagnostic. This is the main log that's circular in nature and runs anytime the cluster service starts. Events can be read in the Event Viewer if logging is disabled. They can also be converted to text file format.
o Operational. Any informational cluster events are registered in this log, such as groups moving, going online, or going offline.
o Performance-CSV. This channel is used to collect information pertaining to the functionality of Cluster Shared Volumes (CSVs).
- FailoverClustering-Client
o Diagnostic. This channel collects Cluster API trace logging. This log can be useful in troubleshooting the Create Cluster and Add Node Cluster actions.
- FailoverClustering-CsvFlt (new in Server 2012)
o Diagnostic. This channel collects trace logging for the CSV Filter Driver (CsvFlt.sys) that is mounted only on the coordinator node for a CSV. This channel provides information regarding metadata operations and redirected I/O operations.
- FailoverClustering-CsvFs (new in Server 2012)
o Diagnostic. This channel collects trace logging for the CSV File System Driver (CsvFs.sys), which is mounted on all nodes in the cluster. This channel provides information regarding direct I/O operations.
- FailoverClustering-Manager
o Admin. This channel collects errors associated with dialog boxes and pop-up warnings that are displayed in Failover Cluster Manager.
- FailoverClustering-WMIProvider
o Admin. This channel collects events associated with the Failover Cluster WMI provider.
o Diagnostic. This channel collects trace logging associated with the Failover Cluster WMI provider. It can be useful when troubleshooting Windows Management Instrumentation (WMI) scripts or Microsoft System Center applications.
Using the FailoverClustering-Client/Diagnostic Channel
Because administrators often encounter problems when creating clusters and joining nodes, I want to show you how to use the FailoverClustering-Client/Diagnostic channel. This channel is disabled by default, so it won't be collecting any data. To enable it, you need to right-click the channel and choose Enable Log. The Diagnostic channel will then start collecting data relevant to a join or create operation.
For example, suppose you previously enabled the Diagnostic channel and you're having a problem creating a cluster. To view the data collected, you need to right-click the channel and choose Disable Log. In the FailoverClustering-Client/Diagnostic event log, you see the following events:
Event ID: 2
Level: Error
Description: CreateCluster (1883): Create cluster failed
with exception. Error = 8202, msg: Failed to create
cluster name CLUSTER on DC \\DC.CONTOSO.COM. Error 8202.
Event ID: 2
Level: Error
Description: CreateClusterNameCOIfNotExists (6879): Failed
to create computer object CLUSTER on DC \\DC.CONTOSO.COM
with OU ou=Clusters,dc=contoso,dc=com. Error 8202.
Level: Error
Description: CreateCluster (1883): Create cluster failed
with exception. Error = 8202, msg: Failed to create
cluster name CLUSTER on DC \\DC.CONTOSO.COM. Error 8202.
Event ID: 2
Level: Error
Description: CreateClusterNameCOIfNotExists (6879): Failed
to create computer object CLUSTER on DC \\DC.CONTOSO.COM
with OU ou=Clusters,dc=contoso,dc=com. Error 8202.
Because you have errors, you can use the Net.exe command to see what their status code (8202) means:
NET HELPMSG 8202
The command returns the message: The specified directory service attribute or value does not exist. With the new features of Server 2012 Failover Clustering, the cluster will be created in the same organizational unit (OU) as the nodes. For the cluster name to be created, the logged-on user must have at least Read and Create Computer Objects permissions. If the user doesn't have those rights, the name won't be created and you'll receive this type of error.
Now suppose you're trying to add a node to the existing cluster and the operation fails. You review the events in the FailoverClustering-Client/Diagnostic log, and see the following:
Event ID: 56
Level: Warning
Description: AsyncNotificationCallback (1463): ApiGetNotify
on hNotify=0x0000000021EBCDC0 returns 1 with rpc_error 0
Event ID: 2
Level: Error
Description: SCMStateNotify (837): Repost of
NotifyServiceStatusChange failed for node
'NodeX': status = 1168
Level: Warning
Description: AsyncNotificationCallback (1463): ApiGetNotify
on hNotify=0x0000000021EBCDC0 returns 1 with rpc_error 0
Event ID: 2
Level: Error
Description: SCMStateNotify (837): Repost of
NotifyServiceStatusChange failed for node
'NodeX': status = 1168
Although their wording is a bit on the cryptic side, the descriptions give you the information that you need. The description for the first event tells you that a remote procedure call (RPC) error occurred. The description for the second event gives you a status code of 1168. Once again, you can use the Net.exe command to see what that status code means:
NET HELPMSG 1168
This time, the command returns the message: Element not found. When a node tries to join a cluster, the running cluster node needs to make an RPC connection to the node being added. In this case, it couldn't find the node.
So, from the information returned by the two events, you can deduce that the running cluster node can't make an RPC connection to the node being added because it can't find that node. After further investigation, you discover that the DNS server has an incorrect IP address for the node being added. After you correct the IP address, the node successfully joins the cluster.
Introducing the New Tests in the Validate a Configuration Wizard
Another helpful troubleshooting tool that you can use is the Validate a Configuration Wizard in Failover Cluster Manager. Several new clustering tests have been added in Server 2012. All the new tests for Server 2012 clustering are in bold:
- Hyper-V (available only if the Hyper-V Role is installed)
o List Hyper-V Virtual Machine Information
o List Information About Servers Running Hyper-V
o Validate Compatibility of Virtual Fibre Channel SANs for Hyper-V
o Validate Firewall Rules for Hyper-V Replica Are Enabled
o Validate Hyper-V Integration Services Version
o Validate Hyper-V Memory Resource Pool Compatibility
o Validate Hyper-V Network Resource Pool and Virtual Switch Compatibility
o Validate Hyper-V Processor Pool Compatibility
o Validate Hyper-V Role Installed
o Validate Hyper-V Storage Resource Pool Compatibility
o Validate Hyper-V Virtual Machine Network Configuration
o Validate Hyper-V Virtual Machine Storage Configuration
o Validate Matching Processor Manufacturers
o Validate Network Listeners Are Running
o Validate Replica Server Settings
- Cluster Configuration (available only if a cluster is running)
o List Cluster Core Groups
o List Cluster Network Information
o List Cluster Resources
o List Cluster Volumes
o List Clustered Roles
o Validate Quorum Configuration
o Validate Resource Status
o Validate Service Principal Name
o Validate Volume Consistency
- Inventory
o Storage
- List Fibre Channel Host Bus Adapters
- List iSCSI Host Bus Adapters
- List SAS Host Bus Adapters
o System
- List BIOS Information
- List Environment Variables
- List Memory Information
- List Operating System Information
- List Plug and Play Devices
- List Running Processes
- List Services Information
- List Software Updates
- List System Drivers
- List System Information
- List Unsigned Drivers
- Network
o List Network Binding Order
o Validate Cluster Network Configuration
o Validate IP Configuration
o Validate Network Communications
o Validate Windows Firewall Configuration
- Storage
o List Disks
o List Potential Cluster Disks
o Validate CSV Network Bindings
o Validate CSV Settings
o Validate Disk Access Latency
o Validate Disk Arbitration
o Validate Disk Failover
o Validate File System
o Validate Microsoft MPIO-Based Disks
o Validate Multiple Arbitration
o Validate SCSI device Vital Product Data (VPD)
o Validate SCSI-3 Persistent Reservation
o Validate Simultaneous Failover
o Validate Storage Spaces Persistent Reservation
- System Configuration
o Validate Active Directory Configuration
o Validate All Drivers Signed
o Validate Memory Dump Settings
o Validate Operating System Edition
o Validate Operating System Installation Option
o Validate Operating System Version
o Validate Required Services
o Validate Same Processor Architecture
o Validate Service Pack Levels
o Validate Software Update Levels
Except for the Storage tests, all the tests can be run at any time because they aren't disruptive to the cluster.
Using the Validate a Configuration Wizard
Let's explore how to take advantage of the Validate a Configuration Wizard. Using the previous example of the problem related to adding a node, let's say that the DNS server had the proper IP address and you can connect between the nodes outside the cluster. In this case, you can run the Validate a Configuration Wizard.
When you run the wizard, the Network/Validate Windows Firewall Configuration test fails. This test specifically looks at the Windows Firewall settings to ensure that port 3343, which is used by the cluster, hasn't been enabled. When this port is disabled, all communications on that port are blocked and you get errors in the Diagnostic channel.
Introducing the New Get-ClusterLog Command Switch
The Windows PowerShell command Get-ClusterLog lets you convert the events in a channel (e.g., FailoverClustering/Diagnostics) to a text file format. PowerShell will name the text file Cluster.log and place it in the C:\Windows\Cluster\Reports folder. If you run the command by itself, each node will have its own Cluster.log file. You can use switches to change this default behavior. Here are the switches, including the new -UseLocalTime switch:
- -Cluster <string>, where <string> is the name of the cluster you want to run the command against. This allows you to specify a remote cluster. If you omit the switch, it will run against the cluster you're currently on.
- -Node <string>, where <string> is the name of the node you want to run the command against. You use this command when you want to generate the Cluster.log file for a certain node only.
- -Destination <string>, where <string> is the folder to which you want to copy the Cluster.log files. If you include this switch, PowerShell will not only create a Cluster.log in each node's C:\Windows\Cluster\Reports folder but also copy all of the log files to the specified destination folder. This switch will add the node's name as part of the filename (e.g., Node1_Cluster.log, Node2_Cluster.log) for the log files copied to the destination folder. This way, each node's log files are easily identifiable.
- -TimeSpan <uint32>. You use this switch if you just want to get a log file that spans the last specified number of minutes, where <uint32> is that number (e.g., 5). This will give you a much smaller log file to review. You can use this switch if you're trying to reproduce an error. For example, you can reproduce the error you think might be occurring, then generate the log for the last 5 minutes to see if that's the case.
- -UseLocalTime. As mentioned previously, this switch is new in Server 2012. Clusters write all their information in GMT. For example, if you have a cluster that's in the GMT-5 time zone and your local time is 13:00 (1:00 p.m.), Cluster.log will show a time of 18:00 (6:00 p.m.) by default. With this switch, the local time is used, so the log will show a time of 13:00. When you use the -UseLocalTime switch, the times returned by the Get-ClusterLog command can easily be matched with the Event Log times.
Now that you know how to get Cluster.log files, it's time to learn how to read and search through them.
Reading Cluster.log Files
Reading Cluster.log files takes a long time to master, because they contain a lot of information that can be confusing. However, I'll give you some tips that can help you get started.
The first thing you need to understand is the anatomy of a Cluster.log file. Every entry has the same basic structure. Here's an entry for an IP address resource coming online:
00000bb8.000001d4::2013/05/15-01:13:24.852
INFO [RES] IP Address <IP Address 1.1.1.1>:
Online: Opened object handle for netinterface
353c85ee-7ea7-4b2a-927d-1538dffcdecd
INFO [RES] IP Address <IP Address 1.1.1.1>:
Online: Opened object handle for netinterface
353c85ee-7ea7-4b2a-927d-1538dffcdecd
Let's break this entry down into smaller pieces to make better sense of it:
00000bb8. This is the process ID in hexadecimal notation. Typically, the process is the Resource Host System (RHS). You can see what resources the process is using by sorting or searching for the lines that include this process ID. This is useful when debugging an RHS dump if you have multiple files present. Each of these dumps is identified by a process ID, so knowing what the process ID is will ensure that you're working with the correct process dump. If you have a complete memory dump, there will be multiple RHS processes. Each is identified by the ID, so you can get to the correct one.
000001d4. This is the thread ID in hex notation. You can see what the thread is doing by sorting or searching for lines that include this thread ID. When you're debugging an RHS process that has 100 threads, you can jump right to the correct one using this ID.
2013/05/15-01:13:24.852. This is the date and time in GMT (unless the -UseLocalTime switch was used to generate the log). So if you're using GMT-5, the local time in this case is May 14, 2013, at 8:13 p.m. The time goes down to milliseconds.
INFO. This is the level of the entry. The level can be INFO (informational), WARN (warning), ERR (error), or DBG (debug). There are a few others, but these levels are what you'll see the majority of the time. Generally, a line with ERR in it indicates a problem with a resource. When you open a Cluster.log file after a failure, you can search for a specific level to try to get to the problem area quicker.
[RES] IP Address. This is the resource type. A resource will always identify its type in the log. With this information, you can more quickly follow the resource going online when there are multiple types of resources all coming online at the same time.
<IP Address 1.1.1.1>. This is the actual resource, as shown in Failover Cluster Manager.
Online: Opened object handle for netinterface 353c85ee-7ea7-4b2a-927d-1538dffcdecd. This is a description of what's going on with the resource. What's going on here is that the resource is opening a handle to the network card driver in order to bind the IP address to it. If it fails here, it's most likely a problem with the network card driver not accepting anything, which means it's bad. Alternatively, the network card might have died. Your next step would be to review the System event log entries to check for any network type events, such as the network going down or a card failing. With many of the descriptions, the more you see them, the more you'll understand what they mean. A description can be particularly helpful if it's describing the last action that occurred before a failure.
Searching Cluster.log Files
When reviewing Cluster.log files, it helps to search for keywords. Table 1 provides a list of keywords that I use when searching for resources.
Keyword
|
Description
|
---|---|
-->OnlinePending
|
This keyword appears in the log the second that Failover Cluster Manager displays Online Pending for a resource. This is where your search should start if you want to follow a resource coming online.
|
-->OfflinePending
|
This keyword appears in the log the second that Failover Cluster Manager displays Offline Pending for a resource. This is where your search should start if you want to follow a resource going offline.
|
-->Offline
|
This keyword appears in the log when Failover Cluster Manager displays Offline for a resource. So if you were following the resource, there's no need to look further. If this resource depends on another resource, that other resource could start its offline process first.
|
-->Online
|
This keyword appears in the log when Failover Cluster Manager displays Online for a resource. So if you were following the resource, there's no need to look further. If another resource depends on this resource, that other resource would not start its online process until this one completes.
|
-->ProcessingFailure
|
This keyword appears in the log when a resource just failed. If you find this line, you would want to start looking at previous entries to see what triggered the failure. Looking at entries after this event isn't necessary. Anytime a resource fails, you should still try to go through the normal offline process, even though you'll most likely get errors because the resource is unavailable.
|
Note that you should type these keywords exactly as you see them. In other words, include the hyphen hyphen greater-than symbol (-->) and don't include any spaces.
You can also use these keywords to quickly determine how long a resource took to go offline or come online. For example, suppose that a group is taking longer than normal to come online. You can use -->OfflinePending as a starting point, then use -->Offline for all resources in the group. Alternatively, you can use -->OnlinePending followed by -->Online. For each resource, add up all the times to see how long it took to come online. After you've done that for all the resources, you can compare the resources' total times to see which resource took the longest amount of time. You can then reviewthe entries in Cluster.log to determine why. For example, if a group took 30 seconds total to come online on one node and 3 minutes total to come online on another node, you should generate Cluster.log files for both nodes and compare them.
You can use the same keywords for groups, except that there must be a space after the greater-than symbol. For example, if a group goes offline, you would first use --> OfflinePending, followed by --> Offline. The only other difference between the resource entry and the group entry is that the group failure is --> Failed, whereas the resource failure is -->ProcessingFailure.
Putting It All Together
To see how all the information presented fits together, let's walk though solving a problem. Suppose that you have a two-node cluster configured with multiple file servers using different networks and a Fibre Channelconnected SAN. Here's the setup for the networks:
- Cluster Network 1 = IP scheme 192.168.0.0/24
- Cluster Network 2 = IP scheme 1.0.0.0/8
- Cluster Network 3 = IP scheme 172.168.0.0/16
In the nodes' network connections, the network adapters are identified as:
- CORPNET = IP scheme 192.168.0.0/24
- MGMT = IP scheme 1.0.0.0/8
- BACKUP = IP scheme 172.168.0.0/16
FILESERVER1 is using Cluster Network 1, which is running on NODE1. FILESERVER2 is using Cluster Network 2, which is running on NODE2.
Let's say that there was a failure overnight and a file server group named FILESERVER2 was moved from NODE2to NODE1. You need to find out why the failure occurred.
The first thing you do is open Failover Cluster Manager, right-click the FILESERVER2 group, and select Show Critical Events. This brings up the following events:
Event ID: 1069
Description: Cluster Resource 'IP Address 1.1.1.1' of
type 'IP Address' in Clustered Role 'FILESERVER' failed.
Event ID: 1205
Description: The Cluster service failed to bring clustered
service or application 'FILESERVER2' completely online or
offline. One or more resources may be in a failed state.
Description: Cluster Resource 'IP Address 1.1.1.1' of
type 'IP Address' in Clustered Role 'FILESERVER' failed.
Event ID: 1205
Description: The Cluster service failed to bring clustered
service or application 'FILESERVER2' completely online or
offline. One or more resources may be in a failed state.
The first event tells you that IP Address 1.1.1.1 had a failure. So, you right-click this resource in Failover Cluster Manager and choose Show Critical Events. You see these events:
Event ID: 1077
Description: Health check for IP Interface
'IP Address 1.1.1.1' (address 1.1.1.1) failed (status is
1168). Run the Validate a Configuration wizard to ensure
that the network adapter is functioning properly.
Event ID: 1069
Description: Cluster Resource 'IP Address 1.1.1.1' of
type 'IP Address' in Clustered Role 'FILESERVER' failed.
Description: Health check for IP Interface
'IP Address 1.1.1.1' (address 1.1.1.1) failed (status is
1168). Run the Validate a Configuration wizard to ensure
that the network adapter is functioning properly.
Event ID: 1069
Description: Cluster Resource 'IP Address 1.1.1.1' of
type 'IP Address' in Clustered Role 'FILESERVER' failed.
Based on the description in first event (event 1077), you decide to use the Validate a Configuration Wizard. You want to run only the Network/Validate Network Communication test because that test will check the adapters and all network paths between the nodes.
After you run the Network/Validate Network Communication test, you check the test report. You don't see any errors or warnings, so you put it aside.
There are event channels you can review, so you go into the FailoverClustering/Operational channel, where you see this event:
Event ID: 1153
Description: The Cluster service is attempting to failover
the clustered service or application 'FILESERVER2' from
node 'NODE2' to node 'NODE1'
Description: The Cluster service is attempting to failover
the clustered service or application 'FILESERVER2' from
node 'NODE2' to node 'NODE1'
Because of this description, you go into the FailoverClustering/Diagnostics channel, where you see these events:
Event ID: 2051
Description: [RCM] rcm::RcmResource::HandleFailure:
(IP Address 1.1.1.1)
Event ID: 2051
Description: [RES] IP Address <IP Address 1.1.1.1>:
Failed to query properties of adapter id
F3EDD1C8-6984-82BC-498806B841CA, status 87.
Description: [RCM] rcm::RcmResource::HandleFailure:
(IP Address 1.1.1.1)
Event ID: 2051
Description: [RES] IP Address <IP Address 1.1.1.1>:
Failed to query properties of adapter id
F3EDD1C8-6984-82BC-498806B841CA, status 87.
Based on this information, you generate a Cluster.log file for this node. In the log, you search for -->ProcessingFailure and find these entries:
[RES] IP Address <IP Address 1.1.1.1>: IP Interface
3600A8C0 failed LooksAlive check, status 1168.
[RES] IP Address <IP Address 1.1.1.1>: IP Interface
3600A8C0 failed IsAlive check, status 1168.
[RHS] Resource IP Address 1.1.1.1 has indicated failure.
[RCM] Res IP Address 1.1.1.1: Online -> ProcessingFailure
( State Unknown )
[RCM] TransitionToState( IP Address 1.1.1.1)
Online-->ProcessingFailure.
3600A8C0 failed LooksAlive check, status 1168.
[RES] IP Address <IP Address 1.1.1.1>: IP Interface
3600A8C0 failed IsAlive check, status 1168.
[RHS] Resource IP Address 1.1.1.1 has indicated failure.
[RCM] Res IP Address 1.1.1.1: Online -> ProcessingFailure
( State Unknown )
[RCM] TransitionToState( IP Address 1.1.1.1)
Online-->ProcessingFailure.
A bit later in Cluster.log, you see the entries documenting that the group was being moved. This is a good indication that the entries found with the -->ProcessingFailure search are related to the problem that caused the group to be moved. Because of the errors seen in those entries, you know for sure that the IP address resource failed. Tofind out what the errors' status code means, you use the Net.exe command:
NET HELPMSG 1168
The command returns the message: Element not found. After looking more closely at the entries, it appears as though the actual problem might be with the network adapter. So, you run some hardware tests against the adapters and find that one adapter is faulty and not even showing up in Windows anymore. Replacing the faulty adapter is the course of action to fix the problem.
But there's still the question of why the Network/Validate Network Communication test results didn't show any errors when everything else did. This test checks all network adapters, going from one node to another, whether they're on the same network or not. It does this so that it knows all the routes it can take to get to the other nodes. So, there are some expected failures just because of the way the networks between the nodes are cabled or segmented.
You decide to look more closely at the test report. That's when you spot the output shown in Figure 2.
You notice that NODE1 doesn't have a network adapter defined as MGMT. This is basically saying the same thing as the events, which is that NODE1 has two networks and NODE2 has three networks. So, the lesson here is that you need to do more than just look at the errors or warnings at the top of the report. You also need to look at the test results.
Get to the Root of the Problem
Troubleshooting a cluster is like troubleshooting just about anything. There are different ways to troubleshoot and multiple things to look at in order to get to a problem's root cause. I presented one way to get to the root cause, and I hope you're able to use it when troubleshooting problems in your clusters. For more information pertaining to failover clustering, check out the Ask the Core Team blog site and the Clustering and High Availability blog site.
Comments
Post a Comment