0 comments

Error Loading Direct Access Configuration

Published on Wednesday, October 1, 2014 in ,

This morning I wanted to have a quick look at our Direct Access infrastructure and when opening the console I got greeted with various errors all explaining that there was a configuration load error:

Capture2

In words: ICMP settings for entry point cannot be determined. Or:

Capture

In words: Settings for entry point Load Balanced Cluster cannot be retrieved. The WinRM client cannot process the request. It cannot determine the content type of the HTTP response from the destination computer. The content type is absent or invalid.

Because initially I only stumbled upon the ICMP settings … error I had to dig a bit deeper. I searched online on how to enable additional tracing capabilities, but I couldn’t find anything. From reverse engineering the RaMmgmtUI.exe I could see that more than enough tracing was available. Here’s how to enable it:

EnableTrace

Create a REG_DWORD called DebugFlag below HKLM\SYSTEM\CurrentControlSet\Services\RaMgmtSvc\Parameters. For our purpose we’ll give it a value of 8. For an overview of the possible values:

TraceLevels

I’m not sure if you can combine those in some way. After finding this registry key, I was able to find the official article on how do this: TechNet: Troubleshooting DirectAccess. I Should have looked a bit better for that information perhaps… After closing the Remote Access Management Console and opening it again, the log file was being filled up:

Tracing

You can find the trace file in c:\Windows\Tracing and it’s called RaMgmtUIMon.txt After opening the file I stumbled across the following error:

2112, 1: 2014-09-30 11:51:43.116 Instrumentation: [RaGlobalConfiguration.AsyncRefresh()] Exit
2112, 12: 2014-09-30 11:51:43.241 ERROR: The WinRM client cannot process the request. It cannot determine the content type of the HTTP response from the destination computer. The content type is absent or invalid.
2112, 9: 2014-09-30 11:51:43.242 Failed to run Get-CimInstance

I then used PowerShell to try to do the same: connect to the other DA node using WinRM:

WinRMerr

The command: winrm get winrm/config –r:HOSTNAME The error:

WSManFault
    Message = The WinRM client cannot process the request. It cannot determine the content type of the HTTP response from the destination computer. The content type is absent or invalid.

Error number:  -2144108297 0x803380F7
The WinRM client cannot process the request. It cannot determine the content type of the HTTP response from the destination computer. The content type is absent or invalid.

Googling on the error number 2144108297 quickly got me to the following articles:

Basically I was running into this issue because my AD user account was member of a large amount of groups. The MaxTokenSize has been raised in Windows 2012 (R2) so that’s already covered, but winhttp.sys, which WinRM depends on, hasn’t. When running into Kerberos token bloat issues on web applications, typically the MaxRequestBytes and MaxFieldLength values have to be tweaked a bit.

RegHTTP

There are various ways to configure these. Using GPO or a manual .reg file that you can just double click:

Windows Registry Editor Version 5.00

[HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\HTTP\Parameters]

"MaxRequestBytes"=dword:0000a000

"MaxFieldLength"=dword:0000ff37

In my environment I’ve set MaxRequestBytes to 40960 and MaxFieldLength to 65335. But I am by no means saying those are the advised values. It’s advised to start with a lower value and slightly increase until you’re good to go.

Conclusion: if you run into any of the above errors when using the Direct Access management console, make sure to check whether WinRM is happy. In my case WinRM was in trouble due to the size of my token.

0 comments

S.DS.AM GetAuthorizationGroups() Fails on Windows 2008 R2/WIN7

Published on Thursday, September 25, 2014 in ,

Today I got a call from a colleague asking me to assist with an issue. His customer had a Windows 2008 R2 server with a custom .NET application on it. The application seemed to stop working from time to time. After a reboot the application seemed to work for a while.

The logging showed a stack trace that started at UserPrincipals.GetAuthorizationGroups and gave the message: An error (1301) occurred while enumerating the groups. The group's SID could not be resolved.

Exception information:
Exception type: PrincipalOperationException
Exception message: An error (1301) occurred while enumerating the groups. 
The group's SID could not be resolved.

at System.DirectoryServices.AccountManagement.SidList.TranslateSids(String target, IntPtr[] pSids)
at System.DirectoryServices.AccountManagement.SidList..ctor(SID_AND_ATTR[] sidAndAttr)
at System.DirectoryServices.AccountManagement.AuthZSet..ctor(Byte[] userSid, NetCred credentials, ContextOptions contextOptions, String flatUserAuthority, StoreCtx userStoreCtx, Object userCtxBase)
at System.DirectoryServices.AccountManagement.ADStoreCtx.GetGroupsMemberOfAZ(Principal p)
at System.DirectoryServices.AccountManagement.UserPrincipal.GetAuthorizationGroups()

The first thing that came to mind was that they deleted some groups and that the application wasn’t properly handling that. But they assured me that was not the case. The only thing they had changed that came to mind was adding a Windows 2012 domain controller.

I could easily reproduce the issue using PowerShell:

Function Get-UserPrincipal($cName, $cContainer, $userName){
    $dsam = "System.DirectoryServices.AccountManagement"
    $rtn = [reflection.assembly]::LoadWithPartialName($dsam)
    $cType = "domain" #context type
    $iType = "SamAccountName"
    $dsamUserPrincipal = "$dsam.userPrincipal" -as [type]
    $principalContext =
        new-object "$dsam.PrincipalContext"($cType,$cName,$cContainer)
    $dsamUserPrincipal::FindByIdentity($principalContext,$iType,$userName)
}

[string]$userName = "thomas"
[string]$cName = "contoso"
[string]$cContainer = "dc=contoso,dc=local"

$userPrincipal = Get-UserPrincipal -userName $userName '
    -cName $cName -cContainer $cContainer

$userPrincipal.getGroups()

Source: Hey Scripting Guy

Some googling led me to:

In short, it seems that when a 2012 domain controller was involved, the GetAuthorizationGroups() function would choke on two new groups (SIDs) that are added to a user by default. Patching the server running the application was enough in order to fix this.

The issue wasn’t really hard to solve as the solution was easy to find online, but I think it’s a great example of the type of application/code to give a special look when you’re testing your AD upgrade.

2 comments

Active Directory: Lsass.exe High CPU Usage

Published on Thursday, September 18, 2014 in , , ,

Recently I had an intriguing issue at one of my customers I frequently visit. They saw their domain controllers having a CPU usage way above what we normally saw. Typically their domain controllers have between 0 – 15% CPU usage whereas now we saw all of their domain controllers running at 80 – 90% during business hours. A quick look showed us that the process which required this much CPU power was lsass.exe. Lsass.exe is responsible for handling all kind of requests towards Active Directory. If you want you can skip to the end to find the cause, but I’ll write this rather lengthy post nevertheless so that others can learn from the steps I took before finding the answer.

Some background: the environment consists of approximately 20.000 users. Technologies in the mix: Windows 7, Windows 2008 R2, SCCM, Exchange, … Our Domain Controllers are virtual, run 2008 R2 x64 SP1, have 4 vCPU’s and 16 GB RAM. RAM is a bit oversized but CPU is really the issue here. There are 8 Domain Controllers.

Here you can see a screenshot of a more or less busy domain controller. It’s not 80% on this screenshot but it’s definitely much more than we are used to see. Task Manager shows us without doubt that lsass.exe is mainly responsible.

 _1 _2

Whenever a domain controller (or server) is busy and you are trying to find the cause, it’s a good idea to start with the built-in tool Perfmon. In the Windows 2003 timeframe articles you’ll see that they mention SPA (Server Performance Advisor) a lot, but for Windows 2008 R2 and up you don’t need this separate download anymore. Its functionality is included in Perfmon. Just go to Start > Run > Perfmon

_3

Open up Data Collector Sets > System and right-click Active Directory Diagnostics > Start. By default it will collect data for 5’ and then it will compile a nice HTML report for you. And that’s where things went wrong for me. The compiling seemed to take a lot of time (20-30 minutes) and after that I ended up with no performance data and no report. I guess the amount of data to be processed was just to much. I found a workaround to get the report though:

While it is compiling: copy the folder mentioned in “Output” to a temporary location. In my example C:\PerfLogs\ADDS\20140909-0001

If the report would fail to compile you will see that the folder will be emptied. However we can try to compile to report for the data we copied by executing the following command:

  • tracerpt *.blg *.etl -df RPT3870.tmp-report report.html -f html

The .tmp file seems to be crucial for the command and it should be present in the folder you copied. Normally you’ll see again that lsass.exe is using a lot of CPU:

1.cpu

The Active Directory section is pretty cool and has a lot of information. For several categories you can see the top x heavy CPU queries/processes.

_4

A sample for LDAP queries, I had to erase quite some information as I try to avoid sharing customer specific details.

2. LDAPCPU

However in our case, for all of the possible tasks/categories, nothing stood out. So the search went on. We took a network trace on a domain controller using the built-in tracing capabilities (Setspn: Network Tracing Awesomeness)

  • Netsh trace start capture=yes
  • Netsh trace stop

In our environment taking a trace of 20 minutes or 1 minute resulted in the same. Due to the large amount of traffic passing by, only 1 minute of tracing data was available in the file. The file was approximately 130 MB.

Using Microsoft Network Monitor (Microsoft.com: Network Monitor 3.4 ) the etl file from the trace can be opened and analyzed. However: 1 minute of tracing contained about 328041 frames (in my case). In order to find a needle in this haystack I used the Top Users expert (Network Monitor plugin): Codeplex: NMTopUsers This plugin will give you an overview of all IP’s that communicated and how many data/packets they exchanged. It’s an easy way to see which IP is communicating more than average. If that’s a server like Exchange that might be legit, but if it’s a client something might be off.

As a starting point I took an IP with many packets:

3. TOPusers

I filtered the trace to only show traffic involving traffic with that IP

  • IPv4.Address == x.y.z.a

4. Trace

Going through the data of that specific client seemed to reveal a lot of TCP 445 (SMB) traffic. However in Network Monitor this was displayed as all kinds of protocols:

  • SMB2
  • LSAD
  • LSAT
  • MSRPC
  • SAMR

As far as I can tell it seems that TCP 445 (SMB) is being used as a way for certain protocols to talk to Active Directory (lsass.exe on the domain controller). When a client logs on, you could explain TCP 445 as group policy objects being downloaded from the SYSVOL share. However in our case that definitely didn’t seem to be case. Browsing a bit through the traffic, it seemed that the LSAT and SAMR messages were more than interesting. So I changed my display filter:

  • IPv4.Address == x.y.z.a AND (ProtocolName == “LSAT” OR ProtocolName == “SAMR”)

image

This resulted in traffic being displayed which seemed pretty easy to read and understand:

image

It seemed that the client was doing a massive amount of LsarLookupNames3 request. A typical request looked like this:

5.TraceDet

Moreover by opening multiple requests I could see that each request was holding a different username to be looked up. Now why would a client be performing queries to AD that seemingly involved all (or a large subset) of our AD user accounts. In my case, in the 60’ seconds trace I had my top client was doing 1015 lookups in 25 seconds. Now that has to count for some load if potentially hundreds of clients are doing this.

As the traffic was identified as SMB, I figured the open files (Computer Management) on the domain controller might show something:

6.Open

Seems like a lot client have a connection to \lsarpc

7.open

And \samr is also popular. Now I got to be honest, both lsarpc and samr were completely unknown to me. I always thought whatever kind of lookups have to be performed against AD are over some kind of LDAP.

After using my favourite search engine I came up with a bit more background information:

It seems LsarOpenPolicy2, LsarLookupNames3, LsarClose are all operations that are performed against a pipe that is available over TCP 445. The LsarLookupNames3 operation seems to resolve usernames to SIDs.

Using the open files tool I tried identifying clients that were currently causing traffic. Keep in mind that there’s also traffic to those pipes that are absolutely valid. In my case I took some random clients and used “netstat –ano | findstr 445” to check if the SMB session remained open for more than a few seconds. Typically this would mean the client was actually hammering our domain controllers.

image

8.Trace

From this trace we can see that the process ID of the process performing the traffic is 33016. Using Task Manager or Process Explorer (SysInternals) you can easily identifiy the actual process behind this ID.

9.PE

We could see that the process was WmiPrvSe.exe and that it’s parent process was Svchost.exe (C:\WINDOWS\system32\svchost.exe -k DcomLaunch). If you see multiple WmiPrvSe.exe processes, don’t be alarmed. Each namespace (e.g. root\cimv2) has it’s own process. Those processes are only running when actual queries are being handled.

10.peHover

In the above screenshot you can see, by hovering over the process, that this instance is responsible for the root\CIMV2 namespace. The screenshot was taken from another problem instance so the PID does not reflect the one I mentioned before.

The properties of the process / The threads tab:

image image

You can clearly see that there’s a thread at the top which draws attention. If we open the stack:

image

Use the copy all button to copy paste in a text file:

RPCRT4.dll!I_RpcTransGetThreadEventThreadOptional+0x2f8

RPCRT4.dll!I_RpcSendReceive+0x28

RPCRT4.dll!NdrSendReceive+0x2b

RPCRT4.dll!NDRCContextBinding+0xec

ADVAPI32.dll!LsaICLookupNames+0x1fc

ADVAPI32.dll!LsaICLookupNames+0xba

ADVAPI32.dll!LsaLookupNames2+0x39

ADVAPI32.dll!SaferCloseLevel+0x2e26

ADVAPI32.dll!SaferCloseLevel+0x2d64

cimwin32.dll!DllGetClassObject+0x5214

framedynos.dll!Provider::ExecuteQuery+0x56

framedynos.dll!CWbemProviderGlue::ExecQueryAsync+0x20b

wmiprvse.exe+0x4ea6

wmiprvse.exe+0x4ceb

..

We can clearly see that this process is the one performing the lookups. Now the problem remains… Who ordered this (WMI) query with the WMI provider? WmiPrvSe is only the executor, not the one who ordered it…

In order to get to the bottom of this I needed to find out who performed the WMI query. So enabling WMI tracing was the way to go. In order to correlate the events, it might be convenient to know when the WmiPrvSe.exe procss was created so that I know that I’m looking at the correct events. I don’t want to be looking at one of the the many SCCM initiated WMI queries! In order to know what time we should be looking for events, we’ll check the Security event log.

Using calc.exe we can easily convert the PID: 33016 to HEX: 80F8. Just set it to programmer mode, enter it having Dec selected and then select Hex

image image

In our environment the security event log has entries for each process that is created. If that’s not the case for you, you can play around with auditpol.exe or update your GPO’s. We can use the find option and enter the HEX code to find the appropriate entry in the security event log. You might find some non related events, but in my case using the HEX code worked out pretty well. The screenshot is from another problem instance, but you get the point.

11.ProcWmiCr

So now all we need is WMI tracing. Luckily with Windows 7 a lot of this stuff is available from with the event viewer: Applications and Services Log > View > Show Analytic And Debug Logs

image

Microsoft > Windows > WMI-Actvitiy > Trace

image

One of the problems with this log is that it fills up quite fast. Especially in an environment where SCCM is active as SCCM relies on WMI extensively. This log will show each query that is handled by WMI. The good part is that it will include the PID (process identifier) of the process requesting the query. The challenge here was to have tracing enabled, make sure this specific log wasn’t full as then it would just drop new events, and have the issue, which we couldn't reproduce, occur… With some hardcore patience and a bit of luck I found an instance pretty fast.

So I correlated the events with the process creation time of the WmiPvrSe.exe process:

First event:

12 Tr1

Followed by:

13 Tv2

Basically it’s a connect to the namespace, execute a query, get some more information and repeat process. Some of the queries:

  • SELECT AddressWidth FROM Win32_Processor
  • Select * from __ClassProviderRegistration
  • select __RELPATH, AddressWidth from Win32_Processor
  • select * from msft_providers where HostProcessIdentifier = 38668
  • SELECT Description FROM Win32_TimeZone

And the last ones:

14TR

15 Tr

GroupOperationId = 260426; OperationId = 260427; Operation = Start IWbemServices::ExecQuery - SELECT Name FROM Win32_UserAccount; ClientMachine = Computer01; User = NT AUTHORITY\SYSTEM; ClientProcessId = 21760; NamespaceName = \\.\root\CIMV2

16 Tr

Now the Select name FROM Win32_UserAccount seems to be the winner here. That one definitely seems to be relevant to our issue. If we open up the MSDN page for WIN32_UserAccount: link there’s actually a warning: Note  Because both the Name and Domain are key properties, enumerating Win32_UserAccount on a large network can negatively affect performance. Calling GetObject or querying for a specific instance has less impact.

Now the good part PID 21760 actually leads to something:

17 PE

The process we found seems to be a service from our antivirus solution:

test

The service we found to be the culprit is the McAfee Product Improvement Program service (TelemetryServer, mctelsvc.exe). Some background information from McAfee: McAfee.com: Product Improvement Program

In theory this is the info they are gathering:

  • Data collected from client system
  • BIOS properties
  • Operating System properties
  • Computer model, manufacturer and total physical memory
  • Computer type (laptop, desktop, or tablet)
  • Processor name and architecture, Operating System architecture (32-bit or 64-bit), number of processor cores, and number of logical processors
  • Disk drive properties such as name, type, size,and description
  • List of all third-party applications (including name, version, and installed date)
  • AV DAT date and version and Engine version
  • McAfee product feature usage data
  • Application errors generated by McAfee Processes
  • Detections made by On-Access Scanner from VirusScan Enterprise logs
  • Error details and status information from McAfee Product Logs

I don’t see any reason why they would need to execute that query… But the event log doesn’t lie. In order to be absolutely sure that this is the query that resulted in such a massive amount of traffic we’ll try to execute the query we suspect using wbemtest.

Start > Run > WbemTest

image image

image image

image

If you have Network Monitor running alongside this test you’ll see a lot of SAMR traffic passing by. So it seemed to be conclusive. The McAfee component had to go. After removing the Product Improvement Program component of each pc we can clearly see the load dropping:

image

To conclude: I know this is a rather lengthy post and I could also just have said “hey, if you have a lsass.exe CPU issues, just check if you got the McAfee Product Improvement Program component running” but with this kind of blog entries I want to share my methodology with others. For me troubleshooting is not only fixing the issue at hand. It’s also about getting an answer (what? why? how?) and it’s always nice if you learn a thing or two on your quest. In my case I learned about WMI tracing and lsarpc/samr. As always feedback is appreciated!

3 comments

Quick Tip: Enumerate a User his AD Group Memberships

Published on Thursday, August 28, 2014 in ,

Using the two following commands you can easily retrieve all the groups a user is member of. This command will also take account group membership caused by nested groups. Here’s the first line, it’s a multi-line command that will store all of the groups the users is a member of in the $tokenGroups variable. The groups are represented by their SID.

$tokenGroups = Get-ADUser -SearchScope Base -SearchBase 'CN=thomas,OU=Admin Accounts,DC=contoso,DC=com' `

-LDAPFilter '(objectClass=user)' -Properties tokenGroups | Select-Object `

-ExpandProperty tokenGroups | Select-Object -ExpandProperty Value

In order to easily translate them to their AD AccountName you can use the following command I blogged about earlier (Quick Tip: Resolving an SID to a AccountName)

$groups = $tokengroups | % {((New-Object System.Security.Principal.SecurityIdentifier($_)).Translate( [System.Security.Principal.NTAccount])).Value}

Using the “-SearchSCope Base –SearchBase …” approach seems to be necessary as you cannot simply use Get-ADUser username …

image001

2 comments

Failover Cluster: Generic Applications Fail with OutOfMemoryException

Published on Thursday, August 14, 2014 in , , , , ,

Recently I helped a customer which was having troubles migrating from a Windows 2003 cluster to a Windows 2012 cluster. The resources they were running on the cluster consisted of many in house developed applications. There were about 80 of them and they were running as generic applications.

Due to Windows 2003 being end of life they started a phased migration towards Windows 2012 (in a test environment). At first the migration seemed to go smooth, but at a given moment they were only able to start a limited amount of applications. The applications that failed gave an Out Of Memory exception (OutOfMemoryException). Typically they could start about 25 applications, and from then on they weren’t able to start more. This number wasn’t exact, sometimes it was more, sometimes it was less.

As I suspected that this wasn’t really a failover clustering problem but more a Windows problem I googled for “windows 2012 running many applications out of memory exception”. I found several links:

HP: Unable to Create More than 140 Generic Application Cluster Resources

IBM: Configuring the Windows registry: Increasing the noninteractive desktop heap size

If the parallel engine is installed on a computer that runs Microsoft Windows Server, Standard or Enterprise edition, increase the noninteractive desktop heap size to ensure that a sufficient number of processes can be created and run concurrently

So it seems you can tweak the desktop heap size in the registry. Here is some background regarding the modification we did to the registry.

The key: HKEY_LOCAL_MACHINE\System\CurrentControlSet\Control\Session Manager\SubSystems\Windows

SharedSection has three values: (KB184802: User32.dll or Kernel32.dll fails to initialize)

  • The first SharedSection value (1024) is the shared heap size common to all desktops. This includes the global handle table, which holds handles to windows, menus, icons, cursors, and so forth, and shared system settings. It is unlikely that you would ever need to change this value.
  • The second SharedSection value (3072) is the size of the desktop heap for each desktop that is associated with the "interactive" window station WinSta0. User objects like hooks, menus, strings, and windows consume memory in this desktop heap. It is unlikely that you would ever need to change this second SharedSection value.
  • The third SharedSection value (512) is the size of the desktop heap for each desktop that is associated with a "noninteractive" window station. If this value is not present, the size of the desktop heap for noninteractive window stations will be same as the size specified for interactive window stations (the second SharedSection value).

Default on Windows 2012 seems to be 768

Raising it to 2048 seems to be a workaround/solution. A reboot is required! After this we were able to start up to 200 generic applications (we didn’t test more). However after a while there were some failures, but at first sight quite limited. This might be due to the actual memory being exhausted. Either way, we definitely saw a huge improvement.

Disclaimer: ASKPERF: Sessions, desktops and windows stations

Please do not modify these values on a whim. Changing the second or third value too high can put you in a no-boot situation due to the kernel not being able to allocate memory properly to even get Session 0 set up

Bonus info: why didn’t the customer didn’t have any issues running the same workload on Windows 2003? They configured the generic applications with “allow desktop interaction”. Something which was removed from the generic applications in Windows 2008. Because they had “allow desktop interaction” configured, generic applications were running in an interaction session and thus were not limited by the much smaller non-interactive desktop heap size.

0 comments

SCOM 2012 R2: Web Portal: 503 The Service is Unavailable

Published on Wednesday, August 13, 2014 in ,

The other day one of my customers mentioned that their SCOM Web Portal has been broken for a while now. As I like digging into web application issues I took a quick look. Here’s what I came up with. It seems that the portal itself was loading fine, but viewing All Alerts or Active Alerts showed a Service Unavailable (“HTTP Error 503: The service is unavailable”).

image

One of the things about IIS based errors is that in most cases the Event Log on the web server can help you a great deal. In the System Event Log I found the following:

clip_image001

A process serving application pool 'OperationsManagerMonitoringView' reported a failure trying to read configuration during startup. The process id was '6352'. Please check the Application Event Log for further event messages logged by the worker process on the specific error. The data field contains the error number.

Checking the IIS Management Console I could indeed see that the Application Pool was stopped. Starting it succeeded, but viewing the page made it crash again. Looking a bit further I found the following in the Application Event Log:

clip_image001[5]

The worker process for application pool 'OperationsManagerMonitoringView' encountered an error 'Configuration file is not well-formed XML

' trying to read configuration data from file '\\?\C:\Windows\Microsoft.NET\Framework64\v2.0.50727\CONFIG\web.config', line number '14'. The data field contains the error code.

Now that seems pretty descriptive! Using notepad I checked the contents of the file and tried to see why the XML was not well-formed. I checked the XML tags and the closings and such but I couldn’t find anything at first sight. Looking a bit longer I saw that the quotes (“) were different from the other quotes in the file. Here’s a screenshot of the bad line and the fixed line. You can simply erase and retype the “ on that line and you should be good to go.

image

Personally I like taking a backup copy before I perform manual fixes. After saving the file I did an IISReset just to be sure. And after that we were able to successfully view our alerts through the Web Portal again!