This presentation covers an overview of the product, detailed architecture and component review as well as an in-depth look at troubleshooting and tools available.
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Implementing and Troubleshooting EdgeSight
1. Citrix Support Secrets
Webinar Series
Implementing and Troubleshooting EdgeSight
Vincent Papoz, Snr Escalation Engineer, Citrix Support
March 21, 2013
In our Agenda for today: I’ll give a short overview of the architecture. Description of a typical use case dealing with how to obtain more information when troubleshooting slow user logon.Many Administrators are running into performance issues with their Edgesight deployment. So we're going to talk about this as well. Our last part will provide an insight into custom reporting,tools and where to find data.- Open the room for question?
In this first part:Overview of the major componentsEdgesight replaces Resource manager for all XenApp versions installed on top of windows 2008 R1 and R2.- In concept, Edgesight works the same way as Resource manager in the sense that each XenApp server is responsible for gathering data for themselves. They will then send this data to a central location to make farmwide historical reporting possible. At the same time, an administrator can connect directly to the XA server local database to obtain real time information.RM uses rmmonitor IMA subsystem to collect the data locally on the XA servers Edgesight: Agent which runs as a service and use different ways of collecting data.
The first components to be installed are the SQL server database engine along with the reporting Service server. The reporting service component is a part of the SQL server installation but it can be installed on a separate machine.When this is up and running, you can install the Edgesight web server. During the install process, the Database server is specified and the Edgesight database is created. At the end of the installation, we will also connect to the Report Server and upload the build in reports.When server side component installation is complete, we can go ahead and deploy the Edgesight agents on the devices we want to monitor. This includes XA servers, Endpoints – such as windows7, Windows Vista and Windows XP workstations - and VDIs. During agent installation, we will point to the Edgesight web server to tell the devices where to send their data.On monitored devices, the agents will collect data (We’re talking about performance data, usage data, configuration data, etc) and aggregate the data into their local database. The local database is a firebird database.Twice a day by default, each monitored device will create a payload containing the information and send it over to the Edgesight web server. In turn, the Edgesight Web server will bulk insert this information into the database.The Citrix Administrator has the possibility to run historical reports against the Edgesight database. There are approx 150 built-in reports provided for this.The administrator will also be able to connect directly to a device where the agent is installed to obtain real time information.
So what type of data are we actually collecting?- agent collects event driven data => Process starts/stops, errors and faults. User logon and logoff. This is done using a DLL injection and interception method.- agent obtains winsock connection and http transaction information by hooking API calls.- It also uses the call back notification method to obtain Windows event log information as well as system reboot information.- Regarding performance metrics, by default, ES polls for standard registry based performance counters from the operating system. But any custom counters are configurable and can be added for 3rd party products.- XA and ICA specific: Agent collects session connection and disconnection events as well as session metrics and virtual channel statistics. This type of data is obtained from the EUEM service running on the XA servers where the agent is installed.- Scheduled Data: Information such as Drive space calculation or asset history is obtained by running wmi queries on a regular basis.
installed when installing the Edgesight agent on XA servers component is collecting client start-up and server start-up metricsprovides information about the user experience.- Client start-up metricsWhich are concerned with timing the operations that occur from the point when the user requests an application, e.g., by clicking an icon, to the point at which an instance of the ICA client has finished opening a connection to the XA Server. - Server Start-up metricsdeals with session creation on a XAServer wherethe user must first be authenticated. After authentication, the session creation process will perform client device mapping tasks (printers, drives), load the user’s profile, execute any login scripts, and then start the user’s application.
Slow logons… When a user clicks on an icon to launch a published application, the first change that they see on the screen is a dialog box and a description of the processes that are taking place to launch the requested application. Any problem experienced during the launch process — XA infrastructure or the supporting infrastructure components — can result in a poor user experience. Diagnosing login problems has traditionally been a difficult, time-consuming, manual process due to the large number of steps involved, many of which are external to the XenApp Server infrastructure. EdgeSightfor XA enables visibility into thesteps involved in the login process and the time required for each step to complete.
As I mentioned earlier, an administrator can connect directly to the server hosting the affected session to obtain real time data. So when a user complains about a very long logon, the admin can login to the Edgesight console and select the user troubleshooter section to find out more about the active user session.
What we need to do is enter the user name and optionally the name of the server hosting the session and click on find session.In this case, you can see 2 sessions for the user
- select the topmost session, you have a number of tabs available in the lower pane that will provide details about the session. - the system summary tab will provide System performance counters for the server, including CPU, memory, and network.
- process detail tab will display Process performance counters which show resource usage by the various applications running while the session is active. You can also select the Show All Processes checkbox to display a list of all processes running on the server.
The Session start detail tab is the one we want to look in that specific case.Tab provides a break down of the logon process both on the server side and on the client side. We get details such as,The time it takes to authenticate the user,The time it takes to load the user profile,The time it takes to create the printers in the user session or to map drives.On the client side, the time it took to enumerate the applications, the time it took to download the ICA file … and so on.All of these categories are explained in details in the online help.In this example, we can see that the overall time to create the session was approximately 57 seconds and at the same time, loading the profile took 55 seconds. file server hosting the user profile is under stress, network latency issue, profile size is massive… but you know that this is where you need to start your investigation.
Comes up quite a lot in supportEdgesight has been running for a while and the Citrix admin has been adding new devices, maybe a little bit too many…Symptoms includes…
- Report rendering… when running a report, you might have to wait for a long time to get the result.errors in the console saying that the queue in the webload directory is growing…Servers will not update the Database, possibly because their payloads are not being processed.nightly maintenance job, whose task is to groom data, run some reindexing, and update statistics might take hours to completeAs a result of this grooming error will occur and will show up in the console- typical SQL database symptoms such as gigantic database files filling up the hard drive, a transaction log refusing to shrink and growing out of control, database locks and so on…
usually a problem due to too much data in the database. In the case of SQL, you need some temp space to be able to perform operations such as deleting, inserting, reindexing and so on. All those operations are recorded in the Transaction log, and if the transaction log cannot grow, then no operation can be performed.- even if there is a sufficient amount of disk space left to perform those operations, the SQL engine might still struggle because there is so much data that every single operation is taking longer to complete, and this will have a knock on effect where everything is delayed.
So what can we do when we run into this situation?What we want to do right now is to have your deployment back and running again in the shortest amount time. After this situation is resolved, we will have time to review the configuration so that we don’t run into this issue anymore. This will include adjusting the grooming schedule and maybe ignoring specific data.But at the moment, we need to get rid of some existing data and for this, we need to assess what data we need and what data we don’t need. This is directly related to our reporting requirements. Sometimes there is data we never need to report on and that we can afford to ignore.As a second step, we need to analyze the content of the database. In other words, we want to find out where the bulk of the data is and what it is. This will also help us adjust the collection configuration after we’ve resolved the issue.The third step deals with removing this data. There are different ways to do that, but we need to do it manually since the application is struggling to do it on its own.After that we might want to reclaim disk space because the data files will not reduce in size straight away. We only need to do this if we’re out of disk space. Otherwise, the SQL engine will take care of it automatically at a later stage.Lastly, we want to adjust the collection configuration for a long term solution. If we don’t do that we will run into the same issue again sometime in the future.
Analyzing the content of the database; how do we do that?We can have a look at the size of the data files and look for the largest ones. A data file is the physical database file on the hard drive. By default, when creating a db, only one .mdf file is created, which is the primary datafile along with an .ldf file, the transaction log.When installing the Edgesight server, the database objects are divided into 8 different data files. This was designed primarily for performance reasons because you can host these data files on different physical hard drives.To find out the size of the datafiles, you can run this query inside SQL studio.Now let’s say you get the following output - While you can see that most of the files seem to have a manageable size, one of them, data file 6, stands out with a size in excess of 240 GB. This is the one we want to investigate.
Now we want to find out which tables are hosted by the data file. We can run this query in SQL studio. We will get the following result.There are 2 tables in Filegroup 6. Core_net_stat and core_net_trans.
How many records are held by those tables… maybe only one of those 2 tables is affected.In this output, we can see that the Core_net_trans table is very large and actually much bigger than the Core_net_stat table.
Question 1 – Has the grooming been failing?To find out, we need to know what the grooming schedule is for those tables. This setting is found in the Edgesight console, but we can also run this query.The output will display the current settings as well as the default setting. We can see here that the settings were not modified.
We’re just going to have a look at the core_net_trans table since it is the largest.We know that the grooming schedule for the core_net_trans table is 10 days since the last maintenance. So if we find records older than 11 days, it means that grooming failed.This query will tell you how many records failed to be deleted.If there is no record older than the cut-off date, we need to have a look at the data inside the table and investigate further
If the grooming has been failing and there is not enough temporary disk space for it to complete successfully, we can use a query similar to this one to delete data by increments of – for example – 100,000. If 100,000 also fails because of the lack of disk space, you would need to first set the increment to a lower value for a start – Something like 1,000 or 10,000.Every now and again you can reclaim free disk space by shrinking the relevant data file. In that case, FG6.You can do that using SQL studio, or you can use this query. You can also integrate this piece into the main block so that it is done automatically after deleting the increment.
- grooming is not failing - and in other words if all this data is inside the 10 day retention period, we need to look at the data inside the table and attempt to understand why so much data is being recorded.- We can run queries against the affected table to find out if a specific process is responsible.- It very much depends on the table of course, but in the case of Core_net_Trans, which records the Application Network transaction performance, what we can do is check whether 1 or several applications are generating a large amount of data.This query will group the process names and sort them by number of occurrence in the Core_net_trans table. It will give you this type of result, where straight away you can see that one application, the first one in the list has been very busy. So I have just put a dummy application name there, but I have seen instances where third party monitoring applications were responsible because they can be starting processes to gather data. There are other examples like scripts being run very often on an environment to check for some specific condition.What you can do as a first step and if this is necessary is to delete the records for this application.
If you happen to be out of disk space, you will want to manually reclaim disk space after you have removed the unwanted records.We can use SQL studio for this. We need to right click the database, select tasks > shrink > Files.>> FG6You can also use the dbccshrinkfile command with the required parameters.
In our action plan, the last step was to find a durable solution so that we don’t run into this issue again. Obviously this solution very much depends on our analysis but it will mainly include the following:Adjusting the grooming schedule. This is done in the console under data maintenanceAdjusting the upload configurationIgnore a process – 2 ways of doing this
You can either choose to add the process in the agent properties advanced configuration.In that case we will disable interception for this process and collect minimal information for this processOr… you can disable injection altogether using a registry setting and we will collect pretty much nothing related to this process.You would mainly use this method if the application has a compatibility issue with the ES agent.This is the method I am going to describe now.
When we ignore a process, we disable injection of a Citrix dll – csma_ldr.dll – for this process.This a per device registry setting and since we’re dealing with injection, we also need to restart the XA server for the setting to take effect.
Ignoring a process means that the Edgesight agent will collect minimal information about a process. It will still collect information about the process startup and shutdown events, but it will discard other information such as Application Network transaction performance, which is precisely what we are concerned about in our scenario.Remember about the output we obtained earlier saying that there were too many instances of the someapp.exe process?This is the process we would exclude from monitoring…Using either the agent property exclusion, or the registry setting in case of a compatibility issue.
Since we might not want to do that on every single XA servers, maybe we can find out if our process is generating data on all servers, or just on a few of them.Because it might be better if we can avoid to set the registry key on every single server if this is not necessary.This query will show you the number of instances of the process per server.Let’s say we get the following result. You can see that the numbers look ok on most servers except the 2 first ones in the list. Those are the one we want to set the exclusion for.
When the exclusion has been set and the server (or the servers) have been rebooted, it is a good practice to check that the exclusion was effective.You can do that by using tools such as Process explorer.Process explorer will display all the DLLs a process has loaded.When there is no exclusion in place, you will find those 3 DLLs listed.Csma_ldr.dll, rsintcor.dll and esint.dll.Csma_ldr.dll is injected into a process by our kernel driver (rskcore.sys). It is responsible for loading the interception modules RSINTCOR.dll and ESINT.dll
This is what you’d see for an excluded process…Well basically all the dlls I mentioned are not listed anymore.
On the SQL side, there are a few optimizations we can think off.You can split the Data files to different hard drives. That will greatly improve performance for IO operations.You can also have a look at the recovery model and maybe set it to simple if possible so that the transaction log is shrunk automatically.A Third suggestion would be to consider Data warehousing – See ETL (extract, transform and load).Data warehousing is a process where you extract, transform and move data to a different SQL server that will act as a data repository for reporting
Although there are quite a few build in reports provided with Edgesight, there will always be a need to format reports in a different way, or combine different type of information, or it could be as simple as adding or removing columns from existing reports.The good news is that the data is there. It was collected on the monitored devices, aggregated and imported into an SQL database… and now, it is available in this SQL database for any type of reporting your industry may require.
So I just want to go back to the architecture overview slide that we looked at at the beginning of this presentation.On the left hand side we have the monitored devices where data is collected by the Edgesight agent. All this data is sent on a regular basis to those components on the right hand side for historical reporting.Since custom reporting is only for historical reporting, we will just get rid of those components on the left hand side. We’re left with those 3 major components:The SQL server where the actual data is storedThe Reporting Server The Edgesight ServerNow there are 3 different approaches we can take when considering custom reporting.First approach: we can leverage the existing 3 components and design our custom reports by sticking to the rules imposed by the Edgesight namespace. This means that we will use Edgesight to manage and run our custom reports.Second approach: we can bypass the Edgesight server and use our own application to connect to the Reporting service and Database components.Third approach: we can connect directly to the database and run ad-hoc queries. You will find quite a few very good resources on the web for this so I am not going to cover this in this presentation.
Whatever approach you decide to go for, it is a good idea to find your way around and know where the data is located.An Edgesight 5.4 Database contains 283 tables hosting the actual data and the schema is not documented. The main reason for this is that the schema can be modified from a version to the next.So when creating your own queries, you can of course query the user tables directly, but then you need to spend some time understanding the schema, and your query might not work anymore when you upgrade your Edgesight server to a newer version.So instead, we provide views to expose the data A view is a virtual table based on the result-set of an SQL statement. 96 of them in Edgesight 5.4. The Edgesight views are fully documented and most of the built in reports are making use of them. This is what we recommend to use for custom reporting.
Here’s a few facts about custom reporting…Custom reporting is for historical reports only. There is no mechanism to customize reports for real time information.On the Edgesight web server, all the built in report definition files can be found at this location.They can be copied, modified and uploaded as custom reports in the Edgesight console.Since Edgesight integrates with Microsoft Reporting services, the reports are written in a form of XML called RDL (stands for report definition language). And the queries to extract data from the DB are written in Transact SQL.Additionally, since reporting services does not support different languages, we have added localization to the Edgesight feature. This is important to know when creating custom reports because if you choose to use the built in parameters you will actually need to map the labels to your custom reports in the database.
Regarding the tools available to us, There is no proprietary framework available to create Edgesight Reports. So you have to use what’s out there already.If you wish to use an existing report that you want to modify, you can open the RDL file in any XML editor and edit the file accordingly. You will need to be comfortable with RDL specification and T-SQL for this.Another tool you can use if you are not too confident with RDL and TSQL is Report builder. You can load existing reports in Report builder and modify it to your convenience.Business Intelligence Development Studio. It’s a little bit more sophisticated than Report Builder but it’s also more powerful.SQL Server Management studio can be used to verify the queries you create.
To conclude this chapter, I wanted to give you a short list of resources you can use to know more about custom reporting. First off, we have a very good introduction to cust reporting by a Citrix consultant.Also by the same author, the description of the anatomy of an ES reportThen you will find a collection of custom reports created by ctx engineers on this pageYou can consult the RDL specification on MSFT websiteAnd also learn more about report builderYou will also find some very good resources on the web. Knowledgeable people have blogged about it quite a bit.At the end of the day Custom reporting can be very complex so if you’re new to it, you can use those resources to get you started.
At Citrix Services - we’re Citrix consultants, teachers and support engineers and we’re all about one thing: making sure you succeed.With our help, you’ll deploy high-performance, robust virtualization and networking projects, faster – with dramatically lower risk and higher return.The best Citrix architects and administrators are the ones who never stop learning – and Citrix Education is here to help you learn those skills.Citrix Consulting gives you direct access to our most experienced virtualization and networking experts.When it’s complex; when it’s mission-critical; when it’s big; That’s when Citrix consultants can really help.On your virtualization journey, you’ll want always-on support from people who really care about your success.There’s no better insurance for your Citrix investment than with Citrix Support.
Secrets of the Citrix Support Ninjas is a FREE eBook available next week.The eBook contains 40 insider troubleshooting tips for administrators.So the purpose of the eBook is to help administrators like you keep your Citrix deployments on track.We’ve collected some of their best tips and tricks for running robust Citrix environments and packaged them up into a free eBook.In it, you’ll discover some of the little-known tricks that our own support people use every day to tune, tweak, troubleshoot and test Citrix solutions. You may know a few of these tips. But you probably don’t know them all.And – you never know – you might discover just one that will change your life as an administrator.Let me give you a sneak peak now.