Talk about NXP and our mission Briefly mention my personal background Talk about NXP challenges of supporting thousands of hardware and software designers. Some of these challenges lead to us building the Splunk End-User Portal Mention out future plans Mention some lessons learnt
As you can see, NXP isn’t a small company. Within R&D we have thousands of engineer’s that run a massive amount of simulations. This is only getting bigger as chip designs shrink.
Like most organisations using Splunk, it starts with IT. IT requested that we build dashboards to help them monitor and troubleshoot problems from Engineers (End-Users) At any given time, we have tens or hundreds of projects running which millions of simulations happening every week. As you can imagine, with so many projects, we in IT and support get lots of questions daily Every user/project has a different requirement. How do we give users the ability to subscribe/unsubcribe to there own alerts for storage, compute and tool/license usage As you can imagine, IT gets a lot of similar questions frequently. Out of these questions, we built Splunk dashboards.
We started with the core applications that our Engineers use. This includes, Compute farm, Tool license usage and Storage. Users can view from the complete domain down to individual servers. Afterwards we added the ability for users to see their individual information. An additional role was added for Support users or Project mangers to see information from other users To assist users, we’ve enabled forecasting that helps them plan their simulations Feedback: Based on the user’s input parameters, we provide feedback to help them optimizise in the future. Now you might ask why we don’t use machine learning and models to optimize the jobs when the user submits them. This might happen in the future, however the current directive is time-to-market and hence allow the Designers the freedom to control there simulations. This doesn’t mean that we cannot provide feedback. Finally we provide service availability to prevent End users starting simulations only to see them end in error as a critical component is down.
This is a general picture of our on-prem splunk enviornment. The main focus of this picture is related to the right which is our seperate end-user portal enviornment. Raw data is summarized and pushed to the end-user portal enviornment. Only statisical summary data is pushed there. We created this enviornment a few years ago before the introduction of “Work Load Management”. This feature could enable us to join the enviornments in the future. The main reason for the split was to prevent the potential 10000 users from overloading the Splunk enviornment in case of a high prior issue.
When the user logs in they will see there job/simulation summary. For our users, it all starts with the simulation or test. This requires 3 resources. Compute/Tooling licenses and Storage. In this view, they can see a general overview of their running jobs, completed jobs and a break down by job type. Note: We had to display blank tables as we got too many questions everytime Splunk would display “No data found”.
These are 2 table views of the feedback that we provide. Note, we only provide feedback on the Simulations. Tooling and storage are used during the simulation process. However protecting the compute farm ensures that 1 user doesn’t cause issues for the other 9999. Our compute farm will “reserve” resources because on the input parameters that the user provides. Reserverations can be done on both a CPU or Memory level. Of the user over estimates, they will reserve capacity that goes unused and could block other simulations. Likewise underestimation causes jobs to run slower as the simulation consumes other resources.
To help the user determine how many simulations they can run or if they can run at all. We peak down the capacity information and provide forecasting to add them. This helps prevent the question, why isn’t my job starting?
License tooling is very expensive and if the tool license isnt available it can lead to 2 outcomes. The job fails, or it stays in a waiting/pending state while it polls the license server for licenses. Because of this, we provide the user history of their usage, their current usage as the duration of the tooling checkout times.
Adding to this, if there aren’t any licenses available, they can see who is using them and for how long.