Overview
Kallidus are committed to bringing you tools and services that add value to your people. We have worked hard on the evolution of our products to bring new exciting features to provide an engaging user experience. While doing this in an increasingly consistent and interconnected way. Over the last 3 years Kallidus has made huge strides in ensuring that our SaaS solutions not only have an industry leading look and feel, but a stable application platform for the end-user to fully experience these benefits.
These improvements and developments are not just isolated to the applications you and your colleagues see. As with all leading SaaS providers we have a team focused on developing the technical platform to ensure that all our products run optimally and can support your ambitions. We have done an enormous amount of work and are very proud of the platform we have built. Kallidus products maintain an average uptime which exceeds the annualised SLA (99.5%) and has improved year on year.
However, we apologise and acknowledge that the platform did not perform as well as expected in the first quarter of 2019. Though only some of our customers would have experienced these issues, they are taken very seriously.
In this instance, a bottleneck appeared outside the core infrastructure, in the API processing servers within the Kallidus estate. These servers were taking an exceptional amount of time to process incoming requests, initially resulting in a poor user experience and ultimately creating timeouts on an end-user’s browser.
The API servers, given the limited nature of their role, are typically not under heavy load. Though load balanced, they are common infrastructure between both our Learn and Perform solutions. As a result of their shared nature, the impact of the Perform APIs had a knock-on effect on impacting the Learn solution. Our monitoring indicated the API servers were running out of memory while processing the backlog of requests, eventually to the point where the load balancers would route all connections to the secondary node in order to regain service. This the process would then repeat itself, until it routed back to the primary node.
Actions taken
We recognised the need to take some quick actions and so we: -
- Increasedcapacity - Kallidus’ infrastructure is highly scalable. We operate a semi-dynamic vertical scaling, which allowed us to increase the capacity of these API servers as well as the No-SQL server that provides persistent storage to the Perform solution.
- Increased the memory available for caching- We identified an issue with memory management on the persistent storage servers such that insufficient memory was being reserved for file caching. This was impacting performance, and consequently we reconfigured the boxes to ensure adequate memory was reserved for caching.
- Reduce the amount of system logging– It was also observed the logging level on these API boxes was too high, generating too much data and consequently impacting performance. This was reset to a lower level.
- Extending monitoring- Kallidus already has high level of system monitoring in place, taking hundreds of data points every second to proactively detect issues. This has been further extended, and now includes specific per tenant monitoring.
Kallidus continue to be as ambitious in developing our technical platform as we are on delivering excellent customer experiences and so we are continuing to work on improving platform uptime and system performance. The focus for our longer-term actions are: -
- A new environment for Performis in the process of being commissioned that should be much faster. The persistent storage hardware/operating system is being switched from Windows to Linux, which should be inherently more scalable for the type of processing we undertake.
- The API processing servers have already been separated between Learn and Perform, thus going forward, there is no risk of issues on one API server affecting the other product. Overall performance has also been improved as these new elements are now both horizontally and vertically scalable.
- A Stress test on Performwill be undertaken to mimic the behaviour of systems users. This will highlight the overall system capacity. The outcome of this will be shared with customers.
In some cases, these actions are complete, with others in their advanced stages. We hope to reassure you all, that these initiatives should mitigate risk, as well as improving both scalability and performance going forward.
If you have any further questions about this exciting phase of our growth and development, please feel free to contact your account manager.
Comments
0 comments
Please sign in to leave a comment.