I'm a developer at Confluence Data Center and I'm mainly working on improving stability and performance of the product.
In this post I would like to introduce some details about Sandbox Framework - internal tool which is used by Confluence DC to launch unsafe code in separate OS process for more stability. This framework has the potential to be made public for plugins, but final work is not done yet and API has not been stabilized.
The main reason why I'm writing this post is to get feedback from plugin developers about their interest and if they see any use cases for using this framework. If the framework looks helpful to you and you want to see it available, please comment here about your use case. The more details we get now, the better we can understand the relevance of the framework to the plugin community. Also, the final API design can be inspired by the feedback.
If some parts of framework internals are not clear or you would like to read more details, please feel free to leave a comment and I will expand the post. As an addition to this post, I've implemented some sort of Hello World macro which uses Sandbox Framework. The code is available in bitbucket repo. Please keep in mind that the API is still raw and details will change, so the code is supposed to communicate general idea and usage pattern.
First, let's understand what problem we were trying to solve. In some parts of Confluence code we have to invoke 3rd party libraries to do some potentially heavy work. Good example of this is converting between different document formats, e.g. when you click on some .doc attachment, we convert this .doc to .pdf in order to preview it inside the browser. This is not a trivial operation, because documents can be arbitrarily big and complex. It is not always possible to estimate the amount of work, which is required to convert some specific document from one format to another, because those documents can have complex structure. Another example of such heavy operation would be the 'Export to PDF' feature: under the hood it involves rendering HTML code to PDF file which is handled by external library. Depending on HTML size and layout, conversion time and efficiency can vary dramatically.
Historically all these heavy lifting tasks have been executed inside the main Confluence process. Because of unpredictable amount of work done by 3rd party library, each such operation done inside main process can affect the stability of the whole application. For example, if some user tries to convert big or complex document, the library might consume a lot of CPU or memory resources, which will affect other Confluence functionality. In worst cases the library can produce OutOfMemoryError or StackOverflowError, which makes JVM no longer usable. If the customer has many thousands users in the instance, the probability that someone triggers such heavy operation at any given time becomes non negligible. And as a result the customer can have regular outages related to heavy operations done by the code inside Confluence JVM. In addition to this, the problem is not always caused by conversion of big documents. Every complex 3rd party library will have bugs which can cause infinite loops, infinite recursions or massive memory allocations for some special inputs.
How we guard Confluence from unsafe code
The problem comes from the inability to control resources (time, CPU, memory) which are taken by the heavy operation, so that the operation affects the product stability. As a solution we implemented Sandbox Framework in Confluence which allows to launch unsafe code in separate JVMs on the same physical host. The API and usage of the framework is very similar to working with thread pools, and essentially the framework is a process pool. Confluence spins several external JVM processes which are waiting for incoming tasks and Confluence manages the whole lifecycle of those processes. Confluence might delegate some work to be done in Sandbox process. If something goes wrong, Confluence will kill existing processes and start new ones. Introduction of external processes by design provides the ability to solve all stability issues we were facing. Here we just use the guarantees that operating systems provides about processes isolation. If unsafe code takes too long to complete or allocates a lot of memory in Sandbox process, Confluence process can safely kill the Sandbox process and report an error back to the caller. As a result, all problems inside managed sandbox JVMs are controlled and are not affecting the stability of the main Confluence process. Confluence process has full ownership of Sandbox processes lifecycle. Managing (creating, communicating and destroying) Sandbox processes happens transparently from the user and from the developer.
Obviously, if we introduce separate OS processes, our system starts to have some limitations and in some cases we should redesign the code which does heavy operations in order to use Sandbox framework.
Since the unsafe code is executed in a separate process, the most obvious change in architecture is that we no longer can directly access shared Confluence process memory. In our implementation the communication between Confluence and Sandbox process is done via OS pipes. Among other things this means that inside Sandbox process we don't have access to Spring context with everything set up for us. Due to inability to access Services and DB from Sandbox process we should design our features in a self-contained way, so that Sandbox operation is one unit of work with input defined before the operation start (for example how to overcome this limitation see a section about images fetching in PDF Export section below).
To give more details about the framework, I would like to show how Confluence uses it in real world examples: doing document conversions and backing 'Export to PDF' feature.
Usage in document conversion
This was the first use case for which Sandbox has been implemented. This feature covers thumbnails generation and conversion between various office formats (e.g. doc, ppt) and PDF for generating browser preview. Initially document conversion has been implemented on backend by calling corresponding Aspose library for particular documents. Converted resulting documents were cached in the filesystem, so that trying to convert the same document twice will use cached result and not cause the system to perform duplicated work.
This approach suffered from problems which we discussed above. Since the conversion has been done in the same process, it could have impacted Confluence stability by extensive usage of CPU or heap.
If we consider abstract task of document conversion between different formats, it has single initial document as input and the resulting file as output. What is more, the procedure of conversion involves only Aspose code and doesn't need any access to Confluence Services. That's why this task is a good candidate to be 'sandboxed' and this feature has straightforward mapping to the new Sandbox framework architecture. It is possible to execute Aspose conversion in Sandbox process which takes a file from disk as input and puts the result somewhere on disk, so that it can be accessed from Confluence.
We can define a timeout for conversion. For example, if it takes more than 30 seconds, we can kill the JVM doing conversion and show an error to the browser which initiated the conversion. Also, if a conversion tries to allocate more memory than Sandbox JVM heap size, it will not affect Confluence.
Usage in PDF Export
Confluence has the ability to convert individual pages to PDF file or export spaces as a single PDF file. Space export has been known as another source of instability (heavy CPU + possible OutOfMemoryErrors) due to arbitrarily large spaces. In version 6.12 my team leveraged the Sandbox framework to rework PDF Export and make it more stable. This required a bit more effort than previous document conversion use case.
Let's first consider conversion of a single page to PDF. On Confluence backend side this consists of several steps:
1. Fetch page source XML from DB
2. Render internal storage format to HTML
3. Convert HTML to PDF using flyingsaucer library
In this list the most potentially dangerous item is 3, because here we execute external library to do the conversion. It might seem that conversion between HTML and PDF document is exactly the same as previous document conversion use case, but this is not exactly true. Suppose there is an image attachment on the page which is being converted. This image has to end up rendered inside the PDF. When we shift the conversion to Sandbox process, we must be careful, because Sandbox process no longer has access to Confluence memory, which means we need to either prefetch all images from the HTML page or have some ability for Sandbox to communicate back to Confluence and fetch the image.
To solve this case we introduced the Callback mechanism to Sandbox framework. Essentially this is a way for Sandbox to execute some code in main Confluence process and fetch the result. This mechanism can be used for bypassing the limitation that Sandbox can't access any Services. It is implemented using interprocess communication tools (pipes).
Finally, using this approach we set up default timeout to page conversion of 30s. If the page can't be converted by that time, the Sandbox process is killed and the error is shown to the user.
It is possible to use our procedure of 'sandboxed' page conversion to implement space conversion. To do this, we can process space in page-by-page manner and convert each page to it's own PDF file. After all pages have been processed in this way, we can glue all PDF files together and obtain resulting single PDF file which represents the whole space.
There are several gotchas in this approach, because in the resulting PDF file should be a table of contents with working page numbers and links. Also, links between pages should be correctly converted to PDF bookmarks. We implemented it by making one extra pass over resulting PDF file and fixing all the links between pages and rendering resulting page numbers to the pages.
As a result, all operations with PDF files are happening inside Sandbox process, therefore if something goes wrong the main application performance and stability will not be affected.
We want to hear from you
Thank you for reading this post! If you're interested in using the framework to make plugin more stable, please tell us about your use case. I will be happy to answer any questions related to the framework.
Thanks everyone for answering last week’s question. The winner of the random drawing from those who commented is: @LarryBrock I’ll contact you separately with your prize details. This wee...
Connect with like-minded Atlassian users at free events near you!Find an event
Connect with like-minded Atlassian users at free events near you!
Unfortunately there are no Community Events near you at the moment.Host an event
You're one step closer to meeting fellow Atlassian users at your local event. Learn more about Community Events