Bugs when exporting sites to pdf

Lazarus April 22, 2022

Hi,

I wanted to report bug but for some reason it gets me here, instead of some jira, so feel free to move it if you want.
The thing is that I was looking for an option to export all confluence pages in my organization to some offline form e.g. pdf. I've found that it's possible to do that from confluence settings ... if you have proper permissions. I didn't have permissions to do that and I'm not a fan of discussing such things with support, so I thought that I will write a script which will just do it.

Our confluence has ~30 spaces with ~12 000 pages in total, so as you may expect during my task I have experienced couple issues which might be interesting for you. However, before I start pointing them out please note that I'm here to discuss the merit and I won't involve in discussions with people who are overly-attached-to-some-forum-rules. If you want to thrown my findings to trash and remove this thread then I don't care, it's your forum and your decision.

Please also note that I don't know what confluence version we are using and I might not have an option to check it. I don't know on what server it runs and I can't check on some other instance, so advanced reproduction and logs gathering is up to Atlassian team. Anyway, here is the list of my findings:

1. Time growths exponentially when querying pages. We have limit 500 per query, so I will use that limit but even if I set limit to 1 the result is the same. Please take a look at those 4 querries:

a) <domain>/rest/api/content?type=page&start=0&limit=500
b) <domain>/rest/api/content?type=page&start=500&limit=500
c) <domain>/rest/api/content?type=page&start=3000&limit=500
d) <domain>/rest/api/content?type=page&start=11000&limit=500

Query a) finishes in 1 second. Query b) finishes in 2-3 seconds. Query c) finishes in 1 minute. Query d) finished in 25 minutes.
At first I thought that maybe server noticed unusual behavior from my site and it's some protection mechanism. However, I've tried multiple times in 2 days and always the bigger start id I was giving the longer it took to return results. 

2. Pdf generation stucks forever if there are unaccessible resources. I have noticed that for some pageId's my script hung. It was not able to export page even after leaving it for 2-3 hours. At first I thought that it's some issue with my script, however then I've tried to do that manually from the browser and it also hung forever, so I've checked those pages more carefully. It happened only 5 times per 12 000 pages but it happened always with the same pages. The thing which was common between those 5 pages was that there were some things hidden behind some permissions. I also found a note on one page that in order to see this page fully I would need to be in some group. It doesn't really matter but the real bug is that pdf generation takes forever instead of just simply finishing immediately (as it does for pages to which I didn't have access at all).

3. Huge pdfs generated when photos are involved. Usually generated pdfs had ~40 KB, so I was surprised when I saw that one of the pages had 450 MB. I've visited that page manually and I've noticed that it only contains 12 photos. I thought that maybe someone has uploaded those photos in big resolution, so I've downloaded them manually and it turned out that one photo has 8 MB. It's not small but still pdf should have ~100 MB, not 450 MB, so I think pdf generation could be optimized.


/ M

2 comments

James Ponting
Atlassian Team
Atlassian Team members are employees working across the company in a wide variety of roles.
May 13, 2022

Hi @Lazarus

I just came across this post whilst working on other things. To your points

  1. This is indeed a bug, and by chance, a fix is due out with Confluence 7.18.0: https://jira.atlassian.com/browse/CONFSERVER-57639
  2. Confluence doesn't support limiting specific components behind permissions, so this sounds like the functionality may be provided by a third party plugin. Our exports will continue running until they complete, though there are internal timeouts on elements inside the PDF conversion process. Without knowing what's affecting this export, I can't comment on why it's not completing. I've never before seen this process hang, so it might be related to third part code.
  3. This one may be an issue with our export, or it may be an issue with the combination of the images and the PDF file. Unfortunately this isn't quite so simple as it may seem from the outside.

Thanks for letting us know about the issues you found. If you're interested in investigating issues 2 and 3 further, our support team can assist (though I saw your note on this front). 

Hopefully being aware of the fix for issue 1 can help somewhat. I'd also suggest having a timeout on your script to abort a request for a PDF document if it exceeds some reasonable timeout (say 10 minutes) to avoid your whole script hanging.

Thanks,
James Ponting
Engineering Manager - Confluence Data Center 

Lazarus May 13, 2022

Hi @James Ponting 

thanks for answering. 

1. Good


2. I don't know if it helps but (except some normal text) this is what I see on such page:
conf.png
If I right click on the icon and copy address it gets me to Jenkins. When I give it my credentials it says that I'm "missing the Overall/Read permission", so as you've wrote this might be related to third party (Jenkins) plugin. Although I think that from user perspective export mechanism should still somehow timeout or return error.


3. If it's not reproducing on your site then it might be tricky indeed.


If you have specific requests then I can try help more with 2 and 3 but I'm just normal user, so I might be limited in many ways.

As to timeout I actually wanted to add that (there is -m parameter in curl) but it doesn't work in my curl version as described here: curl-max-time-and-connect-timeout-not-working-at-all 
I could upgrade curl but issue #2 only happened for 5 pages, so it was ok for me to blacklist those pages manually.

James Ponting
Atlassian Team
Atlassian Team members are employees working across the company in a wide variety of roles.
May 14, 2022

Hey @Lazarus

No worries, we help where we can :)

To your points

2. That looks like we're not able to get the expected response, so it's rendering as a broken image. It's quite likely related to the Jenkins plugin as our code should wait to timeout (which is a different problem in and of itself). I think I'm going to struggle to test this as I don't have a Jenkins instance to hand. If you have someone who has access to those Jenkins elements that can run the export of the page as a PDF to test, that would help isolate it. If the plugin is contributing to the issue, then having someone with the appropriate permissions run the script should allow it to export promptly and accurately on those pages.

3. I would expect that with the exact same files as in your instance, we would end up with the same result. The problem is the interaction between data storage methods. To wildly hypothesise, it sounds like the image is being stored as a bitmap in the PDF document. This may be because we're using a flag to allow the PDF to be rendered in more places/devices (a rasterised pdf), but the trade off is more size. It could also be another feature on the page triggering the rasterisation. I suspect saving the images as jpg files would fix this issue, but there are tradeoffs there too. There's a lot of guesses in there, so generally I'd get support involved so they can investigate.

That's quite unfortunate on the CURL front. Given it's only a handful of pages, that makes sense to me. Unfortunately neither points are trivial to resolve. I understand you've mostly worked around the problems, but if they end up blocking you, I'd encourage you to open a support ticket along with the site admin so that our support team can try and help. If you're able to work with the current state of play, then that's fine too.

Either way, I appreciate your taking the time to let us know about these issues. Sorry I couldn't help further given the circumstances.

Regards,
James Ponting
Engineering Manager - Confluence Data Center 

Lazarus May 17, 2022

Hi @James Ponting 

2. Unfortunately I work only with people with the same access rights as me.

3. I've checked couple pages with significant pdf size and the pictures vary. There are photos of people, buildings, screenshots of apps, some architecture pictures, so I don't think that this problem is due to some special content on our confluence. My guess is that it should be reproducible with any image. If it's not then maybe we are using a bit older version of confluence or something like this. Anyway it might be as you've described that those jpg pictures are converted to bmp in pdfs and that's why they consume more space.

I've workaround all problems I had, so I'm not blocked on anything but thanks for asking and for your replies!

Like James Ponting likes this

Comment

Log in or Sign up to comment
TAGS
AUG Leaders

Atlassian Community Events