Hi everyone,
We’re working on building Retrieval Augmented Generation (RAG) for our Gen-AI applications using our Confluence, but we’ve hit a few roadblocks and would love some guidance.
Specifically, we’re using frameworks like LangChain and LlamaIndex, which successfully pull context from all pages within a Space. However, we’re facing challenges when it comes to extracting large documents. Additionally, some queries through the Confluence REST API are returning 500 errors, likely due to unknown limits or restrictions.
Has anyone else encountered these issues? What’s the best approach to efficiently retrieve both pages and large documents from Confluence in aspect of both cloud and on-prem?
Thanks in advance for any insights!
Hi @Ali Erdogan - there's an old but very relevant discussion about this here:
Warning, it's Reddit, so it's a frank discussion. :-}
But yeah, the upshot is that Confluence API (and especially Cloud Confluence) is not really designed for this kind of thing.
If you're on-prem, you'll get much better performance accessing the database directly, although depending on your permissions/restrictions, you might have to concern yourself with who is supposed to see the page data you're pulling.
You'll find data on the database schema here: Confluence Data Model
There's an example query (Postgres) on how to dump the "source" for a given page in this article:
SELECT bc.body FROM bodycontent bc JOIN content c ON bc.contentid = c.contentid WHERE c.title = '<page_title>' AND c.prevver IS NULL \g /path/to/folder/filename.txt
If you're on Cloud, or are committed to using the API... I don't know if I have any good news for you.
I found this issue, which sounds like what you might be running into: CONFCLOUD-71261 - Calling the REST API intermittently results in 500 errors
Some commenters reported that it was an excessive number of attachments in a space (20k) that caused the 500 errors, and one workaround they found was to NOT try to include attachments in the call:
In order to work around the problem, we changed to "expand=space,history,version,ancestors,metadata.properties" (i.e. skip children.attachment) and then loop through the pages individually to get the "children.attachment" information.
I'm guessing that the only true solution is to implement your own throttling solution if you see 500 errors. It's unfortunate that whatever calls you are using (like in the issue above, /rest/api/space/SPACEKEY/content/page), don't return 429s, which would allow you to programmatically backoff, per Atlassian's documentation on that:
Hi @Darryl Lee ,
Thank you so much for the detailed and comprehensive explanations. The insights you provided about the Confluence API and alternative methods for database access are really valuable to me. Thanks again!
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
Hi @Ali Erdogan - glad I could provide you the information. I'd appreciate if you could Accept my answer if you found it satisfactory. Thanks!
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
Hi Ali - Welcome to the Atlassian Community!
Have you considered exporting the page to XML?
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.