Community
Products
Confluence
Questions
Best Practices for Extracting Large Documents from Confluence for RAG

Best Practices for Extracting Large Documents from Confluence for RAG

Hi everyone,

We’re working on building Retrieval Augmented Generation (RAG) for our Gen-AI applications using our Confluence, but we’ve hit a few roadblocks and would love some guidance.

Specifically, we’re using frameworks like LangChain and LlamaIndex, which successfully pull context from all pages within a Space. However, we’re facing challenges when it comes to extracting large documents. Additionally, some queries through the Confluence REST API are returning 500 errors, likely due to unknown limits or restrictions.

Has anyone else encountered these issues? What’s the best approach to efficiently retrieve both pages and large documents from Confluence in aspect of both cloud and on-prem?

Thanks in advance for any insights!

3 answers

1 accepted

2 votes

Answer accepted

Hi @Ali Erdogan - there's an old but very relevant discussion about this here:

How can I use `rest/api/content` to download all the 800k pages of my Confluence wiki without timing out?

Warning, it's Reddit, so it's a frank discussion. :-}

But yeah, the upshot is that Confluence API (and especially Cloud Confluence) is not really designed for this kind of thing.

If you're on-prem, you'll get much better performance accessing the database directly, although depending on your permissions/restrictions, you might have to concern yourself with who is supposed to see the page data you're pulling.

You'll find data on the database schema here: Confluence Data Model

There's an example query (Postgres) on how to dump the "source" for a given page in this article:

How to Recover Page Content from the Database and Import Using Source Editor

SELECT bc.body FROM bodycontent bc JOIN content c ON bc.contentid = c.contentid WHERE c.title = '<page_title>' AND c.prevver IS NULL \g /path/to/folder/filename.txt

If you're on Cloud, or are committed to using the API... I don't know if I have any good news for you.

I found this issue, which sounds like what you might be running into: CONFCLOUD-71261 - Calling the REST API intermittently results in 500 errors

Some commenters reported that it was an excessive number of attachments in a space (20k) that caused the 500 errors, and one workaround they found was to NOT try to include attachments in the call:

In order to work around the problem, we changed to "expand=space,history,version,ancestors,metadata.properties" (i.e. skip children.attachment) and then loop through the pages individually to get the "children.attachment" information.

I'm guessing that the only true solution is to implement your own throttling solution if you see 500 errors. It's unfortunate that whatever calls you are using (like in the issue above, /rest/api/space/SPACEKEY/content/page), don't return 429s, which would allow you to programmatically backoff, per Atlassian's documentation on that:

Rate limiting

You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.

Comment

Hi @Darryl Lee ,

Thank you so much for the detailed and comprehensive explanations. The insights you provided about the Confluence API and alternative methods for database access are really valuable to me. Thanks again!

You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.

Comment

Like • Darryl Lee likes this

Hi @Ali Erdogan - glad I could provide you the information. I'd appreciate if you could Accept my answer if you found it satisfactory. Thanks!

You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.

Comment

1 vote

Hi Ali - Welcome to the Atlassian Community!

Have you considered exporting the page to XML?

You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.

Comment

Hi @John Funk

Yes, I did consider that. Thanks for the suggestion!

You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.

Comment

Like • John Funk likes this

0 votes

Hi @Ali Erdogan

Were you able to find a solution to this? Are you building your own scraper to extract the space? I am trying to build some RAG applications too and want to extract the whole page tree (data, content, attachments) for this.

I would love to hear about any successful workaround. Thanks!

You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.

Forums

Product Q&A

Community resources

Support

Top groups

Community resources

Support

Learn

Community resources

Support

Events

Community resources

Support

Best Practices for Extracting Large Documents from Confluence for RAG

3 answers

1 accepted

Suggest an answer

Was this helpful?

Thanks!

TAGS

Atlassian Community Events