Best Practices for Extracting Large Documents from Confluence for RAG

Ali Erdogan August 16, 2024

 

 

Hi everyone,

We’re working on building  Retrieval Augmented Generation (RAG)  for our Gen-AI applications using our Confluence, but we’ve hit a few roadblocks and would love some guidance.

Specifically, we’re using frameworks like LangChain and LlamaIndex, which successfully pull context from all pages within a Space. However, we’re facing challenges when it comes to extracting large documents. Additionally, some queries through the Confluence REST API are returning 500 errors, likely due to unknown limits or restrictions.

Has anyone else encountered these issues? What’s the best approach to efficiently retrieve both pages and large documents from Confluence in aspect of both cloud and on-prem?  

Thanks in advance for any insights!

 

2 answers

1 accepted

2 votes
Answer accepted
Darryl Lee
Community Leader
Community Leader
Community Leaders are connectors, ambassadors, and mentors. On the online community, they serve as thought leaders, product experts, and moderators.
August 17, 2024

Hi @Ali Erdogan - there's an old but very relevant discussion about this here:

Warning, it's Reddit, so it's a frank discussion. :-}

But yeah, the upshot is that Confluence API (and especially Cloud Confluence) is not really designed for this kind of thing.

If you're on-prem, you'll get much better performance accessing the database directly, although depending on your permissions/restrictions, you might have to concern yourself with who is supposed to see the page data you're pulling.

You'll find data on the database schema here: Confluence Data Model 

There's an example query (Postgres) on how to dump the "source" for a given page in this article:

SELECT bc.body FROM bodycontent bc JOIN content c ON bc.contentid = c.contentid WHERE c.title = '<page_title>' AND c.prevver IS NULL \g /path/to/folder/filename.txt

If you're on Cloud, or are committed to using the API... I don't know if I have any good news for you.

I found this issue, which sounds like what you might be running into: CONFCLOUD-71261 - Calling the REST API intermittently results in 500 errors 

Some commenters reported that it was an excessive number of attachments in a space (20k) that caused the 500 errors, and one workaround they found was to NOT try to include attachments in the call:

In order to work around the problem, we changed to "expand=space,history,version,ancestors,metadata.properties" (i.e. skip children.attachment) and then loop through the pages individually to get the "children.attachment" information. 

I'm guessing that the only true solution is to implement your own throttling solution if you see 500 errors. It's unfortunate that whatever calls you are using (like in the issue above, /rest/api/space/SPACEKEY/content/page), don't return 429s, which would allow you to programmatically backoff, per Atlassian's documentation on that:

 

Ali Erdogan August 18, 2024

Hi @Darryl Lee ,

Thank you so much for the detailed and comprehensive explanations. The insights you provided about the Confluence API and alternative methods for database access are really valuable to me.  Thanks again!

 

Like Darryl Lee likes this
Darryl Lee
Community Leader
Community Leader
Community Leaders are connectors, ambassadors, and mentors. On the online community, they serve as thought leaders, product experts, and moderators.
August 18, 2024

Hi @Ali Erdogan - glad I could provide you the information. I'd appreciate if you could Accept my answer if you found it satisfactory. Thanks!

1 vote
John Funk
Community Leader
Community Leader
Community Leaders are connectors, ambassadors, and mentors. On the online community, they serve as thought leaders, product experts, and moderators.
August 17, 2024

Hi Ali - Welcome to the Atlassian Community!

Have you considered exporting the page to XML? 

Ali Erdogan August 18, 2024

Hi @John Funk 

Yes, I did consider that. Thanks for the suggestion!

Like John Funk likes this

Suggest an answer

Log in or Sign up to answer
TAGS
AUG Leaders

Atlassian Community Events