Hello - Looking for some insight on this problem statement:
How to retrieve all content from all global spaces in your Confluence instance
Problem Statement:
I want to get all the content from all spaces with type "global".
This now costs more than 70 API calls. This is because there is a limit of 25 objects per call and I need to get the "space key" from every space manually to get their content via a for loop with calls to "https://wiki-acc.[mycompany].com/rest/api/space/[space_key]/content?expand=body.storage".
If there is a way to limit the amount of API calls to achieve my goal, I would very much like to know. It now takes about 37 seconds to execute, which is too slow for my use case.
Thanks in advance!
Reference ticker and latest code:
How can I get all text from all spaces and their c... (atlassian.com)
CC: @Bas Hulskamp
Warm Regards
Hi @Humashankar VJ @Aron Gombas _Midori_ @Barbara Szczesniak
Thank you all for your assistance!
I have solved my issue now, by using async functions to retrieve the data with the Confluence API. This took the execution time down to <=3 seconds from the +/-60 seconds that it was before.
This is my code
async def get_all_global_space_keys(self) -> list[str]:
url = self.base_url + "/rest/api/space"
params = {"type": "global"}
all_keys = []
async with aiohttp.ClientSession() as session:
while True:
async with session.get(url, headers=self.headers, params=params) as response:
data = await response.json()
for result in data["results"]:
all_keys.append(result["key"])
next_url = data['_links'].get('next', None)
if next_url:
url = self.base_url + next_url
else:
break
return all_keys
async def get_content_from_space(self, session, space_key) -> list[Document]:
documents = []
url = self.base_url + "/rest/api/space/" + space_key + "/content"
params = {"expand": "body.storage"}
async with session.get(url, headers=self.headers, params=params) as response:
data = await response.json()
for result in data["page"]["results"]:
content_url = self.base_url + result["_links"]["webui"]
# content: str = BeautifulSoup(result["body"]["storage"]["value"], 'html.parser').get_text()
content: str = result["body"]["storage"]["value"]
documents.append(Document(content, metadata={"source": content_url}).to_json())
return documents
async def get_all_content_from_all_global_spaces(self, space_keys: list[str]):
documents: list[Document] = []
async with aiohttp.ClientSession() as session:
tasks = [self.get_content_from_space(session, key) for key in space_keys]
results = await asyncio.gather(*tasks)
for result in results:
documents.extend(result)
with open(self.WIKI_DATA_PATH, 'w', encoding='utf-8') as file:
file.write(json.dumps(documents))
async def get_all_async(self):
space_keys = await self.get_all_global_space_keys()
await self.get_all_content_from_all_global_spaces(space_keys)
c = ConfluenceService()
asyncio.run(c.get_all_async())
Great to know @Bas Hulskamp - Have a great one, let me mark this as solved for other reference in the future
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
The reference question is tagged with Data Center, but this question is tagged for Cloud.
I work with Cloud, and I do not use Rest API, so I'm not sure I can provide assistance.
My first questions are:
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
Hi @Bas Hulskamp - Kindly take the questions for Barbara
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
Good afternoon @Barbara Szczesniak ,
I use the REST API tot query an on premise instance of Confluence.
All the text of pages and files I use as content for my RAG (Retrieval Augmented Generation) application. RAG is an AI principle if you didn't know.
It now takes 74 seconds to get all content of all spaces (as explained in the linked post), because I have to call the API at least an amount of times equal to the amount of spaces to get all the content with this endpoint "https://wiki-acc.[mycompany].com/rest/api/space/[space_key]/content?expand=body.storage". Not to mention having to iterate over the spaces to get the SpaceKeys, with a limit of 25 per call because I couldn't get the total amount of spaces from an endpoint (see code in original post).
I would like to know if there is an alternative way that takes less API calls and is faster.
I'm not sure what the "export spaces" feature is that you mentioned... how does that work? Would that help me in this case?
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
@Barbara Szczesniak Could you please answer my question? Letting me know if you don't know the answer would also be great
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
@Bas Hulskamp I don't know anything about API calls or on premise Confluence.
According to this page (https://confluence.atlassian.com/doc/export-content-to-word-pdf-html-and-xml-139475.html ), you can export pages or a space to Word, HTML, PDF, or XML formats. I just thought maybe you could use this output for whatever you were using the content for—your RAG application, I guess. I'm not sure if this format would suit your needs.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
@Barbara Szczesniak I don't think so... I would have to read the files and then process the text... directly having the HTML strings is faster en more effective I think.
Do you know of someone else that could help me? Or is my request not possible with the current API? Note: I use the V1 API, because the Confluence instance isn't on the latest version (and I'm not the administrator).
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
Your question is visible to everyone in the community, so someone else might see it and answer.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
Hi @Barbara Szczesniak - Thanks for taking your time on this use case.
Hi @Bas Hulskamp - Let me also find some one who is having exposure on this as this is aa unique problem statement.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
Hi @Aron Gombas _Midori_ - Can you take your time and assist whether you able to assist on this topic / route to the right team who can help on this ?
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
@Humashankar VJ Thanks very much for your assistance!
I'm looking foward to hearing from someone soon.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
There are at least two opportunities for parallelization:
Using these two trivial tricks, you can massively accelerate the data collection here.
Note that you may run into rate limits if you go over-aggressive with parallelization. So it is a bit of experimentation and tuning.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
Thanks @Aron Gombas _Midori_ for the insight
Hi @Bas Hulskamp - Can you intake the above data points and see how you want to move on ?
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
Hi @Humashankar VJ
I'll see what I can do with this. Thanks for your insight @Aron Gombas _Midori_ !
Seems a bit like an AI answer by the way haha
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
?
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
@Humashankar VJ I'm not familiar with the people who are active on this forum. Can you perhaps @ mention someone that would know more about my question?
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
Sure @Bas Hulskamp - Let me work on that part.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.