Community
Products
Confluence
Questions
How Should I get all text from all spaces and their child content?

How Should I get all text from all spaces and their child content?

Hello - Looking for some insight on this problem statement:

How to retrieve all content from all global spaces in your Confluence instance

Problem Statement:

I want to get all the content from all spaces with type "global".

This now costs more than 70 API calls. This is because there is a limit of 25 objects per call and I need to get the "space key" from every space manually to get their content via a for loop with calls to "https://wiki-acc.[mycompany].com/rest/api/space/[space_key]/content?expand=body.storage".

If there is a way to limit the amount of API calls to achieve my goal, I would very much like to know. It now takes about 37 seconds to execute, which is too slow for my use case.

Thanks in advance!

Reference ticker and latest code:

How can I get all text from all spaces and their c... (atlassian.com)

CC: @Bas Hulskamp

Warm Regards

4 answers

2 accepted

0 votes

Answer accepted

Hi @Humashankar VJ @Aron Gombas _Midori_ @Barbara Szczesniak

Thank you all for your assistance!

I have solved my issue now, by using async functions to retrieve the data with the Confluence API. This took the execution time down to <=3 seconds from the +/-60 seconds that it was before.

This is my code

async def get_all_global_space_keys(self) -> list[str]:

        url = self.base_url + "/rest/api/space"

        params = {"type": "global"}

        all_keys = []

        async with aiohttp.ClientSession() as session:

            while True:

                async with session.get(url, headers=self.headers, params=params) as response:

                    data = await response.json()

                    for result in data["results"]:

                        all_keys.append(result["key"])

                    next_url = data['_links'].get('next', None)

                    if next_url:

                        url = self.base_url + next_url

                    else:

                        break

        return all_keys

    

    async def get_content_from_space(self, session, space_key) -> list[Document]:

        documents = []

        url = self.base_url + "/rest/api/space/" + space_key + "/content"

        params = {"expand": "body.storage"}

        async with session.get(url, headers=self.headers, params=params) as response:

            data = await response.json()

            for result in data["page"]["results"]:

                content_url = self.base_url + result["_links"]["webui"]

                # content: str = BeautifulSoup(result["body"]["storage"]["value"], 'html.parser').get_text()

                content: str = result["body"]["storage"]["value"]

                documents.append(Document(content, metadata={"source": content_url}).to_json())

        return documents

    async def get_all_content_from_all_global_spaces(self, space_keys: list[str]):

        documents: list[Document] = []

        async with aiohttp.ClientSession() as session:

            tasks = [self.get_content_from_space(session, key) for key in space_keys]

            results = await asyncio.gather(*tasks)

            for result in results:

                documents.extend(result)

        with open(self.WIKI_DATA_PATH, 'w', encoding='utf-8') as file:

            file.write(json.dumps(documents))

    async def get_all_async(self):

        space_keys = await self.get_all_global_space_keys()

        await self.get_all_content_from_all_global_spaces(space_keys)

c = ConfluenceService()
asyncio.run(c.get_all_async())

You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.

Comment

Great to know @Bas Hulskamp - Have a great one, let me mark this as solved for other reference in the future

You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.

Comment

0 votes

Answer accepted

Hi @Barbara Szczesniak

Can you take a look in this use case and help @Bas Hulskamp

Regards

You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.

Comment

The reference question is tagged with Data Center, but this question is tagged for Cloud.

I work with Cloud, and I do not use Rest API, so I'm not sure I can provide assistance.

My first questions are:

What are you using the content for—what is the reason you need all of the content? This may impact the format you the content in.
Why not export the spaces to the format you need?

You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.

Comment

Hi @Bas Hulskamp - Kindly take the questions for Barbara

You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.

Comment

Good afternoon @Barbara Szczesniak ,

I use the REST API tot query an on premise instance of Confluence.

All the text of pages and files I use as content for my RAG (Retrieval Augmented Generation) application. RAG is an AI principle if you didn't know.

It now takes 74 seconds to get all content of all spaces (as explained in the linked post), because I have to call the API at least an amount of times equal to the amount of spaces to get all the content with this endpoint "https://wiki-acc.[mycompany].com/rest/api/space/[space_key]/content?expand=body.storage". Not to mention having to iterate over the spaces to get the SpaceKeys, with a limit of 25 per call because I couldn't get the total amount of spaces from an endpoint (see code in original post).

I would like to know if there is an alternative way that takes less API calls and is faster.

I'm not sure what the "export spaces" feature is that you mentioned... how does that work? Would that help me in this case?

You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.

Comment

@Barbara Szczesniak Could you please answer my question? Letting me know if you don't know the answer would also be great

You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.

Comment

@Bas Hulskamp I don't know anything about API calls or on premise Confluence.

According to this page (https://confluence.atlassian.com/doc/export-content-to-word-pdf-html-and-xml-139475.html ), you can export pages or a space to Word, HTML, PDF, or XML formats. I just thought maybe you could use this output for whatever you were using the content for—your RAG application, I guess. I'm not sure if this format would suit your needs.

You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.

Comment

@Barbara Szczesniak I don't think so... I would have to read the files and then process the text... directly having the HTML strings is faster en more effective I think.

Do you know of someone else that could help me? Or is my request not possible with the current API? Note: I use the V1 API, because the Confluence instance isn't on the latest version (and I'm not the administrator).

You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.

Comment

Your question is visible to everyone in the community, so someone else might see it and answer.

You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.

Comment

Hi @Barbara Szczesniak - Thanks for taking your time on this use case.

Hi @Bas Hulskamp - Let me also find some one who is having exposure on this as this is aa unique problem statement.

You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.

Comment

Like • Barbara Szczesniak likes this

Hi @Aron Gombas _Midori_ - Can you take your time and assist whether you able to assist on this topic / route to the right team who can help on this ?

You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.

Comment

@Humashankar VJ Thanks very much for your assistance!
I'm looking foward to hearing from someone soon.

You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.

Comment

Like • Humashankar VJ likes this

0 votes

There are at least two opportunities for parallelization:

When you get the first page of the spaces, you can immediately start collecting the contents in those spaces. In other words, you don't need to load all spaces before you can start collecting the pages in those.
You can, obviously, collect the pages in two separate spaces in parallel.

Using these two trivial tricks, you can massively accelerate the data collection here.

Note that you may run into rate limits if you go over-aggressive with parallelization. So it is a bit of experimentation and tuning.

You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.

Comment

Thanks @Aron Gombas _Midori_ for the insight

Hi @Bas Hulskamp - Can you intake the above data points and see how you want to move on ?

You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.

Comment

Hi @Humashankar VJ

I'll see what I can do with this. Thanks for your insight @Aron Gombas _Midori_ !
Seems a bit like an AI answer by the way haha

You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.

Comment

You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.

Comment

0 votes

Anyone?

You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.

Comment

@Humashankar VJ I'm not familiar with the people who are active on this forum. Can you perhaps @ mention someone that would know more about my question?

You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.

Comment

Like • Humashankar VJ likes this

Sure @Bas Hulskamp - Let me work on that part.

You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.

Forums

Product Q&A

Community resources

Support

Top groups

Community resources

Support

Learn

Community resources

Support

Events

Community resources

Support

How Should I get all text from all spaces and their child content?

4 answers

2 accepted

Suggest an answer

Was this helpful?

Thanks!

TAGS

Atlassian Community Events