How Should I get all text from all spaces and their child content?

Humashankar VJ
Rising Star
Rising Star
Rising Stars are recognized for providing high-quality answers to other users. Rising Stars receive a certificate of achievement and are on the path to becoming Community Leaders.
March 7, 2024

Hello - Looking for some insight on this problem statement:

How to retrieve all content from all global spaces in your Confluence instance

Problem Statement:

I want to get all the content from all spaces with type "global".

This now costs more than 70 API calls. This is because there is a limit of 25 objects per call and I need to get the "space key" from every space manually to get their content via a for loop with calls to "https://wiki-acc.[mycompany].com/rest/api/space/[space_key]/content?expand=body.storage".

If there is a way to limit the amount of API calls to achieve my goal, I would very much like to know. It now takes about 37 seconds to execute, which is too slow for my use case.

Thanks in advance!

 

Reference ticker and latest code:

How can I get all text from all spaces and their c... (atlassian.com)

CC: @Bas Hulskamp 

 

Warm Regards

4 answers

2 accepted

0 votes
Answer accepted
Bas Hulskamp March 27, 2024

Hi @Humashankar VJ @Aron Gombas _Midori_ @Barbara Szczesniak 

Thank you all for your assistance!

I have solved my issue now, by using async functions to retrieve the data with the Confluence API. This took the execution time down to <=3 seconds from the +/-60 seconds that it was before.

This is my code

 

async def get_all_global_space_keys(self) -> list[str]:

        url = self.base_url + "/rest/api/space"

        params = {"type": "global"}

        all_keys = []

        async with aiohttp.ClientSession() as session:

            while True:

                async with session.get(url, headers=self.headers, params=params) as response:

                    data = await response.json()

                    for result in data["results"]:

                        all_keys.append(result["key"])

                    next_url = data['_links'].get('next', None)

                    if next_url:

                        url = self.base_url + next_url

                    else:

                        break

        return all_keys

   

    async def get_content_from_space(self, session, space_key) -> list[Document]:

        documents = []

        url = self.base_url + "/rest/api/space/" + space_key + "/content"

        params = {"expand": "body.storage"}

        async with session.get(url, headers=self.headers, params=params) as response:

            data = await response.json()

            for result in data["page"]["results"]:

                content_url = self.base_url + result["_links"]["webui"]

                # content: str = BeautifulSoup(result["body"]["storage"]["value"], 'html.parser').get_text()

                content: str = result["body"]["storage"]["value"]

                documents.append(Document(content, metadata={"source": content_url}).to_json())

        return documents

    async def get_all_content_from_all_global_spaces(self, space_keys: list[str]):

        documents: list[Document] = []

        async with aiohttp.ClientSession() as session:

            tasks = [self.get_content_from_space(session, key) for key in space_keys]

            results = await asyncio.gather(*tasks)

            for result in results:

                documents.extend(result)

        with open(self.WIKI_DATA_PATH, 'w', encoding='utf-8') as file:

            file.write(json.dumps(documents))

    async def get_all_async(self):

        space_keys = await self.get_all_global_space_keys()

        await self.get_all_content_from_all_global_spaces(space_keys)

c = ConfluenceService()
asyncio.run(c.get_all_async())
Humashankar VJ
Rising Star
Rising Star
Rising Stars are recognized for providing high-quality answers to other users. Rising Stars receive a certificate of achievement and are on the path to becoming Community Leaders.
March 27, 2024

Great to know @Bas Hulskamp - Have a great one, let me mark this as solved for other reference in the future

0 votes
Answer accepted
Humashankar VJ
Rising Star
Rising Star
Rising Stars are recognized for providing high-quality answers to other users. Rising Stars receive a certificate of achievement and are on the path to becoming Community Leaders.
March 25, 2024

Hi @Barbara Szczesniak 

Can you take a look in this use case and help @Bas Hulskamp 

Regards

 

Barbara Szczesniak
Rising Star
Rising Star
Rising Stars are recognized for providing high-quality answers to other users. Rising Stars receive a certificate of achievement and are on the path to becoming Community Leaders.
March 25, 2024

The reference question is tagged with Data Center, but this question is tagged for Cloud.

I work with Cloud, and I do not use Rest API, so I'm not sure I can provide assistance.

My first questions are:

  • What are you using the content for—what is the reason you need all of the content? This may impact the format you the content in.
  • Why not export the spaces to the format you need?
Humashankar VJ
Rising Star
Rising Star
Rising Stars are recognized for providing high-quality answers to other users. Rising Stars receive a certificate of achievement and are on the path to becoming Community Leaders.
March 25, 2024

Hi @Bas Hulskamp - Kindly take the questions for Barbara

Bas Hulskamp March 25, 2024

Good afternoon @Barbara Szczesniak ,

I use the REST API tot query an on premise instance of Confluence.

All the text of pages and files I use as content for my RAG (Retrieval Augmented Generation) application. RAG is an AI principle if you didn't know.

It now takes 74 seconds to get all content of all spaces (as explained in the linked post), because I have to call the API at least an amount of times equal to the amount of spaces to get all the content with this endpoint "https://wiki-acc.[mycompany].com/rest/api/space/[space_key]/content?expand=body.storage". Not to mention having to iterate over the spaces to get the SpaceKeys, with a limit of 25 per call because I couldn't get the total amount of spaces from an endpoint (see code in original post).

I would like to know if there is an alternative way that takes less API calls and is faster.

I'm not sure what the "export spaces" feature is that you mentioned... how does that work? Would that help me in this case?

Bas Hulskamp March 26, 2024

@Barbara Szczesniak Could you please answer my question? Letting me know if you don't know the answer would also be great

Barbara Szczesniak
Rising Star
Rising Star
Rising Stars are recognized for providing high-quality answers to other users. Rising Stars receive a certificate of achievement and are on the path to becoming Community Leaders.
March 26, 2024

@Bas Hulskamp I don't know anything about API calls or on premise Confluence. 

According to this page (https://confluence.atlassian.com/doc/export-content-to-word-pdf-html-and-xml-139475.html ), you can export pages or a space to Word, HTML, PDF, or XML formats. I just thought maybe you could use this output for whatever you were using the content for—your RAG application, I guess. I'm not sure if this format would suit your needs.

Bas Hulskamp March 26, 2024

@Barbara Szczesniak I don't think so... I would have to read the files and then process the text... directly having the HTML strings is faster en more effective I think.

Do you know of someone else that could help me? Or is my request not possible with the current API? Note: I use the V1 API, because the Confluence instance isn't on the latest version (and I'm not the administrator).

Barbara Szczesniak
Rising Star
Rising Star
Rising Stars are recognized for providing high-quality answers to other users. Rising Stars receive a certificate of achievement and are on the path to becoming Community Leaders.
March 26, 2024

Your question is visible to everyone in the community, so someone else might see it and answer.  

Humashankar VJ
Rising Star
Rising Star
Rising Stars are recognized for providing high-quality answers to other users. Rising Stars receive a certificate of achievement and are on the path to becoming Community Leaders.
March 26, 2024

Hi @Barbara Szczesniak - Thanks for taking your time on this use case.

Hi @Bas Hulskamp - Let me also find some one who is having exposure on this as this is aa unique problem statement.

Like Barbara Szczesniak likes this
Humashankar VJ
Rising Star
Rising Star
Rising Stars are recognized for providing high-quality answers to other users. Rising Stars receive a certificate of achievement and are on the path to becoming Community Leaders.
March 26, 2024

Hi @Aron Gombas _Midori_ - Can you take your time and assist whether you able to assist on this topic / route to the right team who can help on this ?

Bas Hulskamp March 26, 2024

@Humashankar VJ Thanks very much for your assistance! 
I'm looking foward to hearing from someone soon.

Like Humashankar VJ likes this
0 votes
Aron Gombas _Midori_
Community Leader
Community Leader
Community Leaders are connectors, ambassadors, and mentors. On the online community, they serve as thought leaders, product experts, and moderators.
March 26, 2024

There are at least two opportunities for parallelization:

  1. When you get the first page of the spaces, you can immediately start collecting the contents in those spaces. In other words, you don't need to load all spaces before you can start collecting the pages in those.
  2. You can, obviously, collect the pages in two separate spaces in parallel.

Using these two trivial tricks, you can massively accelerate the data collection here.

Note that you may run into rate limits if you go over-aggressive with parallelization. So it is a bit of experimentation and tuning.

Humashankar VJ
Rising Star
Rising Star
Rising Stars are recognized for providing high-quality answers to other users. Rising Stars receive a certificate of achievement and are on the path to becoming Community Leaders.
March 26, 2024

Thanks @Aron Gombas _Midori_ for the insight

Hi @Bas Hulskamp - Can you intake the above data points and see how you want to move on ?

Bas Hulskamp March 27, 2024

Hi @Humashankar VJ 

I'll see what I can do with this. Thanks for your insight @Aron Gombas _Midori_ !
Seems a bit like an AI answer by the way haha

Aron Gombas _Midori_
Community Leader
Community Leader
Community Leaders are connectors, ambassadors, and mentors. On the online community, they serve as thought leaders, product experts, and moderators.
March 27, 2024

?

0 votes
Bas Hulskamp March 20, 2024

Anyone?

Bas Hulskamp March 25, 2024

@Humashankar VJ I'm not familiar with the people who are active on this forum. Can you perhaps @ mention someone that would know more about my question?

Like Humashankar VJ likes this
Humashankar VJ
Rising Star
Rising Star
Rising Stars are recognized for providing high-quality answers to other users. Rising Stars receive a certificate of achievement and are on the path to becoming Community Leaders.
March 25, 2024

Sure @Bas Hulskamp - Let me work on that part.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               

Suggest an answer

Log in or Sign up to answer
TAGS
AUG Leaders

Atlassian Community Events