Community
Products
Confluence
Questions
How can I get all text from all spaces and their child content?

How can I get all text from all spaces and their child content?

I want to scrape an entire Confluence 7.18.19 instance to analyze every bit of text of every space and it's child content with my NLP application.

Is there a way to do this with the API? I am limited to the API unfortunately.

Thanks in advance!

2 answers

1 accepted

0 votes

Answer accepted

For future reference, refer to this post --> https://community.atlassian.com/t5/Confluence-questions/Re-How-Should-I-get-all-text-from-all-spaces-and-their-c/qaq-p/2653690/comment-id/303748#M303748

You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.

Comment

0 votes

Hi - I hope that you are Doing good !!

Yes this use case is achievable with API
'Get all spaces' endpoint of the Confluence REST API to obtain a list of all spaces in the instance

the following REST API (https://developer.atlassian.com/cloud/confluence/rest/api-group-space/#api-wiki-rest-api-space-get).

Happy to help any further !!

Thank you very much and have a fantastic day!
Warm regards

You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.

Comment

Good morning,

Thanks for your reply!

If I execute a GET request to this endpoint "https://wiki-acc.mycompany.com/rest/api/space", I get the following response:
(note that I replaced certain values with mock values)

{
"results": [
{
"id": 8290324,
"key": "spaceKey",
"name": "spaceName",
"type": "global",
"_links": {
"webui": "/display/mySpace",
"self": "https://wiki-acc.mycompany.com/rest/api/space/mySpace"
},
"_expandable": {
"metadata": "",
"icon": "",
"description": "",
"retentionPolicy": "",
"homepage": "/rest/api/content/7899603"
}
},
.... more spaces ...
],
"start": 0,
"limit": 25,
"size": 25,
"_links": {
"self": "https://wiki-acc.mycompany.com/rest/api/space",
"next": "/rest/api/space?limit=25&start=25",
"base": "https://wiki-acc.mycompany.com",
"context": ""
}
}

This does not resemble the example response from https://developer.atlassian.com/cloud/confluence/rest/v1/api-group-space/#api-wiki-rest-api-space-get

The Confluence instance was updated to 7.19.19 today, maybe that has something to do with it?

Secondly, how am I supposed to get every bit of text from all child content of a space? It seems inefficient to me to query every single content endpoint, is that the only way to achieve that?

You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.

Comment

Good morning @Humashankar VJ ,

I would really appreciate an answer to my question!

To clarify; I want to get all the content from all spaces with type "global".

This now costs more than 70 API calls. This is because there is a limit of 25 objects per call and I need to get the "space key" from every space manually to get their content via a for loop with calls to "https://wiki-acc.[mycompany].com/rest/api/space/[space_key]/content?expand=body.storage".

If there is a way to limit the amount of API calls to achieve my goal, I would very much like to know. It now takes about 37 seconds to execute, which is too slow for my use case.

Thanks in advance!

You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.

Comment

@Humashankar VJ It has been a week... would you please answer my question?

You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.

Comment

Hi @Bas Hulskamp - Surething, sidetracked on few other topics - will get closer to your clarification in some time

You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.

Comment

Hi @Bas Hulskamp

You can give a try by optimize the process is by using the _expand parameter in the /rest/api/space endpoint to retrieve both space information and content in a single API call.

This way will be able to minimize the number of API calls needed to fetch content from all global spaces.

Try to use the expand=homepage,body.storage parameter to retrieve both the space homepage content and the content of child pages in a single API call

import requests.

- Retrieve list of global spaces along with their content

global_spaces_url = f'{base_url}/space?type=global&expand=homepage,body.storage'
response = requests.get(global_spaces_url, auth=(username, password))
global_spaces = response.json().get('results', [])

- Iterate over global spaces

for space in global_spaces:
space_key = space['key']

- Extract text from space content

space_content = space.get('homepage', {}) # Homepage content
space_content_text = space_content.get('title', '') + '\n' + space_content.get('body', {}).get('storage', {}).get('value', '')
print(space_content_text)

- Retrieve child content (if any)

child_pages = space.get('children', {}).get('page', [])
for page in child_pages:
page_content = page.get('body', {}).get('storage', {}).get('value', '')
print(page_content)

You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.

Comment

Hi @Bas Hulskamp

Try the above one and keep me posted - based off we can also work with other options:

We also can try methods like asyncio and aiohttp which will fetch the content of each space in parallel to speed up the process

To streamline the process and reduce the number of API calls,

Will be able to accomplish with the /rest/api/space?type=global endpoint to directly retrieve all global spaces

You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.

Comment

Good morning @Humashankar VJ ,

Thanks for you reply!

I think there is a misunderstanding here... as I read your reply, it seems like you replied with almost the exact solution I already found. I can already get all the spaces with "https://wiki-acc.mycompany.com/rest/api/space?type=global", as per my first reply to this thread on February 22nd.

The parameter "expand=body.storage" makes no difference here; I found that it only works as intended with at least these endpoints:
1. "https://wiki-acc.mycompany.com/rest/api/content/{contentId}?expand=body.storage",
2. "https://wiki-acc.mycompany.com/rest/api/content?expand=body.storage",
3. "https://wiki-acc.mycompany.com/rest/api/space/{spaceId)/content?expand=body.storage"

Now with the above endpoints, I can't specify the "type=global" parameter for spaces as far as I know, so I currently use the third endpoint to get all text from every content object by using the spaceIds retrieved with https://wiki-acc.mycompany.com/rest/api/space?type=global.

This is my code in Python 3.11.8:

    def get_all_global_space_keys(self) -> list:

        url = self.base_url + "/rest/api/space"

        params = {

            "type": "global",

            }

        all_keys = []

        while True:

            response = requests.request("GET", url, headers=self.headers, auth=self.auth, params=params)

            data = response.json()

            

            for result in data["results"]:

                all_keys.append(result["key"])

            next_url = data['_links'].get('next', None)   

            if next_url:

                url = self.base_url + next_url

            elif next_url == None:

                break

        return all_keys

    def get_all_content_from_all_global_spaces(self, space_keys: list):

        params = {"expand": "body.storage"}

        all_text: str = ""

        for key in space_keys:

            url = self.base_url + "/rest/api/space/" + key + "/content"

            response = requests.request("GET", url, headers=self.headers, auth=self.auth, params=params)

            data = response.json()

            for result in data["page"]["results"]:

                content_link = self.base_url + result["_links"]["webui"]

                content = result["body"]["storage"]["value"]

                all_text += content_link + " " + content + "\n"

        with open("../txt/all_text.txt", 'w', encoding='utf-8') as file:

            file.write(all_text)

As far as I can see, my code achieves the same goal as your previous suggestion. This still uses a lot of API calls if there are a lot of spaces (in my case at the moment 71 spaces).

If I missed anything or if it seems like I misunderstood your suggestion, please clarify it in a reply.

Thanks in advance!

You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.

Comment

Hi @Bas Hulskamp - Good Morning !!

Thanks the additional note and the latest outcome code - let me retreat some more research on it.

Meantime I will also create a new thread on this use case in our community to get some others champs viewpoint.

Warm Regards

You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.

Forums

Product Q&A

Community resources

Support

Top groups

Community resources

Support

Learn

Community resources

Support

Events

Community resources

Support

How can I get all text from all spaces and their child content?

2 answers

1 accepted

Suggest an answer

Was this helpful?

Thanks!

TAGS

Atlassian Community Events