My goal is to use the Confluence API to get the content of a page, parse it, edit it, and update that same page with the edited content.
At first, I assumed Confluence's storage format was HTML. Based on that, my original plan was to use Python's BeautifulSoup module to parse and edit the content once I retrieved it from the Confluence API. I now know that the Confluence page storage format is "XHTML-based". I've tried to parse it with various BeautifulSoup parsers (lxml, xml, html.parser) but they all get caught on the standard-breaking macro elements like this:
<ac:structured-macro ac:name="unmigrated-wiki-markup" ac:schema-version="1" ac:macro-id="283daa7d-46af-4d6d-a177-a00b4a2bc342"><ac:plain-text-body><![CDATA[*\[CDRL:\]*]]></ac:plain-text-body></ac:structured-macro>
Is there a preferred method for parsing this?
I am having the same issue! If you found a solution could you share? Thanks!
I never found a parser that worked perfectly for Confluence. Ultimately, I was forced to edit Confluence pages using regular expressions.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
I have had the same issue. Ultimately, I have resorted to using an html parser (where that works), an xml parser (where that works), and regex for everything else.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
Same issue here. Need to find a way to parse Confluence XHTML incl. macro notation.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
Hey Bill, thanks for your response.
Alas, while these 2 libraries are helpful wrappers for the Confluence API, as far as I can tell, neither have the ability to parse the XHTML that is pulled from the API representing the content of a page.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
co-pilot suggested the following which I can confirm worked:
import requests
from requests.auth import HTTPBasicAuth
from bs4 import BeautifulSoup
# Replace these variables with your Confluence details
confluence_url = 'https://<my site>.atlassian.net/wiki/rest/api/content/'
page_id = '<my page id>' # Replace with your page ID
username = '<my username>'
api_token = '<my token>'
def get_confluence_page_text(page_id):
url = f"{confluence_url}{page_id}?expand=body.view"
auth = HTTPBasicAuth(username, api_token)
response = requests.get(url, auth=auth)
if response.status_code == 200:
page_data = response.json()
page_html = page_data['body']['view']['value']
# Use BeautifulSoup to extract text from HTML
soup = BeautifulSoup(page_html, 'html.parser')
page_text = soup.get_text()
return page_text
else:
raise Exception(f"Failed to fetch page: {response.status_code} - {response.text}")
if __name__ == "__main__":
try:
page_text = get_confluence_page_text(page_id)
print(page_text)
except Exception as e:
print(e)
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
This might not be the solution you desire; however, If you are interested in writing Confluence Wiki text to docx format (while maintaining the wiki formats), you can try jirawiki2docx python library.
https://pypi.org/project/jirawiki2docx/
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
@Woodson Miles I am using ConfluencePS PowerShell module to achieve this. It's quite simple to fetch and upload page contents using Get-ConfluencePage and Set-ConfluencePage commands.
You can get this module from PowerShell Gallery.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.