Confluence CLI - find pages which contain specific regex

Nils Leger September 18, 2019

Hello,

in our space there are some special characters which are not depicted properly because we uploaded lots of HTML files via the REST-API and had some issues with encoding. 

Looking at the HTML-Markup, I could see that Confluence saves these characters like that:

&#XXX;

 XXX can be any numbers - like 133 or 128. In the editor the characters look like this: …

 

Now I want to find all pages containing these special characters to fix them manually. I tried to do it with the Confluence CLI as described here. But it doens't work as expected. I guess a reason could be the special characters I use in the regex (.*&#.*;.*):

--action getPageList --space SPACE --regex2 ".*&#.*;.*"

Do you have any idea why it isn't working or any other suggestions to solve my problem?

 

Thank you in advance and best regards,

Nils

2 answers

1 accepted

1 vote
Answer accepted
Deleted user September 20, 2019

Hi @Nils Leger ,

To find the pages with the having the content "&#" and using the regex2 action. You need to know the storage format value of &# and then you need to use that value in the action.

Please see the below action for reference when the content is having &#

--action getPageList --space SU --regex2 ".*&#.*" --debug

Please go through the How to Get Confluence Storage Format page and see the below screenshot of the storage format of a page.
Snag_c50e82a.png

We have opened a support request in our support portal https://bobswift.atlassian.net/servicedesk/customer/portal/1/SUPPORT-3008 and we have made you as a reporter. Please let us know if you have any questions.

Regards,
Kishore Kumar Gangavath.

0 votes
Michael Kuhl _Appfire_
Rising Star
Rising Star
Rising Stars are recognized for providing high-quality answers to other users. Rising Stars receive a certificate of achievement and are on the path to becoming Community Leaders.
September 20, 2019

Hi @Nils Leger - This regex may work better to find the pages with unwanted html entities: "&#\d{3};" (without the quotes). 
That will find them all.   However, you may only want to find certain html entities.  In that case you may want to try sets of entities. 
For example: To filter for only  or &#22 or &#31, you would use "&#[1|22|31];" (again without the quotes).

To avoid manually fixing the pages you could use the storePage action on the list of pages to replace those entities with a space like so:

--action runFromPageList --space ASPACEKEY --regex2 "&#[1|22|31];" --common "-a storePage --id @pageId@ --findReplaceRegex  \"&#[1|22|31];: \" "

Be sure to test carefully!  Maybe make a test space with copies of some of the problem pages to test on before you do this for real.

Suggest an answer

Log in or Sign up to answer
TAGS
AUG Leaders

Atlassian Community Events