Uploading HTML via REST api through python script gives 400 error for some HTML documents

absciexbuild June 8, 2017

Hi,

I have a python script that uses the confluence REST API to download a page, make changes to the HTML and re-upload the page.

My current script works fine if I replace the html with html that I create (that is valid HTML). The code programmatically updates the version and put's it to the server.

But now I want to edit the page programmatically. I download the html, patch a new row into an HTML table and want to re-upload -- this fails with a 400 error. I also tried just round-tripping -- downloading the page html, uploading it -- this fails as well. Downloading and uploading just a <p>Hello World</p> works great -- it replaces the page content as expected.

I am looking for guidance on how to make this work. I notice a lot of <link> tags (where an @user reference is) - and tried removing those before re-upload but that did not work either.

Best,

Friedrich Brunzema

3 answers

1 accepted

3 votes
Answer accepted
absciexbuild June 9, 2017
# coding: utf-8
import argparse
import getpass
import datetime
import json
import keyring
import requests
import lxml.html

# -----------------------------------------------------------------------------
# Globals

BASE_URL = "http://your-server/rest/api/content"
VIEW_URL = "http://your-server/pages/viewpage.action?pageId="
USER_AGENT = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.82 Safari/537.36"


def pprint(data):
    print json.dumps(
        data,
        sort_keys=True,
        indent=4,
        separators=(', ', ' : '))


def get_page_ancestors(auth, pageid):
    # Get basic page information plus the ancestors property

    url = '{base}/{pageid}?expand=ancestors'.format(
        base=BASE_URL,
        pageid=pageid)

    r = requests.get(url, auth=auth, headers={'Content-Type': 'application/json', 'USER-AGENT': USER_AGENT})

    r.raise_for_status()

    return r.json()['ancestors']


def get_page_info(auth, page_id):
    url = '{base}/{page_id}'.format(
        base=BASE_URL,
        page_id=page_id)

    r = requests.get(url, auth=auth, headers={'Content-Type': 'application/json', 'USER-AGENT': USER_AGENT})

    r.raise_for_status()

    return r.json()


def convert_db_to_view(auth2, html):
    url = 'http://your-server/rest/api/contentbody/convert/view'

    data2 = {
        'value': html,
        'representation': 'storage'
    }

    r = requests.post(url,
                      data=json.dumps(data2),
                      auth=auth2,
                      headers={'Content-Type': 'application/json'}
                      )
    r.raise_for_status()
    return r.json()


def convert_view_to_db(auth2, html):
    url = 'http://your-server/rest/api/contentbody/convert/storage'

    data2 = {
        'value': html,
        'representation': 'editor'
    }

    r = requests.post(url,
                      data=json.dumps(data2),
                      auth=auth2,
                      headers={'Content-Type': 'application/json'}
                      )
    r.raise_for_status()
    return r.json()


def write_data(auth, html, page_id):
    info = get_page_info(auth, page_id)

    ver = int(info['version']['number']) + 1

    ancestors = get_page_ancestors(auth, page_id)

    anc = ancestors[-1]
    del anc['_links']
    del anc['_expandable']
    del anc['extensions']

    info['title'] = "Team City Change Log"

    data = {
        'id': str(page_id),
        'type': 'page',
        'title': info['title'],
        'version': {'number': ver},
        'ancestors': [anc],
        'body': {
            'storage':
                {
                    'representation': 'storage',
                    'value': str(html),
                }
        }
    }

    data = json.dumps(data)

    url = '{base}/{page_id}'.format(base=BASE_URL, page_id=page_id)

    our_headers = {'Content-Type': 'application/json', 'USER-AGENT': USER_AGENT}

    r = requests.put(
        url,
        data=data,
        auth=auth,
        headers=our_headers
    )

    r.raise_for_status()

    print "Wrote '%s' version %d" % (info['title'], ver)
    print "URL: %s%d" % (VIEW_URL, page_id)

    return ""


def read_data(auth, page_id):
    url = '{base}/{page_id}?expand=body.storage'.format(base=BASE_URL, page_id=page_id)
    r = requests.get(
        url,
        auth=auth,
        headers={'Content-Type': 'application/json', 'USER-AGENT': USER_AGENT}
    )

    r.raise_for_status()

    return r


def patch_html(auth, options):
    json_text = read_data(auth, options.pageid).text
    json2 = json.loads(json_text)
    html_storage_txt = json2['body']['storage']['value']
    html_display_json = convert_db_to_view(auth, html_storage_txt)
    html_display_txt = html_display_json['value'].encode('utf-8')

    # PATCH 
    # new_view_string = custom patching of HTML here,
    return new_view_string


def get_login(username=None):
    if username is None:
        username = getpass.getuser()

    password = keyring.get_password('confluence_script', username)

    if password is None:
        password = getpass.getpass()
        keyring.set_password('confluence_script', username, password)

    return username, password


def main():
    parser = argparse.ArgumentParser()

    parser.add_argument(
        "-u",
        "--user",
        default=getpass.getuser(),
        help="Specify the username to log into Confluence")

    parser.add_argument(
        "pageid",
        type=int,
        help="Specify the Confluence page id to overwrite")

    options = parser.parse_args()

    auth = get_login(options.user)

    html = patch_html(auth, options)
    html = html.replace('&Acirc;', '')
    write_data(auth, html, options.pageid)
    return

if __name__ == "__main__": main()
Mike Martino April 24, 2019

I hope you get this comment @absciexbuild. This was a HUGE help. This basically just worked "out of the box" for me. Thank you so much for taking the time to post this back.

If any Confluence Python admins are listening here, this is definitely something that should be added to the examples in the Confluence Python API. I was looking to import and export an entire Space programmatically to another instance of Confluence, and this example does approximately 95% of that work.

Thanks again @absciexbuild.

Friedrich Brunzema April 24, 2019

Glad this was useful to you!

Friedrich Brunzema April 24, 2019

I also have a version in C#(basically the python code transcribed), should someone need it, pls contact me.

Like Stacy Bohland likes this
0 votes
William Wong July 13, 2018

Had this same problem as well, but solved it in a different manner.  I found that I wasn't providing a title which seems is required to update page content: 

 

I used JavaScript instead of Python so here's some pseudo code:

// Get pageJSON with GET and select the page content from JSON

var values = pageJSON.body.storage.value;

// Manipulate the content

var res = values.replace(div, div + content);

// Increment the version

var nextVersion = Number(AJS.Meta.get('page-version')) + 1;

// Create JSON object to update page content - note title is specified.

var newJSON = { "version": { "number": nextVersion }, "title": getTitle(), "type": "page", "body": { "storage": { "value": res, "representation": "storage" } } }

...

PUT request afterwards

...

Note: the page data includes macros and I didn't have to convert from storage to view

Aashish Ranjan Chaubey August 23, 2018

Hi All,

I am required to migrate from Sharepoint to confluence, and in order to accomplish the same, I have wrote small C# script to convert my .doc files to HTML.

Since, I do have huge number of HTML files, I am looking for option to get my HTML documents uploaded to confluence dynamically.

Thanks in advance for quick help.

0 votes
Stephen Deutsch
Rising Star
Rising Star
Rising Stars are recognized for providing high-quality answers to other users. Rising Stars receive a certificate of achievement and are on the path to becoming Community Leaders.
June 9, 2017

Are you working with HTML, or the XHTML Confluence Storage Format? I think that confluence will only accept its own storage format (returned when using ?expand=body.storage).

The hello world example just happens to be both valid HTML and storage format, but it's not always the case. Tables are the same in both, though, so that shouldn't cause an issue necessarily.

Confluence is usually pretty good about returning information about the error, so have you checked the message that you get back with the 400? You could try using the REST API Browser for debugging.

Friedrich Brunzema June 9, 2017

Hi,

I actually found the answer. So hard. Sigh.  So editing a confluence page programmatically involves a couple of steps:

1.  read the data: [url]?expand-body.storage
2.  load text json response to json object (json.loads())
3.  extract html - json_object['body']['storage']['value']
4.  convert the returned page from storage to 'view' using post to /rest/api/contentbody/convert/storage -- returns json html  - use {'representation':'storage', 'value': html} in the data -- sanitized for view
5. Convert to text using display_json['value']. 

You can now mess with the HTML

Next you have to convert it back to storage format using post /rest/api/contentbody/convert/storage, {'representation':'editor'}

One of steps caused a unicode character &Acirc; to sneak in -- which I replaced with nothing.

Then you do the upload, making sure increase the version number of the page.

I will post some code below.

codycuellar September 14, 2017

The character appears to show up when converting back to storage format when there's a &nbsp space character, it becomes '&Acirc;&nbsp;' - at least that's what I've found, becuase it doesn't show up on every page.

I also get strange behavior where every time I publish certain pages and fetch it again, there's an additional newline in the HTML in between certain blocks which I have to regex replace with single newlines, otherwise there will be an absurd amount of space between html blocks after a few publishes.

Suggest an answer

Log in or Sign up to answer
TAGS
AUG Leaders

Atlassian Community Events