Uploading HTML via REST api through python script gives 400 error for some HTML documents

Hi,

I have a python script that uses the confluence REST API to download a page, make changes to the HTML and re-upload the page.

My current script works fine if I replace the html with html that I create (that is valid HTML). The code programmatically updates the version and put's it to the server.

But now I want to edit the page programmatically. I download the html, patch a new row into an HTML table and want to re-upload -- this fails with a 400 error. I also tried just round-tripping -- downloading the page html, uploading it -- this fails as well. Downloading and uploading just a <p>Hello World</p> works great -- it replaces the page content as expected.

I am looking for guidance on how to make this work. I notice a lot of <link> tags (where an @user reference is) - and tried removing those before re-upload but that did not work either.

Best,

Friedrich Brunzema

3 answers

0 vote
Stephen Deutsch Community Champion Jun 09, 2017

Are you working with HTML, or the XHTML Confluence Storage Format? I think that confluence will only accept its own storage format (returned when using ?expand=body.storage).

The hello world example just happens to be both valid HTML and storage format, but it's not always the case. Tables are the same in both, though, so that shouldn't cause an issue necessarily.

Confluence is usually pretty good about returning information about the error, so have you checked the message that you get back with the 400? You could try using the REST API Browser for debugging.

Hi,

I actually found the answer. So hard. Sigh.  So editing a confluence page programmatically involves a couple of steps:

1.  read the data: [url]?expand-body.storage
2.  load text json response to json object (json.loads())
3.  extract html - json_object['body']['storage']['value']
4.  convert the returned page from storage to 'view' using post to /rest/api/contentbody/convert/storage -- returns json html  - use {'representation':'storage', 'value': html} in the data -- sanitized for view
5. Convert to text using display_json['value']. 

You can now mess with the HTML

Next you have to convert it back to storage format using post /rest/api/contentbody/convert/storage, {'representation':'editor'}

One of steps caused a unicode character &Acirc; to sneak in -- which I replaced with nothing.

Then you do the upload, making sure increase the version number of the page.

I will post some code below.

The character appears to show up when converting back to storage format when there's a &nbsp space character, it becomes '&Acirc;&nbsp;' - at least that's what I've found, becuase it doesn't show up on every page.

I also get strange behavior where every time I publish certain pages and fetch it again, there's an additional newline in the HTML in between certain blocks which I have to regex replace with single newlines, otherwise there will be an absurd amount of space between html blocks after a few publishes.

# coding: utf-8
import argparse
import getpass
import datetime
import json
import keyring
import requests
import lxml.html

# -----------------------------------------------------------------------------
# Globals

BASE_URL = "http://your-server/rest/api/content"
VIEW_URL = "http://your-server/pages/viewpage.action?pageId="
USER_AGENT = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.82 Safari/537.36"


def pprint(data):
    print json.dumps(
        data,
        sort_keys=True,
        indent=4,
        separators=(', ', ' : '))


def get_page_ancestors(auth, pageid):
    # Get basic page information plus the ancestors property

    url = '{base}/{pageid}?expand=ancestors'.format(
        base=BASE_URL,
        pageid=pageid)

    r = requests.get(url, auth=auth, headers={'Content-Type': 'application/json', 'USER-AGENT': USER_AGENT})

    r.raise_for_status()

    return r.json()['ancestors']


def get_page_info(auth, page_id):
    url = '{base}/{page_id}'.format(
        base=BASE_URL,
        page_id=page_id)

    r = requests.get(url, auth=auth, headers={'Content-Type': 'application/json', 'USER-AGENT': USER_AGENT})

    r.raise_for_status()

    return r.json()


def convert_db_to_view(auth2, html):
    url = 'http://your-server/rest/api/contentbody/convert/view'

    data2 = {
        'value': html,
        'representation': 'storage'
    }

    r = requests.post(url,
                      data=json.dumps(data2),
                      auth=auth2,
                      headers={'Content-Type': 'application/json'}
                      )
    r.raise_for_status()
    return r.json()


def convert_view_to_db(auth2, html):
    url = 'http://your-server/rest/api/contentbody/convert/storage'

    data2 = {
        'value': html,
        'representation': 'editor'
    }

    r = requests.post(url,
                      data=json.dumps(data2),
                      auth=auth2,
                      headers={'Content-Type': 'application/json'}
                      )
    r.raise_for_status()
    return r.json()


def write_data(auth, html, page_id):
    info = get_page_info(auth, page_id)

    ver = int(info['version']['number']) + 1

    ancestors = get_page_ancestors(auth, page_id)

    anc = ancestors[-1]
    del anc['_links']
    del anc['_expandable']
    del anc['extensions']

    info['title'] = "Team City Change Log"

    data = {
        'id': str(page_id),
        'type': 'page',
        'title': info['title'],
        'version': {'number': ver},
        'ancestors': [anc],
        'body': {
            'storage':
                {
                    'representation': 'storage',
                    'value': str(html),
                }
        }
    }

    data = json.dumps(data)

    url = '{base}/{page_id}'.format(base=BASE_URL, page_id=page_id)

    our_headers = {'Content-Type': 'application/json', 'USER-AGENT': USER_AGENT}

    r = requests.put(
        url,
        data=data,
        auth=auth,
        headers=our_headers
    )

    r.raise_for_status()

    print "Wrote '%s' version %d" % (info['title'], ver)
    print "URL: %s%d" % (VIEW_URL, page_id)

    return ""


def read_data(auth, page_id):
    url = '{base}/{page_id}?expand=body.storage'.format(base=BASE_URL, page_id=page_id)
    r = requests.get(
        url,
        auth=auth,
        headers={'Content-Type': 'application/json', 'USER-AGENT': USER_AGENT}
    )

    r.raise_for_status()

    return r


def patch_html(auth, options):
    json_text = read_data(auth, options.pageid).text
    json2 = json.loads(json_text)
    html_storage_txt = json2['body']['storage']['value']
    html_display_json = convert_db_to_view(auth, html_storage_txt)
    html_display_txt = html_display_json['value'].encode('utf-8')

    # PATCH 
    # new_view_string = custom patching of HTML here,
    return new_view_string


def get_login(username=None):
    if username is None:
        username = getpass.getuser()

    password = keyring.get_password('confluence_script', username)

    if password is None:
        password = getpass.getpass()
        keyring.set_password('confluence_script', username, password)

    return username, password


def main():
    parser = argparse.ArgumentParser()

    parser.add_argument(
        "-u",
        "--user",
        default=getpass.getuser(),
        help="Specify the username to log into Confluence")

    parser.add_argument(
        "pageid",
        type=int,
        help="Specify the Confluence page id to overwrite")

    options = parser.parse_args()

    auth = get_login(options.user)

    html = patch_html(auth, options)
    html = html.replace('&Acirc;', '')
    write_data(auth, html, options.pageid)
    return

if __name__ == "__main__": main()
0 vote

Had this same problem as well, but solved it in a different manner.  I found that I wasn't providing a title which seems is required to update page content: 

 

I used JavaScript instead of Python so here's some pseudo code:

// Get pageJSON with GET and select the page content from JSON

var values = pageJSON.body.storage.value;

// Manipulate the content

var res = values.replace(div, div + content);

// Increment the version

var nextVersion = Number(AJS.Meta.get('page-version')) + 1;

// Create JSON object to update page content - note title is specified.

var newJSON = { "version": { "number": nextVersion }, "title": getTitle(), "type": "page", "body": { "storage": { "value": res, "representation": "storage" } } }

...

PUT request afterwards

...

Note: the page data includes macros and I didn't have to convert from storage to view

Suggest an answer

Log in or Sign up to answer
How to earn badges on the Atlassian Community

How to earn badges on the Atlassian Community

Badges are a great way to show off community activity, whether you’re a newbie or a Champion.

Learn more
Community showcase
Posted Tuesday in Confluence

We want to see the templates you've created in Confluence!

Hi Community, Jessica here from the Confluence Product Marketing team!  July’s community challenge is all about sharing pictures  — and as an extension of our first post on what ...

487 views 19 10
Join discussion

Atlassian User Groups

Connect with like-minded Atlassian users at free events near you!

Find a group

Connect with like-minded Atlassian users at free events near you!

Find my local user group

Unfortunately there are no AUG chapters near you at the moment.

Start an AUG

You're one step closer to meeting fellow Atlassian users at your local meet up. Learn more about AUGs

Groups near you