Community
Products
Confluence
Questions
Uploading HTML via REST api through python script gives 400 error for some HTML documents

Uploading HTML via REST api through python script gives 400 error for some HTML documents

Hi,

I have a python script that uses the confluence REST API to download a page, make changes to the HTML and re-upload the page.

My current script works fine if I replace the html with html that I create (that is valid HTML). The code programmatically updates the version and put's it to the server.

But now I want to edit the page programmatically. I download the html, patch a new row into an HTML table and want to re-upload -- this fails with a 400 error. I also tried just round-tripping -- downloading the page html, uploading it -- this fails as well. Downloading and uploading just a <p>Hello World</p> works great -- it replaces the page content as expected.

I am looking for guidance on how to make this work. I notice a lot of <link> tags (where an @user reference is) - and tried removing those before re-upload but that did not work either.

Best,

Friedrich Brunzema

3 answers

1 accepted

3 votes

Answer accepted

# coding: utf-8
import argparse
import getpass
import datetime
import json
import keyring
import requests
import lxml.html

# -----------------------------------------------------------------------------
# Globals

BASE_URL = "http://your-server/rest/api/content"
VIEW_URL = "http://your-server/pages/viewpage.action?pageId="
USER_AGENT = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.82 Safari/537.36"


def pprint(data):
    print json.dumps(
        data,
        sort_keys=True,
        indent=4,
        separators=(', ', ' : '))


def get_page_ancestors(auth, pageid):
    # Get basic page information plus the ancestors property

    url = '{base}/{pageid}?expand=ancestors'.format(
        base=BASE_URL,
        pageid=pageid)

    r = requests.get(url, auth=auth, headers={'Content-Type': 'application/json', 'USER-AGENT': USER_AGENT})

    r.raise_for_status()

    return r.json()['ancestors']


def get_page_info(auth, page_id):
    url = '{base}/{page_id}'.format(
        base=BASE_URL,
        page_id=page_id)

    r = requests.get(url, auth=auth, headers={'Content-Type': 'application/json', 'USER-AGENT': USER_AGENT})

    r.raise_for_status()

    return r.json()


def convert_db_to_view(auth2, html):
    url = 'http://your-server/rest/api/contentbody/convert/view'

    data2 = {
        'value': html,
        'representation': 'storage'
    }

    r = requests.post(url,
                      data=json.dumps(data2),
                      auth=auth2,
                      headers={'Content-Type': 'application/json'}
                      )
    r.raise_for_status()
    return r.json()


def convert_view_to_db(auth2, html):
    url = 'http://your-server/rest/api/contentbody/convert/storage'

    data2 = {
        'value': html,
        'representation': 'editor'
    }

    r = requests.post(url,
                      data=json.dumps(data2),
                      auth=auth2,
                      headers={'Content-Type': 'application/json'}
                      )
    r.raise_for_status()
    return r.json()


def write_data(auth, html, page_id):
    info = get_page_info(auth, page_id)

    ver = int(info['version']['number']) + 1

    ancestors = get_page_ancestors(auth, page_id)

    anc = ancestors[-1]
    del anc['_links']
    del anc['_expandable']
    del anc['extensions']

    info['title'] = "Team City Change Log"

    data = {
        'id': str(page_id),
        'type': 'page',
        'title': info['title'],
        'version': {'number': ver},
        'ancestors': [anc],
        'body': {
            'storage':
                {
                    'representation': 'storage',
                    'value': str(html),
                }
        }
    }

    data = json.dumps(data)

    url = '{base}/{page_id}'.format(base=BASE_URL, page_id=page_id)

    our_headers = {'Content-Type': 'application/json', 'USER-AGENT': USER_AGENT}

    r = requests.put(
        url,
        data=data,
        auth=auth,
        headers=our_headers
    )

    r.raise_for_status()

    print "Wrote '%s' version %d" % (info['title'], ver)
    print "URL: %s%d" % (VIEW_URL, page_id)

    return ""


def read_data(auth, page_id):
    url = '{base}/{page_id}?expand=body.storage'.format(base=BASE_URL, page_id=page_id)
    r = requests.get(
        url,
        auth=auth,
        headers={'Content-Type': 'application/json', 'USER-AGENT': USER_AGENT}
    )

    r.raise_for_status()

    return r


def patch_html(auth, options):
    json_text = read_data(auth, options.pageid).text
    json2 = json.loads(json_text)
    html_storage_txt = json2['body']['storage']['value']
    html_display_json = convert_db_to_view(auth, html_storage_txt)
    html_display_txt = html_display_json['value'].encode('utf-8')

    # PATCH 
    # new_view_string = custom patching of HTML here,
    return new_view_string


def get_login(username=None):
    if username is None:
        username = getpass.getuser()

    password = keyring.get_password('confluence_script', username)

    if password is None:
        password = getpass.getpass()
        keyring.set_password('confluence_script', username, password)

    return username, password


def main():
    parser = argparse.ArgumentParser()

    parser.add_argument(
        "-u",
        "--user",
        default=getpass.getuser(),
        help="Specify the username to log into Confluence")

    parser.add_argument(
        "pageid",
        type=int,
        help="Specify the Confluence page id to overwrite")

    options = parser.parse_args()

    auth = get_login(options.user)

    html = patch_html(auth, options)
    html = html.replace('&Acirc;', '')
    write_data(auth, html, options.pageid)
    return

if __name__ == "__main__": main()

You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.

Comment

I hope you get this comment @absciexbuild. This was a HUGE help. This basically just worked "out of the box" for me. Thank you so much for taking the time to post this back.

If any Confluence Python admins are listening here, this is definitely something that should be added to the examples in the Confluence Python API. I was looking to import and export an entire Space programmatically to another instance of Confluence, and this example does approximately 95% of that work.

Thanks again @absciexbuild.

You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.

Comment

Glad this was useful to you!

You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.

Comment

I also have a version in C#(basically the python code transcribed), should someone need it, pls contact me.

You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.

Comment

Like • Stacy Bohland likes this

0 votes

Had this same problem as well, but solved it in a different manner. I found that I wasn't providing a title which seems is required to update page content:

I used JavaScript instead of Python so here's some pseudo code:

// Get pageJSON with GET and select the page content from JSON

var values = pageJSON.body.storage.value;

// Manipulate the content

var res = values.replace(div, div + content);

// Increment the version

var nextVersion = Number(AJS.Meta.get('page-version')) + 1;

// Create JSON object to update page content - note title is specified.

var newJSON = { "version": { "number": nextVersion }, "title": getTitle(), "type": "page", "body": { "storage": { "value": res, "representation": "storage" } } }

...

PUT request afterwards

...

Note: the page data includes macros and I didn't have to convert from storage to view

You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.

Comment

Hi All,

I am required to migrate from Sharepoint to confluence, and in order to accomplish the same, I have wrote small C# script to convert my .doc files to HTML.

Since, I do have huge number of HTML files, I am looking for option to get my HTML documents uploaded to confluence dynamically.

Thanks in advance for quick help.

You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.

Comment

0 votes

Are you working with HTML, or the XHTML Confluence Storage Format? I think that confluence will only accept its own storage format (returned when using ?expand=body.storage).

The hello world example just happens to be both valid HTML and storage format, but it's not always the case. Tables are the same in both, though, so that shouldn't cause an issue necessarily.

Confluence is usually pretty good about returning information about the error, so have you checked the message that you get back with the 400? You could try using the REST API Browser for debugging.

You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.

Comment

Hi,

I actually found the answer. So hard. Sigh. So editing a confluence page programmatically involves a couple of steps:

1. read the data: [url]?expand-body.storage
2. load text json response to json object (json.loads())
3. extract html - json_object['body']['storage']['value']
4. convert the returned page from storage to 'view' using post to /rest/api/contentbody/convert/storage -- returns json html - use {'representation':'storage', 'value': html} in the data -- sanitized for view
5. Convert to text using display_json['value'].

You can now mess with the HTML

Next you have to convert it back to storage format using post /rest/api/contentbody/convert/storage, {'representation':'editor'}

One of steps caused a unicode character Â to sneak in -- which I replaced with nothing.

Then you do the upload, making sure increase the version number of the page.

I will post some code below.

You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.

Comment

The character appears to show up when converting back to storage format when there's a &nbsp space character, it becomes 'Â ' - at least that's what I've found, becuase it doesn't show up on every page.

I also get strange behavior where every time I publish certain pages and fetch it again, there's an additional newline in the HTML in between certain blocks which I have to regex replace with single newlines, otherwise there will be an absurd amount of space between html blocks after a few publishes.

You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.

Forums

Product Q&A

Community resources

Support

Top groups

Community resources

Support

Learn

Community resources

Support

Events

Community resources

Support

Uploading HTML via REST api through python script gives 400 error for some HTML documents

3 answers

1 accepted

Suggest an answer

Was this helpful?

Thanks!

TAGS

Atlassian Community Events