Uploading HTML via REST api through python script gives 400 error for some HTML documents

Hi,

I have a python script that uses the confluence REST API to download a page, make changes to the HTML and re-upload the page.

My current script works fine if I replace the html with html that I create (that is valid HTML). The code programmatically updates the version and put's it to the server.

But now I want to edit the page programmatically. I download the html, patch a new row into an HTML table and want to re-upload -- this fails with a 400 error. I also tried just round-tripping -- downloading the page html, uploading it -- this fails as well. Downloading and uploading just a <p>Hello World</p> works great -- it replaces the page content as expected.

I am looking for guidance on how to make this work. I notice a lot of <link> tags (where an @user reference is) - and tried removing those before re-upload but that did not work either.

Best,

Friedrich Brunzema

3 answers

0 votes
Stephen Deutsch Community Champion Jun 09, 2017

Are you working with HTML, or the XHTML Confluence Storage Format? I think that confluence will only accept its own storage format (returned when using ?expand=body.storage).

The hello world example just happens to be both valid HTML and storage format, but it's not always the case. Tables are the same in both, though, so that shouldn't cause an issue necessarily.

Confluence is usually pretty good about returning information about the error, so have you checked the message that you get back with the 400? You could try using the REST API Browser for debugging.

Hi,

I actually found the answer. So hard. Sigh.  So editing a confluence page programmatically involves a couple of steps:

1.  read the data: [url]?expand-body.storage
2.  load text json response to json object (json.loads())
3.  extract html - json_object['body']['storage']['value']
4.  convert the returned page from storage to 'view' using post to /rest/api/contentbody/convert/storage -- returns json html  - use {'representation':'storage', 'value': html} in the data -- sanitized for view
5. Convert to text using display_json['value']. 

You can now mess with the HTML

Next you have to convert it back to storage format using post /rest/api/contentbody/convert/storage, {'representation':'editor'}

One of steps caused a unicode character &Acirc; to sneak in -- which I replaced with nothing.

Then you do the upload, making sure increase the version number of the page.

I will post some code below.

The character appears to show up when converting back to storage format when there's a &nbsp space character, it becomes '&Acirc;&nbsp;' - at least that's what I've found, becuase it doesn't show up on every page.

I also get strange behavior where every time I publish certain pages and fetch it again, there's an additional newline in the HTML in between certain blocks which I have to regex replace with single newlines, otherwise there will be an absurd amount of space between html blocks after a few publishes.

# coding: utf-8
import argparse
import getpass
import datetime
import json
import keyring
import requests
import lxml.html

# -----------------------------------------------------------------------------
# Globals

BASE_URL = "http://your-server/rest/api/content"
VIEW_URL = "http://your-server/pages/viewpage.action?pageId="
USER_AGENT = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.82 Safari/537.36"


def pprint(data):
    print json.dumps(
        data,
        sort_keys=True,
        indent=4,
        separators=(', ', ' : '))


def get_page_ancestors(auth, pageid):
    # Get basic page information plus the ancestors property

    url = '{base}/{pageid}?expand=ancestors'.format(
        base=BASE_URL,
        pageid=pageid)

    r = requests.get(url, auth=auth, headers={'Content-Type': 'application/json', 'USER-AGENT': USER_AGENT})

    r.raise_for_status()

    return r.json()['ancestors']


def get_page_info(auth, page_id):
    url = '{base}/{page_id}'.format(
        base=BASE_URL,
        page_id=page_id)

    r = requests.get(url, auth=auth, headers={'Content-Type': 'application/json', 'USER-AGENT': USER_AGENT})

    r.raise_for_status()

    return r.json()


def convert_db_to_view(auth2, html):
    url = 'http://your-server/rest/api/contentbody/convert/view'

    data2 = {
        'value': html,
        'representation': 'storage'
    }

    r = requests.post(url,
                      data=json.dumps(data2),
                      auth=auth2,
                      headers={'Content-Type': 'application/json'}
                      )
    r.raise_for_status()
    return r.json()


def convert_view_to_db(auth2, html):
    url = 'http://your-server/rest/api/contentbody/convert/storage'

    data2 = {
        'value': html,
        'representation': 'editor'
    }

    r = requests.post(url,
                      data=json.dumps(data2),
                      auth=auth2,
                      headers={'Content-Type': 'application/json'}
                      )
    r.raise_for_status()
    return r.json()


def write_data(auth, html, page_id):
    info = get_page_info(auth, page_id)

    ver = int(info['version']['number']) + 1

    ancestors = get_page_ancestors(auth, page_id)

    anc = ancestors[-1]
    del anc['_links']
    del anc['_expandable']
    del anc['extensions']

    info['title'] = "Team City Change Log"

    data = {
        'id': str(page_id),
        'type': 'page',
        'title': info['title'],
        'version': {'number': ver},
        'ancestors': [anc],
        'body': {
            'storage':
                {
                    'representation': 'storage',
                    'value': str(html),
                }
        }
    }

    data = json.dumps(data)

    url = '{base}/{page_id}'.format(base=BASE_URL, page_id=page_id)

    our_headers = {'Content-Type': 'application/json', 'USER-AGENT': USER_AGENT}

    r = requests.put(
        url,
        data=data,
        auth=auth,
        headers=our_headers
    )

    r.raise_for_status()

    print "Wrote '%s' version %d" % (info['title'], ver)
    print "URL: %s%d" % (VIEW_URL, page_id)

    return ""


def read_data(auth, page_id):
    url = '{base}/{page_id}?expand=body.storage'.format(base=BASE_URL, page_id=page_id)
    r = requests.get(
        url,
        auth=auth,
        headers={'Content-Type': 'application/json', 'USER-AGENT': USER_AGENT}
    )

    r.raise_for_status()

    return r


def patch_html(auth, options):
    json_text = read_data(auth, options.pageid).text
    json2 = json.loads(json_text)
    html_storage_txt = json2['body']['storage']['value']
    html_display_json = convert_db_to_view(auth, html_storage_txt)
    html_display_txt = html_display_json['value'].encode('utf-8')

    # PATCH 
    # new_view_string = custom patching of HTML here,
    return new_view_string


def get_login(username=None):
    if username is None:
        username = getpass.getuser()

    password = keyring.get_password('confluence_script', username)

    if password is None:
        password = getpass.getpass()
        keyring.set_password('confluence_script', username, password)

    return username, password


def main():
    parser = argparse.ArgumentParser()

    parser.add_argument(
        "-u",
        "--user",
        default=getpass.getuser(),
        help="Specify the username to log into Confluence")

    parser.add_argument(
        "pageid",
        type=int,
        help="Specify the Confluence page id to overwrite")

    options = parser.parse_args()

    auth = get_login(options.user)

    html = patch_html(auth, options)
    html = html.replace('&Acirc;', '')
    write_data(auth, html, options.pageid)
    return

if __name__ == "__main__": main()

Had this same problem as well, but solved it in a different manner.  I found that I wasn't providing a title which seems is required to update page content: 

 

I used JavaScript instead of Python so here's some pseudo code:

// Get pageJSON with GET and select the page content from JSON

var values = pageJSON.body.storage.value;

// Manipulate the content

var res = values.replace(div, div + content);

// Increment the version

var nextVersion = Number(AJS.Meta.get('page-version')) + 1;

// Create JSON object to update page content - note title is specified.

var newJSON = { "version": { "number": nextVersion }, "title": getTitle(), "type": "page", "body": { "storage": { "value": res, "representation": "storage" } } }

...

PUT request afterwards

...

Note: the page data includes macros and I didn't have to convert from storage to view

Hi All,

I am required to migrate from Sharepoint to confluence, and in order to accomplish the same, I have wrote small C# script to convert my .doc files to HTML.

Since, I do have huge number of HTML files, I am looking for option to get my HTML documents uploaded to confluence dynamically.

Thanks in advance for quick help.

Suggest an answer

Log in or Sign up to answer
Community showcase
Published Oct 09, 2018 in Confluence

Introducing Praecipio Consulting, an Atlassian Solution Partner

Hey there Community!  My name is Vannya Vallejo, the Channel Communication Specialist at Atlassian and I want to help Atlassian users like you learn about our Solution Partners and how they c...

409 views 0 9
Read article

Atlassian User Groups

Connect with like-minded Atlassian users at free events near you!

Find a group

Connect with like-minded Atlassian users at free events near you!

Find my local user group

Unfortunately there are no AUG chapters near you at the moment.

Start an AUG

You're one step closer to meeting fellow Atlassian users at your local meet up. Learn more about AUGs

Groups near you