REST API returns some duplicate pages for content?page queries

Daniel Miller January 3, 2019

Via Python and the "requests" package, using the REST API, I've written scripts to read and write the contents and add attachments to Confluence pages in a particular workspace. Before each transaction, my code verifies that a page of the proper title exists within the workspace by retrieving all the page titles, using the syntax:

my_url/rest/api/space/<spacekey>/content?type=page&limit=25&start=0&expand=space,version

I call this iteratively, incrementing the start by 25 until the number of returned items is less than 25. I look for the desired page title within the set of titles returned. A total of 276 page entries are returned.

For a few new pages, the above logic is failing to find the page title within the space. HOWEVER, the page is visible within the Confluence GUI, apparently in the right place, and if my script ignores the error it can indeed read and write contents and add attachments to the page!

I investigated by comparing the list vs. the set of all page titles returned by the above logic. What I found was a number of duplicate page titles, and the number (about 60 or 70) and page titles of the duplicates varied each time I ran the script. It seems as if Confluence is getting confused about the indexing of pages when I'm retrieving them by start number... or I've made some weird error.

Has anyone seen a similar issue?

2 answers

2 votes
Nejc Grenc May 30, 2023

Hello,

In our organization, we recently encountered a peculiar issue with Confluence content retrieval. After migrating to Atlassian Confluence 7.19.8, we encountered a discrepancy in the generated reports, specifically the absence of certain pages. In my investigation, I traced the issue to the REST calls responsible for retrieving pages from Confluence. The problematic behavior exhibited similarities to those described by Daniel Miller.

 

Problem description

Previously, we employed consecutive calls, each limited to fetching 100 pages, to retrieve Confluence pages. For the specific space in question, which contained 255 pages, we executed three calls using the following format:
{{confluence-url}}/rest/api/space/{{space-key}}/content (?start=0&limit=100, ?start=100&limit=100, ?start=200&limit=100).

  1. The initial call yielded 100 distinct pages.
  2. The second call provided 100 pages, of which 25 were exact duplicates of those obtained in the first call. These duplicates encompassed all page attributes (e.g., ID, title, and status).
  3. Subsequently, the third call returned 55 pages, out of which 19 were duplicates of those retrieved in the previous two calls.

In total, the correct number of pages (255) was obtained, but there existed 44 duplicates, signifying the omission of the remaining 44 pages within the space.

 

This peculiar behavior persisted across standard REST calls (executed via browsers or Postman), as well as when utilizing Java's SpaceService component.

The returned pages exhibited no discernible order, although they consistently followed the same sequence. In other words, each call consistently returned the same set of pages in the same order, as long as no modifications were made within Confluence (i.e., no pages created/updated/deleted). This pattern also encompassed the presence and positioning of duplicates.

 

The solution ?

I implemented the Daniel Miller's proposed solution, involving the utilization of the maximum permissible limit value, which is 200. It seems this approach is effective, as I encountered no further duplicates in my tests.

Granted, my sample size is small (and I have no time to investigate this much further), but there appears to be a deeper issue in Confluence's content retrieval that warrants further investigation.

 

Regards,
Nejc

Daniel Miller
Contributor
May 31, 2023

Nejc:  It has been a few years, but my recollection is that a larger limit size turned out to reduce the number of duplicates, but not eliminate them completely. We never pursued this issue because, as noted, the "missing" pages were actually there and could be read from and written to by the scripts.

Like Nejc Grenc likes this
Nejc Grenc June 1, 2023

Hello, Daniel!

I understand your message. The Confluence pages are currently fully accessible within the Confluence platform itself through browser interactions or direct REST calls. However, these pages were inadvertently omitted from our reports generated using the mentioned REST content call.

Oh, well. Currently, this solution appears to be stable; however, it is likely that we will need to address this issue again in the future if duplicate pages continue to cause them to go missing.

I sincerely appreciate your contribution in providing the initial solution and reminding us of the potential for future failures.

Thank you, Daniel!

Heinrich Ulbricht _WikiTraccs_
Contributor
June 18, 2023

Wow, this is just on time! I recently got a support ticket for the WikiTraccs Confluence to SharePoint migration tool that I'm building that reported errors in migration progress reporting. After digging I found that Confluence returned duplicate pages as well, about 10% in a space of about 4000 pages. Page size was 100 in this case. The solution for now: ignore the duplicates.

After reading through the posts I'm just a bit worried that pages might also be missing. Did anybody try to rebuild the content index (as suggested by @Stephen Sifers) to solve this problem, and did it help?

I'll probably increase the page size to 200 as well. Is this a hard limit or could that be changed somewhere in the environment?

Nejc Grenc June 26, 2023

Hello, Heinrich!

I was on holiday last week, so I only saw your message today.

To the best of my recollection, we did indeed follow the suggestion by @Stephen Sifers and rebuilt the Confluence index on both of our environments. Unfortunately, this did not yield the desired results. It is worth noting that we only attempted the rebuilding through the Application UI and did not explore the option of starting from scratch.

Regarding the limit parameter, it is constrained by fixed system limits, set at 200. If a larger number is entered, the system will automatically reduce it to 200. I am not aware of any method to locate and modify this value within the depths of Confluence. Consequently, retrieving all pages in a single comprehensive call is not possible, and we are compelled to use pagination.

Please try using a limit of 200 and provide feedback here if the issue of duplicates persists. I am genuinely interested in obtaining further details about this matter.

Regards,
Nejc

Heinrich Ulbricht _WikiTraccs_
Contributor
May 16, 2024

Update: I upped the limit to 200, but still have the occasional duplicate. I added filters after each content retrieval that take care of removing duplicates. That's the solution for me.

 

Just as side note: the Confluence Cloud v2 API seems to have a limit of 250 here and there.

Nejc Grenc May 16, 2024

You are right. About 9 months later, we also started getting tickets from our users related to this, but it slipped my mind to update this thread.

The duplicates are easily handled. The missing pages, that were not retrieved by the query, were the big problem. A duplicate in this case does not extend the result list, but instead overwrites another page result.

We have since changed our approach from REST calls to Java API in a plugin,
so I don't have much more insight specifically for this problem.

 


 

The following is about Java API, in case someone is struggling there as well.

I have observed a similar problem with 
com.atlassian.confluence.api.service.content.SpaceService

PageResponse<Content> pages =
spaceService
.findContent(space, expansions)
.fetchMany(ContentType.PAGE, new SimplePageRequest(iterator * 200, 200));

This code returned similar duplicates and missing pages.

But the problem can be solved by using a different API:
com.atlassian.confluence.api.service.content.ContentService

PageResponse<Content> pages =
contentService
.find(expansions)
.withSpace(space)
.withType(ContentType.PAGE)
.fetchMany(ContentType.PAGE, new SimplePageRequest(iterator * 200, 200));

 

Heinrich Ulbricht _WikiTraccs_
Contributor
July 26, 2024

Thanks for reporting back! I am handling yet another case where the Confluence (Server) REST API is just leaving out results (presenting duplicates instead).

For a space of < 2000 pages, sometimes ~500 are missing, sometimes ~200. The number seems to vary depending on the parameters that are set on the query (like expanding certain properties).

One workaround seems to be manually walking the page tree. Starting at the root nodes, each node's children are retrieved, until all levels have been collected. Tedious. But all pages seem to be there.

Still wondering what's up. Unfortunately I'm bound to using the REST API.

0 votes
Stephen Sifers
Atlassian Team
Atlassian Team members are employees working across the company in a wide variety of roles.
January 8, 2019

Hello Daniel,

The result you are reporting can be due to the amount of pages with versions. You will see multiple page returns with duplicate names but note the versions and ID will be different for each duplicate.

Another way to verify the API is not returning duplicate results would be to run a manual comparison.

  1. Simply take your API URL and paste it into a browser, the return should be the JSON results
    1. Save each result page or paste results into a text editor
  2. As you’re doing in your script, keep incrementing the start till you reach the end.
  3. Once you’ve worked through all of the results, use a JSON parsing tool to compare the results
    1. You may have to move your results out of the parsing tool to record all of the results in one place.
  4. If you’re seeing identical duplicate results within your comparison, then you may have a Confluence indexing issue.

If the Confluence index is the issue, you can rebuild the index to resolve this issue. Here are additional resources to help with resolving index issues:

Content Index Administration

How to Rebuild the Content Indexes From Scratch on Confluence Server

Please let us know if you’re still having identical duplicate results within your API call after comparing your results and evaluating if an Index issue is in fact present. If your issue persists, we will most likely need review further to see if there is a defect within the Confluence version you’re on.

Regards,
Stephen Sifers

Daniel Miller January 10, 2019

Stephen: I don't believe we have pages with duplicate names. Note that the original problem was that this routine did not return a page I know to exist. In any case, when repeating the test, I get varying sets of pages returned more than once. What I'm seeing is a varying set of false duplicates "crowding out" other, actual pages.

When looking at the results, I often get a different first page. How does Confluence "order" the pages returned in this query? Should there be a static order (assuming no pages are added or removed from the space)? If not, how does Confluence guarantee that when you request (for example) pages 100-124, you don't get a page returned previously? This seems to be what is happening.

Also, as a workaround, I tried increasing the number of pages per batch to 200 (the apparent maximum). It takes two queries (0-199 and 200 onwards) to retrieve all the pages and the second query returns 76 pages with NO duplications of the first 200. So I have a way forwards, though I'm still concerned there is an issue here which could cause downstream problems as more pages are added.

Thanks

Stephen Sifers
Atlassian Team
Atlassian Team members are employees working across the company in a wide variety of roles.
January 10, 2019

Hello again Daniel,

When I tested on a fresh install of COnfluence 6.13 I was getting non-random results when using a GET method within a space.

With this said, you may want to explore using the SEARCH call which will allow you to apply CQL which then allows you to set Order By on your results. This should produce a more predictable result in your case. You may find out further information on REST API Search Content along with Advanced Searching using CQL.

Any additional information you have for your Confluence deploy would be greatly appreciated so we may test and ensure there is not a defect in the product. Please let us know the following:

  • Confluence Version
  • Any add-ons used within Confluence
  • Database application
  • Database Collation

This should allow us to test further and attempt to reproduce the issue.

Regards,
Stephen Sifers

Daniel Miller January 10, 2019

Stephen: I assume you were using a small batch size compared with the overall number of pages in the space - e.g. 25 vs. 275?

Here is the requested info:

Confluence Version 6.6.7.

The database is postgres 9.6.

No special rules for collation.

Add-ons used within Confluence:
· Bob Swift Atlassian Add-ons - Cache
· Comala Workflows
· Multiexcerpt Plugin4
· Page Info for ScriptRunner
· Scroll Word Exporter
· Staffing Timeline
· Adaptavist ScriptRunner for Confluence
· Angular JS integration for AUI (AUI-NG)
· Bob Swift Atlassian Add-ons - Advanced Tables
· Bob Swift Atlassian Add-ons - Table Library
· Confluence Source Editor
· Forms for Confluence
· Keysight Admin Tools for Confluence
· Scroll Exporter Extensions
· Scroll Runtime for Confluence
· Spreadsheets

Stephen Sifers
Atlassian Team
Atlassian Team members are employees working across the company in a wide variety of roles.
January 11, 2019

Hello again,

Thank you for providing your configuration. I was able to test on your version and configuration to see if I can recreate the random API results. I tested within a space that had over 1000 pages. I tested API calls while pages we already created and while more pages were being created (a simple do while loop to create 500 pages was used in this case). I found the results to be matching each time I made a request to get content. I tested from a page size of 25, 50, 75, 100 and 200. All of these returned n the same matching order, no random order was detected.

I checked to see if there is a logic in the way API results are returned within the content endpoint, the answer was there is no ordering on the results. However, if you do use the content/search you may then pass CQL through which will honor Order By requests. You may find out further information on REST API Search Content along with Advanced Searching using CQL.

If you do feel that you’re experiencing an issue still, you are welcomed to create a support request to have someone review your instance. You may create a support request at Get Support. Along with this, if you feel what you’re experiencing is a bug within the product, you may create a bug request at Bug/Suggestion Request.

I hope this helps to clarify the ordering results which you’re seeing within the content endpoint.

Regards,
Stephen Sifers

Dan Miller
I'm New Here
I'm New Here
Those new to the Atlassian Community have posted less than three times. Give them a warm welcome!
February 7, 2019

Stephen: Clearly there must be some "natural" ordering of the results, to prevent the stateless request for subsequent batches not to overlap prior batches.

But my workaround (requesting 200 pages at once) is working, so I don't think it is worth pursuing unless others are seeing this issue.

Thanks for your help.

Like Nejc Grenc likes this

Suggest an answer

Log in or Sign up to answer
TAGS
AUG Leaders

Atlassian Community Events