Community
Products
Confluence
Questions
How can I pull a list of outbound links on my Confluence site?

How can I pull a list of outbound links on my Confluence site?

I would like to be able to pull a list so that I can see every link (including it's location) on my confluence site. Is this possible?

2 answers

1 accepted

5 votes

Answer accepted

Oh, I found an add-on that finds Broken Links, and I noticed it shows hosts for all outbound links. But alas, only links for broken ones. Maybe you could ask them to add a feature to optionally show unbroken links too?

Broken Links+ for Confluence

Screenshot 2023-06-10 at 12.50.34 AM.png

You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.

Comment

If you do install this app, it adds uses entity properties to add a very useful array of links for each page:

Screenshot 2023-06-10 at 1.04.43 AM.png

So with this app installed, it would be at least a little easier to get those links.

I haven't figured out where they store the index of pages with links. If we can find that, it'd be a lot easier than having grab the properties for every page on your site.

You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.

Comment

Oh duh. Entity properties can be queried in CQL. So, yeah, if you installed this add-on, you could then do a query like this:

https://YOURSITE.atlassian.net/wiki/rest/api/search?limit=200&expand=content.metadata.properties.appsplusBrokenLinks&limit=999&cql=appsplusBrokenLinksIndexed%20%3E%200%20ORDER%20BY%20lastModified%20DESC

And then you could do a bit of JSON parsing (I'm a big fan of jq) to convert that output a list of page links and lists of links. Hrm... let me see... yup, this should do the trick:

jq '. | ._links.base as $base | .results[].content | select ((.metadata.properties.appsplusBrokenLinks.value.links | length) > 0) | {page: ($base + ._links.webui), title: .title, links: [ .metadata.properties.appsplusBrokenLinks.value.links[].e]}'

So then, we combine all that with curl and a API Token, and we get this monstrosity (remember you have to install Broken Links+ for Confluence for this to work):

curl --silent --header 'Authorization: Basic YOURAPITOKEN' 'https://YOURSITE.atlassian.net/wiki/rest/api/search?limit=200&expand=content.metadata.properties.appsplusBrokenLinks&limit=999&cql=appsplusBrokenLinksIndexed%20%3E%200%20ORDER%20BY%20lastModified%20DESC' | jq '. | ._links.base as $base | .results[].content | select ((.metadata.properties.appsplusBrokenLinks.value.links | length) > 0) | {page: ($base + ._links.webui), title: .title, links: [ .metadata.properties.appsplusBrokenLinks.value.links[].e]}'

Which should give you something nice like this:

{

  "page": "https://YOURSITE.atlassian.net/wiki/spaces/PUBLIC/pages/835289184/foobar",
  "title": "foobar",
  "links": [
    "https://yahoo.com",
    "https://google.com",
    "http://lorempixel.com/640/480/technics/"
  ]
}
{
  "page": "https://YOURSITE.atlassian.net/wiki/spaces/PUBLIC/pages/835551233/synergistic+engage+users",
  "title": "synergistic engage users",
  "links": [
    "http://lorempixel.com/640/480/animals/",
    "http://lorempixel.com/640/480/animals/"
  ]
}

(Tested on a Mac in the Terminal app, which has curl built-in and where you can pretty easily install jq.)

(OOOF, my auth header for curl was wrong. AND my query was wrong. AND the app doesn't have a way to query ONLY for pages with links so I had to change to grab the data for EVERY PAGE and then filter it out with jq.)

ALSO, there is (currently) a 200 max limit to how many results are returned by the search query, so this will almost certainly be incomplete for large sites.

The "right" way to do this, I guess, is to use Python or something, to basically get the Broken Links+ metadata for every page (which is at least faster then downloading all the body content!), using cursor pagination and THEN parse that data for pages that contain any links.

Hum. There's probably some slick Python library to deal with cursor-based pagination. That kind of thing always hurts my head, so I leave it as an exercise for the reader.

You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.

Comment

Like • Dave Liao likes this

Thanks @Darryl Lee that was an awesome reply. Yeah we included those index key names early on just in case the more technical people wanted to do advanced CQL queries. You can see the full list at the bottom of the descriptor: https://broken-links.appsplus.co/atlassian-connect.json

@Sally we recently added a feature in the report page of our app (top navbar >> Apps >> Broken Links+) which lists all the domains/hostnames for both broken and unbroken links.

eg if you want to get all pages which include Google Docs links you can search "docs.google.com" in the top-right search box on that report page.

We probably should have renamed the app after that feature release so I've done that now: "Broken & Outbound Links+ for Confluence" :)

Happy to build-in other features if there's a specific use-case you're looking for. Just shoot us an email: support@appsplus.co

You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.

Comment

Hey, thanks @Nathan Waters - AppsPlus ! Exciting to get to talk to the developer directly.

Oooh man, looking at the data, I feel like I'm so close to not having to grab your app's metadata every single page. IF ONLY I could do a CQL query for:

appsplusBrokenLinksHostnames > 0

appsplusBrokenLinksHostnames NOT EMPTY

appsplusBrokenLinksHostnames!=NULL

But none of those work. Nor any other variations. I don't suppose you know any tricks to allow CQL to search for non-empty value in a property key's string object?

You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.

Comment

appsplusBrokenLinksHostnames is an array of hostnames for both broken and unbroken links. It's not the full link, just a unique array of hostnames (eg docs.google.com). Since CQL is only a search for content I think you'd need to query every page to fetch every link.

Best if Sally specify the use-case. I'm not sure why someone would want a full list of raw links other than to do some secondary query/filter on that data. If I knew that secondary intent it'll be easier to build that.

If the intent is something like "find all pages with Google Docs links" then that feature is already built in

You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.

Comment

1 vote

Oooh, cool question. At first I was thinking there wouldn't be a way to do this without crawling the entire site and looking at the content of every page.

But then I remembered that looking at Page Information for a page has a section called "Outbound Links"

ALAS, poking around while loading the page, I unfortunately couldn't see any API calls we could use to get that information.

There is a Suggestion for an API to provide this info:

CONFCLOUD-52009 - REST API call on page to get Outgoing links

It's a bit old though, so I wouldn't hold my breath. :-/

Regardless, I still think you'd have to write a script that grabs the Page Information for every single page on your site. Definitely not very efficient.

Still, faster (hopefully) than having to download all the content for every page and extract that.

You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.

Comment

Suggest an answer

Was this helpful?

Thanks!

Confluence

DEPLOYMENT TYPE

CLOUD

PRODUCT PLAN

STANDARD

PERMISSIONS LEVEL

Product Admin

Forums

Product Q&A

Community resources

Support

Top groups

Community resources

Support

Learn

Community resources

Support

Events

Community resources

Support

How can I pull a list of outbound links on my Confluence site?

2 answers

1 accepted

Suggest an answer

Was this helpful?

Thanks!

DEPLOYMENT TYPE

PRODUCT PLAN

PERMISSIONS LEVEL

TAGS

Atlassian Community Events