I would like to be able to pull a list so that I can see every link (including it's location) on my confluence site. Is this possible?
Oh, I found an add-on that finds Broken Links, and I noticed it shows hosts for all outbound links. But alas, only links for broken ones. Maybe you could ask them to add a feature to optionally show unbroken links too?
If you do install this app, it adds uses entity properties to add a very useful array of links for each page:
So with this app installed, it would be at least a little easier to get those links.
I haven't figured out where they store the index of pages with links. If we can find that, it'd be a lot easier than having grab the properties for every page on your site.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
Oh duh. Entity properties can be queried in CQL. So, yeah, if you installed this add-on, you could then do a query like this:
https://YOURSITE.atlassian.net/wiki/rest/api/search?limit=200&expand=content.metadata.properties.appsplusBrokenLinks&limit=999&cql=appsplusBrokenLinksIndexed%20%3E%200%20ORDER%20BY%20lastModified%20DESC
And then you could do a bit of JSON parsing (I'm a big fan of jq) to convert that output a list of page links and lists of links. Hrm... let me see... yup, this should do the trick:
jq '. | ._links.base as $base | .results[].content | select ((.metadata.properties.appsplusBrokenLinks.value.links | length) > 0) | {page: ($base + ._links.webui), title: .title, links: [ .metadata.properties.appsplusBrokenLinks.value.links[].e]}'
So then, we combine all that with curl and a API Token, and we get this monstrosity (remember you have to install Broken Links+ for Confluence for this to work):
curl --silent --header 'Authorization: Basic YOURAPITOKEN' 'https://YOURSITE.atlassian.net/wiki/rest/api/search?limit=200&expand=content.metadata.properties.appsplusBrokenLinks&limit=999&cql=appsplusBrokenLinksIndexed%20%3E%200%20ORDER%20BY%20lastModified%20DESC' | jq '. | ._links.base as $base | .results[].content | select ((.metadata.properties.appsplusBrokenLinks.value.links | length) > 0) | {page: ($base + ._links.webui), title: .title, links: [ .metadata.properties.appsplusBrokenLinks.value.links[].e]}'
Which should give you something nice like this:
{
"page": "https://YOURSITE.atlassian.net/wiki/spaces/PUBLIC/pages/835289184/foobar",
"title": "foobar",
"links": [
"https://yahoo.com",
"https://google.com",
"http://lorempixel.com/640/480/technics/"
]
}
{
"page": "https://YOURSITE.atlassian.net/wiki/spaces/PUBLIC/pages/835551233/synergistic+engage+users",
"title": "synergistic engage users",
"links": [
"http://lorempixel.com/640/480/animals/",
"http://lorempixel.com/640/480/animals/"
]
}
(Tested on a Mac in the Terminal app, which has curl built-in and where you can pretty easily install jq.)
(OOOF, my auth header for curl was wrong. AND my query was wrong. AND the app doesn't have a way to query ONLY for pages with links so I had to change to grab the data for EVERY PAGE and then filter it out with jq.)
ALSO, there is (currently) a 200 max limit to how many results are returned by the search query, so this will almost certainly be incomplete for large sites.
The "right" way to do this, I guess, is to use Python or something, to basically get the Broken Links+ metadata for every page (which is at least faster then downloading all the body content!), using cursor pagination and THEN parse that data for pages that contain any links.
Hum. There's probably some slick Python library to deal with cursor-based pagination. That kind of thing always hurts my head, so I leave it as an exercise for the reader.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
Thanks @Darryl Lee that was an awesome reply. Yeah we included those index key names early on just in case the more technical people wanted to do advanced CQL queries. You can see the full list at the bottom of the descriptor: https://broken-links.appsplus.co/atlassian-connect.json
@Sally we recently added a feature in the report page of our app (top navbar >> Apps >> Broken Links+) which lists all the domains/hostnames for both broken and unbroken links.
eg if you want to get all pages which include Google Docs links you can search "docs.google.com" in the top-right search box on that report page.
We probably should have renamed the app after that feature release so I've done that now: "Broken & Outbound Links+ for Confluence" :)
Happy to build-in other features if there's a specific use-case you're looking for. Just shoot us an email: support@appsplus.co
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
Hey, thanks @Nathan Waters - AppsPlus ! Exciting to get to talk to the developer directly.
Oooh man, looking at the data, I feel like I'm so close to not having to grab your app's metadata every single page. IF ONLY I could do a CQL query for:
appsplusBrokenLinksHostnames > 0
OR
appsplusBrokenLinksHostnames NOT EMPTY
OR
appsplusBrokenLinksHostnames!=NULL
But none of those work. Nor any other variations. I don't suppose you know any tricks to allow CQL to search for non-empty value in a property key's string object?
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
appsplusBrokenLinksHostnames is an array of hostnames for both broken and unbroken links. It's not the full link, just a unique array of hostnames (eg docs.google.com). Since CQL is only a search for content I think you'd need to query every page to fetch every link.
Best if Sally specify the use-case. I'm not sure why someone would want a full list of raw links other than to do some secondary query/filter on that data. If I knew that secondary intent it'll be easier to build that.
If the intent is something like "find all pages with Google Docs links" then that feature is already built in
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
Oooh, cool question. At first I was thinking there wouldn't be a way to do this without crawling the entire site and looking at the content of every page.
But then I remembered that looking at Page Information for a page has a section called "Outbound Links"
ALAS, poking around while loading the page, I unfortunately couldn't see any API calls we could use to get that information.
There is a Suggestion for an API to provide this info:
It's a bit old though, so I wouldn't hold my breath. :-/
Regardless, I still think you'd have to write a script that grabs the Page Information for every single page on your site. Definitely not very efficient.
Still, faster (hopefully) than having to download all the content for every page and extract that.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
Online forums and learning are now in one easy-to-use experience.
By continuing, you accept the updated Community Terms of Use and acknowledge the Privacy Policy. Your public name, photo, and achievements may be publicly visible and available in search engines.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.