How to check if the links on a page are valid?

Dmitri Bukovski December 13, 2022

Hi, I've got a space that hasn't been updated for a long time and I need to migrate some of its contents to another space. The problem is that since the content hasn't been updated for years, I have grave doubts that many links are already dead.

Can anybody suggest me what can I try to check the links?

 

So far my approach is to get body.view from all pages in a given space and extract all links from it. The problem is that the number of links makes it unfeasible to check them manually, so what I want is to try some Python script to iterate over them.

 

import requests
from requests.auth import HTTPDigestAuth

r = requests.get(link, auth = HTTPDigestAuth(user, password))
code = r.status_code
print(code)

 

My first attempt was to simply use requests library to get the status codes, but it appears that no matter what link I pass there - I get "200" status, even when I look up for non-existing page.

1 answer

1 accepted

0 votes
Answer accepted
Dmitri Bukovski March 12, 2024

later I realized that all requests return 200 as my requests were redirected to the authentication page. I updated the code with proper authentication and now it works as intended. Maybe someone will find it useful for similar task too.

 

import getpass
import urllib3


agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64)
AppleWebKit/537.36 (KHTML, like Gecko)
Chrome/108.0.0.0 Safari/537.36"

passwd = getpass.getpass("type your password: ")

def link_runner_auth(link):

http = urllib3.PoolManager()
headers = urllib3.make_headers(basic_auth=f'user_id:{passwd}',
user_agent=agent)
try:
response = http.request('GET', link,headers=headers).status
time.sleep(0.5)
print(f'{link} ---- {response}')

except KeyboardInterrupt:

print("Keyboard interrupt")

except:

response = print(f'{link} ---- ERROR')

return response

 

Suggest an answer

Log in or Sign up to answer
TAGS
AUG Leaders

Atlassian Community Events