You can use Acrolinx to check the validity of the URLs you use in your content.
When you check your content with an Acrolinx Integration, Acrolinx visits the URL to see if the URL is still valid. URLs that aren't valid appear as Style issues.
You can use Acrolinx to check for URLs in the following formats:
URLs that you write in the text. For example:
You can check URLs in all XML attributes except URLs to namespaces or document types.
URLs that you include as attributes of an HTML or XML tag. For example:
Acrolinx won't highlight the attribute when you click on the URL issue card in the Sidebar.
Turn on Checking for Broken URLs
- Open the
coreserver.propertiesfrom the Dashboard, go to Maintenance > Configuration Properties, then follow the folder structure config > server > bin and click on the file
coreserver.properties. You can then edit the properties directly from the Dashboard.
Alternatively, you can edit
coreserver.propertiesfrom the configuration directory:
Add the following property:
- Click Save and restart your core server.
Here is a list of additional properties you might want to use:
Don't highlight redirected URLs.
The default value is true.
The maximum amount of time that Acrolinx spends checking URLs in one check.
The default value is 1000 milliseconds.
Acrolinx Integrations can't display the results until it validates all the URLs or the maximum wait time is exceeded.
How often Acrolinx rechecks each URL in the URL cache.
The default value is every 720 minutes (12 hours) after the URL was last checked.
The maximum number of URLs that the cache can contain before the oldest URLs get overwritten.
Save the CSV file that lists the URL cache in a different location.
The default value is linkchecker/linkrepository.csv.
Remember that the path has to be relative to the directory: \server\www\output\
How Does Acrolinx Check for Broken URLs?
Let's dive into how Acrolinx looks for and finds broken URLs.
When you run a check, Acrolinx finds the URLs in your content according to statuses. For example, 'Page not found'.
Acrolinx grabs the HTTP header information from each new URL, and saves the URL in a cache that's stored on the Core Platform. Acrolinx saves each URL in the cache with a timestamp of when it was checked and the validity of the URL.
Acrolinx reviews the URL cache every 5 minutes to see if there are any URLs due for rechecking. You can set the default time interval for checking the URLs. The default time interval for checking a URL in the cache is every 12 hours. Acrolinx rechecks and URLs that are older than this time interval, and updates the timestamp and status information.
When you check, Acrolinx searches the cache for URLs that it has already validated in previous checks. If Acrolinx finds a match in the cache, Acrolinx looks at the URL's validation information. If the status says that it's an invalid URL, Acrolinx passes the status information to the Acrolinx Integration as a Style issue.
You can see all the URLs in the cache as a CSV (comma-separated values) file. To open the CSV file, use the URL:
Acrolinx looks for any text that begins with "http://", "https://" or "www" and validates the URL syntax according to standard URL syntax guidelines. Any URLs that don't conform to the correct URL syntax get the status 'ERROR' in the URL cache.
Example of invalid URL syntax: http://emptyURL:nothing
|URL hasn't yet been checked.|
|Response code 200 to 210. Acrolinx received the HTTP header information from the URL destination.|
|Response code 300 to 307. Acrolinx could connect to the URL, but the request was redirected.|
|Response code 404 from the web server. An UnknownHostException is also mapped to this state.|
|The request timed out.|
|All other unknown server and client errors (response code 400-449, except 404, and 500-510). Also indicates java exceptions during the check procedure.|