Configure How Acrolinx Checks for Broken URLs


You can use Acrolinx to check the validity of the URLs you use in your content.

When you check your content with an Acrolinx Integration, Acrolinx visits the URL to see if the URL is still valid. URLs that aren't valid appear as Style issues.

You can use Acrolinx to check for URLs in the following formats:

  • URLs that you write in the text. For example:

    Visit our website http://www.acrolinx.com for more information.

    You can check URLs in all XML attributes except URLs to namespaces or document types.

  • URLs that you include as attributes of an HTML or XML tag. For example:

    Visit our <A HREF="http://www.acrolinx.com"> website </A>
    
    Visit our website at <xref format="html" href="http://www.acrolinx.com" scope="external"></xref>

    Acrolinx won't highlight the attribute when you click on the URL issue card in the Sidebar.

Turn on Checking for Broken URLs

  1. Open the  coreserver.properties

    To edit coreserver.properties from the Dashboard, go to Maintenance > Configuration Properties, then follow the folder structure config > server > bin and click on the file coreserver.properties. You can then edit the properties directly from the Dashboard.

    Alternatively, you can edit coreserver.properties from the configuration directory: %ACROLINX_CONFIGURATION_ROOT%\server\bin\coreserver.properties

  2. Add the following property:

    enableLinkChecker=true
  3. Click Save and restart your core server.


Here is a list of additional properties you might want to use:

PropertyDescription
linkChecker.flagRedirects=false

Don't highlight redirected URLs.

The default value is true.

linkChecker.maxWaitTimeInMs

The maximum amount of time that Acrolinx spends checking URLs in one check.

The default value is 1000 milliseconds.

Acrolinx Integrations can't display the results until it validates all the URLs or the maximum wait time is exceeded.

For example, linkChecker.maxWaitTimeInMs=30000

linkChecker.refreshIntervalInMin

How often Acrolinx rechecks each URL in the URL cache.

The default value is every 720 minutes (12 hours) after the URL was last checked.

For example, linkChecker.refreshIntervalInMin=60

linkChecker.maxCacheSize

The maximum number of URLs that the cache can contain before the oldest URLs get overwritten.

For example, linkChecker.maxCacheSize=100000

linkChecker.linkRepositoryPath

Save the CSV file that lists the URL cache in a different location.

The default value is linkchecker/linkrepository.csv.

Remember that the path has to be relative to the directory: \server\www\output\

For example, linkChecker.linkRepositoryPath=NEWDIR/NEWNAME.csv

How Does Acrolinx Check for Broken URLs?

Let's dive into how Acrolinx looks for and finds broken URLs.

When you run a check, Acrolinx finds the URLs in your content according to statuses. For example, 'Page not found'.

Acrolinx grabs the HTTP header information from each new URL, and saves the URL in a cache that's stored on the Core Platform. Acrolinx saves each URL in the cache with a timestamp of when it was checked and the validity of the URL.

Acrolinx reviews the URL cache every 5 minutes to see if there are any URLs due for rechecking. You can set the default time interval for checking the URLs. The default time interval for checking a URL in the cache is every 12 hours. Acrolinx rechecks and URLs that are older than this time interval, and updates the timestamp and status information.

When you check, Acrolinx searches the cache for URLs that it has already validated in previous checks. If Acrolinx finds a match in the cache, Acrolinx looks at the URL's validation information. If the status says that it's an invalid URL, Acrolinx passes the status information to the Acrolinx Integration as a Style issue.

You can see all the URLs in the cache as a CSV (comma-separated values) file. To open the CSV file, use the URL: 

http://<SERVER_ADDRESS>:8031/output/linkchecker/linkrepository.csv

URL Syntax

Acrolinx looks for any text that begins with "http://", "https://" or "www" and validates the URL syntax according to standard URL syntax guidelines. Any URLs that don't conform to the correct URL syntax get the status 'ERROR' in the URL cache.

Example of invalid URL syntax: http://emptyURL:nothing

Acrolinx also logs URLs with the status 'ERROR' when Acrolinx can't contact the URLs because of a disruption in internet connectivity. When your internet is down, URLs with the status 'ERROR' are logged in the URL cache only and won't appear as style issues.

URL Status

StatusDescription
NOT_CHECKEDURL hasn't yet been checked.
OKResponse code 200 to 210. Acrolinx received the HTTP header information from the URL destination.
REDIRECTEDResponse code 300 to 307. Acrolinx could connect to the URL, but the request was redirected.
PAGE_NOT_FOUNDResponse code 404 from the web server. An UnknownHostException is also mapped to this state.
TIMEOUTThe request timed out.
ERRORAll other unknown server and client errors (response code 400-449, except 404, and 500-510). Also indicates java exceptions during the check procedure.