Data Center

Add content to your library, manage web crawling, and activate connectors.

Before you can track your content in Content Cube, you'll need to tell Acrolinx where to find it. You'll do this by adding content to your content library. Think of your content library as the Acrolinx version of a song library. It's where you store content that you'll later organize into collections (like playlists).

When you add content to your library, Acrolinx reviews it in two steps: a crawl and a check. You'll kick off the process in the data center. Content Cube will run an initial crawl and check of your content to give a baseline. Then, it will automatically crawl and check the content library on a weekly basis.

To learn how to start your first crawl, read on.

Start a Crawl

To identify checkable text on your website, Acrolinx uses a crawler with the user agent Acrolinx-bot. All you have to do is provide Acrolinx-bot with the domains and subdomains that you want to crawl. For example, acrolinx.com and docs.acrolinx.com. Once you add a domain, Acrolinx-bot automatically crawls all of the content in that domain on a weekly basis. You can run up to 100 individual crawls at a time. Learn more about Acrolinx-bot.

To make sure that Acrolinx captures the right content, you can also fine-tune a crawl. If you work in marketing and want to review the content that you use to convert prospects, you might make sure that Acrolinx crawls URLs with paths like /product/ or /products/.

When you add a domain to Content Cube, you don't have to include the subdomain www. But the root domain (let's say acrolinx.com) will sometimes redirect to a URL that includes a www. For example, www.acrolinx.com. If this happens, that the crawler might only identify one page for acrolinx.com, but many more pages for www.acrolinx.com.

To add a new content to your library, do the following:

  1. Go to Profile and settings > Admin Console.
  2. In the WEB CRAWLING tab of the data center, click the plus icon Add new domain to open the Web Crawler Setup.
  3. Enter the domain or subdomain that you want to crawl. For example, docs.acrolinx.com.

    Be sure to leave out the protocol. For example, http:// or https://.

    • Do add: docs.acrolinx.com
    • Don't add: http://docs.acrolinx.com
  4. Optional: Fine-tune your crawl with the following settings:

    Already have some web-crawling experience? Learn how to customize your crawl with our advanced crawl settings.
    SettingDescription
    Max. pages to crawlDefines approximately how many pages Acrolinx should crawl.
    Max. crawl depth

    Determines how many pages Acrolinx-bot will access and index on a site during a single crawl.

    Crawl these paths

    Limits the crawl to certain pages within a domain. When you list one or more of the paths that follow the root domain in a page’s URL, you do the following:

    • Automatically add the path to the virtual robots.txt file as  allow:[input]. This tells Acrolinx to only visit URLs with the specific path directly after the country code top-level domain (ccTLD). For example, my.domain/blog.
    • Use the URL as an alternative_start_url.
    • Automatically add  disallow: / to the robots.txt file. This will keep Acrolinx from crawling anything other than the paths you list.

    If you add /blog under Paths to include, for example, the crawler will only access pages that have my.domain/blog in the URL. To include or exclude multiple paths, list each path on a separate line. For example:

    /blog
    /news/articles/product-updates
    Don't crawl these paths

    Ignores certain pages within a domain during a crawl. When you list one or more of the paths that follow the root domain in a page’s URL, the paths are added to the virtual robots.txt as  disallow:[input]. This tells Acrolinx not to follow URLs with those paths.

    If you add /blog under Paths to exclude, for example, the crawler won't access pages with my.domain/blog in the URL. To include or exclude multiple paths, list each path on a separate line. For example:

    /blog
    /news/articles/product-updates
  5. Click Save to start your crawl.

Track the Status of a Crawl

To keep an eye on how your crawl or check is doing, you can look at the table page. Here, you'll see the following information:

SectionStatusDescription
Domain

Name of the start domain or subdomain for the crawl.

Allowed URL Paths
List of allowed paths specified when the crawl was created. 
Crawl Summary
  • Crawling - Crawl in progress.
  • Checking - Acrolinx running a check.
  • Finished - Crawling and checking complete.
  • Canceled - Crawl was canceled manually.


Summary of the latest crawl. This includes:

  • Pages crawled - Number of pages crawled so far.
  • Crawl length - Time spent crawling.
  • Initiated - Time and date a domain was sent to Acrolinx for crawling.
  • Started - Time and date Acrolinx started to crawl a domain.
  • Finished - Time and date of crawl completion.
Check Summary
  • - Check in progress
  • - Check complete
  • - Check incomplete

Summary of the latest check. This includes:

  • Pages checked - Number of pages that Acrolinx has checked.
  • Pages failed - Number of pages that Acrolinx couldn't check.

Edit Crawl Settings

After you add a domain or subdomain to your content library, you can always go back and adjust the related crawl settings.

To edit the settings for a crawl, do the following:

  1. Go to Profile and settings > Admin Console.
  2. In the WEB CRAWLING tab of the data center, click Edit crawl settings next to the domain or subdomain you want to change.
  3. Adjust the settings for your crawl.
  4. Click Save to update your settings and run a recrawl.

Recrawl a Domain

Acrolinx rechecks and recrawls your content on a weekly basis. If you've made changes and want to recrawl your whole domain immediately, you can trigger the process manually.

To manually trigger a recrawl, do the following:

  1. Go to Profile and settings > Admin Console.
  2. In the WEB CRAWLING tab of the data center, click Recrawl this domain next to the domain or subdomain you want to update.

Delete a Domain

To permanently delete a domain from your content library, do the following:

  1. Go to Profile and settings > Admin Console.
  2. In the WEB CRAWLING tab of the data center, click the trash icon Delete from content library next to the domain or subdomain you want to delete.
  3. Click Delete in the confirmation dialog to permanently delete the domain from your content library.

Advanced Crawl Settings

If you're familiar with the ins and outs of web crawling, you can customize your crawl with advanced crawl settings in Content Cube. These are useful if you only want to track certain parts of your website. It's also a good resource if Acrolinx hits any snags during a crawl.

To add advanced crawl settings, do the following:

  1. Go to Profile and settings > Admin Console.
  2. Open the Web Crawler Settings window.
    • For a new domain or subdomain, click WEB CRAWLING > Add new domain.
    • For an existing domain or subdomain, click  Edit crawl settings next to the domain name.
  3. Choose the settings that you'd like to apply to your crawl:

    SettingDescription
    Never crawl URLs with query parametersTurns off the option to specify query parameters. Selected by default.
    Only crawl URLs with these query parametersSpecify the query parameters that you want the Acrolinx-bot to crawl.
    Never crawl URLs with these query parametersSpecify the query parameters that you want the Acrolinx-bot to ignore.
    Respect nofollow tagsThe Acrolinx-bot will ignore nofollow directives and crawl these pages.
    Respect noindex tagsThe Acrolinx-bot will ignore noindex directives and crawl these pages.
    Follow alternatesThe Acrolinx-bot will crawl any links listed as "alternate."
    Turn on AJAX crawlingThe Acrolinx-bot will crawl AJAX applications.
    Follow canonicalsThe Acrolinx-bot will crawl any URLs mentioned in canonical tags.
    Turn on JavaScript crawlingThe Acrolinx-bot will crawl JavaScript-rendered content.
    Follow HTTP redirects (3xx)The Acrolinx-bot will crawl every page in a page's redirect chain.
    Turn on mobile crawling

    The Acrolinx-bot will identify itself as a mobile device.

    The Acrolinx-bot identifies itself as a desktop device by default.
    Follow links on error pages (4xx and 5xx)The Acrolinx-bot will crawl any links on 4xx and 5xx error pages.
    Crawl Behind Sign-InProvide sign-in details for a password-protected site that you want the Acrolinx-bot to crawl.
    Custom Request HeadersSpecify any authentication headers needed for the Acrolinx-bot to access your content.
  4. Click Save to start crawling.