Data Center
Add content to your library, manage web crawling, and activate connectors.
Before you can track your content in Content Cube, you'll need to tell Acrolinx where to find it. You'll do this by adding content to your content library. Think of your content library as the Acrolinx version of a song library. It's where you store content that you'll later organize into collections (like playlists).
When you add content to your library, Acrolinx reviews it in two steps: a crawl and a check. You'll kick off the process in the data center. Content Cube will run an initial crawl and check of your content to give a baseline. Then, it will automatically crawl and check the content library on a weekly basis.
To learn how to start your first crawl, read on.
Start a Crawl
To identify checkable text on your website, Acrolinx uses a crawler with the user agent Acrolinx-bot
. All you have to do is provide Acrolinx-bot with the domains and subdomains that you want to crawl. For example, acrolinx.com
and docs.acrolinx.com
. Once you add a domain, Acrolinx-bot automatically crawls all of the content in that domain on a weekly basis. You can run up to 100 individual crawls at a time. Learn more about Acrolinx-bot.
To make sure that Acrolinx captures the right content, you can also fine-tune a crawl. If you work in marketing and want to review the content that you use to convert prospects, you might make sure that Acrolinx crawls URLs with paths like /product/
or /products/
.
When you add a domain to Content Cube, you don't have to include the subdomain www
. But the root domain (let's say acrolinx.com
) will sometimes redirect to a URL that includes a www
. For example, www.acrolinx.com
. If this happens, that the crawler might only identify one page for acrolinx.com
, but many more pages for www.acrolinx.com.
To add a new content to your library, do the following:
- Go to Profile and settings > Admin Console.
- In the WEB CRAWLING tab of the data center, click the plus icon Add new domain to open the Web Crawler Setup.
Enter the domain or subdomain that you want to crawl. For example,
docs.acrolinx.com
.Be sure to leave out the protocol. For example,
http://
orhttps://
.- Do add:
docs.acrolinx.com
- Don't add:
http://docs.acrolinx.com
- Do add:
Optional: Fine-tune your crawl with the following settings:
Already have some web-crawling experience? Learn how to customize your crawl with our advanced crawl settings.Setting Description Max. pages to crawl Defines approximately how many pages Acrolinx should crawl. Max. crawl depth Determines how many pages Acrolinx-bot will access and index on a site during a single crawl.
Crawl these paths Limits the crawl to certain pages within a domain. When you list one or more of the paths that follow the root domain in a page’s URL, you do the following:
- Automatically add the path to the virtual robots.txt file as
allow:[input]
. This tells Acrolinx to only visit URLs with the specific path directly after the country code top-level domain (ccTLD). For example,my.domain/blog
. - Use the URL as an
alternative_start_url
. - Automatically add
disallow: /
to the robots.txt file. This will keep Acrolinx from crawling anything other than the paths you list.
If you add
/blog
under Paths to include, for example, the crawler will only access pages that havemy.domain/blog
in the URL. To include or exclude multiple paths, list each path on a separate line. For example:/blog /news/articles/product-updates
Don't crawl these paths Ignores certain pages within a domain during a crawl. When you list one or more of the paths that follow the root domain in a page’s URL, the paths are added to the virtual robots.txt as
disallow:[input]
. This tells Acrolinx not to follow URLs with those paths.If you add
/blog
under Paths to exclude, for example, the crawler won't access pages withmy.domain/blog
in the URL. To include or exclude multiple paths, list each path on a separate line. For example:/blog /news/articles/product-updates
- Automatically add the path to the virtual robots.txt file as
- Click Save to start your crawl.
Track the Status of a Crawl
To keep an eye on how your crawl or check is doing, you can look at the table page. Here, you'll see the following information:
Section | Status | Description |
---|---|---|
Domain | Name of the start domain or subdomain for the crawl. | |
Allowed URL Paths | List of allowed paths specified when the crawl was created. | |
Crawl Summary |
| Summary of the latest crawl. This includes:
|
Check Summary |
| Summary of the latest check. This includes:
|
Edit Crawl Settings
After you add a domain or subdomain to your content library, you can always go back and adjust the related crawl settings.
To edit the settings for a crawl, do the following:
- Go to Profile and settings > Admin Console.
- In the WEB CRAWLING tab of the data center, click Edit crawl settings next to the domain or subdomain you want to change.
- Adjust the settings for your crawl.
- Click Save to update your settings and run a recrawl.
Recrawl a Domain
Acrolinx rechecks and recrawls your content on a weekly basis. If you've made changes and want to recrawl your whole domain immediately, you can trigger the process manually.
To manually trigger a recrawl, do the following:
- Go to Profile and settings > Admin Console.
- In the WEB CRAWLING tab of the data center, click Recrawl this domain next to the domain or subdomain you want to update.
Delete a Domain
To permanently delete a domain from your content library, do the following:
- Go to Profile and settings > Admin Console.
- In the WEB CRAWLING tab of the data center, click the trash icon Delete from content library next to the domain or subdomain you want to delete.
- Click Delete in the confirmation dialog to permanently delete the domain from your content library.
Advanced Crawl Settings
If you're familiar with the ins and outs of web crawling, you can customize your crawl with advanced crawl settings in Content Cube. These are useful if you only want to track certain parts of your website. It's also a good resource if Acrolinx hits any snags during a crawl.
To add advanced crawl settings, do the following:
- Go to Profile and settings > Admin Console.
- Open the Web Crawler Settings window.
- For a new domain or subdomain, click WEB CRAWLING > Add new domain.
- For an existing domain or subdomain, click Edit crawl settings next to the domain name.
Choose the settings that you'd like to apply to your crawl:
Setting Description Never crawl URLs with query parameters Turns off the option to specify query parameters. Selected by default. Only crawl URLs with these query parameters Specify the query parameters that you want the Acrolinx-bot to crawl. Never crawl URLs with these query parameters Specify the query parameters that you want the Acrolinx-bot to ignore. Respect nofollow
tagsThe Acrolinx-bot will ignore nofollow
directives and crawl these pages.Respect noindex
tagsThe Acrolinx-bot will ignore noindex
directives and crawl these pages.Follow alternates The Acrolinx-bot will crawl any links listed as "alternate." Turn on AJAX crawling The Acrolinx-bot will crawl AJAX applications. Follow canonicals The Acrolinx-bot will crawl any URLs mentioned in canonical tags. Turn on JavaScript crawling The Acrolinx-bot will crawl JavaScript-rendered content. Follow HTTP redirects (3xx) The Acrolinx-bot will crawl every page in a page's redirect chain. Turn on mobile crawling The Acrolinx-bot will identify itself as a mobile device.
The Acrolinx-bot identifies itself as a desktop device by default.Follow links on error pages (4xx and 5xx) The Acrolinx-bot will crawl any links on 4xx and 5xx error pages. Crawl Behind Sign-In Provide sign-in details for a password-protected site that you want the Acrolinx-bot to crawl. Custom Request Headers Specify any authentication headers needed for the Acrolinx-bot to access your content. - Click Save to start crawling.