Import data by crawling your website

Import data by crawling your website in Productsup.

Productsup can crawl your website to import additional product data. It was made for cases where there are no product data sources available and a data source needs to be created from scratch.

The Website Crawler begins from a single start domain, then crawling all product landing pages.

Note

This feature needs to be enabled by Productsup support and comes with additional costs. If you are interested in using this service, please get in touch.

Before running this service, let your website admins know about the upcoming crawler. This ensures you won’t be blocked from crawling product data.

Set up the Website Crawler

  1. Navigate to your site

  2. Navigate to data sources

  3. Click Add Data Source

  4. Add the Website Crawler data source

  5. Give your data source a specialized name (if desired)

    1. This will replace the name of the data source option on the data sources overview page

  6. Click continue

    Once you have added the data source, we can set it up:

  7. Enter your domain, for example, starting with www and not http)

  8. Enter your start URL(s)

    1. This is where Productsup start to crawl your products

    2. If product pages/categories are not directly linked to this initial URL, you can enter more than one URL

    web_crawler1.png
  9. Ensure from your website admins that you have permission to crawl the website

    1. once you have checked this, tick the permissions box

Optional steps, including the changing of the default values, can be then carried out:

  1. If your website is hosted on a subdomain, you should enable the crawl subdomains button

  2. For example, the following website is hosted on a shop domain with the company interdimensionallogistics being the subdomain: http://shop.interdimensionallogistics.com

  3. If you want to use your own servers to crawl your website, click the use proxy server button

  4. Enter the server host, username, password, and port

  5. To limit the crawler and try to crawl only product pages, you can enter a URL part into a link that contains:

    1. the crawler will keep links containing this part and will only spread to links containing it

    2. you can use wildcards here

  6. To only include links containing a certain keyword, add this into the filters (include) section

  7. To exclude links containing a certain keyword, add this into the filters (exclude) section

    web_crawler2.png
  8. You can select the number of crawlers that are crawling at the same time

  9. Define how many attempts the crawl should try to reach unreachable pages under retry on error

  10. Under crawler timeout per page, you can set how long the crawler should wait for the requests to be answered before aborting

  11. You can set a limit of how many pages you crawl in the max count of pages to crawl section

    web_crawler3.png
  12. Finally, click Save

Wildcards

In the link contains and filters (include/exclude) fields, you can wildcard characters to save you having to add exact parameters.

An asterisk matches any number of characters, no matter which characters they are

An asterisk matches any number of random characters.

If you input \*/p/ this means that your URL should end with /p/, for example, http://www.test.com/p/

If you input /p/\* this means your URL should start with /p/, for example, /p/123456

If you input \*/p/\* this means that /p/ can be anywhere in your URL, for example, http://www.test.com/p/123

You can use as many asterisks as you wish, for example, \*/cat/\*/p/\* matches: http://www.test.com/cat/123/p/456

A question matches one of any character, no matter which character it is

You can use single or multiple question marks to match a set number of random characters. For example, \*/???/\*/p/\* matches: http://www.test.com/cat/123/p/456