Skip to main content

Import data by crawling your website

Import data by crawling your website in Productsup.

Introduction

Productsup can crawl your website to import additional product data. It is intended for cases where no product data sources are available and when you need to establish a data source.

The Website Crawler begins from a single start domain, then crawls all product landing pages.

Prerequisites

The Website Crawler feature is part of Crawler Module, which is available at an additional cost in all platform editions. Contact support@productsup.com to discuss adding it to your organization.

  • The Crawler Module contains the following features:

    • Website Crawler

    • Data Crawler

    • Image Properties Crawler

  • Inform your website admins about the upcoming crawler before running this feature. This notice ensures you won’t face restrictions when you start the Website Crawler.

Set up Website Crawler

  1. Go to Data Sources from your site's main menu, and select ADD DATA SOURCE. Then choose Website Crawler and select Add.

    add_website_crawler.png
  2. (Optional) Give your data source a custom name. This custom name replaces the name of the data source on the Data Sources Overview page. Then, select Continue.

  3. Enter the website name in Domain, for example, beginning with www. The website domain is where Productsup starts crawling product data.

  4. (Optional) You can enter several URLs in Start URLs if product pages or categories don't link to the initial website domain.

    web_crawler1.png
  5. Switch the Crawl Subdomains button to On if your website has a subdomain. For example, interdimensionallogistics is your company name, and shop identifies your subdomain in your company's shop URL, www.shop.interdimensionallogistics.com.

  6. Ensure you have permission from your website admin to crawl the website.

    1. Once you have confirmed your permission, tick the permissions box.

  7. Select Save.

Optional Website Crawler advanced settings

  1. Select the Use Proxy Server button if you want to use your servers to crawl your website.

  2. Enter your proxy website credentials in the Proxy Host, Proxy Username, and Proxy Password fields.

  3. To limit crawling product pages only, enter a portion of the URL in Link Contains:

    • The crawler detects links containing this URL portion and only crawls those pages.

    • You can use wildcards (*).

  4. You can include links containing a specific keyword(s) by adding them in the Filters (include) section.

  5. To exclude links containing a specific keyword, add it in the Filters (exclude) section.

    links_contains_field.png
  6. User Agent is the name the crawler uses to access your website. You can modify the default User Agent according to your needs, for instance, by adding a hash for increased security. See the following examples:

    Default: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1.6) Gecko/20070802 SeaMonkey/1.1.4 (productsup.io/crawler)

    Modified: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1.6) Gecko/20070802 SeaMonkey/1.1.4 (productsup.io/crawler) 8jbks7698sdha123kjnsad9

  7. You can select the number of crawlers that crawl simultaneously in Concurrent Crawlers.

  8. Define how many attempts the crawler should attempt to locate unreachable pages under Retry on Error.

  9. Under Crawler timeout per page (seconds), you can set how long the crawler should wait to answer requests before aborting.

  10. You can limit how many pages to crawl in the Max count of pages to crawl section.

    web_crawler3.png
  11. Select Save.

    wesite_crawler_data_source.png

Wildcards

In the Link Contains, Filters (include), and Filters (exclude) fields, you can use wildcard characters to save you from having to add precise parameters.

  • An asterisk (*) matches any number of characters, no matter the characters.

  • An asterisk (*) matches any number of random characters.

  • If you input \*/p/, your URL should end with /p/, for example, http://www.test.com/p/.

  • If you input /p/\*, your URL should start with /p/, for example, /p/123456.

  • If you input \*/p/\*, this means you can add /p/ anywhere in your URL, for example, http://www.test.com/p/123.

  • You can use as many asterisks as you wish. For example, \*/cat/\*/p/\* matches: http://www.test.com/cat/123/p/456.

  • A question mark (?) matches one of any character, no matter the character.

  • You can use single or multiple question marks to match a set number of random characters. For example, \*/???/\*/p/\* matches: http://www.test.com/cat/123/p/456.