Crawl product landing pages using the Data Crawler

Crawl product landing pages using the Data Crawler in Productsup.

Note

To use the Data Crawler, you must:

You should also:

  • Inform your website admin about the crawler

  • Align with your website admin to make sure the configuration is in line with the capability of the website

  • Be aware that the Data Crawler could slow down the performance of your site, as it is a time-consuming process

  • Be aware that once the Data Crawler has started to run, pressing the cancel job button will not kill it

The Data Crawler is a service that can crawl your product landing pages you provide in your feed. It was created for cases where product data is missing from the data source but is available on the website.

It does this by crawling product links and returning its HTML code. From there, you can then Extract product information from the Data Crawler.

Adding the Data Crawler

In order to add the Data Crawler, you should:

  1. Go to Data Services in your site's main menu.

  2. Select ADD SERVICE.

  3. Add the Data Crawler service

  4. Give the service a name (if desired)

  5. Define a custom column prefix (if desired)

  6. Select whether to use the service on the import or intermediate level

    • the level refers to where your product landing page links are found

      • if these are in your import file, you can select import

      • if you first need to optimize/create your links before running the crawler, select intermediate

  7. Click add

  8. Select the column where your product links are, under the URL column

  9. You can modify the user agent if desired

    • this is the name of the crawler which accesses your product links

    • you may wish to whitelist this user agent for access to your website

  10. Select the number of crawlers accessing the link at the same time in concurrent crawlers

    • a lower number means that there is generally effect on your website’s performance

    • a higher number means that the crawling is generally completed more quickly

  11. Under request timeout (seconds), you can set the response waiting time for the crawler

  12. Set the interval at which a product should be recrawled under expires after (days)

    • the crawler will always crawl new or changed links

    • it will only recrawl links for new information at the interval you set

    • you can set this interval to be -1 if you want to recrawl products on every run

  13. You can save a compressed version of the crawled data by activating the save data compressed function

    • If this setting is enabled, the content can only be extracted with specific boxes:

      • HTML Get element by ID

      • HTML Get element by Tag name

      • HTML Get element by Xpath

  14. Select trigger during a refresh of the Data View if you want the crawler to run in this case

  15. Add an HTTP Username and HTTP Password if you want to crawl basic authentication protected sites

  16. Check the permissions box to confirm that you are permitted to crawl the pages

    • Otherwise, you cannot use this service

  17. Click save

Once you’ve successfully crawled your product landing pages by triggering a run of your site, you will receive new columns in the Platform. These columns will start with three underscores, followed by the column prefix you set:

  • ___service_datacrawler_url: Contains the crawled URL

  • ___service_datacrawler_data: Contains the source code of the crawled page

  • ___service_datacrawler_date: Shows the date of the last crawl with a timestamp

  • ___service_datacrawler_http_code: Shows the HTTP status of the crawled web page

  • ___service_datacrawler_content_type: Contains encoding information (e.g.: html/UTF-8)

  • ___service_datacrawler_size_download: Contains the size of the crawled page

  • ___service_datacrawler_total_time: Shows the entire duration of the crawling

  • ___service_datacrawler_md5_url: Contains encoded information in md5

    data_crawler.png

Editing an existing Data Crawler

In order to edit settings for your Data Crawler, you should:

  1. Navigate to your site

  2. Navigate to Data Services on the left-hand tab

  3. Click on the settings wheel

Deleting an existing Data Crawler

In order to delete your Data Crawler, you should:

  1. Navigate to your site

  2. Navigate to Data Services on the left-hand tab

  3. Click on the settings wheel

  4. Scroll to the bottom of the page and click remove this service