Service - Data Crawler

What is the Data Crawler?

The Data Crawler is a service that imports additional product data from the product detail pages you provide in your feed. It was created for cases where product data on the website can not be added to the Data Source, but is needed for your export feeds.

This service will create an Additional Data Source in the site it is activated in. It will contain the HTML code of your product detail pages, which can be extracted into Intermediate columns with the help of our rule boxes.

Preconditions

  • This feature requires a Data Source with unique IDs and links to the product detail pages of your products.

  • You need to be the owner of the domain you are crawling.

Important

  • You can only crawl one deeplink column per site.

  • Inform your website adminabout the crawler and make sure the configuration is in line with the capability of the website. If they normally block crawler capability, they can whitelist the user agent.

  • Before using the Data Crawler service, always define the ID column (a column in your feed that is unique per row) in the Settings section of the Data Sources tab:

bfe3b03fe17f80ae8883cdce1ba78d90.png

1. General settings

Step 1: Add new service

You can find instructions on how to add a service, define its position and the stage it is executed on in our general Data Services section.

Step 2: Setup

In order to align the Data Crawler with the capabilities of your website, you can make multiple configurations in the setup of the service. This needs to be set up before the first run and can be adjusted any time afterwards.

URL Column: Select the column containing a link to your product detail page. The crawler will access this link and save the HTML code in the Additional Data Source of your site. Make sure this URL does not contain any tracking parameters as they would be triggered by the crawl.

User Agent: The User Agent is the "name" that the crawler uses to access your website. The default User Agent can be modified according to your needs, e.g. by adding a hash for increased security. Example:

Default: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1.6) Gecko/20070802 SeaMonkey/1.1.4 (productsup.io/crawler)

Modified: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1.6) Gecko/20070802 SeaMonkey/1.1.4 (productsup.io/crawler) 8jbks7698sdha123kjnsad9

Concurrent Crawlers: Select the number of crawlers that access the URL Column at the same time in the dropdown. You might want to align this setting with your website admin. In case this is not possible, start with a low amount of concurrent crawlers and only increase if your website's performance doesn't seem to be affected. In general, the more you chose, the faster the crawling works.

Request Timeout (seconds): Set the time for how long the crawler should wait for the requests to be answered before aborting (adjust it to your websites maximum or average response time).

Expires After (days): Select the amount of days after which a product should be re-crawled. The crawler will check for new product IDs with every scheduling and crawl new IDs, but only refresh existing IDs after the days entered in this field. In case you want the crawler to refresh the data with every Scheduling, add -1 in this input field. This may be required if you are crawling fast-changing information like prices and availabilities from your website.

It is recommended to set it to the highest number possible as unnecessary crawling can lower performance of your website's servers and increase the processing time within the site.

Save data compressed (for big data/HTML sites): The compress functionality for the crawler limits the amount of storage that the entire crawled data (HTML code of the product detail pages) uses. If this setting is enabled, the content can only be extracted with specific boxes as HTML Get element by ID, HTML Get element by Tag name and HTML Get element by Xpath. The data will only be shown in compressed way on Intermediate level.

Trigger during a refresh of the Data View: Per default the Data Service will only be triggered on a manual or scheduled run. Activating this button will additionally enable the triggering of this service for every time you refresh your Data in Data View.

HTTP Username/ HTTP Password:If you want to crawl basic authentication protected sites (for example your staging environment) you can set the username and password here.

Permissions: You need to confirm that you are permitted to crawl the pages, otherwise you can not use the service.

Step 3: Triggering the first run

This service will be executed on Intermediate level, so you need to trigger a full Import & Export of the data if you want to start it manually. Otherwise it will be started with the next Scheduling and run with every full processing of the site moving forward.

As the first run will crawl allproduct detail pages, a good time for the first run could be at night or at other times where the product pages are not facing high customer traffic in order to avoid any performance problems.

Edit Setup/Delete Service

If you want to edit or delete the service, go to Data Sources, Services, and click on "Setup". You will find a Delete option in the Danger Area.

2010b822a8d77fb08c5c675edb5a3d30.png

2. How to work with the crawled data

After the first successful crawl, the Additional Data Source will appear:

ced2b4a0d1fb59a8399c678ac84fa045.png

This Data Source will add multiple new columns to your site, all starting with three underscoresto show that these columns are Productsup "system" columns and not part of your original Data Source. The prepended three underscores also cause the columns to be hidden in the Data View/Data Edit section, so they need to be actively selected from the "Views" feature if you want to see them:

Figure 1. GIF
GIF


Note: Depending on the amount of crawled data, the performance of the Data View could be decreased. We recommend showing as little products as possible.

Figure 2. GIF
GIF


If you search your Dataflow Import columns for "___service", all "system" columns created by Productsup Services will appear:

feab8477a4ec4b53eb6f08dcfd851c39.png

*___id:* Contains the unique product ID

*___service_datacrawler_url:* Contains the crawled URL

*___service_datacrawler_data:* Contains the source code of the crawled page

*___service_datacrawler_date:* Shows the date of the last crawl with timestamp

*___service_datacrawler_http_code:* Shows the http status of the crawled web page, which can be useful if you want to exclude sites that are no longer existing

*___service_datacrawler_content_type:* Contains encoding information (e.g.: html/UTF-8)

*___service_datacrawler_size_download:* Contains the size of the crawled page

*___service_datacrawler_total_time:* Shows the entire duration of the crawling

*___service_datacrawler_md5_url:* Contains encoded information in md5

Mapping in Dataflow and extracting the data

Map the ___service_datacrawler_data, which contains the source code you want to extract data from, to as many Intermediate columns as necessary. If you want to extract additional images you should create multiple custom Intermediate columns for this and connect them all with the Import column containing your crawled data.

The following boxes are recommended to use with Data Crawler data, as they allow you to search the source code for certain tag names and extract the values from them:

Stopping the Crawler

It is not possible to stop our crawlers once the service started. The Cancel Job button in the Dashboard will appear once the site is running, but only stop the Import & Export process itself, not the crawlers. Keep this in mind when triggering processes with the full set of data.

Please contact our Support if you need assistance - working with crawled data is advanced and we are happy to help!