Service - Image Properties Crawler

What is an Image Properties Crawler?

The Image Properties Crawler is a service that provides you with the meta data of the images in your feed. It detects the image properties, for example the image's size and the image type and verifies if the link is reachable.

This service will create an Additional Data Source in the site it is activated in. It will contain the meta data of your images, which can be extracted into Intermediate columns with the help of our rule boxes and then be used in any desired way.

Preconditions

  • This feature requires a Data Source with unique IDs and links to the product detail pages of your products.

  • You need to have the permission of the domain owner whose servers the images are located on.

Important

  • You can only crawl one image column per site.

  • Inform your website adminabout the crawler and make sure the configuration is in line with the capability of the website. If they normally block crawler capability, they can whitelist the user agent.

  • Before using the Image Properties Crawler service, always define the ID column (a column in your feed that is unique per row) in the Settings section of the Data Sources tab:

Figure 1. Image
Image


1. General settings

Step 1: Add new service

You can find instructions on how to add a service, define its position and the stage it is executed on in our general Data Services section.

Step 2: Setup

In order to align the Data Crawler with the capabilities of your website, you can make multiple configurations in the setup of the service. This needs to be set up before the first run and can be adjusted any time afterwards.

32129da505b655fffe3ecbfb552c0a22.png

Image URL Column: Select the column containing a link to your image. The crawler will access this link and save the data as an Additional Data Source of your site. Make sure this URL does not contain any tracking parameters as they would be triggered by the crawl.

Download & Fetch Image Metadata: Activate this option if you want to extract the image meta data, such as size, type or http status.

User Agent: The User Agent is the "name" that the crawler uses to access your website. The default User Agent can be modified according to your needs, e.g. by adding a hash for increased security. Example:

Default: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1.6) Gecko/20070802 SeaMonkey/1.1.4 (productsup.io/crawler)

Modified: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1.6) Gecko/20070802 SeaMonkey/1.1.4 (productsup.io/crawler) 8jbks7698sdha123kjnsad9

Concurrent Crawlers: Select the number of crawlers that access the Image URL Column at the same time in the dropdown. You might want to align this setting with your website admin. In case this is not possible, start with a low amount of concurrent crawlers and only increase if your website's performance doesn't seem to be affected. In general, the more you chose, the faster the crawling works.

Request Timeout (seconds): Set the time for how long the crawler should wait for the requests to be answered before aborting (adjust it to your websites maximum or average response time).

Expires After (days): Select the amount of days after which a product should be re-crawled. The crawler will check for new product IDs with every scheduling and crawl new IDs, but only refresh existing IDs after the days entered in this field. In case you want the crawler to refresh the data with every Scheduling, add -1 in this input field. This may be required if you are crawling fast-changing information like prices and availabilities from your website.

Trigger during a refresh of the Data View: Per default the Data Service will only be triggered on a manual or scheduled run. Activating this button will additionally enable the triggering of this service for every time you refresh your Data in Data View.

It is recommended to set it to the highest number possible as unnecessary crawling can lower performance of your website's servers and increase the processing time within the site.

Step 3: Triggering the first run

This service will be executed on Intermediate level, so you need to trigger a full Import & Export of the data if you want to start it manually. Otherwise it will be started with the next Scheduling and run with every full processing of the site moving forward.

As the first run will crawl the chosen image link of all products, a good time for the first run could be at night or at other times where the product pages are not facing high customer traffic in order to avoid any performance problems.

Edit Setup/Delete Service

If you want to edit or delete the service, go to Data Sources, Services, and click on "Setup". You will find a Delete option in the Danger Area.

Figure 2. Delete
Delete


2. How to work with the crawled data

After the first successful crawl, the Additional Data Source will appear:

Figure 3. Source
Source


This Data Source will add multiple new columns to your site, all starting with three underscoresto show that these columns are Productsup "system" columns and not part of your original Data Source. The prepended three underscores also cause the columns to be hidden in the Data View/Data Edit section, so they need to be actively selected from the "Views" feature if you want to see them:

Figure 4. GIF
GIF


Note: Depending on the amount of crawled data, the performance of the Data View could be decreased. We recommend showing as little products as possible.

Figure 5. GIF
GIF


If you search your Dataflow Import columns for "___service", all "system" columns created by Productsup Services will appear:

Figure 6. DF
DF


*___service_imagecrawler_content_type:* Contains the HTTP response content type

*___service_imagecrawler_date:* Shows the date of the last crawl with timestamp

*___service_imagecrawler_height:* Contains the height of the crawled image

*___service_imagecrawler_width:* Contains the width of the crawled image

*___service_imagecrawler_http_code:* Shows the http status of the crawled image, which can be useful if you want to exclude products without reachable images

*___service_imagecrawler_url:* Shows the url of the crawled image

*___service_imagecrawler_md5_image:* Contains the MD5 checksum of the original image file

*___service_imagecrawler_md5_url:* Contains the MD5 checksum of the original image url

*___service_imagecrawler_mime:* Shows the original image mime type

*___service_imagecrawler_site_download:* Contains the original image size in bytes

*___service_imagecrawler_total_time:* Shows the total time it took to crawl the image

Mapping in Dataflow and working with the crawled data

Map the Import column you want to base your optimizations to the Intermediate level. From there on use our rule boxes and segments for the following use cases:

  • Skip products without a reachable image based on their https status by adding a "Skip Row if Value in" on the connection containing input from ___service_imagecrawler_http_code.

  • Exclude products that don't meet a channels requirements regarding image size by using the "Skip Row if Value in" box on the width or height information. You can do this on export level by adding a custom column beginning with three underscores on export level and adding the box directly in the Dataflow.

  • By analyzing the data in ___service_imagecrawler_md5_image in Data View you can identify duplicate images that might have different URLs.

  • Create Segments for different sizes of images, so you can apply the "FB Use Image Designer Template" box to these groups. This enables you to apply different templates to different image sizes.

Stopping the Crawler

It is not possible to stop our crawlers once the service started. The Cancel Job button in the Dashboard will appear once the site is running, but only stop the Import & Export process itself, not the crawlers. Keep this in mind when triggering processes with the full set of data.

Please contact our Supportif you need assistance - we are happy to help!