Skip to main content

Detect image metadata with the Image Properties Crawler

Crawl your image links to enhance your feed with image metadata using the Image Properties Crawler data service in Productsup.

Introduction

The Image Properties Crawler is a Productsup data service that gathers image metadata by crawling image links. The crawled metadata includes image type, height, width, file size, etc.

A popular use case for the Image Properties Crawler is to check the availability of your image links based on the links' HTTP responses. To ensure your products don't use broken images, you can let the platform use the metadata gathered via this service to skip products with unreachable image links during export. See Use the crawled image metadata for more information.

Note

The Image Properties Crawler service is available for the import and intermediate stages.

When crawling the image links listed in one of your feed's columns, the Image Properties Crawler extracts image metadata from the HTML code of those pages. It creates an additional data source in your Productsup site to add and populate the following columns:

  • ___service_imagecrawler_url duplicates the crawled image link for technical purposes.

  • ___service_imagecrawler_date states the date of the last crawl in the Unix time format.

  • ___service_imagecrawler_http_code displays the HTTP status code of the crawled link.

  • ___service_imagecrawler_width contains the original image width.

  • ___service_imagecrawler_height contains the original image height.

  • ___service_imagecrawler_mime stores the image MIME type.

  • ___service_imagecrawler_content_type displays the HTTP response with the content type of the crawled link.

  • ___service_imagecrawler_size_download states the original image file size in bytes.

  • ___service_imagecrawler_total_time shows how long it took the crawler to fetch the image.

  • ___service_imagecrawler_md5_url contains the crawled URL encoded in MD5.

  • ___service_imagecrawler_md5_image stores the image encoded in MD5.

Warning

The Image Properties Crawler can slow down the performance of your Productsup site because it is a resource-intensive process.

Prerequisites

Note

The Image Properties Crawler service is part of the Crawler Module, which is available at an additional cost in all platform editions. Contact your Customer Success Manager to discuss adding it to your account.

The Crawler Module contains the following features:

To set up the Image Properties Crawler data service, you need:

  1. A product identifier. See Set a product identifier for more information.

  2. The rights to the domain you want to crawl. You must be the owner of the crawled website.

  3. A column in your feed that contains image URLs. The URLs must have no tracking parameters.

Before running the Image Properties Crawler data service, you should discuss the specifics of your website's performance with your website admin to gather the information required for setting up the service. Find answers to these questions:

  1. How many crawlers can access your website at a time?

  2. What are your website's average and maximum response times?

Add the Image Properties Crawler

  1. Go to Data Services from your site's main menu and select ADD SERVICE.

  2. Search for Image Properties Crawler, select Add, and give it a desired name and column prefix.

    By default, ___service_imagecrawler is the column prefix.

    image_properties_crawler.png
  3. Choose the stage containing the column with the crawled URLs in Service Data Level and select Add.

  4. Choose the column in your feed containing links to your landing pages in Image URL Column.

    If you chose Import in Step 3, the drop-down list Image URL Column displays the columns of your import stage. If you chose Intermediate in that field, the drop-down list contains your intermediate-stage columns.

    Note

    If you want to crawl multiple columns containing image links, you need to create a copy of your Productsup site and set up an Image Properties Crawler service in the copied site to crawl another link column.

  5. In User Agent, you can see the name the crawler uses to access your website. By default, the name is Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1.6) Gecko/20070802 SeaMonkey/1.1.4 (productsup.io/crawler). You can modify it as needed, for example, by adding a hash at the end of the crawler name for security reasons.

    Tip

    Whitelist your Productsup crawler using the name specified in this field for your website not to block the crawler.

  6. In Concurrent Crawlers, choose the number of crawlers that can access your website simultaneously. By default, the number is 10.

    Warning

    If you engage more crawlers than your website can handle, the crawling process can run quicker but may cause website performance issues.

  7. In Request Timeout (seconds), you can set how long the crawler should wait for a response from your website. The expected input format is a digit identifying the number of seconds.

    Use your website's maximum or average response time to enable the Image Properties Crawler service to run efficiently.

    By default, the crawler waits 10 seconds for a response before proceeding to a different link.

  8. Enter the number of days the Image Properties Crawler service should wait before recrawling a link in Expires After (days). The expected input format is a digit identifying the number of days.

    The Image Properties Crawler service crawls all new or changed links in your URL column every time the site runs. The Expires After (days) field determines when the service should recrawl the links it has already crawled.

    Tip

    You can enter -1 in Expires After (days) to recrawl all image links every time your site runs in Productsup.

  9. If you want the platform to run the Image Properties Crawler service during every refresh in Data View, select the checkmark icon () in Trigger during a refresh of the Data View.

  10. Select Save.

  11. For the platform to process a new data service, select Run in the top-right corner of your site's view.

    Warning

    Once the Image Properties Crawler has started to run, you can't stop it.

    The first run of this service crawls all image links you have selected in Step 4, so it may take a while. The recommended time for the first run of the service is at night or at another time with low customer traffic on your website, which would minimize website performance issues.

    If you need to work in Data View during a crawl, you can speed up Data View loading times by minimizing the number of displayed products per page in the upper ribbon.

    Note

    If you can't see the columns the Image Properties Crawler added to your feed in Data View, ensure the platform hasn't hidden them:

    1. Go to Data View from your site's main menu and choose the relevant stage or export in the drop-down list on your left.

    2. Select the menu icon on the right and then select the eye icon.

    3. Find the attributes in the list that use the custom prefix ___service_imagecrawler and select the eye icon next to each attribute you want Data View to display.

    4. Close the pop-up menu.

    The naming of the attributes created by the Image Properties Crawler data service depends on the column prefix you chose in Step 2. The attribute names the data service generates always start with three underscores (___), which means the platform doesn't send those attributes to your export channels.

Use the crawled image metadata

Once you map the new columns created by the crawler with the relevant columns of the subsequent stages in Dataflow, you can start working with the crawled data using rule boxes:

  • Use Image Designer Template (limited) - This rule box lets you apply Image Designer templates to the images in your feed. You can create different segments in Data View for different image sizes and use this rule box to apply different Image Designer templates to different segments.

  • Skip Row If Value In - This rule box lets you skip products with unwanted values during export. If applied to ___service_imagecrawler_http_code, it can skip all products whose image links are unreachable. If applied to ___service_imagecrawler_height, ___service_imagecrawler_width, or ___service_imagecrawler_size_download, it can skip products whose image heights, widths, or file sizes don't meet channel requirements. To skip products with unwanted images, you can:

    1. Create a system column on the export stage of Dataflow by adding three underscores (___) at the beginning of the column's name.

    2. Map a relevant column created by the Image Properties Crawler to the new export column.

    3. Apply the Skip Row If Value In rule box on the export stage.

    Tip

    Your links may display a placeholder image for unavailable images, for example, containing the text Image Not Found. In this case, the crawler gets a positive HTTP status of the image link because the link works and contains an image, which means you can’t skip these placeholder images using the rule box Skip Row If Value In with column ___service_imagecrawler_http_code. You also can't exclude such images based on their links because the links are most likely different each time.

    You can use the column ___service_imagecrawler_md5_image to identify such cases:

    1. Go to Data View and select the needed stage in the drop-down list on the left.

    2. Find the column ___service_imagecrawler_md5_image and select Analyze.

    3. Check if any values repeat in the Distinct Values section.

    4. Select a repeating value to see all products with the same image.

    5. Once you have identified the value that belongs to the unwanted placeholder image, go to the export stage containing the attribute you mapped to ___service_imagecrawler_md5_image and apply the Skip Row If Value In rule box to that attribute.

    If you need help, contact support@productsup.com.

Edit the Image Properties Crawler

  1. Go to Data Services from your site's main menu.

  2. Search for your data service.

  3. Select the cogwheel icon () next to the desired data service to edit other settings.

  4. Select Save.

Delete the Image Properties Crawler

  1. Go to Data Services from your site's main menu.

  2. Select the cogwheel icon () next to the desired data service.

  3. In the Danger Area panel, select Remove this service.

  4. Select Yes.