Skip to main content

Extract product data from the crawled code with rule boxes

After running the Data Crawler service in Productsup, you need to use rule boxes to extract your product data from the crawled code.

Introduction

This doc explains how you can extract product data from the code you crawled using the Data Crawler service. See Crawl product landing pages with the Data Crawler to set up the Data Crawler service.

Tip

Extracting information from the crawled HTML code is a technically advanced task. If you require assistance, you may turn to your in-house developers, reach out to your Customer Success Manager, or contact support@productsup.com.

Once you have set up and run the Data Crawler service, you should extract the needed information from the HTML code you crawled. You can use the following rule boxes to do it:

  • HTML getElementById - This rule box lets you extract the crawled data using the IDs defined in the tags in the source code. See Extract data via tag IDs for more information.

  • HTML getElementByTagName - This rule box lets you extract the crawled data using name attributes defined in the tags in the source code. See Extract data via tag names for more information.

  • HTML getElementByXpath - This rule box lets you extract the crawled data using the path to the relevant elements of the source code. See Extract data via Xpath for more information.

  • Split strings - These rule boxes let you extract the crawled data using split string functionality that can separate text-like strings into bits and save only the needed parts. See Extract data via split strings for more information.

  • Regex - These rule boxes let you extract specific information from your crawled data using regular expressions (regex) that define advanced search patterns. See Extract data via regex for more information.

Prerequisites

To prepare for extracting relevant data from the crawled HTML code, you need to:

  1. Set up and run the Data Crawler service. See Crawl product landing pages with the Data Crawler for more information.

  2. Map the ___service_datacrawler_data column containing the source code to an intermediate or export column where you can apply the needed rule boxes.

  3. Identify the elements in the crawled HTML code that you need to extract data from:

    Note

    If you plan to use the rule boxes HTML getElementById, HTML getElementByTagName, and HTML getElementByXpath, you must take the steps outlined in the following list. Split string and regex rule boxes don't require you to take these steps.

    1. Locate the data you want to extract on one of your crawled product pages.

    2. Use the functionality provided by your browser to inspect the element of the page containing the needed data and access its source code.

    3. Find the tag that stores the data in your source code and copy the tag's name, featured ID, or Xpath depending on the rule box you plan to use. Here are examples of these elements:

      1. A tag's name - In the following code snippet, the tag <meta> has a name attribute containing the value description.

        <meta name="description" content="Crawl your landing pages and extract additional product data from your website to enhance your feed with the Data Crawler data service in Productsup.">
        
      2. A tag's ID - In the following code snippet, the tag <div> has an id attribute containing the value top-pager.

        <div id="top-pager">
            <ul class="pager">
                <li class="previous">...</li>
                <li class="next">...</li>
            </ul>
        </div>
      3. Xpath - An Xpath is the full path to a chosen element in the code structure. Once you find the needed tag in the developer's panel of your browser, you can open the context menu to copy its full Xpath. For example, the first <li> tag in the previous code snippet has the following Xpath: /div/ul/li[1].

        Tip

        Sometimes the crawled code of your product page can differ from the code of that page accessed on your live website in the developer's panel of your browser. The Xpaths may vary as well. To get the Xpath to the needed element in your crawled code, you can:

        1. Copy your crawled code and paste it into a local file on your computer.

        2. Save the file as an HTML.

        3. Open the HTML file in your browser and proceed using the developer's panel.

Extract data via tag IDs

If the data you want to extract from the code has a specific ID defined in its tag, you can extract this data using the rule box HTML getElementById.

You can add this rule box both in Data View and Dataflow. Here is how to add it in Data View:

  1. Go to Data View from the site's main menu.

  2. Choose the needed export channel or the intermediate stage in the drop-down menu on the left.

  3. Select Edit in the attribute's column where you want to apply the rule box.

  4. Select the Add Box drop-down menu.

  5. Search for and select the HTML getElementById rule box.

    HTML_getElementById_rule_box.png
  6. In ID, enter the value you found in the relevant tag ID of your page's source code as discussed in 3.

  7. Select Save.

Tip

After extracting the needed element using this rule box, you may need to drill further into your crawled data. You can use split string and regex rule boxes to do it. See Extract data via split strings and Extract data via regex for more information.

Extract data via tag names

If the data you want to extract from the code has a specific name defined in its tag, you can extract this data using the rule box HTML getElementByTagName.

You can add this rule box both in Data View and Dataflow. Here is how to add it in Data View:

  1. Go to Data View from the site's main menu.

  2. Choose the Intermediate stage or the desired export channel from the drop-down menu in the upper ribbon.

  3. Select Edit in the column of the attribute where you want to apply the rule box.

  4. Select the Add Box drop-down menu.

  5. Search for and select the HTML getElementByTagName rule box.

    HTML_getElementByTagName_rule_box.png
  6. In Tagname, enter the value you found in the relevant tag name of your page's source code as discussed in 3.

  7. In Occurance, specify the number of times the platform should extract data if the crawled code contains multiple matches of the defined tag name.

  8. Select Save.

Tip

After extracting the needed element using this rule box, you may need to drill further into your crawled data. You can use Split string and regex rule boxes to do it. See Extract data via split strings and Extract data via regex for more information.

Extract data via Xpath

To extract the needed data from the crawled code, you can also use the Xpath leading to the element in the code that stores your data. In this case, you can use the rule box HTML getElementByXpath.

You can add this rule box both in Data View and Dataflow. Here is how to add it in Data View:

  1. Go to Data View from the site's main menu.

  2. Choose the needed export channel or the intermediate stage in the drop-down menu on the left.

  3. Select Edit in the column of the attribute where you want to apply the rule box.

  4. Select the Add Box drop-down menu.

  5. Search for and select the HTML getElementByXpath rule box.

    HTML_getElementByXpath_rule_box.png
  6. In XPath, enter the path to the relevant element of your page's source code as discussed in 3.

  7. In Occurance, specify the number of times the platform should extract data if the crawled code contains multiple matches of the defined path.

  8. Select Save.

Tip

After extracting the needed element using this rule box, you may need to drill further into your crawled data. You can use Split string and regex rule boxes to do it. See Extract data via split strings and Extract data via regex for more information.

Extract data via split strings

To split the crawled code into bits and preserve only the needed parts of it, you can use the following split string rule boxes:

  • Split String - This rule box splits a string into parts and removes the unneeded data.

  • Split String for PLA - This rule box splits a string into parts and removes the unneeded data. If the rule box finds no splitter character in a string, it empties the string.

  • Split String & Filter - This rule box splits a string into parts, removes the unneeded data, and trims the length of your strings according to a character limit.

  • Split String and Count Items - This rule box splits a string into parts, counts them, and replaces the string with the number of data parts found in it.

See Split strings for more information on each rule box.

Extract data via regex

If you need to search your crawled data for information that matches specific search patterns defined using regex, you can use the following regex rule boxes:

  • Preg Replace - This rule box lets you search your data for regex matches and replace the matches with values of your choice.

  • Preg Match - This rule box lets you search your data for a regex match, preserve the matching part of the data, and remove the rest of the string. The Preg Match rule box stops scanning a string as soon as it finds a match, so the rule box saves only the first match if there are multiple ones in a string.

  • Preg Match All - This rule box has the same functionality as the Preg Match rule box, but it lets the platform find and preserve multiple regex matches within a string.

  • Set Value if Match (Regex) - This rule box lets you specify the values of one attribute based on the results of a regex search performed in a different attribute.

See Regular expressions for more information on each rule box.