Skip to main content

Clean and format your data

Clean and format your data in Productsup.

Before moving on to optimizing your data, the best practice would be to clean up and format your initial data. This could be to remove HTML tags, sanitize your text, or remove block capitals.

Dealing with whitespaces in your data

Your data may have too many - or indeed too few - whitespaces in between your words which you will wish to sort out.

Removing consecutive whitespaces

If you have more than one whitespace in a row, you can remove them using the remove consecutive whitespace box. One whitespace will always be left so that the words remain separated.

  • Add the remove consecutive whitespace box

This will turn this input:

Taupe   Sandals -   silver -    leather - Size 39

Into the following output:

Taupe Sandals - silver - leather - Size 39

Note

You could also remove or replace set numbers of whitespaces using the text replace box

Removing trailing or leading whitespaces

If you have extra whitespaces at the beginning or end of your content, you can remove these using the remove spaces at the beginning and end (trim) box.

  • Add the remove consecutive whitespace box

This will turn this input:

Buy this great "Tailenders of the world unite" t-shirt!

Into the following output:

Buy this great "Tailenders of the world unite" t-shirt!

Separating words in camelCase

If you have words in camelCase (where the second part of the word starts with a capital letter) then you can split these words using the separate words box.

  • Add the separate words box

This will turn this input: brownShoes

into this output: brown Shoes

You can consider normalizing the capitalization using capitalization boxes.

Dealing with capitalization

You may wish to change the capitalization on certain data, such as by making every word start with a capitalized letter or every word in lowercase.

Removing block capitals

You can convert block capitals using the capitalize words box.

  1. Add the capitalize words box

  2. Select whether you would like to amend to

    • Convert uppercase: This Capitalizes The First Letter Of Every Word

    • Only first letter: This capitalizes the first letter of each sentence

capitalize.png

Converting capitals to ‘sentence case’

To convert block capitals into ‘sentence case’, where only the first word in the sentence is capitalized, consider using the uppercase to human box.

  • Add the uppercase to human box

This will turn this input: THESE ARE LOUD NOISES. VERY LOUD NOISES.

into this output: These are loud noises. Very loud noises.

Converting to lowercase

To convert block capitals into all lowercase, consider using the lowercase box.

  • Add the lowercase box

This will turn this input: THESE ARE LOUD NOISES. VERY LOUD NOISES.

into this output: these are loud noises. very loud noises.

Spellcheck

You can use the spellcheck box to check for misspelled words and provide suggestions for how to correct the spelling.

  1. Add the spellcheck box

  2. Select the column where your input values should come from

  3. Select the language in which your values are in

  4. You can optionally choose to not spellcheck words shorter than a certain character length

spell_check.png

Dealing with invalid GTINs

Most export channels require each product to have a valid identifier, and this is often a GTIN. You can use the filter GTINs box to detect and delete values that are not valid GTINs.

  1. Add the filter GTINs box

  2. Select one of the advanced behavior options - these will try and manipulate your data into valid GTINs before performing a final check if they are valid

    • Fill up to 14 digits: add one or more leading zeroes until the value reaches 14 digits

    • Fill up to 13 digits: add one or more leading zeroes until the value reaches 13 digits

    • Trim leading zeroes: remove all zeroes from the beginning of the value

    • Leave unchanged: the original value is kept

Decoding HTML and removing HTML tags

On occasion, you may have HTML tags or encoded information in your feed that you wish to decode and display in human-readable text.

Fixing and decoding HTML

If you have broken or encoded HTML, you can convert this to ‘normal’ characters using the fix and decode HTML box.

  • Add the fix and decode the HTML box.

When using this box, for example, you can turn

Input:

Fu&szlig, Apple &amp, Banana

into

Output:

Fuß, Apple & Banana

Fix and decode base64

If you have broken or encoded base64 tags, you can convert them to ‘normal’ characters using the fix and decode base64 box.

  • Add the fix and decode base64 box.

Removing HTML tags

You can remove any leftover HTML tags from your data using the remove HTML tags box.

  • Add the remove HTML tags box.

When using this box, for example, you can turn

Input:

<h4 class="text">Power Smoothie - start the day with a smoothie.</h4><br/>

into

Output:

Power Smoothie - start the day with a smoothie.

Removing UTF-8 control characters

If your data contains UTF-8 control characters that you wish to remove, consider using the remove UTF-8 control characters box.

  • Add the remove UTF-8 control characters box.

Note

Should not all broken control characters be removed using the above box, you could also use the following regular expression in the preg replace box to remove them:

/[\x{0080}-\x{009F}]+/u

Detecting and extracting information from a column

It may be the case that information such as the brand or color of a product is only available within another column, such as the description of your products. If you wish to isolate this information and use it in another column, you can do so using the following boxes.

Detecting brands

To detect brands and extract them to another column, you can use the detect brand box.

  1. Add the detect brand box.

  2. Input the column in which the brand information is contained.

  3. Choose the output method: whether you want to append, prepend, replace, or assign the brand you have detected.

  4. Select what should occur if there is no match.

As an example, you may wish to extract the brand from the description column and append it to the title column.

Description

title

Nike PRE Montreal Racer Trainers grey upper material: leather and textile, internal material: textile, sole: high-quality synthetics, cover sole: textile Size: 30-36

Sportswear PRE Montreal Racer Trainers

After applying the box to the title column according to the screenshot below, your title would look like this:

Sportswear PRE Montreal Racer Trainers Nike

detect_brand.png

Detecting colors

To detect colors and extract them to another column, you can use the detect color box.

  1. Add the detect color box.

  2. Input the column in which the brand information is contained.

  3. Choose the output method: whether you want to append, prepend, replace, or assign the color you have detected.

  4. Select which language your colors are in.

  5. Select what should occur if there is no match.

detect_color.png

Detecting and extracting values based on a custom list

You can also detect and extract custom values and not necessarily brand or color. To do this, you first have to upload a custom list of values that you wish to extract.

  1. Add a Standard Inclusion/Exclusion list with your custom values to extract.

  2. Add the detect and extract value box.

  3. Define the input column in which your information is contained.

  4. Select the detection list which you previously uploaded.

  5. Choose the mode, whether you want to append, prepend, replace, or assign the color you have detected.

  6. Select the match type which defines where and how the values from your list should be matched.

  7. Select which case you want to match the values in your list: sensitive or insensitive.

  8. Select what should occur if there is no match.

detect.png