Product comparison is one of the crucial aspects of competitive intelligence (CI). There are two modes of product comparison:
- Comparison of own product price across multiple channels
- Comparison of own product price with suitable competitor product
But, the greatest challenge in this journey is how to get the correct comparison product!!
In this article we would explore how an NLP technique, Fuzzy String Matching (FSM), can help in accomplishing the former, especially for price tracking in e-commerce. FSM is sometimes also called as Approximate String Matching.
A product is sold across multiple online channels/retailers by numerous resellers. One thing that becomes almost impossible to standardize is name of the product displayed on the website. Though it would be same product but different websites and resellers have their distinctive way of representing the product. For e.g. the product – Tomato Chilli Chutney, by Kitchens of India, can be written as
- Kitchens of India Tomato Chilli Chutney, 300gms
- Tomato Chilli Chutney-300g, Kitchens of India
- Tomato Chilli Chutney
- Chilli Tomato Chutney by Kitchens of India, 300g
As humans, we can easily state that these products are nothing but one. As humans, we have limits. It would become rather impossible to classify such cases as one, manually, when the list increase to tens of thousands for just one website. Imagine the quantum when, this matching has to be done across 10s of websites and for 100s of product categories!
In such scenarios, FSM comes quite handy with multiple string matching algorithms. Each algorithm has its own specific utility and fitment which can be verified during model building phase. Like Clustering, where categorization is done on the basis of distance between two instances, sometimes using Euclidean distance, FSM also matches strings based on distance calculating algorithms. They are categorized as
- Edit-based distance
- q-grams based distance (also called as n-grams)
- Heuristic distance
Edit-based distance algorithm, distance is the count of changes, like substitution, insertion, deletion, or even transposition of characters, to make the two strings match. E.g. to match aaa with aba, only one substitution is required, so the distance is 1. Algorithms falling in this categories are – Hamming, generalized Levenshtein, Longest Common Substring, optimal string alignment, and generalized Damerau-Levenshtein
q-grams distance is the count of q-character sized packets which are common between both the strings. Larger the count, better the match. E.g. for ‘aaa’ & ‘aba’ for q-gram with length of 2 characters, the vector for first string would be {aa} and for second (ab,ba}. Since there is no match so the distance is Zero. Algorithms in this category are q-gram, Jaccard, and cosine
Heuristic distance is more of a user based application, with no specific mathematical base. Heuristic Jaro distance between two words is the minimum number of single-character transpositions required to change one word into the other. The other advancement to Jaro is Jaro-Winkler distances.
R and Python offer multiple packages to implement FSM.
R – Stringdist, fuzzywuzzyR
Python – fuzzywuzzy, python-Levenshtein