The task of the function is to compare the input text with the existing one and, if it has changed, signal it . Such a task arose when writing a bot that notifies about news on a specific page of the site. The fact is that the text on a (someone else’s) site is updated manually, and therefore it often happens that the text has been updated, but the author decided to correct a few words, spelling errors, etc. I used to stupidly compare 2 texts, whether they are the same or not, but now I want to make it more or less civilized so that the trigger does not work if a comma is simply added in the new text. So far, I do it this way:
def findMessageOverlap (new_message, old_message): overlap_count = 0 for word in new_message.strip (''): if word in old_message: overlap_count + = 1 overlap_percentage = (100 * overlap_count) / len (old_message.strip ('')) # if overlap percentage is less than 50%, there is new message on the site if overlap_percentage & lt; 50: return overlap_percentage
I convert the difference in the words of the new text into percentages, and if the text differs by more than 50% from the old one, then I signal it. How would you experienced developers solve this problem? Because I’m just a beginner.
Answer 1, authority 100%
An interesting problem) There are a lot of discussions on it in the theory of algorithms.
As examples of solutions:
The most common approach for your task (it is just used to detect plagiarism).
This is the minimum number of operations to move from one line to another.
libraries to help:
in particular, look towards SequenceMatcher . This class will help you get rid of predefined “noise” (commas, tabs, unnecessary parts of speech)
library for implementation of the Levenshtein Distance search algorithm
PS: I apologize for not providing a clear solution for your task.
But it is abstract enough to just insert a piece of code that will help you out. But in the links I have indicated, you will quite find the building blocks for its solution.