Home python What technique should you use to parse mutable text (in Python)?

What technique should you use to parse mutable text (in Python)?




The task of the function is to compare the input text with the existing one and, if it has changed, signal it . Such a task arose when writing a bot that notifies about news on a specific page of the site. The fact is that the text on a (someone else’s) site is updated manually, and therefore it often happens that the text has been updated, but the author decided to correct a few words, spelling errors, etc. I used to stupidly compare 2 texts, whether they are the same or not, but now I want to make it more or less civilized so that the trigger does not work if a comma is simply added in the new text. So far, I do it this way:

def findMessageOverlap (new_message, old_message):
  overlap_count = 0
  for word in new_message.strip (''):
    if word in old_message:
      overlap_count + = 1
  overlap_percentage = (100 * overlap_count) / len (old_message.strip (''))
  # if overlap percentage is less than 50%, there is new message on the site
  if overlap_percentage & lt; 50:
    return overlap_percentage

I convert the difference in the words of the new text into percentages, and if the text differs by more than 50% from the old one, then I signal it. How would you experienced developers solve this problem? Because I’m just a beginner.

Answer 1, authority 100%

An interesting problem) There are a lot of discussions on it in the theory of algorithms.
As examples of solutions:

Shingle Algorithm

The most common approach for your task (it is just used to detect plagiarism).

python implementations

python implementation part2

Levenshtein distance

This is the minimum number of operations to move from one line to another.

python implementation

libraries to help:

  1. difflib

in particular, look towards SequenceMatcher . This class will help you get rid of predefined “noise” (commas, tabs, unnecessary parts of speech)

  1. python-Levenshtein

library for implementation of the Levenshtein Distance search algorithm

PS: I apologize for not providing a clear solution for your task.
But it is abstract enough to just insert a piece of code that will help you out. But in the links I have indicated, you will quite find the building blocks for its solution.

Programmers, Start Your Engines!

Why spend time searching for the correct question and then entering your answer when you can find it in a second? That's what CompuTicket is all about! Here you'll find thousands of questions and answers from hundreds of computer languages.

Recent questions