Home python Filling a two-dimensional array / Vectorizing text

Filling a two-dimensional array / Vectorizing text




We need to fill in a two-dimensional array. the element with index (i, j) in this matrix must be equal to the number of occurrences of the j-th word in the i-th sentence. (sentences are read from a file, split into lists, and all words found in sentences are added to the dictionary d, where the key is a word and the ordinal is a value).

input 22 sentences, reduced to lists of the form ['in', 'comparison', 'to', 'dogs',' cats', 'have', 'not', 'undergone', 'major ',' changes', 'during', 'the', 'domestication', 'process'] the result should be a 22×253 matrix (22 as sentences, 253 as total unique words used in sentences). Words are collected in a dictionary of the form [word: index]. If a word from the dictionary occurs 2 times in 1 sentence, and its index according to the dictionary is 1, in place of the element m [1, 1] there must be 2, etc.

I created an empty matrix and ran the search, but it still remains zero, I don’t understand where the error is

m = np.zeros ((number_line, len (new_line)))
i = 0
for line in f.readlines ():
  for x in line:
    a = line.count (x)
    j = d [x]
    m [i, j] = a
  i + = 1

Answer 1, authority 100%

At the request of the author of the question I give an example of a solution with loops.

import numpy as np
from nltk.tokenize import sent_tokenize, RegexpTokenizer
from collections import Counter
text = "" "Displays osx displays.
osx selection.
"" "
sentences = sent_tokenize (text)
tok = RegexpTokenizer ('(? u) \\ b \\ w \\ w + \\ b')
vocab = {'displays': 0, 'osx': 1, 'selection': 2}
res = np.zeros ((len (sentences), len (vocab)))
for i, s in enumerate (sentences):
  for w, cnt in Counter (w.lower () for w in tok.tokenize (s)). items ():
    if w in vocab.keys ():
      res [i, vocab [w]] = cnt


In [254]: res
Out [254]:
array ([[2., 1., 0.],
    [0., 1., 1.],
    [0., 0., 0.]])
In [255]: vocab
Out [255]: {'displays': 0, 'osx': 1, 'selection': 2}

NOTE: for real-world problems it is better to use another solution .

Answer 2, authority 97%

Use sklearn.feature_extraction.text.CountVectorizer and Pandas.SparseDataFrame .

For large texts, this will work orders of magnitude faster (compared to a solution using nested loops) and take several orders of magnitude less memory (the final data is presented as sparse (sparse ) matrices)


import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
sentences = [
 "'It's Raining Cats and Dogs",
 "Do cats like dogs or hot dogs?",
 "Cats prefer hot dogs!"
cv = CountVectorizer (stop_words = 'english')
r = pd.SparseDataFrame (cv.fit_transform (sentences),
            columns = cv.get_feature_names (),
            default_fill_value = 0)


In [201]: r
Out [201]:
  cats dogs hot like prefer raining
0 1 1 0 0 0 1
1 1 2 1 1 0 0
2 1 1 1 0 1 0

Programmers, Start Your Engines!

Why spend time searching for the correct question and then entering your answer when you can find it in a second? That's what CompuTicket is all about! Here you'll find thousands of questions and answers from hundreds of computer languages.

Recent questions