We need to fill in a two-dimensional array. the element with index (i, j)
in this matrix must be equal to the number of occurrences of the j-th word in the i-th sentence. (sentences are read from a file, split into lists, and all words found in sentences are added to the dictionary d, where the key is a word and the ordinal is a value).
input 22 sentences, reduced to lists of the form ['in', 'comparison', 'to', 'dogs',' cats', 'have', 'not', 'undergone', 'major ',' changes', 'during', 'the', 'domestication', 'process']
the result should be a 22×253 matrix (22 as sentences, 253 as total unique words used in sentences). Words are collected in a dictionary of the form [word: index]. If a word from the dictionary occurs 2 times in 1 sentence, and its index according to the dictionary is 1, in place of the element m [1, 1]
there must be 2, etc.
I created an empty matrix and ran the search, but it still remains zero, I don’t understand where the error is
m = np.zeros ((number_line, len (new_line)))
i = 0
for line in f.readlines ():
for x in line:
a = line.count (x)
j = d [x]
m [i, j] = a
i + = 1
Answer 1, authority 100%
At the request of the author of the question I give an example of a solution with loops.
import numpy as np
from nltk.tokenize import sent_tokenize, RegexpTokenizer
from collections import Counter
text = "" "Displays osx displays.
osx selection.
Nothing!
"" "
sentences = sent_tokenize (text)
tok = RegexpTokenizer ('(? u) \\ b \\ w \\ w + \\ b')
vocab = {'displays': 0, 'osx': 1, 'selection': 2}
res = np.zeros ((len (sentences), len (vocab)))
for i, s in enumerate (sentences):
for w, cnt in Counter (w.lower () for w in tok.tokenize (s)). items ():
if w in vocab.keys ():
res [i, vocab [w]] = cnt
Result:
In [254]: res
Out [254]:
array ([[2., 1., 0.],
[0., 1., 1.],
[0., 0., 0.]])
In [255]: vocab
Out [255]: {'displays': 0, 'osx': 1, 'selection': 2}
NOTE: for real-world problems it is better to use another solution .
Answer 2, authority 97%
Use sklearn.feature_extraction.text.CountVectorizer and Pandas.SparseDataFrame .
For large texts, this will work orders of magnitude faster (compared to a solution using nested loops) and take several orders of magnitude less memory (the final data is presented as sparse (sparse
) matrices)
Example:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
sentences = [
"'It's Raining Cats and Dogs",
"Do cats like dogs or hot dogs?",
"Cats prefer hot dogs!"
]
cv = CountVectorizer (stop_words = 'english')
r = pd.SparseDataFrame (cv.fit_transform (sentences),
columns = cv.get_feature_names (),
default_fill_value = 0)
Result:
In [201]: r
Out [201]:
cats dogs hot like prefer raining
0 1 1 0 0 0 1
1 1 2 1 1 0 0
2 1 1 1 0 1 0