Two files are given: one contains text, in the second word and the coefficients characterizing these words, while so that only one word and one coefficient in one line. About this so
21230 Apple
23121 Pillow
in files with text more than 30,000 words. The file with the words is always used by one and its contents unchanged, but it has more than a million rows.
You need to get the coefficients of all words from the first file, while spending as little time as possible.
At this stage there is a list (let it call firstarr) of all words from the text and the list of words with coefficients (let it be called SECARR), in which the word can be accessed via
Secarr [Word] [1]
and to the coefficient
Secarr [SOEF] [0]
I never previously used Python and do not know many aspects of the language, so all what I reached it to check the invested cycle
for word in firstarr:
For Word in Range (Len (SECARR)):
IF Word == SECARR [WORD] [1]:
COEF + = INT (SECARR [I] [0])
Break
Also there is an idea to do the same in meaning, but with
for word in firstarr:
IF Word in Secarr:
COEF + = INT (SECARR [SECARR.INDEX (WORD)] [0])
But it is very doubtful that it shortens the search time (again, it seems to me that maybe it is not so). Perhaps there are any more elegant solutions to the time of time spent on the search?
Answer 1, Authority 100%
Search coefficients from a million list by direct prosperity is one of the most inefficient ways. Sampling Values from the dictionary (hash table) is the solution of your task.
From the second file, create a dictionary (DICT) and save to the file. For example, so:
import pickle
# Dictionary with coefficients
d = {
'Word1': 101,
'Word2': 102,
'Word3': 103,
'Word4': 104,
# ...
}
# Dictionary can be made from your "second" file, but for this you need to know its structure
DICTFILE = OPEN ('DICT.PICKLE', 'WB')
Pickle.DUMP (D, Dictfile)
Then before calculating the coefficients, download the dictionary from the
file
import pickle
DICTFILE = OPEN ('DICT.PICKLE', 'RB')
SECARR = Pickle.load (DICTFILE)
COEF = 0.
For Word in Firstarr:
IF Word in Secarr:
COEF + = SECARR [WORD]