I am trying to read the ports file from IANA. It is stored in UTF-8 encoding w / o BOM.
But on one of the lines, the readline ()
function swears like this
‘charmap’ codec can’t decode byte 0x98
in position 7938: character maps to
& lt; “undefined” & gt;
The line in the file looks like this:
# Jim Harlan & lt; “jimh & amp; infowest.com” & gt;
What crutch to come up with for this? Or is there a direct solution?
UPD
For a crutch in the form of deleting this line will go (and she, for some reason, this one), but only for the duration of debugging, because then suddenly, the partners will tear the hair on my head. I will also post the code that I use for this operation:
try:
file = open (path, 'r')
while True:
line = file.readline ()
if (not line):
break
print (line)
finally:
file.close ()
Answer 1, authority 100%
try using the built-in codecs library:
import codecs
fileObj = codecs.open ("someFilePath", "r", "utf_8_sig")
text = fileObj.read () # or read line by line
fileObj.close ()
Answer 2, authority 100%
To read a text file encoded using utf-8 encoding in Python, you can use the io.open ()
function, which is available as built-in open ()
in Python 3 :
#! / usr / bin / env python
import io
with io.open (path, encoding = 'utf-8') as file:
for line in file:
process (line)
If the file contains errors related to the encoding: the encoding itself is correct, but there may be minor errors, then you can pass errors = 'ignore'
an error handler (or another value as appropriate) .
Do not use codecs
, which may not work correctly with generic string mode.
You don’t need to change your codepage to cp65001
to read the utf-8 file.
If you want to print Unicode to the Windows console, see How to output a Unicode string from Python to the Windows console?
Answer 3, authority 25%
I was constantly catching this error, over and over again. The solution was seen by here .
import codecs
file = codecs.open ("yourFile", "r", "utf-8")
data = file.read ()
file .close ()
chcp 65001
on the command line
These simple steps solved the problem.
Answer 4, authority 12%
file = codecs.open (path, encoding = 'utf-8', mode = 'r')