Home c++ How to determine, binary or text file?

How to determine, binary or text file?

Author

Date

Category

Dan arbitrary file. It is required to write a program (C / C++) that determines it is text or binary. Is there such an algorithm at the moment?

What criteria is appropriate to use?


Answer 1, Authority 100%

Text file is a type of binary file. Just a different data recording format. To know this format, you need to know how this file is recorded. After recording for the read program, it becomes a faceless set byte. The unequivocal standard did not come up. There are attempts, such as file extensions, to specify the reading program, as data recorded. But these rules differ in different OS.

You can try the same way as the document encoding is determined by analyzing the content.

  1. In the case of 8-bit encodings, it is easy to look at the presence of unprinting characters.
  2. For UTF-8 and other composite encodings, the task becomes somewhat more complicated.

Answer 2, Authority 14%

Thank you all responded.

In Linux there is a file command by which you can solve the task.
Is there an analog for Windows ? I tried to download the source and compile, constantly lacking some header files.

Fine Free File Command .


Answer 3, Authority 14%

Unambiguously determine, of course, it is impossible, but in the text file, there will not be such characters as # 0 (zero byte value). With other non-printing characters more complicated: # 13 and # 10 are the symbols of the end of the row and the return carriage.

I think that the optimal will be such an algorithm:

  1. look at the zero symbol.
  2. look at some non-print symbols (but not all!).
  3. look at the number of characters such as a space and end of the string + the return carriage, their number differs from the number of other characters.
  4. Heavy artillery. We look at the ratio of different bytes, in sufficiently large binary files the distribution is approximately smooth, and in text files, some letters will meet more often.

If you do not have Unicode, the results of the application must be very good (~ 95%).

Programmers, Start Your Engines!

Why spend time searching for the correct question and then entering your answer when you can find it in a second? That's what CompuTicket is all about! Here you'll find thousands of questions and answers from hundreds of computer languages.

Recent questions