java - Comparison Float VS Double

There is a lot of controversial information about the performance of Float and Double on the x86-64 platform . I would like to understand this issue.

Since this question is difficult to give an unequivocal answer and usually use real with double accuracy, I propose to consider situations in which it is really worth using exactly Float instead of Double

Answer 1, Authority 100%

tl; dr : float , as expected, faster Double , so if you work with large amounts of data and you have enough accuracy Float You choose Float . If the accuracy of Float is not enough, then your choice is small – Double . If you have no claims at all – choose anything, you will not see the difference.

I, as a participant in the above-mentioned dispute, decided to write an answer. In order to understand what productivity will be, I decided to first study a little theory, for this I wrote the following code:

# include & lt; Cstddef & gt;
INT MAIN ()
{
  volatile double darray [] = {5.234234, 2.2143213, 3.214212, 4.123155};
  Volatile float Farray [] = {5.234234F, 2.2143213F, 3.214212F, 4.123155F};
  volatile double dres = 0.0;
  Volatile Float Fres = 0.0F;
  For (Size_t i = 0; i & lt; 4; ++ i)
    DRES + = DARRAY [I];
  For (Size_t i = 0; i & lt; 4; ++ i)
    Fres + = Farray [i];
  Fres = 0.0;
}

For which we have the following assembler (GCC):

mov rax, QWORD PTR [RBP-96]
    MOVSD XMM0, QWORD PTR [RBP-64 + RAX * 8]
    MOVSD XMM1, QWORD PTR [RBP-104]
    addsd xmm1, xmm0
    MOVQ RAX, XMM1
    MOV QWORD PTR [RBP-104], RAX
    Add QWORD PTR [RBP-96], 1
.L2:
    CMP QWORD PTR [RBP-96], 3
    JBE .l3.
    MOV QWORD PTR [RBP-88], 0
    Jmp .l4.
.L5:
    MOV RAX, QWORD PTR [RBP-88]
    MOVSS XMM0, DWORD PTR [RBP-80 + RAX * 4]
    MOVSS XMM1, DWORD PTR [RBP-108]
    Addss XMM1, XMM0
    MOVD EAX, XMM1
    MOV DWORD PTR [RBP-108], EAX
    Add QWORD PTR [RBP-88], 1
.L4:
    CMP QWORD PTR [RBP-88], 3
    JBE .l5

This is not the entire conclusion, but there is enough information. For us, there are two instructions for us: addss , addsd – Each is a SIMD instruction for working with Float (first) and double. The first thought is to look for a manual, maybe it’s written there that faster? Such a manual There is , but a runway inspection showed that I will not get there – judging by the manual, these instructions should be executed equally quickly. Good. Let us leave this path and try to assemble the previous code with AVX2 in the studio, we get the following ASM:

; 6:
; 7: Volatile double dres = 0.0;
; 8: Volatile Float Fres = 0.0F;
; 9: FOR (Size_t i = 0; i & lt; 4; ++ i)
  XOR EAX, EAX
  vxorps xmm2, xmm2, xmm2
  VMOVSD QWORD PTR DRES $ [RSP], XMM0
  VMOVSS DWORD PTR FRES $ [RSP], XMM2
  MOV ECX, EAX
  NPAD 9.
$ LL4 @ Main:
; 10: DRES + = DARRAY [I];
  VMOVSD XMM1, QWORD PTR Darray $ [RSP + RCX * 8]
  VMOVSD XMM0, QWORD PTR DRES $ [RSP]
  Inc RCX
  vaddsd xmm1, xmm1, xmm0
  VMOVSD QWORD PTR Dres $ [RSP], XMM1
  CMP RCX, 4
  JB Short $ LL4 @ Main
  NPAD 1.
$ LL7 @ Main:
; 11: FOR (Size_t i = 0; i & lt; 4; ++ i)
; 12: Fres + = Farray [i];
  VMOVSS XMM1, DWORD PTR Farray $ [RSP + RAX * 4]
  VMOVSS XMM0, DWORD PTR Fres $ [RSP]
  Inc Rax
  vaddss xmm1, xmm1, xmm0
  VMOVSS DWORD PTR Fres $ [RSP], XMM1
  CMP RAX, 4
  JB Short $ LL7 @ Main

Code practically did not change, except that the operations began to be called Vaddsd and Vaddss . I did not climb into the manual for these teams, I believe that the situation there is similar to those that we have seen earlier.

Then let’s go to another way: we know that Float is a 32-bit, while Double is a 64-bit. It must inevitably have to affect performance, the question is only one – how? My knowledge of SIMD instructions is very limited, so I do not understand why neither GCC nor the studio used any batch instructions for the addition of numbers. Does anyone tell me why? I have already decided that there are no such. But here this article claims that These are: vaddpd and Vaddps , both take the arguments of the size of 256-bit, i.e. At times, such an operation can be folded 8 floats or 4 doubles. This is something else – Float on the right of smaller sizes should be faster and we found that it is actually so.

Another important factor that can withdraw Float is ahead of its smaller effect on the cache: because It is two times less, then the load on the cache will be less. Thus, do not pick up and do not paint more, we get the following conclusion that, in general, immediately comes to mind: Float Faster than Double .

It remains to check it in practice, for this we use the following code:

# include & lt; iostream & gt;
#Include & lt; vector & gt;
#Include & lt; Numeric & GT;
#Include & lt; Chrono & gt;
#Include & lt; algorithm & gt;
#Include & lt; String & GT;
INT MAIN ()
{
  const Size_t Size = 1'000'000'000;
  Std :: Vector & LT; Double & GT; DVector (Size, 2.2143213);
  STD :: Vector & lt; Float & gt; FVector (Size, 2.2143213F);
  Auto Start = STD :: Chrono :: High_Resolution_Clock :: Now ();
  volatile double dres = std :: accumulate (dvector.begin (), dvector.end (), 0.0);
  auto doulelapsed = (STD :: Chrono :: High_Resolution_Clock :: Now () - Start) .Count ();
  start = STD :: Chrono :: High_Resolution_Clock :: Now ();
  volatile float fres = std :: accumulate (fvector.begin (), fvector.end (), 0.0f);
  Auto Floatelapsed = (STD :: Chrono :: High_Resolution_Clock :: Now () - Start) .Count ();
  STD :: COUT & LT; & LT; "Float elapsed:" & lt; & lt; Floatelapsed & lt; & lt; "\ n";
  STD :: COUT & LT; & LT; "Double elapsed:" & lt; & lt; DoubleElapsed & lt; & lt; "\ n";
  float ratio = STD :: MAX & LT; Float & gt; (Floatelapsed, doulelapsed) /
    STD :: MIN & LT; Float & GT; (Floatelapsed, doublelapsed);
  STD :: STRING RELATION = Floatelapsed & lt; DOUBLEELAPSED?
    STD :: STRING ("FASTER"): STD :: STRING ("SLOWER");
  STD :: COUT & LT; & LT; "Float IS" & lt; & lt; Ratio & lt; & lt; "" & lt; & lt; Relation & lt; & lt; "! \ n";
}

On my PC (Haswell) This code collected in 2015 studio with AVX2 gives a stable advantage of Float at 1.2-1.3 times, there are peak values much higher, but I did not attach them attention. Even without AVX2 (I tried different options) everything looks like the same.

Of course, the measurements are quite simple, and the argument is quite superficial (I did not put the goal of a full-fledged study, I currently do not have time for it), but even it shows that people claiming that you need to choose Double and that double Faster Float is not right.

And one more test, where I used intrinsics for counting the amount (I could not use them in the best way, but I don’t know how to eat them):

# include & lt; immintrin.h & gt;
#Include & lt; iostream & gt;
#Include & lt; vector & gt;
#Include & lt; Numeric & GT;
#Include & lt; Chrono & gt;
#Include & lt; algorithm & gt;
#Include & lt; String & GT;
Float Accumulate (Const Std :: Vector & LT; Float & gt; & amp; VEC)
{
  __m256 res = _mm256_undefined_ps ();
  For (Size_t i = 0; i & lt; vec.size (); i + = 8)
  {
    __m256 m1 = _mm256_load_ps (& amp; VEC [i]);
    res = _mm256_add_ps (M1, RES);
  }
  Float Out [8];
  _mm256_store_ps (OUT, RES);
  Return Std :: Accumulate (STD :: Begin (Out), Std :: End (Out), 0.0f);
} 
Double Accumulate (Const Std :: Vector & LT; Double & GT; & AMP; VEC)
{
  __m256d res = _mm256_undefined_pd ();
  For (size_t i = 0; i & lt; vec.size (); i + = 4)
  {
    __m256d m1 = _mm256_load_pd (& amp; VEC [i]);
    res = _mm256_add_pd (M1, RES);
  }
  Double Out [4];
  _mm256_store_pd (OUT, RES);
  Return Std :: Accumulate (STD :: Begin (Out), Std :: End (Out), 0.0);
}
INT MAIN ()
{
  const Size_t Size = 1'000'000;
  Std :: Vector & LT; Double & GT; DVector (Size, 2.2143213);
  STD :: Vector & lt; Float & gt; FVector (Size, 2.2143213F);
  Auto Start = STD :: Chrono :: High_Resolution_Clock :: Now ();
  Volatile Double Dres = Accumulate (DVector);
  auto doulelapsed = (STD :: Chrono :: High_Resolution_Clock :: Now () - Start) .Count ();
  start = STD :: Chrono :: High_Resolution_Clock :: Now ();
  Voltile Float Fres = Accumulate (Fvector);
  Auto Floatelapsed = (STD :: Chrono :: High_Resolution_Clock :: Now () - Start) .Count ();
  STD :: COUT & LT; & LT; "Float elapsed:" & lt; & lt; Floatelapsed & lt; & lt; "\ n";
  STD :: COUT & LT; & LT; "Double elapsed:" & lt; & lt; DoubleElapsed & lt; & lt; "\ n";
  float ratio = STD :: MAX & LT; Float & gt; (Floatelapsed, doulelapsed) /
    STD :: MIN & LT; Float & GT; (Floatelapsed, doublelapsed);
  STD :: STRING RELATION = Floatelapsed & lt; DOUBLEELAPSED?
    STD :: STRING ("FASTER"): STD :: STRING ("SLOWER");
  STD :: COUT & LT; & LT; "Float IS" & lt; & lt; Ratio & lt; & lt; "" & lt; & lt; Relation & lt; & lt; "! \ n";
}

With such code, on the same machine, I get an increase in 2.3-2.5 times.

Answer 2, Authority 35%

And what is performance in this context? Actually Simd We clearly says that in any vector will contain more Float , so any vector operations on Float There will always be faster than similar – over double , if you rely only on the quantitative characteristics of algorithms.

compare there is nothing in such a context, for example.

If you still want to compare, then you can take the code from of this answer by changing the type of operands on Double / Float respectively (and commands with _mm_cmpgt_epi32 on _mm_cmpgt_pd / _mm_cmpgt_ps ). Performance measurements There everything is there.

Answer 3

Since the weight of Float and Double is the same, the speed of compilation and work will only depend on the number of numbers before and after the comma, my opinion, I use the type Float

Comparison Float VS Double

Answer 1, Authority 100%

Answer 2, Authority 35%

Answer 3

Programmers, Start Your Engines!

Recent questions

yandex cards disappear labels with zoom

Embarcadero C++ Builder 10.3 does not give prompts by code

Found input variables with inconsistent numbers of samples error

Return to previous page

Lua C++ error handling