floating point arithmetic
Answers
Answered by
1
Floating-Point Arithmetic
Simply stated, floating-point arithmetic is arithmetic performed on floating-point representations by any number of automated devices.
Traditionally, this definition is phrased so as to apply only to arithmetic performed on floating-point representations of real numbers (i.e., to finite elements of the collection of floating-point numbers) though several additional types of floating-point data including signed infinities and NaNs are also commonly allowed as inputs for such functions.
Despite the succinctness of the definition, it is worth noting that the most widely-adopted standards in computing consider nearly the entirety of floating-point theory under the heading "floating-point arithmetic." One reason for this breadth stems from the fact that any floating-point representation can account for but a finite subset of the continuum of real numbers; this finiteness presents a variety of unforeseen obstacles, chief among which is the fact that certain properties of real arithmetic (e.g., associativity of addition) sometimes fail to hold for floating-point numbers (IEEE Computer Society 2008). As a result, any comprehensive treatment of floating-point arithmetic and/or algebra must address numerous caveats including representations of floating-point numbers, rounding, etc. before ever discussing the actual operations themselves.
As of 2014, the most commonly implemented standard for floating point arithmetic is the IEEE Standard 754-2008 for Floating-Point Arithmetic (written shorthand as IEEE 754-2008 and as IEEE 754 henceforth). This framework is a massive overhaul of its predecessor - IEEE 754-1985 - and includes a built-in collection of guidelines specifying nearly every conceivable aspect of floating-point theory. In particular, IEEE 754 addresses the following aspects of floating-point theory in considerable detail:
1. Floating-point representations and formats.
2. Attributes of floating-point representations, including rounding of floating-point numbers.
3. Arithmetic and algebraic operations on floating-point representations.
4. Infinity, non-numbers (NaNs), signs, and exceptions.
A number of the above topics are discussed across multiple sections of the standard's documentation (IEEE Computer Society 2008).
The "required" arithmetical operations defined by IEEE 754 on floating-point representations are addition, subtraction, multiplication, division, square root, and fused multiply-add (a ternary operation defined by ); these are required in the sense that adherence to the framework requires these operations to be supported with correct rounding throughout. A number of other "recommended" operations are also provided within the framework, some of which are arithmetic in nature; these are recommended in the sense that support for them is not strictly required by the framework. Finally, note that the framework includes both a collection of utility functions which may also be considered arithmetic, namely copy, negate, and abs, as well as a number of closely-related functions defined for vector-valued input (IEEE Computer Society 2008, pp. 46-47)
The above table summarizes the recommended arithmetic operations within IEEE 754. Note that the particulars of the exceptions labeled "Several cases" are addressed in detail in the IEEE 754 documentation (IEEE Computer Society 2008, pp 43-45).
As noted above, even some of the basic required arithmetic operators behave unpredictably in light of floating-point representations and rounding. This stems from the fact that the "normal" arithmetic operations are assumed within IEEE 754 to have infinite precision while the values of floating-point addition, subtraction, multiplication, and division, written symbolically as , , , and , respectively, are computed by performing the "normal" operations of , , , and , respectively, on floating-point numbers written in terms of a common exponent and rounding the result to a fixed number of significant digits (by way of the so-called preferred exponent) afterward. As a result, loss of precision, overflow, and underflow can all occur during the arithmetic and/or rounding steps of the computation. For example, the result of adding l and is exactly
(1)
On the other hand, in a framework with radix and 7-digit precision, the value returned by floating-point addition would be
(2)
Similarly, given and , one has that
(3)
using the 7-digit precision assumed above. However, one has that
(4)
thus yielding a complete lack of precision. Note that in extreme cases like this, systems implementing IEEE 754 won't actually yield as a result: In particular, such a scenario will trigger an underflow warning.
Simply stated, floating-point arithmetic is arithmetic performed on floating-point representations by any number of automated devices.
Traditionally, this definition is phrased so as to apply only to arithmetic performed on floating-point representations of real numbers (i.e., to finite elements of the collection of floating-point numbers) though several additional types of floating-point data including signed infinities and NaNs are also commonly allowed as inputs for such functions.
Despite the succinctness of the definition, it is worth noting that the most widely-adopted standards in computing consider nearly the entirety of floating-point theory under the heading "floating-point arithmetic." One reason for this breadth stems from the fact that any floating-point representation can account for but a finite subset of the continuum of real numbers; this finiteness presents a variety of unforeseen obstacles, chief among which is the fact that certain properties of real arithmetic (e.g., associativity of addition) sometimes fail to hold for floating-point numbers (IEEE Computer Society 2008). As a result, any comprehensive treatment of floating-point arithmetic and/or algebra must address numerous caveats including representations of floating-point numbers, rounding, etc. before ever discussing the actual operations themselves.
As of 2014, the most commonly implemented standard for floating point arithmetic is the IEEE Standard 754-2008 for Floating-Point Arithmetic (written shorthand as IEEE 754-2008 and as IEEE 754 henceforth). This framework is a massive overhaul of its predecessor - IEEE 754-1985 - and includes a built-in collection of guidelines specifying nearly every conceivable aspect of floating-point theory. In particular, IEEE 754 addresses the following aspects of floating-point theory in considerable detail:
1. Floating-point representations and formats.
2. Attributes of floating-point representations, including rounding of floating-point numbers.
3. Arithmetic and algebraic operations on floating-point representations.
4. Infinity, non-numbers (NaNs), signs, and exceptions.
A number of the above topics are discussed across multiple sections of the standard's documentation (IEEE Computer Society 2008).
The "required" arithmetical operations defined by IEEE 754 on floating-point representations are addition, subtraction, multiplication, division, square root, and fused multiply-add (a ternary operation defined by ); these are required in the sense that adherence to the framework requires these operations to be supported with correct rounding throughout. A number of other "recommended" operations are also provided within the framework, some of which are arithmetic in nature; these are recommended in the sense that support for them is not strictly required by the framework. Finally, note that the framework includes both a collection of utility functions which may also be considered arithmetic, namely copy, negate, and abs, as well as a number of closely-related functions defined for vector-valued input (IEEE Computer Society 2008, pp. 46-47)
The above table summarizes the recommended arithmetic operations within IEEE 754. Note that the particulars of the exceptions labeled "Several cases" are addressed in detail in the IEEE 754 documentation (IEEE Computer Society 2008, pp 43-45).
As noted above, even some of the basic required arithmetic operators behave unpredictably in light of floating-point representations and rounding. This stems from the fact that the "normal" arithmetic operations are assumed within IEEE 754 to have infinite precision while the values of floating-point addition, subtraction, multiplication, and division, written symbolically as , , , and , respectively, are computed by performing the "normal" operations of , , , and , respectively, on floating-point numbers written in terms of a common exponent and rounding the result to a fixed number of significant digits (by way of the so-called preferred exponent) afterward. As a result, loss of precision, overflow, and underflow can all occur during the arithmetic and/or rounding steps of the computation. For example, the result of adding l and is exactly
(1)
On the other hand, in a framework with radix and 7-digit precision, the value returned by floating-point addition would be
(2)
Similarly, given and , one has that
(3)
using the 7-digit precision assumed above. However, one has that
(4)
thus yielding a complete lack of precision. Note that in extreme cases like this, systems implementing IEEE 754 won't actually yield as a result: In particular, such a scenario will trigger an underflow warning.
Similar questions