User Tools

Site Tools


numeric

This is an old revision of the document!


Numeric Types

Numeric types represent Real numbers in a computer. Because computers have limited space, most numeric values are approximations of a real value.

Math in a computer is a combination of a storage of data, and a collection of operators. Interpretation of the bits is up to the operators. When we talk about a type of a number, we are usually referring to the operations we can do on them, rather than the storage.

Here are some Numeric Types that programming languages usually support.

  • Signed Integer (32 bits) - These usually have fast support in the processor. Sometimes called an int.
    • Two's Complement - Represents Numbers as a string of bits. Bit Vector addition results in (inv(N) + 1) + N = 0
    • One's Complement - Represents Numbers as a string of bits. Bit Vector addition results in (inv(N)) + N = 0
    • Signed Magnitude - Represents Numbers as one bit for the sign, and the remaining as an unsigned magnitude. N = (-1)^S x M
  • Signed Integer (8 bits) - Sometimes called a byte. Sometimes called an octet in the context of networking.
  • Signed Integer (16 bits) - Sometimes called a short
  • Signed Integer (24 bits) - Uncommon, sometimes called a medium. Some GPUs include support for these.
  • Signed Integer (64 bits) - Sometimes called a long. These are the most common word size on a machine today
  • Signed Integer (128 bits) - Less common, Sometimes used in C#.
  • Unsigned Integer (32 bits) - Represents a magnitude from 0 to 2^B - 1, where B is bits. Overflow wraps around usually.
  • Unsigned Integer (8 bits) - Sometimes also called a byte, or an octet. C usually calls these char.
  • Unsigned Integer (16 bits) - Sometimes called a short, Java calls this a char.
  • Unsigned Integer (64 bits) - Occasionally called a long long in C.
  • Float (32 bit) - Represents a number in binary scientific notation.
  • Double (64 bit) - Also a floating point number, but bigger. Sometimes called a Float64. Double is short for double precision floating point number.
  • Fixed point - Represents a decimal value with a non floating decimal point. Simpler to implement but doesn't capture much range. This type is not a native type to most CPUs.
  • Arbitrary Precision Integer - Sometimes called big integers. Some languages use them as the native type.
  • Arbitrary Precision Decimal - Represents a number with high precision.
  • Complex (x + jy) - Usually a real + imaginary number.
  • Complex (m * theta) - An alternative representation in Polar. Uncommon, but better for multiplication.

There are many less common types, or types that libraries implement.

  • Rational - A ratio of two integers. Go supports this.

As a consequence of their representation, most operations on numeric types are not true to their eponymous mathematical function. For example, adding two Int32 numbers can overflow, resulting in a wrong answer. Also, most floating point operations are not associative: (a + b) + c != a + (b + c). This is usually uncommon enough that it isn't a problem, but care should be taken to programming defensively.

numeric.1740690391.txt.gz · Last modified: by carl