Data Compression

On this page, you learn about different data compression algorithms.
DAT-1.D.1

The size of data (the number of bits required to store it) affects the time it takes to send that data across the Internet. So, people use data compression algorithms to reduce the size of images, sounds, movies and some other kinds of data.

DAT-1.D.3

The amount of size reduction depends on two things:

  1. the amount of redundancy in the original data
  2. the compression algorithm applied

There are two broad categories of data compression algorithms: lossless and lossy, depending on whether information is lost.

: Lossless Compression

Lossless data compression algorithms (such as PNG) are reversible (there is no loss in quality); you can reconstruct the original data.

DAT-1.D.4

Lossless compression works by removing redundant data. These algorithms can usually reduce the number of bits required to store or transmit the data while guaranteeing that the original data can be perfectly reconstructed.

BJC logo uncompressed

Run-length encoding is an example of lossless compression. Consider the 158 pixels in the top row of the BJC logo (at right). The first 60 pixels are white. Then come five pixels of yellowish orange (the top slice of the "b"). And the rest of that row is white.

...top-row-of-pixels-in-bjc-logo: 4 white, 6 yellow-orange, 3 white...

Instead of storing all 158 pixels individually, we could compress them with run-length encoding and just store six values (three numbers and three colors):

pixel count color code
60 FFFFFF
5 E5A84A
93 FFFFFF
DAT-1.D.2

Those six values (60, FFFFFF, 5, E5A84A, 93, FFFFFF) can be reconstructed into that whole first row of the image (158 pixels). So, fewer bits does not necessarily mean less information.

: Lossy Compression

Lossy data compression algorithms are not fully reversible; you can reconstruct only an approximation of the original data.

DAT-1.D.5

Lossy Compression works by removing details that people aren't likely to notice. The most commonly used lossy compression algorithm for pictures is called JPEG (or JPG, both pronounced "jay peg" for "Joint Photographic Experts Group," the committee that invented it). JPEG works by preserving most of the brightness information for each pixel (since human eyes are sensitive to that) and performing a kind of averaging process to the color information (because human eyes aren't as good at distinguishing color, especially colors close to white).

Below are an original, uncompressed picture of pebbles in a pond and a highly compressed JPEG of the same image. Can you tell which is which?
pond pebbles pond pebbles

You probably can tell which is which, especially if you looked for sharp edges or very shiny spots. But the compressed file uses 1/30th of the space used by the original, and you could still tell that it's a picture of rocks. So, for many purposes the compressed version would be good enough. Lossy algorithms usually let you control the degree of precision, and generally, people select less extreme compression settings, so the compressed file looks much more like the original than this example.

What size is this file when encoded in different formats?

Here are the sizes of the pond pebbles picture in four different formats:
format size
BMP encoding every pixel individually (shown above) 148 kB
PNG 106 kB
JPEG with least compression 94 kB
JPEG with most compression (shown above) 5 kB

The MP3 format, which you almost certainly use for portable music files, is a lossy compression format. It tends to emphasize high frequencies, so people accustomed to MP3 music find uncompressed versions of the same music boomy (bassy).

Which is best?

Both types of data compression exist because each is useful in certain circumstances:

  1. These questions are similar to those you will see on the AP CSP exam.
    A film student records a movie on his smartphone and then saves a copy on his computer. He notices that the saved copy is of much lower image quality than the original. Which of the following could NOT be a possible explanation for the lower image quality?
    The movie was saved using fewer bits per second (a lower bit rate) than the original movie.
    The copy of the movie file was somehow corrupted in the process of saving.
    The movie was saved using a lossy compression technique.
    Whenever a file is saved from one computer to another, some information is always lost.
    A visual artist is processing a digital image. Which of the following describe a lossless transformation from which the original image can be recovered? Choose two answers.
    Creating the negative of an image, where colors are reversed (dark areas appear light).
    Blurring the edges of an image.
    Creating a grayscale copy of an image.
    Creating a vertically flipped copy of the image.
    DAT-1.D
    For which of the following kinds of data would lossy compression be okay? Check as many as apply.
    The HTML code for this web page.
    Your computer's desktop picture.
    A live-action movie on Netflix.
    A cartoon on Netflix.
    A digital book, to be read on a computer.