### Data Compression and Decompression Algorithms

Table of Contents Introduction……………………………………………………………………….. ……………………………. …2 1. Data Compression…………………………………….. ………………………………………………………2 1. 1Classification of Compression……………………………………………………………………… 2 1. 2 Data Compression methods………………………………………….. ……………………………3 2. Lossless Compression Algorithm…………….. ………………………………………………………. 4 2. 1 Run-Length Encoding…………………….. ……………………………. ……………………………4 2. 1. 1 Algorithm…………………………….. …………………………………………………………………. 5 2. 1. 2Complexity …………………………….. ……………………………….. ………………………………. 5 2. 1. 3 Advantages and disadvantage………….. ……………………. ………………………………6 3.

Huffmann coding……………………………….. ………………………. …………………………………6 3. 1 Huffmann encoding………………………….. ……………………………………………………….. 6 3. 2 Algorithm………………………………………………………….. ………………………………………. 7 4. Lempel-Ziv algorithm………………………………………….. ……………………………………………7 4. 1 Lempel-Ziv78…………………………………………….. ………………………………………………. 8 4. 2Encoding Algorithm…………………………………………………………………………………….. 8 4. 3 Decoding Algorithm………………………………………………………………………………….. 12 5. Lempel-Ziv Welch……………………………………………………………………………………………. 14 5. 1 Encoding Algorithm………………………………………………………………………………….. 14 5. 2 Decoding Algorithm………………………………………………………………………………….. 6 References…………………………………………………………………………………………………………. 17 INTRODUCTION: Data compression is a common requirement for most of the computerized applications. There are number of data compression algorithms, which are dedicated to compress different data formats. Even for a single data type there are number of different compression algorithms, which use different approaches. This paper examines lossless data compression algorithms. 1. DATA COMPRESSION: In computer science data compression involves encoding information using fewer bits than the original representation.

Compression is useful because it helps reduce the consumption of resources such as data space or transmission capacity. Because compressed data must be decompressed to be used, this extra processing imposes computational or other costs through decompression. 1. 1 Classification of Compression: a) Static/non-adaptive compression. b) Dynamic/adaptive compressioin. a) Static/Non-adaptive Compression: A static method is one in which the mapping from the set of messages to the set of codewords is fixed before transmission begins, so that a given message is represented by the same codeword every time it appears in the message ensemble.

The classic static defined-word scheme is Huffman coding. b) Dynamic/adaptive compression: A code is dynamic if the mapping from the set of messages to the set of codewords changes over time. 2. 2 Data Compression Methods: 1) Losseless Compression: Lossless compression reduces bits by identifying and eliminating statistical redundancy. No information is lost in Lossless compression is possible because most real-world data has statistical redundancy. For example, an image may have areas of colour that do not change over several pixels; instead of coding “red pixel, red pixel, … the data may be encoded as “279 red pixels”. Lossless compression is used in cases where it is important that the original and the decompressed data be identical, or where deviations from the original data could be deleterious. Typical examples are executable programs, text documents, and source code. Some image file formats, like PNG or GIF, use only lossless compression 2) Loosy Compression: In information technology, lossy compression is a data encoding method that compresses data by discarding (losing) some of it. The procedure aims to inimize the amount of data that needs to be held, handled, and/or transmitted by a computer. Lossy compression is most commonly used to compress multimedia data (audio, video, and still images), especially in applications such as streaming media and internet telephony. If we take a photo of a sunset over the sea, for example there are going to be groups of pixels with the same colour value, which can be reduced. Lossy algorithms tend to be more complex, as a result they achieve better results for bitmaps and can accommodate for the lose of data. The compressed file is an estimation of the original data.

One of the disadvantages of lossy compression is that if the compressed file keeps being compressed, then the quality will degraded drastically. 2. Lossless Compression Algorithms: 2. 1 Run-Length Encoding(RLE): RLE stands for Run Length Encoding. It is a lossless algorithm that only offers decent compression ratios in specific types of data. How RLE works: RLE is probably the easiest compression algorithm. It replaces sequences of the same data values within a file by a count number and a single value. Suppose the following string of data (17 bytes) has to be compressed: ABBBBBBBBBCDEEEEF

Using RLE compression, the compressed file takes up 10 bytes and could look like this: A 8B C D 4E F 2. 1. 1 Algorithm: for (i=0;i<length;i++) { J<-0; Count[i]=1; do { J++; If (str[i+j]==str[i]) { Count[i]++; }while(str[i+j]==str[i]) If count[i]==1; Cout<<str[i++] Else{Cout<< count[i]<<str[i]; } } } Also you can see, RLE encoding is only effective if there are sequences of 4 or more repeating characters because three characters are used to conduct RLE so coding two repeating characters would even lead to an increase in file size.

It is important to know that there are many different run-length encoding schemes. The above example has just been used to demonstrate the basic principle of RLE encoding. Sometimes the implementation of RLE is adapted to the type of data that are being compressed. 2. 1. 2 Complexity and Data Compression: We’re used to talk about complexity of an algorithm measuring time and we usually try to find the fastest implementation, like in search algorithms. Here it is not so important to compress data quickly, but to compress as much as possible so the output is as small as possible without lossing data.

A great feature of run-length encoding is that this algorithm is easy to implement. 2. 1. 3 Advantages and disadvantages: This algorithm is very easy to implement and does not require much CPU horsepower. RLE compression is only efficient with files that contain lots of repetitive data. These can be text files if they contain lots of spaces for indenting but line-art images that contain large white or black areas are far more suitable. Computer generated colour images (e. g. architectural drawings) can also give fair compression ratios. Where is RLE compression used? RLE compression can be used in the following file formats: PDF files 3. HUFFMANN CODING: Huffman coding is a popular method for compressing data with variable-length codes. Given a set of data symbols (an alphabet) and their frequencies of occurrence (or, equivalently, their probabilities), the method constructs a set of variable-length codewords with the shortest average length and assigns them to the symbols. Huffman coding serves as the basis for several applications implemented on popular platforms. Some programs use just the Huffman method, while others use it as one step in a multistep compression process. 3. 1 Huffman Encoding:

The Huffman encoding algorithm starts by constructing a list of all the alphabet symbols in descending order of their probabilities. It then constructs, from the bottom up, a binary tree with a symbol at every leaf. This is done in steps, where at each step two symbols with the smallest probabilities are selected, added to the top of the partial tree, deleted from the list, and replaced with an auxiliary symbol representing the two original symbols. When the list is reduced to just one auxiliary symbol (representing the entire alphabet), the tree is complete. The tree is then traversed to determine the codewords of the symbols. . 2 Algorithm: Huffmann(A) { n=|A|; Q=A; For(i=1 to n-1) { z=new node; Left[z]=Extract_min(Q); Right[z]=Extract_min(Q); f[z]=f[left[z]]+f[right[z]]; insert(Q,z); } return Extract_min(Q); //return root } 4. The Lempel-ziv Algorithms: The Lempel Ziv Algorithm is an algorithm for lossless data compression. It is not a single algorithm, but a whole family of algorithms, stemming from the two algorithms proposed by Jacob Ziv and Abraham Lempel in their landmark papers in 1977 and 1978. Lempel Ziv algorithms are widely used in compression utilities such as gzip, GIF image compression.

Following are the variants of Lempel-ziv algos; LZ77Variants| LZR| LZSS| LZB| LZH| LZ78variants| LZW| LZC| LZT| LZMW| 4. 1 Lempel-ziv78: The LZ78 is a dictionary-based compression algorithm. The codewords output by the algorithm consist of two elements: an index referring to the longest matching dictionary entry and the first non-matching symbol. In addition to outputting the codeword for storage/transmission, the algorithm also adds the index and symbol pair to the dictionary. When a symbol that not yet in the dictionary is encountered, the codeword has the index value 0 and it is added to the dictionary as well.

With this method, the algorithm gradually builds up a dictionary. 4. 2 Algorithm: Dictionary empty ; Prefix empty ; DictionaryIndex 1; while(characterStream is not empty) { Char char next character in characterStream; if(Prefix + Char exists in the Dictionary) Prefix Prefix + Char ; else { if(Prefix is empty) CodeWordForPrefix 0 ; else CodeWordForPrefix DictionaryIndex for Prefix ; Output: (CodeWordForPrefix, Char) ; insertInDictionary( ( DictionaryIndex , Prefix + Char) ); DictionaryIndex++ ; Prefix empty ; } } Example 1: LZ78 Compression: Encode (i. e. compress) the string ABBCBCABABCAABCAAB using the LZ78 algorithm. Compressed message: The compressed message is: (0,A)(0,B)(2,C)(3,A)(2,A)(4,A)(6,B) Note: The above is just a representation, the commas and parentheses are not transmitted. Steps : 1. A is not in the Dictionary; insert it 2. B is not in the Dictionary; insert it 3. B is in the Dictionary. BC is not in the Dictionary; insert it. 4. B is in the Dictionary. BC is in the Dictionary. BCA is not in the Dictionary; insert it. 5. B is in the Dictionary. BA is not in the Dictionary; insert it. 6. B is in the Dictionary. BC is in the Dictionary.

BCA is in the Dictionary. BCAA is not in the Dictionary; insert it. 7. B is in the Dictionary. BC is in the Dictionary. BCA is in the Dictionary. BCAA is in the Dictionary. BCAAB is not in the Dictionary; insert it. LZ78 Compression :No of bits transmitted: Uncompressed String: ABBCBCABABCAABCAAB Number of bits = Total number of characters * 8 = 18 * 8 = 144 bits Suppose the codewords are indexed starting from 1: Compressed string( codewords): (0, A) (0, B) (2, C) (3, A) (2, A) (4, A) (6, B) Codeword index 1 2 3 4 5 6 7

Each code word consists of an integer and a character: The character is represented by 8 bits. The number of bits n required to represent the integer part of the codeword with index i is given by: Codeword (0, A) (0, B) (2, C) (3, A) (2, A) (4, A) (6, B) index 1 2 3 4 5 6 7 Bits: (1 + 8) + (1 + 8) + (2 + 8) + (2 + 8) + (3 + 8) + (3 + 8) + (3 + 8) = 71 bits The actual compressed message is: 0A0B10C11A010A100A110B 4. 3 Decompression Algorithm: Dictionary empty ; DictionaryIndex 1 ; hile(there are more (CodeWord, Char) pairs in codestream){ CodeWord next CodeWord in codestream ; char character corresponding to CodeWord ; (codeWord = = 0) String empty ; else String string at index CodeWord in Dictionary ; Output: String + Char ; insertInDictionary( (DictionaryIndex , String + Char) ) ; DictionaryIndex++; } Example : LZ78 Decompression Decompressed message: The decompressed message is: ABBCBCABABCAABCAAB 5. Lempel-ziv Welch: This improved version of the original LZ78 algorithm is perhaps the most famous modification and is sometimes even mistakenly referred to as the Lempel Ziv algorithm.

Published by Terry Welch in 1984it basically applies the LZSS principle of not explicitly transmitting the next nonmatching symbol to the LZ78 algorithm. The only remaining output of this improved algorithm are fixed-length references to the dictionary (indexes). If the message to be encoded consists of only one character, LZW outputs the code for this character; otherwise it inserts two- or multi-character, overlapping,distinct patterns of the message to be encoded in a Dictionary. Overlapping: The last character of a pattern is the first character of the next pattern. 5. 1 Algorithm:

Initialize Dictionary with 256 single character strings and their corresponding ASCII codes; Prefix first input character; CodeWord 256; while(not end of character stream){ Char next input character; if(Prefix + Char exists in the Dictionary) Prefix Prefix + Char; else{ Output: the code for Prefix; insertInDictionary( (CodeWord , Prefix + Char) ) ; CodeWord++; Prefix Char; } } Output: the code for Prefix; Example : Compression using LZW Encode the string BABAABAAA by the LZW encoding algorithm. 1. BA is not in the Dictionary; insert BA, output the code for its prefix: code(B) 2.

AB is not in the Dictionary; insert AB, output the code for its prefix: code(A) 3. BA is in the Dictionary. BAA is not in Dictionary; insert BAA, output the code for its prefix: code(BA) 4. AB is in the Dictionary. ABA is not in the Dictionary; insert ABA, output the code for its prefix: code(AB) 5. AA is not in the Dictionary; insert AA, output the code for its prefix: code(A) 6. AA is in the Dictionary and it is the last pattern; output its code: code(AA) Compressed message: The compressed message is: <66><65><256><257><65><260> LZW: Number of bits transmitted

Example: Uncompressed String: aaabbbbbbaabaaba Number of bits = Total number of characters * 8 = 16 * 8 = 128 bits Compressed string (codewords): <97><256><98><258><259><257><261> Number of bits = Total Number of codewords * 12 = 7 * 12 = 84 bits Note: Each codeword is 12 bits because the minimum Dictionary size is taken as 4096, and 212 = 4096 5. 2 Decoding algorithm: Initialize Dictionary with 256 ASCII codes and corresponding single character strings as their translations; PreviousCodeWord first input code; Output: string(PreviousCodeWord) ;

Char character(first input code); CodeWord 256; while(not end of code stream){ CurrentCodeWord next input code ; if(CurrentCodeWord exists in the Dictionary) String string(CurrentCodeWord) ; else String string(PreviousCodeWord) + Char ; Output: String; Char first character of String ; insertInDictionary( (CodeWord , string(PreviousCodeWord) + Char ) ); PreviousCodeWord CurrentCodeWord ; CodeWord++ ; } Summary of LZW decoding algorithm: output: string(first CodeWord); while(there are more CodeWords){ if(CurrentCodeWord is in the Dictionary) output: string(CurrentCodeWord); else utput: PreviousOutput + PreviousOutput first character; insert in the Dictionary: PreviousOutput + CurrentOutput first character; } Example : LZW Decompression Use LZW to decompress the output sequence <66> <65> <256> <257> <65> <260> 1. 66 is in Dictionary; output string(66) i. e. B 2. 65 is in Dictionary; output string(65) i. e. A, insert BA 3. 256 is in Dictionary; output string(256) i. e. BA, insert AB 4. 257 is in Dictionary; output string(257) i. e. AB, insert BAA 5. 65 is in Dictionary; output string(65) i. e. A, insert ABA 6. 60 is not in Dictionary; output previous output + previous output first character: AA, insert AA References: * http://www. sqa. org. uk/e-learning/BitVect01CD/page_86. htm * http://www. gukewen. sdu. edu. cn/panrj/courses/mm08. pdf * http://www. cs. cmu. edu/~guyb/realworld/compression. pdf * http://www. stoimen. com/blog/2012/01/09/computer-algorithms-data-compression-with-run-length-encoding/ * http://www. ics. uci. edu/~dan/pubs/DC-Sec1. html#Sec_1 * http://www. prepressure. com/library/compression_algorithms/flatedeflate * http://en. wikipedia. org/wiki/Data_compression