Go to page content

Information that makes security stakeholders better off

Article — by Alfonso De Gregorio, 06 December 2010

Untwisted: Bit-sliced TEA time

Or an introduction to bitslicing a block cipher, for those who will not confuse the algorithm with their tea bread.

Welcome back! A dear friend of mine always tells me I have the ability to untwist the intricacies of problems in a passionate way. I do not claim his opinion is necessarily right. Yet, he deserves all the credit and my gratitude for the name of the column I am kicking off, today. Untwisted is about giving a closer look at topics of interest and trying to grasp something new through an hands-on approach. Untwisted is about programming, researching, analyzing, hosting a Sunday afternoon project, writing about it in clear terms (hopefully) and ultimately having fun. It is a technical space written for security practitioners by a colleague.


Without further ado, the first victim of the column are bitslice implementations of cryptographic algorithms. I present a pedagogical bitsliced implementation of the Tiny Encryption Algorithm (TEA). Because the post is aimed at providing instructional value, the source below is optimized for readability and correctness, than raw speed - a topic worth one of the next blog posts by itself.

In cryptography, the technique was first introduced by Eli Biham in his A Fast New DES Implementation in Software, at the 1997 edition of Fast Sofware Encryption (FSE). It was dubbed bitslice by Matthew Kwan who re-coined the term to describe the arrangement given by Biham to the bits of the state.

In fact, the speed-up was achieved by viewing a processor as a SIMD parallel computer, which can compute n one-bit operations simultaneously, where n is the word size. In a bitslice implementation, one logical instruction in software mimics the simultaneous execution of n hardware logical gates. To do that, each n-bit input is "sliced" across n different words, where the first bit is typically from the first block, the second bit belongs to the second, and so on. So, implementing a block cipher with a 64-bit blocksize, 32 blocks, 64-bit long each, can be set in 64 32-bit registers, where the first register contains the first bit of each of the 32 blocks, the second register the seconds, etc. Using this arrangement for describing the state, 32 encryptions can be run simultaneously, or as many as the register size of the reference processor.

Since their introduction, bitslice implementations have received a great deal of attention by cryptographers, interested in providing efficient hardware or software implementation for their algorithms or efficient masking against Differential Power Analysis (DPA), and cryptanalysts desiring to speed up the exhaustive key searches, or the precomputation steps of Time-Memory Trade-Off tables, or applying cube-attacks and cube testers.

Notwithstanding the interest shown by the security community, only a few resources are available online for those who want to start writing bitslice implementations. This article tries to contribute towards filling this gap.


How to bitslice TEA?

Let's start by giving a look to our target. TEA is a Feistel cipher designed in the early 1990s by David Wheeler and Roger Needham at Cambriage, and notable for its simplicity of description and implementation. It cycles 32 times in 64 Fesitel rounds, processing 64bit long blocks and using a 128bit key.

The general idea to bitslice TEA, and other algorithms, is to devise a sensible disposition for the state and a way to implement a bit-level vectorization of each processing step. Though this might seem rather prosaic, convenient datatypes can help you to avoid confusion while improving your confidence with this technique.

Here is a possible arrangement of the inputs and the types used throughout the rest of this article.

/* wordsize-way bit-level vectorization */
typedef unsigned long int vector_width_t;
#define VECTOR_AT_ZERO  0 

typedef vector_width_t parallel_blocks_t[TEA_BLOCK_SIZE];
typedef vector_width_t parallel_keys_t[TEA_KEY_SIZE];
/* v points to the wordsize-way vectorized plaintext/ciphertext, 
 * k to the vectorized key */
/* input quantities are disposed in the following way:     
   v0 <- v[0..31]      k0 <- k[0..31]      k2 <- k[64..95]
   v1 <- v[32..63]     k1 <- k[32..63]     k3 <- k[96..127]
void encrypt(parallel_blocks_t v, const parallel_keys_t k, size_t r){
  /* ... */
void decrypt(parallel_blocks_t v, const parallel_keys_t k, size_t r){
  /* ... */

v and k are arrays containing the n-way bit-level vectorized plaintext/ciphertext block and the secret key. They are long respectively 64 and 128 vector_width_t elements, where vector_width_t is defined according to the desired parallelism level. Here, for the sake of simplicity, vector_width_t aliases unsigned long int, that typically maps on the natural word size of the processor. Please, refer to the code for the actual implementation.

If we could bitslice the various operations that build up each TEA round (ie, additions, subtractions, left shifts, right shifts), the design of a bitsliced TEA would become pretty straightforward. Iterating the encryption rounds would appear like this:

while (r > 0) {
    sum += delta;

    /* bitslice the lshift: v1_lshift_4 = v1 << 4;                */
    /* bitslice addition: v1_lshift_4_plus_k0 = v1_lshift_4 + k0; */
    /*                    v1_plus_sum = v1 + sum;                 */
    /* bitslice the rshift: v1_rshift_5 = v1 >> 5;                */ 
    /* bitslice addition: v1_rshift_5_plus_k1 = v1_rshift_5 + k1; */
    /* again addition:    v0 += v1_lshift_4_plus_k0               *
     *                        ^ v1_plus_sum                       *
     *                        ^ v1_rshift_5_plus_k1;              */
    /* idem                v0_lshift_4 = v0 << 4;                 */
    /*                     v0_lshift_4_plus_k2 = v0_lshift_4 + k2;*/
    /*                     v0_plus_sum = v0 + sum;                */       
    /*                     v0_rshift_5 = v0 >> 5;                 */
    /*                     v0_rshift_5_plus_k3 = v0_rshift_5 + k3;*/
    /*                     v1 += v0_lshift_4_plus_k2              *
     *                         ^ v0_plus_sum                      * 
     *                         ^ v0_rshift_5_plus_k3;             */

In a similar way would appear the decryption rounds, where, bitslicing the subtraction, we would be able to decrement all the instances of v1 and v0 in parallel, by they respective final three terms.

Taken one by one, the operations above should be a no-brainer. Let's proceed with order and implement them.

As you might have noticed from the TEA description and the assignment of the input variables, the TEA rounds works, by design, with 32-bit quantities: namely, v0 and v1, containing the plaintext/ciphertext block, and k0, k1, k2, k3, containing the secret key. This is why you will see the bitslice implementation of the required operations working with 32 registers at a time; they contain the n-way bit-level vectorization of the input variables.

Lshift and Rshift

The left-shift would translate (in software) in 32 reassignments.

/* lshift v1 by 4 */      
shift = 4;      
for (i = 31; i >= 0; i--)             
  v1_lshift_4[i] = (i >= shift) ? v[offset_v1 + i - shift] : 0;

Likewise, the right-shift:

/* rshift v1 by 5 */
shift = 5;
for (i = 0; i < 32; ++i)
    v1_rshift_5[i] = (i < (32-shift)) ? v[offset_v1 + i + shift] : 0;



For the addition of 32-bit numbers we need to implement a a 32-bit full adder. One way to build it is chaining 32 full adders into a ripple-carry adder. In software we can use just the same approach. The code is from the Dan Bernstein's bitsliced implementation of CubeHash, available as part of the SUPERCOPbenchmarking toolkit.

 /* add k0 to v1_lshift_4 */
carry = 0;
for (i = 0;i < 32;++i) {
    ai = v1_lshift_4[i];       
    bi = k[offset_k0 + i];       
    aandb = ai & bi;       
    axorb = ai ^ bi;       
    v1_lshift_4_plus_k0[i] = axorb ^ carry;       
    carry &= axorb;       
    carry |= aandb;     

  Full Subtractor



In a similar fashion, we can build a n-bit full subtractor mirroring the work of a subtractor circuit. Here, I have mirrored a borrow-ripple subtractor, by connecting n full subtractors in series, and I have used it to implement the decryption rounds. Each borrow-out B-out from a full subtractor is connected to the borrow-in B-in of the subtractor at the immediately higher position.


/* xor the three terms and decrement v1 */
borrow = 0;    
for (i = 0;i < 32;++i) {       
    ai = v[offset_v0 + i];       
    bi = v1_lshift_4_plus_k0[i] \
       ^ v1_plus_sum[i]         \
       ^ v1_rshift_5_plus_k1[i];       
    notaandb = (ai ^ VECTOR_AT_ONE) & bi;       
    axorb = ai ^ bi;       
    v[offset_v1 + i] = axorb ^ borrow;       
    borrow = notaandb \
           | ((ai ^ VECTOR_AT_ONE) & borrow) \
           | (bi & borrow);     

These are all the building blocks you need for a bitslice implementation of TEA. The code provides the routines needed for encoding and decoding the variables in the correct arrangement and what you need for start playing with it.

By now, you should feel prepared to give a closer look to this technique, maybe use it in your own code (and cryptographic code, if you are a cryptographer) or recognize and understand it in implementations provided by others. I very much welcome your comments and I look forward to hearing from you about how you are doing with bitslicing and how you are putting this to use. Thanks for reading so far the first issue of Untwisted.


Code and Exercises

Download the most recevent revision, or get the source code from its Mercurial repository:

hg clone https://hg.opensource.secYOUre.com/bstea

If you feel inclined, here are few exercises to try:

  • Write a bitslice implementation of another algorithm of your choice. Possible candidates are: IDEA, XTEA, XXTEA, or EnRUPT;
  • Run the cipher in counter (CTR) mode of operation and devise how to best take advantage of the bitsliced implementation to add the keystream to the plaintext/ciphertext;
  • Optimize the bitslice implementation for speed in software. Consider different processors or the use of instruction sets for doing SIMD within a register (SWAR). How fast can you sip at your TEA?


  • avatar


    Posted 2 years, 3 months ago.

    Hi, I am new to bitslicing topic. Actually I need help in understanding this technique as I want to use it for developing a fast encryption software. I am particularly interested in DES and AES. I used source codes available for DES bitslicing and measured the encryption rate on my PC, also compared with openssl encryption rate. I find them to be same, how to optimize bitslicing further to enhance performance, I need to beat openssl encryption rate of 1 million encryptions/sec achieved on my PC. I want upto 5 times performance increase, I hope you will provide valuable information for solving my problem. thanx, waiting for ur reply !!

  • avatar

    Alfonso De Gregorio

    Posted 2 years, 3 months ago.

    I am not sure which bitsliced DES implementation you have used in your benchmarks. Still, OpenSSL comes with a plain DES implementation and a general purpose CPU-based bitsliced implementations can certainly outperform OpenSSL in DES-evaluations per second.

    As a case in point, the John the Ripper bitsliced implementation (pro edition) uses SSE2 instructions and performs 36.85Mencryptions/s, a significant improvement over OpenSSL with 13.32 Mencryptions/s.

    However, this is still not enough to achieve a 5-fold speedup over OpenSSL.

    To get faster, consider using the CUDA programming model. A couple of years ago, Agosta et al. proposed the first GPU-based bitsliced implementation of DES (geared toward password breaking and based on GTX260). Their implementation outperforms John the Ripper by a factor of 10, with a peak performance of 373.58 MKeys per second:

    Record Setting Software Implementation of DES Using CUDA

    This is 28-fold speedup over OpenSSL.

    For the sake of completeness, the COPACOBANA FPGA implementation tops 2^16 Mencryptions/s.

    Thanks and let me know how you are doing with your project.

  • avatar


    Posted 2 years, 3 months ago.

    Wow, awesome speeds, and thanx for ur prompt reply. I used Matthew Kawn's code but it is the old one (1998) in C language with no SSE optimizations. I found it simple to understand so tested with it. I will look at john the ripper and also the GPU based alternative.

    Will be in touch to let you know about the progress of my project.

  • avatar


    Posted 1 year ago.