UP PREV NEXT

Reduced Library with Good Synthesis Performance 2/4

  
gate count               1751
number of cells           568
number of library cells    92
number of used cells       50
max fanin                   4
max input capacitance      94
max internal fanout        34
critical path  0fF       2123
critical path  6fF       2462

By selectively removing cells, the library size can be more than halved with only a 0.9% loss in performance.

The interesting observation here is the removal of the x1 drive strength cells does not worsen performance significantly. For most functions, these cells are the ones with the largest transistor sizes before folding in the smallest area. But the conclusion is that cells with half sized transistors are better, because they load the critical path less when driving non critical outputs; and x2 or stronger drive strengths are chosen for the critical path.

The critical path itself is shown below on the left, with the full library critical path on the right.

    <  92 cell library critical path   >   <  188 cell library critical path  >
    x 1           3                   51   x 1           3             61
 1  bf1v0x12     15  a->z      191   140   bf1v0x12     15  a->z      208   147
 2  nd4v0x3       1  d->z      294   103   nd4v0x3       1  d->z      311   103
 3  oai21v0x8     4  b->z      375    81   oai21v0x8     4  b->z      392    81
 4  iv1v0x12      1  a->z      424    49   xor2v0x4      1  b->z      473    81
 5  oai21v0x8     4  a2->z     513    89   cgi2v0x3      3  c->z      578   105
 6  xor2v0x4      1  b->z      610    97   iv1v0x6       1  a->z      627    49
 7  cgi2v0x3      3  a->z      729   119   cgi2v0x3      3  c->z      730   103
 8  iv1v0x4       1  a->z      784    55   iv1v0x6       1  a->z      779    49
 9  cgi2v0x3      3  c->z      888   104   cgi2v0x3      3  c->z      889   110
10  iv1v0x4       1  a->z      943    55   iv1v0x6       1  a->z      938    49
11  cgi2v0x3      3  c->z     1039    96   cgi2v0x3      3  c->z     1055   117
12  iv1v0x4       1  a->z     1094    55   iv1v0x6       1  a->z     1104    49
13  cgi2v0x3      3  c->z     1203   109   cgi2v0x3      3  c->z     1209   105
14  iv1v0x4       1  a->z     1257    54   iv1v0x6       1  a->z     1258    49
15  cgi2v0x3      3  c->z     1362   105   cgi2v0x3      3  c->z     1361   103
16  iv1v0x4       1  a->z     1416    54   iv1v0x6       1  a->z     1410    49
17  cgi2v0x3      4  c->z     1534   118   cgi2v0x3      4  c->z     1529   119
18  xnr2v0x3      1  a->z     1638   104   xnr2v0x3      1  a->z     1633   104
19  xor2v0x4      1  b->z     1727    89   xor2v0x4      1  b->z     1722    89
20  cgi2v0x3      2  a->z     1822    95   cgi2v0x3      2  a->z     1817    95
21  iv1v0x4       1  a->z     1877    55   iv1v0x4       1  a->z     1871    54
22  cgi2v0x3      2  c->z     1960    83   cgi2v0x3      2  c->z     1960    89
23  iv1v0x4       1  a->z     2010    50   iv1v0x6       1  a->z     2009    49
24  cgi2v0x2      2  c->z     2096    86   cgi2v0x3      2  c->z     2090    81
25  an2v0x4       2  b->z     2205   109   an2v0x8       2  b->z     2194   104
26  an2v0x8       2  b->z     2315   110   an2v0x8       2  b->z     2304   110
27  xor2v0x2      0  b->z     2462   147   xaon21v0x3    0  a2->z    2441   137
    r 14                                   r 15

These two critical paths are nearly the same. Only gates 3,4,5,6 and 27 are different. It is useful to analyse the differences and see where the loss in speed occurs.

  1. The path from cell #7 to cell #16 is a succession of inverting carry generators and inverters. The largest carry generators are chosen, and optimally the inverters should be an x6 drive strength. In the 92 cell library these don't exist and an x4 drive strength inverter is used instead. This increases the delay of these 10 cells by 2.9% from 783 to 806.
    This is not a lot. In fact, the book on Logical Effort by Sutherland, Sproull and Harris predicts this (see for example Figure 3.7). It is however more than the overall increase in delay, which is only 0.9%.
  2. The delay to the output of cell #1 is actually faster by 17ps in the 92 cell library than the 188 cell library. This is because the multiplier is made up of many parallel critical paths. In order to keep these faster than 2441 in the 188 cell library, the loading on the input pin x(1) and the input buffer is greater and this slows down also the path which finally turns out to be critical.
  3. The output r(14) in the 188 cell library circuit is driven by an xor2v0x3. This cell doesn't exist in the 92 cell library and is replaced by the xor2v0x2 which is slower. This puts the r(14) output as the critical path (3ps slower than the r(15) output) for the 92 cell library.

If now further cells are removed, they will either be the high drive cells needed for the critical path, or cells like the xaon21v0x3 which can both appear on the critical path and significantly reduce the cell count. So from this analysis, the minimum set of combinatorial cells which gives the best performance is 92 cells. Increasing the library to 189 cells gives a slight performance benefit, 0.9% measured with the multiplier. Including the extra cells is a choice for the library developer.

Table of synthesis results  
  critical path (ps) gate count cell count porosity library cells used cells
synthesis 1 4279 1561 923 43%   9  8 basic inverters, NAND & NOR gates
synthesis 2 4236 1472 792 45%  15 12 AND & OR gates
synthesis 3 4157 1357 696 46%  19 16 AOI & OAI gates, 2/1 and 2/2
synthesis 4 4157 1357 696 46%  20 16 mxi2 2-way inverting mux
synthesis 5 3983 1343 668 48%  21 16 cgi2 carry generator inverting
synthesis 6 3948 1352 668 48%  28 18 inverters with multiple drive strengths
synthesis 7 3061 1433 666 51%  70 27 x2 drive strengths for all functions
synthesis 8 3056 1456 666 52%  70 30 BOOG with x1 drive strengths
synthesis 9 2960 1476 666 53%  70 32 BOOG with x05 drive strengths
synthesis 10 2963 1480 666 53%  76 34 nd2a and nr2a cells
synthesis 11 2963 1480 666 53%  79 34 nd2ab type of 2-OR
CyHP library 3778 1539 832 46%  18 17 Minimum size library
synthesis 12 2908 1362 553 54%  91 38 AND/OR into XOR/XNOR
synthesis 13 2893 1378 551 55% 103 39 aoi211, aoi31, oai211 & oai31
synthesis 14 2931 1400 562 55% 104 38 3-XOR gate, 1/2 stage delays
synthesis 15 2886 1390 536 56% 109 40 3-XOR/XNOR gates as 2×2-I/P gates
synthesis 16 2665 1514 538 60% 136 46 x3 drive strength cells
synthesis 17 2567 1571 540 61% 155 49 x4 drive strength cells
synthesis 18 2523 1611 540 62% 167 49 x6 drive strength cells
synthesis 19 2497 1625 538 62% 179 54 x8 drive strength cells
synthesis 20 2493 1628 541 62% 188 55 buffers to decouple non-critical paths
synthesis 21 2441 1758 563 64% 188 55 input buffers
synthesis 22 2550 1717 535 64% 188 55 optimised Alliance flow
synthesis 23 2439 1695 560 63% 188 58 current 209 cell vsclib
synthesis 24 2462 1751 568 64%  92 50 reduced 92 cell library
UP PREV NEXT