# A 4Gb/s CMOS FULLY-DIFFERENTIAL ANALOG DUAL DELAY-LOCKED LOOP CLOCK/DATA RECOVERY CIRCUIT

Zhiwei Mao and Ted H. Szymanski

Optical Network Research Group, ECE Department McMaster University, Hamilton, Ontario, Canada L8S 4K1

# ABSTRACT

A 4Gb/s power and area efficient clock/data recovery (CDR) circuit is proposed. Fully-differential design is employed to reject any common mode noises and to significantly reduce power/ground bounce. An analog dual delay-locked loop (DLL) architecture continuously aligns the clock sampling edge to the center of incoming data eye-opening. A self-correcting function prevents the phase capture range limitation of traditional DLLs. The prototype circuit is implemented in 0.18um CMOS technology. Using 0.18µm CMOS technology, the CDR occupies a small area of 200 x  $320 um^2$  and dissipates low power of 27mW from 2V power supply.

## 1. INTRODUCTION

As the speed performance of VLSI systems increases rapidly, small low-power high-speed I/O interfaces have been widely studied in recent years. Both delay-locked loops (DLLs) and phase-locked loops (PLLs) can be employed in CDR circuits to cancel clock/data skew and to improve overall system timing. In cases where a reference clock is available, DLLs are often used owing to their non-accumulated phase error in contrast to PLLs. Moreover, DLLs generally have much simpler design and are inherently stable. The drawbacks of conventional DLLs are their limited phase capture range and input clock jitter propagation. In addition, digital DLLs [1] have inevitable quantization error and generally require more area and power consumption, whereas analog DLL design [2] was accused to be more sensitive to noises.

This paper proposes a novel CMOS CDR circuit that adopts fully-differential structure to reduce sensitivity to common-mode noises and applies analog dual DLL to achieve continuous phase alignment and robust data recovery. The CDR core circuit consumes small area and low power at a data rate of 4Gb/s.

This paper is arranged as follows: section 2 presents the CDR architecture, section 3 discusses circuit design issues in the prototype implementation of the architecture in 0.18um CMOS technology, section 4 shows the prototype chip implementation and simulation results and section 5 concludes this paper.



Fig. 1. Analog Dual-DLL CDR Architecture.

# 2. ARCHITECTURE

A simplified block diagram of the proposed CDR is shown in Fig. 1. It incorporates two DLLs: a 3x oversampling DLL tracking loop that aligns the data sampling clock at the center of the data eye opening, and an eye-measuring loop (EM-loop) [3] that adjusts the intervals of the 3 sampling clocks according the amount and shape of the jitter distribution. The incoming data is 3x sampled by 6 differential M/S D-flipflops. Consequently, the data recovery frequency is half of the incoming data bit rate.

The DLL tracking loop consists of a main voltagecontrol delay line (VCDL1), 3x samplers, an early/late detector, a charge-pump (CP1), and a loop-filter (LF1). Compared to tracked 2x oversampling, the tracked 3x oversampling technique with dead zone phase detection has lower phase noise and the DLL using 3x oversampling allows larger pumping current or smaller loop capacitor for the same loop bandwidth [5].



Fig. 2. 3x sampling clocks and jitter distribution.

The EM-loop consists of a VCDL2, wide/narrow detector, CP2 and LF2. In tracked 3x oversampling

technique, only two tail portions at both sides of the jitter histogram, as shown in Fig. 2, activate the phase adjustment. When the intervals of the 3 sampling clocks are fixed at 1/3 of bit interval, the portion of jitter participating in phase detection is completely determined by the jitter distribution. In situations of severe jitter, the operative portion of jitter histogram is large and there are more actual pumping-current generated by the phase detector, and thus larger phase noise at the output clock and data. Conversely, under situations of low jitter, the lack of transition information causes drifting of clock edges, resulting in sub-optimum performance [3].

With the EM-loop, the sampling interval is adjusted according to the distribution of the jitter histogram. Referring to Fig. 2, the amount of operative jitter  $A_i$  can be controlled based on the knowledge of transition density of the data stream. In fact,  $A_i$  is determined by the ratio of the Up/Down pumping currents:

$$\frac{I_{up}}{I_{down}} = \frac{P("wide")}{P("narrow")} = \frac{A_i \alpha (2 - A_i \alpha)}{(1 - A_i \alpha)^2}$$

For typical transition density  $\alpha = 0.5$  and  $A_i = 0.2$ , the pumping current ratio  $I_{up} / I_{down}$  should be around 1/4.

For robust acquisition of the DRC, the EM-loop bandwidth should be much smaller than that of the DLL tracking loop such that there is no confliction between the two DLLs. Therefore, the DLL tracking loop can swiftly track the center of the data eye-opening, and the EM-loop adjusts the sampling interval based on average result in a relatively long period.

All of the main blocks of the CDR are implemented with fully differential structure which reduces the effect of common-mode noise, the magnitude of the current fluctuation on power supply, and thus, the power/ground bouncing noise. To the best of our knowledge, this is the first fully differential DLL design used in a CDR circuit. In order to suit large scale integration, no external filter components are required in this design. The CDR core uses 696 transistors and 4 on-chip capacitors.

## 3. CIRCUIT DESIGN

#### 3.1. Voltage-Control Delay Lines (VCDLs)

The VCDL is the most critical part in the CDR circuit design. In addition to good noise immunity, VCDL is also required to have large delay range, good linearity, and sufficient bandwidth. Several differential delay lines have been analyzed and the interpolation topology is chosen due to its largest delay range and good linearity. A bipolar delay interpolation circuit in [4] was converted to CMOS design and the control path is implemented as a folded PMOS stage to avoid stacking, and thus be able to use a low power supply. Fig. 3 shows one of the nine delay stages in VCDL1. The pair M3-M4 constitutes the fast path and M5~M10 the slow path. The total delay is the interpolation of the fast and slow paths with their weights controlled by the folded PMOS current steering pair M11-M12.



To achieve unlimited phase adjustment range, the VCDL1 is designed to have  $\pm 2\pi$  phase delay range as shown in Fig. 4(a). When the delay adjustment reaches  $\pm 2\pi$  or  $-2\pi$ , the correcting circuit will reset the control voltage and force phase adjustment return to the initial state. The same topology as VCDL1 is used for VCDL2. Only the transistors are resized to satisfy the delay requirements for sampling clock interval and to deal with increased capacitance loads introduced by the output clock buffers. Specifically, the VCDL2 is required to have a delay range of  $1/4 \sim 1/2$  data bit interval. The post-layout simulation of a single delay stage in VCDL2 is shown in Fig. 4(b).



Fig. 4. Post-layout simulation of: (a) VCDL1; (b) a delay stage in VCDL2.

#### 3.2. Differential Charge-Pumps

The charge pump employs a well known currentswitching technique with a common-mode feed back (CMFB) [4] as shown in Fig. 5. Besides the immunity to common mode noise, the differential structure allows using NMOS gate capacitors for C1 and C2. Generally, gate oxide capacitors have high capacitance density, but need appropriate dc bias. In Fig. 5, the dc bias voltages of C1 and C2 can be controlled by the CMFB (M5~M9).



Fig. 5. Differential Charge-Pump.

For CP1, the normal charge-pump can be used by setting  $I_{up} = I_{down} = I_1 - I_{CM} = I_2 - I_{CM} = I_3 = I_4$ , where  $I_{CM}$  is the current drawn by M7 or M8. However, CP2 requires  $I_{up} \neq I_{down}$  as discussed in section 2. It is possible to convert from a symmetric charge-pump to asymmetric one by rearranging the 4 current sources  $I_1 \sim I_4$ ; for example, in order to create  $I_{up}/I_{down} = \beta$ , the 4 current sources should satisfy:

$$\begin{cases} I_1 - I_{CM} = \beta (I_2 - I_{CM}) \\ I_3 = I_4 \\ I_1 + I_2 - 2I_{CM} = I_3 + I_4 \end{cases}$$

## 3.3. Self-Correcting Circuit

As already mentioned, the DLL has a limited delay range. When the max/min limit delay is met, the DLL cannot introduce more delay variation. A self-correcting (SC) circuit is often used to move back or correct the delay to a midpoint where it can acquire control again.



Fig. 6. Proposed self-correcting Circuit.

A SC circuit shown in Fig. 6 is implemented in the proposed CDR to monitor the differential control voltages in CP1. When the phase adjustment of VCDL1 reaches  $+2\pi/-2\pi$ , the max/min delay is met, and either Vo- or Vo+ in Fig. 5 will hit the lower limit Vmin, and generate a reset signal. M10 is then turned on to merge the charges on C1 & C2 and force the differential control voltage to be zero. Consequently, the delay of VCDL1 changes exactly one bit and the center sampling clock is kept at the same locked position as before the reset. The three inverters in Fig. 6 work as a signal delay line. This method minimizes the number of errors as long as the differential signal integrated on C1 and C2 can be

quickly removed.

#### 4. SEMULATION RESULTS

The test chip is implemented in 1mm x 1mm area using TSMC 0.18um 1P6M CMOS technology. 2.0V power supply is used for both core and I/O circuits. The CDR core occupies 200um x 320um and consumes only 27mW at 4Gb/s. The on-chip filter capacitors take about 1/4 of the DRC core area. Fig. 7 shows the layout diagram of the test chip.



Fig. 7. Layout diagram of the CDR test chip.

Both pre-layout and post-layout simulations are performed to verify the performance of the proposed CDR circuits. Table I lists post-layout simulation performance characteristics of the test chip.

| Technology        | 0.18um CMOS   |
|-------------------|---------------|
| Power supply      | 2V            |
| Power of core     | 27mW          |
| Area of core      | 200um x 320um |
| Data Rate         | 4Gb/s         |
| DLL tracking loop | < 60ns        |
| Lock time         |               |
| EM-loop Lock time | < 300ns       |

Table I. Performance Characteristics of the CDR.

A simple data stream with  $\alpha = 2/3$  ("100100100...") is sent to the input of the CDR at 4Gb/s data rate. The pulse width of each '1' is intentionally set a little less than that of a '0', thereby introducing small jitter on the input data. Fig. 8 shows the locking procedure of the dual DLL. Fig. 8(a) is the initial state at t = 4ns, where the center clock (Clk c) is located far away from the center of the data bit and, especially, the left clock (Clk 1) is out of the bit slot. At t = 41.5ns in Fig. 8(b), the DLL tracking loop is locked and all 3 sampling clocks are moved into the bit slot. However, Clk c still does not point to the center of the bit interval because the sampling interval is much smaller than the data bit slot. When t = 163 ns in Fig. 8(c), the EM-loop also gets locked, and as a result the sampling interval has been increased to fit the data interval. A little timing shift between the data and sampling clocks exists in Fig. 8(c) due to the setup time required by the MS-DFF samplers.

Fig. 9 shows the locking procedure of the control voltages generated after the differential charge-pumps



Fig. 8. Post-layout simulation of locking procedure.

and loop-filters in both DLL tracking loop (Vcon10 and Vcon11) and EM-loop (Vcon20 and Vcon21). Initially, the control voltages are reset before t=5ns. When t=41ns, Vcon10 and Vcon11 are settled, which implies a locked DLL tracking loop. When t=160ns, Vcon20 and Vcon21 tends to be stable, corresponding to the locked EM-loop. This well agrees with the procedure in Fig. 8. Fig. 10 exhibits the recovered data (1<sup>st</sup>\_bit and 2<sup>nd</sup>\_bit) and clock signal. The prototype chip has been submitted for fabrication through the Canadian Microelectronics Corporation. More experimental data will be gathered after the chip is fabricated.





Fig. 10. Simulation of the recovered data and clock.

#### 5. CONCLUSION

An analog dual DLL CDR composed of 696 transistors is described. The analog dual DLL architecture tracks the center of the data bit slot and adjusts the interval of the 3 sampling clocks to achieve optimal data recovery. The CDR applies fully-differential structure, which has good noise immunity and greatly reduces the power/ground bouncing noise. Simulation results show that this CDR is capable of operating at 4Gb/s and has fast locking time. The test chip is laid-out in 0.18um CMOS process and consumes low power and small area. Consequently, this chip is suitable for applications requiring multi-channel high speed I/O and high level integration.

#### 6. ACKNOWLEDGEMENT

This research was supported by the Ontario Graduate Scholarships in Science and Technology, by the Natural Sciences and Engineering Research Council of Canada grant #121602, by the L.R. Wilson/Bell Canada Enterprises Chair in Data Communications at McMaster University, and by the Canadian Microelectronics Corporation through computing equipment and CMOS IC fabrication services.

#### 7. REFERENCES

[1] M.-J. Edward Lee et al., "An 84-mW 4-Gb/s Clock and Data Recovery Circuit for Serial Link Applications," 2001 Symposium on VLSI Circuits Digest of Technical Papers, pp.149 – 152, 2001.

[2] X. Maillard, et al, "A 900-Mb/s CMOS Data Recovery DLL Using Half-Frequency Clock", *IEEE J. Solid-State Circuits*, Vol. 37, No. 6, pp.711~715, Jun. 2002.

[3] S.H. Lee, et al., "A 5 Gb/s 0.25um CMOS jittertolerant variable-interval oversampling clock/data recovery circuit", *ISSCC Digest of Technical Papers*, pp. 256 -465, 2002.

[4] Behzad Razavi, "Monolithic phase-locked loops and clock recovery circuits theory and design", IEEE Press, NY, 1996.

[5] Yongsarn Moon, et al., "A 0.6-2.5-GBaud CMOS tracked 3x oversampling transceiver with dead-zone phase detection for robust clock/data recovery", *IEEE J.Solid-State Circuits*, Vol. 36, pp.1974-1983, Dec. 2001.