{"title": "A Silicon Primitive for Competitive Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 713, "page_last": 719, "abstract": null, "full_text": "A silicon primitive for competitive learning \n\nDavid Usu \n\nMiguel Figueroa \n\nChris Diorio \n\nComputer Science and Engineering \n\nThe University of Washington \n\n114 Sieg Hall, Box 352350 \n\nSeattle, W A 98195-2350 USA \n\nhsud, miguel, diorio@cs.washington.edu \n\nAbstract \n\nCompetitive learning is a technique for training classification and \nclustering networks. We have designed and fabricated an 11-\ntransistor primitive, that we term an automaximizing bump circuit, \nthat implements competitive learning dynamics. The circuit per(cid:173)\nforms a similarity computation, affords nonvolatile storage, and \nimplements simultaneous local adaptation and computation. We \nshow that our primitive is suitable for implementing competitive \nlearning in VLSI, and demonstrate its effectiveness in a standard \nclustering task. \n\n1 Introduction \n\nCompetitive learning is a family of neural learning algorithms that has proved use(cid:173)\nful for training many classification and clustering networks [1]. In these networks, a \nneuron's synaptic weight vector typically represents a tight cluster of data points. \nUpon presentation of a new input to the network, the neuron representing the closest \ncluster adapts its weight vector, decreasing the difference between the weight vector \nand present input. Details on this adaptation vary for different competitive learning \nrules, but the general functionality of the synapse is preserved across various com(cid:173)\npetitive learning networks. These functions are weight storage, similarity computa(cid:173)\ntion, and competitive learning dynamics. \nMany VLSI implementations of competitive learning have been reported in the lit(cid:173)\nerature [2]. These circuits typically use digital registers or capacitors for weight \nstorage. Digital storage is expensive in terms of die area and power consumption; \ncapacitive storage typically requires a refresh scheme to prevent weight decay. In \naddition, these implementations require separate computation and weight-update \nphases, increasing complexity. More importantly, neural networks built with these \ncircuits typically do not adapt during normal operation. \n\nSynapse transistors [3][4] address the problems raised in the previous paragraph. \nThese devices use the floating-gate technology to provide nonvolatile analog storage \nand local adaptation in silicon. The adaptation mechanisms do not perturb the opera(cid:173)\ntion of the device, thus enabling simultaneous adaptation and computation. Unfortu(cid:173)\nnately, the adaptation mechanisms provide dynamics that are difficult to translate \n\n\finto existing neural-network learning rules. Allen et. al. [5] proposed a silicon com(cid:173)\npetitive learning synapse that used floating gate technology in the early 90's. How(cid:173)\never, that approach suffers from asymmetric adaptation due to separate mechanisms \nfor increasing and decreasing weight values. In addition, they neither characterized \nthe adaptation dynamics of their device, nor demonstrated competitive learning with \ntheir device. \nWe present a new silicon primitive, the automaximizing bump circuit, that uses \nsynapse transistors to implement competitive learning in silicon. This ll-transistor \ncircuit computes a similarity measure, provides nonvolatile storage, implements \nlocal adaptation, and performs simultaneous adaptation and computation. In addi(cid:173)\ntion, the circuit naturally exhibits competitive learning dynamics. In this paper, we \nderive the properties of the automaximizing bump circuit directly from the physics \nof synapse transistors, and corroborate our analysis with data measured from a chip \nfabricated in a 0.351lm CMOS process. In addition, experiments on a competitive \nlearning circuit, and software simulations of the learning rule, show that this device \nprovides a suitable primitive for competitive learning. \n\n2 Synapse transistors \nThe automaxmizing bump circuit's behavior depends on the storage and adaptation \nproperties of synapse transistors . Therefore this section briefly reviews these de(cid:173)\nvices. A synapse transistor comprises a floating-gate MOSFET, with a control gate \ncapacitively coupled to the floating gate, and an associated tunneling implant. The \ntransistor uses floating-gate charge to implement a nonvolatile analog memory, and \noutputs a source current that varies with both the stored value and the control-gate \nvoltage. The synapse uses two adaptation mechanisms: Fowler-Nordheim tunneling \n[6] increases the stored charge; impact-ionized hot-electron injection (IHEI) [7] \ndecreases the charge. Because tunneling and IHEI can both be active during normal \ntransistor operation, the synapse enables simultaneous adaptation and computation. \nA voltage difference between the floating gate and the tunneling implant causes \nelectrons to tunnel from the floating gate, through gate oxide, to the tunneling im(cid:173)\nplant. We can approximate this current (with respect to fixed tunneling and floating(cid:173)\ngate voltages, V tunO and V go ) as [4]: \n\nwhere ItunO and Vx are constants that depend on V tunO and V gO , and Ll V tun and Ll Vg are \ndeviations of the tunneling and floating gate voltages from these fixed levels . \nIHEI adds electrons to the floating gate, decreasing its stored charge. The IHEI \ncurrent increases with the transistor's source current and drain-to-source voltage; \nover a small drain-voltage range, we model this dependence as [3][4]: \n\n(1) \n\n(2) \n\nwhere the constant Vy depends on the VLSI process, and Ut is the thermal voltage. \n\n3 Automaximizing bump circuit \nThe automaximizing bump circuit (Fig. 1) is an adaptive version of the classic \nbump-antibump circuit [8]. It uses synapse transistors to implement the three essen(cid:173)\ntial functions of a competitive learning synapse: storage of a weight value f1\" com(cid:173)\nputation of a similarity measure between the input and f1\" and the ability to move f1, \ncloser to the input. Both circuits take two inputs, VI and V2 , and generate three cur-\n\n\fVdd \n\n(a) \n\nV,o(V) \n(b) \n\nFigure 1. (a) Automaximizing bump cir(cid:173)\ncuit. MI-M5 form \nthe classic bump(cid:173)\nantibump circuit; we added M6-MII and \nthe floating gates. (b) Data showing that \nthe circuit computes a similarity between \nthe input, V in , and the stored value, /-l, for \nthree different stored weights. Yin is rep(cid:173)\nresented as VI =+ Vin12, V2=-Vin/2. \n\nrents. The two outside currents, II and /Z, are a measure of the dissimilarity between \nthe two inputs; the center current, Imid' is a measure of their similarity: \n\nImid = Ib(l+l\\,cosh (K~V)) \n\n2 \n\n~ \n\n-1 \n\n(3) \n\nwhere A, and K are process and design-dependent parameters, ~ V is the voltage dif(cid:173)\nference between VI and V2 , and h is a bias current. Imid is symmetric with respect to \nthe difference between VI and V2 , and approximates a Gaussian centered at ~ V = O. \nWe augment the bump-anti bump circuit by adding floating gates and tunneling junc(cid:173)\ntions to MI-M5, turning them into synapse transistors; MI and M3 share the same \nfloating gate and tunneling junction, as do M2 and M4. We also add transistors M6-\nMIl to control IHEI. For convenience, we will refer to our new circuit merely as a \nbump circuit. The charge stored on the bump circuit's floating gates, QI and Q2, \nshift Imi/S peak away from ~V=O by an amount determined by their difference. We \ninterpret this difference as the weight, p, stored by the circuit, and interpret Imid as a \nsimilarity measure between the circuit's input and stored weight. \n\nTunneling and IHEI adapt the bump circuit's weight. The circuit is automaximizing \nbecause tunneling and IHEI naturally tune the peak of Imid to coincide with the pre(cid:173)\nsent input. This high-level behavior coincides with the dynamics of competitive \nlearning; both act to decrease the difference between a stored weight and the applied \ninput. Therefore, no explicit computation of the direction or magnitude of weight \nupdates is necessary-the circuit naturally performs these computations for us. \nConsequently, we only need to indicate when the circuit should adapt, not how it \ndoes adapt. Applying -IOV to Vlun and -OV to Vinj activates adaptation. Applying \n<8V to Vlun and >2V to Vinj deactivates adaptation. \n\n3.1 Weight storage \n\nThe bump circuit's weight value derives directly from the charge on its floating(cid:173)\ngates. A synapse transistor's floating-gate charge looks, for all practical purposes, \n\n\flike a voltage source, V\" applied to the control gate. This voltage source has a value \nVs = QICi\", where Cin is the control-gate to floating-gate coupling capacitance and Q \nis the floating gate charge. We encode the input to the bump circuit, Yin, as a differ(cid:173)\nential signal: VI= Vin/2; and V2 =-Vin/2 (similar results will follow for any symmet(cid:173)\nric encoding of Yin)' As a result, froid computes the similarity between the two float(cid:173)\ning-gate voltages: Vfgl = VsI+ Vin/2, and Vfg2= Vs2 - Vin/2 where VsI and Vs2 are the \nvoltages due to the charge stored on the floating gates. We define the bump circuit's \nweight, /1, as: \n\n(4) \n\nThis weight corresponds to the value of Yin that equalizes the two floating-gate volt(cid:173)\nages (and maximizes froid). Part (b) of Fig. 1 shows the bump circuit's froid output for \nthree weight values, as a function of the differential input. We see that different \nstored values change the location of the peak, but do not change the shape of the \nbump. Because floating gate charge is nonvolatile, the weight is also nonvolatile. \n\nThe differential encoding of the input makes the bump circuit's adaptation symmet(cid:173)\nric with respect to (Vin -/1). Without loss of generality, we can represent Yin as: \n\n(5) \n\nIf we apply Vin!2 and -Vin!2 to the two input terminals, we arrive at the following \ntwo floating-gate voltages: \n\nVfgl = (Vs2 + Vsl + ~n -\nVfg2 = (Vs2 + Vsl - ~n + /1) 1 2 \n\n/1) 1 2 \n\n(6) \n\n(7) \n\nBy reversing the sign of (Vin -/1), we obtain the same floating-gate voltages on the \nopposite terminals. Because the floating gate voltages are independent of the sign of \n(Vin-/1), the bump circuit's learning rule is symmetric with respect to (Vin-/1). \n\n3.2 Adaptation \n\nWe now explore the bump circuit's adaptation dynamics. We define L1Vfg=Vfgl-Vfg2' \nFrom Eqs. 4-7, we can see that Vin-/1=L1Vfg . Consequently, the learning rate, dfl/dt, \nis equivalent to -dL1 Vfgldt. In our subsequent derivations, we consider only positive \nL1 Vfg, because adaptation is symmetric (albeit with a change of sign). We show com(cid:173)\nplete derivations of the equations in this section in [9]. \n\nTunneling causes adaptation by decreasing the difference between the floating-gate \nvoltages Vfgl and Vfg2 . Electron tunneling increases the voltage of both floating \ngates, but, because tunneling increases exponentially with smaller floating-gate \nvoltages (see Eq.l), tunneling decreases the difference. Assuming that Ml 's floating \ngate voltage is lower than M2's, the change in L1 Vfg due to electron tunneling is: \n\nd L1 Vfg 1 dt = -(I tunl -ftun2 ) 1 Cfg \n\nWe substitute Eq.1 into Eq.8 and solve for the tunneling learning rule: \n\nd \n\nL1Vfg \n\nId \n\nt = -ftOe \n\n(.1.Vtun-.1.VO)/Vx \n\n. h \n\n. sm ((L1Vfg -f/J) Y.) \n\n12 \n\n(8) \n\n(9) \n\nwhere ftO=ftunO/Cfp Vx is a model constant, L1 Vo = (L1 Vfgl + L1 Vfg2 )12, and f/J models the \ntunneling mismatch between synapse transistors. This rule depends on three factors: \n\n\flO ' .--~--~--~--~------, \n\n10 ' \n\n10 ' \n\n10' \n\n10 ' \n\n10 111 \n\n.~ \n] \n~ \n~ .., \n,. \n., \n\n'0 \nI \n\n'\" ~ \n.., \n,. ., \n\n'1 \n\n0 \n\no injection data \n.d tunneling data \n-fit \n\no data \nfit \n\n0.1 \n\n0.2 \n\n.1.V,& \n\n0.3 \n\n(V) \n\n0.4 \n\n(a) \n\n10 ' \n\n0 \n\n0.1 \n\n0.2 \n\n0.3 \n\n0.4 \n\n0.5 \n\n.lVi' (V) \n\n(b) \n\nFigure 2. (a) Measured adaptation rates, due to tunneling and IHEI, along with fits \nfrom Eqs.9 and 11. (b) Composite adaptation rate, along with a fit from (12). We \nslowed the IHEI adaptation rate (by using a higher Vinj ), compared with the data \nfrom part (a), to cause better matching between tunneling and IHEI. \n\na controllable learning rate, ~ Vtun ; the difference between Yin and f.1, ~ Vrg ; and the \naverage floating gate voltage, ~ Yo. \n\nThe circuit also uses IHEI to decrease ~ Vrg . We bias the bump circuit so that only \ntransistors Ml and M2 exhibit IHEI. According to Eq.2, IHEI depends linearly on a \ntransistor's source current, but exponentially on its source-to-drain voltage. Conse(cid:173)\nquently, we decrease ~ Vrg by controlling the drain voltages at Ml and M2. Coupled \ncurrent mirrors (M6-M7 and M8-M9) at the drains of Ml and M2, simultaneously \nraise the drain voltage of the transistor that is sourcing a larger current, and lower \nthe drain voltage of the transistor that is sourcing a smaller current. The transistor \nwith the smaller source current will experience a larger Vsd , and thus exponentially \nmore IHEI, causing its source current to rapidly increase. Diodes (MlO and M11) \nfurther increase the drain voltage of the transistor with the larger current, further \nreducing its IHEI. The net effect is that IHEI acts to equalize the currents, and, like(cid:173)\nwise, the floating gate voltages . Recently Hasler proposed a similar method for \ncontrolling IHEI in a floating gate differential pair [4]. \nAssuming II >h, the change in ~ Vrg due to IHEI is: \n\n(10) \n\nWe expand the learning rule by substituting Eq.2 into Eq.lO. To compute values for \nthe drain voltages of MI and M2, we assume that all of II flows through MIl and all \nof 12 flows through M7. The IHEI learning rule is given below: \nd/1 Vf g 1 dt = - IjOe9WO (e -rVi\"il (/1 Vfg ) - e ~V;\"i 2 (/1 Vfg \u00bb \n\n(11) \n\nwhere fjo=finjO/Crg, r=-2heVy, 17=-lIVy, and I;=KlVy. <1>1 and <1>2 are given by: \n\n* l (~ Vfg ) = \n\n(( \n\nf b -\n\n2(~Vfg) = ( Ib -Imid \n\n( \n\nf mid \n\n)/2 h \n\ncos ( K~ Vfg \n\n12 \n\nV t \n\n)) l-2U,/KVy -lO~Vfg \n\ne \n\n) / 2 \n\ncosh(clVfg VI e \n\n1 2 \n\n)) -(l~Vfg ( \n\n-K~Vfg I U, )- u, IVy) \n\nl-e \n\n(12) \n\n(13) \n\nwhere (J =(I-V/Vy)Kl2Vh and w =Kl2Vt-Kl2Vy-llVy. Like tunneling, the IHEI \nrule depends on three factors: a controllable learning rate, Vinj ; the difference be(cid:173)\ntween Yin and f.1, ~ Vrg; and ~ Yo. Part (a) of Fig. 2 shows measurements of d~ Vrgldt \nversus ~ Vfg due to tunneling and IHEI, along with fits to Eqs.9 and 11 respectively. \n\n\fIHEI and tunneling facilitate adaptation by adding and removing charge from the \nfloating gates, respectively. Isolated, any of these mechanisms will eventually drive \nthe bump circuit out of its operating range. In order to obtain useful adaptation, we \nneed to activate both mechanisms at the same time. There is an added benefit to \ncombining tunneling and IHEI: Part (a) Fig 2 shows that tunneling acts more \nstrongly for smaller values of ~ Vfg , while IHEI shows the opposite behavior. The \nmechanisms complement each other, providing adaptation over more than a I V range \nin ~ Vrg. We combine Eq. 9 and Eq.11 to derive the bump learning rule: \n~viIj \n\n(~Vtun -~I'o)/V, . \n\n\u00a2oVo \n\nTViol \n\n(e \n\nCPl(~Vrg)-e \n\nCP2(~Vfg)) (14) \n\n--d~Vrg / dt =/tOe \n\nsinh((~Vrg -\u00a2')/2V~)+IjOe \n\nPart (b) of Fig. 2 illustrates the composite weight-update dynamics. When ~ Vfg is \nsmall, adaptation is primarily driven by IHEI, while tunneling dominates for larger \nvalues of ~ Vfg \u2022 \n\nThe bump learning rule is unlike any learning rule that we have found in the litera(cid:173)\nture. Nevertheless, it exhibits several desirable properties. First, it naturally moves \nthe bump circuit's weight towards the present input. Second, the weight update is \nsymmetric with respect to the difference between the stored value and the present \ninput. Third, we can vary the weight-update rate over many orders of magnitude by \nadjusting Vlun and Vinj \u2022 Finally, because the bump circuit uses synapse transistors to \nperform adaptation, the circuit can adapt during normal operation. \n\n4 Competitive learning with bump circuits \n\nWe summarize the results of simulations of the bump learning rule and also results \nfrom a competitive learning circuit fabricated in the TSMC 0.35 f.lm process below. \nFor further details consult [9]. We first compared the performance of a software \nneural network on a standard clustering task, using the bump learning rule (fitted to \ndata from Fig. 2), and a basic competitive learning rule (learning rate p=O.OI): \n\ndjl / dt = p X CVin -\n\njl) \n\n(15) \n\nWe trained both networks on data drawn from a mixture of 32 Gaussians, in a 32-\ndimensional space. The Gaussian means were drawn from the interval [0,1] and the \ncovariance matrix was the diagonal matrix 0.1 *1. On an input presentation, the net(cid:173)\nwork updated the weight vector of the closest neuron using either the bump learning \nrule, or Eq.15. We measured the performance of the two learning rules by evaluat(cid:173)\ning the coding error of each trained network, on a test set drawn from the same dis(cid:173)\ntribution as the training data. The coding error is the sum of the squared distances \nbetween each test point and its closest neuron. Part (a) of Fig. 3 shows that the \nbump circuit's rule performs favorably with the hard competitive learning rule. \n\nOur VLSI circuit (Part (b) of Fig. 3) comprised two neurons with a one-dimensional \ninput (a neuron was a single bump circuit), and a feedback network to control adap(cid:173)\ntation. The feedback network comprised a winner-take-all (WT A) [10] that detected \nwhich bump was closest to the present input, and additional circuitry [9] that gener(cid:173)\nated Vtun and Vinj from the WT A output. We tested this circuit on a clustering task, \nto learn the centers of a mixture of two Gaussians. In part (c) of Fig. 3, we compare \nthe performance of our circuit with a simulated neural network using Eq.15. The \nVLSI circuit performed comparably with the neural network, demonstrating that our \nbump circuit, in conjunction with simple feedback mechanisms, can implement \ncompetitive learning in VLSI. We can generalize the circuitry to multiple dimen(cid:173)\nsions (multiple bump circuits per neuron) and multiple neurons; each neuron only \nrequires one Vlun and Vinj signal. \n\n\f+ Hard competitive learning rule \no bump learning rule \n\n3 X 10' \n\n2.6 \n\n2.2 \n\nt \n\n1.8 \n\n1.4 \n\n10~~1 ~00~0--2\"'0~00n-'3~0~00--'4~00~0~<~~0~0--N60~00 \n\nnumber of training examples \n\n(a) \n\nFigure 3. (a) Comparison of a neural \nnetwork using the bump learning rule \nversus a standard competitive learning \nrule. We drew the training data from a \nmixture of thirty-two Gaussians, and \naveraged the results over ten trials. (b) A \ncompetitive \n(c) Per(cid:173)\nformance of a competitive learning cir(cid:173)\ncuit versus a neural network for learning \na mixture of two Gaussians. \n\nlearning circuit. \n\n(b) \n\ntarget values -\ncircuit output + \no \nneural network output = :-: \n\n..... \n\n09 \n\nDB \n\n07 \n\n05 \n\n~ 06 \n\n:J .. 04 \n\n> \n\nQ) \n\n03 \n\n+ \nI + \n+ \n~+ \no \u2022 . . ....... \nI \n, .0- _,-/ \n\n/ \n\n'\" \n\n02 .. \n\n0 1 ~ \n\n/ \n\n/ \nSOD \n\nO ~~--~--~--~--~~--~--~~ \no \n4500 \n\n4000 \n\n3500 \n\n3000 \n\n1500 \n\n1000 \n\n2000 \n\n2500 \n\nAcknowledgements \nThis work was supported by the NSF under grants BES 9720353 and ECS 9733425, \nand by a Packard Foundation Fellowship. \n\nnumber of training examples \n\n(c) \n\nReferences \n\n[1] M.A. Arbib (ed.), The Handbook of Brain Theory and Neural Networks, Cambridge, MA: The \n\nMIT Press, 1995. \n\n[2] H.C. Card, D.K. McNeill, and C.R. Schneider, \"Analog VLSI circuits for competitive learning \n\nnetworks\", in Analog Integrated Circuits and Signal Processing, 15, pp. 291-314, 1998. \n\n[3] C. Diorio, \"A p-channel MOS synapse transistor with self-convergent memory writes\", IEEE \n\nTransactions on Electron Devices, vol. 47, no. 2, pp 464-472, 2000. \n\n[4] P. Hasler, \"Continuous-Time Feedback in Floating-Gate MOS Circuits,\" to appear in IEEE \n\nTransactions on Circuits and Systems IT, Feb. 2001 \n\n[5] T. Allen et. aI, \"Electrically adaptable neural network with post-processing circuitry,\" U.S. \n\nPatent No. 5,331 ,215, issued July 19, 1994. \n\n[6] M. Lenzlinger and E.H. Snow, \"Fowler- Nordheim tunneling into thermally grown Si02\", \n\nJournal of Applied Physics, vol. 40(1), pp. 278-283 , 1969. \n\n[7] E. Takeda, C. Yang, and A. Miura-Hamada, Hot Carrier Effects in MOS Devices, San Diego, \n\nCA: Academic Press, 1995. \n\n[8] T. Delbruck, \"Bump circuits for computing similarity and dissimilarity of analog voltages\", \n\nCNS Memo 26, California Institute of Technology, 1993. \n\n[9] D. Hsu, M. Figueroa, and C. Diorio, \"A silicon primitive for competitive learning,\" UW CSE \n\nTechnical Report no . 2000-07-01, 2000. \n\n[10] J. Lazzaro, S. Ryckebusch, M.A. Mahowald, and c.A. Mead, \"Winner-take-all networks of \nO(n) complexity\", in Advances in Neural Information Processing Systems, San Mateo, CA: \nMorgan Kaufman, vol. 1, pp 703-711 , 1989. \n\n\f", "award": [], "sourceid": 1905, "authors": [{"given_name": "David", "family_name": "Hsu", "institution": null}, {"given_name": "Miguel", "family_name": "Figueroa", "institution": null}, {"given_name": "Chris", "family_name": "Diorio", "institution": null}]}*