At HADILAB, we are passionate about developing high-performance, application-specific computing systems. Currently, our research focuses on hardware-level strategies to optimize the efficiency of machine learning platforms and applications. Additionally, we are exploring hardware-efficient machine learning techniques tailored to neurotechnology. Notably, we have been collaborating with neuroscientists to develop digital processing hardware methods for brain-computer interfacing.
Looking ahead, our research will emphasize hardware-efficient domain-specific acceleration, particularly for machine learning and neuroscience applications. This involves leveraging hybrid platforms such as FPGAs and ASICs to serve as spatially parallel accelerators. Our goal is to create hardware-efficient, domain-specific architectures that balance programmability, configurability, cost-effectiveness, power efficiency, and performance. Another exciting frontier involves designing efficient hardware architectures for brain data analysis and decoding.
Beyond these primary pursuits, our broader research interests include:
Advancements in hardware-efficient deep learning and its applications,
Innovations in neurotechnology and brain-computer interfaces,
Massively parallel reconfigurable system architectures,
CAD algorithm development for VLSI physical design automation, and
Asynchronous circuits and clock domain crossing management.
Through these endeavors, we aim to push the boundaries of engineering innovation, contribute to multiple fields, and make a meaningful impact on technological advancement.
Modern Deep Neural Networks (DNNs) exhibit incredible performance on a variety of complex tasks, such as recognition, classification, and natural language processing.
Adapting to ever-increasing workloads, deep learning algorithms have become extremely compute- and memory-intensive, making them infeasible for deployment on compact, embedded platforms with power and cost budget limitation.
Common methods to minimize and accelerate deep learning involve pruning, quantization, and compression of the neural model.
While these techniques show a dramatic model reduction, in several cases they incur an accuracy degradation.
Moreover, methods involving custom hardware still suffer from large silicon footprint and high power consumption due to massive computations and external memory accesses.
This work employs online Most Significant Digit (MSD) first digit-serial arithmetic to enable early termination of the computation.
Using online MSDF bit-serial arithmetic for DNN inference (1) enables early termination of ineffectual computations, (2) enables mixed-precision operations, (3) allows higher frequencies without compromising latency, and (4) alleviates the infamous weights memory bottleneck.
The proposed technique is efficiently implemented on FPGAs due to their concurrent fine-grained nature, and the availability of on-chip distributed SRAM blocks.
Compared to other bit-serial methods, our Fine-Grained Inference Engine (FGIE) improves energy efficiency by x1.8 while having similar performance gains.
The FGIE architecture has been published in the 2019 International Conference on Field-Programmable Technology (FPT).
[Paper: PDF]
The FGIE architecture (top) An FGIE tile. Processes s synapses in parallel. (bottom) An FGIE layer. Computes p neurons with s synapses each.
Switched ports, first introduced in my dissertation, are a generalization of true (bidirectional) ports, where a certain number of write ports can be dynamically switched into a different number of read ports using a common read/write control signal.
While a true port is a pair of read/write ports, switched ports are best described as a set.
Furthermore, a given application may have multiple sets, each set with a different read/write control.
While previous work generates multi-ported RAM solutions that contain only true ports, or only simple (unidirectional) ports, my research demonstrates that using only two models is too limiting and prevents optimizations from being applied.
The switched ports technique was accepted for publication at the ACM Transactions on Reconfigurable Technology and Systems (TRETS)
[Paper: PDF]
[Code: GitHub]
in an upcoming special issue on reconfigurable components with source, a leading journal in FPGA technology.
The general problem of switched ports is optimized by solving the corresponding set cover problem via ILP.
This is the first time an optimization model is used to construct multi-ported RAM.
Switched ports have a tangible impact on the performance of parallel computation systems (e.g. CGRAs), where the switched ports mechanism can be utilized to dramatically increase parallelism.
A memory compiler that automates the construction of a multi-ported RAM with switched ports was released as an open source library.
Another publication that describes this memory compiler and the optimization problem has been published in the 2016 IEEE International Symposium on Field-Programmable Custom Computing Machines (FCCM '16)
[Paper: PDF,
DOI]
[Talk: PDF,
PPT]
[Code: GitHub],
a leading conference in reconfigurable computing.
Associative Arrays
Content-addressable memories (CAMs), the hardware implementation of associative arrays, are capable of searching the entire memory space for a specific value within a single clock cycle.
While a standard RAM returns data located in a given memory address, a CAM returns an address containing a specific given data.
To do this, it must perform a memory-wide search for a specific value, and there may be multiple addresses that all match the data.
Hence, CAMs are massively parallel search engines accessing all memory content to compare with the searched pattern simultaneously.
CAMs are also heavy power consumers due to the large memory bandwidth requirement and the concurrent compare.
CAMs are used in a variety of scientific fields requiring high-speed associative searches.
CAMs are keystones of network processors, specifically used for IP lookup engines for packet forwarding, intrusion detection, packet filtering and classification.
In addition, CAMs are used for memory management in associative caches and translation lookaside buffers (TLBs), pattern matching, data compression and databases.
Despite their importance, the high implementation cost of CAMs means they are used sparingly.
As a result, FPGA vendors do not provide any dedicated CAM circuitry or any special infrastructure to enable a construction of efficient CAMs.
Instead, designers tend to use algorithmic search heuristics causing a dramatic performance degradation.
In my dissertation, I propose two approaches to implement area-efficient binary CAMs (BCAMs).
The first approach fits deep BCAMs with narrow patterns and is capable of utilizing on-chip SRAM block as BCAMs with only 20% storage overhead.
This approach is more area-efficient even than custom-made BCAMs where the footprint of each CAM cell occupies the area of two SRAM cells.
The second approach is suitable for wide patterns and is capable of generating 9 times wider BCAMs compared to other approaches.
Content-Addressable Memory (CAM) abstraction as a massively parallel search engine accessing all memory content to compare with the searched pattern simultaneously
Only a few research papers have been published in this area, hence, my solution is anticipated to be the basis of further future research.
Furthermore, FPGA and cell-based ASIC vendors would adopt this approach due to its high-efficiency and low cost.
A primary version of the proposed hierarchical search BCAM has been published in the 2014 IEEE International Conference on Field-Programmable Technology (ICFPT '14)
[Paper: PDF,
DOI]
[Poster: PDF,
VSD]
[Code: GitHub].
The follow-on II-BCAM approach has been published in the 2015 IEEE International Symposium on Field-Programmable Custom Computing Machines (FCCM '15)
[Paper: PDF,
DOI]
[Talk: PDF,
PPT]
[Code: GitHub].
Both are leading conferences in FPGA technology. Furthermore, the HDL developed in this project was released as an open source library.
Correction and Recovery of Timing Errors in Tightly Coupled CGRAs and Processor Arrays
Overclocking a CPU is a common practice among home-built PC enthusiasts where the CPU is operated at a higher frequency than its speed rating.
This practice is unsafe because timing errors cannot be detected by modern CPUs and they can be practically undetectable by the end user.
Using a timing speculation technique such as Razor, it is possible to detect timing errors in CPUs.
To date, Razor has been shown to correct only unidirectional, feed-forward processor pipelines.
Our approach safely overclocks 2D arrays by extending Razor correction to cover bidirectional communication in a tightly coupled or lockstep fashion.
To recover from an error, stall wavefronts are produced which propagate across the device.
Multiple errors may arise in close proximity in time and space; if the corresponding stall wavefronts collide, they merge to produce a single unified wavefront, allowing recovery from multiple errors with one stall cycle.
We demonstrate the correctness and viability of our approach by constructing a proof-of-concept prototype which runs on a traditional Altera FPGA.
Our approach can be applied to custom computing arrays, systolic arrays, CGRAs, and also time-multiplexed FPGAs such as those produced by Tabula.
As a result, these devices can be overclocked and safely tolerate dynamic, data-dependent timing errors.
Alternatively, instead of overclocking, this same technique can be used to 'undervolt' the power supply and save energy.
Our method of correction and recovery of timing errors has been published in the 2013 IEEE International Symposium on Field-Programmable Custom Computing Machines (FCCM '13)
[Paper: PDF,
DOI]
[Talk: PDF,
PPT]
[Code: GitHub].
Synthesis of hybrid mesh/tree clock distribution networks
With a combination of nonuniform meshes and unbuffered trees (UBT), a variation-tolerant hybrid clock distribution network is produced.
Clock skew variations are selectively reduced based on circuit timing information generated by static timing analysis (STA).
The skew variation reduction procedure is prioritized for critical timing paths, since these paths are more sensitive to skew variations.
A framework for skew variation management is proposed. The algorithm has been implemented in a standard 65 nm cell library using standard EDA tools, and tested on several benchmark circuits.
As compared to other nonuniform mesh construction methods that do not support managed skew tolerance, experimental results exhibit a 41% average reduction in metal area and a 43% average reduction in power dissipation.
As compared to other methods that employ skew tolerance management techniques but do not use a hybrid clock topology, an 8% average reduction in metal area and a 9% average reduction in power dissipation are achieved.
The hybrid mesh/tree clock distribution networks synthesis approach has been published in INTEGRATION, the VLSI journal (2013)
[Paper: PDF,
DOI]
[Code: GitHub].
Example of the proposed hybrid nonuniform mesh/unbuffered tree structure; The skew map is shown in the background. Darker regions indicate a tighter variation target. The circular spots are clock sinks.