[2020-11-14] High-Performance Deep-Learning Coprocessor Integrated into x86 SoC with Server-Class CPUs

3 minute read

Published: November 14, 2020

This is a paper using processor-in-memory(PIM) for NN acceleration in ASPLOS 2020. I know little about PIM techinques. And I don’t know much about the SRAM structure. I learned some basic knowledge of SRAM, PIM from this paper. And I learned a lot of the implementation tools and benchmark measuring tools.

SRAM, PIM

memory organization of SRAM: SRAM -> slices -> bank -> subbank-> subarray-> partition -> cells.
Any PIM strategy which disturbs this tightly coupled sub-array organization will either negatively impact the data access (memory property) or increase the PIM execution latency.
One of the most common PIM techniques is to assert multiple worlines where the input operands are located and establish a bitline discharge to obtain the computed output. Boolean operation takes 1 cycle. Limited to bitwise operations.
Multiple row activation (MRA) will cause a write bias condition and my cause data corruption.
Interconnect dominates the energy and latency of SRAM.
PIM performance can be determined by the number of operations per PIM cycle (PIM-OPC).

Motivation

Most exsiting accelerators for DNN workloads has incorporates quantization, and using various bit precision for different layers. Such cases makes a better use case for LUT-based approaches. Interconnect dominates the energy and latency of SRAM. So it is desirable to avoid data movement through the interconnect and compute within a subarray. And data parallelism can be enabled by concurrent processing at individual sub-arrays. Integrating specialized hardware within each subarray will result in significant degradation of memory density and negatively impact normal memory operation. So BFree - a LUT based comptue that supports configurable for multiple operations.

BFree Architecture

BFree architecture corporates compute capabilities at sub-array granularity in conventional cache memory organization. Configuration block (CB) stores metadata. BFree compute engine(BCE) incorporates three stage in-order pipeline: 1. Read instructions from CB and decode 2. Compute LUT addressed 3. Access LUT address and further accumulated/processed Decoupled bitlines for the LUT rows alone to reduce access cost. For reducing the number of LUT entries, they only store the entries for 4-bit operands in the sub-arrays. For higher precision operands, BCE will decompose the operands to 4-bit precision operands and accumulate the partial product. Further, the number of entries could be reduced to 49 by utilizing some fundamental multiplication properties - only storing the products if both the operands are odd numbers. Other operations like division or activation functions can also be implemented with LUT-based approaches.

Evaluations

Tools for BFree design: SPICE simulation for STAM, LUT and BCE design; Synposys DC and ICC compiler for auto place and route; Synopsys PrimeTime-PX for power evalutaion Benchmarks: 1) In cache technique - Neural Cache 2) Eyeriss (Not sure how to compare Eyeriss energy/performance, does the author implement Eyeriss design?) 3) CPU, using Pytorch profiler, and use Intel Rapl tools tp measure power 4) GPU, using Tensorflow profiler, and using Nvidia-smi tools to measure GPU power This PIM solution achieves 1.72x better performance while being 3.14x energy efficient compared to the state-of-the-art DNN in-memory accelerators when running inception v3. Our analysis show 101x, 3x speed up and 91x, 11x energy efficient than CPU and GPU

Share on

Twitter Facebook LinkedIn

Yunjie Pan

[2020-11-14] High-Performance Deep-Learning Coprocessor Integrated into x86 SoC with Server-Class CPUs

SRAM, PIM

Motivation

BFree Architecture

Evaluations

Share on

You May Also Enjoy

[2021-10-12] ISCA 2021 Section 3A: Machine Learning 1

RaPiD: AI Accelerator for Ultra-Low Precision Training and Inference

[2021-09-24] Ten Lessons From Three Generations Shaped Google’s TPUv4i : Industrial Product

[2021-04-20] DiAG: A Dataflow-Inspired Architecture for General-Purpose Processors

[2021-02-20] Analyzing and Mitigating Data Stalls in DNN Training

Background