Abstract:
Since the beginning of computer systems, the memory subsystem has always been
one of their essential components. However, the different pace of change between
microprocessor and memory has become one of the greatest challenges that current
designers have to address in order to develop more powerful computer systems.
This problem, called memory gap, is further compounded by the limited
scalability and the high energy consumption of conventional memory technologies
(DRAM and SRAM), which has leaded to consider new non-volatile memory
(NVM) technologies as potential candidates to replace them. Among NVMs, PCM
and STT-RAM are currently postulated as the best alternatives.
Although PCM and STT-RAM have significant advantages over DRAM and SRAM,
they also suffer from some drawbacks that need to be mitigated before they can
both be employed as memory technologies for the next computers generation. Notably,
the slow and energy-hungry write operations on both technologies, and the
limited endurance of PCM cells, which become unchangeable after performing a
relatively reduced amount of writes on them, are the main constraints of PCM and
STT-RAM technologies. This thesis presents two proposals aimed to efficiently
manage the write operations on this kind of memories.
The first proposal, conceived for a system with a PCM-based main memory, is
intended to reduce the number of writes to the main memory by operating at the
cache controller level through the replacement policy used in the immediate-lower
memory hierarchy level (the last-level cache, LLC). For this purpose, and as the
starting point, the conventional LLC replacement policies (oriented to improve
the system performance) have been evaluated in terms of the amount of writes
generated to main memory. Once the algorithm reporting the lowest amount of
writes to main memory has been identified, several changes are proposed aimed to
find a replacement policy satisfying the twofold goal of minimizing the number of
writes to PCM main memory (and hence reducing the corresponding energy consumption) and not penalizing the system performance. The proposed algorithms
have been encoded and integrated in the gem5 architectural simulator, so that
the desired environment, where the main memory is modeled according to PCM
memory features and the last-level cache operates with the designed replacement
policies, is simulated. The behavior of these algorithms when running different
kind of applications, both sequential and parallel programs as well as multiprogrammed
workloads, is evaluated. Experimental results show that, on average, compared with a conventional LRU algorithm, some of our proposals manage to extend the memory lifetime up to 20–45%, also reducing the energy consumption in the memory hierarchy by up to 9% and hardly degrading performance.
In the second proposal, conceived for a system with an STT-RAM last-level cache,
a mechanism aimed to predict unnecessary writes to this last-level cache is presented,
so that those writes identified as useless are filtered in the LLC and performed
directly in the main memory. For this purpose, first it was explored the
reuse locality that the stream of references arriving at the LLC exhibits, unlike the
temporal locality that exhibits the stream of references arriving to the cache levels
closer to the processor. Once verified and evaluated this feature, it was exploited
by using an element able to detect those blocks exhibiting reuse. This reuse detector
is in charge of managing the LLC contents, so that the blocks predicted to be
non-dead blocks are inserted in the LLC while those predicted to have not reuse
bypass the LLC, hence reducing the amount of writes to this level and also the
corresponding energy consumption. For the evaluation of this approach, the inclusion
of the reuse detector (as well as the required modifications in order to adapt
the coherence mechanism) was encoded using the gem5 architectural simulator,
where also the LLC was modeled according to STT-RAM memory features. Then
the proposal was evaluated using sequential applications and multiprogrammed
workloads in a multiprocessor environment. Experimental results reveal that this
content management technique, applied to an STT-RAM LLC and compared to
an STT-RAM LLC baseline where no reuse detector is employed, reports energy
reductions in the shared LLC of a multiprocessor system of around 40%, an additional
energy reduction of more than 6% in the main memory, and improves performance by 3-7% depending on the specific features of the different multiprocessor systems evaluated.
Description:
Tesis, Doctor en Ingeniería Informática, Universidad Complutense de Madrid, España.