DSpace Arşivi :: by Yazar "Kadayif, Ismail" değerine göre listeleniyor

Yazar "Kadayif, Ismail" seçeneğine göre listele

Listeleniyor 1 - 17 / 17

Boosting Performance of Directory-based Cache Coherence Protocols with Coherence Bypass at Subpage Granularity and A Novel On-chip Page Table
(Assoc Computing Machinery, 2016) Soltaniyeh, Mohammadreza; Kadayif, Ismail; Ozturk, Ozcan
Chip multiprocessors (CMPs) require effective cache coherence protocols as well as fast virtual-to-physical address translation mechanisms for high performance. Directory-based cache coherence protocols are the state-of-the-art approaches in many-core CMPs to keep the data blocks coherent at the last level private caches. However, the area overhead and high associativity requirement of the directory structures may not scale well with increasingly higher number of cores. As shown in some prior studies, a significant percentage of data blocks are accessed by only one core, therefore, it is not necessary to keep track of these in the directory structure. In this study, we have two major contributions. First, we show that compared to the classification of cache blocks at page granularity as done in some previous studies, data block classification at subpage level helps to detect considerably more private data blocks. Consequently, it reduces the percentage of blocks required to be tracked in the directory significantly compared to similar page level classification approaches. This, in turn, enables smaller directory caches with lower associativity to be used in CMPs without hurting performance, thereby helping the directory structure to scale gracefully with the increasing number of cores. Memory block classification at subpage level, however, may increase the frequency of the Operating System's (OS) involvement in updating the maintenance bits belonging to subpages stored in page table entries, nullifying some portion of performance benefits of subpage level data classification. To overcome this, we propose a distributed on-chip page table as a our second contribution.
Capturing and optimizing the interactions between prefetching and cache line turnoff
(Elsevier, 2008) Kadayif, Ismail; Zorlubas, Ayhan; Koyuncu, Selcuk; Kabal, Olcay; Akcicek, Davut; Sahin, Yucel; Kandemir, Mahmut
While numerous prior studies focused on performance and energy optimizations for caches, their interactions have received much less attention. This is unfortunate since in general the performance oriented techniques influence energy behavior of the cache, and the energy oriented techniques Usually increase program execution cycles. The overall energy and performance behavior of caches in embedded systems when multiple techniques co-exist remains an open research problem. This paper first studies this interaction and demonstrates how performance and energy optimizations can affect each other. We then propose three optimization schemes that turn-off cache lines in a prefetching-sensitive manner. Specifically, these schemes treat prefetched cache lines differently from the lines brought to the cache in a normal way (i.e., through a load operation) in turning off the cache lines. Our experiments with five randomly selected codes from the SPEC2000 suite indicate that the proposed approaches save significant leakage energy. Our results also show that the performance degradations incurred by the proposed approaches are very small. (c) 2008 Elsevier B.V. All rights reserved.
Classifying Data Blocks at Subpage Granularity With an On-Chip Page Table to Improve Coherence in Tiled CMPs
(IEEE-Inst Electrical Electronics Engineers Inc, 2018) Soltaniyeh, Mohammadreza; Kadayif, Ismail; Ozturk, Ozcan
As shown in some prior studies, a significant percentage of data blocks accessed in parallel codes are private, and not keeping track of those blocks can improve the effectiveness of directory structures in Chip multiprocessors (CMPs). In this paper, we have two major contributions. First, we showed that compared to the classification of cache blocks at page granularity, data block classification (DBC) at subpage level helps to detect considerably more private data blocks. Based on this idea, we propose two different approaches for enhancing the effectiveness of directory caches in tiled CMPs. In the first approach, which is called quasi-dynamic subpage level DBC (QDBC), a data block is assumed to be private from the beginning of the program execution and stays private as long as the corresponding subpage is accessed by only one core. Our second approach, which is called dynamic subpage level DBC, turns a data block into private again after all blocks within the corresponding subpage are evicted from private cache hierarchy. Memory block classification at subpage level, however, may increase the frequency of the operating system involvement in updating the maintenance bits in page table entries. To overcome this, we propose, as a second contribution, a distributed table called as on-chip page table (o-CPT), which stores recently accessed page translations in the system. Our simulation results show that, compared to page level data classification, QDBC and DBC approaches relying on the o-CPT can detect significantly more private data blocks and considerably improve system performance.
Coherency Traffic Reduction in Manycore Systems
(IEEE, 2022) Derebasoglu, Erdem; Kadayif, Ismail; Ozturk, Ozcan
With the increasing number of cores in manycore accelerators and chip multiprocessors (CMPs), it gets more challenging to provide cache coherency efficiently. Although the snooping-based protocols are appropriate solutions to small-scale systems, they are inefficient for large systems because of the limited bandwidth. Therefore, large-scale manycores require directory-based solutions where a hardware structure called directory holds the information. This directory keeps track of all memory blocks and which cache stores a copy of these blocks. The directory sends messages only to caches that store relevant blocks and also coordinate simultaneous accesses to a cache block. As directory-based protocols scale to many cores, performance, network-on-chip (NoC) traffic, and bandwidth become major problems. In this paper, we present software mechanisms to improve the effectiveness of directory-based cache coherency in manycore and multicore systems with shared memory. In multithreaded applications, some of the data accesses do not disrupt cache coherency, but they still produce coherency messages among cores such as read-only (private) data. However, if data is accessed by at least two cores and at least one of them is a write operation, it is called shared data and requires cache coherency. In our proposed system, private data and shared data are determined at compile time, and cache coherency protocol only applies to shared data. We implement our approach in two stages. First, we use Andersen's static pointer analysis to analyze the program and mark its private instructions, i.e., instructions that load or store private data. Then, we use these analyses to decide if cache coherency protocol will be applied or not at runtime. Our simulation results on parallel benchmarks show that our approach reduces cycle count, dynamic random access memory (DRAM) accesses, and coherency traffic up to 13%.
Energy reduction in 3D NoCs through communication optimization
(Springer Wien, 2015) Ozturk, Ozcan; Akturk, Ismail; Kadayif, Ismail; Tosun, Suleyman
Network-on-Chip (NoC) architectures and three-dimensional (3D) integrated circuits have been introduced as attractive options for overcoming the barriers in interconnect scaling while increasing the number of cores. Combining these two approaches is expected to yield better performance and higher scalability. This paper explores the possibility of combining these two techniques in a heterogeneity aware fashion. Specifically, on a heterogeneous 3D NoC architecture, we explore how different types of processors can be optimally placed to minimize data access costs. Moreover, we select the optimal set of links with optimal voltage levels. The experimental results indicate significant savings in energy consumption across a wide range of values of our major simulation parameters.
Exploiting potentially dead blocks for improving data cache reliability against soft errors
(IEEE, 2007) Akcicek, Davut; Koyuncu, Selcuk; Sen, Hande; Kadayif, Ismail
Soft errors due to energetic particle strikes are a big concern for systems to run in a reliable manner. This reliability concern have been more serious with technology scaling and aggressive leakage control mechanisms. Since cache memories consume the largest fraction of on-chip real estate, they are more vulnerable to soft errors, as compared to many other system components. This paper proposes a solution to the problem of designing a reliable data cache without trading reliability for performance and area, which is a typical characteristic of conventional parity and ECC based protection techniques. Although parity is simple and fast, it can detect only odd numbered errors without correcting any of them. On the other hand, ECC techniques are more complex and time-consuming, and have the capability of correcting some of the errors. Our technique enhances data cache reliability by storing the replica(s) of data items in active use into cache lines which hold data not likely to be reused. The bookkeeping information about replicas is maintained in a small fully associative cache called shadow cache. By exploiting the replicas to correct the soft errors enhances the data reliability. Since we keep the replicas in potentially dead blocks, the performance loss is negligible with a little extra chip area requirement for the shadow cache. Our experimental results indicate that our technique, compared to the previous similar two techniques, is more effective for enhancing the L1 data cache reliability in modern Superscalar machines with only negligible degradation in performance.
Hardware/Software Approaches for Reducing the Process Variation Impact on Instruction Fetches
(Assoc Computing Machinery, 2013) Kadayif, Ismail; Turkcan, Mahir; Kiziltepe, Seher; Ozturk, Ozcan
As technology moves towards finer process geometries, it is becoming extremely difficult to control critical physical parameters such as channel length, gate oxide thickness, and dopant ion concentration. Variations in these parameters lead to dramatic variations in access latencies in Static Random Access Memory (SRAM) devices. This means that different lines of the same cache may have different access latencies. A simple solution to this problem is to adopt the worst-case latency paradigm. While this egalitarian cache management is simple, it may introduce significant performance overhead during instruction fetches when both address translation (instruction Translation Lookaside Buffer (TLB) access) and instruction cache access take place, making this solution infeasible for future high-performance processors. In this study, we first propose some hardware and software enhancements and then, based on those, investigate several techniques to mitigate the effect of process variation on the instruction fetch pipeline stage in modern processors. For address translation, we study an approach that performs the virtual-to-physical page translation once, then stores it in a special register, reusing it as long as the execution remains on the same instruction page. To handle varying access latencies across different instruction cache lines, we annotate the cache access latency of instructions within themselves to give the circuitry a hint about how long to wait for the next instruction to become available.
Modeling and Improving Data Cache Reliability
(Assoc Computing Machinery, 2007) Kadayif, Ismail; Kandemir, Mahmut
Soft errors arising front energetic particle strikes pose a significant reliability concern for computing systems, especially for those running in noisy environments. Technology scaling and aggressive leakage control mechanisms make: the problem caused by these transient errors even snore severe. Therefore, it is very important to employ reliability enhancing mechanisms in processor/memory designs to protect them against soft errors. To do so, we first need to model soft errors, and their study cost/reliability tradeoffs among various reliability enhancing techniques based on tire model so that system requireirients could be met. Since cache memories take the largest fraction of on-chip real estate today and their share is expected to continue to grow in future designs, they are more vulnerable to soft errors, as compared to many other components of a, computing system. In this paper, we first focus on a soft error model for L1 data caches, and then explore different reliability enhancing mechanisms. More specifically, we define a, metric called AVFC (Architectural Vulnerability factor for Caches), which represents tire probability with which a fault in the cache can be visible in the final output of the program. Based on this model, we then propose three architectural schemes for improving reliability in tire existence of soft errors. Our first scheme prevents air error from propagating to the lower levels in the memory hierarchy by riot forwarding tire unmodified data words of a dirty cache block to the L2 cache when the dirty block is to be replaced. The second scheme proposed selectively invalidates cache blocks to reduce their vulnerable periods, decreasing their chances of catching any soft errors. Based on the AVFC metric, our experimental results show that these two schemes are very effective in alleviating soft; errors in the L1 data cache. Specifically; by rising our first scheme; it is possible to improve the AVFC metric by 32% without any performance loss. Oft the other hand, tire second scheme enhances the AVFC metric between 60% and 97%, at the cost of a. performance degradation which varies from 0% to 21.3%, depending on how aggressively the cache blocks are invalidated. To reduce the performance overhead caused by cache block invalidation, we also propose a third scheme which tries to bring a fresh copy of tire invalidated block into tire cache via prefetching. Our experimental results indicate that, this scheme can reduce the performance overheads to less than 1% for all applications in our experimental suite, at the cost of giving tip a tolerable portion of tire reliability enhancement the second scheme achieves.
Modeling and improving data cache reliability
(2007) Kadayif, Ismail; Kandemir, Mahmut
Soft errors arising from energetic particle strikes pose a significant reliability concern for computing systems, especially for those running in noisy environments. Technology scaling and aggressive leakage control mechanisms make the problem caused by these transient errors even more severe. Therefore, it is very important to employ reliability enhancing mechanisms in processor/memory designs to protect them against soft errors. To do so, we first need to model soft errors, and then study cost/reliability tradeoffs among various reliability enhancing techniques based on the model so that system requirements could be met. Since cache memories take the largest fraction of on-chip real estate today and their share is expected to continue to grow in future designs, they are more vulnerable to soft errors, as compared to many other components of a computing system. In this paper, we first focus on a soft error model for L1 data caches, and then explore different reliability enhancing mechanisms. More specifically, we define a metric called AVFC (Architectural Vulnerability Factor for Caches), which represents the probability with which a fault in the cache can be visible in the final output of the program. Based on this model, we then propose three architectural schemes for improving reliability in the existence of soft errors. Our first scheme prevents an error from propagating to the lower levels in the memory hierarchy by not forwarding the unmodified data words of a dirty cache block to the L2 cache when the dirty block is to be replaced. The second scheme proposed selectively invalidates cache blocks to reduce their vulnerable periods, decreasing their chances of catching any soft errors. Based on the AVFC metric, our experimental results show that these two schemes are very effective in alleviating soft errors in the L1 data cache. Specifically, by using our first scheme, it is possible to improve the AVFC metric by 32% without any performance loss. On the other hand, the second scheme enhances the AVFC metric between 60% and 97%, at the cost of a performance degradation which varies from 0% to 21.3%, depending on how aggressively the cache blocks are invalidated. To reduce the performance overhead caused by cache block invalidation, we also propose a third scheme which tries to bring a fresh copy of the invalidated block into the cache via prefetching. Our experimental results indicate that, this scheme can reduce the performance overheads to less than 1% for all applications in our experimental suite, at the cost of giving up a tolerable portion of the reliability enhancement the second scheme achieves. © Copyright 2007 ACM.
Modeling soft errors for data caches and alleviating their effects on data reliability
(Elsevier, 2010) Kadayif, Ismail; Sen, Hande; Koyuncu, Selcuk
Soft errors caused by strikes arising from energetic particles pose a significant reliability concern for computing systems. In this study, we first introduce a model for soft error occurrence and propagation in cache memories. Based on this model, we define a metric called Architectural Vulnerability Factor for Caches (AVFC), which represents the probability with which a fault in the cache can be visible in the final output of the program. We then propose three architectural schemes for improving reliability. Our first scheme prevents an error from propagating to the lower levels in the memory hierarchy by not forwarding the unmodified data words of dirty cache blocks to the L2 cache at write-backs. The second scheme selectively invalidates cache blocks to reduce their vulnerable periods. To reduce the performance overhead caused by block invalidation, our third scheme tries to bring a fresh copy of the invalidated block into the cache via prefetching. The experimental results for the SPEC2000 suite show that, based on the proposed model, our first and third schemes together can improve the data reliability roughly 96% at the cost of less than 1% overhead in execution time, quite more than data improvements achieved by either two well-known techniques, namely write-through and early write-back cache mechanisms. (C) 2010 Elsevier B.V. All rights reserved.
Prefetching-aware cache line turnoff for saving leakage energy
(IEEE, 2006) Kadayif, Ismail; Kandemir, Mahmut; Li, Feihui
While numerous prior studies focused on performance and energy optimizations for caches, their interactions have received much less attention. This paper studies this interaction and demonstrates how performance and energy optimizations can affect each other. More importantly, we propose three optimization schemes that turn off cache lines in a prefetching-sensitive manner. These schemes treat prefetched cache lines differently from the lines brought to the cache in a normal way (i.e., through a load operation) in turning off the cache lines. Our experiments with applications from the SPEC2000 suite indicate that the proposed approaches save significant leakage energy with very small degradation on performance.
Reducing data TLB power via compiler-directed address generation
(IEEE-Inst Electrical Electronics Engineers Inc, 2007) Kadayif, Ismail; Nath, Partho; Kandemir, Mahmut; Sivasubramaniam, Anand
Address translation using the translation lookaside buffer (TLB) consumes as much as 16 % of the chip power on some processors because of its high associativity and access frequency. While prior work has looked into optimizing this structure at the circuit and architectural levels, this paper takes a different approach to optimizing its power by reducing the number of data TLB (dTLB) lookups for data references. The main idea is to keep translations in a set of translation registers (TRs) and intelligently use them in software to directly generate the physical addresses without going through the dTLB. The software has to work within the confines of the TRs provided by the hardware and has to maximize the reuse of such translations to be effective. The authors propose strategies and code transformations for achieving this in array-based and pointer-based codes, looking to optimize data accesses. Results with a suite of Spec95 array-based and pointer-based codes show dTLB energy savings of up to 73% and 88%, respectively, compared to directly using the dTLB for all references. Despite the small increase in instructions executed with the mechanisms, the approach can, in fact, provide performance benefits in certain cache-addressing strategies.
Reducing Performance Impact of Process Variation For Data Caches
(IEEE, 2013) Kadayif, Ismail; Tuncer, Kadir
In concurrent with finer-granular process technologies, it is becoming extremely difficult to keep critical physical device parameters within desired bounds, including channel length, gate oxide thickness, and dopant ion concentration. Variations in these parameters can lead to dramatic variations in access latencies in Static Random Access Memory (SRAM) devices: Different lines of the same cache may have different access latencies. A simple solution to this problem is to adopt the worst-case latency paradigm. While this egalitarian cache management is simple, it may introduce significant performance overhead for data cache accesses. To overcome varying access latencies across different data cache lines, we employ a small table storing the access latencies of cache lines. This table is accessed during data cache access to give a hint to the hardware about how long to wait for data to become available.
Reducing performance impact of process variation for data caches
(IEEE Computer Society, 2013) Kadayif, Ismail; Tuncer, Kadir
In concurrent with finer-granular process technologies, it is becoming extremely difficult to keep critical physical device parameters within desired bounds, including channel length, gate oxide thickness, and dopant ion concentration. Variations in these parameters can lead to dramatic variations in access latencies in Static Random Access Memory (SRAM) devices: Different lines of the same cache may have different access latencies. A simple solution to this problem is to adopt the worst-case latency paradigm. While this egalitarian cache management is simple, it may introduce significant performance overhead for data cache accesses. To overcome varying access latencies across different data cache lines, we employ a small table storing the access latencies of cache lines. This table is accessed during data cache access to give a hint to the hardware about how long to wait for data to become available. © 2013 The Chamber of Turkish Electrical Engineers-Bursa.
Reliable Address Translation for Instructions
(IEEE, 2015) Kadayif, Ismail; Ugurlu, Bora
As a result of technology scaling, spatial multi-bit soft errors have been becoming a big concern for SRAM-based storage structures, such as caches, buffers, and register files, in the design of reliable computer systems. Conventional techniques, such as bit interleaving or stronger coding, cannot provide the designers with effective solutions to the problem of reliable address generation in instruction translation lookaside buffers (iTLB) because of high power and/or latency overheads. In this study, we aim to generate reliable address translation for instructions without compromising either on performance or on power consumption. To do so, we propose to use a pair of identical registers storing the last address translation, which are referred to as context frame registers (CFR). As long as the control flow of programs stays in the same page, address translations are supplied by these two registers, instead of the iTLB. Since two CFRs keep the same address translation, spatial multi-bit errors are detected by comparing their contents. If their contents do not match, we obtain the address translation from the iTLB as usual, which uses strong coding for error detection and correction.
Reliable address translation for instructions
(Institute of Electrical and Electronics Engineers Inc., 2016) Kadayif, Ismail; Ugurlu, Bora
As a result of technology scaling, spatial multi-bit soft errors have been becoming a big concern for SRAM-based storage structures, such as caches, buffers, and register files, in the design of reliable computer systems. Conventional techniques, such as bit interleaving or stronger coding, cannot provide the designers with effective solutions to the problem of reliable address generation in instruction translation lookaside buffers (iTLB) because of high power and/or latency overheads. In this study, we aim to generate reliable address translation for instructions without compromising either on performance or on power consumption. To do so, we propose to use a pair of identical registers storing the last address translation, which are referred to as context frame registers (CFR). As long as the control flow of programs stays in the same page, address translations are supplied by these two registers, instead of the iTLB. Since two CFRs keep the same address translation, spatial multi-bit errors are detected by comparing their contents. If their contents do not match, we obtain the address translation from the iTLB as usual, which uses strong coding for error detection and correction. © 2015 Chamber of Electrical Engineers of Turkey.
Studying interactions between prefetching and cache line turnoff
(Institute of Electrical and Electronics Engineers Inc., 2005) Kadayif, Ismail; Kandemir, Mahmut; Chen, Guilin
While lots of prior studies focused on performance and energy optimizations for caches, their interactions have received much less attention. This is unfortunate since in general the performance-oriented techniques influence energy behavior of the cache, and the energy-oriented techniques usually increase program execution cycles. The overall energy and performance behavior of caches in embedded systems when ' multiple techniques co-exist remains an open research problem. This paper studies this interaction and illustrates how performance and energy optimizations affect each other. We also point out several potential optimizations that could be based on this study. © 2005 IEEE.

Yazar "Kadayif, Ismail" seçeneğine göre listele

Sayfa Başına Sonuç

Sıralama seçenekleri