HiveBrain — Knowledge Dashboard

pattern minor by @import:stackexchange-cs 112d ago

I'm reading Computer Organization and Design MIPS Edition 5th Edition The Hardware/Software Interface on how memory cache works. I came across the following paragraph on page 393; The other key aspect of writes is what occurs on a write miss. We first fetch the words of the block from memory. After the block is fetched and placed into the cache, we can overwrite the word that caused the miss into the cache block. We also write the word to main memory using the full address. I understand what a cache read miss is i.e. when the required word isn't present in the cache, but I'm having trouble understanding what a write miss is(how can we miss to write data?) and hence the subsequent paragraphs.

computer-architecturecpu-cachecsstackoverflow

pattern minor by @import:stackexchange-cs 112d ago

Is PREFETCH an asynchronous operation?

I often hear Prefetching as a technique for speeding up, for example, sequential memory access pattern. The prefetch should occur sufficiently far ahead in time to mitigate the latency of memory access, for example in a loop traversing memory linearly. According to the famous "What Every Programmer Should Know About Memory" paper by Ulrich Drepper, it is written: This prefetching would remove some of the costs of accessing main memory since it happens asynchronously with respect to the execution of the program (emphasis mine) I could not find references on Google or Wikipedia corroborating or proving this. Does anyone know where I can find if this is true? The only reasoning I can think of is rhetorical: it must be asynchronous b/c otherwise prefetching offers no benefit to sequential access... unless the execution time of prefetching a cache line is less than bypassing prefetch and placing the same cache line from RAM directly into cache.

computer-architecturecpu-cachememory-accesscsstackoverflow

pattern minor by @import:stackexchange-cs 112d ago

L1 and Ln cache: when are they written?

I have been following the "High Performance Computer Architecture" course from Georgia Tech (also on YouTube), and unless I've missed something, I cannot see where the following has been explained: If I have a multilevel cache, L1/L2/L3/Ln: 1) what decides what level of this hierarchy a block fetched from memory initially gets put in? 2) If I evict a block from L1, does that mean it gets moved to L2 (replacing a block from L2 depending on the replacement policy), and so forth for L2 to L3, and L3 to Ln, until the block evicted from the last level cache gets written to memory? PS And yes, I have searched for this, so don't downvote assuming I haven't :) Update After the answer given below, I went back to the video course and found this. It seems to suggest that data is initially fetched into the L2 cache, then "fed" to the L1 cache. This also shows how, without the inclusion bit set, L1 and L2 can become "out of sync". 3) Is this generally what happens in all caches? IE if I have a 7 layer cache, will the block be fetched into L7, then fed L7->L6, L6->L5 ... and finally L2->L1? This seems like a lot of work... 4) Except in the situation shown in the linked video (if the caches get "out of sync" then certain data will only be available in one cache even though it started out in both caches), it seems that having the same data in multiple levels of the cache all the time (i.e. when the inclusion bit is set) is a waste; we have to copy it amongst all the levels. This can't be the case though, so what am I missing?

cpu-cachecsstackoverflow

pattern minor by @import:stackexchange-cs 112d ago

Cache effective access time calculation

In order to calculate the effective access time of a memory sub-system, I see some different approaches, a.k.a formulas. All are reasonable, but I don't know how they differ and what is the correct one. Assume a two-level cache and a main memory system with the following specs: ``` h1 = 80% t1 = 10ns L1 cache h2 = 40% t2 = 20ns L2 cache h3 = 100% t3 = 100ns Main memory ``` `t1` means the time to access the L1 while `t2` and `t3` mean the penalty to access L2 and main memory, respectively. I see two formulas as described below: 1- `Teff = t1 + (1-h1)[t2 + (1-h2)t3]` which will be 32. The logic behind that is to access L1, first. So, t1 is always accounted. Then with the miss rate of L1, we access lower levels and that is repeated recursively. I agree with this one! You can see further details here. 2- As discussed here, we can calculate that using `Teff = h1*t1 + (1-h1)*h2*t2 + (1-h1)*(1-h2)*t3` which yields 24. However, that is is reasonable when we say that L1 is accessed sometimes. You can see another example here. Although that can be considered as an architecture, we know that L1 is the first place for searching data. So, the L1 time should be always accounted. Is there any suggestion for that?

cpu-cachememory-accesscsstackoverflowperformance

pattern minor by @import:stackexchange-cs 112d ago

How does cache partitioning prevent covert/side-channel attacks?

In a report on an open-source separation kernel (Muen kernel) I was reading, in the future work section, it says that cache coloring can be implemented to prevent covert/side-channel attacks. It is mentioned that In a second step each subject is associated with a color. All subjects of a given color share the same cache partition. In turn subjects of differing color have no access to identical cache locations, which means the cache cannot be used as a side-channel. I understand what page coloring/cache partitioning is, but I do not understand how having different subjects use different cache partitions can solve side-channel attacks. Can anyone enlighten me on this?

cpu-cachememory-accessmemory-hardwarecachingcs

pattern minor by @import:stackexchange-cs 112d ago

CPU cache - retrieving data from memory

Regarding CPU cache, if the CPU does not find the data it needs in the cache, I understand it then looks for it in the main memory (RAM). (Let's assume we have only one level of cache in order to keep this question as simple as possible.) My question is this: if the data is found in main memory, what happens? Does the processor use this data directly, or is the data loaded into the cache and then read from there?

computer-architecturecpu-cachecsstackoverflowmemory-management

pattern minor by @import:stackexchange-cs 112d ago

Does the aliasing problem show up in a virtually indexed physically tagged cache?

Basically, and as a simple method, we can access cache with Physical Address which is from the TLB. But, as another method, we can access cache with Virtual Address. But, in this case, if the cache is not fully flushed between context switch(other process's datas can be exist in cache), there is an aliasing problem. Same memory can be directed from the different virtual address. But in my text book, including these problem, many can be solved by virtually indexed physically tagged. I think this still can make an aliasing problem. Am I wrong?

computer-architecturecpu-cachecsstackoverflow

pattern minor by @import:stackexchange-cs 112d ago

Write Serialization for Cache Coherence in the presence of Store Buffers

One of the requirements for a coherent memory system is write serialization - "two writes to address X by any two processors are observed in the same order by all processors". I am not sure how this condition would be met when the CPU cores have a store buffer into which stores are retired before updating the memory hierarchy. Suppose Cores A and B both retire a store to address X (which is cached by both A and B), and these stores are now in the store buffers and are yet to update the cache. Now loads from address X on both cores A and B could retire after obtaining the load value from the store buffers ie loads from X on core A see value written by the store to X on core A, similarly, loads from X on core B see value written by the store to X on core B. Say now, the store to X from core A sends a read-upgrade message to core B. What happens next on core B? After coherence transactions have completed for A, the store to X on B updates the local cache. After the stores to X from A and B have completed the final value of X is the store by B. Above, the order for values seen by A for X are write by A followed by the write to B. But this doesn't seem to be the order seen by B. So the question is how is write serialization enforced when store buffers are being used? (This is based on my understand of the various material I have read and would appreciate it if anyone can point to sources that discuss these issues in detail)

computer-architecturecpu-cachecsstackoverflowconcurrency

snippet minor by @import:stackexchange-cs 112d ago

How do stack-based cache algorithms avoid Belady's anomaly?

I was going through page replacement algorithms from Galvin's Operating System book. I encountered this line about LRU: A stack algorithm is one in which the pages kept in memory for a frame set of size N will always be a subset of the pages kept for a frame size of N + 1. And it states that stack based algorithm does not suffer from Belady's anomaly. Can anyone explain how it avoids Belady's anomaly?

cpu-cacheoperating-systemscsstackoverflowmemory-management

pattern moderate by @import:stackexchange-cs 112d ago

Why is quiescent consistency compositional, but sequential consistency is not

I'm having trouble in comparing these two memory consistency models. Essentially for sequential consistency I think of real code like this: ``` int x, y; void ThreadA() { x = 20; //Write int a = y; //Read } void ThreadB() { y = 20; int b = x; } ``` In a sequential consistency environment it's impossible for `a` or `b` to not be 20. That is either `a = 20`, `b = 20` or `a = 20 && b = 20`. But how does quiescent consistency fit in to this example, and why is it compositional?

cpu-cachethreadsshared-memorycsstackoverflow