Hi, I’m Sotiris Apostolakis, a fifth yearPh.D. student from Princeton University. In this video, I’ll present the ASPLOS 2020paper, titled Perspective: A Sensible Approach to Speculative Automatic Parallelization. This is joint work with Ziyang and Greg fromPrinceton, Professor Simone Campanoni from Northwestern, and my advisor David August. Let me start by motivating automatic parallelization. Since their inception, multicore processorshave been grossly underutilized. Even until recently, Google reported around30% average core utilization in its data centers, while Twitter had less than 20%. This under-utilization has driven up costs,hurt the environment, and even stunted the growth of multicore technology itself. A major reason for this under-utilization isthe difficulty of extracting parallelism fine-grained enough for multicore. Despite many new tools, languages, and librariesfor parallelization, programmers are mostly limited to coarse-grained forms of parallelism,such as job or data parallelism. This coarse-grained parallelism, though, isill-suited for multicore, as it tends to stress shared resources, such as caches and memorybandwidth. Automatic parallelization, on the other hand,has the potential of enabling efficient use of multicore systems without undue programmereffort. With automatic parallelization, the programmerwrites sequential code, and then the compiler produces an executable that runs efficientlyon multiple cores. The next question is why we focused on speculativeautomatic parallelization. For a long time, parallelizing compilers exclusivelyrelied on statically proven properties by memory analysis and as a result, it had limited applicability. Memory analysis is fundamentally limited intheory, since for any fixed analysis algorithm, there is always a counter-example input forwhich the algorithm is imprecise. Memory analysis is also insufficiently precisein practice, especially for languages like C and C++. But even if analysis was precise enough, still,it would have to conservatively respect all possible inputs, even if there are cases wherereal dependences rarely manifest, and thus could be ignored. Such cases, are for example, error conditionsfor malformed inputs. Following the success of speculation for extractingILP, speculation in TLP, thread-level parallelism, gained traction and overcame the applicabilitylimitations of memory analysis, by enabling optimization of the expected case. I will now describe the state-of-the-art approachfor speculative automatic parallelization, and then I will discuss its core inefficienciesto motivate our work. Next I will present our approach, proceedwith the evaluation results, and finally conclude my talk. Parallelizing compilers start with the sequentialcode and use static analysis to analyze the dependences among program operations. The hardest part of static analysis is memoryanalysis, which analyzes memory accesses. Then, parallelizing compilers apply a sequenceof enabling transforms to make the program more amenable to parallelization. Examples include speculative privatization,reduction, memory speculation, and control speculation. After the application of enablers, the programis parallelized, if possible. Many parallelization patterns exists, butthe most desirable and simple is DOALL, the independent execution of loop iterations.
independent execution of loop iterations. Other patterns like HELIX or PS-DSWP are lessefficient but more applicable. Our work targeted DOALL, and thus I will onlytalk about DOALL parallelism in this talk. Overall, automatic parallelization is allabout handling dependences. Analysis disproves the existence of dependences,enablers break dependences but with a cost, and parallelization techniques try to tolerateunremovable dependences. All these advancements in compiler researchhave increased considerably the applicability of automatic parallelization. However, speculation-based enablers have alsocreated profitability problems. Breaking dependences with speculation hasintroduced costs that often negate the benefits of parallelization. Our work focuses on mitigating these costsand enabling efficient speculative parallelization. We use Privateer as the state-of-the-art baselinesystem for this work. Privateer, published in PLDI 2012, is a speculativeDOALL parallelization system with the highest applicability. We explored its overheads and identified twomain inefficiencies. The first one is excessive use of memory speculation. Memory speculation asserts absence of dependencesnot manifested during profiling and is very expensive to validate due to expensive communicationand bookkeeping costs for every single speculative dependence. This is by far the most common inefficiencyamong all prior speculative systems. The second inefficiency is expensive speculativeprivatization, which involves monitoring of large write sets to correctly merge the privatememory states of parallel workers. Same as Privateer, other speculative systemswith privatization support suffer from this overhead. Next, I will present an example that illustratesthese two inefficiencies. This example is taken from the dijkstra benchmarkfrom MiBench, a benchmark used in the evaluation of Privateer, our baseline system. Here, we have a part of a simple for loop. We have a heavily biased branch that in practiceis always taken, even though we cannot prove prove it statically. We also write and read the memory object usingpointer ptr, which is not modified within the loop. Now the question is how a state-of-the-artsystem would parallelize this loop. As I said before, it’s all about handlingdependences. Therefore, we focus on the program dependencegraph, the PDG. We start with a conservative view of the PDGfor these two instructions, with anti-dependences omitted. To better understand these dependences, observehere two dynamic instances of these instructions for iterations k and j. And you can see the two cross-iteration memorydependences and the intra-iteration one. The goal here is to remove all the cross-iterationdependences, namely, dependences that prevent DOALL parallelism, the independent executionof loop iterations. Now we follow the state-of-the-art approachof trying one thing after another to simplify this PDG. First, static analysis tries to disprove thesedependences, but to no avail. Then, we start applying enablers. Control speculation has no effect, since itis only applicable to speculatively dead code. Yet, memory speculation asserts that thereis no cross-iteration dataflow from instruction
Yet, memory speculation asserts that thereis no cross-iteration dataflow from instruction i1 to i2, simply because this flow was neverobserved during profiling. Finally, the output dependence can be removedwith the application of speculative privatization, which creates private copies of the underlyingmemory object for each worker. At the end, we managed to remove all cross-iterationdependences and the loop can now be parallelized with DOALL. But in this process, expensive-to-validatememory speculation was used. Now let’s see why memory speculation validation is so expensive. The problem is that every worker needs torecord and communicate its access pattern so that an overlap is detected. The monitoring operations are depicted herein grey. Let’s go through a simple timeline of thisexample to demonstrate how validation works. We have two worker threads that execute theloop and a validator thread that checks their access pattern. Worker 1 executes the first iteration andmonitors the two memory accesses and notifies the validator thread that it wrote to thememory location pointed to by ptr. Worker 2 does the same for iteration 2. Then in iteration 3, let’s say that the branchhappens not to be taken. In this case, worker 1 read a value that waswritten in a previous iteration, which violates the speculative assertion that this cross-iterationdataflow cannot occur, and thus, it causes a misspeculation. But even without misspeculation, which infact never happens for this example, workers still perform a lot of additional work, significantlydecreasing the profitability of parallelization. Now let’s assume the branch condition in ourexample is statically proven true, and let’s parallelize this loop again. Same as before, we start with the conservativePDG and we try to simplify it. Static analysis is able this time to removethe cross-iteration dataflow from i1 to i2, since i1 executes on every iteration, andthus it kills the data flow from the previous iteration. Then again speculative privatization resolvesthe output dependence. Speculation was used to infer the underlyingmemory object that was privatized. Here, we enable parallelization without usingmemory speculation, but still, speculative privatization turns out to be expensive forprior systems like Privateer. Workers still have to use write monitoringto know in which iteration every byte was last written and report that so the live-out stateis computed correctly. Here, we avoid the read monitoring but westill have write monitoring. Going through the timeline again, worker 1executes the first iteration and also records the written memory locations. Then everyone does the same for every iteration,and at the end, everyone communicates the private copy of the privatized object, alongwith the timestamp. And the master thread, which continues executionafter the parallel invocation, keeps the latest one as live-out, this case, the object fromworker 1. Thus, even with just privatization, the overheadscan still get pretty high. Just to quantify the two overheads I justdescribed, Privateer had to monitor, for the dijkstra benchmark, more than 1 terabyte ofreads and writes to validate memory speculation and perform privatization for an input graphof 3000 nodes. These numbers are huge, since a lot of monitoringoccurs in an innermost loop. In our work, we dramatically reduce thesenumbers. I’ll proceed with our approach, the Perspectiveapproach.
brain teasers benefits I’ll proceed with our approach, the Perspectiveapproach.
I’ll proceed with our approach, the Perspectiveapproach. Overall, the goal of our work is to maintainthe applicability of prior speculative automatic parallelization systems, without the unnecessaryoverheads of prior work. Before I describe our approach, let’s go backto the original example and see intuitively how we can parallelize the loop more efficiently. Memory speculation, used by prior work inthis example, requires heavy monitoring and this is because it just asserts the absenceof not-observed-during-profiling dependences without any understanding why they are notobserved — there is no reasoning. In this example, the actual reason why thecross-iteration data flow from i1 to i2 is not observed is because the branch is alwaystaken. However, the speculative assertion that thebranch is always taken is only known to control speculation, which can only reason about speculativelydead code, and no other compiler component becomes aware of this speculative controlflow information. So this information ends up being wasted. The solution is to make memory analysis awareof this speculative assertion. Since control speculation will validate it,memory speculation can treat it as a fact, ignoring the possibility of misspeculation. By doing so, memory analysis can view i1 asexecuting on every iteration, and thus, it can infer that i1 kills the cross-iterationdata flow. In terms of the cross-iteration output dependence,we can avoid write monitoring by having a privatization variant that understands thatthe privatizable memory object here is overwritten at every iteration and thus, the live-outstate is the last iteration’s state. For such a variant, though,to be applicable in this example, awareness of the biased branch is also necessary. In the end, by fully leveraging an inexpensive-to-validatespeculative assertion, we can efficiently break all cross-iteration dependences withoutthe monitoring of prior work. To validate the speculative assertion, weonly need to check for control misspeculation, which practically adds zero validation overhead. Using these observations, we redesign thecompiler to automatically find these opportunities. The high-level goals of our approach are toincrease awareness within the compiler of what every component is capable of, enablecollaboration to maximize the effect of speculative assertions, and avoid any unnecessary transformations,especially expensive ones. The Perspective approach involves 3 main contributions. First, unlike prior work that applies a fixedsequence of enabling transforms, we propose a more sensible approach, where we plan firstbefore applying any transformation. The key insight here is that the effect ofparallelization-enabling transforms is predictable and easily expressible on the PDG. In the Perspective approach, we start withstatic analysis that produces an initial version of the PDG. Then, each enabling tranformation expresseson the PDG which dependences it can handle along with an estimated cost. Then, the transform selector examines allthe options and picks the cheapest set of enabling transforms that covers all the dependencesthat need to be removed for DOALL parallelism to be applicable, namely all the cross iterationdependences. Finally, only the selected transformationsare applied. The second contribution is a speculation-awarememory analysis. In prior systems, profile-based speculativeassertions are used separately from memory analysis. Instead, in this work memory analysis is madeaware of speculative assertions, effectively
Instead, in this work memory analysis is madeaware of speculative assertions, effectively computing their full effect. And this often results in new cheap alternativesfor breaking dependences. Additionally, in this design, enablers suchas privatization, instead of operating in isolation, also become aware of these speculativeassertions. This creates an opportunity for new efficientenabling transforms that leverage all this exposed information. In fact, Perspective introduces new efficientvariants of speculative privatization that avoid the write monitoring of prior work. For instance, I briefly presented in the exampleearlier, a privatization variant that leveraged a control speculation assertion to show thatthe privatizable memory object is overwritten at every iteration. Another example is a variant that leveragesvalue-prediction assertions to predict the live-out state instead of monitoring it. To recap our motivating example, Perspective,as opposed to the state-of-the-art, avoids for the dijkstra benchmark use of memory speculationand expensive privatization and thus, almost eliminates, as you can see, Privateer’s memoryaccess monitoring. The effect in terms of performance is 4.8xspeedup over Privateer for this benchmark. Note that dijkstra was in fact the motivatingbenchmark for the Privateer paper, which makes this dramatic speedup even more impressive. The Perspective framework is implemented onthe LLVM compiler infrastructure and it includes a set of profilers, a parallelizing compiler,and a lightweight process-based runtime system. The framework consists of around 80,000 linesof C and C++ code. The compiler workflow is depicted on the right. For explanation of this workflow and for implementationdetails, please refer to the paper. Let’s now see some evaluation results. Perspective is evaluated on a commodity shared-memorymachine with 28 cores. The claim that is empirically evaluated isthat we maintain the applicability of prior automatic DOALL systems, while improving theirefficiency. In order to avoid cherry-picking and validateour claim, the evaluated benchmarks cover all the parallelizable benchmarks from twostate-of-the-art automatic parallelization papers. Benchmarks are from SPEC CPU, PARSEC, PolyBench,and MiBench suites. This graph shows fully automatic whole-programspeedup over sequential execution across various numbers of cores for 12 C/C++ benchmarks ona 28-core machine. As you can see, Perspective achieves scalableperformance, in most cases linear speedup, thanks to the elimination of the overheadswe discussed. alvinn exhibits smaller speedup, mostly becausethe parallelized hot loop covers a smaller part of the whole program compared to otherbenchmarks. In this graph, we compare Perspective againstPrivateer of prior work, and we show speedup over sequential execution on 28 cores. The main takeaway is that Perspective doublesthe geomean performance of Privateer. The reason behind this big performance improvementis the dramatic reduction of monitored reads and writes for speculation validation and privatization,as seen in this table. For more evaluation results, including breakdownof the benefit of each contribution and the effect of misspeculation, please refer tothe paper. To conclude, this work represents an importantadvance in fulfilling the promise of automatic parallelization. We identified and mitigated core inefficienciesof prior speculative parallelization systems with the introduction of a speculation-awarememory analysis, new efficient enabling transforms, and a new compiler design that enables carefulplanning. The end result is scalable speedups and doublingthe performance of a state-of-the-art system. Finally, this paper has gone though the artifactevaluation process and was awarded all the available badges. The artifact is publicly available and canbe used to reproduce the main evaluation results and corroborate the claims of this paper. Thank you.