October 9th, 2014
Chakra’s Multi-tiered Pipeline: Historical background
When a function is executed for the first time, Chakra’s parser creates an AST representation of the function’s source. The AST is then converted to bytecode, which is immediately executed by Chakra’s interpreter. While the interpreter is executing the bytecode, it collects data such as type information and invocation counts to create a profile of the functions being executed. This profile data is used to generate highly optimized machine code (a.k.a. JIT’ed code) as a part of the JIT compilation of the function. When Chakra notices that a function or loop-body is being invoked multiple times in the interpreter, it queues up the function in Chakra’s background JIT compiler pipeline to generate optimized JIT’ed code for the function. Once the JIT’ed code is ready, Chakra replaces the function or loop entry points such that subsequent calls to the function or the loop start executing the faster JIT’ed code instead of continuing to execute the bytecode via the interpreter.
Improved Startup Performance: Streamlined execution pipeline
Simple JIT: A new JIT compiling tier
Starting with Windows 10 Technical Preview, Chakra now has an additional JIT compilation tier called Simple JIT, which comes into play in-between the switch over from executing a function in the interpreter to executing the highly optimized JIT code, when the compiled code is ready. As its name implies, Simple JIT avoids generating code with complex optimizations, which is dependent on profile data collection by the interpreter. In most cases, the time to compile the code by the Simple JIT is much smaller than the time needed to compile highly optimized JIT code by the Full JIT compiler. Having a Simple JIT enables Chakra to achieve a faster switchover from bytecode to simple JIT’ed code, which in turn helps Chakra deliver a faster startup for apps and sites. Once the optimized JIT code is generated, Chakra then switches over code execution from the simple JIT’ed code version to the fully optimized JIT’ed code version. The other inherent advantage of having a Simple JIT tier is that in case a bailout happens, the function execution can utilize the faster switchover from interpreter to Simple JIT, till the time the fully optimized re-JIT’ed code is available.
The Simple JIT compiler is essentially a less optimizing version of Chakra’s Full JIT compiler. Similar to Chakra’s Full JIT compiler, the Simple JIT compiler also executes on the concurrent background JIT thread, which is now shared between both JIT compilers. One of the key difference between the two JIT execution tiers is that unlike executing optimized JIT code, the simple JIT’ed code execution pipeline continues to collect profile data which is used by the Full JIT compiler to generate optimized JIT’ed code.
Figure 2 – Chakra’s new Simple JIT tier
Today, the browser and Web applications are used on a multitude of device configurations – be it phones, tablets, 2-in-1s, PCs or Xbox. While some of these device configurations restrict the availability of the hardware resources, applications running on top of beefy systems often fail to utilize the full power of the underlying hardware. Since inception in IE9, Chakra has used one parallel background thread for JIT compilation. Starting with Windows 10 Technical Preview, Chakra is now even more aware of the hardware it is running on. Whenever Chakra determines that it is running on a potentially underutilized hardware, Chakra now has the ability to spawn multiple concurrent background threads for JIT compilation. For cases where more than one concurrent background JIT thread is spawned, Chakra’s JIT compilation payload for both the Simple JIT and the Full JIT is split and queued for compilation across multiple JIT threads. This architectural change to Chakra’s execution pipeline helps reduce the overall JIT compilation latency – in turn making the switch over from the slower interpreted code to a simple or fully optimized version of JIT’ed code substantially faster at times. This change enables the TypeScript compiler to now run up to 30% faster in Chakra.
Figure 3 – Simple and full JIT compilation, along with garbage collection is performed on multiple background threads, when available
Previewing Equivalent Object Type Specialization
The internal representation of an object’s property layout in Chakra is known as a “Type.” Based on the number of properties and layout of an object, Chakra creates either a Fast Type or a Slower Property Bag Type for each different object layout encountered during script execution. As properties are added to an object, its layout changes and a new type is created to represent the updated object layout. Most objects, which have the exact same property layout, share the same internal Fast Type.
Figure 4 – Illustration of Chakra’s internal object types
Despite having different property values, objects `o1` and `o2` in the above example share the same type (Type1) because they have the same properties in the same order, while objects `o3` and ‘o4’ have a different types (Type2 and Type3 respectively) because their layout is not exactly similar to that of `o1` or `o2`.
To improve the performance of repeat property lookups for an internal Fast Type at a given call site, Chakra creates inline caches for the Fast Type to associate a property name with its associated slot in the layout. This enables Chakra to directly access the property slot, when a known object type comes repetitively at a call site. While executing code, if Chakra encounters an object of a different type than what is stored in the inline cache, an inline cache “miss” occurs. When a monomorphic inline cache (one which stores info for only a single type) miss occurs, Chakra needs to find the location of the property by accessing a property dictionary on the new type. This path is slower than getting the location from the inline cache when a match occurs. In IE11, Chakra delivered several type system enhancements, including the ability to create polymorphic inline caches for a given property access. Polymorphic caches provide the ability to store the type information of more than one Fast Type at a given call site, such that if multiple object types come repetitively to a call site, they continue to perform fast by utilizing the property slot information from the inlined cache for that type. The code snippet below is a simplified example that shows polymorphic inline caches in action.
Despite the speedup provided by polymorphic inline caches for multiple types, from a performance perspective, polymorphic caches are somewhat slower than monomorphic (or a single) type cache, as the compiler needs to do a hash lookup for a type match for every access. In Windows 10 Technical Preview, Chakra introduces a new JIT optimization called “Equivalent Object Type Specialization,” which builds on top of “Object Type Specialization” that Chakra has supported since IE10. Object Type Specialization allows the JIT to eliminate redundant type checks against the inline cache when there are multiple property accesses to the same object. Instead of checking the cache for each access, the type is checked only for the first one. If it does not match, a bailout occurs. If it does match, Chakra does not check the type for other accesses as long as it can prove that the type of the object can’t be changed between the accesses. This enables properties to be accessed directly from the slot location that was stored in the profile data for the given type. Equivalent Object Type Specialization extends this concept to multiple types and enables Chakra to directly access property values from the slots, as long as the relative slot location of the properties protected by the given type check matches for all the given types.
The following example showcases how this optimization kicks in for the same code sample as above, but improves the performance of such coding patterns by over 20%.
Code Inlining Enhancements
JIT compilers need to strike a balance to inlining. Inlining too much increases the memory overhead, in part from pressure on the register allocator as well as JIT compiler itself because a new copy of the inline function needs to be created in each place it is called. Inlining too little could lead to overall slower performance of the code. Chakra uses several heuristics to make inlining decisions based on data points like the bytecode size of a function, location of a function (leaf or non-leaf function) etc. For example, the smaller a function, the better chance it has of being inlined.
In Windows 10 Technical Preview, Chakra takes this a step further and now inlines the call/apply target. The simplified example below illustrates this optimization, and in some of the patterns we tested, this optimization improves the performance by over 15%.
Auto-typed Array Optimizations
The code sample below illustrates the hoisting of bounds checks and length loads that is now done by Chakra, to speed up array operations.
Better UI Responsiveness: Garbage Collection Improvements
Chakra has a mark and sweep garbage collector that supports concurrent and partial collections. In Windows 10 Technical Preview, Chakra continues to build upon the GC improvements in IE11 by pushing even more work to the dedicated background GC thread. In IE11, when a full concurrent GC was initiated, Chakra’s background GC would perform an initial marking pass, rescan to find objects that were modified by main thread execution while the background GC thread was marking, and perform a second marking pass to mark objects found during the rescan. Once the second marking pass was complete, the main thread was stopped and a final rescan and marking pass was performed, followed by a sweep performed mostly by the background GC thread to find unreachable objects and add them back to the allocation pool.
Figure 5 – Full concurrent GC life cycle in IE11
In IE11, the final mark pass was performed only on the main thread and could cause delays if there were lots of objects to mark. Those delays contributed to dropped frames or animation stuttering in some cases. In Windows 10 Technical Preview, this final mark pass is now split between the main thread and the dedicated GC thread to reduce the main thread execution pauses even further. With this change, the time that Chakra’s GC spends in the final mark phase on the main thread has reduced up to 48%.
Figure 6 – Full concurrent GC life cycle in Windows 10 Technical Preview
We are excited to preview the above performance optimizations in Chakra in the Windows 10 Technical Preview. Without making any changes to your app or site code, these optimizations will allow your apps and sites to have a faster startup by utilizing Chakra’s streamlined execution pipeline that now supports a Simple JIT and multiple background JIT threads, have increased execution throughput by utilizing Chakra’s more efficient type, inlining and auto-typed array optimizations, and have improved UI responsiveness due to a more parallelized GC. Given that performance is a never ending pursuit, we remain committed to refining these optimizations, improving Chakra’s performance further and are excited for what’s to come. Stay tuned for more updates and insights as we continue to make progress. In the meanwhile, if you have any feedback on the above or anything related to Chakra, please drop us a note or simply reach out on Twitter @IEDevChat, or on Connect.
— John-David Dalton, Gaurav Seth, Louis Lafreniere