PHPFixing
  • Privacy Policy
  • TOS
  • Ask Question
  • Contact Us
  • Home
  • PHP
  • Programming
  • SQL Injection
  • Web3.0
Showing posts with label optimization. Show all posts
Showing posts with label optimization. Show all posts

Sunday, November 27, 2022

[FIXED] how to pass correctly a module in node

 November 27, 2022     architecture, javascript, module, node.js, optimization     No comments   

Issue

app.js --------------

const mongoose = require("mongoose");
const utils = require("./config/utils");
const app = express();

utils.initializeDb(mongoose);
app.listen(3000);

config/utils.js

module.exports = (mongoose) => {
  initializeDb: async () => {
    await mongoose.connect(
      process.env.MONGO_URI,
      {
        useNewUrlParser: true,
      },
      () => {
        console.log("Mongoose connection successfuly started");
      }
    );
  };
};

The error in the terminal/console is: TypeError: utils.initializeDb is not a function.

I'm digging into optimization, encapsulation, to clean more my code passing modules through functions and etc. and I tried this thing but it give me this error... I would like to know the error that is happening in the code and also some tips on how to optimize this code. Thank you :)


Solution

try the below method :

app.js

    const mongoose = require("mongoose");
    const initializeDb = require("./config/utils");
    const app = express();

    initializeDb(mongoose);
    app.listen(3000);

config/utils.js

module.exports = {
  initializeDb: async (mongoose) => {
    await mongoose.connect(
      process.env.MONGO_URI,
      {
        useNewUrlParser: true,
      },
      () => {
        console.log("Mongoose connection successfuly started");
      }
    );
  };
};


Answered By - Critical Carpet
Answer Checked By - Willingham (PHPFixing Volunteer)
Read More
  • Share This:  
  •  Facebook
  •  Twitter
  •  Stumble
  •  Digg

Tuesday, November 22, 2022

[FIXED] What is better: multiple "if" statements or one "if" with multiple conditions?

 November 22, 2022     if-statement, java, multiple-conditions, optimization     No comments   

Issue

For my work I have to develop a small Java application that parses very large XML files (~300k lines) to select very specific data (using Pattern), so I'm trying to optimize it a little. I was wondering what was better between these 2 snippets:

if (boolean_condition && matcher.find(string)) {
    ...
}

OR

if (boolean_condition) {
    if (matcher.find(string)) {
        ...
    }
}

Other details:

  • These if statements are executed on each iteration inside a loop (~20k iterations)
  • The boolean_condition is a boolean calculated on each iteration using an external function
  • If the boolean is set to false, I don't need to test the regular expression for matches

Thanks for your help.


Solution

One golden rule I follow is to "Avoid Nesting" as much as I can. But if it is at the cost of making my single if condition too complex, I don't mind nesting it out.

Besides you're using the short-circuit && operator. So if the boolean is false, it won't even try matching!

So,

if (boolean_condition && matcher.find(string)) {
    ...
}

is the way to go!



Answered By - adarshr
Answer Checked By - Katrina (PHPFixing Volunteer)
Read More
  • Share This:  
  •  Facebook
  •  Twitter
  •  Stumble
  •  Digg

Sunday, November 20, 2022

[FIXED] How to reduce this 5 preg_replaces lines in 1 line?

 November 20, 2022     optimization, php, preg-replace     No comments   

Issue

I belive minimal is better, so I'm wondering how could I reduce/optimize these 5 lines in 1 line?

#JPG
$post[message] = preg_replace('/<a href="(.+?)\.jpg" target="_blank">(.+?)<\/a>/', '<img src="$1.jpg">', $post[message]);

#JPEG
$post[message] = preg_replace('/<a href="(.+?)\.jpeg" target="_blank">(.+?)<\/a>/', '<img src="$1.jpeg">', $post[message]);

#GIF
$post[message] = preg_replace('/<a href="(.+?)\.gif" target="_blank">(.+?)<\/a>/', '<img src="$1.gif">', $post[message]);

#PNG
$post[message] = preg_replace('/<a href="(.+?)\.png" target="_blank">(.+?)<\/a>/', '<img src="$1.png">', $post[message]);

#BMP
$post[message] = preg_replace('/<a href="(.+?)\.bmp" target="_blank">(.+?)<\/a>/', '<img src="$1.bmp">', $post[message]);

Solution

Use an alternation:

$post[message] = preg_replace('/<a href="(.+?\.(?:jpe?g|gif|png|bmp))" target="_blank">.+?<\/a>/',
                              '<img src="$1">', $post[message]);

Note that I have removed the second capture group, which was the anchor text, as your regex replacement was not even using it.



Answered By - Tim Biegeleisen
Answer Checked By - Pedro (PHPFixing Volunteer)
Read More
  • Share This:  
  •  Facebook
  •  Twitter
  •  Stumble
  •  Digg

Saturday, November 12, 2022

[FIXED] What should be stored in cache for web app?

 November 12, 2022     azure-caching, caching, memcached, optimization, web-applications     No comments   

Issue

I realize that this might be a vague question the bequests a vague answer, but I'm in need of some real world examples, thoughts, &/or best practices for caching data for a web app. All of the examples I've read are more technical in nature (how to add or remove cache data from the respective cache store), but I've not been able to find a higher level strategy for caching.

For example, my web app has an inbox/mail feature for each user. What I've been doing to date is storing typical session data in the cache. In this example, when the user logs in I go to the database and retrieve the user's mail messages and store them in cache. I'm beginning to wonder if I should just maintain a copy of all users' messages in the cache, all the time, and just retrieve them from cache when needed, instead of loading from the database upon login. I have a bunch of other data that's loaded on login (product catalogs and related entities) and login is starting to slow down.

So I guess my question to the community, is what would you do/recommend as an approach in this scenario?

Thanks.


Solution

This might be better suited to https://softwareengineering.stackexchange.com/, but generally you want to cache:

  • Metadata/configuration data that does not change frequently. E.g. country/state lists, external resource addresses, logic/branching settings, product/price/tax definitions, etc.
  • Data that is costly to retrieve or generate and that does not need to frequently change. E.g. historical data sets for reports.
  • Data that is unique to the current user's session.

The last item above is where you need to be careful as you can drastically increase your app's memory usage, by adding a few megabytes to the data for every active session. It also implies different levels of caching -- application wide, user session, etc.

Generally you should NOT cache data that is under active change.

In larger systems you also need to think about where the cache(s) will sit. Is it possible to have one central cache server, or is it good enough for each server/process to handle its own caching?

Also: you should have some method to quickly reset/invalidate the cached data. For a smaller or less mission-critical app, this could be as simple as restarting the web server. For the large system that I work on, we use a 12 hour absolute expiration window for most cached data, but we have a way of forcing immediate expiration if we need it.

This is a really broad question, and the answer depends heavily on the specific application/system you are building. I don't know enough about your specific scenario to say if you should cache all the users' messages, but instinctively it seems like a bad idea since you would seem to be effectively caching your entire data set. This could lead to problems if new messages come in or get deleted. Would you then update them in the cache? Would that not simply duplicate the backing store?

Caching is only a performance optimization technique, and as with any optimization, measure first before making substantial changes, to avoid wasting time optimizing the wrong thing. Maybe you don't need much caching, and it would only complicate your app. Maybe the data you are thinking of caching can be retrieved in a faster way, or less of it can be retrieved at once.



Answered By - Jordan Rieger
Answer Checked By - Cary Denson (PHPFixing Admin)
Read More
  • Share This:  
  •  Facebook
  •  Twitter
  •  Stumble
  •  Digg

Monday, October 31, 2022

[FIXED] How to speed up obstacle checking algorithm?

 October 31, 2022     algorithm, optimization, performance, python     No comments   

Issue

I've been given the following problem: "You want to build an algorithm that allows you to build blocks along a number line, and also to check if a given range is block-free. Specifically, we must allow two types of operations:

  1. [1, x] builds a block at position x.
  2. [2, x, size] checks whether it's possible to build a block of size size that begins at position x (inclusive). Returns 1 if possible else 0.

Given a stream of operations of types 1 and 2, return a list of outputs for each type 2 operations."

I tried to create a set of blocks so we can lookup in O(1) time, that way for a given operation of type 2, I loop in range(x, x + size) and see if any of those points are in the set. This algorithm runs too slowly and I'm looking for alternative approaches that are faster. I also tried searching the entire set of blocks if the size specified in the type 2 call is greater than len(blocks), but this also times out. Can anyone think of a faster way to do this? I've been stuck on this for a while.


Solution

Store the blocks in a red-black tree (or any self-balancing tree), and when you're given a query, find the smallest element in the tree greater than or equal to x and return 1 if it's greater than x+size. This is O(n + mlogn) where n is the number of blocks, and m is the number of queries.

If you use a simple binary search tree (rather than a self-balancing one), a large test case with blocks at (1, 2, 3, ..., n) will cause your search tree to be very deep and queries will run in linear (rather than logarithmic) time.



Answered By - Paul Hankin
Answer Checked By - Pedro (PHPFixing Volunteer)
Read More
  • Share This:  
  •  Facebook
  •  Twitter
  •  Stumble
  •  Digg

[FIXED] How to speed up an algorithm to detect lights that cover a point?

 October 31, 2022     algorithm, big-o, optimization, performance, python     No comments   

Issue

"Assume you have a road (or numberline) with lights over head that light up a section of the numberline. Lights are represented by a 2D array where the light at index i covers from arr[i][0] to arr[i][1] (inclusive).

You are also given a list of points on this line, and for each point, you want to compute how many lights illuminate that point. I want to return a list where list[i] = the number of lights that illuminate point i."

Sample input/output: lights = [[1, 7], [5, 11], [7, 9]], points = [7, 1, 5] should return [3, 1, 2]
I wrote a solution where for each point in points, you loop over the lights array and see how many have lights[i][0] <= point <= lights[i][1]. This solution is too slow (being O(n *k) where n = num. lights and k = num. points). I also tried sorting the lamps and breaking when lamp[i][0] > point, but it still times out. Can anyone think of a faster algorithm to compute this array? I'm stumped on how else to approach this.


Solution

This problem is better solved like this:

  • Get all the changes from the lights array and create an array that has twice its size. It will have tuples of point and light-change. The light change component is +1 or -1, depending on whether that point represents the beginning of the light span or end of it. When it is the end, the point should be one more than found in the lights array, so to indicate that point is still reached by the light, and we should only discount it when moving to the right to the next point.

  • Sort this result by point. There might be duplicates.

  • Then accumulate from left to right the light-change values, so adding up all those -1 and +1 from left to right, and storing the intermediate results again with the point where the light-change was found. The final accumulated value will of course be 0.

  • Optionally make that array unique by point, only keeping the right most of the duplicate points, as those have the right accumulated value to go with it. (I did not do that in below implementation)

  • Finally use binary search on that array to translate each given point to the number of lights that shine on that point.

Implementation:

from bisect import bisect

def solve(lights, points):
    pairs = sorted((x + i, -i or 1) for changes in lights for i, x in enumerate(changes))
    offsets, changes = zip(*pairs)
    total = 0
    cumul = [0] + [total := total + change for change in changes]
    return [cumul[bisect(offsets, x)] for x in points]

lights = [[1, 7], [5, 11], [7, 9]]
points = [7, 1, 5]
print(solve(lights, points))  # [3, 1, 2]

And here is the version that removes duplicates. It is only of benefit if there really are many duplicate offsets in the cumul list:

def solve(lights, points):
    pairs = sorted((x + i, -i or 1) for changes in lights for i, x in enumerate(changes))
    total = 0
    cumulpairs = [(x, total := total + change) for x, change in pairs]
    # remove duplicates - assuming Python 3.7+ with ordered dicts
    offsets, cumul = zip(*dict(cumulpairs).items())
    cumul = [0] + list(cumul)
    return [cumul[bisect(offsets, x)] for x in points]


Answered By - trincot
Answer Checked By - Dawn Plyler (PHPFixing Volunteer)
Read More
  • Share This:  
  •  Facebook
  •  Twitter
  •  Stumble
  •  Digg

[FIXED] how to hide html elements while waiting for all components to be loaded?

 October 31, 2022     html, javascript, lighthouse, optimization, performance     No comments   

Issue

chromium lighthouse tells me :

performance 89 (orange)

Cumulative Layout Shift 0.388 (red alert !)

Avoid large layout shifts

html :

<html>
   ...
   <body class="loading">
   ...some static elements
   </body>
   ...
</html>

css :

body.loading * { display : none; }

js when DOM is loaded :

document.addEventListener('DOMContentLoaded', (e) => {

    //do many dynamic stuff with the DOM depending on user data (&  register events listeners)

  })

& finally when everything is in place in the dom, through window.onload (or if that event comes too soon, a callback once everything is in place) :

document.body.classList.remove('loading')

If I remove the loading class from the html lighthouse says great perf 100.

I thought before seeing this lighthouse analysis that my layout was shifting from a simple loading (css) animation to a completely ready page with no visible changes of the dom for the user during the load so i assume now that i am wrong, doing something wrong & missing something.

How should i do that instead for optimal loading while not risking having elements that are not ready ever being displayed ?


Solution

It turns out that now Lighthouse doesn't flag for CLS anymore although I didn't make any changes there (on the opposite I've added some code in HTML, CSS and JS that should have made the page slower).

So answer is (at least until being proven otherwise):

Hiding elements while the page is loading & JavaScript is not ready doesn't have negative impact on performance.

Here is minimal code to have a ultra-light spinner while the page is not ready yet:

const MyOnLoad = function() {
    
    setTimeout(function(){ // only for the sake of displaying here the spinner 2 and half second obv
              
              document.body.classList.remove('loading')
        
        }
        ,2500) 
    
}

window.onload = MyOnLoad
.loading {
  margin: 34vmin auto;
  border: 0; 
  border-radius: 50%;
  border-top: 0.89vmin solid red;
  width: 21vmin;
  height: 21vmin;
  animation: spinner 2s linear infinite;

 }

 @keyframes spinner {
  0%   { transform: rotate(0deg); }
  100% { transform: rotate(360deg); }
 }

 .loading * {display: none;}
<html>
 <body class="loading">
    <main>show only when page is ready</main>
 </body>
</html>



Answered By - mikakun
Answer Checked By - Senaida (PHPFixing Volunteer)
Read More
  • Share This:  
  •  Facebook
  •  Twitter
  •  Stumble
  •  Digg

[FIXED] Why are loops always compiled into "do...while" style (tail jump)?

 October 31, 2022     assembly, loops, micro-optimization, optimization, performance     No comments   

Issue

When trying to understand assembly (with compiler optimization on), I see this behavior:

A very basic loop like this

outside_loop;
while (condition) {
     statements;
}

Is often compiled into (pseudocode)

    ; outside_loop
    jmp loop_condition    ; unconditional
loop_start:
    loop_statements
loop_condition:
    condition_check
    jmp_if_true loop_start
    ; outside_loop

However, if the optimization is not turned on, it compiles to normally understandable code:

loop_condition:
    condition_check
    jmp_if_false loop_end
    loop_statements
    jmp loop_condition  ; unconditional
loop_end:

According to my understanding, the compiled code is better resembled by this:

goto condition;
do {
    statements;
    condition:
}
while (condition_check);

I can't see a huge performance boost or code readability boost, so why is this often the case? Is there a name for this loop style, for example "trailing condition check"?


Solution

Related: asm loop basics: While, Do While, For loops in Assembly Language (emu8086)

Terminology: Wikipedia says "loop inversion" is the name for turning a while(x) into if(x) do{}while(x), putting the condition at the bottom of the loop where it belongs.


Fewer instructions / uops inside the loop = better. Structuring the code outside the loop to achieve this is very often a good idea.

Sometimes this requires "loop rotation" (peeling part of the first iteration so the actual loop body has the conditional branch at the bottom). So you do some of the first iteration and maybe skip the loop entirely, then fall into the loop. Sometimes you also need some code after the loop to finish the last iteration.

Sometimes loop rotation is extra useful if the last iteration is a special case, e.g. a store you need to skip. This lets you implement a while(1) {... ; if(x)break; ...; } loop as a do-while, or put one of the conditions of a multiple-condition loop at the bottom.

Some of these optimizations are related to or enable software pipelining, e.g. loading something for the next iteration. (OoO exec on x86 makes SW pipelining not very important these days but it's still useful for in-order cores like many ARM. And unrolling with multiple accumulators is still very valuable for hiding loop-carried FP latency in a reduction loop like a dot product or sum of an array.)

do{}while() is the canonical / idiomatic structure for loops in asm on all architectures, get used to it. IDK if there's a name for it; I would say such a loop has a "do while structure". If you want names, you could call the while() structure "crappy unoptimized code" or "written by a novice". :P Loop-branch at the bottom is universal, and not even worth mentioning as a Loop Optimization. You always do that.

This pattern is so widely used that on CPUs that use static branch prediction for branches without an entry in the branch-predictor caches, unknown forward conditional branches are predicted not-taken, unknown backwards branches are predicted taken (because they're probably loop branches). See Static branch prediction on newer Intel processors on Matt Godbolt's blog, and Agner Fog's branch-prediction chapter at the start of his microarch PDF.

This answer ended up using x86 examples for everything, but much of this applies across the board for all architectures. I wouldn't be surprised if other superscalar / out-of-order implementations (like some ARM, or POWER) also have limited branch-instruction throughput whether they're taken or not. But fewer instructions inside the loop is nearly universal when all you have is a conditional branch at the bottom, and no unconditional branch.


If the loop might need to run zero times, compilers more often put a test-and-branch outside the loop to skip it, instead of jumping to the loop condition at the bottom. (i.e. if the compiler can't prove the loop condition is always true on the first iteration).

BTW, this paper calls transforming while() to if(){ do{}while; } an "inversion", but loop inversion usually means inverting a nested loop. (e.g. if the source loops over a row-major multi-dimensional array in the wrong order, a clever compiler could change for(i) for(j) a[j][i]++; into for(j) for(i) a[j][i]++; if it can prove it's correct.) But I guess you can look at the if() as a zero-or-one iteration loop. Fun fact, compiler devs teaching their compilers how to invert a loop (to allow auto-vectorization) for a (very) specific case is why SPECint2006's libquantum benchmark is "broken". Most compilers can't invert loops in the general case, just ones that looks almost exactly like the one in SPECint2006...


You can help the compiler make more compact asm (fewer instructions outside the loop) by writing do{}while() loops in C when you know the caller isn't allowed to pass size=0 or whatever else guarantees a loop runs at least once.

(Actually 0 or negative for signed loop bounds. Signed vs. unsigned loop counters is a tricky optimization issue, especially if you choose a narrower type than pointers; check your compiler's asm output to make sure it isn't sign-extending a narrow loop counter inside the loop very time if you use it as an array index. But note that signed can actually be helpful, because the compiler can assume that i++ <= bound will eventually become false, because signed overflow is UB but unsigned isn't. So with unsigned, while(i++ <= bound) is infinite if bound = UINT_MAX.) I don't have a blanket recommendation for when to use signed vs. unsigned; size_t is often a good choice for looping over arrays, though, but if you want to avoid the x86-64 REX prefixes in the loop overhead (for a trivial saving in code size) but convince the compiler not to waste any instructions zero or sign-extending, it can be tricky.


I can't see a huge performance boost

Here's an example where that optimization will give speedup of 2x on Intel CPUs before Haswell, because P6 and SnB/IvB can only run branches on port 5, including not-taken conditional branches.

Required background knowledge for this static performance analysis: Agner Fog's microarch guide (read the Sandybridge section). Also read his Optimizing Assembly guide, it's excellent. (Occasionally outdated in places, though.) See also other x86 performance links in the x86 tag wiki. See also Can x86's MOV really be "free"? Why can't I reproduce this at all? for some static analysis backed up by experiments with perf counters, and some explanation of fused vs. unfused domain uops.

You could also use Intel's IACA software (Intel Architecture Code Analyzer) to do static analysis on these loops.

; sum(int []) using SSE2 PADDD (dword elements)
; edi = pointer,  esi = end_pointer.
; scalar cleanup / unaligned handling / horizontal sum of XMM0 not shown.

; NASM syntax
ALIGN 16          ; not required for max performance for tiny loops on most CPUs
.looptop:                 ; while (edi<end_pointer) {
    cmp     edi, esi    ; 32-bit code so this can macro-fuse on Core2
    jae    .done            ; 1 uop, port5 only  (macro-fused with cmp)
    paddd   xmm0, [edi]     ; 1 micro-fused uop, p1/p5 + a load port
    add     edi, 16         ; 1 uop, p015
    jmp    .looptop         ; 1 uop, p5 only

                            ; Sandybridge/Ivybridge ports each uop can use
.done:                    ; }

This is 4 total fused-domain uops (with macro-fusion of the cmp/jae), so it can issue from the front-end into the out-of-order core at one iteration per clock. But in the unfused domain there are 4 ALU uops and Intel pre-Haswell only has 3 ALU ports.

More importantly, port5 pressure is the bottleneck: This loop can execute at only one iteration per 2 cycles because cmp/jae and jmp both need to run on port5. Other uops stealing port5 could reduce practical throughput somewhat below that.

Writing the loop idiomatically for asm, we get:

ALIGN 16
.looptop:                 ; do {
    paddd   xmm0, [edi]     ; 1 micro-fused uop, p1/p5 + a load port
    add     edi, 16         ; 1 uop, p015

    cmp     edi, esi        ; 1 uop, port5 only  (macro-fused with cmp)
    jb    .looptop        ; } while(edi < end_pointer);

Notice right away, independent of everything else, that this is one fewer instruction in the loop. This loop structure is at least slightly better on everything from simple non-pipelined 8086 through classic RISC (like early MIPS), especially for long-running loops (assuming they don't bottleneck on memory bandwidth).

Core2 and later should run this at one iteration per clock, twice as fast as the while(){}-structured loop, if memory isn't a bottleneck (i.e. assuming L1D hits, or at least L2 actually; this is only SSE2 16-bytes per clock).

This is only 3 fused-domain uops, so can issue at better than one per clock on anything since Core2, or just one per clock if issue groups always end with a taken branch.

But the important part is that port5 pressure is vastly reduced: only cmp/jb needs it. The other uops will probably be scheduled to port5 some of the time and steal cycles from loop-branch throughput, but this will be a few % instead of a factor of 2. See How are x86 uops scheduled, exactly?.

Most CPUs that normally have a taken-branch throughput of one per 2 cycles can still execute tiny loops at 1 per clock. There are some exceptions, though. (I forget which CPUs can't run tight loops at 1 per clock; maybe Bulldozer-family? Or maybe just some low-power CPUs like VIA Nano.) Sandybridge and Core2 can definitely run tight loops at one per clock. They even have loop buffers; Core2 has a loop buffer after instruction-length decode but before regular decode. Nehalem and later recycle uops in the queue that feeds the issue/rename stage. (Except on Skylake with microcode updates; Intel had to disable the loop buffer because of a partial-register merging bug.)

However, there is a loop-carried dependency chain on xmm0: Intel CPUs have 1-cycle latency paddd, so we're right up against that bottleneck, too. add esi, 16 is also 1 cycle latency. On Bulldozer-family, even integer vector ops have 2c latency, so that would bottleneck the loop at 2c per iteration. (AMD since K8 and Intel since SnB can run two loads per clock, so we need to unroll anyway for max throughput.) With floating point, you definitely want to unroll with multiple accumulators. Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? (Unrolling FP loops with multiple accumulators).


If I'd used an indexed addressing mode, like paddd xmm0, [edi + eax], I could have used sub eax, 16 / jnc at the loop condition. SUB/JNC can macro-fuse on Sandybridge-family, but the indexed load would un-laminate on SnB/IvB (but stay fused on Haswell and later, unless you use the AVX form).

    ; index relative to the end of the array, with an index counting up towards zero
    add   rdi, rsi          ; edi = end_pointer
    xor   eax, eax
    sub   eax, esi          ; eax = -length, so [rdi+rax] = first element

 .looptop:                  ; do {
    paddd   xmm0, [rdi + rax]
    add     eax, 16
    jl    .looptop          ; } while(idx+=16 < 0);  // or JNC still works

(It's usually better to unroll some to hide the overhead of pointer increments instead of using indexed addressing modes, especially for stores, partly because indexed stores can't use the port7 store AGU on Haswell+.)

On Core2/Nehalem add/jl don't macro-fuse, so this is 3 fused-domain uops even in 64-bit mode, without depending on macro-fusion. Same for AMD K8/K10/Bulldozer-family/Ryzen: no fusion of the loop condition, but PADDD with a memory operand is 1 m-op / uop.

On SnB, paddd un-laminates from the load, but add/jl macro-fuse, so again 3 fused-domain uops. (But in the unfused domain, only 2 ALU uops + 1 load, so probably fewer resource conflicts reducing throughput of the loop.)

On HSW and later, this is 2 fused-domain uops because an indexed load can stay micro-fused with PADDD, and add/jl macro-fuses. (Predicted-taken branches run on port 6, so there are never resource conflicts.)

Of course, the loops can only run at best 1 iteration per clock because of taken branch throughput limits even for tiny loops. This indexing trick is potentially useful if you had something else to do inside the loop, too.


But all of these loops had no unrolling

Yes, that exaggerates the effect of loop overhead. But gcc doesn't unroll by default even at -O3 (unless it decides to fully unroll). It only unrolls with profile-guided optimization to let it know which loops are hot. (-fprofile-use). You can enable -funroll-all-loops, but I'd only recommend doing that on a per-file basis for a compilation unit you know has one of your hot loops that needs it. Or maybe even on a per-function basis with an __attribute__, if there is one for optimization options like that.

So this is highly relevant for compiler-generated code. (But clang does default to unrolling tiny loops by 4, or small loops by 2, and extremely importantly, using multiple accumulators to hide latency.)


Benefits with very low iteration count:

Consider what happens when the loop body should run once or twice: There's a lot more jumping with anything other than do{}while.

  • For do{}while, execution is a straight-line with no taken branches and one not-taken branch at the bottom. This is excellent.

  • For an if() { do{}while; } that might run the loop zero times, it's two not-taken branches. That's still very good. (Not-taken is slightly cheaper for the front-end than taken when both are correctly predicted).

  • For a jmp-to-the-bottom jmp; do{}while(), it's one taken unconditional branch, one taken loop condition, and then the loop branch is not-taken. This is kinda clunky but modern branch predictors are very good...

  • For a while(){} structure, this is one not-taken loop exit, one taken jmp at the bottom, then one taken loop-exit branch at the top.

With more iterations, each loop structure does one more taken branch. while(){} also does one more not-taken branch per iteration, so it quickly becomes obviously worse.

The latter two loop structures have more jumping around for small trip counts.


Jumping to the bottom also has a disadvantage for non-tiny loops that the bottom of the loop might be cold in L1I cache if it hasn't run for a while. Code fetch / prefetch is good at bringing code to the front-end in a straight line, but if prediction didn't predict the branch early enough, you might have a code miss for the jump to the bottom. Also, parallel decode will probably have (or could have) decoded some of the top of the loop while decoding the jmp to the bottom.

Conditionally jumping over a do{}while loop avoids all that: you only jump forwards into code that hasn't been run yet in cases where the code you're jumping over shouldn't run at all. It often predicts very well because a lot of code never actually takes 0 trips through the loop. (i.e. it could have been a do{}while, the compiler just didn't manage to prove it.)

Jumping to the bottom also means the core can't start working on the real loop body until after the front-end chases two taken branches.

There are cases with complicated loop conditions where it's easiest to write it this way, and the performance impact is small, but compilers often avoid it.


Loops with multiple exit conditions:

Consider a memchr loop, or a strchr loop: they have to stop at the end of the buffer (based on a count) or the end of an implicit-length string (0 byte). But they also have to break out of the loop if they find a match before the end.

So you'll often see a structure like

do {
    if () break;

    blah blah;
} while(condition);

Or just two conditions near the bottom. Ideally you can test multiple logical conditions with the same actual instruction (e.g. 5 < x && x < 25 using sub eax, 5 / cmp eax, 20 / ja .outside_range, unsigned compare trick for range checking, or combine that with an OR to check for alphabetic characters of either case in 4 instructions) but sometimes you can't and just need to use an if()break style loop-exit branch as well as a normal backwards taken branch.


Further reading:

  • Matt Godbolt's CppCon2017 talk: “What Has My Compiler Done for Me Lately? Unbolting the Compiler's Lid” for good ways to look at compiler output (e.g. what kind of inputs give interesting output, and a primer on reading x86 asm for beginners). related: How to remove "noise" from GCC/clang assembly output?

  • Modern Microprocessors A 90-Minute Guide!. Details look at superscalar pipelined CPUs, mostly architecture neutral. Very good. Explains instruction-level parallelism and stuff like that.

  • Agner Fog's x86 optimization guide and microarch pdf. This will take you from being able to write (or understand) correct x86 asm to being able to write efficient asm (or see what the compiler should have done).

  • other links in the x86 tag wiki, including Intel's optimization manuals. Also several of my answers (linked in the tag wiki) have things that Agner missed in his testing on more recent microarchitectures (like un-lamination of micro-fused indexed addressing modes on SnB, and partial register stuff on Haswell+).

  • Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? (Unrolling FP loops with multiple accumulators): how to use multiple accumulators to hide latency of a reduction loop (like an FP dot product).

  • Lecture 7: Loop Transformations (also on archive.org). Lots of cool stuff that compilers do to loops, using C syntax to describe the asm.

  • https://en.wikipedia.org/wiki/Loop_optimization

Sort of off topic:

  • Memory bandwidth is almost always important, but it's not widely known that a single core on most modern x86 CPUs can't saturate DRAM, and not even close on many-core Xeons where single-threaded bandwidth is worse than on a quad-core with dual channel memory controllers.

  • How much of ‘What Every Programmer Should Know About Memory’ is still valid? (my answer has commentary on what's changed and what's still relevant in Ulrich Drepper's well-known excellent article.)



Answered By - Peter Cordes
Answer Checked By - Gilberto Lyons (PHPFixing Admin)
Read More
  • Share This:  
  •  Facebook
  •  Twitter
  •  Stumble
  •  Digg

[FIXED] Which operator is faster (> or >=), (< or <=)?

 October 31, 2022     assembly, c, operators, optimization, performance     No comments   

Issue

Is < cheaper (faster) than <=, and similarly, is > cheaper (faster) than >=?

Disclaimer: I know I could measure but that will be on my machine only and I am not sure if the answer could be "implementation specific" or something like that.


Solution

it varies, first start at examining different instruction sets and how how the compilers use those instruction sets. Take the openrisc 32 for example, which is clearly mips inspired but does conditionals differently. For the or32 there are compare and set flag instructions, compare these two registers if less than or equal unsigned then set the flag, compare these two registers if equal set the flag. Then there are two conditional branch instructions branch on flag set and branch on flag clear. The compiler has to follow one of these paths, but less, than, less than or equal, greater than, etc are all going to use the same number of instructions, same execution time for a conditional branch and same execution time for not doing the conditional branch.

Now it is definitely going to be true for most architectures that performing the branch takes longer than not performing the branch because of having to flush and re-fill the pipe. Some do branch prediction, etc to help with that problem.

Now some architectures the size of the instruction may vary, compare gpr0 and gpr1 vs compare gpr0 and the immediate number 1234, may require a larger instruction, you will see this a lot with x86 for example. so although both cases may be a branch if less than how you encode the less depending on what registers happen to hold what values can make a performance difference (sure x86 does a lot of pipelining, lots of caching, etc to make up for these issues). Another similar example is mips and or32, where r0 is always a zero, it is not really a general purpose register, if you write to it it doesnt change, it is hardwired to a zero, so a compare if equal to 0 MIGHT cost you more than a compare if equal to some other number if an extra instruction or two is required to fill a gpr with that immediate so that the compare can happen, worst case is having to evict a register to the stack or memory, to free up the register to put the immediate in there so that the compare can happen.

Some architectures have conditional execution like arm, for the full arm (not thumb) instructions you can on a per instruction basis execute, so if you had code

if(i==7) j=5; else j=9;

the pseudo code for arm would be

cmp i,#7
moveq j,#5
movne j,#7

there is no actual branch, so no pipeline issues you flywheel right on through, very fast.

One architecture to another if that is an interesting comparison some as mentioned, mips, or32, you have to specifically perform some sort of instruction for the comparision, others like x86, msp430 and the vast majority each alu operation changes the flags, arm and the like change flags if you tell it to change flags otherwise dont as shown above. so a

while(--len)
{
  //do something
}

loop the subtract of 1 also sets the flags, if the stuff in the loop was simple enough you could make the whole thing conditional, so you save on separate compare and branch instructions and you save in the pipeline penalty. Mips solves this a little by compare and branch are one instruction, and they execute one instruction after the branch to save a little in the pipe.

The general answer is that you will not see a difference, the number of instructions, execuition time, etc are the same for the various conditionals. special cases like small immediates vs big immediates, etc may have an effect for corner cases, or the compiler may simply choose to do it all differently depending on what comparison you do. If you try to re-write your algorithm to have it give the same answer but use a less than instead of a greater than and equal, you could be changing the code enough to get a different instruction stream. Likewise if you perform too simple of a performance test, the compiler can/will optimize out the comparison complete and just generate the results, which could vary depending on your test code causing different execution. The key to all of this is disassemble the things you want to compare and see how the instructions differ. That will tell you if you should expect to see any execution differences.



Answered By - old_timer
Answer Checked By - Katrina (PHPFixing Volunteer)
Read More
  • Share This:  
  •  Facebook
  •  Twitter
  •  Stumble
  •  Digg

[FIXED] How do I profile a Python script?

 October 31, 2022     optimization, performance, profiling, python, time-complexity     No comments   

Issue

Project Euler and other coding contests often have a maximum time to run or people boast of how fast their particular solution runs. With Python, sometimes the approaches are somewhat kludgey - i.e., adding timing code to __main__.

What is a good way to profile how long a Python program takes to run?


Solution

Python includes a profiler called cProfile. It not only gives the total running time, but also times each function separately, and tells you how many times each function was called, making it easy to determine where you should make optimizations.

You can call it from within your code, or from the interpreter, like this:

import cProfile
cProfile.run('foo()')

Even more usefully, you can invoke the cProfile when running a script:

python -m cProfile myscript.py

To make it even easier, I made a little batch file called 'profile.bat':

python -m cProfile %1

So all I have to do is run:

profile euler048.py

And I get this:

1007 function calls in 0.061 CPU seconds

Ordered by: standard name
ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    1    0.000    0.000    0.061    0.061 <string>:1(<module>)
 1000    0.051    0.000    0.051    0.000 euler048.py:2(<lambda>)
    1    0.005    0.005    0.061    0.061 euler048.py:2(<module>)
    1    0.000    0.000    0.061    0.061 {execfile}
    1    0.002    0.002    0.053    0.053 {map}
    1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler objects}
    1    0.000    0.000    0.000    0.000 {range}
    1    0.003    0.003    0.003    0.003 {sum}

EDIT: Updated link to a good video resource from PyCon 2013 titled Python Profiling
Also via YouTube.



Answered By - Chris Lawlor
Answer Checked By - David Goodson (PHPFixing Volunteer)
Read More
  • Share This:  
  •  Facebook
  •  Twitter
  •  Stumble
  •  Digg

Thursday, October 20, 2022

[FIXED] How to improve VueJS dev performance (RAM-wise)?

 October 20, 2022     javascript, node.js, optimization, vue.js, vuejs2     No comments   

Issue

I am currently working on a Vue.js project with a pretty large codebase. It is written on Vue 2. What I have noticed is that when I run it on development mode using vue-cli-service serve, it uses >2GB of RAM. I have tried a few configs trying to improve this , such as:

configureWebpack: {
optimization: {
  removeAvailableModules: false,
  removeEmptyChunks: false,
  splitChunks: false,
  runtimeChunk: true,
},
output: {
  pathinfo: false,
}

but it didn't change anything. It still uses >2GB of RAM. I also have disabled devtools: Vue.config.devtools = false but still, without luck.

I need to find a way to reduce the RAM usage on development mode.


Solution

A lot of things can impact your local dev experience. Listing all of them would take a huge amount of time but you can get a glimpse on what could be optimized here.
That answer is specific to Nuxt but give some generic leads too.

TLDR being that it can depend of your project (importing huge libraries, loading 3rd party scripts, etc...), the tooling around (code editors, linters etc...), your system's specs (SSD, CPU etc...).
It's too broad of a topic to say what exactly is taking you more RAM (and not really worth the effort).

As mentioned in the comments, moving to Vite may help. Get some hardware with 16 GB of RAM, a decent CPU and an SSD and you will be mostly okay if not running crazy unoptimized Docker funky setups.



Answered By - kissu
Answer Checked By - Candace Johnson (PHPFixing Volunteer)
Read More
  • Share This:  
  •  Facebook
  •  Twitter
  •  Stumble
  •  Digg

Sunday, October 9, 2022

[FIXED] How could I optimize this following program in R to boost performance? (Monte Carlo simulation involving compute-intensive permutation test)

 October 09, 2022     for-loop, optimization, performance, r, statistics     No comments   

Issue

Hi I'm trying to optimize the following program for a Monte Carlo Power study. I first use an algorithm taken from Efron and Tibshirani (1993) to find the p-value of a permutation test (upper-tailed) for equality of means. I wrote a function as follows called perm_test(), of which the output is a single p-value. Then, I call this function in another program, power_p(), that simulates 1000 permutation tests (returning 1000 p-values). My power estimate is the proportion of these 1000 p-values that are statistically significant, i.e., < 0.05. The entire process takes me about 8 minutes to run (2020 macBook pro). I'm wondering if anyone has any suggestions in terms of optimizing this process to make it run quicker. Many thnx.

perm_test <- function(delta) {
  
  # Permutation test as described in Efron and Tibshirani (1993), (Algorithm 15.1, p208)
  
  # Draw random samples from a normal distribution
  x <- rnorm(10, mean = delta, sd = 1)
  y <- rnorm(10, mean = 0, sd = 1)
  # observed diff in means, denoted as D_obs 
  D_obs <- mean(x) - mean(y)
  
  # Create a data frame "N" with  n_x + n_y obs (20 rows in our case)
  N <- data.frame("v" = c(x, y))
  # create a group variable "g" indicating which group each observation belongs to
  N$g <- as.factor(c(rep("x", 10), rep("y", 10)))
  # arrange column "v" in ascending order 
  # notice that column "g" is also re-ordered
  N <- arrange(N, v)
  
  ###############################################################################################
  # There are 20 choose 10 (184756) possibilities for the ordering of "g"                       #       
  # corresponding to all possible ways of partitioning 20 elements into two subsets of size 10  #
  # we take only a random sample of 5000 from those 184756 possibilities                        #
  ###############################################################################################
  
  # Initialize variables
  B <- 5000
  x_mean <- 0
  y_mean <- 0
  D_perm <- rep(0, 5000)
  
  # Loop to randomly generate 5000 different orderings of "g"
  for (i in 1:B) {
    
    # Shuffle the ordering of "g"
    N$g <- sample(N$g)
    # Permuted means of x and y
    x_mean <- tapply(N$v, N$g, mean)[1]
    y_mean <- tapply(N$v, N$g, mean)[2]
    # Find permuted diff in means, denoted as D_perm
    D_perm[i] <- x_mean - y_mean 
    }
  
  # Find p-value 
  P_perm <- sum(D_perm >= D_obs)/ B
  
  # Output
  return(round(P_perm, digits = 5))
  
}

Here's the program that simulates 1000 permutation tests:

power_p <- function(numTrial, delta) {
  
  # Initilize variables
  P_p <- rep(0, numTrial) 
  pwr_p <- 0
  
  # Simulation
  P_p <- replicate(n = numTrial, expr = perm_test(delta))
  
  # Power estimates are the proportions of p-values that are significant (i.e. less than 0.05)
  pwr_p <- sum(P_p < 0.05) / numTrial
  
  # Output 
  return(round(pwr_p, digits = 5))

}

Solution

perm_test2 <- function(delta) {
  x <- rnorm(10, mean = delta, sd = 1)
  y <- rnorm(10, mean = 0, sd = 1)
  D_obs <- mean(x) - mean(y)
  v <- c(x, y)
  g <- rep(1:2, each = 10)
  B <- 5000
  y_mean <- x_mean <- 0
  D_perm <- rep(0, B)
  for (i in 1:B) {
    ii <- sample(g) == 1L
    D_perm[i] <- (sum(v[ii]) - sum(v[!ii]))/10 
  }
  P_perm <- sum(D_perm >= D_obs)/ B
  return(round(P_perm, digits = 5))
}

In comparison to more complicated approaches I propose the above one, where I have made some simple improvements to your existing code.

  1. we do not need to use data.frame, that only takes unnecessary time to subset vector each time
  2. instead of factor group vector we can use simple integer vector
  3. the ordering of data is not needed
  4. we can reduce mean calculation. In your code 2 calls of tapply(N$v, N$g, mean) was the slowest part.
  5. mean(x) is slower than sum(x)/n, it does additional checks etc., so in this situation we can use the faster approach. As the inner loop will be executed 1000x5000 times(sims x B).

bench::mark(perm_test_org(0.5), perm_test2(0.5), iterations = 5, check = F,
            relative = T)[, 1:8]
# A tibble: 2 x 8
#   expression           min median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc
#   <bch:expr>         <dbl>  <dbl>     <dbl>     <dbl>    <dbl> <int> <dbl>
# 1 perm_test_org(0.5)  15.5   15.3       1        1.00     1        5    16
# 2 perm_test2(0.5)      1      1        13.5      1        1.69     5     2

Approximately 15x faster on my system. 1000 iterations took 33.89 seconds.

Update 2.

We can improve speed even more by:

  1. replacing sample with sample.int
  2. then we see that the g vector isn't needed at all to select random two groups
  3. in loop we do not need to sum both parts of vector v as we can sum(v) before loop, so we can do one sum less inside the loop and calculate the result values at end.

perm_test3 <- function(delta) {
  x <- rnorm(10, mean = delta, sd = 1)
  y <- rnorm(10, mean = 0, sd = 1)
  D_obs <- mean(x) - mean(y)
  v <- c(x, y)
  B <- 5000
  s <- sum(v)
  D_perm2 <- rep(0, B)
  for (i in 1:B) {
    D_perm2[i] <- sum(v[sample.int(10) < 6])
  }
  D_perm <- D_perm2 - (s - D_perm2)
  P_perm <- sum(D_perm/10 >= D_obs) / B
  return(round(P_perm, digits = 5))
}

Runs in +/- 20 seconds for 1000 iterations. Now the slowest part is repeated call of sample.int. You can look into faster functions: https://www.r-bloggers.com/2019/04/fast-sampling-support-in-dqrng/



Answered By - minem
Answer Checked By - Cary Denson (PHPFixing Admin)
Read More
  • Share This:  
  •  Facebook
  •  Twitter
  •  Stumble
  •  Digg

Wednesday, March 16, 2022

[FIXED] Best practices for optimizing LAMP sites for speed?

 March 16, 2022     lamp, mysql, optimization, php     No comments   

Issue

I want to know when building a typical site on the LAMP stack how do you optimize it for the best possible load times. I am picturing a typical DB-driven site.

This is a high-level look and could probably pull in question and let me break it down into each layer of the stack.

L - At the system level, (setup and filesystem) can you do to improve speed? One thing I can think of is image sizes, can compression here help optimize anything?

A - There have to be a ton of settings related to site speed here in the web server. Not my Forte. Probably depends a lot on how many sites are running concurrently.

M - MySQL in a database driven site, DB performance is key. Is there a better normalization approach i.e, using link tables? Web developers often just make simple monolithic tables resembling 1NF and this can kill performance.

P - aside from performance-boosting settings like caching, what can the programmer do to affect performance at a high level? I would really like to know if MVC design approaches hit performance more than quick-and-dirty. Other simple tips like are sessions faster than cookies would be interesting to know.

Obviously you have to get down and dirty into the details and find what code is slowing you down. Also I realize that many sites have many different performance characteristics, but let's assume a typical site that has more reads then writes.

I am just wondering if we can compile a bunch of best practices and fully expect people to link other questions so we can effectively workup a checklist.

My goal is to see if even in addition to the usual issues in performance we can see some oddball things you might not think of crop up to go along with a best-practices summary.

So my question is, if you were starting from scratch, how would you make sure your LAMP site was fast?


Solution

Here's a few personal must-dos that I always set up in my LAMP applications.

  • Install mod_deflate for apache, and do not use PHP's gzip handlers. mod_deflate will allow you to compress static content, like javascript/css/static html, as well as the usual dynamic PHP output, and it's one less thing you have to worry about in your code.

  • Be careful with .htaccess files! Enabling .htaccess files for directories in your app means that Apache has to scan the filesystem constantly, looking for .htaccess directives. It is far better to put directives inside the main configuration or a vhost configuration, where they are loaded once. Any time you can get rid of a directory-level access file by moving it into a main configuration file, you save disk access time.

  • Prepare your application's database layer to utilize a connection manager of some sort (I use a Singleton for most applications). It's not very hard to do, and reducing the number of database connections your application opens saves resources.

  • If you think your application will see significant load, memcached can perform miracles. Keep this in mind while you write your code... perhaps one day instead of creating objects on the fly, you will be getting them from memcached. A little foresight will make implementation painless.

  • Once your app is up and running, set MySQL's slow query time to a small number and monitor the slow query log diligently. This will show you where your problem queries are coming from, and allow you to optimize your queries and indexes before they become a problem.

  • For serious performance tweakers, you will want to compile PHP from source. Installing from a package installs a lot of libraries that you may never use. Since PHP environments are loaded into every instance of an Apache thread, even a 5MB memory overhead from extra libraries quickly becomes 250MB of lost memory when there's 50 Apache threads in existence. I keep a list of my standard ./configure line I use when building PHP here, and I find it suits most of my applications. The downside is that if you end up needing a library, you have to recompile PHP to get it. Analyze your code and test it in a devel environment to make sure you have everything you need.

  • Minify your Javascript.

  • Be prepared to move static content, such as images and video, to a non-dynamic web server. Write your code so that any URLs for images and video are easily configured to point to another server in the future. A web server optimized for static content can easily serve tens or even hundreds of times faster than a dynamic content server.

That's what I can think of off the top of my head. Googling around for PHP best practices will find a lot of tips on how to write faster/better code as well (Such as: echo is faster than print).



Answered By - zombat
Read More
  • Share This:  
  •  Facebook
  •  Twitter
  •  Stumble
  •  Digg

Tuesday, March 15, 2022

[FIXED] LAMP stack performance under heavy traffic loads

 March 15, 2022     lamp, optimization, performance, server-load, traffic     No comments   

Issue

I know the title of my question is rather vague, so I'll try to clarify as much as I can. Please feel free to moderate this question to make it more useful for the community.

Given a standard LAMP stack with more or less default settings (a bit of tuning is allowed, client-side and server-side caching turned on), running on modern hardware (16Gb RAM, 8-core CPU, unlimited disk space, etc), deploying a reasonably complicated CMS service (a Drupal or Wordpress project for arguments sake) - what amounts of traffic, SQL queries, user requests can I resonably expect to accommodate before I have to start thinking about performance?

NOTE: I know that specifics will greatly depend on the details of the project, i.e. optimizing MySQL queries, indexing stuff, minimizing filesystem hits - assuming web developers did a professional job - I'm really looking for a very rough figure in terms of visits per day, traffic during peak visiting times, how many records before (transactional) MySQL fumbles, so on.

I know the only way to really answer my question is to run load testing on a real project, and I'm concerned that my question may be treated as partly off-top.

I would like to get a set of figures from people with first-hand experience, e.g. "we ran such and such set-up and it handled at least this much load [problems started surfacing after such and such]". I'm also greatly interested in any condenced (I'm short on time atm) reading I can do to get a better understanding of the matter.

P.S. I'm meeting a client tomorrow to talk about his project, and I want to be prepared to reason about performance if his project turns out to be akin FourSquare.


Solution

Very tricky to answer without specifics as you have noted. If I was tasked with what you have to do, I would take each component in turn ( network interface, CPU/memory, physical IO load, SMP locking etc) and get the maximum capacity available, divide by rough estimate of use per request.

For example, network io. You might have 1x 1Gb card, which might achieve maybe 100Mbytes/sec. ( I tend to use 80% of theoretical max). How big will a typical 'hit' be? Perhaps 3kbytes average, for HTML, images etc. that means you can achieve 33k requests per second before you bottleneck at the physical level. These numbers are absolute maximums, depending on tools and skills you might not get anywhere near them, but nobody can exceed these maximums.

Repeat the above for every component, perhaps varying your numbers a little, and you will build a quick picture of what is likely to be a concern. Then, consider how you can quickly get more capacity in each component, can you just chuck $$ and gain more performance (eg use SSD drives instead of HD)? Or will you hit a limit that cannot be moved without rearchitecting? Also take into account what resources you have available, do you have lots of skilled programmer time, DBAs, or wads of cash? If you have lots of a resource, you can tend to reduce those constraints easier and quicker as you move along the experience curve.

Do not forget external components too, firewalls may have limits that are lower than expected for sustained traffic.

Sorry I cannot give you real numbers, our workloads are using custom servers, high memory caching and other tricks, and not using all the products you list. However, I would concentrate most on IO/SQL queries and possibly network IO, as these tend to be more hard limits, than CPU/memory, although I'm sure others will have a different opinion.



Answered By - rlb
Read More
  • Share This:  
  •  Facebook
  •  Twitter
  •  Stumble
  •  Digg

Monday, February 14, 2022

[FIXED] How do I optimize this export feature for LAMP web app?

 February 14, 2022     caching, lamp, optimization     No comments   

Issue

I have a feature which allows the user to create projects and view it on this page. They could import resources(pdf,img,etc) to be kept along with their projects. So now i want to create a feature which allows the user to export all their stuff and those people who are in the same group as them all neatly with pretty ribbon tied in a zip file.

Currently i'm using Archive:Zip to zip up the file preemptively, keep their CRC32 checksum and running this as a daily cronjob to cut the user waiting time down. But if there's any changes to any of the files i will have to rerun the whole thing.

My initial benchmark shows me that 103MB of file will takes up to 47secs to run. The process involve generating XML linking them to XSL, copying images, html for the iframes and what not.

I'm thinking of creating a table or a text file to keep CRC32 checksum or last modified date for all of the files in a temporarily storage area and compare with this list each time the user click on export, and if there's any new files, i will remove the same file from the cached zip file and add in the new file. Or i will just keep all the loose files and copy and replace the newer files and then do the archive on each click.

My questions are:

  1. Is this considered as a premature or bad optimization technique?
  2. How should i properly optimize this?
  3. Is there some book or resources that I can learn for these sort of optimization techniques?

Solution

What's wrong with the idea of:

  • setting a flag of some sort whenever a users files change (add, delete or change file).
  • running your nightly compress on each user whose files have changed, then resetting that flag.
  • if the user requests an export when the flag is set, you'll have to do the compress again before export completes (there's no way around that).

To further speed up the user experience, you could also decouple the export request from the export operation. For example, when a user (whose flag is set) requests an export, notify them that it will be done when the compress happens, and set a different flag. Then modify the second step above to also export the newly created package if this second flag is set.

This gives the user immediate feedback that something will happen but moves the grunt work to the future.

Alternatively, you don't have to tie the export to the compress. You could compress every night but allow extra compress/export jobs during the day as needed. It's still good to decouple the request from the event however.

Answering your specific questions.

1/ I do not consider this premature or bad optimization. The 'code' is functionally complete since it does all you ask of it so it is the right time for optimizing. In addition, you have identified the bottleneck and are optimizing the right area.

2/ See my text above. You should optimize it by doing exactly what you've done - identify the bottleneck and concentrate on improving that. Given that you're unlikely to get much better compression performance, the decoupling 'trick' I've suggested is a good one. Like progress bars and splash screens, it's usually more to do with the users perception of speed rather than speed itself.

3/ Books? Don't bother, there's thousands of resources on the net. Keep asking on SO and print out all the responses. Eventually your brain will be as full as mine and every new snippet of code will cause you to temporarily forget your wife's name :-).



Answered By - paxdiablo
Read More
  • Share This:  
  •  Facebook
  •  Twitter
  •  Stumble
  •  Digg

Sunday, February 6, 2022

[FIXED] How to improve speed on my laravel website?

 February 06, 2022     laravel, optimization, performance, php     No comments   

Issue

I built a website using Laravel and it's extremely slow.

This the code and the files for the website:

web.php:

Route::get('/product/{url}', 'FrontendController@getProduct')->where('url', '^[\w.-]*$');

FrontendController:

    public function getProduct(String $url){
    $product = Product::where('url', $url)->first();
    $imgs = ProductImage::all()->where('product_id',$product->id);
    return view('frontend.product', ['product' => $product, 'products' => Product::all(), 'images' => $imgs ]);
}

product.blade.php:

   <div class="container-fluid display-products-desktop px-5" style="margin-top:15rem">
    <div class="row mx-0 row-filters-desktop">
        <div class="col-12 d-flex justify-content-end mb-5">
            <a href="" class="cta-links cta-talk-designer" style="padding-right: 5rem">TALK WITH A DESIGNER</a>
            <!-- HOURS CONDITION -->
            @php
            $dt = new DateTime("now", new DateTimeZone('Portugal'));
            @endphp

            @if ($dt->format('H:i:s') >= '09:00:00' && $dt->format('H:i:s') <= '18:00:00')
                <a class="cta-links" data-toggle="modal" data-target="#contact-five-minutes-modal">CALL ME IN 5 MINUTES</a>
            @else
                <a class="cta-links" data-toggle="modal" data-target="#contact-us-modal">CONTACT US</a>
            @endif
        </div>
    </div>
    <div class="row m-0">
        <div class="col-2 p-0 thumb-div" style="border-right: 1px solid rgba(177, 177, 177, 0.8);">
            <div class="swiper-container gallery-thumbs m-0">
                <div class="swiper-wrapper">
                    @foreach ($images as $image)
                    <div class="swiper-slide thumbs"><img src="{{ $image->url }}" alt="" class="gallery-thumbs-slide"></div>
                    @endforeach
                </div>
              </div>
        </div>
        <div class="col-6 p-0">
            <div class="swiper-container gallery-top m-0">
                <div class="swiper-wrapper">
                    @foreach ($images as $image)
                    <div class="swiper-slide top"><img src="{{ $image->url }}" alt="" class="img-fluid"></div>
                    @endforeach
                </div>                
                <div class="swiper-next-2 swiper-dark"></div>
                <div class="swiper-prev-2 swiper-dark"></div>
            </div>
        </div>
        <div class="col-4 p-0 margin-top-text" style="padding-left: 1.5rem !important">
            <p class="product-page-pre-title">meet</p>
            <h1 class="m-0 main-product-title" style="font-weight: 500">{{ $product->name }}</h1>
            <h4 class="mb-5" style="font-weight: 500">by {{ $product->brand }}</h4>
            <p class="paragraph-product-page">{{ $product->description }}</p>
            <h5 class="mt-5 section-title-product-page">MATERIAL AND FINISHES:</h5>
            <p class="paragraph-product-page">{{ $product->finishes }}</p>
            <h5 class="mt-4 section-title-product-page">DIMENSIONS</h5>
            <p class="paragraph-product-page m-0">{{ $product->dimensions }}</p>
            <a class="cta-links" style="margin-top:2.5rem;margin-bottom:2.5rem; display:block;" href="#downloadCatalogue" data-bs-toggle="modal" data-bs-target="#downloadCatalogue">DOWNLOAD CATALOGUE</a>
            <a class="cta-links" style="margin-top:2.5rem;margin-bottom:2.5rem; display:block;" href="#productSheet{{ $product->id}}" data-bs-toggle="modal" data-bs-target="#productSheet{{ $product->id}}">DOWNLOAD PRODUCT SHEET</a>
            <a class="cta-links" style="margin-top:2.5rem;margin-bottom:2.5rem; display:block;" href="#reqCustom{{ $product->id}}" data-bs-toggle="modal" data-bs-target="#reqCustom{{ $product->id}}">REQUEST CUSTOMIZATION</a>
            <button class="product-price-btn" data-bs-toggle="modal" data-bs-target="#getPrice{{ $product->id }}">GET PRICE</button>
        </div>
    </div>    
    <div class="row mx-0">
        <div class="col-12 p-0 text-center" style="margin-top:6rem">
            <h1 class="m-5" style="font-weight: 600">Related Products</h1>
        </div>
    </div>
    <div class="d-row d-flex gap-2 justify-content-center">
        <div class="col-md-3 px-0">
            <img src="{{ $product->related_product_img_1}}" alt="" class="img-fluid">
        </div>
        <div class="col-md-3 px-0">
            <img src="{{ $product->related_product_img_2}}"  alt="" class="img-fluid">
        </div>
        <div class="col-md-3 px-0">
            <img src="{{ $product->related_product_img_3}}"  alt="" class="img-fluid">
        </div>
        <div class="col-md-3 px-0">
            <img src="{{ $product->related_product_img_4}}"  alt="" class="img-fluid">
        </div>
    </div>
    <div class="d-row d-flex gap-2 justify-content-center" style="margin-bottom: 5rem">
        @php
            $slug_1 = str_replace(' ', '-', $product->related_product_name_1);
            $url_1 = strtolower( $slug_1 );
            $slug_2 = str_replace(' ', '-', $product->related_product_name_2);
            $url_2 = strtolower( $slug_2 );
            $slug_3 = str_replace(' ', '-', $product->related_product_name_3);
            $url_3 = strtolower( $slug_3 );
            $slug_4 = str_replace(' ', '-', $product->related_product_name_4);
            $url_4 = strtolower( $slug_4 );
        @endphp
        <div class="col-md-3 px-0">
            <div class="text-center mt-1">
                <p class="product-subtitle">new</p>
                <a href="/product/{{ $url_1 }}"><h4 class="product-title">{{ $product->related_product_name_1}}</h4></a>
                <a class="product-price-link" href="#getPrice{{ $product->related_product_id_1}}" data-bs-toggle="modal" data-bs-target="#getPrice{{ $product->related_product_id_1}}">get price</a>
            </div>
        </div>
        <div class="col-md-3 px-0">
            <div class="text-center mt-1">
                <p class="product-subtitle">new</p>
                <a href="/product/{{ $url_2 }}"><h4 class="product-title">{{ $product->related_product_name_2}}</h4></a>
                <a class="product-price-link" href="#getPrice{{ $product->related_product_id_2}}" data-bs-toggle="modal" data-bs-target="#getPrice{{ $product->related_product_id_2}}">get price</a>
            </div>
        </div>
        <div class="col-md-3 pr-0 pl-1">
            <div class="text-center mt-1">
                <p class="product-subtitle">new</p>
                <a href="/product/{{ $url_3 }}"><h4 class="product-title">{{ $product->related_product_name_3}}</h4></a>
                <a class="product-price-link" href="#getPrice{{ $product->related_product_id_3}}" data-bs-toggle="modal" data-bs-target="#getPrice{{ $product->related_product_id_3}}">get price</a>
            </div>
        </div>
        <div class="col-md-3">
            <div class="text-center mt-1">
                <p class="product-subtitle">new</p>
                <a href="/product/{{ $url_4 }}"><h4 class="product-title">{{ $product->related_product_name_4}}</h4></a>
                <a class="product-price-link" href="#getPrice{{ $product->related_product_id_4}}" data-bs-toggle="modal" data-bs-target="#getPrice{{ $product->related_product_id_4}}">get price</a>
            </div>
        </div>
    </div>
</div>

These are the results shown on a speed test website:

enter image description here

enter image description here

I am a beginner, so I´ve been studying ways to improve this. Do any of you have suggestions on how to improve the website performance? Thanks in advance.


Solution

Solved it by changing the following code and removing a few bootstrap modals from the product.blade.

FrontEndController:

public function getProductTest(String $url){
    $product = Product::where('url', $url)->first();
    $imgs = ProductImage::where('product_id',$product->id)->get();
    return view('frontend.product-test', ['product' => $product, 'images' => $imgs ]);
}

The speed went from almost eleven seconds to 1.22 seconds.



Answered By - Marília
Read More
  • Share This:  
  •  Facebook
  •  Twitter
  •  Stumble
  •  Digg

Tuesday, January 18, 2022

[FIXED] Optimizing custom Wordpress SQL query for fetching usermeta data

 January 18, 2022     mysql, optimization, query-optimization, sql, wordpress     No comments   

Issue

I have the following query and it works. But it takes insanely long to process due to it's buildup. I'm therefor in the need of assistance to get this query faster.

SQL Query

In the query PRODUCT_ID should be replaced by ' ' and the productID number.

SELECT
    b.order_id,
    customer_meta.meta_value AS customer_id,
    users.user_email,
    qty_table.meta_value AS qty,
    user_meta1.meta_value AS firstname,
    user_meta2.meta_value AS lastname,
    user_meta3.meta_value AS company,
    user_meta4.meta_value AS address,
    user_meta5.meta_value AS city,
    user_meta6.meta_value AS postcode,
    user_meta7.meta_value AS state,
    user_meta8.meta_value AS user_phone
FROM
    wp_woocommerce_order_itemmeta a,
    wp_woocommerce_order_items b,
    wp_postmeta customer_meta,
    wp_users users,
    wp_woocommerce_order_itemmeta qty_table,
    wp_usermeta user_meta1,
    wp_usermeta user_meta2,
    wp_usermeta user_meta3,
    wp_usermeta user_meta4,
    wp_usermeta user_meta5,
    wp_usermeta user_meta6,
    wp_usermeta user_meta7,
    wp_usermeta user_meta8
WHERE
    a.meta_key = '_product_id'
    AND a.meta_value = PRODUCT_ID
    AND a.order_item_id = b.order_item_id
    AND customer_meta.meta_key = '_customer_user'
    AND customer_meta.post_id = b.order_id
    AND user_meta1.meta_key = 'first_name'
    AND user_meta1.user_id = users.id
    AND user_meta2.meta_key = 'last_name'
    AND user_meta2.user_id = users.id
    AND user_meta3.meta_key = 'billing_company'
    AND user_meta3.user_id = users.id
    AND user_meta4.meta_key = 'billing_address_1'
    AND user_meta4.user_id = users.id
    AND user_meta5.meta_key = 'billing_city'
    AND user_meta5.user_id = users.id
    AND user_meta6.meta_key = 'billing_postcode'
    AND user_meta6.user_id = users.id
    AND user_meta7.meta_key = 'billing_state'
    AND user_meta7.user_id = users.id
    AND user_meta8.meta_key = 'billing_phone'
    AND user_meta8.user_id = users.id
    AND users.ID = customer_meta.meta_value
    AND qty_table.meta_key = '_qty'
    AND qty_table.order_item_id = b.order_item_id
ORDER BY user_meta3.meta_value ASC

I need all the information, since I want to list all the users with their Firstname, lastname, company, address, postcode, etc for a given product which has been bought. So the query in itself works, but is a killer in process time.

I could use max( CASE WHEN ... ... THEN ...END ) as a_name but I only know how to do it successfully if I use a left join.

Any tips on how to get this query to run better?


Solution

WP, are you listening? WooCommerce, are you listening? I am getting tired of optimizing your database app.

First and foremost, EAV is a terrible schema design. But I won't rant about that. I'll just point out the index(es) that are probably missing or mal-formed:

wp_usermeta:  PRIMARY KEY(user_id, meta_key)

Without any (191) tacked on.

Similarly for wp_woocommerce_order_itemmeta.

I may have more abuse to dish out; please provide SHOW CREATE TABLE for the tables being used.



Answered By - Rick James
Read More
  • Share This:  
  •  Facebook
  •  Twitter
  •  Stumble
  •  Digg

Friday, January 14, 2022

[FIXED] Troubleshooting slow web page load times, how to do it/get better?

 January 14, 2022     apache, lamp, mysql, optimization     No comments   

Issue

I'm the primary developer at a web firm, but often end up doing some sysadmin stuff, and was wondering what resources are available for learning how to troubleshoot slow page load times. My experience with sysadmin tools is almost none. I'm relatively proficient at the Linux/Unix command line, but have never used any type of packet tracking software and only know the basics of using dig for ip resolves. My experience with apache and mysql is mostly limited to configuring initial setup and then using them.

Are there any good books or web sites that cover the topics needed for accurately diagnosing website performance/bottlenecks and if so what are they, or is the gamut of technologies used to large and experience/time with using the technologies typically how people get good at this stuff?


Solution

Overall, there's no substitute for experience. A concept as broad as "slow web page load time" could be hitting a bottleneck in any number of different places:

  1. Client is slow to resolve IP from domain.
  2. Network between client and domain is slow or congested.
  3. Server is slow to respond to request.
  4. Requested page is large.
  5. Requested resources embedded in page are large.
  6. Page contains server-side code that requires significant processing.
  7. Database is slow to respond.
  8. Page manipulates a lot of data before responding.
  9. Rendered page on client contains a lot of code and runs slowly.
  10. etc.

For any given page, it's a matter of know where the bottlenecks could be and determining what that bottleneck is in order to address it. Having a full and complete understanding of everything that goes on from end to end in "loading a page" is essential. Identifying patterns of slow load times across multiple disparate requests will help narrow down the potential bottlenecks. etc.

It's very much a case-by-case scenario.



Answered By - David
Read More
  • Share This:  
  •  Facebook
  •  Twitter
  •  Stumble
  •  Digg

Wednesday, January 12, 2022

[FIXED] Profiling and optimising PHP / MySQL websites

 January 12, 2022     lamp, mysql, optimization, php, query-optimization     No comments   

Issue

I have a server (VPS) that hosts numerous PHP / MySQL websites. Most are quite similar in that they are all hand-coded websites serving text and images from MySQL databases.

Server traffic has increased a fair amount recently and the server is experiencing some slow down. As such I want to try and identify bottle necks in the server so that I can improve the server's speed.

Does anyone have any tips on how to do this? I have setup timing scripts on some of my larger sites to see how long it takes for the webpages to be created but its always a really low figure. According to the server stats the main issue seems to be CPU / MySQL usage. Is there anyway to identify queries that are taking a long time?

Thanks Chris


Solution

Yes, there is a way! MySQL has a built-in feature for this. You can set up a log file to log slow queries.

Other general advice would of course be to use EXPLAIN on common queries and check if everything is indexed properly.



Answered By - Carsten
Read More
  • Share This:  
  •  Facebook
  •  Twitter
  •  Stumble
  •  Digg
Older Posts Home

Total Pageviews

Featured Post

Why Learn PHP Programming

Why Learn PHP Programming A widely-used open source scripting language PHP is one of the most popular programming languages in the world. It...

Subscribe To

Posts
Atom
Posts
All Comments
Atom
All Comments

Copyright © PHPFixing