PHPFixing
  • Privacy Policy
  • TOS
  • Ask Question
  • Contact Us
  • Home
  • PHP
  • Programming
  • SQL Injection
  • Web3.0

Monday, September 5, 2022

[FIXED] How can I trim empty/whitespace lines?

 September 05, 2022     c++, text, trim     No comments   

Issue

I have to process badly mismanaged text with creative indentation. I want to remove the empty (or whitespace) lines at the beginning and end of my text without touching anything else; meaning that if the first or last actual lines respectively begin or end with whitespace, these will stay.

For example, this:

<lines, empty or with whitespaces ...>
<text, maybe preceded by whitespace>
<lines with or without text...>
<text, maybe followed by whitespace>
<lines, empty or with whitespaces ...>

turns to

<text, maybe preceded by whitespace>
<lines with or without text...>
<text, maybe followed by whitespace>

preserving the spaces at the beginning and the end of the actual text lines (the text might also be entirely whitespace)

A regex replacing (\A\s*(\r\n|\Z)|\r\n\s*\Z) by emptiness does exactly what I want, but regex is kind of overkill, and I fear it might cost me some time when processing texts with a lot of lines but not much to trim.

On the other hand, an explicit algorithm is easy to make (just read until a non-whitespace/the end while remembering the last line feed, then truncate, and do the same backwards) but it feels like I'm missing something obvious.

How can I do this?


Solution

As you can see from this discussion, trimming whitespace requires a lot of work in C++. This should definitely be included in the standard library.

Anyway, I've checked how to do it as simply as possible, but nothing comes near the compactness of RegEx. For speed, it's a different story.

In the following you can find three versions of a program which does the required task. With regex, with std functions and with just a couple of indexes. The last one can be also made faster because you can avoid copying altogether, but I left it for fair comparison:

#include <string>
#include <sstream>
#include <chrono>
#include <iostream>
#include <regex>
#include <exception>

struct perf {
    std::chrono::steady_clock::time_point start_;
    perf() : start_(std::chrono::steady_clock::now()) {}
    double elapsed() const {
        auto stop = std::chrono::steady_clock::now();
        std::chrono::duration<double> elapsed_seconds = stop - start_;
        return elapsed_seconds.count();
    }
};

std::string Generate(size_t line_len, size_t empty, size_t nonempty) {
    std::string es(line_len, ' ');
    es += '\n';
    for (size_t i = 0; i < empty; ++i) {
        es += es;
    }

    std::string nes(line_len - 1, ' ');
    es += "a\n";
    for (size_t i = 0; i < nonempty; ++i) {
        nes += nes;
    }

    return es + nes + es;
}


int main()
{
    std::string test;
    //test = "  \n\t\n  \n  \tTEST\n\tTEST\n\t\t\n  TEST\t\n   \t\n \n  ";
    std::cout << "Generating...";
    std::cout.flush();
    test = Generate(1000, 8, 10);
    std::cout << " done." << std::endl;

    std::cout << "Test 1...";
    std::cout.flush();
    perf p1;
    std::string out1;
    std::regex re(R"(^\s*\n|\n\s*$)");
    try {
        out1 = std::regex_replace(test, re, "");
    }
    catch (std::exception& e) {
        std::cout << e.what() << std::endl;
    }
    std::cout << " done. Elapsed time: " << p1.elapsed() << "s" << std::endl;

    std::cout << "Test 2...";
    std::cout.flush();
    perf p2;
    std::stringstream is(test);
    std::string line;
    while (std::getline(is, line) && line.find_first_not_of(" \t\n\v\f\r") == std::string::npos);
    std::string out2 = line;
    size_t end = out2.size();
    while (std::getline(is, line)) {
        out2 += '\n';
        out2 += line;
        if (line.find_first_not_of(" \t\n\v\f\r") != std::string::npos) {
            end = out2.size();
        }
    }
    out2.resize(end);
    std::cout << " done. Elapsed time: " << p2.elapsed() << "s" << std::endl;

    if (out1 == out2) {
        std::cout << "out1 == out2\n";
    }
    else {
        std::cout << "out1 != out2\n";
    }

    std::cout << "Test 3...";
    std::cout.flush();
    perf p3;
    static bool whitespace_table[] = {
        1,1,1,1,1,1,1,1,1,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
        0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
        1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
        1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
        1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
        1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
        1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
        1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
    };
    size_t sfl = 0; // Start of first line
    for (size_t i = 0, end = test.size(); i < end; ++i) {
        if (test[i] == '\n') {
            sfl = i + 1;
        }
        else if (whitespace_table[(unsigned char)test[i]]) {
            break;
        }
    }
    size_t ell = test.size(); // End of last line
    for (size_t i = test.size(); i-- > 0;) {
        if (test[i] == '\n') {
            ell = i;
        }
        else if (whitespace_table[(unsigned char)test[i]]) {
            break;
        }
    }
    std::string out3 = test.substr(sfl, ell - sfl);
    std::cout << " done. Elapsed time: " << p3.elapsed() << "s" << std::endl;

    if (out1 == out3) {
        std::cout << "out1 == out3\n";
    }
    else {
        std::cout << "out1 != out3\n";
    }

    return 0;
}

Running it on C++ Shell you get these timings:

Generating... done.
Test 1... done. Elapsed time: 4.2288s
Test 2... done. Elapsed time: 0.0077323s
out1 == out2
Test 3... done. Elapsed time: 0.000695783s
out1 == out3

If performance is important, it's better to really test it with the real files.

As a side note, this regex doesn't work on MSVC, because I couldn't find a way of avoiding ^ and $ to match the start and end of lines, that is disable the multiline mode of operation. If you run this, it throws an exception saying regex_error(error_complexity): The complexity of an attempted match against a regular expression exceeded a pre-set level. I think I'll ask how to cope with this!



Answered By - Costantino Grana
Answer Checked By - Dawn Plyler (PHPFixing Volunteer)
  • Share This:  
  •  Facebook
  •  Twitter
  •  Stumble
  •  Digg
Newer Post Older Post Home

0 Comments:

Post a Comment

Note: Only a member of this blog may post a comment.

Total Pageviews

Featured Post

Why Learn PHP Programming

Why Learn PHP Programming A widely-used open source scripting language PHP is one of the most popular programming languages in the world. It...

Subscribe To

Posts
Atom
Posts
Comments
Atom
Comments

Copyright © PHPFixing