How to remove HTML tags in C and C++ with RegEx
Regular expressions are a great tool for any programming language.
The other day I saw a simple but interesting question on the internet. Someone posted wanting to know: โHow to remove HTML tags in C?โ .
It quickly came to my mind RegEx, but with C++ .
If you understand Regular Expressions with C++ it is really very easy, just:
- Include the
<regex>
header; - Inform the pattern of the regular expression;
- And finally use the
regex_replace()
function to replace with the string you want.
In summary the code is this:
#include <iostream>
#include <regex>
int main(){
std::string html = "<a href=\"https://terminalroot.com/\">This is a link</a>";
std::regex tags("<[^>]*>");
std::string remove{};
std::cout << std::regex_replace(html, tags, remove) << '\n';
return 0;
}
Probable output:
This is a link
But in Linguagem C things are really not that easy.
Linguagem C
You can use regex.h
in C, but it will only check for patterns, but the replacement will be up to you.
For example, checking if a given string has tags in it, we can use it like this:
#include <stdio.h>
#include <regex.h>
int main(){
regex_t regex;
int check_regex = regcomp(®ex, "<[^>]*>", REG_EXTENDED);
check_regex = regexec(®ex, "<p>Tag</p>", 0, NULL, 0);
!check_regex ? printf("Has tags!\n") : printf("It has no tags.\n");
regfree(®ex);
return 0;
}
Likely output:
Has tags!
For more information access the POSIX page of the manual by the command:
man regex.h
Removing HTML TAGS in C
After you check if a given string has tags (saves processing) the next step is to remove the tags.
I came up with a solution of my own (and simple ๐ก ) that may be contested by C lovers, but it works ๐ . The code itself is:
- Include headers:
stdio.h
to useprintf
;string.h
to usestrlen
;- and
stdbool.h
to use thebool
type
- Define a
SIZE
constant to optimize performance - Create a
char *
return function for redefining. And that function is as follows:- I inserted a
for
loop to go through the string according to the number of characters in it; - It checks if the opening character of the
<
tag was identified in the string; - If yes, it makes boolean variable
tag
astrue
- Then concatenate the character into a temporary output of the same size:
out[SIZE];
- And to continue adding, we change it to
false
only after identifying the>
closing tag character.
- I inserted a
The final code is:
#include <stdio.h>
#include <string.h>
#include <stdbool.h>
#define SIZE 4096
char * remove_tags(char string[]){
bool tag = false;
char out[SIZE];
for(int i = 0; i < strlen(string); i++){
if( string[i] == '<'){
tag = true;
}
if(!tag){
strncat(out, &string[i], 1);
}
if(string[i] == '>'){
tag = false;
}
}
string = out;
return string;
}
int main(){
char string[SIZE] = "<a href=\"https://terminalroot.com/\">This is a link</a>";
printf("%s\n", remove_tags(string));
return 0;
}
Probable output:
This is a link
The right thing would be to allocate space on the heap, because a string that contains a document HTML can be huge. But for didactic purposes, and to understand the logic, itโs a good size.
Comments