How to remove HTML tags in C and C++ with RegEx
Regular expressions are a great tool for any programming language.
The other day I saw a simple but interesting question on the internet. Someone posted wanting to know: โHow to remove HTML tags in C?โ .
It quickly came to my mind RegEx, but with C++ .
If you understand Regular Expressions with C++ it is really very easy, just:
- Include the
<regex>
header; - Inform the pattern of the regular expression;
- And finally use the
regex_replace()
function to replace with the string you want.
In summary the code is this:
Probable output:
This is a link
But in Linguagem C things are really not that easy.
Linguagem C
You can use regex.h
in C, but it will only check for patterns, but the replacement will be up to you.
For example, checking if a given string has tags in it, we can use it like this:
Likely output:
Has tags!
For more information access the POSIX page of the manual by the command:
Removing HTML TAGS in C
After you check if a given string has tags (saves processing) the next step is to remove the tags.
I came up with a solution of my own (and simple ๐ก ) that may be contested by C lovers, but it works ๐ . The code itself is:
- Include headers:
stdio.h
to useprintf
;string.h
to usestrlen
;- and
stdbool.h
to use thebool
type
- Define a
SIZE
constant to optimize performance - Create a
char *
return function for redefining. And that function is as follows:- I inserted a
for
loop to go through the string according to the number of characters in it; - It checks if the opening character of the
<
tag was identified in the string; - If yes, it makes boolean variable
tag
astrue
- Then concatenate the character into a temporary output of the same size:
out[SIZE];
- And to continue adding, we change it to
false
only after identifying the>
closing tag character.
- I inserted a
The final code is:
Probable output:
This is a link
The right thing would be to allocate space on the heap, because a string that contains a document HTML can be huge. But for didactic purposes, and to understand the logic, itโs a good size.
Comments