Using Regular Expressions in Erlang

rustkas · 19 July 2021 09:55

I found the answer to my question (related to \s). More details at stackoverflow.com - question.

rustkas · 21 July 2021 05:00

Today I have implemented a task related to regular expressions, which the author (Simon St.Laurent) of the very useful book (Introducing Erlang) wanted to do for so long, but somehow there was no time. I have added to the library (erlang-simple-string) the functions necessary for convenient work with regular expressions. They will be useful for those who read this book and want to understand how to work with regular expressions in Erlang.

Now this library is a complete solution describing how to work with strings in Erlang.

rustkas · 21 July 2021 19:50

To preserve my “discoveries” in studying the work of Regular Expressions with certain character classes in Erlang (which, according to the developers’ decision, require additional settings for the locale parameters), I decided that I needed to implement a library in Erlang that would help overcome this limitation. First of all, I released the conversion of shorthads classes that have restrictions in a simpler form. I named it re_tuner.

Further, using the capabilities of regular expressions, which Erlang provides, I would like to implement a parser for regular expression texts that would replace problem shorthand classes with simpler counterparts.

Here’s an example of how such a simple conversion works:


research_01_test() ->
    Expected = true,
    ValidCharacterList = get_valid_character_list(),
    RegularExpression = "[[:punct:]]",
    TunedRegularExpression = re_tuner:tune(RegularExpression),
    {ok, MP} = re:compile(TunedRegularExpression),
    Result = check_all_by_regex(MP, ValidCharacterList, true),
    ?assertEqual(Expected, Result).

rustkas · 24 July 2021 10:51

While reading the book “Introducing Regular Expressions”. and continuing to explore the work of regular expressions in Erlang, I would like to share examples of how you can search for certain text in case insensitive manner. This problem can be solved in several ways, and here are examples:

This is the default solution.

    Regex = "(?:the)",
    {ok, MP} = re:compile(Regex, [caseless]),

And this is working with a option letter i:
Regex = "(?i)(?:the)",
Regex = "(?:(?i)the)",
Regex = "(?i:the)",
The option letter i can be inserted between the question mark and the colon.

Full Erlang code solution is here.

rustkas · 25 July 2021 19:05

I turned to another interesting feature in working with regular expressions (with Unicode characters). If you need to find a symbol, the number of significant digits of which in hexadecimal form is greater than or equal to three, then these numbers must be framed in curly braces.

Regex = "\\x{2014}",
Regex = "\\x{6C60}",

Octal numbers have much the same behavior.

The complete example is here.

rustkas · 27 July 2021 19:23

Regular expressions are fraught with an inextricable stream of subtleties that you need to know in order to get anything valuable.

Here’s another one. Note that there is no mention of the multiline option in the documentation for the re:replace/4 method. But if you want to make a slightly more complex replacement, such as this one (a sort of text to html converter example),

   FileContent = read_rime(),
    Regex = "(^.*$)",
    
    Markup =
        "
\t<!DOCTYPE html>\n
    <html lang=\"en\">\n
\t	\t<head>\n
\t		\t\t<title>&</title>\n
\t		\t\t<meta charset=\"utf-8\"/>\n
    </head>\n
\t<body>\n
<h1>&<\h1>
\t",
    NewContent =
        re:replace(FileContent, Regex, Markup, [multiline,{return, list}]),
% ...

Full code example is here.

then you will definitely need this option, since the metacharacters ^ and $ do not work without it.

rustkas · 28 July 2021 20:07

I did it!
Ninth chapter of “Introducing Regular Expressions”.

This is where the most interesting work takes place. The author proposes to implement text transformation to html5 using regular expressions and auxiliary tools: sed (“stream editor”) and the Perl programming language capabilities.

I really wanted to make a similar solution using the Erlang programming language and the capabilities of the re module.
It was not easy to come up with similar solutions. But everything worked out. I am very happy about that.

The work done can be found here.
I have also implemented the escript-app too.

rustkas · 29 July 2021 20:38

The End of the Beginning
I have worked through all the material in the book “Introducing Regular Expressions”. Chapter 10 was inspiring to continue to deepen your knowledge of regular expressions. Thanks to this book, the explanations of examples, I have developed a clear understanding of what regular expressions is capable of. The processing of the book materials into Erlang language allowed me to gain skills in working with the re library in various, sometimes difficult situations.

I recommend these books as an introductory book on regular expressions, and my research too (in the form of projects, as a good illustration of the author’s examples using Erlang).

While reading the book, I also created three hex.pm projects that helped me implement tests of various regular expression functionality (these are my first hex.pm projects ):

rustkas · 30 July 2021 09:52

«As with any language, the key to learning regular expressions is practice, practice, practice»
— Ben Forta

Ben Forta wrote a book (two editions - “Regular Expressions in 10 minutes”, “Learning Regular expressions”) with very similar structure and purpose as Michael Fitzgerald’s book “Introducing Regular Expressions” did.

I am interested in how he presented this difficult material in a simple way and I want to work on his books in the same way as I did with the already studied.

I created a new project on GitHub and enthusiastically set to work (I will work with the material, taking into account the already acquired knowledge about regular expressions and the re module.).

rustkas · 1 August 2021 17:17

Ben Forta’s book satisfies my expectations. Now I am at the fourth chapter. This book describe regular expressions in a simple way and share his pearls of knowledge. Adapting the examples of this book to the capabilities of Erlang is exciting — I find new features for working strings (Regular expressions are used everywhere implicitly or explicitly.). For example, Erlang considers these recordings to be absolutely the same:
"myArray\[[0-9]\]"
"myArray[[0-9]]"

Character \ is ignored. It is interesting.

This question worries anyone who has ever wanted to dig deeper with this backslash in Erlang.
http://erlang.org/pipermail/erlang-questions/2009-May/043755.html

I have done a lot of experiments researching this question. I came to the conclusion that here lies the Erlang limitation when working with strings. It can be overcome if you only write the text with a backslash to a text file and read it, getting the line of the expected content (with a backslash).

You can search for a double backslash as follows:
Regex = "\\\\",

This practice of programming in Erlang develops me as a programmer - I find new ways to organize testing, simplify the project, improving the libraries I have created.

NobbZ · 1 August 2021 19:32

They are indeed the same string, and therefore will result in the same regex.

As there is no first level syntax support for regexes in Erlang you have two escape the backslash in the string such that the regex engine actually sees it.

rustkas · 1 August 2021 19:40

Most interestingly, Erlang strings also work in the same way outside of regular expressions too.

Actively working with Erlang, I recognize that the better I understand how Erlang works with regular expressions, the clearer it becomes to work with textual information in all other functions related to the presentation or search of textual information.

rustkas · 2 August 2021 18:35

I really enjoy working with editions of Ben Forta’s book. At the end of the fifth chapter, an interesting example appeared, which helped me illustrate the unique option of compiling a regular expression in the library re. This option is ungreedy.

{ok,MP}= re:compile(Regex,[ungreedy]),

Inverts the “greediness” of the quantifiers so that they are not greedy by default, but become greedy if followed by “?”. It is not compatible with Perl. It can also be set by a (?U) option setting within the pattern.

If option ungreedy is set (an option that is not available in Perl), the quantifiers are not greedy by default, but individual ones can be made greedy by following them with a question mark. That is, it inverts the default behavior.

Thanks to it, there is no need to set the lazy quantifier (?) explicitly. It is set implicitly. This is sometimes convenient. I am glad that the author of the book inspired me to implement this example.

How it looks in the code - here.

rustkas · 4 August 2021 09:06

Another specific (that is, a unique Erlang regex compilation option) option is dollar_endonly. Combined with the notempty re:run or re:replace option, it is convenient.

A dollar metacharacter in the pattern matches only at the end of the subject string. Without this option, a dollar also matches immediately before a newline at the end of the string (but not before any other newlines). This option is ignored if option multiline is specified. There is no equivalent option in Perl, and it cannot be set within a pattern.

Code example is here.

rustkas · 4 August 2021 09:39

Another interesting feature of working with Regex in Erlang is an attempt to apply a regular expression pattern, which does not take into account the work with Unicode to the text containing Unicode. Most of the time I was getting the error:

*failed*
in function re:run/3 (re.erl, line 788)
  called as run([60,115,99,114,105,112,116,62,13,10,102,117,110,99|...],{re_pattern,0,0,0,<<69,82,67,80,113,0,0,0,2,...>>},[global,{capture,all,list}])
...

**error:badarg
  output:<<"">>

This behavior is a clear indication that the input contains Unicode. The solution is either to fix the regex pattern or change the input.

rustkas · 4 August 2021 17:48

I continue to share bits of info about working with regular expressions in Erlang (I study this topic by unioning regex study material and Erlang re module documentation using TDD).

Using Subexpressions

Working with subexpressions (that is, those expressions that are enclosed in parentheses) have their own peculiarities. When searching globally (g - global), it is important to select elements and Erlang provides us with a group of settings that it is desirable to specify explicitly (for your convenience). Often, you have to choose whatever you are looking for, not the subexpressions, so using a preference will help you with this. The default selection result may confuse you (as me), since it includes, in addition to the whole selection, a selection of subexpressions (which, often, simply clutter up the selection of necessary data). Therefore, it is important to determine the sampling parameter as {capture, first, list} or as {capture, first, index}.

Check out the code examples here.

rustkas · 5 August 2021 16:27

Unsupported Escape Sequences

In Perl, the sequences \l, \L, \u, and \U are recognized by its string handler and used to modify the case of following characters. PCRE does not support these escape sequences.

So Erlang re module also do not implement them. This means that using regular expressions when replacing text, you cannot change the case (from a small letter to a capital letter and vice versa).

rustkas · 6 August 2021 06:13

Embedding Conditions

«A powerful yet infrequently used feature of the regular expression language is the capability to embed conditional processing within an expression»
— Ben Forta

This topic is not very clearly explained in the library documentation, and, probably, that’s why I did not give it enough attention, but meanwhile this topic (this functionality hidden from the uninitiated) is able to bring more logic into the regular expression and make its work even more intellectual.

Thanks to the tenth chapter of Ben Fort’s book, I was able to understand the essence of this functionality and will be able to apply it, if necessary, in my work.

Sample code here.

rustkas · 9 August 2021 14:20

I am happy. The adaptation of examples from Ben Fort’s excellent book has come to the end.

Erlang developers already have two books for Regex study (with the rebar3 projects):

[“Introducing Regular Expressions”]( I am happy. The adaptation of examples from Ben Fort’s excellent book has come to the end.

Erlang developers already have two books for Regex study (with the rebar3 projects):

A lot of examples. The adaptation of the examples given in the book to the realities of Erlang helped me to deepen my understanding of working with regular expressions and the peculiarities of the implementation.

It is always important to have a lot of working examples related to the topic.

It would also be great to implement a desktop application to demonstrate how to work with regular expressions (in the Appendix C to Ben Forta’s book, you can find a number of implementations in different languages). I’ll think about it.

rustkas · 22 August 2021 10:24

On August 26, I will share my experience of using regular expressions in Erlang. I invite everyone who is interested in this topic to visit.