Using Regular Expressions in Erlang

I turned to another interesting feature in working with regular expressions (with Unicode characters). If you need to find a symbol, the number of significant digits of which in hexadecimal form is greater than or equal to three, then these numbers must be framed in curly braces.

Regex = "\\x{2014}",
Regex = "\\x{6C60}",

Octal numbers have much the same behavior.

The complete example is here.

2 Likes

Regular expressions are fraught with an inextricable stream of subtleties that you need to know in order to get anything valuable. :sunglasses:

Here’s another one. Note that there is no mention of the multiline option in the documentation for the re:replace/4 method. But if you want to make a slightly more complex replacement, such as this one (a sort of text to html converter example),

   FileContent = read_rime(),
    Regex = "(^.*$)",
    
    Markup =
        "
\t<!DOCTYPE html>\n
    <html lang=\"en\">\n
\t	\t<head>\n
\t		\t\t<title>&</title>\n
\t		\t\t<meta charset=\"utf-8\"/>\n
    </head>\n
\t<body>\n
<h1>&<\h1>
\t",
    NewContent =
        re:replace(FileContent, Regex, Markup, [multiline,{return, list}]),
% ...

Full code example is here.

then you will definitely need this option, since the metacharacters ^ and $ do not work without it.

2 Likes

I did it! :sunglasses:
Ninth chapter of “Introducing Regular Expressions”.

This is where the most interesting work takes place. The author proposes to implement text transformation to html5 using regular expressions and auxiliary tools: sed (“stream editor”) and the Perl programming language capabilities.

I really wanted to make a similar solution using the Erlang programming language and the capabilities of the re module.
It was not easy to come up with similar solutions. But everything worked out. I am very happy about that. :blush:

The work done can be found here.
I have also implemented the escript-app too.

2 Likes

The End of the Beginning
I have worked through all the material in the book “Introducing Regular Expressions”. Chapter 10 was inspiring to continue to deepen your knowledge of regular expressions. Thanks to this book, the explanations of examples, I have developed a clear understanding of what regular expressions is capable of. The processing of the book materials into Erlang language allowed me to gain skills in working with the re library in various, sometimes difficult situations.

I recommend these books as an introductory book on regular expressions, and my research too (in the form of projects, as a good illustration of the author’s examples using Erlang).

While reading the book, I also created three hex.pm projects that helped me implement tests of various regular expression functionality (these are my first hex.pm projects :blush:):

2 Likes

«As with any language, the key to learning regular expressions is practice, practice, practice»
Ben Forta

Ben Forta wrote a book (two editions - “Regular Expressions in 10 minutes”, “Learning Regular expressions”) with very similar structure and purpose as Michael Fitzgerald’s book “Introducing Regular Expressions” did.

I am interested in how he presented this difficult material in a simple way and I want to work on his books in the same way as I did with the already studied.

I created a new project on GitHub and enthusiastically set to work (I will work with the material, taking into account the already acquired knowledge about regular expressions and the re module.).

2 Likes

Ben Forta’s book satisfies my expectations. Now I am at the fourth chapter. This book describe regular expressions in a simple way and share his pearls of knowledge. Adapting the examples of this book to the capabilities of Erlang is exciting — I find new features for working strings (Regular expressions are used everywhere implicitly or explicitly.). For example, Erlang considers these recordings to be absolutely the same:
"myArray\[[0-9]\]"
"myArray[[0-9]]"

Character \ is ignored. It is interesting.

This question worries anyone who has ever wanted to dig deeper with this backslash in Erlang.
http://erlang.org/pipermail/erlang-questions/2009-May/043755.html

I have done a lot of experiments researching this question. I came to the conclusion that here lies the Erlang limitation when working with strings. It can be overcome if you only write the text with a backslash to a text file and read it, getting the line of the expected content (with a backslash).

You can search for a double backslash as follows:
Regex = "\\\\",


This practice of programming in Erlang develops me as a programmer - I find new ways to organize testing, simplify the project, improving the libraries I have created.

2 Likes

They are indeed the same string, and therefore will result in the same regex.

As there is no first level syntax support for regexes in Erlang you have two escape the backslash in the string such that the regex engine actually sees it.

3 Likes

Most interestingly, Erlang strings also work in the same way outside of regular expressions too.


Actively working with Erlang, I recognize that the better I understand how Erlang works with regular expressions, the clearer it becomes to work with textual information in all other functions related to the presentation or search of textual information.

2 Likes

I really enjoy working with editions of Ben Forta’s book. At the end of the fifth chapter, an interesting example appeared, which helped me illustrate the unique option of compiling a regular expression in the library re. This option is ungreedy.

{ok,MP}= re:compile(Regex,[ungreedy]),

Inverts the “greediness” of the quantifiers so that they are not greedy by default, but become greedy if followed by “?”. It is not compatible with Perl. It can also be set by a (?U) option setting within the pattern.

If option ungreedy is set (an option that is not available in Perl), the quantifiers are not greedy by default, but individual ones can be made greedy by following them with a question mark. That is, it inverts the default behavior.

Thanks to it, there is no need to set the lazy quantifier (?) explicitly. It is set implicitly. This is sometimes convenient. I am glad that the author of the book inspired me to implement this example.

How it looks in the code - here.

2 Likes

Another specific (that is, a unique Erlang regex compilation option) option is dollar_endonly. Combined with the notempty re:run or re:replace option, it is convenient.

A dollar metacharacter in the pattern matches only at the end of the subject string. Without this option, a dollar also matches immediately before a newline at the end of the string (but not before any other newlines). This option is ignored if option multiline is specified. There is no equivalent option in Perl, and it cannot be set within a pattern.

Code example is here.

2 Likes

Another interesting feature of working with Regex in Erlang is an attempt to apply a regular expression pattern, which does not take into account the work with Unicode to the text containing Unicode. Most of the time I was getting the error:

*failed*
in function re:run/3 (re.erl, line 788)
  called as run([60,115,99,114,105,112,116,62,13,10,102,117,110,99|...],{re_pattern,0,0,0,<<69,82,67,80,113,0,0,0,2,...>>},[global,{capture,all,list}])
...

**error:badarg
  output:<<"">>

This behavior is a clear indication that the input contains Unicode. The solution is either to fix the regex pattern or change the input.

2 Likes

I continue to share bits of info about working with regular expressions in Erlang (I study this topic by unioning regex study material and Erlang re module documentation using TDD).

Using Subexpressions

Working with subexpressions (that is, those expressions that are enclosed in parentheses) have their own peculiarities. When searching globally (g - global), it is important to select elements and Erlang provides us with a group of settings that it is desirable to specify explicitly (for your convenience). Often, you have to choose whatever you are looking for, not the subexpressions, so using a preference will help you with this. The default selection result may confuse you (as me), since it includes, in addition to the whole selection, a selection of subexpressions (which, often, simply clutter up the selection of necessary data). Therefore, it is important to determine the sampling parameter as {capture, first, list} or as {capture, first, index}.

Check out the code examples here.

2 Likes

Unsupported Escape Sequences

In Perl, the sequences \l, \L, \u, and \U are recognized by its string handler and used to modify the case of following characters. PCRE does not support these escape sequences.

So Erlang re module also do not implement them. This means that using regular expressions when replacing text, you cannot change the case (from a small letter to a capital letter and vice versa).

2 Likes

Embedding Conditions

«A powerful yet infrequently used feature of the regular expression language is the capability to embed conditional processing within an expression»
Ben Forta

This topic is not very clearly explained in the library documentation, and, probably, that’s why I did not give it enough attention, but meanwhile this topic (this functionality hidden from the uninitiated) is able to bring more logic into the regular expression and make its work even more intellectual.

Thanks to the tenth chapter of Ben Fort’s book, I was able to understand the essence of this functionality and will be able to apply it, if necessary, in my work.


Sample code here.

2 Likes

I am happy. :blush: The adaptation of examples from Ben Fort’s excellent book has come to the end.

Erlang developers already have two books for Regex study (with the rebar3 projects):

  1. [“Introducing Regular Expressions”]( I am happy. :blush: The adaptation of examples from Ben Fort’s excellent book has come to the end.

Erlang developers already have two books for Regex study (with the rebar3 projects):

  1. “Introducing Regular Expressions” (+GitHub).
  2. “Learning Regular expressions” (+GitHub).

A lot of examples. The adaptation of the examples given in the book to the realities of Erlang helped me to deepen my understanding of working with regular expressions and the peculiarities of the implementation.

It is always important to have a lot of working examples related to the topic.


It would also be great to implement a desktop application to demonstrate how to work with regular expressions (in the Appendix C to Ben Forta’s book, you can find a number of implementations in different languages). I’ll think about it.

2 Likes

On August 26, I will share my experience of using regular expressions in Erlang. I invite everyone who is interested in this topic to visit.

2 Likes

The conference was rich and informative. I hope that everyone who wants to attend it got access to it.

I continue my research on regular expressions and started adapting the code examples from the book “Regular Expressions Cookbook, 2nd Edition”. The authors are real gurus of regular expressions, share their deep experience. It is very interesting to apply experience, ideas, achievements of the regular expressions experts to the possibilities of re module.

2 Likes

I tune Erlang documentation - add information related to the topic of union, intersection and subtraction of character classes.

I hope this material will be available in the next OTP version.

2 Likes

I am impressed with the global approach of the authors of the book “Regular Expressions Cookbook, 2nd Edition”. They analyzed the possibilities provided by the standard libraries of programming languages (such as C#, Java, JavaScript, Perl, PHP, Python, Ruby and VB.NET) (which come with the compiler in the delivery) and offered readers solutions to typical problems that a programmer encounters when working with regular expressions.
Unfortunately, the authors did not include Erlang and even less Elixir there. I hope that if the next edition of this book comes out, then this shortcoming will be eliminated (at least I will offer this to the authors of the book).

I have already adapted the second and most of the third chapter to Erlang’s capabilities. I want that Regular Expressions to help solve complex problems, simplifying sometimes non-trivial solutions. I have already succeeded in implementing functions that I could not find in Erlang standard library ( match_chain/2, match_evaluator/3) - (I implement them myself and share the results of my research in the re_tuner library).

2 Likes

When working with regular expressions, you need to be aware that line endings are fraught with serious danger. If the parameters of the end of the line, the handler launch parameter does not agree with the content of the text being examined, then the resulting data will be incorrect.

Linux \n:
11

Windows \r\n:
22

To prevent possible errors, you can use two methods:

  1. affecting the input text,
  2. setting parameters for regular expression execution.

In my opinion, it is more convenient to use the first method.


-spec sanitize_text(Text) -> Result
   when 
        Text :: string(),
		Result :: string().

sanitize_text(Text) when is_list(Text)->
    SanitizedText = string:replace(Text, [$\r,$\n], [$\n], all),
    SanitizedText.

Source code.

2 Likes