Using Regular Expressions in Erlang

Because erlang is not default UTF-8 (kind of predates most of that), so it’s charlists are using pretty standard Extended ASCII:
Extended ASCII

1 Like

Yes, it’s all about Unicode. Are you familiar with the material in this article (It is very important for understanding what happens to characters and how it is recommended to work with them.)? It explains a lot, but not everything. Erlang is developing and working with characters has changed significantly (if we compare the current 24 OTP with versions from the top ten or even from the second ten).

We can start Erlang app with Unicode (startup flag +pc unicode). So I do it and if Erlang has Extended ASCII when we execute [128] we would get a character of Extended ASCII.
I add that option to rebar3 config too. {erl_opts, [debug_info,{pc, unicode}]}.

Erlang/OTP 24 [erts-12.0] [source] [64-bit] [smp:4:4] [ds:4:4:10] [async-threads:1] [jit]

Eshell V12.0  (abort with ^G)
1> [128].
[128]
2> [1070,1085,1080,1082,1086,1076].
"Юникод"
3> 

Unfortunately, we get a list we with a data not a character of Extended ASCII. From this I conclude that there is Extended ASCII is not existed in Erlang. The values of the Extended ASCII elements in Erlang must be obtained somehow differently.

2 Likes

This topic is very interesting to me and I, of course, will try to find answers to my questions. Now, studying how the library re reacts to simple regular expressions, I’m learning a lot of interesting things and in the end I will understand all this.

For example, the shorthand \s only(!) finds a space (ACSII character 32) - there were expected to be more characters in the list [9,10,13,32] (I expected it to be analogous to the following regex [ \\t\\n\\r]). At the same time, \S element sees itself as predictable
[^ \\t\\n\\r].

tests_09_whitespace_03_tests.erl

2 Likes

Yeah a [128] is not a string, if it displays as something like a string that’s just because it happens to fit, but otherwise that’s just a list of integers. Erlang doesn’t actually have a string type, just lists of integers and binaries are the closest you get, but nothing that would enforce it being a string like you would get in Rust or so.

2 Likes

I asked a question on stackoverflow about the implementation features (ASCII related of the Erlang library for working with regular expressions and received an answer. Excellent!
:slight_smile: :dizzy:

2 Likes

On the subject of strange behavior of \s Steve Vinoski provides me information at stackoverflow.com to think about. Probably, the operation of the re module depends on the system settings (specifically, the value of the LC_ALL variable). It is strange that I did not find anything about this in the documentation for this library.

The road to the truth is long and winding.

I hope there is an opportunity to make system settings inside tests. The application module can help with this. Now I will test its functional by EUnit. Hope this will be an acceptable solution.

2 Likes

Unfortunately, the application module is not suitable for the purposes of setting up the test configuration. Tests have shown this.

2 Likes

I found the answer to my question (related to \s). More details at stackoverflow.com - question.

2 Likes

Today I have implemented a task related to regular expressions, which the author (Simon St.Laurent) of the very useful book (Introducing Erlang) wanted to do for so long, but somehow there was no time. I have added to the library (erlang-simple-string) the functions necessary for convenient work with regular expressions. They will be useful for those who read this book and want to understand how to work with regular expressions in Erlang.

Now this library is a complete solution describing how to work with strings in Erlang.

2 Likes

To preserve my “discoveries” in studying the work of Regular Expressions with certain character classes in Erlang (which, according to the developers’ decision, require additional settings for the locale parameters), I decided that I needed to implement a library in Erlang that would help overcome this limitation. First of all, I released the conversion of shorthads classes that have restrictions in a simpler form. I named it re_tuner.

Further, using the capabilities of regular expressions, which Erlang provides, I would like to implement a parser for regular expression texts that would replace problem shorthand classes with simpler counterparts.

Here’s an example of how such a simple conversion works:


research_01_test() ->
    Expected = true,
    ValidCharacterList = get_valid_character_list(),
    RegularExpression = "[[:punct:]]",
    TunedRegularExpression = re_tuner:tune(RegularExpression),
    {ok, MP} = re:compile(TunedRegularExpression),
    Result = check_all_by_regex(MP, ValidCharacterList, true),
    ?assertEqual(Expected, Result).
2 Likes

While reading the book “Introducing Regular Expressions”. and continuing to explore the work of regular expressions in Erlang, I would like to share examples of how you can search for certain text in case insensitive manner. This problem can be solved in several ways, and here are examples:

  • This is the default solution.
    Regex = "(?:the)",
    {ok, MP} = re:compile(Regex, [caseless]),
  • And this is working with a option letter i:
    Regex = "(?i)(?:the)",
    Regex = "(?:(?i)the)",
    Regex = "(?i:the)",
    The option letter i can be inserted between the question mark and the colon.

Full Erlang code solution is here.

2 Likes

I turned to another interesting feature in working with regular expressions (with Unicode characters). If you need to find a symbol, the number of significant digits of which in hexadecimal form is greater than or equal to three, then these numbers must be framed in curly braces.

Regex = "\\x{2014}",
Regex = "\\x{6C60}",

Octal numbers have much the same behavior.

The complete example is here.

2 Likes

Regular expressions are fraught with an inextricable stream of subtleties that you need to know in order to get anything valuable. :sunglasses:

Here’s another one. Note that there is no mention of the multiline option in the documentation for the re:replace/4 method. But if you want to make a slightly more complex replacement, such as this one (a sort of text to html converter example),

   FileContent = read_rime(),
    Regex = "(^.*$)",
    
    Markup =
        "
\t<!DOCTYPE html>\n
    <html lang=\"en\">\n
\t	\t<head>\n
\t		\t\t<title>&</title>\n
\t		\t\t<meta charset=\"utf-8\"/>\n
    </head>\n
\t<body>\n
<h1>&<\h1>
\t",
    NewContent =
        re:replace(FileContent, Regex, Markup, [multiline,{return, list}]),
% ...

Full code example is here.

then you will definitely need this option, since the metacharacters ^ and $ do not work without it.

2 Likes

I did it! :sunglasses:
Ninth chapter of “Introducing Regular Expressions”.

This is where the most interesting work takes place. The author proposes to implement text transformation to html5 using regular expressions and auxiliary tools: sed (“stream editor”) and the Perl programming language capabilities.

I really wanted to make a similar solution using the Erlang programming language and the capabilities of the re module.
It was not easy to come up with similar solutions. But everything worked out. I am very happy about that. :blush:

The work done can be found here.
I have also implemented the escript-app too.

2 Likes

The End of the Beginning
I have worked through all the material in the book “Introducing Regular Expressions”. Chapter 10 was inspiring to continue to deepen your knowledge of regular expressions. Thanks to this book, the explanations of examples, I have developed a clear understanding of what regular expressions is capable of. The processing of the book materials into Erlang language allowed me to gain skills in working with the re library in various, sometimes difficult situations.

I recommend these books as an introductory book on regular expressions, and my research too (in the form of projects, as a good illustration of the author’s examples using Erlang).

While reading the book, I also created three hex.pm projects that helped me implement tests of various regular expression functionality (these are my first hex.pm projects :blush:):

2 Likes

«As with any language, the key to learning regular expressions is practice, practice, practice»
Ben Forta

Ben Forta wrote a book (two editions - “Regular Expressions in 10 minutes”, “Learning Regular expressions”) with very similar structure and purpose as Michael Fitzgerald’s book “Introducing Regular Expressions” did.

I am interested in how he presented this difficult material in a simple way and I want to work on his books in the same way as I did with the already studied.

I created a new project on GitHub and enthusiastically set to work (I will work with the material, taking into account the already acquired knowledge about regular expressions and the re module.).

2 Likes

Ben Forta’s book satisfies my expectations. Now I am at the fourth chapter. This book describe regular expressions in a simple way and share his pearls of knowledge. Adapting the examples of this book to the capabilities of Erlang is exciting — I find new features for working strings (Regular expressions are used everywhere implicitly or explicitly.). For example, Erlang considers these recordings to be absolutely the same:
"myArray\[[0-9]\]"
"myArray[[0-9]]"

Character \ is ignored. It is interesting.

This question worries anyone who has ever wanted to dig deeper with this backslash in Erlang.
http://erlang.org/pipermail/erlang-questions/2009-May/043755.html

I have done a lot of experiments researching this question. I came to the conclusion that here lies the Erlang limitation when working with strings. It can be overcome if you only write the text with a backslash to a text file and read it, getting the line of the expected content (with a backslash).

You can search for a double backslash as follows:
Regex = "\\\\",


This practice of programming in Erlang develops me as a programmer - I find new ways to organize testing, simplify the project, improving the libraries I have created.

1 Like

They are indeed the same string, and therefore will result in the same regex.

As there is no first level syntax support for regexes in Erlang you have two escape the backslash in the string such that the regex engine actually sees it.

2 Likes

Most interestingly, Erlang strings also work in the same way outside of regular expressions too.


Actively working with Erlang, I recognize that the better I understand how Erlang works with regular expressions, the clearer it becomes to work with textual information in all other functions related to the presentation or search of textual information.

1 Like

I really enjoy working with editions of Ben Forta’s book. At the end of the fifth chapter, an interesting example appeared, which helped me illustrate the unique option of compiling a regular expression in the library re. This option is ungreedy.

{ok,MP}= re:compile(Regex,[ungreedy]),

Inverts the “greediness” of the quantifiers so that they are not greedy by default, but become greedy if followed by “?”. It is not compatible with Perl. It can also be set by a (?U) option setting within the pattern.

If option ungreedy is set (an option that is not available in Perl), the quantifiers are not greedy by default, but individual ones can be made greedy by following them with a question mark. That is, it inverts the default behavior.

Thanks to it, there is no need to set the lazy quantifier (?) explicitly. It is set implicitly. This is sometimes convenient. I am glad that the author of the book inspired me to implement this example.

How it looks in the code - here.

1 Like