Using Regular Expressions in Erlang

Intensively researching Erlang books and additional resources on it, I have found that the topic of using Regular Expressions is either casually mentioned or not touched upon. Regular expressions are a powerful helper in any programming language. I am wondering how it is implemented in Erlang.

When I started to implement my CSV converter solution, which you can read about in the “Property-Based Testing with PropEr, Erlang, and Elixir” book’s Journal for input text encoding, I decided to try using the functionality of the Erlang re library, but I realized that its capabilities are very powerful and I need to understand it in more detail (its skillful use will provide a convenient tool for implementing various ideas).

I start reading “Introducing Regular Expressions”. The book’s source code examples and Erlang re library documentation gave me enough information to start my research (rebar3 based Erlang project implementation) into this area.

If anyone has anything to share on this topic, please do share! :bulb: :blush:

2 Likes

Corresponding tweet for this thread:

Share link for this tweet.

2 Likes

Is there anything like https://rubular.com for Erlang or Elixir? If not that could be a neat project idea for someone! :nerd_face:

I use that site whenever I need to work with them in Ruby - it’s much quicker than testing them each time in a console…

2 Likes

Erlang module re use PCRE for his implementation.
RegExr is very useful tool.

2 Likes

I want to share some of my developments that I created in the process of researching the work of the re library.

The first thing I noticed was that the left slash should be escaped (\\d). As a result, Regular Expression patterns have their own special look:


    Expected = {match, ["707-827-7019"]},
	{match, [InputString]} = Expected,
    RegularExpression = "^(\\(\\d{3}\\)|^\\d{3}[.-]?)?\\d{3}[.-]?\\d{4}$",
    {ok, MP} = re:compile(RegularExpression),
    Expected = re:run(InputString, MP, [{capture, first, list}]).

Next, I wrote a code that shows which characters are included in a particular group of characters.

    Expected = ok,
    ValidCharacterList = lists:seq(0, 255),
    % octal code
    RegularExpression = "[\\f]",
    {ok, MP} = re:compile(RegularExpression),
    Result =
        lists:foreach(fun(Elem) ->
                         case re:run([Elem], MP) of
                             {match, _} ->
                                 %true;
                                 ?debugFmt("Found! = ~p~n", [Elem]);
                             nomatch ->
                                 false
                         end
                      end,
                      %{match, _} = re:run([Elem], MP)
                      ValidCharacterList),
    ?assertEqual(Expected, Result).

By changing the value of the regular expression, you can study the content of any groups (such as \d, \D, \w, \W, \s, \S).

2 Likes

That’s not a regex thing, just a string in programming language thing, it’s pretty universal when putting a regex in a string literal in most programming languages.

This is one of the reasons that Elixir added raw string syntaxes and sigils, lol. ^.^

But yeah, you have backslash escapes for the programming language string literals, then another “level” of backslash escapes for the regex backslashes. Same as in C or Java or whatever else. Thankfully a lot of languages are starting to come with ‘raw string literals’ in some form or another now though.

1 Like

You are right, \\ is not regex thing. \d it is.

Unfortunately, Erlang doesn’t have Elixir syntax things, but you don’t notice it anymore when you program a lot on it. Erlang was for for ATE, Elixir for Web. Erlang is adopted for Web.

In Erlang in order not to complicate the text by escaping (for example, if a character \ needs to be inserted into a regular expression pattern, it must be written like this \\\\), you can use the hex-code of this character (for \ it will be \\x5C). This can be extremely convenient. For " it is too.

2 Likes

Lol, never thought of using the hexcode, nice. ^.^

Normally I put regex in the erlang environment and just grab it from that. Like from a config file or so.

2 Likes

Regex is such a nice tool, but unfortunately I’m using it rarely, so every time I use it it feels a bit awkward at the beginning.
I also use the already mentioned RegExr to write the expression first.

It’s always good to know about regex when solving some coding puzzles like Advent of Code. I used it there with Erlang :slight_smile:

3 Likes

Joooin uuuusss…

Lol, not kidding though, I use regex at least once a day, and occasionally 150 times in a day. ^.^;

Not even with just programming stuff, but in just text editing most often.

3 Likes

So do I. Regular expressions help me when editing text, when it is sometimes necessary to replace combinations of characters. For example, replacing square brackets and parentheses and vice versa.
There is a list of data (my favorite mistake in markdown format):

- (Day 1)[code/day01]
...

A series of simple regular expressions help me fix this:

  1. \(Day\[Day
  2. \] \)
  3. \[code\(code
  4. \](\r\n)?$\)\r\n

AdventOfCode_2020_Erlang

2 Likes

I concentrated my little research of regular expressions in Erlang using symbols from the ASCII-table into the library ascii_table (it is my first library which I publish on hex.pm :slight_smile:).

The method ascii_table:print() will print a table containing the most suitable regular expressions of ASCII-table characters.

Dec  Hx     Oct    Regex  String  Description
0    16#0   8#0    \x0    NULL    (null)
1    16#1   8#1    \x1    SOH     (start of heading)
2    16#2   8#2    \x2    STX     (start of text)
3    16#3   8#3    \x3    ETX     (end of text)
4    16#4   8#4    \x4    EOT     (end of transmission)
5    16#5   8#5    \x5    ENQ     (enquiry)
6    16#6   8#6    \x6    ACK     (acknowledge)
7    16#7   8#7    \x7    BEL     (bell)
...
2 Likes

To study the work of regular expressions in Erlang, I created a “test lab” - an Erlang code a structure of it is convenient to use for researching the work of regular expressions.

../posix_01_alpha_tests.erl.

% For research For research mode, activate the RESEARCH constant.
% Letters.
-module(posix_01_alpha_tests).
%-define(RESEARCH, true).
-define(REGEX, "[[:alpha:]]").

%%
%% Tests
%%
-ifdef(TEST).

-include_lib("eunit/include/eunit.hrl").

-ifdef(RESEARCH).

letters_research_test() ->
    Expected = ok,
    ValidCharacterList = lists:seq(0, 255),
	
    RegularExpression = ?REGEX,
    {ok, MP} = re:compile(RegularExpression),
    Result =
        lists:foreach(fun(Elem) ->
                         case re:run([Elem], MP) of
                             {match, _} ->
                                 ?debugFmt("Found! = ~p~n", [Elem]);
                             nomatch ->
                                 false
                         end
                      end,
                      %{match, _} = re:run([Elem], MP)
                      ValidCharacterList),
    ?assertEqual(Expected, Result).

-else.

letters_research_01_test() ->
    Expected = true,
    ValidCharacterList =
        lists:seq(65, 90)
        ++ lists:seq(97, 122)
        ++ [170, 181, 186]
        ++ lists:seq(192, 255),
    % octal code
    RegularExpression = ?REGEX,
    {ok, MP} = re:compile(RegularExpression),
    {match, _} = re:run(ValidCharacterList, MP),
    Result = true,
    ?assertEqual(Expected, Result).

-endif.

-endif.

and the result of such research is a test in which the results of some small research are concentrated. The macro RESEARCH helped make a switch between two modes - exploration and working mode.

2 Likes

While doing my research, I noticed that the Erlang implementation has its own characteristics. There are questions without answers. For example, when studying the POSIX notation classes, in addition to the expected characters (letters, numbers) in the range [0-255], characters greater than 127 are included in the sample.

[[:upper:]]

    ValidCharacterList =
		lists:seq(65, 90)
        ++ lists:seq(192, 222),

I don’t know why they are added. If anyone knows, please explain! :slight_smile:

2 Likes

Because erlang is not default UTF-8 (kind of predates most of that), so it’s charlists are using pretty standard Extended ASCII:
Extended ASCII

1 Like

Yes, it’s all about Unicode. Are you familiar with the material in this article (It is very important for understanding what happens to characters and how it is recommended to work with them.)? It explains a lot, but not everything. Erlang is developing and working with characters has changed significantly (if we compare the current 24 OTP with versions from the top ten or even from the second ten).

We can start Erlang app with Unicode (startup flag +pc unicode). So I do it and if Erlang has Extended ASCII when we execute [128] we would get a character of Extended ASCII.
I add that option to rebar3 config too. {erl_opts, [debug_info,{pc, unicode}]}.

Erlang/OTP 24 [erts-12.0] [source] [64-bit] [smp:4:4] [ds:4:4:10] [async-threads:1] [jit]

Eshell V12.0  (abort with ^G)
1> [128].
[128]
2> [1070,1085,1080,1082,1086,1076].
"Юникод"
3> 

Unfortunately, we get a list we with a data not a character of Extended ASCII. From this I conclude that there is Extended ASCII is not existed in Erlang. The values of the Extended ASCII elements in Erlang must be obtained somehow differently.

2 Likes

This topic is very interesting to me and I, of course, will try to find answers to my questions. Now, studying how the library re reacts to simple regular expressions, I’m learning a lot of interesting things and in the end I will understand all this.

For example, the shorthand \s only(!) finds a space (ACSII character 32) - there were expected to be more characters in the list [9,10,13,32] (I expected it to be analogous to the following regex [ \\t\\n\\r]). At the same time, \S element sees itself as predictable
[^ \\t\\n\\r].

tests_09_whitespace_03_tests.erl

2 Likes

Yeah a [128] is not a string, if it displays as something like a string that’s just because it happens to fit, but otherwise that’s just a list of integers. Erlang doesn’t actually have a string type, just lists of integers and binaries are the closest you get, but nothing that would enforce it being a string like you would get in Rust or so.

2 Likes

I asked a question on stackoverflow about the implementation features (ASCII related of the Erlang library for working with regular expressions and received an answer. Excellent!
:slight_smile: :dizzy:

2 Likes

On the subject of strange behavior of \s Steve Vinoski provides me information at stackoverflow.com to think about. Probably, the operation of the re module depends on the system settings (specifically, the value of the LC_ALL variable). It is strange that I did not find anything about this in the documentation for this library.

The road to the truth is long and winding.

I hope there is an opportunity to make system settings inside tests. The application module can help with this. Now I will test its functional by EUnit. Hope this will be an acceptable solution.

2 Likes