Please help me write a universal regular expression to split a string into numbers, a combination of letters in a row and a space.
There is a line corresponding to the United Kingdom postal code. It is necessary to split it into parts and form a common array from them in the same order in which they are written into the string. Each part must match:
-
a letter or several letters in a row, without numbers and other characters, in particular a space, up to the first NOT matching character
-
one or more consecutive digits (number) up to the first NOT matching character
-
another character (in my example this is only one likely space, I represent it as not a letter or a number)
By convention, there can be an indefinite number of array elements, but they are limited by the kingdom postal code template. Here is a picture that perfectly describes the template:
Borrowed from www.getthedata.com
where:
- L – letter
- N – digit
- A – number or letter
- SPACE – space
-
A Optional mark above a symbol means that it will not necessarily be present.
-
Example 1:
The code SW1A 1AA should after transformation become an array with
with the following set of elements:['SW', '1', 'A', '', '1', 'AA']
-
Example 2:
A piece of code SW1A should become an array with the following set
elements:['SW', '1', 'A']
-
Example 3:
A piece of code SW22 3 should become an array with the following set
elements:['SW', '22', '', '3']
-
Postscript : Particularly difficult for me is the fact that the whole postal code can be broken into parts, as well as a separate part of it, down to an empty line. Not being an expert in regular expressions, this task was too much for me: (Perhaps there is a simpler universal solution than using the preg_split
function?
Answer 1, authority 100%
You can tokenize a string like this (see online demo ):
$ strs = ["SW1A 1AA", "SW1A", "SW22 3"];
foreach ($ strs as $ s) {
if (preg_match_all ('~ [a-zA-Z] + | \ d + | [^ \ da-zA-Z] + ~', $ s, $ chunks)) {
print_r ($ chunks [0]);
}
}
// - Array ([0] = & gt; SW [1] = & gt; 1 [2] = & gt; A [3] = & gt; [4] = & gt; 1 [5] = & gt; AA)
// - Array ([0] = & gt; SW [1] = & gt; 1 [2] = & gt; A)
// - Array ([0] = & gt; SW [1] = & gt; 22 [2] = & gt; [3] = & gt; 3)
Regular expression here (see demo )
[a-zA-Z] + | \ d + | [^ \ da-zA-Z] +
It finds
[a-zA-Z] +
– one or more Latin letters|
– or\ d +
– one or more digits|
– or[^ \ da-zA-Z] +
– – one or more characters other than numbers and Latin letters
If you need Unicode support, use '~ \ p {L} + | \ p {N} + | [^ \ p {N} \ {L}] + ~ u'
.