Forgiving regex to extract key-value pairs from plain text files
Recording data manually (typing them with your fingers) in plain text files is still a viable option, even though not that common.
If what you are recording consists of multiple data values, you need some kind of key-value format. A simple format like this does the job:
First key:
Lorem ipsum dolor
Second key:
Nunc volutpat cursus
The rules can be summarized like this:
- the key ends with a colon
:
- the value is on the next line
- key-value pairs are delimited by empty lines
Whatever format you choose, you have to leave room for human error, e.g., extra spaces are very common.
Or you might want to allow some flexibility, or you expect long strings of text that should be broken into multiple lines to increase readability.
Here's an entry with some exaggerated formatting issues:
Alpha:
Lorem ipsum
Beta gamma:
dolor sit amet
consectetur adipiscing
elit
delta: quam vehicula
Epsilon:-Zeta:
Curabitur interdum massa
Eta:
Maecenas ac felis
Theta:
Iota:
Morbi at lobortis
In the end, if you are looking to extract a list of keys and values, you need a regex rule that goes beyond the basics:
[
[
"Alpha",
"Beta gamma",
"delta",
"Epsilon:-Zeta",
"Eta",
"Theta",
"Iota"
],
[
"Lorem ipsum",
"dolor sit amet\nconsectetur adipiscing\nelit",
"quam vehicula",
"Curabitur interdum massa",
"Maecenas ac felis",
"",
"Morbi at lobortis"
]
]
The regex that allows this amount of forgiveness looks like this:
[ ]*(.+):[ ]*\n?((?:.+\n?[^\s])*)
Use this pattern with the preg_match_all
function, and you have a solid start to do something with the data.
To deconstruct the pattern, head over to RegEx101 where you will see the captured groups highlighted and the tokens annotated.