Common expressions are a really great tool in a programmer’s toolbox. However they will’t do all the things. And one of many issues they will’t do is to reliably parse CSV (comma separated worth) information. It’s because an everyday expression doesn’t retailer state. You want a state machine (or one thing equal) to parse a CSV file.
For instance, think about this (very quick) CSV file (3 double quotes + 1 comma + 3 double quotes):
“””,”””
That is accurately interpreted as:
quote to begin the info worth + escaped quote + comma + escaped quote + quote to finish the info worth
E.g. a single worth of:
“,”
How every character is interpreteted is dependent upon what characters come earlier than and after it. E.g. the primary quote places you into an ‘inside information’ state. The second quote places you right into a ‘is likely to be an escaped for the next character or is likely to be finish of information’ state. The third quote places you again right into a ‘inside information’ state.
Irrespective of how difficult a regex you provide you with, it can all the time be doable to create a CSV file that your regex can’t accurately parse. And as soon as the parsing goes improper, all the things after that time might be rubbish.
You may write a regex that may deal with CSV file the place you’re assured there are not any commas, quotes or carriage returns within the information values. However commas, quotes or carriage returns within the information values are completely legitimate in CSV information. So it’s only ever going to deal with a subset of all of the doable well-formed CSV information.
Word that you simply can parse a TSV (tab separated worth) file with a regex, as TSV information are (usually!) not allowed to include tabs or carriage returns in information and subsequently don’t want escaping.
See additionally on Stackoverflow: