Date: 2012-11-20 05:31 pm (UTC)
ptc24: (Default)
From: [personal profile] ptc24
Erm. I think you're right, in that there are issues over "parse".

I think the trap I've fallen into is letting the questioner and answerer define "parse" for me. The questioner was trying to do something, the answerer said "You can't parse HTML with regexes", and I've been assuming that there's been consensus that this is at least relevant. I've been on the other side of this debate before (
insisting on narrow, strict definitions of "parsing"), so maybe I should slap my wrists. That said, I'm pretty sure that whatever is meant, parsing is about syntax, not semantics.

It's possible that the questioner's original request is more akin to some other syntactic task, such as tokenisation, or chunking. But "chunking" is one of those things that may or may not count as parsing - let's look at wikipedia:

"Shallow parsing (also chunking, "light parsing") is an analysis of a sentence which identifies the constituents (noun groups, verbs, verb groups, etc.), but does not specify their internal structure, nor their role in the main sentence."

So when we're finding tags in HTML, we're identifying some constituents of the text stream, constituents which I think are too complex to count as "tokens" (if "tokens" are even an admissible notion in concrete HTML). So what the questioner is trying to do could be construed as a form of partial shallow parsing.

Nested tags: Arbitrary nested tags? Tags nested at unlimited depth?
If you don't have an account you can create one now.
No Subject Icon Selected
More info about formatting

If you are unable to use this captcha for any reason, please contact us by email at support@dreamwidth.org