jack | How do I parse HTML using a Regex?

jack

Bookmark organising

When I organised my bookmarks, I realised that most fell into the categories of

Stuff I use every day and want quickly accessible. Eg. email, social networking homepages, the wikipedia page on the unicode checkmark, etc.
Something to read later. Eg. links that seem interesting, computer games, books and films to consider buying or renting, etc.
Something to read periodically, eg. news sites, social networking friends pages, feeds, blogs I follow, many webcomics divided into "daily", "bi/triweekly" etc.
Something I may occasionally want as a reference. Eg. step-by-step instructions for stuff I do occasionally.
Stuff that's useless, doesn't update, but I just keep coming back to because it's so awesome, such as The world flag rating page (do not make your country's flag in photoshop, tricolors are overused), The Evil Overlord List of movie-stereotypical mistakes I will not do if I'm ever an evil overload, and the Earth destruction advisory board FAQ on non-dilettante ways to destroy the earth

The last category was a minor surprise to me, as I'd not realised in advance it was a category I'd need. But I really do need it, because even if I don't need those links, if I don't have it, my mind keeps saying "don't forget the earth destruction advisory board, what it if updates the earth destruction status[1]" so I need a place to put them, just to get them out of the way in all the other categories!

I do the same with physical objects too: if I want to keep it and it doesn't have a place, make a place for things I keep for that reason however stupid. Then, if I decide it's stupid and I don't need to keep it, I can throw it out later, having already separated it from stuff I'm keeping for a more useful reason.

The reason I mention this now is that last night several of us were talking about an answer on stack overflow that is incredibly awesome and made the rounds several times recently, but some less-programmer-y people hadn't seen, which is one of the most recent links promoted to my list of "stuff on the internet I personally find most awesome".

Link for khalinche and ceb from last night, how do I use a regex to detect certain sorts of tag in HTML text

http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454

There is a question on Stack Overflow asking how to use a regex to detect certain sorts of tag in HTML text and the first answer (link) is a work of genius, as the answerer gets more and more emphatic about his opinion, it's really funny and accurate (even if you don't know what the words mean, it's still funny and you can get a gist of the answer if you scroll through slowly to the end). :)

Footnotes

[1] On 10 September, 2008, it did, advancing the "Earth destruction advisory count" from 0 to 1. There is a supplementary FAQ on the event at http://qntm.org/board, starting with "The Earth hasn't been destroyed! What are you talking about?"

Crossposts: http://cartesiandaemon.livejournal.com/797921.html

Flat | Top-Level Comments Only

From:

ptc24

You can't parse [X]HTML with regex.

Other true statements, if we take the presuppositions that are needed to make that statement true, and apply them elsewhere:

You can't parse natural language with regex

You can't parse natural language with any computer program we currently have or know how to write

You can't parse natural language with a human being

Now obviously there's something funny about that last statement.

So the next comment:

While it is true that asking regexes to parse arbitrary HTML is like asking Paris Hilton to write an operating system, it's sometimes appropriate to parse a limited, known set of HTML.

Alternatively, to parse a limited-but-ill-specified set of HTML with potentially less than 100% reliably.

From:

jack

I'm sorry, I think you're right, but I don't understand what you're trying to say (I don't actually have much theoretical background on regular expressions and context free grammars, I've only really encountered them from the programming side, not the maths side).

Are you saying that the case that specific question was asking about was one that could be parsed fine by a regular expression, so the extended rant about not using regular expressions wasn't a good answer? But what does that have to do with humans parsing natural language?

From:

simont

Possibly one of the suppositions

ptc24 is alluding to is the implicit difference in standards of success between natural languages and computer languages. With any computer language, we aim for perfection in parsing: if just one program, no matter how fiddly, satisfies the syntax specification but is not correctly understood by the parser, then that's a bug and it wants fixing. Hence, the fact that a regexp will inevitably fall down in some complicated corner case of HTML or other, no matter how many common cases they get right, is sufficient to justify the statement that regexps cannot parse HTML.

But if you applied the same standard to natural language, you'd have to say that nobody and nothing can parse it!

From:

ptc24

My first instinct was to say "bingo", but there's also the case of well-constrained subsets of HTML, for example single tags, or HTML from some specific known source, or whatever.

I once wrote a handful of regexes to replace an XML parser - the XML parser was slow and leaked memory, and the data we were processing was in a format that was nice and tractable. People didn't like that. Solution: download the data in a non-XML format. Suddenly the handful-of-regexes approach was nothing to complain about. Also, as a result, the download took half as long.

From:

ptc24

There seem to be two issues here:

1) "You can't parse [X]HTML with regex." meaning "You can't parse [X]HTML with regex.
with full reliability and generality." This is what my main point was. That's my point about humans parsing natural language. In fact, the point about humans was mainly about reliability, whereas I suppose the problem that's relevant here is generality. (Arguably, you could say "there is no one person who understands every natural language, and certainly no-one who understands every conceivable natural language." Possibly if some theorists are right there are natural languages that are comprehensible to some space aliens but not to H. sapiens)

2) You're correct, the specific question was about HTML tags - a subset of HTML, which doesn't have the problems with recursion that full HTML has - and so regexes are (likely to be) OK.

From:

simont

You draw a careful distinction there between reliability and generality, but don't say what each one means. Would I be right in guessing that 'generality' is about how few examples of the language you give up on and say 'dunno', and 'reliability' is about how few wrong answers you give for those cases in which you don't say 'dunno'?

From:

ptc24

I'd count both of these are "reliability". Issues of generality I think include the range of inputs it makes sense to run things over. Something might do well for copy-edited text but not for blog posts, something might do well for flowing text but not for things in tables, something might be OK for genuine written text but not for transcripts of speech, etc. Also, you can imagine systems which work with full reliability, in that they capture 100% of the cases of some narrowly-but-well-defined phenomenon in some narrowly-but-well-defined case.

There's a bare mass noun problem here, like a bare plural problem. If I have some regexes, and they reliably parse some set of web pages that I am interested in, and those web pages are in HTML, do my regexes "parse HTML"? They don't, and can't, parse all HTML, they can and do parse some HTML. But what does it mean to "parse HTML"?

From:

jack

You're right, breaking it down into two issues clarifies it, thank you.

One

OK, you're right, though this seems to be a fact about how programmers use language differently to people in casual conversation, not anything to do with regexes specifically. "Can I use X to do Y" could mean is there any conceivable circumstance where I could use X to do Y (eg. can I do an emergency tracheotomy with a pen lid). But if I go on stack overflow and ask "can I divide by two to find a square root", the helpful answer is not "yes, it works for 4" but "no, you need to use an iterative approximation".

This particular questioner didn't use the phrase "can I do Y with X", but I think most people reading would understand the answer in the sense of "is X a sensible method of doing Y".

For that matter, I think I think a better interpretation of what the answer is trying to say might be that you can't parse HTML (in the sense of "interpreting the HTML structure of the entire document" a typical HTML document using a regex. Even though you can do useful processing and parsing that fall short of parsing it in its entirety (see below).

Two

Of course, it's complicated by the fact that the questioner wasn't asking that, and what they literally asked could be done perfectly fine with a regex.

But I think this is one of the awkward cultural protocols which hasn't quite found an answer on stack overflow, if someone asks "how can I do [horrible idea]? I can't do [obvious non-horrible idea] but I don't want to talk about why not". Was this guy actually going to use his html-tag parsing code to do something you could sensibly do without trying to parse a full CFG, eg. scraping a website in a predictable format and stripping out HTML tags? Or was he, as many answerers seemed to suspect, doing this as a first step to trying to parse it in a more general way, when "I know it's not what you asked, but what you're trying to do won't work and you should do it this other way" is actually the most correct and helpful answer.

From:

ptc24

But if I go on stack overflow and ask "can I divide by two to find a square root", the helpful answer is not "yes, it works for 4" but "no, you need to use an iterative approximation".

I think this example is missing something. In the case above, the working-for-4 case is like a stopped clock being right twice a day. There are cases which are more like a non-stopped clock being right most of the time, except when there's a GMT/BST switch or you're travelling between timezones or the clock runs out of batteries or whatever, and you have a choice between "these aren't an issue for me in my particular case", "these are an issue but I can correct for these by hand when needed" and "I need a good clock that handles all of these cases (and probably a few others that I've forgotten about) automatically".

So, can I tell the time by looking at my kitchen clock? Does the presence or absence of batteries in the back of my clock affect whether I can tell the time with it by looking at it? Does the possibility of my flatmate correcting the clock for GMT or leaving it uncorrected (without telling me either way) affect whether I can tell the time with it by looking at it?

ETA: is trying to parse, or process, (X)(HT)ML with a regex like trying to tell the time with a sundial?

Edited Date: 2012-11-20 03:38 pm (UTC)

From:

jack

Ah! I think maybe we do have an ambiguity in "parse".

I think I interpreted "parse html" to mean "parse an html document as html", ie. inherently involving parsing the salient features of html, notably nested tags. But you equally validly interpreted "parse html" in a more general way as "parse an html document in a structured way to get data out of it in a useful form".

And the first way, "parse html" is inherently problematic, because even if it works in some cases, "works" implies "generate some sort of data representation of the structure of the document" which is exactly what you can't do. But the second way, "parse html" includes things like "scraping useful information out of it" which is 100% valid in a reading-a-clock way.

Does that sound right?

From:

ptc24

Erm. I think you're right, in that there are issues over "parse".

I think the trap I've fallen into is letting the questioner and answerer define "parse" for me. The questioner was trying to do something, the answerer said "You can't parse HTML with regexes", and I've been assuming that there's been consensus that this is at least relevant. I've been on the other side of this debate before (
insisting on narrow, strict definitions of "parsing"), so maybe I should slap my wrists. That said, I'm pretty sure that whatever is meant, parsing is about syntax, not semantics.

It's possible that the questioner's original request is more akin to some other syntactic task, such as tokenisation, or chunking. But "chunking" is one of those things that may or may not count as parsing - let's look at wikipedia:

"Shallow parsing (also chunking, "light parsing") is an analysis of a sentence which identifies the constituents (noun groups, verbs, verb groups, etc.), but does not specify their internal structure, nor their role in the main sentence."

So when we're finding tags in HTML, we're identifying some constituents of the text stream, constituents which I think are too complex to count as "tokens" (if "tokens" are even an admissible notion in concrete HTML). So what the questioner is trying to do could be construed as a form of partial shallow parsing.

Nested tags: Arbitrary nested tags? Tags nested at unlimited depth?

From:

andrewducker

Stuff I use every day: Bookmarks on the Firefox bookmark toolbar.
Something to read later: http://getpocket.com/
Periodical Stuff: http://www.google.co.uk/reader/
Reference Stuff: Bookmarked. Or added to Delicious.
Awesome Stuff: Delicious.

Flat | Top-Level Comments Only

Active Recent Entries

1: (no subject)