How do I parse HTML using a Regex?
Nov. 20th, 2012 10:57 am![[personal profile]](https://www.dreamwidth.org/img/silk/identity/user.png)
Bookmark organising
When I organised my bookmarks, I realised that most fell into the categories of
When I organised my bookmarks, I realised that most fell into the categories of
- Stuff I use every day and want quickly accessible. Eg. email, social networking homepages, the wikipedia page on the unicode checkmark, etc.
- Something to read later. Eg. links that seem interesting, computer games, books and films to consider buying or renting, etc.
- Something to read periodically, eg. news sites, social networking friends pages, feeds, blogs I follow, many webcomics divided into "daily", "bi/triweekly" etc.
- Something I may occasionally want as a reference. Eg. step-by-step instructions for stuff I do occasionally.
- Stuff that's useless, doesn't update, but I just keep coming back to because it's so awesome, such as The world flag rating page (do not make your country's flag in photoshop, tricolors are overused), The Evil Overlord List of movie-stereotypical mistakes I will not do if I'm ever an evil overload, and the Earth destruction advisory board FAQ on non-dilettante ways to destroy the earth
The last category was a minor surprise to me, as I'd not realised in advance it was a category I'd need. But I really do need it, because even if I don't need those links, if I don't have it, my mind keeps saying "don't forget the earth destruction advisory board, what it if updates the earth destruction status[1]" so I need a place to put them, just to get them out of the way in all the other categories!
I do the same with physical objects too: if I want to keep it and it doesn't have a place, make a place for things I keep for that reason however stupid. Then, if I decide it's stupid and I don't need to keep it, I can throw it out later, having already separated it from stuff I'm keeping for a more useful reason.
The reason I mention this now is that last night several of us were talking about an answer on stack overflow that is incredibly awesome and made the rounds several times recently, but some less-programmer-y people hadn't seen, which is one of the most recent links promoted to my list of "stuff on the internet I personally find most awesome".
Link for khalinche and ceb from last night, how do I use a regex to detect certain sorts of tag in HTML text
http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454
There is a question on Stack Overflow asking how to use a regex to detect certain sorts of tag in HTML text and the first answer (link) is a work of genius, as the answerer gets more and more emphatic about his opinion, it's really funny and accurate (even if you don't know what the words mean, it's still funny and you can get a gist of the answer if you scroll through slowly to the end). :)
Footnotes
[1] On 10 September, 2008, it did, advancing the "Earth destruction advisory count" from 0 to 1. There is a supplementary FAQ on the event at http://qntm.org/board, starting with "The Earth hasn't been destroyed! What are you talking about?"
no subject
Date: 2012-11-20 01:16 pm (UTC)Other true statements, if we take the presuppositions that are needed to make that statement true, and apply them elsewhere:
You can't parse natural language with regex
You can't parse natural language with any computer program we currently have or know how to write
You can't parse natural language with a human being
Now obviously there's something funny about that last statement.
So the next comment:
Alternatively, to parse a limited-but-ill-specified set of HTML with potentially less than 100% reliably.
no subject
Date: 2012-11-20 02:20 pm (UTC)Are you saying that the case that specific question was asking about was one that could be parsed fine by a regular expression, so the extended rant about not using regular expressions wasn't a good answer? But what does that have to do with humans parsing natural language?
no subject
Date: 2012-11-20 02:31 pm (UTC)But if you applied the same standard to natural language, you'd have to say that nobody and nothing can parse it!
no subject
Date: 2012-11-20 02:52 pm (UTC)I once wrote a handful of regexes to replace an XML parser - the XML parser was slow and leaked memory, and the data we were processing was in a format that was nice and tractable. People didn't like that. Solution: download the data in a non-XML format. Suddenly the handful-of-regexes approach was nothing to complain about. Also, as a result, the download took half as long.
no subject
Date: 2012-11-20 02:34 pm (UTC)1) "You can't parse [X]HTML with regex." meaning "You can't parse [X]HTML with regex.
with full reliability and generality." This is what my main point was. That's my point about humans parsing natural language. In fact, the point about humans was mainly about reliability, whereas I suppose the problem that's relevant here is generality. (Arguably, you could say "there is no one person who understands every natural language, and certainly no-one who understands every conceivable natural language." Possibly if some theorists are right there are natural languages that are comprehensible to some space aliens but not to H. sapiens)
2) You're correct, the specific question was about HTML tags - a subset of HTML, which doesn't have the problems with recursion that full HTML has - and so regexes are (likely to be) OK.
no subject
Date: 2012-11-20 02:58 pm (UTC)no subject
Date: 2012-11-20 03:13 pm (UTC)There's a bare mass noun problem here, like a bare plural problem. If I have some regexes, and they reliably parse some set of web pages that I am interested in, and those web pages are in HTML, do my regexes "parse HTML"? They don't, and can't, parse all HTML, they can and do parse some HTML. But what does it mean to "parse HTML"?
no subject
Date: 2012-11-20 03:09 pm (UTC)One
OK, you're right, though this seems to be a fact about how programmers use language differently to people in casual conversation, not anything to do with regexes specifically. "Can I use X to do Y" could mean is there any conceivable circumstance where I could use X to do Y (eg. can I do an emergency tracheotomy with a pen lid). But if I go on stack overflow and ask "can I divide by two to find a square root", the helpful answer is not "yes, it works for 4" but "no, you need to use an iterative approximation".
This particular questioner didn't use the phrase "can I do Y with X", but I think most people reading would understand the answer in the sense of "is X a sensible method of doing Y".
For that matter, I think I think a better interpretation of what the answer is trying to say might be that you can't parse HTML (in the sense of "interpreting the HTML structure of the entire document" a typical HTML document using a regex. Even though you can do useful processing and parsing that fall short of parsing it in its entirety (see below).
Two
Of course, it's complicated by the fact that the questioner wasn't asking that, and what they literally asked could be done perfectly fine with a regex.
But I think this is one of the awkward cultural protocols which hasn't quite found an answer on stack overflow, if someone asks "how can I do [horrible idea]? I can't do [obvious non-horrible idea] but I don't want to talk about why not". Was this guy actually going to use his html-tag parsing code to do something you could sensibly do without trying to parse a full CFG, eg. scraping a website in a predictable format and stripping out HTML tags? Or was he, as many answerers seemed to suspect, doing this as a first step to trying to parse it in a more general way, when "I know it's not what you asked, but what you're trying to do won't work and you should do it this other way" is actually the most correct and helpful answer.
no subject
Date: 2012-11-20 03:36 pm (UTC)I think this example is missing something. In the case above, the working-for-4 case is like a stopped clock being right twice a day. There are cases which are more like a non-stopped clock being right most of the time, except when there's a GMT/BST switch or you're travelling between timezones or the clock runs out of batteries or whatever, and you have a choice between "these aren't an issue for me in my particular case", "these are an issue but I can correct for these by hand when needed" and "I need a good clock that handles all of these cases (and probably a few others that I've forgotten about) automatically".
So, can I tell the time by looking at my kitchen clock? Does the presence or absence of batteries in the back of my clock affect whether I can tell the time with it by looking at it? Does the possibility of my flatmate correcting the clock for GMT or leaving it uncorrected (without telling me either way) affect whether I can tell the time with it by looking at it?
ETA: is trying to parse, or process, (X)(HT)ML with a regex like trying to tell the time with a sundial?
no subject
Date: 2012-11-20 04:59 pm (UTC)I think I interpreted "parse html" to mean "parse an html document as html", ie. inherently involving parsing the salient features of html, notably nested tags. But you equally validly interpreted "parse html" in a more general way as "parse an html document in a structured way to get data out of it in a useful form".
And the first way, "parse html" is inherently problematic, because even if it works in some cases, "works" implies "generate some sort of data representation of the structure of the document" which is exactly what you can't do. But the second way, "parse html" includes things like "scraping useful information out of it" which is 100% valid in a reading-a-clock way.
Does that sound right?
no subject
Date: 2012-11-20 05:31 pm (UTC)I think the trap I've fallen into is letting the questioner and answerer define "parse" for me. The questioner was trying to do something, the answerer said "You can't parse HTML with regexes", and I've been assuming that there's been consensus that this is at least relevant. I've been on the other side of this debate before (
insisting on narrow, strict definitions of "parsing"), so maybe I should slap my wrists. That said, I'm pretty sure that whatever is meant, parsing is about syntax, not semantics.
It's possible that the questioner's original request is more akin to some other syntactic task, such as tokenisation, or chunking. But "chunking" is one of those things that may or may not count as parsing - let's look at wikipedia:
"Shallow parsing (also chunking, "light parsing") is an analysis of a sentence which identifies the constituents (noun groups, verbs, verb groups, etc.), but does not specify their internal structure, nor their role in the main sentence."
So when we're finding tags in HTML, we're identifying some constituents of the text stream, constituents which I think are too complex to count as "tokens" (if "tokens" are even an admissible notion in concrete HTML). So what the questioner is trying to do could be construed as a form of partial shallow parsing.
Nested tags: Arbitrary nested tags? Tags nested at unlimited depth?
no subject
Date: 2012-11-20 07:16 pm (UTC)Something to read later: http://getpocket.com/
Periodical Stuff: http://www.google.co.uk/reader/
Reference Stuff: Bookmarked. Or added to Delicious.
Awesome Stuff: Delicious.