-
-
Notifications
You must be signed in to change notification settings - Fork 580
Description
I have an application that scrapes an entire website and it runs in about 12 hours. It uses WebClient.DownloadString
to get the html, then uses HtmlParser.ParseDocument
to parse it. Then, I do a lot of other parsing on top of that. This happens for about 293,000 pages so I'm trying to save any little bit of time that I can.
I've noticed that I've got a lot of places where I call IHtmlDocument.GetElementsByTagName(TAG).FirstOrDefault(QUERY_SELECTOR)
. I believe I could collapse this into some sort of IHtmlDocument.QuerySelector(QUERY_SELECTOR)
which theoretically would speed up the time by returning after the first match, but some preliminary testing has shown QuerySelector
to be slow vs the old method. For instance:
IElement element = doc.GetElementsByTagName("h2").FirstOrDefault(x => x.TextContent.Contains("Climbing Directory"));
takes about 1 ms, where
IElement element = doc.QuerySelector("h2:contains('Climbing Directory')");
takes about 23 ms.
Any suggestions for improving my code?