SgmlReader is a versatile .NET library written in C# for parsing HTML/SGML files. The original community around SgmlReader used to be hosted by GotDotNet, but it has been phased out. MindTouch Dream and MindTouch Deki use extensively the SgmlReader library. We found and fixed a few bugs in it as well. In the spirit of the original author, we're providing back these changes on the MindTouch Developer Center site.
The latest version of SgmlReader can be downloaded on SourceForge.Net or from our public SVN repository. If you find/fix issues in SgmlReader, please post in the SgmlReader forum.
The following sample code parses a HTML into an XmlDocument:
XmlDocument FromHtml(TextReader reader) {
// setup SgmlReader
Sgml.SgmlReader sgmlReader = new Sgml.SgmlReader();
sgmlReader.DocType = "HTML";
sgmlReader.WhitespaceHandling = WhitespaceHandling.All;
sgmlReader.CaseFolding = Sgml.CaseFolding.ToLower;
sgmlReader.InputStream = reader;
// create document
XmlDocument doc = new XmlDocument();
doc.PreserveWhitespace = true;
doc.XmlResolver = null;
doc.Load(sgmlReader);
return doc;
}