HTMLParserを使ってみる(CyberNekoHTML)*1

用意するもの

NekoHTML
- nekohtml.jar
xerces2j
- xercesImpl.jar

URLを指定してHTMLを読み込みドキュメントツリーを見るテスト

package jp.seraphyware.htmlparsertest;

import java.io.InputStream;
import java.net.URL;
import java.net.URLConnection;

import org.cyberneko.html.parsers.DOMParser;
import org.w3c.dom.Document;
import org.w3c.dom.Element;
import org.w3c.dom.Node;
import org.w3c.dom.NodeList;
import org.xml.sax.InputSource;

public class NekoHTMLTest {

    public static void main(final String[] args) throws Exception {
        final URL url = new URL("http://java.sun.com/");
        final URLConnection urlConnection = url.openConnection();
        final DOMParser parser = new DOMParser();
        final InputStream is = urlConnection.getInputStream();
        try {
            parser.parse(new InputSource(is));
        }
        finally {
            is.close();
        }
        final Document doc = parser.getDocument();
        final Element root = doc.getDocumentElement();
        walkTree("", root);
    }
    
    private static void walkTree(final String level,
            final Element elm) throws Exception {
        System.out.println(level + "<" + elm.getTagName() + ">");
        final NodeList children = elm.getChildNodes();
        if (children != null) {
            final int len = children.getLength();
            for (int idx = 0; idx < len; idx++) {
                final Node child = (Node) children.item(idx);
                if (child.getNodeType() == Node.ELEMENT_NODE) {
                    walkTree(level + " ", (Element) child);
                }
                else if (child.getNodeType() == Node.TEXT_NODE) {
                    final String txt = child.getNodeValue();
                    if (txt.trim().length() > 0) {
                        System.out.println(level + txt);
                    }
                }
            }
        }
    }
}

所感

NekoHTMLはXMLノードとしてHTMLを扱えるため、XML関連ツールの恩恵をそのまま受けられるところが利点っぽい。
しかし、なぜか、「＆ｎｂｓｐ；」のようなエスケープされた文字を読み込むことができないようだ。ただの「?」になる。使い方の問題だとは思うが、ちょっと、この挙動はよく分からない。
「yahoo.co.jp」などのEUC-JPで書かれたページも文字化けすることなく読まれたので、エンコードに関して何かしなければならない、ということはなさそう。