C#通过Html Agility Pack解析HTML

Why Html Agility Pack? (以下简称HAP)

　　.Net下解析HTML文件有很多种选择，包括微软自己也提供MSHTML用于manipulate HTML文件。但是，经过我一段时间的搜索，Html Agility Pack浮出水面：它是Stackoverflow网站上推荐最多的C# HTML解析器。HAP开源，易用，解析速度快。

How to use HAP?

1. 下载http://htmlagilitypack.codeplex.com/

2. 解压

3. 在Visual Studio Solution里，右击project -> add reference -> 选择解压文件夹里的HTMLAgilityPack.dll -> 确定

4. 代码头部加入 using HtmlAgilityPack;

    HtmlWeb webClient = new HtmlWeb();  
    HtmlDocument doc = webClient.Load("http://xxx");  
      
    HtmlNodeCollection hrefList = doc.DocumentNode.SelectNodes(".//a[@href]");  
      
    if (hrefList != null)  
    {  
         foreach (HtmlNode href in hrefList)  
         {  
           HtmlAttribute att = href.Attributes["href"];  
           doSomething(att.Value);  
     
        }  
     
  }

　　以上代码示例load进来一个网页，提取所有的link（就是<a href=...></a>），遍历时提取出link的内容（href.Attributes["href"].Value）然后doSomething().

Q: 如何根据ID选择HTML结点？

A: 利用@id='xxx', e.g.,

   HtmlNode bugSum = doc.DocumentNode.SelectSingleNode("//h2[@id='summary']");

HtmlNode bugSum = doc.DocumentNode.SelectSingleNode("//h2[@id='summary']");

Q: 如何得到结点的文字内容或Html内容？

node.InnerText.Trim() node.InnerHtml node.OuterHtml

   node.InnerText.Trim()  
   node.InnerHtml  
   node.OuterHtml

Q: 如何在html树结构下查找结点？

A: 比如从根节点查找id=container的div下的第一个table:

HtmlNode table = doc.DocumentNode.SelectSingleNode("//div[@id='container']/table[1]");

HtmlNode table = doc.DocumentNode.SelectSingleNode("//div[@id='container']/table[1]");
　　注意路径里"//"表示从根节点开始查找，两个斜杠‘//’表示查找所有childnodes；一个斜杠'/'表示只查找第一层的 childnodes（即不查找grandchild）；点斜杠".//"表示从当前结点而不是根结点开始查找。接上一行代码，比如要查找table所有直接子结点的tr:

HtmlNodeCollection tr = table.SelectNodes("./tr");

HtmlNodeCollection tr = table.SelectNodes("./tr");

Q: 如何得到结点的ID？

A：很简单： node.ID

Q: 如果一段html存在字符串里，是否可以用Html Agility Pack进行处理？

A：可以，先将字符串load进来，之后的处理方法一样：

<pre name="code" class="csharp">//load the original html string html = "some html stuff" HtmlDocument doc = new HtmlDocument(); doc.LoadHtml(@html);

//load the original html   
//<pre name="code" class="csharp">
string html = "some html stuff"  
HtmlDocument doc = new HtmlDocument();  
doc.LoadHtml(@html);

Q: 我对load进来的html进行了一些处理，比如改变了一些结点内容，删除了一些结点什么的，为什么结果却没有变化？

A: 也许你忘记save你对html的改变了，假设html存在字符串中：

    //load the original html  
    string html = "some html stuff"  
    HtmlDocument doc = new HtmlDocument();  
    doc.LoadHtml(@html);  
      
    //make some changes  
    doSomething();  
      
    //save the change  
   var sb = new StringBuilder();  
   using (var writer = new StringWriter(sb))  
   {  
       doc.Save(writer);  
   }

//load the original html string html = "some html stuff" HtmlDocument doc = new HtmlDocument(); doc.LoadHtml(@html); //make some changes doSomething(); //save the change var sb = new StringBuilder(); using (var writer = new StringWriter(sb)) { doc.Save(writer); }

Q: 如何去掉外层的html tag只留下内容？

A: 用remove方法。假设结点<a href=xxx>ABCD</a>，你想留下ABCD而不要<a></a>，那你需要先得到这个Html结点，假设叫link:

link.ParentNode.RemoveChild(link,true);

link.ParentNode.RemoveChild(link,true);
参数true表示留下grandchild，在这里即内容ABCD; false表示将此结点连同其grandchilds一起删除。

Q: 如何插入结点？

A: 插入结点有两种方式。第一种是直接使用Insert方法:

HtmlNode pathNode = doc.CreateElement("tr"); HtmlNode tbody = doc.DocumentNode.SelectSingleNode("//table[1]"); tbody.ChildNodes.Insert(0, pathNode);

HtmlNode pathNode = doc.CreateElement("tr");  
HtmlNode tbody = doc.DocumentNode.SelectSingleNode("//table[1]");  
tbody.ChildNodes.Insert(0, pathNode);

　　这里，pathNode是需要插入的结点。我们想把它插入到table （即tbody所表示的结点）的第一行。Insert方法的第一个参数0即表示将pathNode插入到tbody中，作为其第一个Child node。

　　不过，当你并不清楚插入结点的位置时，此方法并不好用。比如，你想将一个新结点插入一个特定结点之前/之后，这时用InsertBefore / InsertAfter 方法会很方便：

HtmlNode parentNode = trNode.ParentNode;  
HtmlNode tr = postdoc.CreateElement("tr");  
parentNode.InsertBefore(tr, trNode);

HtmlNode parentNode = trNode.ParentNode; HtmlNode tr = postdoc.CreateElement("tr"); parentNode.InsertBefore(tr, trNode);
这里，我们将新结点 tr 插入trNode之前。注意要在trNode的父节点上调用InsertBefore方法。

来源：http://blog.csdn.net/flying881114/article/details/6610888

加支付宝好友偷能量挖...

2012-4-18评论(0)网络

阅读(165)喜欢(0)分类：Asp.Net/C#/WCF