铁蟒蛇,美丽的汤,win32的应用蟒蛇、美丽

由网友(熟悉的侧脸)分享简介:是否美丽的汤工作,用铁的python?如果是与铁Python版本?有多容易分发使用铁python的.NET 2.0中的Windows桌面应用程序(主要是C#调用一些蟒蛇code解析HTML)? Does beautiful soup work with iron python?If so with which...

是否美丽的汤工作,用铁的python? 如果是与铁Python版本? 有多容易分发使用铁python的.NET 2.0中的Windows桌面应用程序(主要是C#调用一些蟒蛇code解析HTML)?

Does beautiful soup work with iron python? If so with which version of iron python? How easy is it to distribute a windows desktop app on .net 2.0 using iron python (mostly c# calling some python code for parsing html)?

推荐答案

我在问自己同样的问题,并努力遵循的建议在这里和其他地方获得IronPython和BeautifulSoup与我现有的code,我决定很好地发挥后去寻找一个替代原生.NET解决方案。 BeautifulSoup是code一个美妙位和起初它看起来并不像有什么可比性提供.NET,但后来我发现的 HTML敏捷性包如果有什么我想我确实获得了一些维修过BeautifulSoup。这需要清洁或这些混沌的HTML,并产生从它优雅的XML DOM,可以通过的XPath查询。随着$ C $的几行C,你甚至可以拿回原始的XDocument,然后craft在LINQ查询到XML 。老实说,如果网页抓取是你的目标,这是关于最干净的解决方案,您可能会发现。

I was asking myself this same question and after struggling to follow advice here and elsewhere to get IronPython and BeautifulSoup to play nicely with my existing code I decided to go looking for an alternative native .NET solution. BeautifulSoup is a wonderful bit of code and at first it didn't look like there was anything comparable available for .NET, but then I found the HTML Agility Pack and if anything I think I've actually gained some maintainability over BeautifulSoup. It takes clean or crufty HTML and produces a elegant XML DOM from it that can be queried via XPath. With a couple lines of code you can even get back a raw XDocument and then craft your queries in LINQ to XML. Honestly, if web scraping is your goal, this is about the cleanest solution you are likely to find.

修改

下面是一个简单的(阅读:不稳健的话)的例子,分析了美国众议院的再presentatives假期计划:

Here is a simple (read: not robust at all) example that parses out the US House of Representatives holiday schedule:

using System;
using System.Collections.Generic;
using HtmlAgilityPack;

namespace GovParsingTest
{
    class Program
    {
        static void Main(string[] args)
        {
            HtmlWeb hw = new HtmlWeb();
            string url = @"http://www.house.gov/house/House_Calendar.shtml";
            HtmlDocument doc = hw.Load(url);

            HtmlNode docNode = doc.DocumentNode;
            HtmlNode div = docNode.SelectSingleNode("//div[@id='primary']");
            HtmlNodeCollection tableRows = div.SelectNodes(".//tr");

            foreach (HtmlNode row in tableRows)
            {
                HtmlNodeCollection cells = row.SelectNodes(".//td");
                HtmlNode dateNode = cells[0];
                HtmlNode eventNode = cells[1];

                while (eventNode.HasChildNodes)
                {
                    eventNode = eventNode.FirstChild;
                }

                Console.WriteLine(dateNode.InnerText);
                Console.WriteLine(eventNode.InnerText);
                Console.WriteLine();
            }

            //Console.WriteLine(div.InnerHtml);
            Console.ReadKey();
        }
    }
}
阅读全文

相关推荐

最新文章