Sunday, December 5, 2010

Web Scrapping: Search, Extract and Work with data from across the web

I read this excellent article from Marshall Kirkpatrick

Needlebase, Dapper and Extractive web tools are democratizing the ability to extract and work with data from across the web. They are to text processing what blogging was to text publishing.

For sure the now Yahoo-acquired Dapper, is may be the less impressive of the three tools. It enables to build an RSS feed from changes made to any field on any web page.

The brand-new Extractiv, is a bulk web-crawling and semantic analysis tool that seems very easy to use. It is free for up to 1,000 URLs per web crawl, 1 web crawl at a time and up to 1,000 docs/day. Other plans are not really expensive (99$ per month and 299$ per month).

Needlebase, sits in the middle, and its free. Needlebase was built as a side project of travel search company ITA Software (Google is currently in legal negotiations to acquire ITA). It is a great new point-and-click tool for extracting, sorting and visualizing data from across pages around the web. Needlebase allows you to view web pages through a virtual browser, point and click to train it in understanding what fields on that page are of interest to you and how those fields relate to each other. Then the program goes and scrapes the data from all of those fields, publishes them into a table, list or map, and recommends merges of cells that appear to be mistakenly separate. It's very cool and it lets non-technical people do things with data quickly and easily that we used to require the assistance of someone more technical to do.

Power is now on your side ...