A Scraping Library

As part of a project I’m working on, I needed to get documents from state institutions. And instead of writing code specific for each site, I decided to try creating a “universal” document scraper. It can be found as a separate module within the main project https://github.com/Glamdring/state-alerts/. The project is written in Scala, and can be used in any JVM project (provided you add a scala jar dependency). It is meant for scraping documents, rather than random data. It can probably be extended to do that, but for now I’d like it to be more (state)-document / open-data oriented, rather than a tool for commercial scraping (which is often frowned upon).

It is now in a more or less stable form, I’ve already deployed the application and it works properly, so I’ll just share a short description of the functionality. The point is to be able to specify scraping only via configuration. The class used to configure individual scraping instances is ExtractorDescriptor. There you specify a number of things:

  • Target URL, http method, body parameters (in case of POST). You can put a placeholder {x} which will be used for paging
  • The type of document (PDF, doc, HTML) and the type of the scraping workflow – i.e. how is the document reached on the target page. There are 4 options, depending on whether there’s a separate details page, whether there’s only a table and where the link to the document is located
  • XPath expressions for elements, containing meta data and the links to the documents. There’s a different expression depending on where the information is located – in a table or in separate details page
  • Date format, for the date of the document; optionally regex can be used, in case the date cannot be strictly located by XPath
  • Simple “heuristics” – if you know the URL structure of the document you are looking for, there’s no need to locate it via XPath.
  • Other configurations, like javascript requirements, whether scraping should fail on error, etc.

When you have an ExtractorDescriptor instance ready (for java apps you can use the builder to create one), you can create a new Extractor(descriptor), and then (usually with a scheduled job) call extractor.extractDocuments(since)

The result is a list of documents (there are two methods – one returns a scala list, and one returns a java list).

The library depends on htmlunit, nekohtml, scala, xml-apis and some more, visible in the pom. It doesn’t support multiple parsers. It also doesn’t handle distributed running of scraping tasks – this you should handle yourself. No jar release or maven dependency is published yet – if one needs it, it has to be checked-out and built. I hope it is useful, though. If not as code, then at least as an approach to getting data from web pages programatically.

As part of a project I’m working on, I needed to get documents from state institutions. And instead of writing code specific for each site, I decided to try creating a “universal” document scraper. It can be found as a separate module within the main project https://github.com/Glamdring/state-alerts/. The project is written in Scala, and can be used in any JVM project (provided you add a scala jar dependency). It is meant for scraping documents, rather than random data. It can probably be extended to do that, but for now I’d like it to be more (state)-document / open-data oriented, rather than a tool for commercial scraping (which is often frowned upon).

It is now in a more or less stable form, I’ve already deployed the application and it works properly, so I’ll just share a short description of the functionality. The point is to be able to specify scraping only via configuration. The class used to configure individual scraping instances is ExtractorDescriptor. There you specify a number of things:

  • Target URL, http method, body parameters (in case of POST). You can put a placeholder {x} which will be used for paging
  • The type of document (PDF, doc, HTML) and the type of the scraping workflow – i.e. how is the document reached on the target page. There are 4 options, depending on whether there’s a separate details page, whether there’s only a table and where the link to the document is located
  • XPath expressions for elements, containing meta data and the links to the documents. There’s a different expression depending on where the information is located – in a table or in separate details page
  • Date format, for the date of the document; optionally regex can be used, in case the date cannot be strictly located by XPath
  • Simple “heuristics” – if you know the URL structure of the document you are looking for, there’s no need to locate it via XPath.
  • Other configurations, like javascript requirements, whether scraping should fail on error, etc.

When you have an ExtractorDescriptor instance ready (for java apps you can use the builder to create one), you can create a new Extractor(descriptor), and then (usually with a scheduled job) call extractor.extractDocuments(since)

The result is a list of documents (there are two methods – one returns a scala list, and one returns a java list).

The library depends on htmlunit, nekohtml, scala, xml-apis and some more, visible in the pom. It doesn’t support multiple parsers. It also doesn’t handle distributed running of scraping tasks – this you should handle yourself. No jar release or maven dependency is published yet – if one needs it, it has to be checked-out and built. I hope it is useful, though. If not as code, then at least as an approach to getting data from web pages programatically.