I’m pleased to announced that the first version of xml2 is now available on CRAN. Xml2 is a wrapper around the comprehensive libxml2 C library that makes it easier to work with XML and HTML in R:
Read XML and HTML with
read_xml()
andread_html()
.Navigate the tree with
xml_children()
,xml_siblings()
andxml_parent()
. Alternatively, use xpath to jump directly to the nodes you’re interested in withxml_find_one()
andxml_find_all()
. Get the full path to a node withxml_path()
.Extract various components of a node with
xml_text()
,xml_attrs()
,xml_attr()
, andxml_name()
.Convert to list with
as_list()
.Where appropriate, functions support namespaces with a global url -> prefix lookup table. See
xml_ns()
for more details.Convert relative urls to absolute with
url_absolute()
, and transform in the opposite direction withurl_relative()
. Escape and unescape special characters withurl_escape()
andurl_unescape()
.Support for modifying and creating xml documents in planned in a future version.
This package owes a debt of gratitude to Duncan Temple Lang who’s XML package has made it possible to use XML with R for almost 15 years!
Usage
You can install it by running:
install.packages("xml2")
(If you’re on a mac, you might need to wait a couple of days - CRAN is busy rebuilding all the packages for R 3.2.0 so it’s running a bit behind.)
Here’s a small example working with an inline XML document:
library(xml2)
x <- read_xml("<foo>
<bar>text <baz id = 'a' /></bar>
<bar>2</bar>
<baz id = 'b' />
</foo>")
xml_name(x)
#> [1] "foo"
xml_children(x)
#> {xml_nodeset (3)}
#> [1] <bar>text <baz id="a"/></bar>
#> [2] <bar>2</bar>
#> [3] <baz id="b"/>
# Find all baz nodes anywhere in the document
baz <- xml_find_all(x, ".//baz")
baz
#> {xml_nodeset (2)}
#> [1] <baz id="a"/>
#> [2] <baz id="b"/>
xml_path(baz)
#> [1] "/foo/bar[1]/baz" "/foo/baz"
xml_attr(baz, "id")
#> [1] "a" "b"
Development
Xml2 is still under active development. If notice any problems (including crashes), please try the development version, and if that doesn’t work, file an issue.