I’m pleased to announce that dplyr 0.7.0 is now on CRAN! (This was dplyr 0.6.0 previously; more on that below.) dplyr provides a “grammar” of data transformation, making it easy and elegant to solve the most common data manipulation challenges. dplyr supports multiple backends: as well as in-memory data frames, you can also use it with remote SQL databases. If you haven’t heard of dplyr before, the best place to start is the Data transformation chapter in R for Data Science.

You can install the latest version of dplyr with:

install.packages("dplyr")

Features

dplyr 0.7.0 is a major release including over 100 improvements and bug fixes, as described in the release notes. In this blog post, I want to discuss one big change and a handful of smaller updates. This version of dplyr also saw a major revamp of database connections. That’s a big topic, so it’ll get its own blog post next week.

Tidy evaluation

The biggest change is a new system for programming with dplyr, called tidy evaluation, or tidy eval for short. Tidy eval is a system for capturing expressions and later evaluating them in the correct context. It is is important because it allows you to interpolate values in contexts where dplyr usually works with expressions:

my_var <- quo(homeworld)

starwars %>%
  group_by(!!my_var) %>%
  summarise_at(vars(height:mass), mean, na.rm = TRUE)
#> # A tibble: 49 x 3
#>         homeworld   height  mass
#>
#>  1       Alderaan 176.3333  64.0
#>  2    Aleen Minor  79.0000  15.0
#>  3         Bespin 175.0000  79.0
#>  4     Bestine IV 180.0000 110.0
#>  5 Cato Neimoidia 191.0000  90.0
#>  6          Cerea 198.0000  82.0
#>  7       Champala 196.0000   NaN
#>  8      Chandrila 150.0000   NaN
#>  9   Concord Dawn 183.0000  79.0
#> 10       Corellia 175.0000  78.5
#> # ... with 39 more rows

This makes it possible to write your functions that work like dplyr functions, reducing the amount of copy-and-paste in your code:

starwars_mean <- function(my_var) {
  my_var <- enquo(my_var)

  starwars %>%
    group_by(!!my_var) %>%
    summarise_at(vars(height:mass), mean, na.rm = TRUE)
}
starwars_mean(homeworld)

You can also use the new .data pronoun to refer to variables with strings:

my_var <- "homeworld"

starwars %>%
  group_by(.data[[my_var]]) %>%
  summarise_at(vars(height:mass), mean, na.rm = TRUE)

This is useful when you’re writing packages that use dplyr code because it avoids an annoying note from R CMD check.

To learn more about how tidy eval helps solve data analysis challenge, please read the new programming with dplyr vignette. Tidy evaluation is implemented in the rlang package, which also provides a vignette on the theoretical underpinnings. Tidy eval is a rich system and takes a while to get your head around it, but we are confident that learning tidy eval will pay off, especially as it roles out to other packages in the tidyverse (tidyr and ggplot2 are next on the todo list).

The introduction of tidy evaluation means that the standard evaluation (underscored) version of each main verb (filter_(), select_() etc) is no longer needed, and so these functions have been deprecated (but remain around for backward compatibility).

Character encoding

We have done a lot of work to ensure that dplyr works with encodings other than Latin1 on Windows. This is most likely to affect you if you work with data that contains Chinese, Japanese, or Korean (CJK) characters. dplyr should now just work with such data. Please let us know if you have problems!

New datasets

dplyr has some new datasets that will help write more interesting examples:

starwars
#> # A tibble: 87 x 13
#>                  name height  mass    hair_color  skin_color eye_color
#>
#>  1     Luke Skywalker    172    77         blond        fair      blue
#>  2              C-3PO    167    75                  gold    yellow
#>  3              R2-D2     96    32           white, blue       red
#>  4        Darth Vader    202   136          none       white    yellow
#>  5        Leia Organa    150    49         brown       light     brown
#>  6          Owen Lars    178   120   brown, grey       light      blue
#>  7 Beru Whitesun lars    165    75         brown       light      blue
#>  8              R5-D4     97    32            white, red       red
#>  9  Biggs Darklighter    183    84         black       light     brown
#> 10     Obi-Wan Kenobi    182    77 auburn, white        fair blue-gray
#> # ... with 77 more rows, and 7 more variables: birth_year ,
#> #   gender , homeworld , species , films ,
#> #   vehicles , starships
storms
#> # A tibble: 10,010 x 13
#>     name  year month   day  hour   lat  long              status category
#>
#>  1   Amy  1975     6    27     0  27.5 -79.0 tropical depression       -1
#>  2   Amy  1975     6    27     6  28.5 -79.0 tropical depression       -1
#>  3   Amy  1975     6    27    12  29.5 -79.0 tropical depression       -1
#>  4   Amy  1975     6    27    18  30.5 -79.0 tropical depression       -1
#>  5   Amy  1975     6    28     0  31.5 -78.8 tropical depression       -1
#>  6   Amy  1975     6    28     6  32.4 -78.7 tropical depression       -1
#>  7   Amy  1975     6    28    12  33.3 -78.0 tropical depression       -1
#>  8   Amy  1975     6    28    18  34.0 -77.0 tropical depression       -1
#>  9   Amy  1975     6    29     0  34.4 -75.8      tropical storm        0
#> 10   Amy  1975     6    29     6  34.0 -74.8      tropical storm        0
#> # ... with 10,000 more rows, and 4 more variables: wind ,
#> #   pressure , ts_diameter , hu_diameter
band_members
#> # A tibble: 3 x 2
#>    name    band
#>
#> 1  Mick  Stones
#> 2  John Beatles
#> 3  Paul Beatles
band_instruments
#> # A tibble: 3 x 2
#>    name  plays
#>
#> 1  John guitar
#> 2  Paul   bass
#> 3 Keith guitar

New and improved verbs

mtcars %>% pull(-1) %>% str()
#>  num [1:32] 4 4 1 1 2 1 4 2 2 4 ...
mtcars %>% pull(cyl) %>% str()
#>  num [1:32] 6 6 4 6 8 6 8 4 4 6 ...

Thanks to Paul Poncet for the idea!

iris %>% summarise_if(is.numeric, mean)
starwars %>% select_if(Negate(is.list))
storms %>% group_by_at(vars(month:hour))

Other important changes

You can change the default behaviour by calling pkgconfig::set_config("dplyr::na_matches", "never").

Breaking changes

From time-to-time I discover that I made a mistake in an older version of dplyr and developed what is now a clearly suboptimal API. If the problem isn’t too big, I try to just leave it - the cost of making small improvements is not worth it when compared to the cost of breaking existing code. However, there are bigger improvements where I believe the short-term pain of breaking code is worth the long-term payoff of a better API.

Regardless, it’s still frustrating when an update to dplyr breaks your code. To minimise this pain, I plan to do two things going forward:

Contributors

dplyr is truly a community effort. Apart from the dplyr team (myself, Kirill Müller, and Lionel Henry), this release wouldn’t have been possible without patches from Christophe Dervieux, Dean Attali, Ian Cook, Ian Lyttle, Jake Russ, Jay Hesselberth, Jennifer (Jenny) Bryan, @lindbrook, Mauro Lepore, Nicolas Coutin, Daniel, Tony Fischetti, Hiroaki Yutani and Sergio Oller. Thank you all for your contributions!