knitr::opts_chunk$set(warning = FALSE)
- Distribute R computations using
spark_apply()to execute arbitrary R code across your Spark cluster. You can now use all of your favorite R packages and functions in a distributed context.
- Connect to External Data Sources using
- Use the Latest Frameworks including dplyr 0.7, DBI 0.7, RStudio 1.1 and Spark 2.2.
and several improvements across:
- Spark Connections add a new Databricks connection that enables using sparklyr in Databricks through
mode="databricks", add support for Yarn Cluster through
master="yarn-cluster"and connection speed was also improved.
- Dataframes add support for
- Machine Learning adds support for multinomial regression in
ml_model_data()and a new
- Many other improvements, from initial support for broom over
ml_generalized_linear_regression(), dplyr support for
%regexp%, sparklyr extensions now support
download_scalac()to help you install the required Scala compilers while developing extensions, Hive database management got simplified with
src_databases()to query and switch between Hive databases. RStudio started a joint effort with Microsoft to support a cross-platform Spark installer under github.com/rstudio/spark-install.
Additional changes and improvements can be found in the sparklyr NEWS file.
sparklyr 0.6 provides support for executing distributed R code through
spark_apply(). For instance, after connecting and copying some data:
library(sparklyr) sc <- spark_connect(master = "local") iris_tbl <- sdf_copy_to(sc, iris)
We can apply an arbitrary R function, say
jitter(), to each column over each row as follows:
iris_tbl %>% spark_apply(function(e) sapply(e[,1:4], jitter))
One can also group by columns to apply an operation over each group of rows, say, to perform linear regression over each group as follows:
spark_apply( iris_tbl, function(e) broom::tidy(lm(Petal_Width ~ Petal_Length, e)), names = c("term", "estimate", "std.error", "statistic", "p.value"), group_by = "Species" )
Packages can be used since they are automatically distributed to the worker nodes; however, using
spark_apply() requires R to be installed over each worker node. Please refer to Distributing R Computations for additional information and examples.
External Data Sources
sparklyr 0.6 adds support for connecting Spark to databases. This feature is useful if your Spark environment is separate from your data environment, or if you use Spark to access multiple data sources. You can use
spark_write_source with any data connector available in Spark Packages. Alternatively, you can use
spark_write_jdbc() and a JDBC driver with almost any data source.
For example, you can connect to Cassandra using
spark_read_source(). Notice that the Cassandra connector version needs to match the Spark version as defined in their version compatibility section.
config <- spark_config() config[["sparklyr.defaultPackages"]] <- c( "datastax:spark-cassandra-connector:2.0.0-RC1-s_2.11") sc <- spark_connect(master = "local", config = config) spark_read_source(sc, "emp", "org.apache.spark.sql.cassandra", list(keyspace = "dev", table = "emp"))
To connect to MySQL, one can download the MySQL connector and use
spark_read_jdbc() as follows:
config <- spark_config() config$`sparklyr.shell.driver-class-path` <- "~/Downloads/mysql-connector-java-5.1.41/mysql-connector-java-5.1.41-bin.jar" sc <- spark_connect(master = "local", config = config) spark_read_jdbc(sc, "person_jdbc", options = list( url = "jdbc:mysql://localhost:3306/sparklyr", user = "root", password = "<password>", dbtable = "person"))
Notice that the Cassandra connector version needs to match the Spark version as defined in their version compatibility section. See also crassy, an
sparklyr extension being developed to read data from Cassandra with ease.
sparklyr 0.6 includes many improvements for working with DataFrames. Here are some additional highlights.
x_tbl <- sdf_copy_to(sc, data.frame(a = c(1,2,3), b = c(2,3,4))) y_tbl <- sdf_copy_to(sc, data.frame(b = c(3,4,5), c = c("A","B","C")))
It is now possible to pivot (i.e. cross tabulate) one or more columns using
sdf_pivot(y_tbl, b ~ c, list(b = "count"))
Binding Rows and Columns
Binding DataFrames by rows and columns is supported through
Separate lists into columns with ease. This is especially useful when working with model predictions that are returned as lists instead of scalars. In this example, each element in the probability column contains two items. We can use
sdf_separate_column() to isolate the item that corresponds to the probability that
vs equals one.
library(magrittr) mtcars[, c("vs", "mpg")] %>% sdf_copy_to(sc, .) %>% ml_logistic_regression("vs", "mpg") %>% sdf_predict() %>% sdf_separate_column("probability", list("P[vs=1]" = 2))
sparklyr 0.6 adds support for multinomial regression for Spark 2.1.0 or higher:
iris_tbl %>% ml_logistic_regression("Species", features = c("Sepal_Length", "Sepal_Width"))
Improved Text Mining with LDA
ft_tokenizer() was introduced in
sparklyr 0.5 but
sparklyr 0.6 introduces
vocabulary.only to simplify LDA:
library(janeaustenr) lines_tbl <- sdf_copy_to(sc,austen_books()[c(1,3),]) lines_tbl %>% ft_tokenizer("text", "tokens") %>% ft_count_vectorizer("tokens", "features") %>% ml_lda("features", k = 4)
The vocabulary can be printed with:
lines_tbl %>% ft_tokenizer("text", "tokens") %>% ft_count_vectorizer("tokens", "features", vocabulary.only = TRUE)
That’s all for now, disconnecting: