big data

sparklyr 1.3: Higher-order Functions, Avro and Custom Serializers

2020-07-16 Yitao Li
Sparklyr 1.3 is now available, featuring integration of Spark higher-order functions, and data import/export in Avro and in user-defined serialization formats. Read more →

sparklyr 1.2: Foreach, Spark 3.0 and Databricks Connect

2020-05-06 Yitao Li
A new version of sparklyr is now available on CRAN! In this sparklyr 1.2 release, the following new improvements have emerged into spotlight: A registerDoSpark() method to create a foreach parallel backend powered by Spark that enables hundreds of existing R packages to run in Spark. Support for Databricks Connect, allowing sparklyr to connect to remote Databricks clusters. Improved support for Spark structures when collecting and querying their nested attributes with dplyr. Read more →

sparklyr 1.1: Foundations, Books, Lakes and Barriers

2020-01-29 Javier Luraschi
Today we are excited to share that sparklyr 1.1 is now available on CRAN! In a nutshell, you can use sparklyr to scale datasets across computing clusters running Apache Spark. For this particular release, we would like to highlight the following new features: Delta Lake enables database-like properties in Spark. Spark 3.0 preview is now available through sparklyr. Barrier Execution paves the way to use Spark with deep learning frameworks. Read more →

sparklyr 1.0: Apache Arrow, XGBoost, Broom and TFRecords

2019-03-15 Javier Luraschi
With much excitement built over the past three years, we are thrilled to share that sparklyr 1.0 is now available on CRAN! The sparklyr package provides an R interface to Apache Spark. It supports dplyr, MLlib, streaming, extensions and many other features; however, this particular release enables the following new features: Arrow enables faster and larger data transfers between Spark and R. XGBoost enables training gradient boosting models over distributed datasets. Read more →

sparklyr 0.9: Streams and Kubernetes

2018-10-01 Javier Luraschi
Today we are excited to share that a new release of sparklyr is available on CRAN! This 0.9 release enables you to: Create Spark structured streams to process real time data from many data sources using dplyr, SQL, pipelines, and arbitrary R code. Monitor connection progress with upcoming RStudio Preview 1.2 features and support for properly interrupting Spark jobs from R. Use Kubernetes clusters with sparklyr to simplify deployment and maintenance. Read more →

See RStudio + sparklyr for big data at Strata + Hadoop World

2017-02-13 Roger Oberg
If big data is your thing, you use R, and you’re headed to Strata + Hadoop World in San Jose March 13 & 14th, you can experience in person how easy and practical it is to analyze big data with R and Spark. In a beginner level talk by RStudio’s Edgar Ruiz and an intermediate level workshop by Win-Vector’s John Mount, we cover the spectrum: What R is, what Spark is, how Sparklyr works, and what is required to set up and tune a Spark cluster. Read more →

SparkR preview by Vincent Warmerdam

2015-05-28 Garrett Grolemund
This is a guest post by Vincent Warmerdam of SparkR preview in Rstudio Apache Spark is the hip new technology on the block. It allows you to write scripts in a functional style and the technology behind it will allow you to run iterative tasks very quickly on a cluster of machines. It’s benchmarked to be quicker than hadoop for most machine learning use cases (by a factor between 10-100) and soon Spark will also have support for the R language. Read more →