Is Dplyr faster than base R.

Data manipulation with PLYR, DPLYR and reshape2

Transcript

1 1 Data manipulation with PLYR, DPLYR and reshape2 Hung Quan Vu Department of Computer Science Supervisor: Jakob Lüttgau

2 2 Contents: 1. PLYR ... 3 a. Definition b. Advantages c. Base d. Example e. Convert and summarize f. Nested chunking of the data g. Other useful options h. Disadvantages 2. DPLYR ... 9 a. Definition b. Advantages c. Base 3. Reshape a. Definition b. Improvement c. File format d. The packages

3 3 PLYR a. What is PLYR? - Plyr is an R package that simplifies the file splitting process, file editing process, and file merging process. These are common data manipulation steps. It is important that plyr makes it easy to control the input and output data format from a syntactically consistent set of functions. - Eg plyr is very effective if we want to: fit the same model for every subsets of a dataframe quickly calculate matching statistics for every group carry out group-wise transformations such as scaling or standardization - it is already possible to do this with the base R functions (such as split and apply function family, but with plyr it's a bit easier with: completely consistent names, arguments and outputs convenient parallelization through the foreach package input and output to data.frames, matrices and lists to track progress bars for long-running operations built-in error recovery and informative error messages labels that are retained for all transformations b.Why do we want to use PLYR? - PLYR> base apply function> loop (for) apply> loop +) The coding is clean (if you are familiar with the concept). Not only is it easier to code and read, but also less error-prone because: (a) You don't have to deal with subsetting (b) You don't have to deal with saving the results +) Apply functions can be faster than loops sometimes even dramatic. PLYR> base aplly functions +) plyr has a common syntax easier

4 4+) plyr requires less code as it takes care of the input and output format. +) plyr can easily be operated in parallel -> faster c. PLYR basics - Plyr was built on what is integrated in application functions by being able to manipulate the input and output formats and to keep the syntax consistent in all variations. It also adds some subtleties like error processing, parallel processing, and progress bars. - The basic format is 2 letters and then ply (). The first letter refers to the input format and the second to the output format. - The 3 main letters are: d = data frame a = array (contains matrices) l = list dataframe list array dataframe ddply ldply adply List dlply llply alply array daply laply aaply - Some less common letters: m, r, _ m = multi- argument function input r = repeat a function n times. _ = Discard output The underscore (_) might be useful for plotting. It will do something with the data (e.g. add line segments to a plot) and then throw away the output (e.g. d_ply ()).

5 5 d. A general example with plyr Let's take a simple example: a data frame, split the (by years), calculate the coefficient of variation of the count, and then return a data frame. > set.seed (1)> d <- data.frame (year = rep (2000: 2002, each = 3), + count = round (runif (9, 0, 20)))> print (d) Year Count > library (plyr)> ddply (d, "year", function (x) {+ mean.count <- mean (x $ count) + sd.count <- sd (x $ count) + cv <- sd.count /mean.count + data.frame (cv.count = cv) +}) year cv.count

6 6 e. Transform and summarize - It is often convenient to use these functions within plyr. - Transformation acts as if it were a completely normal basis function of R and changes an existing data frame. - Summarise creates a new (usually) shortened data frame> ddply (d, "year", summarize, mean.count = mean (count)) year mean.count> ddply (d, "year", transform, total.count = sum (count)) Year Count Total count

7 7 Bonus function: mutate. mutate works almost exactly like transform but allows you to build columns from columns that are created by the user. > ddply (d, "year", mutate, mu = mean (count), sigma = sd (count), + cv = sigma / mu) Year Count mu sigma cv f. Nested chunking of the data - The basic syntax can easily expanded to break the data apart based on multiple columns> baseball.dat <- subset (baseball, year> 2000) # data from the plyr package> x <- ddply (baseball.dat, c ("year", " team "), summarize, + homeruns = sum (hr))> head (x) year Team homeruns ANA ARI ATL BAL BOS CHA 63

8 8 g. Other useful ways - handling errors: one can use the failwith function to control how errors are handled. > f <- function (x) if (x == 1) stop ("error!") else 1> safe.f <- failwith (na, f, quiet = TRUE)> #llply (1: 2, f) > llply (1: 2, safe.f) [[1]] [1] NA [[2]] [1] 1 - parallel processing In connection with domc (or dosmp under Windows) you can have separate functions on each core of the Run computer. On a dual core machine, the speed can double in some situations. Set.Parallel = TRUE. > x <- c (1:10)> wait <- function (i) Sys.sleep (0.1)> system.time (llply (x, wait)) user system elapsed> system.time (sapply (x, wait) ) user system elapsed> library (domc)> registerdomc (2)> system.time (llply (x, wait, .parallel = TRUE)) user system elapsed

9 9 h. Why don't I want to use PLYR? Plyr can be slow. Especially when you work with very large amounts of data that involve a lot of subsetting. Hadley is working on this and the latest development versions of plyr can run much faster. There are 3 quicker options: (1) Use the basic R-apply function:> system.time (ddply (baseball, "id", summarize, length (year))) User system elapsed> system.time (tapply (baseball $ year, baseball $ id, function (x) length (x))) User system elapsed (2) Use Immutable data frame. When subseting, an immutable data frame (idata.frame) returns the pointer to the original object, state making a copy of itself. This is often the limiting step in an apply function. > system.time (ddply (idata.frame (baseball), "id", summarize, length (year))) User system elapsed (3) Use Data.table package:> library (data.table)> dt <- data .table (baseball, key = "id")> system.time (dt [, length (year), by = list (id)]) user system elapsed

10 10 DPLYR a. What is DPLYR - DPLYR is the next iteration of PLYR - goal is only data frames - DPLYR is faster than PLYR, has more consistent API b. Why do we want to use DPYR? - Time is money, that's why Romain Francois wrote the most important pieces in RCPP to blaze fast performance. Performance will only get better over time, especially as we figure out the best way to use most of multiple processors (x faster than plyr) - Tabular data is tabular data regardless of where it lives, so you should use the same functions to work with it. In dplyr everything you can do with local data frames can also be done with remote database tables. PostgreSQL, MySQL, SQLite and Google bigquery have built-in support; Adding new backend is actually implementing S3 method. - The bottleneck in most data analysis is the time it takes to figure out what to do with your data, and dplyr makes this easier by showing individual functions. (group_by, summarize, mutate, filter, select and arrange). Each function only solves one task, but it does it really efficiently. - Chaining commands with%.% (%>%)

11 11 c. The basic manipulated functions filter: subset lines. Multiple conditions combined by & only; is not available. select: subset columns. Multiple columns can be returned. arrange: rearrange lines. Offers space for several entrances and ascending / descending order. mutate: add new columns, possibly based on other columns; multiple inputs create multiple columns. summarize: calculate each function within the groups so that each group is reduced to a single line. Multiple inputs create multiple outputs to summarize. - Chaining commands Nested R commands are often difficult to read o Read the order of operations from the innermost to the outermost functions. o Consequently, the arguments for these outermost functions kick a long way away from the actual function. o dplyr enables commands with the chain or%.% functions of the operations to be sequenced linearly, and thus much more logically. - Database Based on Hadley, dplyr supports the three most popular open source databases (SQLite, MySQL and PostgreSQL) and Google BigQuery. Useful when we don't want to extract our data from the database.

12 12 Reshape2 a. What is Reshape2 - An R package that simplifies file conversion process between long and wide format. - A restart of the reshape package - data conversion much more concentrated and much faster - improves the speed at the expense of functionality b. Improvement - much faster and more memory efficient - cast is replaced by two functions depending on the output type: dcast creates dataframe and acast creates matrices / arrays. - Multi-dimensional margins are now possible - Some functions have been removed, e.g. the cast operator, and the ability to return multiple values ​​from an aggregation function. - better development practices like namespaces and tests. c. What is wide and long file? Wide-data has a column for each variable. Eg: # ozone wind temp # # # #

13 13 And here is long-format data: # variable value # 1 ozone # 2 ozone # 3 ozone # 4 ozone # 5 wind # 6 wind # 7 wind # 8 wind # 9 temp # 10 temp # 11 temp # 12 temp Long -format-daten has a column for possible variable types and a column for the values ​​of these variables. Long format data is not necessarily just two columns. For example, we could have ozone measurements for every day of the year. In that case you could use a different column for days. In other words, there are different levels of longness. The ultimate form we want to receive our data depends on what we are going to do with it. - It turns out that one needs wide format data for some types of data analysis and long format data for others. In reality we need a lot more long format data than wide format data. For example ggplot2 long format data (technical clean data) is required. Plyr also requires long format data, and most modeling functions (such as lm (), glm (), and gam ()) require long format data. But most people often find it easier to record their data in wide format data. d. The packages - melt converts wide format to long format - cast converts long format to wide format - Think of working with metal: if you melt metal, it drips and becomes long. If you cast it into a mold, it becomes wide. -Sean Anderson

14 14 Summary: plyr is an R package that simplifies the file splitting process, file editing process and file merging process. The basic format is 2 letters and then ply (). Is not always fast dplyr is the next iteration of PLYR goal is only data frames is faster than PLYR, has more consistent API manipulation functions: filter, arrange, mutate, select, summarize Channing command: dplyr enables commands to be chain or%. % To sequence functions of the operations linearly, and thus much more logically than nested R commands. reshape2 An R package that simplifies file conversion process between long and wide format. A restart of the reshape package Basic functions: melt converts wide format to long format cast converts long format to wide format

15 15 Literature The Comprehensive R Archive Network