Title: | Tidy Verbs for Fast Data Manipulation |
---|---|
Description: | A toolkit of tidy data manipulation verbs with 'data.table' as the backend. Combining the merits of syntax elegance from 'dplyr' and computing performance from 'data.table', 'tidyfst' intends to provide users with state-of-the-art data manipulation tools with least pain. This package is an extension of 'data.table'. While enjoying a tidy syntax, it also wraps combinations of efficient functions to facilitate frequently-used data operations. |
Authors: | Tian-Yuan Huang [aut, cre] |
Maintainer: | Tian-Yuan Huang <[email protected]> |
License: | MIT + file LICENSE |
Version: | 1.8.1 |
Built: | 2024-10-16 06:03:16 UTC |
Source: | https://github.com/hope-data-science/tidyfst |
Inverse operation of match.
x %notin% y
x %notin% y
x |
vector or NULL |
y |
vector or NULL |
"a" %in% letters[1:3] "a" %notin% letters[1:3] 1 %in% 1:3 1 %notin% 1:3
"a" %in% letters[1:3] "a" %notin% letters[1:3] 1 %in% 1:3 1 %notin% 1:3
Order the rows of a data frame rows by the values of selected columns.
arrange_dt(.data, ...)
arrange_dt(.data, ...)
.data |
data.frame |
... |
Arrange by what group? Minus symbol means arrange by descending order. |
data.table
iris %>% arrange_dt(Sepal.Length) # minus for decreasing order iris %>% arrange_dt(-Sepal.Length) # arrange by multiple variables iris %>% arrange_dt(Sepal.Length,Petal.Length)
iris %>% arrange_dt(Sepal.Length) # minus for decreasing order iris %>% arrange_dt(-Sepal.Length) # arrange by multiple variables iris %>% arrange_dt(Sepal.Length,Petal.Length)
This function first export the data.frame to a temporal file, and then parse it back as a fst table (class name is "fst_table").
as_fst(.data)
as_fst(.data)
.data |
A data.frame |
An object of class fst_table
## Not run: iris %>% as_fst() -> iris_fst iris_fst ## End(Not run)
## Not run: iris %>% as_fst() -> iris_fst iris_fst ## End(Not run)
Bind any number of data frames by row, making a longer result. Similar to 'dplyr::bind_rows', however, columns with same names but different data types would be coerced to a single proper data type.
bind_rows_dt(...)
bind_rows_dt(...)
... |
Data frames to combine. Each argument can either be a data frame, a list that could be a data frame, or a list of data frames. Columns are matched by name, and any missing columns will be filled with 'NA'. |
data.table
bind_rows_dt(iris[1:3,],iris[6:8,]) # data frames with same name but different type # numeric data would be coerced to character data in this case df1 <- data.frame(x = 1:2, y = letters[1:2]) df2 <- data.frame(x = 4:5, y = 1:2) bind_rows_dt(df1, df2)
bind_rows_dt(iris[1:3,],iris[6:8,]) # data frames with same name but different type # numeric data would be coerced to character data in this case df1 <- data.frame(x = 1:2, y = letters[1:2]) df2 <- data.frame(x = 4:5, y = 1:2) bind_rows_dt(df1, df2)
For a data.frame with numeric values, add a new column specifying the column name of the first max/min value each row.
col_max(.data, ..., .name = "max_col") col_min(.data, ..., .name = "min_col")
col_max(.data, ..., .name = "max_col") col_min(.data, ..., .name = "min_col")
.data |
A data.frame with numeric column(s) |
... |
Variables for screening, could receive what 'select_dt' receives. When starts with '-'(minus symbol) or '!', return the negative columns. |
.name |
The column name of the new added column |
A data.table
https://stackoverflow.com/questions/17735859/for-each-row-return-the-column-name-of-the-largest-value
set.seed(199057) DT <- data.table(matrix(sample(10, 100, TRUE), ncol=10)) DT col_max(DT) col_max(DT,V1:V3) col_max(DT,.name = "max_col_name") col_min(DT) col_min(DT,2:4) col_max(iris)
set.seed(199057) DT <- data.table(matrix(sample(10, 100, TRUE), ncol=10)) DT col_max(DT) col_max(DT,V1:V3) col_max(DT,.name = "max_col_name") col_min(DT) col_min(DT,2:4) col_max(iris)
Turns implicit missing values into explicit missing values. All the combinations of column values (should be unique) will be constructed. Other columns will be filled with NAs or constant value.
complete_dt(.data, ..., fill = NA)
complete_dt(.data, ..., fill = NA)
.data |
data.frame |
... |
Specification of columns to expand.The selection of columns is
supported by the flexible |
fill |
Atomic value to fill into the missing cell, default uses |
When the provided columns with addtion data are of different length, all the unique combinations would be returned. This operation should be used only on unique entries, and it will always returned the unique entries.
If you supply fill parameter, these values will also replace existing explicit missing values in the data set.
data.table
df <- data.table( group = c(1:2, 1), item_id = c(1:2, 2), item_name = c("a", "b", "b"), value1 = 1:3, value2 = 4:6 ) df %>% complete_dt(item_id,item_name) df %>% complete_dt(item_id,item_name,fill = 0) df %>% complete_dt("item") df %>% complete_dt(item_id=1:3) df %>% complete_dt(item_id=1:3,group=1:2) df %>% complete_dt(item_id=1:3,group=1:3,item_name=c("a","b","c"))
df <- data.table( group = c(1:2, 1), item_id = c(1:2, 2), item_name = c("a", "b", "b"), value1 = 1:3, value2 = 4:6 ) df %>% complete_dt(item_id,item_name) df %>% complete_dt(item_id,item_name,fill = 0) df %>% complete_dt("item") df %>% complete_dt(item_id=1:3) df %>% complete_dt(item_id=1:3,group=1:2) df %>% complete_dt(item_id=1:3,group=1:3,item_name=c("a","b","c"))
Count the unique values of one or more variables.
count_dt(.data, ..., sort = TRUE, .name = "n") add_count_dt(.data, ..., .name = "n")
count_dt(.data, ..., sort = TRUE, .name = "n") add_count_dt(.data, ..., .name = "n")
.data |
data.table/data.frame data.frame will be automatically converted to data.table. |
... |
Variables to group by, could receive what 'select_dt' receives. |
sort |
logical. If TRUE result will be sorted in desending order by resulting variable. |
.name |
character. Name of resulting variable. Default uses "n". |
data.table
iris %>% count_dt(Species) iris %>% count_dt(Species,.name = "count") iris %>% add_count_dt(Species) iris %>% add_count_dt(Species,.name = "N") mtcars %>% count_dt(cyl,vs) mtcars %>% count_dt("cyl|vs") mtcars %>% count_dt(cyl,vs,.name = "N",sort = FALSE) mtcars %>% add_count_dt(cyl,vs) mtcars %>% add_count_dt("cyl|vs")
iris %>% count_dt(Species) iris %>% count_dt(Species,.name = "count") iris %>% add_count_dt(Species) iris %>% add_count_dt(Species,.name = "N") mtcars %>% count_dt(cyl,vs) mtcars %>% count_dt("cyl|vs") mtcars %>% count_dt(cyl,vs,.name = "N",sort = FALSE) mtcars %>% add_count_dt(cyl,vs) mtcars %>% add_count_dt("cyl|vs")
Returns a vector whose elements are the cumulative mean of the elements of the argument.
cummean(x)
cummean(x)
x |
a numeric or complex object, or an object that can be coerced to one of these. |
cummean(1:10)
cummean(1:10)
Select only unique/distinct rows from a data frame.
distinct_dt(.data, ..., .keep_all = FALSE, fromLast = FALSE)
distinct_dt(.data, ..., .keep_all = FALSE, fromLast = FALSE)
.data |
data.frame |
... |
Optional variables to use when determining uniqueness. If there are multiple rows for a given combination of inputs, only the first row will be preserved. If omitted, will use all variables. |
.keep_all |
If |
fromLast |
Logical indicating if duplication should be
considered from the reverse side. Defaults to |
data.table
iris %>% distinct_dt() iris %>% distinct_dt(Species) iris %>% distinct_dt(Species,.keep_all = TRUE) mtcars %>% distinct_dt(cyl,vs) mtcars %>% distinct_dt(cyl,vs,.keep_all = TRUE) mtcars %>% distinct_dt(cyl,vs,.keep_all = TRUE,fromLast = TRUE)
iris %>% distinct_dt() iris %>% distinct_dt(Species) iris %>% distinct_dt(Species,.keep_all = TRUE) mtcars %>% distinct_dt(cyl,vs) mtcars %>% distinct_dt(cyl,vs,.keep_all = TRUE) mtcars %>% distinct_dt(cyl,vs,.keep_all = TRUE,fromLast = TRUE)
A set of tools to deal with missing values in data.frames. It can dump, replace, fill (with next or previous observation) or delete entries according to their missing values.
drop_na_dt(.data, ...) replace_na_dt(.data, ..., to) delete_na_cols(.data, prop = NULL, n = NULL) delete_na_rows(.data, prop = NULL, n = NULL) fill_na_dt(.data, ..., direction = "down") shift_fill(x, direction = "down")
drop_na_dt(.data, ...) replace_na_dt(.data, ..., to) delete_na_cols(.data, prop = NULL, n = NULL) delete_na_rows(.data, prop = NULL, n = NULL) fill_na_dt(.data, ..., direction = "down") shift_fill(x, direction = "down")
.data |
data.frame |
... |
Colunms to be replaced or filled. If not specified, use all columns. |
to |
What value should NA replace by? |
prop |
If proportion of NAs is larger than or equal to "prop", would be deleted. |
n |
If number of NAs is larger than or equal to "n", would be deleted. |
direction |
Direction in which to fill missing values. Currently either "down" (the default) or "up". |
x |
A vector with missing values to be filled. |
drop_na_dt
drops the entries with NAs in specific columns.
fill_na_dt
fill NAs with observations ahead ("down") or below ("up"),
which is also known as last observation carried forward (LOCF) and
next observation carried backward(NOCB).
delete_na_cols
could drop the columns with NA proportion larger
than or equal to "prop" or NA number larger than or equal to "n",
delete_na_rows
works alike but deals with rows.
shift_fill
could fill a vector with missing values.
data.table
https://stackoverflow.com/questions/23597140/how-to-find-the-percentage-of-nas-in-a-data-frame
https://stackoverflow.com/questions/2643939/remove-columns-from-dataframe-where-all-values-are-na
https://stackoverflow.com/questions/7235657/fastest-way-to-replace-nas-in-a-large-data-table
df <- data.table(x = c(1, 2, NA), y = c("a", NA, "b")) df %>% drop_na_dt() df %>% drop_na_dt(x) df %>% drop_na_dt(y) df %>% drop_na_dt(x,y) df %>% replace_na_dt(to = 0) df %>% replace_na_dt(x,to = 0) df %>% replace_na_dt(y,to = 0) df %>% replace_na_dt(x,y,to = 0) df %>% fill_na_dt(x) df %>% fill_na_dt() # not specified, fill all columns df %>% fill_na_dt(y,direction = "up") x = data.frame(x = c(1, 2, NA, 3), y = c(NA, NA, 4, 5),z = rep(NA,4)) x x %>% delete_na_cols() x %>% delete_na_cols(prop = 0.75) x %>% delete_na_cols(prop = 0.5) x %>% delete_na_cols(prop = 0.24) x %>% delete_na_cols(n = 2) x %>% delete_na_rows(prop = 0.6) x %>% delete_na_rows(n = 2) # shift_fill y = c("a",NA,"b",NA,"c") shift_fill(y) # equals to shift_fill(y,"down") shift_fill(y,"up")
df <- data.table(x = c(1, 2, NA), y = c("a", NA, "b")) df %>% drop_na_dt() df %>% drop_na_dt(x) df %>% drop_na_dt(y) df %>% drop_na_dt(x,y) df %>% replace_na_dt(to = 0) df %>% replace_na_dt(x,to = 0) df %>% replace_na_dt(y,to = 0) df %>% replace_na_dt(x,y,to = 0) df %>% fill_na_dt(x) df %>% fill_na_dt() # not specified, fill all columns df %>% fill_na_dt(y,direction = "up") x = data.frame(x = c(1, 2, NA, 3), y = c(NA, NA, 4, 5),z = rep(NA,4)) x x %>% delete_na_cols() x %>% delete_na_cols(prop = 0.75) x %>% delete_na_cols(prop = 0.5) x %>% delete_na_cols(prop = 0.24) x %>% delete_na_cols(n = 2) x %>% delete_na_rows(prop = 0.6) x %>% delete_na_rows(n = 2) # shift_fill y = c("a",NA,"b",NA,"c") shift_fill(y) # equals to shift_fill(y,"down") shift_fill(y,"up")
Quickly create dummy (binary) columns from character and factor type columns in the inputted data (and numeric columns if specified.) This function is useful for statistical analysis when you want binary columns rather than character columns.
dummy_dt(.data, ..., longname = TRUE)
dummy_dt(.data, ..., longname = TRUE)
.data |
data.frame |
... |
Columns you want to create dummy variables from. Very flexible, find in the examples. |
longname |
logical. Should the output column labeled with the
original column name? Default uses |
If no columns provided, will return the original data frame. When NA exist in the input column, they would also be considered. If the input character column contains both NA and string "NA", they would be merged.
This function is inspired by fastDummies package, but provides
simple and precise usage, whereas fastDummies::dummy_cols
provides more
features for statistical usage.
data.table
https://stackoverflow.com/questions/18881073/creating-dummy-variables-in-r-data-table
iris %>% dummy_dt(Species) iris %>% dummy_dt(Species,longname = FALSE) mtcars %>% head() %>% dummy_dt(vs,am) mtcars %>% head() %>% dummy_dt("cyl|gear") # when there are NAs in the column df <- data.table(x = c("a", "b", NA, NA),y = 1:4) df %>% dummy_dt(x) # when NA and "NA" both exist, they would be merged df <- data.table(x = c("a", "b", NA, "NA"),y = 1:4) df %>% dummy_dt(x)
iris %>% dummy_dt(Species) iris %>% dummy_dt(Species,longname = FALSE) mtcars %>% head() %>% dummy_dt(vs,am) mtcars %>% head() %>% dummy_dt("cyl|gear") # when there are NAs in the column df <- data.table(x = c("a", "b", NA, NA),y = 1:4) df %>% dummy_dt(x) # when NA and "NA" both exist, they would be merged df <- data.table(x = c("a", "b", NA, "NA"),y = 1:4) df %>% dummy_dt(x)
Wrapper for read_fst
and write_fst
from fst, but use a different default. For data import, always return a data.table.
For data export, always compress the data to the smallest size.
export_fst(x, path, compress = 100, uniform_encoding = TRUE) import_fst( path, columns = NULL, from = 1, to = NULL, as.data.table = TRUE, old_format = FALSE )
export_fst(x, path, compress = 100, uniform_encoding = TRUE) import_fst( path, columns = NULL, from = 1, to = NULL, as.data.table = TRUE, old_format = FALSE )
x |
a data frame to write to disk |
path |
path to fst file |
compress |
value in the range 0 to 100, indicating the amount of compression to use. Lower values mean larger file sizes. The default compression is set to 50. |
uniform_encoding |
If 'TRUE', all character vectors will be assumed to have elements with equal encoding. The encoding (latin1, UTF8 or native) of the first non-NA element will used as encoding for the whole column. This will be a correct assumption for most use cases. If 'uniform.encoding' is set to 'FALSE', no such assumption will be made and all elements will be converted to the same encoding. The latter is a relatively expensive operation and will reduce write performance for character columns. |
columns |
Column names to read. The default is to read all columns. |
from |
Read data starting from this row number. |
to |
Read data up until this row number. The default is to read to the last row of the stored dataset. |
as.data.table |
If TRUE, the result will be returned as a |
old_format |
must be FALSE, the old fst file format is deprecated and can only be read and converted with fst package versions 0.8.0 to 0.8.10. |
'import_fst' returns a data.table with the selected columns and rows. 'export_fst' writes 'x' to a 'fst' file and invisibly returns 'x' (so you can use this function in a pipeline).
## Not run: export_fst(iris,"iris_fst_test.fst") iris_dt = import_fst("iris_fst_test.fst") iris_dt unlink("iris_fst_test.fst") ## End(Not run)
## Not run: export_fst(iris,"iris_fst_test.fst") iris_dt = import_fst("iris_fst_test.fst") iris_dt unlink("iris_fst_test.fst") ## End(Not run)
Choose rows where conditions are true.
filter_dt(.data, ...)
filter_dt(.data, ...)
.data |
data.frame |
... |
List of variables or name-value pairs of summary/modifications functions. |
data.table
iris %>% filter_dt(Sepal.Length > 7) iris %>% filter_dt(Sepal.Length == max(Sepal.Length)) # comma is not supported in tidyfst after v0.9.8 # which means you can't use: ## Not run: iris %>% filter_dt(Sepal.Length > 7, Sepal.Width > 3) ## End(Not run) # use following code instead iris %>% filter_dt(Sepal.Length > 7 & Sepal.Width > 3)
iris %>% filter_dt(Sepal.Length > 7) iris %>% filter_dt(Sepal.Length == max(Sepal.Length)) # comma is not supported in tidyfst after v0.9.8 # which means you can't use: ## Not run: iris %>% filter_dt(Sepal.Length > 7, Sepal.Width > 3) ## End(Not run) # use following code instead iris %>% filter_dt(Sepal.Length > 7 & Sepal.Width > 3)
A tookit of APIs for reading fst file as data.table, could select by column, row and conditional filtering.
parse_fst(path) slice_fst(ft, row_no) select_fst(ft, ...) filter_fst(ft, ...) summary_fst(ft)
parse_fst(path) slice_fst(ft, row_no) select_fst(ft, ...) filter_fst(ft, ...) summary_fst(ft)
path |
path to fst file |
ft |
An object of class fst_table, returned by |
row_no |
An integer vector (Positive) |
... |
The filter conditions |
summary_fst
could provide some basic information about
the fst table.
parse_fst
returns a fst_table class.
select_fst
and filter_fst
returns a data.table.
## Not run: fst::write_fst(iris,"iris_test.fst") # parse the file but not reading it parse_fst("iris_test.fst") -> ft ft class(ft) lapply(ft,class) names(ft) dim(ft) summary_fst(ft) # get the data by query ft %>% slice_fst(1:3) ft %>% slice_fst(c(1,3)) ft %>% select_fst(Sepal.Length) ft %>% select_fst(Sepal.Length,Sepal.Width) ft %>% select_fst("Sepal.Length") ft %>% select_fst(1:3) ft %>% select_fst(1,3) ft %>% select_fst("Se") ft %>% select_fst("nothing") ft %>% select_fst("Se|Sp") ft %>% select_fst(cols = names(iris)[2:3]) ft %>% filter_fst(Sepal.Width > 3) ft %>% filter_fst(Sepal.Length > 6 , Species == "virginica") ft %>% filter_fst(Sepal.Length > 6 & Species == "virginica" & Sepal.Width < 3) unlink("iris_test.fst") ## End(Not run)
## Not run: fst::write_fst(iris,"iris_test.fst") # parse the file but not reading it parse_fst("iris_test.fst") -> ft ft class(ft) lapply(ft,class) names(ft) dim(ft) summary_fst(ft) # get the data by query ft %>% slice_fst(1:3) ft %>% slice_fst(c(1,3)) ft %>% select_fst(Sepal.Length) ft %>% select_fst(Sepal.Length,Sepal.Width) ft %>% select_fst("Sepal.Length") ft %>% select_fst(1:3) ft %>% select_fst(1,3) ft %>% select_fst("Se") ft %>% select_fst("nothing") ft %>% select_fst("Se|Sp") ft %>% select_fst(cols = names(iris)[2:3]) ft %>% filter_fst(Sepal.Width > 3) ft %>% filter_fst(Sepal.Length > 6 , Species == "virginica") ft %>% filter_fst(Sepal.Length > 6 & Species == "virginica" & Sepal.Width < 3) unlink("iris_test.fst") ## End(Not run)
Carry out data manipulation within specified groups. Different from group_dt
,
the implementation is split into two operations, namely grouping and implementation.
Using setkey
and setkeyv
in data.table
to carry out group_by
-like functionalities in dplyr. This is
not only convenient but also efficient in computation.
group_by_dt(.data, ..., cols = NULL) group_exe_dt(.data, ...)
group_by_dt(.data, ..., cols = NULL) group_exe_dt(.data, ...)
.data |
A data frame |
... |
Variables to group by for |
cols |
A character vector of column names to group by. |
group_by_dt
and group_exe_dt
are a pair of functions
to be used in combination. It utilizes the feature of key setting in data.table,
which provides high performance for group operations, especially when you have
to operate by specific groups frequently.
A data.table with keys
# aggregation after grouping using group_exe_dt as.data.table(iris) -> a a %>% group_by_dt(Species) %>% group_exe_dt(head(1)) a %>% group_by_dt(Species) %>% group_exe_dt( head(3) %>% summarise_dt(sum = sum(Sepal.Length)) ) mtcars %>% group_by_dt("cyl|am") %>% group_exe_dt( summarise_dt(mpg_sum = sum(mpg)) ) # equals to mtcars %>% group_by_dt(cols = c("cyl","am")) %>% group_exe_dt( summarise_dt(mpg_sum = sum(mpg)) )
# aggregation after grouping using group_exe_dt as.data.table(iris) -> a a %>% group_by_dt(Species) %>% group_exe_dt(head(1)) a %>% group_by_dt(Species) %>% group_exe_dt( head(3) %>% summarise_dt(sum = sum(Sepal.Length)) ) mtcars %>% group_by_dt("cyl|am") %>% group_exe_dt( summarise_dt(mpg_sum = sum(mpg)) ) # equals to mtcars %>% group_by_dt(cols = c("cyl","am")) %>% group_exe_dt( summarise_dt(mpg_sum = sum(mpg)) )
Carry out data manipulation within specified groups.
group_dt(.data, by = NULL, ...) rowwise_dt(.data, ...)
group_dt(.data, by = NULL, ...) rowwise_dt(.data, ...)
.data |
A data.frame |
by |
Variables to group by,unquoted name of grouping variable of list of unquoted names of grouping variables. |
... |
Any data manipulation arguments that could be implemented on a data.frame. |
If you want to use summarise_dt
and mutate_dt
in
group_dt
, it is better to use the "by" parameter in those functions,
that would be much faster because you don't have to use .SD
(which takes
extra time to copy).
data.table
https://stackoverflow.com/questions/36802385/use-by-each-row-for-data-table
iris %>% group_dt(by = Species,slice_dt(1:2)) iris %>% group_dt(Species,filter_dt(Sepal.Length == max(Sepal.Length))) iris %>% group_dt(Species,summarise_dt(new = max(Sepal.Length))) # you can pipe in the `group_dt` iris %>% group_dt(Species, mutate_dt(max= max(Sepal.Length)) %>% summarise_dt(sum=sum(Sepal.Length))) # for users familiar with data.table, you can work on .SD directly # following codes get the first and last row from each group iris %>% group_dt( by = Species, rbind(.SD[1],.SD[.N]) ) #' # for summarise_dt, you can use "by" to calculate within the group mtcars %>% summarise_dt( disp = mean(disp), hp = mean(hp), by = cyl ) # but you could also, of course, use group_dt mtcars %>% group_dt(by =.(vs,am), summarise_dt(avg = mean(mpg))) # and list of variables could also be used mtcars %>% group_dt(by =list(vs,am), summarise_dt(avg = mean(mpg))) # examples for `rowwise_dt` df <- data.table(x = 1:2, y = 3:4, z = 4:5) df %>% mutate_dt(m = mean(c(x, y, z))) df %>% rowwise_dt( mutate_dt(m = mean(c(x, y, z))) )
iris %>% group_dt(by = Species,slice_dt(1:2)) iris %>% group_dt(Species,filter_dt(Sepal.Length == max(Sepal.Length))) iris %>% group_dt(Species,summarise_dt(new = max(Sepal.Length))) # you can pipe in the `group_dt` iris %>% group_dt(Species, mutate_dt(max= max(Sepal.Length)) %>% summarise_dt(sum=sum(Sepal.Length))) # for users familiar with data.table, you can work on .SD directly # following codes get the first and last row from each group iris %>% group_dt( by = Species, rbind(.SD[1],.SD[.N]) ) #' # for summarise_dt, you can use "by" to calculate within the group mtcars %>% summarise_dt( disp = mean(disp), hp = mean(hp), by = cyl ) # but you could also, of course, use group_dt mtcars %>% group_dt(by =.(vs,am), summarise_dt(avg = mean(mpg))) # and list of variables could also be used mtcars %>% group_dt(by =list(vs,am), summarise_dt(avg = mean(mpg))) # examples for `rowwise_dt` df <- data.table(x = 1:2, y = 3:4, z = 4:5) df %>% mutate_dt(m = mean(c(x, y, z))) df %>% rowwise_dt( mutate_dt(m = mean(c(x, y, z))) )
For 'import_fst_chunked', if a large fst file which could not be imported into the memory all at once, this function could read the fst file by chunks and preprocessed the chunk to ensure the results yielded by the chunks are small enough to be summarised in the end. For 'get_fst_chunk_size', this function can measure the memory used by a specified row number.
import_fst_chunked( path, chunk_size = 10000L, chunk_f = identity, combine_f = rbindlist ) get_fst_chunk_size(path, nrows)
import_fst_chunked( path, chunk_size = 10000L, chunk_f = identity, combine_f = rbindlist ) get_fst_chunk_size(path, nrows)
path |
Path to fst file |
chunk_size |
Integer. The number of rows to include in each chunk |
chunk_f |
A function implemented on every chunk. |
combine_f |
A function to aggregate all the elements from the list of results from chunks. |
nrows |
Number of rows to test. |
For 'import_fst_chunked', default to the whole data.frame in data.table. Could be adjusted to any type. For 'get_fst_chunk_size', return the file size.
## Not run: # Generate some random data frame with 10 million rows and various column types nr_of_rows <- 1e7 df <- data.frame( Logical = sample(c(TRUE, FALSE, NA), prob = c(0.85, 0.1, 0.05), nr_of_rows, replace = TRUE), Integer = sample(1L:100L, nr_of_rows, replace = TRUE), Real = sample(sample(1:10000, 20) / 100, nr_of_rows, replace = TRUE), Factor = as.factor(sample(labels(UScitiesD), nr_of_rows, replace = TRUE)) ) # Write the file to disk fst_file <- tempfile(fileext = ".fst") write_fst(df, fst_file) # Get the size of 10000 rows get_fst_chunk_size(fst_file,1e4) # File all rows that Integer == 7 by chunks import_fst_chunked(fst_file,chunk_f = \(x) x[Integer==7]) ## End(Not run)
## Not run: # Generate some random data frame with 10 million rows and various column types nr_of_rows <- 1e7 df <- data.frame( Logical = sample(c(TRUE, FALSE, NA), prob = c(0.85, 0.1, 0.05), nr_of_rows, replace = TRUE), Integer = sample(1L:100L, nr_of_rows, replace = TRUE), Real = sample(sample(1:10000, 20) / 100, nr_of_rows, replace = TRUE), Factor = as.factor(sample(labels(UScitiesD), nr_of_rows, replace = TRUE)) ) # Write the file to disk fst_file <- tempfile(fileext = ".fst") write_fst(df, fst_file) # Get the size of 10000 rows get_fst_chunk_size(fst_file,1e4) # File all rows that Integer == 7 by chunks import_fst_chunked(fst_file,chunk_f = \(x) x[Integer==7]) ## End(Not run)
Impute the columns of data.frame with its mean, median or mode.
impute_dt(.data, ..., .func = "mode")
impute_dt(.data, ..., .func = "mode")
.data |
A data.frame |
... |
Columns to select |
.func |
Character, "mode" (default), "mean" or "median". Could also define it by oneself. |
A data.table
Pclass <- c(3, 1, 3, 1, 3, 2, 2, 3, NA, NA) Sex <- c('male', 'male', 'female', 'female', 'female', 'female', NA, 'male', 'female', NA) Age <- c(22, 38, 26, 35, NA, 45, 25, 39, 28, 40) SibSp <- c(0, 1, 3, 1, 2, 3, 2, 2, NA, 0) Fare <- c(7.25, 71.3, 7.92, NA, 8.05, 8.46, 51.9, 60, 32, 15) Embarked <- c('S', NA, 'S', 'Q', 'Q', 'S', 'C', 'S', 'C', 'S') data <- data.frame('Pclass' = Pclass, 'Sex' = Sex, 'Age' = Age, 'SibSp' = SibSp, 'Fare' = Fare, 'Embarked' = Embarked) data data %>% impute_dt() # defalut uses "mode" as `.func` data %>% impute_dt(is.numeric,.func = "mean") data %>% impute_dt(is.numeric,.func = "median") my_fun = function(x){ x[is.na(x)] = (max(x,na.rm = TRUE) - min(x,na.rm = TRUE))/2 x } data %>% impute_dt(is.numeric,.func = my_fun)
Pclass <- c(3, 1, 3, 1, 3, 2, 2, 3, NA, NA) Sex <- c('male', 'male', 'female', 'female', 'female', 'female', NA, 'male', 'female', NA) Age <- c(22, 38, 26, 35, NA, 45, 25, 39, 28, 40) SibSp <- c(0, 1, 3, 1, 2, 3, 2, 2, NA, 0) Fare <- c(7.25, 71.3, 7.92, NA, 8.05, 8.46, 51.9, 60, 32, 15) Embarked <- c('S', NA, 'S', 'Q', 'Q', 'S', 'C', 'S', 'C', 'S') data <- data.frame('Pclass' = Pclass, 'Sex' = Sex, 'Age' = Age, 'SibSp' = SibSp, 'Fare' = Fare, 'Embarked' = Embarked) data data %>% impute_dt() # defalut uses "mode" as `.func` data %>% impute_dt(is.numeric,.func = "mean") data %>% impute_dt(is.numeric,.func = "median") my_fun = function(x){ x[is.na(x)] = (max(x,na.rm = TRUE) - min(x,na.rm = TRUE))/2 x } data %>% impute_dt(is.numeric,.func = my_fun)
To use facilities provided by data.table, but do not have to load data.table package.
in_dt(.data, ...) as_dt(.data)
in_dt(.data, ...) as_dt(.data)
.data |
A data.frame |
... |
Recieve |
The as_dt
could turn any data frame to data.table class. If the data is
not a data frame, return error.
The in_dt
function creates a virtual environment in data.table, it could be
piped well because it still follows the principals of tidyfst, which are: (1) Never
use in place replacement and (2) Always recieves a data frame (data.frame/tibble/data.table)
and returns a data.table. Therefore, the in place functions like :=
will still
return the results.
iris %>% as_dt() iris %>% in_dt(order(-Sepal.Length),.SD[.N],by=Species)
iris %>% as_dt() iris %>% in_dt(order(-Sepal.Length),.SD[.N],by=Species)
Wrappers of set operations in data.table. Only difference is it could be applied to non-data.table data frames by recognizing and coercing them to data.table automatically.
intersect_dt(x, y, all = FALSE) union_dt(x, y, all = FALSE) setdiff_dt(x, y, all = FALSE) setequal_dt(x, y, all = TRUE)
intersect_dt(x, y, all = FALSE) union_dt(x, y, all = FALSE) setdiff_dt(x, y, all = FALSE) setequal_dt(x, y, all = TRUE)
x |
A data.frame |
y |
A data.frame |
all |
Logical. When |
A data.table
x = iris[c(2,3,3,4),] x2 = iris[2:4,] y = iris[c(3:5),] intersect_dt(x, y) # intersect intersect_dt(x, y, all=TRUE) # intersect all setdiff_dt(x, y) # except setdiff_dt(x, y, all=TRUE) # except all union_dt(x, y) # union union_dt(x, y, all=TRUE) # union all setequal_dt(x, x2, all=FALSE) # setequal setequal_dt(x, x2) # setequal all
x = iris[c(2,3,3,4),] x2 = iris[2:4,] y = iris[c(3:5),] intersect_dt(x, y) # intersect intersect_dt(x, y, all=TRUE) # intersect all setdiff_dt(x, y) # except setdiff_dt(x, y, all=TRUE) # except all union_dt(x, y) # union union_dt(x, y, all=TRUE) # union all setequal_dt(x, x2, all=FALSE) # setequal setequal_dt(x, x2) # setequal all
The mutating joins add columns from 'y' to 'x', matching rows based on the keys:
* 'inner_join_dt()': includes all rows in 'x' and 'y'. * 'left_join_dt()': includes all rows in 'x'. * 'right_join_dt()': includes all rows in 'y'. * 'full_join_dt()': includes all rows in 'x' or 'y'.
Filtering joins filter rows from 'x' based on the presence or absence of matches in 'y':
* 'semi_join_dt()' return all rows from 'x' with a match in 'y'. * 'anti_join_dt()' return all rows from 'x' without a match in 'y'.
inner_join_dt(x, y, by = NULL, on = NULL, suffix = c(".x", ".y")) left_join_dt(x, y, by = NULL, on = NULL, suffix = c(".x", ".y")) right_join_dt(x, y, by = NULL, on = NULL, suffix = c(".x", ".y")) full_join_dt(x, y, by = NULL, on = NULL, suffix = c(".x", ".y")) anti_join_dt(x, y, by = NULL, on = NULL) semi_join_dt(x, y, by = NULL, on = NULL)
inner_join_dt(x, y, by = NULL, on = NULL, suffix = c(".x", ".y")) left_join_dt(x, y, by = NULL, on = NULL, suffix = c(".x", ".y")) right_join_dt(x, y, by = NULL, on = NULL, suffix = c(".x", ".y")) full_join_dt(x, y, by = NULL, on = NULL, suffix = c(".x", ".y")) anti_join_dt(x, y, by = NULL, on = NULL) semi_join_dt(x, y, by = NULL, on = NULL)
x |
A data.table |
y |
A data.table |
by |
(Optional) A character vector of variables to join by. If 'NULL', the default, '*_join_dt()' will perform a natural join, using all variables in common across 'x' and 'y'. A message lists the variables so that you can check they're correct; suppress the message by supplying 'by' explicitly. To join by different variables on 'x' and 'y', use a named vector. For example, 'by = c("a" = "b")' will match 'x$a' to 'y$b'. To join by multiple variables, use a vector with length > 1. For example, 'by = c("a", "b")' will match 'x$a' to 'y$a' and 'x$b' to 'y$b'. Use a named vector to match different variables in 'x' and 'y'. For example, 'by = c("a" = "b", "c" = "d")' will match 'x$a' to 'y$b' and 'x$c' to 'y$d'. |
on |
(Optional)
Indicate which columns in x should be joined with which columns in y.
Examples included:
1. |
suffix |
If there are non-joined duplicate variables in x and y, these suffixes will be added to the output to disambiguate them. Should be a character vector of length 2. |
A data.table
workers = fread(" name company Nick Acme John Ajax Daniela Ajax ") positions = fread(" name position John designer Daniela engineer Cathie manager ") workers %>% inner_join_dt(positions) workers %>% left_join_dt(positions) workers %>% right_join_dt(positions) workers %>% full_join_dt(positions) # filtering joins workers %>% anti_join_dt(positions) workers %>% semi_join_dt(positions) # To suppress the message, supply 'by' argument workers %>% left_join_dt(positions, by = "name") # Use a named 'by' if the join variables have different names positions2 = setNames(positions, c("worker", "position")) # rename first column in 'positions' workers %>% inner_join_dt(positions2, by = c("name" = "worker")) # the syntax of 'on' could be a bit different workers %>% inner_join_dt(positions2,on = "name==worker")
workers = fread(" name company Nick Acme John Ajax Daniela Ajax ") positions = fread(" name position John designer Daniela engineer Cathie manager ") workers %>% inner_join_dt(positions) workers %>% left_join_dt(positions) workers %>% right_join_dt(positions) workers %>% full_join_dt(positions) # filtering joins workers %>% anti_join_dt(positions) workers %>% semi_join_dt(positions) # To suppress the message, supply 'by' argument workers %>% left_join_dt(positions, by = "name") # Use a named 'by' if the join variables have different names positions2 = setNames(positions, c("worker", "position")) # rename first column in 'positions' workers %>% inner_join_dt(positions2, by = c("name" = "worker")) # the syntax of 'on' could be a bit different workers %>% inner_join_dt(positions2,on = "name==worker")
Find the "next" or "previous" values in a vector.
It has wrapped data.table's shift
function.
lead_dt(x, n = 1L, fill = NA) lag_dt(x, n = 1L, fill = NA)
lead_dt(x, n = 1L, fill = NA) lag_dt(x, n = 1L, fill = NA)
x |
A vector |
n |
a positive integer of length 1, giving the number of positions to lead or lag by. Default uses 1 |
fill |
Value to use for padding when the window goes beyond the input length.
Default uses |
A vector
lead_dt(1:5) lag_dt(1:5) lead_dt(1:5,2) lead_dt(1:5,n = 2,fill = 0)
lead_dt(1:5) lag_dt(1:5) lead_dt(1:5,2) lead_dt(1:5,n = 2,fill = 0)
Turning a wide table to its longer form. It takes multiple columns and collapses into key-value pairs.
longer_dt(.data, ..., name = "name", value = "value", na.rm = FALSE)
longer_dt(.data, ..., name = "name", value = "value", na.rm = FALSE)
.data |
A data.frame |
... |
Pattern for unchanged group or unquoted names. Pattern can accept
regular expression to match column names. It can recieve what |
name |
Name for the measured variable names column. The default name is 'name'. |
value |
Name for the molten data values column(s). The default name is 'value'. |
na.rm |
If |
A data.table
## Example 1: stocks = data.frame( time = as.Date('2009-01-01') + 0:9, X = rnorm(10, 0, 1), Y = rnorm(10, 0, 2), Z = rnorm(10, 0, 4) ) stocks stocks %>% longer_dt(time) stocks %>% longer_dt("ti") # Example 2: library(tidyr) billboard %>% longer_dt( -"wk", name = "week", value = "rank", na.rm = TRUE ) # or use: billboard %>% longer_dt( artist,track,date.entered, name = "week", value = "rank", na.rm = TRUE ) # or use: billboard %>% longer_dt( 1:3, name = "week", value = "rank", na.rm = TRUE )
## Example 1: stocks = data.frame( time = as.Date('2009-01-01') + 0:9, X = rnorm(10, 0, 1), Y = rnorm(10, 0, 2), Z = rnorm(10, 0, 4) ) stocks stocks %>% longer_dt(time) stocks %>% longer_dt("ti") # Example 2: library(tidyr) billboard %>% longer_dt( -"wk", name = "week", value = "rank", na.rm = TRUE ) # or use: billboard %>% longer_dt( artist,track,date.entered, name = "week", value = "rank", na.rm = TRUE ) # or use: billboard %>% longer_dt( 1:3, name = "week", value = "rank", na.rm = TRUE )
Convenient fucntions to implement conversion between tidy table and named matrix.
mat_df(m) df_mat(df, row, col, value)
mat_df(m) df_mat(df, row, col, value)
m |
A matrix |
df |
A data.frame with at least 3 columns, one for row name, one for column name, and one for values. The names for column and row should be unique. |
row |
Unquoted expression of column name for row |
col |
Unquoted expression of column name for column |
value |
Unquoted expression of column name for values |
For mat_df
, a data.frame.
For df_mat
, a named matrix.
mm = matrix(c(1:8,NA),ncol = 3,dimnames = list(letters[1:3],LETTERS[1:3])) mm tdf = mat_df(mm) tdf mat = df_mat(tdf,row,col,value) setequal(mm,mat) tdf %>% setNames(c("A","B","C")) %>% df_mat(A,B,C)
mm = matrix(c(1:8,NA),ncol = 3,dimnames = list(letters[1:3],LETTERS[1:3])) mm tdf = mat_df(mm) tdf mat = df_mat(tdf,row,col,value) setequal(mm,mat) tdf %>% setNames(c("A","B","C")) %>% df_mat(A,B,C)
Adds or updates columns in data.frame.
mutate_dt(.data, ..., by) transmute_dt(.data, ..., by)
mutate_dt(.data, ..., by) transmute_dt(.data, ..., by)
.data |
data.frame |
... |
List of variables or name-value pairs of summary/modifications functions. |
by |
(Optional) Mutate by what group? |
data.table
iris %>% mutate_dt(one = 1,Sepal.Length = Sepal.Length + 1) iris %>% transmute_dt(one = 1,Sepal.Length = Sepal.Length + 1) # add group number with symbol `.GRP` iris %>% mutate_dt(id = 1:.N,grp = .GRP,by = Species)
iris %>% mutate_dt(one = 1,Sepal.Length = Sepal.Length + 1) iris %>% transmute_dt(one = 1,Sepal.Length = Sepal.Length + 1) # add group number with symbol `.GRP` iris %>% mutate_dt(id = 1:.N,grp = .GRP,by = Species)
Update or add columns when the given condition is met.
mutate_when
integrates mutate
and case_when
in dplyr and make a new tidy verb for data.table. mutate_vars
is
a super function to do updates in specific columns according to conditions.
mutate_when(.data, when, ..., by) mutate_vars(.data, .cols = NULL, .func, ..., by)
mutate_when(.data, when, ..., by) mutate_vars(.data, .cols = NULL, .func, ..., by)
.data |
data.frame |
when |
An object which can be coerced to logical mode |
... |
Name-value pairs of expressions for |
by |
(Optional) Mutate by what group? |
.cols |
Any types that can be accepted by |
.func |
Function to be run within each column, should return a value or vectors with same length. |
data.table
iris[3:8,] iris[3:8,] %>% mutate_when(Petal.Width == .2, one = 1,Sepal.Length=2) iris %>% mutate_vars("Pe",scale) iris %>% mutate_vars(is.numeric,scale) iris %>% mutate_vars(-is.factor,scale) iris %>% mutate_vars(1:2,scale) iris %>% mutate_vars(.func = as.character)
iris[3:8,] iris[3:8,] %>% mutate_when(Petal.Width == .2, one = 1,Sepal.Length=2) iris %>% mutate_vars("Pe",scale) iris %>% mutate_vars(is.numeric,scale) iris %>% mutate_vars(-is.factor,scale) iris %>% mutate_vars(1:2,scale) iris %>% mutate_vars(.func = as.character)
Create or melt list columns in data.frame.
Analogous function for nest
and unnest
in tidyr.
unnest_dt
will automatically remove other list-columns except for the
target list-columns (which would be unnested later). Also, squeeze_dt
is
designed to merge multiple columns into list column.
nest_dt(.data, ..., mcols = NULL, .name = "ndt") unnest_dt(.data, ...) squeeze_dt(.data, ..., .name = "ndt") chop_dt(.data, ...) unchop_dt(.data, ...)
nest_dt(.data, ..., mcols = NULL, .name = "ndt") unnest_dt(.data, ...) squeeze_dt(.data, ..., .name = "ndt") chop_dt(.data, ...) unchop_dt(.data, ...)
.data |
data.table, nested or unnested |
... |
The variables for nest group(for |
mcols |
Name-variable pairs in the list, form like |
.name |
Character. The nested column name. Defaults to "ndt".
|
In the nest_dt
, the data would be nested to a column named 'ndt',
which is short for nested data.table.
The squeeze_dt
would not remove the originial columns.
The unchop_dt
is the reverse operation of chop_dt
.
These functions are experiencing the experimental stage, especially
the unnest_dt
. If they don't work on some circumtances, try tidyr
package.
data.table, nested or unnested
https://www.r-bloggers.com/much-faster-unnesting-with-data-table/
https://stackoverflow.com/questions/25430986/create-nested-data-tables-by-collapsing-rows-into-new-data-tables
# examples for nest_dt # nest by which columns? mtcars %>% nest_dt(cyl) mtcars %>% nest_dt("cyl") mtcars %>% nest_dt(cyl,vs) mtcars %>% nest_dt(vs:am) mtcars %>% nest_dt("cyl|vs") mtcars %>% nest_dt(c("cyl","vs")) # change the nested column name mtcars %>% nest_dt(cyl,.name = "data") # nest two columns directly iris %>% nest_dt(mcols = list(petal="^Pe",sepal="^Se")) # nest more flexibly iris %>% nest_dt(mcols = list(ndt1 = 1:3, ndt2 = "Pe", ndt3 = Sepal.Length:Sepal.Width)) # examples for unnest_dt # unnest which column? mtcars %>% nest_dt("cyl|vs") %>% unnest_dt(ndt) mtcars %>% nest_dt("cyl|vs") %>% unnest_dt("ndt") df <- data.table( a = list(c("a", "b"), "c"), b = list(c(TRUE,TRUE),FALSE), c = list(3,c(1,2)), d = c(11, 22) ) df df %>% unnest_dt(a) df %>% unnest_dt(2) df %>% unnest_dt("c") df %>% unnest_dt(cols = names(df)[3]) # You can unnest multiple columns simultaneously df %>% unnest_dt(1:3) df %>% unnest_dt(a,b,c) df %>% unnest_dt("a|b|c") # examples for squeeze_dt # nest which columns? iris %>% squeeze_dt(1:2) iris %>% squeeze_dt("Se") iris %>% squeeze_dt(Sepal.Length:Petal.Width) iris %>% squeeze_dt(1:2,.name = "data") # examples for chop_dt df <- data.table(x = c(1, 1, 1, 2, 2, 3), y = 1:6, z = 6:1) df %>% chop_dt(y,z) df %>% chop_dt(y,z) %>% unchop_dt(y,z)
# examples for nest_dt # nest by which columns? mtcars %>% nest_dt(cyl) mtcars %>% nest_dt("cyl") mtcars %>% nest_dt(cyl,vs) mtcars %>% nest_dt(vs:am) mtcars %>% nest_dt("cyl|vs") mtcars %>% nest_dt(c("cyl","vs")) # change the nested column name mtcars %>% nest_dt(cyl,.name = "data") # nest two columns directly iris %>% nest_dt(mcols = list(petal="^Pe",sepal="^Se")) # nest more flexibly iris %>% nest_dt(mcols = list(ndt1 = 1:3, ndt2 = "Pe", ndt3 = Sepal.Length:Sepal.Width)) # examples for unnest_dt # unnest which column? mtcars %>% nest_dt("cyl|vs") %>% unnest_dt(ndt) mtcars %>% nest_dt("cyl|vs") %>% unnest_dt("ndt") df <- data.table( a = list(c("a", "b"), "c"), b = list(c(TRUE,TRUE),FALSE), c = list(3,c(1,2)), d = c(11, 22) ) df df %>% unnest_dt(a) df %>% unnest_dt(2) df %>% unnest_dt("c") df %>% unnest_dt(cols = names(df)[3]) # You can unnest multiple columns simultaneously df %>% unnest_dt(1:3) df %>% unnest_dt(a,b,c) df %>% unnest_dt("a|b|c") # examples for squeeze_dt # nest which columns? iris %>% squeeze_dt(1:2) iris %>% squeeze_dt("Se") iris %>% squeeze_dt(Sepal.Length:Petal.Width) iris %>% squeeze_dt(1:2,.name = "data") # examples for chop_dt df <- data.table(x = c(1, 1, 1, 2, 2, 3), y = 1:6, z = 6:1) df %>% chop_dt(y,z) df %>% chop_dt(y,z) %>% unchop_dt(y,z)
nth
get the value from a vector with its position,
while maxth
and minth
get the nth highest or lowest value
from the vector.
nth(v, n = 1) maxth(v, n = 1) minth(v, n = 1)
nth(v, n = 1) maxth(v, n = 1) minth(v, n = 1)
v |
A vector |
n |
For |
A single value.
https://stackoverflow.com/questions/2453326/fastest-way-to-find-second-third-highest-lowest-value-in-vector-or-column/66367996#66367996
x = 1:10 nth(x, 1) nth(x, 5) nth(x, -2) y = c(10,3,4,5,2,1,6,9,7,8) maxth(y,3) minth(y,3)
x = 1:10 nth(x, 1) nth(x, 5) nth(x, -2) y = c(10,3,4,5,2,1,6,9,7,8) maxth(y,3) minth(y,3)
Provides an estimate of the memory that is being used to store an R object. A wrapper of 'object.size', but use a nicer printing unit.
object_size(object)
object_size(object)
object |
an R object. |
An object of class "object_size"
iris %>% object_size()
iris %>% object_size()
Count the number of times each pair
of items appear together within a group.
For example, this could count the number of times two words appear within documents.
This function has referred to pairwise_count
in widyr package,
but with very different defaults on several parameters.
pairwise_count_dt( .data, .group, .value, upper = FALSE, diag = FALSE, sort = TRUE )
pairwise_count_dt( .data, .group, .value, upper = FALSE, diag = FALSE, sort = TRUE )
.data |
A data.frame. |
.group |
Column name of counting group. |
.value |
Item to count pairs, will end up in |
upper |
When |
diag |
Whether to include diagonal (V1==V2) in output. Default uses |
sort |
Whether to sort rows by counts. Default uses |
A data.table with 3 columns (named as "V1","V2" and "n"), containing combinations in "V1" and "V2", and counts in "n".
dat <- data.table(group = rep(1:5, each = 2), letter = c("a", "b", "a", "c", "a", "c", "b", "e", "b", "f")) pairwise_count_dt(dat,group,letter) pairwise_count_dt(dat,group,letter,sort = FALSE) pairwise_count_dt(dat,group,letter,upper = TRUE) pairwise_count_dt(dat,group,letter,diag = TRUE) pairwise_count_dt(dat,group,letter,diag = TRUE,upper = TRUE) # The column name could be specified using character. pairwise_count_dt(dat,"group","letter")
dat <- data.table(group = rep(1:5, each = 2), letter = c("a", "b", "a", "c", "a", "c", "b", "e", "b", "f")) pairwise_count_dt(dat,group,letter) pairwise_count_dt(dat,group,letter,sort = FALSE) pairwise_count_dt(dat,group,letter,upper = TRUE) pairwise_count_dt(dat,group,letter,diag = TRUE) pairwise_count_dt(dat,group,letter,diag = TRUE,upper = TRUE) # The column name could be specified using character. pairwise_count_dt(dat,"group","letter")
Add percentage for counts in the data.frame, both numeric and character with '
percent(x, digits = 1) add_prop(.data, count_name = last(names(.data)), digits = 1)
percent(x, digits = 1) add_prop(.data, count_name = last(names(.data)), digits = 1)
x |
A number (numeric). |
digits |
How many digits to keep in the percentage. Default uses 1. |
.data |
A data frame. |
count_name |
Column name of counts (Character). Default uses the last column of data.frame. |
https://stackoverflow.com/questions/7145826/how-to-format-a-number-as-percentage-in-r
percent(0.9057) percent(0.9057,3) iris %>% count_dt(Species) %>% add_prop() iris %>% count_dt(Species) %>% add_prop(count_name = "n",digits = 2)
percent(0.9057) percent(0.9057,3) iris %>% count_dt(Species) %>% add_prop() iris %>% count_dt(Species) %>% add_prop(count_name = "n",digits = 2)
This function is a wrapper for require
and
detach
. pkg_load
checks to see if a
package is installed, if not it attempts to install the package
from CRAN. pkg_unload
can detach one or more loaded packages.
pkg_load(..., pkg_names = NULL) pkg_unload(..., pkg_names = NULL)
pkg_load(..., pkg_names = NULL) pkg_unload(..., pkg_names = NULL)
... |
Name(s) of package(s). |
pkg_names |
(Optional)Character vector containing packages to load
or unload. Default uses |
require
,
detach
,
p_load
,
p_unload
## Not run: pkg_load(data.table) pkg_unload(data.table) pkg_load(stringr,fst) pkg_unload(stringr,fst) pkg_load(pkg_names = c("data.table","fst")) p_unload(pkg_names = c("data.table","fst")) pkg_load(data.table,stringr,fst) pkg_unload("all") # shortcut to unload all loaded packages ## End(Not run)
## Not run: pkg_load(data.table) pkg_unload(data.table) pkg_load(stringr,fst) pkg_unload(stringr,fst) pkg_load(pkg_names = c("data.table","fst")) p_unload(pkg_names = c("data.table","fst")) pkg_load(data.table,stringr,fst) pkg_unload("all") # shortcut to unload all loaded packages ## End(Not run)
This function allow user to define how data.table is printed.
print_options( topn = 5, nrows = 100, class = TRUE, row.names = TRUE, col.names = "auto", print.keys = TRUE, trunc.cols = FALSE )
print_options( topn = 5, nrows = 100, class = TRUE, row.names = TRUE, col.names = "auto", print.keys = TRUE, trunc.cols = FALSE )
topn |
The number of rows to be printed from the beginning and
end of tables with more than |
nrows |
The number of rows which will be printed before truncation is enforced. |
class |
If |
row.names |
If |
col.names |
One of three flavours for controlling the display of column names in output. |
print.keys |
If |
trunc.cols |
If |
Notice that tidyfst has a slightly different printing default for data.table, which is it always prints the keys and variable class (not like data.table).
None. This function is used for its side effect of changing options.
iris %>% as.data.table() print_options(topn = 3,trunc.cols = TRUE) iris %>% as.data.table() # set all settings to default in tidyfst print_options() iris %>% as.data.table()
iris %>% as.data.table() print_options(topn = 3,trunc.cols = TRUE) iris %>% as.data.table() # set all settings to default in tidyfst print_options() iris %>% as.data.table()
Extract vector from data.frame, works likt '[['. Analogous function for pull
in dplyr
pull_dt(.data, col)
pull_dt(.data, col)
.data |
data.frame |
col |
A name of column or index (should be positive). |
vector
mtcars %>% pull_dt(2) mtcars %>% pull_dt(cyl) mtcars %>% pull_dt("cyl")
mtcars %>% pull_dt(2) mtcars %>% pull_dt(cyl) mtcars %>% pull_dt("cyl")
Recode discrete variables, including numerice and character variable.
rec_num(x, rec, keep = TRUE) rec_char(x, rec, keep = TRUE)
rec_num(x, rec, keep = TRUE) rec_char(x, rec, keep = TRUE)
x |
A numeric or character vector. |
rec |
String with recode pairs of old and new values. Find the usage in examples. |
keep |
Logical. Decide whether to keep the original values if not recoded.
Defaults to |
A vector.
x = 1:10 x rec_num(x, rec = "1=10; 4=2") rec_num(x, rec = "1:3=1; 4:6=2") rec_num(x, rec = "1:3=1; 4:6=2",keep = FALSE) y = letters[1:5] y rec_char(y,rec = "a=A;b=B") rec_char(y,rec = "a,b=A;c,d=B") rec_char(y,rec = "a,b=A;c,d=B",keep = FALSE)
x = 1:10 x rec_num(x, rec = "1=10; 4=2") rec_num(x, rec = "1:3=1; 4:6=2") rec_num(x, rec = "1:3=1; 4:6=2",keep = FALSE) y = letters[1:5] y rec_char(y,rec = "a=A;b=B") rec_char(y,rec = "a,b=A;c,d=B") rec_char(y,rec = "a,b=A;c,d=B",keep = FALSE)
Change the position of columns, using the same syntax as 'select_dt()'. Check similar function as 'relocate' in dplyr.
relocate_dt(.data, ..., how = "first", where = NULL)
relocate_dt(.data, ..., how = "first", where = NULL)
.data |
A data.frame |
... |
Columns to move |
how |
The mode of movement, including "first","last","after","before". Default uses "first". |
where |
Destination of columns selected by |
A data.table with rearranged columns.
df <- data.table(a = 1, b = 1, c = 1, d = "a", e = "a", f = "a") df df %>% relocate_dt(f) df %>% relocate_dt(a,how = "last") df %>% relocate_dt(is.character) df %>% relocate_dt(is.numeric, how = "last") df %>% relocate_dt("[aeiou]") df %>% relocate_dt(a, how = "after",where = f) df %>% relocate_dt(f, how = "before",where = a) df %>% relocate_dt(f, how = "before",where = c) df %>% relocate_dt(f, how = "after",where = c) df2 <- data.table(a = 1, b = "a", c = 1, d = "a") df2 %>% relocate_dt(is.numeric, how = "after", where = is.character) df2 %>% relocate_dt(is.numeric, how="before", where = is.character)
df <- data.table(a = 1, b = 1, c = 1, d = "a", e = "a", f = "a") df df %>% relocate_dt(f) df %>% relocate_dt(a,how = "last") df %>% relocate_dt(is.character) df %>% relocate_dt(is.numeric, how = "last") df %>% relocate_dt("[aeiou]") df %>% relocate_dt(a, how = "after",where = f) df %>% relocate_dt(f, how = "before",where = a) df %>% relocate_dt(f, how = "before",where = c) df %>% relocate_dt(f, how = "after",where = c) df2 <- data.table(a = 1, b = "a", c = 1, d = "a") df2 %>% relocate_dt(is.numeric, how = "after", where = is.character) df2 %>% relocate_dt(is.numeric, how="before", where = is.character)
Rename one or more columns in the data.frame.
rename_dt(.data, ...) rename_with_dt(.data, .fn, ...)
rename_dt(.data, ...) rename_with_dt(.data, .fn, ...)
.data |
data.frame |
... |
statements of rename, e.g. 'sl = Sepal.Length' means the column named as "Sepal.Length" would be renamed to "sl" |
.fn |
A function used to transform the selected columns. Should return a character vector the same length as the input. |
data.table
iris %>% rename_dt(sl = Sepal.Length,sw = Sepal.Width) %>% head() iris %>% rename_with_dt(toupper) iris %>% rename_with_dt(toupper,"^Pe")
iris %>% rename_dt(sl = Sepal.Length,sw = Sepal.Width) %>% head() iris %>% rename_with_dt(toupper) iris %>% rename_with_dt(toupper,"^Pe")
While replace_na_dt
could replace all NAs to another
value, replace_dt
could replace any value(s) to another specific
value.
replace_dt(.data, ..., from = is.nan, to = NA)
replace_dt(.data, ..., from = is.nan, to = NA)
.data |
A data.frame |
... |
Colunms to be replaced. If not specified, use all columns. |
from |
A value, a vector of values or a function returns a logical value.
Defaults to |
to |
A value. Defaults to |
A data.table.
iris %>% mutate_vars(is.factor,as.character) -> new_iris new_iris %>% replace_dt(Species, from = "setosa",to = "SS") new_iris %>% replace_dt(Species,from = c("setosa","virginica"),to = "sv") new_iris %>% replace_dt(Petal.Width, from = .2,to = 2) new_iris %>% replace_dt(from = .2,to = NA) new_iris %>% replace_dt(is.numeric, from = function(x) x > 3, to = 9999 )
iris %>% mutate_vars(is.factor,as.character) -> new_iris new_iris %>% replace_dt(Species, from = "setosa",to = "SS") new_iris %>% replace_dt(Species,from = c("setosa","virginica"),to = "sv") new_iris %>% replace_dt(Petal.Width, from = .2,to = 2) new_iris %>% replace_dt(from = .2,to = NA) new_iris %>% replace_dt(is.numeric, from = function(x) x > 3, to = 9999 )
The enhanced data.frame, including tibble and data.table, do not
support row names. To link to some base r facilities, there should be functions
to save information in row names. These functions are analogous to
rownames_to_column
and column_to_rownames
in tibble.
rn_col(.data, var = "rowname") col_rn(.data, var = "rowname")
rn_col(.data, var = "rowname") col_rn(.data, var = "rowname")
.data |
A data.frame. |
var |
Name of column to use for rownames. |
rn_col
returns a data.table,
col_rn
returns a data frame.
mtcars %>% rn_col() mtcars %>% rn_col("rn") mtcars %>% rn_col() -> new_mtcars new_mtcars %>% col_rn() -> old_mtcars old_mtcars setequal(mtcars,old_mtcars)
mtcars %>% rn_col() mtcars %>% rn_col("rn") mtcars %>% rn_col() -> new_mtcars new_mtcars %>% col_rn() -> old_mtcars old_mtcars setequal(mtcars,old_mtcars)
Rounds values in its first argument to the specified number of decimal places, returning character, ensuring first decimal digits are showed even when they are zeros.
round0(x, digits = 0)
round0(x, digits = 0)
x |
A numeric vector. |
digits |
Integer indicating the number of decimal places. Defaults to 0. |
A character vector.
https://stackoverflow.com/questions/42105336/how-to-round-a-number-and-make-it-show-zeros
a = 14.0034 round0(a,2) b = 10 round0(b,1)
a = 14.0034 round0(a,2) b = 10 round0(b,1)
Select a number or proportion of rows randomly from the data frame
sample_dt
is a merged version of sample_n_dt
and
sample_frac_dt
, this could be convenient.
sample_dt(.data, n = NULL, prop = NULL, replace = FALSE, by = NULL) sample_n_dt(.data, size, replace = FALSE, by = NULL) sample_frac_dt(.data, size, replace = FALSE, by = NULL)
sample_dt(.data, n = NULL, prop = NULL, replace = FALSE, by = NULL) sample_n_dt(.data, size, replace = FALSE, by = NULL) sample_frac_dt(.data, size, replace = FALSE, by = NULL)
.data |
A data.frame |
n |
Number of rows to select |
prop |
Fraction of rows to select |
replace |
Sample with or without replacement? Default uses |
by |
(Optional) Character. Specify if you want to sample by group. |
size |
For |
data.table
sample_n_dt(mtcars, 10) sample_n_dt(mtcars, 50, replace = TRUE) sample_frac_dt(mtcars, 0.1) sample_frac_dt(mtcars, 1.5, replace = TRUE) sample_dt(mtcars,n=10) sample_dt(mtcars,prop = 0.1) # sample by group(s) iris %>% sample_n_dt(2,by = "Species") iris %>% sample_frac_dt(.1,by = "Species") mtcars %>% sample_n_dt(1,by = c("cyl","vs"))
sample_n_dt(mtcars, 10) sample_n_dt(mtcars, 50, replace = TRUE) sample_frac_dt(mtcars, 0.1) sample_frac_dt(mtcars, 1.5, replace = TRUE) sample_dt(mtcars,n=10) sample_dt(mtcars,prop = 0.1) # sample by group(s) iris %>% sample_n_dt(2,by = "Species") iris %>% sample_frac_dt(.1,by = "Species") mtcars %>% sample_n_dt(1,by = c("cyl","vs"))
Select specific column(s) via various ways. One can select columns by their column names, indexes or regular expression recognizing the column name(s).
select_dt(.data, ..., cols = NULL, negate = FALSE) select_mix(.data, ..., rm.dup = TRUE)
select_dt(.data, ..., cols = NULL, negate = FALSE) select_mix(.data, ..., rm.dup = TRUE)
.data |
data.frame |
... |
List of variables or name-value pairs of summary/modifications functions. It can also recieve conditional function to select columns. When starts with '-'(minus symbol) or '!', return the negative columns. |
cols |
(Optional)A numeric or character vector. |
negate |
Applicable when regular expression and "cols" is used.
If |
rm.dup |
Should duplicated columns be removed? Defaults to |
data.table
iris %>% select_dt(Species) iris %>% select_dt(Sepal.Length,Sepal.Width) iris %>% select_dt(Sepal.Length:Petal.Length) iris %>% select_dt(-Sepal.Length) iris %>% select_dt(-Sepal.Length,-Petal.Length) iris %>% select_dt(-(Sepal.Length:Petal.Length)) iris %>% select_dt(c("Sepal.Length","Sepal.Width")) iris %>% select_dt(-c("Sepal.Length","Sepal.Width")) iris %>% select_dt(1) iris %>% select_dt(-1) iris %>% select_dt(1:3) iris %>% select_dt(-(1:3)) iris %>% select_dt(1,3) iris %>% select_dt("Pe") iris %>% select_dt(-"Se") iris %>% select_dt(!"Se") iris %>% select_dt("Pe",negate = TRUE) iris %>% select_dt("Pe|Sp") iris %>% select_dt(cols = 2:3) iris %>% select_dt(cols = 2:3,negate = TRUE) iris %>% select_dt(cols = c("Sepal.Length","Sepal.Width")) iris %>% select_dt(cols = names(iris)[2:3]) iris %>% select_dt(is.factor) iris %>% select_dt(-is.factor) iris %>% select_dt(!is.factor) # select_mix could provide flexible mix selection select_mix(iris, Species,"Sepal.Length") select_mix(iris,1:2,is.factor) select_mix(iris,Sepal.Length,is.numeric) # set rm.dup to FALSE could save the duplicated column names select_mix(iris,Sepal.Length,is.numeric,rm.dup = FALSE)
iris %>% select_dt(Species) iris %>% select_dt(Sepal.Length,Sepal.Width) iris %>% select_dt(Sepal.Length:Petal.Length) iris %>% select_dt(-Sepal.Length) iris %>% select_dt(-Sepal.Length,-Petal.Length) iris %>% select_dt(-(Sepal.Length:Petal.Length)) iris %>% select_dt(c("Sepal.Length","Sepal.Width")) iris %>% select_dt(-c("Sepal.Length","Sepal.Width")) iris %>% select_dt(1) iris %>% select_dt(-1) iris %>% select_dt(1:3) iris %>% select_dt(-(1:3)) iris %>% select_dt(1,3) iris %>% select_dt("Pe") iris %>% select_dt(-"Se") iris %>% select_dt(!"Se") iris %>% select_dt("Pe",negate = TRUE) iris %>% select_dt("Pe|Sp") iris %>% select_dt(cols = 2:3) iris %>% select_dt(cols = 2:3,negate = TRUE) iris %>% select_dt(cols = c("Sepal.Length","Sepal.Width")) iris %>% select_dt(cols = names(iris)[2:3]) iris %>% select_dt(is.factor) iris %>% select_dt(-is.factor) iris %>% select_dt(!is.factor) # select_mix could provide flexible mix selection select_mix(iris, Species,"Sepal.Length") select_mix(iris,1:2,is.factor) select_mix(iris,Sepal.Length,is.numeric) # set rm.dup to FALSE could save the duplicated column names select_mix(iris,Sepal.Length,is.numeric,rm.dup = FALSE)
Given either regular expression,
separate_dt()
turns a single character column into two columns.
separate_dt( .data, separated_colname, into, sep = "[^[:alnum:]]+", remove = TRUE )
separate_dt( .data, separated_colname, into, sep = "[^[:alnum:]]+", remove = TRUE )
.data |
A data frame. |
separated_colname |
Column to be separated, can be a character or alias. |
into |
Character vector of length 2. |
sep |
Separator between columns. |
remove |
If |
df <- data.frame(x = c(NA, "a.b", "a.d", "b.c")) df %>% separate_dt(x, c("A", "B")) # equals to df %>% separate_dt("x", c("A", "B")) # If you just want the second variable: df %>% separate_dt(x,into = c(NA,"B"))
df <- data.frame(x = c(NA, "a.b", "a.d", "b.c")) df %>% separate_dt(x, c("A", "B")) # equals to df %>% separate_dt("x", c("A", "B")) # If you just want the second variable: df %>% separate_dt(x,into = c(NA,"B"))
'slice_dt()' lets you index rows by their (integer) locations. It allows you to select, remove, and duplicate rows. It is accompanied by a number of helpers for common use cases:
* 'slice_head_dt()' and 'slice_tail_dt()' select the first or last rows. * 'slice_sample_dt()' randomly selects rows. * 'slice_min_dt()' and 'slice_max_dt()' select rows with highest or lowest values of a variable.
slice_dt(.data, ..., by = NULL) slice_head_dt(.data, n, by = NULL) slice_tail_dt(.data, n, by = NULL) slice_max_dt(.data, order_by, n, by = NULL, with_ties = TRUE) slice_min_dt(.data, order_by, n, by = NULL, with_ties = TRUE) slice_sample_dt(.data, n, replace = FALSE, by = NULL)
slice_dt(.data, ..., by = NULL) slice_head_dt(.data, n, by = NULL) slice_tail_dt(.data, n, by = NULL) slice_max_dt(.data, order_by, n, by = NULL, with_ties = TRUE) slice_min_dt(.data, order_by, n, by = NULL, with_ties = TRUE) slice_sample_dt(.data, n, replace = FALSE, by = NULL)
.data |
A data.table |
... |
Provide either positive values to keep, or negative values to drop. The values provided must be either all positive or all negative. |
by |
Slice by which group(s)? |
n |
When larger than or equal to 1, the number of rows. When between 0 and 1, the proportion of rows to select. |
order_by |
Variable or function of variables to order by. |
with_ties |
Should ties be kept together? The default, 'TRUE', may return more rows than you request. Use 'FALSE' to ignore ties, and return the first 'n' rows. |
replace |
Should sampling be performed with ('TRUE') or without ('FALSE', the default) replacement. |
A data.table
a = iris slice_dt(a,1,2) slice_dt(a,2:3) slice_dt(a,141:.N) slice_dt(a,1,.N) slice_head_dt(a,5) slice_head_dt(a,0.1) slice_tail_dt(a,5) slice_tail_dt(a,0.1) slice_max_dt(a,Sepal.Length,10) slice_max_dt(a,Sepal.Length,10,with_ties = FALSE) slice_min_dt(a,Sepal.Length,10) slice_min_dt(a,Sepal.Length,10,with_ties = FALSE) slice_sample_dt(a,10) slice_sample_dt(a,0.1) # use by to slice by group ## following codes get the same results slice_dt(a,1:3,by = "Species") slice_dt(a,1:3,by = Species) slice_dt(a,1:3,by = .(Species)) slice_head_dt(a,2,by = Species) slice_tail_dt(a,2,by = Species) slice_max_dt(a,Sepal.Length,3,by = Species) slice_max_dt(a,Sepal.Length,3,by = Species,with_ties = FALSE) slice_min_dt(a,Sepal.Length,3,by = Species) slice_min_dt(a,Sepal.Length,3,by = Species,with_ties = FALSE) # in `slice_sample_dt`, "by" could only take character class slice_sample_dt(a,.1,by = "Species") slice_sample_dt(a,3,by = "Species") slice_sample_dt(a,51,replace = TRUE,by = "Species")
a = iris slice_dt(a,1,2) slice_dt(a,2:3) slice_dt(a,141:.N) slice_dt(a,1,.N) slice_head_dt(a,5) slice_head_dt(a,0.1) slice_tail_dt(a,5) slice_tail_dt(a,0.1) slice_max_dt(a,Sepal.Length,10) slice_max_dt(a,Sepal.Length,10,with_ties = FALSE) slice_min_dt(a,Sepal.Length,10) slice_min_dt(a,Sepal.Length,10,with_ties = FALSE) slice_sample_dt(a,10) slice_sample_dt(a,0.1) # use by to slice by group ## following codes get the same results slice_dt(a,1:3,by = "Species") slice_dt(a,1:3,by = Species) slice_dt(a,1:3,by = .(Species)) slice_head_dt(a,2,by = Species) slice_tail_dt(a,2,by = Species) slice_max_dt(a,Sepal.Length,3,by = Species) slice_max_dt(a,Sepal.Length,3,by = Species,with_ties = FALSE) slice_min_dt(a,Sepal.Length,3,by = Species) slice_min_dt(a,Sepal.Length,3,by = Species,with_ties = FALSE) # in `slice_sample_dt`, "by" could only take character class slice_sample_dt(a,.1,by = "Species") slice_sample_dt(a,3,by = "Species") slice_sample_dt(a,51,replace = TRUE,by = "Species")
Work like the '*_join_dt' series functions, joining tables with common or customized keys in various ways. The only difference is the joining is case insensitive like SQL.
sql_join_dt(x, y, by = NULL, type = "inner", suffix = c(".x", ".y"))
sql_join_dt(x, y, by = NULL, type = "inner", suffix = c(".x", ".y"))
x |
A data.table |
y |
A data.table |
by |
(Optional) A character vector of variables to join by. If 'NULL', the default, '*_join_dt()' will perform a natural join, using all variables in common across 'x' and 'y'. A message lists the variables so that you can check they're correct; suppress the message by supplying 'by' explicitly. To join by different variables on 'x' and 'y', use a named vector. For example, 'by = c("a" = "b")' will match 'x$a' to 'y$b'. To join by multiple variables, use a vector with length > 1. For example, 'by = c("a", "b")' will match 'x$a' to 'y$a' and 'x$b' to 'y$b'. Use a named vector to match different variables in 'x' and 'y'. For example, 'by = c("a" = "b", "c" = "d")' will match 'x$a' to 'y$b' and 'x$c' to 'y$d'. Notice that in 'sql_join', the joining variables would turn to upper case in the output table. |
type |
Which type of join would you like to use? Default uses "inner", other options include "left", "right", "full", "anti", "semi". |
suffix |
If there are non-joined duplicate variables in x and y, these suffixes will be added to the output to disambiguate them. Should be a character vector of length 2. |
A data.table
dt1 = data.table(x = c("A","b"),y = 1:2) dt2 = data.table(x = c("a","B"),z = 4:5) sql_join_dt(dt1,dt2)
dt1 = data.table(x = c("A","b"),y = 1:2) dt2 = data.table(x = c("a","B"),z = 4:5) sql_join_dt(dt1,dt2)
Summarise group of values into one value for each group. If there is only one group, then only one value would be returned. The summarise function should always return a single value.
summarise_dt(.data, ..., by = NULL) summarize_dt(.data, ..., by = NULL) summarise_when(.data, when, ..., by = NULL) summarize_when(.data, when, ..., by = NULL) summarise_vars(.data, .cols = NULL, .func, ..., by) summarize_vars(.data, .cols = NULL, .func, ..., by)
summarise_dt(.data, ..., by = NULL) summarize_dt(.data, ..., by = NULL) summarise_when(.data, when, ..., by = NULL) summarize_when(.data, when, ..., by = NULL) summarise_vars(.data, .cols = NULL, .func, ..., by) summarize_vars(.data, .cols = NULL, .func, ..., by)
.data |
data.frame |
... |
List of variables or name-value pairs of summary/modifications
functions for |
by |
unquoted name of grouping variable of list of unquoted names of grouping variables. For details see data.table |
when |
An object which can be coerced to logical mode |
.cols |
Columns to be summarised. |
.func |
Function to be run within each column, should return a value or vectors with same length. |
summarise_vars
could complete summarise on specific columns.
data.table
iris %>% summarise_dt(avg = mean(Sepal.Length)) iris %>% summarise_dt(avg = mean(Sepal.Length),by = Species) mtcars %>% summarise_dt(avg = mean(hp),by = .(cyl,vs)) # the data.table way mtcars %>% summarise_dt(cyl_n = .N, by = .(cyl, vs)) # `.` is short for list iris %>% summarise_vars(is.numeric,min) iris %>% summarise_vars(-is.factor,min) iris %>% summarise_vars(1:4,min) iris %>% summarise_vars(is.numeric,min,by ="Species") mtcars %>% summarise_vars(is.numeric,mean,by = c("vs", "am")) # use multiple functions on multiple columns iris %>% summarise_vars(is.numeric,.func = list(mean,sd,median)) iris %>% summarise_vars(is.numeric,.func = list(mean,sd,median),by = Species)
iris %>% summarise_dt(avg = mean(Sepal.Length)) iris %>% summarise_dt(avg = mean(Sepal.Length),by = Species) mtcars %>% summarise_dt(avg = mean(hp),by = .(cyl,vs)) # the data.table way mtcars %>% summarise_dt(cyl_n = .N, by = .(cyl, vs)) # `.` is short for list iris %>% summarise_vars(is.numeric,min) iris %>% summarise_vars(-is.factor,min) iris %>% summarise_vars(1:4,min) iris %>% summarise_vars(is.numeric,min,by ="Species") mtcars %>% summarise_vars(is.numeric,mean,by = c("vs", "am")) # use multiple functions on multiple columns iris %>% summarise_vars(is.numeric,.func = list(mean,sd,median)) iris %>% summarise_vars(is.numeric,.func = list(mean,sd,median),by = Species)
Convenient printing of time elapsed. A wrapper of
data.table::timetaken
, but showing the results more directly.
sys_time_print(expr) pst(expr)
sys_time_print(expr) pst(expr)
expr |
Valid R expression to be timed. |
A character vector of the form HH:MM:SS,
or SS.MMMsec if under 60 seconds (invisibly for show_time
). See examples.
sys_time_print(Sys.sleep(1)) a = iris sys_time_print({ res = iris %>% mutate_dt(one = 1) }) res
sys_time_print(Sys.sleep(1)) a = iris sys_time_print({ res = iris %>% mutate_dt(one = 1) }) res
An efficient way to transpose data frames(data.frame/data.table/tibble).
t_dt(.data)
t_dt(.data)
.data |
A data.frame/data.table/tibble |
This function would return the original data.frame structure, keeping all the row names and column names. If the row names are not available or, "V1,V2..." will be provided.
A transposed data.frame
t_dt(iris) t_dt(mtcars)
t_dt(iris) t_dt(mtcars)
Duplicating rows according to a weighting variable. This is the opposite operation of 'count_dt'. Analogous to 'tidyr::uncount'.
uncount_dt(.data, wt, .remove = TRUE)
uncount_dt(.data, wt, .remove = TRUE)
.data |
A data.frame |
wt |
A vector of weights. |
.remove |
Should the column for |
df <- data.table(x = c("a", "b"), n = c(1, 2)) uncount_dt(df, n) uncount_dt(df,n,FALSE)
df <- data.table(x = c("a", "b"), n = c(1, 2)) uncount_dt(df, n) uncount_dt(df,n,FALSE)
Convenience function to paste together multiple columns into one.
unite_dt( .data, united_colname, ..., sep = "_", remove = FALSE, na2char = FALSE )
unite_dt( .data, united_colname, ..., sep = "_", remove = FALSE, na2char = FALSE )
.data |
A data frame. |
united_colname |
The name of the new column, string only. |
... |
A selection of columns. If want to select all columns, pass "" to the parameter. See example. |
sep |
Separator to use between values. |
remove |
If |
na2char |
If |
df <- expand.grid(x = c("a", NA), y = c("b", NA)) df # Treat missing value as NA, default df %>% unite_dt("z", x:y, remove = FALSE) # Treat missing value as character "NA" df %>% unite_dt("z", x:y, na2char = TRUE, remove = FALSE) df %>% unite_dt("xy", x:y) # Select all columns iris %>% unite_dt("merged_name",".")
df <- expand.grid(x = c("a", NA), y = c("b", NA)) df # Treat missing value as NA, default df %>% unite_dt("z", x:y, remove = FALSE) # Treat missing value as character "NA" df %>% unite_dt("z", x:y, na2char = TRUE, remove = FALSE) df %>% unite_dt("xy", x:y) # Select all columns iris %>% unite_dt("merged_name",".")
fread
from data.table could not recognize the encoding
and return the correct form, this could be unconvenient for text mining tasks. The
utf8-encoding
could use "UTF-8" as the encoding to override the current
encoding of characters in a data frame.
utf8_encoding(.data)
utf8_encoding(.data)
.data |
A data.frame. |
A data.table with characters in UTF-8 encoding
Transform a data frame from long format to wide by increasing the number of columns and decreasing the number of rows.
wider_dt(.data, ..., name, value = NULL, fun = NULL, fill = NA)
wider_dt(.data, ..., name, value = NULL, fun = NULL, fill = NA)
.data |
A data.frame |
... |
Optional. The unchanged group in the transformation.
Could use integer vector, could receive what |
name |
Chracter.One column name of class to spread |
value |
Chracter.One column name of value to spread.
If |
fun |
Should the data be aggregated before casting?
Defaults to |
fill |
Value with which to fill missing cells. Default uses |
The parameter of 'name' and 'value' should always be provided and should be explicit called (with the parameter names attached).
data.table
stocks = data.frame( time = as.Date('2009-01-01') + 0:9, X = rnorm(10, 0, 1), Y = rnorm(10, 0, 2), Z = rnorm(10, 0, 4) ) %>% longer_dt(time) -> longer_stocks longer_stocks longer_stocks %>% wider_dt("time", name = "name", value = "value") longer_stocks %>% mutate_dt(one = 1) %>% wider_dt("time", name = "name", value = "one") ## using "fun" parameter for aggregation DT <- data.table(v1 = rep(1:2, each = 6), v2 = rep(rep(1:3, 2), each = 2), v3 = rep(1:2, 6), v4 = rnorm(6)) ## for each combination of (v1, v2), add up all values of v4 DT %>% wider_dt(v1,v2, value = "v4", name = ".", fun = sum)
stocks = data.frame( time = as.Date('2009-01-01') + 0:9, X = rnorm(10, 0, 1), Y = rnorm(10, 0, 2), Z = rnorm(10, 0, 4) ) %>% longer_dt(time) -> longer_stocks longer_stocks longer_stocks %>% wider_dt("time", name = "name", value = "value") longer_stocks %>% mutate_dt(one = 1) %>% wider_dt("time", name = "name", value = "one") ## using "fun" parameter for aggregation DT <- data.table(v1 = rep(1:2, each = 6), v2 = rep(rep(1:3, 2), each = 2), v3 = rep(1:2, 6), v4 = rnorm(6)) ## for each combination of (v1, v2), add up all values of v4 DT %>% wider_dt(v1,v2, value = "v4", name = ".", fun = sum)