R data.table symbols and operators you should know (2024)

Make your R data.table code more efficient and elegant with these special symbols and operators. Plus, learn about the new fcase() function

By Sharon Machlis

Executive Editor, Data & Analytics, InfoWorld |

R data.table symbols and operators you should know (2)

R data.table code becomes more efficient — and elegant — when you take advantage of its special symbols and functions. With that in mind, we’ll look at some special ways to subset, count, and create new columns.

For this demo, I’m going to use data from the 2019 Stack Overflow developers survey, with about 90,000 responses. If you want to follow along, you can download the data fromStack Overflow.

If the data.table package is not installed on your system, install it from CRAN and then load it as usual with library(data.table). To start, you may want to read in just the first few rows of the data set to make it easier to examine the data structure. You can do that with with data.table’s fread() function and the nrows argument. I’ll read in 10 rows:

data_sample <- fread("data/survey_results_public.csv", nrows = 10)

As you’ll see, there are 85 columns to examine. (If you want to know what all the columns mean, there are files in the download with the data schema and a PDF of the original survey.)

To read in all the data, I’ll use:

mydt <- fread("data/survey_results_public.csv")

Next, I’ll create a new data.table with just a few columns to make it easier to work with and see results. A reminder that data.table uses this basic syntax:

mydt[i, j, by]

The data.table package introduction says to read this as “take dt, subset or reorder rows using i, calculate j, grouped by by.” Keep in mind that i and j are similar to base R’s bracket ordering: rows first, columns second. So i is for operations you’d do on rows (choosing rows based on row numbers or conditions); j is what you’d do with columns (select columns or create new columns from calculations). However, note also that you can do a lot more inside data.table brackets than a base R data frame. And the “by” section is new to data.table.

Since I’m selecting columns, that code goes in the “j” spot, which means the brackets need a comma first to leave the “i” spot empty:

mydt[, j]

Select data.table columns

One of the things I like about data.table is that it’s easy to select columns either quoted or unquoted. Unquoted is often more convenient (that’s usually the tidyverse way). But quoted is useful if you’re using data.table inside your own functions, or if you want to pass in a vector you created somewhere else in your code.

You can select data.table columns the typical base R way, with a conventional vector of quoted column names. For example:

dt1 <- mydt[, c("LanguageWorkedWith", "LanguageDesireNextYear", 
"OpenSourcer", "CurrencySymbol", "ConvertedComp”,
“Hobbyist”)]

If you want to use them unquoted, create a list instead of a vector and you can pass in the unquoted names.

dt1 <- mydt[, list(LanguageWorkedWith, LanguageDesireNextYear, 
OpenSourcer, CurrencySymbol, ConvertedComp,
Hobbyist)]

And now we come to our first special symbol. Instead of typing out list(), you can just use a dot:

dt1 <- mydt[, .(LanguageWorkedWith, LanguageDesireNextYear, 
OpenSourcer, CurrencySymbol, ConvertedComp,
Hobbyist)]

That .() is a shortcut for list() inside data.table brackets.

What if you want to use an already-existing vector of column names? Putting the vector object name inside data.table brackets won’t work. If I create a vector with quoted column names, like so:

mycols <- c("LanguageWorkedWith", "LanguageDesireNextYear", 
"OpenSourcer", "CurrencySymbol", "ConvertedComp", "Hobbyist")

Then this code willnot work:

dt1 <- mydt[, mycols]

Instead, you need to put ..(that’s two dots) in front of the vector object name:

dt1 <- mydt[, ..mycols]

Why two dots? That seemed kind of random to me until I read the explanation. Think of it like the two dots in a Unix command-line terminal that move you up one directory. Here, you’re moving up one namespace, from the environment inside data.table brackets up to the global environment. (That really does help me remember it!)

Count data.table rows

On to the next symbol. To count by group, you can use data.table’s .N symbol, where.N stands for “number of rows.” It can be the total number of rows, or number of rows per group if you’re aggregating in the “by” section.

This expression returns the total number of rows in the data.table:

mydt[, .N]

The following example calculates the number of rows grouped by one variable: whether people in the survey also code as a hobby (the Hobbyist variable).

mydt[, .N, Hobbyist]

# returns:
Hobbyist N1: Yes 712572: No 17626

You can use the plain column name within data.table brackets if there is just one variable. If you want to group by two or more variables, use the . symbol. For example:

mydt[, .N, .(Hobbyist, OpenSourcer)]

To order results from highest to lowest, you can add a second set of brackets after the first. The .N symbol automatically generates a column named N (of course you can rename it if you want), so ordering by the number of rows can look something like this:

mydt[, .N, .(Hobbyist, OpenSourcer)][order(Hobbyist, -N)]

As I learn data.table code, I find it helpful to read it step by step. So I’d read this as “For all rows in mydt (since there’s nothing in the “I” spot), count number of rows, grouping by Hobbyist and OpenSourcer. Then order first by Hobbyist and then number of rows descending.”

That’s equivalent to this dplyr code:

mydf %>%
count(Hobbyist, OpenSourcer) %>%
order(Hobbyist, -n)

If you find the tidyverse conventional multi-line approach more readable, this data.table code also works:

mydt[, .N, 
.(Hobbyist, OpenSourcer)][
order(Hobbyist, -N)
]

Add columns to a data.table

Next, I’d like add columns to see if each respondent uses R, if they use Python, if they use both, or if they use neither. The LanguageWorkedWith column has information about languages used, and a few rows of that data look like this:

R data.table symbols and operators you should know (3) Sharon Machlis

Each answer is a single character string. Most have multiple languages separated by a semicolon.

As is often the case, it’s easier to search for Python than R, since you can’t just search for "R" in the string (Ruby and Rust also contain a capital R) the way you can search for "Python". This is the simpler code to create a TRUE/FALSE vector that checks if each string in LanguageWorkedWith contains Python:

ifelse(LanguageWorkedWith %like% "Python", TRUE, FALSE)

If you know SQL, you’ll recognize that “like” syntax. I, well, like %like%.It’s a nice streamlined way to check for pattern matching. The function documentation says it’s meant to be used inside data.table brackets, but actually you can use it in any of your code, not just with data.tables. I checked with data.table creator Matt Dowle, who said the advice to use it inside the brackets is because some extra performance optimization happens there.

Next, here’s code to add a column called PythonUser to the data.table:

dt1[, PythonUser := ifelse(LanguageWorkedWith %like% "Python", TRUE, FALSE)]

Notice the := operator. Python has an operator like that, too, and ever since I heard it called the “walrus operator,” that’s what I call it. I think it’s officially “assignment by reference.” That’s because the code above changed the existing object dt1 data.table by adding the new column — without needing to save it to a new variable.

To search for R, I’ll use the regular expression "\\bR\\b" which says: “Find a pattern that starts with a word boundary — the \\b, then an R, and then end with another word boundary. ( I can’t just look for "R;" because the last item in each string doesn’t have a semicolon.)

This adds an RUser column to dt1:

dt1[, RUser := ifelse(LanguageWorkedWith %like% "\\bR\\b", TRUE, FALSE)]

If you wanted to add both columns at once with := you would need to turn that walrus operator into a function by backquoting it, like this:

dt1[, `:=`(
PythonUser = ifelse(LanguageWorkedWith %like% "Python", TRUE, FALSE),
RUser = ifelse(LanguageWorkedWith %like% "\\bR\\b", TRUE, FALSE)
)]

More useful data.table operators

There are several other data.table operators worth knowing. The%between%operator has this syntax:

myvector %between% c(lower_value, upper_value)

So if I want to filter for all responses where compensation was between 50,000 and 100,000 paid in US dollars, this code works:

comp_50_100k <- dt1[CurrencySymbol == "USD" & 
ConvertedComp %between% c(50000, 100000)]

The second line above is the between condition. Note that the %between% operator includes both the lower and upper values when it checks.

Another useful operator is %chin%. It works like base R’s %in% but is optimized for speed and is for character vectors only. So, if I want to filter for all rows where the OpenSourcer column was either “Never” or “Less than once per year” this code works:

rareos <- dt1[OpenSourcer %chin% c("Never", "Less than once per year")]

This is pretty similar to base R, except that base R must specify the data frame name inside the bracket and also requires a comma after the filter expression:

rareos_df <- df1[df1$OpenSourcer %in% c("Never", "Less than once per year"),]

The new fcase() function

For this final demo, I’ll start by creating a new data.table with just people who reported compensation in US dollars:

usd <- dt1[CurrencySymbol == "USD" & !is.na(ConvertedComp)]

Next, I’ll create a new column called Language for whether someone uses just R, just Python, both, or neither. And I’ll use the new fcase() function. At the time this article was published, fcase() was available only in data.table’s development version. If you already have data.table installed, you can update to the latest dev version with this command:

data.table::update.dev.pkg()

The fcase() function is similar to SQL’s CASE WHEN statement and dplyr’s case_when() function. The basic syntax isfcase(condition1, "value1", condition2, "value2") and so on. A default value for “everything else” can be added with default = value.

Here is code to create the new Language column:

usd[, Language := fcase(
RUser & !PythonUser, "R",
PythonUser & !RUser, "Python",
PythonUser & RUser, "Both",
!PythonUser & !RUser, "Neither"
)]

I put each condition on a separate line because I find it easier to read, but you don’t have to.

A caution: If you’re using RStudio, the data.table structure doesn’t automatically update in the top right RStudio pane after you create a new column with the walrus operator. You need to manually click the refresh icon to see changes in the number of columns.

There are a few other symbols I won’t cover in this article. You can find a list of them in the “special symbols” data.table help file by running help("special-symbols"). One of the most useful, .SD, already has its own Do More With R article and video, “How to use .SD in the R data.table package.”

For more R tips, head to the“Do More With R” page on InfoWorldor check out the“Do More With R” YouTube playlist.

Next read this:

  • Why companies are leaving the cloud
  • 5 easy ways to run an LLM locally
  • Coding with AI: Tips and best practices from developers
  • Meet Zig: The modern alternative to C
  • What is generative AI? Artificial intelligence that creates
  • The best open source software of 2023

Related:

  • R Language
  • Software Development
  • Analytics

Sharon Machlis is Director of Editorial Data & Analytics at Foundry, where she works on data analysis and in-house editor tools in addition to writing. Her book Practical R for Mass Communication and Journalism was published by CRC Press. She was named Digital Analytics Association's 2021 Top (Data) Practitioner, winner of the 2023 Jesse H. Neal journalism award for best instructional content, 2014 Azbee national gold award for investigative reporting, and 2017 Azbee gold for how-to article, among other awards. You can find her on Mastodon at masto.machlis.com/@smach.

Follow

Copyright © 2020 IDG Communications, Inc.

R data.table symbols and operators you should know (2024)

FAQs

What is data table in R used for? ›

It is widely used for fast aggregation of large datasets, low latency add/update/remove of columns, quicker ordered joins, and a fast file reader. The syntax for data. table is flexible and intuitive and therefore leads to faster development.

What symbol or character do we use to specify a column in a table in R? ›

So i is for operations you'd do on rows (choosing rows based on row numbers or conditions); j is what you'd do with columns (select columns or create new columns from calculations). However, note also that you can do a lot more inside data. table brackets than a base R data frame.

Which function can help us to convert the data frame into a data table object? ›

Method 1 : Using setDT() method

table object is a part of the data. table package, which needs to be installed in the working space. The setDT() method can be used to coerce the dataframe or the lists into data. table, where the conversion is made to the original dataframe.

How do you add rows to a data table in R? ›

To add or insert observation/row to an existing Data Frame in R, we use rbind() function. We can add single or multiple observations/rows to a Data Frame in R using rbind() function.

Is data table better than tidyverse? ›

table and tidyverse . In cases when we are handling very large dataset, data. table would be a good choice since it runs extremely fast. In cases when we are not requiring the speed so much, especially when collaborating with others, we can choose tidyverse since its code is more readable.

What is the difference between Dataframe and data table in R? ›

frame in R is similar to the data table which is used to create tabular data but data table provides a lot more features than the data frame so, generally, all prefer the data. table instead of the data. frame.

Which symbols is used for displaying specific rows and columns? ›

Answer: The $ symbol is used to specify the fixed columns or rows in the formula.

How to identify column names in R? ›

To access a specific column in a dataframe by name, you use the $ operator in the form df$name where df is the name of the dataframe, and name is the name of the column you are interested in. This operation will then return the column you want as a vector.

Which symbol is used to display columns of table? ›

An implicit select list uses the asterisk symbol. SELECT * FROM manufact; Because the manufact table has only three columns, Figure 1 and Figure 2 are equivalent and display the same results; that is, a list of every column and row in the manufact table.

How do you select columns in a data table in R? ›

To pick out single or multiple columns use the select() function. The select() function expects a dataframe as it's first input ('argument', in R language), followed by the names of the columns you want to extract with a comma between each name.

How do you change the column names in a data table in R? ›

The colnames () function can be used to change column names R dataframe. The colnames () function can be used to rename column in R, i.e., change one column name at a time; also, all the column names can be changed simultaneously.

What function creates a data frame in R? ›

The function data. frame() creates data frames, tightly coupled collections of variables which share many of the properties of matrices and of lists, used as the fundamental data structure by most of R's modeling software.

How to merge two data frames in R? ›

To join two data frames (datasets) vertically, use the rbind function. The two data frames must have the same variables, but they do not have to be in the same order. If data frameA has variables that data frameB does not, then either: Delete the extra variables in data frameA or.

What is the mutate function in R? ›

How to Use Mutate function in R
  1. mutate() – adds new variables while retaining old variables to a data frame.
  2. transmute() – adds new variables and removes old ones from a data frame.
  3. mutate_all() – changes every variable in a data frame simultaneously.
  4. mutate_at() – changes certain variables by name.
Jul 11, 2022

How to create a data frame in R? ›

You construct a data frame with the data. frame() function. As arguments, you pass the vectors from before: they will become the different columns of your data frame. Because every column has the same length, the vectors you pass should also have the same length.

What is one advantage of using a data table? ›

One advantage of using a data table is that it provides a structured way to organize and present data, making it easier to analyze and compare different pieces of information.

What is the use of table variable? ›

This variable can be used in the function, stored procedure, or batch in which it's declared. Within its scope, a table variable can be used like a regular table. It may be applied anywhere a table or table expression is used in SELECT, INSERT, UPDATE, and DELETE statements.

Top Articles
Latest Posts
Article information

Author: Kimberely Baumbach CPA

Last Updated:

Views: 5835

Rating: 4 / 5 (41 voted)

Reviews: 80% of readers found this page helpful

Author information

Name: Kimberely Baumbach CPA

Birthday: 1996-01-14

Address: 8381 Boyce Course, Imeldachester, ND 74681

Phone: +3571286597580

Job: Product Banking Analyst

Hobby: Cosplaying, Inline skating, Amateur radio, Baton twirling, Mountaineering, Flying, Archery

Introduction: My name is Kimberely Baumbach CPA, I am a gorgeous, bright, charming, encouraging, zealous, lively, good person who loves writing and wants to share my knowledge and understanding with you.