Descriptive Statistics in R

Frequency distributions can be made in R using the table() function. There is no functions that directly give you relative frequency distributions, but it can easily be calculated by dividing the frequency distribution by the november of observations. The number of observations in a dataset can be computed by using the length() function. To get the percent frequency distribution, simply use the same code that was used to get relative frequency distribution and multiply it by 100. A pie chart can be made by simply using the pie() function and a bar chart can be created by using the barplot() function. The input for both of these functions is the frequency distribution.

 > table(x) # Frequency Distribution
 > table(x) / length(x) # Relative Frequency Distribution
 > table(x) / length(x) * 100 # Percent Frequency Distribution
 > pie(table(x))
 > barplot(table(x))

Creating frequency distributions for quantitative data is tricker as you have to define the classes, or groupings. One quick way of defining the classes is by using the ones R chooses when creating a histogram. A histogram can be created in R using the hist() function. Use the seq() function to define the classes and the cut() function to place each observation into one of hte classes. Once the classes are determined, a frequency distribution for quantitative data can be created by using the same function as was used for qualitative data. A cumulative frequency distribution can be created by using the cumsum() function.

 > hist(x) # Histogram
 > seq(from, to, by) # Define Classes
 > cut(x, breaks) # Place Data in Classes
 > cumsum(x) # Cumulative Frequencies
 > table(x)

A crosstabulation can be created by using the same function that was used for a frequency distribution, table(). However, instead of only inputing one variable, input two variables. Row percentages can be calculated by dividing the crosstabulation by the sum of the rows, using the rowSums() function, and multiplying by 100. Similarly, column percentages can be calculated by dividing the crosstabulation by the sum of the columns, using the rowSums() function, and multiplying by 100. A scatter diagram can be created by using the plot() function and a trend line can be created using the abline() function.

 > table(x, y) # Crosstabulation
 > table(x, y) / rowSums(table(x, y)) * 100 # Row Percentages
 > table(x, y) / colSums(table(x, y)) * 100 # Column Percentages
 > plot(x, y) # Scatter Diagram
 > abline(lm(y ~ x)) # Trend Line

Many basic numerical summaries can be computed in R by simple using their name. For example, the function for calculating the mean is simply mean(). Similarly, the function for calculating the median is median(). Computing the mode is more difficult as there is no function for directly calculating the mode. To get the mode, use the table() function and identify which value has the highest frequency. The function for calculating percentiles is quantile(). While the previous functions only require one argument, the data, this function requires an additional argument. The second argument in this function is the desired percentile, expressed as a decimal. For example, if you want the 30th percentile, you set the second argument as .30. Quartiles can be computed using the same function by inputing the appropriate value for the second argument: either .25, .50 or .75.

 > mean(x)
 > median(x)
 > table(x) # Mode
 > quantile(x, probs) # Percentiles
 > quantile(x, probs = c(.25, .50, .75)) # Quartiles

Although there is a function called range(), this does not calculate the range of data. Instead, it gives you the maximum and minimum values of the data. Since the range is simply the difference of these two values, you can apply the diff() function afterwards to compute the range. Alternative, you can use the max() and min() functions. To compute the interquartile range, use the function IQR(). It is important to note that all the letters in this function are capitalized as R is a case-sensitive programming language. To compute the variance, simply use the function var(x). Similarly, to compute the standard deviation, use the function sd(). There is no function for calculating the coefficient of variation in R, so it must be done manually using the functions for mean and standard deviation.

 > diff(range(x)) # Range
 > IQR(x) # Interquartile Range
 > var(x) # Variance
 > sd(x) # Standard Deviation
 > sd(x) / mean(x) * 100 # Coefficient of Variation

Calculating skewness requires the use of a package as it is not part of base R. Install the "moments" package using the install.packages() function and then load it by using the library() function. Note that quotations are required around the package name in the former function but not the latter. Then simply use the skewness() function to calculate the skewness. There is no function in R for calculating z-scores so they need to be calculated manually using the functions for mean and standard deviation. To calculate the covariance, use the function cov(). Similarly, to calculate the correlation coefficient, use the function cor().

 > library(moments) # Load Package
 > skewness(x)
 > (x - mean(x)) / sd(x) # z-Scores
 > cov(x, y) # Covariance
 > cor(x, y) # Correlation Coefficient