Applied Supervised Learning with R
上QQ阅读APP看书,第一时间看更新

Data Structures in R

In any programming language, data structures are the fundamental units of storing information and making it ready for further processing. Depending on the type of data, various forms of data structures are available for storing and processing. Each of the data structures explained in the next section has its characteristic features and applicability.

In this section, we will explore each of it and how to use it with our data.

Vector

Vector is the most fundamental of all the data structures, and the values are stored in a 1-D array. Vector is the most suitable for a single variable with a series of values. In Exercise 3, Reading a JSON File and Storing the Data in DataFrame, refer to step 4 where we assigned a DataFrame its column names and concatenated using the c() method, as shown here:

c_names <- c("S.No","District","Area","Production","PTY")

We can extract the second value in the vector by specifying the index in square brackets next to the vector name. Let's review the following code where we subset the value in the second index:

c_names[2]

The output is as follows:

## [1] "District"

The collection of string concatenated with the c() method is a vector. It can store a homogenous collection of characters, integers, or floating point values. While trying to store an integer with character, an implicit type cast will happen, which will convert all the values to character.

Caution

Note that it might not be the expected behavior every time. Caution is required, especially when the data is not clean. It may otherwise cause errors that are harder to find than the usual programming errors.

Matrix

Matrix is the higher dimension data structure used for storing n-dimensional data. It is suitable for storing tabular data. Similar to vector, the matrix also allows only homogenous collection of data in its rows and columns.

The following code generates 16 random numbers drawn from a binomial distribution with a parameter, number of trials (size) = 100, and success probability equal to 0.4. The rbinom() method in R is useful for generating such random numbers:

r_numbers <- rbinom(n = 16, size = 100, prob = 0.4)

Now, to store r_number as a matrix, use the following command:

matrix(r_numbers, nrow = 4, ncol = 4)

The output is as follows:

## [,1] [,2] [,3] [,4]

## [1,] 48 39 37 39

## [2,] 34 41 32 38

## [3,] 40 34 42 46

## [4,] 37 42 36 44

Let's extend the text mining example we took in Exercise 4, Reading a CSV File with Text Column and Storing the Data in VCorpus, to understand the usage of matrix in text mining.

Consider the following two reviews. Use the lapply to type cast the first review to as.character and print:

lapply(review_corpus[1:2], as.character)

The output is as follows:

## $'1'

## [1] "I have bought several of the Vitality canned dog food products and have found them all to be of good quality. The product looks more like a stew than a processed meat, and it smells better. My Labrador is finicky, and she appreciates this product better than most."

## $'2'

## [1] "Product arrived labeled as Jumbo Salted Peanuts...the peanuts were actually small sized unsalted. Not sure if this was an error or if the vendor intended to represent the product as \"Jumbo\".

Now, in the following exercise, we will transform the data to remove stopwords, whitespaces, and punctuations from these two paragraphs. We will then perform stemming (both looking and looked will be reduced to look). Also, for consistency, convert all the text into lowercase.

Exercise 5: Performing Transformation on the Data to Make it Available for the Analysis

In this exercise, we will perform the transformation on the data to make it available for further analysis.

Perform the following steps to complete the exercise:

  1. First, use the following commands to convert all the characters in the data to lowercase:

    top_2_reviews <- review_corpus[1:2]

    top_2_reviews <- tm_map(top_2_reviews,content_transformer(tolower))

    lapply(top_2_reviews[1], as.character)

    The output is as follows:

    ## [1] "I have bought several of the vitality canned dog food products and have found them all to be of good quality. the product looks more like a stew than a processed meat and it smells better. my labrador is finicky, and she appreciates this product better than most."

  2. Next, remove the stopwords from the data, such as, a, the, an, and many more:

    top_2_reviews <- tm_map(top_2_reviews,removeWords, stopwords("english"))

    lapply(top_2_reviews[1], as.character)

    The output is as follows:

    ## [1] " bought several vitality canned dog food products found good quality. product looks like stew processed meat smells better. labrador finicky appreciates product better ."

  3. Remove extra whitespaces between words using the following command:

    top_2_reviews <- tm_map(top_2_reviews,stripWhitespace)

    lapply(top_2_reviews[1], as.character)

    The output is as follows:

    ## [1] " bought several vitality canned dog food products found good quality. product looks like stew processed meat smells better. labrador finicky appreciates product better ."

  4. Perform the stemming process, which will only keep the root of the word. For example, looking and looked will become look:

    top_2_reviews <- tm_map(top_2_reviews,stemDocument)

    lapply(top_2_reviews[1], as.character)

    The output is as follows:

    ## [1] " bought sever vital can dog food product found good quality. product look like stew process meat smell better. labrador finicki appreci product better ."

    Now that we have the text processed and cleaned up, we can create a document matrix that stores merely the frequency of the occurrence of distinct words in the two reviews. We will demonstrate how to count each word contained in the review. Each row of the matrix represents one review, and the columns are distinct words. Most of the values are zero because not all the words will be present in each review. In this example, we have a sparsity of 49%, which means only 51% of the matrix contains non-zero values.

  5. Create Document Term Matrix (DTM), in which each row will represent one tweet (also referred to as Doc) and each column a unique word from the corpus:

    dtm <- DocumentTermMatrix(top_2_reviews)

    inspect(dtm)

    The output is as follows:

    ## <<DocumentTermMatrix (documents: 2, terms: 37)>>

    ## Non-/sparse entries: 38/36

    ## Sparsity : 49%

    ## Maximal term length: 10

    ## Weighting : term frequency (tf)

    ##

    ## Terms

    ## Docs "jumbo". actual appreci arriv better better. bought can dog error

    ## 1 0 0 1 0 1 1 1 1 1 0

    ## 2 1 1 0 1 0 0 0 0 0 1

    ## Terms

    ## Docs finicki food found good intend jumbo label labrador like look meat

    ## 1 1 1 1 1 0 0 0 1 1 1 1

    ## 2 0 0 0 0 1 1 1 0 0 0 0

    0

    We can use this document term matrix in a plenty of ways. For the sake of the brevity of this introduction to the matrix, we will skip the details of the Document Term Matric here.

    The DTM shown in the previous code is in the list format. In order to convert it to the matrix, we can use the as.matrix() method again. The matrix contains two documents (reviews) and 37 unique words. The count of a particular word in a document is retrieved by specifying the row and column index or name in the matrix.

  6. Now, store the results in a matrix using the following command:

    dtm_matrix <- as.matrix(dtm)

  7. To find the dimension of the matrix, that is, 2 documents and 37 words, use the following command:

    dim(dtm_matrix)

    The output is as follows:

    ## [1] 2 37

  8. Now, print a subset of the matrix:

    dtm_matrix[1:2,1:7]

    The output is as follows:

    ## Terms

    ## Docs "jumbo". actual appreci arriv better better. bought

    ## 1 0 0 1 0 1 1 1

    ## 2 1 1 0 1 0 0 0

  9. Finally, count the word product in document 1 using the following command:

    dtm_matrix[1,"product"]

    The output is as follows:

    ## [1] 3

List

While vector and matrix both are useful structures to be used in various computations in a program, it might not be sufficient for storing a real-world dataset, which most often contains data of mix types, like a customer table in CRM application has the customer name and age together in two columns. The list offers a structure to allow for storing two different types of data together.

In the following exercise, along with generating 16 random numbers, we have used the sample() method to generate 16 characters from the English alphabet. The list method stores both the integers and characters together.

Exercise 6: Using the List Method for Storing Integers and Characters Together

In this exercise, we will use the list method to store randomly generated numbers and characters. The random numbers will be generated using the rbinom function, and the random characters will be selected from English alphabets A-Z.

Perform the following steps to complete the exercise:

  1. First, generate 16 random numbers drawn from a binomial distribution with parameter size equals 100 and the probability of success equals 0.4:

    r_numbers <- rbinom(n = 16, size = 100, prob = 0.4)

  2. Now, select 16 alphabets from English LETTERS without repetition:

    #sample() will generate 16 random letters from the English alphabet without repetition

    r_characters <- sample(LETTERS, size = 16, replace = FALSE)

  3. Put r_numbers and r_characters into a single list. The list() function will create the data structure list with r_numbers and r_characters:

    list(r_numbers, r_characters)

    The output is as follows:

    ## [[1]]

    ## [1] 48 53 38 31 44 43 36 47 43 38 43 41 45 40 44 50

    ##

    ## [[2]]

    ## [1] "V" "C" "N" "Z" "E" "L" "A" "Y" "U" "F" "H" "D" "O" "K" "T" "X"

    In the following step, we will see a list with the integer and character vectors stored together.

  4. Now, let's store and retrieve integer and character vectors from a list:

    r_list <- list(r_numbers, r_characters)

  5. Next, retrieve values in the character vector using the following command:

    r_list[[2]]

    The output is as follows:

    ## [1] "V" "C" "N" "Z" "E" "L" "A" "Y" "U" "F" "H" "D" "O" "K" "T" "X"

  6. Finally, retrieve the first value in the character vector:

    (r_list[[2]])[1]

    The output is as follows:

    ## [1] "V"

    Though this solves the requirement of storing heterogeneous data types together, its still doesn't put any integrity checks on the relationship between the values in the two vectors. If we would like to assign every letter to one integer. In the previous output, V represents 48, C represents 53, and so on.

    A list is not robust to handle such one-to-one mapping. Consider the following code, instead of 16 characters, if we generate 18 random characters, and it still allows for storing it in a list. The last two characters have no associated mapping with the integer now.

  7. Now, generate 16 random numbers drawn from a binomial distribution with parameter size equal to 100 and probability of success equal to 0.4:

    r_numbers <- rbinom(n = 16, size = 100, prob = 0.4)

  8. Select any 18 alphabets from English LETTERS without repetition:

    r_characters <- sample(LETTERS, 18, FALSE)

  9. Place r_numbers and r_characters into a single list:

    list(r_numbers, r_characters)

    The output is as follows:

    ## [[1]]

    ## [1] 48 53 38 31 44 43 36 47 43 38 43 41 45 40 44 50

    ##

    ## [[2]]

    ## [1] "V" "C" "N" "Z" "E" "L" "A" "Y" "U" "F" "H" "D" "O" "K" "T" "X" "P" "Q"

Activity 2: Create a List of Two Matrices and Access the Values

In this activity, you will create two matrices and retrieve a few values using the index of the matrix. You will also perform operations such as multiplication and subtraction.

Perform the following steps to complete the activity:

  1. Create two matrices of size 10 x 4 and 4 x 5 by randomly generated numbers from a binomial distribution (use rbinom method). Call the matrix mat_A and mat_B, respectively.
  2. Now, store the two matrices in a list.
  3. Using the list, access the row 4 and column 2 of mat_A and store it in variable A, and access row 2 and column 1 of mat_B and store it in variable B.
  4. Multiply the A and B matrices and subtract from row 2 and column 1 of mat_A.

    Note

    The solution for this activity can be found at page 440.