The Natural Language Processing Workshop
上QQ阅读APP看书,第一时间看更新

Types of Data

To deal with data effectively, we need to understand the various forms in which it exists. First, let's explore the types of data that exist. There are two main ways to categorize data (by structure and by content), as explained in the upcoming sections.

Categorizing Data Based on Structure

Data can be pided on the basis of structure into three categories, namely, structured, semi-structured, and unstructured data, as shown in the following diagram:

Figure 2.1: Categorization based on content

These three categories are as follows:

  • Structured data: This is the most organized form of data. It is represented in tabular formats such as Excel files and Comma-Separated Value (CSV) files. The following image shows what structured data usually looks like:

Figure 2.2: Structured data

The preceding table contains information about five people, with each row representing a person and each column representing one of their attributes.

  • Semi-structured data: This type of data is not presented in a tabular structure, but it can be transformed into a table. Here, information is usually stored between tags following a definite pattern. XML and HTML files can be referred to as semi-structured data. The following screenshot shows how semi-structured data can appear:

Figure 2.3: Semi-structured data

The format shown in the preceding screenshot is called markup language format. Here, the data is stored between tags, hierarchically. It is a universally accepted format, and there are a lot of parsers available that can convert this data into structured data.

  • Unstructured data: This type of data is the most difficult to deal with. Machine learning algorithms would find it difficult to comprehend unstructured data without any loss of information. Text corpora and images are examples of unstructured data. The following image shows what unstructured data looks like:

Figure 2.4: Unstructured data

This is called unstructured data because if we want to get employee details from the preceding text snippet with our program, we will not be able to do so by simple parsing. We have to make our algorithm understand the semantics of the language to make it able to extract information from this.

Categorizing Data Based on Content

Data can be pided into four categories based on content, as shown in the following diagram:

Figure 2.5: Categorizing data based on structure

Let's look at each category here:

  • Text data: This refers to text corpora consisting of written sentences. This type of data can only be read. An example would be the text corpus of a book.
  • Image data: This refers to pictures that are used to communicate messages. This type of data can only be seen.
  • Audio data: This refers to voice recordings, music, and so on. This type of data can only be heard.
  • Video data: A continuous series of images coupled with audio forms a video. This type of data can be seen as well as heard.

With that, we have learned about the different types of data and their categorization on the basis of structure and content. When dealing with unstructured data, it is necessary to clean it first. In the next section, we will look into some of the preprocessing steps for cleaning data.