data:image/s3,"s3://crabby-images/5f920/5f920916ef998efd50e6a4de46c34d977c572c35" alt="Natural Language Processing Fundamentals"
Types of Data
To deal with data effectively, we need to understand the various forms in which it exists. Let's first understand the types of data that exist. There are two main ways to categorize data, by structure and by content, as explained in the upcoming sections.
Categorizing Data Based on Structure
On the basis of structure, data can be divided into three categories, namely structured, semi-structured, and unstructured, as shown in the following diagram:
data:image/s3,"s3://crabby-images/2b642/2b64219645073a640966feaf42c26f4b31d79c21" alt=""
Figure 2.1: Categorization based on content
These three categories are explained in detail here:
- Structured Data: This is the most organized form of data. It is represented in tabular formats such as Excel files and Comma-Separated Value (CSV) files. The following figure shows what structured data usually looks like:
data:image/s3,"s3://crabby-images/741ae/741ae5ba9c0f1151c85253ab5c837b0e3a179530" alt=""
Figure 2.2: Structured data
- Semi-Structured Data: This type of data is not presented in a tabular structure, but it can be represented in a tabular format after transformation. Here, information is usually stored between tags following a definite pattern. XML and HTML files can be referred to as semi-structured data. The following figure shows how semi-structured data can appear:
data:image/s3,"s3://crabby-images/d4b2a/d4b2aba77f1f2f2e002d69907c4b5d9ad2421f5a" alt=""
Figure 2.3: Semi-structured data
- Unstructured Data: This type of data is the most difficult to deal with. Machine learning algorithms would find it difficult to comprehend unstructured data without any loss of information. Text corpora and images are examples of unstructured data. The following figure shows how unstructured data looks like:
data:image/s3,"s3://crabby-images/187bc/187bc3301e98b52f930cc90a20b1fc4a9e0f2c2c" alt=""
Figure 2.4: Unstructured data
Categorization of Data Based on Content
On the basis of content, data can be divided into four categories, as shown in the following figure:
data:image/s3,"s3://crabby-images/baa72/baa728da17888584e51bda310a9c6f079ffcd958" alt=""
Figure 2.5: Categorization of data based on structure
Let's look at each category here:
- Text Data: This refers to text corpora consisting of written sentences. This type of data can only be read. An example would be the text corpus of a book.
- Image Data: This refers to pictures that are used to communicate messages. This type of data can only be seen.
- Audio Data: This refers to recordings of someone's voice, music, and so on. This type of data can only be heard.
- Video Data: A continuous series of images coupled with audio forms a video. This type of data can be seen as well as heard.
We have learned about the different types of data as well their categorization on the basis of structure and content. When dealing with unstructured data, it is necessary to clean it first. In the coming section, we will look into some pre-processing steps for cleaning data.