Types of Data
To deal with data effectively, we need to understand the various forms in which it exists. First, let's explore the types of data that exist. There are two main ways to categorize data (by structure and by content), as explained in the upcoming sections.
Categorizing Data Based on Structure
Data can be pided on the basis of structure into three categories, namely, structured, semi-structured, and unstructured data, as shown in the following diagram:
These three categories are as follows:
- Structured data: This is the most organized form of data. It is represented in tabular formats such as Excel files and Comma-Separated Value (CSV) files. The following image shows what structured data usually looks like:
The preceding table contains information about five people, with each row representing a person and each column representing one of their attributes.
- Semi-structured data: This type of data is not presented in a tabular structure, but it can be transformed into a table. Here, information is usually stored between tags following a definite pattern. XML and HTML files can be referred to as semi-structured data. The following screenshot shows how semi-structured data can appear:
The format shown in the preceding screenshot is called markup language format. Here, the data is stored between tags, hierarchically. It is a universally accepted format, and there are a lot of parsers available that can convert this data into structured data.
- Unstructured data: This type of data is the most difficult to deal with. Machine learning algorithms would find it difficult to comprehend unstructured data without any loss of information. Text corpora and images are examples of unstructured data. The following image shows what unstructured data looks like:
This is called unstructured data because if we want to get employee details from the preceding text snippet with our program, we will not be able to do so by simple parsing. We have to make our algorithm understand the semantics of the language to make it able to extract information from this.
Categorizing Data Based on Content
Data can be pided into four categories based on content, as shown in the following diagram:
Let's look at each category here:
- Text data: This refers to text corpora consisting of written sentences. This type of data can only be read. An example would be the text corpus of a book.
- Image data: This refers to pictures that are used to communicate messages. This type of data can only be seen.
- Audio data: This refers to voice recordings, music, and so on. This type of data can only be heard.
- Video data: A continuous series of images coupled with audio forms a video. This type of data can be seen as well as heard.
With that, we have learned about the different types of data and their categorization on the basis of structure and content. When dealing with unstructured data, it is necessary to clean it first. In the next section, we will look into some of the preprocessing steps for cleaning data.