Digital data
Digital data can be broken down into structured digital data and unstructured digital data. Structured data is best known as relational data, but is really any text-based data stored in such a way that enables it to be accessed and queried to an agreed standard.
For relational data, it is stored in a well defined mathematical structure with official rules and standards for accessing and manipulating it. In the market there are other types of databases that store text data that conform to other standards (for example, ADABAS, IMS/DB).
Any data that is not stored in a well-defined structured format can by default be seen as unstructured. The traditional view is that unstructured data is just any binary data.
There is a fuzzy area between structured and unstructured, more akin to saying there are degrees of structure and there is a lot of overlap.
It's possible to store unstructured data in a column in a relational table, which is structured. The physical database files containing structured data are binary and stored in a propriety format without well-defined rules and are considered unstructured. A propriety format is one where the vendor (the maker of the format) controls and decides its behavior. There is no agreed standard or peer review for its format. There are gray areas covering this as can be shown with the the Adobe PDF format. Though the format was controlled by Adobe and considered proprietary, in 2008 it was made open and released to the general community(3).
Data stored in NoSQL or XML can be considered to be stored in a semi-structured format. For XML there are rules for accessing and querying it, but the data itself and its structure can vary. It can conform to agreed standards or be stored in a raw format.
Just saying that text data is structured and binary data is unstructured is not sufficient, as a text file (notepad or vi) can contain a random set of characters without definition, rules, or conform to any standard.
The unstructured data can be broken down into different groups. A well-known group is multimedia or rich media. Here there are types such as digital image, audio, video, and document (though there are more in this list). Some of these types are well-defined and can contain embedded XML that conform to an agreed set of standards (this is covered further in Chapter 2, Understanding Digital Objects). The format of the binary data can also follow agreed rules. The digital image format JPEG is an open standard. For video, MPEG is also an open standard. Multimedia would be a category of unstructured data that is well defined. Its category is fluid and changing as technology changes and unlikely to conform to the mathematical and well-proven relational structure.
So we can now define all data as follows:
- Structured: The structured data is any data stored in a well-defined, non-propriety system. This data is primarily text based. It typically conforms to ACID(4).
The structured data is anything that has an enforced composition to the atomic data types(5).
- Semi-structured: The semi-structured data is any data stored in a system that conforms to some rules and can be proprietary. This data is primarily text based. It does not have to conform to ACID.
- Well-defined unstructured: It is the binary data that is well defined and conforms mostly to an agreed standard.
- Unstructured: It is the binary data that is proprietary.
The challenge is that, even based on these definitions, some data falls across one or more definitions. This is typical of what one encounters when dealing with unstructured data. There is no concise and easy to use definition. The temptation is to say that unstructured data is just any data that is not structured. But with example data sets such as NoSQL, XML, and a multitude of other storage systems, there is a feeling that they should belong to structured. In that case, is HTML structured or unstructured? HTML in theory is a subset of XML, but errors are allowed in HTML and it's not case sensitive, whereas XML is. A raw text file can be labeled as HTML and be a valid HTML file, but you can't do the same with XML. An XML file with one syntax error in it is not XML because it doesn't conform to the XML rule set.
A well known joke is, what is the name of a boomerang that doesn't return? A stick! Except that when one looks at the true history of boomerangs, most were designed not to return. Yet we associate a boomerang as any object that when thrown returns. An object of any shape can be used as a boomerang. This has been shown by boomerang experts, who use letters of the alphabet as the shape of boomerangs just to show how versatile the ability of an object when thrown to return can be. The point to be made is that our traditional, innate sense of what something should be and belong to, is not always right.
One can also say that unstructured data is really structured data that hasn't been defined correctly yet. Because of the exceptions to the rule it might not even be valid to break data up into structured and unstructured. Yet by breaking it up and identifying each set, one can associate rules with it, understand its limitations, and formulate new concepts around it. So it is useful to be able to do this.
When we look at the situation of a digital image being stored in a relational database like Oracle, we actually see two different situations. We see the digital image, which is binary data conforming to a well-defined standard, but it's being stored in a structured system. We can see what the data represents and where it is stored as two different systems.
So let's look at this further. If we now separate the storage mechanism from the data itself, we can have unstructured data stored in a relational database. The unstructured data is a separate entity and even though it's handled using ACID that is not important as the data itself is unstructured. Of course, that raises some new issues. What about some of the text elements stored in a structured database, are they structured or unstructured? What if we store a date value that behaves as structured, is fixed in its definition and conforms to a mathematical standard? If the date is stored in a varchar
field (which means variable character length) then it's not structured. This is because any value can be put into it. We could enter in 12th Jan 2005
, 30-Feb 2012
, or 01.02.03
. Any value without validation can be stored in it. If we store an address in a varchar
field, is that structured or unstructured? If we store the values in an abstract data type, it can be classified as structured data as methods can be applied to it and the structure is well defined and controlled. If the address is stored in only a varchar
field, then any value can be added in free-form and it is unstructured. A similar situation holds for names and a raft of other values (this is covered further in Chapter 3, The Multimedia Warehouse). So it appears that a lot of the individual data items in a structured database might actually be unstructured. This issue is well known in data warehouses, where a lot of time is spent cleaning the data into a structured format.
So again we come to a situation where trying to clearly define structured and unstructured data always brings up inconsistencies and exceptions to the rule. At this point we realize that this isn't an issue at all and come to a better understanding of how one has to rethink the whole strategy of working with the unstructured data. A document can contain only photos. Is it a document or a photo album? If a video only has an audio track but no picture, is it still a video? Is a GIF animated image a video? Even when looking at two images and comparing, how can we say they are the same? If one image differs from the other by one byte, is it still the same? If comparing two seemingly identical videos, but one is missing only the final frame, which has no audio or picture, is it the same or different? The world of unstructured data introduces us to a world where our traditional rules for dealing with commonly held concepts break down and don't make sense any more. The strict definitions we are used to and comfortable with for defining relational data fall apart when dealing with the unstructured data.
For a database management system to begin to correctly handle the unstructured data, it must initially have support for objects. An object can be seen to be a grouping of fields with associated rules. The grouping of fields can be referred to as an Abstract Data Type (or ADT). The associated rules are called methods. The data as stored can be linked directly to other data items, which is referred to as a reference. The data items themselves can repeat and can be stored hierarchically or in a nested structure. Object-oriented systems are known to conflict with the relational systems because they break a number of the rules involved in the data normalization(6). In the late 1990s this caused the market to divide between using relational or object databases, as each offered strengths and weaknesses. Oracle managed to combine the two in its database allowing data architects to pick up the best method. With the embedding of Online Analytical Processing (OLAP) and XML into the database in later releases, the Oracle database grew from being relational to one supporting most structures.
With the recent rise in popularity of NoSQL, again the debate has been raised about which is better to use, a relational system or a NoSQL one? The experienced data architects, who remember the relational/object debate, will realize that it's not really one or the other, it's using the one that can satisfy a number of conditions that are business dependent, including the ability to do the following:
- Scale (support large numbers of users and/or large volumes of data)
- Be open (not proprietary) or be locked into a vendor
- To provide data integrity and prevent data corruption or loss
Most databases can enable unstructured data to be stored in them, but do not support the management, control, and manipulation of that data. Most provide the equivalent of lip service to unstructured data and encourage it to be stored externally. Even in the case of Oracle, which has built-in support of the unstructured data and provides a powerful database environment for handling it, it still has serious limitations with it (this is covered further in Chapter 9, Understanding the Limitations of Oracle Products). Even though it is a market leader in unstructured data management there are still a large number of major improvements the database needs.
Metadata
Throughout this book, most chapters will cover the usage of metadata. With unstructured data management, metadata is crucial. It is the data that describes the unstructured data and gives meaning to it. Each type of unstructured data object has its own metadata. It might be as simple as a filename, or as complex as a complete set of relational records. Without metadata the unstructured data loses meaning.
The metadata is primarily used for searching. Without it, it's not possible to construct a multimedia warehouse. It is also used for assigning a description. A person might see a photo of a plant. The metadata might have a description of what that photo is, giving meaning and context to the photo.
The metadata is also used to relate unstructured data objects, which in turn adds intelligence and structure to it. It is also used to store information about the object like its name, when it was created, who created it, and who modified it.
The metadata can be used to represent any knowledge about the unstructured object. It's typically stored in a structured format. Currently the trend is to use XML, but this has not always been the case. Additionally, metadata can be matched to data in relational databases or NoSQL databases.
As will be shown in the following chapters, the metadata usage can be rich, varied, and complex. At the moment because of limitations in computer technology, metadata is crucial for most systems that want to extensively use unstructured data. A computer if asked the question, find me the video with the picture of the person John in it, would have great difficulty answering it. Likewise, a question asking, find me all audio files with a lyre bird singing after sunset, would be equally hard to answer. By having a human operator attach metadata with this information in it, then while searching multimedia with that information, the questions raised can be answered.
Unfortunately, the need to manually attach metadata is a time consuming and costly exercise. A number of sites are investigating crowd sourcing to resolve it (see Chapter 3, The Multimedia Warehouse) or just bringing in a number of people to go through and identify the unstructured data.
As computer technology improves and new algorithms are discovered, the need to store metadata will disappear. Computers are already good at facial recognition and can convert speech to text. They do have major limitations and still struggle in complex situations that humans do easily. It is envisaged that in the next 20 years technology will improve to the point where algorithms will become commonplace that will be able to identify objects and people in a video or photo, and understand sounds and complex speech in audio files. When this point is reached, the need for metadata will be reduced and constrained to a smaller, more tightly controlled subset. The metadata will always exist and always be needed.
As the veil over the unstructured data is slowly removed, and as knowledge and understanding grows, so will the use of metadata. As covered in the previous point, the use will change and diminish over time, and the market for its use will grow. For example, if the current market represented 100 units, and if multimedia represented 30 percent that would be 30 units. If its usage over time dropped to 5 percent that would be 5 units. But if the growth of the market expanded to 10,000 units, 5% would be 500 units, which is five times bigger than the current market. So even though the need will be reduced, the market as it grows will demand an increasing usage for metadata.
The uses for metadata will start to strain relational databases, and object relational databases will be pushed to their limits to identify and handle the changing complexities of it. Time-based structures (effectively four-dimensional) will be needed. Oracle's flashback capabilities will need to be ramped up in data warehouses to handle large-scale, complex queries. The fuzzy data structures, which are needed to handle the vagaries of some multimedia types, struggle to be easily represented and queried against in most databases. Neural structures are another story altogether and most computer systems can't even cope with the basic handling of them. It's feasible in concept to attach a neural network as a metadata to an object type, which details how to recognize and handle components within it(7).