Defining unstructured data
A starting point is needed for defining exactly what is unstructured data. The goal of this section is to begin to describe and define the base components of unstructured data.
Terminology
In reviewing this book, an important question was raised. And that was, what is the best term to describe the concept of storing and delivering digital information? On investigation, a number of terms that closely fit the mark were discovered, though none truly described the concept that was trying to be expressed.
The following are a list of some of the terms discovered and reviewed, including definitions found on the Internet.
A digital file is a collection of binary data represented as bytes, contained and assigned a name to identify it. Digital files traditionally exist within a filesystem. They can also be captured and stored in a database.
A digital image is a representation of a two-dimensional image as a finite set of digital values, called picture elements or pixels(8). It is commonly known as a digital photo.
In various current usages, a digital object or asset may comprise a single media file or group of files including or excluding some or all associated metadata. The framework's apparent usage of a digital object to denote a single media file excluding its associated metadata should be made explicit to avoid misreading in opposition to the term's other contemporary usages. This recommendation for explicit definition would apply equally to the term digital asset should that language be adopted instead(9).
There are a number of definitions available. They are as follows:
- Any digital data traffic should be viewed as a digital content product
- Digital content products would seem logically to include those that have a digital representation
- Digital content products would include any products that are encoded in digital form
- Products that are in digital format and that form part of the content of a repository, collection, exhibition, or archive(10)
- The definition of digital content encompasses images, music, and videos(11)
A digital asset is a digital object that can be clearly identified as a singular item or component, which may be ascribed a value. Computer systems can be built to manage these assets also referred to as a Digital Asset Management System (DAMS), which is a system for organizing and managing access to digital materials.
This is a broad term encompassing digital surrogates created as a result of converting analogue materials to digital form (digitization), and born digital, for which there has never been and is never intended to be an analogue equivalent, and digital records(12).
Digital libraries (DLs) are organized collections of digital information. They combine the structuring and gathering of information, which libraries and archives have always done, with the digital representation that computers have made possible(13).
A DL contains digital representations of the objects found in it. Most understanding of the DL probably also assumes that it will be accessible via the Internet, though not necessarily to everyone. But the idea of digitization is perhaps the only characteristic of a digital library on which there is a universal agreement(14).
Analyzing the digital object
Each of the preceding definitions are correct, but the issue is that none truly conveys the meaning behind what it is to manage the unstructured data and deliver it. Each definition is restrictive and not adaptive to the changing digital technology. Most assume a digital image is a photo or document, and all assume they are owned. As will be shown further, these assumptions do not stand up on a closer scrutiny.
What did stand out was that most definitions conveyed the idea of representation, that is the digital information is meant to symbolize something, be it a photo, document, or video.
So which term should be used? After reviewing all terms the one that seems to have the most potential is a digital object. This is the term that will be used throughout most of the book. It is far easier to use an existing term that people are familiar with than it is to create a new one or define an acronym.
It is then important to accurately define what a digital object actually is. With technology changing, any classic definition we give today is likely to be out of date within a couple of years. The standard perception that the general public has of a digital object is a photograph taken by a digital camera. As will be explained later, a digital photograph is just a subset of type Picture
. In fact, when looking at digital objects we are looking at ways of representing data, which is ultimately used by one of our traditional five senses.
When looking at the types of digital objects available they can be broken up as shown in the following table:
These are the traditional object types used throughout the world, but one needs to address the need of what types of images will exist in 50 years time. It is nearly impossible to predict this, so to accurately define a digital object we need to look at how we as humans deal with digital objects and use this to future proof our definition. This involves looking at the senses humans use for viewing digital objects and then expand on this.
So let's redefine the definition of a digital object to the following:
A digital object is a representation of anything, stored in binary format, to be used by our senses.
Why to be used by our senses? If there is no intention of use for a digital object, it can be classified as a digital file. A Windows DLL, a Unix executable, or security attributes are all digital representations of something, but they are not digital objects because they are not used by our senses. They are used by computers for the management of data. By specifying that the binary representation has to be used by our senses, then the boundaries of use for that digital object are captured and can then be further defined.
The traditional view is that we have five senses: sight, sound, touch, taste, and smell. When looking in greater detail at these senses we can break them down as shown in the following table:
There is not much difference between taste and smell, as both involve chemical reactions. Interestingly, we can actually taste with our nose(15). When it comes to sound we can actually feel certain low vibration sounds. For watching movies on DVDs, this is an important experience and part of the entertainment value. In this case the deep bass, which is emitted by certain speakers, is felt by the body through touch. So one digital object can be used by multiple senses.
By equating it to a sense we can resolve a number of real world problems associated with defining an image. For example, a document that can be viewed or read using sight, can be converted to Braille and then read by touch. For those familiar with the TV show Red Dwarf, in that series they even explored the concept of reading a book using smell.
Just because we currently are not using one of our senses for viewing a digital object, it does not mean it should be excluded. A good example is taste. Currently it is very hard to simulate taste in a digital sense, but this doesn't mean that in ten years' time the concept of artificial taste will not be invented.
Digital object types
A digital object can be broken down into image types. Each image type can be further broken into image subtypes. We can then apply conversion and transformation rules to each subtype to modify the digital object.
By breaking down the senses into their core concepts and then equating them to traditional image concepts, it now becomes possible to identify traditional object types and then define them.
A digital object does not need to have any meaning associated with it, nor does it have to represent a real world scene (which is the traditional view of a photograph). A picture of an abstract painting is a digital object and the white noise of an empty TV channel can be classified as a digital object.
For simplicity we will maintain a digital object as having to be stored in the binary format. Though there are audio and video formats that use analogue signals, these formats can be expressed in a binary digital format. Even when looking at artificial intelligence and the use of neural networks, this can be represented in a binary format.
Core types
Digital objects can be composed of two core types and the dimension of time. By combining these core types into different combinations, a variety of base types can be created.
The core types are as follows:
- Image
- Audio
The use of the dimension of time is very fluid and varies based on how it is used. Video uses a very strict definition of time. Animated GIFs use a very simple time-based sequence, whereas heraldry uses a very loose definition of time. The use of time is covered later in greater detail.
When expanding this definition to handle three-dimensional objects, the concept of an artifact is introduced. This is an object created from physical materials, but created digitally. An example is a model created from a three-dimensional printer using resins and glue.
Subtypes
For each object type we introduce object subtypes. For example, we can define a photograph as a real world representation of a picture. A line drawing is a hand drawing. The CGI is a computer-generated image.
A picture is a two-dimensional representation of anything. A picture can be viewed using all senses. A picture is defined as having a width and height assuming the picture is rectangular. For non-rectangular pictures the width and height describe the upper boundary lengths of the picture.
The following are the examples of subtypes:
- Geo-raster
- Photo
- Art
- Line drawing
- Montage
- 3D view using a set of 2D (but still 2D)
- Stereoscopic image
Audio is a time-based set of sounds. If we investigated hard enough, we could eventually equate a sound to a picture. This is not required and to keep things simple this is not going to be done.
The following are the examples of audio subtypes:
- Music
- Audio book
Creating new base types
When looking at these definitions it can be seen that the definition for a document and video can be expressed using the terms of a picture and an audio. By adding a rule set we can create new digital image types. The two new rule definitions as used previously are as follows:
- Digital image types can be time based, a set of digital objects linked together using the dimension of time.
- A well defined character set. It is a set of pictures or icons grouped together. UTF8 and US7ASCII are character sets. Egyptian hieroglyphs can be grouped together to form a character set.
Using new rules we can create new digital object types based on the core picture and audio image types.
The following are examples of non-traditional digital object types:
The document is a set of pictures with each picture optionally representing a character from one or more well-defined character sets. Each picture can be classed as a font.
The example subtypes are as follows:
- Ephemera
- Structured documents (used for signaling)
- Forms
A video is a combination of an optional set of time-based pictures and an optional set of time-based sounds. A DVD is an example of a video subtype. A photo montage is not an example of a video subtype. In this case we have a set of pictures but because they are not time based, they can only be classified as type picture.
The example subtypes are as follows:
- Film
- TV
- Documentary
- Surveillance
Multimedia is a combination of one or more object types that is optionally time based. Usually, they are created to be interactive, such as an educational program or game.
The example subtypes are as follows:
- VRML
- SVG
- SMIL
- Macromedia Flash
- Java Applet
Data is a document that is perceived by its users as a collection of tables (and nothing but tables). This is a slight expansion on the original definition of relational. Relational data is treated as image data, as it can be transformed into a picture (creating a graph) or a document (creating a report) or even a video (creating a view using data mining analysis).
The example subtypes are as follows:
- Relational
- XML
- Object
- Metadata
This is where we take the data, convert parts of it into well-defined objects, and then extend it over a well-defined period of time. A simulation can be converted into video. A simulation, which is given a set of tightly enforced rules, can be extended into a self-evolving artificial neural network with the resulting output being an enhanced pattern-matching algorithm. Such an algorithm can be subsequently used for transforming digital object subtypes.
The genealogy is a record of the descent of a person, family, or group from an ancestor or ancestors(17). It involves taking data, documents, photos, video, and audio and extending it over time.
The subtypes include the following:
- Heraldry: It is the study and classification of armorial bearings and the tracing of genealogies(18)
- Private record: It is a privately defined record hierarchy position(19)
Virtual digital object
It is possible for a digital object to be categorized into multiple object types. This is because the line on what actually constitutes a digital object can change depending on how it is delivered. For example, an MP3 file is classified as an audio type. If it is delivered using the Real Player server and streamed to a client, it is treated as a video type. Another example is an animated GIF, which is a time-related set of images enclosed within a repeating cycle. An animated GIF is by definition a video, yet for ease of delivery, it is delivered as a specialized GIF (that is, a type of static picture).
This means it is important to separate the storage of the digital object from the delivery mechanism. The delivery mechanism might involve a virtual change of the digital object. The digital object exists in two (or possibly more) states and it isn't until the object is delivered that its true state is determined. When this happens it is called a virtual digital object.
In a perfect world there is no difference between a digital object and its delivery mechanism. But because of Internet standards that limit what can be seen (for example, browsers by default only view JPEG and GIF images and not TIFF ones) and due to limitations in network bandwidth and cost of delivery, virtual digital objects have had to be created to address these issues. These issues are subject to current environmental constraints and will change over time. HTML5 is attempting to define a set of supported video standards. This is covered further in Chapter 3, The Multimedia Warehouse.
Digital object delivery
One goal in this book is to describe how to deliver a digital object. This is covered in great detail in Chapter 6, Delivery Techniques. At the moment we have classified what a digital object is, but have not defined what it is to actually deliver it.
We expand on the original definition and add the following:
Only when that digital object has been successfully consumed by one of our senses can it be considered to be delivered.
This means a photograph viewed on the computer screen has been delivered. A DVD streamed to a computer terminal has been delivered and a document viewed and read has been delivered.
It is not important that money has been transacted when delivering a digital image. Buying a digital image is an optional part of the delivery process.
But what about the scenario where an audio file is cut to CD and then shipped to a customer? What if that customer does not listen to it? By the preceding definition it has not been consumed therefore it has not been delivered. Common sense indicates that the image has been delivered. From a traditional consumer viewpoint it is on actual receipt of the digital image that the image can be considered to be delivered. This view is now starting to conflict with new e-commerce concepts starting to appear on the Internet. That is, consumers are now only being charged for use of a digital image only when it has been consumed and not when they have received it.
So when defining consumption of an image and ensuring that definition is future proof, we have to be careful that our traditional viewpoint of commerce does not interfere with that definition. With e-commerce the rules are changing, consumer habits are changing, and new ideas for image delivery are being tried.
At this point we will leave the definition as it is. In Chapter 6, Delivery Techniques, we will explore this concept in greater detail. Here we will be looking at who the consumer is and who the producer is. With e-commerce our traditional perspectives need to be challenged.
Manipulating digital objects
This section is an introduction to the methods available for managing and manipulating a digital object. When working with digital objects it soon becomes apparent that techniques have to be utilized to view and understand what they actually are. The digital object itself can contain other digital objects and only by processing it can these other objects be discovered.
Conversion is when we change an object subtype into another object subtype. Major conversions occur when we convert between types. For example, when we go from a picture to a document. Minor image conversions occur when we convert an image between subtypes, a common example is when converting a JPEG image to a GIF image.
In converting a digital image the process might be irreversible, meaning once converted it cannot be converted back again. For example, in converting a video to a photograph, we cannot convert that photograph back to the same video.
The process might also lose information in the conversion. In converting a JPEG image to a GIF image and back to a JPEG image, color information is lost. Though the image might look like the original image, it is not the original image. This is a lossy conversion and covered in greater detail in Chapter 2, Understanding Digital Objects. At the end of this chapter there is a chart detailing how it's possible to convert between all the major types.
Transformation occurs when a digital object is modified. For example, we can rotate, watermark, or crop a photo. We can convert the bit rate of an audio file, change a Word document into an Adobe document, or add special effects to a video. Transforming does not change the object subtype.
A digital object can be composed of multiple digital objects. The extraction process involves unpacking those digital objects. For example, a DICOM image can be composed of multiple photographs and documents that are in turn digital objects themselves.
We live in a world where storage is limited. The storage not only includes the volume of space a digital object uses, but the bandwidth required to deliver that image. As such with digital objects, compression becomes important. And for all digital objects we deal with lossless or lossy compression.
With lossless compression the digital object is compressed (reduced in size) and when uncompressed the original digital object is reconstructed without the loss of any original information.
With lossy compression, the obtained object is not the same on reconstruction as the original object. This is covered in greater detail in Chapter 2, Understanding Digital Objects.
After the compression or conversion of an object, we may lose some information in the process, therefore it now becomes important to be able to define whether that modified object is still the original object.
The technical definition is, two digital objects are classified as absolutely identical such that when they are compared in an uncompressed format each byte exactly matches the byte in the same corresponding position.
With a digital object, this definition does not match with real world expectations. For example, we can convert a WAV file to MP3 and then back to WAV. The technical definition says the two are different, but to the human ear listening to the original and the converted WAV file, there will be no difference.
In another example, it is possible to embed hidden watermarks in a JPEG image. To a person viewing the original image and the modified image, they will not be able to tell them apart. They will say they are the same digital image.
To address this we can then add a new definition: two digital objects are classified as observably identical when they are perceived to be identical.
Now that we have defined digital object comparison, we can apply this to our compression definition as follows.
On compressing a digital object, if the obtained compressed object matches with the original object, the compression is said to be lossless. If they do not match, that compression is lossy.
This in turns raises a new issue. If the resultant lossy image when viewed is not perceived to be identical to the original image, that image is termed as being badly compressed.
The skill comes in balancing compression to reduce the size of the original digital object without it becoming noticeably badly compressed.
It should be understood that stating an object is identical just because it is perceived to be identical is highly dependent on the individual doing the comparison checking. It is this area that moves into a very gray area by going into image searching. It will be discussed in more detail in Chapter 3, The Multimedia Warehouse, and Chapter 4, Searching the Multimedia Warehouse. It is an imprecise area that is not suited for traditional binary logic but well suited for neural networks, pattern matching, and fuzzy logic.
A thumbnail is a digital object that has been transformed and/or converted into a format which uses less storage. The goal in creating thumbnails is to improve the performance of object delivery. It is not fair to classify a thumbnail as an index, for the simple reason that it is not transparent. In a relational database the data is perceived by the user as tables (and nothing but tables)(20). An index is an object designed to improve performance. It cannot be seen as a table, so the corollary is that it must be transparent. A thumbnail is seen and yet is designed to improve performance like an index, so it breaks the original relational rule. A thumbnail fits in the structure referred to as a pyramid index.
From an Oracle perspective the closest equivalent is to treat thumbnails as a form of materialized view. Multiple thumbnails can be created from an original image of varying size. Two types of thumbnails are the web quality thumbnail and the standard thumbnail. The standard thumbnail is the smallest size produced, whereas the web quality is the largest size thumbnail produced.
In the case of a Georaster Image (which is a very large digital photograph typically seen as a satellite image), hundreds of thumbnails can be created of varying sizes based on the original.
Thumbnails are optional and do not have to be produced.
As will be shown throughout this book, a lot of traditional relational concepts are broken when applied to the world of unstructured data. The thumbnail and indexing is just one good example. This can be unsettling for those who have been trained and skilled in the relational database world. Unstructured data is seen as either a threat or an anomaly that is best treated by placing it into a blob field or insisting that it be stored externally and not in the database. The psychology behind this resistance to adopt and use unstructured data in itself cannot be easily dismissed and must be factored in by the data designer, database administrator, and developer. The introduction to the market of multimedia centric devices, such as the iPad or Android are beginning to break down the notion of keeping all unstructured data outside the database, as users start to become better educated and fluent in the usage of multimedia and are insisting on greater use and access to it in their applications.
This is the act of combining multiple object subtypes into a new object subtype while still keeping each subtype separate and distinct.
The traditional example of this is mapping spatial data over an image. The data is separate and can be searched. For example, we can search for a grid reference point on a map. Another example is seen when attaching metadata to an object. We can add EXIF data to a camera picture. The metadata is a specialized case and will be looked at in more detail in Chapter 3, The Multimedia Warehouse.
Searching for a digital object is a complex topic and is covered in greater detail in Chapter 3, The Multimedia Warehouse, and Chapter 4, Searching the Multimedia Warehouse.
Due to the complexity in searching a digital object, the current method is to search within a transposed object, with data being transposed over the digital object. Searching against data is simpler than trying to search within the image. Searching using this technique is called Data Transpose Searching. For example, the standard search method for looking for images involves searching against metadata attached to the image.
One key goal when searching is to search on the actual digital object. For example, find me all photos with a tree in it, or find the audio file that contains a lyric, or find the video that has Elvis Presley singing the song "you were always on my mind" in it. Currently, computer technology has not progressed to a stage where this is easily possible. A search using this method is called Actual Searching.
Another form of searching involves expanding on the concept of badly compressed objects and finding related or similar digital objects. We might want to find all pictures that have a sunset in them and use an existing photo as a base for the search engine to use them. This type of searching is called Similarity Searching, and the technology is now available to search on a variety of digital objects.
Similarity searching has the potential to be used in a number of fields, especially in fraud and copyright protection. For example, software is now available for universities where they can find all essays that are similar to ones submitted by students. By adjusting the similarity parameters a teacher can then compare two essays and determine with a high degree of certainty whether one is a copy of the other and has been slightly modified.
A set of images linked together is referred to as a product group. This is not to be confused with the composite type discussed further in the chapter. A product group in an intelligence warehouse that might be a set of digital images of a crime scene. In an electronic commerce system it might be a set of songs, videos, and digital booklets relating to an album.