2. Introduction (cont…)
• Corporate data include everything found in the corporation in the way of
data.
• The most basic division of corporate data is by structured data and
unstructured data.
• As a rule, there are much more unstructured data than structured data.
• Unstructured data have two basic divisions—
• repetitive data and nonrepetitive data.
• Big data is made up of unstructured data.
3. Introduction (cont…)
• Nonrepetitive big data has a fundamentally different form than repetitive
unstructured big data.
• The differences between nonrepetitive big data and repetitive big data are
so large that they can be called the boundaries of the “great divide.”
• As a rule, nonrepetitive big data has MUCH greater business value than
repetitive big data.
4. Data Architecture
• Data architecture is about the larger picture of data and how it fits together in a typical organization.
6. Structured Data
• Structured data is when data is in a standardized format, has a well-
defined structure, complies to a data model, follows a persistent
order, and is easily accessed by humans and programs. This data type
is generally stored in
z a database.
• Examples: SQL, Excel, or any relational database.
7. Unstructured Data
Unstructured data is information that is not arranged according to a preset data model or schema, and
therefore cannot be stored in a traditional relational database or RDBMS. Text and multimedia are two
common types of unstructured content.
8. Repetitive Unstructured
• A typical form of repetitive unstructured data in the corporation might be the data generated by an
analog machine.
• For example, a farmer has a machine that reads the identification of railroad cars as the railroad
cars pass through the farmer's property. Trains pass through the property night and day. The
electronic eye reads and records the passage of each car on the track.
9. Nonrepetitive Unstructured Data
• Nonrepetitive unstructured data are data that are nonrepetitive, such as e-mails.
• Each email can be long or short. The e-mail can be in English or Spanish (or some other
languages.) The author of the e-mail can say anything that he/she pleases. It is only a pure accident
if the contents of any e-mail are identical to the contents of any other email.
• And there are many forms of nonrepetitive unstructured data. There are voice recordings, there are
contracts, there are customer feedback messages, etc.
11. The Great Divide of Data
It is hardly obvious why there should be this great divide of data.
But there are some very
• good reasons for the divide:
• Repetitive data usually have very limited business value, while
nonrepetitive data are rich in business value.
• Repetitive data can be handled one way; nonrepetitive data are
handled very differently.
• Repetitive data can be analyzed one way, while nonrepetitive
data can be analyzed in a very different manner.
Depicst, emails, all transactions, telephone conversations, chats, etc
There are many ways to subdivide the data shown in Fig. 1.1.1. The way that is shown is only one of many ways data can be understood.
One way to understand the data found in the corporation is to look at structured data and nonstructured data. Fig. 1.1.2 shows this subdivision of data.
Structured data is highly specific and is stored in a predefined format, where unstructured data is a conglomeration of many varied types of data that are stored in their native formats. This means that structured data takes advantage of schema-on-write and unstructured data employs schema-on-read.
It is not obvious at all, but the dividing line in unstructured data between unstructured
repetitive data and unstructured nonrepetitive data is very significant. In fact, the dividing
line between unstructured repetitive data and unstructured nonrepetitive data is so
important that the division can be called the “great divide” of data.
Tools and techniques that work in one world simply are not applicable to the
other world and vice versa.
The basic divisions of data that are shown in Fig. 1.1.6 are important for a lot of reasons. Each of the divisions of data requires their own infrastructure, their own technology, and their own treatment. Even though all forms of data exist in the same corporation, each of the forms of data may as well exist on different planets. They simply require their own treatment and their own unique infrastructure.
there is a very high degree of business value for structured data. As an example of the value of structured data, it is really important to the business to have
the correct bank account balance, both to the bank and to the customer. Textual data contain even more highly valued business data. When customers talk to an
agent of the company through a call center, everything the customer says is valuable. And there is significantly less business value for nonrepetitive nontextual data and unstructured repetitive data.