Structured and Unstructured Data

Structured and Unstructured Data

Previously, we discussed about the essential skill sets that you might need as a data engineer. Most importantly, however, it is all about the data! And data can take into various forms; structured or unstructured data.

Structured data

It's the data format that we can easily recognize in our day to day activities. When we are checking the train schedules from a station to another, the price-list menu in our favorite cafes; structured into rows and columns, where each line row represents each coffee in the menu that we can order, and the columns says which coffee is at which price.

Cappuccino

1.99

Cafe late

2.99

Espresso

1.99

Flat White

1.99

Each row represents exactly one record and each column represents an attribute of each record. It is highly organized format which makes it easier to extract, read, and manipulate. However, as the data gets bigger on both dimensions (rows or columns) the data structure can get more and more complex.

menu

price

store

store location

Cappuccino

1.99

Main St.

Downtown

Cafe late

2.99

Main St.

Downtown

Espresso

1.99

Main St.

Downtown

Flat White

1.99

Main St.

Downtown

Green Tea Latte

2.99

Main St.

Downtown

Cappuccino

1.99

Hays St.

Downtown

Cafe late

2.99

Hays St.

Downtown

Espresso

1.99

Hays St.

Downtown

Flat White

1.99

Hays St.

Downtown

In a database management system, as the data gets more and more complex, the data structure will require to be designed into certain model in order to store and access the data effectively; following the normalization techniques.

For example, if we normalize the above table, it could look like this:

menu

price

store_id

Cappuccino

1.99

1

Cafe late

2.99

1

Espresso

1.99

1

Flat White

1.99

1

Green Tea Latte

2.99

1

Cappuccino

1.99

2

Cafe late

2.99

2

Espresso

1.99

2

Flat White

1.99

2

store_id

location

city

manager_id

1

Main St.

Downtown

11

2

Hays St.

Downtown

21

manager_id

name

11

Ashley

21

Kumar

Now, we have three tables; menu, stores, and managers, each of this table represent each entity. The tables are structured to multiple tables that makes it easier for us to identify each entity. This would also enable system to access and record the data to each entity more effectively.

More on entity relationships and normalization techniques will be discussed on the next articles. There's no one simple solution on the normalization techniques. It is all depends on the process to acquire the data, real-world behavior of the data and what it represents, and how the data will be accessed.

Unstructured data

Believe it or not, we encounter with unstructured data all the time. And It is not "exactly" the coffee's price in a menu -- but the picture of the coffee is! How it looks conveys some information that lead us to decide "I ought to buy this one."

The films that we are watching; it contains the data that we can't exactly measure programatically. Such as how certain expression conveys a character's emotion, the tone of the actor's voices, the background sound effects, the music, and how all that combined gives us the experience to float into a story.

The music that we listened to; one genre that get us dance up and down to the floor, and other to let out all of our tears. Believe it or not, it's all data! Only that the data takes into various elements, scattered without any specific structure in which programmatically, can't be easily recognized -- not as easy as tabular format at least.

Hence, analyzing the unstructured data usually relies on the recognizing a patterns or sentiments of the data, instead of looking at each element of the data.