Previously, we discussed about the essential skill sets that you might need as a data engineer. Most importantly, however, it is all about the data! And data can take into various forms; structured or unstructured data.
Structured data
It's the data format that we can easily recognize in our day to day activities. When we are checking the train schedules from a station to another, the price-list menu in our favorite cafes; structured into rows and columns, where each line row represents each coffee in the menu that we can order, and the columns says which coffee is at which price.
Cappuccino | 1.99 |
Cafe late | 2.99 |
Espresso | 1.99 |
Flat White | 1.99 |
Each row represents exactly one record and each column represents an attribute of each record. It is highly organized format which makes it easier to extract, read, and manipulate. However, as the data gets bigger on both dimensions (rows or columns) the data structure can get more and more complex.
menu | price | store | store location |
Cappuccino | 1.99 | Main St. | Downtown |
Cafe late | 2.99 | Main St. | Downtown |
Espresso | 1.99 | Main St. | Downtown |
Flat White | 1.99 | Main St. | Downtown |
Green Tea Latte | 2.99 | Main St. | Downtown |
Cappuccino | 1.99 | Hays St. | Downtown |
Cafe late | 2.99 | Hays St. | Downtown |
Espresso | 1.99 | Hays St. | Downtown |
Flat White | 1.99 | Hays St. | Downtown |
In a database management system, as the data gets more and more complex, the data structure will require to be designed into certain model in order to store and access the data effectively; following the normalization techniques.
For example, if we normalize the above table, it could look like this:
menu | price | store_id |
Cappuccino | 1.99 | 1 |
Cafe late | 2.99 | 1 |
Espresso | 1.99 | 1 |
Flat White | 1.99 | 1 |
Green Tea Latte | 2.99 | 1 |
Cappuccino | 1.99 | 2 |
Cafe late | 2.99 | 2 |
Espresso | 1.99 | 2 |
Flat White | 1.99 | 2 |
store_id | location | city | manager_id |
1 | Main St. | Downtown | 11 |
2 | Hays St. | Downtown | 21 |
manager_id | name |
11 | Ashley |
21 | Kumar |
Now, we have three tables; menu, stores, and managers, each of this table represent each entity. The tables are structured to multiple tables that makes it easier for us to identify each entity. This would also enable system to access and record the data to each entity more effectively.
More on entity relationships and normalization techniques will be discussed on the next articles. There's no one simple solution on the normalization techniques. It is all depends on the process to acquire the data, real-world behavior of the data and what it represents, and how the data will be accessed.
Unstructured data
Believe it or not, we encounter with unstructured data all the time. And It is not "exactly" the coffee's price in a menu -- but the picture of the coffee is! How it looks conveys some information that lead us to decide "I ought to buy this one."
The films that we are watching; it contains the data that we can't exactly measure programatically. Such as how certain expression conveys a character's emotion, the tone of the actor's voices, the background sound effects, the music, and how all that combined gives us the experience to float into a story.
The music that we listened to; one genre that get us dance up and down to the floor, and other to let out all of our tears. Believe it or not, it's all data! Only that the data takes into various elements, scattered without any specific structure in which programmatically, can't be easily recognized -- not as easy as tabular format at least.
Hence, analyzing the unstructured data usually relies on the recognizing a patterns or sentiments of the data, instead of looking at each element of the data.