Everything you need to get started with data engineering (part 1)

Everything you need to get started with data engineering (part 1)

Not too long ago, data-driven companies started to use the phrase "data engineer" as the data has increase in velocity, volume, value, variety and veracity (also known as 5 V's in big data) that, in the nutshell, require a broader focus on sophisticated engineering "techniques" compared to the traditional data storage.

Traditional database storage stores and manages structured data like spreadsheets or relational databases. We also have a massive unstructured data to process and analysis and that's when we start to deep dive to the "big data."

In conjunction of structured and unstructured data, there has been a lot of implementation for unstructured data (or semi-structured data) use-cases.

Regardless, the principle of data engineer still remains; to build an efficient platform that store, manage, and process the data effectively.

Back to the basic.

When I first learned about the data engineering, I imagined that we are going to work with millions of streaming data and social media posts from all over the world. And I got intimidated. The truth is, some of us might still be working on the less advance projects and it could still make a great impact to an organization.

If you are an expert (or at a certain level of fluency) with spreadsheets, you're up to a great start! Most of the time, we are still working on structured data; It could be from the excel files where the users store data and conduct analysis, or the traditional database management systems where all transactions are being recorded.

There is no denying that enterprises are still using relational database management system; it can be on-cloud or on-premise. The entity relationships can be modeled and designed to reflect the transactions and day to day operation. It is convenient to record, modify, and track through the structured data in a database management system in which we are able to extract the data through some SQL queries.

Understanding entity relationships, database modeling, and querying data with SQL are the first crucial step to get into the data -- not only as an engineer but also as an analyst. From a simple `select` query statement to the advanced use of common table expressions, all of which serve as the foundation to work with data.

Earlier on my career working with data, I started with about 60 to 70 percent doing a lot of database modeling and SQL queries. While the rest of them, utilizing ETL tools (which we'll be discussed in another post) and understanding the business processes. And it still play crucial in the work that I do as a data engineer.

Choose your language.

"Choose" might not be the right term to use because in the end, we have to keep up with the demand. To get started, however, it might be a good idea to pick one programming language to focus on; Python, R, or perhaps you would find Java interesting. I got introduced to python when I started joining my company. So yes, technically I did not make the choice, but it was one of the best thing ever happened!

Python is one of the most popular choice for data engineers. You can find a lot of resources and community that could help you as you learn python. For my personal experience, it is relatively easy to learn, cleaner, and readable codes. I can't speak for the other programming languages.

In python, the frequently used libraries in my practice so far are pandas, numpy and some other tools such as request, psycopg2, SQLAlchemy, etc. Pandas at its core is a powerful python tool to manipulate and analyze data. In some cases, I also use numpy for data analysis and some of the data manipulation as well, but technically numpy is more focused on multidimensional array objects.

Here's the caveat; those mentioned tools and libraries are just a few of many. Since they are the most common, perhaps they would be the perfect start. Find a few of small projects that would enhance your python skills on pandas, numpy, or maybe a little bit of request and SQLAlchemy.

There are a lot of tools and libraries out there, depending on your use cases, you might need them! It is undeniably important to keep our mind open and to learn new things. It might sounds challenging but believe me, it won't be so hard, thanks to the search engine!

Eventually, the more we learn, the better we'll get. Even the most expert programmer would still have a lot to learn. And yes, experience is the best teacher there is. If you encountered some errors, it's okay to ask around!

Organize semi-structured data.

After doing a lot of practices and gained enough confidence working on SQL and python, great! Perhaps the next interesting step would be to get our hand dirty on less structured data; such as JSON and XML. XML uses tags to define the data for each attribute. Many are still utilizing XML but JSON is now becoming more popular than XML because JSON is more straightforward and easier to read.

JSON has been widely used for its flexibility by allowing multiple nested format for a complex structure of data, it is readable and effective format to organize the data. Yet, it could become tricky when we need to define and extract each grain of the data to perform our own formula. For example, data in JSON format should be expected a lot when we work with APIs to post or get the the data from/to them.

Try to work on some codes manipulating data from JSON or XML. There are a lot of resources out there from a simple JSON data format, to a more complex structure with a lot of nested data. I would recommend to extract the data from them with the language of your choice, then I would create another JSON file in a more complex format.

Let's conclude.

Understanding database development with SQL, working on codes (specifically on data engineering toolkits), and working on a complex semi-structured data should provide the essential foundations in data engineering. In fact, mastering these skills would be sufficient to get things done as a data engineer.

There are plenty of tools out there and information overload might gave you the headache. In the end, not all of them might be relevant for a business. The good news is, no one needs to master all of the tools out there.

But it is important to keep our mind open and to learn continuously. One organization to another might utilize different tools and implement a different data strategy, in which there'll be another learning curve and we might be expected to adapt quickly.

Part 2 we'll discuss the most popular data engineering tools out there and the most common data engineering projects. Stay tuned!