Everything you need to get started with data engineering (part 1)

Everything you need to get started with data engineering (part 1)

Not too long ago, data-driven companies started to use the phrase "data engineer" as the data has increase in velocity, volume, value, variety and veracity (also known as 5 V's in big data) that, in the nutshell, require a broader focus on sophisticated engineering "techniques" compared to the traditional data storage.

Traditional database storage stores and manages structured data like spreadsheets or relational databases. We also have a massive unstructured data to process and analysis and that's when we start to deep dive to the "big data."

The term is relatively new, some of the data engineers are working in a capacity where the data-driven organization are solely using structured data, but many are also heavily utilize unstructured data (or semi-structured data) in conjunction with the structured data. Regardless, the principle of data engineer still remains; to build efficient platforms that store, manage, and process the data effectively.

Back to the basic.

When I first learned about the data engineering, I imagined millions of streaming data, social media posts from all over the world, all of which are fascinating but also intimidating at the same time. Working on a complicated unstructured data might be part of the job desc, but for many of us, most of the time we would be working a lot on structured data from the database management systems that could make a great impact in a business.

There is no denying that enterprises are still using relational database management system; it can be on-cloud or on-premise. The entity relationships can be modeled and designed to reflect the transactions and day to day operation. It is convenient to record, modify, and track through the structured data in a database management system in which we are able to extract the data through some SQL queries.

Understanding entity relationships, database modeling, and querying data with SQL are the first crucial step to get into the data -- not only as an engineer but also as an analyst. From a simple select query to the advanced use of common table expressions, all of which serve as the foundation and benefits greatly to work with data, even if we are yet to learn programming languages and any other data platform.

Earlier on my career working with data, I started with about 60 to 70 percent doing a lot of database modeling and SQL queries. While the rest of them, utilizing ETL tools (which we'll be discussed in another post) and understanding the business processes. And it still play crucial in the work that I do as a data engineer.

Choose your language.

"Choose" might not be the right term to use because in the end, we might have to keep up with the demand. To get started, however, it might be a good idea to pick a language to focus on first. There are numbers of languages out there; Python, R, or perhaps you would find Java interesting.

I got introduced to python when I started joining my company. So yes, technically I did not make the choice, but I was one of the best thing ever happened. Python is one of the most popular choice for data engineers. You can find a lot of resources and community that could help you as you learn python. For my personal experience, it is relatively easy to learn, cleaner, and readable codes.

Unfortunately, I can't speak for the other programming languages. When I work with python, the often used libraries are pandas, numpy among some other tools or libraries such as request, psycopg2, SQLAlchemy, etc. Pandas at its core is a powerful python tool to manipulate and analyze data. In some cases, I also use numpy for data analysis and some of the data manipulation as well but technically it is more focused on multidimensional array objects.

Here's the caveat; those mentioned tools and libraries are just a few of many, but they have been commonly used in data engineering. Perhaps they would be a perfect start. Find a few of small projects that would enhance your python skills on pandas, numpy, or maybe a little bit of request and SQLAlchemy.

There are a lot of tools and libraries out there, depending on your use cases, you might need them. It is undeniably important to keep our mind open and to learn new things. It might sounds challenging but believe me, it won't be so hard, thanks to the search engine!

Eventually, the more we learn, the better we'll get. Even the most expert programmer would still have a lot to learn. And yes, experience is the best teacher there is. If you encountered some errors, it's okay to ask around!

Organize semi-structured data.

After doing a lot of practices and gained enough confidence working on SQL and python, great! Perhaps the next interesting step would be to get our hand dirty on less structured data; such as JSON and XML. XML uses tags to define the data for each attribute. Many are still utilizing XML but JSON is now becoming more popular than XML because JSON is more straightforward and easier to read.

JSON has been widely used for its flexibility by allowing multiple nested format for a complex structure of data, it is readable and effective format to organize the data. Yet, it could become tricky when we need to define and extract each grain of the data to perform our own formula. For example, data in JSON format should be expected a lot when we work with APIs to post or get the the data from/to them.

Try to work on some codes manipulating data from JSON or XML. There are a lot of resources out there from a simple JSON data format, to a more complex structure with a lot of nested data. I would recommend to extract the data from them with the language of your choice, then I would create another JSON file in a more complex format.

Let's conclude.

Understanding database development with SQL, working on codes (specifically on data engineering toolkits), and working on a complex semi-structured data should provide the essential foundations in data engineering. In fact, mastering these skills would be sufficient to get things done as a data engineer.

There are plenty of tools out there and information overload might gave you the headache. In the end, not all of them might be relevant for a business. The good news is, no one needs to master all of the tools out there.

But it is important to keep our mind open and to learn continuously. One organization to another might utilize different tools and implement a different data strategy, in which there'll be another learning curve and we might be expected to adapt quickly.

Part 2 we'll discuss the most popular data engineering tools out there and the most common data engineering projects. Stay tuned!