Data engineering is a fast-growing field with no shortage of cloud platforms to build projects on. But how do you choose between options like AWS, GCP, and Azure as a beginner?
In a recent podcast, host Mohammad Arshad sat down with data engineering expert Pooja Jain to get her insight. With over 5 years of experience across banking, healthcare, and ecommerce, Pooja has worked on everything from predictive modeling to building data pipelines.
Key Factors to Consider
When evaluating cloud platforms, Pooja explains that the most important factor is identifying the specific data challenges you need to solve:
“Our focus should be more towards the activities because let’s say I want to do the orchestration or I want to store the data or I want to do cleaning or I want to focus more on streaming data right so each platform has their own set of services.”
Some key considerations include:
- Type of data you need to process like batch, streaming, structured, unstructured etc.
- Tools and services required for pipeline orchestration, ETL, data storage options etc.
- Cost optimization – pricing and free tier services differ across platforms
- Ease of use – GCP tends to have a gentler learning curve according to Pooja
- Team skills & experience – leverage existing knowledge
While Pooja has worked extensively with both AWS and GCP, she explains that:
“I feel more comfortable working with GCP the we are always comfortable with.”
But ultimately the right platform depends on the architecture and data infrastructure needed for your specific project.
Must-Have Skills for Aspiring Data Engineers
Besides just knowing Python and SQL, Pooja emphasizes the importance of conceptual knowledge:
“They have to understand the basic concepts of data engineering the concepts of Big Data how the Hadoop system ecosystem has evolved and why we are moving towards Cloud what are the challenges that we are facing.”
Here are some key skills she recommends focusing on:
- Relational databases like SQL
- Non-relational databases like NoSQL
- ETL vs ELT pipelines
- Data warehousing and data lakes
- Hadoop ecosystems
- Cloud migration challenges
Understanding these data engineering concepts will ensure you can design the right solutions.
Open Source Tools to Build Your Skills
While cloud platforms handle a lot of built-in complexity, Pooja suggests leveraging open source tools like:
- Apache Spark for processing huge datasets
- Kafka for streaming pipelines
- Airflow for pipeline orchestration
- dbt for data transformation
Open source tools give beginners:
- Exposure to common industry frameworks
- Ability to replicate real challenges
- Flexibility to use across any platform or infrastructure
- Hands-on practice applying engineering concepts learned
Starting with these tools allows young professionals to showcase relevant skills when applying for roles.
Getting Started with Data Engineering Projects
We all know working on projects is one of the fastest ways to skill up. So where should beginners focus their efforts?
Pooja outlines three key phases of any data engineering project:
- Ingesting and collecting data (extract/load)
- Processing and transforming data (transform)
- Consuming clean data for reporting, analytics or ML models (visualize/predict)
Open datasets provide great fodder for mock projects. But Pooja offers an alternative idea for scenarios where additional data is needed:
“There are two ways one is you can utilize the existing data sets get it into your raw Zone…the other is to just utilize the open source tools and Technologies and then try to do.”
Rather than focus on analyzing insights, she suggests showcasing your ability to:
- Build a reliable, scalable pipeline
- Handle various data formats
- Perform ETL operations with open source tools
- Follow best practices around monitoring, testing and validation
This develops expertise needed to shine in any data engineering interview.
Using Generative AI as an Asset
No conversation about the future of technology is complete without discussing red-hot trends like generative AI and ChatGPT. So what impact do tools like these have?
“Generative AI is just it can it is not this thing I separate right it is not something separate it is you can use it as a tool to do whatever you are doing in a better manner that is the starting point.”
Mohammad points out relying on AI alone eliminates practice critical for building expertise. But Pooja sees it becoming an asset that eliminates rote tasks:
“It gives a very uh good compactor you know set of services that we can invent use it we don’t have to exclusively call or do anything.”
The key is finding opportunities for generative AI to augment human creativity rather than replace it completely.
Big Data Skills Still Needed for Cloud Migration
It’s easy to dismiss legacy ecosystems like Hadoop and MapReduce as dated. But Pooja argues parts of these skills remain relevant today:
“If a person doesn’t knows Big Data either Hadoop systems the hdfs the hype and all those things were they going to move it to the cloud the data migration the uh there used to be easy workflows at that time.”
Understanding these frameworks helps with challenges like:
- Knowing limitations of data pipelines
- Replicating data architecture in cloud
- Optimization for specific use cases
So while specifics around coding algorithms are less relevant, the architectural concepts around handling big data are still enormously important.
Growing Your Audience on LinkedIn
With over 25,000 followers, Pooja has seen great success using LinkedIn to share data engineering knowledge. So what lessons can help beginners find their own audience?
“It’s important to consume the content it’s important to understand what is actually happening on the planet what is it that the experienced professionals are actually trying to convey.”
She suggests 3 simple steps to get started:
- Identify relevant professionals to follow and engage with. Comment thoughtfully on their posts when you have something useful to contribute.
- Share articles or posts you come across that provide value to the community.
- Summarize key learnings or insights to start establishing your expertise.
Rather than contributing noise, focus on understanding community needs first.
The Importance of Patience and Consistency
Like any skill, building influence takes time and focused effort. Pooja closes with an important reminder for anyone starting their data journey:
“We have to give some time and stay consistent we cannot expect you know immediate results.”
Stay motivated by finding little ways to add value every single day, whether:
- Helping someone new learn concepts you know
- Testing out new tools for first-hand experience
- Building your online credibility through engagement
Analyzing data teaches the importance of systemic thinking. Apply that mindset to your own career growth by structuring consistent progress over quick hacks.
Start Your Data Engineering Journey Today
Mohammad and Pooja covered several key concepts relevant to anyone getting started in data engineering. Hopefully their insights have piqued your interest!
Here are three simple action steps you can take right away:
- Join online communities centered around big data, cloud platforms, AI etc. Follow professionals like Pooja actively sharing their expertise.
- Identify an open dataset that aligns with your industry interests. Use tools listed in this article to start building a portfolio project showcasing core data skills.
- Set up a LinkedIn profile to establish your personal brand as you continue learning.
What resonated with you most from today’s conversation? Share your key takeaways in the comments!