Key Responsibilities

1. Data Pipeline Development

Design and build ETL/ELT pipelines using Python and PySpark

Process large-scale structured and unstructured data

Ensure high performance and reliability of data workflows

2. Big Data Processing

Use Apache Spark (especially PySpark) for distributed data processing

Optimize Spark jobs (partitioning, caching, joins, etc.)

Handle batch and near real-time data processing

3. Data Integration

Ingest data from multiple sources: APIs, databases, flat files, streaming systems

Work with tools like Apache Kafka for real-time pipelines

Ensure data consistency and integrity

4. Data Modeling & Storage

Design scalable data models (star/snowflake schemas)
Work with:
- Data lakes (e.g., Amazon S3)
- Data warehouses (e.g., Snowflake, Amazon Redshift)

5. Performance Optimization

Tune SQL queries and Spark jobs
Optimize memory usage and job execution time
Implement efficient partitioning and indexing strategies

6. Cloud & DevOps

Work on cloud platforms like:
- Amazon Web Services
- Microsoft Azure
- Google Cloud Platform
Build CI/CD pipelines for data workflows
Use containerization tools like Docker

7. Data Quality & Governance

Implement validation checks and monitoring
Ensure data accuracy, lineage, and governance
Work with logging and alerting systems

Required Skills

Core Technical Skills

Strong programming in Python
Expertise in PySpark / Apache Spark
Advanced SQL knowledge
Experience with distributed computing

Big Data & Tools

Hands-on with:
- Hadoop ecosystem
- Apache Hive
- Apache Airflow

Data Engineering Concepts

ETL/ELT design
Data warehousing & modeling
Batch vs streaming architectures

Cloud & Storage

Experience with cloud data services (S3, BigQuery, ADLS, etc.)
Understanding of data lake architecture

Preferred / Nice-to-Have Skills

Real-time processing (Kafka, Spark Streaming)
Knowledge of Delta Lake or Apache Iceberg
Experience with Databricks
Basic understanding of machine learning pipelines
Familiarity with DevOps tools (CI/CD, Terraform)

Data Engineer (PYSPARK)

Job Overview

Job Description

Role: Python & PySpark Data Engineer

Overview

Key Responsibilities

1. Data Pipeline Development

2. Big Data Processing

3. Data Integration