Role: Python & PySpark Data Engineer
Overview
We are looking for a Data Engineer with strong expertise in Python and PySpark to design, build, and optimize scalable data pipelines. You will work with large datasets, distributed systems, and cloud platforms to enable data-driven decision-making.
Key Responsibilities
1. Data Pipeline Development
- Design and build ETL/ELT pipelines using Python and PySpark
- Process large-scale structured and unstructured data
- Ensure high performance and reliability of data workflows
2. Big Data Processing
- Use Apache Spark (especially PySpark) for distributed data processing
- Optimize Spark jobs (partitioning, caching, joins, etc.)
- Handle batch and near real-time data processing
3. Data Integration
- Ingest data from multiple sources: APIs, databases, flat files, streaming systems
- Work with tools like Apache Kafka for real-time pipelines
-
- Ensure data consistency and integrity
4. Data Modeling & Storage
- Design scalable data models (star/snowflake schemas)
- Work with:
- Data lakes (e.g., Amazon S3)
- Data warehouses (e.g., Snowflake, Amazon Redshift)
5. Performance Optimization
- Tune SQL queries and Spark jobs
- Optimize memory usage and job execution time
- Implement efficient partitioning and indexing strategies
6. Cloud & DevOps
- Work on cloud platforms like:
- Amazon Web Services
- Microsoft Azure
- Google Cloud Platform
- Build CI/CD pipelines for data workflows
- Use containerization tools like Docker
7. Data Quality & Governance
- Implement validation checks and monitoring
- Ensure data accuracy, lineage, and governance
- Work with logging and alerting systems
Required Skills
Core Technical Skills
- Strong programming in Python
- Expertise in PySpark / Apache Spark
- Advanced SQL knowledge
- Experience with distributed computing
Big Data & Tools
- Hands-on with:
- Hadoop ecosystem
- Apache Hive
- Apache Airflow
Data Engineering Concepts
- ETL/ELT design
- Data warehousing & modeling
- Batch vs streaming architectures
Cloud & Storage
- Experience with cloud data services (S3, BigQuery, ADLS, etc.)
- Understanding of data lake architecture
Preferred / Nice-to-Have Skills
- Real-time processing (Kafka, Spark Streaming)
- Knowledge of Delta Lake or Apache Iceberg
- Experience with Databricks
- Basic understanding of machine learning pipelines
- Familiarity with DevOps tools (CI/CD, Terraform)