Big Data (PySpark) Tech Lead

Long Finch Technologies

Irving, TX/ Jacksonville, FL/ Jersey City, NJ

Posted On: Dec 04, 2024

Posted On: Dec 04, 2024

Job Overview

Job Type

Full-time

Experience

10 - 25 Years

Salary

$120,000 - $150,000 Per Year

Work Arrangement

On-Site

Travel Requirement

0%

Required Skills

  • Big Data
  • Hadoop
  • SQL
  • PySpark
  • ETL
  • Unix
  • Apache Kafka
Job Description
Roles and Responsibilities
  • Design, build, and unit test applications using the Spark framework with Python.
  • Develop PySpark-based applications for both batch and streaming data processing.
  • Optimize the performance of Spark applications in Hadoop by configuring Spark Context, Spark-SQL, Data Frames, and Pair RDDs. Choose the right native Hadoop file formats (Avro, Parquet, ORC) and compression codecs for optimal data access.
  • Design and develop real-time data applications using Apache Kafka and Spark Streaming to support dynamic data processing needs.
  • Develop and execute data pipeline testing processes, validating business rules and ensuring data quality.
  • Build integrated solutions using Unix shell scripting, RDBMS, Hive, HDFS File System, and HDFS file types. Implement data tokenization libraries for column-level obfuscation and integration with Hive and Spark.
  • Process and manage large volumes of structured and unstructured data, integrating data from multiple sources to create cohesive data solutions.
  • Create and maintain automated integration and regression testing frameworks using Jenkins, integrated with Bitbucket and/or GIT repositories.
  • Participate actively in the Agile development process, communicate issues and bugs during scrum meetings, and document project developments.
  • Develop and review comprehensive technical documentation for all delivered artifacts.
  • Solve complex data-driven scenarios, troubleshoot defects, and address production issues effectively.

 

Position Requirements
  • 10+ years in data management, data lakes, and data warehouse development.
  • 6+ years of experience with Hadoop, Hive, Sqoop, SQL, and Teradata.
  • 6+ years of hands-on experience with PySpark (Python and Spark) and Unix.
  • Knowledge of industry-leading ETL processes is a plus.
  • Experience in the banking domain is highly desirable.
  • Expertise in optimizing Spark applications and data access.
  • Proven experience in building real-time data solutions using Apache Kafka and Spark Streaming.
  • Ability to work with various data storage and processing technologies, including HDFS, Hive, and RDBMS.
  • Strong experience in creating automated testing frameworks and continuous integration using Jenkins, Bitbucket, and/or GIT.
  • Demonstrated ability to triage complex data issues and production problems.
  • Strong written and verbal communication skills for documentation and collaboration with team members and stakeholders.

Job ID: LF240511


Posted By

Andy

HR Manager