The Data Engineer is a key member of the project team responsible for designing, implementing, and maintaining data pipelines, integrating diverse data sources, and optimizing data infrastructure for efficient and secure data management. This role focuses on ensuring data quality, availability, and accessibility, supporting the needs of data scientists, analysts, and other stakeholders throughout the project.
Responsibilities:
Data Pipeline Design: Design, develop, and maintain scalable data pipelines for processing, transformation, and integration of healthcare data from various sources. Ensure data pipelines are efficient, reliable, and capable of handling large-scale datasets.
Data Integration: Collaborate with data sources, both internal and external, to establish data integration processes. Implement mechanisms for seamless data ingestion, data cleansing, and data transformation to ensure data quality.
Data Storage Management: Manage data storage solutions, including cloud-based and on-premises databases, data warehouses, and distributed file systems. Optimize data storage for performance, cost, and security.
ETL Processes: Develop and maintain ETL (Extract, Transform, Load) processes to facilitate the movement of data between systems. Ensure the ETL processes are efficient, accurate, and aligned with data requirements.
Data Governance: Establish and enforce data governance practices, including data lineage, metadata management, and data cataloging. Maintain documentation of data sources, transformations, and data flows.
Collaboration with Data Team: Work closely with data scientists, analysts, and other members of the data team to understand data requirements, provide data access, and support their analytical needs.
Performance Optimization: Monitor data pipeline and database performance, identifying bottlenecks and areas for optimization. Implement performance enhancements to ensure fast and efficient data processing.
Security and Privacy: Collaborate with the Information Security Specialist to ensure data security and privacy measures are implemented. Apply access controls, encryption, and other security mechanisms to protect sensitive healthcare data.
Data Quality Assurance: Implement data quality checks, data validation, and data cleansing processes to ensure the accuracy and reliability of healthcare data.
Emerging Technologies: Stay up-to-date with emerging data engineering technologies, tools, and best practices. Continuously evaluate new technologies that can improve data processing and integration capabilities.
Qualifications:
Bachelor's or Master's degree in Computer Science, Data Engineering, or a related field.
Proven experience (typically 3+ years) in data engineering roles, preferably with exposure to healthcare or large-scale data projects.
Proficiency in data integration, ETL processes, data transformation, and data pipeline design.
Strong programming skills in languages such as Python, Java, or Scala.
Experience with data storage solutions, such as SQL databases, NoSQL databases, and cloud-based data warehouses (e.g., Amazon Redshift, Google BigQuery).
Familiarity with distributed computing frameworks (e.g., Apache Spark, Hadoop) and data processing tools.
Understanding of data governance, metadata management, and data quality assurance practices.
Excellent problem-solving skills, capable of troubleshooting and optimizing data pipelines.
Knowledge of data privacy regulations (e.g., HIPAA) and ethical considerations in data engineering.
Strong collaboration skills to work effectively with cross-functional teams and stakeholders.
Ability to manage multiple tasks and prioritize responsibilities to meet project timelines.