Architected and optimized Amazon Redshift for BI reporting by implementing materialized views on datasets over 10 TB, using star-schema fact and dimension modeling via AWS Glue.
Leveraged Airflow to orchestrate weekly refresh workflows, while configuring Redshift query priorities, workload management (WLM), and access controls to reduce refresh time from 3 days to under 24 hours, enabling more up-to-date data delivery to Tableau dashboards.
Owned and managed 4+ scalable Airflow pipelines executing Spark, Kafka, and Hudi jobs on EMR clusters across multiple Airflow and Hadoop versions, while resolving compatibility issues, optimizing performance, and monitoring workflows via Kafdrop, CloudWatch Dashboards, and Spark History UI.
Designed and configured an end-to-end AWS ecosystem for data processing, including Hadoop clusters, job schedulers, metadata databases, data lakes, and disaster recovery solutions, while staying on top of cost control, network security, access management, and data resiliency in compliance with best practices and Infrastructure as Code (IaC) frameworks.
Data Management Intern
Swiss Re American holding Corporation
Fort Wayne
05.2022 - 10.2022
Excelled in first-time hands-on processing of 300 million+ row real industry datasets using PySpark DataFrames, RDDs, and Spark SQL.
Conducted comparative analysis of Pandas and PySpark on identical datasets to evaluate performance metrics, query execution plans, and suitability for different data processing scenarios.
Learned to optimize distributed data workflows under challenging resource constraints, thereby obtaining data insights.
Evaluated actual vs. expected values using multiple analytical techniques while gaining hands-on experience with the Palantir Foundry platform.
Education
Master of Science - Computer Science
Purdue University
Fort Wayne, IN
Bachelor of Engineering - Computer Science and Engineering
College of Engineering, Guindy (CEG), Anna University
Chennai, TN, India
Skills
Spark
Python
Pandas
SQL
Kafka
Airflow
Tableau
Aws cloud
Terraform
Ansible
R
Bash/shell scripting
Certification
AWS Certified Data Engineer - Associate
Data Analytics - Google Professional Career Certificate
Research Projects
RNA Analysis Pipeline for Retinal Disease Detection Using R and HPC
Utilized parallel processing frameworks in R (parallel, doParallel, foreach) to distribute a time-intensive RNA gene expression correlation algorithm across Purdue University’s Gilbreth GPU cluster, achieving a 90% reduction in runtime through parallel execution.
This significantly accelerated the diagnosis of Retinitis Pigmentosa (a retinal degenerative disease), aiding the research efforts of biomedical scientists.
Timeline
Data Engineer
CCC Intelligent Solutions
03.2023 - Current
Data Management Intern
Swiss Re American holding Corporation
05.2022 - 10.2022
Master of Science - Computer Science
Purdue University
Bachelor of Engineering - Computer Science and Engineering
College of Engineering, Guindy (CEG), Anna University