Maya Bangar - Big Data Developer

Summary

Big Data Developer with 3+ years of experience designing, building, and optimizing cloud-based data pipelines using Apache Spark (Scala, PySpark, Spark SQL) on AWS and Azure platforms. Strong expertise in data lake and lakehouse architectures using Amazon S3, Delta Lake, and Parquet, with hands-on experience in ETL pipelines, performance optimization, and workflow orchestration using Apache Airflow. Proven ability to process large-scale structured and semi-structured datasets.

Overview

3

years of professional experience

Work History

Big Data Developer

Annotation Infotech

Mumbai

02.2023 - 01.2026

Designed and developed batch data pipelines using Apache Spark (Scala/PySpark) to process large-scale, structured, and semi-structured datasets (CSV, JSON, Avro, Parquet).
Built and optimized ETL workflows on AWS EMR, ingesting data from multiple sources, and storing curated datasets in Amazon S3 using columnar formats (Parquet).
Improved Spark job performance through partition tuning, caching, persistence, and efficient file format selection, reducing processing time and EMR resource usage.
Created and managed Hive-managed and external tables, implementing partitioning and bucketing strategies to enable efficient analytical querying.
Developed Spark SQL-based transformations for complex aggregations, joins, and data enrichment required for downstream analytics use cases.
Orchestrated scheduled data workflows using Apache Airflow, managing task dependencies, retries, and failure handling to meet pipeline SLAs.
Provisioned and maintained AWS EC2 and EMR clusters for development, testing, and production workloads.
Implemented data lake architecture on Amazon S3, ensuring optimized storage layout, and improved query performance.
Monitored Spark jobs and EMR clusters, identifying performance bottlenecks, and applying proactive optimizations.
Collaborated with cross-functional teams to understand business requirements, and deliver scalable data solutions.

Technologies: Apache Spark, Scala, PySpark, Spark SQL, Hadoop, Hive, Sqoop, AWS (S3, EC2, EMR), MySQL, and Airflow.

Spatial Data Specialist I

HERE Technologies

Mumbai

09.2021 - 01.2022

Performed data ingestion from relational databases into HDFS and Hive using Sqoop.
Created and managed Hive tables, including partitioned datasets for optimized query execution.
Executed analytical queries using Spark SQL to support internal data processing requirements.
Gained an understanding of Spark performance concepts, such as predicate pushdown and caching.

Technologies: Apache Spark, Hadoop, Hive, Sqoop, Spark SQL, HDFS.

Education

Master of Science - Information Technology

Pillai College of Arts, Commerce And Science

05.2021

Bachelor of Science - Information Technology

Pillai College of Arts Commerce And Science

05.2019

Skills

AWS: S3, EC2, EMR

Azure: Azure Databricks, Azure Blob Storage

Apache Spark (Scala Spark, PySpark, Spark SQL)

Hadoop (HDFS), Hive, Sqoop

Python, SQL

Apache Airflow, AWS Lambda, GitHub

Parquet, Avro, ORC, JSON, CSV

Responsibilities

Utilized Sqoop to transfer data from MySQL to HDFS.
Employed Spark for data processing and transformation to ensure data quality and consistency.
Understanding knowledge to Import data into HDFS and Hive using Sqoop and managed data within the environment.
Loaded data into Hive from Spark RDDs and data frames for further processing.
Loaded and transformed large sets of semi structured data like JSON, Avro, Parquet.
Queried data using Spark SQL on top of Spark engine for faster datasets processing.
Created multiple Hive tables, running hive queries in those data, implemented Partitioning, Dynamic Partitioning, and Bucketing in Hive for efficient data access.
Created EC2 instances and EMR clusters for development and testing.
Performed step execution in EMR clusters for the job deployment as per requirements.
As per the business requirement storing the spark processed data in S3 with appropriate file formats.
Creating different types of tables like Managed Table and External table.
Skilled in using Spark SQL persistence and caching mechanisms to reduce data processing overhead and improve query performance.
Familiarity with Spark SQL schema and data type operations, such as creating, modifying, and dropping tables and handling null values.
Knowledge of Spark optimization techniques, such as cost-based query optimization, column pruning, and predicate pushdown, and their impact on query performance and resource utilization.
Experience working with Spark in production environments and implementing performance monitoring and alerting systems to detect and resolve performance issues proactively.
Proficient in processing serialized data in Spark using various formats, such as Avro, Parquet, ORC, and their features.
Skilled in working with textual data formats in Spark, such as CSV, JSON, and XML, and their serialization using Spark DataFrames and RDDs.
Maintained and monitored Spark clusters on AWS EMR, ensuring high availability and fault tolerance.
Designed and implemented data lake architectures on Amazon S3 and columnar formats such as Parquet to optimize query performance.
Optimized Spark jobs and data processing workflows for scalability, performance, and cost efficiency using techniques such as partitioning and caching.
Designed and developed Spark applications to implement complex data transformations and aggregations for batch processing jobs, leveraging Spark SQL and DataFrames.
Loaded and transformed large sets of data.

Timeline