Current Role: PySpark Developer and Data Engineer
Client Base: FMCGs
Duration: April 2021 to Present
Team Size: 4
Environment & Tools: GCP, Python, Spark, BigQuery
Cloud Platforms:
• GCP: Compute Engine, Dataproc, Cloud Storage, Pub/ Sub, Cloud Function, BigQuery
• AWS: EC2, Lambda, S3, SQS, SNS, Code Commit
Key Responsibilities:
• Work with Data Science teams to set up and streamline data frameworks.
• Improved delivery efficiency and reduced logistical costs through implementation of delivery optimization algorithm.
• Developed algorithm to assign top 5 nearest distributor to each store/ shop PAN INDIA based on road distances obtained from OpenStreetMap (OSM) data.
• Developed a MAP visualization module to generate final vehicle delivery route in a MAP using python Folium library and OpenStreetMap.
• Developed an algorithm to score the stores based on the sales data.
• Developed frontend page using FLASK and automated framework without any manual intervention.
• Architected automated framework using AWS/GCP services to trigger processes once inputs reached bucket.
• Led cleaning, processing, analysis, and transformation of historical data using Bigquery, enhancing data quality and reliability.
• Leveraged SQL capabilities to extract meaningful insights and optimize data for further use.
• Led scaling of data solutions to handle 150GBs of data efficiently.
• Demonstrated expertise in configuring Spark based on input data size, available cores, and memory allocation for nodes.
• Lead tasks related to data processing and analytics.
• Worked with few python ML libraries and algorithms.
• Participate in GCP migration activities, ensuring smooth transitions and optimized performance.
• Enhance data processing workflows for better efficiency and scalability.
• Optimized data pipelines by implementing advanced ETL processes and streamlining data flow.
• Developed custom algorithms for efficient data processing, enabling faster insights and decision making.
• Reduced operational costs by automating manual processes and improving overall data management efficiency.
Achievements:
• Successfully led multiple data engineering projects, resulting in significant performance improvements.
• Implemented optimized data frameworks that reduced processing time by 40-50 % and reduction in overall execution expense by 50 %
• Developed Python code to fully automate Oozie job scheduling, significantly reducing the monitoring time required by development team.
• Introduced dynamic memory allocation in Spark, optimizing resource utilization within cluster to its maximum capacity.
• Played key role in migrating project to GCP.
Project Name: MDR Threat Hunting (Network Security)
Client Base: Indian Banks, UAE Banks
Duration: May 2019 to Present
Team Size: 4
Environment & Tools: Hortonworks, Hive, Scala, Spark, Pig, Oozie, Logstash, Python
MDR Threat Hunting is a product known as "Managed Detection and Response" that monitors and proactively searches for threats within endpoint, user, network, and application data. The data processing core analytics are handled by Hadoop technologies. Here's a breakdown of the process:
Technologies Used:
• Spark Scala for analytics
• Python scripting to optimize job scheduling and automate model generation
• HDFS, LogStash and Hive
Data Source and Handling:
• Data is sourced through logger and NIFI, which sends the current day's data via TCP or UDP port in syslog format.
• Logstash receives the data and stores it in HDFS in parsed CSV or TSV format.
Data Volume:
• The total data received from all models is around 700•800 GB.
• After filtering, this reduces to approximately 200•300 GB (including data from 27 customers in India).
Data Processing and Storage:
• The parsed data is stored in CSV or TSV format.
• It is processed by Pig and Spark, then stored in Hive managed tables.
Usage:
• The final processed data is saved in HIVE table and used by the upstream team for analysis and UI representation.
Project Name: Log Management
Client Base: Indian Banks, UAE Banks, Mirrae Asset Global Fund
Duration: May 2020 to Present
Team Size: 3
Environment & Tools: Druid, Kafka-Stream with Scala
In this project, we are creating a data lake for system generated CEF formatted logs using NiFi for real time and batch processing. Here's an overview:
Data Source and Handling:
• Data is received from the upstream team as raw CEF logs in an unstructured format via NiFi onto a Kafka topic.
Data Processing:
• Kafka Stream API is used to index the raw CEF data.
• Each event is consumed, converted into a single line JSON, and produced to multiple output topics.
Output and Utilization:
• These multiple topics are used by Druid for real time indexing, using Kafka topics as the input data source.
Capacity:
• We were able to handle 10k-15k events per second (EPS) to create the data lake.
Project Name : Client Investment and Securities (CI&S)
Client: US BANK
Organization: Hexaware Technologies Limited
Duration: Sep 2016 to Mar 2019
Team size: 5
Environment: HDFS ,Hive, Python
The project involved maintaining and supporting a management reporting application called CI&S, which handled account information like holdings, trades, transactions, and cash flow.
Key responsibilities:
• Data maintenance by loading data from staging tables to HDFS using Oozie and importing data from a relational database into HDFS.
• Report generation using SQL (Hive) scripts and queries to analyze the dataset and generate reports like statements of holdings, trades, transactions, and cash activity.
• Data integration by exporting analyzed data patterns working with Apache POI and iText APIs, using XML files and creating PDF reports.
Hadoop
Pyspark
Spark with Scala
Python
SQL (BigQuery, Hive)
GCP, AWS
ETL development