Summary
Overview
Work History
Education
Skills
Timeline
Generic

HIMANSHU VISHWAKARMA

SENIOR DATA ENGINEER
Pune

Summary

  • Big Data Engineer with 8 years of IT experience, around 7 years in Hadoop's ecosystem and 6 years experience with Spark. Currently working as PySpark and Python Developer from last 3 years. Previously worked with Spark with Scala for 3 years.
  • Have experience working in both Cloud platform (GCP, AWS) and on-premises (Hortonworks) with Hadoop ecosystem.
  • Skilled in building scalable cloud frameworks with 4+ years experience on cloud platforms: 2+ years in GCP (Compute Engine, Dataproc, Cloud Storage, Pub/Sub, Cloud Functions and Airflow) and 2+ years in AWS (EC2, Lambda, S3, SNS, SQS, CodeCommit). Participated in GCP migration activities and have exposure to few ML libraries and algorithms using python and pandas.

Overview

8
8
years of professional experience

Work History

Senior Data Engineer

LTIMindtree
04.2021 - Current

Current Role: PySpark Developer and Data Engineer

Client Base: FMCGs

Duration: April 2021 to Present

Team Size: 4

Environment & Tools: GCP, Python, Spark, BigQuery


Cloud Platforms:
GCP: Compute Engine, Dataproc, Cloud Storage, Pub/ Sub, Cloud Function, BigQuery
AWS: EC2, Lambda, S3, SQS, SNS, Code Commit


Key Responsibilities:
• Work with Data Science teams to set up and streamline data frameworks.

• Improved delivery efficiency and reduced logistical costs through implementation of delivery optimization algorithm.

• Developed algorithm to assign top 5 nearest distributor to each store/ shop PAN INDIA based on road distances obtained from OpenStreetMap (OSM) data.

• Developed a MAP visualization module to generate final vehicle delivery route in a MAP using python Folium library and OpenStreetMap.

• Developed an algorithm to score the stores based on the sales data.

• Developed frontend page using FLASK and automated framework without any manual intervention.

• Architected automated framework using AWS/GCP services to trigger processes once inputs reached bucket.

• Led cleaning, processing, analysis, and transformation of historical data using Bigquery, enhancing data quality and reliability.
• Leveraged SQL capabilities to extract meaningful insights and optimize data for further use.
• Led scaling of data solutions to handle 150GBs of data efficiently.
• Demonstrated expertise in configuring Spark based on input data size, available cores, and memory allocation for nodes.
• Lead tasks related to data processing and analytics.

• Worked with few python ML libraries and algorithms.
• Participate in GCP migration activities, ensuring smooth transitions and optimized performance.
• Enhance data processing workflows for better efficiency and scalability.
• Optimized data pipelines by implementing advanced ETL processes and streamlining data flow.
• Developed custom algorithms for efficient data processing, enabling faster insights and decision making.
• Reduced operational costs by automating manual processes and improving overall data management efficiency.


Achievements:
• Successfully led multiple data engineering projects, resulting in significant performance improvements.
• Implemented optimized data frameworks that reduced processing time by 40-50 % and reduction in overall execution expense by 50 %

• Developed Python code to fully automate Oozie job scheduling, significantly reducing the monitoring time required by development team.

• Introduced dynamic memory allocation in Spark, optimizing resource utilization within cluster to its maximum capacity.
• Played key role in migrating project to GCP.

Data Engineer

Paladion Networks
05.2019 - 03.2021

Project Name: MDR Threat Hunting (Network Security)

Client Base: Indian Banks, UAE Banks

Duration: May 2019 to Present

Team Size: 4

Environment & Tools: Hortonworks, Hive, Scala, Spark, Pig, Oozie, Logstash, Python

MDR Threat Hunting is a product known as "Managed Detection and Response" that monitors and proactively searches for threats within endpoint, user, network, and application data. The data processing core analytics are handled by Hadoop technologies. Here's a breakdown of the process:

Technologies Used:
• Spark Scala for analytics
• Python scripting to optimize job scheduling and automate model generation

• HDFS, LogStash and Hive

Data Source and Handling:
• Data is sourced through logger and NIFI, which sends the current day's data via TCP or UDP port in syslog format.
• Logstash receives the data and stores it in HDFS in parsed CSV or TSV format.

Data Volume:
• The total data received from all models is around 700•800 GB.
• After filtering, this reduces to approximately 200•300 GB (including data from 27 customers in India).

Data Processing and Storage:
• The parsed data is stored in CSV or TSV format.
• It is processed by Pig and Spark, then stored in Hive managed tables.

Usage:
• The final processed data is saved in HIVE table and used by the upstream team for analysis and UI representation.

Project Name: Log Management

Client Base: Indian Banks, UAE Banks, Mirrae Asset Global Fund

Duration: May 2020 to Present

Team Size: 3

Environment & Tools: Druid, Kafka-Stream with Scala

In this project, we are creating a data lake for system generated CEF formatted logs using NiFi for real time and batch processing. Here's an overview:

Data Source and Handling:
• Data is received from the upstream team as raw CEF logs in an unstructured format via NiFi onto a Kafka topic.

Data Processing:
• Kafka Stream API is used to index the raw CEF data.
• Each event is consumed, converted into a single line JSON, and produced to multiple output topics.

Output and Utilization:
• These multiple topics are used by Druid for real time indexing, using Kafka topics as the input data source.

Capacity:
• We were able to handle 10k-15k events per second (EPS) to create the data lake.

Software Engineer

Hexaware Technologies
01.2016 - 04.2019

Project Name : Client Investment and Securities (CI&S)

Client: US BANK

Organization: Hexaware Technologies Limited

Duration: Sep 2016 to Mar 2019

Team size: 5

Environment: HDFS ,Hive, Python

The project involved maintaining and supporting a management reporting application called CI&S, which handled account information like holdings, trades, transactions, and cash flow.

Key responsibilities:
• Data maintenance by loading data from staging tables to HDFS using Oozie and importing data from a relational database into HDFS.
• Report generation using SQL (Hive) scripts and queries to analyze the dataset and generate reports like statements of holdings, trades, transactions, and cash activity.
• Data integration by exporting analyzed data patterns working with Apache POI and iText APIs, using XML files and creating PDF reports.

Education

B.Tech -

Guru Nanak Institutions
Hyderabad, India
04.2001 -

10+2 - Science

Kendriya Vidyalaya
Hyderabad
04.2001 -

Skills

  • Hadoop

  • Pyspark

  • Spark with Scala

  • Python

  • SQL (BigQuery, Hive)

  • GCP, AWS

  • ETL development

Timeline

Senior Data Engineer

LTIMindtree
04.2021 - Current

Data Engineer

Paladion Networks
05.2019 - 03.2021

Software Engineer

Hexaware Technologies
01.2016 - 04.2019

B.Tech -

Guru Nanak Institutions
04.2001 -

10+2 - Science

Kendriya Vidyalaya
04.2001 -
HIMANSHU VISHWAKARMASENIOR DATA ENGINEER