Sandeep Erelli
13+ years building large-scale distributed systems and cloud-native platforms across Fortune 500 companies. Deep expertise in Java & Spring Boot, Apache Kafka event-driven architectures, and Azure, AWS, and GCP — processing millions of events per minute with 99.99% availability.
// Enterprise Membership Platform · Production
@Component
@Slf4j
public class MembershipConsumer {
@Autowired
CosmosDbRepository cosmosRepo;
@Autowired
RedisTemplate<String,
MembershipStatus> cache;
@KafkaListener(
topics = "${kafka.topic.membership}",
groupId = "membership-consumer-grp")
public void onEvent(MembershipEvent e) {
cosmosRepo.upsert(
e.getMemberId(), e.getStatus());
cache.opsForValue().set(
e.getMemberId(), e.getStatus());
log.info("Processed: {}",
e.getMemberId());
}
}
// Kafka · Azure CosmosDB · Redis · K8s
13 Years of Enterprise Engineering
I'm a Staff Software Engineer with 13+ years designing and operating large-scale distributed systems on the JVM. My core expertise is Java & Spring Boot — building high-throughput microservices, cloud-native platforms, and event-driven architectures that process millions of transactions every day.
In my current role, I architect cloud-native solutions for an enterprise membership platform ensuring high availability, fault tolerance, and performance optimization at scale — backed by Azure CosmosDB, Apache Cassandra, and Redis.
A defining theme across my career has been async decoupling via Apache Kafka. At LivePerson, I led the migration from synchronous messaging to a fully asynchronous Kafka event system handling millions of events per minute across three core services — eliminating cascading failures and enabling independent deployment of every service in the mesh.
Earlier at eBay and JPMorgan Chase, I built checkout & cart APIs at global scale, real-time Spark Streaming pipelines for transactional analytics, and NoSQL data platforms on Cassandra, Redis, and MongoDB sustaining sub-millisecond latency under continuous high-throughput workloads.
Enterprise Engineering at Scale
Walmart+ Membership Platform — Cloud-Native at Scale
Apr 2023 – Present · Bellevue, WA · Enterprise Scale
Led design and implementation of large-scale, cloud-native solutions for the Walmart+ membership platform — ensuring high availability, fault tolerance, and performance optimization at enterprise scale. Architected event-driven microservices on Apache Kafka decoupling downstream services for independent deployment and scaling. Stabilized platform service metrics and deployment strategy across services, driving continuous improvement through new tools, frameworks, and methodologies.
Async Messaging Migration — Millions of Events / Minute
LivePerson Inc. · SDE III · Apr 2020 – Apr 2023 · Seattle, WA
Led the team in migrating a traditional synchronous messaging system to a fully asynchronous architecture powered by Apache Kafka — delivering a highly scalable system handling millions of events per minute. Stabilized three core messaging services, spearheaded operational and migration efforts of the entire Kafka ecosystem, and built a comprehensive monitoring stack with Prometheus and Grafana for performance, system, and business metrics. Recognized as one of the company's top engineers in Q2 2020, Q3 2021, and Q2 2022.
eBay Checkout & Cart at Global Scale
Built scalable, performant microservices in Java/Spring Boot to enhance the checkout experience for eBay users globally. Re-factored and enhanced Checkout & Shopping Cart APIs with Cassandra, Play framework, and reactive programming — while collaborating with data science teams on behavioral data pipelines.
Real-Time Observability Stack
Designed and deployed monitoring infrastructure across multiple platforms using Prometheus time-series metrics and Grafana dashboards tracking performance, system, and business KPIs in real time — with proactive alerting reducing mean time to detection across all production services.
Real-Time Data Pipelines at JPMorgan
Developed a Kafka + Spark Streaming + Cassandra real-time data pipeline at JPMorgan Chase to analyze large volumes of transactional data for the Corporate and Investment Bank — delivering highly scalable, fault-tolerant distributed systems powering data scientist-facing and supplier-facing services.
Professional Journey
Staff Software Engineer
- Led design and implementation of large-scale, cloud-native solutions for the Walmart+ membership platform — ensuring high availability, fault tolerance, and performance optimization
- Architected event-driven microservices on Apache Kafka, replacing synchronous REST dependencies and enabling independent scaling and deployment of each downstream service
- Stabilized platform service metrics and deployment strategy across services through systematic observability improvements with Prometheus, Grafana, and Splunk
- Promoted a culture of continuous improvement — identifying automation opportunities, implementing best practices, and driving adoption of new frameworks and methodologies
- Collaborated directly with product owners and business stakeholders to translate requirements into scalable technical specifications aligned with business goals
Software Development Engineer III
- Led the migration from a traditional synchronous messaging system to a fully asynchronous Kafka-based architecture handling millions of events per minute
- Stabilized the messaging platform by improving service metrics and deployment strategy across three core services
- Spearheaded operational and migration efforts of the entire Kafka ecosystem — topics, consumer groups, Schema Registry, and Kafka Connect connectors
- Built a monitoring stack with Prometheus for time-series metrics and created Grafana dashboards and alerts tracking performance, system, and business KPIs
Senior Software Engineer, Backend
- Built scalable, performant, resilient microservices in Java/Spring Boot enhancing the checkout experience for eBay users globally
- Re-factored and enhanced Checkout & Shopping Cart APIs using Play framework and reactive programming while maintaining system stability
- Collaborated with data science teams to build data management tools and ETL pipelines for processing user behavioral data
- Improved project quality through comprehensive testing — unit, functional, and performance tests — and supported services with monitoring and diagnostic tooling
Senior Software Engineer
- Built a shared data platform supporting Corporate and Investment Bank (CIB) client services — designed for high-scale fault-tolerant analysis of large transactional data volumes
- Developed a real-time data pipeline using Kafka, Spark Streaming, and Cassandra to power analytics over billions of financial transactions
- Delivered high-performance, scalable, resilient microservices in Java and Scala supporting data scientist-facing and supplier-facing services
Senior Software Engineer, Backend
- Designed and maintained applications supporting eBay's marketing platform — delivering personalized content, campaigns, and templates via email, mobile, and in-app notifications
- Built and owned ETL data pipelines for processing user behavioral data at scale using Kafka, Apache Spark, Sqoop, and HDFS
Earlier Experience
- Citigroup Inc., Saint Louis MO (2014–2015) — Java/J2EE Consultant. Implemented CitiMortgage web application with high-performance SOAP web services and Spring MVC translation layer.
- GP INFOTECH PVT LTD, Hyderabad (2010–2012) — Java/J2EE Developer. Built end-to-end full-stack solutions for educational institutions across India.
Technical Skills
Education
Featured Projects
Temporal Workflow Orchestration — 100M+ Executions/Day
Designed and scaled a multi-tenant workflow orchestration platform on open-source Temporal processing over 100 million executions per day — covering financial reconciliation, data synchronization, ML inference pipelines, and long-running business processes. Built durable, retryable Java/Spring Boot activities backed by Apache Cassandra for workflow state, Azure CosmosDB for multi-region active-active checkpoints, and OpenSearch for real-time SLA analytics. Scaled from 1M to 100M+ monthly executions via zero-downtime rolling Kubernetes upgrades with independently sized worker pools per workflow type.
AI Multi-Agent Orchestration Platform
Architected a production multi-agent AI system in Java/Spring Boot where a Planner Agent decomposes high-level goals into discrete steps, routing work to specialized agents — Data, Search, Code, and Action — running as independently scalable microservices. Temporal workflows provide durable orchestration so every agent execution is retryable and fault-tolerant even across multi-hour task chains. Redis handles short-term agent memory; OpenSearch with kNN powers long-term RAG retrieval. A Synthesizer Agent aggregates all results into a grounded, structured response. Dispatched at scale via Kafka to absorb traffic spikes without dropping work.
Enterprise Membership Platform
Designed and owned a cloud-native membership platform. Kafka event streaming, Azure CosmosDB for globally distributed state, Cassandra for high-throughput data, and Redis for low-latency caching on Kubernetes with full observability via Prometheus, Grafana, and Splunk.
Async Messaging System — LivePerson
Led migration from synchronous to async Kafka event architecture on GCP, processing millions of events per minute across three core services. Built Prometheus + Grafana observability stack tracking performance, system, and business SLAs in real time.
eBay Checkout & Cart Platform
Scalable Java/Spring Boot microservices for eBay's global checkout flow using Play framework, reactive programming, and Cassandra for high-throughput cart state — with full unit, functional, and performance test coverage.
Real-Time Financial Pipeline — JPMorgan
Kafka + Spark Streaming + Cassandra pipeline for JPMorgan's Corporate & Investment Bank to analyze billions of financial transactions in real time, powering data scientist and supplier-facing analytics services.
Open to conversations
I'm open to conversations around senior/staff engineering roles, technical advisory, and architecture discussions in distributed systems, event-driven platforms, and cloud-native engineering at scale.