Sandeep Erelli
13+ years architecting and operating large-scale distributed systems and cloud-native platforms across Fortune 500 organizations. Specialized in designing resilient, high-throughput event-driven architectures, distributed workflow platforms, and mission-critical backend systems operating at enterprise scale.
Deep expertise across JVM ecosystems, Apache Kafka streaming platforms, cloud-native infrastructure, and multi-cloud environments including Azure, AWS, and GCP — building highly available systems processing millions of events per minute with 99.99% availability.
// Membership Platform · Kafka → Temporal Workflows
@Component
@Slf4j
public class MembershipEventConsumer {
@Autowired private WorkflowClient temporal;
@Autowired private MeterRegistry metrics;
@KafkaListener(
topics = "${kafka.topic.membership}",
groupId = "membership-consumer-grp",
concurrency = "${kafka.concurrency:6}")
public void onEvent(
@Payload MembershipEvent event,
Acknowledgment ack) {
MembershipWorkflow wf = temporal
.newWorkflowStub(MembershipWorkflow.class,
WorkflowOptions.newBuilder()
.setWorkflowId("mbr-"
+ event.getMemberId())
.setTaskQueue(MEMBERSHIP_QUEUE)
.setWorkflowExecutionTimeout(
Duration.ofHours(24))
.build());
WorkflowClient.start(
wf::processLifecycle, event);
metrics.counter("membership.events",
"type", event.getType())
.increment();
ack.acknowledge();
}
}
// Kafka · Temporal.io · Micrometer · CosmosDB
Distributed Systems & AI Platform Architect
I'm a Staff Software Engineer with 13+ years of experience architecting and operating large-scale distributed systems, cloud-native platforms, and event-driven architectures at enterprise scale. I specialize in building resilient, fault-tolerant systems that power mission-critical workflows processing hundreds of millions of transactions and workflow executions daily across Azure, AWS, and GCP.
My expertise spans distributed workflow orchestration, asynchronous systems, streaming data platforms, AI-driven infrastructure, and operational scalability — focused on designing highly observable, durable, and recoverable platforms capable of operating reliably under real-world production scale and failure conditions.
My work has been centered around durable execution and workflow orchestration using Temporal.io. I've architected Temporal-backed systems executing 100M+ workflow runs per day, leveraging workflow-as-code paradigms, activity retries, saga compensation patterns, dynamic signaling, and multi-region resiliency strategies to replace fragile cron-driven orchestration and custom state machines with highly reliable workflow platforms.
I'm building agentic AI systems powered by orchestrated multi-agent workflows — combining planner/synthesizer orchestration with specialized agents for semantic retrieval, reasoning, code execution, and action dispatch, coordinated through durable workflow infrastructure for recoverability and long-running execution management. The architecture integrates LLM function-calling, OpenSearch vector search, Redis-backed reasoning memory, and Kafka event ingestion to enable autonomous, self-correcting AI workflows operating at production scale.
Extensive experience designing real-time streaming and large-scale data processing systems using Apache Kafka, Kafka Streams, Apache Spark, Spark Streaming, Kubernetes, Redis, Cassandra, and OpenSearch. These systems have powered fraud detection, transactional analytics, ML feature generation, financial reporting, and operational intelligence workloads across eBay, JPMorgan Chase, and LivePerson — processing tens of billions of records across Hadoop/YARN and AWS EMR.
A defining theme throughout my career has been enabling organizational scalability through asynchronous, event-driven architectures. At LivePerson, I led the migration from tightly coupled synchronous RPC to a Kafka-native event mesh handling millions of events per minute across critical platform services — significantly improving resiliency, eliminating cascading failures, and enabling fully independent deployment and scaling across the platform ecosystem.
I'm passionate about building platforms that are not only scalable and performant, but also observable, operable, resilient, and adaptable for long-term organizational growth.
Enterprise Engineering at Scale
Walmart+ Membership Platform — Cloud-Native at Scale
Apr 2023 – Present · Bellevue, WA · Enterprise Scale
Led design and implementation of large-scale, cloud-native solutions for the Walmart+ membership platform — ensuring high availability, fault tolerance, and performance optimization at enterprise scale. Architected event-driven microservices on Apache Kafka decoupling downstream services for independent deployment and scaling. Stabilized platform service metrics and deployment strategy across services, driving continuous improvement through new tools, frameworks, and methodologies.
Async Messaging Migration — Millions of Events / Minute
LivePerson Inc. · SDE III · Apr 2020 – Apr 2023 · Seattle, WA
Led the team in migrating a traditional synchronous messaging system to a fully asynchronous architecture powered by Apache Kafka — delivering a highly scalable system handling millions of events per minute. Stabilized three core messaging services, spearheaded operational and migration efforts of the entire Kafka ecosystem, and built a comprehensive monitoring stack with Prometheus and Grafana for performance, system, and business metrics. Recognized as one of the company's top engineers in Q2 2020, Q3 2021, and Q2 2022.
eBay Checkout & Cart at Global Scale
Built scalable, performant microservices in Java/Spring Boot to enhance the checkout experience for eBay users globally. Re-factored and enhanced Checkout & Shopping Cart APIs with Cassandra, Play framework, and reactive programming — while collaborating with data science teams on behavioral data pipelines.
Real-Time Observability Stack
Designed and deployed monitoring infrastructure across multiple platforms using Prometheus time-series metrics and Grafana dashboards tracking performance, system, and business KPIs in real time — with proactive alerting reducing mean time to detection across all production services.
Real-Time Data Pipelines at JPMorgan
Developed a Kafka + Spark Streaming + Cassandra real-time data pipeline at JPMorgan Chase to analyze large volumes of transactional data for the Corporate and Investment Bank — delivering highly scalable, fault-tolerant distributed systems powering data scientist-facing and supplier-facing services.
Professional Journey
Staff Software Engineer
- Led design and implementation of large-scale, cloud-native solutions for the Walmart+ membership platform — ensuring high availability, fault tolerance, and performance optimization
- Architected event-driven microservices on Apache Kafka, replacing synchronous REST dependencies and enabling independent scaling and deployment of each downstream service
- Stabilized platform service metrics and deployment strategy across services through systematic observability improvements with Prometheus, Grafana, and Splunk
- Promoted a culture of continuous improvement — identifying automation opportunities, implementing best practices, and driving adoption of new frameworks and methodologies
- Collaborated directly with product owners and business stakeholders to translate requirements into scalable technical specifications aligned with business goals
Software Development Engineer III
- Led the migration from a traditional synchronous messaging system to a fully asynchronous Kafka-based architecture handling millions of events per minute
- Stabilized the messaging platform by improving service metrics and deployment strategy across three core services
- Spearheaded operational and migration efforts of the entire Kafka ecosystem — topics, consumer groups, Schema Registry, and Kafka Connect connectors
- Built a monitoring stack with Prometheus for time-series metrics and created Grafana dashboards and alerts tracking performance, system, and business KPIs
Senior Software Engineer, Backend
- Built scalable, performant, resilient microservices in Java/Spring Boot enhancing the checkout experience for eBay users globally
- Re-factored and enhanced Checkout & Shopping Cart APIs using Play framework and reactive programming while maintaining system stability
- Collaborated with data science teams to build data management tools and ETL pipelines for processing user behavioral data
- Improved project quality through comprehensive testing — unit, functional, and performance tests — and supported services with monitoring and diagnostic tooling
Senior Software Engineer
- Built a shared data platform supporting Corporate and Investment Bank (CIB) client services — designed for high-scale fault-tolerant analysis of large transactional data volumes
- Developed a real-time data pipeline using Kafka, Spark Streaming, and Cassandra to power analytics over billions of financial transactions
- Delivered high-performance, scalable, resilient microservices in Java and Scala supporting data scientist-facing and supplier-facing services
Senior Software Engineer, Backend
- Designed and maintained applications supporting eBay's marketing platform — delivering personalized content, campaigns, and templates via email, mobile, and in-app notifications
- Built and owned ETL data pipelines for processing user behavioral data at scale using Kafka, Apache Spark, Sqoop, and HDFS
Earlier Experience
- Citigroup Inc., Saint Louis MO (2014–2015) — Java/J2EE Consultant. Implemented CitiMortgage web application with high-performance SOAP web services and Spring MVC translation layer.
- GP INFOTECH PVT LTD, Hyderabad (2010–2012) — Java/J2EE Developer. Built end-to-end full-stack solutions for educational institutions across India.
Technical Skills
Education
Featured Projects
Temporal Workflow Orchestration — 100M+ Executions/Day
Designed and scaled a multi-tenant workflow orchestration platform on open-source Temporal processing over 100 million executions per day — covering financial reconciliation, data synchronization, ML inference pipelines, and long-running business processes. Built durable, retryable Java/Spring Boot activities backed by Apache Cassandra for workflow state, Azure CosmosDB for multi-region active-active checkpoints, and OpenSearch for real-time SLA analytics. Scaled from 1M to 100M+ monthly executions via zero-downtime rolling Kubernetes upgrades with independently sized worker pools per workflow type.
AI Multi-Agent Orchestration Platform
Architected a production multi-agent AI system in Java/Spring Boot where a Planner Agent decomposes high-level goals into discrete steps, routing work to specialized agents — Data, Search, Code, and Action — running as independently scalable microservices. Temporal workflows provide durable orchestration so every agent execution is retryable and fault-tolerant even across multi-hour task chains. Redis handles short-term agent memory; OpenSearch with kNN powers long-term RAG retrieval. A Synthesizer Agent aggregates all results into a grounded, structured response. Dispatched at scale via Kafka to absorb traffic spikes without dropping work.
Enterprise Membership Platform
Designed and owned a cloud-native membership platform. Kafka event streaming, Azure CosmosDB for globally distributed state, Cassandra for high-throughput data, and Redis for low-latency caching on Kubernetes with full observability via Prometheus, Grafana, and Splunk.
Async Messaging System — LivePerson
Led migration from synchronous to async Kafka event architecture on GCP, processing millions of events per minute across three core services. Built Prometheus + Grafana observability stack tracking performance, system, and business SLAs in real time.
eBay Checkout & Cart Platform
Scalable Java/Spring Boot microservices for eBay's global checkout flow using Play framework, reactive programming, and Cassandra for high-throughput cart state — with full unit, functional, and performance test coverage.
Real-Time Financial Pipeline — JPMorgan
Kafka + Spark Streaming + Cassandra pipeline for JPMorgan's Corporate & Investment Bank to analyze billions of financial transactions in real time, powering data scientist and supplier-facing analytics services.
Open to conversations
I'm open to conversations around senior/staff engineering roles, technical advisory, and architecture discussions in distributed systems, event-driven platforms, and cloud-native engineering at scale.