SE Sandeep Erelli
13+ Years Engineering  ·  Seattle, WA

Sandeep Erelli

Staff Software Engineer
|

13+ years architecting and operating large-scale distributed systems and cloud-native platforms across Fortune 500 organizations. Specialized in designing resilient, high-throughput event-driven architectures, distributed workflow platforms, and mission-critical backend systems operating at enterprise scale.

Deep expertise across JVM ecosystems, Apache Kafka streaming platforms, cloud-native infrastructure, and multi-cloud environments including Azure, AWS, and GCP — building highly available systems processing millions of events per minute with 99.99% availability.

MembershipEventConsumer.java
// Membership Platform · Kafka → Temporal Workflows
@Component
@Slf4j
public class MembershipEventConsumer {

  @Autowired private WorkflowClient temporal;
  @Autowired private MeterRegistry  metrics;

  @KafkaListener(
    topics      = "${kafka.topic.membership}",
    groupId     = "membership-consumer-grp",
    concurrency = "${kafka.concurrency:6}")
  public void onEvent(
      @Payload MembershipEvent event,
      Acknowledgment ack) {

    MembershipWorkflow wf = temporal
      .newWorkflowStub(MembershipWorkflow.class,
        WorkflowOptions.newBuilder()
          .setWorkflowId("mbr-"
            + event.getMemberId())
          .setTaskQueue(MEMBERSHIP_QUEUE)
          .setWorkflowExecutionTimeout(
            Duration.ofHours(24))
          .build());

    WorkflowClient.start(
      wf::processLifecycle, event);

    metrics.counter("membership.events",
      "type", event.getType())
      .increment();
    ack.acknowledge();
  }
}
// Kafka · Temporal.io · Micrometer · CosmosDB
Engineering experience at
Walmart · LivePerson · eBay · JPMorgan Chase · Citigroup

Distributed Systems & AI Platform Architect

I'm a Staff Software Engineer with 13+ years of experience architecting and operating large-scale distributed systems, cloud-native platforms, and event-driven architectures at enterprise scale. I specialize in building resilient, fault-tolerant systems that power mission-critical workflows processing hundreds of millions of transactions and workflow executions daily across Azure, AWS, and GCP.

My expertise spans distributed workflow orchestration, asynchronous systems, streaming data platforms, AI-driven infrastructure, and operational scalability — focused on designing highly observable, durable, and recoverable platforms capable of operating reliably under real-world production scale and failure conditions.

Workflow Orchestration & Durable Execution

My work has been centered around durable execution and workflow orchestration using Temporal.io. I've architected Temporal-backed systems executing 100M+ workflow runs per day, leveraging workflow-as-code paradigms, activity retries, saga compensation patterns, dynamic signaling, and multi-region resiliency strategies to replace fragile cron-driven orchestration and custom state machines with highly reliable workflow platforms.

Agentic AI & Multi-Agent Systems

I'm building agentic AI systems powered by orchestrated multi-agent workflows — combining planner/synthesizer orchestration with specialized agents for semantic retrieval, reasoning, code execution, and action dispatch, coordinated through durable workflow infrastructure for recoverability and long-running execution management. The architecture integrates LLM function-calling, OpenSearch vector search, Redis-backed reasoning memory, and Kafka event ingestion to enable autonomous, self-correcting AI workflows operating at production scale.

Streaming & Large-Scale Data Processing

Extensive experience designing real-time streaming and large-scale data processing systems using Apache Kafka, Kafka Streams, Apache Spark, Spark Streaming, Kubernetes, Redis, Cassandra, and OpenSearch. These systems have powered fraud detection, transactional analytics, ML feature generation, financial reporting, and operational intelligence workloads across eBay, JPMorgan Chase, and LivePerson — processing tens of billions of records across Hadoop/YARN and AWS EMR.

Event-Driven Architecture & Platform Resiliency

A defining theme throughout my career has been enabling organizational scalability through asynchronous, event-driven architectures. At LivePerson, I led the migration from tightly coupled synchronous RPC to a Kafka-native event mesh handling millions of events per minute across critical platform services — significantly improving resiliency, eliminating cascading failures, and enabling fully independent deployment and scaling across the platform ecosystem.

I'm passionate about building platforms that are not only scalable and performant, but also observable, operable, resilient, and adaptable for long-term organizational growth.

Temporal.io Agentic AI Apache Kafka Durable Execution Multi-Agent Systems LLM Orchestration OpenSearch / kNN Apache Spark Kafka Streams Spark Streaming Hadoop / YARN / EMR Event-Driven Architecture Kubernetes Java / Spring Boot
0
+
Years of Engineering
100M+
Temporal Workflows / Day
10B+
Records / Batch Pipeline
0
Cloud Platforms (Azure · AWS · GCP)
M+/min
Kafka Events Processed

Enterprise Engineering at Scale

Walmart+ Membership Platform — Cloud-Native at Scale

Apr 2023 – Present  ·  Bellevue, WA  ·  Enterprise Scale

Led design and implementation of large-scale, cloud-native solutions for the Walmart+ membership platform — ensuring high availability, fault tolerance, and performance optimization at enterprise scale. Architected event-driven microservices on Apache Kafka decoupling downstream services for independent deployment and scaling. Stabilized platform service metrics and deployment strategy across services, driving continuous improvement through new tools, frameworks, and methodologies.

HA
High Availability
Azure
Primary Cloud
K8s
Container Orchestration
Staff
Engineering Level
Java Spring Boot Apache Kafka Azure CosmosDB Apache Cassandra Redis Kubernetes Azure SQL Prometheus Grafana Splunk

Async Messaging Migration — Millions of Events / Minute

LivePerson Inc.  ·  SDE III  ·  Apr 2020 – Apr 2023  ·  Seattle, WA

Led the team in migrating a traditional synchronous messaging system to a fully asynchronous architecture powered by Apache Kafka — delivering a highly scalable system handling millions of events per minute. Stabilized three core messaging services, spearheaded operational and migration efforts of the entire Kafka ecosystem, and built a comprehensive monitoring stack with Prometheus and Grafana for performance, system, and business metrics. Recognized as one of the company's top engineers in Q2 2020, Q3 2021, and Q2 2022.

M+/min
Events Processed
3
Core Services Stabilized
Annual Engineer Award
GCP
Cloud Platform
Apache Kafka Java Spring Boot Couchbase Redis Elasticsearch Prometheus Grafana GCP Kubernetes

eBay Checkout & Cart at Global Scale

Built scalable, performant microservices in Java/Spring Boot to enhance the checkout experience for eBay users globally. Re-factored and enhanced Checkout & Shopping Cart APIs with Cassandra, Play framework, and reactive programming — while collaborating with data science teams on behavioral data pipelines.

Real-Time Observability Stack

Designed and deployed monitoring infrastructure across multiple platforms using Prometheus time-series metrics and Grafana dashboards tracking performance, system, and business KPIs in real time — with proactive alerting reducing mean time to detection across all production services.

Real-Time Data Pipelines at JPMorgan

Developed a Kafka + Spark Streaming + Cassandra real-time data pipeline at JPMorgan Chase to analyze large volumes of transactional data for the Corporate and Investment Bank — delivering highly scalable, fault-tolerant distributed systems powering data scientist-facing and supplier-facing services.

Professional Journey

Staff Software Engineer

Walmart Inc., Bellevue WA Current
Apr 2023 — Present
  • Led design and implementation of large-scale, cloud-native solutions for the Walmart+ membership platform — ensuring high availability, fault tolerance, and performance optimization
  • Architected event-driven microservices on Apache Kafka, replacing synchronous REST dependencies and enabling independent scaling and deployment of each downstream service
  • Stabilized platform service metrics and deployment strategy across services through systematic observability improvements with Prometheus, Grafana, and Splunk
  • Promoted a culture of continuous improvement — identifying automation opportunities, implementing best practices, and driving adoption of new frameworks and methodologies
  • Collaborated directly with product owners and business stakeholders to translate requirements into scalable technical specifications aligned with business goals
JavaSpring BootKafkaAzure CosmosDBApache CassandraRedisAzure SQLKubernetesDockerSplunkPrometheusGrafanaAzure

Software Development Engineer III

LivePerson Inc., Seattle WA
Apr 2020 — Apr 2023
  • Led the migration from a traditional synchronous messaging system to a fully asynchronous Kafka-based architecture handling millions of events per minute
  • Stabilized the messaging platform by improving service metrics and deployment strategy across three core services
  • Spearheaded operational and migration efforts of the entire Kafka ecosystem — topics, consumer groups, Schema Registry, and Kafka Connect connectors
  • Built a monitoring stack with Prometheus for time-series metrics and created Grafana dashboards and alerts tracking performance, system, and business KPIs
JavaSpring BootKafkaCouchbaseRedisOracleMySQLElasticsearchPrometheusGrafanaKibanaGCPKubernetes

Senior Software Engineer, Backend

eBay Inc., Bellevue WA
Sep 2017 — Mar 2020
  • Built scalable, performant, resilient microservices in Java/Spring Boot enhancing the checkout experience for eBay users globally
  • Re-factored and enhanced Checkout & Shopping Cart APIs using Play framework and reactive programming while maintaining system stability
  • Collaborated with data science teams to build data management tools and ETL pipelines for processing user behavioral data
  • Improved project quality through comprehensive testing — unit, functional, and performance tests — and supported services with monitoring and diagnostic tooling
JavaScalaSpring BootPlay FrameworkCassandraOracleMySQLKubernetesDockerHelm

Senior Software Engineer

JPMorgan Chase & Co., Seattle WA / Columbus OH
Oct 2016 — Aug 2017
  • Built a shared data platform supporting Corporate and Investment Bank (CIB) client services — designed for high-scale fault-tolerant analysis of large transactional data volumes
  • Developed a real-time data pipeline using Kafka, Spark Streaming, and Cassandra to power analytics over billions of financial transactions
  • Delivered high-performance, scalable, resilient microservices in Java and Scala supporting data scientist-facing and supplier-facing services
JavaScalaKafkaSpark StreamingSpark SQLCassandraHadoopHiveElasticsearchSpring Boot

Senior Software Engineer, Backend

eBay Inc., Bellevue WA
Oct 2015 — Sep 2016
  • Designed and maintained applications supporting eBay's marketing platform — delivering personalized content, campaigns, and templates via email, mobile, and in-app notifications
  • Built and owned ETL data pipelines for processing user behavioral data at scale using Kafka, Apache Spark, Sqoop, and HDFS
JavaScalaKafkaApache SparkCassandraRedisPostgreSQLHDFSMapReduceTeradata

Earlier Experience

Citigroup Inc.  ·  GP INFOTECH PVT LTD
2010 — 2015
  • Citigroup Inc., Saint Louis MO (2014–2015) — Java/J2EE Consultant. Implemented CitiMortgage web application with high-performance SOAP web services and Spring MVC translation layer.
  • GP INFOTECH PVT LTD, Hyderabad (2010–2012) — Java/J2EE Developer. Built end-to-end full-stack solutions for educational institutions across India.
JavaJ2EESpring MVCHibernateOracleJSPWebSphere

Technical Skills

Cloud Platforms
Microsoft Azure AWS Google Cloud (GCP) Azure CosmosDB Azure SQL Azure Blob Storage S3
Data Stores
Apache Cassandra Azure CosmosDB Redis MongoDB Couchbase Server HBase PostgreSQL MySQL Oracle
Big Data
Apache Spark Spark Streaming Spark SQL Hadoop / HDFS Apache Hive Apache Impala Apache Sqoop MapReduce Teradata
Search & Observability
Elasticsearch Prometheus Grafana Splunk Kibana OpenSearch OpenTelemetry
Infrastructure & DevOps
Kubernetes Docker Helm Terraform Linux GitHub Jenkins Bamboo

Education

Master of Science
Computer Science
Texas A&M University-Kingsville, Texas, USA
2013 — 2014
Bachelor of Technology
Computer Science & Engineering
Jawaharlal Nehru Technological University, Hyderabad, India
2006 — 2010

Featured Projects

Production · Current

Enterprise Membership Platform

Designed and owned a cloud-native membership platform. Kafka event streaming, Azure CosmosDB for globally distributed state, Cassandra for high-throughput data, and Redis for low-latency caching on Kubernetes with full observability via Prometheus, Grafana, and Splunk.

JavaSpring BootKafkaAzure CosmosDBCassandraRedisKubernetesAzure

Async Messaging System — LivePerson

Led migration from synchronous to async Kafka event architecture on GCP, processing millions of events per minute across three core services. Built Prometheus + Grafana observability stack tracking performance, system, and business SLAs in real time.

JavaKafkaCouchbaseRedisPrometheusGrafanaGCP

eBay Checkout & Cart Platform

Scalable Java/Spring Boot microservices for eBay's global checkout flow using Play framework, reactive programming, and Cassandra for high-throughput cart state — with full unit, functional, and performance test coverage.

JavaScalaPlay FrameworkCassandraKubernetesHelm

Real-Time Financial Pipeline — JPMorgan

Kafka + Spark Streaming + Cassandra pipeline for JPMorgan's Corporate & Investment Bank to analyze billions of financial transactions in real time, powering data scientist and supplier-facing analytics services.

JavaScalaKafkaSpark StreamingCassandraHadoopElasticsearch

Open to conversations

Schedule a Conversation

Pick a session type — I'll confirm within 24 hours

I'm open to conversations around senior/staff engineering roles, technical advisory, and architecture discussions in distributed systems, event-driven platforms, and cloud-native engineering at scale.