Revolutionizing Sleep: Scaling IoT Telemetry to 30+ Billion Daily Events by Deepika Sikri & Vikas Talegaonkar

ScyllaDB 115 views 17 slides Mar 07, 2025
Slide 1
Slide 1 of 17
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17

About This Presentation

Sleep Number processes 30B+ sensor events daily to enhance sleep technology. This talk details a journey in scaling an IoT telemetry pipeline to handle exponential data growth, real-time processing, and high reliability—advancing sleep science for millions.


Slide Content

A ScyllaDB Community
Revolutionizing Sleep:
Scaling IoT Telemetry to
30+ Billion Daily Events
Deepika Sikri
Head of Engineering
Vikas Talegaonkar
Director of Engineering

Deepika Sikri (She/Her)
Head of Engineering

Driving innovative solutions and
scalable technologies to enhance
sleep experiences

Vikas Talegaonkar (He/His)
Director of Engineering

A Leader with experience in building,
scaling, and transforming from vision
to execution

Introduction
In a world powered by big data, Sleep Number is transforming sleep
technology—processing over 30 billion sensor events daily to improve
the lives of millions.

This talk explores our journey to scale an IoT telemetry pipeline to
meet the challenges of exponential data growth, a rapidly expanding
user base, and real-time processing demands, all while ensuring
unwavering reliability.

Discover how we turned data into dreams.

Scale or Fail

The Challenge
Scaling Sleep Number’s IoT cloud infrastructure wasn’t just
about handling more data but about rewriting the playbook
for cost-efficient, resilient growth. As millions of smart
beds come online, the stakes couldn’t have been higher:

■Data Explosion: Billions of sensor events daily,
demanding real-time insights.
■User Surge: An ever-expanding customer base expecting
seamless performance.
■Resilience Under Pressure: A pipeline that could
withstand failures without skipping a beat.

The objective was clear: SCALEX—scale smarter, not
harder. The goal wasn’t just to keep up and stay ahead,
balancing cost and performance while building for the
future.

Identifying the Critical Bottleneck
Our telemetry datastore was a ‘pet’—high-maintenance
and ill-suited for scaling to meet evolving demands.

■Frequent Incidents
■Complex Debugging
■Escalating Costs
■High Operational Overhead
■Dependence on Specialized Skills
■Growing Big Data
■Data Strategy Evolution

Path Forward

Art of Datastore Selection
Telemetry datastore from ‘pet’— ‘cattle’

■Horizontal & Dynamic Scaling
■Managed/Serverless
■Multi-Tier Storage (Hot, Warm, and Cold)
■Data Strategy Evolution – Data Lakes
■CAP theorem

Scaling for Immunity Against Failures
Proactive Protection (Like Vaccines)
■Sharding and multi-write
■Redundancy

Reacting to External Events (Like Antigens)
■ Event-Driven Architecture

On-Demand Protection (like boosted Immune
System)
■Elastic Scaling
■Data Replication

Proactive System Design
Building for Resilience

Scaling Write Efficiency and Consistency
■Data Partitioning and User Stickiness
We implement a user ID-based sharding strategy to distribute data across multiple clusters. Each
shard is responsible for a specific subset of data, reducing the likelihood of write conflicts. User
stickiness is ensured through a consistent hashing function:
shard_to_write = user_id % shard_count This formula guarantees that a user's data always goes to
the same shard, maintaining consistency and improving read/write performance.
■Write Distribution
Write operations are routed to the appropriate shard using the sharding key (user ID). This ensures
that writes for a particular data subset always go to the same cluster, minimizing cross-cluster
operations and potential conflicts.

Managing Increasing Demand of Data Streaming
The Power of Asynchronous Data Transfer in IoT Pipelines Using Kafka
Asynchronous data transfer in IoT pipelines using Kafka is a crucial element for efficiency. It enables
multiple devices to transmit data concurrently, reducing latency and ensuring timely data processing.
Kafka's robust message queuing system effectively manages high data volumes, maintaining system
responsiveness and guaranteeing reliable delivery. This approach enhances scalability and improves
the resilience of IoT systems, making it an ideal solution for meeting the growing demands of data
streaming in dynamic environments.

Read Replicas and Resource Optimization
■Read Replicas and Scalability Each cluster can have multiple read replicas to handle read traffic, improving overall
system performance and scalability. This allows us to scale read capacity independently of write capacity.
■Divide and Conquer Strategies To optimize resource allocation and workload distribution, we employ the following
formulas:
■Partitions per Shard: partitions_per_shard=
partition_count/shard_countpartitions_per_shard=partition_count/shard_cou
■Consumers per Shard: consumers_per_shard=
consumer_count/shard_countconsumers_per_shard=consumer_count/shar
■_countPartitions per Consumer:partitions_per_consumer=
partition_count/consumer_countpartitions_per_consumer=partition_count/consumer_countThese calculations
help us balance the workload across shards and ensure efficient utilization of resources.

Sharding & Partitioning in Action

Smooth Transition
■Multi-Stage Development and Release
■Gradual Multi-phase release, followed final cutover (the Big Bang)
■Dual Pipeline
■Smooth transition by running old and new systems in parallel, enabling gradual migration and risk
mitigation.
■Feature Flag
■Enables dynamic control over feature activation, allowing gradual rollouts, testing, and risk-free rollbacks.
■Traffic Controller
■Gateway Traffic Controller
■Incrementally shifting traffic from the old to the new system at controlled rates
■Move traffic gradually from old to new pipeline successively from 1, 5, 10, 25, 40, 80, 90 & 100 percent.
■Confidence Building

Conclusion
Our journey in scaling Sleep Number's IoT infrastructure to process over 30 billion
sensor events daily has been transformative. We've created a highly scalable,
reliable, and cost-effective solution by transitioning our data store and implementing
innovative strategies such as efficient data aggregation, asynchronous data transfer
with event driven architecture, smart sharding, and a dynamic consumer architecture.

Stay in Touch
Deepika Sikri
[email protected]
Vikas Talegaonkar
[email protected]
Tags