Skip to content

Research

Research Mission

The main research approach that we have taken is to gain deep understanding of the target systems and applications’ needs and to rethink systems’ core abstractions based on the understanding; we iterate between these two activities. We have been motivated by real problems experienced by developers and operators in practice. With these first-hand experiences, we perform comprehensive measurements to pinpoint system bottlenecks, characterize application behaviors and their needs, and understand practitioners’ pain points.

A systems solution often requires designing a new abstraction or adapting an existing abstraction in a novel way (e.g., in a new scenario, under new settings or assumptions). We take different routes when designing new systems solutions. On the one hand is the freedom of doing blue sky research, where we openly question the fundamentals underlying distributed systems in face of rapidly-emerging needs and challenges. For example, we rethink the design of modern cloud storage systems and build the first, cost-effective cloud caching system that exploits serverless functions as a novel storage medium. On the other hand, we seek practical and easily-deployable solutions for otherwise sophisticated systems problems. Distributed systems often involve complex cross-component interactions in order to support cross-cutting tasks such as scheduling. This leads us to gravitate towards simple and general solutions that solve not a specific problem, but which satisfy a general set of applications. For instance, we design a new function scheduler, which bridges the divide between user-space scheduling and kernel scheduling while being transparent to any serverless platforms.


Research Projects

We are committed to open science. We make our research and its dissemination, including publications, datasets, and software artifacts, accessible to support use and development both in academia and industry.

The research artifacts are publicly available at github.com/ds2-lab.

Redesigning FaaS Platforms

Custom FaaS container support is gaining traction as it enables better control over OSes, versioning, and tooling for modernizing FaaS applications. Our research looks to build scalable FaaS container platform that offers fast container provisioning for dependency-heavy FaaS applications.

Serverless Storage

We argue that the emerging serverless computing paradigm provides a well-suited, cost-effective platform for object caching. We build InfiniCache, the first in-memory object caching system that is built and deployed atop ephemeral serverless functions. Stay tuned for more results.

Serverless Analytics

Running complex data analytics jobs on FaaS platforms is appealing but also poses challenges for serverless execution frameworks, which will need to rapidly scale and schedule tasks. Our research pioneers to innovate serverless data analytics to make life easier for data scientists.

Serverless OS Scheduling

The execution time of serverless functions is typically short and thus is sensitive to resource contention. The CPU schedulers of today's main stream operating systems are simply not designed for short-job-dominant FaaS workloads. Our research proposes new scheduling techniques to address this mismatch.

  • pdf    project   
Federated Learning Systems

Federated Learning enables learning a shared model across many clients without violating the privacy requirements. This learning approach introduces interesting challenges across the whole system stack. Our research aims to address these challenges to enable scalable, efficient, and robust federated learning at scale.

Graph Learning Systems

It is challenging to train a graph neural network (GNN) on large graphs, which are prevalent in today's applications such as social networks and recommender systems. Our research designs new distributed GNN training methods that are scalable and efficient.


Research Sponsors

We are grateful for the generous support from our sponsors including National Science Foundation, Adobe Research, CloudBank, Amazon Web Services, Google Cloud, and IBM Cloud.


SPX: Collaborative Research: Cross-stack Memory Optimizations for Boosting I/O Performance of Deep Learning HPC Applications

  • Award Info: National Science Foundation Award CCF-1919075
  • PI: Yue Cheng
  • Funding Amount: $320,603

OAC Core: SMALL: DeepJIMU: Model-Parallelism Infrastructure for Large-scale Deep Learning by Gradient-Free Optimization

  • Award Info: National Science Foundation Award OAC-2007976
  • PIs: Liang Zhao (Emory), Yue Cheng
  • Funding Amount: $498,609

MRI: Acquisition of an Adaptive Computing Infrastructure to Support Compute- and Data-Intensive Multidisciplinary Research

  • Award Info: National Science Foundation Award MRI-2018631
  • PIs: Elise Miller-Hooks (GMU), Shobita Satyapal (GMU), Maria Emelianenko (GMU), Yue Cheng, Jayshree Sarma (GMU)
  • Funding Amount: $750,000

CAREER: Harnessing Serverless Functions to Build Highly Elastic Cloud Storage Infrastructure

  • Award Info: National Science Foundation Award CNS-2045680
  • PI: Yue Cheng
  • Funding Amount: $572,897 + $16,000 REU

FMSG: Cyber: Federated Deep Learning for Future Ubiquitous Distributed Additive Manufacturing

  • Award Info: National Science Foundation Award CMMI-2134689
  • PIs: Jia Liu (Auburn), Nima Shamsaei (Auburn), Yue Cheng
  • Funding Amount: $498,762

Serverless Storage Management for Large-scale Analytics Workloads

  • Award Info: Adobe Research Gift
  • PI: Yue Cheng
  • Funding Amount: $40,000