Performance Summit III

How to be a Performance Badass

The talk will be centered on Do's and Don'ts but focused on individuals and how to be successful in that world.

by Rico Mariani (Facebook)

Performance at a Macro Level

Better performance of Mobile apps keeps more users engaged and results in achieving business goals. Often when we talk about performance we look at the problem with a magnifying glass to find every little thing that might have been contributing to the startup latency of an app. In this talk, we will cover how we approached the idea of making UberEats app more performant, created a phased approach and share the findings. We will include the entire stack - mobile technologies and backend services - that helped achieve our goals of increasing the performance.

by Kayvan Najafzadeh (Uber)

Data Engineering at the Speed of Your Disk

Our current best disk can read data at speeds of gigabytes per second; the best networks are even faster. We should aim for data engineering tasks (data filtering, parsing, validation) to achieve similar high speeds. Bottleneck tasks such as JSON ingestion can be much faster than they currently are.

Slides

by Daniel Lemire (Université du Québec)

Performance Testing for Firebase Cloud Messaging Backend

Firebase Cloud Messaging (FCM), formerly known as Google Cloud Messaging, is a cross-platform messaging solution to send notification to client apps. Performance testing for the messaging backend is a challenging problem in different aspects like networking, authentication, etc. In this talk I will cover the challenges of and best practice applied to the FCM performance testing infrastructure and how FCM developers use it for different testing purposes.

by Zijian Yao (Firebase/Google)

Understanding Kernel Scheduling Behavior with SchedViz

Kernel scheduling can be a significant source of latency problems: when a thread isn't running, it can't service requests or do anything else. SchedViz is a newly-open-sourced tool that provides fine-grained visibility into kernel scheduling behavior, and, increasingly, into other kernel phenomena as well. This talk will provide a brief walk-through of SchedViz, including how it works and what we used it for.

by Lee Baugh (Google)

Solving Reliability Challenges with Blackbox

Blackbox is a mobile instrumentation framework designed to capture context leading up to an error site. In this talk, we discuss how Facebook is using Blackbox to tackle functional bugs and crashes in our apps.

by Phuong Nguyen (Facebook)

Visual Completion measurement on Web

Visual Completion is a new solution of user-centric metrics measurement for RUM (Real-User-Monitoring) logging. It can track user perceived visual performance of full page loading, in-app navigations and user interactions. In contrast to most traditional latency measurements like capturing just start and end time of any task duration, Visual Completion considers display timing of elements at pixel count level in progressive web-app rendering architecture. In addition to tracking visual performance, we are also measuring TTI(Time-To-Interactive) to collect performance signals for app responsiveness.

Slides

by Wooseok Jeong (Facebook)

Applying Statistics to Root-Cause Analysis

As systems get more complex, reasoning about performance gets more difficult. Telemetry data emitted by our services is noisy and usually unhelpful in stressful situations. Distributed Tracing, in particular, can provide rich, contextual data but root-cause analysis can still be convoluted. In this talk, I'll review a few statistics-based approaches we have applied to help quickly identify which properties of the system are correlated with performance issues. In order to support this type of aggregate trace analysis, we need data, but data isn't cheap. We want to gather only the relevant traces and bias towards traces that have abnormal behavior. I'll also talk about a few sampling approaches we use for analysis to minimize cost and overhead.

Slides

by Karthik Kumar (Lightstep)

Uplevelling Understanding with Transient Analysis

Performance on mobile devices is often heavily dependent on the efficient use of shared resources like network bandwidth and RAM and orchestration between disparate components that rely on them. Understanding the (often surprising) conditions that arise “in the wild,” their prevalence along your user population, and therefore how client code should optimally adapt to perform best under various transient conditions is very challenging. Transient Analysis is a methodology and toolset we’ve built to enable this type of understanding by modeling expected domain-specific behaviors, processing telemetry to characterize adherence to or divergence from these expectations at scale, and linking this analysis to actionable insights and visualizations of actual examples of problematic behavior. This session will walk through the development lifecycle of such an analysis and demo the tooling that enables it.

by George Hoffman (Facebook)

The Sociotechnical Path to High-Performing Teams (Begins With Observability)

"Observability" is everywhere these days, but what does it actually mean? Is it just a new marketing term for the same old monitoring we've always done? Are there three pillars, or no pillars? It's enough to make anyone cranky and cynical about the motives of those involved. I'll give a brief history of observability and control systems theory, make a pitch for the precise technical definition of observability, and explain how it differs from monitoring and other telemetry -- and why it has recently suddenly become so shudderingly relevant to us all. I will discuss the second-order technical implications and effects of the definition I espouse, and describe the characteristics of tools we must build to understand the systems of tomorrow. We are far behind where we should be as a profession when it comes to how much of our effort is wasted on crap that doesn't move the business forward, and this is in large part because our ability to understand our systems is so wretched -- and we don't even know it. Let's fix that.

by Charity Majors (Honeycomb)

Using BPF for lightweight Android profiling

BPF gives you the power to understand application performance in ways that were not possible before, it is the newest tool Mobile Profilers team is using to understand application performance and detect regressions in Consumption Metrics on Android devices, in this talk we will discuss the powers of BPF and how we are using for lightweight and dynamic profiling.

Slides

by Riham Selim (Facebook)

FlameCommander: Netflix’s cloud profiler

Even under constant load, the behavior of a system is affected by variance, perturbations, single-threaded execution and other time-based issues, and never completely uniform, making the analysis of these small variations a needle-in-a-haystack problem. FlameScope solved this problem by combining a subsecond-offset heatmap, for navigating a profile and visualizing these perturbations, with a flame graph for code-path analysis. This talk focuses on how FlameScope, the open-source profile visualization tool, evolved into FlameCommander, a full-fledged cloud profiling solution used by thousands of engineers at Netflix.

Slides

by Martin Spier (Netflix)

Faster Data-Center Apps with BOLT

Code-layout optimizations are paramount for optimal performance of large data-center applications. In this talk, I will cover multiple approaches to improve the code layout of an application, introduce an open-source binary optimization tool BOLT, and walk through the challenges of deploying it at a Facebook scale. Lastly, I will share the plans for seamless integration of the binary optimization technology into the server application space.

by Maksim Panchenko (Facebook)

Talks