Research scientist at Google
My mission is to build theory and tools that make the practice of machine learning across science and industry more reliable, transparent, and effective.
Email: m(at)mrtz(dot)org
Twitter: @mrtz
Blogs: Moody Rd,
Off the convex path
Recent Activities:
– NIPS 2015 Workshop on Adaptive Data
Analysis
– ICML 2015 Workshop on Fairness,
Accountability and Transparency in Machine Learning
– NIPS 2014 Workshop on Fairness, Accountability and Transparency
in Machine Learning
– Visiting Scientist
at the Simons Institute for Theoretical Computer Science (Fall
2013)
Interns and long-term visitors: Chiyuan Zhang (MIT), Roy Frostig (Stanford), Tengyu Ma (Princeton), Eric Price (UT Austin), Mary Wootters (U Michigan)
Miscellaneous: I cycle. You can follow me on Strava.
See all publications. See selected publications. Toggle venue.
Links: arXiv
Abstract: Despite their massive size, successful deep artificial neural networks can exhibit a remarkably small difference between training and test performance. Conventional wisdom attributes small generalization error either to properties of the model family, or to the regularization techniques used during training. Through extensive systematic experiments, we show how these traditional approaches fail to explain why large neural networks generalize well in practice. Specifically, our experiments establish that state-of-the-art convolutional networks for image classification trained with stochastic gradient methods easily fit a random labeling of the training data. This phenomenon is qualitatively unaffected by explicit regularization, and occurs even if we replace the true images by completely unstructured random noise. We corroborate these experimental findings with a theoretical construction showing that simple depth two neural networks already have perfect finite sample expressivity as soon as the number of parameters exceeds the number of data points as it usually does in practice. We interpret our experimental findings by comparison with traditional models.
Links: arXiv
Abstract: An emerging design principle in deep learning is that each layer of a deep artificial neural network should be able to easily express the identity transformation. This idea not only motivated various normalization techniques, such as batch normalization, but was also key to the immense success of residual networks. In this work, we put the principle of identity parameterization on a more solid theoretical footing alongside further empirical progress. We first give a strikingly simple proof that arbitrarily deep linear residual networks have no spurious local optima. The same result for linear feed-forward networks in their standard parameterization is substantially more delicate. Second, we show that residual networks with ReLu activations have universal finite-sample expressivity in the sense that the network can represent any function of its sample provided that the model has more parameters than the sample size. Directly inspired by our theory, we experiment with a radically simple residual architecture consisting of only residual convolutional layers and ReLu activations, but no batch normalization, dropout, or max pool. Our model improves significantly on previous all-convolutional networks on the CIFAR10, CIFAR100, and ImageNet classification benchmarks.
Links: arXiv
Abstract: We propose a criterion for discrimination against a specified sensitive attribute in supervised learning, where the goal is to predict some target based on available features. Assuming data about the predictor, target, and membership in the protected group are available, we show how to optimally adjust any learned predictor so as to remove discrimination according to our definition. Our framework also improves incentives by shifting the cost of poor classification from disadvantaged groups to the decision maker, who can respond by improving the classification accuracy. In line with other studies, our notion is oblivious: it depends only on the joint statistics of the predictor, the target and the protected attribute, but not on interpretation of individualfeatures. We study the inherent limits of defining and identifying biases based on such oblivious measures, outlining what can and cannot be inferred from different oblivious tests. We illustrate our notion using a case study of FICO credit scores.
Links: arXiv
Abstract: We prove that gradient descent efficiently converges to the global optimizer of the maximum likelihood objective of an unknown linear time-invariant dynamical system from a sequence of noisy observations generated by the system. Even though the objective function is non-convex, we provide polynomial running time and sample complexity bounds under strong but natural assumptions. Linear systems identification has been studied for many decades, yet, to the best of our knowledge, these are the first polynomial guarantees for the problem we consider.
Links: arXiv
Abstract: We show that any model trained by a stochastic gradient method with few iterations has vanishing generalization error. We prove this by showing the method is algorithmically stable in the sense of Bousquet and Elisseeff. Our analysis only employs elementary tools from convex and continuous optimization. Our results apply to both convex and non-convex optimization under standard Lipschitz and smoothness assumptions. Applying our results to the convex case, we provide new explanations for why multiple epochs of stochastic gradient descent generalize well in practice. In the nonconvex case, we provide a new interpretation of common practices in neural networks, and provide a formal rationale for stability-promoting mechanisms in training large, deep models. Conceptually, our findings underscore the importance of reducing training time beyond its obvious benefit.
Links: arXiv
Abstract: Machine learning relies on the assumption that unseen test instances of a classification problem follow the same distribution as observed training data. However, this principle can break down when machine learning is used to make important decisions about the welfare (employment, education, health) of strategic individuals. Knowing information about the classifier, such individuals may manipulate their attributes in order to obtain a better classification outcome. As a result of this behavior---often referred to as gaming---the performance of the classifier may deteriorate sharply. Indeed, gaming is a well-known obstacle for using machine learning methods in practice; in financial policy-making, the problem is widely known as Goodhart's law. In this paper, we formalize the problem, and pursue algorithms for learning classifiers that are robust to gaming. We model classification as a sequential game between a player called "Jury" and a player called "Contestant". Jury designs a classifier, and Contestant receives an input to the classifier drawn from a distribution. Before being classified, Contestant may change his input based on Jury's classifier. However, Contestant incurs a cost for these changes according to a cost function. Jury's goal is to achieve high classification accuracy with respect to Contestant's original input and a known ideal classification function, assuming Contestant plays best response. Contestant's goal is to achieve a favorable classification outcome while taking into account the cost of achieving it. For a natural class of separable cost functions, and certain generalizations, we obtain computationally efficient learning algorithms which are near optimal, achieving a classification error that is arbitrarily close to the theoretical minimum. Surprisingly, the only assumption we place on the target function is that it stems from a statistically learnable concept class, a minimal assumption even permitting concept classes that are computationally hard to learn. For general cost functions, designing an approximately optimal strategy-proof classifier, for an inverse polynomial approximation, is NP-hard.
Links: Springer
Abstract: Spectral analysis refers to a family of popular and effective methods that analyze an input matrix by exploiting information about its eigenvectors or singular vectors. Applications include principal component analysis, low-rank approximation, and spectral clustering. Many of these applications are commonly performed on data sets that feature sensitive information such as patient records in a medical study. In such cases privacy is a major concern. Differential privacy is a powerful general-purpose privacy definition. This entry explains how differential privacy may be applied to task of approximately computing the top singular vectors of a matrix.
Links: DOI
Abstract: Misapplication of statistical data analysis is a common cause of spurious discoveries in scientific research. Existing approaches to ensuring the validity of inferences drawn from data assume a fixed procedure to be performed, selected before the data are examined. In common practice, however, data analysis is an intrinsically adaptive process, with new analyses generated on the basis of data exploration, as well as the results of previous analyses on the same data. We demonstrate a new approach for addressing the challenges of adaptivity based on insights from privacy-preserving data analysis. As an application, we show how to safely reuse a holdout data set many times to validate the results of adaptively chosen analyses.
Links: arXiv
Abstract: Overfitting is the bane of data analysts, even when data are plentiful. Formal approaches to understanding this problem focus on statistical inference and generalization of individual analysis procedures. Yet the practice of data analysis is an inherently interactive and adaptive process: new analyses and hypotheses are proposed after seeing the results of previous ones, parameters are tuned on the basis of obtained results, and datasets are shared and reused. An investigation of this gap has recently been initiated by the authors in (Dwork et al., 2014), where we focused on the problem of estimating expectations of adaptively chosen functions. In this paper, we give a simple and practical method for reusing a holdout (or testing) set to validate the accuracy of hypotheses produced by a learning algorithm operating on a training set. Reusing a holdout set adaptively multiple times can easily lead to overfitting to the holdout set itself. We give an algorithm that enables the validation of a large number of adaptively chosen hypotheses, while provably avoiding overfitting. We illustrate the advantages of our algorithm over the standard use of the holdout set via a simple synthetic experiment. We also formalize and address the general problem of data reuse in adaptive data analysis. We show how the differential-privacy based approach given in (Dwork et al., 2014) is applicable much more broadly to adaptive data analysis. We then show that a simple approach based on description length can also be used to give guarantees of statistical validity in adaptive settings. Finally, we demonstrate that these incomparable approaches can be unified via the notion of approximate max-information that we introduce.
Links: Forthcoming.
Abstract: We investigate the problem of learning an unknown probability distribution over a discrete population from random samples. Our goal is to design efficient algorithms that simultaneously achieve low error in total variation norm while guaranteeing Differential Privacy to the individuals of the population.
We describe a general approach that yields near sample-optimal and computationally efficient differentially private estimators for a wide range of well-studied and natural distribution families. Our theoretical results show that for a wide variety of structured distributions there exist private estimation algorithms that are nearly as efficient---both in terms of sample size and running time---as their non-private counterparts. We complement our theoretical guarantees with an experimental evaluation. Our experiments illustrate the speed and accuracy of our private estimators on both synthetic mixture models, as well as a large public data set.
Links: arXiv
Abstract: The organizer of a machine learning competition faces the problem of maintaining an accurate leaderboard that faithfully represents the quality of the best submission of each competing team. What makes this estimation problem particularly challenging is its sequential and adaptive nature. As participants are allowed to repeatedly evaluate their submissions on the leaderboard, they may begin to overfit to the holdout data that supports the leaderboard. Few theoretical results give actionable advice on how to design a reliable leaderboard. Existing approaches therefore often resort to poorly understood heuristics such as limiting the bit precision of answers and the rate of re-submission.
In this work, we introduce a notion of leaderboard accuracy tailored to the format of a competition. We introduce a natural algorithm called Ladder and demonstrate that it simultaneously supports strong theoretical guarantees in a fully adaptive model of estimation, withstands practical adversarial attacks, and achieves high utility on real submission files from an actual competition hosted by Kaggle.
Notably, we are able to sidestep a powerful recent hardness result for adaptive risk estimation that rules out algorithms such as ours under a seemingly very similar notion of accuracy. On a practical note, we provide a completely parameter-free variant of our algorithm that can be deployed in a real competition with no tuning required whatsoever.
Links: arXiv
Abstract: A great deal of effort has been devoted to reducing the risk of spurious scientific discoveries, from the use of sophisticated validation techniques, to deep statistical methods for controlling the false discovery rate in multiple hypothesis testing. However, there is a fundamental disconnect between the theoretical results and the practice of data analysis: the theory of statistical inference assumes a fixed collection of hypotheses to be tested, or learning algorithms to be applied, selected non-adaptively before the data are gathered, whereas in practice data is shared and reused with hypotheses and new analyses being generated on the basis of data exploration and the outcomes of previous analyses.
In this work we initiate a principled study of how to guarantee the validity of statistical inference in adaptive data analysis. As an instance of this problem, we propose and investigate the question of estimating the expectations of $m$ adaptively chosen functions on an unknown distribution given $n$ random samples.
We show that, surprisingly, there is a way to estimate an \emph{exponential} in $n$ number of expectations accurately even if the functions are chosen adaptively. This gives an exponential improvement over standard empirical estimators that are limited to a linear number of estimates. Our result follows from a general technique that counter-intuitively involves actively perturbing and coordinating the estimates, using techniques developed for privacy preservation. We give additional applications of this technique to our question.
Abstract: We consider the problem of identifying the parameters of an unknown mixture of two arbitrary d-dimensional Gaussians from a sequence of random samples. Our main result is a computationally efficient moment-based estimator with an optimal convergence rate thus resolving a problem introduced by Pearson (1894). Denoting by σ^{2} the variance of the unknown mixture, we prove that Θ(σ^{12}) samples are necessary and sufficient to estimate each parameter up to constant additive error when d=1. Our upper bound extends to arbitrary dimension d up to a (necessary) logarithmic loss in d using a novel–yet simple–dimensionality reduction technique.
Strikingly, our estimator turns out to be very similar to the one Pearson proposed in 1894 which reduces the one-dimensional problem to solving and analyzing a tractable system of polynomial equations. Our result greatly improves on the exponent in the sample size of the best previous estimator due to Kalai, Moitra and Valiant (2010)
Abstract: We provide a new robust convergence analysis of the well-known power method for computing the dominant singular vectors of a matrix that we call the noisy power method. Our result characterizes the convergence behavior of the algorithm when a significant amount noise is introduced after each matrix-vector multiplication. The noisy power method can be seen as a meta-algorithm that has recently found a number of important applications in a broad range of machine learning problems including alternating minimization for matrix completion, streaming principal component analysis (PCA), and privacy-preserving spectral analysis. Our general analysis subsumes several existing ad-hoc convergence bounds and resolves a number of open problems in these multiple applications.
Links: arXiv
Abstract: We show that, under a standard hardness assumption, there is no computationally efficient algorithm that given n samples from an unknown distribution can give valid answers to n^{3+o(1)} adaptively chosen statistical queries. A statistical query asks for the expectation of a predicate over the underlying distribution, and an answer to a statistical query is valid if it is "close" to the correct expectation over the distribution.
Our result stands in stark contrast to the well known fact that exponentially many statistical queries can be answered validly and efficiently if the queries are chosen non-adaptively (no query may depend on the answers to previous queries). Moreover, Dwork et al. [DFH+14], showed how to accurately answer exponentially many adaptively chosen statistical queries via a computationally inefficient algorithm. They also gave efficient algorithm that can answer O(n^2) adaptively chosen queries, which shows our result is almost quantitatively tight.
Conceptually, our result demonstrates that achieving statistical validity alone can be a source of computational intractability in adaptive settings. For example, in the modern large collaborative research environment, data analysts typically choose a particular approach based on previous findings. False discovery occurs if a research finding is supported by the data but not by the underlying distribution. While the study of preventing false discovery in Statistics is decades old, to the best of our knowledge our result is the first to demonstrate a computational barrier. In particular, our result suggests that the perceived difficulty of preventing false discovery in today's collaborative research environment may be inherent.
Links: arXiv
Abstract: Alternating Minimization is a widely used and empirically successful framework for Matrix Completion and related low-rank optimization problems. We give a new algorithm based on Alternating Minimization that provably recovers an unknown low-rank matrix from a random subsample of its entries under a standard incoherence assumption while achieving a linear convergence rate. Compared to previous work our results reduce the provable sample complexity requirements of the Alternating Minimization approach by at least a quartic factor in the rank and the condition number of the unknown matrix. These improvements apply when the matrix is exactly low-rank and when it is only close to low-rank in the Frobenius norm.
Underlying our work is a new robust convergence analysis of the well-known Subspace Iteration algorithm for computing the dominant singular vectors of a matrix also known as the Power Method. This viewpoint leads to a conceptually simple understanding of Alternating Minimization that we exploit. Additionally, we contribute a new technique for controlling the coherence of intermediate solutions arising in iterative algorithms. These techniques may be of interest beyond their application here.
Links: JMLR (PDF)
Abstract: We give the first algorithm for Matrix Completion whose running time and sample complexity is polynomial in the rank of the unknown target matrix, linear in the dimension of the matrix, and logarithmic in the condition number of the matrix. To the best of our knowledge, all previous algorithms either incurred a quadratic dependence on the condition number of the unknown matrix or a quadratic dependence on the dimension of the matrix in the running time. Our algorithm is based on a novel extension of Alternating Minimization which we show has theoretical guarantees under standard assumptions even in the presence of noise.
Links: arXiv
Abstract: Matrix Completion is the problem of recovering an unknown real-valued low-rank matrix from a subsample of its entries. Important recent results show that the problem can be solved efficiently under the assumption that the unknown matrix is incoherent and the subsample is drawn uniformly at random. Are these assumptions necessary?
It is well known that Matrix Completion in its full generality is NP-hard. However, little is known if make additional assumptions such as incoherence and permit the algorithm to output a matrix of slightly higher rank. In this paper we prove that Matrix Completion remains computationally intractable even if the unknown matrix has rank~$4$ but we are allowed to output any constant rank matrix, and even if additionally we assume that the unknown matrix is incoherent and are shown $90\%$ of the entries. This result relies on the conjectured hardness of the $4$-Coloring problem. We also consider the positive semidefinite Matrix Completion problem. Here we show a similar hardness result under the standard assumption that $\mathrm{P}\ne \mathrm{NP}.$
Our results greatly narrow the gap between existing feasibility results and computational lower bounds. In particular, we believe that our results give the first complexity-theoretic justification for why distributional assumptions are needed beyond the incoherence assumption in order to obtain positive results. On the technical side, we contribute several new ideas on how to encode hard combinatorial problems in low-rank optimization problems. We hope that these techniques will be helpful in further understanding the computational limits of Matrix Completion and related problems.
Links: PDF, arXiv The arxiv paper is missing a plot. See PDF instead.
Abstract: Linear sketches are powerful algorithmic tools that turn an n-dimensional input into a concise lower-dimensional representation via a linear transformation. Such sketches have seen a wide range of applications including norm estimation over data streams, compressed sensing, and distributed computing. In almost any realistic setting, however, a linear sketch faces the possibility that its inputs are correlated with previous evaluations of the sketch. Known techniques no longer guarantee the correctness of the output in the presence of such correlations. We therefore ask: Are linear sketches inherently non-robust to adaptively chosen inputs? We give a strong affirmative answer to this question. Specifically, we show that no linear sketch approximates the Euclidean norm of its input to within an arbitrary multiplicative approximation factor on a polynomial number of adaptively chosen inputs. The result remains true even if the dimension of the sketch is d = n – o(n) and the sketch is given unbounded computation time. Our result is based on an algorithm with running time polynomial in d that adaptively finds a distribution over inputs on which the sketch is incorrect with constant probability. Our result implies several corollaries for related problems including lp-norm estimation and compressed sensing. Notably, we resolve an open problem in compressed sensing regarding the feasibility of l2/l2-recovery guarantees in the presence of computationally bounded adversaries.
Links: arXiv
Abstract: We consider differentially private approximate singular vector computation. Known worst-case lower bounds show that the error of any differentially private algorithm must scale polynomially with the dimension of the singular vector. We are able to replace this dependence on the dimension by a natural parameter known as the coherence of the matrix that is often observed to be significantly smaller than the dimension both theoretically and empirically. We also prove a matching lower bound showing that our guarantee is nearly optimal for every setting of the coherence parameter. Notably, we achieve our bounds by giving a robust analysis of the well-known power iteration algorithm, which may be of independent interest. Our algorithm also leads to improvements in worst-case settings and to better low-rank approximations in the spectral norm.
Links: arXiv
Abstract: We consider a fundamental problem in unsupervised learning: given a collection of m points in R^n, if many but not necessarily all of these points are contained in a d-dimensional subspace T can we find it? The points contained in T are called inliers and the remaining points are outliers. This problem has received considerable attention in computer science and in statistics. Yet efficient algorithms from computer science are not robust to adversarial outliers, and the estimators from robust statistics are hard to compute in high dimensions. This is a serious and persistent issue not just in this application, but for many other problems in unsupervised learning. Are there algorithms for linear regression that are both robust to outliers and efficient? We give an algorithm that finds T when it contains more than a d/n fraction of the points. Hence, for say d = n/2 this estimator is both easy to compute and well-behaved when there are a constant fraction of outliers. We prove that it is small set expansion hard to find T when the fraction of errors is any larger and so our estimator is an optimal compromise between efficiency and robustness. In fact, this basic problem has a surprising number of connections to other areas including small set expansion, matroid theory and functional analysis that we make use of here.
Links: arXiv
Abstract: We provide a new robust convergence analysis of the well-known subspace iteration algorithm for computing the dominant singular vectors of a matrix, also known as simultaneous iteration or power method. Our result characterizes the convergence behavior of the algorithm when a large amount noise is introduced after each matrix-vector multiplication. While interesting in its own right, our main motivation comes from the problem of privacy-preserving spectral analysis where noise is added in order to achieve the privacy guarantee known as differential privacy. Our contributions here are twofold:
We give nearly tight worst-case bounds for the problem of computing a differentially private low-rank approximation in the spectral norm. Our results extend to privacy-preserving principal component analysis and apply to several variants of differential privacy that have been considered in the past. The running time of our algorithm is nearly linear in the input sparsity leading to strong improvements in running time over previous work while almost matching existing worst-case lower bounds on the amount of error that is required by any differentially private algorithm.
Complementing our worst-case bounds, we show that the error dependence of our algorithm on the matrix dimension can be replaced by an essentially tight dependence on the coherence of the matrix. This result resolves the main problem left open by previous results in this line of work. The coherence is always bounded by the matrix dimension but often substantially smaller thus leading to significant average-case improvements over our worst-case bounds.
Abstract: Data is only as good as the similarity metric used to compare it. The all important notion of similarity allows us to leverage knowledge derived from prior observations to predict characteristics of new samples. In this paper we consider the problem of compiling a consistent and accurate view of similarity given its multiple incomplete and noisy approximations. We propose a new technique called Multiple Kernel Completion (MKC), which completes given similarity kernels as well as finds their best combination within a Support Vector Machine framework, so as to maximize the discrimination margin. We demonstrate the effectiveness of the proposed technique on datasets from UCI Machine Learning repository as well as for the task of heart valve disease discrimination using CW Doppler images. Our empirical results establish that MKC consistently outperforms existing data completion methods like 0-imputation, mean-imputation and matrix completion across datasets and training set sizes.
Abstract: We present a new algorithm for differentially private data release, based on a simple combination of the Exponential Mechanism with the Multiplicative Weights update rule. Our MWEM algorithm achieves what are the best known and nearly optimal theoretical guarantees, while at the same time being simple to implement and experimentally more accurate on actual data sets than existing techniques.
Links: arXiv, Proceedings (ACM)
Abstract: Computing accurate low rank approximations of large matrices is a fundamental data mining task. In many applications however the matrix contains sensitive information about individuals. In such case we would like to release a low rank approximation that satisfies a strong privacy guarantee such as differential privacy. Unfortunately, to date the best known algorithm for this task that satisfies differential privacy is based on naive input perturbation or randomized response: Each entry of the matrix is perturbed independently by a sufficiently large random noise variable, a low rank approximation is then computed on the resulting matrix.
We give (the first) significant improvements in accuracy over randomized response under the natural and necessary assumption that the matrix has low coherence. Our algorithm is also very efficient and finds a constant rank approximation of an m x n matrix in time O(mn). Note that even generating the noise matrix required for randomized response already requires time O(mn).
Links: arXiv, Proceedings (SIAM), Short talk (PDF), Blog post at windowsontheory.org
Abstract: This work considers computationally efficient privacy-preserving data release. We study the task of analyzing a database containing sensitive information about individual participants. Given a set of statistical queries on the data, we want to release approximate answers to the queries while also guaranteeing differential privacy–protecting each participant's sensitive data.
Our focus is on computationally efficient data release algorithms; we seek algorithms whose running time is polynomial, or at least sub-exponential, in the data dimensionality. Our primary contribution is a computationally efficient reduction from differentially private data release for a class of counting queries, to learning thresholded sums of predicates from a related class.
We instantiate this general reduction with a variety of algorithms for learning thresholds. These instantiations yield several new results for differentially private data release. As two examples, taking {0,1}^d to be the data domain (of dimension d), we obtain differentially private algorithms for:
Several other instantiations yield further results for privacy-preserving data release. Of the two results highlighted above, the first learning algorithm uses techniques for representing thresholded sums of predicates as low-degree polynomial threshold functions. The second learning algorithm is based on Jackson's Harmonic Sieve algorithm [Jackson 1997]. It utilizes Fourier analysis of the database viewed as a function mapping queries to answers.
Links: arXiv
Abstract: We study fairness in classification, where individuals are classified, e.g., admitted to a university, and the goal is to prevent discrimination against individuals based on their membership in some group, while maintaining utility for the classifier (the university). The main conceptual contribution of this paper is a framework for fair classification comprising (1) a (hypothetical) task-specific metric for determining the degree to which individuals are similar with respect to the classification task at hand; (2) an algorithm for maximizing utility subject to the fairness constraint, that similar individuals are treated similarly. We also present an adaptation of our approach to achieve the complementary goal of "fair affirmative action," which guarantees statistical parity (i.e., the demographics of the set of individuals receiving any classification are the same as the demographics of the underlying population), while treating similar individuals as similarly as possible. Finally, we discuss the relationship of fairness to privacy: when fairness implies privacy, and how tools developed in the context of differential privacy may be applied to fairness.
Abstract: We initiate a principled study of graph densification. Given a graph G the goal of graph densification is to come up with another graph H that has significantly more edges than G but nevertheless approximates G well with respect to some set of test functions. In this paper we focus on the case of cut and spectral approximations. As it turns out graph densification exhibits rich connections to a set of interesting and sometimes seemingly unrelated questions in graph theory and metric embeddings. In particular we show the following results:
Our results are mainly based on linear and semidefinite programs (and their duals) for computing the maximum weight densifier of a given graph. This also leads to efficient algorithms in the case of spectral densifiers and additive cut densifiers.
Links: PDF
Abstract: In this thesis we consider the challenges arising in the design of algorithms that interact with sensitive personal data---such as medical records, online tracking data, or financial records.
One important goal is to protect the privacy of those individuals whose personal information contributed to the data set. We consider algorithms that satisfy the strong privacy guarantee known as differential privacy. A wide range of computational tasks reduces to the setting in which a trusted database curator responds to a number of statistical queries posed by an untrusted data analyst. The basic question is how accurately and efficiently the curator can release approximate answers to the given queries while satisfying differential privacy. We make the following main contributions to differentially private data analysis:
Not all problems arising in the presence of sensitive data are a matter of privacy. In the final part of this thesis, we isolate fairness in classification as a formidable concern and thus initiate its formal study. The goal of fairness is to prevent discrimination against protected subgroups of the population in a classification system. We argue that fairness cannot be achieved by blindness to the attribute we would like to protect. Our main conceptual contribution is in asserting that fairness is achieved when similar individuals are treated similarly. Based on the goal of treating similar individuals similarly, we formalize and show how to achieve fairness in classification, given a similarity metric. We also observe that our notion of fairness can be seen as a generalization of differential privacy.
Links: arXiv
Abstract: Suppose we would like to know all answers to a set of statistical queries C on a data set up to small error, but we can only access the data itself using statistical queries. A trivial solution is to exhaustively ask all queries in C. Can we do any better?
• We show that the number of statistical queries necessary and sufficient for this task is---up to polynomial factors---equal to the agnostic learning complexity of C in Kearns' statistical query (SQ) model. This gives a complete answer to the question when running time is not a concern.
• We then show that the problem can be solved efficiently (allowing arbitrary error on a small fraction of queries) whenever the answers to C can be described by a submodular function. This includes many natural concept classes, such as graph cuts and Boolean disjunctions and conjunctions. In doing so we also give a new learning algorithm for submodular functions that improves upon recent results in a different context.
While interesting from a learning theoretic point of view, our main applications are in privacy-preserving data analysis:
Here, our second result leads to the first algorithm that efficiently releases differentially private answers to all Boolean conjunctions with 1% average error. This presents significant progress on a key open problem in privacy-preserving data analysis.
Our first result on the other hand gives unconditional lower bounds on any differentially private algorithm that admits a (potentially non-privacy-preserving) implementation using only statistical queries. Not only our algorithms, but also most known private algorithms can be implemented using only statistical queries, and hence are constrained by these lower bounds. Our result therefore isolates the complexity of agnostic learning in the SQ-model as a new barrier in the design of differentially private algorithms.
Short abstract: We consider the following seemingly unrelated questions:
• Is the MaxCut problem hard on random geometric graphs of the type considered by Feige and Schechtman (2002)?
• Is the value of a mathematical relaxation for a constraint-satisfaction problem (CSP) preserved when one passes from an instance P to a random induced sub-formula of P?
It turns out that the answer to the first question is ``no'' and in fact this is intimately related to the second question. The answer to the second question is much more subtle, and, in contrast to the case of the objective value of the CSP, the answer strongly depends on the type of relaxation and CSP.
Short abstract: We consider privacy-preserving statistical data analysis with online queries. In this setting a trusted curator maintains a database of sensitive information about individual participants, and releases privacy-preserving answers to queries as they arrive online. Our primary contribution is a new differentially private multiplicative weights mechanism for answering a large number of counting queries that arrive online and may be adaptively chosen. Our mechanism is the first to achieve worst-case accuracy guarantees (accuracy on every input database) for a large number of online queries together with a runtime that is polynomial in the data universe. The error is optimal in its dependence on the size of the database and depends only logarithmically on the number of queries being answered. The runtime is nearly linear in the size of the data universe.
Our main technical contributions are a new application of multiplicative weights techniques to the differential privacy setting, and a new privacy analysis for multiplicative weights algorithms.
Links: PDF, arXiv, Proceedings (DOI)
Short abstract: We give a connection between convex geometry and differentially private data analysis. Specifically, we study the noise complexity of differentially private mechanisms in the setting where the data is represented by a histogram and the user asks a number of linear queries non-adaptively. We show that the amount of noise necessary and sufficient to achieve differential privacy is determined by two geometric parameters of a convex body associated with the set of queries. We use this connection to give tight upper and lower bounds for random linear queries. Assuming the truth of a deep conjecture from convex geometry, known as the Hyperplane conjecture, we can extend our results to arbitrary linear queries giving nearly matching upper and lower bounds.
Short abstract: We show an essentially tight connection between the semidefinite programming relaxation of Unique Games and their behavior under parallel repetition. Our results generalize and help to explain Raz's refutation of the Strong Parallel Repetition Conjecture. Indeed, we show that a Unique Game is a counterexample to strong parallel repetition whenever its SDP value is significantly larger than its integral value.