One autonomous taxi, please

If you don’t get seasick, an autonomous boat might be the right mode of transportation for you. 

Scientists from MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) and the Senseable City Laboratory, together with Amsterdam Institute for Advanced Metropolitan Solutions (AMS Institute) in the Netherlands, have now created the final project in their self-navigating trilogy: a full-scale, fully autonomous robotic boat that’s ready to be deployed along the canals of Amsterdam. 

“Roboat” has come a long way since the team first started prototyping small vessels in the MIT pool in late 2015. Last year, the team released their half-scale, medium model that was 2 meters long and demonstrated promising navigational prowess. 

This year, two full-scale Roboats were launched, proving more than just proof-of-concept: these craft can comfortably carry up to five people, collect waste, deliver goods, and provide on-demand infrastructure. 

The boat looks futuristic — it’s a sleek combination of black and gray with two seats that face each other, with orange block letters on the sides that illustrate the makers’ namesakes. It’s a fully electrical boat with a battery that’s the size of a small chest, enabling up to 10 hours of operation and wireless charging capabilities. 

“We now have higher precision and robustness in the perception, navigation, and control systems, including new functions, such as close-proximity approach mode for latching capabilities, and improved dynamic positioning, so the boat can navigate real-world waters,” says Daniela Rus, MIT professor of electrical engineering and computer science and director of CSAIL. “Roboat’s control system is adaptive to the number of people in the boat.” 

To swiftly navigate the bustling waters of Amsterdam, Roboat needs a meticulous fusion of proper navigation, perception, and control software. 

Using GPS, the boat autonomously decides on a safe route from A to B, while continuously scanning the environment to  avoid collisions with objects, such as bridges, pillars, and other boats.

To autonomously determine a free path and avoid crashing into objects, Roboat uses lidar and a number of cameras to enable a 360-degree view. This bundle of sensors is referred to as the “perception kit” and lets Roboat understand its surroundings. When the perception picks up an unseen object, like a canoe, for example, the algorithm flags the item as “unknown.” When the team later looks at the collected data from the day, the object is manually selected and can be tagged as “canoe.” 

The control algorithms — similar to ones used for self-driving cars — function a little like a coxswain giving orders to rowers, by translating a given path into instructions toward the “thrusters,” which are the propellers that help the boat move.  

If you think the boat feels slightly futuristic, its latching mechanism is one of its most impressive feats: small cameras on the boat guide it to the docking station, or other boats, when they detect specific QR codes. “The system allows Roboat to connect to other boats, and to the docking station, to form temporary bridges to alleviate traffic, as well as floating stages and squares, which wasn’t possible with the last iteration,” says Carlo Ratti, professor of the practice in the MIT Department of Urban Studies and Planning (DUSP) and director of the Senseable City Lab. 

Roboat, by design, is also versatile. The team created a universal “hull” design — that’s the part of the boat that rides both in and on top of the water. While regular boats have unique hulls, designed for specific purposes, Roboat has a universal hull design where the base is the same, but the top decks can be switched out depending on the use case.

“As Roboat can perform its tasks 24/7, and without a skipper on board, it adds great value for a city. However, for safety reasons it is questionable if reaching level A autonomy is desirable,” says Fabio Duarte, a principal research scientist in DUSP and lead scientist on the project. “Just like a bridge keeper, an onshore operator will monitor Roboat remotely from a control center. One operator can monitor over 50 Roboat units, ensuring smooth operations.”

The next step for Roboat is to pilot the technology in the public domain. “The historic center of Amsterdam is the perfect place to start, with its capillary network of canals suffering from contemporary challenges, such as mobility and logistics,” says Stephan van Dijk, director of innovation at AMS Institute. 

Previous iterations of Roboat have been presented at the IEEE International Conference on Robotics and Automation. The boats will be unveiled on Oct. 28 in the waters of Amsterdam. 

Ratti, Rus, Duarte, and Dijk worked on the project alongside Andrew Whittle, MIT’s Edmund K Turner Professor in civil and environmental engineering; Dennis Frenchman, professor at MIT’s Department of Urban Studies and Planning; and Ynse Deinema of AMS Institute. The full team can be found at Roboat’s website. The project is a joint collaboration with AMS Institute. The City of Amsterdam is a project partner.

Read More

Artificial intelligence sheds light on how the brain processes language

In the past few years, artificial intelligence models of language have become very good at certain tasks. Most notably, they excel at predicting the next word in a string of text; this technology helps search engines and texting apps predict the next word you are going to type.

The most recent generation of predictive language models also appears to learn something about the underlying meaning of language. These models can not only predict the word that comes next, but also perform tasks that seem to require some degree of genuine understanding, such as question answering, document summarization, and story completion. 

Such models were designed to optimize performance for the specific function of predicting text, without attempting to mimic anything about how the human brain performs this task or understands language. But a new study from MIT neuroscientists suggests the underlying function of these models resembles the function of language-processing centers in the human brain.

Computer models that perform well on other types of language tasks do not show this similarity to the human brain, offering evidence that the human brain may use next-word prediction to drive language processing.

“The better the model is at predicting the next word, the more closely it fits the human brain,” says Nancy Kanwisher, the Walter A. Rosenblith Professor of Cognitive Neuroscience, a member of MIT’s McGovern Institute for Brain Research and Center for Brains, Minds, and Machines (CBMM), and an author of the new study. “It’s amazing that the models fit so well, and it very indirectly suggests that maybe what the human language system is doing is predicting what’s going to happen next.”

Joshua Tenenbaum, a professor of computational cognitive science at MIT and a member of CBMM and MIT’s Artificial Intelligence Laboratory (CSAIL); and Evelina Fedorenko, the Frederick A. and Carole J. Middleton Career Development Associate Professor of Neuroscience and a member of the McGovern Institute, are the senior authors of the study, which appears this week in the Proceedings of the National Academy of Sciences. Martin Schrimpf, an MIT graduate student who works in CBMM, is the first author of the paper.

Making predictions

The new, high-performing next-word prediction models belong to a class of models called deep neural networks. These networks contain computational “nodes” that form connections of varying strength, and layers that pass information between each other in prescribed ways.

Over the past decade, scientists have used deep neural networks to create models of vision that can recognize objects as well as the primate brain does. Research at MIT has also shown that the underlying function of visual object recognition models matches the organization of the primate visual cortex, even though those computer models were not specifically designed to mimic the brain.

In the new study, the MIT team used a similar approach to compare language-processing centers in the human brain with language-processing models. The researchers analyzed 43 different language models, including several that are optimized for next-word prediction. These include a model called GPT-3 (Generative Pre-trained Transformer 3), which, given a prompt, can generate text similar to what a human would produce. Other models were designed to perform different language tasks, such as filling in a blank in a sentence.

As each model was presented with a string of words, the researchers measured the activity of the nodes that make up the network. They then compared these patterns to activity in the human brain, measured in subjects performing three language tasks: listening to stories, reading sentences one at a time, and reading sentences in which one word is revealed at a time. These human datasets included functional magnetic resonance (fMRI) data and intracranial electrocorticographic measurements taken in people undergoing brain surgery for epilepsy.

They found that the best-performing next-word prediction models had activity patterns that very closely resembled those seen in the human brain. Activity in those same models was also highly correlated with measures of human behavioral measures such as how fast people were able to read the text.

“We found that the models that predict the neural responses well also tend to best predict human behavior responses, in the form of reading times. And then both of these are explained by the model performance on next-word prediction. This triangle really connects everything together,” Schrimpf says.

“A key takeaway from this work is that language processing is a highly constrained problem: The best solutions to it that AI engineers have created end up being similar, as this paper shows, to the solutions found by the evolutionary process that created the human brain. Since the AI network didn’t seek to mimic the brain directly — but does end up looking brain-like — this suggests that, in a sense, a kind of convergent evolution has occurred between AI and nature,” says Daniel Yamins, an assistant professor of psychology and computer science at Stanford University, who was not involved in the study.

Game changer

One of the key computational features of predictive models such as GPT-3 is an element known as a forward one-way predictive transformer. This kind of transformer is able to make predictions of what is going to come next, based on previous sequences. A significant feature of this transformer is that it can make predictions based on a very long prior context (hundreds of words), not just the last few words.

Scientists have not found any brain circuits or learning mechanisms that correspond to this type of processing, Tenenbaum says. However, the new findings are consistent with hypotheses that have been previously proposed that prediction is one of the key functions in language processing, he says.

“One of the challenges of language processing is the real-time aspect of it,” he says. “Language comes in, and you have to keep up with it and be able to make sense of it in real time.”

The researchers now plan to build variants of these language processing models to see how small changes in their architecture affect their performance and their ability to fit human neural data.

“For me, this result has been a game changer,” Fedorenko says. “It’s totally transforming my research program, because I would not have predicted that in my lifetime we would get to these computationally explicit models that capture enough about the brain so that we can actually leverage them in understanding how the brain works.”

The researchers also plan to try to combine these high-performing language models with some computer models Tenenbaum’s lab has previously developed that can perform other kinds of tasks such as constructing perceptual representations of the physical world.

“If we’re able to understand what these language models do and how they can connect to models which do things that are more like perceiving and thinking, then that can give us more integrative models of how things work in the brain,” Tenenbaum says. “This could take us toward better artificial intelligence models, as well as giving us better models of how more of the brain works and how general intelligence emerges, than we’ve had in the past.”

The research was funded by a Takeda Fellowship; the MIT Shoemaker Fellowship; the Semiconductor Research Corporation; the MIT Media Lab Consortia; the MIT Singleton Fellowship; the MIT Presidential Graduate Fellowship; the Friends of the McGovern Institute Fellowship; the MIT Center for Brains, Minds, and Machines, through the National Science Foundation; the National Institutes of Health; MIT’s Department of Brain and Cognitive Sciences; and the McGovern Institute.

Other authors of the paper are Idan Blank PhD ’16 and graduate students Greta Tuckute, Carina Kauf, and Eghbal Hosseini.

Read More

Deciding Which Tasks Should Train Together in Multi-Task Neural Networks

Posted by Christopher Fifty, Research Engineer, Google Research, Brain Team

Many machine learning (ML) models typically focus on learning one task at a time. For example, language models predict the probability of a next word given a context of past words, and object detection models identify the object(s) that are present in an image. However, there may be instances when learning from many related tasks at the same time would lead to better modeling performance. This is addressed in the domain of multi-task learning, a subfield of ML in which multiple objectives are trained within the same model at the same time.

Consider a real-world example: the game of ping-pong. When playing ping-pong, it is often advantageous to judge the distance, spin, and imminent trajectory of the ping-pong ball to adjust your body and line up a swing. While each of these tasks are unique — predicting the spin of a ping-pong ball is fundamentally distinct from predicting its location — improving your reasoning of the location and spin of the ball will likely help you better predict its trajectory and vice-versa. By analogy, within the realm of deep learning, training a model to predict three related tasks (i.e., the location, spin, and trajectory of a ping-pong ball) may result in improved performance over a model that only predicts a single objective.

Left: Three single-task networks, each of which uses the same input to predict the spin, distance, or trajectory of the ping-pong ball, respectively. Right: A single multi-task network that simultaneously predicts the spin, distance, and trajectory.

In “Efficiently Identifying Task Groupings in Multi-Task Learning”, a spotlight presentation at NeurIPS 2021, we describe a method called Task Affinity Groupings (TAG) that determines which tasks should be trained together in multi-task neural networks. Our approach attempts to divide a set of tasks into smaller subsets such that the performance across all tasks is maximized. To accomplish this goal, it trains all tasks together in a single multi-task model and measures the degree to which one task’s gradient update on the model’s parameters would affect the loss of the other tasks in the network. We denote this quantity as inter-task affinity. Our experimental findings indicate that selecting groups of tasks that maximize inter-task affinity correlates strongly with overall model performance.

Which Tasks Should Train Together?
In the ideal case, a multi-task learning model will apply the information it learns during training on one task to decrease the loss on other tasks included in training the network. This transfer of information leads to a single model that can not only make multiple predictions, but may also exhibit improved accuracy for those predictions when compared with the performance of training a different model for each task. On the other hand, training a single model on many tasks may lead to competition for model capacity and severely degrade performance. This latter scenario often occurs when tasks are unrelated. Returning to our ping-pong analogy, imagine trying to predict the location, spin, and trajectory of the ping-pong ball while simultaneously recounting the Fibonnaci sequence. Not a fun prospect, and most likely detrimental to your progression as a ping-pong player.

One direct approach to select the subset of tasks on which a model should train is to perform an exhaustive search over all possible combinations of multi-task networks for a set of tasks. However, the cost associated with this search can be prohibitive, especially when there are a large number of tasks, because the number of possible combinations increases exponentially with respect to the number of tasks in the set. This is further complicated by the fact that the set of tasks to which a model is applied may change throughout its lifetime. As tasks are added to or dropped from the set of all tasks, this costly analysis would need to be repeated to determine new groupings. Moreover, as the scale and complexity of models continues to increase, even approximate task grouping algorithms that evaluate only a subset of possible multi-task networks may become prohibitively costly and time-consuming to evaluate.

Building Task Affinity Groupings
In examining this challenge, we drew inspiration from meta-learning, a domain of machine learning that trains a neural network that can be quickly adapted to a new, and previously unseen task. One of the classic meta-learning algorithms, MAML, applies a gradient update to the models’ parameters for a collection of tasks and then updates its original set of parameters to minimize the loss for a subset of tasks in that collection computed at the updated parameter values. Using this method, MAML trains the model to learn representations that will not minimize the loss for its current set of weights, but rather for the weights after one or more steps of training. As a result, MAML trains a models’ parameters to have the capacity to quickly adapt to a previously unseen task because it optimizes for the future, not the present.

TAG employs a similar mechanism to gain insight into the training dynamics of multi-task neural networks. In particular, it updates the model’s parameters with respect only to a single task, looks at how this change would affect the other tasks in the multi-task neural network, and then undoes this update. This process is then repeated for every other task to gather information on how each task in the network would interact with any other task. Training then continues as normal by updating the model’s shared parameters with respect to every task in the network.

Collecting these statistics, and looking at their dynamics throughout training, reveals that certain tasks consistently exhibit beneficial relationships, while some are antagonistic towards each other. A network selection algorithm can leverage this data in order to group tasks together that maximize inter-task affinity, subject to a practitioner’s choice of how many multi-task networks can be used during inference.

Overview of TAG. First, tasks are trained together in the same network while computing inter-task affinities. Second, the network selection algorithm finds task groupings that maximize inter-task affinity. Third, the resulting multi-task networks are trained and deployed.

Results
Our experimental findings indicate that TAG can select very strong task groupings. On the CelebA and Taskonomy datasets, TAG is competitive with the prior state-of-the-art, while operating between 32x and 11.5x faster, respectively. On the Taskonomy dataset, this speedup translates to 2,008 fewer Tesla V100 GPU hours to find task groupings.

Conclusion
TAG is an efficient method to determine which tasks should train together in a single training run. The method looks at how tasks interact through training, notably, the effect that updating the model’s parameters when training on one task would have on the loss values of the other tasks in the network. We find that selecting groups of tasks to maximize this score correlates strongly with model performance.

Acknowledgements
We would like to thank Ehsan Amid, Zhe Zhao, Tianhe Yu, Rohan Anil, and Chelsea Finn for their fundamental contributions to this work. We also recognize Tom Small for designing the animation, and Google Research as a whole for fostering a collaborative and uplifting research environment.

Read More

Saving seaweed with machine learning

Last year, Charlene Xia ’17, SM ’20 found herself at a crossroads. She was finishing up her master’s degree in media arts and sciences from the MIT Media Lab and had just submitted applications to doctoral degree programs. All Xia could do was sit and wait. In the meantime, she narrowed down her career options, regardless of whether she was accepted to any program.

“I had two thoughts: I’m either going to get a PhD to work on a project that protects our planet, or I’m going to start a restaurant,” recalls Xia.

Xia poured over her extensive cookbook collection, researching international cuisines as she anxiously awaited word about her graduate school applications. She even looked into the cost of a food truck permit in the Boston area. Just as she started hatching plans to open a plant-based skewer restaurant, Xia received word that she had been accepted into the mechanical engineering graduate program at MIT.

Shortly after starting her doctoral studies, Xia’s advisor, Professor David Wallace, approached her with an interesting opportunity. MathWorks, a software company known for developing the MATLAB computing platform, had announced a new seed funding program in MIT’s Department of Mechanical Engineering. The program encouraged collaborative research projects focused on the health of the planet.

“I saw this as a super-fun opportunity to combine my passion for food, my technical expertise in ocean engineering, and my interest in sustainably helping our planet,” says Xia.

Wallace knew Xia would be up to the task of taking an interdisciplinary approach to solve an issue related to the health of the planet. “Charlene is a remarkable student with extraordinary talent and deep thoughtfulness. She is pretty much fearless, embracing challenges in almost any domain with the well-founded belief that, with effort, she will become a master,” says Wallace.

Alongside Wallace and Associate Professor Stefanie Mueller, Xia proposed a project to predict and prevent the spread of diseases in aquaculture. The team focused on seaweed farms in particular.

Already popular in East Asian cuisines, seaweed holds tremendous potential as a sustainable food source for the world’s ever-growing population. In addition to its nutritive value, seaweed combats various environmental threats. It helps fight climate change by absorbing excess carbon dioxide in the atmosphere, and can also absorb fertilizer run-off, keeping coasts cleaner.

As with so much of marine life, seaweed is threatened by the very thing it helps mitigate against: climate change. Climate stressors like warm temperatures or minimal sunlight encourage the growth of harmful bacteria such as ice-ice disease. Within days, entire seaweed farms are decimated by unchecked bacterial growth.

To solve this problem, Xia turned to the microbiota present in these seaweed farms as a predictive indicator of any threat to the seaweed or livestock. “Our project is to develop a low-cost device that can detect and prevent diseases before they affect seaweed or livestock by monitoring the microbiome of the environment,” says Xia.

The team pairs old technology with the latest in computing. Using a submersible digital holographic microscope, they take a 2D image. They then use a machine learning system known as a neural network to convert the 2D image into a representation of the microbiome present in the 3D environment.

“Using a machine learning network, you can take a 2D image and reconstruct it almost in real time to get an idea of what the microbiome looks like in a 3D space,” says Xia.

The software can be run in a small Raspberry Pi that could be attached to the holographic microscope. To figure out how to communicate these data back to the research team, Xia drew upon her master’s degree research.

In that work, under the guidance of Professor Allan Adams and Professor Joseph Paradiso in the Media Lab, Xia focused on developing small underwater communication devices that can relay data about the ocean back to researchers. Rather than the usual $4,000, these devices were designed to cost less than $100, helping lower the cost barrier for those interested in uncovering the many mysteries of our oceans. The communication devices can be used to relay data about the ocean environment from the machine learning algorithms.

By combining these low-cost communication devices along with microscopic images and machine learning, Xia hopes to design a low-cost, real-time monitoring system that can be scaled to cover entire seaweed farms.

“It’s almost like having the ‘internet of things’ underwater,” adds Xia. “I’m developing this whole underwater camera system alongside the wireless communication I developed that can give me the data while I’m sitting on dry land.”

Armed with these data about the microbiome, Xia and her team can detect whether or not a disease is about to strike and jeopardize seaweed or livestock before it is too late.

While Xia still daydreams about opening a restaurant, she hopes the seaweed project will prompt people to rethink how they consider food production in general.

“We should think about farming and food production in terms of the entire ecosystem,” she says. “My meta-goal for this project would be to get people to think about food production in a more holistic and natural way.”

Read More

Practical Differentially Private Clustering

Posted by Alisa Chang, Software Engineer, Google Cloud and Pritish Kamath, Research Scientist, Google Research

Over the last several years, progress has been made on privacy-safe approaches for handling sensitive data, for example, while discovering insights into human mobility and through use of federated analytics such as RAPPOR. In 2019, we released an open source library to enable developers and organizations to use techniques that provide differential privacy, a strong and widely accepted mathematical notion of privacy. Differentially-private data analysis is a principled approach that enables organizations to learn and release insights from the bulk of their data while simultaneously providing a mathematical guarantee that those results do not allow any individual user’s data to be distinguished or re-identified.

In this post, we consider the following basic problem: Given a database containing several attributes about users, how can one create meaningful user groups and understand their characteristics? Importantly, if the database at hand contains sensitive user attributes, how can one reveal these group characteristics without compromising the privacy of individual users?

Such a task falls under the broad umbrella of clustering, a fundamental building block in unsupervised machine learning. A clustering method partitions the data points into groups and provides a way to assign any new data point to a group with which it is most similar. The k-means clustering algorithm has been one such influential clustering method. However, when working with sensitive datasets, it can potentially reveal significant information about individual data points, putting the privacy of the corresponding user at risk.

Today, we announce the addition of a new differentially private clustering algorithm to our differential privacy library, which is based on privately generating new representative data points. We evaluate its performance on multiple datasets and compare to existing baselines, finding competitive or better performance.

K-means Clustering
Given a set of data points, the goal of k-means clustering is to identify (at most) k points, called cluster centers, so as to minimize the loss given by the sum of squared distances of the data points from their closest cluster center. This partitions the set of data points into k groups. Moreover, any new data point can be assigned to a group based on its closest cluster center. However, releasing the set of cluster centers has the potential to leak information about particular users — for example, consider a scenario where a particular data point is significantly far from the rest of the points, so the standard k-means clustering algorithm returns a cluster center at this single point, thereby revealing sensitive information about this single point. To address this, we design a new algorithm for clustering with the k-means objective within the framework of differential privacy.

A Differentially Private Algorithm
In “Locally Private k-Means in One Round”, published at ICML 2021, we presented a differentially private algorithm for clustering data points. That algorithm had the advantage of being private in the local model, where the user’s privacy is protected even from the central server performing the clustering. However, any such approach necessarily incurs a significantly larger loss than approaches using models of privacy that require trusting a central server.1

Here, we present a similarly inspired clustering algorithm that works in the central model of differential privacy, where the central server is trusted to have complete access to the raw data, and the goal is to compute differentially private cluster centers, which do not leak information about individual data points. The central model is the standard model for differential privacy, and algorithms in the central model can be more easily substituted in place of their non-private counterparts since they do not require changes to the data collection process. The algorithm proceeds by first generating, in a differentially private manner, a core-set that consists of weighted points that “represent” the data points well. This is followed by executing any (non-private) clustering algorithm (e.g., k-means++) on this privately generated core-set.

At a high level, the algorithm generates the private core-set by first using random-projection–based locality-sensitive hashing (LSH) in a recursive manner2 to partition the points into “buckets” of similar points, and then replacing each bucket by a single weighted point that is the average of the points in the bucket, with a weight equal to the number of points in the same bucket. As described so far, though, this algorithm is not private. We make it private by performing each operation in a private manner by adding noise to both the counts and averages of points within a bucket.

This algorithm satisfies differential privacy because each point’s contributions to the bucket counts and the bucket averages are masked by the added noise, so the behavior of the algorithm does not reveal information about any individual point. There is a tradeoff with this approach: if the number of points in the buckets is too large, then individual points will not be well-represented by points in the core-set, whereas if the number of points in a bucket is too small, then the added noise (to the counts and averages) will become significant in comparison to the actual values, leading to poor quality of the core-set. This trade-off is realized with user-provided parameters in the algorithm that control the number of points that can be in a bucket.

Visual illustration of the algorithm.

Experimental Evaluation
We evaluated the algorithm on a few benchmark datasets, comparing its performance to that of the (non-private) k-means++ algorithm, as well as a few other algorithms with available implementations, namely diffprivlib and dp-clustering-icml17. We use the following benchmark datasets: (i) a synthetic dataset consisting of 100,000 data points in 100 dimensions sampled from a mixture of 64 Gaussians; (ii) neural representations for the MNIST dataset on handwritten digits obtained by training a LeNet model; (iii) the UC Irvine dataset on Letter Recognition; and (iv) the UC Irvine dataset on Gas Turbine CO and NOx Emissions.3

We analyze the normalized k-means loss (mean squared distance from data points to the nearest center) while varying the number of target centers (k) for these benchmark datasets.4 The described algorithm achieves a lower loss than the other private algorithms in three out of the four datasets we consider.

Normalized loss for varying k = number of target clusters (lower is better). The solid curves denote the mean over the 20 runs, and the shaded region denotes the 25-75th percentile range.

Moreover, for datasets with specified ground-truth labels (i.e., known groupings), we analyze the cluster label accuracy, which is the accuracy of the labeling obtained by assigning the most frequent ground-truth label in each cluster found by the clustering algorithm to all points in that cluster. Here, the described algorithm performs better than other private algorithms for all the datasets with pre-specified ground-truth labels that we consider.

Cluster label accuracy for varying k = number of target clusters (higher is better). The solid curves denote the mean over the 20 runs, and the shaded region denotes the 25-75th percentile range.

Limitations and Future Directions
There are a couple of limitations to consider when using this or any other library for private clustering.

  1. It is important to separately account for the privacy loss in any preprocessing (e.g., centering the data points or rescaling the different coordinates) done before using the private clustering algorithm. So, we hope to provide support for differentially private versions of commonly used preprocessing methods in the future and investigate changes so that the algorithm performs better with data that isn’t necessarily preprocessed.
  2. The algorithm described requires a user-provided radius, such that all data points lie within a sphere of that radius. This is used to determine the amount of noise that is added to the bucket averages. Note that this differs from diffprivlib and dp-clustering-icml17 which take in different notions of bounds of the dataset (e.g., a minimum and maximum of each coordinate). For the sake of our experimental evaluation, we calculated the relevant bounds non-privately for each dataset. However, when running the algorithms in practice, these bounds should generally be privately computed or provided without knowledge of the dataset (e.g., using the underlying range of the data). Although, note that in case of the algorithm described, the provided radius need not be exactly correct; any data points outside of the provided radius are replaced with the closest points that are within the sphere of that radius.

Conclusion
This work proposes a new algorithm for computing representative points (cluster centers) within the framework of differential privacy. With the rise in the amount of datasets collected around the world, we hope that our open source tool will help organizations obtain and share meaningful insights about their datasets, with the mathematical assurance of differential privacy.

Acknowledgements
We thank Christoph Dibak, Badih Ghazi, Miguel Guevara, Sasha Kulankhina, Ravi Kumar, Pasin Manurangsi, Jane Shapiro, Daniel Simmons-Marengo, Yurii Sushko, and Mirac Vuslat Basaran for their help.


1As shown by Uri Stemmer in Locally private k-means clustering (SODA 2020). 
2This is similar to work on LSH Forest, used in the context of similarity-search queries. 
3Datasets (iii) and (iv) were centered to have mean zero before evaluating the algorithms. 
4Evaluation done for fixed privacy parameters ε = 1.0 and δ = 1e-6. Note that dp-clustering-icml17 works in the pure differential privacy model (namely, with δ = 0); k-means++, of course, has no privacy parameters. 

Read More