Author: Rebekah Thompson
For all of you who enjoyed our list of tools and infrastructures from Q2 of 2020, we're back with another list for you all!
For those of you who are new to our Applied Machine Learning Quartly, welcome! This is an ongoing curated list of interesting new open source machine learning models, tools, and visualizations for 2020. Updated roughly quarterly.
Choosing the best model for your machine learning project can feel daunting at times, but researchers at Google have developed a new, open-source Python framework to hopefully help alleviate that feeling — Model Search (MS).
Google’s Model Search GitHub page describes the framework as:
“Model search (MS) is a framework that implements AutoML algorithms for model architecture search at scale. It aims to help researchers speed up their exploration process for finding the right model architecture for their classification problems (i.e., DNNs with different types of layers).”
- Google’s Model Search GitHub Repository
A depiction of the MS model from Google AI’s blog post on Model Search is shown below. The researchers explained their model as follows:
“Model Search schematic illustrating the distributed search and ensembling. Each trainer runs independently to train and evaluate a given model. The results are shared with the search algorithm, which it stores. The search algorithm then invokes mutation over one of the best architectures and then sends the new model back to a trainer for the next iteration. S is the set of training and validation examples and A are all the candidates used during training and search.”
- Hanna Mazzawi and Xavi Gonzalvo (Authors of the Model Search article)
Based on the currently available version, this library enables you to:
Resources to get started with Model Search:
Zachary Teed and Jia Deng of Princeton University have introduced a new end-to-end deep neural network architecture for optical flow, which is one of the long-standing problems in accurate video analysis. Optical flow is the task of estimating the motion, or direction, of an object between video frames and is often limited by fast-moving objects, a tracked object being blocked by another object, or motion blur.
RAFT consists of three main components:
Image From the Paper: "RAFT: Recurrent All-Pairs Field Transforms forOptical Flow"
According to Teed and Deng’s results in the publication, RAFT’s strengths include the following based on the KITTI Vision Benchmark Suite:
Resources to get started with RAFT:
Are you looking for a comparison of different MLOps platforms? Or maybe you just want to discuss the pros and cons of operating a ML platform on the cloud vs on-premise? Sign up for our free MLOps Briefing -- its completely free and you can bring your own questions or set the agenda.
DALL·E, cleverly named after the artist Salvalor Dalí and Pixar’s WALL·E, is a trained neural network from OpenAI that can create images from user input text captions using PyTorch.
According to OpenAI’s page introducing DALL·E:
“DALL·E is a 12-billion parameter version of GPT-3 trained to generate images from text descriptions, using a dataset of text-image pairs. We’ve [DALL·E developers] found that it has a diverse set of capabilities, including creating anthropomorphized versions of animals and objects, combining unrelated concepts in plausible ways, rendering text, and applying transformations to existing images.”
- OpenAi’s DALL·E Article
The captions are not just limited to a single word, such as “dog” or “cat.” With this network, the user can input a phrase such as “an illustration of a baby daikon radish in a tutu walking a dog” and get just that — an illustration of a baby daikon radish in a tutu walking a dog.
Not only does DALL·E work on text prompts, it will also work with the combination of image and text prompts. The image below shows DALL·E attempting to replicate an image of a cat as a sketch after receiving an image and text prompt based on that image.
Resources to get started with DALL·E:
Jraph is a graph neural network developed by DeepMind and joins the JAX family, a machine learning framework developed by Google Researchers, as one of its newest members. The developers of Jraph describe it as follows:
“Jraph (pronounced "giraffe") is a lightweight library for working with graph neural networks in jax. It provides a data structure for graphs, a set of utilities for working with graphs, and a 'zoo' of forkable graph neural network models.”
- Jraph GitHub Repository
Jraph takes inspiration from Tensorflow’s graph_nets library when defining its GraphsTuple data structure, a named tuple that contains one or more directed graphs.
Resources to get started with Jraph:
Petting Zoo is a Python library created by Justin Terry, currently a Ph.D. Student at the University of Maryland, for conducting and researching multi-agent reinforcement learning. It is similar to OpenAI’s Gym library, but instead of focusing on a single agent, the models focus on training multiple agents. If you have experience with Gym, or even if you don’t, and want to try your hand at training multiple agents, give Petting Zoo a try.
Petting Zoo offers six families of learning environments to train and test your agents including:
For more information check out the links below and also this interview with Justin hosted on Synthetic Intelligence Forum’s YouTube channel for a deeper dive into Petting Zoo and how it can potentially help you with your projects.
Resources to get started with Petting Zoo:
The use of transformers in image-based tasks and models, such as convolutional neural networks (CNNs), has been gaining in popularity. Imagining the possibilities this addition could bring to the world of image analysis, Google Researchers have introduced Vision Transformer (ViT), a vision model-based as closely as possible to the transformer architecture used in text-based models.
In their Google AI article presenting the new model, the authors explained ViT as the following:
“ViT represents an input image as a sequence of image patches, similar to the sequence of word embeddings used when applying Transformers to text, and directly predicts class labels for the image. ViT demonstrates excellent performance when trained on sufficient data, outperforming a comparable state-of-the-art CNN with four times fewer computational resources.”
- Google Research Scientists, ViT Article
Image Source: Google article on ViT
So, how does this model work? In the same article, the researchers go on to explain just that:
“ViT divides an image into a grid of square patches. Each patch is flattened into a single vector by concatenating all pixels’ channels in a patch and then linearly projecting it to the desired input dimension. Because Transformers are agnostic to the structure of the input elements we add learnable position embeddings to each patch, which allows the model to learn about the structure of the images. A priori, ViT does not know about the relative location of patches in the image, or even that the image has a 2D structure — it must learn such relevant information from the training data and encode structural information in the position embeddings.”
- Google Research Scientists, ViT Article
Resources to get started with Vision Transformer (ViT):
One of the latest datasets from Google Research Scientists is called “Room-Across-Room (RxR).” RxR is a multilingual dataset for Vision-and-Language Navigation (VLN) in Matterport3D environments.
The research scientists for RxR describes the dataset as the following from their article:
Image Source: Google Article on RxR
“the first multilingual dataset for VLN, containing 126,069 human-annotated navigation instructions in three typologically diverse languages — English, Hindi and Telugu. Each instruction describes a path through a photorealistic simulator populated with indoor environments from the Matterport3D dataset, which includes 3D captures of homes, offices and public buildings.”
For those who are unfamiliar with Matterport3D, it is a large, diverse RGB-D dataset for scene understanding that contains 10,800 panoramic views from 194,400 RGB-D images of 90 building-scale scenes. Google Researchers took this large dataset and made it even bigger — 10x bigger!
As you can see in the image below, the agent using the RxR dataset moves throughout the different rooms and is being depicted by different colors in the image called “pose traces,” which can be found in further detail in the RxR article. The agent then outputs a corresponding description of what it is seeing. If you compare the text to the agent trajectory, the color of the text matches the colored path in the image.
On top of the introduction of this new dataset, Google Researchers also announced that to keep track of the progression of VLN, they are announcing the RxR Challenge. The RxR Challenge is “a competition that encourages the machine learning community to train and evaluate their instruction following agents on RxR instructions.” If you’re interested in learning more or participating in the competition, visit the link in the resources below.
Resources to get started with RxR:
Image Source: ExtendedSumm Paper
Full Paper: On Generating Extended Summaries of Long Documents
Do you wish there was a way to help summarize your latest paper or document instead of going back and forth trying to decide what is most important to include? Give ExtendedSumm a try.
Researchers Sajad Sotudeh, Arman Cohan, and Nazli Goharian from the IR Lab at Georgetown University created a method that expands upon previous research that was used to create high-level summaries for short documents. Their method is used to generate a more in-depth, extended summary for longer documents such as research papers, legal documents, and books.
They describe their method as one that “aims at jointly learning to predict sentence importance and its corresponding section” and they describe their methodology as the following in the abstract of their publication:
“Our method exploits hierarchical structure of the documents and incorporates it into an extractive summarization model through a multi-task learning approach. We then present our results on three long summarization datasets, arXiv-Long, PubMed-Long, and Longsumm.”
The study showed that their method either matched or exceeded the baseline, BertSumExt, over a dataset of mixed summarizations that varied in size.
Image Source: ExtendedSumm Paper
Datasets used for this model:
Resources to get started with Extended Summ:
Are you looking for a comparison of different MLOps platforms? Or maybe you just want to discuss the pros and cons of operating a ML platform on the cloud vs on-premise? Sign up for our free MLOps Briefing -- its completely free and you can bring your own questions or set the agenda.
Khalid Salama, author of the original article featured on the Kera’s website, gives us a tutorial of how to build a duel-encoder neural network model to search for images using natural language.
“The idea is to train a vision encoder and a text encoder jointly to project the representation of images and their captions into the same embedding space, such that the caption embeddings are located near the embeddings of the images they describe.”
- Khalid Salama, Natural language image search with a Dual Encoder
We will be giving a high-level description and steps involved in this method, but if you’re interested in the full tutorial with screenshots of the code after reading our summary, you can find his full article here.
Requirements for initial setup:
Key steps used in this method:
If all goes according to plan, here’s an example of the results you may receive. These results are based on the generated results from Khalid Salama’s article.
Resources to get started:
Imagine reading most, if not all, of the research papers from a conference full of innovative work that you may want to implement yourself. It would take a lot of time, right?
Luckily, Prabhu Prakash Kagitha has done the work for us in his articles NeurIPS 2020 Papers: Takeaways for a Deep Learning Engineer and NeurIPS 2020 Papers: Takeaways of a Deep Learning Engineer— Computer Vision. Featured on the blog Towards Data Science, he shares summaries from 2020 Neural Information Processing Systems (NeurIPS) conference with the first blog post based on general deep learning papers and a second post dedicated specifically to papers related to computer vision.
The links below will take you to each of the featured papers, but I highly recommend reading his summaries first if you’re short on time. He does a great job of summarizing the main topic, results, and provides a practical, key takeaway for deep learning engineers who do not have the time to read through the entire paper or summary.
The images below, created by Prabhu, show an overview of the topics covered in each section.
Papers featured in Prabhu's general deep learning post include:
Image Source: Prabhu's computer vision post
Papers featured in Prabhu's computer vision post include:
Instead of scouring the internet for the latest or most popular journals and models, it’s always nice to have a nice list to make the search easier, right? Well, you’re in luck!
In his featured article, Papers with Code 2020: A Year in Review, Ross Taylor lists the top 10 research papers, libraries, and benchmarks for 2020 from the Papers with Code blog. Below, we’ve listed five links from each section that we believe you will find interesting.
After you’ve finished reading the summaries below, you can click here to read the rest of Ross Taylor’s article and see what else trended last year on Paper with Code.
Top Research Papers:
Top Libraries:
Top Benchmark Datasets:
Don’t forget to take a look at other posts from our blog to see how we used some of these items for ours and our partners’ projects!