Behzad Golshan

Behzad Golshan

Building LLMs & Unleashing Their Potential

KDD 2020 Highlights

Note: I am not the sole author of this post! I have written this article in collaboration with my colleagues at Megagon Labs. See the original post.

The KDD (Knowledge Discovery and Data Mining) conference is the oldest and arguably the top data mining conference in the world. This year’s conference, which was planned to take place in San Diego, became an all-virtual conference due to the covid-19 pandemic. I was certainly looking forward to meeting old friends and making new ones at KDD, and while Zoom socials and messaging apps are helpful, they don’t quite replace the in-person interactions. Although we all missed the sun and the beach, we still got a well-organized KDD conference which was bigger than ever. There were roughly 210 accepted research papers, 32 workshops, and more than 40 tutorials including a tutorial by our own wonderful colleague, Estevam Hruschka, on “Data-Driven Never-Ending Learning Question Answering Systems” which you can watch here. Given the magnitude of the conference, it’s kind of impossible to summarize everything. Nevertheless, I’m going to try my best to share some of the latest trends and mention some of the papers and talks that I’ve really liked along the way.

KDD has always been a popular venue for papers in the areas of graph mining, recommender systems, and text mining. This year’s KDD was by no means an exception with the majority of papers on graph mining, a paper on evaluating recommender systems (which I highly recommend) winning the best paper award, and multiple tutorials on text mining. While the practical problems that the KDD community cares about haven’t changed significantly, there is a significant shift in how these problems are addressed. There is a clear rise in relying on machine-learning techniques, and specifically deep neural networks to delve even deeper into many of the field’s fundamental problems. To put things in perspective, 16% of papers in this year’s KDD mention “deep” or “neural” in their title. The same number for paper in KDD 2016 (merely 5 years ago) was only 3%. 

Graph mining, with its numerous applications, has seen rapid strides with the emergence of Graph Neural Networks (GNNs). A large number of papers in KDD 2020 explore how graph representation learning (through GNNs) can be improved to tackle classic problems such as node classification, link prediction, and graph classification (see PolicyGNN and TinyGNN). One of my favorites this year was a paper titled GPT-GNN, which shows how GNNs can be pre-trained through generation (of edges and node attribute). To me, the paper is a clear example of how fast ideas across different disciplines (e.g., vision and NLP) are being exchanged. Similarly, techniques based on pre-training and transfer-learning techniques are becoming more and more common for text mining. This paper from Amazon and CMU, is a good example of using transformer-based models for multi-label text classification. Given all this, it’s not surprising that KDD has started organizing Deep Learning Day since 2018.

Similar to last year’s conference, KDD 2020 also organized an Earth Day as well as a Health Day to bring researchers and practitioners in these disciplines together. I very much liked this talk in the Earth Day event on how the number of unreported cases in a pandemic can be estimated. I have to admit that 2020’s global pandemic and California fires made me appreciate research in these areas more than ever. Speaking of events focused on applied research, KDD 2020 had a number of invited speakers from the industry including our own head of research Wang-Chiew Tan who talked about the power of subjective data. If you are interested you can read more about subjective data here.

Although KDD’s virtual conference portal is no longer open, you can still gain a lot from this year’s KDD. A great number of talks and tutorials are available on YouTube and Vimeo, and you can browse all the papers and review their abstracts through PaperDigest. KDD 2021 is going to be in Singapore and I’m looking forward to seeing what’s coming next.

NAACL 2018 Talks (Summary)

A Scalable Neural Shortlisting-Reranking Approach for Large-Scale Domain Classification in Natural Language Understanding (from the Alexa team)

The talk basically explained how Alexa detects intents and performs slot-filling. A lot of stuff they talked about are directly related to FrameIt. There were two main points in the talk:

  • Similar to FrameIt, they first do intent classification and then slot-filling. But like us, they want to see if slots can help you with intent classification. To achieve this, they first shortlist top “K” intents according to the ML model. Then they carry-out slot-filling for each of these intents, and then pass these “K” intents along with the extracted slots to another classifier to see which one makes more sense. The classifier can now use the signal from extracted slots to judge better.
  • They mentioned that intent-detection for them works in a hierarchical way. For instance, if you say “Play the thriller for me”, they don’t directly map it into an intent. They first decide that it’s not a question but a command. Then they decide it wants us to play back something. Then then decide if that something should be a thriller movie or the song thriller or other stuff. They mentioned that this hierarchical exploration is essential otherwise there are so many intents to build a classifier for.

Self-Training for Jointly Learning to Ask and Answer Questions (from CMU)

This idea here was very simple and neat. They mentioned that people are working on QA and question generation as two separate problems, but of course these problems are tightly connected. They presented how you can create new questions using existing question generation techniques and use that as data to improve your QA system. And how you can use a QA system train a better question generation system. The technical aspect of the talk was focused on making sure that this cycle is stable and avoid drifts into meaningless questions and answers.

The Web as a Knowledge-base for Answering Complex Questions (from Tel Aviv University)

Their goal was to answer complex question using the web. “What is the population of the richest country in europe?”. They mentioned that at the moment, google does not handle this questions, but by decomposing these questions into logical forms and querying each smaller chunk to google you can answer these questions. Using an off the shelf semantic parser, then realize that they need to first find the “richest country in europe” (which google can easily accomplish) and then answer the “What is the population of Germany” which again google is good at answering.

Semantic Structural Evaluation for Text Simplification (from The Hebrew University of Jerusalem)

This was a 12 minute talk so there are so many unclear parts. They proposed a new method for text simplification. I found this interesting as we’ve also discussed about simplifying happy moments or similar data in the lab. Their proposed method uses a semantic parser, namely UCCA which is focused on extracting what they call a “Scene”. For example, “walking” would be a scene in “I want for a walk”. They use the notion of “Scene” to measure how much your simplified text conveys the same “Scenes”. I’m skeptical how much they can understand that “We had a great chat we coworkers over lunch” and “I chatted with coworkers” are similar. It all comes down to how they define scenes.

Specialising Word Vectors for Lexical Entailment (from University of Cambridge)

I’m not going to dive into details but they’ve proposed a new technique for predicting word vectors that also accounts for Lexical Entailment (i.e., to know that “bird” IsA “animal”). Their proposed word embeddings captures the similarity of the various words using the cosine similarity of the vectors which means they tune the directions of each vector in the space. For Lexical Entailment, they use the length of each vector and they make sure that “animal” end up with a larger length compared to “bird”. Combining both the directions of the vectors and their length they can exploit both word similarity as well as entailment.

Colorless Green Recurrent Networks Dream Hierarchically (from Facebook AI)

They studied a simple questions: “Do RNNs learn syntax?”. I found this interesting since we’ve talked about including POS tags many times as additional features. Well, they concluded that RNNs are good at learning the syntax which makes me believe that there is really no point in providing syntactic features to these models (at least in the presence of enough training data). If you’re wondering about the title, I need to add that they wanted to make sure that the RNN is learning the syntax and not inferring anything from the semantics of the word thus they put meaningless yet grammatically correct sentences such as the well-known sentence “Colorless green ideas sleep furiously”.

Sentiment Analysis: It’s Complicated! (from McGill University)
This was a really cool study which was more about crowdsourcing than sentiment analysis. They collected sentiment labels for tweets using the usual positive, negative, and neutral classes. Obviously, the workers did not always agree. They realize that if they stick with data points with small degree of agreement then they need to throw away 10% of their data, and if they enforce a stronger agreement (4 out of 5 votes), then they need to throw out 35% of their data. Nevertheless, they trained a sentiment classifier using these 2 datasets after removing the disagreements. Then, they decided to re-label the data and then add a new category called “complicated”. Using this new data, they had 2 findings:

  • The classifiers using this 4 categories performs much better on test data compared to the classifiers that uses the 3 classes (and is not fed with the disagreements).
  • They noticed that marking disagreements as “complicated” actually doesn’t work. Basically, they do not align with what people mark as complicated necessarily. 

Based on this, they suggested that in most crowdsourcing tasks we should avoid dropping the problematic data and since keeping it under the label “problematic” can help improve the system.

Deep Contextualized Word Representations (from AI2) —- BEST PAPER

This paper is also known as the “ELMo” paper and received the best paper award. This is indeed a very influential paper as they achieve a new state-of-the-art on multiple NLP tasks. The talk was quite technical and I need to consult the paper to make sure that I fully understand the details. At a high-level, the paper proposes a new word vector model which does not provide a fix vector for each word. Basically, the embedding of a word is dynamic and gets updated depending on the context that it appears in. As a result, “stork” might have a different embedding depending on whether you are talking about wildlife vs. (delivering) babies! They’ve shown that using these embeddings in some of the competitive models yields state-of-the-art results. As I mentioned, the embedding of each word depends on the context it appears in, and to compute (or rather update) the embeddings they use LSTMs to summarize the context to the right and to the left of that word in that particular sentence. 

There was much more going on at NAACL, but I think this is a good point to stop 🙂

NAACL 2018 Keynotes (Summary)

Why 72? (by Charles Yang – University of Pennsylvania)

This was a great talk. He started with a simple question: “Why do Chinese-speaking kids (i.e., 2-3 year olds) learn to count faster than English-speaking kids?”. Apparently, an English-speaking kid can count up to 100 once they know how to count to 72. The corresponding number is about 40 for Chinese. He explained that this has to do with how we learn language rules (and their exceptions). Obviously, exceptions make it difficult for any system to learn the rules (e.g., “fifteen” instead of “fiveteen”). His work provides (theoretical and experimental) evidence that humans learn a rule if there are at most “n / ln(n)” exceptions in “n” observed samples. Using this, you can explain the numbers 72 and 40. This is all cool, but why should we care? Well, he then focused on a few rules of language (for instance, the fact that both “a” and “the” can precede a noun), and showed that in a natural (large) corpus only 30% of nouns appear with both “a” and “the”. He found this disturbing since according to his model, humans should not generalize that nouns can appear with either “a” or “the”. So why do we generalize? It turns out that we learn this rule when we only know about 3000 words, and these limited sets of words commonly appear with both “a” and “the”. Thus, when we only know 3000 words, it’s easier for us to figure out the rules. In other words, not knowing many words is a blessing. Based on this, he concluded the talk by suggesting the NLP community to use less (and more limited) data! He believes most of what we want (and need) to learn can be learned from simpler and more limited data.

Building a SocialBot: Lessons Learned from 10M Conversations (by Mari Ostendorf – University of Washington):

Her lab was the winner of the Alexa SocialBot challenge, and this walk was basically a summary of what they learned during the challenge. She shared many funny examples of what may go wrong in these systems and I’ll promise to include one in the end. Here are some of the points from the talk:

  • Seq2seq models are really good at saying “I don’t know”, so use “rule-based” techniques especially if you are launching a new system and have no clue how users might interact with your system. Rule-based chatbots can help you gather the data you need to bootstrap. 
  • People enjoy chatting if (1) you have something interesting to say, and (2) you show interest in them (even if it’s as subtle as acknowledging their response in a smarter way).
  • Doing (1) from the previous bullet is very challenging. You need to fetch interesting facts and information from relevant sources, understand them, and incorporate them appropriately into the conversation. 
  • It’s important to classify your user. Are you talking to a kid? An adversarial user? etc. 
  • In their experience, the best design for a dialogue manager consists of a “master dialogue manager” which delegates the conversation to more specialized dialogue systems.
  • Finally, they mentioned that it is extremely hard to have deep conversations. Most exchanges are short and they have to move on to a different topic to keep the conversation going.

Here is one of the funny examples she presented. This is a conversation between the bot and a small kid (from a kindergarten they were presenting the bot at):

The Moment When the Future Fell Asleep (by Kevin Knight — USC but moving to DiDi):

This keynote was a bit all over the place (reviewing the speaker’s past project). It was quite an entertaining talk with a few serious moments. He talked about his work on deciphering text, and how he put together a team to decipher an encrypted message from Zodiac (the serial killer) as part of a reality tv show (“The Hunt for the Zodiac Killer”). He was hoping to become one of the few people (along with Natalie Portman, Kristen Stewart and Carl Sagan) who have an Erdos-Bacon number. After this part, he pointed out that what we do in our community can be of interest to people in arts and entertainments, but the community normally ignores such applications and sources of funding. He also talked about his work on poem generation (which you might remember from EMNLP), and mentioned how difficult it was to evaluate such a  system. He pointed out that the first papers on machine-translation approached this problem without having proper methods to evaluate. He encouraged the NLP community to work on influential problems even when there is a lack of evaluation metrics for the problem.

Building innovative startups, products, and services – personal insights (by Daniel Marcu – USC):

I think “How to have a successful career” would have been a better title for the talk, cause that’s what it was. He shared some of his personal insights which I’m going to list here:

  • Master the hype cycle. He believes it is important to know (both for researchers and businesses) at which stage of the hype cycle a technology is to decide how they should focus on it. Here is the hype cycle from 2017 (and I’m not sure who makes this!):
  • Understand the space of competitors. Figure out what you can bring to the table that is unique or better. A successful career in research follows the same thing.
  • Have a contribution-driven mindset. Work on problems that have major contributions, and don’t let the complexity of the problems, or prestige, or the ongoing hype be your guide when choosing a problem to work on. 
  • Hack stuff! Build things as most stuff tend to work with hacking. When hacking fails, then you end up with a concrete and valuable research problem.
  • Do not ignore scalability while focusing on quality. If your system takes 3 weeks longer to train and shows slight improvement over the baseline, it’s not really worth it.
  • Understand that academic metrics do not translate well to the real world. In many cases, these numbers do not translate to what users want. 
  • Note that other communities are usually a better judge of what is useful among the things we’ve built in our community. 
  • And always be customer-oriented. Who your customers are is quite clear in industry. But in academia they are your colleagues, people who are waiting for new ideas to build upon. Ask yourself if others can build upon your work to advance the field or not.

Google Assistant or My Assistant? Towards Personalized Situated Conversational Agents (by Dilek Hakkani-Tur – Google AI):

The talk presented an overview of the NLP community’s progress on building chatbots. She mentioned that both “chit-chat” and “task-oriented” chatbots are eventually converging as most task-oriented chatbots are learning to generalize to multiple domains and chit-chat bots are learning that both context and structured knowledge is very important for a successful conversation. She then moved onto talking about the particular architecture an assistant bot they are developing at google. She used various examples to convey that an assistant bot needs to understand the input text, remember the context, detect the state of the conversation, decide on an action, and generate appropriate responses. She claimed that, in their experience, it was easier to build this system using an end-to-end architecture. She mentioned that by breaking down the system into smaller independent units, they faced many challenges:

  • Improving individual components in many cases does not affect the overall performance of the system. They considered this as wasted effort.
  • Debugging becomes a tedious task. If the system does something unexpected, it is hard to figure out which component is responsible for the error. While in an end-to-end architecture this is more straightforward.
    Note: I have to say that I’m not sure if I understand the second point properly. To me, debugging an end-to-end system is quite difficult. Perhaps more difficult than debugging a modular design. 

Finally, she presented the architecture that carries out all these tasks at once, but I’ll leave that part out since the technical papers explain the architecture more adequately. 

Final note: Alon mentioned that the google assistant is not based on an end-to-end architecture, so what she was presenting might be part of their new efforts to carry out deeper conversations in Google Assistant. 

Cheap Tricks and the Perils of Machine Learning (by Percy Liang – Stanford):

This keynote was actually part of the deep-learning workshop and not the main conference, but it should have been because it was awesome. The talk had two parts, namely “Harder Data” and “Harder Problems”, and I’m going to follow the same outline.

Harder Data: As you know Percy has created the SQuAD dataset for question answering which is widely used as a benchmark. Percy started talking about the viewpoint of Hector Levesque (from University of Toronto) who thinks that most of what our algorithms learn on dataset (such as SQuAD) is a collection of “cheap tricks”. The canonical example was asking machines to resolve the co-reference in the sentence “Sarah yelled at Paul because she was angry”. Levesque thinks the pronoun “she” will be resolved to “Sarah” just based on gender. This is fine but mostly our networks are learning these tricks rather than figuring out (deeply) how references work in language. To highlight this issue, the “Winograd Schema Challenge” was created. The task in the challenge is coreference resolution but on sentences for which commonsense reasoning is required (e.g., “James yelled at Paul because he was angry”). It has been shown that most of the state-of-the-art models fail miserably on this dataset. Percy acknowledged this, but mentioned that there might be easier ways to get this harder data. As an example, he focused on QA systems and argued that QA is a super problem. Basically, most NLP tasks such as coreference resolution or slot-filling can be viewed as question answering. So he decided to target his QA system for slot-filling or coreference resolution. He found out that this performs very poorly. He used this as evidence to both agree with Levesque’s viewpoint while disagreeing that we don’t need to start with datasets as hard as the Winograd Schema Challenge. 

Harder Problems: This part of the talk pointed out an important difference between how we model the world vs. how physicists model the world. The point was that physicists do not launch 10,000 rockets before learning to launch one. Why aren’t we like that? He argued that this is because they fully study a phenomenon and then build systems, but we do that very rarely. He pointed out that NNs are not going to be enough unless we incorporate our knowledge of what we are trying to model. For instance, when we incorporated that locality on images and text is important, we achieved much better results using CNN (for images) and attention mechanism (for text). He argued that we need to do this more often. Lastly, he pointed out that the basic assumption in ML is that your training and test data come from the same distribution. While this is an important assumption for many things we do, he argued that the real test for how much you have mastered a problem is to test your system on a dataset that has a different distribution. For instance, how well the systems tuned for QA on squad would perform on “Winograd schema challenge”. He envisioned that in the future, a ML paper reports numbers on two test sets: one with a distribution similar to the training set, and one with a different distribution.