Behzad Golshan

Behzad Golshan

Working on AI, LLMs & the inevitable 🤖 uprising

NAACL 2018 Keynotes (Summary)

Why 72? (by Charles Yang – University of Pennsylvania)

This was a great talk. He started with a simple question: “Why do Chinese-speaking kids (i.e., 2-3 year olds) learn to count faster than English-speaking kids?”. Apparently, an English-speaking kid can count up to 100 once they know how to count to 72. The corresponding number is about 40 for Chinese. He explained that this has to do with how we learn language rules (and their exceptions). Obviously, exceptions make it difficult for any system to learn the rules (e.g., “fifteen” instead of “fiveteen”). His work provides (theoretical and experimental) evidence that humans learn a rule if there are at most “n / ln(n)” exceptions in “n” observed samples. Using this, you can explain the numbers 72 and 40. This is all cool, but why should we care? Well, he then focused on a few rules of language (for instance, the fact that both “a” and “the” can precede a noun), and showed that in a natural (large) corpus only 30% of nouns appear with both “a” and “the”. He found this disturbing since according to his model, humans should not generalize that nouns can appear with either “a” or “the”. So why do we generalize? It turns out that we learn this rule when we only know about 3000 words, and these limited sets of words commonly appear with both “a” and “the”. Thus, when we only know 3000 words, it’s easier for us to figure out the rules. In other words, not knowing many words is a blessing. Based on this, he concluded the talk by suggesting the NLP community to use less (and more limited) data! He believes most of what we want (and need) to learn can be learned from simpler and more limited data.

Building a SocialBot: Lessons Learned from 10M Conversations (by Mari Ostendorf – University of Washington):

Her lab was the winner of the Alexa SocialBot challenge, and this walk was basically a summary of what they learned during the challenge. She shared many funny examples of what may go wrong in these systems and I’ll promise to include one in the end. Here are some of the points from the talk:

  • Seq2seq models are really good at saying “I don’t know”, so use “rule-based” techniques especially if you are launching a new system and have no clue how users might interact with your system. Rule-based chatbots can help you gather the data you need to bootstrap. 
  • People enjoy chatting if (1) you have something interesting to say, and (2) you show interest in them (even if it’s as subtle as acknowledging their response in a smarter way).
  • Doing (1) from the previous bullet is very challenging. You need to fetch interesting facts and information from relevant sources, understand them, and incorporate them appropriately into the conversation. 
  • It’s important to classify your user. Are you talking to a kid? An adversarial user? etc. 
  • In their experience, the best design for a dialogue manager consists of a “master dialogue manager” which delegates the conversation to more specialized dialogue systems.
  • Finally, they mentioned that it is extremely hard to have deep conversations. Most exchanges are short and they have to move on to a different topic to keep the conversation going.

Here is one of the funny examples she presented. This is a conversation between the bot and a small kid (from a kindergarten they were presenting the bot at):

The Moment When the Future Fell Asleep (by Kevin Knight — USC but moving to DiDi):

This keynote was a bit all over the place (reviewing the speaker’s past project). It was quite an entertaining talk with a few serious moments. He talked about his work on deciphering text, and how he put together a team to decipher an encrypted message from Zodiac (the serial killer) as part of a reality tv show (“The Hunt for the Zodiac Killer”). He was hoping to become one of the few people (along with Natalie Portman, Kristen Stewart and Carl Sagan) who have an Erdos-Bacon number. After this part, he pointed out that what we do in our community can be of interest to people in arts and entertainments, but the community normally ignores such applications and sources of funding. He also talked about his work on poem generation (which you might remember from EMNLP), and mentioned how difficult it was to evaluate such a  system. He pointed out that the first papers on machine-translation approached this problem without having proper methods to evaluate. He encouraged the NLP community to work on influential problems even when there is a lack of evaluation metrics for the problem.

Building innovative startups, products, and services – personal insights (by Daniel Marcu – USC):

I think “How to have a successful career” would have been a better title for the talk, cause that’s what it was. He shared some of his personal insights which I’m going to list here:

  • Master the hype cycle. He believes it is important to know (both for researchers and businesses) at which stage of the hype cycle a technology is to decide how they should focus on it. Here is the hype cycle from 2017 (and I’m not sure who makes this!):
  • Understand the space of competitors. Figure out what you can bring to the table that is unique or better. A successful career in research follows the same thing.
  • Have a contribution-driven mindset. Work on problems that have major contributions, and don’t let the complexity of the problems, or prestige, or the ongoing hype be your guide when choosing a problem to work on. 
  • Hack stuff! Build things as most stuff tend to work with hacking. When hacking fails, then you end up with a concrete and valuable research problem.
  • Do not ignore scalability while focusing on quality. If your system takes 3 weeks longer to train and shows slight improvement over the baseline, it’s not really worth it.
  • Understand that academic metrics do not translate well to the real world. In many cases, these numbers do not translate to what users want. 
  • Note that other communities are usually a better judge of what is useful among the things we’ve built in our community. 
  • And always be customer-oriented. Who your customers are is quite clear in industry. But in academia they are your colleagues, people who are waiting for new ideas to build upon. Ask yourself if others can build upon your work to advance the field or not.

Google Assistant or My Assistant? Towards Personalized Situated Conversational Agents (by Dilek Hakkani-Tur – Google AI):

The talk presented an overview of the NLP community’s progress on building chatbots. She mentioned that both “chit-chat” and “task-oriented” chatbots are eventually converging as most task-oriented chatbots are learning to generalize to multiple domains and chit-chat bots are learning that both context and structured knowledge is very important for a successful conversation. She then moved onto talking about the particular architecture an assistant bot they are developing at google. She used various examples to convey that an assistant bot needs to understand the input text, remember the context, detect the state of the conversation, decide on an action, and generate appropriate responses. She claimed that, in their experience, it was easier to build this system using an end-to-end architecture. She mentioned that by breaking down the system into smaller independent units, they faced many challenges:

  • Improving individual components in many cases does not affect the overall performance of the system. They considered this as wasted effort.
  • Debugging becomes a tedious task. If the system does something unexpected, it is hard to figure out which component is responsible for the error. While in an end-to-end architecture this is more straightforward.
    Note: I have to say that I’m not sure if I understand the second point properly. To me, debugging an end-to-end system is quite difficult. Perhaps more difficult than debugging a modular design. 

Finally, she presented the architecture that carries out all these tasks at once, but I’ll leave that part out since the technical papers explain the architecture more adequately. 

Final note: Alon mentioned that the google assistant is not based on an end-to-end architecture, so what she was presenting might be part of their new efforts to carry out deeper conversations in Google Assistant. 

Cheap Tricks and the Perils of Machine Learning (by Percy Liang – Stanford):

This keynote was actually part of the deep-learning workshop and not the main conference, but it should have been because it was awesome. The talk had two parts, namely “Harder Data” and “Harder Problems”, and I’m going to follow the same outline.

Harder Data: As you know Percy has created the SQuAD dataset for question answering which is widely used as a benchmark. Percy started talking about the viewpoint of Hector Levesque (from University of Toronto) who thinks that most of what our algorithms learn on dataset (such as SQuAD) is a collection of “cheap tricks”. The canonical example was asking machines to resolve the co-reference in the sentence “Sarah yelled at Paul because she was angry”. Levesque thinks the pronoun “she” will be resolved to “Sarah” just based on gender. This is fine but mostly our networks are learning these tricks rather than figuring out (deeply) how references work in language. To highlight this issue, the “Winograd Schema Challenge” was created. The task in the challenge is coreference resolution but on sentences for which commonsense reasoning is required (e.g., “James yelled at Paul because he was angry”). It has been shown that most of the state-of-the-art models fail miserably on this dataset. Percy acknowledged this, but mentioned that there might be easier ways to get this harder data. As an example, he focused on QA systems and argued that QA is a super problem. Basically, most NLP tasks such as coreference resolution or slot-filling can be viewed as question answering. So he decided to target his QA system for slot-filling or coreference resolution. He found out that this performs very poorly. He used this as evidence to both agree with Levesque’s viewpoint while disagreeing that we don’t need to start with datasets as hard as the Winograd Schema Challenge. 

Harder Problems: This part of the talk pointed out an important difference between how we model the world vs. how physicists model the world. The point was that physicists do not launch 10,000 rockets before learning to launch one. Why aren’t we like that? He argued that this is because they fully study a phenomenon and then build systems, but we do that very rarely. He pointed out that NNs are not going to be enough unless we incorporate our knowledge of what we are trying to model. For instance, when we incorporated that locality on images and text is important, we achieved much better results using CNN (for images) and attention mechanism (for text). He argued that we need to do this more often. Lastly, he pointed out that the basic assumption in ML is that your training and test data come from the same distribution. While this is an important assumption for many things we do, he argued that the real test for how much you have mastered a problem is to test your system on a dataset that has a different distribution. For instance, how well the systems tuned for QA on squad would perform on “Winograd schema challenge”. He envisioned that in the future, a ML paper reports numbers on two test sets: one with a distribution similar to the training set, and one with a different distribution.