Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Successfully reported this slideshow.

Like this presentation? Why not share!

- Ben Lau, Quantitative Researcher, H... by MLconf 1072 views
- Daniel Shank, Data Scientist, Talla... by MLconf 1557 views
- Alex Smola, Director of Machine Lea... by MLconf 2483 views
- Jean-François Puget, Distinguished ... by MLconf 2147 views
- Mayur Thakur, Managing Director, Go... by MLconf 850 views
- Josh Wills, Head of Data Engineerin... by MLconf 1361 views

Corinna Cortes is a Danish computer scientist known for her contributions to machine learning. She is currently the Head of Google Research, New York. Cortes is a recipient of the Paris Kanellakis Theory and Practice Award for her work on theoretical foundations of support vector machines.

Cortes received her M.S. degree in physics from Copenhagen University in 1989. In the same year she joined AT&T Bell Labs as a researcher and remained there for about ten years. She received her Ph.D. in computer science from the University of Rochester in 1993. Cortes currently serves as the Head of Google Research, New York. She is an Editorial Board member of the journal Machine Learning.

Cortes’ research covers a wide range of topics in machine learning, including support vector machines and data mining. In 2008, she jointly with Vladimir Vapnik received the Paris Kanellakis Theory and Practice Award for the development of a highly effective algorithm for supervised learning known as support vector machines (SVM). Today, SVM is one of the most frequently used algorithms in machine learning, which is used in many practical applications, including medical diagnosis and weather forecasting.

Abstract Summary:

Harnessing Neural Networks:

Deep learning has demonstrated impressive performance gain in many machine learning applications. However, unveiling and realizing these performance gains is not always straightforward. Discovering the right network architecture is critical for accuracy and often requires a human in the loop. Some network architectures occasionally produce spurious outputs, and the outputs have to be restricted to meet the needs of an application. Finally, realizing the performance gain in a production system can be difficult because of extensive inference times.

In this talk we discuss methods for making neural networks efficient in production systems. We also discuss an efficient method for automatically learning the network architecture, called AdaNet. We provide theoretical arguments for the algorithm and present experimental evidence for its effectiveness.

No Downloads

Total views

941

On SlideShare

0

From Embeds

0

Number of Embeds

9

Shares

0

Downloads

35

Comments

11

Likes

1

No notes for slide

- 1. Harnessing Neural Networks Corinna Cortes Google Research, NY
- 2. Harnessing the Power of Neural Networks Introduction How do we standardize the output? How do we speed up inference? How do we automatically find a good network architecture?
- 3. Google’s mission is to organize the world’s information and make it universally accessible and useful.
- 4. Google Translate
- 5. Smart reply in Inbox 10% of all responses sent on mobile
- 6. LSTM in Action
- 7. LSTMs and Extrapolation They daydream or hallucinate :-) Feature or bug?
- 8. DeepDream Art Auction and Symposium (A&MI)
- 9. Magenta A ht Xt A.I. Duet https://aiexperiments.withgoogle.com/ai-duet/view/
- 10. Harnessing the Power of Neural Networks Introduction How do we standardize the output? How do we speed up inference? How do we automatically find a good network architecture?
- 11. Restricting the Output. Smart Replies. http://www.kdd.org/kdd2016/papers/files/Paper_1069.pdf ● Ungrammatical and inappropriate answers ○ thanks hon!; Yup, got it thx; Leave me alone! ● Work with a Fixed Response Set ○ Sanitized answers are clustered in semantically similar answers using label propagation; ○ The answers in the clusters are used to filter the candidate set generated by the LSTM. Diversity is ensured by using top answers from different clusters. ● Efficient search via tries
- 12. Search Tree, Trie, for Valid Responses Tuesday Wednesday Tuesday? Wednesday? I can do Cluster responses How about . ! ! What time works for you? . What time works for you?
- 13. Computational Complexity ● Exhaustive: R x l R size of response set, l length of longest sentence ● Beam search: b x l Typical size of R ~ millions, typical size of b ~ 10-30
- 14. ● A more elegant solution based on rules ○ Exploit rules to efficiently enlarge the response set: ■ “Can you do Monday?” “Yes, I can do Monday” ■ “Can you do Tuesday?” “Yes, I can do Tuesday” ■ ... “Can you do <time>?” “Yes, I can do <time>” or “No, I can do <time + 1> What if the Response Set in Billions?
- 15. Rules for Response Set Text Normalization for Text-to-Speech, TTS, Systems Navigation assistant
- 16. Text Normalization Richard Sproat, Navdeep Jaitly, Google: “RNN Approaches to Text Normalization: A Challenge” https://arxiv.org/pdf/1611.00068.pdf
- 17. Break the Task in Two ● Channel model ○ possible normalizations of that token? Sequence of tokens to words. ○ Example: 123 ■ one hundred twenty three, one two three, one twenty three, ... ● Language model ○ which one is appropriate to the given context? Words to words. ○ Example: 123 ■ 123 King Ave. - the correct reading in American English would normally be one twenty three.
- 18. Combining the Models One combined LSTM
- 19. Silly Mistakes
- 20. Add a Grammar to Constrain the Output Rule: <number> + <measurement abbreviation> => <number> + the possible verbalizations of the measure abbreviation. Instantiation: 24.2kg => twenty four point two kilogram, twenty four point two kilograms, twenty four point two kilo. Finite State Transducers: a finite state automaton which produces output as well as reading input, pattern matching, regular expressions.
- 21. Thrax Grammar MEASURE: <number> + <measurement abbreviation> -> <number> + measurement verbalizations Input: 5 kg -> five kilo/kilograms/kilogram MONEY: $ <number> -> <number> dollars Input composed with FSTs. The output of the FST is used to restrict the output of the LSTM.
- 22. TTS: RNN + FST Measure and Money restricted by grammar.
- 23. Harnessing the Power of Neural Networks Introduction How do we standardize the output? How do we speed up inference? How do we automatically find a good network architecture?
- 24. One class per image type (horse, car, …), M classes. Neural network inference: Just to compute the last layer requires MN multiply adds. Super-Multiclass Classification Problem Output layer, M units: Last hidden layer, N units:
- 25. Asymmetric Hashing W1 W2 W3 WM Weights to the output layer, parted in N/k chunks ● Represent each chunk with a set of cluster centers (256) using k-means. ● Save the coordinates of the centers, (ID, coordinates). ● Save each weight vector as a set of closest IDs, hashcode.
- 26. Asymmetric Hashing W1 W2 W3 WM Weights to the output layer, parted in N/k chunks ● Represent each chunk with a set of cluster centers (256) using k-means. ● Save the coordinates of the centers, (ID, coordinates). ● Save each weight vector as a set of closest IDs, hashcode. 78 184 15 12 63 192 56 82 72 201 37 51
- 27. Asymmetric Hashing, Searching ● For given activation u, divide it into its N/k chunks, uj : ○ Compute the 256 N/k distances to centers. 256N multiply adds, not MN. ○ Compute the distances to all hash codes: ● MN/k additions needed. ● The “Asymmetric” in “Asymmetric Hashing” refers to the fact that we hash the weight vectors but not the activation vector.
- 28. Asymmetric Hashing Incredible saving in inference time Sometimes also with a bit of improved accuracy
- 29. Harnessing the Power of Neural Networks Introduction How do we standardize the output? How do we speed up inference? How do we automatically find a good network architecture?
- 30. “Learning to Learn” a.k.a “Automated Hyperparameter Tuning” Google: AdaNet, Architecture Search with Reinforcement Learning MIT: Designing Neural Networks Architectures Using Reinforcement Learning, Harvard,Toronto, MIT, Intel: Scalable Bayesian Optimization Using Deep Neural Networks. Genetic Algorithms, Reinforcement Learning, Boosting Algorithm
- 31. Modeling Challenges for ML The right model choice can significantly improve the performance. For Deep Learning it is particularly hard as the search space is huge and ● Difficult non-convex optimization ● Lack of sufficient theory Questions ● Can neural network architectures be learned together with their weights? ● Can this problem be solved efficiently and in a principled way? ● Can we capture the end-to-end process?
- 32. AdaNet ● Incremental construction: At each round, the algorithm adds a subnetwork to the existing neural network; ● Algorithm leverages embeddings previous learned; ● Adaptively grows network, balancing trade-off between empirical error and model complexity; ● Learning bound:
- 33. Experimental Results, AdaNet CIFAR-10: 60,000 images, 10 classes SD of all #’s: 0.01 Label Pair AdaNet Log. Reg. NN deer-truck 0.94 0.90 0.92 deer-horse 0.84 0.77 0.81 automobile-truck 0.85 0.80 0.81 cat-dog 0.69 0.67 0.66 dog-horse 0.84 0.80 0.81
- 34. Neural Architecture Search with RL
- 35. Neural Architecture Search with RL Error rates on CIFAR-10 Perplexity on Penn Treebank Current accuracy of NAS on ImageNet: 78% State-of-Art: 80.x%
- 36. “Learning to Learn” a.k.a “Automated Hyperparameter Tuning” Google: AdaNet, Architecture Search with Reinforcement Learning MIT: Designing Neural Networks Architectures Using Reinforcement Learning, Harvard,Toronto, MIT, Intel: Scalable Bayesian Optimization Using Deep Neural Networks. Genetic Algorithms, Reinforcement Learning, Boosting Algorithm
- 37. Harnessing the Power of Neural Networks Introduction How do we standardize the output? How do we speed up inference? How do we automatically find a good network architecture?

No public clipboards found for this slide

Login to see the comments