Speaker: Kenneth Heafield, Lecturer at the University of Edinburgh
Summary: The ParaCrawl project is mining a petabyte of the web for translations to release freely at https://paracrawl.eu/releases.html. But the web is a messy place, with a lot of data to sift through. To find translations, we translate everything into English or at least use a neural encoder. A related project makes machine translation inference more efficient by using optimizations ranging from assembly instructions to removal of bits of model architecture.
Website & Internet + Performance testingRoman Ananev
The presentation about how the site works on the Internet and what happens when you open it in your browser. What happens under the hood of the server and browser.
How to measure the performance of the CS-Cart project simply and without technical knowledge :) And of course, why all the online-performance-testing services lie, or dont provides a clear view ;)
https://www.simtechdev.com/cloud-hosting
---
Cloud hosting for CS-Cart, Multi-Vendor, WordPress, and Magento
by Simtech Development - AWS and CS-Cart certified hosting provider
free installation & migration | free 24/7 server monitoring | free daily backups | free SSL | and more...
Living in a multiligual world: Internationalization for Web 2.0 ApplicationsLars Trieloff
Lars Trieloff's presentation at Web 2.0 Expo Berlin covers the why and how-to of internationalization for web 2.0, consolidating i18n technology and enabling user-contributed translations.
Where we are, as Front-End developers? This presentation navigates a short timeline of the computer science focusing on the client-side development as a mean to answer why and what has changed, as well as explore patterns and tendencies for the near future.
English article: https://medium.com/@caiovaccaro/javascript-state-of-the-union-2015-part-1-7ccff74813fa#.8x9y48ohk
Presented at 3|SHARE's EVOLVE'14 - The Adobe Experience Manager Community Summit on Wednesday November 19th, 2014 at the Hard Rock Hotel in San Diego, CA. evolve14.com
Website & Internet + Performance testingRoman Ananev
The presentation about how the site works on the Internet and what happens when you open it in your browser. What happens under the hood of the server and browser.
How to measure the performance of the CS-Cart project simply and without technical knowledge :) And of course, why all the online-performance-testing services lie, or dont provides a clear view ;)
https://www.simtechdev.com/cloud-hosting
---
Cloud hosting for CS-Cart, Multi-Vendor, WordPress, and Magento
by Simtech Development - AWS and CS-Cart certified hosting provider
free installation & migration | free 24/7 server monitoring | free daily backups | free SSL | and more...
Living in a multiligual world: Internationalization for Web 2.0 ApplicationsLars Trieloff
Lars Trieloff's presentation at Web 2.0 Expo Berlin covers the why and how-to of internationalization for web 2.0, consolidating i18n technology and enabling user-contributed translations.
Where we are, as Front-End developers? This presentation navigates a short timeline of the computer science focusing on the client-side development as a mean to answer why and what has changed, as well as explore patterns and tendencies for the near future.
English article: https://medium.com/@caiovaccaro/javascript-state-of-the-union-2015-part-1-7ccff74813fa#.8x9y48ohk
Presented at 3|SHARE's EVOLVE'14 - The Adobe Experience Manager Community Summit on Wednesday November 19th, 2014 at the Hard Rock Hotel in San Diego, CA. evolve14.com
DevOps Fest 2019. Gianluca Arbezzano. DevOps never sleeps. What we learned fr...DevOps_Fest
Where a company with an OpenSource project announce that they are working on a new major release there is always a lot of chatting going on in the community because you never know how much this is going to break your system. Gianluca Arbezzano SRE at InfluxData will speak about the journey the company is facing from a DevOps perspective to move from InfluxDB v1 to version 2 a fully integrated platform that starts from the strong background we built running a database like InfluxDB at scale in our SaaS offer. This is not just a story about how a project evolved but it touches all the company in particular for what concern DevOpsFest everything around Kubernetes, Container and automation. How the SRE team managed the onboard of 20 developers on a cloud based project where operating and observing the system is a key concept to learn how to build a more solid and sustainable product.
SearchLeeds 2018 - Craig Campbell - How to fix the most common technical SEO ...Branded3
Craig uses SEMRush’s auditing tool to demonstrate some of the most common issues that are found when doing a website audit. He shares some actionable tips and advice on how to improve your technical SEO to improve the performance of your website online which will include site speed optimisation, time to bite, using the correct SSL certificates and much more.
6 site migration fails and how to avoid them - BrightonSEO September 2018 - J...Oban International
A failed site migration can include a loss in traffic, a drop in rankings in search results and a lower engagement on your site. Find out how to avoid the mistakes that lead to a failed site migration with James Brown, SEO Strategist at Oban International
Globant Week Cali - Entendiendo el desarrollo Front-end del mundo moderno.Globant
Todos los días sale un nuevo framework de Javascript y es difícil mantenerse al día en cada framework si no encontramos un marco en común entre ellos. ¿Cómo crear la siguiente aplicación web y qué herramienta usar de manera que no quede obsoleta antes de acabar el producto?, ¿Cómo lo hacemos en Globant?, Veremos muchas cosas como Web Components, ES6, Typescript/Flow, CSS Frameworks, Unit testing e incluso Backend con Javascript.
Flink Forward Berlin 2018: Brian Wolfe - "Upshot: distributed tracing using F...Flink Forward
Distributed tracing is used to analyze performance and error cases in service oriented architectures. The Observability team at Airbnb recently created Upshot, a data pipeline that uses Flink to analyze over 40 million trace events per minute. Summaries of the resulting data are sent to Druid, Datadog, and other downstream datastores. This talk will focus on how we use Flink and how we analyzed and addressed scaling issues we encountered while building Upshot.
Going Global 101: How to Manage Your Websites Worldwide Using DrupalAcquia
Internet usage has exploded worldwide over the last decade. More than one third of today’s world population has internet access. In fact, as of 2014, the number of internet users worldwide was 2.92 billion, up from 2.71 billion in the previous year. This shows that the ability to provide fully translated content to users is more important than ever. Translated content can help you to better market your brand in new regions, and establish your website as an international authority on its subject matter.
Without proper preparation, various challenges can arise when translating websites into Português, Русский, Français, Italiano, Español, Deutsch, 中文, 日本語, 한국어, and other languages. In this webinar you will hear from Lingotek on how to be internationally savvy, and discover:
-Simple tips and lessons for developing localization-ready websites using Drupal
-Why translation and localization really matters to your digital strategy
-The 7 Elements of Localizability
-How Qualcomm, a leader in next-generation mobile technologies, is successfully developing international-ready websites using Drupal
Speaker: Vitalii Braslavskyi, Software Engineer at Grammarly
Summary:
Today, the dominant approach to software engineering is an imperative one — the best practices have been proven over time. But the world is always evolving, and in order to evolve with it and remain as productive as possible, we need to continue searching for better tools to solve problems of increasing complexity.
In this talk, we'll discuss the tools and techniques of the .Net ecosystem that can help us to concentrate on the problem itself — not just on the intermediate steps (which have likely already been solved). We'll compare imperative and declarative approaches and assess solutions to problems.
We'll also offer examples of how engineers in Grammarly's Office Add-in team use these tools to improve the efficiency of our engineering and strengthen our solutions to the problems at hand.
Grammarly AI-NLP Club #10 - Information-Theoretic Probing with Minimum Descri...Grammarly
Speaker: Elena Voita, a Ph.D. student at the University of Edinburgh and the University of Amsterdam
Summary: How can you know whether a model (e.g., ELMo, BERT) has learned to encode a linguistic property? The most popular approach to measure how well pretrained representations encode a linguistic property is to use the accuracy of a probing classifier (probe). However, such probes often fail to adequately reflect differences in representations, and they can show different results depending on probe hyperparameters. As an alternative to standard probing, we propose information-theoretic probing which measures minimum description length (MDL) of labels given representations. In addition to probe quality, the description length evaluates “the amount of effort” needed to achieve this quality. We show that (i) MDL can be easily evaluated on top of standard probe-training pipelines, and (ii) compared to standard probes, the results of MDL probing are more informative, stable, and sensible.
More Related Content
Similar to Grammarly AI-NLP Club #9 - Dumpster diving for parallel corpora with efficient translation - Kenneth Heafield
DevOps Fest 2019. Gianluca Arbezzano. DevOps never sleeps. What we learned fr...DevOps_Fest
Where a company with an OpenSource project announce that they are working on a new major release there is always a lot of chatting going on in the community because you never know how much this is going to break your system. Gianluca Arbezzano SRE at InfluxData will speak about the journey the company is facing from a DevOps perspective to move from InfluxDB v1 to version 2 a fully integrated platform that starts from the strong background we built running a database like InfluxDB at scale in our SaaS offer. This is not just a story about how a project evolved but it touches all the company in particular for what concern DevOpsFest everything around Kubernetes, Container and automation. How the SRE team managed the onboard of 20 developers on a cloud based project where operating and observing the system is a key concept to learn how to build a more solid and sustainable product.
SearchLeeds 2018 - Craig Campbell - How to fix the most common technical SEO ...Branded3
Craig uses SEMRush’s auditing tool to demonstrate some of the most common issues that are found when doing a website audit. He shares some actionable tips and advice on how to improve your technical SEO to improve the performance of your website online which will include site speed optimisation, time to bite, using the correct SSL certificates and much more.
6 site migration fails and how to avoid them - BrightonSEO September 2018 - J...Oban International
A failed site migration can include a loss in traffic, a drop in rankings in search results and a lower engagement on your site. Find out how to avoid the mistakes that lead to a failed site migration with James Brown, SEO Strategist at Oban International
Globant Week Cali - Entendiendo el desarrollo Front-end del mundo moderno.Globant
Todos los días sale un nuevo framework de Javascript y es difícil mantenerse al día en cada framework si no encontramos un marco en común entre ellos. ¿Cómo crear la siguiente aplicación web y qué herramienta usar de manera que no quede obsoleta antes de acabar el producto?, ¿Cómo lo hacemos en Globant?, Veremos muchas cosas como Web Components, ES6, Typescript/Flow, CSS Frameworks, Unit testing e incluso Backend con Javascript.
Flink Forward Berlin 2018: Brian Wolfe - "Upshot: distributed tracing using F...Flink Forward
Distributed tracing is used to analyze performance and error cases in service oriented architectures. The Observability team at Airbnb recently created Upshot, a data pipeline that uses Flink to analyze over 40 million trace events per minute. Summaries of the resulting data are sent to Druid, Datadog, and other downstream datastores. This talk will focus on how we use Flink and how we analyzed and addressed scaling issues we encountered while building Upshot.
Going Global 101: How to Manage Your Websites Worldwide Using DrupalAcquia
Internet usage has exploded worldwide over the last decade. More than one third of today’s world population has internet access. In fact, as of 2014, the number of internet users worldwide was 2.92 billion, up from 2.71 billion in the previous year. This shows that the ability to provide fully translated content to users is more important than ever. Translated content can help you to better market your brand in new regions, and establish your website as an international authority on its subject matter.
Without proper preparation, various challenges can arise when translating websites into Português, Русский, Français, Italiano, Español, Deutsch, 中文, 日本語, 한국어, and other languages. In this webinar you will hear from Lingotek on how to be internationally savvy, and discover:
-Simple tips and lessons for developing localization-ready websites using Drupal
-Why translation and localization really matters to your digital strategy
-The 7 Elements of Localizability
-How Qualcomm, a leader in next-generation mobile technologies, is successfully developing international-ready websites using Drupal
Speaker: Vitalii Braslavskyi, Software Engineer at Grammarly
Summary:
Today, the dominant approach to software engineering is an imperative one — the best practices have been proven over time. But the world is always evolving, and in order to evolve with it and remain as productive as possible, we need to continue searching for better tools to solve problems of increasing complexity.
In this talk, we'll discuss the tools and techniques of the .Net ecosystem that can help us to concentrate on the problem itself — not just on the intermediate steps (which have likely already been solved). We'll compare imperative and declarative approaches and assess solutions to problems.
We'll also offer examples of how engineers in Grammarly's Office Add-in team use these tools to improve the efficiency of our engineering and strengthen our solutions to the problems at hand.
Grammarly AI-NLP Club #10 - Information-Theoretic Probing with Minimum Descri...Grammarly
Speaker: Elena Voita, a Ph.D. student at the University of Edinburgh and the University of Amsterdam
Summary: How can you know whether a model (e.g., ELMo, BERT) has learned to encode a linguistic property? The most popular approach to measure how well pretrained representations encode a linguistic property is to use the accuracy of a probing classifier (probe). However, such probes often fail to adequately reflect differences in representations, and they can show different results depending on probe hyperparameters. As an alternative to standard probing, we propose information-theoretic probing which measures minimum description length (MDL) of labels given representations. In addition to probe quality, the description length evaluates “the amount of effort” needed to achieve this quality. We show that (i) MDL can be easily evaluated on top of standard probe-training pipelines, and (ii) compared to standard probes, the results of MDL probing are more informative, stable, and sensible.
Grammarly AI-NLP Club #8 - Arabic Natural Language Processing: Challenges and...Grammarly
Speaker: Nizar Habash is an Associate Professor of Computer Science at New York University Abu Dhabi (NYUAD). Professor Habash’s research includes extensive work on machine translation, morphological analysis, and computational modeling of Arabic and its dialects. Professor Habash has been a principal investigator or co-investigator on over 20 grants. He has over 200 publications including a book titled “Introduction to Arabic Natural Language Processing.” His website is www.nizarhabash.com. He is the director of the NYUAD Computational Approaches to Modeling Language (CAMeL) Lab (www.camel-lab.com).
Summary: The Arabic language presents a number of challenges to researchers and developers of language technologies. Arabic is both morphologically rich and highly ambiguous; and it has a number of dialects that vary widely amongst themselves and with Standard Arabic. The dialects have no official spelling standards, and spelling and grammar errors are common in unedited Standard Arabic. In this talk, we present some of these challenges in detail and cover some of the ongoing efforts to address them with creative language technologies.
Grammarly AI-NLP Club #6 - Sequence Tagging using Neural Networks - Artem Che...Grammarly
Speaker: Artem Chernodub, Chief Scientist at Clikque Technology and Associate Professor at Ukrainian Catholic University
Summary: Sequence Tagging is an important NLP problem that has several applications, including Named Entity Recognition, Part-of-Speech Tagging, and Argument Component Detection. In our talk, we will focus on a BiLSTM+CNN+CRF model — one of the most popular and efficient neural network-based models for tagging. We will discuss task decomposition for this model, explore the internal design of its components, and provide the ablation study for them on the well-known NER 2003 shared task dataset.
Grammarly AI-NLP Club #5 - Automatic text simplification in the biomedical do...Grammarly
Speaker: Natalia Grabar, NLP scientist
Summary: We propose a set of experiments with the general objective of ensuring a better understanding of technical health documents. Various experiments address different steps of this complex and ambitious process: (1) categorization of documents according to their complexity; (2) detection of complex passages within documents; (3) acquisition of resources for the lexical and semantic simplification of documents; (4) alignment of parallel sentences from comparable corpora for generating rules for syntactic transformation. According to the steps and tasks, various methods are exploited (rule-based, machine learning, with and without linguistic knowledge). In addition to text simplification, the results and resources can be used for other NLP applications and tasks (e.g., information retrieval and extraction, question-answering, textual entailment).
Grammarly AI-NLP Club #3 - Learning to Read for Automated Fact Checking - Isa...Grammarly
Speaker: Isabelle Augenstein, Assistant Professor, University of Copenhagen
Summary: The spread of misinformation and disinformation is growing, and it’s having a big impact on interpersonal communications, politics and even science.
Traditional methods, e.g., manual fact-checking by reporters, cannot keep up with the growth of information. On the other hand, there has been much progress in natural language processing recently, partly due to the resurgence of neural methods.
How can natural language processing methods fill this gap and help to automatically check facts?
This talk will explore different ways to frame fact checking and detail our ongoing work on learning to encode documents for automated fact checking, as well as describe future challenges.
Grammarly AI-NLP Club #4 - Understanding and assessing language with neural n...Grammarly
Speaker: Marek Rei, Senior Research Associate, University of Cambridge
Summary: The number of people learning English around the world is currently estimated at 1.5 billion and is predicted to exceed 1.9 billion by 2020. The increasing need to communicate beyond borders has created a large unmet demand for qualified language teachers across the globe. Computational models for error detection and essay scoring can alleviate this issue by giving millions of people access to affordable learning resources. Successful systems for automated language teaching will need to analyse language at various levels of granularity and provide useful feedback to individual students.In this talk, we will explore some of the latest approaches to written language assessment, using neural architectures for composing the meaning of a sentence or text, and also discuss potential future directions in the field.
Grammarly Meetup: DevOps at Grammarly: Scaling 100xGrammarly
Speaker: Dmitry Unkovsky, Software Engineer at Grammarly
Summary: We will tell the story of DevOps at Grammarly since 2013. We’ll talk about how we managed infrastructure growth while keeping up with the rapid pace of product development; what worked for us and what did not, and why; and what it’s like to make technical choices as an engineer at our company. We will share our current vision and future plans.
Grammarly Meetup: Memory Networks for Question Answering on Tabular Data - Sv...Grammarly
Tabular data is difficult to analyze and search through. There is a clear need for new tools and interfaces that would allow even non-tech-savvy users to gain insights from open datasets without resorting to specialized data analysis tools or even having to fully understand the dataset structure. We explore the End-To-End Memory Networks architecture (Sukhbaatar et al., 2015) in application to answering natural language questions from tabular data. This architecture was originally designed for the question-answering tasks from short natural language texts (bAbI tasks) (Weston et al., 2015), which include testing elements of inductive and deductive reasoning, co-reference resolution and time manipulation.
Grammarly AI-NLP Club #2 - Recent advances in applied chatbot technology - Jo...Grammarly
Speaker: Jordi Carrera Ventura, Artificial Intelligence technologist at Telefónica R&D
Summary: Chatbots (aka conversational agents, spoken dialogue systems) allow users to interface with computers using natural language by simply asking questions or issuing commands.
Given a query, the chatbot builds a semantic representation of the input, transforms it into a logical statement, and performs all the necessary actions to fulfill the user's intent. Sometimes this simply means calculating an exact answer or retrieving a fact from a database, whereas other times it means building a contextual model and running a full-fledged conversation flow while keeping track of anaphoras and cross-references.
Besides the direct applications of chatbots in IoT (Amazon’s Alexa, Apple's Siri) and IT (the historical field of Information Retrieval as a whole can be seen as a sub-problem of spoken dialogue systems), chatbots' main appeal for technologists is their location at the intersection of all major Natural Language Processing technologies and many of the deepest questions in Cognitive Science today: semantic parsing, entity recognition, knowledge representation, and coreference resolution.
In this talk, I will explore those questions in the context of an applied industry setting, and I will introduce a framework suitable for addressing them, together with an overview of the state-of-the-art in chatbot technology and some original techniques.
Grammarly AI-NLP Club #1 - Domain and Social Bias in NLP: Case Study in Langu...Grammarly
Speaker: Tim Baldwin, Professor of Computer Science, University of Melbourne
Summary: Two forms of bias that are commonly associated with natural language processing (NLP) tasks are domain bias (implicit bias towards documents from a particular domain, with lower performance over other document types) and social bias (implicit bias towards documents authored by particular types of individuals, with lower performance over documents authored by other types of individuals). In this talk, I will discuss the importance of debiasing NLP models across these dimensions, and strategies that can be employed to achieve this. I will focus the talk on the task of language identification (i.e., identifying the language(s) a written document is authored in).
Speaker: Andriy Gryshchuk, Senior Research Engineer at Grammarly.
Summary: Paraphrase detection is a challenging NLP task since it requires both thorough syntactic and thorough semantic analysis to identify whether two phrases have the same intent. A few months ago, paraphrase identification became an objective of one of the most popular Kaggle competitions, Quora Question Pairs. In this talk, Yuriy Guts and Andriy Gryshchuk, silver medalists of the competition, will share their arsenal of statistical, linguistic, and Deep Learning approaches that helped them succeed in this challenge.
Speaker: Yuriy Guts, Machine Learning Engineer at DataRobot.
Paraphrase detection is a challenging NLP task since it requires both thorough syntactic and thorough semantic analysis to identify whether two phrases have the same intent. A few months ago, paraphrase identification became an objective of one of the most popular Kaggle competitions, Quora Question Pairs. In this talk, Yuriy Guts and Andriy Gryshchuk, silver medalists of the competition, will share their arsenal of statistical, linguistic, and Deep Learning approaches that helped them succeed in this challenge.
Natural Language Processing for biomedical text mining - Thierry HamonGrammarly
Speaker: Thierry Hamon, Associate Professor in Computer Science at Université Paris, Member of the LIMSI-CNRS research lab.
Summary: Among the large amounts of unstructured data generated across the world and available nowadays, textual data represent an important source of information. This fact is particularly true in the biomedical domain, where a constant increasing demand to access the textual content is observed: the situation is relevant for accessing and processing Electronic Health Records, online discussion forums, and scientific literature. Indeed, dealing with biomedical texts requires us to take into account a great variety of texts, languages and Users.
For several years now, a lot of NLP research has focused on mining and retrieving information (i.e., medical entities and domain-specific relations), which are relevant for biologists, physicians, terminologists, epidemiologists, and patients. We will propose an overview of the NLP methods used for tackling several such research problems through text mining applications. First, we will present the resources and rule-based approaches we designed for extracting drug-related information from clinical texts, and for acquiring domain-specific semantic relations from digital libraries. Then we will present the cross-lingual approach we are developing for building multilingual terminologies from a patient-centered Ukrainian corpus.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
Search and Society: Reimagining Information Access for Radical FuturesBhaskar Mitra
The field of Information retrieval (IR) is currently undergoing a transformative shift, at least partly due to the emerging applications of generative AI to information access. In this talk, we will deliberate on the sociotechnical implications of generative AI for information access. We will argue that there is both a critical necessity and an exciting opportunity for the IR community to re-center our research agendas on societal needs while dismantling the artificial separation between the work on fairness, accountability, transparency, and ethics in IR and the rest of IR research. Instead of adopting a reactionary strategy of trying to mitigate potential social harms from emerging technologies, the community should aim to proactively set the research agenda for the kinds of systems we should build inspired by diverse explicitly stated sociotechnical imaginaries. The sociotechnical imaginaries that underpin the design and development of information access technologies needs to be explicitly articulated, and we need to develop theories of change in context of these diverse perspectives. Our guiding future imaginaries must be informed by other academic fields, such as democratic theory and critical theory, and should be co-developed with social science scholars, legal scholars, civil rights and social justice activists, and artists, among others.
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
"Impact of front-end architecture on development cost", Viktor TurskyiFwdays
I have heard many times that architecture is not important for the front-end. Also, many times I have seen how developers implement features on the front-end just following the standard rules for a framework and think that this is enough to successfully launch the project, and then the project fails. How to prevent this and what approach to choose? I have launched dozens of complex projects and during the talk we will analyze which approaches have worked for me and which have not.
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
7. The chair broke.
Le pr´esidente a ´eclat´e.
Problem ParaCrawl Browser Translation Conclusion
7
8. Project
mine web for translations
for free: paracrawl.eu
Problem ParaCrawl Browser Translation Conclusion
8
9. Project
mine web for translations
for free: paracrawl.eu
Problem ParaCrawl Browser Translation Conclusion
9
10. Projects
mine web for translations
for free: paracrawl.eu
bergam t
firefox translation extension
client-side
in progress: browser.mt
Problem ParaCrawl Browser Translation Conclusion
10
11. Projects
mine web for translations
for free: paracrawl.eu
bergam t
firefox translation extension
client-side
in progress: browser.mt
data
Problem ParaCrawl Browser Translation Conclusion
11
12. Projects
mine web for translations
for free: paracrawl.eu
bergam t
firefox translation extension
client-side
in progress: browser.mt
data
fast translation
Problem ParaCrawl Browser Translation Conclusion
12
13. Projects
Part 1
mine web for translations
for free: paracrawl.eu
bergam t
Part 2
firefox translation extension
client-side
in progress: browser.mt
data
fast translation
Problem ParaCrawl Browser Translation Conclusion
13
14. ParaCrawl: crawl the web for parallel corpora
All 26 EU + EEA official languages
+3 Spanish co-official languages
4–1,178 Millon words per language
510,482 Websites
1+ Petabyte of compressed web pages
Problem ParaCrawl Browser Translation Conclusion
14
15. Parallel Corpus Size
Language Words
French 1,178,317,233
German 929,818,868
Spanish 897,891,704
Italian 533,512,632
Portuguese 299,634,135
Dutch 233,087,345
Russian 157,061,045
Polish 145,802,939
Swedish 138,264,978
Czech 117,385,158
Danish 106,565,546
Hungarian 104,292,635
Language Words
Greek 88,669,279
Finnish 66,385,933
Romanian 62,189,306
Bulgarian 55,725,444
Slovak 45,636,383
Croatian 43,464,197
Slovenian 31,855,427
Estonian 30,858,140
Lithuanian 27,214,054
Latvian 23,656,140
Irish 21,909,039
Maltese 4,252,814
Words on English side, after filtering
Problem ParaCrawl Browser Translation Conclusion
15
16. Improving Quality
ParaCrawl BLEU Gain
From To Release 1 Release 4
English Finnish +0.0 +1.2
Finnish English +2.5 +4.6
English Latvian +0.7 +1.9
Latvian English +0.9 +2.5
English Romanian +0.6 +1.3
Romanian English +2.4 +4.0
English Czech -1.4 -0.1
Czech English +0.6 +1.1
English German -3.2 +1.2
German English -1.0 +3.1
Gains relative to WMT data without ParaCrawl.
Problem ParaCrawl Browser Translation Conclusion
16
17. Text Extraction
CommonCrawl Targeted Crawls
Language
Detection
Identify Multilingual Sites
Target
Document and
Sentence Alignment
Cleaning Evaluation
Problem ParaCrawl Browser Translation Conclusion
17
18. Site Crawling
95% of translations we find are not in CommonCrawl.
Because CommonCrawl is too shallow.
Problem ParaCrawl Browser Translation Conclusion
18
19. Site Crawling
95% of translations we find are not in CommonCrawl.
Because CommonCrawl is too shallow.
→ We directly crawl multilingual sites.
→ Use the Internet Archive.
Problem ParaCrawl Browser Translation Conclusion
19
20. Learn what pages to crawl/links to follow?
URL: domain, language code, etc.
Link context: text, XPath
Bandit learning problem
Reward: pages in both languages are found
Ongoing work by Hieu Hoang.
Problem ParaCrawl Browser Translation Conclusion
20
21. Not Translated: wordpress.com
Blog hosting site
=⇒ multilingual, but few translations.
We blacklist large untranslated sites.
Problem ParaCrawl Browser Translation Conclusion
21
22. Language classification
Say you’re looking for isiXhosa translations:
English Do you have pets?
isiXhosa Unazo izilwanaya zasekhaya?
Problem ParaCrawl Browser Translation Conclusion
22
23. Language classification
Say you’re looking for isiXhosa translations:
English Do you have pets?
isiXhosa Unazo izilwanaya zasekhaya?
isiXhosa occurs 0.000008x as often as English on the web.
This is lower than error rate in language classification.
=⇒ Most of the “isiXhosa” was actually baseball statistics.
=⇒ Sometimes we need to build language models to filter.
Problem ParaCrawl Browser Translation Conclusion
23
24. Matching
We have text. How do we find translations?
Language codes in URLs [Resnick and Smith, 2003]
Translate to English, match [Uszkoreit et al, 2010]
Neural network vectors [Schwenk, 2018]
Problem ParaCrawl Browser Translation Conclusion
24
25. Matching
We have text. How do we find translations?
Language codes in URLs [Resnick and Smith, 2003]
Translate to English, match [Uszkoreit et al, 2010]
Neural network vectors [Schwenk, 2018]
Problem ParaCrawl Browser Translation Conclusion
25
26. Matching
Translate everything to English.
=⇒ Need translation system (can use dictionary)
=⇒ Need fast translation
Match pages by tf-idf in (translated) English.
Then match sentences with n–gram overlap.
Problem ParaCrawl Browser Translation Conclusion
26
27. Boilerplate: santander.co.uk
“Santander UK plc. Registered Office: 2 Triton Square, Regent’s Place,
London, NW1 3AN, United Kingdom. Registered Number 2294747.
Registered in England and Wales. www.santander.co.uk. Telephone 0800
389 7000. Calls may be recorded or monitored. Authorised by the
Prudential Regulation Authority and regulated by the Financial Conduct
Authority and the Prudential Regulation Authority. Our Financial
Services Register number is 106054. You can check this on the Financial
Services Register by visiting the FCA’s website www.fca.org.uk/register.
Santander and the flame logo are registered trademarks.”
=⇒ Match pages on boilerplate.
=⇒ Learn to translate boilerplate really well.
We use boilerpipe which tries to throw it out.
Problem ParaCrawl Browser Translation Conclusion
27
28. Templates: booking.com
“Solo travelers in particular like the location – they rated it 9.5 for
a one-person stay.”
“Les voyageurs individuels appr´ecient particuli`erement
l’emplacement de cet ´etablissement. Ils lui donnent la note de 9,5
pour un s´ejour en solo.”
“Solo travelers in particular like the location – they rated it 8.9 for
a one-person stay.”
“Les voyageurs individuels appr´ecient particuli`erement
l’emplacement de cet ´etablissement. Ils lui donnent la note de 8,9
pour un s´ejour en solo.”
Corpus of repetitive sentences is less useful.
=⇒ Diversity cleaning.
Problem ParaCrawl Browser Translation Conclusion
28
29. Noise
Paid people to judge English–German sentences:
Okay 23%
Misaligned sentences 41%
Third language 3%
Both English 10%
Both German 10%
Untranslated sentences 4%
Short segments (≤2 tokens) 1%
Short segments (3–5 tokens) 5%
Non-linguistic characters 2%
[Koehn et al, 2018]
Problem ParaCrawl Browser Translation Conclusion
29
30. Cleaning
Supervised classifier trained on 50k good, 50k bad sentences
Handwritten patterns
Character-based language model
Test set attempts to have consistent cut-off across languages
Problem ParaCrawl Browser Translation Conclusion
30
31. Shared Task on Corpus Filtering
Common techniques from 2018 Conference on MT:
Aggressive language model filtering
Score from translation systems, both directions
Remove near-duplicates on source and target (not translated)
Partially implemented
Problem ParaCrawl Browser Translation Conclusion
31
32. Copyright
Remember: 510,482 websites.
Crawls follow robots.txt
Crawler leaves contact information.
A few sites have asked to be removed and we have.
Under GDPR, people have the right to correct information.
We hope they do!
Problem ParaCrawl Browser Translation Conclusion
32
33. Company that sells corpora speads copyright fear:
The first word of
copyright is copy.
Problem ParaCrawl Browser Translation Conclusion
33
34. So I found them selling crawled corpora:
They took it down.
Problem ParaCrawl Browser Translation Conclusion
34
35. Summary
There’s training data for some languages.
Search engines have been mining the web for years.
Time for large open data.
Problem ParaCrawl Browser Translation Conclusion
35
36. Bergamot: Browser-based Machine Translation
browser.mt
This project has received funding from the European Union’s Horizon 2020 research and innovation
programme under grant agreement No 825303.
38. Motivation
Statoil (Norwegian state oil company) employment information and
contracts leaked on Translate.com –Norsk Rikskringkasting, 2017
Don’t trade your privacy for Google Translate.
40. Project Goals and Outline
Broad use as a Firefox extension + open platform
Fast on a desktop
Trustworthy
Support web forms
Domain adaptation
Problem ParaCrawl Browser Translation Conclusion
40
41. We’re Making a Public Product
=⇒ User Experience Work Package
Problem ParaCrawl Browser Translation Conclusion
41
45. Speed on Desktops
CPU version of Marian toolkit developed with Microsoft and Intel.
Problem ParaCrawl Browser Translation Conclusion
45
46. Speed Contest
0 20 40 60 80 100 120 140
18.0
20.0
22.0
24.0
26.0
28.0
2018: others GPU
2018: others CPU
2018: Marian GPU
2018: Marian CPU
2019: Marian CPU
2019: Marian GPU
Million translated source tokens per USD
BLEUonnewstest2014
2018 GPU systems
2018 CPU systems
2019 GPU systems
2019 CPU systems
Problem ParaCrawl Browser Translation Conclusion
46
47. Some of the Optimizations
Tune model size,
1 Teacher-student
2 Greedy search
3 Simplify model structure
4 Integer arithmetic
Problem ParaCrawl Browser Translation Conclusion
47
48. Teacher-student
Option 1: Train a model directly.
Option 2: Teacher-student (Kim and Rush, 2016)
Teacher: Large high-quality translation model.
Teacher translates source-language sentences.
Student: model learns on output created by teacher.
Model GPU BLEU
1xTeacher, beam size 8 109.7 28.1
4xTeacher, beam size 8 410.8 29.0
1xStudent, beam size 4 52.0 28.4
1xStudent, beam size 1 19.9 28.2
Even models with same size improve slightly.
Problem ParaCrawl Browser Translation Conclusion
48
49. Greedy Search
Normally: keep competing translations and take the highest probability.
Beam size is the number of competing translations.
Model GPU BLEU
1xStudent, beam size 4 52.0 28.4
1xStudent, beam size 2 31.9 28.4
1xStudent, beam size 1 19.9 28.2
Computing probabilities is expensive because we need to normalize.
Greedy can just pick the highest number without normalizing.
Problem ParaCrawl Browser Translation Conclusion
49
50. Simplify model structure
A transformer model generates sentences from left to right.
Each step consults all previous steps. → O(n2)
Zhang et al (2018): just average previous steps.
Update average on the fly → O(n).
Model GPU BLEU
Baseline transformer 12.8 27.6
Averaged transformer 7.2 27.6
Further work: simplified simple recurrent unit.
Problem ParaCrawl Browser Translation Conclusion
50
51. Integer Arithmetic
Why Integers
Benchmarks: Memory bandwidth is limiting factor
=⇒ Compress model.
More at once: P40 does 47 TOPS int8, 12 TOPS float.
Can do int8 with no quality loss [Quinn et al, 2018]
Problem ParaCrawl Browser Translation Conclusion
51
52. Fast 8-bit matrix multiplication
mm512 maddubs epi16 aka vpmaddubsw
The only 512-bit wide multiply of 8-bit integers on Intel.
Multiply signed by unsigned integers, then sum adjacent pairs into 16-bit.
Why signed * unsigned?!
New 8-bit VNNI instruction is also signed * unsigned.
Problem ParaCrawl Browser Translation Conclusion
52
53. Working Around signed * unsigned
Skew
Add 128 to one of the arguments.
A ∗ B = A ∗ (128J + B) − A ∗ 128J
where 128J is a matrix full of 128.
Efficient if A is constant.
Normalize sign
Manually manipulate sign bits in the multiply.
=⇒ Extra instructions in hot loop.
Problem ParaCrawl Browser Translation Conclusion
53
54. 4 bits?
Quantize log parameters (Miyashita et al, 2016).
Try quantizing a trained model.
3-bit 4-bit 5-bit 6-bit 7-bit 8-bit
0.72 28.92 35.08 35.60 35.69 35.67
5 bits is annoying to fit in registers
. . . so close to 4 bits!
Problem ParaCrawl Browser Translation Conclusion
54
55. Continued Training
First, train as normal with floats.
Then quantize parameters after every update.
Remember the rounding error so small changes can accumulate.
-0.19 BLEU with 4-bit quantization.
https://arxiv.org/abs/1909.06091 [Aji and Heafield, 2019]
Problem ParaCrawl Browser Translation Conclusion
55
57. 144 Heads
Voita et al 2019: prune 50% after training.
Pruning before training doesn’t work.
Problem ParaCrawl Browser Translation Conclusion
57
58. 144 Heads
Voita et al 2019: prune 50% after training.
Pruning before training doesn’t work.
PhD student Maxi Behnke: prune during training?
Problem ParaCrawl Browser Translation Conclusion
58
59. Lottery ticket hypothesis
Some parameters are luckily initialized
Bigger models have more entries
Even if most can be discarded.
(Frankle and Carbin, 2018)
Remove entire unlucky heads?
Problem ParaCrawl Browser Translation Conclusion
59
62. Old Danish Ticket: Klippekort
No longer in use
Can apply for a refund
. . . via a form
Public domain image from Wikipedia.
Problem ParaCrawl Browser Translation Conclusion
62
63. Danish Ticket Refund Form
Expects answers in Danish
Problem ParaCrawl Browser Translation Conclusion
63
64. Danish Ticket Refund Form
Expects answers in Danish
So I traded mine for a beer with Dirk Hovy at EMNLP 2017
Problem ParaCrawl Browser Translation Conclusion
64
65. What if you don’t have Dirk Hovy?
Answer a Danish web form in Danish:
Be confident my answers are correct.
. . . Even though I don’t speak Danish.
=⇒ Browser will prompt to rephrase when uncertain.
Problem ParaCrawl Browser Translation Conclusion
65
66. What if you don’t have Dirk Hovy?
Answer a Danish web form in Danish:
Be confident my answers are correct.
. . . Even though I don’t speak Danish.
=⇒ Browser will prompt to rephrase when uncertain.
. . . And use all rephrasings to translate better.
Problem ParaCrawl Browser Translation Conclusion
66
67. We’re in the Browser
The browser knows your history (if you let it).
It knows what site you are on.
Adapt translations to the user and page.
Problem ParaCrawl Browser Translation Conclusion
67
68. We’re in the Browser
The browser knows your history (if you let it).
It knows what site you are on.
Adapt translations to the user and page.
Much less creepy when all processing is local.
Problem ParaCrawl Browser Translation Conclusion
68
69. Bergamot Summary
Privacy-preserving translation via local processing.
Coming as a Firefox extension.
Anybody want to help with Ukrainian?
Problem ParaCrawl Browser Translation Conclusion
69