Moses has been the core of most of the developments on the machine translation market nowadays. Despite a thriving community, its open source core means that users need to find ways to adapt it and improve its features. Pangea v3, part of the EU's EXPERT project provides hybrid features based on search engine type recall to improve translation output.
Webinar: Simpler Semantic Search with SolrLucidworks
Hear from Lucidworks Senior Solutions Consultant Ted Sullivan about how you can leverage Apache Solr and Lucidworks Fusion to improve semantic awareness of your search applications.
Deep learning is having a profound impact on AI applications. With the future of neural network-inspired computing in mind, re:Invent is hosting the first ever Deep Learning Summit. Designed for developers to learn about the latest in deep learning research and emerging trends, attendees will hear from industry thought leaders—members of the academic and venture capital communities—who will share their perspectives in 30-minute Lightning Talks.
The Summit will be held on Thursday, November 30th at the Venetian from 1-5pm.
The Deep Learning Revolution - Terrence Sejnowski, The Salk Institute for Biological Studies
Eye, Robot: Computer Vision and Autonomous Robotics - Aaron Ames & Pietro Perona, California Institute of Technology
Exploiting the Power of Language - Alexander Smola, Amazon Web Services
Reducing Supervision: Making More with Less - Martial Herbert, Carnegie Mellon University
Learning Where to Look in Video - Kristen Grauman, University of Texas
Look, Listen, Learn: The Intersection of Vision and Sound - Antonio Torralba, MIT
Investing in the Deep Learning Future - Matt Ocko, Data Collective Venture Capital
Webinar: Simpler Semantic Search with SolrLucidworks
Hear from Lucidworks Senior Solutions Consultant Ted Sullivan about how you can leverage Apache Solr and Lucidworks Fusion to improve semantic awareness of your search applications.
Deep learning is having a profound impact on AI applications. With the future of neural network-inspired computing in mind, re:Invent is hosting the first ever Deep Learning Summit. Designed for developers to learn about the latest in deep learning research and emerging trends, attendees will hear from industry thought leaders—members of the academic and venture capital communities—who will share their perspectives in 30-minute Lightning Talks.
The Summit will be held on Thursday, November 30th at the Venetian from 1-5pm.
The Deep Learning Revolution - Terrence Sejnowski, The Salk Institute for Biological Studies
Eye, Robot: Computer Vision and Autonomous Robotics - Aaron Ames & Pietro Perona, California Institute of Technology
Exploiting the Power of Language - Alexander Smola, Amazon Web Services
Reducing Supervision: Making More with Less - Martial Herbert, Carnegie Mellon University
Learning Where to Look in Video - Kristen Grauman, University of Texas
Look, Listen, Learn: The Intersection of Vision and Sound - Antonio Torralba, MIT
Investing in the Deep Learning Future - Matt Ocko, Data Collective Venture Capital
A presentation given at the Lucene/Solr Revolution 2014 conference to show Solr and Elasticsearch features side by side. The presentation time was only 30 minutes, so only the core usability features were compared. The full video is embedded on the last slide.
A quick overview of Elasticsearch usage at Dailymotion for video search
Talk given at Elasticsearch Meetup France #7
June 10, 2014
http://www.meetup.com/elasticsearchfr/events/171946592/
Modern, Scalable, Ambitious apps with Ember.jsMike North
Emberjs is an opinionated web UI framework focused on developer productivity. I will introduce the basics of the framework, and provide several examples of where ember saves an unprecedented amount of time for dev teams. Additionally, I'll cover ember-cli, the extensible build tool that the Emberjs and Angular communities are depending on for code generation, asset compilation, and running tests
Andreas Zeller's keynote at the 1st Intl Fuzzing workshop 2022 at NDSS: https://fuzzingworkshop.github.io/program.html
Do you fuzz your own program, or do you fuzz someone else's program? The answer to this question has vast consequences on your view on fuzzing. Fuzzing someone else's program is the typical adverse "security tester" perspective, where you want your fuzzer to be as automatic and versatile as possible. Fuzzing your own code, however, is more like a traditional tester perspective, where you may assume some knowledge about the program and its context, but may also want to _exploit_ this knowledge - say, to direct the fuzzer to critical locations.
In this talk, I detail these differences in perspectives and assumptions, and highlight their consequences for fuzzer design and research. I also highlight cultural differences in the research communities, and what happens if you submit a paper to the wrong community. I close with an outlook into our newest frameworks, set to reconcile these perspectives by giving users unprecedented control over fuzzing, yet staying fully automatic if need be.
Bio: Andreas Zeller is faculty at the CISPA Helmholtz Center for Information Security and a professor for Software Engineering at Saarland University, both in Saarbrücken, Germany. His research on automated debugging, mining software archives, specification mining, and security testing has won several awards for its impact in academia and industry. Zeller is an ACM Fellow, an IFIP Fellow, an ERC Advanced Grant Awardee, and holds an ACM SIGSOFT Outstanding Research Award.
Approximate "Now" is Better Than Accurate "Later"NUS-ISS
How does Twitter track the top trending topics?
How does Amazon keep track of the top-selling items for the day?
How many cabs have been booked this month using your App?
Is the password that a new user is choosing a common/compromised password?
Modern web-scale systems process billions of transactions and generate terabytes of data every single day. In order to find answers to questions against this data, one would initiate a multi-minute query against a NoSQL datastore or kick off a batch job written in a distributed processing framework such as Spark or Flink. However, these jobs are throughput-heavy and not suited for realtime low-latency queries. However, you and your customers would like to have all this information "right now".
At the end of this talk, you'll realize that you can power these low-latency queries and with incredibly low memory footprint "IF" you are willing to accept answers that are, say, 96-99% accurate. This talk introduces some of the go-to probabilistic data structures that are used by organisations with large amounts of data - specifically Bloom filter, Count Min Sketch and HyperLogLog.
Materials Project Validation, Provenance, and Sandboxes by Dan GunterDan Gunter
Summary of Goals, Progress, and Next steps for these three aspects of the Materials Project (materialsproject.org) infrastructure
* Validation: constantly guard against bugs in core data and imported data
* Provenance: know how data came to be
* Sandboxes: combine public and non-public data; "good fences make good neighbors"
Presenter: Dan Gunter, LBNL
From list sorting to network routing, and from hash tables to capacity planning, a programmer's daily work is filled with probability. We use probabilistic algorithms, data structures, and systems constantly often without even thinking about it. Experienced engineers reach for probabilistic algorithms frequently and intentionally, especially when building systems of serious scale. How do probabilistic algorithms actually work in practice? And how do we know they'll be safe and reliable in our critical production systems? We'll address those questions, explore a few algorithms, and see why "with high probability" is often better than "exactly".
A presentation given at the Lucene/Solr Revolution 2014 conference to show Solr and Elasticsearch features side by side. The presentation time was only 30 minutes, so only the core usability features were compared. The full video is embedded on the last slide.
A quick overview of Elasticsearch usage at Dailymotion for video search
Talk given at Elasticsearch Meetup France #7
June 10, 2014
http://www.meetup.com/elasticsearchfr/events/171946592/
Modern, Scalable, Ambitious apps with Ember.jsMike North
Emberjs is an opinionated web UI framework focused on developer productivity. I will introduce the basics of the framework, and provide several examples of where ember saves an unprecedented amount of time for dev teams. Additionally, I'll cover ember-cli, the extensible build tool that the Emberjs and Angular communities are depending on for code generation, asset compilation, and running tests
Andreas Zeller's keynote at the 1st Intl Fuzzing workshop 2022 at NDSS: https://fuzzingworkshop.github.io/program.html
Do you fuzz your own program, or do you fuzz someone else's program? The answer to this question has vast consequences on your view on fuzzing. Fuzzing someone else's program is the typical adverse "security tester" perspective, where you want your fuzzer to be as automatic and versatile as possible. Fuzzing your own code, however, is more like a traditional tester perspective, where you may assume some knowledge about the program and its context, but may also want to _exploit_ this knowledge - say, to direct the fuzzer to critical locations.
In this talk, I detail these differences in perspectives and assumptions, and highlight their consequences for fuzzer design and research. I also highlight cultural differences in the research communities, and what happens if you submit a paper to the wrong community. I close with an outlook into our newest frameworks, set to reconcile these perspectives by giving users unprecedented control over fuzzing, yet staying fully automatic if need be.
Bio: Andreas Zeller is faculty at the CISPA Helmholtz Center for Information Security and a professor for Software Engineering at Saarland University, both in Saarbrücken, Germany. His research on automated debugging, mining software archives, specification mining, and security testing has won several awards for its impact in academia and industry. Zeller is an ACM Fellow, an IFIP Fellow, an ERC Advanced Grant Awardee, and holds an ACM SIGSOFT Outstanding Research Award.
Approximate "Now" is Better Than Accurate "Later"NUS-ISS
How does Twitter track the top trending topics?
How does Amazon keep track of the top-selling items for the day?
How many cabs have been booked this month using your App?
Is the password that a new user is choosing a common/compromised password?
Modern web-scale systems process billions of transactions and generate terabytes of data every single day. In order to find answers to questions against this data, one would initiate a multi-minute query against a NoSQL datastore or kick off a batch job written in a distributed processing framework such as Spark or Flink. However, these jobs are throughput-heavy and not suited for realtime low-latency queries. However, you and your customers would like to have all this information "right now".
At the end of this talk, you'll realize that you can power these low-latency queries and with incredibly low memory footprint "IF" you are willing to accept answers that are, say, 96-99% accurate. This talk introduces some of the go-to probabilistic data structures that are used by organisations with large amounts of data - specifically Bloom filter, Count Min Sketch and HyperLogLog.
Materials Project Validation, Provenance, and Sandboxes by Dan GunterDan Gunter
Summary of Goals, Progress, and Next steps for these three aspects of the Materials Project (materialsproject.org) infrastructure
* Validation: constantly guard against bugs in core data and imported data
* Provenance: know how data came to be
* Sandboxes: combine public and non-public data; "good fences make good neighbors"
Presenter: Dan Gunter, LBNL
From list sorting to network routing, and from hash tables to capacity planning, a programmer's daily work is filled with probability. We use probabilistic algorithms, data structures, and systems constantly often without even thinking about it. Experienced engineers reach for probabilistic algorithms frequently and intentionally, especially when building systems of serious scale. How do probabilistic algorithms actually work in practice? And how do we know they'll be safe and reliable in our critical production systems? We'll address those questions, explore a few algorithms, and see why "with high probability" is often better than "exactly".
Similar to Pangea v3 - when engine search meets machine translation, Manuel Herranz, Pangeanic (20)
As contents published on the Internet are becoming more and more dominated by videos, requirements on the language translation have also changed. Specifically, video publishers and distributors have a significant interest in balancing both the translation time and the accuracy. To this end, Pactera has invested in solutions, which leverage machine translation to reduce the overall translation time, and recruit human translators to improve the accuracy in a Wikipedia-like fashion. At Pactera, we aim to help video contents to reach billions of people that were not possible before.
Review processes as the last step in quality assurance workflows are “notorious for causing delays and frustrations”. The reason normally is a flawed process: Many manual steps for the PMs, the lack of intuitive, layout-oriented collaboration software, plus the expectation of review to “fix a broken translation” in the last second rather than giving strategic process input. globalReview shifts this paradigm: As an integrated, collaborative platform with full layout editing it provides a positive review experience. At the same time, it pushes quality upstream applying DQF principles: Flexible content profiles define precise quality expectations; issue categories and scoring effectively gauge and also track translation quality over time; a sampling module allows for fast yet accurate quality evaluation. Put together, this allows the customer to raise the process from painful review to strategic quality management and gain valuable business intelligence.
A global P2P Trading Platform for TMs will be introduced. Tmxmall TM marketplace is the core, and client TM software and CATs are the input and output respectively. User of CATs is able to search the TMs of client users while it does not require client users to upload TMs to the cloud.
The presentation will introduce the NLP technologies used in Shiyibao and the main product features, covering the following points:
Function of giving automatic grades for translations based on translation quality automatic evaluation algorithm;
Function of giving automatic comments based on rules matching;
Function of sorting translations according to their similarity or some specific fragments to dramatically improve the efficiency of reviewing and commenting on translations.
In today’s digital economy, content is becoming smaller, more fragmented, and in need of on-demand translation in minutes and around the clock. Traditional localization models are no longer sufficient in meeting these always-on, agile, fast, and small translation requirements of the digital age. This is why mobile translation services like Stepes that are able to deliver quality, speed, and scalability are poised to see tremendous growth. During this 6-minute presentation, Stepes will demonstrate live its instant human translation service for micro content. Powered by human translators from around the world, Stepes is the world’s first mobile translation ecosystem delivering quality translation services using a networking model similar to Uber and Lyft.
For the language service industry, the biggest challenge is still, regardless if it’s for conventional language service mode or cloud-based service mode, translator resources. Using technology to help us map out the most suitable translators for each project is the key to ensure the high translation quality.
Computer Aided Translation Training System (CATS) provides a package solutions to the problems of translation translation. CATS combines artificial intelligence, data collection, and visualization of information technology, which makes the translation teaching, class management and monitoring on one single platform areality. Translation and interpretaton teaching resources on CATS are updated regularly into detailed categories, making the teaching materials easy to access. CATS supports translation and interpretation teaching and practices, company internships as well as scientific research.
Most of LSPs have not converted the translated bilingual documents to TM till now. Even the LSPs have established TMs, they are also confronted with disordered management of TMs and low efficiency. This report will share the way of quick TM establishment with Tmxmall Cloud-Based Smart Aligner, the way of Management of large-scale TMs with Private Cloud-Based TM for achieving pre-translation with large-scale TMs and team cooperation and etc.. Besides, the report will introduce Tmxmall TM marketplace, which is expected to promote TM sharing. Finally, we will share the experience of LSPs on alignment and Private Cloud-Based TM management for reducing translation costs and increasing profits.
SDL is the leader in global content management and language translation solutions. With more than 20 years of experience, SDL helps companies build relevant online experiences that deliver transformative business results on a global scale. Translation Industry continues to grow, and Freelancers, LSPs and Corporate clients all see increased demand as more and more content is created, so we have to address them all. As a Market-leading translation productivity tool, SDL Trados Studio is trusted by over 200,000 translation professionals to boost productivity, control quality and aid collaboration. SDL has launched Trados Studio 2017. This presentation will introduce SDL Trados Studio 2017 and highlight SDL’s new productivity booster- UPLIFT, which is well welcomed by global clients.
0x01 - Newton's Third Law: Static vs. Dynamic AbusersOWASP Beja
f you offer a service on the web, odds are that someone will abuse it. Be it an API, a SaaS, a PaaS, or even a static website, someone somewhere will try to figure out a way to use it to their own needs. In this talk we'll compare measures that are effective against static attackers and how to battle a dynamic attacker who adapts to your counter-measures.
About the Speaker
===============
Diogo Sousa, Engineering Manager @ Canonical
An opinionated individual with an interest in cryptography and its intersection with secure software development.
Acorn Recovery: Restore IT infra within minutesIP ServerOne
Introducing Acorn Recovery as a Service, a simple, fast, and secure managed disaster recovery (DRaaS) by IP ServerOne. A DR solution that helps restore your IT infra within minutes.
This presentation by Morris Kleiner (University of Minnesota), was made during the discussion “Competition and Regulation in Professions and Occupations” held at the Working Party No. 2 on Competition and Regulation on 10 June 2024. More papers and presentations on the topic can be found out at oe.cd/crps.
This presentation was uploaded with the author’s consent.
Have you ever wondered how search works while visiting an e-commerce site, internal website, or searching through other types of online resources? Look no further than this informative session on the ways that taxonomies help end-users navigate the internet! Hear from taxonomists and other information professionals who have first-hand experience creating and working with taxonomies that aid in navigation, search, and discovery across a range of disciplines.
Sharpen existing tools or get a new toolbox? Contemporary cluster initiatives...Orkestra
UIIN Conference, Madrid, 27-29 May 2024
James Wilson, Orkestra and Deusto Business School
Emily Wise, Lund University
Madeline Smith, The Glasgow School of Art
4. TM database built from TMX files
Based on the state-of-the-art full-text search engine
Extremely fast indexing, search and retrieval
Supports advanced text retrieval techniques
(fuzzy match, regular expressions)
Easily scalable
Role-based security
ElasticTM
5. Considered Lucene-based search engines:
Solr and ElasticSearch
Mature open source projects
Have similar capabilities & performance
ElasticSearch was picked mainly because of:
Out-of-the-box scalability
Powerful Query DSL (query language)
Role-based security (via plugin)
ElasticTM - Search Engine
6. ElasticTM - Design
EN ES FR ... NL
Search Engine
EN
<->
ES
FR
<->
ES
FR
<->
NL
...
Map DB
ElasticTM
7. ElasticTM - Design (cont’d)
Monolingual indices
Memory-effective
Implicit transitive language pairs
Bilingual mappings
§ Fast bidirectional id <-> id mapping
Role-based security system
Admin, project admin, user etc.
8. § Mapping source language segments to a target language
§ Bidirectional map (id to id)
§ Supports quick bulk incremental updates
ElasticTM - Map
Alternatives?
§ NoSQL key-value databases
MongoDB
CouchDB
Redis
ElasticSearch, many others …
Lack of upsert support for bulk
updates
Handling duplicate entries
Scalability
§ SQL databases
MySQL
PostgreSQL
✗
✗
9. ElasticTM - Map -
Benchmarks
* The lower, the better
Time, s
Memory, MB
12. Translation Memory (TM)
Pre-translations stored in a database and offered as suggestions
Implemented matching algorithm to propose a relevant translations
exact match and fuzzy match
segments similarities based on characters or tokens
NLP improves matching algorithm
13. Approach
• Statistical Machine Translation (SMT)
• Computer-Aided Translation (CAT)
environment
Run maintenance
• Search and
replace
• Update TM entries
• Imports & Export
entries
Translatio
n Memory
TM processing ElasticT
M
Full-text search engine
+
NLP techniques
14. Basic examples of TM Matching
& processing
perfect match by substitution
fuzzy match
{
“source_TM” : “I have 3 cats”,
“target_TM” : “Yo tengo 3 gatos”,
“score” : “80%”
}
{
“source_TM” : “I have <number> cats”,
“target_TM” : “Yo tengo <number> gatos”,
“score” : “100%”
}
Original TM
{
“input_source”: “I have 2 cats”,
“output_target”: “ ”
}
TM after preprocessing
• URLs
• Emails
• Dates
• Units
15. Basic examples of TM
Matching & processing
fuzzy match
{
“source_TM” : “I have a cat and I am very happy”,
“target_TM” : “Yo tengo un gato y estoy muy feliz”,
“score” : “44%”
}
{
“target_TM” : “Yo tengo un gato y estoy muy feliz”,
“source_TM” : “I have a cat”,
“target_TM” : “Yo tengo un gato”,
“source_TM” : “I am very happy”,
“target_TM” : “Estoy muy feliz”,
“score” : “100%”
}
Original TM
{
“input_source”: “I have a cat”,
“output_target”: “ ”
}
TM after preprocessing
perfect match by substitution
16. Improving TM Matching
Several language → Maximise the reuse of existing human translation
Linguistic feature → improving fuzzy matching
string transformation
segmentation rules
pos tagger
tokenizer
EN
ES
PT
JA
.
.
.
FR
EN
ES
PT
JA
.
.
.
FR
17. Improving TM Matching
Linguistic approach to improve match
• Segment the text by sentence
○ Using delimiters like . ? ! , - :
○ Limited the total of words
• Intra-sentence segmentation
○ Using conjunctions, adverbs,
determiners, pronouns
○ Others approaches
• Replace segments
○ Numbers, dates, proper nouns and
identifiers, URLs, e-mail address,
punctuation marks, acronyms,
variables.
• POS pattern string
• Named entity recognition
ElasticTM
TMX
files
source
text
(Puscasu, 2004; Eriksson and
Myhrman, 2010; Orasan, 2000)
18. Challenges
• Morphologically rich and non-Indo-European languages
• Go beyond statistics (ongoing work, part of EXPERT project)
Hybrid approaches improve certain language pairs: Japanese (R&D with Japanese partners),
morphologically rich languages, Semitic languages.
• Continue building revenue streams on MT
MT allows Pangeanic to build other technologies (web, search, etc), enhance and improve its solutions to
its client portfolio and offer new services.