This document provides an overview of the Tesseract OCR engine, including a brief history of OCR technology, details about Tesseract's development and architecture, and announcements about version 2.00. It describes how Tesseract uses techniques like text line finding, baseline approximation, and static/adaptive classifiers to recognize words despite challenges with spacing, italics, and other font variations. The document also notes areas where commercial OCR systems may have advantages over Tesseract, such as page layout analysis, language support, and accuracy.
Using MLOps to Bring ML to Production/The Promise of MLOpsWeaveworks
In this final Weave Online User Group of 2019, David Aronchick asks: have you ever struggled with having different environments to build, train and serve ML models, and how to orchestrate between them? While DevOps and GitOps have made huge traction in recent years, many customers struggle to apply these practices to ML workloads. This talk will focus on the ways MLOps has helped to effectively infuse AI into production-grade applications through establishing practices around model reproducibility, validation, versioning/tracking, and safe/compliant deployment. We will also talk about the direction for MLOps as an industry, and how we can use it to move faster, with more stability, than ever before.
The recording of this session is on our YouTube Channel here: https://youtu.be/twsxcwgB0ZQ
Speaker: David Aronchick, Head of Open Source ML Strategy, Microsoft
Bio: David leads Open Source Machine Learning Strategy at Azure. This means he spends most of his time helping humans to convince machines to be smarter. He is only moderately successful at this. Previously, David led product management for Kubernetes at Google, launched GKE, and co-founded the Kubeflow project. David has also worked at Microsoft, Amazon and Chef and co-founded three startups.
Sign up for a free Machine Learning Ops Workshop: http://bit.ly/MLOps_Workshop_List
Weaveworks will cover concepts such as GitOps (operations by pull request), Progressive Delivery (canary, A/B, blue-green), and how to apply those approaches to your machine learning operations to mitigate risk.
"Automated machine learning (AutoML) is the process of automating the end-to-end process of applying machine learning to real-world problems. In a typical machine learning application, practitioners must apply the appropriate data pre-processing, feature engineering, feature extraction, and feature selection methods that make the dataset amenable for machine learning. Following those preprocessing steps, practitioners must then perform algorithm selection and hyperparameter optimization to maximize the predictive performance of their final machine learning model. As many of these steps are often beyond the abilities of non-experts, AutoML was proposed as an artificial intelligence-based solution to the ever-growing challenge of applying machine learning. Automating the end-to-end process of applying machine learning offers the advantages of producing simpler solutions, faster creation of those solutions, and models that often outperform models that were designed by hand."
In this talk we will discuss how QuSandbox and the Model Analytics Studio can be used in the selection of machine learning models. We will also illustrate AutoML frameworks through demos and examples and show you how to get started
Using MLOps to Bring ML to Production/The Promise of MLOpsWeaveworks
In this final Weave Online User Group of 2019, David Aronchick asks: have you ever struggled with having different environments to build, train and serve ML models, and how to orchestrate between them? While DevOps and GitOps have made huge traction in recent years, many customers struggle to apply these practices to ML workloads. This talk will focus on the ways MLOps has helped to effectively infuse AI into production-grade applications through establishing practices around model reproducibility, validation, versioning/tracking, and safe/compliant deployment. We will also talk about the direction for MLOps as an industry, and how we can use it to move faster, with more stability, than ever before.
The recording of this session is on our YouTube Channel here: https://youtu.be/twsxcwgB0ZQ
Speaker: David Aronchick, Head of Open Source ML Strategy, Microsoft
Bio: David leads Open Source Machine Learning Strategy at Azure. This means he spends most of his time helping humans to convince machines to be smarter. He is only moderately successful at this. Previously, David led product management for Kubernetes at Google, launched GKE, and co-founded the Kubeflow project. David has also worked at Microsoft, Amazon and Chef and co-founded three startups.
Sign up for a free Machine Learning Ops Workshop: http://bit.ly/MLOps_Workshop_List
Weaveworks will cover concepts such as GitOps (operations by pull request), Progressive Delivery (canary, A/B, blue-green), and how to apply those approaches to your machine learning operations to mitigate risk.
"Automated machine learning (AutoML) is the process of automating the end-to-end process of applying machine learning to real-world problems. In a typical machine learning application, practitioners must apply the appropriate data pre-processing, feature engineering, feature extraction, and feature selection methods that make the dataset amenable for machine learning. Following those preprocessing steps, practitioners must then perform algorithm selection and hyperparameter optimization to maximize the predictive performance of their final machine learning model. As many of these steps are often beyond the abilities of non-experts, AutoML was proposed as an artificial intelligence-based solution to the ever-growing challenge of applying machine learning. Automating the end-to-end process of applying machine learning offers the advantages of producing simpler solutions, faster creation of those solutions, and models that often outperform models that were designed by hand."
In this talk we will discuss how QuSandbox and the Model Analytics Studio can be used in the selection of machine learning models. We will also illustrate AutoML frameworks through demos and examples and show you how to get started
This talk is about how we applied deep learning techinques to achieve state-of-the-art results in various NLP tasks like sentiment analysis and aspect identification, and how we deployed these models at Flipkart
Lessons in Linear Algebra at Scale with Apache Spark : Let's Make the Sparse ...Databricks
If you enjoy Linear Algebra, Spark, and exceptionally bad puns, then this could be the talk for you! In this session, we will chronicle the adventures of developing a large-scale Spark system in Scala at Target to power a text-based similarity engine by using core Linear Algebra concepts. You will not hear about a shiny system and how awesome it is, but instead you will learn about everything that went wrong and all of the lessons that were learned along the way. We will cover concepts like Cosine Similarity, Spark's Distributed Matrix APIs, the Breeze numerical processing library under the hood that powers these APIs, among other things. We will embark on this system development journey together to understand what it took from beginning to end to pull a performant and scalable similarity engine together. Linear Algebra is often the backbone of many prominent machine learning algorithms, and the goal is that from this session, you will gain a deeper understanding into what gotchas exist and what is needed to design, tune, and scale these types of systems.
Speaker: Anna Holschuh
High Available Task Scheduling Design using Kafka and Kafka Streams | Naveen ...HostedbyConfluent
In any enterprise or cloud application, Task scheduling is a key requirement. A highly available and fault-tolerant task scheduling will help us to improve our business goals.
A classic task scheduling infrastructure is typically backed by databases. The instances/service that performs the scheduling, loads the task definitions from the database into memory and performs the task scheduling.
This kind of infrastructure creates issues like stateful services, inability to scale the services horizontally, being prone to frequent failures, etc., If the state of these kinds of services is not maintained well, it may lead to inconsistent and integrity issues.
To mitigate these issues, we will explore a high available and fault-tolerant task scheduling infrastructure using Kafka, Kafka Streams, and State Store.
Introduction to Natural Language ProcessingPranav Gupta
the presentation gives a gist about the major tasks and challenges involved in natural language processing. In the second part, it talks about one technique each for Part Of Speech Tagging and Automatic Text Summarization
This is the first lecture on Applied Machine Learning. The course focuses on the emerging and modern aspects of this subject such as Deep Learning, Recurrent and Recursive Neural Networks (RNN), Long Short Term Memory (LSTM), Convolution Neural Networks (CNN), Hidden Markov Models (HMM). It deals with several application areas such as Natural Language Processing, Image Understanding etc. This presentation provides the landscape.
All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...Daniel Zivkovic
Serverless Toronto's 6th-anniversary event helps IT pros understand and prepare for the #GenAI tsunami ahead. You'll gain situational awareness of the LLM Landscape, receive condensed insights, and actionable advice about RAG in 2024 from Google AI Lead Mark Ryan and LlamaIndex creator Jerry Liu. We chose #RAG (Retrieval-Augmented Generation) because it is the predominant paradigm for building #LLM (Large Language Model) applications in enterprises today - and that's where the jobs will be shifting. Here is the recording: https://youtu.be/P5xd1ZjD-Os?si=iq8xibj5pJsJ62oW
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo MazzaferroPyData
Those of us who use TensorFlow often focus on building the model that's most predictive, not the one that's most deployable. So how to put that hard work to work? In this talk, we'll walk through a strategy for taking your machine learning models from Jupyter Notebook into production and beyond.
This talk is an introduction to the vector search engine Weaviate. You will learn how storing data using vectors enables semantic search and automatic data classification. Topics like the underlying vector storage mechanism and how the pre-trained language vectorization model enables this are touched. In addition, this presentation consists of live demos to show the power of Weaviate and how you can get started with your own datasets. No prior technical knowledge is required; all concepts are illustrated with real use case examples and live demos. Most of all data is unstructured. Additionally, data is often stored without context, meaning and relation to concepts in the real world. This means that all this data is difficult to index, classify and search through. While this is traditionally solved by manual effort or expensive machine learning models, Weaviate takes another approach to this problem. Weaviate is a vector search engine, which stores data as vectors and automatically adds context and meaning to new data. This enables to search through the data without using exact matching keywords. Moreover, data can be automatically classified. Weaviate is completely open source, has a built-in machine learning model, has a graph-like data model, completely API-based and is cloud-native. Weaviate uses a GraphQL API next to RESTful endpoints to interact with the data in an intuitive manner. Additionally, Python, Go and JavaScript clients are available to facilitate interaction between Weaviate and your applications. GraphQL and client examples will be shown in the presentation.
This talk is about how we applied deep learning techinques to achieve state-of-the-art results in various NLP tasks like sentiment analysis and aspect identification, and how we deployed these models at Flipkart
Lessons in Linear Algebra at Scale with Apache Spark : Let's Make the Sparse ...Databricks
If you enjoy Linear Algebra, Spark, and exceptionally bad puns, then this could be the talk for you! In this session, we will chronicle the adventures of developing a large-scale Spark system in Scala at Target to power a text-based similarity engine by using core Linear Algebra concepts. You will not hear about a shiny system and how awesome it is, but instead you will learn about everything that went wrong and all of the lessons that were learned along the way. We will cover concepts like Cosine Similarity, Spark's Distributed Matrix APIs, the Breeze numerical processing library under the hood that powers these APIs, among other things. We will embark on this system development journey together to understand what it took from beginning to end to pull a performant and scalable similarity engine together. Linear Algebra is often the backbone of many prominent machine learning algorithms, and the goal is that from this session, you will gain a deeper understanding into what gotchas exist and what is needed to design, tune, and scale these types of systems.
Speaker: Anna Holschuh
High Available Task Scheduling Design using Kafka and Kafka Streams | Naveen ...HostedbyConfluent
In any enterprise or cloud application, Task scheduling is a key requirement. A highly available and fault-tolerant task scheduling will help us to improve our business goals.
A classic task scheduling infrastructure is typically backed by databases. The instances/service that performs the scheduling, loads the task definitions from the database into memory and performs the task scheduling.
This kind of infrastructure creates issues like stateful services, inability to scale the services horizontally, being prone to frequent failures, etc., If the state of these kinds of services is not maintained well, it may lead to inconsistent and integrity issues.
To mitigate these issues, we will explore a high available and fault-tolerant task scheduling infrastructure using Kafka, Kafka Streams, and State Store.
Introduction to Natural Language ProcessingPranav Gupta
the presentation gives a gist about the major tasks and challenges involved in natural language processing. In the second part, it talks about one technique each for Part Of Speech Tagging and Automatic Text Summarization
This is the first lecture on Applied Machine Learning. The course focuses on the emerging and modern aspects of this subject such as Deep Learning, Recurrent and Recursive Neural Networks (RNN), Long Short Term Memory (LSTM), Convolution Neural Networks (CNN), Hidden Markov Models (HMM). It deals with several application areas such as Natural Language Processing, Image Understanding etc. This presentation provides the landscape.
All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...Daniel Zivkovic
Serverless Toronto's 6th-anniversary event helps IT pros understand and prepare for the #GenAI tsunami ahead. You'll gain situational awareness of the LLM Landscape, receive condensed insights, and actionable advice about RAG in 2024 from Google AI Lead Mark Ryan and LlamaIndex creator Jerry Liu. We chose #RAG (Retrieval-Augmented Generation) because it is the predominant paradigm for building #LLM (Large Language Model) applications in enterprises today - and that's where the jobs will be shifting. Here is the recording: https://youtu.be/P5xd1ZjD-Os?si=iq8xibj5pJsJ62oW
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo MazzaferroPyData
Those of us who use TensorFlow often focus on building the model that's most predictive, not the one that's most deployable. So how to put that hard work to work? In this talk, we'll walk through a strategy for taking your machine learning models from Jupyter Notebook into production and beyond.
This talk is an introduction to the vector search engine Weaviate. You will learn how storing data using vectors enables semantic search and automatic data classification. Topics like the underlying vector storage mechanism and how the pre-trained language vectorization model enables this are touched. In addition, this presentation consists of live demos to show the power of Weaviate and how you can get started with your own datasets. No prior technical knowledge is required; all concepts are illustrated with real use case examples and live demos. Most of all data is unstructured. Additionally, data is often stored without context, meaning and relation to concepts in the real world. This means that all this data is difficult to index, classify and search through. While this is traditionally solved by manual effort or expensive machine learning models, Weaviate takes another approach to this problem. Weaviate is a vector search engine, which stores data as vectors and automatically adds context and meaning to new data. This enables to search through the data without using exact matching keywords. Moreover, data can be automatically classified. Weaviate is completely open source, has a built-in machine learning model, has a graph-like data model, completely API-based and is cloud-native. Weaviate uses a GraphQL API next to RESTful endpoints to interact with the data in an intuitive manner. Additionally, Python, Go and JavaScript clients are available to facilitate interaction between Weaviate and your applications. GraphQL and client examples will be shown in the presentation.
The lecture presents the open source project Tesseract - a free OCR engine written in C++. The lecture presents the strong and weak sides of tesseract and explains how to train it in a new language. The lecture demonstration materials are available at the authors's blog: http://www.nakov.com/blog
Python 101: Python for Absolute Beginners (PyTexas 2014)Paige Bailey
If you're absolutely new to Python, and to programming in general, this is the place to start!
Here's the breakdown: by the end of this workshop, you'll have Python downloaded onto your personal machine; have a general idea of what Python can help you do; be pointed in the direction of some excellent practice materials; and have a basic understanding of the syntax of the language.
Please don't forget to bring your laptop!
Audience: "Python 101" is geared toward individuals who are new to programming. If you've had some programming experience (shell scripting, MATLAB, Ruby, etc.), then you'll probably want to check out the more intermediate workshop, "Python 101++".
I have presented the power point presentation on Basics of the optical character recognition. Here i have focused to discuss about hoe OCR is used in scanning process and can it be used for document scanning and its uses.
General presentation on viscous-flow CFD code ReFRESCO. Theoretical, numerical, IT, developers and application range aspects are presented. Current status of the code @ January 2015.
In this talk we present our transition from an old legacy system built on top of MySQL and a bunch of PHP scripts to a system based on Hadoop and Hive, which allows us to process and keep stats of hundreds of thousands of ads several times a day and index them to be served to the end users. We also present our current undertaking to deliver the ads from our sources (property portals, job boards…) to our visitors in near real-time using Storm, HBase and Zookeeper.
In this talk I will introduce attendees to the basics of messaging queues, their goals and applications from CFML. Messaging enables software applications to connect and scale. Thus, providing applications to connect to each other as components of a larger application, or to user devices and data. Messaging is asynchronous, and can decouple your software concerns with ease. However, messaging is much more than the traditional publish/subscribe patterns but also the ability to create work queues, routing and much more.
Building a Neural Machine Translation System From ScratchNatasha Latysheva
Human languages are complex, diverse and riddled with exceptions – translating between different languages is therefore a highly challenging technical problem. Deep learning approaches have proved powerful in modelling the intricacies of language, and have surpassed all statistics-based methods for automated translation. This session begins with an introduction to the problem of machine translation and discusses the two dominant neural architectures for solving it – recurrent neural networks and transformers. A practical overview of the workflow involved in training, optimising and adapting a competitive neural machine translation system is provided. Attendees will gain an understanding of the internal workings and capabilities of state-of-the-art systems for automatic translation, as well as an appreciation of the key challenges and open problems in the field.
Deep learning is having a profound impact on AI applications. With the future of neural network-inspired computing in mind, re:Invent is hosting the first ever Deep Learning Summit. Designed for developers to learn about the latest in deep learning research and emerging trends, attendees will hear from industry thought leaders—members of the academic and venture capital communities—who will share their perspectives in 30-minute Lightning Talks.
The Summit will be held on Thursday, November 30th at the Venetian from 1-5pm.
The Deep Learning Revolution - Terrence Sejnowski, The Salk Institute for Biological Studies
Eye, Robot: Computer Vision and Autonomous Robotics - Aaron Ames & Pietro Perona, California Institute of Technology
Exploiting the Power of Language - Alexander Smola, Amazon Web Services
Reducing Supervision: Making More with Less - Martial Herbert, Carnegie Mellon University
Learning Where to Look in Video - Kristen Grauman, University of Texas
Look, Listen, Learn: The Intersection of Vision and Sound - Antonio Torralba, MIT
Investing in the Deep Learning Future - Matt Ocko, Data Collective Venture Capital
Introduction to libre « fulltext » technologyRobert Viseur
The presentation will be based on my personal experience on SQLite, MySQL and Zend Search ; on workshops I’ve attended (PostgreSQL) and on tests conducted under my supervision (PostgreSQL, MySQL, Sphinx, Lucene, Xapian). It will cover an exhaustive overview of existing techniques, from the most basic to the more advanced, and will lead to a comparative table of the existing technology.
Getting Deep on Orchestration - Nickoloff - DockerCon16allingeek
Orchestration platforms let us work with higher level ideas like services and jobs; but there is more to a platform than scheduling and service discovery. A platform is a collection of actors and APIs that work together and provide those higher level abstractions on a distributed system. In this session we'll go deep on the architecture of open source orchestration platforms, consider scaling pains, reveal extension points, and reflect on an orchestration platform at Amazon. We'll finish with a demo of a homemade abstraction for failure injection by policy.
Modern software architectures - PHP UK Conference 2015Ricard Clau
The web has changed. Users demand responsive, real-time interactive applications and companies need to store and analyze tons of data. Some years ago, monolithic code bases with a basic LAMP stack, some caching and perhaps a search engine were enough. These days everybody is talking about micro-services architectures, SOA, Erlang, Golang, message passing, queue systems and many more. PHP seems to not be cool anymore but... is this true? Should we all forget everything we know and just learn these new technologies? Do we really need all these things?
VoltDB and Erlang: two very promising beasts, made for the new parallel world, but still lingering in the wings. Not only are they addressing todays challenges but they are using parallel architectures as corner stone of their new and surprising approach to be faster and more productive. What are they good for? Why are we working to team them up?
Erlang promises faster implementation, way better maintenance and 4 times shorter code. VoltDB claims to be two orders of magnitude faster than its competitors. The two share many similarities: both are the result of scientific research and designed from scratch to address the new reality of parallel architectures with full force.
This talk presents the case for Erlang as server language, where it shines, how it looks, and how to get started. It details Erlang's secret sauce: microprocesses, actors, atoms, immutable variables, message passing and pattern matching. (Note: for a longer version of this treatment of Erlang only see: Why Erlang? http://www.slideshare.net/eonblast/why-erlang-gdc-online-2012)
VoltDB's inner workings are explained to understand why it can be so incredibly fast and still better than its NoSQL competitors. The well publicized Node.js benchmark clocking in at 695,000 transactions per second is described and the simple steps to get VoltDB up and running to see the prodigy from up close.
Source examples are presented that show Erlang and VoltDB in action.
The speaker is creator and maintainer of the Erlang VoltDB driver Erlvolt.
Students, digital devices and success - Andreas Schleicher - 27 May 2024..pptxEduSkills OECD
Andreas Schleicher presents at the OECD webinar ‘Digital devices in schools: detrimental distraction or secret to success?’ on 27 May 2024. The presentation was based on findings from PISA 2022 results and the webinar helped launch the PISA in Focus ‘Managing screen time: How to protect and equip students against distraction’ https://www.oecd-ilibrary.org/education/managing-screen-time_7c225af4-en and the OECD Education Policy Perspective ‘Students, digital devices and success’ can be found here - https://oe.cd/il/5yV
This is a presentation by Dada Robert in a Your Skill Boost masterclass organised by the Excellence Foundation for South Sudan (EFSS) on Saturday, the 25th and Sunday, the 26th of May 2024.
He discussed the concept of quality improvement, emphasizing its applicability to various aspects of life, including personal, project, and program improvements. He defined quality as doing the right thing at the right time in the right way to achieve the best possible results and discussed the concept of the "gap" between what we know and what we do, and how this gap represents the areas we need to improve. He explained the scientific approach to quality improvement, which involves systematic performance analysis, testing and learning, and implementing change ideas. He also highlighted the importance of client focus and a team approach to quality improvement.
How to Create Map Views in the Odoo 17 ERPCeline George
The map views are useful for providing a geographical representation of data. They allow users to visualize and analyze the data in a more intuitive manner.
Unit 8 - Information and Communication Technology (Paper I).pdfThiyagu K
This slides describes the basic concepts of ICT, basics of Email, Emerging Technology and Digital Initiatives in Education. This presentations aligns with the UGC Paper I syllabus.
2. Contents
• Introduction & history of OCR
• Tesseract architecture & methods
• Announcing Tesseract 2.00
• Training Tesseract
• Future enhancements
3. A Brief History of OCR
• What is Optical Character Recognition?
• A Brief History of OCR
• OCR predates electronic computers!
4.
5. A Brief History of OCR
• 1929 – Digit recognition machine
• 1953 – Alphanumeric recognition machine
• 1965 – US Mail sorting
• 1965 – British banking system
• 1976 – Kurzweil reading machine
• 1985 – Hardware-assisted PC software
• 1988 – Software-only PC software
• 1994-2000 – Industry consolidation
6. Tesseract Background
• Developed on HP-UX at HP between 1985
• and 1994 to run in a desktop scanner.
• Came neck and neck with Caere and XIS
• in the 1995 UNLV test.
• (See http://www.isri.unlv.edu/downloads/AT-1995.pdf )
• Never used in an HP product.
• Open sourced in 2005. Now on:
• http://code.google.com/p/tesseract-ocr
• Highly portable.
7. Tesseract OCR Architecture
• Baselines are rarely perfectly straight
• Text Line Finding – skew independent –
• published at ICDAR’95 Montreal.
• (http://scholar.google.com/scholar?q=skew+detection+smith)
• Baselines are approximated by quadratic splines
• to account for skew and curl.
• Meanline, ascender and descender lines are a
• constant displacement from baseline.
• Critical value is the x-height.
8. Spaces between words are tricky too
• Italics, digits, punctuation all create special-case font-dependent
spacing.
• • Fully justified text in narrow columns can have vastly varying
spacing on different lines.
• Tesseract: Recognize Word
• Outline Approximation
• Polygonal approximation is a double-edged sword.
• Noise and some pertinent information are both lost.
9. Tesseract: Features and Matching
• Static classifier uses outline fragments as features. Broken characters
are easily recognizable by a small->large matching process in classifier.
(This is slow.)
• Adaptive classifier uses the same technique!
• (Apart from normalization method)
10. Announcing tesseract-2.00
• Fully Unicode (UTF-8) capable
• Already trained for 6 Latin-based
• languages (Eng, Fra, Ita, Deu, Spa, Nld)
• Code and documented process to train at
• http://code.google.com/p/tesseract-ocr
• UNLV regression test framework
• Other minor fixes
11. Commercial OCR v Tesseract
• Page layout analysis.
• More languages.
• Improve accuracy.
• Add a UI.