The document provides an introduction to full-text search, including how it works through indexing documents, analyzing text, and building an inverted index. It discusses how documents are divided into words and stored with the documents they appear in. It also covers how text analysis through tokenization and filters transforms raw text into indexable terms. The document uses examples to demonstrate how a basic full-text search index is constructed and can be queried.
A boots on the ground survey of WiFi in Timisoara. He will show us what wireless networks, routers and some unexpected devices you can find just by walking around. Get a primer on wireless security and why you should care about it.
What's with the 1s and 0s? Making sense of binary data at scale with Tika and...gagravarr
If you have one or two files, you can take the time to manually work out what they are, what they contain, and how to get the useful bits out (probably....). However, this approach really doesn't scale, mechanical turks or no! Luckily, there are Apache projects out there which can help!
In this talk, we'll first look at how we can work out what a given blob of 1s and 0s actually is, be it textual or binary. We'll then see how to extract common metadata from it, along with text, embedded resources, images, and maybe even the kitchen sink! We'll see how to do all of this with Apache Tika, and how to dive down to the underlying libraries (including its Apache friends like POI and PDFBox) for specialist cases. Finally, we'll look a little bit about how to roll this all out on a Big Data or Large-Search case.
Describes techniques for injecting "Semantic Intelligence" into search applications. Focuses on Apache Solr and Lucidworks Fusion, but these techniques are generally applicable to any search engine because all of them use the same basic mechanism - inverted token mapping at their 'core'.
A boots on the ground survey of WiFi in Timisoara. He will show us what wireless networks, routers and some unexpected devices you can find just by walking around. Get a primer on wireless security and why you should care about it.
What's with the 1s and 0s? Making sense of binary data at scale with Tika and...gagravarr
If you have one or two files, you can take the time to manually work out what they are, what they contain, and how to get the useful bits out (probably....). However, this approach really doesn't scale, mechanical turks or no! Luckily, there are Apache projects out there which can help!
In this talk, we'll first look at how we can work out what a given blob of 1s and 0s actually is, be it textual or binary. We'll then see how to extract common metadata from it, along with text, embedded resources, images, and maybe even the kitchen sink! We'll see how to do all of this with Apache Tika, and how to dive down to the underlying libraries (including its Apache friends like POI and PDFBox) for specialist cases. Finally, we'll look a little bit about how to roll this all out on a Big Data or Large-Search case.
Describes techniques for injecting "Semantic Intelligence" into search applications. Focuses on Apache Solr and Lucidworks Fusion, but these techniques are generally applicable to any search engine because all of them use the same basic mechanism - inverted token mapping at their 'core'.
A presentation given at ACCU 2014.
As software developers we do not just write code. We write many, many words too.
We write documentation, comments, manuals, specifications, technical articles, wiki documentation, and more. Maybe even magazine articles and books.
This talk discusses some practicalities of writing well, both stylistically and practically. We'll talk about prose, but also about the right "geek" way of writing, the storage formats, toolchains, and the storage of our words.
We'll cover:
- writing style
- what's appropriate: what to write what not to write
- keeping track: "source control" for words
- toolchains: what toolsets to use to write and prepare output
- markup languages vs "wysiwyg" tools
- sharing your words with non-geeks
At the end of this talk, you'll have a good idea how to put together an example "document toolchain" taking source-controlled words in a humane markup style, and creating high-quality HTML, PDF (fully styled, print-ready) ePub and Kindle output, as well as Word-friendly versions.
We want code that is easy to understand, re-usable, and flexible. But we are always up against deadlines, so we rush, and end up with code that is messy, buggy, hard to maintain, and makes us go slower even though we’re trying to go faster.
What is clean code? In this talk I’ll provide some answers to this question, and introduce you to 10 good habits that will help keep your code clean, such as the use of meaningful names for your variables and functions, and following the “Boy Scout Rule” (leave the code cleaner than you found it). I will even try to persuade you that using a lot of code comments is a sign that there are problems with your code.
Creating a Data Science Team from an Architect's perspective. This is about team building on how to support a data science team with the right staff, including data engineers and devops.
I created this presentation for one of our office's weekly professional development sessions. It was meant to give a basic understanding of HTML. *Very* basic.
Overview of structured search technology. Using the structure of a document to create better search results for document search and retrieval.
How both search precision and recall is improved when the structure of a document is used.
How a keyword match in a title of a document can be used to boost the search score.
Case studies with the eXist native XML database.
Steps to set up a pilot project.
Digital publishing has changed. Understand the base components that allow modern publishers to more easily publish content in multiple formats across multiple platforms.
Presentation originally developed by Apex VP and Principal Consultant Bill Kasdorf for a university press in June 2016, based on presentations on this subject that he has given to many organizations over the past ten years. Learn more at www.apexcovantage.com.
This is a presentation for my class in graduate school. I'm going to introduce a command line based full text search engine written in Python by scratch.
Presentation given as part of the Zotero Training Workshops, Fall 2012. Original authored in Pandoc markdown and available on github: https://github.com/adam3smith/zotero-workshops
Slides for the course Big Data and Automated Content Analysis, in which students of the social sciences (communication science) learn how to conduct analyses using Python.
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...SOFTTECHHUB
The choice of an operating system plays a pivotal role in shaping our computing experience. For decades, Microsoft's Windows has dominated the market, offering a familiar and widely adopted platform for personal and professional use. However, as technological advancements continue to push the boundaries of innovation, alternative operating systems have emerged, challenging the status quo and offering users a fresh perspective on computing.
One such alternative that has garnered significant attention and acclaim is Nitrux Linux 3.5.0, a sleek, powerful, and user-friendly Linux distribution that promises to redefine the way we interact with our devices. With its focus on performance, security, and customization, Nitrux Linux presents a compelling case for those seeking to break free from the constraints of proprietary software and embrace the freedom and flexibility of open-source computing.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
A presentation given at ACCU 2014.
As software developers we do not just write code. We write many, many words too.
We write documentation, comments, manuals, specifications, technical articles, wiki documentation, and more. Maybe even magazine articles and books.
This talk discusses some practicalities of writing well, both stylistically and practically. We'll talk about prose, but also about the right "geek" way of writing, the storage formats, toolchains, and the storage of our words.
We'll cover:
- writing style
- what's appropriate: what to write what not to write
- keeping track: "source control" for words
- toolchains: what toolsets to use to write and prepare output
- markup languages vs "wysiwyg" tools
- sharing your words with non-geeks
At the end of this talk, you'll have a good idea how to put together an example "document toolchain" taking source-controlled words in a humane markup style, and creating high-quality HTML, PDF (fully styled, print-ready) ePub and Kindle output, as well as Word-friendly versions.
We want code that is easy to understand, re-usable, and flexible. But we are always up against deadlines, so we rush, and end up with code that is messy, buggy, hard to maintain, and makes us go slower even though we’re trying to go faster.
What is clean code? In this talk I’ll provide some answers to this question, and introduce you to 10 good habits that will help keep your code clean, such as the use of meaningful names for your variables and functions, and following the “Boy Scout Rule” (leave the code cleaner than you found it). I will even try to persuade you that using a lot of code comments is a sign that there are problems with your code.
Creating a Data Science Team from an Architect's perspective. This is about team building on how to support a data science team with the right staff, including data engineers and devops.
I created this presentation for one of our office's weekly professional development sessions. It was meant to give a basic understanding of HTML. *Very* basic.
Overview of structured search technology. Using the structure of a document to create better search results for document search and retrieval.
How both search precision and recall is improved when the structure of a document is used.
How a keyword match in a title of a document can be used to boost the search score.
Case studies with the eXist native XML database.
Steps to set up a pilot project.
Digital publishing has changed. Understand the base components that allow modern publishers to more easily publish content in multiple formats across multiple platforms.
Presentation originally developed by Apex VP and Principal Consultant Bill Kasdorf for a university press in June 2016, based on presentations on this subject that he has given to many organizations over the past ten years. Learn more at www.apexcovantage.com.
This is a presentation for my class in graduate school. I'm going to introduce a command line based full text search engine written in Python by scratch.
Presentation given as part of the Zotero Training Workshops, Fall 2012. Original authored in Pandoc markdown and available on github: https://github.com/adam3smith/zotero-workshops
Slides for the course Big Data and Automated Content Analysis, in which students of the social sciences (communication science) learn how to conduct analyses using Python.
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...SOFTTECHHUB
The choice of an operating system plays a pivotal role in shaping our computing experience. For decades, Microsoft's Windows has dominated the market, offering a familiar and widely adopted platform for personal and professional use. However, as technological advancements continue to push the boundaries of innovation, alternative operating systems have emerged, challenging the status quo and offering users a fresh perspective on computing.
One such alternative that has garnered significant attention and acclaim is Nitrux Linux 3.5.0, a sleek, powerful, and user-friendly Linux distribution that promises to redefine the way we interact with our devices. With its focus on performance, security, and customization, Nitrux Linux presents a compelling case for those seeking to break free from the constraints of proprietary software and embrace the freedom and flexibility of open-source computing.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Enhancing Performance with Globus and the Science DMZGlobus
ESnet has led the way in helping national facilities—and many other institutions in the research community—configure Science DMZs and troubleshoot network issues to maximize data transfer performance. In this talk we will present a summary of approaches and tips for getting the most out of your network infrastructure using Globus Connect Server.
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionAggregage
Join Maher Hanafi, VP of Engineering at Betterworks, in this new session where he'll share a practical framework to transform Gen AI prototypes into impactful products! He'll delve into the complexities of data collection and management, model selection and optimization, and ensuring security, scalability, and responsible use.
A tale of scale & speed: How the US Navy is enabling software delivery from l...sonjaschweigert1
Rapid and secure feature delivery is a goal across every application team and every branch of the DoD. The Navy’s DevSecOps platform, Party Barge, has achieved:
- Reduction in onboarding time from 5 weeks to 1 day
- Improved developer experience and productivity through actionable findings and reduction of false positives
- Maintenance of superior security standards and inherent policy enforcement with Authorization to Operate (ATO)
Development teams can ship efficiently and ensure applications are cyber ready for Navy Authorizing Officials (AOs). In this webinar, Sigma Defense and Anchore will give attendees a look behind the scenes and demo secure pipeline automation and security artifacts that speed up application ATO and time to production.
We will cover:
- How to remove silos in DevSecOps
- How to build efficient development pipeline roles and component templates
- How to deliver security artifacts that matter for ATO’s (SBOMs, vulnerability reports, and policy evidence)
- How to streamline operations with automated policy checks on container images
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Essentials of Automations: The Art of Triggers and Actions in FMESafe Software
In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation.
We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios.
Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!
Pushing the limits of ePRTC: 100ns holdover for 100 daysAdtran
At WSTS 2024, Alon Stern explored the topic of parametric holdover and explained how recent research findings can be implemented in real-world PNT networks to achieve 100 nanoseconds of accuracy for up to 100 days.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
Welcome to the first live UiPath Community Day Dubai! Join us for this unique occasion to meet our local and global UiPath Community and leaders. You will get a full view of the MEA region's automation landscape and the AI Powered automation technology capabilities of UiPath. Also, hosted by our local partners Marc Ellis, you will enjoy a half-day packed with industry insights and automation peers networking.
📕 Curious on our agenda? Wait no more!
10:00 Welcome note - UiPath Community in Dubai
Lovely Sinha, UiPath Community Chapter Leader, UiPath MVPx3, Hyper-automation Consultant, First Abu Dhabi Bank
10:20 A UiPath cross-region MEA overview
Ashraf El Zarka, VP and Managing Director MEA, UiPath
10:35: Customer Success Journey
Deepthi Deepak, Head of Intelligent Automation CoE, First Abu Dhabi Bank
11:15 The UiPath approach to GenAI with our three principles: improve accuracy, supercharge productivity, and automate more
Boris Krumrey, Global VP, Automation Innovation, UiPath
12:15 To discover how Marc Ellis leverages tech-driven solutions in recruitment and managed services.
Brendan Lingam, Director of Sales and Business Development, Marc Ellis
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
2. About me Full-time (Mostly) Java Developer Part-time general technical/sysadmin/geeky guy Interested in: hard problems, search, performance, paralellism, scalability
15. Deathy’s Tip Don't be too quick in deciding what a "document" is. Put some thought into it or you'll regret it (speaking from a lot of experience)
16. First we need some documents, more specifically some text samples
17. Documents Doc1: "The cow says moo" Doc2: "The dog says woof" Doc3: "The cow-dog says moof“ "Stolen" from http://www.slideshare.net/tomdyson/being-google
44. Some more interesting documents Doc1: "The quick brown fox jumps over the lazy dog" Doc2: "All Daleks: Exterminate! Exterminate! EXTERMINATE!! EXTERMINATE!!!" Doc3: "And the final score is: no TARDIS, no screwdriver, two minutes to spare. Who da man?!"
57. Lots of things you can do with filters case normalization removing unwanted/unneeded characters transliteration/normalization of special characters stopwords synonyms
I won't delve into specifics or actual implementations. I'll try to present main concepts which come from Information Retrieval theory and also essential components you should be aware of when dealing with any full-text search system. If interested, there could be a future presentation on actual implementations (Lucene in my case).
Java Web Developer-ish. Last 4 years worked mostly on electronic publishing applications: processing/searching/displaying various content sets of various sizes. Passion for big data and lots of it. ( Last weekend I was parallelizing indexing on a 800K document set so it uses as many cores as possible. On Friday I was indexing a data set of 5.8M documents... )
about fulltext search, or search in general
take your pick: lots of pictures, lots of friends, lots of blog posts
actually, scratch that..
much better..
fulltext search is usually VERY fast. and by adding your own custom one, you can make it faster for where your specific application needs it most.
Depending on your content and users you can have very specific relevance criteria. You can surprise your users with the quality of results.
various needs for various content- bitch about imobiliare.ro not having search in text or very dynamic filters. Example: cannot search for apartments to rent with internet access...- bitch about geekmeet.ro wordpress search not being able to filter based on category (Timisoara in this case)
"index" = where you add items which you want to find and where you search for them."document" = the basic unit of indexing/searching. Usually one row from the search results list. Could be a book, a chapter, a page, a URL, etc.
Observe the sorting. More on this later...
not quite boolean, but simple enough to understand..
actual implementations vary and it usually shouldn't matter. Just remember that there are fields and documents and each indexed term is indexed for a specific field.
I'm going Lucene here, but any good index/search API will let you customize this process. This is as many have found a good way to structure your process.
punctuation and various mixes of upper/lower-case in tokens.
Bitch about tokenizer/filter options (or lack thereof in Sphinx/MySQL)…