A talk about the (hidden) document processing capability built right into Apache Solr. We show you what it its, how to use it, how to write your own plugins and suggest some future improvements.
Key topics when migrating from FAST to Solr, EuroCon 2010Cominvent AS
Presented during Lucene EuroCon 2010 in Prague. This presentation assumes no prior experience with FAST ESP, but some idea of what Solr/Lucene is. It gives you some hints on what to expect when migrating.
Cloudera - Using morphlines for on the-fly ETL by Wolfgang HoschekHakka Labs
In this talk Senior Software engineer Wolfgang Hoschek from Cloudera discusses Morphlines, the easy way to build and integrate ETL apps for Hadoop. The talk was recorded at the SumbleUpon offices.
Cloudera Morphlines is a new open source framework that reduces the time and skills necessary to integrate, build, and change Hadoop processing applications that extract, transform, and load data into Apache Solr, Apache HBase, HDFS, enterprise data warehouses, or analytic online dashboards.
Wolfgang Hoschek is a Software Engineer on the Platform team and the lead developer on Morphlines. He is a former CERN fellow and received his Ph.D from the Technical University of Vienna, Austria, and M.S from the University of Linz, Austria.
Key topics when migrating from FAST to Solr, EuroCon 2010Cominvent AS
Presented during Lucene EuroCon 2010 in Prague. This presentation assumes no prior experience with FAST ESP, but some idea of what Solr/Lucene is. It gives you some hints on what to expect when migrating.
Cloudera - Using morphlines for on the-fly ETL by Wolfgang HoschekHakka Labs
In this talk Senior Software engineer Wolfgang Hoschek from Cloudera discusses Morphlines, the easy way to build and integrate ETL apps for Hadoop. The talk was recorded at the SumbleUpon offices.
Cloudera Morphlines is a new open source framework that reduces the time and skills necessary to integrate, build, and change Hadoop processing applications that extract, transform, and load data into Apache Solr, Apache HBase, HDFS, enterprise data warehouses, or analytic online dashboards.
Wolfgang Hoschek is a Software Engineer on the Platform team and the lead developer on Morphlines. He is a former CERN fellow and received his Ph.D from the Technical University of Vienna, Austria, and M.S from the University of Linz, Austria.
These slides were presented by Hossein Falaki of Databricks to the Atlanta Apache Spark User Group on Thursday, March 9, 2017: https://www.meetup.com/Atlanta-Apache-Spark-User-Group/events/238120227/
Solr 4: Run Solr in SolrCloud Mode on your local file system.gutierrezga00
Running Solr in SolrCloud Mode on your local file system using Solr version 4.10.3. It demonstrate how configure the Apache Solr binaries so that you can create any number of SolrCloud instances without having the need to modified the binaries.
YouTube: http://youtu.be/70AKyQYoLqM
Download sample SolrCloud scripts: https://github.com/gutierrezga00/SolrCloud_LocalFileSystem
Presentation Slides: http://www.slideshare.net/gutierrezga00/solr-cloud-local-file-system
Download Solr version 4.10.3: http://lucene.apache.org/solr/
Download Zookeeper version 3.4.6: http://zookeeper.apache.org/
Introduction to Structured Streaming | Big Data Hadoop Spark Tutorial | Cloud...CloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2so9ZDk
This CloudxLab Introduction to Structured Streaming tutorial helps you to understand Structured Streaming in detail. Below are the topics covered in this tutorial:
1) Structured Streaming - Introduction
2) Word Count using Structured Streaming
3) Programming Model
4) Output Modes - Complete, Append and Update
This is the slide deck of the Zend webinar "Introduction to column oriented databases in PHP".
This webinar is a quick crash-course and practical session that will explain:
* What are column oriented databases and how do they differ from the standard rational (row oriented) databases.
* Advantages and disadvantages
* Is a column oriented database for me?
* How to use it from PHP
Presenter – Slavey Karadzhov
A tutorial presentation based on spark.apache.org documentation.
I gave this presentation at Amirkabir University of Technology as Teaching Assistant of Cloud Computing course of Dr. Amir H. Payberah in spring semester 2015.
Scylla Summit 2017: SMF: The Fastest RPC in the WestScyllaDB
On a quest to build the fastest durable log broker in the west, we had to rethink all of the components needed to deliver on this promise. First, we began by building the fastest RPC system in the west, SMF. SMF is a new RPC mechanism, IDL-compiler, and libraries that make using Seastar easy. In this talk, I will cover SMF in detail and show a live demo on how you can get started using it to build your next application so you can live in the future.
Apache Spark Introduction | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2spQIBA
This CloudxLab Introduction to Apache Spark tutorial helps you to understand Spark in detail. Below are the topics covered in this tutorial:
1) Spark Architecture
2) Why Apache Spark?
3) Shortcoming of MapReduce
4) Downloading Apache Spark
5) Starting Spark With Scala Interactive Shell
6) Starting Spark With Python Interactive Shell
7) Getting started with spark-submit
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2LCTufA
This CloudxLab Introduction to SparkR tutorial helps you to understand SparkR in detail. Below are the topics covered in this tutorial:
1) SparkR (R on Spark)
2) SparkR DataFrames
3) Launch SparkR
4) Creating DataFrames from Local DataFrames
5) DataFrame Operation
6) Creating DataFrames - From JSON
7) Running SQL Queries from SparkR
This is the fourteenth (and last for now) set of slides from a Perl programming course that I held some years ago.
I want to share it with everyone looking for intransitive Perl-knowledge.
A table of content for all presentations can be found at i-can.eu.
The source code for the examples and the presentations in ODP format are on https://github.com/kberov/PerlProgrammingCourse
Scylla Summit 2017: How to Run Cassandra/Scylla from a MySQL DBA's Point of ViewScyllaDB
Are you a MySQL DBA or DevOps individual being asked to run Cassandra or Scylla? Feeling overwhelmed? In this talk, I will present Cassandra/Scylla operations in terms that directly relate to MySQL. I will show you comparisons between the Information Schema and the Cassandra/Scylla System keyspace(s). I will also talk about metrics available in MySQL versus Cassandra/Scylla and how to retrieve them. Finally, I will talk about how MySQL replication compares with Cassandra replication. Hopefully, when I am done you will be able to relate to Cassandra operations in a practical and useful way.
Compare and contrast RDF triple stores and NoSQL: are triples stores NoSQL or not?
Talk given 2011-09-08 tot he BigData/NoSQL meetup at Bristol University.
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...CloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2sm9c61
This CloudxLab Introduction to Spark SQL & DataFrames tutorial helps you to understand Spark SQL & DataFrames in detail. Below are the topics covered in this slide:
1) Loading XML
2) What is RPC - Remote Process Call
3) Loading AVRO
4) Data Sources - Parquet
5) Creating DataFrames From Hive Table
6) Setting up Distributed SQL Engine
These slides were presented by Hossein Falaki of Databricks to the Atlanta Apache Spark User Group on Thursday, March 9, 2017: https://www.meetup.com/Atlanta-Apache-Spark-User-Group/events/238120227/
Solr 4: Run Solr in SolrCloud Mode on your local file system.gutierrezga00
Running Solr in SolrCloud Mode on your local file system using Solr version 4.10.3. It demonstrate how configure the Apache Solr binaries so that you can create any number of SolrCloud instances without having the need to modified the binaries.
YouTube: http://youtu.be/70AKyQYoLqM
Download sample SolrCloud scripts: https://github.com/gutierrezga00/SolrCloud_LocalFileSystem
Presentation Slides: http://www.slideshare.net/gutierrezga00/solr-cloud-local-file-system
Download Solr version 4.10.3: http://lucene.apache.org/solr/
Download Zookeeper version 3.4.6: http://zookeeper.apache.org/
Introduction to Structured Streaming | Big Data Hadoop Spark Tutorial | Cloud...CloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2so9ZDk
This CloudxLab Introduction to Structured Streaming tutorial helps you to understand Structured Streaming in detail. Below are the topics covered in this tutorial:
1) Structured Streaming - Introduction
2) Word Count using Structured Streaming
3) Programming Model
4) Output Modes - Complete, Append and Update
This is the slide deck of the Zend webinar "Introduction to column oriented databases in PHP".
This webinar is a quick crash-course and practical session that will explain:
* What are column oriented databases and how do they differ from the standard rational (row oriented) databases.
* Advantages and disadvantages
* Is a column oriented database for me?
* How to use it from PHP
Presenter – Slavey Karadzhov
A tutorial presentation based on spark.apache.org documentation.
I gave this presentation at Amirkabir University of Technology as Teaching Assistant of Cloud Computing course of Dr. Amir H. Payberah in spring semester 2015.
Scylla Summit 2017: SMF: The Fastest RPC in the WestScyllaDB
On a quest to build the fastest durable log broker in the west, we had to rethink all of the components needed to deliver on this promise. First, we began by building the fastest RPC system in the west, SMF. SMF is a new RPC mechanism, IDL-compiler, and libraries that make using Seastar easy. In this talk, I will cover SMF in detail and show a live demo on how you can get started using it to build your next application so you can live in the future.
Apache Spark Introduction | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2spQIBA
This CloudxLab Introduction to Apache Spark tutorial helps you to understand Spark in detail. Below are the topics covered in this tutorial:
1) Spark Architecture
2) Why Apache Spark?
3) Shortcoming of MapReduce
4) Downloading Apache Spark
5) Starting Spark With Scala Interactive Shell
6) Starting Spark With Python Interactive Shell
7) Getting started with spark-submit
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2LCTufA
This CloudxLab Introduction to SparkR tutorial helps you to understand SparkR in detail. Below are the topics covered in this tutorial:
1) SparkR (R on Spark)
2) SparkR DataFrames
3) Launch SparkR
4) Creating DataFrames from Local DataFrames
5) DataFrame Operation
6) Creating DataFrames - From JSON
7) Running SQL Queries from SparkR
This is the fourteenth (and last for now) set of slides from a Perl programming course that I held some years ago.
I want to share it with everyone looking for intransitive Perl-knowledge.
A table of content for all presentations can be found at i-can.eu.
The source code for the examples and the presentations in ODP format are on https://github.com/kberov/PerlProgrammingCourse
Scylla Summit 2017: How to Run Cassandra/Scylla from a MySQL DBA's Point of ViewScyllaDB
Are you a MySQL DBA or DevOps individual being asked to run Cassandra or Scylla? Feeling overwhelmed? In this talk, I will present Cassandra/Scylla operations in terms that directly relate to MySQL. I will show you comparisons between the Information Schema and the Cassandra/Scylla System keyspace(s). I will also talk about metrics available in MySQL versus Cassandra/Scylla and how to retrieve them. Finally, I will talk about how MySQL replication compares with Cassandra replication. Hopefully, when I am done you will be able to relate to Cassandra operations in a practical and useful way.
Compare and contrast RDF triple stores and NoSQL: are triples stores NoSQL or not?
Talk given 2011-09-08 tot he BigData/NoSQL meetup at Bristol University.
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...CloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2sm9c61
This CloudxLab Introduction to Spark SQL & DataFrames tutorial helps you to understand Spark SQL & DataFrames in detail. Below are the topics covered in this slide:
1) Loading XML
2) What is RPC - Remote Process Call
3) Loading AVRO
4) Data Sources - Parquet
5) Creating DataFrames From Hive Table
6) Setting up Distributed SQL Engine
Central Processing Unit CUP by madridista ujjwalUjwal Limbu
CPU known as a processor is the brain of the computer, process data (input) and converts it into meaningful information (output).
Arithmetic logical unit ALU
Registers
System Bus
Data transfer instruction set of 8085 micro processorvishalgohel12195
Data transfer instruction set of 8085 micro processor
WHAT IS INSTRUCTION?
CLASSIFICATION OF INSTRUCTION.
DATA TRANSFER INSTRUCTION.
EXAMPLES
PROGRAMME OF DATA TRANFER INSTRUCTION
Solr Compute Cloud - An Elastic SolrCloud Infrastructure Nitin S
Scaling search platforms for serving hundreds of millions of documents with low latency and high throughput workloads at an optimized cost is an extremely hard problem. BloomReach has implemented Sc2, which is an elastic Solr infrastructure for Big Data applications, supporting heterogeneous workloads and hosted in the cloud. It dynamically grows/shrinks search servers to provide application and pipeline level isolation, NRT search and indexing, latency guarantees, and application-specific performance tuning. In addition, it provides various high availability features such as differential real-time streaming, disaster recovery, context aware replication, and automatic shard and replica rebalancing, all with a zero downtime guarantee for all consumers. This infrastructure currently serves hundreds of millions of documents in millisecond response times with a load ranging in the order of 200-300K QPS.
This presentation will describe an innovate implementation of scaling Solr in an elastic fashion. It will review the architecture and take a deep dive into how each of these components interact to make the infrastructure truly elastic, real time, and robust while serving latency needs.
37 slides about taking care of your SolrCluster - Collections API, Core API, dynamic schema modification, segment merging, hard vs. soft commit, caches, monitoring, performance, JMX, it's all in here.
How SolrCloud Changes the User Experience In a Sharded Environmentlucenerevolution
Presented by Erick Erickson, Lucid Imagination - See conference video - http://www.lucidimagination.com/devzone/events/conferences/lucene-revolution-2012
The next major release of Solr (4.0) will include "SolrCloud", which provides new distributed capabilities for both in-house and externally-hosted Solr installations. Among the new capabilities are: Automatic Distributed Indexing, High Availability and Failover, Near Real Time searching and Fault Tolerance. This talk will focus, at a high level, on how these new capabilities impact the design of Solr-based search applications primarily from infrastructure and operational perspectives.
Deploying and managing SolrCloud in the cloud using the Solr Scale Toolkitthelabdude
SolrCloud is a set of features in Apache Solr that enable elastic scaling of search indexes using sharding and replication. In this presentation, Tim Potter will demonstrate how to provision, configure, and manage a SolrCloud cluster in Amazon EC2, using a Fabric/boto based solution for automating SolrCloud operations. Attendees will come away with a solid understanding of how to operate a large-scale Solr cluster, as well as tools to help them do it. Tim will also demonstrate these tools live during his presentation. Covered technologies, include: Apache Solr, Apache ZooKeeper, Linux, Python, Fabric, boto, Apache Kafka, Apache JMeter.
Business systems are meals. Nodes are a key ingredient. Do you configure and deploy business systems into multiple environments, with all the implied complexity of managing necessary variations and eliminating unwanted inconsistencies? Do you currently create Visio diagrams to describe your business systems, but long for something more actionable? Or do you use an existing definition format that you wish you could move into Chef?
In this talk we will share techniques for template-driven deployment of topologies composed of Chef nodes, and describe the latest incarnation of these techniques, which is supported by the knife topo plugin and exchange format, as well as our Automate.Insights SaaS based offering. We will also describe our latest work to integrate with chef-provisioning, and how these techniques could be used to map from existing formats such as AWS CloudFormation.
With such techniques you will be able to accelerate unleashing the awesome power of Chef across your enterprise.
Talk at ChefConf 2015 on techniques for template-driven deployment of topologies composed of Chef nodes, by John Sweitzer and Christine Draper.
Video:
https://www.youtube.com/watch?v=hoXf0Uo5bCo
apidays LIVE Helsinki - Implementing OpenAPI and GraphQL Services with gRPC b...apidays
apidays LIVE Helsinki - APIs, Platforms, And Ecosystems - Transforming Industries And Experiences
Implementing OpenAPI and GraphQL Services with gRPC
Tim Burks, Software Engineer at Google
This presentation was given by course moderator Phill Jones of MoreBrains Cooperative during the initial session of the NISO Spring training series "Working with Scholarly APIs." Session One, Foundational Specifics, was held on April 28, 2022.
My talk at Lucene/Solr Revolution 2017, Las Vegas
The improved plugin system being proposed in this talk utilizes PF4J to add bundle packaging (zip/jar), plugin discovery (repositories), one-line install/upgrade and automatic version compatibility checks. Think of it as Homebrew or Apt-Get for Solr :) The hope is that this will encourage hundreds of new plugins being created and thus give Solr developers a sense of community and a new “stage” to perform on.
Enterprise search can grow big, really big! And growing. Tens, yes hundreds of servers may be involved, locally or in the cloud. Managing this has been complex and time consuming - until now :)
SolrCloud to the rescue
Using the world's most popular Open Source search engine, Apache Solr™, we will show you how the new upcoming version 4.0 makes scaling search in the cloud really simple and robust. A new feature called SolrCloud adds centralized configuration, distributed indexing & searching, automatic failover, recovery and leader election. Scaling is now as simple as adding a new server to your cluster and it will find its role where it is most needed and start serving searches.
Dagens Næringslivs overgang til Lucene/Solr søkCominvent AS
Foredrag på GoOpen, Oslo, 2011 (Norwegian language)
NHST Media Group lager nettsidene for bl.a. Dagens Næringsliv, Dagens IT og en rekke engelskspråklige bransjeaviser. Systemutvikler Hans Jørgen Hoel og søke-arkitekt Jan Høydahl forteller om prosessen etter at det ble besluttet å erstatte søkeløsningen fra FAST med fri programvare Apache Solr. Vi vil forsøke å besvare bl.a.: Hvilke utfordringer møtte vi som følge av forskjeller i de to plattformene? Hvorfor bygde vi vårt eget søkerammeverk? Har det nye søket innfridd forventningene?
Se også www.goopen.no, www.cominvent.com og www.nhst.no og Twitter hashtag #GoOpen
Oslo Enterprise MeetUp May 12th 2010 - Jan HøydahlCominvent AS
Presentation held at Oslo Enterprise MeetUp in May, pitched towards an audience who come from the FAST ESP side and have some existing FAST knowledge. Check out one of my other presentations if you're most familiar with Lucene/Solr.
Frokostseminar mai 2010 solr open source cominvent asCominvent AS
Slides fra frokostseminar om Open Souce søk med Apache Lucene/Solr i Oslo mai 2010. Dette var et arrangement av Cominvent AS og FindWise AB.
Presentation is in Norwegian language
Presentation of Norwegian based search consulting company Cominvent AS, focusing on Apache Solr/Lucene/ElasticSearch and other enterprise search and big data technology.
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
Generating a custom Ruby SDK for your web service or Rails API using Smithyg2nightmarescribd
Have you ever wanted a Ruby client API to communicate with your web service? Smithy is a protocol-agnostic language for defining services and SDKs. Smithy Ruby is an implementation of Smithy that generates a Ruby SDK using a Smithy model. In this talk, we will explore Smithy and Smithy Ruby to learn how to generate custom feature-rich SDKs that can communicate with any web service, such as a Rails JSON API.
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring.
Learn about:
• The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks.
• Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective.
• Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification.
• Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process.
Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams.
Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
2. What will I cover?
Who is Jan Høydahl?
Intro to Solr’s (hidden) UpdateChain
How to write your own UpdateProcessors
Example: Web crawl @ Oslo University
A vision for future improvements
Conclusion
2
8. Why document processing?
But what if you want to:
Add or remove fields?
Make decisions based on other fields?
We need a way to modify the Document
8
23. Other examples
Company
The Apache Software Foundation
(ASF) is a non-profit corporation to
support Apache software projects.
The ASF was formed from the
Apache Group and incorporated in
Delaware, U.S., in June 1999.
Location Date
Entity extraction
19
29. Writing your own processor
•Make generic processors - parameterized
•Use SchemaAware, SolrCoreAware and
ResourceLoaderAware interfaces
•Prefix param names to avoid name clash
•Testing and testable methods
•Donate back to Apache & document on Wiki
24
39. Donations back to Apache
SOLR-2599: FieldCopyProcessor
SOLR-2825: RegexReplaceProcessor
SOLR-2826: URLClassifyProcessor
SOLR-2827: RegexpBoostProcessor
SOLR-2828: StaticRankProcessor
Binary Document Dumper (?)
Many thanks for the donations!
29
43. Improvements
Processors re-created for every request
Duplication of config between chains
No support for non-linear or sub chains
Does not scale very well
Lack native scripting language support
34
44. Improvements
Pain:
Potentially expensive initialization
StaticRankProcessor: read&parse 50.000 lines
Proposed cure:
Keep persistent state object in factory:
private final Map<Object,Object> sharedObjCache
new StaticRankProcessor(params, request,
response, nextProcessor, sharedObjCache);
Processor uses sharedObjCache for state
35
45. Improvements
Processors re-created for every request
Duplication of config between chains
No support for non-linear or sub chains
Does not scale very well
Lack native scripting language support
36
46. Improvements
Pain:
Multi chains often need identical Processors
UiO’s two chains share 80% -> copy/paste
Proposed cure:
Allow sharing of named instances
Define:
<processor name="langid" class="..">
Refer:
<processor ref="langid" />
See SOLR-2823
37
47. Improvements
Processors re-created for every request
Duplication of config between chains
No support for non-linear or sub chains
Does not scale very well
Lack native scripting language support
38
48. Improvements
Pain:
Chains are linear only
Hard to do branching, sub chains, conditional...
Proposed cure (SOLR-2841):
New scriptable Update Chain - alternative to XML
Script chain logic in solr/conf/updateproc.groovy
Full flexibility:
chain myChain {
if(doc.getFieldValue("type").equals("pdf"))
process(tikaproc)
}
39
49. Improvements
Processors re-created for every request
Duplication of config between chains
No support for non-linear chains or sub chains
Does not scale very well
Lack native scripting language support
40
50. Improvements
Pain:
Single threaded
Heavy processing not efficient
Proposed cure:
Local: Use multi threaded update requests
SolrCloud: Dedicated nodes, role=“processor” ?
Wrap an external pipeline in UpdateProcessor
Example: OpenPipelineUpdateProcessor ?
41
51. Improvements
Processors re-created for every request
Duplication of config between chains
No support for non-linear chains or sub chains
Does not scale very well
Lack native scripting language support
42
52. Improvements
Pain:
Not really a “problem” :-)
Nice to write processors in Python, Groovy, JS...
Proposed cure:
Now: Finish SOLR-1725: Script based Processor
Later: Make scripts first-class processors
<processor script="myScript.py" />
or
<processor ref="myScript" />
43
54. New standalone framework?
•The UpdateChain is Solr specific
•Interest for a pure pipeline framework
•Search engine independent
•Scalable
•Rich pool of processors
•Several existing candidates
•Some initial thoughts:
http://wiki.apache.org/solr/DocumentProcessing
45
56. Summary
•Document centric vs field centric processing
•UpdateChain is there - use it!
•Works well for most “light” cases
•Scaling issues, but caching config may help
•More processors welcome!
47
59. Alternative pipelines
OpenPipeline (Dieselpoint)
•OpenPipe (T-Rank, now on GitHub)
•Pypes (ESR)
•UIMA (Apache)
•Eclipse SMILA
•Apache commons pipeline
•Piped (FoundIT, Norway)
•Behemoth (DigitaPebble)
•FindWise and TwigKit also has some technology
50
60. Calling out from UpdateChain
This is one way an
external pipeline
system can be
integrated with Solr.
The main benefit of
such a method is you
can continue to feed
content with SolrJ, DIH
or other Update
Request Handlers.
51
61. Scaling with external pipeline
Here is a more
advanced,
distributed
case, where a
Solr node is
dedicated for
processing, and
the entry point
Solr only
dispatches the
requests.
52