Mir is a Media Information Retrieval system I've developed and this is the talk I will give at next YAPC::EU @ Cluj.
Mir is domain-agnostic, highly configurable, can easily scale and is 100% pure Perl. Mir can be used as a core library to build a complete IR system. Mir can be extended to gather data from any source.
Undertaking a digital journey starts with clearly articulating the success factors for the entire digital journey, and our experience from the field has shown it to be an Achilles heel for most CXOs, across Fortune 500 organizations. Our findings were corroborated when a Mckinsey study reported that only 15% of the organizations are able to calculate the ROI of a digital initiative.
In this talk we will deliberate on demonstrated examples from multi-billion dollar businesses around proven methodologies to measure the value of a digital enterprise. The panel will share experiences as well as provide actionable advice for immediate next steps around the following:
Successful metrics for measuring the value for Digital / IoT / AI/ Machine learning engagements
How can 'Digital Traction Metrics' help with actionable insights even before the Financial Metrics have been reported
What are the best in-class organizational constructs and futuristic employee engagement methods to facilitate the digital revolution
Panelists for this session include:
• Christian Bilien - Head of Global Data at Societe Generale
• Pierre Alexandre Pautrat – Head of Big Data at BPCE/Nattixis
• Ronny Fehling – VP , Airbus
• Juergen Urbanski – Silicon Valley Data Science
• Abhas Ricky - EMEA Lead, Innovation & Strategy, Hortonworks
According to service scale, there are hundreds or thousands of running containers in your service. Should we monitor each container by microscope or monitor each microservice by magnifier? This depends which granularity can help us find and solve the problems. In this sharing, I will introduce how to use cAdvisor, Icinga2, InfluxDB and Grafana to build a self-hosted monitoring system. In addition, I also discuss with how to embrace open source and share some practical experiences.
Webinar Monitoring in era of cloud computingCREATE-NET
The webinar "Monitoring in era of cloud computing" covers the topics:
1. What is Monitoring
2. Ceilometer
- Architecture
- Agents (Compute/Central)
- Storage & API
- Quick Demo
3. Monasca
- Architecture
- Events/Messages
- Storage & API
- Quick Demo
Undertaking a digital journey starts with clearly articulating the success factors for the entire digital journey, and our experience from the field has shown it to be an Achilles heel for most CXOs, across Fortune 500 organizations. Our findings were corroborated when a Mckinsey study reported that only 15% of the organizations are able to calculate the ROI of a digital initiative.
In this talk we will deliberate on demonstrated examples from multi-billion dollar businesses around proven methodologies to measure the value of a digital enterprise. The panel will share experiences as well as provide actionable advice for immediate next steps around the following:
Successful metrics for measuring the value for Digital / IoT / AI/ Machine learning engagements
How can 'Digital Traction Metrics' help with actionable insights even before the Financial Metrics have been reported
What are the best in-class organizational constructs and futuristic employee engagement methods to facilitate the digital revolution
Panelists for this session include:
• Christian Bilien - Head of Global Data at Societe Generale
• Pierre Alexandre Pautrat – Head of Big Data at BPCE/Nattixis
• Ronny Fehling – VP , Airbus
• Juergen Urbanski – Silicon Valley Data Science
• Abhas Ricky - EMEA Lead, Innovation & Strategy, Hortonworks
According to service scale, there are hundreds or thousands of running containers in your service. Should we monitor each container by microscope or monitor each microservice by magnifier? This depends which granularity can help us find and solve the problems. In this sharing, I will introduce how to use cAdvisor, Icinga2, InfluxDB and Grafana to build a self-hosted monitoring system. In addition, I also discuss with how to embrace open source and share some practical experiences.
Webinar Monitoring in era of cloud computingCREATE-NET
The webinar "Monitoring in era of cloud computing" covers the topics:
1. What is Monitoring
2. Ceilometer
- Architecture
- Agents (Compute/Central)
- Storage & API
- Quick Demo
3. Monasca
- Architecture
- Events/Messages
- Storage & API
- Quick Demo
The OpenCSD library for decoding CoreSight traces has reached the point where it is ready to be integrated into applications. This session will present an overview of the state of the library, its interfaces and explore and demonstrate a sample integration with perf.
Real-time user profiling based on Spark streaming and HBase by Arkadiusz Jach...Big Data Spain
Agora owns dozens of themed, classified, entertainment and social services. There are news and sports portals, forums, advertising services, blogs and many other thematic websites. All sites generate over 400 page views per second (under normal conditions) and considerably more events (likes focus, clicks and scrolling events). It raises one question: how to build user profiles real-time in such a dynamic and changing environment?
Session presented at Big Data Spain 2015 Conference
15th Oct 2015
Kinépolis Madrid
http://www.bigdataspain.org
Event promoted by: http://www.paradigmatecnologico.com
Abstract: http://www.bigdataspain.org/program/thu/slot-16.html
Drupal Dev Days Vienna 2023 - What is the secure software supply chain and th...sparkfabrik
This talk "What is the secure software supply chain and the current state of the PHP ecosystem" discusses the current state of the software supply chain, the big global recent events (SolarWinds, log4shell, codecov, packagist) and the state of the PHP and Drupal ecosystem, the threats and the mitigations that can be applied using tools like Sigstore, Syft, and Grype for digital signatures, SBOM generation, and automatic vulnerability scanning and how to use them for real-world projects to gain unprecedented levels of knowledge of your digital artifacts.
databricks ml flow demonstration using automatic features engineeringMohamed MEJDOUBI
demonstration of using featuretools package to generate features / aggregates from raw relational data, and using ml flow to track entire model building & hyperparams optimization
Ontology development for software tracking information systemeSAT Journals
System makes use of the semantic web and it used for check and track
r price, information
price, tracking old issues in software, searching for software price
, Semantic.
Secure 2019 - APT for Everyone - Adversary Simulations based on ATT&CK FrameworkLeszek Mi?
The presentation I gave during SECURE 2019 Conference in Poland - one of the most important IT Security event in the calendar of polish *Cyber Security* conferences.
The Enterprise Guide to Building a Data Mesh - Introducing SpecMeshIanFurlong4
For organisations to successfully adopt data mesh, setting up and maintaining infrastructure needs to be easy.
We believe the best way to achieve this is to leverage the learnings from building a ‘central nervous system‘, commonly used in modern data-streaming ecosystems. This approach formalises and automates of the manual parts of building a data mesh.
This presentation introduces SpecMesh; a methodology and supporting developer toolkit to enable business to build the foundations of their data mesh.
ING CoreIntel - collect and process network logs across data centers in near ...Evention
Security is at the core of every bank activity. ING set an ambitious goal to have an insight into the overall network data activity. The purpose is to quickly recognize and neutralize unwelcomed guests such as malware, viruses and to prevent data leakage or track down misconfigured software components.
Since the inception of the CoreIntel project we knew we were going to face the challenges of capturing, storing and processing vast amount of data of a various type from all over the world. In our session we would like to share our experience in building scalable, distributed system architecture based on Kafka, Spark Streaming, Hadoop and Elasticsearch to help us achieving these goals.
Why choosing good data format matters? How to manage kafka offsets? Why dealing with Elasticsearch is a love-hate relationship for us or how we just managed to put it all these pieces together.
This presentation was carried out at the different FI-WARE Bootcamps, for presenting Kurento, its capabilities and APIs to a group of startups.
Kurento makes it possible to create rich video applications supporting WebRTC and HTTP pseudo-streaming (video tag) and exposing Kurento Media Server capabilities, which include computer vision, augmented reality, group communications and recording.
Building a data pipeline to ingest data into Hadoop in minutes using Streamse...Guglielmo Iozzia
Slides from my talk at the Hadoop User Group Ireland meetup on June 13th 2016: building a data pipeline to ingest data from sources of different nature into Hadoop in minutes (and no coding at all) using the Open Source Streamsets Data Collector tool.
Best practices and lessons learnt from Running Apache NiFi at RenaultDataWorks Summit
No real-time insight without real-time data ingestion. No real-time data ingestion without NiFi ! Apache NiFi is an integrated platform for data flow management at entreprise level, enabling companies to securely acquire, process and analyze disparate sources of information (sensors, logs, files, etc) in real-time. NiFi helps data engineers accelerate the development of data flows thanks to its UI and a large number of powerful off-the-shelf processors. However, with great power comes great responsibilities. Behind the simplicity of NiFi, best practices must absolutely be respected in order to scale data flows in production & prevent sneaky situations. In this joint presentation, Hortonworks and Renault, a French car manufacturer, will present lessons learnt from real world projects using Apache NiFi. We will present NiFi design patterns to achieve high level performance and reliability at scale as well as the process to put in place around the technology for data flow governance. We will also show how these best practices can be implemented in practical use cases and scenarios.
Speakers
Kamelia Benchekroun, Data Lake Squad Lead, Renault Group
Abdelkrim Hadjidj, Solution Engineer, Hortonworks
Listen to the keynote address and hear about the latest developments from Rachana Ananthakrishnan and Ian Foster who review the updates to the Globus Platform and Service, and the relevance of Globus to the scientific community as an automation platform to accelerate scientific discovery.
How to Position Your Globus Data Portal for Success Ten Good PracticesGlobus
Science gateways allow science and engineering communities to access shared data, software, computing services, and instruments. Science gateways have gained a lot of traction in the last twenty years, as evidenced by projects such as the Science Gateways Community Institute (SGCI) and the Center of Excellence on Science Gateways (SGX3) in the US, The Australian Research Data Commons (ARDC) and its platforms in Australia, and the projects around Virtual Research Environments in Europe. A few mature frameworks have evolved with their different strengths and foci and have been taken up by a larger community such as the Globus Data Portal, Hubzero, Tapis, and Galaxy. However, even when gateways are built on successful frameworks, they continue to face the challenges of ongoing maintenance costs and how to meet the ever-expanding needs of the community they serve with enhanced features. It is not uncommon that gateways with compelling use cases are nonetheless unable to get past the prototype phase and become a full production service, or if they do, they don't survive more than a couple of years. While there is no guaranteed pathway to success, it seems likely that for any gateway there is a need for a strong community and/or solid funding streams to create and sustain its success. With over twenty years of examples to draw from, this presentation goes into detail for ten factors common to successful and enduring gateways that effectively serve as best practices for any new or developing gateway.
More Related Content
Similar to Mir - a Media Information Retrieval system
The OpenCSD library for decoding CoreSight traces has reached the point where it is ready to be integrated into applications. This session will present an overview of the state of the library, its interfaces and explore and demonstrate a sample integration with perf.
Real-time user profiling based on Spark streaming and HBase by Arkadiusz Jach...Big Data Spain
Agora owns dozens of themed, classified, entertainment and social services. There are news and sports portals, forums, advertising services, blogs and many other thematic websites. All sites generate over 400 page views per second (under normal conditions) and considerably more events (likes focus, clicks and scrolling events). It raises one question: how to build user profiles real-time in such a dynamic and changing environment?
Session presented at Big Data Spain 2015 Conference
15th Oct 2015
Kinépolis Madrid
http://www.bigdataspain.org
Event promoted by: http://www.paradigmatecnologico.com
Abstract: http://www.bigdataspain.org/program/thu/slot-16.html
Drupal Dev Days Vienna 2023 - What is the secure software supply chain and th...sparkfabrik
This talk "What is the secure software supply chain and the current state of the PHP ecosystem" discusses the current state of the software supply chain, the big global recent events (SolarWinds, log4shell, codecov, packagist) and the state of the PHP and Drupal ecosystem, the threats and the mitigations that can be applied using tools like Sigstore, Syft, and Grype for digital signatures, SBOM generation, and automatic vulnerability scanning and how to use them for real-world projects to gain unprecedented levels of knowledge of your digital artifacts.
databricks ml flow demonstration using automatic features engineeringMohamed MEJDOUBI
demonstration of using featuretools package to generate features / aggregates from raw relational data, and using ml flow to track entire model building & hyperparams optimization
Ontology development for software tracking information systemeSAT Journals
System makes use of the semantic web and it used for check and track
r price, information
price, tracking old issues in software, searching for software price
, Semantic.
Secure 2019 - APT for Everyone - Adversary Simulations based on ATT&CK FrameworkLeszek Mi?
The presentation I gave during SECURE 2019 Conference in Poland - one of the most important IT Security event in the calendar of polish *Cyber Security* conferences.
The Enterprise Guide to Building a Data Mesh - Introducing SpecMeshIanFurlong4
For organisations to successfully adopt data mesh, setting up and maintaining infrastructure needs to be easy.
We believe the best way to achieve this is to leverage the learnings from building a ‘central nervous system‘, commonly used in modern data-streaming ecosystems. This approach formalises and automates of the manual parts of building a data mesh.
This presentation introduces SpecMesh; a methodology and supporting developer toolkit to enable business to build the foundations of their data mesh.
ING CoreIntel - collect and process network logs across data centers in near ...Evention
Security is at the core of every bank activity. ING set an ambitious goal to have an insight into the overall network data activity. The purpose is to quickly recognize and neutralize unwelcomed guests such as malware, viruses and to prevent data leakage or track down misconfigured software components.
Since the inception of the CoreIntel project we knew we were going to face the challenges of capturing, storing and processing vast amount of data of a various type from all over the world. In our session we would like to share our experience in building scalable, distributed system architecture based on Kafka, Spark Streaming, Hadoop and Elasticsearch to help us achieving these goals.
Why choosing good data format matters? How to manage kafka offsets? Why dealing with Elasticsearch is a love-hate relationship for us or how we just managed to put it all these pieces together.
This presentation was carried out at the different FI-WARE Bootcamps, for presenting Kurento, its capabilities and APIs to a group of startups.
Kurento makes it possible to create rich video applications supporting WebRTC and HTTP pseudo-streaming (video tag) and exposing Kurento Media Server capabilities, which include computer vision, augmented reality, group communications and recording.
Building a data pipeline to ingest data into Hadoop in minutes using Streamse...Guglielmo Iozzia
Slides from my talk at the Hadoop User Group Ireland meetup on June 13th 2016: building a data pipeline to ingest data from sources of different nature into Hadoop in minutes (and no coding at all) using the Open Source Streamsets Data Collector tool.
Best practices and lessons learnt from Running Apache NiFi at RenaultDataWorks Summit
No real-time insight without real-time data ingestion. No real-time data ingestion without NiFi ! Apache NiFi is an integrated platform for data flow management at entreprise level, enabling companies to securely acquire, process and analyze disparate sources of information (sensors, logs, files, etc) in real-time. NiFi helps data engineers accelerate the development of data flows thanks to its UI and a large number of powerful off-the-shelf processors. However, with great power comes great responsibilities. Behind the simplicity of NiFi, best practices must absolutely be respected in order to scale data flows in production & prevent sneaky situations. In this joint presentation, Hortonworks and Renault, a French car manufacturer, will present lessons learnt from real world projects using Apache NiFi. We will present NiFi design patterns to achieve high level performance and reliability at scale as well as the process to put in place around the technology for data flow governance. We will also show how these best practices can be implemented in practical use cases and scenarios.
Speakers
Kamelia Benchekroun, Data Lake Squad Lead, Renault Group
Abdelkrim Hadjidj, Solution Engineer, Hortonworks
Similar to Mir - a Media Information Retrieval system (20)
Listen to the keynote address and hear about the latest developments from Rachana Ananthakrishnan and Ian Foster who review the updates to the Globus Platform and Service, and the relevance of Globus to the scientific community as an automation platform to accelerate scientific discovery.
How to Position Your Globus Data Portal for Success Ten Good PracticesGlobus
Science gateways allow science and engineering communities to access shared data, software, computing services, and instruments. Science gateways have gained a lot of traction in the last twenty years, as evidenced by projects such as the Science Gateways Community Institute (SGCI) and the Center of Excellence on Science Gateways (SGX3) in the US, The Australian Research Data Commons (ARDC) and its platforms in Australia, and the projects around Virtual Research Environments in Europe. A few mature frameworks have evolved with their different strengths and foci and have been taken up by a larger community such as the Globus Data Portal, Hubzero, Tapis, and Galaxy. However, even when gateways are built on successful frameworks, they continue to face the challenges of ongoing maintenance costs and how to meet the ever-expanding needs of the community they serve with enhanced features. It is not uncommon that gateways with compelling use cases are nonetheless unable to get past the prototype phase and become a full production service, or if they do, they don't survive more than a couple of years. While there is no guaranteed pathway to success, it seems likely that for any gateway there is a need for a strong community and/or solid funding streams to create and sustain its success. With over twenty years of examples to draw from, this presentation goes into detail for ten factors common to successful and enduring gateways that effectively serve as best practices for any new or developing gateway.
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...Shahin Sheidaei
Games are powerful teaching tools, fostering hands-on engagement and fun. But they require careful consideration to succeed. Join me to explore factors in running and selecting games, ensuring they serve as effective teaching tools. Learn to maintain focus on learning objectives while playing, and how to measure the ROI of gaming in education. Discover strategies for pitching gaming to leadership. This session offers insights, tips, and examples for coaches, team leads, and enterprise leaders seeking to teach from simple to complex concepts.
Globus Connect Server Deep Dive - GlobusWorld 2024Globus
We explore the Globus Connect Server (GCS) architecture and experiment with advanced configuration options and use cases. This content is targeted at system administrators who are familiar with GCS and currently operate—or are planning to operate—broader deployments at their institution.
How Recreation Management Software Can Streamline Your Operations.pptxwottaspaceseo
Recreation management software streamlines operations by automating key tasks such as scheduling, registration, and payment processing, reducing manual workload and errors. It provides centralized management of facilities, classes, and events, ensuring efficient resource allocation and facility usage. The software offers user-friendly online portals for easy access to bookings and program information, enhancing customer experience. Real-time reporting and data analytics deliver insights into attendance and preferences, aiding in strategic decision-making. Additionally, effective communication tools keep participants and staff informed with timely updates. Overall, recreation management software enhances efficiency, improves service delivery, and boosts customer satisfaction.
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...Juraj Vysvader
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I didn't get rich from it but it did have 63K downloads (powered possible tens of thousands of websites).
GraphSummit Paris - The art of the possible with Graph TechnologyNeo4j
Sudhir Hasbe, Chief Product Officer, Neo4j
Join us as we explore breakthrough innovations enabled by interconnected data and AI. Discover firsthand how organizations use relationships in data to uncover contextual insights and solve our most pressing challenges – from optimizing supply chains, detecting fraud, and improving customer experiences to accelerating drug discoveries.
Large Language Models and the End of ProgrammingMatt Welsh
Talk by Matt Welsh at Craft Conference 2024 on the impact that Large Language Models will have on the future of software development. In this talk, I discuss the ways in which LLMs will impact the software industry, from replacing human software developers with AI, to replacing conventional software with models that perform reasoning, computation, and problem-solving.
Unleash Unlimited Potential with One-Time Purchase
BoxLang is more than just a language; it's a community. By choosing a Visionary License, you're not just investing in your success, you're actively contributing to the ongoing development and support of BoxLang.
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...Globus
Large Language Models (LLMs) are currently the center of attention in the tech world, particularly for their potential to advance research. In this presentation, we'll explore a straightforward and effective method for quickly initiating inference runs on supercomputers using the vLLM tool with Globus Compute, specifically on the Polaris system at ALCF. We'll begin by briefly discussing the popularity and applications of LLMs in various fields. Following this, we will introduce the vLLM tool, and explain how it integrates with Globus Compute to efficiently manage LLM operations on Polaris. Attendees will learn the practical aspects of setting up and remotely triggering LLMs from local machines, focusing on ease of use and efficiency. This talk is ideal for researchers and practitioners looking to leverage the power of LLMs in their work, offering a clear guide to harnessing supercomputing resources for quick and effective LLM inference.
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptxrickgrimesss22
Discover the essential features to incorporate in your Winzo clone app to boost business growth, enhance user engagement, and drive revenue. Learn how to create a compelling gaming experience that stands out in the competitive market.
Understanding Globus Data Transfers with NetSageGlobus
NetSage is an open privacy-aware network measurement, analysis, and visualization service designed to help end-users visualize and reason about large data transfers. NetSage traditionally has used a combination of passive measurements, including SNMP and flow data, as well as active measurements, mainly perfSONAR, to provide longitudinal network performance data visualization. It has been deployed by dozens of networks world wide, and is supported domestically by the Engagement and Performance Operations Center (EPOC), NSF #2328479. We have recently expanded the NetSage data sources to include logs for Globus data transfers, following the same privacy-preserving approach as for Flow data. Using the logs for the Texas Advanced Computing Center (TACC) as an example, this talk will walk through several different example use cases that NetSage can answer, including: Who is using Globus to share data with my institution, and what kind of performance are they able to achieve? How many transfers has Globus supported for us? Which sites are we sharing the most data with, and how is that changing over time? How is my site using Globus to move data internally, and what kind of performance do we see for those transfers? What percentage of data transfers at my institution used Globus, and how did the overall data transfer performance compare to the Globus users?
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...Globus
The U.S. Geological Survey (USGS) has made substantial investments in meeting evolving scientific, technical, and policy driven demands on storing, managing, and delivering data. As these demands continue to grow in complexity and scale, the USGS must continue to explore innovative solutions to improve its management, curation, sharing, delivering, and preservation approaches for large-scale research data. Supporting these needs, the USGS has partnered with the University of Chicago-Globus to research and develop advanced repository components and workflows leveraging its current investment in Globus. The primary outcome of this partnership includes the development of a prototype enterprise repository, driven by USGS Data Release requirements, through exploration and implementation of the entire suite of the Globus platform offerings, including Globus Flow, Globus Auth, Globus Transfer, and Globus Search. This presentation will provide insights into this research partnership, introduce the unique requirements and challenges being addressed and provide relevant project progress.
Enhancing Research Orchestration Capabilities at ORNL.pdfGlobus
Cross-facility research orchestration comes with ever-changing constraints regarding the availability and suitability of various compute and data resources. In short, a flexible data and processing fabric is needed to enable the dynamic redirection of data and compute tasks throughout the lifecycle of an experiment. In this talk, we illustrate how we easily leveraged Globus services to instrument the ACE research testbed at the Oak Ridge Leadership Computing Facility with flexible data and task orchestration capabilities.
Providing Globus Services to Users of JASMIN for Environmental Data AnalysisGlobus
JASMIN is the UK’s high-performance data analysis platform for environmental science, operated by STFC on behalf of the UK Natural Environment Research Council (NERC). In addition to its role in hosting the CEDA Archive (NERC’s long-term repository for climate, atmospheric science & Earth observation data in the UK), JASMIN provides a collaborative platform to a community of around 2,000 scientists in the UK and beyond, providing nearly 400 environmental science projects with working space, compute resources and tools to facilitate their work. High-performance data transfer into and out of JASMIN has always been a key feature, with many scientists bringing model outputs from supercomputers elsewhere in the UK, to analyse against observational or other model data in the CEDA Archive. A growing number of JASMIN users are now realising the benefits of using the Globus service to provide reliable and efficient data movement and other tasks in this and other contexts. Further use cases involve long-distance (intercontinental) transfers to and from JASMIN, and collecting results from a mobile atmospheric radar system, pushing data to JASMIN via a lightweight Globus deployment. We provide details of how Globus fits into our current infrastructure, our experience of the recent migration to GCSv5.4, and of our interest in developing use of the wider ecosystem of Globus services for the benefit of our user community.
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
Mir - a Media Information Retrieval system
1. 8/22/16 MIR-A Media Information Retrieval system
MIR
A Media Information Retrieval
System
Marco Masetti
grubert65@gmail.com
2. 8/22/16 MIR-A Media Information Retrieval system
Who am I ?
● Perl programmer since 2000
● Working in the research dptm of an Italian SME ( > 200 employees)
● Both research and commercial projects
● Attended ACT conferences:
– Italian Perl Workshop 2012
– YAPC::Europe 2010
– Italian Perl Workshop 2009
– Italian Perl Workshop 2008
– Italian Perl Workshop 2006
– Italian Perl Workshop 2005
– Italian Perl Workshop 2004
– YAPC::Europe 2003
3. 8/22/16 MIR-A Media Information Retrieval system
A gentle introduction to IR
● Information Retrieval (IR) is finding material
(usually documents) of an unstructured nature
(usually text) that satisfies an information need
from within large collections (usually stored on
computers) (cit. C.D.Manning - “An introduction to Information Retrieval” - 2009)
● MIR is a platform you can leverage to build
distributed IR systems
● The MIR core comes as a single distribution
4. 8/22/16 MIR-A Media Information Retrieval system
The MIR system
● MIR lets you configure and run acquisition
campaigns
● A campaign is split into phases:
● Several campaigns can run in parallel
● MIR is domain-agnostic, highly configurable, can
easily scale and is implemented in Perl
Acq Store Index Delivery
5. 8/22/16 MIR-A Media Information Retrieval system
MIR main packages and scripts
● Mir::Acq :Acq processors (+ fetchers)
● Mir::Store :Metadata store
● Mir::IR :Indexing component
● Mir::Config :System configuration
● Mir::Util :Utility libraries
● mir-acq-fetcher.pl :to run a single fetcher
● mir-acq-scheduler.pl:to schedule the launch of an acq campaign
● mir-acq-processor.pl: to handle the running of campaign fetchers
● mir-ir.pl : indexing process (event-based)
● mir-ir-polling.pl : indexing process (polls for new docs to index)
8. 8/22/16 MIR-A Media Information Retrieval system
Node 2Node 1
MIR system deployments (1)
● From simple configurations...
Acq processor
f
f
f
metadata
Store Index
9. 8/22/16 MIR-A Media Information Retrieval system
Index node 1Store node 1
MIR system deployments (2)
● ...to complex ones...
Acq node 1
Acq processor
f
f
f
Acq node 2
Acq processor
f
f
f
f
Acq node 3
Acq processor
f
f
f
Metadata
Store
Store node 2
Metadata
Store
Index
Index node 2
Index
Index node 3
Index
Index node 4
Index
Index node 5
Index
Index node 6
Index
10. 8/22/16 MIR-A Media Information Retrieval system
Configuring a MIR system (1)
● Configuration is split into sections. Sections can be used to
configure other sub-systems MIR interacts with.
● Configuration sections can be centralized in a single node or
distributed
● By default a MIR system configuration section is stored in
the system collection of the MIR database of the local
Mongo server.
● Configuration sections can also be stored as distinct JSON
files.
● MIR lets you configure acquisition campaigns. A campaign
can be identified by a process chain from data acquisition,
to metadata extraction down to information delivery
11. 8/22/16 MIR-A Media Information Retrieval system
Configuring a MIR system (2)
● The system section usually holds:
– one configuration item for each acquisition campaign
– a configuration item for the ElasticSearch engine (not
mandatory).
12. 8/22/16 MIR-A Media Information Retrieval system
Metadata handling and storage
● A metadata should be a Moose class that should extends the
Mir::Doc class or at least consume the Mir::R::Doc role.
● The Mir::Doc class consumes the DriverRole and the Mir::R::Doc
roles
● The DriverRole implements the driver/interface pattern and lets you
create a new metadata in this way:
my $doc = Mir::Doc->create(driver => 'Tweet');
● The Mir::R::Doc adds some common attributes (unique id, creation
date,...) and consumes the MooseX::Storage role. The
MooseX::Storage role provides persistency to Moose objects.
● A metadata obj is usually created by an acq fetcher and pass throw
all process chain, usually being enriched while processing proceed.
13. 8/22/16 MIR-A Media Information Retrieval system
Fetcher design
● A fetcher is a tiny bit of logic that gathers data from a given
source and creates metadata. Fetchers come as perl
distributions. A fetcher class should consume the
Mir::R::Acq::Fetch role and belong to the Mir::Acq::Fetcher
namespace.
package Mir::Acq::Fetcher::Twitter;
use Moose;
with 'Mir::R::Acq::Fetch';
sub get_docs {
# all the logic to interact with Twitter...
my $tweet = Mir::Doc::Tweet>new(
lang => 'en',
place => 'ClujNapoca',
text => 'Following this weird pres...',
...
);
$tweet>store();
}
package Mir::Doc::Tweet;
use Moose;
with 'Mir::R::Doc';
has 'lang' => (is => 'rw', isa => 'Str');
has 'place' => (is => 'rw', isa => 'Str');
has 'text' => (is => 'rw', isa => 'Str');
...
1) First define your metadata...
2) then implement the fetcher...
14. 8/22/16 MIR-A Media Information Retrieval system
Fetchers implemented
Mir::Acq::Fetcher::FS Traverse (in many ways) a local or remote file
system. Create a metadata for each new file found.
Mir::Acq::Fetcher::Instagram Retrives published media according to configured
criteria (es: lat,lon, radius)
Mir::Acq::Fetcher::Netatmo Weather data from the Netatmo network of weather
stations
Mir::Acq::Fetcher::OpenWeather Weather data using OpenWeather API
Mir::Acq::Fetcher::WU Weather data using WeatherUndergroud API
Mir::Acq::Fetcher:RSS News feeds from an rss channel
Mir::Acq::Fetcher::Twitter Use Twitter REST API, retrives tweets according to
configured criteria
Mir::Acq::Fetcher::TwitterStream Use Twitter Streaming API to retrieve tweets in
stream mode
Implemented by my team @ my company for commercial projects (not freely available...),
the aim is to use CPAN to build a collection of freely available fetchers.
15. 8/22/16 MIR-A Media Information Retrieval system
How fetchers are organized...
● This is quite complex, but complexity provides you freedom
in metadata acquisition
● Basically:
– Either run a single fetcher via mir-acq-fetcher.pl
– Or configure an acq campaign using:
● Mir-acq-scheduler.pl to schedule which campaign should start
when (using for example the cron daemon)
● Mir-acq-processor.pl to define how many fetchers for each
campaign should be launched in parallel
● The scheduler and the processor talk using a network queue
17. 8/22/16 MIR-A Media Information Retrieval system
MIR-IR – text extraction
● Metadata can easily refer to files of some format
● Mir::Util::DocHandler is a class to extract text from a good range of file
types, including: PDF, MS Office, ODF, images,...
●
Drivers are implemented for each supported file suffix (Office, doc, docx,
html, pdf, pdf2, pdf3, rtf, xls, ppt, txt, image), you get a driver for the proper
format in the usual way:
my $dh = Mir::Util::DocHandler->create( driver => $suffix );
$dh->open_doc(“/path/to/doc.$suffix”);
my $num_pages = $dh->pages();
my ( $text, $confidence ) = $dh->page_text($_) foreach (1..$num_pages);
● LUTs can be configured to use drivers for other suffix (es: 'htm' => 'html' )
● The mir-doc-handler.pl script can be used to test how text extraction works
on a given doc
18. 8/22/16 MIR-A Media Information Retrieval system
MIR-IR – text indexing
● The text indexing process deals with text extraction
and indexing into Elastic
● A text indexing process active for each campaign
● Text indexing can be driven in 2 ways:
– The indexing process gets metadata from an indexing
queue (useful for streamlined sources, like TwitterStream
API or MPeg7)
– The indexing process polls directly the metadata store for
new metadata to process
● The mir-ir.pl and mir-ir-polling.pl scripts implement the
2 behaviours
# Mir indexing processes on node 2
mir@mir2:~$ ps ef |grep perl
perl mirirpolling.pl campaign DocIndex_commerciale config_driver Mongo config_params {"host":"192.168.56.21"}
perl mirirpolling.pl campaign DocIndex_eccairs config_driver Mongo config_params {"host":"192.168.56.21"}
perl mirirpolling.pl campaign DocIndex_cv config_driver Mongo config_params {"host":"192.168.56.21"}
perl mirirpolling.pl campaign DocIndex_progetti config_driver Mongo config_params {"host":"192.168.56.21"}
perl mirirpolling.pl campaign DocIndex_segreteria_pub config_driver Mongo config_params {"host":"192.168.56.21"}
perl mirirpolling.pl campaign DocIndex_segreteria_priv config_driver Mongo config_params {"host":"192.168.56.21"}
perl mirirpolling.pl campaign DocIndex_segreteria_ris config_driver Mongo config_params {"host":"192.168.56.21"}
perl mirirpolling.pl campaign DocIndex_segreteria_usr config_driver Mongo config_params {"host":"192.168.56.21"}
perl mirirpolling.pl campaign eccairsSRIS config_driver Mongo config_params {"host":"192.168.56.21"}
perl mirirpolling.pl campaign eccairsAVIATION config_driver Mongo config_params {"host":"192.168.56.21"}
20. 8/22/16 MIR-A Media Information Retrieval system
What next...
● Improve documentation (tutorials,...)
● MIR system Monitoring GUI
● Use Vagrant or similar to ship appliances
21. 8/22/16 MIR-A Media Information Retrieval system
Thank YOU!!!
Enjoy your YAPC!!!