This slideset presents the Nutch search engine (http://lucene.apache.org/nutch). A high-level architecture is described, as well as some challenges common in web-crawling and solutions implemented in Nutch. The presentation closes with a brief look into the Nutch future.
Large Scale Crawling with Apache Nutch and FriendsJulien Nioche
This session will give an overview of Apache Nutch. I will describe its main components and how it fits with other Apache projects such as Hadoop, SOLR, Tika or HBase. The second part of the presentation will be focused on the latest developments in Nutch, the differences between the 1.x and 2.x branch and what we can expect to see in Nutch in the future. This session will cover many practical aspects and should be a good starting point to crawling on a large scale with Apache Nutch and SOLR.
This talk will give an overview of Apache Nutch, its main components, how it fits with other Apache projects and its latest developments.
Apache Nutch was started exactly 10 years ago and was the starting point for what later became Apache Hadoop and also Apache Tika. Nutch is nowadays the tool of reference for large scale web crawling.
In this talk I will give an overview of Apache Nutch and describe its main components and how Nutch fits with other Apache projects such as Hadoop, SOLR or Tika.
The second part of the presentation will be focused on the latest developments in Nutch and the changes introduced by the 2.x branch with the use of Apache GORA as a front end to various NoSQL datastores.
Storm-Crawler is a collection of resources for building low-latency, large scale web crawlers on Apache Storm. We will compare with similar projects like Apache Nutch and present several use cases where the storm-crawler is being used. In particular we will see how the Storm-crawler can be used with ElasticSearch and Kibana for crawling and indexing web pages.
Large Scale Crawling with Apache Nutch and Friendslucenerevolution
Presented by Julien Nioche, Director, DigitalPebble
This session will give an overview of Apache Nutch. I will describe its main components and how it fits with other Apache projects such as Hadoop, SOLR, Tika or HBase. The second part of the presentation will be focused on the latest developments in Nutch, the differences between the 1.x and 2.x branch and what we can expect to see in Nutch in the future. This session will cover many practical aspects and should be a good starting point to crawling on a large scale with Apache Nutch and SOLR.
This slideset presents the Nutch search engine (http://lucene.apache.org/nutch). A high-level architecture is described, as well as some challenges common in web-crawling and solutions implemented in Nutch. The presentation closes with a brief look into the Nutch future.
Large Scale Crawling with Apache Nutch and FriendsJulien Nioche
This session will give an overview of Apache Nutch. I will describe its main components and how it fits with other Apache projects such as Hadoop, SOLR, Tika or HBase. The second part of the presentation will be focused on the latest developments in Nutch, the differences between the 1.x and 2.x branch and what we can expect to see in Nutch in the future. This session will cover many practical aspects and should be a good starting point to crawling on a large scale with Apache Nutch and SOLR.
This talk will give an overview of Apache Nutch, its main components, how it fits with other Apache projects and its latest developments.
Apache Nutch was started exactly 10 years ago and was the starting point for what later became Apache Hadoop and also Apache Tika. Nutch is nowadays the tool of reference for large scale web crawling.
In this talk I will give an overview of Apache Nutch and describe its main components and how Nutch fits with other Apache projects such as Hadoop, SOLR or Tika.
The second part of the presentation will be focused on the latest developments in Nutch and the changes introduced by the 2.x branch with the use of Apache GORA as a front end to various NoSQL datastores.
Storm-Crawler is a collection of resources for building low-latency, large scale web crawlers on Apache Storm. We will compare with similar projects like Apache Nutch and present several use cases where the storm-crawler is being used. In particular we will see how the Storm-crawler can be used with ElasticSearch and Kibana for crawling and indexing web pages.
Large Scale Crawling with Apache Nutch and Friendslucenerevolution
Presented by Julien Nioche, Director, DigitalPebble
This session will give an overview of Apache Nutch. I will describe its main components and how it fits with other Apache projects such as Hadoop, SOLR, Tika or HBase. The second part of the presentation will be focused on the latest developments in Nutch, the differences between the 1.x and 2.x branch and what we can expect to see in Nutch in the future. This session will cover many practical aspects and should be a good starting point to crawling on a large scale with Apache Nutch and SOLR.
Talk about Apache Nutch on ApacheCon Europe 2014:
http://sched.co/1nyYa7b
http://events.linuxfoundation.org/sites/events/files/slides/aceu2014-snagel-web-crawling-nutch.pdf
StormCrawler presentation given at the Bristech meetup on 6/10/2016. Covers the main concepts and functionalities of Apache Storm, then describes StormCrawler with a step by step approach to building a scalable web crawler. Finally we saw 3 real users of StormCrawler, illustrating the versatility of the project.
Building a Scalable Web Crawler with Hadoop by Ahad Rana from CommonCrawl
Ahad Rana, engineer at CommonCrawl, will go over CommonCrawl’s extensive use of Hadoop to fulfill their mission of building an open, and accessible Web-Scale crawl. He will discuss their Hadoop data processing pipeline, including their PageRank implementation, describe techniques they use to optimize Hadoop, discuss the design of their URL Metadata service, and conclude with details on how you can leverage the crawl (using Hadoop) today.
Building a Large Scale SEO/SEM Application with Apache SolrRahul Jain
Slides from my talk on "Building a Large Scale SEO/SEM Application with Apache Solr" in Lucene/Solr Revolution 2014 where I talk how we handle Indexing/Search of 40 billion records (documents)/month in Apache Solr with 4.6 TB compressed index data.
Abstract: We are working on building a SEO/SEM application where an end user search for a "keyword" or a "domain" and gets all the insights about these including Search engine ranking, CPC/CPM, search volume, No. of Ads, competitors details etc. in a couple of seconds. To have this intelligence, we get huge web data from various sources and after intensive processing it is 40 billion records/month in MySQL database with 4.6 TB compressed index data in Apache Solr.
Due to large volume, we faced several challenges while improving indexing performance, search latency and scaling the overall system. In this session, I will talk about our several design approaches to import data faster from MySQL, tricks & techniques to improve the indexing performance, Distributed Search, DocValues(life saver), Redis and the overall system architecture.
One of the key differences between Presto and Hive, also a crucial functional requirement Facebook made when launching this new SQL engine project, was to have the opportunity to query different kinds of data sources via a uniform ANSI SQL interface.
Presto, an open source distributed analytical SQL engine, implements this with it’s connector architecture, creating an abstraction layer for anything that can be expressed as in a row-like format, ranging from MySQL tables, HDFS, Amazon S3 to NoSQL stores, Kafka streams and proprietary data sources. Presto connector SPI allows anyone to implement a Presto connector and benefit from the capabilities of the Presto SQL engine, enabling them to join data from various sources within a single SQL query.
HBaseCon 2012 | HBase for the Worlds Libraries - OCLCCloudera, Inc.
WorldCat is the world’s largest network of library content and services. Over 25,000 libraries in 170 countries have cooperated for 40 years to build WorldCat. OCLC is currently in the process of transitioning Worldcat from Oracle to Apache HBase. This session will discuss our data design for representing the constantly changing ownership information for thousands of libraries (billions of data points, millions of daily updates) and our plans for how we’re managing HBase in an environment that is equal parts end user facing and batch.
Apache Solr on Hadoop is enabling organizations to collect, process and search larger, more varied data. Apache Spark is is making a large impact across the industry, changing the way we think about batch processing and replacing MapReduce in many cases. But how can production users easily migrate ingestion of HDFS data into Solr from MapReduce to Spark? How can they update and delete existing documents in Solr at scale? And how can they easily build flexible data ingestion pipelines? Cloudera Search Software Engineer Wolfgang Hoschek will present an architecture and solution to this problem. How was Apache Solr, Spark, Crunch, and Morphlines integrated to allow for scalable and flexible ingestion of HDFS data into Solr? What are the solved problems and what's still to come? Join us for an exciting discussion on this new technology.
Talk about Apache Nutch on ApacheCon Europe 2014:
http://sched.co/1nyYa7b
http://events.linuxfoundation.org/sites/events/files/slides/aceu2014-snagel-web-crawling-nutch.pdf
StormCrawler presentation given at the Bristech meetup on 6/10/2016. Covers the main concepts and functionalities of Apache Storm, then describes StormCrawler with a step by step approach to building a scalable web crawler. Finally we saw 3 real users of StormCrawler, illustrating the versatility of the project.
Building a Scalable Web Crawler with Hadoop by Ahad Rana from CommonCrawl
Ahad Rana, engineer at CommonCrawl, will go over CommonCrawl’s extensive use of Hadoop to fulfill their mission of building an open, and accessible Web-Scale crawl. He will discuss their Hadoop data processing pipeline, including their PageRank implementation, describe techniques they use to optimize Hadoop, discuss the design of their URL Metadata service, and conclude with details on how you can leverage the crawl (using Hadoop) today.
Building a Large Scale SEO/SEM Application with Apache SolrRahul Jain
Slides from my talk on "Building a Large Scale SEO/SEM Application with Apache Solr" in Lucene/Solr Revolution 2014 where I talk how we handle Indexing/Search of 40 billion records (documents)/month in Apache Solr with 4.6 TB compressed index data.
Abstract: We are working on building a SEO/SEM application where an end user search for a "keyword" or a "domain" and gets all the insights about these including Search engine ranking, CPC/CPM, search volume, No. of Ads, competitors details etc. in a couple of seconds. To have this intelligence, we get huge web data from various sources and after intensive processing it is 40 billion records/month in MySQL database with 4.6 TB compressed index data in Apache Solr.
Due to large volume, we faced several challenges while improving indexing performance, search latency and scaling the overall system. In this session, I will talk about our several design approaches to import data faster from MySQL, tricks & techniques to improve the indexing performance, Distributed Search, DocValues(life saver), Redis and the overall system architecture.
One of the key differences between Presto and Hive, also a crucial functional requirement Facebook made when launching this new SQL engine project, was to have the opportunity to query different kinds of data sources via a uniform ANSI SQL interface.
Presto, an open source distributed analytical SQL engine, implements this with it’s connector architecture, creating an abstraction layer for anything that can be expressed as in a row-like format, ranging from MySQL tables, HDFS, Amazon S3 to NoSQL stores, Kafka streams and proprietary data sources. Presto connector SPI allows anyone to implement a Presto connector and benefit from the capabilities of the Presto SQL engine, enabling them to join data from various sources within a single SQL query.
HBaseCon 2012 | HBase for the Worlds Libraries - OCLCCloudera, Inc.
WorldCat is the world’s largest network of library content and services. Over 25,000 libraries in 170 countries have cooperated for 40 years to build WorldCat. OCLC is currently in the process of transitioning Worldcat from Oracle to Apache HBase. This session will discuss our data design for representing the constantly changing ownership information for thousands of libraries (billions of data points, millions of daily updates) and our plans for how we’re managing HBase in an environment that is equal parts end user facing and batch.
Apache Solr on Hadoop is enabling organizations to collect, process and search larger, more varied data. Apache Spark is is making a large impact across the industry, changing the way we think about batch processing and replacing MapReduce in many cases. But how can production users easily migrate ingestion of HDFS data into Solr from MapReduce to Spark? How can they update and delete existing documents in Solr at scale? And how can they easily build flexible data ingestion pipelines? Cloudera Search Software Engineer Wolfgang Hoschek will present an architecture and solution to this problem. How was Apache Solr, Spark, Crunch, and Morphlines integrated to allow for scalable and flexible ingestion of HDFS data into Solr? What are the solved problems and what's still to come? Join us for an exciting discussion on this new technology.
ShmooCon 2015: No Budget Threat Intelligence - Tracking Malware Campaigns on ...Andrew Morris
In this talk, I'll be discussing my experience developing intelligence-gathering capabilities to track several different independent groups of threat actors on a very limited budget (read: virtually no budget whatsoever). I'll discuss discovering the groups using open source intelligence gathering and honeypots, monitoring attacks, collecting and analyzing malware artifacts to figure out what their capabilities are, and reverse engineering their malware to develop the capability to track their targets in real time. Finally, I'll chat about defensive strategies and provide recommendations for enterprise security analysts and other security researchers.
Asynchronous API in Java8, how to use CompletableFutureJosé Paumard
Slides of my talk as Devoxx 2015. How to set up asynchronous data processing pipelines using the CompletionStage / CompletableFuture API, including how to control threads and how to handle exceptions.
Short overview of the main principles of Continuous Integration (CI), describing benefits of CI and showing a smooth path of integrating CI into your development cycle, finishing with a short introduction into Xinc - PHP CI Server and how to utilize it for your projects.
Thrift vs Protocol Buffers vs Avro - Biased ComparisonIgor Anishchenko
Igor Anishchenko
Odessa Java TechTalks
Lohika - May, 2012
Let's take a step back and compare data serialization formats, of which there are plenty. What are the key differences between Apache Thrift, Google Protocol Buffers and Apache Avro. Which is "The Best"? Truth of the matter is, they are all very good and each has its own strong points. Hence, the answer is as much of a personal choice, as well as understanding of the historical context for each, and correctly identifying your own, individual requirements.
The technologies and people we are designing experiences for are constantly changing, in most cases they are changing at a rate that is difficult keep up with. When we think about how our teams are structured and the design processes we use in light of this challenge, a new design problem (or problem space) emerges, one that requires us to focus inward. How do we structure our teams and processes to be resilient? What would happen if we looked at our teams and design process as IA’s, Designers, Researchers? What strategies would we put in place to help them be successful? This talk will look at challenges we face leading, supporting, or simply being a part of design teams creating experiences for user groups with changing technological needs.
An immersive workshop at General Assembly, SF. I typically teach this workshop at General Assembly, San Francisco. To see a list of my upcoming classes, visit https://generalassemb.ly/instructors/seth-familian/4813
I also teach this workshop as a private lunch-and-learn or half-day immersive session for corporate clients. To learn more about pricing and availability, please contact me at http://familian1.com
A Provenance-Aware Linked Data Application for Trip Management and OrganizationBoris Villazón-Terrazas
A Provenance-Aware Linked Data Application for Trip Management and Organization, presented at the Triplification Challenge, I-Semantics 2011.
We present, an application for exploiting, managing and organizing Linked Data in the domain of news and blogs about travelling. El Viajero makes use of several heterogeneous datasets to help users to plan future trips, and relies on the Open Provenance Model for modelling the provenance information of the resources.
R and Hadoop are changing the way organizations manage and utilize big data. Think Big Analytics and Revolution Analytics are helping clients plan, build, test and implement innovative solutions based on the two technologies that allow clients to analyze data in new ways; exposing new insights for the business. Join us as Jeffrey Breen explains the core technology concepts and illustrates how to utilize R and Revolution Analytics’ RevoR in Hadoop environments.
Ontopic's presentation on Storm-Crawler for ApacheCon North America 2015.
Storm-Crawler is a next-generation web crawler that discovers and processes content on the Web, in real-time with low latency. This open source (and Apache Licensed) project is built on the Apache Storm framework, which provides a great foundation for a distributed real-time web crawler.
Apache Drill [1] is a distributed system for interactive analysis of large-scale datasets, inspired by Google’s Dremel technology. It is a design goal to scale to 10,000 servers or more and to be able to process Petabytes of data and trillions of records in seconds. Since its inception in mid 2012, Apache Drill has gained widespread interest in the community. In this talk we focus on how Apache Drill enables interactive analysis and query at scale. First we walk through typical use cases and then delve into Drill's architecture, the data flow and query languages as well as data sources supported.
[1] http://incubator.apache.org/drill/
This presentation shows you the basic concept of distributed tracing and Opentracing. And you can see the sample hands-on application (HotROD) of Jaeger
Tail-f Webinar OpenFlow Switch Management Using NETCONF and YANGTail-f Systems
This Webinar is on the OF-CONFIG specification and how it applies to large scale OpenFlow switch management.
If you are interested in the management of OpenFlow switches and want to find out more about available specifications and tools to build and deploy such solutions, this webinar provides a useful overview.
Webinar Agenda:
-Step-by-step walkthrough of the OpenFlow Management and Configuration Protocol
-Demonstration of how to use of the YANG language to define the structure and semantics of OpenFlow switch configurations
-The role of NETCONF in implementing OF-CONFIG
Advantages of this approach to network device manufacturers and network operations teams needing to quickly develop and deploy state-of-the-art management solutions
-Short overview of Tail-f Systems’ applicable products and tools
http://www.tail-f.com
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams.
Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Nutch as a Web data mining platform
1. Apache
Nutch as a Web mining platform
Nutch – Berlin Buzzwords '10
the present and the future
Andrzej Białecki
ab@sigram.com
2. Intro
● Started using Lucene in 2003 (1.2-dev?)
● Created Luke – the Lucene Index Toolbox
● Nutch, Lucene committer, Lucene PMC member
Nutch – Berlin Buzzwords '10
● Nutch project lead
3. Agenda
● Nutch architecture overview
● Crawling in general – strategies and challenges
● Nutch workflow
● Web data mining with Nutch
Nutch – Berlin Buzzwords '10
with examples
● Nutch present and future
● Questions and answers
3
4. Apache Nutch project
● Founded in 2003 by Doug Cutting, the Lucene
creator, and Mike Cafarella
● Apache project since 2004 (sub-project of Lucene)
● Spin-offs:
Nutch – Berlin Buzzwords '10
– Map-Reduce and distributed FS → Hadoop
– Content type detection and parsing → Tika
● Many installations in operation, mostly vertical
search
● Collections typically 1 mln - 200 mln documents
● Apache Top-Level Project since May
● Current release 1.1
4
5. What's in a search engine?
… a few things that may surprise you!
Nutch – Berlin Buzzwords '10
5
6. Search engine building blocks
Injector Scheduler Crawler Searcher
Nutch – Berlin Buzzwords '10
Indexer
Web graph Updater Content
-page info
-links (in/out)
repository
Parser
Crawling frontier controls
6
7. Nutch features at a glance
● Plugin-based, highly modular:
●
Most behaviors can be changed via plugins
● Data repository:
– Page status database and link database (web graph)
– Content and parsed data database (shards)
Nutch – Berlin Buzzwords '10
● Multi-protocol, multi-threaded, distributed crawler
● Robust crawling frontier controls
● Scalable data processing framework
●
Hadoop MapReduce processing
● Full-text indexer & search front-end
●
Using Solr (or Lucene)
●
Support for distributed search
● Flexible integration options
7
8. Search engine building blocks
Injector Scheduler Crawler Searcher
Nutch – Berlin Buzzwords '10
Indexer
Web graph Updater Content
-page info
-links (in/out)
repository
Parser
Crawling frontier controls
8
10. Nutch data
Maintains info on all known URL-s:
Injector Generator ● Fetch schedule
Fetcher Searcher
● Fetch status
● Page signature
● Metadata
Nutch – Berlin Buzzwords '10
Indexer
CrawlDB Updater
Shards
(segments)
Link Parser
LinkDB inverter
URL filters & normalizers, parsing/indexing filters, scoring plugins
10
11. Nutch data
Injector Generator For each Fetcher
target URL keeps info on
Searcher
incoming links, i.e. list of source
URL-s and their associated anchor
text
Nutch – Berlin Buzzwords '10
Indexer
CrawlDB Updater
Shards
(segments)
Link Parser
LinkDB inverter
URL filters & normalizers, parsing/indexing filters, scoring plugins
11
12. Nutch data
Shards (“segments”) keep:
Injector Generator
● Raw page content Fetcher Searcher
● Parsed content + discovered
metadata + outlinks
● Plain text for indexing and
Nutch – Berlin Buzzwords '10
Indexer
snippets
CrawlDB Updater
Shards
(segments)
Link Parser
LinkDB inverter
URL filters & normalizers, parsing/indexing filters, scoring plugins
12
13. Shard-based workflow
● Unit of work (batch) – easier to process massive datasets
● Convenience placeholder, using predefined directory names
● Unit of deployment to the search infrastructure
– Solr-based search may discard shards once indexed
● Once completed they are basically unmodifiable
– No in-place updates of content, or replacing of obsolete content
Nutch – Berlin Buzzwords '10
● Periodically phased-out by new, re-crawled shards
– Solr-based search can update Solr index in-place
200904301234/
2009043012345
2009043012345
Generator
crawl_generate/
crawl_generate
crawl_generate
crawl_fetch/
Fetcher crawl_fetch
crawl_fetch
content/
content
crawl_parse/ “cached” view
content
Parser crawl_parse
parse_data/
crawl_parse
parse_text/
parse_data
parse_data snippets
Indexer parse_text
parse_text
13
14. Crawling frontier challenge
● No authoritative catalog of web pages
● Crawlers need to discover their view of web universe
●
Start from “seed list” & follow (walk) some (useful? interesting?) outlinks
● Many dangers of simply wandering around
●
explosion or collapse of the frontier; collecting unwanted content (spam,
junk, offensive)
Nutch – Berlin Buzzwords '10
I need a few
interesting
items...
14
15. High-quality seed list
● Reference sites:
– Wikipedia, FreeBase, DMOZ seed + 1 hop
– Existing verticals
● Seeding from existing
Nutch – Berlin Buzzwords '10
search engines
– Collect top-N URL-s for seed
characteristic keywords i=1
● Seed URL-s plus 1:
– First hop usually retains high-
quality and focus
– Remove blatantly obvious junk
15 15
17. Wide vs. focused crawling
● Differences:
– Little technical difference in configuration
– Big difference in operations, maintenance and quality
● Wide crawling:
●
(Almost) Unlimited crawling frontier
Nutch – Berlin Buzzwords '10
●
High risk of spamming and junk content
●
“Politeness” a very important limiting factor
●
Bandwidth & DNS considerations
● Focused (vertical or enterprise) crawling:
●
Limited crawling frontier
●
Bandwidth or politeness is often not an issue
●
Low risk of spamming and junk content
17
18. Vertical & enterprise search
● Vertical search
– Range of selected “reference” sites
– Robust control of the crawling frontier
– Extensive content post-processing
– Business-driven decisions about ranking
Nutch – Berlin Buzzwords '10
● Enterprise search
– Variety of data sources and data formats
– Well-defined and limited crawling frontier
– Integration with in-house data sources
– Little danger of spam
– PageRank-like scoring usually works poorly
18
19. Nutch – Berlin Buzzwords '10
?
Face to face with Nutch
19
20. Installation & basic config
● http://nutch.apache.org
● Java 1.5+
● Single-node out of the box
– Comes also as a “job” jar to run on existing Hadoop cluster
● File-based configuration: conf/
Nutch – Berlin Buzzwords '10
– Plugin list
– Per-plugin configuration
● … much, much more on this on the Wiki
20 20
21. Main Nutch workflow
Command-line:
bin/nutch
● Inject: initial creation of CrawlDB inject
– Insert seed URLs
– Initial LinkDB is empty
Nutch – Berlin Buzzwords '10
● Generate new shard's fetchlist generate
● Fetch raw content fetch
● Parse content (discovers outlinks) parse
● Update CrawlDB from shards updatedb
● Update LinkDB from shards invertlinks
● Index shards index /
solrindex
(repeat) 21
29. Map-reduce indexing
● Map() just assembles all parts of documents
● Reduce() performs text analysis + indexing:
– Sends assembled documents to Solr
or
– Adds to a local Lucene index
Nutch – Berlin Buzzwords '10
● Other possible MR indexing models:
– Hadoop contrib/indexing model:
●
analysis and indexing on map() side
●
Index merging on reduce() side
– Modified Nutch model:
●
Analysis on map() side
●
Indexing on reduce() side
29
30. Nutch integration
● Nutch search & tools API
– Search via REST-style interaction, XML / JSON response
– Tools CLI and API to access bulk & single Nutch items
– Single-node, embedded, distributed (Hadoop cluster)
● Data-level integration: direct MapFile /
Nutch – Berlin Buzzwords '10
SequenceFile reading
– More complicated (and still requires using Nutch classes)
– May be more efficient
– Future: native tools related to data stores (HBase, SQL, ...)
● Exporting Nutch data
– All data can be exported to plain text formats
– bin/nutch read*
● ...db – read CrawlDB and dump some/all records
● ...linkdb – read LinkDb and dump some/all records
● ...seg – read segments (shards) and dump some/all records 30
32. Nutch search
● Solr indexing and searching (preferred)
– Simple Lucene indexing / search available too
● Using Solr search:
– DisMax search over several fields (url, title, body, anchors)
– Faceted search
Nutch – Berlin Buzzwords '10
– Search results clustering
– SolrCloud:
●
Automatic shard replication and load-balancing
●
Hashing update handler to distribute docs to Solr shards
30 32
33. Search-based analytics
● Keyword search → crude topic mining
● Phrase search → crude collocation mining
● Anchor search → crude semantic enrichment
● Feedback loop from search results:
– Faceting and on-line clustering may discover latent topics
Nutch – Berlin Buzzwords '10
– Top-N results for reference queries may prioritize further crawling
● Example: question answering system
– Source documents from reference sites
– NLP document analysis: key-phrase detection, POS-tagging,
noun-verb / subject-predicate detection, enrichment from DBs
and semantic nets
– NLP query analysis: expected answer type (e.g. person, place,
date, activity, method, ...), key-phrases, synonyms
– Regular search
– Evaluation of raw results (further NLP analysis of each document) 33
34. Web as a corpus
● Examples:
– Source of raw text in a specific language
– Source of text on a given subject
●
Selection by e.g. a presence of keywords, or full-blown NLP
●
Add data from known reference sites (Wikipedia, Freebase) or
databases (Medline) or semantic nets (WordNet, OpenCyc)
Nutch – Berlin Buzzwords '10
– Source of documents in a specific format (e.g. PDF)
● Nutch setup:
– URLFilters define the crawling frontier and content types
– Parse plugins determine the content extraction / processing
●
e.g. language detection
● Nutch shards:
– Extracted text, metadata, outlinks / anchors
34
35. Web as a corpus (2)
● Concept mining
– Harvesting human-created concept descriptions and
associations
– “kind of”, “contains”, “includes”, “application of”
– Co-occurrence of concepts has some meaning too!
● Example: medical search engine
Nutch – Berlin Buzzwords '10
– Controlled vocabulary of diseases, symptoms, procedures
– Identifiable metadata: author, journal, publication date, etc.
– Nutch crawl of reference sites and DBs
●
Co-occurrence of controlled vocabulary
–
BloomFilter-s for quick trimming of map-side data
–
Or Mahout collocation mining for uncontrolled concepts
●
Cube of co-occurring (related) concepts
●
Several dimensions to traverse
–
“authors who publish most often together on treatment of myocardial infarction”
●
10 nodes, 100k phrases in vocabulary, 20 mln pages, ~300bln
phrases on map side → ~5GB data cube 35
36. Web as a directed graph
● Nodes (vertices): URL-s as unique identifiers
● Edges (links): hyperlinks like <a href=”targetUrl”/>
● Edge labels: <a href=”..”>anchor text</a>
● Often represented as adjacency (neighbor) lists
● Inverted graph: LinkDB in Nutch
Nutch – Berlin Buzzwords '10
Straight (outlink) graph:
1 → 2a, 3b, 4c, 5d, 6e
8 5 → 6f, 9g
j 7 → 3h, 4i, 8j, 9k
k
7 9
3 h
2
i Inverted (inlink) graph:
a b 2 ← 1a
c 3 ← 1b, 7h
1 4
4 ← 1c, 7i
e d 5 ← 1d
g 6 ← 1e, 5f
6 f
5 8 ← 7j
9 ← 5g, 7k
36
37. Link inversion
● Pages have outgoing links (outlinks)
… I know where I'm pointing to
● Question: who points to me?
… I don't know, there is no catalog of pages
… NOBODY knows for sure either!
Nutch – Berlin Buzzwords '10
● In-degree may indicate importance of the page
● Anchor text provides important semantic info
● Answer: invert the outlinks that I know about,
and group by target (Nutch 'invertlinks')
tgt 2 src 2
tgt 1 src 1
src 1 tgt 3 tgt 1 src 3
tgt 5 tgt 4 src 5 src 4 37
38. Web as a recommender
● Links as recommendations:
– Link represents an association
– Anchor text represents a recommended topic
●
… with some surrounding text of a hyperlink?
● Not all pages are created equal
Nutch – Berlin Buzzwords '10
– Recommendations from good pages are useful
– Recommendations from bad pages may be useless
– Merit / guilt by association:
●
Links from good pages should improve the target's reputation
●
Links from bad pages may compromise good pages' reputation
● Not all recommendations are trustworthy
– What links to trust, and to what degree?
– Social aspects: popularity, fashion, mobbing, fallacy of
“common belief”
38
39. Link analysis and scoring
● PageRank 1 1
– Query-independent page weight 1 1
– Based on the flow of weight along link paths
●
Dampening factor α to stabilize the flow
●
Weight from “dangling nodes” redistributed 1.25 1.25
Other models
Nutch – Berlin Buzzwords '10
●
0.75 0.75
– Hyperlink-Induced Topic Search (HITS)
●
Query-dependent, local iterations, hub/authority
– TrustRank 1.06 1.31
●
Propagation of “trust” based on human expert
evaluation of seed sites 0.69 0.94
● Challenges
– Loops, link spam, cliques, loosely connected
subgraphs, mobbing, etc
39
40. Nutch link analysis tools
● Tools for PageRank calculation with loop detection
– LinkDb: source of anchor text (think “recommended topics”)
– Page in-degree ≈ popularity / importance / quality
– Scoring API (and plugins) to control the flow of page importance
along link paths
● Nutch shards:
Nutch – Berlin Buzzwords '10
– Source of outlinks → expanding the crawling frontier
– Page linked-ness vs. its content: hub or authority
● Example: porn / junk detection
– Links to “porn” pages poisonous to importance / quality
– Links from “porn” pages decrease the confidence in quality of the
target page
● Example: vertical crawl
– Expanding to pages “on topic” == with sufficient in-link support
from known on topic pages
40
41. Web of gossip and opinions
● General Web – not considering special-purpose
networks here...
● Example:
– Who / what is in the news?
– How often a name is mentioned?
today Google yields 44,500 hits for ab@getopt.org
Nutch – Berlin Buzzwords '10
●
– What facts about me are publicly available?
– What is the sentiment associated with a name (person,
organization, trademark)?
● Nutch setup:
– Seed from a few reference news sites, blogs, Twitter, etc
– Use Nutch plugin for RSS/Atom crawling
– NLP parsing plugins (NER, classification, sentiment analysis)
● Nutch shards:
– Capture temporal aspect
41
42. Web as a source of … anything
● The data is there, just lost among irrelevant stuff
– Difficult to find → good seed list + crawling frontier controls
– Mixed with junk & irrelevant data → URL & content filtering
Be creative – combine multiple strategies:
Nutch – Berlin Buzzwords '10
●
– Crawl for raw data, stay on topic – filter out junk early
– Use plain indexing & search as a crude analytic tool
– Use creative post-processing to filter and enhance the data
– Export data from Nutch and pipe it to other tools (Pig,
HBase, Mahout, ...)
42
43. Future of Nutch
● Nutch 2.0 re-design
– Refactoring, cleanup, better scale-up / scale-down
– Avoid code duplication
– Expected release ~Q4 2010
● Share code with other crawler projects →
crawler-commons
Nutch – Berlin Buzzwords '10
● Indexing & Search → Solr, SolrCloud
– Distributed and replicated search is difficult
– Initial integration needs significant improvement
– Shard management – SolrCloud / Zookeeper
● Web-graph & page repository → ORM layer
– Combine CrawlDB, LinkDB and shard storage
– Avoid tedious shard management
– Gora ORM mapping: HBase, SQL, Cassandra? BerkeleyDB?
– Benefit from native tools specific to storage → easier integration 43
44. Future of Nutch (2)
● What's left then?
– Crawling frontier management, discovery
– Re-crawl algorithms
– Spider trap handling
– Fetcher
Nutch – Berlin Buzzwords '10
– Ranking: enterprise-specific, user-feedback
– Duplicate detection, URL aliasing (mirror detection)
– Template detection and cleanup, pagelet-level crawling
– Spam & junk control
● Vision: á la carte toolkit, scalable from
1-1000s nodes
– Easier setup for small 1 node installs
– Focus on a reliable, easy to integrate framework
44
45. Conclusions
(This overview is a tip of the iceberg)
Nutch
● Implements all core search engine components
Nutch – Berlin Buzzwords '10
● Extremely configurable and modular
● Scales well
● A complete crawl & search platform – and a toolkit
● Easy to use as an input feed to data collecting and
data mining tools
45
46. Q&A
● Further information:
– http://nutch.apache.org/
– user@nutch.apache.org
Nutch – Berlin Buzzwords '10
● Contact author:
– ab@sigram.com
46