SPIRES is the biggest bibliographic database for High Energy Physics, ArXiv is the biggest fulltext repository for the fulltext papers in High Energy Physics, and INSPIRE is the biggest digital library that merges the two.
Dr. Sarah Rainey discusses the concept of citizenship and who counts as a citizen. She notes that citizenship is influenced by race, gender, class, and other factors. Additionally, she explores how state actions and policies can have different impacts on men and women. Dr. Rainey also examines the historical connections between concepts of "civilization", masculinity, and femininity as well as the public and private spheres.
This document provides a training course on creating workbooks in Microsoft Excel 2003. It covers creating a new workbook, entering text and numbers into cells, editing data, and inserting and deleting columns and rows. The training includes lessons on meeting the workbook, entering data, and editing data and worksheets. It provides instructions, examples, and practice questions to teach students the basics of working in Excel.
Mobile marketing is becoming more common with new technologies. Some examples include interactive billboards that allow people to design shoes and see their design displayed, smartphone photo contests where people can win prizes by finding and photographing promotional items, and TV shows offering mobile wallpapers and games for additional revenue. Mobile engagement allows companies to promote brands, generate sales, and increase revenues across multiple platforms.
The document is a clipping from 2010 that does not provide any other context or information to summarize in 3 sentences or less. No meaningful summary can be generated from the title and date alone.
Dr. Sarah Rainey discusses the concept of citizenship and who counts as a citizen. She notes that citizenship is influenced by race, gender, class, and other factors. Additionally, she explores how state actions and policies can have different impacts on men and women. Dr. Rainey also examines the historical connections between concepts of "civilization", masculinity, and femininity as well as the public and private spheres.
This document provides a training course on creating workbooks in Microsoft Excel 2003. It covers creating a new workbook, entering text and numbers into cells, editing data, and inserting and deleting columns and rows. The training includes lessons on meeting the workbook, entering data, and editing data and worksheets. It provides instructions, examples, and practice questions to teach students the basics of working in Excel.
Mobile marketing is becoming more common with new technologies. Some examples include interactive billboards that allow people to design shoes and see their design displayed, smartphone photo contests where people can win prizes by finding and photographing promotional items, and TV shows offering mobile wallpapers and games for additional revenue. Mobile engagement allows companies to promote brands, generate sales, and increase revenues across multiple platforms.
The document is a clipping from 2010 that does not provide any other context or information to summarize in 3 sentences or less. No meaningful summary can be generated from the title and date alone.
Parker Weber has Holland Codes of Realistic, Enterprising, and Investigative. They are drawn to work that involves athletic or mechanical skills, helping and influencing people, and problem solving. The ideal work environment would involve limited time behind a desk. Two potential career paths that align with their interests are anesthesiologist and lawyer. Anesthesiology involves monitoring patients during surgery and providing pain relief, while law involves legal research, advocacy, and advising clients. Both careers require extensive education but offer high earning potential. Potential employers for each path are listed.
Hannah Gibson outlines 25 green changes individuals can make, including unplugging unused items, using power bars and fluorescent light bulbs, buying organic food and riding bikes instead of driving, recycling various materials, taking shorter showers and only running major appliances with full loads, planting trees, and purchasing items with less packaging. The list aims to save energy, water, and reduce waste and environmental impact through more sustainable habits at home.
Este es un movimiento que aboga por abrir los ojos de la humanidad a un nuevo y mejor mundo más equilibrado donde no existan la mayoría de los problemas más serios que existen hoy en día.
The document discusses the benefits of exercise for mental health. Regular physical activity can help reduce anxiety and depression and improve mood and cognitive functioning. Exercise boosts blood flow, releases endorphins, and promotes changes in the brain which help enhance one's emotional well-being and mental clarity.
The design influence is merely a reflection of our culture and expectations for user interfaces. Ideally these trends represent favorable ideas in the web design community. However designers will always have their own opinions when it comes to design terms, so take these ideas with a grain of salt.
This document discusses predicting defects in the system testing phase using a model based on a six sigma approach. The research aims to establish a defect prediction model to determine the number of defects to be found before testing begins. The model would help with resource planning, test coverage, and meeting deadlines. The research applies a define-measure-analyze-design-verify process to build the model using regression analysis on data from previous projects. Factors like requirements errors, design errors, and code errors are analyzed to determine their relationship to defects found during testing. The initial results found several significant factors that could be used to reliably predict defects.
Presentation for WOM Marketing Summit 2013 by Edelman JapanEdelman Japan
This document discusses the importance of storytelling in modern marketing and communications. It argues that in today's multi-screen world with an infinite amount of content but limited attention spans, storytelling is essential to engage audiences. Stories help audiences remember key messages, as most people need to hear information more than three times to recall it. The document also provides tips for promoting stories through creating engaging content and measuring the success of storytelling initiatives.
El documento lista los últimos 20 ganadores del Premio Nobel de la Paz desde 1993 hasta 2013, incluyendo organizaciones como la Organización para la Prohibición de Armas Químicas en 2013, la Unión Europea en 2012, y la Organización de las Naciones Unidas en 2001, así como individuos notables como Ellen Johnson Sirleaf, Leymah Gbowee y Tawakkol Karman en 2011, y Barack Obama en 2009.
The document describes MontySolr, an extension that allows embedding CPython in Solr. It was created to connect Python and Java applications without compromises. MontySolr is robust, tested, and open source. It can be used for any Python application like Django or any C/C++ app that Python understands. The author provides context on their work at CERN and the challenges of connecting their Python-based digital library software Invenio to the Java search engine Solr. They evaluated different options for embedding Solr in non-Java applications before settling on their approach using MontySolr.
Keynote: from publisher to platform, How The Guardian Embraced the Internet ...lucenerevolution
The Guardian embraced the internet by developing an open platform and open web principles. It moved from being solely a publisher to also being a platform, opening up its content through APIs to allow third-party developers to build applications. This helped drive significant traffic growth. To support its platform ambitions and developer partners, the Guardian evolved its technical architecture to be more scalable, reliable and high performing, adopting technologies like Solr, Memcached and cloud hosting.
How The Guardian Embraced the Internet using Content, Search, and Open SourceLucidworks (Archived)
This talk will cover how The Guardian opened up their business, enriched it, and reached new markets with its Open Platform strategy. Stephen will cover the technical architecture, implementation of Solr (the key technology powering the platform), and how The Guardian has used it to embrace disruption in the media space, while finding new sources of revenue and innovation
In the mid-1990s, the high-energy physics community (think FermiLab and CERN) started planning for the Large Hadron Collider. Managing the petabytes of data that would be generated by the facility and sharing it with the globally distributed community of over 10,000 researchers would be a major infrastructure and technology problem. This same community that brought us the web has now developed standards, software, and infrastructure for grid computing. In this seminar I'll present some of the exciting science that is being done on the Open Science Grid, the US national cyberinfrastructure linking 60 institutions (Harvard included) into a massive distributed computing and data processing system.
This document summarizes the progress of the Enhanced Publications (EP) Project. It discusses developments in creating enhanced digital publications, building a database of EP examples, disseminating information about EPs, and addressing challenges in preserving dynamic digital objects and convincing stakeholders of the value of EPs. The EP Project aims to innovate hybrid forms of scholarly publishing in the humanities and social sciences.
Metaheuristic Optimization: Algorithm Analysis and Open ProblemsXin-She Yang
The document discusses metaheuristic algorithms for optimization problems. It begins with introductions from two experts about computational science and the usefulness of models. It then provides an overview of different metaheuristic algorithms like simulated annealing, genetic algorithms, and particle swarm optimization. The document discusses how these algorithms generate new solutions through techniques like probabilistic moves, Markov chains, crossover and mutation. It provides examples and diagrams to illustrate how various metaheuristic algorithms work.
This document is part 3 of a seminar on new energy for Vietnam. It will discuss the science behind new energy by answering 4 questions: (1) how we know zero point energy exists, (2) how its existence improves our understanding of nature, (3) how we can access it, and (4) what we can do with it. It will overview 12 key theories in physics that are important for developing new energy applications, including quantum mechanics, electrodynamics, and theories around low-energy nuclear reactions. The goal is to make information about discoveries in new energy physics available to Vietnamese scientists since this information is often excluded from textbooks by petroleum and nuclear power industries.
Parker Weber has Holland Codes of Realistic, Enterprising, and Investigative. They are drawn to work that involves athletic or mechanical skills, helping and influencing people, and problem solving. The ideal work environment would involve limited time behind a desk. Two potential career paths that align with their interests are anesthesiologist and lawyer. Anesthesiology involves monitoring patients during surgery and providing pain relief, while law involves legal research, advocacy, and advising clients. Both careers require extensive education but offer high earning potential. Potential employers for each path are listed.
Hannah Gibson outlines 25 green changes individuals can make, including unplugging unused items, using power bars and fluorescent light bulbs, buying organic food and riding bikes instead of driving, recycling various materials, taking shorter showers and only running major appliances with full loads, planting trees, and purchasing items with less packaging. The list aims to save energy, water, and reduce waste and environmental impact through more sustainable habits at home.
Este es un movimiento que aboga por abrir los ojos de la humanidad a un nuevo y mejor mundo más equilibrado donde no existan la mayoría de los problemas más serios que existen hoy en día.
The document discusses the benefits of exercise for mental health. Regular physical activity can help reduce anxiety and depression and improve mood and cognitive functioning. Exercise boosts blood flow, releases endorphins, and promotes changes in the brain which help enhance one's emotional well-being and mental clarity.
The design influence is merely a reflection of our culture and expectations for user interfaces. Ideally these trends represent favorable ideas in the web design community. However designers will always have their own opinions when it comes to design terms, so take these ideas with a grain of salt.
This document discusses predicting defects in the system testing phase using a model based on a six sigma approach. The research aims to establish a defect prediction model to determine the number of defects to be found before testing begins. The model would help with resource planning, test coverage, and meeting deadlines. The research applies a define-measure-analyze-design-verify process to build the model using regression analysis on data from previous projects. Factors like requirements errors, design errors, and code errors are analyzed to determine their relationship to defects found during testing. The initial results found several significant factors that could be used to reliably predict defects.
Presentation for WOM Marketing Summit 2013 by Edelman JapanEdelman Japan
This document discusses the importance of storytelling in modern marketing and communications. It argues that in today's multi-screen world with an infinite amount of content but limited attention spans, storytelling is essential to engage audiences. Stories help audiences remember key messages, as most people need to hear information more than three times to recall it. The document also provides tips for promoting stories through creating engaging content and measuring the success of storytelling initiatives.
El documento lista los últimos 20 ganadores del Premio Nobel de la Paz desde 1993 hasta 2013, incluyendo organizaciones como la Organización para la Prohibición de Armas Químicas en 2013, la Unión Europea en 2012, y la Organización de las Naciones Unidas en 2001, así como individuos notables como Ellen Johnson Sirleaf, Leymah Gbowee y Tawakkol Karman en 2011, y Barack Obama en 2009.
The document describes MontySolr, an extension that allows embedding CPython in Solr. It was created to connect Python and Java applications without compromises. MontySolr is robust, tested, and open source. It can be used for any Python application like Django or any C/C++ app that Python understands. The author provides context on their work at CERN and the challenges of connecting their Python-based digital library software Invenio to the Java search engine Solr. They evaluated different options for embedding Solr in non-Java applications before settling on their approach using MontySolr.
Keynote: from publisher to platform, How The Guardian Embraced the Internet ...lucenerevolution
The Guardian embraced the internet by developing an open platform and open web principles. It moved from being solely a publisher to also being a platform, opening up its content through APIs to allow third-party developers to build applications. This helped drive significant traffic growth. To support its platform ambitions and developer partners, the Guardian evolved its technical architecture to be more scalable, reliable and high performing, adopting technologies like Solr, Memcached and cloud hosting.
How The Guardian Embraced the Internet using Content, Search, and Open SourceLucidworks (Archived)
This talk will cover how The Guardian opened up their business, enriched it, and reached new markets with its Open Platform strategy. Stephen will cover the technical architecture, implementation of Solr (the key technology powering the platform), and how The Guardian has used it to embrace disruption in the media space, while finding new sources of revenue and innovation
In the mid-1990s, the high-energy physics community (think FermiLab and CERN) started planning for the Large Hadron Collider. Managing the petabytes of data that would be generated by the facility and sharing it with the globally distributed community of over 10,000 researchers would be a major infrastructure and technology problem. This same community that brought us the web has now developed standards, software, and infrastructure for grid computing. In this seminar I'll present some of the exciting science that is being done on the Open Science Grid, the US national cyberinfrastructure linking 60 institutions (Harvard included) into a massive distributed computing and data processing system.
This document summarizes the progress of the Enhanced Publications (EP) Project. It discusses developments in creating enhanced digital publications, building a database of EP examples, disseminating information about EPs, and addressing challenges in preserving dynamic digital objects and convincing stakeholders of the value of EPs. The EP Project aims to innovate hybrid forms of scholarly publishing in the humanities and social sciences.
Metaheuristic Optimization: Algorithm Analysis and Open ProblemsXin-She Yang
The document discusses metaheuristic algorithms for optimization problems. It begins with introductions from two experts about computational science and the usefulness of models. It then provides an overview of different metaheuristic algorithms like simulated annealing, genetic algorithms, and particle swarm optimization. The document discusses how these algorithms generate new solutions through techniques like probabilistic moves, Markov chains, crossover and mutation. It provides examples and diagrams to illustrate how various metaheuristic algorithms work.
This document is part 3 of a seminar on new energy for Vietnam. It will discuss the science behind new energy by answering 4 questions: (1) how we know zero point energy exists, (2) how its existence improves our understanding of nature, (3) how we can access it, and (4) what we can do with it. It will overview 12 key theories in physics that are important for developing new energy applications, including quantum mechanics, electrodynamics, and theories around low-energy nuclear reactions. The goal is to make information about discoveries in new energy physics available to Vietnamese scientists since this information is often excluded from textbooks by petroleum and nuclear power industries.
This document provides an overview of particle accelerators. It notes that there are over 30,000 accelerators worldwide used for research, medicine, and industry. Accelerators are used to produce beams of particles like electrons and ions that act as probes for scientific research. Examples of applications mentioned include medical isotopes and radiation therapy, material modification and analysis using synchrotron light sources, and particle physics research using large facilities like CERN. The document aims to give the reader a broad introduction to the field of accelerators and their diverse applications.
Citing and reading behaviours in high energy physics.Proyecto CeVALE2
This document analyzes citation and reading behaviors in the field of high-energy physics (HEP). It finds that:
1) There is a large citation advantage for HEP papers that are freely available as preprints online, as demonstrated by an analysis of citations in the SPIRES database.
2) No discernible citation advantage was found for publishing papers in open access journals compared to non-open access journals.
3) An analysis of usage logs in the SPIRES digital library shows that HEP scientists rarely read journals and prefer accessing preprints instead.
The Royal Society of Chemistry has an archive of published journals and books stretching back to 1841. In the past decade we have digitized this archive and semantically enriched our frontfile data with chemical structures linked to our free online chemical compound database, ChemSpider. In this talk we will survey our recent efforts to extract all kinds of data – chemical structures, experimental and bibliographic data – from both our backfile and frontfile. We will also discuss our future work to extract chemical reactions to host in our ChemSpider Reactions database and will discuss the potential applications of optical structure recognition technologies for converting structure images to structures as well as using similar techniques to convert experimental spectral data into interactive data formats. A key aspect of this project is the delivery of a crowdsourcing platform for the interactive annotation and validation of the extracted data.
This document summarizes a presentation on the philosophy of the web. It discusses the origins of studying the philosophy of the web and past events on this topic. It also examines how philosophical concepts like ontologies and proper names have been "artifactualized" through their implementation on the web. Finally, it discusses the idea of "philosophical engineering" proposed by Berners-Lee, where philosophies and protocols are used to build the technical infrastructure of the web.
The Large Hadron Collider (LHC) is the world's largest and most powerful particle collider located in a tunnel under the France-Switzerland border. Built by CERN between 1998 and 2008, the LHC was designed to collide beams of protons or heavy ions at very high energies to study the smallest known particles and discover new ones. Its primary goals are to test theories like supersymmetry and the existence of the Higgs boson, which was discovered in 2012. The LHC contains several large detectors that analyze collision data to advance understanding of fundamental physics.
Jean-Claude Bradley presents at the Science Commons Symposium on Feb 20, 2010 at the Microsoft Campus in Redmond. The talk covers doing Open Notebook Science using free and hosted tools, including new archiving protocols developed with Andrew Lang.
The document discusses the philosophy of the web and philosophical engineering. It describes early events on the philosophy of the web like IRW2006. It discusses how philosophical concepts like proper names have become "artifactualized" through their implementation as URIs. The philosophy of technology is relevant to understand how opposing philosophical positions have been united through technical artifacts. The document argues that philosophical engineering through building the technical protocols requires interpreting what was previously constructed.
The document introduces Yann Yu from Lucidworks and provides information about Lucidworks and its products Solr and Hadoop. It discusses how Solr can be used to provide search capabilities for large amounts of both structured and unstructured data stored in Hadoop. Integrating Solr and Hadoop allows for fast search across big data stored in Hadoop along with real-time indexing and querying capabilities. Examples discussed include enabling enterprise-wide search of documents stored in Hadoop and using Flume to index log data from Hadoop into Solr for real-time analytics and search.
Couchbase Connect 2014: Lucidworks CEO Will Hayes takes you on a fantastic voyage through the hope and the hype of big data and why the future is search-centric.
SFBay Area Solr Meetup - July 15th: Integrating Hadoop and SolrLucidworks (Archived)
The document discusses integrating Hadoop and Solr to enable fast, ad-hoc search across structured and unstructured big data stored in Hadoop. It provides examples of how Hadoop can be used for large-scale storage and processing while Solr is used for real-time querying and search. Specifically, it describes how the Lucidworks HDFS connector can process documents from HDFS and index them into SolrCloud for search, and how log data can be ingested from Flume into HDFS for archiving and extracted fields can be indexed into Solr in real-time for search and analytics dashboards.
SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for BusinessLucidworks (Archived)
Box uses the Solr search platform to power content search across its 25 million+ users. Some key aspects of Box's search implementation with Solr include:
1) The Solr index is sharded or split across multiple shards for high availability and scalability, with each file identifier mapped to a specific shard.
2) Search queries are handled by a front-end load balancer that distributes queries across multiple search head nodes for high availability.
3) Solr documents contain metadata like file owner, parent folders, and extracted text to support search by content, ownership, and folder structure.
SFBay Area Solr Meetup - June 18th: Benchmarking Solr PerformanceLucidworks (Archived)
The document discusses benchmarking the performance of SolrCloud clusters. It describes Timothy Potter's experience operating a large SolrCloud cluster at Dachis Group. It outlines an methodology for benchmarking indexing performance by varying the number of servers, shards, and replicas. Results show near-linear scalability as nodes are added. The document also introduces the Solr Scale Toolkit for deploying and managing SolrCloud clusters using Python and AWS. It demonstrates integrating Solr with tools like Logstash and Kibana for log aggregation and dashboards.
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search EngineLucidworks (Archived)
The document discusses how search has evolved beyond traditional keyword search to include more complex tasks like recommendations, classifications, and analytics using distributed technologies like Hadoop. It provides an overview of new capabilities in Lucene/Solr like reduced memory usage, pluggable codecs, and spatial search upgrades. LucidWorks offers products like Solr and SiLK that integrate with Hadoop and provide search and analytics capabilities across distributed data.
This document discusses integrating search capabilities with Hadoop's big data analytics. It explains that Hadoop is well-suited for distributed storage and processing of large datasets, while search excels at free-text retrieval and indexing large amounts of text. The document outlines how the speaker's company integrated Hadoop and search using HBase replication to a search index, allowing results from Hadoop jobs to be searchable in near real-time. It provides an example use case of monitoring tweets for keywords and extracting mentioned URLs to visualize popular links.
Solr 4.7 and 4.8 include new features such as asynchronous execution of long-running actions, cursors for deep paging, document expiration, dynamic synonyms and stopwords, SSL support in SolrCloud, and improved collections API. Future versions will focus on ZooKeeper as the single source of truth, incremental field updates, multi-valued DocValues sorting, and removing legacy field types. The speaker also discussed related open source projects from LucidWorks for deploying Solr on AWS, log processing, and data quality.
Minneapolis Solr Meetup - May 28, 2014: eCommerce Search with Apache SolrLucidworks (Archived)
This document discusses how Apache Solr can power ecommerce search and provides examples of companies using it. It outlines basic features for ecommerce like facets, highlighting, and boosting as well as advanced features like spatial search and analytics. The document also provides tips for ecommerce search like understanding user needs, debugging issues, and leveraging signals from user behavior to improve relevance.
Target transitioned from their previous search platform to using Solr. Some benefits they found included the speed of importing data into Solr and the ease of adding additional data signals to improve relevancy. However, they had to start from scratch on their relevancy strategy in Solr and found facets worked differently between the platforms. Target also discussed how they were able to improve relevancy by incorporating guest activity data on their website to surface more viewed and ordered items.
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...Lucidworks (Archived)
The document discusses the development of a new search system for PubChem to allow for exploration of multidimensional biomedical data. The new system was needed to address the challenges of handling large and heterogeneous datasets with many relationships between data types in a way that allows for fast querying. The system leverages Apache SOLR to provide features like full text search, faceting, molecule structure searching and joining of related data. It includes backend components like SOLR, SQL and specialized search engines as well as web APIs and frontend interfaces like reusable widgets and a new search interface.
Unstructured Or: How I Learned to Stop Worrying and Love the xml, Presented...Lucidworks (Archived)
This document discusses Solr, an open source search platform from the Apache Lucene project. It provides full-text search, faceted search, auto-suggest capabilities, and supports multiple file formats for document indexing. The document outlines Solr's architecture and components, provides usage examples from large government sites, and recommends related open source tools.
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...Lucidworks (Archived)
This document discusses building a lightweight discovery interface for Chinese patents. It describes using parsers and the cloud to ingest various patent file formats and metadata in order to build a search interface. It emphasizes spending adequate time on user experience design and sharing data with users and other applications.
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DCLucidworks (Archived)
ISS is a software solutions company that provides big data management tools to Department of Defense and intelligence community customers. They have over 800 employees across several US offices. Their solutions are reusable, license-free for the US government, and scalable from single users to large networks with thousands of users. Customers have thousands of heterogeneous data sources that create data at an increasing rate, making effective search and analytics tools necessary to help analysts extract useful information and actionable intelligence from large amounts of unstructured data in tactical environments. ISS argues that search must be the cornerstone of an effective big data strategy, allowing normalization, indexing, and semantic search of content to help analysts focus their efforts and gain insights from large data sets.
What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DCLucidworks (Archived)
Lucene and Solr 4.8 include improvements to speed, flexibility, and scalability. Key updates include native near real-time support in Lucene, faster indexing with document writer per thread, and improved fuzzy and wildcard query processing. Solr 4 offers new faceting, geospatial, and distributed capabilities. Both projects provide easier configuration and more pluggable scoring and indexing options to improve search relevance and performance.
This document summarizes Sean Timm's presentation on Solr and Lucene at AOL. It discusses AOL's history with search technologies including using Open Directory Project (ODP) and building search into AOL Server using their own retrieval model (CPL). It describes AOL's contributions to Solr/Lucene including the Data Import Handler. It provides recommendations for contributing to the Solr/Lucene community such as answering questions, improving documentation, and submitting patches. It highlights some of AOL's applications of Solr like search for MapQuest, AIM, Mail, and analyzing Sarah Palin's emails.
This document provides an introduction to SolrCloud, which enables horizontal scaling of a Solr search index using sharding and replication. Key terminology is defined, including ZooKeeper, nodes, collections, shards, replicas, and leaders. The document outlines the high-level SolrCloud architecture and discusses features like sharding, document routing, replication, distributed indexing and querying. Challenges around consistency and availability are also covered.
Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DCLucidworks (Archived)
Doug discusses challenges with collaboration between search developers and content experts when optimizing search relevancy. The current process of developers making changes and experts having to wait a week for results is inefficient. Doug proposes applying test-driven development principles to search by having experts continuously test search results and provide feedback on changes in real-time. This allows developers to get immediate feedback and ensures changes are improving search quality. Doug's company built a tool called Quepid that implements this approach to enable better collaboration between experts and developers when optimizing search.
This document discusses building a data-driven log analysis application using LucidWorks SILK. It begins with an introduction to LucidWorks and discusses the continuum of search capabilities from enterprise search to big data search. It then describes how SILK can enable big data search across structured and unstructured data at massive scale. The solution components involve collecting log data from various sources using connectors, ingesting it into Solr, and building visualizations for analysis. It concludes with a demo and contact information.
Building a data driven search application with LucidWorks SiLK
Embedding CPython in Solr
1. MontySolr:
Embedding CPython in Solr
Roman Chyla, CERN
roman.chyla@cern.ch, May 26, 2011
Thursday, May 26, 2011
2. Why should I care?
- Our challenge is to connect Python and Java
- Without compromises
- We created MontySolr extension
- Robust, tested (will be used by our system)
- But works for any Python application (eg. Django)
- And for any C/C++ app that Python understands!
- Open source (GPL v2)
- Try it out!
- https://github.com/romanchyla/montysolr
2
Thursday, May 26, 2011
3. Outline
‣ Context
- The Challenge
- Key components
- Available technologies
- Our approach
- Problems solved
- Evaluation
- Wrap-up
3
Thursday, May 26, 2011
4. CERN
- European Organization for Nuclear Research
- Switzerland, Geneva
- The largest laboratory for High Energy Physics
- Home to the Large Hadron Collider
- 40-50K HEP scientists worldwide
4
Thursday, May 26, 2011
5. CERN
- European Organization for Nuclear Research
- Switzerland, Geneva
- The largest laboratory for High Energy Physics
- Home to the Large Hadron Collider
- 40-50K HEP scientists worldwide
4
Thursday, May 26, 2011
6. CERN
- European Organization for Nuclear Research
- Switzerland, Geneva
- The largest laboratory for High Energy Physics
- Home to the Large Hadron Collider
- 40-50K HEP scientists worldwide
4
Thursday, May 26, 2011
7. CERN
- European Organization for Nuclear Research
- Switzerland, Geneva
- The largest laboratory for High Energy Physics
- Home to the Large Hadron Collider
- 40-50K HEP scientists worldwide
4
Thursday, May 26, 2011
8. CERN
- European Organization for Nuclear Research
- Switzerland, Geneva
- The largest laboratory for High Energy Physics
- Home to the Large Hadron Collider
- 40-50K HEP scientists worldwide
4
Thursday, May 26, 2011
9. CERN
- European Organization for Nuclear Research
- Switzerland, Geneva
- The largest laboratory for High Energy Physics
- Home to the Large Hadron Collider
- 40-50K HEP scientists worldwide
4
Thursday, May 26, 2011
10. CERN
- European Organization for Nuclear Research
- Switzerland, Geneva
- The largest laboratory for High Energy Physics
- Home to the Large Hadron Collider
- 40-50K HEP scientists worldwide
4
Thursday, May 26, 2011
11. CERN
- European Organization for Nuclear Research
- Switzerland, Geneva
- The largest laboratory for High Energy Physics
- Home to the Large Hadron Collider
- 40-50K HEP scientists worldwide
4
Thursday, May 26, 2011
12. SPIRES
- Stanford Linear Accelerator Center - SLAC
- High-Energy Physics Literature Database
- Started December 1991
- The first web outside Europe/CERN
- The first database on web
5
Thursday, May 26, 2011
13. SPIRES
- Stanford Linear Accelerator Center - SLAC
- High-Energy Physics Literature Database
- Started December 1991
- The first web outside Europe/CERN
- The first database on web
5
Thursday, May 26, 2011
16. Invenio
- Integrated digital library software behind INSPIRE
- Used by very large institutional repositories
- http://repositories.webometrics.info/toprep_inst.asp
- Customizable virtual collections
- Flexible management of metadata
- 3 000 authors per article
- Powerful search engine
- Incl. citation map analysis
- Written in Python (since 2001)
- 290 000 lines of code
8
Thursday, May 26, 2011
17. Outline
- Context
‣ The Challenge
- Key components
- Available technologies
- Our approach
- Problems solved
- Evaluation
- Wrap-up
9
Thursday, May 26, 2011
18. The Challenge
- HEP scientific community
- Searches metadata oriented
- However fulltexts are changing the situation
- And we want to provide even better service
- Bigger volumes of data
- NLP processing
- Semantic search
10
Thursday, May 26, 2011
20. The Challenge
Query: supersymmetry AND author:ellis
Invenio
11
Thursday, May 26, 2011
21. The Challenge
Query: supersymmetry AND author:ellis
Invenio fulltext:supersymmetry
11
Thursday, May 26, 2011
22. The Challenge
Query: supersymmetry AND author:ellis
Invenio fulltext:supersymmetry
IDs: 1;2;3;9....
11
Thursday, May 26, 2011
23. The Challenge
Query: supersymmetry AND author:ellis
Invenio fulltext:supersymmetry
IDs: 1;2;3;9....
11
Thursday, May 26, 2011
24. The Challenge
Query: supersymmetry AND author:ellis
Invenio fulltext:supersymmetry
IDs: 1;2;3;9....
11
Thursday, May 26, 2011
25. The Challenge
Query: supersymmetry AND author:ellis
Invenio fulltext:supersymmetry
IDs: 1;2;3;9....
11
Thursday, May 26, 2011
26. The Challenge
Query: supersymmetry AND author:ellis
Invenio fulltext:supersymmetry
1-6M IDs
IDs: 1;2;3;9....
11
Thursday, May 26, 2011
27. The Challenge
Query: supersymmetry AND author:ellis
Invenio fulltext:supersymmetry
1-6M IDs
IDs: 1;2;3;9....
1. only IDs,
no score
= no ranking
11
Thursday, May 26, 2011
28. The Challenge
Query: supersymmetry AND author:ellis
Invenio fulltext:supersymmetry
1-6M IDs
IDs: 1;2;3;9....
2. score merging 1. only IDs,
difficult (if no score
available) = no ranking
11
Thursday, May 26, 2011
29. The Challenge
3. push IDs ?
Query: supersymmetry AND author:ellis
(eg._faceting)
Invenio fulltext:supersymmetry
1-6M IDs
IDs: 1;2;3;9....
2. score merging 1. only IDs,
difficult (if no score
available) = no ranking
11
Thursday, May 26, 2011
30. What is the “best” solution?
- We love Python...
- ...and our applications are written in Python...
- But what if Solr is the master search engine?
- Merge results inside Solr?
- Typical size: 1-10 mil. IDs
- Expected latency: 1-2 s.
- What we want to achieve:
- Fast transfer of hits from Invenio to Solr
- Leverage the power of both (no compromises)
- Developer-friendly integration, simplicity
12
Thursday, May 26, 2011
31. Outline
- Context
- The Challenge
‣ Key components
- Available technologies
- Our approach
- Evaluation
- Demonstration
- Wrap-up
13
Thursday, May 26, 2011
32. To embed Solr (in Java app)
- Your app simulates Java web container?
- use EmbeddedSolrServer
- It knows nothing about Java servlets?
- use DirectConnect class
- Maybe we are too lazy?
- Embed the web container (in my case Jetty)
- Seemed strange (webserver inside webserver)
- ... but it worked well
14
Thursday, May 26, 2011
33. To embed Solr (in Java app)
- Your app simulates Java web container?
- use EmbeddedSolrServer
- It knows nothing about Java servlets?
- use DirectConnect class
- Maybe we are too lazy?
- Embed the web container (in my case Jetty)
- Seemed strange (webserver inside webserver)
- ... but it worked well
14
Thursday, May 26, 2011
34. To embed Solr (in Java app)
- Your app simulates Java web container?
- use EmbeddedSolrServer
- It knows nothing about Java servlets?
- use DirectConnect class
- Maybe we are too lazy?
- Embed the web container (in my case Jetty)
- Seemed strange (webserver inside webserver)
- ... but it worked well
14
Thursday, May 26, 2011
35. To embed Solr (in Java app)
- Your app simulates Java web container?
- use EmbeddedSolrServer
- It knows nothing about Java servlets?
- use DirectConnect class
- Maybe we are too lazy?
- Embed the web container (in my case Jetty)
- Seemed strange (webserver inside webserver)
- ... but it worked well
14
Thursday, May 26, 2011
36. To embed Solr (in Java app)
- Your app simulates Java web container?
- use EmbeddedSolrServer
- It knows nothing about Java servlets?
- use DirectConnect class
- Maybe we are too lazy?
- Embed the web container (in my case Jetty)
- Seemed strange (webserver inside webserver)
- ... but it worked well
14
Thursday, May 26, 2011
37. To use Solr in non-Java app
- Solr is already usable via HTTP requests, but we
need something else here...
- Remote objects/calls?
- Pyro, execnet, CORBA, SOAP...
- or simply pipes?
- Access Python from Java?
- Jython
- JEPP
- Access Java from Python?
- JPype
- JCC
15
Thursday, May 26, 2011
38. Jython?
- Implementation of Python in 100% Java
- Both Java and Python code
- Truly multithreaded
- C modules will not work
- but see http://bit.ly/iTRYbb
- Slower than CPython
16
Thursday, May 26, 2011
39. Jython?
- Implementation of Python in 100% Java
- Both Java and Python code
- Truly multithreaded
- C modules will not work
- but see http://bit.ly/iTRYbb
- Slower than CPython
17
Thursday, May 26, 2011
40. Jython?
- Implementation of Python in 100% Java
- Both Java and Python code
- Truly multithreaded
- C modules will not work
- but see http://bit.ly/iTRYbb
- Slower than CPython
17
Thursday, May 26, 2011
41. JEPP - Java Embedded Python
- Python code runs inside
Python interpreter
- Embeds CPython interpreter
via Java Native Interface
(JNI) in Java
- http://jepp.sourceforge.net/
- recently updated (27-Jan)
- but JCC is more active
18
Thursday, May 26, 2011
43. JCC
- Embeds JVM in Python
- C++ code generator
- C++ object interface
wraps a Java library
- C++ wrappers conform
to Python's C type
system
- result: complete Python
extension module
20
Thursday, May 26, 2011
47. To use Solr in non-Java app
Jython JCC JEPP
Python ✓ ✓
CModules
Speed ✓ ?
No code ✓ ✓
changes
Access from ✓ ✓
Python
Access from ✓ ... ✓
Java
22
Thursday, May 26, 2011
48. The first try
Invenio
Solr
JCC
23
Thursday, May 26, 2011
49. Devil is in details...
24
Thursday, May 26, 2011
50. GIL - Global Interpreter Lock
Unfortunately Python webapp is not like Java...
25
Thursday, May 26, 2011
51. GIL - Global Interpreter Lock
We can have 200 threads, but only 4 will run at time...
26
Thursday, May 26, 2011
52. GIL - Global Interpreter Lock
27
Thursday, May 26, 2011
53. Fortunately solution exists
- JCC can embed Python inside Java
- Special thanks to Andi Vajda! (JCC creator)
- We write ‘empty’ classes in Java ...
- ... and implement them in Python
Python /w Java inside Java /w Python inside 28
Thursday, May 26, 2011
54. The second try
Solr /w Invenio
Invenio (backend)
frontend
XML
JCC
29
Thursday, May 26, 2011
55. Implementing the bridge
- Special Java class
- With method pythonExtension()
- Native method pythonDecRef()
- JCC provides its implementation
- And number of other native methods
- These will be implemented using Python
- Like writing JNI Java/C code but without
compilation...
30
Thursday, May 26, 2011
56. MontySolr extension
- JCC has great potential, but also added
complexity...
- So the MontySolr project was born
- Modules must be built in shared mode
- JCC dynamic library loaded and started from the main
thread
- Simple mechanism of the Python bridge and message
- Configurable handlers on the Python side
- Secured dereferencing of the native objects
- Threading on the Java side
- Multiprocessing on the Python side
- Easy ant targets (compilation) ...
31
Thursday, May 26, 2011
57. Hello World - Java part
public class MontySolrBridge extends BasicBridge implements
PythonBridge {
private long pythonObject;
public void pythonExtension(long pythonObject) {
this.pythonObject = pythonObject;
}
public long pythonExtension() {
return this.pythonObject;
}
public void finalize() throws Throwable {
pythonDecRef();
}
public native void pythonDecRef();
public void sendMessage(PythonMessage message) {
PythonVM vm = PythonVM.get();
vm.acquireThreadState();
receive_message(message);
vm.releaseThreadState();
}
public native void receive_message(PythonMessage message);
} 32
Thursday, May 26, 2011
58. Hello World - Python part
from montysolr import MontySolrBridge
class SimpleBridge(MontySolrBridge):
def __init__(self):
super(SimpleBridge, self).__init__()
def receive_message(self, message):
query = message.getParam(‘query’)
message.setResults(‘Hello world!’)
print ‘Python received from Java:’, query
33
Thursday, May 26, 2011
59. Example - running MontySolr
- Java side
- JRE (32/64 bit)
- Standard Solr/Lucene jars
- JCC dynamic library
- Python side
- Python interpreter (32/64 bit)
- 4 Python modules (jcc, solr, lucene, montysolr)
- In the main thread
- First we load JCC
- Then start Python interpreter ...
- ... load Python handlers
34
Thursday, May 26, 2011
60. Solr as search service
Solr /w Invenio
Invenio (backend)
frontend
XML
JCC
35
Thursday, May 26, 2011
61. Example
Solr
MyCustom
Handler
36
Thursday, May 26, 2011
62. Example
refersto:author:ellis
Solr
MyCustom
Handler
37
Thursday, May 26, 2011
63. Example - Solr custom handler
MontySolrVM.INSTANCE.sendMessage(message);
PythonMessage msg = MontySolrVM.INSTANCE
.createMessage("perform_search")
.setSender("Invenio")
.setParam("query","refersto:author:ellis");
MontySolrVM.INSTANCE.sendMessage(msg);
Object result = msg.getResults();
if (result != null) {
int[] hits = (int[]) message.getResults();
}
38
Thursday, May 26, 2011
64. Example - JNI connection
refersto:author:ellis
Solr
MyCustom Python
Handler Bridge
39
Thursday, May 26, 2011
65. Example - JNI connection
refersto:author:ellis
Solr
MyCustom Python Invenio
Handler Bridge wrappers
40
Thursday, May 26, 2011
66. Example - Python side
# handler is made ‘visible’ at startup
SolrpieTarget('Invenio:perform_search',
perform_search)
# search time - called from Java
def perform_search(message):
query = message.getParam(“query”)
hits = call_real_search(query)
# cast Python list into Java array
message.setResults(JArray_ints(hits))
41
Thursday, May 26, 2011
67. Example
refersto:author:ellis
Solr
Invenio
Invenio
MyCustom Python Invenio
Handler Bridge wrappers
Invenio
Invenio
42
Thursday, May 26, 2011
68. Example - Java side again
MontySolrVM.INSTANCE.sendMessage(message);
PythonMessage msg = MontySolrVM.INSTANCE
.createMessage("perform_search")
.setSender("Invenio")
.setParam("query","refersto:author:ellis");
MontySolrVM.INSTANCE.sendMessage(msg);
Object result = msg.getResults();
if (result != null) {
int[] hits = (int[]) message.getResults();
}
43
Thursday, May 26, 2011
69. Solr as search service
Solr /w Invenio
Apache (backend)
webserver
XML
Invenio
Invenio
JCC
44
Thursday, May 26, 2011
70. Outline
- Context
- The Challenge
- Key components
- Available technologies
- Our approach
- Problems solved
‣ Evaluation
- Wrap-up
45
Thursday, May 26, 2011
74. Robust?
- Extensive siege tests show very good
performance and stability under high load
- 100-200 users, complex searches
- 50 concurrent users, citation analysis
- JCC incurs small overhead
- We detected no memory leaks
- The same as dbpedia.org
- But watch out for errors in C
- An error in C module brings down the whole JVM
- (errors in pure Python module can be handled)
49
Thursday, May 26, 2011
75. Easy to develop/maintain?
- Added complexity
- Java in the toolbox
- Need to compile C++ extensions
- Python/OS version dependencies
- For this we get
- Easy integration with Invenio
- The best of two applications
- A lot of features for free
- And we can control Solr from Python!
50
Thursday, May 26, 2011
76. Outline
- Context
- The Challenge
- Key components
- Available technologies
- Our approach
- Problems solved
- Evaluation
‣ Wrap-up
51
Thursday, May 26, 2011
77. Wrap-up
- Our challenge was to connect two different
languages/systems
- And we wanted to get the best of the two...
- So we had to plug Python into Solr
- And now our Solr knows citation analysis!
- We created MontySolr extension
- Robust, tested (will be used by INSPIRE)
- Works for any Python application (eg. Django)
- And for any C/C++ app that Python understands!
- Free software license
- Try it out! Help us make it better!
- https://github.com/romanchyla/montysolr
52
Thursday, May 26, 2011
78. Questions?
- MontySolr
- https://github.com/romanchyla/montysolr
- Roman Chyla
- Fellow, CERN Scientific Information Service
- roman.chyla@cern.ch
- @rchyla
- https://svnweb.cern.ch/trac/rcarepo
Thursday, May 26, 2011
80. Links
- Invenio platform
- http://invenio-software.org/
- INSPIRE Digital library
- http://inspirebeta.net/
- Diagrams of JCC and JEPP
- Andreas Schreiber : Mixing Java and Python
- http://www.slideshare.net/onyame/mixing-python-and-
java
- On Jython C Extension API
- http://stackoverflow.com/questions/3097466/using-
numpy-and-cpython-with-jython
- Demo of a running service:
- http://insdev01.cern.ch 55
Thursday, May 26, 2011
81. #1 - How to embed Solr (standard)
- solr.client.solrj.embedded.EmbeddedSolrServer
56
Thursday, May 26, 2011
82. #2 - How to embed Solr (simplified)
- solr.servlet.DirectSolrConnection
- like previous, but simpler
- all the queries are sent as strings, everything is
just a string
- very flexible and probably suitable for quick
integration
57
Thursday, May 26, 2011
83. #2 - How to embed Solr (simplified)
- solr.servlet.DirectSolrConnection
- like previous, but simpler
- all the queries are sent as strings, everything is
just a string
- very flexible and probably suitable for quick
integration
57
Thursday, May 26, 2011
84. #3 - Example of a Solr custom handler
58
Thursday, May 26, 2011
85. #4 - Example Python handler
59
Thursday, May 26, 2011