War stories from building the Global Patent Search Network, and why Data folks need to think more about UX and Discovery, and UX folks need to think more about Data.
The United States Patent and Trademark Office wanted a simple, lightweight, yet modern and rich discovery interface for Chinese patent data. This is the story of the Global Patent Search Network, the next generation multilingual search platform for the USPTO. GPSN, http://gpsn.uspto.gov, was the first public application deployed in the cloud, and allowed a very small development team to build a discovery interface across millions of patents.
This case study will cover:
• How we leveraged Amazon Web Services platform for data ingestion, auto scaling, and deployment at a very low price compared to traditional data centers.
• We will cover some of the innovative methods for converting XML formatted data to usable information.
• Parsing through 5 TB of raw TIFF image data and converting them to modern web friendly format.
• Challenges in building a modern Single Page Application that provides a dynamic, rich user experience.
• How we built “data sharing” features into the application to allow third party systems to build additional functionality on top of GPSN.
Got hundreds of millions of documents to search? DataImportHandler blowing up while indexing? Random thread errors thrown by Solr Cellduring document extraction? Query performance collapsing? Then you've searching at Big Data scale. This talk will focus on the underlying principles of Big Data, and how to apply them to Solr. This talk isn't a deep dive into SolrCloud, though we'll talk about it. It also isn't meant to be a talk on traditional scaling of Solr.
At Basis Technologies Open Source Search conference I talked about a project I did this past year, and talked about the lessons, both good and the bad that we learned.
The Hadoop Distributed File System is the foundational storage layer in typical Hadoop deployments. Performance and stability of HDFS are crucial to the correct functioning of applications at higher layers in the Hadoop stack. This session is a technical deep dive into recent enhancements committed to HDFS by the entire Apache contributor community. We describe real-world incidents that motivated these changes and how the enhancements prevent those problems from reoccurring. Attendees will leave this session with a deeper understanding of the implementation challenges in a distributed file system and identify helpful new metrics to monitor in their own clusters.
The United States Patent and Trademark Office wanted a simple, lightweight, yet modern and rich discovery interface for Chinese patent data. This is the story of the Global Patent Search Network, the next generation multilingual search platform for the USPTO. GPSN, http://gpsn.uspto.gov, was the first public application deployed in the cloud, and allowed a very small development team to build a discovery interface across millions of patents.
This case study will cover:
• How we leveraged Amazon Web Services platform for data ingestion, auto scaling, and deployment at a very low price compared to traditional data centers.
• We will cover some of the innovative methods for converting XML formatted data to usable information.
• Parsing through 5 TB of raw TIFF image data and converting them to modern web friendly format.
• Challenges in building a modern Single Page Application that provides a dynamic, rich user experience.
• How we built “data sharing” features into the application to allow third party systems to build additional functionality on top of GPSN.
Got hundreds of millions of documents to search? DataImportHandler blowing up while indexing? Random thread errors thrown by Solr Cellduring document extraction? Query performance collapsing? Then you've searching at Big Data scale. This talk will focus on the underlying principles of Big Data, and how to apply them to Solr. This talk isn't a deep dive into SolrCloud, though we'll talk about it. It also isn't meant to be a talk on traditional scaling of Solr.
At Basis Technologies Open Source Search conference I talked about a project I did this past year, and talked about the lessons, both good and the bad that we learned.
The Hadoop Distributed File System is the foundational storage layer in typical Hadoop deployments. Performance and stability of HDFS are crucial to the correct functioning of applications at higher layers in the Hadoop stack. This session is a technical deep dive into recent enhancements committed to HDFS by the entire Apache contributor community. We describe real-world incidents that motivated these changes and how the enhancements prevent those problems from reoccurring. Attendees will leave this session with a deeper understanding of the implementation challenges in a distributed file system and identify helpful new metrics to monitor in their own clusters.
DefCore: The Interoperability Standard for OpenStackMark Voelker
This presentation provides an introduction to the OpenStack DefCore Committee, which is working to create interoperability standards for OpenStack Powered clouds. You'll gain insight into the interoperability challenges of OpenStack clouds, and learn how DefCore creates it's Guidelines. Learn why the Technical Committee, Board of Directors, end users, and vendors have a seat at the table. You'll laugh, you'll cry, you'll immediately want to stop talking about cloud computing and go watch science fiction all night.
This talk was originally presented at the Triangle OpenStack Meetup Group's September 21, 2015 meeting in Durham, NC. A recording can be found here (this talk starts at the 46:10 mark): https://vmware.webex.com/vmware/lsr.php?RCID=a51f9e6882f54ccab8b715c8c0162484
A new revision with updates was given at a meeting of the China Open Source Cloud League on May 20, 2016 in Beijing. The slides here on Slideshare represent that presentation.
OpenStack: Toward a More Resilient CloudMark Voelker
Since it's inception over four years ago, OpenStack has become the most popular open source software for building many types of clouds in part due to the flexibility it provides. As more adoption increases, interest has increased in building OpenStack clouds on a highly available control plane infrastructure. In this talk we will provide an introduction to today's OpenStack community and software, then dive deeper into how to build more highly available, scalable OpenStack architectures. - See more at: http://www.percona.com/news-and-events/percona-university-smart-data-raleigh/openstack-toward-more-resilient-cloud#sthash.wicdUMdH.dpuf
Getting a Neural Network Up and Running with OpenLabMelvin Hillsman
Access to hardware for AI/ML for the everyday developer wanting to explore this field can be challenging to obtain and maintain for even the most rudimentary applications and testing. Needing to go beyond a single development machine running locally only increases this. OpenLab is curated infrastructure accessible to open source projects and individuals working within and on open source projects designed to help address this use case. Access to GPU, FPGA, IoT, and more, allows HPC, AI/ML, Deep Learning, or other testing and applications. In this presentation, we will walk through getting an account with OpenLab, obtaining resources, and getting a neural network up and running with an app that will bring back great childhood memories.
What's beyond Virtualization - The Future of Cloud PlatformsDerek Collison
My updated talk om the future of IT at QCon NY
What lies beyond virtualization? How do we start the journey to a secure, composeable, and trusted hybrid platform that truly delivers the business value and velocity we all want?
In the era of software-defined everything, one goal is to reach a fluid infrastructure that has the level of plasticity needed to self heal itself and provide higher level SLAs for applications and services. Adding value to existing applications and services in a transparent fashion requires a rethinking of core technologies in the platform space. In this talk we will take a look at some low level technologies and approaches to achieving this goal. Topics will range from Intelligent layer 7 SDN with semantic awareness, distributed scheduling algorithms, policy distribution and invalidation, health monitoring and management, self healing techniques, and the role of unsupervised deep machine learning and anomaly detection.
This is a talk given by Jason Hoffman at a workshop given by Joyent called "Scale With Rails" in 2006. There's quite a bit of prescience in this presentation, including the first documented use of ZFS in production ("Fsck you if you think ZFS isn't production") and of OS-based virtualization (zones) in the cloud (which, it must be said, was not called "cloud" in 2006).
[Presented at All Things Open 2015 in Raleigh, NC, USA]
OpenStack is one of the fastest-growing and exciting open source projects of our time. OpenStack has drawn together technologists from all over the world to create a cloud operating system and a huge, diverse community behind it. This talk will provide an introduction to OpenStack for newcomers to the project of those who just want to know more. We’ll take a brief look at OpenStack’s history, get a technical overview of the project, learn how to contribute, and check out a few emerging trends and hot topics in the OpenStack world.
Wordnik's technical co-founder Tony Tam describes the reason for going NoSQL. During his talk Tony will discuss the selection criteria, testing + evaluation and successful, zero-downtime migration to MongoDB. Additionally details on Wordnik's speed and stability will be covered as well as how NoSQL technologies have changed the way Wordnik scales.
Georgia Tech cse6242 - Intro to Deep Learning and DL4JJosh Patterson
Introduction to deep learning and DL4J - http://deeplearning4j.org/ - a guest lecture by Josh Patterson at Georgia Tech for the cse6242 graduate class.
Trent Hornibrook gave a recent talk at the Infracoders meet-up playing a thought experiment with the audience on 'what would be your tech decisions if you were given a blank cheque at at startup'.
Trent, recently working for a start-up then shared what decisions he made, and why
Experienced executive with over 18 years of global experience and demonstrated success in driving IT value, business and IT, reducing overall costs, and quickly executing and delivering business solutions.
DefCore: The Interoperability Standard for OpenStackMark Voelker
This presentation provides an introduction to the OpenStack DefCore Committee, which is working to create interoperability standards for OpenStack Powered clouds. You'll gain insight into the interoperability challenges of OpenStack clouds, and learn how DefCore creates it's Guidelines. Learn why the Technical Committee, Board of Directors, end users, and vendors have a seat at the table. You'll laugh, you'll cry, you'll immediately want to stop talking about cloud computing and go watch science fiction all night.
This talk was originally presented at the Triangle OpenStack Meetup Group's September 21, 2015 meeting in Durham, NC. A recording can be found here (this talk starts at the 46:10 mark): https://vmware.webex.com/vmware/lsr.php?RCID=a51f9e6882f54ccab8b715c8c0162484
A new revision with updates was given at a meeting of the China Open Source Cloud League on May 20, 2016 in Beijing. The slides here on Slideshare represent that presentation.
OpenStack: Toward a More Resilient CloudMark Voelker
Since it's inception over four years ago, OpenStack has become the most popular open source software for building many types of clouds in part due to the flexibility it provides. As more adoption increases, interest has increased in building OpenStack clouds on a highly available control plane infrastructure. In this talk we will provide an introduction to today's OpenStack community and software, then dive deeper into how to build more highly available, scalable OpenStack architectures. - See more at: http://www.percona.com/news-and-events/percona-university-smart-data-raleigh/openstack-toward-more-resilient-cloud#sthash.wicdUMdH.dpuf
Getting a Neural Network Up and Running with OpenLabMelvin Hillsman
Access to hardware for AI/ML for the everyday developer wanting to explore this field can be challenging to obtain and maintain for even the most rudimentary applications and testing. Needing to go beyond a single development machine running locally only increases this. OpenLab is curated infrastructure accessible to open source projects and individuals working within and on open source projects designed to help address this use case. Access to GPU, FPGA, IoT, and more, allows HPC, AI/ML, Deep Learning, or other testing and applications. In this presentation, we will walk through getting an account with OpenLab, obtaining resources, and getting a neural network up and running with an app that will bring back great childhood memories.
What's beyond Virtualization - The Future of Cloud PlatformsDerek Collison
My updated talk om the future of IT at QCon NY
What lies beyond virtualization? How do we start the journey to a secure, composeable, and trusted hybrid platform that truly delivers the business value and velocity we all want?
In the era of software-defined everything, one goal is to reach a fluid infrastructure that has the level of plasticity needed to self heal itself and provide higher level SLAs for applications and services. Adding value to existing applications and services in a transparent fashion requires a rethinking of core technologies in the platform space. In this talk we will take a look at some low level technologies and approaches to achieving this goal. Topics will range from Intelligent layer 7 SDN with semantic awareness, distributed scheduling algorithms, policy distribution and invalidation, health monitoring and management, self healing techniques, and the role of unsupervised deep machine learning and anomaly detection.
This is a talk given by Jason Hoffman at a workshop given by Joyent called "Scale With Rails" in 2006. There's quite a bit of prescience in this presentation, including the first documented use of ZFS in production ("Fsck you if you think ZFS isn't production") and of OS-based virtualization (zones) in the cloud (which, it must be said, was not called "cloud" in 2006).
[Presented at All Things Open 2015 in Raleigh, NC, USA]
OpenStack is one of the fastest-growing and exciting open source projects of our time. OpenStack has drawn together technologists from all over the world to create a cloud operating system and a huge, diverse community behind it. This talk will provide an introduction to OpenStack for newcomers to the project of those who just want to know more. We’ll take a brief look at OpenStack’s history, get a technical overview of the project, learn how to contribute, and check out a few emerging trends and hot topics in the OpenStack world.
Wordnik's technical co-founder Tony Tam describes the reason for going NoSQL. During his talk Tony will discuss the selection criteria, testing + evaluation and successful, zero-downtime migration to MongoDB. Additionally details on Wordnik's speed and stability will be covered as well as how NoSQL technologies have changed the way Wordnik scales.
Georgia Tech cse6242 - Intro to Deep Learning and DL4JJosh Patterson
Introduction to deep learning and DL4J - http://deeplearning4j.org/ - a guest lecture by Josh Patterson at Georgia Tech for the cse6242 graduate class.
Trent Hornibrook gave a recent talk at the Infracoders meet-up playing a thought experiment with the audience on 'what would be your tech decisions if you were given a blank cheque at at startup'.
Trent, recently working for a start-up then shared what decisions he made, and why
Experienced executive with over 18 years of global experience and demonstrated success in driving IT value, business and IT, reducing overall costs, and quickly executing and delivering business solutions.
Ayudamos a nuestros clientes a construir negocios rentables, escuchando, consultando, aconsejando, implantando y entrenando.
Ayudamos en desarrollo de negocio, ventas y propuestas; asesoramos en outsourcing, gobierno de TI y procesos de negocio
Alfred Kärcher GmbH & Co. KG is a German family-owned company that operates worldwide and is known for its high-pressure cleaners, floor care equipment, parts cleaning systems, wash water treatment, military decontamination equipment and window vacuum cleaners.[1] Headquartered in Winnenden, Germany, it produces both cleaning equipment and full cleaning systems. The world market leader in cleaning technology, employs more than 10,000 people worldwide. In 2014 the company posted sales revenues of €2.12 billion ($2.84 billion) and sold 12.72 million machines. Kärcher has 100 subsidiaries in 60 countries.
Descárgate la última versión de la presentación corporativa de Abengoa, compañía internacional que aplica soluciones tecnológicas innovadoras para el desarrollo sostenible en los sectores de energía y agua.
Latest Corporate Presentation of Abengoa: The international company that applies innovative technology solutions for sustainable development in the energy and water sectors.
ERTMS Solutions is specialized in the development of innovative products for the railway signalling world, and for a majority of them compatible with the European railway signalization standard ERTMS/ETCS.
PJ Software, LLC is a multitudinous expert in providing web and mobile solutions for the whole volume of industry domains.
Expertise:
- by technologies and tools
OS's: Android, iOS, Windows, Linux, Mac OS
Languages: HTML5, CSS/CSS 3, JavaScript, PHP, Ruby on Rails, Groovy, Objective-C, Java, C/C++, XML
Databases: MySQL, Oracle, PostgreSQL, Mongo, JackRabbit
Technologies: AJAX, iPhone SDK, Cocoa Touch, Android SDK, Android Native Development Kit (NDK), Android DT (ADT)
Development platforms: Java SE/EE, LAMP PHP/MySQL, Rails, Grails
Applications and web servers: Apache, JBoss, TomCat, Oracle, Mongrel, Jetty, GlassFish
Development frameworks: jQuery, jQuery mobile, jQuery UI, Eclipse, BackBone, Zend, Symphony, Smarty, Spring, Grails, Velocity, Cactus, EJB, JMS, JPA, JSF, Tapestry, Scala, Seam, Perl, Rails, Sinatra
- by industry
Retail; Health care and life science; Information technology; Real estate; Telecommunications; Gambling; Hospitality, leisure & travel; Education and learning; Financial services; Agriculture
- by functional areas
Web, Mobile, Server
- by solution types
Web programing
Mobile app development
Ecommerce website development
Server/database installment and remote control
Web and application servers administration
Custom development and CMS
Ecommerce solutions
Web portals development
Corporate portals development
Application UI/UX
Web design
SEO
SMM
Contact us for details at: sales@pj-software.com
skype: phonejetsoftware
or
Leave us an inquiry with our Free Quote Form at http://pj-software.com/contacts/#Free quote form
This presentation was given by LinkedIn CEO, Jeff Weiner, at the Morgan Stanley Technology, Media & Telecom Conference 2014 in San Francisco, California on Monday, March 3, 2014. It outlines LinkedIn's value propositions for our members and customers, as well as detailing the vision of LinkedIn to create economic opportunity for all professionals in the world & the development of the Economic Graph.
Follow our new Economic Graph Showcase Page at: https://www.linkedin.com/company/linkedin-economic-graph
From a student to an apache committer practice of apache io tdbjixuan1989
This talk is introduce by Xiangdong Huang, who is a PPMC of Apache IoTDB (incubating) project, at Apache Event at Tsinghua University in China.
About the Event:
The open source ecosystem plays more and more important role in the world. Open source software is widely used in operating systems, cloud computing, big data, artificial intelligence, and industrial Internet. Many companies have gradually increased their participation in the open source community. Developers with open source experience are increasingly valued and favored by large enterprises. The Apache Software Foundation is one of the most important open source communities, contributing a large number of valuable open source software and communities to the world.
The invited guests of this lecture are all from ASF community, including the chairman of the Apache Software Foundation, three Apache members, Top 5 Apache code committers (according to Apache annual report), the first Committer in the Hadoop project in China, several Apache project mentors or VPs, and many Apache Committers. They will tell you what the open source culture is, how to join the Apache open source community, and the Apache Way.
This presentation was given in one of the DSATL Mettups in March 2018 in partnership with Southern Data Science Conference 2018 (www.southerndatascience.com)
Big Data made easy in the era of the Cloud - Demi Ben-AriDemi Ben-Ari
Talking about the ease of use and handling Big Data technologies in the Cloud. Using Google Cloud Platform and Amazon Web Services and all of the tools around it.
Showing the problems and how we can solve them with simple tools.
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...Big Data Spain
http://www.bigdataspain.org/2014/conference/state-of-play-data-science-on-hadoop-in-2015-keynote
Machine Learning is not new. Big Machine Learning is qualitatively different: More data beats algorithm improvement, scale trumps noise and sample size effects, can brute-force manual tasks.
Session presented at Big Data Spain 2014 Conference
18th Nov 2014
Kinépolis Madrid
http://www.bigdataspain.org
Event promoted by: http://www.paradigmatecnologico.com
Slides: https://speakerdeck.com/bigdataspain/state-of-play-data-science-on-hadoop-in-2015-by-sean-owen-at-big-data-spain-2014
SharePoint Engage Phoenix 2017
Our SharePoint environment is a lot like many others – a SharePoint 2007 implementation that was used more as a file dump than a collaboration space. With minimal user adoption, we were never quite ready to implement 2010, with a pilot SharePoint 2010 implementation stalled out of the gate.
In the meantime, some content was put on Box and other services to address external collaboration needs. Business users needed more relevant search results, content databases had grown uncomfortably large, and access controls had become spaghetti. Fortunately, site sprawl wasn’t too bad… except that the reason for that was the low adoption.
SharePoint 2013 arrived to a perfect storm – business and technology needs to be addressed, content that needs to be brought back in-house, and user adoption that needs to be improved. Time to upgrade!
See how we approached the upgrade, the issues than needed to be addressed, and the questions that needed to be answered.
Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/2l2Rr6L.
Doug Daniels discusses the cloud-based platform they have built at DataDog and how it differs from a traditional datacenter-based analytics stack. He walks through the decisions they have made at each layer, covers the pros and cons of these decisions and discusses the tooling they have built. Filmed at qconsf.com.
Doug Daniels is a Director of Engineering at Datadog, where he works on high-scale data systems for monitoring, data science, and analytics. Prior to joining Datadog, he was CTO at Mortar Data and an architect and developer at Wireless Generation, where he designed data systems to serve more than 4 million students in 49 states.
Agile Data: Building Hadoop Analytics ApplicationsDataWorks Summit
Mining data requires a deep investment in people and time. How can you be sure you’re building the right models? What tools help you connect with the customer’s needs? With this hands-on presentation, you’ll learn a flexible toolset and methodology for building effective analytics applications. Agile Data (the book) shows you how to create an environment for exploring data, using lightweight tools such as Python, Apache Pig, and the D3.js (Data-Driven Documents) JavaScript library. You’ll learn an iterative approach that allows you to quickly change the kind of analysis you’re doing, as you discover what the data is telling you. All the example code in this book is available as working web applications. We will cover how to: * Build an application to mine your own email inbox * Use different data structures and algorithms to extract multiple features from a single dataset, and learn how different perspectives can yield insight * Rapidly boot your applications as simple front-ends to a document store * Add features driven by descriptive and inferential statistics, machine learning, and data visualization * Gather usage data and talk to real users to help guide your data-driven exploration
This presentation provides an introduction to Azure DocumentDB. Topics include elastic scale, global distribution and guaranteed low latencies (with SLAs) - all in a managed document store that you can query using SQL and Javascript. We also review common scenarios and advanced Data Sciences scenarios.
Data Science at Scale: Using Apache Spark for Data Science at BitlySarah Guido
Given at Data Day Seattle 2015.
Bitly generates over 9 billion clicks on shortened links a month, as well as over 100 million unique link shortens. Analyzing data of this scale is not without its challenges. At Bitly, we have started adopting Apache Spark as a way to process our data. In this talk, I’ll elaborate on how I use Spark as part of my data science workflow. I’ll cover how Spark fits into our existing architecture, the kind of problems I’m solving with Spark, and the benefits and challenges of using Spark for large-scale data science.
Turn Data Into Actionable Insights - StampedeCon 2016StampedeCon
At Monsanto, emerging technologies such as IoT, advanced imaging and geo-spatial platforms; molecular breeding, ancestry and genomics data sets have made us rethink how we approach developing, deploying, scaling and distributing our software to accelerate predictive and prescriptive decisions. We created a Cloud based Data Science platform for the enterprise to address this need. Our primary goals were to perform analytics@scale and integrate analytics with our core product platforms.
As part of this talk, we will be sharing our journey of transformation showing how we enabled: a collaborative discovery analytics environment for data science teams to perform model development, provisioning data through APIs, streams and deploying models to production through our auto-scaling big-data compute in the cloud to perform streaming, cognitive, predictive, prescriptive, historical and batch analytics@scale, integrating analytics with our core product platforms to turn data into actionable insights.
HPCC Systems Engineering Summit: Community Use Case: Because Who Has Time for...HPCC Systems
Presenter: John Andleman, Staff Database Engineer, Citrix
In this session, John will share some interesting use cases leveraging the HPCC Systems platform, including those beyond traditional big data uses. John will also share his roadmap of HPCC projects being planned for the next few months and why he feels HPCC Systems is a more suitable solution than Hadoop based on experiences and lessons learned.
NOTE: The video of this presentation is the 3rd one shown in the accompanying YouTube link.
Machine learning applications are typically stitched together from hopes and dreams, shell scripts, cron jobs, home-grown schedulers, snippets of configuration clipped from multiple blog posts, thousands of hard-coded business rules, a.k.a. "our SQL corpus," and a few lines of training and testing code. Organizing all the moving parts into something maintainable and supportive of ongoing development is a challenge most teams have on their TODO list, roadmap, or tech debt pile. Getting ahead of the day-to-day demands and settling into a sane architecture often seems like an unattainable goal. The past several years have seen an explosion of tool-building in the data engineering and analytics area, including in Apache projects spanning the areas of search and information retrieval, job orchestration, file and stream formats, and machine learning libraries. In this talk we will cover our product and development teams' choices of architecture and tools, from data ingestion and storage, through transformations and processing, to presentation of results and publishing to web services, reports, and applications.
Similar to Searching Chinese Patents Presentation at Enterprise Data World (20)
Smarter search drives value to your business. Delivering search that matches users to the right content is what you care about. But organizations often get stuck getting there. It turns out that you need quite a number of very different ingredients to deliver tremendous search. It can make your head spin! To help you think through where your team is on its road to smarter search, Pugh introduces the maturity model used by OpenSource Connections and walks you through a very concrete method to inventory needed skills and translate that into search roles for your team. He shows how to measure your capabilities in key areas of search to drive better ROI from search.
The right path to making search relevant - Taxonomy Bootcamp London 2019OpenSource Connections
Three aspects of search quality; focusing on relevance; why this is not just a technology problem; measuring search maturity & relevance; open source tools and techniques; Solr and Elasticsearch
Payloads have been a powerful aspect of Lucene for a long time, but have only had limited exposure in Solr. The Tika project has only recently finished integrating the powerful Tesseract OCR library, bringing the prospect of OCR to the masses.
Haystack 2019 - Search-based recommendations at Politico - Ryan KohlOpenSource Connections
Over the past year, the POLITICO team has developed a recommendation system for our users, which recommends not only news content to read but also news topics to subscribe to. This talk will discuss our development path, including dead-ends and performance trade-offs. In the end, the team produced a system based on search technology (in our case, Elasticsearch) and refined by machine learning techniques to achieve a balance between personalization and serendipity.
With the advent of deep learning and algorithms like word2vec and doc2vec, vectors-based representations are increasingly being used in search to represent anything from documents to images and products. However, search engines work with documents made of tokens, and not vectors, and are typically not designed for fast vector matching out of the box. In this talk, I will give an overview of how vectors can be derived from documents to produce a semantic representation of a document that can be used to implement semantic / conceptual search without hurting performance. I will then describe a few different techniques for efficiently searching vector-based representations in an inverted index, including LSH, vector quantization and k-means tree, and compare their performance in terms of speed and relevancy. Finally, I will describe how each technique can be implemented efficiently in a lucene-based search engine such as Solr or Elastic Search.
Haystack 2019 - Natural Language Search with Knowledge Graphs - Trey GraingerOpenSource Connections
To optimally interpret most natural language queries, it is necessary to understand the phrases, entities, commands, and relationships represented or implied within the search. Knowledge graphs serve as useful instantiations of ontologies which can help represent this kind of knowledge within a domain.
In this talk, we'll walk through techniques to build knowledge graphs automatically from your own domain-specific content, how you can update and edit the nodes and relationships, and how you can seamlessly integrate them into your search solution for enhanced query interpretation and semantic search. We'll have some fun with some of the more search-centric use cased of knowledge graphs, such as entity extraction, query expansion, disambiguation, and pattern identification within our queries: for example, transforming the query "bbq near haystack" into
{ filter:["doc_type":"restaurant"], "query": { "boost": { "b": "recip(geodist(38.034780,-78.486790),1,1000,1000)", "query": "bbq OR barbeque OR barbecue" } } }
We'll also specifically cover use of the Semantic Knowledge Graph, a particularly interesting knowledge graph implementation available within Apache Solr that can be auto-generated from your own domain-specific content and which provides highly-nuanced, contextual interpretation of all of the terms, phrases and entities within your domain. We'll see a live demo with real world data demonstrating how you can build and apply your own knowledge graphs to power much more relevant query understanding within your search engine.
For e-commerce applications, matching users with the items they want is the name of the game. If they can't find what they want then how can they buy anything?! Typically this functionality is provided through search and browse experience. Search allows users to type in text and match against the text of the items in the inventory. Browse allows users to select filters and slice-and-dice the inventory down to the subset they are interested in. But with the shift toward mobile devices, no one wants to type anymore - thus browse is becoming dominant in the e-commerce experience.
But there's a problem! What if your inventory is not categorized? Perhaps your inventory is user generated or generated by external providers who don't tag and categorize the inventory. No categories and no tags means no browse experience and missed sales. You could hire an army of taxonomists and curators to tag items - but training and curation will be expensive. You can demand that your providers tag their items and adhere to your taxonomy - but providers will buck this new requirement unless they see obvious and immediate benefit. Worse, providers might use tags to game the system - artificially placing themselves in the wrong category to drive more sales. Worst of all, creating the right taxonomy is hard. You have to structure a taxonomy to realistically represent how your customers think about the inventory.
Eventbrite is investigating a tantalizing alternative: using a combination of customer interactions and machine learning to automatically tag and categorize our inventory. As customers interact with our platform - as they search for events and click on and purchase events that interest them - we implicitly gather information about how our users think about our inventory. Search text effectively acts like a tag and a click on an event card is a vote for that clicked event is representative of that tag. We are able to use this stream of information as training data for a machine learning classification model; and as we receive new inventory, we can automatically tag it with the text that customers will likely use when searching for it. This makes it possible to better understand our inventory, our supply and demand, and most importantly this allows us to build the browse experience that customers demand.
In this talk I will explain in depth the problem space and Eventbrite's approach in solving the problem. I will describe how we gathered training data from our search and click logs, and how we built and refined the model. I will present the output of the model and discuss both the positive results of our work as well as the work left to be done. Those attending this talk will leave with some new ideas to take back to their own business.
Haystack 2019 - Improving Search Relevance with Numeric Features in Elasticse...OpenSource Connections
Recently Elasticsearch has introduced a number of ways to improve search relevance of your documents based on numeric features. In this talk I will present the newly introduced field types of "rank_feature", "rank_features" ,"dense_field", and "sparse_vector" and discuss in what situations and how they can be used to boost scores of your documents. I will also talk about the inner workings of queries based on these fields, and related performance considerations.
Haystack 2019 - Architectural considerations on search relevancy in the conte...OpenSource Connections
With an increasing amount of relevancy factors, relevancy fine-tuning becomes more complex as changing the impact of factors produces increasingly more unintended side effects. In recent years, there has been a lot of discussion about how learning algorithms can replace manual relevancy fine-tuning in order to manage this complexity. However, discussions about the challenge of relevancy should additionally consider architectural aspects. Especially microservice-based architectures provide many ways to encapsulate and to separate complexities of search solutions, which facilitates optimizing the search as well as locating and fixing problems.
Generally, relevancy factors can be assigned to three different groups, each handled at a different stage of the search request processing. The first group contains contextual factors that depend on certain characteristics of a query, such as query-related boosts lifting up top-sellers for queries or category-related boosts to distinguish products from their accessories. Such contextual factors can be handled as a step of the preprocessing of queries. The respective boosting information can simply be appended to the query before it is actually sent to the search engine. Ideally, the normalization of the query is done beforehand.
The second group contains factors that are considered for all queries in more or less the same way, e. g. a ranking function basing on keyword occurrences, product topicality or sales in total. Factors related to this group can be handled directly by configuring the search engine.
The third group contains situational factors. For instance, a certain product might be a good match for a certain query in general, but for situational circumstances it should not appear among the top five products (e. g. because it is out of stock). Such situational factors can be handled by resorting result sets, after they were returned by the search engine.
The handling of the different factors within successive stages of search request processing will be discussed from an architectural perspective. Implications for applying learning algorithms and the implementation of a personalized search will be considered.
Does your search application include a custom query syntax with various search operators such as Booleans, proximity, term or phrase frequency, capitalization, quoted text or as-is operator, and other advanced operators? Although most search applications offer a natural language-oriented search box, some advanced applications may also offer a custom query syntax for advanced users or automated tasks. The Lucene "classic" query operators that are supported by the Solr edismax query parser (Boolean, phrase with slop, wildcard, etc.) cover a good amount of use cases, but they only get you so far. In this talk, we will explore various strategies to support a custom and advanced query syntax in Solr, covering a spectrum of options from leveraging the out-of-the-box Solr query DSL, to a custom Solr query parser, and hybrid solutions in between. We will identify the options' pros and cons, discuss relevancy considerations, and illustrate the options in Java.
Haystack 2019 - Establishing a relevance focused culture in a large organizat...OpenSource Connections
For a relevance engineer one of the most difficult tasks in the tuning process is to convince others in the organization that this is a joint effort. Even the brightest search guru doesn't get very far when working in isolation, so establishing cross-collaboration through the organization is essential. But how to get there?
On top of that, in a large organization a relevance engineer often works on multiple seemingly unrelated search projects. The challenge is not to get drowned in building custom solutions for each project, but to design generic and re-usable strategies which solve many problems at once.
In this session we'll discuss how to build a widely supported basis for search quality improvements in an organization. It is full of practical tips and examples which could help you in establishing a cross-functional culture that is optimal for relevance tuning. It also zooms in on an holistic approach of solving multiple equivalent search issues at once.
Haystack 2019 - Solving for Satisfaction: Introduction to Click Models - Eliz...OpenSource Connections
Relevance metrics like NDGC or ERR require graded judgements to evaluate query relevance performance. But what happens when we don't know what 'good' looks like ahead of time? This talk will look at using click modeling techniques to infer relevance judgements from user interaction logs.
2019 Haystack - How The New York Times Tackles Relevance - Jeremiah ViaOpenSource Connections
The New York Times has had search for a long time but 2018 was the year in which the company engaged with relevance in a deep way. The aim of this talk is to share what we've learned as we've increased our search sophistication and some of the challenges we still face.
Some of the techniques we've adopted in this past year include offline metrics testing, reflective testing, and user engagement metrics. We now have a process in place to quickly get mappings changes out to production. As a team we now also have a vocabulary for talking about relevance and can use it to discuss trade-offs and goals in conjunction with our metrics.
We hope this talk is of use to those who've put off working on search relevance due to fear, uncertainty, or ambivalence. We will talk about how we went from working on everything but search relevance to finally pulling back the curtain on the search system. We hope what we've learned can help others get started.
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
Building RAG with self-deployed Milvus vector database and Snowpark Container...Zilliz
This talk will give hands-on advice on building RAG applications with an open-source Milvus database deployed as a docker container. We will also introduce the integration of Milvus with Snowpark Container Services.
In the rapidly evolving landscape of technologies, XML continues to play a vital role in structuring, storing, and transporting data across diverse systems. The recent advancements in artificial intelligence (AI) present new methodologies for enhancing XML development workflows, introducing efficiency, automation, and intelligent capabilities. This presentation will outline the scope and perspective of utilizing AI in XML development. The potential benefits and the possible pitfalls will be highlighted, providing a balanced view of the subject.
We will explore the capabilities of AI in understanding XML markup languages and autonomously creating structured XML content. Additionally, we will examine the capacity of AI to enrich plain text with appropriate XML markup. Practical examples and methodological guidelines will be provided to elucidate how AI can be effectively prompted to interpret and generate accurate XML markup.
Further emphasis will be placed on the role of AI in developing XSLT, or schemas such as XSD and Schematron. We will address the techniques and strategies adopted to create prompts for generating code, explaining code, or refactoring the code, and the results achieved.
The discussion will extend to how AI can be used to transform XML content. In particular, the focus will be on the use of AI XPath extension functions in XSLT, Schematron, Schematron Quick Fixes, or for XML content refactoring.
The presentation aims to deliver a comprehensive overview of AI usage in XML development, providing attendees with the necessary knowledge to make informed decisions. Whether you’re at the early stages of adopting AI or considering integrating it in advanced XML development, this presentation will cover all levels of expertise.
By highlighting the potential advantages and challenges of integrating AI with XML development tools and languages, the presentation seeks to inspire thoughtful conversation around the future of XML development. We’ll not only delve into the technical aspects of AI-powered XML development but also discuss practical implications and possible future directions.
Essentials of Automations: The Art of Triggers and Actions in FMESafe Software
In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation.
We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios.
Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
Unlocking Productivity: Leveraging the Potential of Copilot in Microsoft 365, a presentation by Christoforos Vlachos, Senior Solutions Manager – Modern Workplace, Uni Systems
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofsAlex Pruden
This paper presents Reef, a system for generating publicly verifiable succinct non-interactive zero-knowledge proofs that a committed document matches or does not match a regular expression. We describe applications such as proving the strength of passwords, the provenance of email despite redactions, the validity of oblivious DNS queries, and the existence of mutations in DNA. Reef supports the Perl Compatible Regular Expression syntax, including wildcards, alternation, ranges, capture groups, Kleene star, negations, and lookarounds. Reef introduces a new type of automata, Skipping Alternating Finite Automata (SAFA), that skips irrelevant parts of a document when producing proofs without undermining soundness, and instantiates SAFA with a lookup argument. Our experimental evaluation confirms that Reef can generate proofs for documents with 32M characters; the proofs are small and cheap to verify (under a second).
Paper: https://eprint.iacr.org/2023/1886
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!SOFTTECHHUB
As the digital landscape continually evolves, operating systems play a critical role in shaping user experiences and productivity. The launch of Nitrux Linux 3.5.0 marks a significant milestone, offering a robust alternative to traditional systems such as Windows 11. This article delves into the essence of Nitrux Linux 3.5.0, exploring its unique features, advantages, and how it stands as a compelling choice for both casual users and tech enthusiasts.
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...SOFTTECHHUB
The choice of an operating system plays a pivotal role in shaping our computing experience. For decades, Microsoft's Windows has dominated the market, offering a familiar and widely adopted platform for personal and professional use. However, as technological advancements continue to push the boundaries of innovation, alternative operating systems have emerged, challenging the status quo and offering users a fresh perspective on computing.
One such alternative that has garnered significant attention and acclaim is Nitrux Linux 3.5.0, a sleek, powerful, and user-friendly Linux distribution that promises to redefine the way we interact with our devices. With its focus on performance, security, and customization, Nitrux Linux presents a compelling case for those seeking to break free from the constraints of proprietary software and embrace the freedom and flexibility of open-source computing.
Full-RAG: A modern architecture for hyper-personalizationZilliz
Mike Del Balso, CEO & Co-Founder at Tecton, presents "Full RAG," a novel approach to AI recommendation systems, aiming to push beyond the limitations of traditional models through a deep integration of contextual insights and real-time data, leveraging the Retrieval-Augmented Generation architecture. This talk will outline Full RAG's potential to significantly enhance personalization, address engineering challenges such as data management and model training, and introduce data enrichment with reranking as a key solution. Attendees will gain crucial insights into the importance of hyperpersonalization in AI, the capabilities of Full RAG for advanced personalization, and strategies for managing complex data integrations for deploying cutting-edge AI solutions.
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
How to Get CNIC Information System with Paksim Ga.pptxdanishmna97
Pakdata Cf is a groundbreaking system designed to streamline and facilitate access to CNIC information. This innovative platform leverages advanced technology to provide users with efficient and secure access to their CNIC details.
UiPath Test Automation using UiPath Test Suite series, part 5DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 5. In this session, we will cover CI/CD with devops.
Topics covered:
CI/CD with in UiPath
End-to-end overview of CI/CD pipeline with Azure devops
Speaker:
Lyndsey Byblow, Test Suite Sales Engineer @ UiPath, Inc.
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024Neo4j
Neha Bajwa, Vice President of Product Marketing, Neo4j
Join us as we explore breakthrough innovations enabled by interconnected data and AI. Discover firsthand how organizations use relationships in data to uncover contextual insights and solve our most pressing challenges – from optimizing supply chains, detecting fraud, and improving customer experiences to accelerating drug discoveries.
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
Searching Chinese Patents Presentation at Enterprise Data World
1. Searching Chinese Patents:
Challenges and Solutions When Building
an Innovative Discovery Interface
ERIC PUGH | epugh@o19s.com | @dep4b
2. Who am I?
• Principal at OpenSource Connections
- Solr/Lucene Search Consultancy
http://bit.ly/OSCCommercialSummary
• Member of Apache Software
Foundation
• SOLR-284 UpdateRichDocuments
(July 07)
8. Risks
• Cloud new at USPTO
• Discovery is tenuous concept
• Conflicting User Goals
• Fixed Budget: trade scope for
budget/quality
9.
10.
11.
12.
13.
14. Telling some stories
➡How to inject “Discovery” into your
app
• The Cloud to the Rescue (sorta!)
• Parsers and Parsers and Parsers
• Don’t be Afraid to Share!
17. Grok data at gut level
Look for outliers
!
User Interviews
Surveys
Card Sorting
Scenarios/Personas
!
UX
Data
brainstorm
Mockups
Proof of concept
!
!
18. Where to spend time?
UX
Engine
Data
40%
!
20%
!
40%
!
40%
!
40%
!
20%
We spent
!
19. Telling some stories
• How to inject “Discovery” into your app
➡The Cloud to the Rescue (sorta!)
• Parsers and Parsers and Parsers
• Don’t be Afraid to Share!
21. Boy meets Girl Story
Metadata
Ingest
Pipeline
Discovery
UX
Content
Files
22. How we built it
EmberJS Single Page Search App
HTML
XML
JSON
Server Dashboard
GPSN UI (Bootsrap CSS)
Browsers
Mobile/
Tablet
Third Party
Application
Servers
S3 BucketSolr
23. Solr as a NoSQL
Datastore
• Used “atomic updates” to merge three
source datasets into single final dataset.
• All text displayed in application stored in
Solr.
• Dynamic schema supports many languages,
en, cn right now.
25. Don’t Move Files
• Copying 5 TB data up to S3 was very
painful.
• We used S3Funnel which is “rsync like”
• We bought more network bandwidth for
our office
28. Think about DataVolume
• Started with older dataset, and tasks like TIFF -> PNG
conversion became progressively harder. Map/Reduce nice,
need more visibility into progress..
• Should have sharded our Search Index from the beginning
just to make indexing faster and cheaper process (500 gb
index!)
• 8 shards dropped time from 12 hours to 2 hours.
Merging took 5!
• We had too many steps in our pipeline
29. Building
a
Patents
Index
MachineCount
0
75
150
225
300
5 days 3 days 30 Minutes
1 5
300
32. Telling some stories
• How to inject “Discovery” into your app
• The Cloud to the Rescue (sorta!)
➡Parsers and Parsers and Parsers
• Don’t be Afraid to Share!
35. Lot’s of File Types
• Sometimes in ZIP archives, sometimes not!
• multiple XML formats as well as CSV and
EDI
• Purplebook,Yellowbook,
Redbook,Greenbook, Questel, SIPO…
36. Tika as a pipeline!
• Auto detects content type
• Metadata structure has all the
key/value needed for Solr
• Allows us to scale up with
Behemoth project (and
others!).
38. Detector to pick File
public
class
GreenbookDetector
implements
Detector
{
!
private
static
Pattern
pattern
=
Pattern.compile("PATN");
@Override
public
MediaType
detect(InputStream
stream,
Metadata
metadata)
throws
IOException
{
!
MediaType
type
=
MediaType.OCTET_STREAM;
InputStream
lookahead
=
new
LookaheadInputStream(stream,
1024);
String
extract
=
org.apache.commons.io.IOUtils.toString(lookahead,
"UTF-‐8");
!
Matcher
matcher
=
pattern.matcher(extract);
!
if
(matcher.find())
{
type
=
GreenbookParser.MEDIA_TYPE;
}
!
lookahead.close();
return
type;
}
}
39. Telling some stories
• How to inject “Discovery” into your app
• The Cloud to the Rescue (sorta!)
• Parsers and Parsers and Parsers
➡Don’t be Afraid to Share!
40. Your BigData solution
isn’t perfect
• Allow users to export data
• Most business users want to work in Excel.
Accept it!
• Allow other applications to build on top of
of your application.
41. GPSN has
• Lots of easy “Print to
PDF” options.
• Data stored in S3 as:
• individual patent files
• chunky downloads.
• Filtering to expand or
select specific data sets.
• Permalinks: simple, very
sharable URLs.
• Underlying Solr service
is exposed to public via
proxy. You can query
Solr yourself.
• Need advance querying?
Use Lucene syntax in
search bar.