My talk at the ACM Data Mining Unconference on 01 Nov 2009. How to use an open source stack (Hadoop, Cascading, Bixo) in EC2 for cost effective, scalable and reliable web mining.
Rapid Application Development with SwiftUI and FirebasePeter Friese
Firebase is Google's mobile development platform for Android, iOS, and the web. SwiftUI is Apple's user interface toolkit that lets developers design apps in a declarative way. In this session, we will bring the two together and take a look at how easy it is to develop a new application from scratch.
Slides for my talk at heise MacDev 2019 (https://heise-macdev.de/lecture.php?id=8509)
Rapid Application Development with SwiftUI and FirebasePeter Friese
Firebase is Google's mobile development platform for Android, iOS, and the web. SwiftUI is Apple's user interface toolkit that lets developers design apps in a declarative way. In this session, we will bring the two together and take a look at how easy it is to develop a new application from scratch.
Slides for my talk at heise MacDev 2019 (https://heise-macdev.de/lecture.php?id=8509)
Search Queries Explained – A Deep Dive into Query Rules, Query Variables and ...Mikael Svenson
The query framework of SharePoint 2013 is a vast one, and it takes time to learn and master. In this session, you will get an overview of the latent capabilities with query rules and learn how you can maximize the use of query rules when building search driven pages using the Content by Search web part. The session is built around my blog series “SharePoint Search Queries Explained”
http://techmikael.blogspot.com/2014/03/sharepoint-search-queries-explained.html
Building an unstructured data management solution with elastic search and ama...mobiusservices
Learn how to manage unstructured data by building a document database with document, page indexing and retrieval solutions using Elasticsearch and Amazon Web Services
Introduction Presentation about NoSQL
Agenda:
- Why NoSQL
- What is NoSQL
- Distribution Models
- The CAP Theorem
- NoSQL Types
- NoSQL or Relational or Both
- Demo!
Consuming External Content and Enriching Content with Apache Cameltherealgaston
While AEM Solr Search provides a framework for indexing and searching content within AEM, it does not address other real-world use cases such as indexing and searching content external to AEM (i.e. products). Secondly, it assumes that the final indexable AEM document will be produced entirely by AEM. This is often not the case, as advanced search applications typically need to enrich the document prior to indexing using external data sources.
In this talk we will extend the AEM Solr Search reference architecture to include document processing capabilities using Apache Camel. As an example, two real-world use cases will be provided: 1) ingesting an external product data set via Apache Camel into a shared Solr instance and delivering the results via AEM, and 2) enriching AEM content with analytics and ratings data for the purpose of applying popularity boosting.
This are the slides of my talk: "Building a SPA in 30 min" given at NoSQL Matters CGN 2014.
It is about the creation of a backend for a Single Page Web Application build in AngularJS. The Backend is build in Foxx on top of ArangoDB, a framework to create a RESTful backend with only a few lines of Code.
How to migrate from any CMS (thru the front-door)ICF CIRCUIT
Chris Rockwell, University of Michigan
Based on lessons learned, a presentation of some nifty techniques for expediting and automating content migration leveraging Ruby, Cucumber, Selenium, Capybara, CURB, and the SlingPostServlet
Build a simple data lake on AWS using a combination of services, including AWS Glue Data Catalog, AWS Glue Crawlers, AWS Glue Jobs, AWS Glue Studio, Amazon Athena, Amazon Relational Database Service (Amazon RDS), and Amazon S3.
Link to the blog post and video: https://garystafford.medium.com/building-a-simple-data-lake-on-aws-df21ca092e32
Do you need an external search platform for Adobe Experience Manager?therealgaston
Experience Manager provides some basic search capabilities out of the box. In this talk, we'll explore an external search platform for implementing an Experience Manager powered, search-driven site. As an example, we will use Apache Solr as a reference implementation and describe best practices for indexing content, exposing non-Experience Manager content via search, delivering search-driven experiences, and deploying the solution in a production setting.
John Hammink's Talk at Great Wide Open 2016. We discuss: 1.) the need for data analytics infrastructure that can scale exponentially and 2.) what such an infrastructure must contain and finally 3.) the need for an infrastructure to be able to handle un - and semi-structured data.
MongoDB and Hadoop: Driving Business InsightsMongoDB
MongoDB and Hadoop can work together to solve big data problems facing today's enterprises. We will take an in-depth look at how the two technologies complement and enrich each other with complex analyses and greater intelligence. We will take a deep dive into the MongoDB Connector for Hadoop and how it can be applied to enable new business insights with MapReduce, Pig, and Hive, and demo a Spark application to drive product recommendations.
On the move with Big Data (Hadoop, Pig, Sqoop, SSIS...)Stéphane Fréchette
How is Big Data moved around? How are you planning to move it?
This session will focus on familiar and not so similar tools you can use today
for moving and integrating Big Data. Also important to outline the technologies and platform (introduction to Big Data, Hadoop, HDInsight and tools).
We will compare and outline options,
discuss how they can work with your existing Hadoop and Windows Azure
environment, and provide some guidance on when and how to use each of these
tools.
PDF version (with notes) of my talk at the ACM Data Mining Unconference on 01 Nov 2009. How to use an open source stack (Hadoop, Cascading, Bixo) in EC2 for cost effective, scalable and reliable web mining.
Search Queries Explained – A Deep Dive into Query Rules, Query Variables and ...Mikael Svenson
The query framework of SharePoint 2013 is a vast one, and it takes time to learn and master. In this session, you will get an overview of the latent capabilities with query rules and learn how you can maximize the use of query rules when building search driven pages using the Content by Search web part. The session is built around my blog series “SharePoint Search Queries Explained”
http://techmikael.blogspot.com/2014/03/sharepoint-search-queries-explained.html
Building an unstructured data management solution with elastic search and ama...mobiusservices
Learn how to manage unstructured data by building a document database with document, page indexing and retrieval solutions using Elasticsearch and Amazon Web Services
Introduction Presentation about NoSQL
Agenda:
- Why NoSQL
- What is NoSQL
- Distribution Models
- The CAP Theorem
- NoSQL Types
- NoSQL or Relational or Both
- Demo!
Consuming External Content and Enriching Content with Apache Cameltherealgaston
While AEM Solr Search provides a framework for indexing and searching content within AEM, it does not address other real-world use cases such as indexing and searching content external to AEM (i.e. products). Secondly, it assumes that the final indexable AEM document will be produced entirely by AEM. This is often not the case, as advanced search applications typically need to enrich the document prior to indexing using external data sources.
In this talk we will extend the AEM Solr Search reference architecture to include document processing capabilities using Apache Camel. As an example, two real-world use cases will be provided: 1) ingesting an external product data set via Apache Camel into a shared Solr instance and delivering the results via AEM, and 2) enriching AEM content with analytics and ratings data for the purpose of applying popularity boosting.
This are the slides of my talk: "Building a SPA in 30 min" given at NoSQL Matters CGN 2014.
It is about the creation of a backend for a Single Page Web Application build in AngularJS. The Backend is build in Foxx on top of ArangoDB, a framework to create a RESTful backend with only a few lines of Code.
How to migrate from any CMS (thru the front-door)ICF CIRCUIT
Chris Rockwell, University of Michigan
Based on lessons learned, a presentation of some nifty techniques for expediting and automating content migration leveraging Ruby, Cucumber, Selenium, Capybara, CURB, and the SlingPostServlet
Build a simple data lake on AWS using a combination of services, including AWS Glue Data Catalog, AWS Glue Crawlers, AWS Glue Jobs, AWS Glue Studio, Amazon Athena, Amazon Relational Database Service (Amazon RDS), and Amazon S3.
Link to the blog post and video: https://garystafford.medium.com/building-a-simple-data-lake-on-aws-df21ca092e32
Do you need an external search platform for Adobe Experience Manager?therealgaston
Experience Manager provides some basic search capabilities out of the box. In this talk, we'll explore an external search platform for implementing an Experience Manager powered, search-driven site. As an example, we will use Apache Solr as a reference implementation and describe best practices for indexing content, exposing non-Experience Manager content via search, delivering search-driven experiences, and deploying the solution in a production setting.
John Hammink's Talk at Great Wide Open 2016. We discuss: 1.) the need for data analytics infrastructure that can scale exponentially and 2.) what such an infrastructure must contain and finally 3.) the need for an infrastructure to be able to handle un - and semi-structured data.
MongoDB and Hadoop: Driving Business InsightsMongoDB
MongoDB and Hadoop can work together to solve big data problems facing today's enterprises. We will take an in-depth look at how the two technologies complement and enrich each other with complex analyses and greater intelligence. We will take a deep dive into the MongoDB Connector for Hadoop and how it can be applied to enable new business insights with MapReduce, Pig, and Hive, and demo a Spark application to drive product recommendations.
On the move with Big Data (Hadoop, Pig, Sqoop, SSIS...)Stéphane Fréchette
How is Big Data moved around? How are you planning to move it?
This session will focus on familiar and not so similar tools you can use today
for moving and integrating Big Data. Also important to outline the technologies and platform (introduction to Big Data, Hadoop, HDInsight and tools).
We will compare and outline options,
discuss how they can work with your existing Hadoop and Windows Azure
environment, and provide some guidance on when and how to use each of these
tools.
PDF version (with notes) of my talk at the ACM Data Mining Unconference on 01 Nov 2009. How to use an open source stack (Hadoop, Cascading, Bixo) in EC2 for cost effective, scalable and reliable web mining.
Hadoop World 2011: Data Mining in Hadoop, Making Sense of it in Mahout! - Mic...Cloudera, Inc.
Much of Hadoop adoption thus far has been for use cases such as processing log files, text mining, and storing masses of file data -- all very necessary, but largely not exciting. In this presentation, Michael Cutler presents a selection of methodologies, primarily using Mahout, that will enable you to derive real insight into your data (mined in Hadoop) and build a recommendation engine focused on the implicit data collected from your users.
Analyzing Customer Experience Feedback Using Text Mining: A Linguistics-Based...Mohamed Zaki
Complexity surrounding the holistic nature of customer experience has made measuring customer perceptions of interactive service experiences challenging. At the same time, advances in technology and changes in methods for collecting explicit customer
feedback are generating increasing volumes of unstructured textual data, making it difficult for managers to analyze and interpret this information. Consequently, text mining, a method enabling automatic extraction of information from textual data, is gaining in popularity. However, this method has performed below expectations in terms of depth of analysis of customer experience feedback and accuracy. In this study, we advance linguistics-based text mining modeling to inform the process of developing an
improved framework. The proposed framework incorporates important elements of customer experience, service methodologies, and theories such as cocreation processes, interactions, and context. This more holistic approach for analyzing feedback
facilitates a deeper analysis of customer feedback experiences, by encompassing three value creation elements: activities, resources, and context (ARC). Empirical results show that the ARC framework facilitates the development of a text mining model for analysis of customer textual feedback that enables companies to assess the impact of interactive service processes on customer experiences. The proposed text mining model shows high accuracy levels and provides flexibility through training. As such, it can evolve to account for changing contexts over time and be deployed across different (service) business domains; we term it an ‘‘open learning’’ model. The ability to timely assess customer experience feedback represents a prerequisite for successful cocreation processes in a service environment.
Build Narratives, Connect Artifacts: Linked Open Data for Cultural HeritageOntotext
Many issues are faced by scholars, book researchers, museum directors who try to find the underlying connection between resources. Scholars in particular continuously emphasizes the role of digital humanities and the value of linked data in cultural heritage information systems.
Best Practices for Large Scale Text Mining ProcessingOntotext
Q&A:
NOW facilitates semantic search by having annotations attached to search strings. How compolex does that get, e.g. with wildcards between annotated strings?
NOW’s searchbox is quite basic at the moment, but still supports a few scenarios.
1. Pure concept/faceted search - search for all documents containing a concept or where a set of concepts are co-occurring. Ranking is based on frequence of occurrence.
2. Concept/faceted + Full Text search - search for both concepts and particular textual term of phrase.
3. Full text search
With search, pretty much anything can be done to customise it. For the NOW showcase we’ve kept it fairly simple, as usually every client has a slightly different case and wants to tune search in a slightly different direction.
The search in NOW is faceted which means that you search with concepts (facets) and you retrieve all documents which contain mentions of the searched concept. If you search by more than one facet the engine retrieves documents which contain mentions of both concepts but there is no restriction that they occur next to each other.
Is the tagging service expandable (say with custom ontologies)? also is it a something you offer as a service? it is unclear to me from the website.
The TAG service is used for demonstration purposes only. The models behind it are trained for annotating news articles. The pipeline is customizable for every concrete scenario, different domains and entities of interest. You can access several of our pipelines as a service through the S4 platform or you can have them hosted as an on premise solution. In some cases our clients want domain adaptation or improvements in particular area, or to tag with their internal dataset - in this case we offer again an on premise deployment and also a managed service hosted on our hardware.
Hdoes your system accomodate cluster analysis using unsupervised keyword/phrase annotation for knowledge discovery?
As much as the patterns of user behaviour are also considered knowledge discovery we employ these for suggesting related reads. Apart from these we have experience tailoring custom clustering pipelines which also rely on features like keyword and named entities.
For topic extraction how many topics can we extract? from twitter corpus wgat csn we infer?
For topic extraction we have determined that we obtain best results when suggesting 3 categories. These are taken from IPTC but only the uppermost levels which are less than 20.
The twitter corpus example is from a project Ontotext participates in called Pheme. The goal of the project is to detect rumours and to check their veracity, thus help journalists in their hunt for attractive news.
Do you provide Processing Resources and JAPE rules for GATE framework and that can be used with GATE embedded?
We are contributing to the GATE framework and everything which has been wrapped up as PRs has been included the corresponding GATE distributions.
Amazon subsidiary Alexa.com is leveling the search playing field. For the first time, developers looking to build the next "big thing" in search or an ultra custom search engine have access to the 300 terabytes of Alexa crawl data, along with the utilities to search, process, and publish their own custom subset of the data-all at a reasonable price.
Today organizations find themselves in a data rich world with a growing need for increased agility and accessibility of all this data for analysis and deriving keen insights to drive strategic decisions. Creating a data lake helps you to manage all the disparate sources of data you are collecting, in its original format and extract value. In this session learn how to architect and implement an Analytics Data Lake. Hear customer examples of best practices and learn from their architectural blueprints.
AWS re:Invent 2016: Big Data Architectural Patterns and Best Practices on AWS...Amazon Web Services
The world is producing an ever increasing volume, velocity, and variety of big data. Consumers and businesses are demanding up-to-the-second (or even millisecond) analytics on their fast-moving data, in addition to classic batch processing. AWS delivers many technologies for solving big data problems. But what services should you use, why, when, and how? In this session, we simplify big data processing as a data bus comprising various stages: ingest, store, process, and visualize. Next, we discuss how to choose the right technology in each stage based on criteria such as data structure, query latency, cost, request rate, item size, data volume, durability, and so on. Finally, we provide reference architecture, design patterns, and best practices for assembling these technologies to solve your big data problems at the right cost.
Organisations involved in Big Data and Analytics spend a lot of time preparing data for analysis which often involves large-scale movement and transformation. In this session we will explore AWS Glue, a new service designed to assist with the process of cataloging, transforming and scheduling for your data pipeline.
Speaker: Cassandra Bonner, Solutions Architect, Amazon Web Services
These slides are from my 2009 Fundamentals of Search workshop at KMWorld. Please contact me for information about search engines, consulting, workshops and training.
PoolParty is a thesaurus management system and a SKOS editor for the Semantic Web including text mining and linked data capabilities. The system helps to build and maintain multilingual thesauri providing an easy-to-use interface. PoolParty server provides semantic services to integrate semantic search or recommender systems into systems like CMS, DMS, CRM or Wikis
Big Data Architectural Patterns and Best Practices on AWSAmazon Web Services
by Dario Rivera, Solutions Architect, AWS
The world is producing an ever-increasing volume, velocity, and variety of big data. Consumers and businesses are demanding up-to-the-second (or even millisecond) analytics on their fast-moving data, in addition to classic batch processing. AWS delivers many technologies for solving big data problems. But what services should you use, why, when, and how? In this session, we simplify big data processing as a data bus comprising various stages: ingest, store, process, and visualize. Next, we discuss how to choose the right technology in each stage based on criteria such as data structure, query latency, cost, request rate, item size, data volume, durability, and so on. Finally, we provide reference architecture, design patterns, and best practices for assembling these technologies to solve your big data problems at the right cost.
This is a presentation I gave at Hadoop Summit San Jose 2014, on doing fuzzy matching at large scale using combinations of Hadoop & Solr-based techniques.
Our client helps advertisers target publishers/networks and improve ad results by analyzing millions of web pages every day. They have been able to cut monthly costs by more than 50%, improve response time by 4x, and quickly add new features by switching from a traditional DB-centric approach to one based on Hadoop & Solr. This analysis is handled by a complex Hadoop-based workflow, where the end result is a set of unique, highly optimized Solr indexes. The data processing platform provided by Hadoop also enables scalable machine learning using Mahout. This presentation covers some of the unique challenges in switching the web site from relying on slow, expensive real-time analytics using database queries to fast, affordable batch analytics and search using Hadoop and Solr.
A very short introduction to Hadoop, from the talk I gave at the BigDataCamp held in Washington DC this past November 2011. Some of this content is also covered in the various big data classes we offer via on-site training (see http://www.scaleunlimited.com/training/)
Presentation by Ken Krugler at the SDForum SAM SIG (Software Architecture & Modeling) meeting on Sept 22nd, 2010. This talk provides a brief introduction to Map-Reduce & Hadoop, then discusses challenges of implementing complex data processing using low-level Map-Reduce support, and a number of solutions.
A tale of scale & speed: How the US Navy is enabling software delivery from l...sonjaschweigert1
Rapid and secure feature delivery is a goal across every application team and every branch of the DoD. The Navy’s DevSecOps platform, Party Barge, has achieved:
- Reduction in onboarding time from 5 weeks to 1 day
- Improved developer experience and productivity through actionable findings and reduction of false positives
- Maintenance of superior security standards and inherent policy enforcement with Authorization to Operate (ATO)
Development teams can ship efficiently and ensure applications are cyber ready for Navy Authorizing Officials (AOs). In this webinar, Sigma Defense and Anchore will give attendees a look behind the scenes and demo secure pipeline automation and security artifacts that speed up application ATO and time to production.
We will cover:
- How to remove silos in DevSecOps
- How to build efficient development pipeline roles and component templates
- How to deliver security artifacts that matter for ATO’s (SBOMs, vulnerability reports, and policy evidence)
- How to streamline operations with automated policy checks on container images
Communications Mining Series - Zero to Hero - Session 1DianaGray10
This session provides introduction to UiPath Communication Mining, importance and platform overview. You will acquire a good understand of the phases in Communication Mining as we go over the platform with you. Topics covered:
• Communication Mining Overview
• Why is it important?
• How can it help today’s business and the benefits
• Phases in Communication Mining
• Demo on Platform overview
• Q/A
GridMate - End to end testing is a critical piece to ensure quality and avoid...ThomasParaiso2
End to end testing is a critical piece to ensure quality and avoid regressions. In this session, we share our journey building an E2E testing pipeline for GridMate components (LWC and Aura) using Cypress, JSForce, FakerJS…
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...SOFTTECHHUB
The choice of an operating system plays a pivotal role in shaping our computing experience. For decades, Microsoft's Windows has dominated the market, offering a familiar and widely adopted platform for personal and professional use. However, as technological advancements continue to push the boundaries of innovation, alternative operating systems have emerged, challenging the status quo and offering users a fresh perspective on computing.
One such alternative that has garnered significant attention and acclaim is Nitrux Linux 3.5.0, a sleek, powerful, and user-friendly Linux distribution that promises to redefine the way we interact with our devices. With its focus on performance, security, and customization, Nitrux Linux presents a compelling case for those seeking to break free from the constraints of proprietary software and embrace the freedom and flexibility of open-source computing.
How to Get CNIC Information System with Paksim Ga.pptxdanishmna97
Pakdata Cf is a groundbreaking system designed to streamline and facilitate access to CNIC information. This innovative platform leverages advanced technology to provide users with efficient and secure access to their CNIC details.
Sudheer Mechineni, Head of Application Frameworks, Standard Chartered Bank
Discover how Standard Chartered Bank harnessed the power of Neo4j to transform complex data access challenges into a dynamic, scalable graph database solution. This keynote will cover their journey from initial adoption to deploying a fully automated, enterprise-grade causal cluster, highlighting key strategies for modelling organisational changes and ensuring robust disaster recovery. Learn how these innovations have not only enhanced Standard Chartered Bank’s data infrastructure but also positioned them as pioneers in the banking sector’s adoption of graph technology.
Removing Uninteresting Bytes in Software FuzzingAftab Hussain
Imagine a world where software fuzzing, the process of mutating bytes in test seeds to uncover hidden and erroneous program behaviors, becomes faster and more effective. A lot depends on the initial seeds, which can significantly dictate the trajectory of a fuzzing campaign, particularly in terms of how long it takes to uncover interesting behaviour in your code. We introduce DIAR, a technique designed to speedup fuzzing campaigns by pinpointing and eliminating those uninteresting bytes in the seeds. Picture this: instead of wasting valuable resources on meaningless mutations in large, bloated seeds, DIAR removes the unnecessary bytes, streamlining the entire process.
In this work, we equipped AFL, a popular fuzzer, with DIAR and examined two critical Linux libraries -- Libxml's xmllint, a tool for parsing xml documents, and Binutil's readelf, an essential debugging and security analysis command-line tool used to display detailed information about ELF (Executable and Linkable Format). Our preliminary results show that AFL+DIAR does not only discover new paths more quickly but also achieves higher coverage overall. This work thus showcases how starting with lean and optimized seeds can lead to faster, more comprehensive fuzzing campaigns -- and DIAR helps you find such seeds.
- These are slides of the talk given at IEEE International Conference on Software Testing Verification and Validation Workshop, ICSTW 2022.
In the rapidly evolving landscape of technologies, XML continues to play a vital role in structuring, storing, and transporting data across diverse systems. The recent advancements in artificial intelligence (AI) present new methodologies for enhancing XML development workflows, introducing efficiency, automation, and intelligent capabilities. This presentation will outline the scope and perspective of utilizing AI in XML development. The potential benefits and the possible pitfalls will be highlighted, providing a balanced view of the subject.
We will explore the capabilities of AI in understanding XML markup languages and autonomously creating structured XML content. Additionally, we will examine the capacity of AI to enrich plain text with appropriate XML markup. Practical examples and methodological guidelines will be provided to elucidate how AI can be effectively prompted to interpret and generate accurate XML markup.
Further emphasis will be placed on the role of AI in developing XSLT, or schemas such as XSD and Schematron. We will address the techniques and strategies adopted to create prompts for generating code, explaining code, or refactoring the code, and the results achieved.
The discussion will extend to how AI can be used to transform XML content. In particular, the focus will be on the use of AI XPath extension functions in XSLT, Schematron, Schematron Quick Fixes, or for XML content refactoring.
The presentation aims to deliver a comprehensive overview of AI usage in XML development, providing attendees with the necessary knowledge to make informed decisions. Whether you’re at the early stages of adopting AI or considering integrating it in advanced XML development, this presentation will cover all levels of expertise.
By highlighting the potential advantages and challenges of integrating AI with XML development tools and languages, the presentation seeks to inspire thoughtful conversation around the future of XML development. We’ll not only delve into the technical aspects of AI-powered XML development but also discuss practical implications and possible future directions.
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
Dr. Sean Tan, Head of Data Science, Changi Airport Group
Discover how Changi Airport Group (CAG) leverages graph technologies and generative AI to revolutionize their search capabilities. This session delves into the unique search needs of CAG’s diverse passengers and customers, showcasing how graph data structures enhance the accuracy and relevance of AI-generated search results, mitigating the risk of “hallucinations” and improving the overall customer journey.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...Neo4j
Leonard Jayamohan, Partner & Generative AI Lead, Deloitte
This keynote will reveal how Deloitte leverages Neo4j’s graph power for groundbreaking digital twin solutions, achieving a staggering 100x performance boost. Discover the essential role knowledge graphs play in successful generative AI implementations. Plus, get an exclusive look at an innovative Neo4j + Generative AI solution Deloitte is developing in-house.
Over the prior 4 years I had a startup called Krugle, that provided code search for open source projects and inside large companies. We did a large, 100M page crawl of the “programmer’s web” to find out information about open source projects. Based on what I learned from that experience, I started the Bixo open source project. It’s a toolkit for building web mining workflows, and I’ll be talking more about that later. Several companies paid me to integrate Bixo into an existing data processing environment. And that in turn led to Bixo Labs, which is a platform for quickly creating creating web mining apps. Elastic means the size of the system can easily be changed to match the web mining task.
This is the world that many of you live in. Analyzing data to find important patterns. Here’s an example of output from the QlikView business intelligence tool It was used to help analyze the relative prevalence of keywords in two competing web sites. Here you see two word terms that often occur on McAfee’s site, but not on Symantec’s Which is very useful data for anybody who worries about search engine optimization.
You all know about analyzing data to find important patterns that get managers all worked up…
But how do you get to this point? How do you use the web as the source for data that you’re analyzing That’s what I’m going to be talking about here.
Quick intro to web mining, so we’re on the same page Most people think about the big search companies when they think about web mining. Search is clearly the biggest web mining category, and generates the most revenue. But other types of web mining have value that is high and growing.
It’s common to confuse web crawling with fetching. Crawling is the process of automatically finding new pages by extracting links from fetched pages. But for many web mining applications, you have a “white list” of pre-defined URLs. In either case, though, you need to reliably, efficiently and politely fetch pages. Content comes in a variety of formats - typically HTML, but also PDF, word, zip archives, etc. Need to parse these formats to extract key data - typically text, but could be image data. Often the analyze step will include aspects of machine learning - classification, clustering. “useful data” covers a lot of ground, because there are a lot of ways to use the output of web mining. Generating an index is one of the most common, because people think about search as the goal. But for data mining, the end result at this point is often highly reduced data that is input to traditional data mining tools.
What are the key differences between web mining and traditional data mining I’m saying “traditional” because the face of data mining is clearly changing. But if you look at most vendor tools, the focus is on what I’d call “traditional data mining” Scale - 10M is big for data mining, but not for web mining Access - with DM, once you defeated Mongor, keeper of data base access keys, you were golden Web pages are typically public, but it’s a shared resource so implicit rules apply. Like “don’t bring my web site to its knees”. Data mining breaks traditional implicit contract, so extra cautions apply. Implicit contract is that I let you crawl me, and you drive traffic to me when your search index goes live. But with DM, there often isn’t an index as the end result. With mining DBs, there’s explicit structure, which is mostly lacking from web pages.
If it doesn’t scale, then it won’t handle the quantity of data you’ll ultimately want to process from the web If you can’t create real workflows, it will never be reliable or efficient. If you don’t use specialized web crawling code, you’ll get blacklisted Because you’re trying to distill down large data, there’s often some custom processing. If you don’t run it a cloud environment, you’ll be wasting money - and I’ll explain why in a few slides.
I’m focusing on one particular solution to the challenges of web mining that I just described. It’s the “HECB” stack. I’m going to talk about these from the bottom up, which is EC2 first, then Hadoop…but the acronym didn’t work as well.
At Krugle we ran two clusters, one of 11 servers, and a smaller 4 server cluster In the end, our actual utilization ratio was probably < 20% Even with close to 100% utilization, the break-even point for EC2 vs. colo is somewhere between 50 and 200 servers, depending on who you talk to. If utilization was 20%, then break even would be 250 to 1000 servers. Mining for search doesn’t work so well in this model - cluster should be always crawling (ABC) so not as bursty And transferring raw content, parse, and index will generate lots of transfer charges. But for web mining that’s focused on data mining, data is distilled so this isn’t an issue.
Map-reduce - how do you parallelize the processing of lots of data so that you can Do the work on many servers? The answer is Map-reduce. HDFS - how do you store lots of data in a fault-tolerant, cost-effective manner. How do you make sure the data (the big stuff) moves as little as possible during processing. The answer is the Hadoop distributed file system. It’s open source, so lots of support, consultants, rapid bug fixes, etc. Large companies are using it, especially Yahoo Elastic map reduce is a special service built on top of EC2, where it’s easier to run Hadoop jobs Because you have access to pre-configured Hadoop clusters, special tools, etc.
If you ever had to write a complex workflow using Hadoop, you know the answer. It frees you from the lower-level details of thinking in map-reduce. You can think about the workflow as operations on records with fields. And in data mining, the workflow often winds up being very complex. Because you can build workflows out of a mix of pre-defined & custom pipes, it’s a real toolkit. Chris explains it as MR is assembly, and Cascading is C. Sometimes it feels more like C++ :) Key aspect of reliable workflows is Cascading’s ability to check your workflow (the DAG it builds) Finds cases where fields aren’t available for operations. Solves a key problem we ran into when customizing Nutch at Krugle
Does the world really need yet another web crawler? No, but it does need a web mining toolkit Two companies agreed to sponsor work on Bixo as an open source project. Polite yet efficient - tension between those two goals that’s hard to resolve. If you do a crawl of any reasonable size, you’ll run into lots of errors. Even if a web server says “I swear to you, I’m sending you a 20K HTML file in English” It’s a 50K text file in Russian using the Cyrillic character set. And because it’s open source, you get the benefit of a community of users. They contribute re-usable toolkit components.
Whenever I show a workflow diagram like this, I make a joke about it being intuitively obvious. Which, obviously, it’s not. And in fact the full workflow is a bit bigger, as I left out the second stage that describes more of the keyword analysis. But the key point is that the blue color items are provided by Cascading. And the green color items are provided by Bixo. So what’s left are two yellow items, which represent the two points of customization.
There were two main pieces of custom code that needed to be written. One was some URL filtering to focus on the right content inside the web sites. Avoiding non-English pages by specific URL patterns. Same kind of thing for forums and such, since these pages weren’t part of what could easily be optimized. And if enough people need this type of support, since Bixo is open source it will likely become part of the toolkit
Finally we can actually use a traditional data mining tool to help make sense of the digested data. Many things we could do in addition Clustering of results, to improve keyword analysis Larger sites have “areas of interest” Identifying broken links, typos Identifying personal data - email addresses, phone numbers
I try to limit presentations to 20 slides - so I’ve hit that limit In the spirit of the unconference - let me know what you’d like to do next.
Let’s use a real example now of using Bixo to do web mining. Imagine that the Apache Foundation decided to honor people who make significant contributions to the Hadoop community. In a typical company, determining the winner would depend on political maneuvering, bribes,and sucking up. But the Apache Foundation could decides to go for a quantitative approach for the HUGMEE award.
How do you figure out the most helpful Hadoopers? As we discussed previously, it’s a classic web mining problem Luckily the Hadoop mailing lists are all nicely archived as monthly mbox files. How do we score based on key phrases (next slide)?
Parsing the mod_mbox page is simple with Tika’s HtmlParser Cheated a bit when parsing emails - some users like Owen have many aliases So hand-generated alias resolution table.
Need to ignore “thanks” in “thanks in advance for doing my job for me” signoff. Generate two tuples for each email: one with messageId/name/address One with reply-to messageId/score Group/sum aspect is classic reduce operation.
I think this slide is pretty self-explanatory - two Bixo fetch cycles, 6 custom Cascading operations, 6 MR jobs. OK, actually not so clear, but… Key point is that only purple is stuff that I had to actually create Some lines are purple as well, since that workflow (DAG) is also something I defined - see next page. But only two custom operations actually needed - parsing mbox_page and calculating score Running took about 30 minutes - mostly politely waiting until it was Ok to politely do another fetch. Downloaded 150MB of mbox files 409 unique email addresses with at least one positive reply.
Most of the code needed to create the workflow for this data mining app. Lots of oatmeal code - which is good. Don’t want to be writing tricky code here. Could optimize, but that would be a mistake…most web mining is programmer-constrained. So just use more servers in EC2 - cheaper & faster.
Example of the top-level pages that were fetched in first phase. Then needed to be parsed to extract links to mbox files.
Example of one of two custom operation Parsing mod_mbox page Uses Tika to extract Ids Emits tuple with URL for each mbox ID
Curve looks right - exponential decay. 409 unique email addresses that got some love from somebody.
And the winner is…Ted Dunning I know - I should have colored the elephant yellow.
A list of the usual suspects Coincidentally, Ted helped me derive the scoring algorithm I used…hmm.