A Semantic Data Model for Web ApplicationsArmin Haller
This presentation gives a short overview of the Semantic Web, RDFa and Linked Data. The second part briefly discusses ActiveRaUL, our model and system for developing form-based Web applications using Semantic Web technologies.
Data integration, data interoperation and data quality are major challenges that continue to haunt enterprises. Every enterprise either by choice or by chance has created massive silos of data in different formats, with duplications and quality issues.
Knowledge graphs have proven to be a viable solution to address the integration and interoperation problem. Semantic technologies in particular provide an intelligent way of creating an abstract layer for the enterprise data model and mapping of siloed data to that model, allowing a smooth integration and a common view of the data.
Technologies like OWL (Web Ontology Language) and RDF (Resource Description Framework) are the back bone of semantics for knowledge graph implementation. Enterprises use OWL to build an ontology model to create a common definition for concepts and how they are connected to each other in their specific domain.
They then use RDF to create a triple format representation of their data by mapping it to the Ontology. This approach makes their data smart and machine understandable.
But how can enterprises control and validate the quality of this mapped data? Furthermore, how can they use this one abstract representation of data to meet all their different business requirements? Different departments, different LoBs and different business branches all have their own data needs, creating a new challenge to be tackled by the enterprise.
In this talk we will look at how the power of SHACL (SHAPES and Constraints Language), a W3C standard for defining constraint sets over data; complements the two core semantic technologies OWL and RDF. What are the similarities, the overlaps and the differences.
We will talk about how SHACL gives enterprises the power to reuse, customize and validate their data for various scenarios, uses cases and business requirements; making the application of semantics even more practical.
A Semantic Data Model for Web ApplicationsArmin Haller
This presentation gives a short overview of the Semantic Web, RDFa and Linked Data. The second part briefly discusses ActiveRaUL, our model and system for developing form-based Web applications using Semantic Web technologies.
Data integration, data interoperation and data quality are major challenges that continue to haunt enterprises. Every enterprise either by choice or by chance has created massive silos of data in different formats, with duplications and quality issues.
Knowledge graphs have proven to be a viable solution to address the integration and interoperation problem. Semantic technologies in particular provide an intelligent way of creating an abstract layer for the enterprise data model and mapping of siloed data to that model, allowing a smooth integration and a common view of the data.
Technologies like OWL (Web Ontology Language) and RDF (Resource Description Framework) are the back bone of semantics for knowledge graph implementation. Enterprises use OWL to build an ontology model to create a common definition for concepts and how they are connected to each other in their specific domain.
They then use RDF to create a triple format representation of their data by mapping it to the Ontology. This approach makes their data smart and machine understandable.
But how can enterprises control and validate the quality of this mapped data? Furthermore, how can they use this one abstract representation of data to meet all their different business requirements? Different departments, different LoBs and different business branches all have their own data needs, creating a new challenge to be tackled by the enterprise.
In this talk we will look at how the power of SHACL (SHAPES and Constraints Language), a W3C standard for defining constraint sets over data; complements the two core semantic technologies OWL and RDF. What are the similarities, the overlaps and the differences.
We will talk about how SHACL gives enterprises the power to reuse, customize and validate their data for various scenarios, uses cases and business requirements; making the application of semantics even more practical.
Data Wranglers DC December meetup: http://www.meetup.com/Data-Wranglers-DC/events/151563622/
There's a lot of data sitting on websites just waiting to be combined with data you have sitting on your servers. During this talk, Robert Dempsey will show you how to create a dataset using Python by scraping websites for the data you want.
First Steps in Semantic Data Modelling and Search & Analytics in the CloudOntotext
This webinar will break the roadblocks that prevent many from reaping the benefits of heavyweight Semantic Technology in small scale projects. We will show you how to build Semantic Search & Analytics proof of concepts by using managed services in the Cloud.
Introduction to web scraping from static and Ajax generated web pages with Python, using urllib, BeautifulSoup, and Selenium. The slides are from a talk given at Vancouver PyLadies meetup on March 7, 2016.
s developing mash-ups with Web 2.0 really much easier than using Semantic Web technologies? For instance, given a music style as an input, what it takes to retrieve data from online music archives (MusicBrainz, MusicBrainz D2R Server, MusicMoz) and event databases (EVDB)? What to merge them and to let the users explore the results? Are Semantic Web technologies up to this Web 2.0 challenge? This half-day tutorial shows how to realize a Semantic Web Application we named Music Event Explorer or shortly meex (try it!).
what is web ?
why database on the web?
website technologies like HTML,CSS,JavaScript,Server,Servlets,Ajax..
all contents ownership goes to respective owners :)
(Classroom Presentaion)
Web Scraping using Python | Web Screen ScrapingCynthiaCruz55
Web scraping is the process of collecting and parsing raw data from the Web, and the Python community has come up with some pretty powerful web scraping tools.
Imagine you have to pull a large amount of data from websites and you want to do it as quickly as possible. How would you do it without manually going to each website and getting the data? Well, “Web Scraping” is the answer. Web Scraping just makes this job easier and faster.
https://www.webscreenscraping.com/hire-python-developers.php
Data Wranglers DC December meetup: http://www.meetup.com/Data-Wranglers-DC/events/151563622/
There's a lot of data sitting on websites just waiting to be combined with data you have sitting on your servers. During this talk, Robert Dempsey will show you how to create a dataset using Python by scraping websites for the data you want.
First Steps in Semantic Data Modelling and Search & Analytics in the CloudOntotext
This webinar will break the roadblocks that prevent many from reaping the benefits of heavyweight Semantic Technology in small scale projects. We will show you how to build Semantic Search & Analytics proof of concepts by using managed services in the Cloud.
Introduction to web scraping from static and Ajax generated web pages with Python, using urllib, BeautifulSoup, and Selenium. The slides are from a talk given at Vancouver PyLadies meetup on March 7, 2016.
s developing mash-ups with Web 2.0 really much easier than using Semantic Web technologies? For instance, given a music style as an input, what it takes to retrieve data from online music archives (MusicBrainz, MusicBrainz D2R Server, MusicMoz) and event databases (EVDB)? What to merge them and to let the users explore the results? Are Semantic Web technologies up to this Web 2.0 challenge? This half-day tutorial shows how to realize a Semantic Web Application we named Music Event Explorer or shortly meex (try it!).
what is web ?
why database on the web?
website technologies like HTML,CSS,JavaScript,Server,Servlets,Ajax..
all contents ownership goes to respective owners :)
(Classroom Presentaion)
Web Scraping using Python | Web Screen ScrapingCynthiaCruz55
Web scraping is the process of collecting and parsing raw data from the Web, and the Python community has come up with some pretty powerful web scraping tools.
Imagine you have to pull a large amount of data from websites and you want to do it as quickly as possible. How would you do it without manually going to each website and getting the data? Well, “Web Scraping” is the answer. Web Scraping just makes this job easier and faster.
https://www.webscreenscraping.com/hire-python-developers.php
Intro to mobile web application developmentzonathen
Learn all the basics of web app development including bootstrap, handlebars templates, jquery and angularjs, as well as using hybrid app deployment on a phone.
Dita for the web: Make Adaptive Content Simple for Writers and DeveloperDon Day
Lavacon 2013, Portland, Oregon
On the challenges of implementing structured, in-browser editing environements for creating adaptive content for the Web.
Exploiting Layout and Content
Don Day, Contelligence Group
Overview of modern software ecosystem for big data analysisMichael Bryzek
Brief summary of modern software available today to provide the core infrastructure to provide collection and analysis of big data collected from sensors (internet of everything). Presented at the Dec 2015 Trillion Sensors Summit in Orlando FL.
Code for Startup MVP (Ruby on Rails) Session 1Henry S
First Session on Learning to Code for Startup MVP's using Ruby on Rails.
This session covers the web architecture, Git/GitHub and makes a real rails app that is deployed to Heroku at the end.
Thanks,
Henry
Mastering Web Scraping with JSoup Unlocking the Secrets of HTML ParsingKnoldus Inc.
In this session, we will delve into the world of web scraping with JSoup, an open-source Java library. Here we are going to learn how to parse HTML effectively, extract meaningful data, and navigate the Document Object Model (DOM) for powerful web scraping capabilities.
What “Model” DITA Specializations Can Teach About Information ModelincDon Day
The DITA Open Toolkit download site includes several demo specializations that few people discover and use. In this webinar, DITA maven, Don Day, will use these examples to highlight the role of information modelling that led to each specialization. Don will highlight the key points of how each specialization was created, or how semantics were introduced into the specialization, and a whole lot more.
Social Media Data Collection & AnalysisScott Sanders
A non-technical primer on how to collect and analyze social media data. This was an invited lecture by Biostatistics and Bioinformatics Department in the School of Public Health at the University of Louisville.
Industry Ontologies: Case Studies in Creating and Extending Schema.org for In...MakoLab SA
The presentation introduces listeners into the details of the most important global semantic vocabulary build jointly by Google, Yahoo, Microsoft and Yandex: schema.org. It then discusses the experiences related to the creation of “hosted” extensions for the automotive industries (existing: auto.schema.org) and for the financial industries (in making: fibo.schema.org). The two extensions, built by an international team of specialists managed by MakoLab with full respect to the community processes, have two different creation strategies which will be presented and discussed.
The use cases for both vocabularies will be demonstrated. They are related to both “external” business effects (better visibility of the websites using them on the web) and “internal” effects (new kind of analytics and search capacities).
The presentation will also invite to participate to two W3C Community Groups responsible for the open communication activities around the two extensions.
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdfEnterprise Wired
In this guide, we'll explore the key considerations and features to look for when choosing a Trusted analytics platform that meets your organization's needs and delivers actionable intelligence you can trust.
Learn SQL from basic queries to Advance queriesmanishkhaire30
Dive into the world of data analysis with our comprehensive guide on mastering SQL! This presentation offers a practical approach to learning SQL, focusing on real-world applications and hands-on practice. Whether you're a beginner or looking to sharpen your skills, this guide provides the tools you need to extract, analyze, and interpret data effectively.
Key Highlights:
Foundations of SQL: Understand the basics of SQL, including data retrieval, filtering, and aggregation.
Advanced Queries: Learn to craft complex queries to uncover deep insights from your data.
Data Trends and Patterns: Discover how to identify and interpret trends and patterns in your datasets.
Practical Examples: Follow step-by-step examples to apply SQL techniques in real-world scenarios.
Actionable Insights: Gain the skills to derive actionable insights that drive informed decision-making.
Join us on this journey to enhance your data analysis capabilities and unlock the full potential of SQL. Perfect for data enthusiasts, analysts, and anyone eager to harness the power of data!
#DataAnalysis #SQL #LearningSQL #DataInsights #DataScience #Analytics
Analysis insight about a Flyball dog competition team's performanceroli9797
Insight of my analysis about a Flyball dog competition team's last year performance. Find more: https://github.com/rolandnagy-ds/flyball_race_analysis/tree/main
The Building Blocks of QuestDB, a Time Series Databasejavier ramirez
Talk Delivered at Valencia Codes Meetup 2024-06.
Traditionally, databases have treated timestamps just as another data type. However, when performing real-time analytics, timestamps should be first class citizens and we need rich time semantics to get the most out of our data. We also need to deal with ever growing datasets while keeping performant, which is as fun as it sounds.
It is no wonder time-series databases are now more popular than ever before. Join me in this session to learn about the internal architecture and building blocks of QuestDB, an open source time-series database designed for speed. We will also review a history of some of the changes we have gone over the past two years to deal with late and unordered data, non-blocking writes, read-replicas, or faster batch ingestion.
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdfGetInData
Recently we have observed the rise of open-source Large Language Models (LLMs) that are community-driven or developed by the AI market leaders, such as Meta (Llama3), Databricks (DBRX) and Snowflake (Arctic). On the other hand, there is a growth in interest in specialized, carefully fine-tuned yet relatively small models that can efficiently assist programmers in day-to-day tasks. Finally, Retrieval-Augmented Generation (RAG) architectures have gained a lot of traction as the preferred approach for LLMs context and prompt augmentation for building conversational SQL data copilots, code copilots and chatbots.
In this presentation, we will show how we built upon these three concepts a robust Data Copilot that can help to democratize access to company data assets and boost performance of everyone working with data platforms.
Why do we need yet another (open-source ) Copilot?
How can we build one?
Architecture and evaluation
1. NLP for the Web
Dr. Matthew Peters
@mattthemathman
(with thanks to Rutu Mulkar, Erin Renshaw, Dan Lecocq,
Chris Whitten, Jay Leary and many others at Moz)
2. Moz is a SaaS company that sells SEO and Content Marketing software to
professional marketers
3. We crawl a lot: > 1 Billion pages / day
We’d like to extract structured information from these pages
16. • Motivation – what are some of the unique challenges
and opportunities in doing NLP on the web?
• Main article extraction / page de-chroming
• Keyword extraction
• Author identification
• Conclusion
Outline
18. C1: Pages have clutter
and unrelated text
Navigation aids
Ads
Links to
other articles
19. C2: Text segments can confuse NLP components
Mike Lindblom Bertha: State sues - but even the lawsuit is delayed
Mike Lindblom Bertha
State sues NP chunks extracted by our chunker
the lawsuit
20. • Many different standards and only partial adoption
• Wide variety of templates and sites
• Broken HTML
C3: The web has a lot of cruft
21. • Web page have attributes other then the visual text: URL
string, page title, meta description, etc.
• The HTML/XML has a tree structure we can use
O1: Pages have additional structure
24. General approach
1. Use HTML parser to represent page as a tree
2. Split the tree into small pieces and analyze each piece
separately
3. Run NLP pipelines or other machine learning models on these
small pieces to:
- focus attention the important pieces
- extract structured information
- other task dependent objectives
4. Need algorithms that efficiently process only raw HTML
(without JS, image, CSS, etc.)
27. Dragnet
• Combine diverse features with machine learning
• Open source: https://github.com/seomoz/dragnet
• v1 (2013): link/text density + CETR:
blog: https://moz.com/devblog/dragnet-content-extraction-from-diverse-feature-sets/
paper: http://www2013.org/companion/p89.pdf)
• v2: (2015): added Readability
blog: https://moz.com/devblog/benchmarking-python-content-extraction-algorithms-
dragnet-readability-goose-and-eatiht/
28. 10,000 foot view
• Split page into distinct visual elements called “blocks”
• Use machine learning to classify each block as content
or no content
Inspired by:
Kohlschütter et al, Boilerplate Detection using Shallow Text Features, WSDM ’10
Weninger et al, CETR -- Content Extraction with Tag Ratios, WWW ‘10
Readability: (https://github.com/buriy/python-readability)
29. Splitting the text into blocks
• Use new lines in HTML (n) (CETR)
• Use subtrees (Readability)
• Flatten HTML and break on <div>, <p>, <h1>, etc.
(Kohlschütter et al. and us)
30.
31.
32.
33. Text and link
“density” two
important features.
High text density, low
link density à more
likely to be content
34. Compute “smoothed tag ratio”, ratio of #
tags to # chars
Compute “smoothed absolute difference
tag ratio”, dTR / dblock.
Captures intuition that main
content occurs together
Run k-means with 3 clusters on blocks,
with one centroid always pinned
to (0, 0)
Blocks in (0, 0) cluster are non-content,
remainder content
36. Use a simplified version of Readability:
• Compute score for each subtree using:
- parent id/class attributes
- length of text
• Find subtree with highest score
• Block feature = maximum subtree score for all subtrees containing block
42. Extract a ranked list of
keywords from a page with
relevancy score
(91, 'bertha')
(61, 'stp')
(59, 'state sues')
(44, 'tunnel')
(37, 'wsdot')
(30, 'tunnel construction')
(28, 'the seattle times')
(17, 'seattle tunnel partners')
(13, 'repair bertha')
(10, 'transportation lawsuit’)
43. Prior work
Many prior papers on similar task
Most use small data sets (hundreds of labeled examples) à unsupervised +
supervised methods
Wide range of previous approaches and almost always tailored to specific
type of document (academic papers, etc.)
Requirements for “gold standard” are fuzzy
Our approach:
Build a web specific algorithm to leverage unique aspects of domain
Combine many different features / approaches
Overcome data limitations and build complex model by gathering lots of
data automatically
45. Main article
Run dragnet to
extract the main
article content.
Keep track of
individual blocks
and process each
separately.
46. This displays as a dash but is the
unicode character U+2014
Need to special case @twitter,
email@domain.com, dates, etc.
Text & Token normalization
Include web specific logic in tokenizer / normalizer
47. Generate Candidates Rank Candidates
Raw HTML Ranked Topics
Parse/
Dechroming
Normalize
Sentence/
word
tokenize
C
A
N
D
I
D
A
T
E
S
48. Processing individual blocks helps NP chunker
Mike Lindblom
Bertha: State sues - but even the lawsuit is delayed
Mike Lindblom
Bertha NP chunks extracted by our chunker
State sues
the lawsuit
50. Generate Candidates Rank Candidates
Raw HTML Ranked Topics
Parse/
Dechroming
Normalize
Sentence/
word
tokenize
POS tag/
Noun phrase
chunk
Wikipedia
lookup
C
A
N
D
I
D
A
T
E
S
51. Generate Candidates Rank Candidates
Raw HTML Ranked Topics
Parse/
Dechroming
Normalize
Sentence/
word
tokenize
POS tag/
Noun phrase
chunk
Wikipedia
lookup
Shallow
Occurrence
QDR
POS
URL
C
A
N
D
I
D
A
T
E
S
TF
52. Ranking model features
Shallow: relative position in document, number of tokens
Occurrence: does candidate occur in title, H1, meta description, etc
Term frequency: count of occurrences, average token count, sum(in degree),
etc
QDR: information retrieval motivated “query-document relevance” ranking
models. TF-IDF (term frequency X inverse document frequency),
probabilistic approaches, language models
POS tags: is the keyword a proper noun, etc
URL features: does the keyword appear in URL
53. Generate Candidates Rank Candidates
Raw HTML Ranked Topics
Parse/
Dechroming
Normalize
Sentence/
word
tokenize
POS tag/
Noun phrase
chunk
Wikipedia
lookup
Shallow
Occurrence
QDR
POS
URL
Classifier
(probability of
relevance)
C
A
N
D
I
D
A
T
E
S
TF
54. Generating training data
List of high
volume
keywords
Top 10
results
Crawl
pages
Training Data:
HTML with relevant
keyword
Commercial
Search Engine
55. PU learning
Learning classifiers from only positive and unlabeled
data, Elkan and Noto, SIGKDD 2008
● Most ML classifiers have both
positive and negative
examples in training data
● We only have one keyword
per page that is relevant
(“positive”) and many others
that may or may not be
positive
● Use result from this paper
applied to our data
56. Generate Candidates Rank Candidates
Raw HTML Ranked Topics
Parse/
Dechroming
Normalize
Sentence/
word
tokenize
POS tag/
Noun phrase
chunk
Wikipedia
lookup
Shallow
Occurrence
QDR
POS
URL
Classifier
(probability of
relevance)
C
A
N
D
I
D
A
T
E
S
TF
57. Keyword extraction review
Resulting algorithm is:
● Robust across different content types – worst case still extracts
reasonable topics
● Reasonably fast, about 25 pages / second end-to-end
● Subjectively outperforms other commercial APIs (e.g. Alchemy, etc).
● In production for a year+, processed many millions of pages
59. Author
Extract a list of author names
(or an empty list if no authors)
from a given web page.
See: https://moz.com/devblog/web-page-author-extraction/
60. Do we need a ML algorithm for this?
(why isn’t this trivial?)
Heuristics do an adequate job:
● The microformat rel="author" attribute in link tags (a) is commonly used
to specify the page author
● Some sites specify page authors with a meta author tag.
● Many sites use names like “author” or “byline” for class attributes in their
CSS.
61. Sometimes heuristics work well
<div class="article-columnist-name vcard">
<a class="author url fn" rel="author"
href="/author/mike-lindblom/">Mike Lindblom</a>
</div>
67. Ranked blocks
Tags for highest ranked block
Dragnet HTML
Blockfier
Author chunker
Page HTML
Block
representation
Block ranking model
68. Block ranking model
Combines NLP and web features
• Tokens in block text (similar to bag-of-words classification)
• Tokens in block HTML tag attributes (e.g. class=“byline”)
• The HTML tags in block (e.g. many author names are links)
• rel=“author” and other markup inspired features
Put all features through Random Forest classifier that predicts probability a
block contains author
69. Block model performance
Overall block model is pretty good –
captures intuition that “bylines are easy
to spot”
Table lists Precision@K whether block
actually contains the author’s name.
70. Author Chunker
Modified IOB tagger similar to NP chunker or POS tagger
3 – class classification problem (Beginning of name, Inside name, Outside)
To make predictions at next token:
uni-, bi- and tri-gram tokens from previous/next few tokens
uni-, bi- and tri-gram POS tags
previous predicted IOB labels
HTML tags preceding and following the token
rel="author" and other markup inspired features
Overall 85.6% accurate chunking top block.
72. Author Chunker using HTML features
by John Timmer - Jul 1
IN NNP NNP - NN CD
O ??
<p class="byline”> <a rel="author”><span> </span></a>
To make prediction here, we can use tokens,
POS tags and HTML structure between tokens
73. Overall author model performance
Overall accuracy on test set
is good, outperforming
alternatives.
(heuristics)
(commercial API)
(OS Python library)