This document discusses text research and the Text-Fabric data model. It describes Text-Fabric as a data model for annotated text corpora, a query engine, a text weaver, and an API. The data model transforms TEI-XML into separate feature files to untangle annotations and enable better data logistics. Computational research involves gathering data from repositories, modeling and analyzing it, publishing results back to repositories, and discussing conclusions in notebooks. Publishing work flows include building websites to deliver research outputs to the general public more accessibly.
Dependency Parsing-based QA System for RDF and SPARQLFariz Darari
This document describes a dependency parsing-based question answering system that uses RDF and SPARQL. It parses natural language facts and questions into typed dependencies, translates them into RDF and SPARQL, queries populated RDF data to answer questions, and incorporates WordNet and DBpedia background knowledge. The system handles negation, tenses, passive voice and provides examples of its question answering capabilities.
Full Video: https://www.youtube.com/watch?v=cOShsisEsC0
An overview of the relation and combination of three data processing paradigms that is becoming more relevant today. It introduces the essentials of graph, distributed and stream computing and beyond. Furthermore, it questions the fundamental problems that we want to solve with data analysis and the potential of eventually saving the human kind in the next millennium by improving the state of the art of computation technologies while being too busy answering first world problem questions. Crazy but possible.
Nicolas Pastorino - The Open-source roar in the eZ CommunityNicolas Pastorino
This document summarizes Nicolas Pastorino's presentation to the eZ Day Paris 2011 conference. It discusses the growth of the open source eZ community in areas like pull requests, new extensions, forums and blogs. Tools that support community contributions are highlighted, including the eZ Publish code repository on GitHub and the issue tracker. Future plans like a community roadmap, event planning tool, and job board are outlined. The presentation emphasizes that an active, growing community benefits eZ Publish and calls for community members to spread awareness of the system.
Python 3.6 was released in December 2016, and it includes 16 new PEPs! In this talk, we will focus on PEP 498 - Literal String Interpolation, affectionately known as f-strings. Let's learn about f-strings. See some examples, and know the gotchas. You’ll want to upgrade to Python 3.6 just for this!
Presented at PyCon AU, August 5th 2017
MementoMap Framework for Flexible and Adaptive Web Archive ProfilingSawood Alam
In this work we propose MementoMap, a flexible and adaptive framework to efficiently summarize holdings of a web archive. We described a simple, yet extensible, file format suitable for MementoMap. We used the complete index of the arquivo.pt comprising 5B mementos (archived web pages/files) to understand the nature and shape of its holdings. We generated MementoMaps with varying amount of detail from its HTML pages that have an HTTP status code of 200 OK. Additionally, we designed a single-pass, memory-efficient, and parallelization-friendly algorithm to compact a large MementoMap into a small one and an in-file binary search method for efficient lookup. We analyzed more than three years of MemGator (a Memento aggregator) logs to understand the response behavior of 14 public web archives. We evaluated MementoMaps by measuring their Accuracy using 3.3M unique URIs from MemGator logs. We found that a MementoMap of less than 1.5% Relative Cost (as compared to the comprehensive listing of all the unique original URIs) can correctly identify the presence or absence of 60% of the lookup URIs in the corresponding archive while maintaining 100% Recall (i.e., zero false negatives).
This document provides an introduction to NoSQL and Neo4j. It discusses how Neo4j is a graph database that is well-suited for storing connected data. It then demonstrates how to query the graph database using the Cypher query language, which uses a declarative pattern matching approach. Examples of real-world uses of Neo4j by companies are also presented to illustrate how it has been adopted for applications such as social networking, fraud detection, and knowledge graphs.
OpenStack is an IaaS provider software written in Python. As such, it provides a massive scalable operating system and services like: Image, Storage, Object, Compute, etc.
This talks aims to give the audience an overview about OpenStack, its capabilities, its modules, coding styles, workflow and organization.
As a successful community driven development case, it’s definitely a good reference for anyone willing to take that road or maybe joining existing projects.
Dependency Parsing-based QA System for RDF and SPARQLFariz Darari
This document describes a dependency parsing-based question answering system that uses RDF and SPARQL. It parses natural language facts and questions into typed dependencies, translates them into RDF and SPARQL, queries populated RDF data to answer questions, and incorporates WordNet and DBpedia background knowledge. The system handles negation, tenses, passive voice and provides examples of its question answering capabilities.
Full Video: https://www.youtube.com/watch?v=cOShsisEsC0
An overview of the relation and combination of three data processing paradigms that is becoming more relevant today. It introduces the essentials of graph, distributed and stream computing and beyond. Furthermore, it questions the fundamental problems that we want to solve with data analysis and the potential of eventually saving the human kind in the next millennium by improving the state of the art of computation technologies while being too busy answering first world problem questions. Crazy but possible.
Nicolas Pastorino - The Open-source roar in the eZ CommunityNicolas Pastorino
This document summarizes Nicolas Pastorino's presentation to the eZ Day Paris 2011 conference. It discusses the growth of the open source eZ community in areas like pull requests, new extensions, forums and blogs. Tools that support community contributions are highlighted, including the eZ Publish code repository on GitHub and the issue tracker. Future plans like a community roadmap, event planning tool, and job board are outlined. The presentation emphasizes that an active, growing community benefits eZ Publish and calls for community members to spread awareness of the system.
Python 3.6 was released in December 2016, and it includes 16 new PEPs! In this talk, we will focus on PEP 498 - Literal String Interpolation, affectionately known as f-strings. Let's learn about f-strings. See some examples, and know the gotchas. You’ll want to upgrade to Python 3.6 just for this!
Presented at PyCon AU, August 5th 2017
MementoMap Framework for Flexible and Adaptive Web Archive ProfilingSawood Alam
In this work we propose MementoMap, a flexible and adaptive framework to efficiently summarize holdings of a web archive. We described a simple, yet extensible, file format suitable for MementoMap. We used the complete index of the arquivo.pt comprising 5B mementos (archived web pages/files) to understand the nature and shape of its holdings. We generated MementoMaps with varying amount of detail from its HTML pages that have an HTTP status code of 200 OK. Additionally, we designed a single-pass, memory-efficient, and parallelization-friendly algorithm to compact a large MementoMap into a small one and an in-file binary search method for efficient lookup. We analyzed more than three years of MemGator (a Memento aggregator) logs to understand the response behavior of 14 public web archives. We evaluated MementoMaps by measuring their Accuracy using 3.3M unique URIs from MemGator logs. We found that a MementoMap of less than 1.5% Relative Cost (as compared to the comprehensive listing of all the unique original URIs) can correctly identify the presence or absence of 60% of the lookup URIs in the corresponding archive while maintaining 100% Recall (i.e., zero false negatives).
This document provides an introduction to NoSQL and Neo4j. It discusses how Neo4j is a graph database that is well-suited for storing connected data. It then demonstrates how to query the graph database using the Cypher query language, which uses a declarative pattern matching approach. Examples of real-world uses of Neo4j by companies are also presented to illustrate how it has been adopted for applications such as social networking, fraud detection, and knowledge graphs.
OpenStack is an IaaS provider software written in Python. As such, it provides a massive scalable operating system and services like: Image, Storage, Object, Compute, etc.
This talks aims to give the audience an overview about OpenStack, its capabilities, its modules, coding styles, workflow and organization.
As a successful community driven development case, it’s definitely a good reference for anyone willing to take that road or maybe joining existing projects.
20171017 3PL Machine Learning & AI in Transport & LogisticsFrank Salliau
This document discusses machine learning and AI in transport and logistics. It provides an overview of machine learning concepts like supervised and unsupervised learning, as well as deep learning techniques. It also examines the growth of big data from various sources that enable machine learning applications. Examples are given of machine learning uses in transport and logistics, such as predicting demand and optimizing routes with IoT sensor data.
TypeScript와 Flow: 자바스크립트 개발에 정적 타이핑 도입하기Heejong Ahn
- The document discusses TypeScript and Flow, two type systems for JavaScript. It provides information on their history, design goals around soundness vs productivity, usage statistics, and comparisons of their type systems and tooling support.
- Key differences noted are that Flow focuses on soundness while TypeScript balances correctness and productivity, and that TypeScript has significantly more resources and adoption based on metrics like StackOverflow questions, GitHub stars, and npm downloads.
This document discusses open source software options for startups from the perspective of Victor Neo, a computer science student and CTO. It provides an introduction to Victor and his experience with startups and open source software. The document then discusses the cost savings and importance of people when using software. It provides an overview of common open source office, accounting, graphics, and development applications that startups can use. Examples are given of how Facebook, Twitter, and Victor's own company Pikaland have utilized open source software. The document encourages contributing back to open source communities and lists upcoming talks on related topics.
DataPlotly is a plugin for QGIS that allows to create D3 like plots from spatial data. It is build on top of plotly, a javascript library which offers easy API for many languages such as Python, R, Matlab and NodeJS.
The plugin was created back in 2017 for the upcoming QGIS 3 version: today the plugin has been downloaded more than 50,000 times.
Creating plots is out of the main scopes of QGIS but thanks to the simple Python API it is easy enough to create additional scripts and plugins. Thanks to these APIs, DataPlotly is today a well maintained Python plugin with a growing community of developers, users and testers.
DataPlotly plots are completely interactive so that plot elements are directly linked with map items; therefore the user is able to query map items from the main plot canvas.
Thanks to a crowdfunding campaign launched in March 2019 during the annual QGIS User Conference, the functionalities of DataPlotly were extended: a complete refactoring of the code, more plots but especially the creation of plots in the layout composer.
More and more people are using the plugin to analyze the data and to create complex output reports of data (e.g. the Covid-19 pandemic
A call to give back puppetlabs-corosync to the communityJulien Pivotto
The document discusses Puppet Labs' corosync module, which manages a cluster using Pacemaker, Corosync, and CMAN. It was presented by Julien Pivotto at the 2015 Puppet Contributor Summit. The modules provide Puppet types and resources for configuring primitives, the cluster information base, messaging, and cluster membership and quorum services.
Google processes 400 petabytes of data every month and that was way back in 2007! With users generating massive amounts of data in social networking sites like Facebook and Twitter, and an increase in the use of sensor devices, the amount of data generated is only going to go up. Further, with the cost of hard-disks going down, and such data being made available to everyone, and with the advent of cloud computing, we now have the power to process such data ourselves.
What are the challenges of processing such massive amounts of data? With such data being available to every corporation, big or small, how does this change how we have been perceiving data? The talk takes you through some of the technologies used to tackle these challenges.
The talk has been tailored to suit students. It helps them relate to and appreciate the subjects they learn in their curriculum - data structures, programming languages, databases, operating systems, networking etc. At the same time, it describes some of the interesting work being done in the software industry in the areas of databases, data analysis, cloud computing etc.
The introduction to my class on machine learning. The subjects covered in this class go from:
1.- Linear Classifiers
2.- Non Linear Classifiers
3.- Graphical Models
4.- Clustering
5.- Etc
I am planning to upload the rest once I feel they are at the level.
File Polyglottery; or This Proof of Concept is Also a Picture of CatsEvan Sultanik
Evan Sultanik's BSides Philly 2017 talk on File Polyglots. Watch the video, here: https://www.youtube.com/watch?v=fdKPnsWp9ho
A polyglot is a file that can be interpreted as multiple different filetypes depending on how it is parsed. While polyglots serve the noble purpose of being a nifty parlor trick, they also have much more nefarious uses, e.g., hiding malicious printer firmware inside a document that subverts a printer when printed, or a document that displays completely different content depending on which viewer opens it. This talk does a deep dive into the technical details of how to create such special files, using examples from some of the recent issues of the International Journal of PoC||GTFO. Learn how we made a PDF that is also a valid NES ROM that, when emulated, displays the MD5 sum of the PDF. Learn how we created a PDF that is also a valid PostScript document that, when printed to a PostScript printer, produces a completely different document. Oh, and the PostScript also prints your /etc/passwd file, for good measure. Learn how to create a PDF that is also a valid Git repository containing its own LaTeX source code and a copy of itself. And many more!
It is the slides for COSCUP[1] 2013 Hands-on[2], "Learning Python from Data".
It aims for using examples to show the world of Python. Hope it will help you with learning Python.
[1] COSCUP: http://coscup.org/
[2] COSCUP Hands-on: http://registrano.com/events/coscup-2013-hands-on-mosky
The document discusses the Apriori algorithm for finding frequent itemsets in transactional data. The Apriori algorithm works in two phases:
1. It finds all frequent itemsets of length 1 by scanning the transaction database. It then generates candidate itemsets of length k from frequent itemsets of length k-1, and tests them against the database to determine which are frequent.
2. It uses the found frequent itemsets to generate association rules between items. The confidence and support of each rule is calculated to determine how interesting it is.
The algorithm efficiently finds all rules and itemsets that meet a minimum support threshold by generating candidates in a way that prunes subsets that cannot be frequent. This allows
Parallel computing in Python: Current state and recent advancesPierre Glaser
The document discusses parallel computing in Python. It provides an overview of parallelization libraries in Python like multiprocessing and concurrent.futures. It describes different approaches to parallelization using threads versus processes. Thread-based parallelism allows sharing memory but the Global Interpreter Lock limits parallel execution, while process-based parallelism ensures true parallel execution but requires data copying between processes. It also discusses challenges like serialization and portability across environments. Alternatives like loky provide a more robust process pool executor for data science tasks.
The UNESCO Internet website is the main tool used to disseminate information about the Organization and its programme of activities. A respected source of information, the UNESCO website is ranked among the top five of UN family websites and receives on average 1.8 million unique visitors (7 million page views) per month.
The Secretariat is located in its Paris headquarters and in 52 field offices around the world, and demands the high availability of the website, a mission critical working tool for the Secretariat and its communities.
In this Talk Chakir Piro (UNESCO) and Olivier Dobberkau (dkd) will give a short overview of the history of the usage of TYPO3 at www.unesco.org and how we are migrating more content from an old cms to TYPO3.
We will introduce the setup involved to deploy a multinational and multilingual website with TYPO3. Further on we will describe the requirements of such a project dealing with a large amount of stakeholders, communication channels and international events.
Chakir Piro will describe the role the department he works in to filter and aggregate the needs of the different sectors, field and cluster offices of UNESCO.
We will give practical insights on how organizations can adopt a fast track to deliver daily content to its website visitors.
Making NumPy-style and Pandas-style code faster and run in parallel. Continuum has been working on scaled versions of NumPy and Pandas for 4 years. This talk describes how Numba and Dask provide scaled Python today.
The document discusses storing high-energy physics data in the DAOS object storage system. It provides an overview of high-energy physics experiments and data, and the ROOT framework commonly used for data analysis. It then introduces RNTuple, a new data format being developed to replace the legacy TTree format used in ROOT. The document details how RNTuple maps data to DAOS objects and describes a C++ interface for interacting with DAOS. It notes that from the user perspective, working with RNTuple data in DAOS is similar to working with files on disk.
This document discusses OpenStreetMap (OSM) and the characteristics of a "NeoGeographer". Some key points:
- OSM is an open-source map of the world that anyone can edit or use. It aims to provide an alternative to proprietary map data from companies like Google Maps.
- A "NeoGeographer" is someone who contributes geospatial data to digital maps using modern tools, as opposed to traditional "Old Geographers" who worked in the past.
- OSM data is licensed under the Open Database License which allows for reuse and modification of the map data. In contrast, Google Maps does not allow for free secondary use or modification of its map content.
Big Data Day LA 2015 - Applications of the Apriori Algorithm on Open Data by ...Data Con LA
The Apriori Algorithm is an unsupervised learning technique for producing associative rules. This talk will explain the algorithm's implementation, explore how effective it can be when applied to big data, discuss how we use it at DataScience to do market basket analysis, and demonstrate some novel use cases involving the million song database, recipes, and other applications involving open data.
The document is a summary of the topics covered in a CSE 471 class over 15 weeks. It includes a list of chapters covered from the textbook as well as brief summaries of the material covered each week, including search techniques, planning, logical inference, probabilistic reasoning, learning methods, and conclusions about AI. It also notes that the take-home exam will be released this week and provides instructions for completing online course evaluations.
The document is a summary of the topics covered in a CSE 471 class over 15 weeks. It includes a list of chapters covered from the textbook as well as brief summaries of the material covered each week, including search techniques, planning, logical inference, probabilistic reasoning, learning methods, and conclusions about AI. It also notes that the take-home exam will be released this week and provides instructions for completing online course evaluations.
1. The document discusses various Yahoo products and services for startups including YQL for querying data from the web, BOSS for search, and YSlow for improving website speed.
2. It promotes these tools as helping startups with key phases of developing a business from coming up with an idea, building the product, refining it, and achieving profitability.
3. The tools are described as open, free to use, and able to scale easily with millions of users while requiring no purchases from developers.
Towards TextPy, a module for processing text.
If we define annotated text as a graph with additional structure, we can make text processing more efficient, in the same way that Pandas makes processing dataframes more efficient.
20171017 3PL Machine Learning & AI in Transport & LogisticsFrank Salliau
This document discusses machine learning and AI in transport and logistics. It provides an overview of machine learning concepts like supervised and unsupervised learning, as well as deep learning techniques. It also examines the growth of big data from various sources that enable machine learning applications. Examples are given of machine learning uses in transport and logistics, such as predicting demand and optimizing routes with IoT sensor data.
TypeScript와 Flow: 자바스크립트 개발에 정적 타이핑 도입하기Heejong Ahn
- The document discusses TypeScript and Flow, two type systems for JavaScript. It provides information on their history, design goals around soundness vs productivity, usage statistics, and comparisons of their type systems and tooling support.
- Key differences noted are that Flow focuses on soundness while TypeScript balances correctness and productivity, and that TypeScript has significantly more resources and adoption based on metrics like StackOverflow questions, GitHub stars, and npm downloads.
This document discusses open source software options for startups from the perspective of Victor Neo, a computer science student and CTO. It provides an introduction to Victor and his experience with startups and open source software. The document then discusses the cost savings and importance of people when using software. It provides an overview of common open source office, accounting, graphics, and development applications that startups can use. Examples are given of how Facebook, Twitter, and Victor's own company Pikaland have utilized open source software. The document encourages contributing back to open source communities and lists upcoming talks on related topics.
DataPlotly is a plugin for QGIS that allows to create D3 like plots from spatial data. It is build on top of plotly, a javascript library which offers easy API for many languages such as Python, R, Matlab and NodeJS.
The plugin was created back in 2017 for the upcoming QGIS 3 version: today the plugin has been downloaded more than 50,000 times.
Creating plots is out of the main scopes of QGIS but thanks to the simple Python API it is easy enough to create additional scripts and plugins. Thanks to these APIs, DataPlotly is today a well maintained Python plugin with a growing community of developers, users and testers.
DataPlotly plots are completely interactive so that plot elements are directly linked with map items; therefore the user is able to query map items from the main plot canvas.
Thanks to a crowdfunding campaign launched in March 2019 during the annual QGIS User Conference, the functionalities of DataPlotly were extended: a complete refactoring of the code, more plots but especially the creation of plots in the layout composer.
More and more people are using the plugin to analyze the data and to create complex output reports of data (e.g. the Covid-19 pandemic
A call to give back puppetlabs-corosync to the communityJulien Pivotto
The document discusses Puppet Labs' corosync module, which manages a cluster using Pacemaker, Corosync, and CMAN. It was presented by Julien Pivotto at the 2015 Puppet Contributor Summit. The modules provide Puppet types and resources for configuring primitives, the cluster information base, messaging, and cluster membership and quorum services.
Google processes 400 petabytes of data every month and that was way back in 2007! With users generating massive amounts of data in social networking sites like Facebook and Twitter, and an increase in the use of sensor devices, the amount of data generated is only going to go up. Further, with the cost of hard-disks going down, and such data being made available to everyone, and with the advent of cloud computing, we now have the power to process such data ourselves.
What are the challenges of processing such massive amounts of data? With such data being available to every corporation, big or small, how does this change how we have been perceiving data? The talk takes you through some of the technologies used to tackle these challenges.
The talk has been tailored to suit students. It helps them relate to and appreciate the subjects they learn in their curriculum - data structures, programming languages, databases, operating systems, networking etc. At the same time, it describes some of the interesting work being done in the software industry in the areas of databases, data analysis, cloud computing etc.
The introduction to my class on machine learning. The subjects covered in this class go from:
1.- Linear Classifiers
2.- Non Linear Classifiers
3.- Graphical Models
4.- Clustering
5.- Etc
I am planning to upload the rest once I feel they are at the level.
File Polyglottery; or This Proof of Concept is Also a Picture of CatsEvan Sultanik
Evan Sultanik's BSides Philly 2017 talk on File Polyglots. Watch the video, here: https://www.youtube.com/watch?v=fdKPnsWp9ho
A polyglot is a file that can be interpreted as multiple different filetypes depending on how it is parsed. While polyglots serve the noble purpose of being a nifty parlor trick, they also have much more nefarious uses, e.g., hiding malicious printer firmware inside a document that subverts a printer when printed, or a document that displays completely different content depending on which viewer opens it. This talk does a deep dive into the technical details of how to create such special files, using examples from some of the recent issues of the International Journal of PoC||GTFO. Learn how we made a PDF that is also a valid NES ROM that, when emulated, displays the MD5 sum of the PDF. Learn how we created a PDF that is also a valid PostScript document that, when printed to a PostScript printer, produces a completely different document. Oh, and the PostScript also prints your /etc/passwd file, for good measure. Learn how to create a PDF that is also a valid Git repository containing its own LaTeX source code and a copy of itself. And many more!
It is the slides for COSCUP[1] 2013 Hands-on[2], "Learning Python from Data".
It aims for using examples to show the world of Python. Hope it will help you with learning Python.
[1] COSCUP: http://coscup.org/
[2] COSCUP Hands-on: http://registrano.com/events/coscup-2013-hands-on-mosky
The document discusses the Apriori algorithm for finding frequent itemsets in transactional data. The Apriori algorithm works in two phases:
1. It finds all frequent itemsets of length 1 by scanning the transaction database. It then generates candidate itemsets of length k from frequent itemsets of length k-1, and tests them against the database to determine which are frequent.
2. It uses the found frequent itemsets to generate association rules between items. The confidence and support of each rule is calculated to determine how interesting it is.
The algorithm efficiently finds all rules and itemsets that meet a minimum support threshold by generating candidates in a way that prunes subsets that cannot be frequent. This allows
Parallel computing in Python: Current state and recent advancesPierre Glaser
The document discusses parallel computing in Python. It provides an overview of parallelization libraries in Python like multiprocessing and concurrent.futures. It describes different approaches to parallelization using threads versus processes. Thread-based parallelism allows sharing memory but the Global Interpreter Lock limits parallel execution, while process-based parallelism ensures true parallel execution but requires data copying between processes. It also discusses challenges like serialization and portability across environments. Alternatives like loky provide a more robust process pool executor for data science tasks.
The UNESCO Internet website is the main tool used to disseminate information about the Organization and its programme of activities. A respected source of information, the UNESCO website is ranked among the top five of UN family websites and receives on average 1.8 million unique visitors (7 million page views) per month.
The Secretariat is located in its Paris headquarters and in 52 field offices around the world, and demands the high availability of the website, a mission critical working tool for the Secretariat and its communities.
In this Talk Chakir Piro (UNESCO) and Olivier Dobberkau (dkd) will give a short overview of the history of the usage of TYPO3 at www.unesco.org and how we are migrating more content from an old cms to TYPO3.
We will introduce the setup involved to deploy a multinational and multilingual website with TYPO3. Further on we will describe the requirements of such a project dealing with a large amount of stakeholders, communication channels and international events.
Chakir Piro will describe the role the department he works in to filter and aggregate the needs of the different sectors, field and cluster offices of UNESCO.
We will give practical insights on how organizations can adopt a fast track to deliver daily content to its website visitors.
Making NumPy-style and Pandas-style code faster and run in parallel. Continuum has been working on scaled versions of NumPy and Pandas for 4 years. This talk describes how Numba and Dask provide scaled Python today.
The document discusses storing high-energy physics data in the DAOS object storage system. It provides an overview of high-energy physics experiments and data, and the ROOT framework commonly used for data analysis. It then introduces RNTuple, a new data format being developed to replace the legacy TTree format used in ROOT. The document details how RNTuple maps data to DAOS objects and describes a C++ interface for interacting with DAOS. It notes that from the user perspective, working with RNTuple data in DAOS is similar to working with files on disk.
This document discusses OpenStreetMap (OSM) and the characteristics of a "NeoGeographer". Some key points:
- OSM is an open-source map of the world that anyone can edit or use. It aims to provide an alternative to proprietary map data from companies like Google Maps.
- A "NeoGeographer" is someone who contributes geospatial data to digital maps using modern tools, as opposed to traditional "Old Geographers" who worked in the past.
- OSM data is licensed under the Open Database License which allows for reuse and modification of the map data. In contrast, Google Maps does not allow for free secondary use or modification of its map content.
Big Data Day LA 2015 - Applications of the Apriori Algorithm on Open Data by ...Data Con LA
The Apriori Algorithm is an unsupervised learning technique for producing associative rules. This talk will explain the algorithm's implementation, explore how effective it can be when applied to big data, discuss how we use it at DataScience to do market basket analysis, and demonstrate some novel use cases involving the million song database, recipes, and other applications involving open data.
The document is a summary of the topics covered in a CSE 471 class over 15 weeks. It includes a list of chapters covered from the textbook as well as brief summaries of the material covered each week, including search techniques, planning, logical inference, probabilistic reasoning, learning methods, and conclusions about AI. It also notes that the take-home exam will be released this week and provides instructions for completing online course evaluations.
The document is a summary of the topics covered in a CSE 471 class over 15 weeks. It includes a list of chapters covered from the textbook as well as brief summaries of the material covered each week, including search techniques, planning, logical inference, probabilistic reasoning, learning methods, and conclusions about AI. It also notes that the take-home exam will be released this week and provides instructions for completing online course evaluations.
1. The document discusses various Yahoo products and services for startups including YQL for querying data from the web, BOSS for search, and YSlow for improving website speed.
2. It promotes these tools as helping startups with key phases of developing a business from coming up with an idea, building the product, refining it, and achieving profitability.
3. The tools are described as open, free to use, and able to scale easily with millions of users while requiring no purchases from developers.
Towards TextPy, a module for processing text.
If we define annotated text as a graph with additional structure, we can make text processing more efficient, in the same way that Pandas makes processing dataframes more efficient.
We demonstrate how Text-Fabric can handle the display of text and annotations, even when chunks of text are not properly embedded in each other. This demo contains examples from the Hebrew Bible and the Old Babylonian Letters (cuneiform clay tablets).
This document discusses applying data analysis techniques used for ancient corpora to the Quran. It presents Text-Fabric (TF) as a graph database model for storing textual data in plain text files without XML or SQL. TF models text as nodes for words and phrases connected by edge relationships, and stores components like words, phrases, chapters and verses that can be uniquely identified. The document provides an example of a TF dataset containing parsed text from Iain M. Banks' novel "Consider Phlebas".
Researchers in ancient text corpora can take control over their data. We show a way to do so by means of Text-Fabric.
Co-production of Cody Kingham and Dirk Roorda
This document summarizes the history and current state of BHSA (Basic Handwritten Script Analysis) tools. It describes early tools like EMDROS and SHEBANQ, as well as more recent projects like Text-Fabric that encode texts in a graph structure with minimal encoding. Text-Fabric files separate each feature of the data into individual files for easy processing and combination. The document outlines Text-Fabric data, sharing, starting with the tool, publishing with it, and available apps and corpora. It promotes Text-Fabric's concepts of transparent, contributor-friendly encodings and provides links to relevant GitHub repositories and tutorials.
Developing a tool for handling text with linguistic annotations. Text-Fabric is meant to support researchers that wnat to contribute portions of the data, and weaves the contributions in into a meaningful whole. Currently, it is primarily meant for working with the Hebrew Bible, based on the ETCBC (Amsterdam) linguistic database.
Conference presentation for 2016 annual meeting of the Society of Biblical Literature, San Antonio. (https://www.sbl-site.org).
Authors: Janet Dyk (linguistic ideas) and Dirk Roorda (computational implementation).
A verb organizes the elements in a sentence. Different patterns of constituents affect the meaning of a verb in a given context. The potential of a verb to combine with patterns of elements is known as its valence. A single set of questions, organized as a flow chart, selects the relevant building blocks within the context of a verb. The resulting pattern provides a particular significance for the verb in question. Because all contexts are submitted to the same flow chart, similarities and differences between verbs come to light. For example, verbs of movement in their causative formation manifest the same patterns as transitive verbs with an object that gets moved. We apply this approach to the whole Hebrew Bible, using the database of the Eep Talstra Centre for Bible and Computer (ETCBC), which contains the relevant linguistic annotations. This allows us to have a complete listing of all patterns for all verbs. It provides the basis for consistent proposals for the significance of specific patterns occurring with a particular verb. The valence results are made available in SHEBANQ, an online research tool based on the ETCBC database. It presents the basic data, text and linguistic features, together with annotations by researchers. The valence results consist of a set of algorithmically generated annotations which show up between the lines of the text. The algorithm itself and its documentation can be found at https://shebanq.ancient-data.org/tools?goto=valence. By using SHEBANQ we achieve several goals with respect to the scholarly workflow: (1) all our results are openly accessible online, and other researchers may comment on them; (2) all resources needed to reproduce this research are available online and can be downloaded (Open Access).
This document provides an overview of the SHEBANQ project, which provides tools for querying annotated Hebrew text data. It describes the data sources and contributors that have built up the underlying text corpus over many years. It also outlines the steps taken to make this data and related tools more accessible, including developing a website, depositing data in archives, running demonstration projects, and integrating the data and tools into broader research environments through additional projects and publications. The goal has been to facilitate wider use of this linguistic resource and foster more digital humanities and data science work based on its contents.
1. The document discusses layers of annotation for analyzing biblical Hebrew text, including the text itself, linguistic features, manually or automatically generated analyses, and queries for exegetical search.
2. It provides an overview of the Linguistic Annotation Framework (LAF) for representing annotated text and statistics on the annotation of one Hebrew text, with over 800,000 regions and 1.4 million nodes.
3. The document describes tools for querying the annotated text, including the SHEBANQ system and LAF-Fabric API, and the ability to work with the data in various formats like XML, binary files, and R.
20151111 utrecht ver theolbibliothecarissenDirk Roorda
DANS is an institute of the Royal Netherlands Academy of Arts and Sciences and the Netherlands Organization for Scientific Research that promotes permanent access to digital research data. It provides data archiving services including depositing datasets in its online repository EASY, which ensures the data is findable, referable, downloadable, usable, and supports scholarly communication through publication of data papers. DANS also works with research organizations using a front office-back office model to facilitate long-term preservation of research data.
Text as Data: processing the Hebrew BibleDirk Roorda
The merits of stand-off markup (LAF) versus inline markup (TEI) for processing text as data. Ideas applied to work with the Hebrew Bible, resulting in tools for researchers and end-users.
Datamanagement for Research: A Case StudyDirk Roorda
How practices of data sharing can help researchers to produce more science.
Session in the data management course organized by RDNL (Research Data in the Netherlands)
Hebrew Bible as Data: Laboratory, Sharing, LessonsDirk Roorda
The document discusses using the Hebrew Bible as a data source for research. It describes several databases and tools for querying and analyzing the data, including ETCBC, SHEBANQ, and LAF-Fabric. It provides an overview of how the data is created, archived, shared and disseminated through the research data cycle. Examples are given of using LAF-Fabric to count nodes, write plain text, and visualize annotations. The goal is to make the Hebrew Bible and linguistic annotations available as linked open data for various types of researchers.
LAF-Fabric: a tool to process the ETCBC Hebrew Text Database in Linguistic Annotation Framework.
How researchers in theology and linguistics can create workflows to analyse the text of the Hebrew Bible and extract data for visualization. Those workflows can be written in Python, and run conveniently in the IPython Notebook.
Joint work with Martijn Naaijer (VU University).
With the Hebrew Bible encoded in Linguistic Annotation Framework (LAF-ISO), and with a new LAF processing tool, we demonstrate how you can do practical data analysis. The tool, LAF-Fabric, integrates with the ipython notebook approach. Our example here is lexeme cooccurrence analysis of bible books. For now, the road from data to visualization is more important than the exact visualization.
The document describes the Linguistic Annotation Framework (LAF), which is an ISO standard for representing stand-off annotation of language resources. LAF allows for annotating text with linguistic information like part-of-speech tags or named entities in an XML format. Example annotated text corpora using LAF include the Open American National Corpus and a text database of the Hebrew Bible. The document then discusses challenges with existing LAF processors and introduces LAF-Fabric as a new tool that compiles LAF annotations into binary data for faster querying of linguistic features and running Python scripts against the data.
Signatures of wave erosion in Titan’s coastsSérgio Sacani
The shorelines of Titan’s hydrocarbon seas trace flooded erosional landforms such as river valleys; however, it isunclear whether coastal erosion has subsequently altered these shorelines. Spacecraft observations and theo-retical models suggest that wind may cause waves to form on Titan’s seas, potentially driving coastal erosion,but the observational evidence of waves is indirect, and the processes affecting shoreline evolution on Titanremain unknown. No widely accepted framework exists for using shoreline morphology to quantitatively dis-cern coastal erosion mechanisms, even on Earth, where the dominant mechanisms are known. We combinelandscape evolution models with measurements of shoreline shape on Earth to characterize how differentcoastal erosion mechanisms affect shoreline morphology. Applying this framework to Titan, we find that theshorelines of Titan’s seas are most consistent with flooded landscapes that subsequently have been eroded bywaves, rather than a uniform erosional process or no coastal erosion, particularly if wave growth saturates atfetch lengths of tens of kilometers.
Sexuality - Issues, Attitude and Behaviour - Applied Social Psychology - Psyc...PsychoTech Services
A proprietary approach developed by bringing together the best of learning theories from Psychology, design principles from the world of visualization, and pedagogical methods from over a decade of training experience, that enables you to: Learn better, faster!
Mending Clothing to Support Sustainable Fashion_CIMaR 2024.pdfSelcen Ozturkcan
Ozturkcan, S., Berndt, A., & Angelakis, A. (2024). Mending clothing to support sustainable fashion. Presented at the 31st Annual Conference by the Consortium for International Marketing Research (CIMaR), 10-13 Jun 2024, University of Gävle, Sweden.
Compositions of iron-meteorite parent bodies constrainthe structure of the pr...Sérgio Sacani
Magmatic iron-meteorite parent bodies are the earliest planetesimals in the Solar System,and they preserve information about conditions and planet-forming processes in thesolar nebula. In this study, we include comprehensive elemental compositions andfractional-crystallization modeling for iron meteorites from the cores of five differenti-ated asteroids from the inner Solar System. Together with previous results of metalliccores from the outer Solar System, we conclude that asteroidal cores from the outerSolar System have smaller sizes, elevated siderophile-element abundances, and simplercrystallization processes than those from the inner Solar System. These differences arerelated to the formation locations of the parent asteroids because the solar protoplane-tary disk varied in redox conditions, elemental distributions, and dynamics at differentheliocentric distances. Using highly siderophile-element data from iron meteorites, wereconstruct the distribution of calcium-aluminum-rich inclusions (CAIs) across theprotoplanetary disk within the first million years of Solar-System history. CAIs, the firstsolids to condense in the Solar System, formed close to the Sun. They were, however,concentrated within the outer disk and depleted within the inner disk. Future modelsof the structure and evolution of the protoplanetary disk should account for this dis-tribution pattern of CAIs.
14. • Gather your data from a repository
What is (computational) research?

6
15. • Gather your data from a repository
• Model it in a logical, abstract, tractable way
What is (computational) research?

6
16. • Gather your data from a repository
• Model it in a logical, abstract, tractable way
• Analyse it by means of a suite of well-chosen tools
What is (computational) research?

6
17. • Gather your data from a repository
• Model it in a logical, abstract, tractable way
• Analyse it by means of a suite of well-chosen tools
• Produce results and deliver them again in a repository
What is (computational) research?

6
18. • Gather your data from a repository
• Model it in a logical, abstract, tractable way
• Analyse it by means of a suite of well-chosen tools
• Produce results and deliver them again in a repository
• Discuss conclusions in a Jupyter notebook
What is (computational) research?

6
19. • Gather your data from a repository
• Model it in a logical, abstract, tractable way
• Analyse it by means of a suite of well-chosen tools
• Produce results and deliver them again in a repository
• Discuss conclusions in a Jupyter notebook
• Publish and preserve everything in Zenodo/SHA, and/or on a website
What is (computational) research?

6
22. Text-Fabric is ...
• a data model for text corpora with annotations
• a query engine

7
23. Text-Fabric is ...
• a data model for text corpora with annotations
• a query engine
• a text weaver

7
24. Text-Fabric is ...
• a data model for text corpora with annotations
• a query engine
• a text weaver
• an API

7
25. Text-Fabric is ...
• a data model for text corpora with annotations
• a query engine
• a text weaver
• an API
• a python package pip install text-fabric

7
35. Data model
• TEI-XML:
fi
ne for archiving, di
ffi
cult for data science
• Text-Fabric:
• from inline to stand-o
f

10
36. Data model
• TEI-XML:
fi
ne for archiving, di
ffi
cult for data science
• Text-Fabric:
• from inline to stand-o
f

10
untangle
37. Data model
• TEI-XML:
fi
ne for archiving, di
ffi
cult for data science
• Text-Fabric:
• from inline to stand-o
f
• from tags to features

10
untangle
38. Data model
• TEI-XML:
fi
ne for archiving, di
ffi
cult for data science
• Text-Fabric:
• from inline to stand-o
f
• from tags to features
• from hierarchy to spatial relationships

10
untangle
39. Data model
• TEI-XML:
fi
ne for archiving, di
ffi
cult for data science
• Text-Fabric:
• from inline to stand-o
f
• from tags to features
• from hierarchy to spatial relationships
• from nested elements to tables of numbers

10
untangle
43. This model solves problems
he asked Eth<damaged>iopia for</damaged> support

13
44. This model solves problems
he asked Eth<damaged>iopia for</damaged> support
now try to mark Ethiopia as a name of a country

13
45. This model solves problems
he asked Eth<damaged>iopia for</damaged> support
now try to mark Ethiopia as a name of a country
* he asked <country>Eth<damaged>iopia</country> for</damaged> support

13
46. This model solves problems
he asked Eth<damaged>iopia for</damaged> support
now try to mark Ethiopia as a name of a country
* he asked <country>Eth<damaged>iopia</country> for</damaged> support
this is invalid XML!

13
47. This model solves problems
he asked Eth<damaged>iopia for</damaged> support
now try to mark Ethiopia as a name of a country
* he asked <country>Eth<damaged>iopia</country> for</damaged> support
this is invalid XML!
• TEI is good to formulate encoding practices

13
48. This model solves problems
he asked Eth<damaged>iopia for</damaged> support
now try to mark Ethiopia as a name of a country
* he asked <country>Eth<damaged>iopia</country> for</damaged> support
this is invalid XML!
• TEI is good to formulate encoding practices
• XML is bad for modelling the richness of text and annotations

13
49. This model solves problems
he asked Eth<damaged>iopia for</damaged> support
now try to mark Ethiopia as a name of a country
* he asked <country>Eth<damaged>iopia</country> for</damaged> support
this is invalid XML!
• TEI is good to formulate encoding practices
• XML is bad for modelling the richness of text and annotations
• ... I long for a TEI without XML ...

13
51. ... but for now we just take a more abstract model ...
the Text-Fabric solution is:

14
52. ... but for now we just take a more abstract model ...
the Text-Fabric solution is:
| he asked Ethiopia for support

14
53. ... but for now we just take a more abstract model ...
the Text-Fabric solution is:
| he asked Ethiopia for support
---------------------------------------

14
54. ... but for now we just take a more abstract model ...
the Text-Fabric solution is:
| he asked Ethiopia for support
---------------------------------------
damaged | 111111111

14
55. ... but for now we just take a more abstract model ...
the Text-Fabric solution is:
| he asked Ethiopia for support
---------------------------------------
damaged | 111111111
country | 111111111

14
56. ... but for now we just take a more abstract model ...
the Text-Fabric solution is:
| he asked Ethiopia for support
---------------------------------------
damaged | 111111111
country | 111111111
• the data for damaged and country end up in separate
fi
les =>

14
57. ... but for now we just take a more abstract model ...
the Text-Fabric solution is:
| he asked Ethiopia for support
---------------------------------------
damaged | 111111111
country | 111111111
• the data for damaged and country end up in separate
fi
les =>
• separation of concerns =>

14
58. ... but for now we just take a more abstract model ...
the Text-Fabric solution is:
| he asked Ethiopia for support
---------------------------------------
damaged | 111111111
country | 111111111
• the data for damaged and country end up in separate
fi
les =>
• separation of concerns =>
• better data logistics

14
59. Data logistics the whole corpus is just a bunch of separate
fi
les, each
dealing with a well de
fi
ned aspect of the data

15
60. Data logistics the whole corpus is just a bunch of separate
fi
les, each
dealing with a well de
fi
ned aspect of the data

15
61. Data logistics the whole corpus is just a bunch of separate
fi
les, each
dealing with a well de
fi
ned aspect of the data

15
62. Data logistics
140,000 lines
the whole corpus is just a bunch of separate
fi
les, each
dealing with a well de
fi
ned aspect of the data

15
84. "record" a plain text from the TF dataset

27
Named Entity Recognition
85. "record" a plain text from the TF dataset
and remember the original "coordinates i.e. the nodes

27
Named Entity Recognition
86. "record" a plain text from the TF dataset
and remember the original "coordinates i.e. the nodes

27
Named Entity Recognition
87. CLARIAH/wp6-missieven
• Generale Missieven in TF => plain text with recorded positions
cltl/voc-missives
• Sophie Arnoult (CLTL): plain text => named entities
entities notebook
• named entities + recorded positions => back to TF features

28
88. Reporting
• Jupyter notebooks are excellent to tell a computational story
• or to reason things out on the basis of data
• and to highlight the argument with visualisations
• and they
fi
t nicely in a repo
• and repo releases can be archived and "DOI-ed"

29
91. Publishing work
fl
ow
• When delivering data in repos and writing articles
in journals is not enough ...
• Build a website

30
92. Publishing work
fl
ow
• When delivering data in repos and writing articles
in journals is not enough ...
• Build a website
• Infrastructure needed to do this e
ffi
ciently for
many corpora

30
113. data science
interface
general public
interface
source
tei-pagexml
ascii-database
pre-process back-end broker front-end

31
TextRepo
AnnoRepo
Broccoli TextAnnoViz
team-text - production street to online
your web
browser
user
GitRepo
researcher
Jupyter
notebook
untangle
can do large corpora,
but also small ones
corpus must
fi
t in RAM,
large corpora: by volume
Globalise
Republic
Mondriaan
Suriano
Translatin
Hermans
117. Final remarks

32
work in a repo: start-to-
fi
nish
logic enables logistics
stand-o
ff
annotations keep it clean
118. Final remarks

32
work in a repo: start-to-
fi
nish
logic enables logistics
stand-o
ff
annotations keep it clean
let tools support repo operations
119. Final remarks

32
work in a repo: start-to-
fi
nish
logic enables logistics
stand-o
ff
annotations keep it clean
let tools support repo operations
just one example of how this can be done
pip install text-fabric
github.com/annotation/text-fabric
120. Final remarks

32
work in a repo: start-to-
fi
nish
logic enables logistics
stand-o
ff
annotations keep it clean
let tools support repo operations
just one example of how this can be done
dirk.roorda@di.knaw.huc.nl
thank you
pip install text-fabric
github.com/annotation/text-fabric