This document discusses web data engineering and processing large web archive datasets. It describes how web archives contain petabytes of archived web pages and metadata that must be efficiently accessed and analyzed at scale. Web data engineering techniques transform this raw data into useful information through methods like extracting graphs and indexes to enable search, while hiding complexity from end users.
Web Data Engineering - A Technical Perspective on Web ArchivesHelge Holzmann
This document summarizes a presentation on web data engineering and tools for working with large web archive datasets. It discusses the challenges of accessing and analyzing petabytes of archived web data and introduces ArchiveSpark, an open source library that provides efficient loading and processing of web archives using Apache Spark. Key techniques like using metadata indexes and attachments to enable flexible analysis are also outlined.
Continuing Education to Advance Web Archiving (CEDWARC) on Oct 28, 2019 at Gelman Library, George Washington University, 2130 H St NW, Washington, DC 20052.
The document discusses linking XML data to the web of linked data. It provides examples of converting XML content like tables and files into linked data formats like Turtle and JSON-LD. It also demonstrates querying linked data from XML files using SPARQL and XSLT transformations and serving linked data from XML using Apache Jena Fuseki. The document aims to help integrate linked data processing into existing XML tooling and workflows.
New tasks, new roles: Libraries in the tension between Digital Humanities, Re...Stefan Schmunk
This document summarizes Dr. Stefan Schmunk's presentation on new roles for libraries in relation to digital humanities, research data, and research infrastructures. The presentation discusses how digital humanities projects involving tasks like digital scholarly editions require new skills from libraries, such as expertise in XML encoding, long-term preservation of digital materials, and creation of virtual research environments. It also explores how libraries must adapt to help researchers with the growing importance of research data in the humanities by taking on roles like hosting data repositories, providing data management support and training, and building research data infrastructures.
This document introduces distributed computing and tools for processing large tabular data using the Big Data Cluster. It discusses how distributed computing allows tabular data to be replicated across nodes and computation to be parallelized. It then provides an overview of Hadoop and how the Big Data Cluster can be used with tools like Hue, Hive, and Pig to perform analytics on large datasets. Finally, it walks through an example of computing TF-IDF scores on a corpus of text documents from Project Gutenberg.
This presentation addresses the main issues of Linked Data and scalability. In particular, it provides gives details on approaches and technologies for clustering, distributing, sharing, and caching data. Furthermore, it addresses the means for publishing data trough could deployment and the relationship between Big Data and Linked Data, exploring how some of the solutions can be transferred in the context of Linked Data.
Web Data Engineering - A Technical Perspective on Web ArchivesHelge Holzmann
This document summarizes a presentation on web data engineering and tools for working with large web archive datasets. It discusses the challenges of accessing and analyzing petabytes of archived web data and introduces ArchiveSpark, an open source library that provides efficient loading and processing of web archives using Apache Spark. Key techniques like using metadata indexes and attachments to enable flexible analysis are also outlined.
Continuing Education to Advance Web Archiving (CEDWARC) on Oct 28, 2019 at Gelman Library, George Washington University, 2130 H St NW, Washington, DC 20052.
The document discusses linking XML data to the web of linked data. It provides examples of converting XML content like tables and files into linked data formats like Turtle and JSON-LD. It also demonstrates querying linked data from XML files using SPARQL and XSLT transformations and serving linked data from XML using Apache Jena Fuseki. The document aims to help integrate linked data processing into existing XML tooling and workflows.
New tasks, new roles: Libraries in the tension between Digital Humanities, Re...Stefan Schmunk
This document summarizes Dr. Stefan Schmunk's presentation on new roles for libraries in relation to digital humanities, research data, and research infrastructures. The presentation discusses how digital humanities projects involving tasks like digital scholarly editions require new skills from libraries, such as expertise in XML encoding, long-term preservation of digital materials, and creation of virtual research environments. It also explores how libraries must adapt to help researchers with the growing importance of research data in the humanities by taking on roles like hosting data repositories, providing data management support and training, and building research data infrastructures.
This document introduces distributed computing and tools for processing large tabular data using the Big Data Cluster. It discusses how distributed computing allows tabular data to be replicated across nodes and computation to be parallelized. It then provides an overview of Hadoop and how the Big Data Cluster can be used with tools like Hue, Hive, and Pig to perform analytics on large datasets. Finally, it walks through an example of computing TF-IDF scores on a corpus of text documents from Project Gutenberg.
This presentation addresses the main issues of Linked Data and scalability. In particular, it provides gives details on approaches and technologies for clustering, distributing, sharing, and caching data. Furthermore, it addresses the means for publishing data trough could deployment and the relationship between Big Data and Linked Data, exploring how some of the solutions can be transferred in the context of Linked Data.
This document provides an overview of web mining. It defines web mining as using data mining techniques to automatically discover and extract information from web documents and services. It discusses the differences between web mining and data mining, and covers the main topics in web mining including web graph analysis, structured data extraction, and web advertising. It also describes the different approaches of web content mining, web structure mining, and web usage mining.
Research institutions, governments and sometimes even the industry are promoting a way to publish data that conforms to principles of openness such as being Findable, Accessible, Interoperable and Reusable.
These principles can be adhered to in a multitude of ways: Linked Open Data is one of them; it is favoured by scientific communities, but its adoption is not limited to research contexts. In this talk I will provide an account of how my research projects enjoyed the benefits of being on either side of the FAIR data supply chain.
Data Management and Integration with d:swarm (Lightning talk, ELAG 2014)Jan Polowinski
d:swarm is a middleware for data integration and management currently developed by the Saxon State and University Library Dresden in cooperation with Avantgarde Labs.
d:swarm - A Library Data Management Platform Based on a Linked Open Data Appr...Jens Mittelbach
D:SWARM is a graphical web-based ETL modelling tool that serves to import data from heterogeneous sources with different formats, to map input to output schemata and design transformation workflows, to load transformed data into property graph database. It is developed in a collaborative project by SLUB Dresden (www.slub-dresden.de) and Avantgarde Labs GmbH (www.avantgarde-labs.de) features additional functionalities like exporting of data models as RDF and sharing mappings and transformation workflows.
IEEE IRI 16 - Clustering Web Pages based on Structure and Style SimilarityThamme Gowda
The structural similarity of HTML pages is measured by using Tree Edit Distance measure on DOM trees. The stylistic similarity is measured by using Jaccard similarity on CSS class names. An aggregated similarity measure is computed by combining structural and stylistic measures. A clustering method is then applied to this aggregated similarity measure to group the documents.
Session 1.2 improving access to digital content by semantic enrichmentsemanticsconference
This document discusses improving access to digital collections through semantic enrichment. It describes linking names and entities from text to knowledge bases like Wikidata to make the content more discoverable and usable. The process involves named entity recognition, entity linking using disambiguation algorithms, presenting enriched context, and enabling semantic search. User feedback is gathered to improve the linking algorithms through additional training. The goal is to increase trust in the links for research purposes. Overall, the approach aims to enrich text collections by connecting content to external information sources.
DBPedia past, present and future - Dimitris Kontokostas. Reveals recent developments in the Linked Data and knowledge graphs field and how DBPedia progress with wikipedia data.
Clustering output of Apache Nutch using Apache SparkThamme Gowda
This document discusses clustering the output of Apache Nutch web pages using Apache Spark. It presents structural and style similarity measures to group similar web pages based on their DOM structure and CSS styles. Shared near neighbor clustering is implemented on the Spark GraphX library to cluster the web pages based on a similarity matrix without prior knowledge of cluster sizes or shapes. A demo is provided to visualize the clustered results.
Linked Data from a Digital Object Management SystemUldis Bojars
Lightning talk about generating Linked Data from a digital object management system at the National Library of Latvia. Conference: http://swib.org/swib12/programme.php
Introduction to Web Mining and Spatial Data MiningAarshDhokai
Data Ware Housing And Mining subject offer in Gujarat Technological University in Branch of Information and Technology.
This Topic is from chapter 8 named Advance Topics.
Enterprise Knowledge Graphs allow organizations to integrate heterogeneous data from various sources and represent them semantically using common vocabularies and ontologies. This facilitates linking and querying of related information across organizational boundaries. Knowledge graphs provide a holistic view of enterprise data and support various applications through their use as a common background knowledge base. However, building and maintaining knowledge graphs at scale poses challenges regarding data quality, coherence, and evolution of the knowledge representation over time.
FAIRDOM data management support for ERACoBioTech ProposalsFAIRDOM
This document provides information about a webinar from the FAIRDOM Consortium on data management for ERACoBioTech full proposals. It includes:
- Details on how to budget for and include a data management plan in proposals
- A checklist for developing a data management plan covering topics like the types and volumes of data, data sharing and reuse, and making data FAIR
- An overview of the FAIRDOM services and software platform that can help with project data management and stewardship
Repositories are systems mainly used to store and publish academic contents. This presentation discusses why repositories contents should be published as Linked (Open) Data and how repositories can be extended to do so.
A possible future role of schema.org for business reportingsopekmir
The presentation demonstrates a vision for the “reporting extension” that could enhance the processes related to business reporting and the role it could have for the SBR vision.
Research Data Management: An Introductory Webinar from OpenAIRE and EUDATTony Ross-Hellauer
OpenAIRE and EUDAT co-present this webinar which aims to introduce researchers and others to the concept of research data management (RDM). As well as presenting the benefits of taking an active approach to research data management – including increased speed and ease of access, efficiency (fund once, reuse many times), and improved quality and transparency of research – the webinar will advise on strategies for successful RDM, resources to help manage data effectively, choosing where to store and deposit data, the EC H2020 Open Data Pilot and the basics of data management, stewardship and archiving.
Webinar recording available: http://www.instantpresenter.com/eifl/EB57D6888147
Analytics and Access to the UK web archiveLewis Crawford
The document summarizes the background, purpose, and methods of the UK Web Archive. It discusses how the archive collects, stores, and provides access to snapshots of UK websites over time to preserve digital cultural heritage. It also describes challenges of scale due to the immense size of web content and techniques like full-text search and data analytics that are used to facilitate discovery of information within the archive.
2013 DataCite Summer Meeting - Elsevier's program to support research data (H...datacite
2013 DataCite Summer Meeting - Making Research better
DataCite. Co-sponsored by CODATA.
Thursday, 19 September 2013 at 13:00 - Friday, 20 September 2013 at 12:30
Washington, DC. National Academy of Sciences
http://datacite.eventbrite.co.uk/
CLARIAH Toogdag 2018: A distributed network of digital heritage informationEnno Meijers
Slides of my keynote at the CLARIAH Toogdag 2018 on 9 March at the National Library of the Netherlands. The main topics were the development of the distributed digital heritage network and the alignment to and cooperation with the CLARIAH infrastructure and data. It also points at some of the current limitations of the semantic web technology.
Medical Heritage Library (MHL) on ArchiveSparkHelge Holzmann
This presentation gives an introduction to ArchiveSpark and the recent extension to use it with any archival collection. The slides demonstrate how to set it up and use it for analyzing data from medical journals of the Medical Heritage Library (MHL).
This document provides an overview of the key concepts around big data and Hadoop. It discusses big data sources and challenges, including capturing, storing, searching, sharing, transferring, analyzing and presenting large amounts of data. It then describes how Hadoop provides a cost-effective solution for storage and processing of big data using a distributed architecture. Finally, the document outlines the core components of Hadoop including the Hadoop Distributed File System for storage and MapReduce for distributed processing.
This document provides an overview of web mining. It defines web mining as using data mining techniques to automatically discover and extract information from web documents and services. It discusses the differences between web mining and data mining, and covers the main topics in web mining including web graph analysis, structured data extraction, and web advertising. It also describes the different approaches of web content mining, web structure mining, and web usage mining.
Research institutions, governments and sometimes even the industry are promoting a way to publish data that conforms to principles of openness such as being Findable, Accessible, Interoperable and Reusable.
These principles can be adhered to in a multitude of ways: Linked Open Data is one of them; it is favoured by scientific communities, but its adoption is not limited to research contexts. In this talk I will provide an account of how my research projects enjoyed the benefits of being on either side of the FAIR data supply chain.
Data Management and Integration with d:swarm (Lightning talk, ELAG 2014)Jan Polowinski
d:swarm is a middleware for data integration and management currently developed by the Saxon State and University Library Dresden in cooperation with Avantgarde Labs.
d:swarm - A Library Data Management Platform Based on a Linked Open Data Appr...Jens Mittelbach
D:SWARM is a graphical web-based ETL modelling tool that serves to import data from heterogeneous sources with different formats, to map input to output schemata and design transformation workflows, to load transformed data into property graph database. It is developed in a collaborative project by SLUB Dresden (www.slub-dresden.de) and Avantgarde Labs GmbH (www.avantgarde-labs.de) features additional functionalities like exporting of data models as RDF and sharing mappings and transformation workflows.
IEEE IRI 16 - Clustering Web Pages based on Structure and Style SimilarityThamme Gowda
The structural similarity of HTML pages is measured by using Tree Edit Distance measure on DOM trees. The stylistic similarity is measured by using Jaccard similarity on CSS class names. An aggregated similarity measure is computed by combining structural and stylistic measures. A clustering method is then applied to this aggregated similarity measure to group the documents.
Session 1.2 improving access to digital content by semantic enrichmentsemanticsconference
This document discusses improving access to digital collections through semantic enrichment. It describes linking names and entities from text to knowledge bases like Wikidata to make the content more discoverable and usable. The process involves named entity recognition, entity linking using disambiguation algorithms, presenting enriched context, and enabling semantic search. User feedback is gathered to improve the linking algorithms through additional training. The goal is to increase trust in the links for research purposes. Overall, the approach aims to enrich text collections by connecting content to external information sources.
DBPedia past, present and future - Dimitris Kontokostas. Reveals recent developments in the Linked Data and knowledge graphs field and how DBPedia progress with wikipedia data.
Clustering output of Apache Nutch using Apache SparkThamme Gowda
This document discusses clustering the output of Apache Nutch web pages using Apache Spark. It presents structural and style similarity measures to group similar web pages based on their DOM structure and CSS styles. Shared near neighbor clustering is implemented on the Spark GraphX library to cluster the web pages based on a similarity matrix without prior knowledge of cluster sizes or shapes. A demo is provided to visualize the clustered results.
Linked Data from a Digital Object Management SystemUldis Bojars
Lightning talk about generating Linked Data from a digital object management system at the National Library of Latvia. Conference: http://swib.org/swib12/programme.php
Introduction to Web Mining and Spatial Data MiningAarshDhokai
Data Ware Housing And Mining subject offer in Gujarat Technological University in Branch of Information and Technology.
This Topic is from chapter 8 named Advance Topics.
Enterprise Knowledge Graphs allow organizations to integrate heterogeneous data from various sources and represent them semantically using common vocabularies and ontologies. This facilitates linking and querying of related information across organizational boundaries. Knowledge graphs provide a holistic view of enterprise data and support various applications through their use as a common background knowledge base. However, building and maintaining knowledge graphs at scale poses challenges regarding data quality, coherence, and evolution of the knowledge representation over time.
FAIRDOM data management support for ERACoBioTech ProposalsFAIRDOM
This document provides information about a webinar from the FAIRDOM Consortium on data management for ERACoBioTech full proposals. It includes:
- Details on how to budget for and include a data management plan in proposals
- A checklist for developing a data management plan covering topics like the types and volumes of data, data sharing and reuse, and making data FAIR
- An overview of the FAIRDOM services and software platform that can help with project data management and stewardship
Repositories are systems mainly used to store and publish academic contents. This presentation discusses why repositories contents should be published as Linked (Open) Data and how repositories can be extended to do so.
A possible future role of schema.org for business reportingsopekmir
The presentation demonstrates a vision for the “reporting extension” that could enhance the processes related to business reporting and the role it could have for the SBR vision.
Research Data Management: An Introductory Webinar from OpenAIRE and EUDATTony Ross-Hellauer
OpenAIRE and EUDAT co-present this webinar which aims to introduce researchers and others to the concept of research data management (RDM). As well as presenting the benefits of taking an active approach to research data management – including increased speed and ease of access, efficiency (fund once, reuse many times), and improved quality and transparency of research – the webinar will advise on strategies for successful RDM, resources to help manage data effectively, choosing where to store and deposit data, the EC H2020 Open Data Pilot and the basics of data management, stewardship and archiving.
Webinar recording available: http://www.instantpresenter.com/eifl/EB57D6888147
Analytics and Access to the UK web archiveLewis Crawford
The document summarizes the background, purpose, and methods of the UK Web Archive. It discusses how the archive collects, stores, and provides access to snapshots of UK websites over time to preserve digital cultural heritage. It also describes challenges of scale due to the immense size of web content and techniques like full-text search and data analytics that are used to facilitate discovery of information within the archive.
2013 DataCite Summer Meeting - Elsevier's program to support research data (H...datacite
2013 DataCite Summer Meeting - Making Research better
DataCite. Co-sponsored by CODATA.
Thursday, 19 September 2013 at 13:00 - Friday, 20 September 2013 at 12:30
Washington, DC. National Academy of Sciences
http://datacite.eventbrite.co.uk/
CLARIAH Toogdag 2018: A distributed network of digital heritage informationEnno Meijers
Slides of my keynote at the CLARIAH Toogdag 2018 on 9 March at the National Library of the Netherlands. The main topics were the development of the distributed digital heritage network and the alignment to and cooperation with the CLARIAH infrastructure and data. It also points at some of the current limitations of the semantic web technology.
Medical Heritage Library (MHL) on ArchiveSparkHelge Holzmann
This presentation gives an introduction to ArchiveSpark and the recent extension to use it with any archival collection. The slides demonstrate how to set it up and use it for analyzing data from medical journals of the Medical Heritage Library (MHL).
This document provides an overview of the key concepts around big data and Hadoop. It discusses big data sources and challenges, including capturing, storing, searching, sharing, transferring, analyzing and presenting large amounts of data. It then describes how Hadoop provides a cost-effective solution for storage and processing of big data using a distributed architecture. Finally, the document outlines the core components of Hadoop including the Hadoop Distributed File System for storage and MapReduce for distributed processing.
Dec'2013 webinar from the EUCLID project on managing large volumes of Linked Data
webinar recording at https://vimeo.com/84126769 and https://vimeo.com/84126770
more info on EUCLID: http://euclid-project.eu/
Designing Convergent HPC and Big Data Software Stacks: An Overview of the HiB...inside-BigData.com
In this deck from the 2019 Stanford HPC Conference, DK Panda from Ohio State University presents: Designing Convergent HPC and Big Data Software Stacks: An Overview of the HiBD Project.
"This talk will provide an overview of challenges in designing convergent HPC and BigData software stacks on modern HPC clusters. An overview of RDMA-based designs for Hadoop (HDFS, MapReduce, RPC and HBase), Spark, Memcached, Swift, and Kafka using native RDMA support for InfiniBand and RoCE will be presented. Enhanced designs for these components to exploit HPC scheduler (SLURM), parallel file systems (Lustre) and NVM-based in-memory technology will also be presented. Benefits of these designs on various cluster configurations using the publicly available RDMA-enabled packages from the OSU HiBD project (http://hibd.cse.ohio-state.edu) will be shown.
DK Panda is a Professor and University Distinguished Scholar of Computer Science and Engineering at the Ohio State University. He has published over 450 papers in the area of high-end computing and networking. The MVAPICH2 (High Performance MPI and PGAS over InfiniBand, Omni-Path, iWARP and RoCE) libraries, designed and developed by his research group (http://mvapich.cse.ohio-state.edu), are currently being used by more than 2,950 organizations worldwide (in 85 countries). More than 518,000 downloads of this software have taken place from the project's site. This software is empowering several InfiniBand clusters (including the 3rd, 14th, 17th, and 27th ranked ones) in the TOP500 list. The RDMA packages for Apache Spark, Apache Hadoop and Memcached together with OSU HiBD benchmarks from his group (http://hibd.cse.ohio-state.edu) are also publicly available. These libraries are currently being used by more than 300 organizations in 35 countries. More than 28,900 downloads of these libraries have taken place. High-performance and scalable versions of the Caffe and TensorFlow framework are available from https://hidl.cse.ohio-state.edu.
Prof. Panda is an IEEE Fellow. More details about Prof. Panda are available at http://www.cse.ohio-state.edu/~panda.
Watch the video: https://youtu.be/1QEq0EUErKM
Learn more: http://hpcadvisorycouncil.com/events/2019/stanford-workshop/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
This document provides an overview of archival technologies presented at the 46th Annual Georgia Archives Institute on June 10-21, 2013. The presentation introduces various archival management tools like Archon and Archivists' Toolkit for managing archival collections. It also discusses digital collection management software such as CONTENTdm and Islandora. Emerging standards, formats and linked open data initiatives are also covered. The goal is to help archivists identify existing and new technologies that can help manage and provide access to archival materials.
Engineering patterns for implementing data science models on big data platformsHisham Arafat
Discussion of practically implementing data science models on big data platforms from engineering perspective. An eye opener on the engineering factors associated with designing and working solution. We use a simple text mining example on social media analytics for brand marketing. At the first while, it seems simple solution however if you go deeply and think on implementation aspects of even a simple analytics model, you can discover the degree of complexity at each part of the solution. An Abstraction of the Big Data key advantages would be very helpful to select appropriate Big Data technology components out of very large landscape. Two examples with reference are given for using Lambda Architecture and unusual way of image processing using Big Data abstraction provided.
Introduction to ArchiveSpark, given at the WebSci' 2016 Hackathon: Exploring the Past of the Web: Alexandria & Archive-It Hackathon http://www.websci16.org/hackathon https://github.com/helgeho/ArchiveSpark
The document provides an overview of open data principles and techniques for publishing and linking structured data on the web. It discusses key open data principles like availability and universal participation. It then covers two main techniques: linked (open) data, which uses semantic web technologies like RDF and URIs to interlink data; and microdata, which embeds structured data in HTML using tags. The presenter provides examples of open government data and linked open datasets. The goal is to create a single globally connected data space on the web through open sharing of structured data using these techniques.
More and more organizations are moving their ETL workloads to a Hadoop based ELT grid architecture. Hadoop`s inherit capabilities, especially it`s ability to do late binding addresses some of the key challenges with traditional ETL platforms. In this presentation, attendees will learn the key factors, considerations and lessons around ETL for Hadoop. Areas such as pros and cons for different extract and load strategies, best ways to batch data, buffering and compression considerations, leveraging HCatalog, data transformation, integration with existing data transformations, advantages of different ways of exchanging data and leveraging Hadoop as a data integration layer. This is an extremely popular presentation around ETL and Hadoop.
Minimizing the Complexities of Machine Learning with Data VirtualizationDenodo
Watch full webinar here: https://buff.ly/309CZ1Y
Advanced data science techniques, like machine learning, have proven an extremely useful tool to derive valuable insights from existing data. Platforms like Spark, and complex libraries for R, Python and Scala put advanced techniques at the fingertips of the data scientists. However, these data scientists spent most of their time looking for the right data and massaging it into a usable format. Data virtualization offers a new alternative to address these issues in a more efficient and agile way.
Attend this webinar and learn:
*How data virtualization can accelerate data acquisition and massaging, providing the data scientist with a powerful tool to complement their practice
*How popular tools from the data science ecosystem: Spark, Python, Zeppelin, Jupyter, etc. integrate with Denodo
*How you can use the Denodo Platform with large data volumes in an efficient way
*About the success McCormick has had as a result of seasoning the Machine Learning and Blockchain Landscape with data virtualization
This document discusses managing electronic resources using linked data. It proposes developing a technical solution to convert relevant data into linked data through a data management platform, aggregate the data in a unified triplestore, and access/manipulate it via a generic GUI built on OntoWiki. The application aims to provide a flexible electronic resource management system by reusing existing vocabularies and defining new properties following ERMI guidelines. It will be open source and developed through an official OntoWiki repository in partnership between Leipzig University Library, SLUB Dresden, AKSW, and InfAI.
Research Data (and Software) Management at Imperial: (Everything you need to ...Sarah Anna Stewart
A presentation on research data management tools, workflows and best practices at Imperial College London with a focus on software management. Presented at the 2017 session of the HPC Summer School (Dept. of Computing).
Wednesday 6 May: Hand me the data! What you should know as a humanities resea...WARCnet
Wednesday 6 May: Hand me the data! What you should know as a humanities researcher before asking for data from a web archive, Ulrich Have, NetLab/DIGHUMLAB, Aarhus University
Digital library services and the changing environmentJohn MacColl
The document discusses the changing environment for digital library services. It argues that libraries need to both concentrate their resources at a network level and diffuse their data and services through open sharing on the web. This will allow libraries to better expose their collections to users where they search online. The document also advocates for mass digitization of collections and putting materials online at "web scale" to make previously undiscovered resources accessible.
This document provides an agenda and overview of a talk on big data and data science given by Peter Wang. The key points covered include:
- An honest perspective on big data trends and challenges over time.
- Architecting systems for data exploration and analysis using tools like Continuum Analytics' Blaze and Numba libraries.
- Python's role in data science for its ecosystem of libraries and accessibility to domain experts.
Linguistic Linked Open Data, Challenges, Approaches, Future WorkSebastian Hellmann
Hellmann keynote TKE (2016), Challenges, Approaches and Future Work for Linguistic Linked Open Data (LLOD)
While the Linguistic Linked Open Data (LLOD) Cloud (http://linguistic-lod.org/) has evolved beyond expectations - thanks to the effort of a vibrant community - overall progress has to be seen under a more scrutinizing light.
Initial challenges which have been formulated by Christian Chiarcos, Sebastian Nordhoff and me as early as 2011[1][2] have been discussed extensively in the LDL, MLODE and NLP & DBpedia workshop series and in several W3C community groups. In particular, the LIDER FP7 project (http://www.lider-project.eu/) - originally conceived to tackle these challenges and build a Linguistic Linked Open Data Cloud - rather gave them more shape and uncovered that there is yet quite a long road ahead to solve problems such as proper metadata, contextualisation of knowledge, data quality, hosting, open licensing and provenance, timely updated network links, knowledge integration and interoperability on the largest possible scale - the Web.
The invited talk attempts to give a full account of these abovementioned challenges and presents and critically evaluates pertinent efforts and approaches including evolving standards such as the NLP Interchange Format (NIF)[3][4], DataID[5], SHACL[6], lemon[7] and the LIDER guidelines[8] as well as practical services such as LingHub[9], LODVader[10], RDFUnit[11] (just to mention a few).
As a glimmer of hope, the talk will conclude with the recent efforts of the DBpedia community to coordinate the creation of a public data infrastructure for a large, multilingual, semantic knowledge graph, which is, of course, not a panacean golden hammer, but a potential step in the right direction to bridge the gap between language and knowledge.
________________
[1] Towards a Linguistic Linked Open Data cloud : The Open Linguistics Working Group (http://www.atala.org/IMG/pdf/Chiarcos-TAL52-3.pdf ) Christian Chiarcos, Sebastian Hellmann, and Sebastian Nordhoff. TAL 52(3):245 - 275 (2011)
[2] Linked Data in Linguistics. Representing Language Data and Metadata (http://www.springer.com/computer/ai/book/978-3-642-28248-5 ) Christian Chiarcos, Sebastian Nordhoff, and Sebastian Hellmann (Eds.). Springer, Heidelberg, (2012)
[3] http://persistence.uni-leipzig.org/nlp2rdf/ontologies/nif-core
[4] https://www.w3.org/community/ld4lt/
[5] http://wiki.dbpedia.org/projects/dbpedia-dataid
[6] http://w3c.github.io/data-shapes/shacl/
[7] https://www.w3.org/2016/05/ontolex/
[8] http://www.lider-project.eu/guidelines
[9] http://linghub.lider-project.eu/
[10] http://lodvader.aksw.org/
[11] http://aksw.org/Projects/RDFUnit
Coping Strategies for the Death of Unlimited StorageGlobus
Presented at GlobusWorld 2022 by a set of panelists moderated by Bob Flynn from Internet2. Panelists offer their perspectives on migrating between cloud storage providers.
INNOVATION AND RESEARCH (Digital Library Information Access)Libcorpio
Innovation and research, Digital Library Information Access, LIS Education, Library and Information Science, LIS Studies, Information Management, Education and Learning, Library science, Information science, Digital Libraries, Research on Digital Libraries, DL, Innovation in libraries and publishing, Areas of Research for DL, Information Discovery, Collection Management and Preservation, Interoperability, Economic, Social and Legal Issues, Core Topics In Digital Libraries, DL Research Around The World
Similar to Web Data Engineering - A Technical Perspective on Web Archives (20)
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeWalaa Eldin Moustafa
Dynamic policy enforcement is becoming an increasingly important topic in today’s world where data privacy and compliance is a top priority for companies, individuals, and regulators alike. In these slides, we discuss how LinkedIn implements a powerful dynamic policy enforcement engine, called ViewShift, and integrates it within its data lake. We show the query engine architecture and how catalog implementations can automatically route table resolutions to compliance-enforcing SQL views. Such views have a set of very interesting properties: (1) They are auto-generated from declarative data annotations. (2) They respect user-level consent and preferences (3) They are context-aware, encoding a different set of transformations for different use cases (4) They are portable; while the SQL logic is only implemented in one SQL dialect, it is accessible in all engines.
#SQL #Views #Privacy #Compliance #DataLake
Open Source Contributions to Postgres: The Basics POSETTE 2024ElizabethGarrettChri
Postgres is the most advanced open-source database in the world and it's supported by a community, not a single company. So how does this work? How does code actually get into Postgres? I recently had a patch submitted and committed and I want to share what I learned in that process. I’ll give you an overview of Postgres versions and how the underlying project codebase functions. I’ll also show you the process for submitting a patch and getting that tested and committed.
Codeless Generative AI Pipelines
(GenAI with Milvus)
https://ml.dssconf.pl/user.html#!/lecture/DSSML24-041a/rate
Discover the potential of real-time streaming in the context of GenAI as we delve into the intricacies of Apache NiFi and its capabilities. Learn how this tool can significantly simplify the data engineering workflow for GenAI applications, allowing you to focus on the creative aspects rather than the technical complexities. I will guide you through practical examples and use cases, showing the impact of automation on prompt building. From data ingestion to transformation and delivery, witness how Apache NiFi streamlines the entire pipeline, ensuring a smooth and hassle-free experience.
Timothy Spann
https://www.youtube.com/@FLaNK-Stack
https://medium.com/@tspann
https://www.datainmotion.dev/
milvus, unstructured data, vector database, zilliz, cloud, vectors, python, deep learning, generative ai, genai, nifi, kafka, flink, streaming, iot, edge
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...Kaxil Naik
Navigating today's data landscape isn't just about managing workflows; it's about strategically propelling your business forward. Apache Airflow has stood out as the benchmark in this arena, driving data orchestration forward since its early days. As we dive into the complexities of our current data-rich environment, where the sheer volume of information and its timely, accurate processing are crucial for AI and ML applications, the role of Airflow has never been more critical.
In my journey as the Senior Engineering Director and a pivotal member of Apache Airflow's Project Management Committee (PMC), I've witnessed Airflow transform data handling, making agility and insight the norm in an ever-evolving digital space. At Astronomer, our collaboration with leading AI & ML teams worldwide has not only tested but also proven Airflow's mettle in delivering data reliably and efficiently—data that now powers not just insights but core business functions.
This session is a deep dive into the essence of Airflow's success. We'll trace its evolution from a budding project to the backbone of data orchestration it is today, constantly adapting to meet the next wave of data challenges, including those brought on by Generative AI. It's this forward-thinking adaptability that keeps Airflow at the forefront of innovation, ready for whatever comes next.
The ever-growing demands of AI and ML applications have ushered in an era where sophisticated data management isn't a luxury—it's a necessity. Airflow's innate flexibility and scalability are what makes it indispensable in managing the intricate workflows of today, especially those involving Large Language Models (LLMs).
This talk isn't just a rundown of Airflow's features; it's about harnessing these capabilities to turn your data workflows into a strategic asset. Together, we'll explore how Airflow remains at the cutting edge of data orchestration, ensuring your organization is not just keeping pace but setting the pace in a data-driven future.
Session in https://budapestdata.hu/2024/04/kaxil-naik-astronomer-io/ | https://dataml24.sessionize.com/session/667627
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"sameer shah
Embark on a captivating financial journey with 'Financial Odyssey,' our hackathon project. Delve deep into the past performance of two companies as we employ an array of financial statement analysis techniques. From ratio analysis to trend analysis, uncover insights crucial for informed decision-making in the dynamic world of finance."
Global Situational Awareness of A.I. and where its headedvikram sood
You can see the future first in San Francisco.
Over the past year, the talk of the town has shifted from $10 billion compute clusters to $100 billion clusters to trillion-dollar clusters. Every six months another zero is added to the boardroom plans. Behind the scenes, there’s a fierce scramble to secure every power contract still available for the rest of the decade, every voltage transformer that can possibly be procured. American big business is gearing up to pour trillions of dollars into a long-unseen mobilization of American industrial might. By the end of the decade, American electricity production will have grown tens of percent; from the shale fields of Pennsylvania to the solar farms of Nevada, hundreds of millions of GPUs will hum.
The AGI race has begun. We are building machines that can think and reason. By 2025/26, these machines will outpace college graduates. By the end of the decade, they will be smarter than you or I; we will have superintelligence, in the true sense of the word. Along the way, national security forces not seen in half a century will be un-leashed, and before long, The Project will be on. If we’re lucky, we’ll be in an all-out race with the CCP; if we’re unlucky, an all-out war.
Everyone is now talking about AI, but few have the faintest glimmer of what is about to hit them. Nvidia analysts still think 2024 might be close to the peak. Mainstream pundits are stuck on the wilful blindness of “it’s just predicting the next word”. They see only hype and business-as-usual; at most they entertain another internet-scale technological change.
Before long, the world will wake up. But right now, there are perhaps a few hundred people, most of them in San Francisco and the AI labs, that have situational awareness. Through whatever peculiar forces of fate, I have found myself amongst them. A few years ago, these people were derided as crazy—but they trusted the trendlines, which allowed them to correctly predict the AI advances of the past few years. Whether these people are also right about the next few years remains to be seen. But these are very smart people—the smartest people I have ever met—and they are the ones building this technology. Perhaps they will be an odd footnote in history, or perhaps they will go down in history like Szilard and Oppenheimer and Teller. If they are seeing the future even close to correctly, we are in for a wild ride.
Let me tell you what we see.
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataKiwi Creative
Harness the power of AI-backed reports, benchmarking and data analysis to predict trends and detect anomalies in your marketing efforts.
Peter Caputa, CEO at Databox, reveals how you can discover the strategies and tools to increase your growth rate (and margins!).
From metrics to track to data habits to pick up, enhance your reporting for powerful insights to improve your B2B tech company's marketing.
- - -
This is the webinar recording from the June 2024 HubSpot User Group (HUG) for B2B Technology USA.
Watch the video recording at https://youtu.be/5vjwGfPN9lw
Sign up for future HUG events at https://events.hubspot.com/b2b-technology-usa/
A presentation that explain the Power BI Licensing
Web Data Engineering - A Technical Perspective on Web Archives
1. Web Data Engineering:
A Technical Perspective on Web Archives
Dr. Helge Holzmann
Web Data Engineer
Internet Archive
helge@archive.org
Open Repositories 2019
Hamburg, Germany
June 12, 2019
2. What is a web archive?
• Web archives preserve our history as documented on the web…
• … in huge datasets, consisting of all kinds of web resources
• e.g., HTML pages, images, video, scripts, …
• … stored as big files in the standardized (W)ARC format
• along with metadata + request / response headers
• next to lightweight capture index files (CDX)
• … to provide access to webpages from the past
• for users through close reading
• replayed by the Wayback Machine
• for data analysis at scale through distant-reading
• enabled by Big Data processing methods, like Hadoop / Spark, …
Helge Holzmann (helge@archive.org)2019-06-12
5. Not today's topic …
2019-06-12 Helge Holzmann (helge@archive.org)
http://blog.archive.org/2016/09/19/the-internet-archive-turns-20
6. The (archived) web…
• ... is a very valuable dataset to study the web (and the offline world)
• Access to very diverse knowledge from various discliplines (history, politics, …)
• The whole web at your fingertips / processable snapshots
• Adds a temporal dimension to the Web / captures dynamics
• ... is a widely unstructured collection of data
• Access and analysis at scale is challenging
• Processing petabytes of data is expensive and time-consuming
• Difficult to discover, identify, extract records and contained information
• Potentially highly technical, complex access and parsing process
• Low-level details users / researchers / data scientists don't want to / can't deal with
• Data engineering needed to be used in downstream applications / studies
2019-06-12 Helge Holzmann (helge@archive.org)
6
7. Different perspectives on web archives
• User-centric View
• (Temporal) Search / Information Retrieval
• Direct access / replaying archived pages
• Data-centric View
• (W)ARC and CDX (metadata) datasets
• Big data processing: Hadoop, Spark, …
• Content analysis, historical / evolution studies
• Graph-centric View
• Structural view on the dataset
• Graph algorithms / analysis, structured information
• Hyperlink and host graphs, entity / social networks, facts and more
2019-06-12 Helge Holzmann (helge@archive.org)
7
[Helge Holzmann. Concepts and Tools for the Effective and Efficient Use of Web Archives. PhD thesis 2019]
8. Web (archives) as graph
• Foundational model for most downstream applications / analysis tasks
• E.g., Search index construction, term / entity co-occurrence studies, …
• Different ways / approaches to construct / extract (temporal) graphs
• (Temporal) hyperlinks (hosts vs. URLs), social networks, knowledge graphs, etc.
• Technical challenges that users don't want to / can't deal with:
• Efficient generation, effective representation, …
Helge Holzmann (helge@archive.org)2019-06-12
8
9. (Temporal) search in web archives
• Wanted: Enter a textual query, find relevant captures
• Challenges:
• Documents are temporal / consist of multiple versions
• New captures could near-duplicates or relevant changes
• Temporal relevance in addition to textual relevance
• Relevance to the query is not always encoded in the content
• Information needs / query intents are different from traditional IR
• Mostly navigational: Under which URL can I find a specific resource?
• How to turn (temporal) graphs into a searchable index?
• Integrate full-text, titles, headlines, anchor texts, ...?
• Convert into a format supported by Information Retrieval systems, e.g. ElasticSearch
• Adaptation of existing retrieval models
2019-06-12 Helge Holzmann (helge@archive.org)
9
10. Web Data Engineering
• Transforming data into useful information
• Making it usable for downstream applications
• Search, data science, digital humanities, content analysis, ...
• Regular users, researchers, data scientists / analysts, ...
• Enabling efficient and effective access through...
• ... infrastructures
• ... suitable data formats
• ... simple tools / APIs
• ... optimized indexes
• Technical considerations made by computer scientists
• to help users / researchers focus on their application / study / research
• to hiding complexity / low-level details through flexible abstractions
2019-06-12 Helge Holzmann (helge@archive.org)
10
11. Example: Language Analysis (1)
• Possible research questions:
• Which pages of a language exist outside the contries ccTLD?
• Which languages are used the most in a certain area / topic?
• How has a language evolved over time on the web?
• Requirements:
• Tools for (W)ARC access, HTML parsing, language detection
• Language-annotated pages / captures
• Challenges:
• Texts too short to detect a language / confidence scores
• Multiple languages on one page / filtering and weighting
• Slow and expensive processing due to large-scale content analysis (weeks)
2019-06-12 Helge Holzmann (helge@archive.org)
11
15. Fatcat.wiki (big catalog)
• At-scale web harvesting of scholarly works
• with descriptive metadata and full-text
• linked with versions and secondary outputs
2019-06-12 Helge Holzmann (helge@archive.org)
15
• API-first accessible /
editable system
16. Challenge: the Internet Archive is big
• Web archive / Wayback Machine
• 20+ years of web
• 625+ library and other partners
• 753,932,022,000 (captured) URLs
• 362 billion web pages
• More than 5,000 URLs archived every second
• 40+ petabyte
• And there's more:
2019-06-12 Helge Holzmann (helge@archive.org)
17. Challenge: web archives are Big Data
• Processing requires computing clusters
• i.e., Hadoop, YARN, Spark, …
• Web archive data is heterogeneous, may include text, video, images, …
• Common header / metadata format, but various / diverse payloads
• Requires cleaning, filtering, selection, extraction before processing
• MapReduce or variants
• Homogeneous data types / formats
• Distributed batch processing
• load → transform
• aggregate → write
2019-06-12 Helge Holzmann (helge@archive.org)
17
18. Trade-off: data locality vs. random access
• Direct access allows for exploiting data locality
• Moving computations to the data / sequential scans
• Indirect access with selective random accesses
• Scanning sequentially results in wasted reads (PB)
Helge Holzmann (helge@archive.org)
18
2019-06-12
19. Efficient processing
• Indirect access via lightweight metadata (CDX)
• Basic operations on metadata before touching the archive (filter, group, sort)
• E.g., offline pages, data types (scripts, styles, images, ...), domains
• Enriching records with data from payload for downstream applications
• E.g., titles, headlines, links, part-of-speach, named entities, ...
2019-06-12 Helge Holzmann (helge@archive.org)
19
20. Sparkling data processing ☆
• (Internal) data processing library based on Apache Spark
• Goal to integrate all APIs to work with (temporal) web data in one library
• Continuous work in progress, growing with every new task
• Rich of features
• Efficient CDX / (W)ARC loading, parsing and storing from HDFS, Petabox, …
• Fast HTML processing without expensive DOM parsing (SAX-like)
• Internal PetaBox authentication / access features
• ATT / CDXA attachment loaders and writers
• Shell / Python integration for computing derivations
• Distributed budget-aware repartitioning (e.g., 1GB per partition / file)
• Advanced retry / timeout / failure handling
• Lots of utilities for logging, file handling, string operations, URL/SURT formatting, …
• Easily configurable, library-wide constants and settings
• …
Helge Holzmann (helge@archive.org)2019-06-12
20
21. ArchiveSpark
• Expressive and efficient data access and processing
• Declarative workflows, seamless two step loading approach
• Open source
• Available on GitHub: https://github.com/helgeho/ArchiveSpark
• with documentation, docker image, and recipes for common tasks
• Modular / extensible
• Various DataSpecifications and EnrichFunctions
• ArchiveSpark-server: Web service API for ArchiveSpark
• https://github.com/helgeho/ArchiveSpark-server
• Generalizable for archival collections beyond Web archives
• …
Helge Holzmann (helge@archive.org)2019-06-12
21
[Helge Holzmann, Vinay Goel and Avishek Anand. ArchiveSpark: Efficient Web Archive Access, Extraction and Derivation. JCDL 2016]
[Helge Holzmann, Emily Novak Gustainis and Vinay Goel. Universal Distant Reading through Metadata Proxies with ArchiveSpark. IEEE BigData 2017]
22. Simple and expressive interface
• Based on Spark, powered by Scala
• This does not mean you have to learn a new programming language!
• The interface is rather declarative / no deep scala or spark knowledge required
• Simple data accessors are included
• Provide simplified access to the underlying data model
• Easy extraction / enrichment mechanisms
• Customizable and extensible by advanced users
Helge Holzmann (helge@archive.org)
val rdd = ArchiveSpark.load(WarcCdxHdfsSpec(cdxPath, warcPath))
val onlineHtml = rdd.filter(r => r.status == 200 && r.mime == "text/html")
val entities = onlineHtml.enrich(Entities)
entities.saveAsJson("entities.gz")
22
2019-06-12
23. Familiar, readable, reusable output
• Nested JSON output encodes lineage of applied enrichments
Helge Holzmann (helge@archive.org)
title
text
entities
persons
23
2019-06-12
24. Benchmarks vs. Spark / HBase
• Three scenarios, from basic to more sophisticated:
a) Select one particular URL
b) Select all pages (MIME type text/html) under a specific domain
c) Select the latest successful capture (HTTP status 200) in a specific month
• Benchmarks do not include derivations
• Those are applied on top of all three methods and involve third-party libraries
2019-06-12 Helge Holzmann (helge@archive.org)
24
25. New ArchiveSpark (3.0) very soon
• Major overhaul
• Streamlined dependencies and package structure
• Even more simplified API
• Lots of bug fixes and improvements
• Will be widely based on / include parts of Sparkling
• org.archive.archivespark.sparkling
• Will benefit from Sparkling fixes and updates
• Almost ready
• Please have a little patience and check back soon…
• Follow / star / watch on GitHub
• https://github.com/helgeho/ArchiveSpark
Helge Holzmann (helge@archive.org)2019-06-12
25
26. We're at your service!
• Archive-It Research Services (ARS)
• WAT (extended metadata files)
• LGA (temporal graphs)
• WANE (named entities)
• Special Seed Services (Artificial Zone Files)
• Language + GeoIP analysis
• Nation Wide Web (NWW) Search
• Customized / regional web + media search
• APIs
• WASAPI data-transfer API (Archive-It)
• Availability API + CDX Server (Wayback)
• More to come soon, stay tuned…
2019-06-12 Helge Holzmann (helge@archive.org)
26
27. Thank you!
Helge Holzmann (helge@archive.org)
• archive.org
• archive-it.org
• fatcat.wiki
• github.org/helgeho/ArchiveSpark
Questions?
2019-06-12
www.HelgeHolzmann.de
27
If interested in our work,
please get in touch!