This is the second part of a joint presentation I did with Jimmy Lin (University of Maryland) at the "Web Archiving Collaboration: New Tools and Models" conference at Columbia University, New York NY on 4 June 2015.
As of Drupal 7 we'll have RDFa markup in core, in this session I will:
-explain what the implications are of this and why this matters
-give a short introduction to the Semantic web, RDF, RDFa and SPARQL in human language
-give a short overview of the RDF modules that are available in contrib
-talk about some of the potential use cases of all these magical technologies
This document summarizes the history and evolution of Apache Hadoop over the past 10 years. It discusses how Hadoop originated from Doug Cutting's work on Nutch in 2002. It grew to include HDFS for storage and MapReduce for processing. Yahoo was an early large-scale user. The community has expanded Hadoop to include over 25 components like Hive, HBase, Spark and more. The open source model and ability to adapt have helped Hadoop succeed and it will continue to evolve to handle new data sources and cloud deployments in the next 10 years.
The Mint is an authority control / vocabulary server designed to supply authority services to repositories. It is designed to be a practical tool for working towards Linked Data repositories, making it easy build high-quality metadata collection and discovery system.
Hadoop Summit 2016 presentation.
As Yahoo continues to grow and diversify its mobile products, its platform team faces an ever-expanding, sophisticated group of analysts and product managers, who demand faster answers: to measure key engagement metrics like Daily Active Users, app installs and user retention; to run longitudinal analysis; and to re-calculate metrics over rolling time windows. All of this, quickly enough not to need a coffee break while waiting for results. The optimal solution for this use-case would have to take into account raw performance, cost, security implications and ease of data management. Could the benefits of Hive, ORC and Tez, coupled with a good data design provide the performance our customers crave? Or would it make sense to use more nascent, off-grid querying systems? This talk will examine the efficacy of using Hive for large-scale mobile analytics. We will quantify Hive performance on a traditional, shared, multi-tenant Hadoop cluster, and compare it with more specialized analytics tools on a single-tenant cluster. We will also highlight which tuning parameters yield maximum benefits, and analyze the surprisingly ineffectual ones. Finally, we will detail several enhancements made by Yahoo's Hive team (in split calculation, stripe elimination and the metadata system) to successfully boost performance.
Slow-cooked data and APIs in the world of Big Data: the view from a city per...Oscar Corcho
The document discusses slow-cooked data and APIs from a city perspective. It draws an analogy between big data/fast food and slow-cooked/linked open data. It outlines six rules for slow-cooking data: 1) appropriately segment datasets, 2) annotate data with semantics, 3) provide multiple data formats, 4) engage children in data contribution and use, 5) use open data internally before publishing, and 6) leverage common data structures for interoperability like fast food franchises do. The goal is to cook open data in a way that is both useful and reusable.
Introduction to the Hadoop Ecosystem (FrOSCon Edition)Uwe Printz
Talk held at the FrOSCon 2013 on 24.08.2013 in Sankt Augustin, Germany
Agenda:
- What is Big Data & Hadoop?
- Core Hadoop
- The Hadoop Ecosystem
- Use Cases
- What‘s next? Hadoop 2.0!
This document provides an overview of NoSQL schema design and examples using a document database like MongoDB or MapR-DB. It discusses how to model complex, flexible schemas to store object-oriented data like products, users, and music catalog information. Examples show how a music database could be reduced from over 200 tables to just a few collections by embedding objects and references. Flexible schemas in a document database more closely match object models and allow easy evolution of the data model.
This document provides an overview of big data and Hadoop. It discusses why Hadoop is useful for extremely large datasets that are difficult to manage in relational databases. It then summarizes what Hadoop is, including its core components like HDFS, MapReduce, HBase, Pig, Hive, Chukwa, and ZooKeeper. The document also outlines Hadoop's design principles and provides examples of how some of its components like MapReduce and Hive work.
As of Drupal 7 we'll have RDFa markup in core, in this session I will:
-explain what the implications are of this and why this matters
-give a short introduction to the Semantic web, RDF, RDFa and SPARQL in human language
-give a short overview of the RDF modules that are available in contrib
-talk about some of the potential use cases of all these magical technologies
This document summarizes the history and evolution of Apache Hadoop over the past 10 years. It discusses how Hadoop originated from Doug Cutting's work on Nutch in 2002. It grew to include HDFS for storage and MapReduce for processing. Yahoo was an early large-scale user. The community has expanded Hadoop to include over 25 components like Hive, HBase, Spark and more. The open source model and ability to adapt have helped Hadoop succeed and it will continue to evolve to handle new data sources and cloud deployments in the next 10 years.
The Mint is an authority control / vocabulary server designed to supply authority services to repositories. It is designed to be a practical tool for working towards Linked Data repositories, making it easy build high-quality metadata collection and discovery system.
Hadoop Summit 2016 presentation.
As Yahoo continues to grow and diversify its mobile products, its platform team faces an ever-expanding, sophisticated group of analysts and product managers, who demand faster answers: to measure key engagement metrics like Daily Active Users, app installs and user retention; to run longitudinal analysis; and to re-calculate metrics over rolling time windows. All of this, quickly enough not to need a coffee break while waiting for results. The optimal solution for this use-case would have to take into account raw performance, cost, security implications and ease of data management. Could the benefits of Hive, ORC and Tez, coupled with a good data design provide the performance our customers crave? Or would it make sense to use more nascent, off-grid querying systems? This talk will examine the efficacy of using Hive for large-scale mobile analytics. We will quantify Hive performance on a traditional, shared, multi-tenant Hadoop cluster, and compare it with more specialized analytics tools on a single-tenant cluster. We will also highlight which tuning parameters yield maximum benefits, and analyze the surprisingly ineffectual ones. Finally, we will detail several enhancements made by Yahoo's Hive team (in split calculation, stripe elimination and the metadata system) to successfully boost performance.
Slow-cooked data and APIs in the world of Big Data: the view from a city per...Oscar Corcho
The document discusses slow-cooked data and APIs from a city perspective. It draws an analogy between big data/fast food and slow-cooked/linked open data. It outlines six rules for slow-cooking data: 1) appropriately segment datasets, 2) annotate data with semantics, 3) provide multiple data formats, 4) engage children in data contribution and use, 5) use open data internally before publishing, and 6) leverage common data structures for interoperability like fast food franchises do. The goal is to cook open data in a way that is both useful and reusable.
Introduction to the Hadoop Ecosystem (FrOSCon Edition)Uwe Printz
Talk held at the FrOSCon 2013 on 24.08.2013 in Sankt Augustin, Germany
Agenda:
- What is Big Data & Hadoop?
- Core Hadoop
- The Hadoop Ecosystem
- Use Cases
- What‘s next? Hadoop 2.0!
This document provides an overview of NoSQL schema design and examples using a document database like MongoDB or MapR-DB. It discusses how to model complex, flexible schemas to store object-oriented data like products, users, and music catalog information. Examples show how a music database could be reduced from over 200 tables to just a few collections by embedding objects and references. Flexible schemas in a document database more closely match object models and allow easy evolution of the data model.
This document provides an overview of big data and Hadoop. It discusses why Hadoop is useful for extremely large datasets that are difficult to manage in relational databases. It then summarizes what Hadoop is, including its core components like HDFS, MapReduce, HBase, Pig, Hive, Chukwa, and ZooKeeper. The document also outlines Hadoop's design principles and provides examples of how some of its components like MapReduce and Hive work.
XSPARQL is a query language that allows querying of both XML and RDF data sources simultaneously. It extends the syntax of XQuery with a SPARQL-for clause to query RDF data and a CONSTRUCT clause to produce RDF output. XSPARQL 1.1 supports SPARQL 1.1 operators like aggregation, federation, negation and property paths. It also allows processing of JSON files. The XSPARQL evaluator takes an XSPARQL query, rewrites it, optimizes it, and executes it using XQuery and SPARQL engines to retrieve and combine data from different sources into a unified XML or RDF answer.
This document discusses linked spatial data and spatial data infrastructures. It provides examples of using URIs to represent spatial things and linking spatial datasets. Key points discussed include:
1. Using URIs and HTTP to identify spatial things like locations and allowing information about those things to be retrieved in different formats like RDF and GML.
2. Examples of using linked spatial data for tasks like looking up information, identifying locations, linking datasets, and querying spatial relationships between objects.
3. Initiatives to link spatial metadata standards like ISO19115 to open data schemas like DCAT-AP to make spatial data more accessible on the web.
4. Revenue models for linked data providers including public funding, advertisements, and
This document provides an introduction and overview of Apache Hive, including what it is, its architecture and components, how it is used in production, and performance considerations. Hive is an open source data warehouse system for Hadoop that allows users to query data using SQL-like language and scales to petabytes of data. It works by compiling queries into a directed acyclic graph of MapReduce jobs for execution. The document outlines Hive's architecture, components like the metastore and Thrift server, and how organizations use it for log processing, data mining and business intelligence tasks.
Big Data Day LA 2015 - Applications of the Apriori Algorithm on Open Data by ...Data Con LA
The Apriori Algorithm is an unsupervised learning technique for producing associative rules. This talk will explain the algorithm's implementation, explore how effective it can be when applied to big data, discuss how we use it at DataScience to do market basket analysis, and demonstrate some novel use cases involving the million song database, recipes, and other applications involving open data.
Maintaining scholarly standards in the digital age: Publishing historical gaz...Humphrey Southall
This presentation: (1( Discusses why providing detailed attributions of individual contributions is essential to large scale sharing of historical research data; (2) Provides a short introduction to Open Linked Data; (3) Introduces the PastPlace Gazetteer API (Applications Programming Interface), explaining components of the RDF it generates using the example of Oxford, UK; (4) Notes that most open data projects use the Creative Commons -- Must Ackowledge license (CC-BY) while not actually acknowledging contributors within their RDF, then shows how we do it; (5) Introduces the separate PastPlace Datafeed API, which implements the W3C Datacube Vocabulary.
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)Uwe Printz
Talk held at the IT-Stammtisch Darmstadt on 08.11.2013
Agenda:
- What is Big Data & Hadoop?
- Core Hadoop
- The Hadoop Ecosystem
- Use Cases
- What‘s next? Hadoop 2.0!
Querying 1.8 billion reddit comments with pythonDaniel Rodriguez
The document is about querying over 1.6 billion Reddit comments using Python. It discusses:
1) Moving the Reddit comment data from S3 to HDFS and converting it to the Parquet format for efficiency.
2) Using the Blaze and Ibis Python libraries to query the data through Impala, allowing SQL-like queries with a Pandas-like API.
3) Examples of queries, like counting total comments or comments in specific subreddits, and plotting the daily frequency of comments in the /r/IAmA subreddit.
Presentation on RDF Stream Processing models given at the SR4LD tutorial (ISWC 2013) -- updated version at: http://www.slideshare.net/dellaglio/rsp2014-01rspmodelsss
This document provides an overview of 4 solutions for processing big data using Hadoop and compares them. Solution 1 involves using core Hadoop processing without data staging or movement. Solution 2 uses BI tools to analyze Hadoop data after a single CSV transformation. Solution 3 creates a data warehouse in Hadoop after a single transformation. Solution 4 implements a traditional data warehouse. The solutions are then compared based on benefits like cloud readiness, parallel processing, and investment required. The document also includes steps for installing a Hadoop cluster and running sample MapReduce jobs and Excel processing.
1. The document discusses using Elasticsearch for full text search in Python applications. It provides an overview of how Elasticsearch works, including inverted indexes and normalization.
2. Instructions are given on setting up Elasticsearch and integrating it with Django applications using Haystack. Haystack allows adding search functionality to models and provides a SearchQuerySet API.
3. The document covers topics like Elasticsearch settings, updating indexes, using signals for real-time updates, and pros and cons of Haystack like its ORM-like interface but loose performance.
IPython Notebook as a Unified Data Science Interface for HadoopDataWorks Summit
This document discusses using IPython Notebook as a unified data science interface for Hadoop. It proposes that a unified environment needs: 1) mixed local and distributed processing via Apache Spark, 2) access to languages like Python via PySpark, 3) seamless SQL integration via SparkSQL, and 4) visualization and reporting via IPython Notebook. The document demonstrates this environment by exploring open payments data between doctors/hospitals and manufacturers.
A Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules DamjiData Con LA
Abstract:- Of all the developers delight, none is more attractive than a set of APIs that make developers productive, that are easy to use, and that are intuitive and expressive. Apache Spark offers these APIs across components such as Spark SQL, Streaming, Machine Learning, and Graph Processing to operate on large data sets in languages such as Scala, Java, Python, and R for doing distributed big data processing at scale. In this talk, I will explore the evolution of three sets of APIs - RDDs, DataFrames, and Datasets available in Apache Spark 2.x. In particular, I will emphasize why and when you should use each set as best practices, outline its performance and optimization benefits, and underscore scenarios when to use DataFrames and Datasets instead of RDDs for your big data distributed processing. Through simple notebook demonstrations with API code examples, you'll learn how to process big data using RDDs, DataFrames, and Datasets and interoperate among them.
This document discusses integrating R with Hadoop. It begins with an introduction to R and its uses for statistical analysis and data visualization. It then discusses how R can be used with Hadoop to analyze large datasets stored in Hadoop and to execute R code using Hadoop. Examples of R packages that interface with Hadoop components like HDFS, HBase, and MapReduce are provided. Guidelines are given for when it makes sense to integrate R and Hadoop versus using them separately.
Hadoop is an open-source distributed processing framework created by Doug Cutting in 2005 at Yahoo. It allows for the distributed processing of large datasets across clusters of computers using simple programming models. Hadoop addresses problems of velocity, volume and variety of big data by distributing storage and processing across clusters of low-cost commodity hardware. It provides reliable storage and processing of large datasets through its Hadoop Distributed File System and MapReduce programming model.
Introduction to Big Data & Hadoop Architecture - Module 1Rohit Agrawal
Learning Objectives - In this module, you will understand what is Big Data, What are the limitations of the existing solutions for Big Data problem; How Hadoop solves the Big Data problem, What are the common Hadoop ecosystem components, Hadoop Architecture, HDFS and Map Reduce Framework, and Anatomy of File Write and Read.
This document provides an overview of a talk on Apache Spark. It introduces the speaker and their background. It acknowledges inspiration from a previous Spark training. It then outlines the structure of the talk, which will include: a brief history of big data; a tour of Spark including its advantages over MapReduce; and explanations of Spark concepts like RDDs, transformations, and actions. The document serves to introduce the topics that will be covered in the talk.
Case study of Rujhaan.com (A social news app )Rahul Jain
Rujhaan.com is a news aggregation app that collects trending news and social media discussions from topics of interest to users. It uses various technologies including a crawler to collect data from social media, Apache Solr for search, MongoDB for storage, Redis for caching, and machine learning techniques like classification and clustering. The presenter discussed the technical architecture and challenges of building Rujhaan.com to provide fast, personalized news content to over 16,000 monthly users while scaling to growing traffic levels.
The document discusses open data and the CKAN open data catalog. It provides an overview of CKAN, including its data model and API. It also discusses open data initiatives like data.gov.uk and how CKAN is used to power open data portals around the world.
XSPARQL is a query language that allows querying of both XML and RDF data sources simultaneously. It extends the syntax of XQuery with a SPARQL-for clause to query RDF data and a CONSTRUCT clause to produce RDF output. XSPARQL 1.1 supports SPARQL 1.1 operators like aggregation, federation, negation and property paths. It also allows processing of JSON files. The XSPARQL evaluator takes an XSPARQL query, rewrites it, optimizes it, and executes it using XQuery and SPARQL engines to retrieve and combine data from different sources into a unified XML or RDF answer.
This document discusses linked spatial data and spatial data infrastructures. It provides examples of using URIs to represent spatial things and linking spatial datasets. Key points discussed include:
1. Using URIs and HTTP to identify spatial things like locations and allowing information about those things to be retrieved in different formats like RDF and GML.
2. Examples of using linked spatial data for tasks like looking up information, identifying locations, linking datasets, and querying spatial relationships between objects.
3. Initiatives to link spatial metadata standards like ISO19115 to open data schemas like DCAT-AP to make spatial data more accessible on the web.
4. Revenue models for linked data providers including public funding, advertisements, and
This document provides an introduction and overview of Apache Hive, including what it is, its architecture and components, how it is used in production, and performance considerations. Hive is an open source data warehouse system for Hadoop that allows users to query data using SQL-like language and scales to petabytes of data. It works by compiling queries into a directed acyclic graph of MapReduce jobs for execution. The document outlines Hive's architecture, components like the metastore and Thrift server, and how organizations use it for log processing, data mining and business intelligence tasks.
Big Data Day LA 2015 - Applications of the Apriori Algorithm on Open Data by ...Data Con LA
The Apriori Algorithm is an unsupervised learning technique for producing associative rules. This talk will explain the algorithm's implementation, explore how effective it can be when applied to big data, discuss how we use it at DataScience to do market basket analysis, and demonstrate some novel use cases involving the million song database, recipes, and other applications involving open data.
Maintaining scholarly standards in the digital age: Publishing historical gaz...Humphrey Southall
This presentation: (1( Discusses why providing detailed attributions of individual contributions is essential to large scale sharing of historical research data; (2) Provides a short introduction to Open Linked Data; (3) Introduces the PastPlace Gazetteer API (Applications Programming Interface), explaining components of the RDF it generates using the example of Oxford, UK; (4) Notes that most open data projects use the Creative Commons -- Must Ackowledge license (CC-BY) while not actually acknowledging contributors within their RDF, then shows how we do it; (5) Introduces the separate PastPlace Datafeed API, which implements the W3C Datacube Vocabulary.
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)Uwe Printz
Talk held at the IT-Stammtisch Darmstadt on 08.11.2013
Agenda:
- What is Big Data & Hadoop?
- Core Hadoop
- The Hadoop Ecosystem
- Use Cases
- What‘s next? Hadoop 2.0!
Querying 1.8 billion reddit comments with pythonDaniel Rodriguez
The document is about querying over 1.6 billion Reddit comments using Python. It discusses:
1) Moving the Reddit comment data from S3 to HDFS and converting it to the Parquet format for efficiency.
2) Using the Blaze and Ibis Python libraries to query the data through Impala, allowing SQL-like queries with a Pandas-like API.
3) Examples of queries, like counting total comments or comments in specific subreddits, and plotting the daily frequency of comments in the /r/IAmA subreddit.
Presentation on RDF Stream Processing models given at the SR4LD tutorial (ISWC 2013) -- updated version at: http://www.slideshare.net/dellaglio/rsp2014-01rspmodelsss
This document provides an overview of 4 solutions for processing big data using Hadoop and compares them. Solution 1 involves using core Hadoop processing without data staging or movement. Solution 2 uses BI tools to analyze Hadoop data after a single CSV transformation. Solution 3 creates a data warehouse in Hadoop after a single transformation. Solution 4 implements a traditional data warehouse. The solutions are then compared based on benefits like cloud readiness, parallel processing, and investment required. The document also includes steps for installing a Hadoop cluster and running sample MapReduce jobs and Excel processing.
1. The document discusses using Elasticsearch for full text search in Python applications. It provides an overview of how Elasticsearch works, including inverted indexes and normalization.
2. Instructions are given on setting up Elasticsearch and integrating it with Django applications using Haystack. Haystack allows adding search functionality to models and provides a SearchQuerySet API.
3. The document covers topics like Elasticsearch settings, updating indexes, using signals for real-time updates, and pros and cons of Haystack like its ORM-like interface but loose performance.
IPython Notebook as a Unified Data Science Interface for HadoopDataWorks Summit
This document discusses using IPython Notebook as a unified data science interface for Hadoop. It proposes that a unified environment needs: 1) mixed local and distributed processing via Apache Spark, 2) access to languages like Python via PySpark, 3) seamless SQL integration via SparkSQL, and 4) visualization and reporting via IPython Notebook. The document demonstrates this environment by exploring open payments data between doctors/hospitals and manufacturers.
A Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules DamjiData Con LA
Abstract:- Of all the developers delight, none is more attractive than a set of APIs that make developers productive, that are easy to use, and that are intuitive and expressive. Apache Spark offers these APIs across components such as Spark SQL, Streaming, Machine Learning, and Graph Processing to operate on large data sets in languages such as Scala, Java, Python, and R for doing distributed big data processing at scale. In this talk, I will explore the evolution of three sets of APIs - RDDs, DataFrames, and Datasets available in Apache Spark 2.x. In particular, I will emphasize why and when you should use each set as best practices, outline its performance and optimization benefits, and underscore scenarios when to use DataFrames and Datasets instead of RDDs for your big data distributed processing. Through simple notebook demonstrations with API code examples, you'll learn how to process big data using RDDs, DataFrames, and Datasets and interoperate among them.
This document discusses integrating R with Hadoop. It begins with an introduction to R and its uses for statistical analysis and data visualization. It then discusses how R can be used with Hadoop to analyze large datasets stored in Hadoop and to execute R code using Hadoop. Examples of R packages that interface with Hadoop components like HDFS, HBase, and MapReduce are provided. Guidelines are given for when it makes sense to integrate R and Hadoop versus using them separately.
Hadoop is an open-source distributed processing framework created by Doug Cutting in 2005 at Yahoo. It allows for the distributed processing of large datasets across clusters of computers using simple programming models. Hadoop addresses problems of velocity, volume and variety of big data by distributing storage and processing across clusters of low-cost commodity hardware. It provides reliable storage and processing of large datasets through its Hadoop Distributed File System and MapReduce programming model.
Introduction to Big Data & Hadoop Architecture - Module 1Rohit Agrawal
Learning Objectives - In this module, you will understand what is Big Data, What are the limitations of the existing solutions for Big Data problem; How Hadoop solves the Big Data problem, What are the common Hadoop ecosystem components, Hadoop Architecture, HDFS and Map Reduce Framework, and Anatomy of File Write and Read.
This document provides an overview of a talk on Apache Spark. It introduces the speaker and their background. It acknowledges inspiration from a previous Spark training. It then outlines the structure of the talk, which will include: a brief history of big data; a tour of Spark including its advantages over MapReduce; and explanations of Spark concepts like RDDs, transformations, and actions. The document serves to introduce the topics that will be covered in the talk.
Case study of Rujhaan.com (A social news app )Rahul Jain
Rujhaan.com is a news aggregation app that collects trending news and social media discussions from topics of interest to users. It uses various technologies including a crawler to collect data from social media, Apache Solr for search, MongoDB for storage, Redis for caching, and machine learning techniques like classification and clustering. The presenter discussed the technical architecture and challenges of building Rujhaan.com to provide fast, personalized news content to over 16,000 monthly users while scaling to growing traffic levels.
The document discusses open data and the CKAN open data catalog. It provides an overview of CKAN, including its data model and API. It also discusses open data initiatives like data.gov.uk and how CKAN is used to power open data portals around the world.
The document discusses Big Data on Azure and provides an overview of HDInsight, Microsoft's Apache Hadoop-based data platform on Azure. It describes HDInsight cluster types for Hadoop, HBase, Storm and Spark and how clusters can be automatically provisioned on Azure. It also summarizes tools like Hive, HBase and Spark that can be used to analyze data stored on HDInsight clusters.
Schema.org: What It Means For You and Your LibraryRichard Wallis
This document summarizes a presentation about Schema.org given to the LITA Forum in Albuquerque, NM on November 7th, 2014. The presentation discussed what Schema.org is, the SchemaBibEx extension for bibliographic data, and examples of Schema.org being used. It also covered the challenges involved in mapping library metadata to Schema.org and proposals made by SchemaBibEx to address these challenges.
The document discusses Richard Wallis and his work extending Schema.org to better describe bibliographic data. Wallis is an independent consultant who chairs several W3C community groups focused on expanding Schema.org for bibliographic and archives data. He has worked with organizations like OCLC and Google to develop vocabularies that extend Schema.org to describe over 330 million bibliographic resources in linked data.
This document discusses YQL (Yahoo Query Language) which allows users to query and access data from various web services through a simple SQL-like syntax. It describes how YQL provides a standardized way to access data without having to read documentation for each individual API. The document provides examples of common data queries and lists some of the benefits of using YQL, such as consolidating multiple HTTP requests into a single request. It also notes that YQL simply rewrites queries into HTTP calls under the hood rather than using "voodoo magic".
Thesis Proposal: User Application Profiles for Publishing Linked Data in HTM...Sean Petiya
User Application Profiles for Publishing Linked Data in HTML/RDFa: Building a Semantic Web of Comic Book Metadata.
Kent State University - July 30, 2014
The objective is to present a case study for building a domain ontology and extending the usability and usage of that vocabulary by developing metadata application profiles for specific user groups. These objectives will be realized by a metadata vocabulary for the description of comic books and comic book collections, titled the Comic Book Ontology (CBO) and a series of schemata for encoding records using appropriate members of that ontology, specifically an XML schema and a corresponding minimal version. A set of metadata application profiles will also be developed to guide the publication of comic book data using the vocabulary by identified user groups, which include libraries, collectors, creators, retailers, and publishers, and will present recommended elements, guidelines, and examples of encoding data in the markup of existing hypertext systems using HTML5 and RDFa. The study then aims to extend the usability and usage of those schemata by presenting a methodology for building application profiles guided by the development of assumptive, data-driven personas. It will generate these personas through a review of systems used by each participant and an analysis of existing content. The study also seeks to demonstrate how an ontology can be applied to existing collaborative indexing projects, datasets, or research to enhance the visibility, reference, and utilization of those endeavors through their publication as Linked Data. The overall, and long-term, goal is to explore methods for bringing enhanced bibliographic control and organization to the comic book domain, allowing the creative and intellectual efforts of writers, artists, contributors, scholars, researchers, and collectors to be better combined and shared, and well represented in the Semantic Web.
The document discusses Big Data on Azure and provides an overview of HDInsight, Microsoft's Apache Hadoop-based data platform on Azure. It describes HDInsight cluster types for Hadoop, HBase, Storm and Spark and how clusters can be automatically provisioned on Azure. Example applications and demos of Storm, HBase, Hive and Spark are also presented. The document highlights key aspects of using HDInsight including storage integration and tools for interactive analysis.
The document summarizes a datathon conducted using various COVID-19 datasets from different European web archives. The goals were to 1) create a sandbox for exploring the data, 2) conduct initial analysis to see what could be achieved, and 3) document the process. Different institutions provided different types of datasets, including seedlists, tweets, and derived datasets. Challenges included restrictions on sharing raw data and representing large collections. Preliminary analysis identified potential research questions and ways to study web archives, collections, and the pandemic response.
This document discusses Yahoo Query Language (YQL), which allows users to query and retrieve data from various web services through a simple SQL-like syntax. YQL acts as an API for services that may not otherwise have exposed data through APIs. The document provides examples of YQL queries to retrieve data from services like Google, Twitter, Foursquare and the New York Times. It highlights how YQL simplifies accessing web data by allowing complex operations to be performed with single HTTP requests.
Why do they call it Linked Data when they want to say...?Oscar Corcho
The four Linked Data publishing principles established in 2006 seem to be quite clear and well understood by people inside and outside the core Linked Data and Semantic Web community. However, not only when discussing with outsiders about the goodness of Linked Data but also when reviewing papers for the COLD workshop series, I find myself, in many occasions, going back again to the principles in order to see whether some approach for Web data publication and consumption is actually Linked Data or not. In this talk we will review some of the current approaches that we have for publishing data on the Web, and we will reflect on why it is sometimes so difficult to get into an agreement on what we understand by Linked Data. Furthermore, we will take the opportunity to describe yet another approach that we have been working on recently at the Center for Open Middleware, a joint technology center between Banco Santander and Universidad Politécnica de Madrid, in order to facilitate Linked Data consumption.
Talk about Exploring the Semantic Web, and particularly Linked Data, and the Rhizomer approach. Presented August 14th 2012 at the SRI AIC Seminar Series, Menlo Park, CA
The document discusses exposing library holdings data on the web using linked data. It notes that OCLC has exposed over 300 million resources using Schema.org, RDFa, and links to controlled vocabularies. The data is available via various formats like RDF/XML, JSON-LD and Turtle. BIBFRAME is presented as the new standard for bibliographic description that allows library data to be shared as part of the web. Libraries are encouraged to make their resources discoverable on the web of data by linking to other institutions and authorities.
The document discusses open data and CKAN, an open source data portal. It provides an overview of CKAN, how it works, and its features for publishing and sharing open government data. It also discusses the use of CKAN on data.gov.uk and lessons learned from the project.
Linked data demystified:Practical efforts to transform CONTENTDM metadata int...Cory Lampert
This document outlines a presentation about transforming metadata from a CONTENTdm digital collection into linked data. It discusses the concepts of linked data, including defining linked data, linked data principles, technologies and standards. It then explains how these concepts can be applied to digital collection records, including anticipated challenges working with CONTENTdm. The document describes a linked data project at UNLV Libraries to transform collection records into linked data and publish it on the linked data cloud. It provides tips for creating metadata that is more suitable for linked data.
Big Data Analysis : Deciphering the haystack Srinath Perera
A primary outcome of Bigdata is to derive useful and actionable insights from large or challenges data collections. The goal is to run the transformations from data, to information, to knowledge, and finally to insights. This includes calculating simple analytics like Mean, Max, and Median, to derive overall understanding about data by building models, and finally to derive predictions from data. Some cases we can afford to wait to collect and processes them, while in other cases we need to know the outputs right away. MapReduce has been the defacto standard for data processing, and we will start our discussion from there. However, that is only one side of the problem. There are other technologies like Apache Spark and Apache Drill graining ground, and also realtime processing technologies like Stream Processing and Complex Event Processing. Finally there are lot of work on porting decision technologies like Machine learning into big data landscape. This talk discusses big data processing in general and look at each of those different technologies comparing and contrasting them.
Similar to Warcbase Building a Scalable Platform on HBase and Hadoop - Part Two: Historian Use Case (20)
Making Sense of Abundance: Opportunity and Challenges Across Three Web Archiv...Ian Milligan
These are the slides that I gave at the Canadian Society for Digital Humanities annual conference at the University of Ottawa, Ottawa ON, on 3 June 2015.
Warcbase: Building a Scalable Platform on HBase and Hadoop - Part Two, Histor...Ian Milligan
This was the second part of a joint presentation I did with Jimmy Lin (Maryland) at the "Web Archiving Collaboration: New Tools and Models" conference at Columbia University, New York NY on 4 June 2015.
WARCs, WATs, and wgets: Opportunity and Challenge for a Historian Amongst Thr...Ian Milligan
This is my presentation from the 2015 meeting of the International Internet Preservation Consortium's annual meeting, held at Stanford University (Palo Alto, CA, USA).
Clustering Search to Navigate A Case Study of the Canadian World Wide Web as ...Ian Milligan
Here are the slides for the talk I gave at the Digital Humanities 2014 conference in Lausanne, Switzerland. Paper abstract is http://dharchive.org/paper/DH2014/Paper-83.xml.
International Internet Preservation Consortium Research Slides from Ian MilliganIan Milligan
This document summarizes Ian Milligan's presentation titled "An Infinite Archive? Historical Explorations in the Internet Archive’s Wide Web Scrape". The presentation discusses how historians need new computational methods to make sense of the vast amount of information available in web archives like the Internet Archive's 80TB Wide Web Scrape from 2011. It proposes using named entity recognition and analyzing mentions of countries and provinces in different top-level domains to explore the content and scope of the archive. The presentation concludes by thanking the audience and providing Ian Milligan's contact information.
Historical Research Breakout Session Notes, WIRE 2014Ian Milligan
This document summarizes a discussion around using web archives for historical research. Key points discussed include:
1) Historians are interested in using web archives for research but find it difficult due to technical challenges and a lack of necessary skills and resources.
2) Questions were raised around whether to create new research corpora from web archives or use existing ones, and how to ensure sources are properly contextualized.
3) An idea was proposed to focus historical research on the period of 1995-2000 using around 20TB of data from the Internet Archive along with other datasets, in order to better understand this formative early period of the web.
4) Developing tools and community resources could help address historians
Instagram has become one of the most popular social media platforms, allowing people to share photos, videos, and stories with their followers. Sometimes, though, you might want to view someone's story without them knowing.
Discover the benefits of outsourcing SEO to Indiadavidjhones387
"Discover the benefits of outsourcing SEO to India! From cost-effective services and expert professionals to round-the-clock work advantages, learn how your business can achieve digital success with Indian SEO solutions.
Understanding User Behavior with Google Analytics.pdfSEO Article Boost
Unlocking the full potential of Google Analytics is crucial for understanding and optimizing your website’s performance. This guide dives deep into the essential aspects of Google Analytics, from analyzing traffic sources to understanding user demographics and tracking user engagement.
Traffic Sources Analysis:
Discover where your website traffic originates. By examining the Acquisition section, you can identify whether visitors come from organic search, paid campaigns, direct visits, social media, or referral links. This knowledge helps in refining marketing strategies and optimizing resource allocation.
User Demographics Insights:
Gain a comprehensive view of your audience by exploring demographic data in the Audience section. Understand age, gender, and interests to tailor your marketing strategies effectively. Leverage this information to create personalized content and improve user engagement and conversion rates.
Tracking User Engagement:
Learn how to measure user interaction with your site through key metrics like bounce rate, average session duration, and pages per session. Enhance user experience by analyzing engagement metrics and implementing strategies to keep visitors engaged.
Conversion Rate Optimization:
Understand the importance of conversion rates and how to track them using Google Analytics. Set up Goals, analyze conversion funnels, segment your audience, and employ A/B testing to optimize your website for higher conversions. Utilize ecommerce tracking and multi-channel funnels for a detailed view of your sales performance and marketing channel contributions.
Custom Reports and Dashboards:
Create custom reports and dashboards to visualize and interpret data relevant to your business goals. Use advanced filters, segments, and visualization options to gain deeper insights. Incorporate custom dimensions and metrics for tailored data analysis. Integrate external data sources to enrich your analytics and make well-informed decisions.
This guide is designed to help you harness the power of Google Analytics for making data-driven decisions that enhance website performance and achieve your digital marketing objectives. Whether you are looking to improve SEO, refine your social media strategy, or boost conversion rates, understanding and utilizing Google Analytics is essential for your success.
Ready to Unlock the Power of Blockchain!Toptal Tech
Imagine a world where data flows freely, yet remains secure. A world where trust is built into the fabric of every transaction. This is the promise of blockchain, a revolutionary technology poised to reshape our digital landscape.
Toptal Tech is at the forefront of this innovation, connecting you with the brightest minds in blockchain development. Together, we can unlock the potential of this transformative technology, building a future of transparency, security, and endless possibilities.
Warcbase Building a Scalable Platform on HBase and Hadoop - Part Two: Historian Use Case
1. Warcbase
Building a Scalable Platform
on HBase and Hadoop
Part Two: Historian Use Case
Jimmy Lin
University of Maryland
College Park, MD
Ian Milligan
University of Waterloo
Waterloo, ON Canada
2. Why should a
historian
care?
The sheer amount of social,
cultural, and political
information generated every
day presents new
opportunities for historians.
5. Nightmare Scenario
• Wayback Machine won’t be enough. We won’t use that.
• Historians rely uncritically on date-ordered keyword
search results, putting them at mercy of search
algorithms they do not understand;
• Historians are completely left out of post-1996
research, letting everybody else do the work (a la
Culturomics project/Nature magazine article);
• Our profession gets left behind…
6.
7. Unlocking an Archive-It
Collection
• Archive-It has amazing collections of social,
cultural, political, and economic records generated
by everyday people, leaders, businesses,
academics, and beyond.
• Stories waiting to be hold.
• The data is there, but the problem is access.
8. Example Dataset
• Archive-It Collection 227,
Canadian Political Parties and
Political Interest Groups
(University of Toronto)
• October 2005 - Present
• All major and minor political
parties, as well as organized
political interest groups (Council
of Canadians, Coalition to
Oppose the Arms Trade
Assembly of First Nations, etc.)
• Started by now-retired librarian,
hard to get details on seed list
9. Two Main Approaches
• Warcbase
• Link extraction and analytics
• Full-text extraction and analytics
• Full-text faceted search
• UK Web Archive’s Shine solr front end
11. Basic Link Statistics
• Count number of pages per domain
• Count number of links for each crawl so they can
be normalized (very important)
• Run on command line using relatively simple pig
scripts
12. Example Script (counting
number of links for each crawl)
register
'target/warcbase-‐0.1.0-‐SNAPSHOT-‐fatjar.jar';
DEFINE
ArcLoader
org.warcbase.pig.ArcLoader();
DEFINE
ExtractLinks
org.warcbase.pig.piggybank.ExtractLinks();
raw
=
load
'/shared/collections/CanadianPoliticalParties/
arc/'
using
ArcLoader
as
(url:
chararray,
date:
chararray,
mime:
chararray,
content:
bytearray);
a
=
filter
raw
by
mime
==
'text/html'
and
date
is
not
null;
b
=
foreach
a
generate
SUBSTRING(date,
0,
6)
as
date,
url,
FLATTEN(ExtractLinks((chararray)
content,
url));
c
=
group
b
by
$0;
d
=
foreach
c
generate
group,
COUNT(b);
25. Text Analysis
register
'target/warcbase-‐0.1.0-‐SNAPSHOT-‐fatjar.jar';
DEFINE
ArcLoader
org.warcbase.pig.ArcLoader();
DEFINE
ExtractRawText
org.warcbase.pig.piggybank.ExtractRawText();
DEFINE
ExtractTopLevelDomain
org.warcbase.pig.piggybank.ExtractTopLevelDomain();
raw
=
load
'/shared/collections/CanadianPoliticalParties/arc/'
using
ArcLoader
as
(url:
chararray,
date:
chararray,
mime:
chararray,
content:
bytearray);
a
=
filter
raw
by
mime
==
'text/html'
and
date
is
not
null;
b
=
foreach
a
generate
SUBSTRING(date,
0,
6)
as
date,
REPLACE(ExtractTopLevelDomain(url),
'^s*www.',
'')
as
url,
content;
c
=
filter
b
by
url
==
'greenparty.ca';
d
=
foreach
c
generate
date,
url,
ExtractRawText((chararray)
content)
as
text;
store
d
into
'cpp.text-‐greenparty';
26. Text Analysis
• Now have circumscribed corpus for specified
query (i.e. liberal.ca, or ndp.ca, or conservative.ca)
• Can now use standard text analysis tools, etc. to
extract meaning
• LDA (topic modeling)
• NER (named entity recognition)
27. NER
October
2005
62476
Stephen
Harper
30234
Michael
Chong
30109
Gwynne
Dyer
28011
ami
Entrez
26238
Paul
Martin
22303
Harper
28. NER
November
2008
3188
Stéphane
Dion
2557
Stephen
Harper
2471
Stephen
HarperLaureen
2410
Dion
2356
Harper
30. Shine
• UK Web Archive’s Shine
(https://github.com/ukwa/
shine)
• Indexing as bottleneck
• ~ 250GB of WARCs takes ~
5 days on a single machine
• Hadoop indexer available if
data in HFDS
• ~ 90GB index size
32. Shine
• Advantages: accessible to the general public,
easy to use, interactive trend diagram allows
digging down for context, can move down to level
of document itself.
• Disadvantage: keyword searching requires you
know what to look for; random sampling misleading
when tens of thousands of records; etc.
• Doesn’t take advantage of what makes web
sources so powerful: hyperlinks