This document provides an overview of the project management software JIRA and its capabilities. It introduces JIRA concepts like projects, components, versions and issues. It describes how to customize workflows and filters in JIRA. The document also explains how to use the GreenHopper extension to manage projects using agile methodologies through planning boards, task boards and contexts. Overall, the document serves as an introduction to using JIRA for issue tracking and project management.
Presented by Adrien Grand, Software Engineer, Elasticsearch
Although people usually come to Lucene and related solutions in order to make data searchable, they often realize that it can do much more for them. Indeed, its ability to handle high loads of complex queries make Lucene a perfect fit for analytics applications and, for some use-cases, even a credible replacement for a primary data-store. It is important to understand the design decisions behind Lucene in order to better understand the problems it can solve and the problems it cannot solve. This talk will explain the design decisions behind Lucene, give insights into how Lucene stores data on disk and how it differs from traditional databases. Finally, there will be highlights of recent and future changes in Lucene index file formats.
Evaluating Your Learning to Rank Model: Dos and Don’ts in Offline/Online Eval...Sease
1. The document discusses evaluating learning to rank models, including offline and online evaluation methods. Offline evaluation involves building a test set from labeled data and evaluating metrics like NDCG, while online evaluation uses methods like A/B testing and interleaving to directly measure user behavior and business metrics.
2. Common mistakes in offline evaluation include having only one sample per query, single relevance labels per query, and unrepresentative test samples. While offline evaluation provides efficiency, online evaluation allows observing real user interactions and model impact on key metrics.
3. Recommendations are given to test models both offline and online, with online testing providing advantages like measuring actual business outcomes and interpreting model effects.
Introduction to JIRA & Agile Project ManagementDan Chuparkoff
This document provides an introduction to using JIRA for agile project management. It discusses key concepts like defining tasks, estimating task effort in story points, and using JIRA's agile tools like boards and burndowns. Screenshots show how to create and manage tasks in JIRA's different modes for Scrum and Kanban workflows.
This document provides an overview of the project management software JIRA and its capabilities. It introduces JIRA concepts like projects, components, versions and issues. It describes how to customize workflows and filters in JIRA. The document also explains how to use the GreenHopper extension to manage projects using agile methodologies through planning boards, task boards and contexts. Overall, the document serves as an introduction to using JIRA for issue tracking and project management.
Presented by Adrien Grand, Software Engineer, Elasticsearch
Although people usually come to Lucene and related solutions in order to make data searchable, they often realize that it can do much more for them. Indeed, its ability to handle high loads of complex queries make Lucene a perfect fit for analytics applications and, for some use-cases, even a credible replacement for a primary data-store. It is important to understand the design decisions behind Lucene in order to better understand the problems it can solve and the problems it cannot solve. This talk will explain the design decisions behind Lucene, give insights into how Lucene stores data on disk and how it differs from traditional databases. Finally, there will be highlights of recent and future changes in Lucene index file formats.
Evaluating Your Learning to Rank Model: Dos and Don’ts in Offline/Online Eval...Sease
1. The document discusses evaluating learning to rank models, including offline and online evaluation methods. Offline evaluation involves building a test set from labeled data and evaluating metrics like NDCG, while online evaluation uses methods like A/B testing and interleaving to directly measure user behavior and business metrics.
2. Common mistakes in offline evaluation include having only one sample per query, single relevance labels per query, and unrepresentative test samples. While offline evaluation provides efficiency, online evaluation allows observing real user interactions and model impact on key metrics.
3. Recommendations are given to test models both offline and online, with online testing providing advantages like measuring actual business outcomes and interpreting model effects.
Introduction to JIRA & Agile Project ManagementDan Chuparkoff
This document provides an introduction to using JIRA for agile project management. It discusses key concepts like defining tasks, estimating task effort in story points, and using JIRA's agile tools like boards and burndowns. Screenshots show how to create and manage tasks in JIRA's different modes for Scrum and Kanban workflows.
InnoDB Scalability improvements in MySQL 8.0Mydbops
This document provides an overview of new InnoDB scalability improvements in MySQL 8.0, including improved read and write scalability. It discusses how the InnoDB architecture was updated to support read/write workloads on modern hardware more efficiently. The redo log was redesigned to be lock-less. Contention aware transaction scheduling and other new features like instant alter algorithms and temporary session tablespaces were added to enhance performance.
This document discusses using NLP techniques like tokenization, feature extraction, classification, clustering, and anomaly detection to analyze log files. It provides examples of how each technique can be applied including tokenizing log records, extracting features like n-grams and token shapes, classifying records by type or priority level, clustering records to find anomalies, and detecting outliers. The document also recommends tools like NLTK, Scikit-Learn, Logpai and references the author's own work at Insight Engines on log search and analysis products.
Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...Lucidworks
This document summarizes Bloomberg's use of machine learning for search ranking within their Solr implementation. It discusses how they process 8 million searches per day and need machine learning to automatically tune rankings over time as their index grows to 400 million documents. They use a Learning to Rank approach where features are extracted from queries and documents, training data is collected, and a ranking model is generated to optimize metrics like click-through rates. Their Solr Learning to Rank plugin allows this model to re-rank search results in Solr for improved relevance.
Apache Calcite is a dynamic data management framework. Think of it as a toolkit for building databases: it has an industry-standard SQL parser, validator, highly customizable optimizer (with pluggable transformation rules and cost functions, relational algebra, and an extensive library of rules), but it has no preferred storage primitives. In this tutorial, the attendees will use Apache Calcite to build a fully fledged query processor from scratch with very few lines of code. This processor is a full implementation of SQL over an Apache Lucene storage engine. (Lucene does not support SQL queries and lacks a declarative language for performing complex operations such as joins or aggregations.) Attendees will also learn how to use Calcite as an effective tool for research.
A Learning to Rank Project on a Daily Song Ranking ProblemSease
Ranking data, i.e., ordered list of items, naturally appears in a wide variety of situation; understanding how to adapt a specific dataset and to design the best approach to solve a ranking problem in a real-world scenario is thus crucial.This talk aims to illustrate how to set up and build a Learning to Rank (LTR) project starting from the available data, in our case a Spotify Dataset (available on Kaggle) on the Worldwide Daily Song Ranking, and ending with the implementation of a ranking model. A step by step (phased) approach to cope with this task using open source libraries will be presented.We will examine in depth the most important part of the pipeline that is the data preprocessing and in particular how to model and manipulate the features in order to create the proper input dataset, tailored to the machine learning algorithm requirements.
Airflow at lyft for Airflow summit 2020 conferenceTao Feng
1) Lyft uses Airflow for ETL workflows to move data from mobile apps and events to data warehouses.
2) Lyft has customized Airflow with features like UI auditing, DAG dependency graphs, and integrating Amundsen for data lineage.
3) Current focuses at Lyft include an ETL expiration system, upgrading DAGs to Python 3, and leveraging new Airflow features in a multi-tenant cluster.
Sucessful implementation of JIRA and Confluence - tips and best practiceschade_chr
How to make sure that your Social Enterprise platform becomes a success in the company. Focus in Atlassians JIRA/Confluence products. Implementation checklist at the end.
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...Mihai Criveti
- The document discusses automating data science pipelines with DevOps tools like Ansible, Packer, and Kubernetes.
- It covers obtaining data, exploring and modeling data, and how to automate infrastructure setup and deployment with tools like Packer to build machine images and Ansible for configuration management.
- The rise of DevOps and its cultural aspects are discussed as well as how tools like Packer, Ansible, Kubernetes can help automate infrastructure and deploy machine learning models at scale in production environments.
This talk is from Distributed Data Summit SF 2018 - http://distributeddatasummit.com/2018-sf/sessions#netflix2
Operating C* can involve a lot of required manpower, complex automation, or both. Some of this complexity comes from operational/configuration activity of the underlying kernel and hardware but much of it is operation complexity stemming from C* itself. Some examples of this complexity are restarting the database in a safe way, reliability backing up and restoring snapshots, monitoring the health of the datastore, and even ensuring eventual consistency through repair. As a result of these complexities, C* operators end up with complicated operational setups, which are expensive to build, manage and monitor. As part of this talk, we will share lessons learned in managing such complexity via our Priam sidecar including recent innovations in how our sidecar ensures the highest possible uptime and correctness of Cassandra. We then use this to motivate building in the management sidecar directly as part of C* itself (CASSANDRA-14395).
Netflix’s Big Data Platform team manages data warehouse in Amazon S3 with over 60 petabytes of data and writes hundreds of terabytes of data every day. With a data warehouse at this scale, it is a constant challenge to keep improving performance. This talk will focus on Iceberg, a new table metadata format that is designed for managing huge tables backed by S3 storage. Iceberg decreases job planning time from minutes to under a second, while also isolating reads from writes to guarantee jobs always use consistent table snapshots.
In this session, you'll learn:
• Some background about big data at Netflix
• Why Iceberg is needed and the drawbacks of the current tables used by Spark and Hive
• How Iceberg maintains table metadata to make queries fast and reliable
• The benefits of Iceberg's design and how it is changing the way Netflix manages its data warehouse
• How you can get started using Iceberg
Speaker
Ryan Blue, Software Engineer, Netflix
This document discusses ONNX (Open Neural Network Exchange) and its integration with MLflow for model portability and deployment. It provides an overview of ONNX, describing how it allows models to be trained in one framework and deployed in another. It then discusses several companies that support ONNX, including Microsoft's use of ONNX Runtime to accelerate models across various products, AWS and Nvidia's support, and Facebook and Intel's contributions. The document ends by explaining how MLflow recently added support for the ONNX format, allowing models to be exported to and loaded from ONNX.
Dense Retrieval with Apache Solr Neural Search.pdfSease
This document provides an overview of dense retrieval with Apache Solr neural search. It discusses semantic search problems that neural search aims to address through vector-based representations of queries and documents. It then describes Apache Solr's implementation of neural search using dense vector fields and HNSW graphs to perform k-nearest neighbor retrieval. Functions are shown for indexing and searching vector data. The document also discusses using vector queries for filtering, re-ranking, and hybrid searches combining dense and sparse criteria.
How to Build your Training Set for a Learning To Rank ProjectSease
Learning to rank (LTR from now on) is the application of machine learning techniques, typically supervised, in the formulation of ranking models for information retrieval systems.
With LTR becoming more and more popular (Apache Solr supports it from Jan 2017), organisations struggle with the problem of how to collect and structure relevance signals necessary to train their ranking models.
This talk is a technical guide to explore and master various techniques to generate your training set(s) correctly and efficiently.
Expect to learn how to :
– model and collect the necessary feedback from the users (implicit or explicit)
– calculate for each training sample a relevance label which is meaningful and not ambiguous (Click Through Rate, Sales Rate …)
– transform the raw data collected in an effective training set (in the numerical vector format most of the LTR training library expect)
Join us as we explore real world scenarios and dos and don’ts from the e-commerce industry.
Solr is an open source, widely used, popular IR machine. It can be used for simple sentiment analysis and sentiment retrieval tool. Its multi-language analyzers together with UIMA (Unstructured Information Management Architecture) framework can be extended for sentiment extraction. Each sentence passes through a series of pluggable annotators. Entity and its associated polarity are detected for each sentence. Polarity of each sentence is stored into Solr index. Persistent model files can be created from training data and accessed at run time.
將 Open Data 放上 Open Source Platforms: 開源資料入口平台 CKAN 開發經驗分享Chengjen Lee
This document discusses using the open source CKAN platform for open data portals and sharing experiences customizing and implementing CKAN. It begins by introducing CKAN and its features for publishing, sharing, finding, and using open data. It then describes customizing CKAN for a Taiwanese open data portal including adding custom metadata fields, data validation, visualization plugins, localization, and harvesting from other data sources. The document concludes by discussing ways to contribute to CKAN such as through the development of new features, plugins, or translations.
InnoDB Scalability improvements in MySQL 8.0Mydbops
This document provides an overview of new InnoDB scalability improvements in MySQL 8.0, including improved read and write scalability. It discusses how the InnoDB architecture was updated to support read/write workloads on modern hardware more efficiently. The redo log was redesigned to be lock-less. Contention aware transaction scheduling and other new features like instant alter algorithms and temporary session tablespaces were added to enhance performance.
This document discusses using NLP techniques like tokenization, feature extraction, classification, clustering, and anomaly detection to analyze log files. It provides examples of how each technique can be applied including tokenizing log records, extracting features like n-grams and token shapes, classifying records by type or priority level, clustering records to find anomalies, and detecting outliers. The document also recommends tools like NLTK, Scikit-Learn, Logpai and references the author's own work at Insight Engines on log search and analysis products.
Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...Lucidworks
This document summarizes Bloomberg's use of machine learning for search ranking within their Solr implementation. It discusses how they process 8 million searches per day and need machine learning to automatically tune rankings over time as their index grows to 400 million documents. They use a Learning to Rank approach where features are extracted from queries and documents, training data is collected, and a ranking model is generated to optimize metrics like click-through rates. Their Solr Learning to Rank plugin allows this model to re-rank search results in Solr for improved relevance.
Apache Calcite is a dynamic data management framework. Think of it as a toolkit for building databases: it has an industry-standard SQL parser, validator, highly customizable optimizer (with pluggable transformation rules and cost functions, relational algebra, and an extensive library of rules), but it has no preferred storage primitives. In this tutorial, the attendees will use Apache Calcite to build a fully fledged query processor from scratch with very few lines of code. This processor is a full implementation of SQL over an Apache Lucene storage engine. (Lucene does not support SQL queries and lacks a declarative language for performing complex operations such as joins or aggregations.) Attendees will also learn how to use Calcite as an effective tool for research.
A Learning to Rank Project on a Daily Song Ranking ProblemSease
Ranking data, i.e., ordered list of items, naturally appears in a wide variety of situation; understanding how to adapt a specific dataset and to design the best approach to solve a ranking problem in a real-world scenario is thus crucial.This talk aims to illustrate how to set up and build a Learning to Rank (LTR) project starting from the available data, in our case a Spotify Dataset (available on Kaggle) on the Worldwide Daily Song Ranking, and ending with the implementation of a ranking model. A step by step (phased) approach to cope with this task using open source libraries will be presented.We will examine in depth the most important part of the pipeline that is the data preprocessing and in particular how to model and manipulate the features in order to create the proper input dataset, tailored to the machine learning algorithm requirements.
Airflow at lyft for Airflow summit 2020 conferenceTao Feng
1) Lyft uses Airflow for ETL workflows to move data from mobile apps and events to data warehouses.
2) Lyft has customized Airflow with features like UI auditing, DAG dependency graphs, and integrating Amundsen for data lineage.
3) Current focuses at Lyft include an ETL expiration system, upgrading DAGs to Python 3, and leveraging new Airflow features in a multi-tenant cluster.
Sucessful implementation of JIRA and Confluence - tips and best practiceschade_chr
How to make sure that your Social Enterprise platform becomes a success in the company. Focus in Atlassians JIRA/Confluence products. Implementation checklist at the end.
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...Mihai Criveti
- The document discusses automating data science pipelines with DevOps tools like Ansible, Packer, and Kubernetes.
- It covers obtaining data, exploring and modeling data, and how to automate infrastructure setup and deployment with tools like Packer to build machine images and Ansible for configuration management.
- The rise of DevOps and its cultural aspects are discussed as well as how tools like Packer, Ansible, Kubernetes can help automate infrastructure and deploy machine learning models at scale in production environments.
This talk is from Distributed Data Summit SF 2018 - http://distributeddatasummit.com/2018-sf/sessions#netflix2
Operating C* can involve a lot of required manpower, complex automation, or both. Some of this complexity comes from operational/configuration activity of the underlying kernel and hardware but much of it is operation complexity stemming from C* itself. Some examples of this complexity are restarting the database in a safe way, reliability backing up and restoring snapshots, monitoring the health of the datastore, and even ensuring eventual consistency through repair. As a result of these complexities, C* operators end up with complicated operational setups, which are expensive to build, manage and monitor. As part of this talk, we will share lessons learned in managing such complexity via our Priam sidecar including recent innovations in how our sidecar ensures the highest possible uptime and correctness of Cassandra. We then use this to motivate building in the management sidecar directly as part of C* itself (CASSANDRA-14395).
Netflix’s Big Data Platform team manages data warehouse in Amazon S3 with over 60 petabytes of data and writes hundreds of terabytes of data every day. With a data warehouse at this scale, it is a constant challenge to keep improving performance. This talk will focus on Iceberg, a new table metadata format that is designed for managing huge tables backed by S3 storage. Iceberg decreases job planning time from minutes to under a second, while also isolating reads from writes to guarantee jobs always use consistent table snapshots.
In this session, you'll learn:
• Some background about big data at Netflix
• Why Iceberg is needed and the drawbacks of the current tables used by Spark and Hive
• How Iceberg maintains table metadata to make queries fast and reliable
• The benefits of Iceberg's design and how it is changing the way Netflix manages its data warehouse
• How you can get started using Iceberg
Speaker
Ryan Blue, Software Engineer, Netflix
This document discusses ONNX (Open Neural Network Exchange) and its integration with MLflow for model portability and deployment. It provides an overview of ONNX, describing how it allows models to be trained in one framework and deployed in another. It then discusses several companies that support ONNX, including Microsoft's use of ONNX Runtime to accelerate models across various products, AWS and Nvidia's support, and Facebook and Intel's contributions. The document ends by explaining how MLflow recently added support for the ONNX format, allowing models to be exported to and loaded from ONNX.
Dense Retrieval with Apache Solr Neural Search.pdfSease
This document provides an overview of dense retrieval with Apache Solr neural search. It discusses semantic search problems that neural search aims to address through vector-based representations of queries and documents. It then describes Apache Solr's implementation of neural search using dense vector fields and HNSW graphs to perform k-nearest neighbor retrieval. Functions are shown for indexing and searching vector data. The document also discusses using vector queries for filtering, re-ranking, and hybrid searches combining dense and sparse criteria.
How to Build your Training Set for a Learning To Rank ProjectSease
Learning to rank (LTR from now on) is the application of machine learning techniques, typically supervised, in the formulation of ranking models for information retrieval systems.
With LTR becoming more and more popular (Apache Solr supports it from Jan 2017), organisations struggle with the problem of how to collect and structure relevance signals necessary to train their ranking models.
This talk is a technical guide to explore and master various techniques to generate your training set(s) correctly and efficiently.
Expect to learn how to :
– model and collect the necessary feedback from the users (implicit or explicit)
– calculate for each training sample a relevance label which is meaningful and not ambiguous (Click Through Rate, Sales Rate …)
– transform the raw data collected in an effective training set (in the numerical vector format most of the LTR training library expect)
Join us as we explore real world scenarios and dos and don’ts from the e-commerce industry.
Solr is an open source, widely used, popular IR machine. It can be used for simple sentiment analysis and sentiment retrieval tool. Its multi-language analyzers together with UIMA (Unstructured Information Management Architecture) framework can be extended for sentiment extraction. Each sentence passes through a series of pluggable annotators. Entity and its associated polarity are detected for each sentence. Polarity of each sentence is stored into Solr index. Persistent model files can be created from training data and accessed at run time.
將 Open Data 放上 Open Source Platforms: 開源資料入口平台 CKAN 開發經驗分享Chengjen Lee
This document discusses using the open source CKAN platform for open data portals and sharing experiences customizing and implementing CKAN. It begins by introducing CKAN and its features for publishing, sharing, finding, and using open data. It then describes customizing CKAN for a Taiwanese open data portal including adding custom metadata fields, data validation, visualization plugins, localization, and harvesting from other data sources. The document concludes by discussing ways to contribute to CKAN such as through the development of new features, plugins, or translations.
This document summarizes a presentation about the CKAN data management system. It introduces CKAN, describes its main features for publishing, finding, storing and managing datasets, engaging with users, and customizing the system. It provides examples of how CKAN is used by sites like data.gov.uk and discusses installation instructions, including required packages, configuring PostgreSQL, and deployment options.
CKANCon 2016 and IODC16 were conferences about open data. CKANCon 2016 was a one day conference for CKAN developers that included case studies, lightning talks, and discussions around the CKAN roadmap and moving to the Flask framework. IODC16 was the 4th International Open Data Conference that brought together the global open data community to discuss topics like data in education, agriculture, and more. Selected sessions included videos on tracking earthquake relief funds in Nepal and unlocking private sector data for public good.
OpenKM is a Document Management System, that due to its characteristics can be used by big companies as well as by the small ones, as a useful tool in processing knowledge management, providing a more flexible and cost effective alternative to other proprietary applications
This document discusses 12 common mistakes that rookie coaches make and how veteran coaches approach situations differently. The mistakes include focusing on winning over sportsmanship, talking about win-loss record rather than players, allowing players to blame others rather than take responsibility, micro-managing rather than developing assistants, using intimidation rather than building relationships, benching players for mistakes rather than coaching them, focusing on starters over role players, emphasizing individual awards over team roles, hoping for leadership rather than teaching it, giving long intense talks rather than short focused ones, holding boring practices rather than energetic ones, and teaching only tactics rather than life lessons. The veteran coaches are presented as taking a more holistic developmental approach focused on the well-being
This document advertises a program that claims to increase a person's vertical jump by 10 inches through a complete workout chart, nutrition plan, one-on-one training, weight room alternatives, training forum, and jump shoes. It repeatedly encourages clicking a link to discover more information.
Building a server to manage high concurrent connections is non-trival task. For those developers that use ActionScript 3 to build games on the client side it means having a totally different skillset. Being able to use ActionScript 3 on the server to build MMO’s or port client code to the server allows developers to leverage their skills on the server.
By walking through a live game example with more then 15,000 concurrent connections running on a medium Amazon EC2 server the presentation will:
1. Introduce Linux server configuration for high concurrent connected usage.
2. Introduce Socket class based on libev library for high concurrent connection.
3. Introduce leveraging Tamarin project for ActionScript 3 on the server.
此簡報為 Will 保哥 於 2015/6/25 (四) 接受 SQL PASS Taiwan 邀請演講的內容。
現場錄影: http://www.microsoftvirtualacademy.com/training-courses/sql-server-realase-management?mtag=MVP4015686
[ Will 保哥的部落格 - The Will Will Web ]
http://blog.miniasp.com
[ Will 保哥的技術交流中心 ] (Facebook 粉絲專頁)
https://www.facebook.com/will.fans
[ Will 保哥的噗浪 ]
http://www.plurk.com/willh/invite
[ Will 保哥的推特 ]
https://twitter.com/Will_Huang
[ Will 保哥的 G+ 頁面 ]
http://gplus.to/willh
Preserving Collaborative Documents in Contemporary EventsChengjen Lee
This document discusses preserving online collaborative documents from the 2014 Sunflower Student Movement in Taiwan. During the movement, students and activists occupied Taiwan's legislature to protest a trade agreement with China. They used online platforms like Hackpad and Hackfoldr to collaboratively edit documents. The researchers aim to faithfully archive these "born-digital" materials by recreating the operating systems, applications, and data needed to access the documents offline. However, challenges include rights issues, technical complexity, and the need to also archive related websites and media to fully preserve the online context of the movement.
“Open Data Web” – A Linked Open Data Repository Built with CKANChengjen Lee
This document summarizes the development of an open linked data repository called Open Data Web (ODW) built using CKAN. Key points:
- ODW publishes structured data from a Taiwanese archive catalog as linked open data using the RDF data model.
- It provides features for browsing, spatial and temporal querying of the data through a SPARQL endpoint.
- The system was implemented by customizing CKAN using extensions to support linked data import/export, custom fields, spatial/temporal search.
- Future work includes improving import speed and providing native SPARQL queries in CKAN.
ckan 2.0 Introduction (20140618 updated)Chengjen Lee
This document provides an overview and agenda for a presentation on CKAN 2.0, an open-source data management system. The presentation covers topics such as features for publishing and finding datasets, storing and managing data, customizing and extending CKAN, and how CKAN supports open data principles. It also provides examples of CKAN in use by government open data portals and discusses issues such as language support and extensions. Harvester extensions are introduced for harvesting metadata and datasets from remote CKAN instances and other data sources.
ckan 2.0 Introduction (20140522 updated)Chengjen Lee
This document outlines an agenda and presentation on CKAN, an open-source data management system. The presentation covers an introduction to CKAN, a tour of its features for publishing, finding, and managing data, how it supports open data principles, examples of CKAN instances, issues, and installation and harvesting topics.
This report summarizes an imported dataset containing tags and groups. The presenter shares that the data includes package names with asterisks replaced, parentheses removed from tags, and content mapped to groups. Users are directed to the MoPad application to obtain login credentials needed to access the imported data.
The document summarizes a presentation on controlled vocabularies, access control, and other features in CKAN. It discusses using CKAN's tag vocabularies and a validator for place names. It describes the roles of admin, editor and user for organizations and their access levels. It also covers searching private datasets after logging in, the datastore and datapusher extensions, and statistics available through the stats extension.
This document summarizes a presentation on CKAN, an open-source data management system. It discusses CKAN's features for publishing, finding, and managing datasets. These include adding metadata and data, filtering datasets, previewing data types, and customizing CKAN. It also covers harvesting data from external sources, installing CKAN, and common issues. The goal of CKAN is to make data open and accessible on the web according to the 5 star open data model.
The document introduces Pelican, an open-source static site generator. It can convert reStructuredText and Markdown files into HTML files. The document covers how to install Pelican using virtualenv and pip install. It also discusses the basic folder structure and usage of Pelican, including generating HTML files from Markdown or reStructuredText content files organized into categories and tagged.
ckan 2.0: Harvesting from other sourcesChengjen Lee
This document summarizes Cheng-Jen Lee's presentation on CKAN 2.0 harvesting capabilities and linked data/RDF. It discusses manually and automatically harvesting from remote sources using harvesters, implementing a custom harvester, and issues with harvesting. It also covers the Resource Description Framework and using DCAT and Dublin Core vocabularies to retrieve RDF metadata from datasets.
This document summarizes a presentation on CKAN 2.0 features including:
- Support for Jetty 8/9 and Solr 4 for spatial searching
- Extensions for spatial metadata, searching, mapping and previews
- Import/export of datasets via the API and database dumps
- Methods for migrating CKAN including issues with PostgreSQL versions
- Authentication is moving away from OpenID toward UNIX user authentication