Sammy Fung discusses using open source tools like Python and Scrapy to extract open weather data from websites like the Hong Kong Observatory in order to make it more accessible. He created an open source project called hk0weather that scrapes current weather reports and exports the data to JSON format. The goal is to develop open source projects that can create open data in standard machine-readable formats to help citizens access public data more easily.
Open Source Community Metrics for FOSDEMDawn Foster
Presented in the Community DevRoom at FOSDEM 2013. A longer version of this presentation is available at http://fastwonderblog.com/2012/11/05/open-source-community-metrics-linuxcon-barcelona/
The Semantic Web Client Library - Consuming Linked Data in Your ApplicationsOlaf Hartig
I presented the Semantic Web Client library (http://www4.wiwiss.fu-berlin.de/bizer/ng4j/semwebclient/) with these slides during the Linked Data Tutorial (http://events.linkeddata.org/iswc2008tutorial/) at the International Semantic Web Conference 2008.
The Semantic Web is about to grow up. By efforts such as the Linked Open Data initiative, we finally find ourselves at the edge of a Web of Data becoming reality. Standards such as OWL 2, RIF and SPARQL 1.1 shall allow us to reason with and ask complex structured queries on this data, but still they do not play together smoothly and robustly enough to cope with huge amounts of noisy Web data. In this talk, we discuss open challenges relating to querying and reasoning with Web data and raise the question: can the emerging Web of Data ever catch up with the now ubiquitous HTML Web?
Open Source Community Metrics: LinuxCon BarcelonaDawn Foster
The best thing about open source projects is that you have all of your community data in the public at your fingertips. You just need to know how to gather the data about your open source community so that you can hack it all together to get something interesting that you can really use.
Open Source Community Metrics LibreOffice ConferenceDawn Foster
Open Source Community Metrics: Tips and Techniques for Measuring Participation
Do you know what people are really doing in your open source project? Having good community data and metrics for your open source project is a great way to understand what works and what needs improvement over time, and metrics can also be a nice way to highlight contributions from key project members. This session will focus on tips and techniques for collecting and analyzing metrics from tools commonly used by open source projects. It's like people watching, but with data.
Open Source Community Metrics for FOSDEMDawn Foster
Presented in the Community DevRoom at FOSDEM 2013. A longer version of this presentation is available at http://fastwonderblog.com/2012/11/05/open-source-community-metrics-linuxcon-barcelona/
The Semantic Web Client Library - Consuming Linked Data in Your ApplicationsOlaf Hartig
I presented the Semantic Web Client library (http://www4.wiwiss.fu-berlin.de/bizer/ng4j/semwebclient/) with these slides during the Linked Data Tutorial (http://events.linkeddata.org/iswc2008tutorial/) at the International Semantic Web Conference 2008.
The Semantic Web is about to grow up. By efforts such as the Linked Open Data initiative, we finally find ourselves at the edge of a Web of Data becoming reality. Standards such as OWL 2, RIF and SPARQL 1.1 shall allow us to reason with and ask complex structured queries on this data, but still they do not play together smoothly and robustly enough to cope with huge amounts of noisy Web data. In this talk, we discuss open challenges relating to querying and reasoning with Web data and raise the question: can the emerging Web of Data ever catch up with the now ubiquitous HTML Web?
Open Source Community Metrics: LinuxCon BarcelonaDawn Foster
The best thing about open source projects is that you have all of your community data in the public at your fingertips. You just need to know how to gather the data about your open source community so that you can hack it all together to get something interesting that you can really use.
Open Source Community Metrics LibreOffice ConferenceDawn Foster
Open Source Community Metrics: Tips and Techniques for Measuring Participation
Do you know what people are really doing in your open source project? Having good community data and metrics for your open source project is a great way to understand what works and what needs improvement over time, and metrics can also be a nice way to highlight contributions from key project members. This session will focus on tips and techniques for collecting and analyzing metrics from tools commonly used by open source projects. It's like people watching, but with data.
Adventures in Linked Data Land (presentation by Richard Light)jottevanger
"Adventures in Linked Data Land: bringing RDF to the Wordsworth Trust" is a paper given by RIchard Light (http://uk.linkedin.com/pub/richard-light/a/221/ba5) to a Linked Data meeting run by the Collections Trust in February 2010. He runs through the basics of LD, how it relates to cultural heritage, and some of his experiments with it, specifically with the data of the Wordsworth Trust, finally listing a series of challenges that face museums in trying to get on board the Linked Data bus.
This presentation looks back at several efforts, conducted in the past fifteen years, aimed at establishing interoperability for web-based scholarly communication. It tries to characterize the perspectives/approaches taken by these efforts and, based upon that, proposes an HATEOS-based approach to interlink scholarly nodes on the web. This was first presented at the Research Data Alliance meeting in Paris, France, September 22 2015.
ProteomeXchange: Update for the C-HPP Consortium.
10th C-HPP Workshop: “Proteome data management and identification of missing proteins".
Bangkok, Thailand. 09/08/2015. Remote presentation.
Sparql querying of-property-graphs-harsh thakkar-graph day 2017 sfHarsh Thakkar
Knowledge graphs have become popular over the past decade and frequently rely on the Resource Description Framework (RDF) or property graph databases as data models. We present, the first translator from SPARQL -- the W3C standardised language for RDF -- and Gremlin -- a popular property graph traversal language. Gremlinator translates SPARQL queries to Gremlin path traversals for executing graph pattern matching queries over graph databases.
This allows a user, who is well versed in SPARQL, to access and query a wide variety of Graph Data Management Systems (DMSs) avoiding the steep learning curve for adapting to a new Graph Query Language (GQL). Gremlin is a graph computing system agnostic traversal language (covering both OLTP graph database or OLAP graph processors), making it a desirable choice for supporting interoperability for querying Graph DMSs.
Adventures in Linked Data Land (presentation by Richard Light)jottevanger
"Adventures in Linked Data Land: bringing RDF to the Wordsworth Trust" is a paper given by RIchard Light (http://uk.linkedin.com/pub/richard-light/a/221/ba5) to a Linked Data meeting run by the Collections Trust in February 2010. He runs through the basics of LD, how it relates to cultural heritage, and some of his experiments with it, specifically with the data of the Wordsworth Trust, finally listing a series of challenges that face museums in trying to get on board the Linked Data bus.
This presentation looks back at several efforts, conducted in the past fifteen years, aimed at establishing interoperability for web-based scholarly communication. It tries to characterize the perspectives/approaches taken by these efforts and, based upon that, proposes an HATEOS-based approach to interlink scholarly nodes on the web. This was first presented at the Research Data Alliance meeting in Paris, France, September 22 2015.
ProteomeXchange: Update for the C-HPP Consortium.
10th C-HPP Workshop: “Proteome data management and identification of missing proteins".
Bangkok, Thailand. 09/08/2015. Remote presentation.
Sparql querying of-property-graphs-harsh thakkar-graph day 2017 sfHarsh Thakkar
Knowledge graphs have become popular over the past decade and frequently rely on the Resource Description Framework (RDF) or property graph databases as data models. We present, the first translator from SPARQL -- the W3C standardised language for RDF -- and Gremlin -- a popular property graph traversal language. Gremlinator translates SPARQL queries to Gremlin path traversals for executing graph pattern matching queries over graph databases.
This allows a user, who is well versed in SPARQL, to access and query a wide variety of Graph Data Management Systems (DMSs) avoiding the steep learning curve for adapting to a new Graph Query Language (GQL). Gremlin is a graph computing system agnostic traversal language (covering both OLTP graph database or OLAP graph processors), making it a desirable choice for supporting interoperability for querying Graph DMSs.
Use of Open Data in Hong Kong (LegCo 2014)Sammy Fung
Presentation on use of open data in HK given to Legislative Council Secretariat. Content is mixed from my presentations at startmeup 2013 and opendatahk meetup.
Web Scraping_ Gathering Data from Websites.pptxHitechIOT
WWeb ScrapingBo ZhaoCollege of Earth, Ocean, and AtmosphericSciences, Oregon State University, Corvallis, OR,USAWeb scraping, also known as web extraction orharvesting, is a technique to extract data from theWorld Wide Web (WWW) and save it to a filesystem or database for later retrieval or analysis.Commonly, web data is scrapped utilizing Hyper-text Transfer Protocol (HTTP) or through a webbrowser. This is accomplished either manually bya user or automatically by a bot or web crawler.Due to the fact that an enormous amount of het-erogeneous data is constantly generated on theWWW, web scraping is widely acknowledged asan efficient and powerful technique for collectingbig data (Mooney et al. 2015; Bar-Ilan 2001). Toadapt to a variety of scenarios, current web scrap-ing techniques have become customized fromsmaller ad hoc, human-aided procedures to theutilization of fully automated systems that areable to convert entire websites into well-organizeddata set. State-of-the-art web scraping tools arenot only capable of parsing markup languages orJSON files but also integrating with computervisual analytics (Butler 2007) and natural lan-guage processing to simulate how human usersbrowse web content (Yi et al. 2003).The process of scraping data from the Internetcan be divided into two sequential steps; acquiringweb resources and then extracting desired infor-mation from the acquired data. Specifically, a webscraping program starts by composing a HTTPrequest to acquire resources from a targetedwebsite. This request can be formatted in either aURL containing a GET query or a piece of HTTPmessage containing a POST query. Once therequest is successfully received and processed bythe targeted website, the requested resource willbe retrieved from the website and then sent back tothe give web scraping program. The resource canbe in multiple formats, such as web pages that arebuilt from HTML, data feeds in XML or JSONformat, or multimedia data such as images, audio,or video files. After the web data is downloaded,the extraction process continues to parse,reformat, and organize the data in a structuredway. There are two essential modules of a webscraping program –a module for composingan HTTP request, such as Urllib2 or seleniumand another one for parsing and extracting infor-mation from raw HTML code, such as BeautifulSoup or Pyquery. Here, the Urllib2 moduledefines a set of functions to dealing with HTTPrequests, such as authentication, redirections,cookies, and so on, while Selenium is a webbrowser wrapper that builds up a web browser,such as Google Chrome or Internet Explorer, andenables users to automate the process of browsinga website by programming. Regarding dataextraction, Beautiful Soup is designed fo
The Power of Semantic Technologies to Explore Linked Open DataOntotext
Atanas Kiryakov's, Ontotext’s CEO, presentation at the first edition of Graphorum (http://graphorum2017.dataversity.net/) – a new forum that taps into the growing interest in Graph Databases and Technologies. Graphorum is co-located with the Smart Data Conference, organized by the digital publishing platform Dataversity.
The presentation demonstrates the capabilities of Ontotext’s own approach to contributing to the discipline of more intelligent information gathering and analysis by:
- graphically explorinh the connectivity patterns in big datasets;
- building new links between identical entities residing in different data silos;
- getting insights of what type of queries can be run against various linked data sets;
- reliably filtering information based on relationships, e.g., between people and organizations, in the news;
- demonstrating the conversion of tabular data into RDF.
Learn more at http://ontotext.com/.
Simplified News Analytics in Presidential Election with Google Cloud PlatformImre Nagi
This talk presented the usage of various Google Cloud Platform technologies in building simplified news analytics system to process news about presidential election in Indonesia.
Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/2lGNybu.
Stefan Krawczyk discusses how his team at StitchFix use the cloud to enable over 80 data scientists to be productive. He also talks about prototyping ideas, algorithms and analyses, how they set up & keep schemas in sync between Hive, Presto, Redshift & Spark and make access easy for their data scientists, etc. Filmed at qconsf.com..
Stefan Krawczyk is Algo Dev Platform Lead at StitchFix, where he’s leading development of the algorithm development platform. He spent formative years at Stanford, LinkedIn, Nextdoor & Idibon, working on everything from growth engineering, product engineering, data engineering, to recommendation systems, NLP, data science and business intelligence.
This slide deck has been prepared for a workshop on Linked Data Publishing and Semantic Processing using the Redlink platform (http://redlink.co). The workshop delivered at the Department of Information Engineering, Computer Science and Mathematics at Università degli Studi dell'Aquila aimed at providing a general understanding of Semantic Web Technologies and how these can be used in real world use cases such as Salzburgerland Tourismus.
A brief introduction has been also included on MICO (Media in Context) a European Union part-funded research project to provide cross-media analysis solutions for online multimedia producers.
Similar to How do we develop open source software to help open data ? (MOSC 2013) (20)
DevRel - Transform article writing from printing to onlineSammy Fung
My presentation at DevRel/Asia 2020 to talk about Developer Relationship - Transforming article writing from printing to online.
#opensource #hackthon #linux #asia #hongkong #devrel
My Open Source Journey - Developer and CommunitySammy Fung
I share my open source journey which begins from Linux User Group to nowadays in the Open Source community, in developer role and community leader/volunteer role.
https://coscup.org/2019/en/programs/my-open-source-journey-developer-and-community
I introduced my open source job board at Lightning talk of COSCUP 2014 in Taiwan. This presentation slide is extended at lightning talk of Software Freedom Day 2014 in Hong Kong.
Introduction of g0v.tw at OpenDataHK.meet.12Sammy Fung
a lightning talk presented at OpenDataHK.meet.12 to introduce g0v.tw and its projects, and quick brainstorming what can we do in Hong Kong for similiar projects.
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfPeter Spielvogel
Building better applications for business users with SAP Fiori.
• What is SAP Fiori and why it matters to you
• How a better user experience drives measurable business benefits
• How to get started with SAP Fiori today
• How SAP Fiori elements accelerates application development
• How SAP Build Code includes SAP Fiori tools and other generative artificial intelligence capabilities
• How SAP Fiori paves the way for using AI in SAP apps
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
Essentials of Automations: The Art of Triggers and Actions in FMESafe Software
In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation.
We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios.
Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024Neo4j
Neha Bajwa, Vice President of Product Marketing, Neo4j
Join us as we explore breakthrough innovations enabled by interconnected data and AI. Discover firsthand how organizations use relationships in data to uncover contextual insights and solve our most pressing challenges – from optimizing supply chains, detecting fraud, and improving customer experiences to accelerating drug discoveries.
Pushing the limits of ePRTC: 100ns holdover for 100 daysAdtran
At WSTS 2024, Alon Stern explored the topic of parametric holdover and explained how recent research findings can be implemented in real-world PNT networks to achieve 100 nanoseconds of accuracy for up to 100 days.
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
Dr. Sean Tan, Head of Data Science, Changi Airport Group
Discover how Changi Airport Group (CAG) leverages graph technologies and generative AI to revolutionize their search capabilities. This session delves into the unique search needs of CAG’s diverse passengers and customers, showcasing how graph data structures enhance the accuracy and relevance of AI-generated search results, mitigating the risk of “hallucinations” and improving the overall customer journey.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
3. We want a easier way to
access the public data.
4. Agenda
● What is Open Data ?
● Use of Open Source Software in web crawling.
● Starting new Open Source project hk0weather
to create Open Weather Data.
5. Sammy Fung
● Software Developer
– to use and develop open source sofware.
– Perl → PHP → Python.
– interests on Data Mining / Web Crawling.
– works at internet service company 43 Global to
deploy OpenStack cloud service.
6. Sammy Fung
●
Open Source Community Leader.
– Founding Chairman, Hong Kong Linux User Group.
– Community Manager, opensource.hk.
– GNOME Asia committee member.
– Mozilla Rep.
– Program committee member of COSCUP - the largest
Open Source conference in Taiwan.
●
Blogger at sammy.hk.
8. Open Data
Three Laws of Open Government Data by David Eaves.
1.If it can't be spidered or indexed, it doesn't exist.
2.If it isn't available in open and machine readable format,
it can't engage.
3.If a legal framework doesn't allow it to be repurposed, it
doesn't empower.
http://eaves.ca/2009/09/30/three-law-of-open-government-data/
9. Open Data
● Tim Berners-Lee, the inventor of the Web.
– 5stardata.info
– 5 star deployment scheme of Open Data.
10. * One Star - Open Data
1.make your stuff available on the Web (whatever format) under an
open license.
2.make it available as structured data (e.g., Excel instead of image
scan of a table)
3.use non-proprietary formats (e.g., CSV instead of Excel)
4.use URIs to denote things, so that people can point at your stuff.
5.link your data to other data to provide context.
5stardata.info by Tim Berners-Lee, the inventor of the Web.
11. ** Two Star - Open Data
1.make your stuff available on the Web (whatever format) under an
open license.
2.make it available as structured data (e.g., Excel instead of image
scan of a table)
3.use non-proprietary formats (e.g., CSV instead of Excel)
4.use URIs to denote things, so that people can point at your stuff.
5.link your data to other data to provide context.
5stardata.info by Tim Berners-Lee, the inventor of the Web.
12. *** Three Star - Open Data
1.make your stuff available on the Web (whatever format) under an
open license.
2.make it available as structured data (e.g., Excel instead of image
scan of a table)
3.use non-proprietary formats (e.g., CSV instead of Excel)
4.use URIs to denote things, so that people can point at your stuff.
5.link your data to other data to provide context.
5stardata.info by Tim Berners-Lee, the inventor of the Web.
13. **** Four Star - Open Data
1.make your stuff available on the Web (whatever format) under an
open license.
2.make it available as structured data (e.g., Excel instead of image
scan of a table)
3.use non-proprietary formats (e.g., CSV instead of Excel)
4.use URIs to denote things, so that people can point at your stuff.
5.link your data to other data to provide context.
5stardata.info by Tim Berners-Lee, the inventor of the Web.
14. ***** Five Star - Open Data
1.make your stuff available on the Web (whatever format) under an
open license.
2.make it available as structured data (e.g., Excel instead of image
scan of a table)
3.use non-proprietary formats (e.g., CSV instead of Excel)
4.use URIs to denote things, so that people can point at your stuff.
5.link your data to other data to provide context.
5stardata.info by Tim Berners-Lee, the inventor of the Web.
17. Weather Information in Hong Kong
● Hong Kong Observatory
– Hourly Hong Kong Weather Report
– Regional Weather in Hong Kong (10 min updates)
– Weather Forecast and Weekly Weather Forecast
– Typhoon Report and Forecast
20. Weather at Data.One
●
My Chinese Blog Post 'Progress of Open
Government Data in Hong Kong' on 2013/1/17.
● Data.One released on 2011/3/31.
● Weather at Data.One provides 7 dataset URLs,
returns RSS (XML) format (Eng/TChi/SChi)
– One word: Useless.
– Data.One dataset (RSS) is completely different with
HKO own paid service (XML).
21. Weather at Data.One
● Example - Current local weather report:
● Plain text report in RSS.
● Difference to quote report content:
– Website: a pair of HTML tags, eg. <PRE>....</PRE>.
– Data.One: a pair of RSS description tags,
<description>....</description>.
● Other weather data is missing, eg. Regional
temperture updates per each 12 mins.
22. Weather at Data.One
● Weather at Data.One is 'report' but not 'data'.
● Weather RSS is already released by HKO
before launch of Data.One.
● Technically, json/xml format is better
readable by computer programs.
25. Web Scraping
● a computer software technique of extracting
information from websites. (Wikipedia)
● for business, hobbies, research purposes.
26. Web Scraping
● Look for right URLs to scrap.
● Look for right content from webpages.
● Saving data into data store.
● When to run the web scraping program ?
27. Use of Open Source Software in
Web Crawling
● Use Open Source Tools to collect useful and
meaningful machine-readable data.
● Doesn't need to wait provider to release data
in machine-readable format.
28. Open Source Tools
● Python programming lanugage
● with Regular Expression library
● Scrapy web crawling framework
29. Why python + scrapy ?
● python: my current favourite programming
language for few years.
● scrapy: web crawling framework written in
Python.
30. What is Scrapy ?
● An open source web scraping framework for
Python.
● Scrapy is a fast high-level screen scraping and
web crawling framework, used to crawl
websites and extract structured data from
their pages. It can be used for a wide range of
purposes, from data mining to monitoring
and automated testing.
31. Scrapy Features
● define data you want to scrapy
● write spider to extract data
● Built-in: selecting and extracting data from HTML
and XML
● Built-in: JSON, CSV, XML output
● Interactive shell console
● Built-in: web service, telnet console, logging
● Others
33. Programme List of Paid TVs in 2004
● I want to know live football match was
showing on which channel.
● Paid TV web site = M$ + IIS + ASP + Flash
● Slow....... Very Slow...... Extremely Slow!
● Couldn't connect at any peak hours!
● Wrote my first web crawler in PHP in 2004.
34. Public Transportation in 2006-2010
● Kowloon Motor Bus (KMB)
– No map view for a bus route
● Public Transportation Enquiry System (PTES)
– Exteremly Poor, Ugly (or much worse) map UI on
PTES.
35. HK Observatory and Joint Typhoon
Warning Center
● Any typhoon is coming to Hong Kong ? And
When will it come ?
● No easy data exchange format.
● No RSS nor ATOM.
● We aren't check websites everyday.
46. Starting new Open Source projects
to create Open Data
● Develop a open source project.
● Release data in standard machine-readable
data format.
47. hk0weather
● https://github.com/sammyfung/hk0weather
● Open Source Hong Kong Weather Project.
● convert to JSON data from HKO webpages.
● python + scrapy
● 1st version: from current weather report,
extracting temperture and humidity from 20+
weather stations, export in json format.
56. Agenda
● What is Open Data ?
● Use of Open Source Software in web crawling.
● Starting new Open Source project hk0weather
to create Open Weather Data.
57. We want a easier way to
access the public data.