Project for data engineering certificate in University of Washington: Stream Weather and Traffic Data from Web APIs using Kafka to predict the traffic conditions
Process and steps that are followed in creation of successful visualization. Taking an example of Encyclopaedia of life data and tableu visualization prototype
A Knowledge Graph Framework for Detecting Traffic Events Using Stationary Cam...RoopTeja Muppalla
Imagery-based Traffic Sensing Knowledge Graph (ITSKG) framework utilizes the stationary traffic camera information as sensors to understand the traffic patterns. This system extracts image-based features from traffic camera images, adds a semantic layer to the sensor data for traffic information, and then labels traffic imagery with semantic labels such as congestion. This framework adds a new dimension to existing traffic modeling systems by incorporating dynamic image-based features as well as creating a knowledge graph to add a layer of abstraction to understand and interpret concepts like congestion to the traffic event detection system.
This work is presented at the Industrial Knowledge workshop co-located with the 9th International ACM Web Science Conference 2017 on 25th June 2017.
Visualising statistical Linked Data with PloneEau de Web
Presentation of a Plone-based tool that can create graphical visualisations of semantic statistical data expressed using the RDF Data Cube Vocabulary and queried using generated SPARQL statements. The tool was developed under a project funded by the European Commission and is publicly available at www.digital-agenda-data.eu
Process and steps that are followed in creation of successful visualization. Taking an example of Encyclopaedia of life data and tableu visualization prototype
A Knowledge Graph Framework for Detecting Traffic Events Using Stationary Cam...RoopTeja Muppalla
Imagery-based Traffic Sensing Knowledge Graph (ITSKG) framework utilizes the stationary traffic camera information as sensors to understand the traffic patterns. This system extracts image-based features from traffic camera images, adds a semantic layer to the sensor data for traffic information, and then labels traffic imagery with semantic labels such as congestion. This framework adds a new dimension to existing traffic modeling systems by incorporating dynamic image-based features as well as creating a knowledge graph to add a layer of abstraction to understand and interpret concepts like congestion to the traffic event detection system.
This work is presented at the Industrial Knowledge workshop co-located with the 9th International ACM Web Science Conference 2017 on 25th June 2017.
Visualising statistical Linked Data with PloneEau de Web
Presentation of a Plone-based tool that can create graphical visualisations of semantic statistical data expressed using the RDF Data Cube Vocabulary and queried using generated SPARQL statements. The tool was developed under a project funded by the European Commission and is publicly available at www.digital-agenda-data.eu
What to do with the existing spatial data in planningKarel Charvat
Spatial planning acts between all levels of government so planners face important challenges in the development of territorial frameworks and concepts every day.
Spatial planning systems, the legal situation and spatial planning data management are completely different and fragmented throughout Europe.
Nevertheless, planning is a holistic activity.
All tasks and processes must be solved comprehensively with
input from various sources.
It is necessary to make inputs interoperable because it allows the user to search data from different sources, view them, download them and use them with help of geoinformation technologies (GIT).
From Simple Features to Moving Features and Beyond? at OGC Member Meeting, Se...Anita Graser
Presentation of arxiv preprint https://arxiv.org/abs/2006.16900
Mobility data science lacks common data structures and analytical functions. This position paper assesses the current status and open issues towards a universal API for mobility data science. In particular, we look at standardization efforts revolving around the OGC Moving Features standard which, so far, has not attracted much attention within the mobility data science community. We discuss the hurdles any universal API for movement data has to overcome and propose key steps of a roadmap that would provide the foundation for the development of this API.
Test-driven Assessment of [R2]RML Mappings to Improve Dataset Quality andimou
RDF dataset Quality Assessment is currently performed primarily after data is published. Incorporating its results, by applying corresponding adjustments to the dataset, happens manually and occurs
rarely. In the case of (semi-)structured data (e.g., CSV, XML), the root of the violations often derives from the mappings that specify how the RDF dataset will be generated. Thus, we suggest shifting the quality assessment
from the rdf dataset to the mapping definitions that generate it. The proposed test-driven approach for assessing mappings relies on RDFUnit test cases applied over mappings specified with RML. Our evaluation is applied to different cases, e.g., DBpedia, and indicates that the overall quality of an RDF dataset is quickly and significantly improved.
In Proceedings of the VLDB Endowment (PVLDB)
Vol. 9 No. 11
www.vldb.org/pvldb/vol9/p888-asudeh.pdf
--
The ranked retrieval model has rapidly become the de facto way for search query processing in client-server databases, especially those on the web. Despite of the extensive efforts in the database community on designing better ranking functions/mechanisms, many such databases in practice still fail to address the diverse and sometimes contradicting preferences of users on tuple ranking, perhaps (at least partially) due to the lack of expertise and/or motivation for the database owner to design truly effective ranking functions. This paper takes a different route on addressing the issue by defining a novel {\em query reranking problem}, i.e., we aim to design a third-party service that uses nothing but the public search interface of a client-server database to enable the on-the-fly processing of queries with any user-specified ranking functions (with or without selection conditions), no matter if the ranking function is supported by the database or not. We analyze the worst-case complexity of the problem and introduce a number of ideas, e.g., on-the-fly indexing, domination detection and virtual tuple pruning, to reduce the average-case cost of the query reranking algorithm. We also present extensive experimental results on real-world datasets, in both offline and live online systems, that demonstrate the effectiveness of our proposed techniques.
Presentation of the spatiotemporal RDF store Strabon at the Linked Data Europe Workshop, co-located with the European Data Forum in Athens, Greece (21 March 2014)
2011 ITS World Congress - GO-Sync - A Framework to Synchronize Transit Agency...Sean Barbeau
Discusses an open-source tool that can sync GTFS datasets with OpenStreetMap to help small agencies manage their bus stop inventory via crowd-sourcing. Includes some actual results of crowd-sourcing bus stop location accuracy in Tampa, FL.
Abstract - In the present paper we describe a new, updated and refined dataset specifically tailored to train and evaluate machine learning based malware traffic analysis algorithms. To generate it, we started from the largest databases of network traffic captures available online, deriving a dataset with a set of widely-applicable features and then cleaning and preprocessing it to remove noise, handle missing data and keep its size as small as possible. The resulting dataset is not biased by any specific application (although specifically addressed to machine learning algorithms), and the entire process can run automatically to keep it updated.
Provenance Analytics at AAAI Human Computation Conference 2013T Dong Huynh
Trung Dong Huynh presenting the paper entitled "Interpretation of Crowdsourced Activities using Provenance Network Analysis" - How analysing provenance graphs can help interpreting crowdsouced activities in CollabMap
Understanding speed and travel-time dynamics in response to various city related events is an important and challenging problem. Sensor data (numerical) containing average speed of vehicles passing through a road link can be interpreted in terms of traffic related incident reports from city authorities and social media data (textual), providing a complementary understanding of traffic dynamics. State-of-the-art research is focused on either analyzing sensor observations or citizen observations; we seek to exploit both in a synergistic manner.
We demonstrate the role of domain knowledge in capturing the non-linearity of speed and travel-time dynamics by segmenting speed and travel-time observations into simpler components amenable to description using linear models such as Linear Dynamical System (LDS). Specifically, we propose Restricted Switching Linear Dynamical System (RSLDS) to model normal speed and travel time dynamics and thereby characterize anomalous dynamics. We utilize the city traffic events extracted from text to explain anomalous dynamics. We present a large scale evaluation of the proposed approach on a real-world traffic and twitter dataset collected over a year with promising results.
The data streaming processing paradigm and its use in modern fog architecturesVincenzo Gulisano
Invited lecture at the University of Trieste.
The lecture covers (briefly) the data streaming processing paradigm, research challenges related to distributed, parallel and deterministic streaming analysis and the research of the DCS (Distributed Computing and Systems) groups at Chalmers University of Technology.
What to do with the existing spatial data in planningKarel Charvat
Spatial planning acts between all levels of government so planners face important challenges in the development of territorial frameworks and concepts every day.
Spatial planning systems, the legal situation and spatial planning data management are completely different and fragmented throughout Europe.
Nevertheless, planning is a holistic activity.
All tasks and processes must be solved comprehensively with
input from various sources.
It is necessary to make inputs interoperable because it allows the user to search data from different sources, view them, download them and use them with help of geoinformation technologies (GIT).
From Simple Features to Moving Features and Beyond? at OGC Member Meeting, Se...Anita Graser
Presentation of arxiv preprint https://arxiv.org/abs/2006.16900
Mobility data science lacks common data structures and analytical functions. This position paper assesses the current status and open issues towards a universal API for mobility data science. In particular, we look at standardization efforts revolving around the OGC Moving Features standard which, so far, has not attracted much attention within the mobility data science community. We discuss the hurdles any universal API for movement data has to overcome and propose key steps of a roadmap that would provide the foundation for the development of this API.
Test-driven Assessment of [R2]RML Mappings to Improve Dataset Quality andimou
RDF dataset Quality Assessment is currently performed primarily after data is published. Incorporating its results, by applying corresponding adjustments to the dataset, happens manually and occurs
rarely. In the case of (semi-)structured data (e.g., CSV, XML), the root of the violations often derives from the mappings that specify how the RDF dataset will be generated. Thus, we suggest shifting the quality assessment
from the rdf dataset to the mapping definitions that generate it. The proposed test-driven approach for assessing mappings relies on RDFUnit test cases applied over mappings specified with RML. Our evaluation is applied to different cases, e.g., DBpedia, and indicates that the overall quality of an RDF dataset is quickly and significantly improved.
In Proceedings of the VLDB Endowment (PVLDB)
Vol. 9 No. 11
www.vldb.org/pvldb/vol9/p888-asudeh.pdf
--
The ranked retrieval model has rapidly become the de facto way for search query processing in client-server databases, especially those on the web. Despite of the extensive efforts in the database community on designing better ranking functions/mechanisms, many such databases in practice still fail to address the diverse and sometimes contradicting preferences of users on tuple ranking, perhaps (at least partially) due to the lack of expertise and/or motivation for the database owner to design truly effective ranking functions. This paper takes a different route on addressing the issue by defining a novel {\em query reranking problem}, i.e., we aim to design a third-party service that uses nothing but the public search interface of a client-server database to enable the on-the-fly processing of queries with any user-specified ranking functions (with or without selection conditions), no matter if the ranking function is supported by the database or not. We analyze the worst-case complexity of the problem and introduce a number of ideas, e.g., on-the-fly indexing, domination detection and virtual tuple pruning, to reduce the average-case cost of the query reranking algorithm. We also present extensive experimental results on real-world datasets, in both offline and live online systems, that demonstrate the effectiveness of our proposed techniques.
Presentation of the spatiotemporal RDF store Strabon at the Linked Data Europe Workshop, co-located with the European Data Forum in Athens, Greece (21 March 2014)
2011 ITS World Congress - GO-Sync - A Framework to Synchronize Transit Agency...Sean Barbeau
Discusses an open-source tool that can sync GTFS datasets with OpenStreetMap to help small agencies manage their bus stop inventory via crowd-sourcing. Includes some actual results of crowd-sourcing bus stop location accuracy in Tampa, FL.
Abstract - In the present paper we describe a new, updated and refined dataset specifically tailored to train and evaluate machine learning based malware traffic analysis algorithms. To generate it, we started from the largest databases of network traffic captures available online, deriving a dataset with a set of widely-applicable features and then cleaning and preprocessing it to remove noise, handle missing data and keep its size as small as possible. The resulting dataset is not biased by any specific application (although specifically addressed to machine learning algorithms), and the entire process can run automatically to keep it updated.
Provenance Analytics at AAAI Human Computation Conference 2013T Dong Huynh
Trung Dong Huynh presenting the paper entitled "Interpretation of Crowdsourced Activities using Provenance Network Analysis" - How analysing provenance graphs can help interpreting crowdsouced activities in CollabMap
Understanding speed and travel-time dynamics in response to various city related events is an important and challenging problem. Sensor data (numerical) containing average speed of vehicles passing through a road link can be interpreted in terms of traffic related incident reports from city authorities and social media data (textual), providing a complementary understanding of traffic dynamics. State-of-the-art research is focused on either analyzing sensor observations or citizen observations; we seek to exploit both in a synergistic manner.
We demonstrate the role of domain knowledge in capturing the non-linearity of speed and travel-time dynamics by segmenting speed and travel-time observations into simpler components amenable to description using linear models such as Linear Dynamical System (LDS). Specifically, we propose Restricted Switching Linear Dynamical System (RSLDS) to model normal speed and travel time dynamics and thereby characterize anomalous dynamics. We utilize the city traffic events extracted from text to explain anomalous dynamics. We present a large scale evaluation of the proposed approach on a real-world traffic and twitter dataset collected over a year with promising results.
The data streaming processing paradigm and its use in modern fog architecturesVincenzo Gulisano
Invited lecture at the University of Trieste.
The lecture covers (briefly) the data streaming processing paradigm, research challenges related to distributed, parallel and deterministic streaming analysis and the research of the DCS (Distributed Computing and Systems) groups at Chalmers University of Technology.
On-the-fly Integration of Static and Dynamic Linked Dataaharth
Slides of COLD 2013 Paper "On-the-fly Integration of Static and Dynamic Linked Data", Andreas Harth (KIT), Craig Knoblock (USC), Steffen Stadtmüller (KIT), Rudi Studer (KIT), Pedro Szekely (USC)
I summarize requirements for an "Open Analytics Environment" (aka "the Cauldron"), and some work being performed at the University of Chicago and Argonne National Laboratory towards its realization.
Cassandra Day 2014: Re-envisioning the Lambda Architecture - Web-Services & R...DataStax Academy
Do you want to expose a kick-ass REST API? Do you want interactions with that API to drive slick dashboards that answer hard questions? Do you want all of that in near real-time, distributed across a number of machines, and tolerant of system faults? Take one part Cassandra, stir in a bit of Paxos, and blend with Storm. Coat the rim of the glass with DropWizard. Sit back, relax, and enjoy the show. Come see how Health Market Science is using this mixology to fix our healthcare system.
My talk at the Winter School on Big Data in Tarragona, Spain.
Abstract: We have made much progress over the past decade toward harnessing the collective power of IT resources distributed across the globe. In high-energy physics, astronomy, and climate, thousands work daily within virtual computing systems with global scope. But we now face a far greater challenge: Exploding data volumes and powerful simulation tools mean that many more--ultimately most?--researchers will soon require capabilities not so different from those used by such big-science teams. How are we to meet these needs? Must every lab be filled with computers and every researcher become an IT specialist? Perhaps the solution is rather to move research IT out of the lab entirely: to leverage the “cloud” (whether private or public) to achieve economies of scale and reduce cognitive load. I explore the past, current, and potential future of large-scale outsourcing and automation for science, and suggest opportunities and challenges for today’s researchers.
Presentation by Steffen Zeuch, Researcher at German Research Center for Artificial Intelligence (DFKI) and Post-Doc at TU Berlin (Germany), at the FogGuru Boot Camp training in September 2018.
Swift Parallel Scripting for High-Performance WorkflowDaniel S. Katz
The Swift scripting language was created to provide a simple, compact way to write parallel scripts that run many copies of ordinary programs concurrently in various workflow patterns, reducing the need for complex parallel programming or arcane scripting to achieve this common high-level task. The result was a highly portable programming model based on implicitly parallel functional dataflow. The same Swift script runs on multi-core computers, clusters, grids, clouds, and supercomputers, and is thus a useful tool for moving workflow computations from laptop to distributed and/or high performance systems.
Swift has proven to be very general, and is in use in domains ranging from earth systems to bioinformatics to molecular modeling. It’s more recently been adapted to serve as a programming model for much finer-grain in-memory workflow on extreme scale systems, where it can perform task rates in the millions to billion-per-second.
In this talk, we describe the state of Swift’s implementation, present several Swift applications, and discuss ideas for of the future evolution of the programming model on which it’s based.
From complex Systems to Networks: Discovering and Modeling the Correct Network"diannepatricia
From complex Systems to Networks: Discovering and Modeling the Correct Network" by Nitesh Chawla as part of the Cognitive Systems Institute Speaker Series
Nitesh Chawla is the Frank M. Freimann Professor of Computer Science and Engineering, and director of the research center on network and data sciences (iCeNSA) at the University of Notre Dame.
Big Data Processing Beyond MapReduce by Dr. Flavio VillanustreHPCC Systems
Data Centric Approach: Our platform is built on the premise of absorbing data from multiple data sources and transforming them to a highly intelligent social network graphs that can be processed to non-obvious relationships.
Reflections on Almost Two Decades of Research into Stream ProcessingKyumars Sheykh Esmaili
This is the slide deck that I used during my tutorial presentation at the ACM DEBS Conference (http://www.debs2017.org/) that was held in Barcelona between June 19 and June 23, 2017.
The tutorial paper itself can be accessed here: http://dl.acm.org/citation.cfm?id=3095110
Science Services and Science Platforms: Using the Cloud to Accelerate and Dem...Ian Foster
Ever more data- and compute-intensive science makes computing increasingly important for research. But for advanced computing infrastructure to benefit more than the scientific 1%, we need new delivery methods that slash access costs, new sustainability models beyond direct research funding, and new platform capabilities to accelerate the development of new, interoperable tools and services.
The Globus team has been working towards these goals since 2010. We have developed software-as-a-service methods that move complex and time-consuming research IT tasks out of the lab and into the cloud, thus greatly reducing the expertise and resources required to use them. We have demonstrated a subscription-based funding model that engages research institutions in supporting service operations. And we are now also showing how the platform services that underpin Globus applications can accelerate the development and use of an integrated ecosystem of advanced science applications, such as NCAR’s Research Data Archive and OSG Connect, thus enabling access to powerful data and compute resources by many more people than is possible today.
In this talk, I introduce Globus services and the underlying Globus platform. I present representative applications and discuss opportunities that this platform presents for both small science and large facilities.
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...pchutichetpong
M Capital Group (“MCG”) expects to see demand and the changing evolution of supply, facilitated through institutional investment rotation out of offices and into work from home (“WFH”), while the ever-expanding need for data storage as global internet usage expands, with experts predicting 5.3 billion users by 2023. These market factors will be underpinned by technological changes, such as progressing cloud services and edge sites, allowing the industry to see strong expected annual growth of 13% over the next 4 years.
Whilst competitive headwinds remain, represented through the recent second bankruptcy filing of Sungard, which blames “COVID-19 and other macroeconomic trends including delayed customer spending decisions, insourcing and reductions in IT spending, energy inflation and reduction in demand for certain services”, the industry has seen key adjustments, where MCG believes that engineering cost management and technological innovation will be paramount to success.
MCG reports that the more favorable market conditions expected over the next few years, helped by the winding down of pandemic restrictions and a hybrid working environment will be driving market momentum forward. The continuous injection of capital by alternative investment firms, as well as the growing infrastructural investment from cloud service providers and social media companies, whose revenues are expected to grow over 3.6x larger by value in 2026, will likely help propel center provision and innovation. These factors paint a promising picture for the industry players that offset rising input costs and adapt to new technologies.
According to M Capital Group: “Specifically, the long-term cost-saving opportunities available from the rise of remote managing will likely aid value growth for the industry. Through margin optimization and further availability of capital for reinvestment, strong players will maintain their competitive foothold, while weaker players exit the market to balance supply and demand.”
Explore our comprehensive data analysis project presentation on predicting product ad campaign performance. Learn how data-driven insights can optimize your marketing strategies and enhance campaign effectiveness. Perfect for professionals and students looking to understand the power of data analysis in advertising. for more details visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Streaming Weather Data from Web APIs to Jupyter through Kafka
1. Weather & Transportation
Streaming the Data, Finding Correlations
Provide capability to Data for Democracy democratizing_weather_data
University of Washington Professional & Continuing Education
BIG DATA 230 B Su 17: Emerging Technologies In Big Data
Team D-Hawks
John Bever, Karunakar Kotha, Leo Salemann, Shiva Vuppala, Wenfan Xu
2. Overview
Our
“Client”
Their Mission
Learn More www.datafordemocracy.org https://github.com/Data4Democracy democratizing_weather_data/streaming
Our Mission ● Provide a streaming capability to extract weather and traffic data from multiple Web API’s, and produce a
clean merged dataframe suitable for Machine Learning and other Data Science analysis.
● Deliver code to D4D’s Github Repository
● Use vendor-neutral, opensource solutions, implemented in python and Jupyter notebooks
3. Pipeline
• Kafka transport mechanism (vendor-neutral, open source)
• Message value is an entire JSON document
• One topic per source API, guarantees consistent schema
• Multiple json documents (sharing same schema) combined into a single dataframe
• Dataframe records joined based on space and time
4. Web APIs
Postman
• Great tool for interacting with potential
APIs.
• Friendly GUI for constructing requests
and reading responses.
• Provided JSON files before pipeline
was completed. Allowed analysis of
data in parallel
ProgrammableWeb.com
● A massive searchable directory of over
15,500 web APIs that are updated daily
● Includes sample source code for APIs
7. Analysis
Load Json file, normalize, save as dataframe.
Repeat for next json file, append to prior.
7 days of data (includes eclipse!) 30 minutes between readings
1 Merged Traffic/Weather Table (52,975 rows x 30 columns)
54 Weather Json Files from Yahoo (54 rows x 31 columns)
394 Weather Json Files from WSDOT (40,931 rows x 16 columns)
395 Traffic Json Files from WSDOT (70,998 rows x 20 columns)
Merge WSDOT & Yahoo Weather Dataframes (use columns common to both)
Merge Traffic/Weather Dataframes. Each Row has:
- Traffic data from a specific Traffic dataframe row
- Weather data from a weather station within 20 miles and 30 minutes of traffic reading.
9. Analyzing the Merged/Traffic Weather Dataset
Scatterplot Matrix with Seaborn (10% random sample)
Average Travel Time
Current Travel Time
Wind Direction
Wind SpeedTemp.
Humidity
Barometer
10. Wrapping Up ...
Key Takeaways
• Choose your python libraries carefully (2 lines of code for a fully-labeled lineplot vs. dozens)
• Spatial plots first, data-joins later (I-5 traffic data vs. statewide weather, also Portland)
• The fastest way to count records in a dataframe is df.shape[0]
Conclusion
• Data for Democracy has a repeatable way to extract weather and transportation data from WSDOT and Yahoo
• Jupyter Notebook provides a teaching/coding environment
• Bitnami provides low-cost simple Kafka infrastructure
Further Work
• Upload csv and zipped json’s to data.world
• Better parameters for Producer scripts (ex. Longitude, Latitude, Date, Time)
• Config files for access keys
• More matrix plots, Data Science, Machine Learning
•Gather data for longer time frames (fewer readings per day?)
•Isolate matrix plots to specific locations and/or time.