Process and steps that are followed in creation of successful visualization. Taking an example of Encyclopaedia of life data and tableu visualization prototype
A Knowledge Graph Framework for Detecting Traffic Events Using Stationary Cam...RoopTeja Muppalla
Imagery-based Traffic Sensing Knowledge Graph (ITSKG) framework utilizes the stationary traffic camera information as sensors to understand the traffic patterns. This system extracts image-based features from traffic camera images, adds a semantic layer to the sensor data for traffic information, and then labels traffic imagery with semantic labels such as congestion. This framework adds a new dimension to existing traffic modeling systems by incorporating dynamic image-based features as well as creating a knowledge graph to add a layer of abstraction to understand and interpret concepts like congestion to the traffic event detection system.
This work is presented at the Industrial Knowledge workshop co-located with the 9th International ACM Web Science Conference 2017 on 25th June 2017.
Visualising statistical Linked Data with PloneEau de Web
Presentation of a Plone-based tool that can create graphical visualisations of semantic statistical data expressed using the RDF Data Cube Vocabulary and queried using generated SPARQL statements. The tool was developed under a project funded by the European Commission and is publicly available at www.digital-agenda-data.eu
Process and steps that are followed in creation of successful visualization. Taking an example of Encyclopaedia of life data and tableu visualization prototype
A Knowledge Graph Framework for Detecting Traffic Events Using Stationary Cam...RoopTeja Muppalla
Imagery-based Traffic Sensing Knowledge Graph (ITSKG) framework utilizes the stationary traffic camera information as sensors to understand the traffic patterns. This system extracts image-based features from traffic camera images, adds a semantic layer to the sensor data for traffic information, and then labels traffic imagery with semantic labels such as congestion. This framework adds a new dimension to existing traffic modeling systems by incorporating dynamic image-based features as well as creating a knowledge graph to add a layer of abstraction to understand and interpret concepts like congestion to the traffic event detection system.
This work is presented at the Industrial Knowledge workshop co-located with the 9th International ACM Web Science Conference 2017 on 25th June 2017.
Visualising statistical Linked Data with PloneEau de Web
Presentation of a Plone-based tool that can create graphical visualisations of semantic statistical data expressed using the RDF Data Cube Vocabulary and queried using generated SPARQL statements. The tool was developed under a project funded by the European Commission and is publicly available at www.digital-agenda-data.eu
What to do with the existing spatial data in planningKarel Charvat
Spatial planning acts between all levels of government so planners face important challenges in the development of territorial frameworks and concepts every day.
Spatial planning systems, the legal situation and spatial planning data management are completely different and fragmented throughout Europe.
Nevertheless, planning is a holistic activity.
All tasks and processes must be solved comprehensively with
input from various sources.
It is necessary to make inputs interoperable because it allows the user to search data from different sources, view them, download them and use them with help of geoinformation technologies (GIT).
From Simple Features to Moving Features and Beyond? at OGC Member Meeting, Se...Anita Graser
Presentation of arxiv preprint https://arxiv.org/abs/2006.16900
Mobility data science lacks common data structures and analytical functions. This position paper assesses the current status and open issues towards a universal API for mobility data science. In particular, we look at standardization efforts revolving around the OGC Moving Features standard which, so far, has not attracted much attention within the mobility data science community. We discuss the hurdles any universal API for movement data has to overcome and propose key steps of a roadmap that would provide the foundation for the development of this API.
Test-driven Assessment of [R2]RML Mappings to Improve Dataset Quality andimou
RDF dataset Quality Assessment is currently performed primarily after data is published. Incorporating its results, by applying corresponding adjustments to the dataset, happens manually and occurs
rarely. In the case of (semi-)structured data (e.g., CSV, XML), the root of the violations often derives from the mappings that specify how the RDF dataset will be generated. Thus, we suggest shifting the quality assessment
from the rdf dataset to the mapping definitions that generate it. The proposed test-driven approach for assessing mappings relies on RDFUnit test cases applied over mappings specified with RML. Our evaluation is applied to different cases, e.g., DBpedia, and indicates that the overall quality of an RDF dataset is quickly and significantly improved.
In Proceedings of the VLDB Endowment (PVLDB)
Vol. 9 No. 11
www.vldb.org/pvldb/vol9/p888-asudeh.pdf
--
The ranked retrieval model has rapidly become the de facto way for search query processing in client-server databases, especially those on the web. Despite of the extensive efforts in the database community on designing better ranking functions/mechanisms, many such databases in practice still fail to address the diverse and sometimes contradicting preferences of users on tuple ranking, perhaps (at least partially) due to the lack of expertise and/or motivation for the database owner to design truly effective ranking functions. This paper takes a different route on addressing the issue by defining a novel {\em query reranking problem}, i.e., we aim to design a third-party service that uses nothing but the public search interface of a client-server database to enable the on-the-fly processing of queries with any user-specified ranking functions (with or without selection conditions), no matter if the ranking function is supported by the database or not. We analyze the worst-case complexity of the problem and introduce a number of ideas, e.g., on-the-fly indexing, domination detection and virtual tuple pruning, to reduce the average-case cost of the query reranking algorithm. We also present extensive experimental results on real-world datasets, in both offline and live online systems, that demonstrate the effectiveness of our proposed techniques.
Presentation of the spatiotemporal RDF store Strabon at the Linked Data Europe Workshop, co-located with the European Data Forum in Athens, Greece (21 March 2014)
2011 ITS World Congress - GO-Sync - A Framework to Synchronize Transit Agency...Sean Barbeau
Discusses an open-source tool that can sync GTFS datasets with OpenStreetMap to help small agencies manage their bus stop inventory via crowd-sourcing. Includes some actual results of crowd-sourcing bus stop location accuracy in Tampa, FL.
Abstract - In the present paper we describe a new, updated and refined dataset specifically tailored to train and evaluate machine learning based malware traffic analysis algorithms. To generate it, we started from the largest databases of network traffic captures available online, deriving a dataset with a set of widely-applicable features and then cleaning and preprocessing it to remove noise, handle missing data and keep its size as small as possible. The resulting dataset is not biased by any specific application (although specifically addressed to machine learning algorithms), and the entire process can run automatically to keep it updated.
Provenance Analytics at AAAI Human Computation Conference 2013T Dong Huynh
Trung Dong Huynh presenting the paper entitled "Interpretation of Crowdsourced Activities using Provenance Network Analysis" - How analysing provenance graphs can help interpreting crowdsouced activities in CollabMap
Understanding speed and travel-time dynamics in response to various city related events is an important and challenging problem. Sensor data (numerical) containing average speed of vehicles passing through a road link can be interpreted in terms of traffic related incident reports from city authorities and social media data (textual), providing a complementary understanding of traffic dynamics. State-of-the-art research is focused on either analyzing sensor observations or citizen observations; we seek to exploit both in a synergistic manner.
We demonstrate the role of domain knowledge in capturing the non-linearity of speed and travel-time dynamics by segmenting speed and travel-time observations into simpler components amenable to description using linear models such as Linear Dynamical System (LDS). Specifically, we propose Restricted Switching Linear Dynamical System (RSLDS) to model normal speed and travel time dynamics and thereby characterize anomalous dynamics. We utilize the city traffic events extracted from text to explain anomalous dynamics. We present a large scale evaluation of the proposed approach on a real-world traffic and twitter dataset collected over a year with promising results.
The data streaming processing paradigm and its use in modern fog architecturesVincenzo Gulisano
Invited lecture at the University of Trieste.
The lecture covers (briefly) the data streaming processing paradigm, research challenges related to distributed, parallel and deterministic streaming analysis and the research of the DCS (Distributed Computing and Systems) groups at Chalmers University of Technology.
What to do with the existing spatial data in planningKarel Charvat
Spatial planning acts between all levels of government so planners face important challenges in the development of territorial frameworks and concepts every day.
Spatial planning systems, the legal situation and spatial planning data management are completely different and fragmented throughout Europe.
Nevertheless, planning is a holistic activity.
All tasks and processes must be solved comprehensively with
input from various sources.
It is necessary to make inputs interoperable because it allows the user to search data from different sources, view them, download them and use them with help of geoinformation technologies (GIT).
From Simple Features to Moving Features and Beyond? at OGC Member Meeting, Se...Anita Graser
Presentation of arxiv preprint https://arxiv.org/abs/2006.16900
Mobility data science lacks common data structures and analytical functions. This position paper assesses the current status and open issues towards a universal API for mobility data science. In particular, we look at standardization efforts revolving around the OGC Moving Features standard which, so far, has not attracted much attention within the mobility data science community. We discuss the hurdles any universal API for movement data has to overcome and propose key steps of a roadmap that would provide the foundation for the development of this API.
Test-driven Assessment of [R2]RML Mappings to Improve Dataset Quality andimou
RDF dataset Quality Assessment is currently performed primarily after data is published. Incorporating its results, by applying corresponding adjustments to the dataset, happens manually and occurs
rarely. In the case of (semi-)structured data (e.g., CSV, XML), the root of the violations often derives from the mappings that specify how the RDF dataset will be generated. Thus, we suggest shifting the quality assessment
from the rdf dataset to the mapping definitions that generate it. The proposed test-driven approach for assessing mappings relies on RDFUnit test cases applied over mappings specified with RML. Our evaluation is applied to different cases, e.g., DBpedia, and indicates that the overall quality of an RDF dataset is quickly and significantly improved.
In Proceedings of the VLDB Endowment (PVLDB)
Vol. 9 No. 11
www.vldb.org/pvldb/vol9/p888-asudeh.pdf
--
The ranked retrieval model has rapidly become the de facto way for search query processing in client-server databases, especially those on the web. Despite of the extensive efforts in the database community on designing better ranking functions/mechanisms, many such databases in practice still fail to address the diverse and sometimes contradicting preferences of users on tuple ranking, perhaps (at least partially) due to the lack of expertise and/or motivation for the database owner to design truly effective ranking functions. This paper takes a different route on addressing the issue by defining a novel {\em query reranking problem}, i.e., we aim to design a third-party service that uses nothing but the public search interface of a client-server database to enable the on-the-fly processing of queries with any user-specified ranking functions (with or without selection conditions), no matter if the ranking function is supported by the database or not. We analyze the worst-case complexity of the problem and introduce a number of ideas, e.g., on-the-fly indexing, domination detection and virtual tuple pruning, to reduce the average-case cost of the query reranking algorithm. We also present extensive experimental results on real-world datasets, in both offline and live online systems, that demonstrate the effectiveness of our proposed techniques.
Presentation of the spatiotemporal RDF store Strabon at the Linked Data Europe Workshop, co-located with the European Data Forum in Athens, Greece (21 March 2014)
2011 ITS World Congress - GO-Sync - A Framework to Synchronize Transit Agency...Sean Barbeau
Discusses an open-source tool that can sync GTFS datasets with OpenStreetMap to help small agencies manage their bus stop inventory via crowd-sourcing. Includes some actual results of crowd-sourcing bus stop location accuracy in Tampa, FL.
Abstract - In the present paper we describe a new, updated and refined dataset specifically tailored to train and evaluate machine learning based malware traffic analysis algorithms. To generate it, we started from the largest databases of network traffic captures available online, deriving a dataset with a set of widely-applicable features and then cleaning and preprocessing it to remove noise, handle missing data and keep its size as small as possible. The resulting dataset is not biased by any specific application (although specifically addressed to machine learning algorithms), and the entire process can run automatically to keep it updated.
Provenance Analytics at AAAI Human Computation Conference 2013T Dong Huynh
Trung Dong Huynh presenting the paper entitled "Interpretation of Crowdsourced Activities using Provenance Network Analysis" - How analysing provenance graphs can help interpreting crowdsouced activities in CollabMap
Understanding speed and travel-time dynamics in response to various city related events is an important and challenging problem. Sensor data (numerical) containing average speed of vehicles passing through a road link can be interpreted in terms of traffic related incident reports from city authorities and social media data (textual), providing a complementary understanding of traffic dynamics. State-of-the-art research is focused on either analyzing sensor observations or citizen observations; we seek to exploit both in a synergistic manner.
We demonstrate the role of domain knowledge in capturing the non-linearity of speed and travel-time dynamics by segmenting speed and travel-time observations into simpler components amenable to description using linear models such as Linear Dynamical System (LDS). Specifically, we propose Restricted Switching Linear Dynamical System (RSLDS) to model normal speed and travel time dynamics and thereby characterize anomalous dynamics. We utilize the city traffic events extracted from text to explain anomalous dynamics. We present a large scale evaluation of the proposed approach on a real-world traffic and twitter dataset collected over a year with promising results.
The data streaming processing paradigm and its use in modern fog architecturesVincenzo Gulisano
Invited lecture at the University of Trieste.
The lecture covers (briefly) the data streaming processing paradigm, research challenges related to distributed, parallel and deterministic streaming analysis and the research of the DCS (Distributed Computing and Systems) groups at Chalmers University of Technology.
On-the-fly Integration of Static and Dynamic Linked Dataaharth
Slides of COLD 2013 Paper "On-the-fly Integration of Static and Dynamic Linked Data", Andreas Harth (KIT), Craig Knoblock (USC), Steffen Stadtmüller (KIT), Rudi Studer (KIT), Pedro Szekely (USC)
I summarize requirements for an "Open Analytics Environment" (aka "the Cauldron"), and some work being performed at the University of Chicago and Argonne National Laboratory towards its realization.
Cassandra Day 2014: Re-envisioning the Lambda Architecture - Web-Services & R...DataStax Academy
Do you want to expose a kick-ass REST API? Do you want interactions with that API to drive slick dashboards that answer hard questions? Do you want all of that in near real-time, distributed across a number of machines, and tolerant of system faults? Take one part Cassandra, stir in a bit of Paxos, and blend with Storm. Coat the rim of the glass with DropWizard. Sit back, relax, and enjoy the show. Come see how Health Market Science is using this mixology to fix our healthcare system.
My talk at the Winter School on Big Data in Tarragona, Spain.
Abstract: We have made much progress over the past decade toward harnessing the collective power of IT resources distributed across the globe. In high-energy physics, astronomy, and climate, thousands work daily within virtual computing systems with global scope. But we now face a far greater challenge: Exploding data volumes and powerful simulation tools mean that many more--ultimately most?--researchers will soon require capabilities not so different from those used by such big-science teams. How are we to meet these needs? Must every lab be filled with computers and every researcher become an IT specialist? Perhaps the solution is rather to move research IT out of the lab entirely: to leverage the “cloud” (whether private or public) to achieve economies of scale and reduce cognitive load. I explore the past, current, and potential future of large-scale outsourcing and automation for science, and suggest opportunities and challenges for today’s researchers.
Presentation by Steffen Zeuch, Researcher at German Research Center for Artificial Intelligence (DFKI) and Post-Doc at TU Berlin (Germany), at the FogGuru Boot Camp training in September 2018.
Swift Parallel Scripting for High-Performance WorkflowDaniel S. Katz
The Swift scripting language was created to provide a simple, compact way to write parallel scripts that run many copies of ordinary programs concurrently in various workflow patterns, reducing the need for complex parallel programming or arcane scripting to achieve this common high-level task. The result was a highly portable programming model based on implicitly parallel functional dataflow. The same Swift script runs on multi-core computers, clusters, grids, clouds, and supercomputers, and is thus a useful tool for moving workflow computations from laptop to distributed and/or high performance systems.
Swift has proven to be very general, and is in use in domains ranging from earth systems to bioinformatics to molecular modeling. It’s more recently been adapted to serve as a programming model for much finer-grain in-memory workflow on extreme scale systems, where it can perform task rates in the millions to billion-per-second.
In this talk, we describe the state of Swift’s implementation, present several Swift applications, and discuss ideas for of the future evolution of the programming model on which it’s based.
From complex Systems to Networks: Discovering and Modeling the Correct Network"diannepatricia
From complex Systems to Networks: Discovering and Modeling the Correct Network" by Nitesh Chawla as part of the Cognitive Systems Institute Speaker Series
Nitesh Chawla is the Frank M. Freimann Professor of Computer Science and Engineering, and director of the research center on network and data sciences (iCeNSA) at the University of Notre Dame.
Big Data Processing Beyond MapReduce by Dr. Flavio VillanustreHPCC Systems
Data Centric Approach: Our platform is built on the premise of absorbing data from multiple data sources and transforming them to a highly intelligent social network graphs that can be processed to non-obvious relationships.
Reflections on Almost Two Decades of Research into Stream ProcessingKyumars Sheykh Esmaili
This is the slide deck that I used during my tutorial presentation at the ACM DEBS Conference (http://www.debs2017.org/) that was held in Barcelona between June 19 and June 23, 2017.
The tutorial paper itself can be accessed here: http://dl.acm.org/citation.cfm?id=3095110
Science Services and Science Platforms: Using the Cloud to Accelerate and Dem...Ian Foster
Ever more data- and compute-intensive science makes computing increasingly important for research. But for advanced computing infrastructure to benefit more than the scientific 1%, we need new delivery methods that slash access costs, new sustainability models beyond direct research funding, and new platform capabilities to accelerate the development of new, interoperable tools and services.
The Globus team has been working towards these goals since 2010. We have developed software-as-a-service methods that move complex and time-consuming research IT tasks out of the lab and into the cloud, thus greatly reducing the expertise and resources required to use them. We have demonstrated a subscription-based funding model that engages research institutions in supporting service operations. And we are now also showing how the platform services that underpin Globus applications can accelerate the development and use of an integrated ecosystem of advanced science applications, such as NCAR’s Research Data Archive and OSG Connect, thus enabling access to powerful data and compute resources by many more people than is possible today.
In this talk, I introduce Globus services and the underlying Globus platform. I present representative applications and discuss opportunities that this platform presents for both small science and large facilities.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Removing Uninteresting Bytes in Software FuzzingAftab Hussain
Imagine a world where software fuzzing, the process of mutating bytes in test seeds to uncover hidden and erroneous program behaviors, becomes faster and more effective. A lot depends on the initial seeds, which can significantly dictate the trajectory of a fuzzing campaign, particularly in terms of how long it takes to uncover interesting behaviour in your code. We introduce DIAR, a technique designed to speedup fuzzing campaigns by pinpointing and eliminating those uninteresting bytes in the seeds. Picture this: instead of wasting valuable resources on meaningless mutations in large, bloated seeds, DIAR removes the unnecessary bytes, streamlining the entire process.
In this work, we equipped AFL, a popular fuzzer, with DIAR and examined two critical Linux libraries -- Libxml's xmllint, a tool for parsing xml documents, and Binutil's readelf, an essential debugging and security analysis command-line tool used to display detailed information about ELF (Executable and Linkable Format). Our preliminary results show that AFL+DIAR does not only discover new paths more quickly but also achieves higher coverage overall. This work thus showcases how starting with lean and optimized seeds can lead to faster, more comprehensive fuzzing campaigns -- and DIAR helps you find such seeds.
- These are slides of the talk given at IEEE International Conference on Software Testing Verification and Validation Workshop, ICSTW 2022.
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionAggregage
Join Maher Hanafi, VP of Engineering at Betterworks, in this new session where he'll share a practical framework to transform Gen AI prototypes into impactful products! He'll delve into the complexities of data collection and management, model selection and optimization, and ensuring security, scalability, and responsible use.
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
GridMate - End to end testing is a critical piece to ensure quality and avoid...ThomasParaiso2
End to end testing is a critical piece to ensure quality and avoid regressions. In this session, we share our journey building an E2E testing pipeline for GridMate components (LWC and Aura) using Cypress, JSForce, FakerJS…
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
Sudheer Mechineni, Head of Application Frameworks, Standard Chartered Bank
Discover how Standard Chartered Bank harnessed the power of Neo4j to transform complex data access challenges into a dynamic, scalable graph database solution. This keynote will cover their journey from initial adoption to deploying a fully automated, enterprise-grade causal cluster, highlighting key strategies for modelling organisational changes and ensuring robust disaster recovery. Learn how these innovations have not only enhanced Standard Chartered Bank’s data infrastructure but also positioned them as pioneers in the banking sector’s adoption of graph technology.
Dr. Sean Tan, Head of Data Science, Changi Airport Group
Discover how Changi Airport Group (CAG) leverages graph technologies and generative AI to revolutionize their search capabilities. This session delves into the unique search needs of CAG’s diverse passengers and customers, showcasing how graph data structures enhance the accuracy and relevance of AI-generated search results, mitigating the risk of “hallucinations” and improving the overall customer journey.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...Neo4j
Leonard Jayamohan, Partner & Generative AI Lead, Deloitte
This keynote will reveal how Deloitte leverages Neo4j’s graph power for groundbreaking digital twin solutions, achieving a staggering 100x performance boost. Discover the essential role knowledge graphs play in successful generative AI implementations. Plus, get an exclusive look at an innovative Neo4j + Generative AI solution Deloitte is developing in-house.
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfPeter Spielvogel
Building better applications for business users with SAP Fiori.
• What is SAP Fiori and why it matters to you
• How a better user experience drives measurable business benefits
• How to get started with SAP Fiori today
• How SAP Fiori elements accelerates application development
• How SAP Build Code includes SAP Fiori tools and other generative artificial intelligence capabilities
• How SAP Fiori paves the way for using AI in SAP apps
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
Streaming Weather Data from Web APIs to Jupyter through Kafka
1. Weather & Transportation
Streaming the Data, Finding Correlations
Provide capability to Data for Democracy democratizing_weather_data
University of Washington Professional & Continuing Education
BIG DATA 230 B Su 17: Emerging Technologies In Big Data
Team D-Hawks
John Bever, Karunakar Kotha, Leo Salemann, Shiva Vuppala, Wenfan Xu
2. Overview
Our
“Client”
Their Mission
Learn More www.datafordemocracy.org https://github.com/Data4Democracy democratizing_weather_data/streaming
Our Mission ● Provide a streaming capability to extract weather and traffic data from multiple Web API’s, and produce a
clean merged dataframe suitable for Machine Learning and other Data Science analysis.
● Deliver code to D4D’s Github Repository
● Use vendor-neutral, opensource solutions, implemented in python and Jupyter notebooks
3. Pipeline
• Kafka transport mechanism (vendor-neutral, open source)
• Message value is an entire JSON document
• One topic per source API, guarantees consistent schema
• Multiple json documents (sharing same schema) combined into a single dataframe
• Dataframe records joined based on space and time
4. Web APIs
Postman
• Great tool for interacting with potential
APIs.
• Friendly GUI for constructing requests
and reading responses.
• Provided JSON files before pipeline
was completed. Allowed analysis of
data in parallel
ProgrammableWeb.com
● A massive searchable directory of over
15,500 web APIs that are updated daily
● Includes sample source code for APIs
7. Analysis
Load Json file, normalize, save as dataframe.
Repeat for next json file, append to prior.
7 days of data (includes eclipse!) 30 minutes between readings
1 Merged Traffic/Weather Table (52,975 rows x 30 columns)
54 Weather Json Files from Yahoo (54 rows x 31 columns)
394 Weather Json Files from WSDOT (40,931 rows x 16 columns)
395 Traffic Json Files from WSDOT (70,998 rows x 20 columns)
Merge WSDOT & Yahoo Weather Dataframes (use columns common to both)
Merge Traffic/Weather Dataframes. Each Row has:
- Traffic data from a specific Traffic dataframe row
- Weather data from a weather station within 20 miles and 30 minutes of traffic reading.
9. Analyzing the Merged/Traffic Weather Dataset
Scatterplot Matrix with Seaborn (10% random sample)
Average Travel Time
Current Travel Time
Wind Direction
Wind SpeedTemp.
Humidity
Barometer
10. Wrapping Up ...
Key Takeaways
• Choose your python libraries carefully (2 lines of code for a fully-labeled lineplot vs. dozens)
• Spatial plots first, data-joins later (I-5 traffic data vs. statewide weather, also Portland)
• The fastest way to count records in a dataframe is df.shape[0]
Conclusion
• Data for Democracy has a repeatable way to extract weather and transportation data from WSDOT and Yahoo
• Jupyter Notebook provides a teaching/coding environment
• Bitnami provides low-cost simple Kafka infrastructure
Further Work
• Upload csv and zipped json’s to data.world
• Better parameters for Producer scripts (ex. Longitude, Latitude, Date, Time)
• Config files for access keys
• More matrix plots, Data Science, Machine Learning
•Gather data for longer time frames (fewer readings per day?)
•Isolate matrix plots to specific locations and/or time.