This document discusses Mastro, an open-source metadata management system developed in Go. It provides connectors to various data sources, a catalog service to index and retrieve assets, a feature store to manage processed data, and crawlers to automatically index data. The document outlines Mastro's components, including Data Assets, Connectors, the Catalogue Service, Feature Store, Crawlers, and UI. It also describes Mastro Version Control (MVC), which provides versioning capabilities for datasets.
This document discusses Mastro, an open-source metadata management system developed in Go. It provides connectors to various data sources, a catalog service to index and retrieve assets, a feature store to manage processed data, and crawlers to automatically index data. The document outlines Mastro's components, including Data Assets, Connectors, the Catalogue Service, Feature Store, Crawlers, and UI. It also describes Mastro Version Control (MVC), which provides versioning capabilities for datasets.
This document summarizes a presentation on Mobicents Diameter. It provides an overview of Diameter basics and architecture, including support for high availability and fault tolerance. It discusses past achievements from 2011-2012, including supported applications and stability improvements. Future goals for 2012-2013 are outlined, such as improving documentation, examples, and out-of-box experience. Specific upcoming releases through 1.7.0 are also summarized.
Oscon keynote: Working hard to keep it simpleMartin Odersky
The document discusses how Scala aims to make parallel and concurrent programming easier by avoiding mutable state and enabling a functional programming style, and it presents Scala as a language that unifies object-oriented and functional programming in a way that allows both sequential and parallel applications to be written safely and efficiently. Examples are given of how parallel collections and actors can be used in Scala for parallelism and concurrency, and how domain-specific languages embedded in Scala can help program parallel applications for different domains like physics simulation.
It's a brief overview of Natural Language Processing using Python module NLTK.The codes for demonstration can be found from the github link given in the references slide.
This document summarizes presentations from a meeting on information processing and sensemaking. It discusses the challenges of making sense of large and disparate data sources. It outlines technical challenges in areas like data association, hypothesis generation, and learning relationships from complex data. It also describes several of the UK Ministry of Defense's projects in areas like text analysis, network analysis, and spatiotemporal correlation. Funding opportunities are announced for research proposals to help address technical problems like multi-intelligence fusion and improved sensemaking. Examples of potential approaches and data sources are provided.
Data Con LA 2018 - Enabling real-time exploration and analytics at scale at H...Data Con LA
Enabling real-time exploration and analytics at scale to drive operational intelligence at Hulu by Indrasis Mondal, Director, Data Engineering and Data Products, Hulu
Data is one of most powerful assets for companies today and a key driver for innovation, product development and business efficiency. Operational intelligence allows modern organization to use that data asset in real-time to enable immediate insights to their business operations and allow rapid decision making for strategic advantage. In this presentation we will walk through the operational intelligence capabilities Hulu has built to process tens of millions of events per minute to enable fast exploration of data and real-time decision making .
This document discusses Mastro, an open-source metadata management system developed in Go. It provides connectors to various data sources, a catalog service to index and retrieve assets, a feature store to manage processed data, and crawlers to automatically index data. The document outlines Mastro's components, including Data Assets, Connectors, the Catalogue Service, Feature Store, Crawlers, and UI. It also describes Mastro Version Control (MVC), which provides versioning capabilities for datasets.
This document discusses Mastro, an open-source metadata management system developed in Go. It provides connectors to various data sources, a catalog service to index and retrieve assets, a feature store to manage processed data, and crawlers to automatically index data. The document outlines Mastro's components, including Data Assets, Connectors, the Catalogue Service, Feature Store, Crawlers, and UI. It also describes Mastro Version Control (MVC), which provides versioning capabilities for datasets.
This document summarizes a presentation on Mobicents Diameter. It provides an overview of Diameter basics and architecture, including support for high availability and fault tolerance. It discusses past achievements from 2011-2012, including supported applications and stability improvements. Future goals for 2012-2013 are outlined, such as improving documentation, examples, and out-of-box experience. Specific upcoming releases through 1.7.0 are also summarized.
Oscon keynote: Working hard to keep it simpleMartin Odersky
The document discusses how Scala aims to make parallel and concurrent programming easier by avoiding mutable state and enabling a functional programming style, and it presents Scala as a language that unifies object-oriented and functional programming in a way that allows both sequential and parallel applications to be written safely and efficiently. Examples are given of how parallel collections and actors can be used in Scala for parallelism and concurrency, and how domain-specific languages embedded in Scala can help program parallel applications for different domains like physics simulation.
It's a brief overview of Natural Language Processing using Python module NLTK.The codes for demonstration can be found from the github link given in the references slide.
This document summarizes presentations from a meeting on information processing and sensemaking. It discusses the challenges of making sense of large and disparate data sources. It outlines technical challenges in areas like data association, hypothesis generation, and learning relationships from complex data. It also describes several of the UK Ministry of Defense's projects in areas like text analysis, network analysis, and spatiotemporal correlation. Funding opportunities are announced for research proposals to help address technical problems like multi-intelligence fusion and improved sensemaking. Examples of potential approaches and data sources are provided.
Data Con LA 2018 - Enabling real-time exploration and analytics at scale at H...Data Con LA
Enabling real-time exploration and analytics at scale to drive operational intelligence at Hulu by Indrasis Mondal, Director, Data Engineering and Data Products, Hulu
Data is one of most powerful assets for companies today and a key driver for innovation, product development and business efficiency. Operational intelligence allows modern organization to use that data asset in real-time to enable immediate insights to their business operations and allow rapid decision making for strategic advantage. In this presentation we will walk through the operational intelligence capabilities Hulu has built to process tens of millions of events per minute to enable fast exploration of data and real-time decision making .
The document summarizes a project to develop a visualization dashboard for the City of Tuscaloosa's 311 application. A team presented requirements, recommendations, a product backlog, and timeline. They analyzed 311 data to identify key metrics and developed the dashboard using ASP.NET MVC, Google Charts, and ElasticSearch. The dashboard allows viewing service requests, volumes, and response times by date range, district, and department in real-time on mobile. The project was completed through several sprints and accomplished its goals of providing officials with predefined 311 metrics for quick, data-driven decisions.
Extracting Insights from Data at TwitterPrasad Wagle
Prasad Wagle's talk discussed how Twitter extracts insights from its large volumes of data. Twitter collects hundreds of millions of tweets and interactions per day from over 300 million monthly active users, creating big data challenges around velocity, volume, and variety. Twitter stores this data in hundreds of petabytes across large Hadoop clusters and processes it using batch tools like Hadoop and Spark as well as real-time tools like Heron. Insights are generated through basic analytics like user counts, A/B testing of new features, and custom data science work including machine learning models for recommendations, content filtering, and ad targeting. Systems, programming, and statistical skills are needed to effectively extract value from Twitter's big data.
From determining the most convenient rider pickup points to predicting the fastest routes, Uber aims to use data-driven analytics to create seamless trip experiences. Within engineering, analytics inform decision-making processes across the board.
One of the distinct challenges for Uber is analyzing geospatial big data. City locations, trips, and event information, for instance, provide insights that can improve business decisions and better serve users. Geospatial data analysis is particularly challenging, especially in a big data scenario, such as computing how many rides start at a transit location, how many drivers are crossing state lines, and so on. For these analytical requests, we must achieve efficiency, usability, and scalability in order to meet user needs and business requirements.
To accomplish this, we use Hadoop, Hive, and Presto in our production environment to process the big data powering our interactive SQL engine. In this talk, we discuss our engineering effort to optimize geospatial queries in the whole Hadoop stack.
Speakers
Zhenxiao Luo, Engineering Manager, Uber
Lu Niu, Sr Software Engineer, Uber
TfL has been working on broader integration projects where we focus to get the most efficient use of our road networks and public transport. We bring together a wide range of data from multiple disconnected systems which we not only use for operational purposes, but also make more of them open and available; in real time. This session presents how we brought together IoT and Big Data techniques to understand current and predicted transport network status and plan to evolve the base solution into a broader product.
Initially prepared for the CERN/RDA workshop on Active Data Management Plans (28-30 June 2016). Also presented in Denver at International Data Week (12-17 Sept 2016).
Local Weather Information and GNOME Shell ExtensionSammy Fung
This document provides information about an upcoming presentation on local weather information and GNOME Shell extensions. The presentation will discuss obtaining weather data from local meteorological observatories and making it available as open data. It will also cover creating weather widgets for the GNOME Shell desktop environment. The presenter has 15+ years experience in open source communities and organizing conferences in Asia and the US.
Shaik Niyas Ahamed Mohamed Hajiyar has over 7 years of experience in data warehousing and business intelligence, specializing in Ab Initio ETL tool, Teradata, and UNIX scripting. He has worked on several projects for clients like Tata Consultancy Services, Citi Bank, JPMorgan Chase, and John Lewis, taking on roles like developer, team lead, and trainer. His skills include ETL design, development, testing, support, and performance tuning across various technologies.
Uber Geo spatial data platform at DataWorks SummitZhenxiao Luo
This document discusses geospatial analytics at Uber using Presto. It begins with an overview of Uber's analytics infrastructure and use of Presto. It then discusses challenges with geospatial queries in Hive and how Uber optimized Presto with techniques like quadtrees to improve efficiency, reliability, and usability of geospatial queries. Presto's optimizations allow queries to run faster by building indexes and rewriting queries during planning rather than relying on an external service. Future work may include distributed join optimizations and integration with Uber's H3 geospatial indexing system.
The final project for the "Big Data and Intelligent Analytics" course taught by Sri Krishnamurthy focuses on leveraging Artificial Intelligence to tackle diverse, real-world challenges. Projects range from job market analysis and news aggregation to personalized healthcare and fashion discovery, employing AI for data processing, personalization, and visualization. Key technologies include Airflow, FastAPI, Pinecone, OpenAI, and various database solutions, with AI applications including NLP, image recognition, and AI-generated illustrations, emphasizing innovative, AI-driven solutions for complex problems.
The document discusses efforts to put spatial information into the hands of customers in Queensland, Australia. It outlines (1) understanding customer needs through characteristics like interests and technology skills, (2) adopting approaches like business tools with spatial components and specialized interfaces, and (3) providing access through data downloads, web services, and visualization tools. The goal is to make information accessible online through consistent and federated systems, while addressing challenges like staying relevant to evolving technologies and industries.
Agile Gurugram 2023 | Observability for Modern Applications. How does it help...AgileNetwork
This document discusses observability for modern applications. It begins by defining observability as the ability to observe what is happening inside a system. Observability helps measure key performance indicators and allows teams to react faster to issues. In cloud native environments, observability fits by instrumenting applications to capture logs, traces, metrics and health data which are then transmitted to analytics tools. The document outlines the different pillars of application instrumentation - logs to see what happened, traces to see how it happened, metrics to see how much happened, and health checks to see system status. It discusses OpenTelemetry as an open source observability framework to address prior vendor lock-in issues and competing standards.
The document discusses open data and its benefits. It outlines 5 levels or "stars" of open data, with 5 stars being the most open. Open government data can include transportation and financial data, helping cities and giving citizens visibility. A pilot open data project is proposed, starting with one UNDP dataset to understand features and stakeholder needs before a larger launch. The pilot would test an API or open data platform over 2-3 months to inform a full open data service.
The Datapunt Amsterdam Infrastructure connects city data for internal and external users. It uses an agile development process and DevOps practices to quickly build APIs, data products, and applications. This enables the city to rapidly solve problems through a continuous delivery process of developing usable solutions in short sprints. Datapunt also aims to collaborate openly by sharing code on GitHub and helping establish standards.
The tripscore Linked Data client: calculating specific summaries over large t...David Chaves-Fraga
The document describes the Tripscore Linked Data client, which calculates summaries over large public transportation time series data stored using the Linked Connections framework. It addresses the problem of expensive analytical queries over long periods of public transport data. The client moves processing to the client side by requesting summarized data from the server in order to improve performance. Next steps include transforming additional real-world public transport datasets into the Linked Connections format and improving metadata for discoverability.
RMBtec proposes using the Open Data Hub as the real-time data backbone for a start-up creating an aircraft tracking application. The Open Data Hub offers data providers visibility, infrastructure to publish data, documentation, analytics and support. It describes how aircraft position data captured by sensors could be preprocessed and sent to the Open Data Hub via MQTT, then streamed in real-time and stored in a data mart via transformers. The Open Data Hub provides an alternative to cloud services and has capabilities that could support publishing and consuming aircraft tracking data.
This document provides an overview of Android development. It discusses what Android is, its architecture and core components like activities, services, and broadcast receivers. It covers how apps are built, signed, and deployed. It also addresses key topics like app performance, security, usability, and preparing an app for release. The document outlines an Android development course covering the platform's building blocks, project structure, UI design, data storage, networking, and device-specific features.
#twbconf 2017: Digital transformation in London - Natalie Taylor, Mayor of Lo...Together We're Better
Natalie Taylor presented on several digital transformation initiatives at London local government. She discussed the creation of a new London.gov website using agile methodology to improve the user experience. She also covered providing digital skills training to other City Hall departments and an open project system for managing projects collaboratively. Additionally, she summarized the Local Government Digital Service Standard peer group for London councils and the scoping study conducted for a proposed London Office of Technology and Innovation.
Using the NERD stack to move on from XPages - a new beginning
This customer solution based session is about our first production project on domino-db, the Domino AppDevPack and the IAM service to extend and outreach a large XPages based Web Portal. While this project might still be ongoing when it's time to do this presentation, we will present our way forward using declarative front ends based on web components and domino-db based REST APIs to build a solution to move the customer's external web portal above and beyond what XPages were able to do. BUT - we still keep the benefits of quick turnaround times and low maintenance costs. See how we build domino views, forms and search capabilities like never before!
This content describes Call Detail Records (CDR) data format, data acquisition method, visualize in Mobmap and the applications for disaster management.
Enterprise Monitoring and Auditing in DenodoDenodo
Watch full webinar here: https://buff.ly/3P3l4oK
Proper monitoring of an enterprise system is critical to understanding its capacity and growth, anticipating potential issues, and even understanding key ROI metrics. This also facilitates the implementation of policies and user access audits which are key to optimizing the resource utilization in an organization. Do you want to learn more about the new Denodo features for monitoring, auditing, and visualizing enterprise monitoring data?
Join us for the session with Vijayalakshmi Mani, Data Engineer at Denodo, to understand how the new features and components help in monitoring your Denodo Servers and the resource utilizations and how to extract the most out of the logs that the Denodo Platform generates including FinOps information.
Watch on-demand and Learn:
- What is a Denodo Monitor and what’s new in it?
- How to visualize the Denodo Monitor Information and use of Diagnostics & Monitoring Tool
- Introduction to the new Denodo Dashboard
- Demonstration on the Denodo Dashboard
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdfGetInData
Recently we have observed the rise of open-source Large Language Models (LLMs) that are community-driven or developed by the AI market leaders, such as Meta (Llama3), Databricks (DBRX) and Snowflake (Arctic). On the other hand, there is a growth in interest in specialized, carefully fine-tuned yet relatively small models that can efficiently assist programmers in day-to-day tasks. Finally, Retrieval-Augmented Generation (RAG) architectures have gained a lot of traction as the preferred approach for LLMs context and prompt augmentation for building conversational SQL data copilots, code copilots and chatbots.
In this presentation, we will show how we built upon these three concepts a robust Data Copilot that can help to democratize access to company data assets and boost performance of everyone working with data platforms.
Why do we need yet another (open-source ) Copilot?
How can we build one?
Architecture and evaluation
The document summarizes a project to develop a visualization dashboard for the City of Tuscaloosa's 311 application. A team presented requirements, recommendations, a product backlog, and timeline. They analyzed 311 data to identify key metrics and developed the dashboard using ASP.NET MVC, Google Charts, and ElasticSearch. The dashboard allows viewing service requests, volumes, and response times by date range, district, and department in real-time on mobile. The project was completed through several sprints and accomplished its goals of providing officials with predefined 311 metrics for quick, data-driven decisions.
Extracting Insights from Data at TwitterPrasad Wagle
Prasad Wagle's talk discussed how Twitter extracts insights from its large volumes of data. Twitter collects hundreds of millions of tweets and interactions per day from over 300 million monthly active users, creating big data challenges around velocity, volume, and variety. Twitter stores this data in hundreds of petabytes across large Hadoop clusters and processes it using batch tools like Hadoop and Spark as well as real-time tools like Heron. Insights are generated through basic analytics like user counts, A/B testing of new features, and custom data science work including machine learning models for recommendations, content filtering, and ad targeting. Systems, programming, and statistical skills are needed to effectively extract value from Twitter's big data.
From determining the most convenient rider pickup points to predicting the fastest routes, Uber aims to use data-driven analytics to create seamless trip experiences. Within engineering, analytics inform decision-making processes across the board.
One of the distinct challenges for Uber is analyzing geospatial big data. City locations, trips, and event information, for instance, provide insights that can improve business decisions and better serve users. Geospatial data analysis is particularly challenging, especially in a big data scenario, such as computing how many rides start at a transit location, how many drivers are crossing state lines, and so on. For these analytical requests, we must achieve efficiency, usability, and scalability in order to meet user needs and business requirements.
To accomplish this, we use Hadoop, Hive, and Presto in our production environment to process the big data powering our interactive SQL engine. In this talk, we discuss our engineering effort to optimize geospatial queries in the whole Hadoop stack.
Speakers
Zhenxiao Luo, Engineering Manager, Uber
Lu Niu, Sr Software Engineer, Uber
TfL has been working on broader integration projects where we focus to get the most efficient use of our road networks and public transport. We bring together a wide range of data from multiple disconnected systems which we not only use for operational purposes, but also make more of them open and available; in real time. This session presents how we brought together IoT and Big Data techniques to understand current and predicted transport network status and plan to evolve the base solution into a broader product.
Initially prepared for the CERN/RDA workshop on Active Data Management Plans (28-30 June 2016). Also presented in Denver at International Data Week (12-17 Sept 2016).
Local Weather Information and GNOME Shell ExtensionSammy Fung
This document provides information about an upcoming presentation on local weather information and GNOME Shell extensions. The presentation will discuss obtaining weather data from local meteorological observatories and making it available as open data. It will also cover creating weather widgets for the GNOME Shell desktop environment. The presenter has 15+ years experience in open source communities and organizing conferences in Asia and the US.
Shaik Niyas Ahamed Mohamed Hajiyar has over 7 years of experience in data warehousing and business intelligence, specializing in Ab Initio ETL tool, Teradata, and UNIX scripting. He has worked on several projects for clients like Tata Consultancy Services, Citi Bank, JPMorgan Chase, and John Lewis, taking on roles like developer, team lead, and trainer. His skills include ETL design, development, testing, support, and performance tuning across various technologies.
Uber Geo spatial data platform at DataWorks SummitZhenxiao Luo
This document discusses geospatial analytics at Uber using Presto. It begins with an overview of Uber's analytics infrastructure and use of Presto. It then discusses challenges with geospatial queries in Hive and how Uber optimized Presto with techniques like quadtrees to improve efficiency, reliability, and usability of geospatial queries. Presto's optimizations allow queries to run faster by building indexes and rewriting queries during planning rather than relying on an external service. Future work may include distributed join optimizations and integration with Uber's H3 geospatial indexing system.
The final project for the "Big Data and Intelligent Analytics" course taught by Sri Krishnamurthy focuses on leveraging Artificial Intelligence to tackle diverse, real-world challenges. Projects range from job market analysis and news aggregation to personalized healthcare and fashion discovery, employing AI for data processing, personalization, and visualization. Key technologies include Airflow, FastAPI, Pinecone, OpenAI, and various database solutions, with AI applications including NLP, image recognition, and AI-generated illustrations, emphasizing innovative, AI-driven solutions for complex problems.
The document discusses efforts to put spatial information into the hands of customers in Queensland, Australia. It outlines (1) understanding customer needs through characteristics like interests and technology skills, (2) adopting approaches like business tools with spatial components and specialized interfaces, and (3) providing access through data downloads, web services, and visualization tools. The goal is to make information accessible online through consistent and federated systems, while addressing challenges like staying relevant to evolving technologies and industries.
Agile Gurugram 2023 | Observability for Modern Applications. How does it help...AgileNetwork
This document discusses observability for modern applications. It begins by defining observability as the ability to observe what is happening inside a system. Observability helps measure key performance indicators and allows teams to react faster to issues. In cloud native environments, observability fits by instrumenting applications to capture logs, traces, metrics and health data which are then transmitted to analytics tools. The document outlines the different pillars of application instrumentation - logs to see what happened, traces to see how it happened, metrics to see how much happened, and health checks to see system status. It discusses OpenTelemetry as an open source observability framework to address prior vendor lock-in issues and competing standards.
The document discusses open data and its benefits. It outlines 5 levels or "stars" of open data, with 5 stars being the most open. Open government data can include transportation and financial data, helping cities and giving citizens visibility. A pilot open data project is proposed, starting with one UNDP dataset to understand features and stakeholder needs before a larger launch. The pilot would test an API or open data platform over 2-3 months to inform a full open data service.
The Datapunt Amsterdam Infrastructure connects city data for internal and external users. It uses an agile development process and DevOps practices to quickly build APIs, data products, and applications. This enables the city to rapidly solve problems through a continuous delivery process of developing usable solutions in short sprints. Datapunt also aims to collaborate openly by sharing code on GitHub and helping establish standards.
The tripscore Linked Data client: calculating specific summaries over large t...David Chaves-Fraga
The document describes the Tripscore Linked Data client, which calculates summaries over large public transportation time series data stored using the Linked Connections framework. It addresses the problem of expensive analytical queries over long periods of public transport data. The client moves processing to the client side by requesting summarized data from the server in order to improve performance. Next steps include transforming additional real-world public transport datasets into the Linked Connections format and improving metadata for discoverability.
RMBtec proposes using the Open Data Hub as the real-time data backbone for a start-up creating an aircraft tracking application. The Open Data Hub offers data providers visibility, infrastructure to publish data, documentation, analytics and support. It describes how aircraft position data captured by sensors could be preprocessed and sent to the Open Data Hub via MQTT, then streamed in real-time and stored in a data mart via transformers. The Open Data Hub provides an alternative to cloud services and has capabilities that could support publishing and consuming aircraft tracking data.
This document provides an overview of Android development. It discusses what Android is, its architecture and core components like activities, services, and broadcast receivers. It covers how apps are built, signed, and deployed. It also addresses key topics like app performance, security, usability, and preparing an app for release. The document outlines an Android development course covering the platform's building blocks, project structure, UI design, data storage, networking, and device-specific features.
#twbconf 2017: Digital transformation in London - Natalie Taylor, Mayor of Lo...Together We're Better
Natalie Taylor presented on several digital transformation initiatives at London local government. She discussed the creation of a new London.gov website using agile methodology to improve the user experience. She also covered providing digital skills training to other City Hall departments and an open project system for managing projects collaboratively. Additionally, she summarized the Local Government Digital Service Standard peer group for London councils and the scoping study conducted for a proposed London Office of Technology and Innovation.
Using the NERD stack to move on from XPages - a new beginning
This customer solution based session is about our first production project on domino-db, the Domino AppDevPack and the IAM service to extend and outreach a large XPages based Web Portal. While this project might still be ongoing when it's time to do this presentation, we will present our way forward using declarative front ends based on web components and domino-db based REST APIs to build a solution to move the customer's external web portal above and beyond what XPages were able to do. BUT - we still keep the benefits of quick turnaround times and low maintenance costs. See how we build domino views, forms and search capabilities like never before!
This content describes Call Detail Records (CDR) data format, data acquisition method, visualize in Mobmap and the applications for disaster management.
Enterprise Monitoring and Auditing in DenodoDenodo
Watch full webinar here: https://buff.ly/3P3l4oK
Proper monitoring of an enterprise system is critical to understanding its capacity and growth, anticipating potential issues, and even understanding key ROI metrics. This also facilitates the implementation of policies and user access audits which are key to optimizing the resource utilization in an organization. Do you want to learn more about the new Denodo features for monitoring, auditing, and visualizing enterprise monitoring data?
Join us for the session with Vijayalakshmi Mani, Data Engineer at Denodo, to understand how the new features and components help in monitoring your Denodo Servers and the resource utilizations and how to extract the most out of the logs that the Denodo Platform generates including FinOps information.
Watch on-demand and Learn:
- What is a Denodo Monitor and what’s new in it?
- How to visualize the Denodo Monitor Information and use of Diagnostics & Monitoring Tool
- Introduction to the new Denodo Dashboard
- Demonstration on the Denodo Dashboard
Similar to Real-time Natural Language Processing for Crowdsourced Road Traffic Alerts (20)
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdfGetInData
Recently we have observed the rise of open-source Large Language Models (LLMs) that are community-driven or developed by the AI market leaders, such as Meta (Llama3), Databricks (DBRX) and Snowflake (Arctic). On the other hand, there is a growth in interest in specialized, carefully fine-tuned yet relatively small models that can efficiently assist programmers in day-to-day tasks. Finally, Retrieval-Augmented Generation (RAG) architectures have gained a lot of traction as the preferred approach for LLMs context and prompt augmentation for building conversational SQL data copilots, code copilots and chatbots.
In this presentation, we will show how we built upon these three concepts a robust Data Copilot that can help to democratize access to company data assets and boost performance of everyone working with data platforms.
Why do we need yet another (open-source ) Copilot?
How can we build one?
Architecture and evaluation
State of Artificial intelligence Report 2023kuntobimo2016
Artificial intelligence (AI) is a multidisciplinary field of science and engineering whose goal is to create intelligent machines.
We believe that AI will be a force multiplier on technological progress in our increasingly digital, data-driven world. This is because everything around us today, ranging from culture to consumer products, is a product of intelligence.
The State of AI Report is now in its sixth year. Consider this report as a compilation of the most interesting things we’ve seen with a goal of triggering an informed conversation about the state of AI and its implication for the future.
We consider the following key dimensions in our report:
Research: Technology breakthroughs and their capabilities.
Industry: Areas of commercial application for AI and its business impact.
Politics: Regulation of AI, its economic implications and the evolving geopolitics of AI.
Safety: Identifying and mitigating catastrophic risks that highly-capable future AI systems could pose to us.
Predictions: What we believe will happen in the next 12 months and a 2022 performance review to keep us honest.
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Aggregage
This webinar will explore cutting-edge, less familiar but powerful experimentation methodologies which address well-known limitations of standard A/B Testing. Designed for data and product leaders, this session aims to inspire the embrace of innovative approaches and provide insights into the frontiers of experimentation!
Global Situational Awareness of A.I. and where its headedvikram sood
You can see the future first in San Francisco.
Over the past year, the talk of the town has shifted from $10 billion compute clusters to $100 billion clusters to trillion-dollar clusters. Every six months another zero is added to the boardroom plans. Behind the scenes, there’s a fierce scramble to secure every power contract still available for the rest of the decade, every voltage transformer that can possibly be procured. American big business is gearing up to pour trillions of dollars into a long-unseen mobilization of American industrial might. By the end of the decade, American electricity production will have grown tens of percent; from the shale fields of Pennsylvania to the solar farms of Nevada, hundreds of millions of GPUs will hum.
The AGI race has begun. We are building machines that can think and reason. By 2025/26, these machines will outpace college graduates. By the end of the decade, they will be smarter than you or I; we will have superintelligence, in the true sense of the word. Along the way, national security forces not seen in half a century will be un-leashed, and before long, The Project will be on. If we’re lucky, we’ll be in an all-out race with the CCP; if we’re unlucky, an all-out war.
Everyone is now talking about AI, but few have the faintest glimmer of what is about to hit them. Nvidia analysts still think 2024 might be close to the peak. Mainstream pundits are stuck on the wilful blindness of “it’s just predicting the next word”. They see only hype and business-as-usual; at most they entertain another internet-scale technological change.
Before long, the world will wake up. But right now, there are perhaps a few hundred people, most of them in San Francisco and the AI labs, that have situational awareness. Through whatever peculiar forces of fate, I have found myself amongst them. A few years ago, these people were derided as crazy—but they trusted the trendlines, which allowed them to correctly predict the AI advances of the past few years. Whether these people are also right about the next few years remains to be seen. But these are very smart people—the smartest people I have ever met—and they are the ones building this technology. Perhaps they will be an odd footnote in history, or perhaps they will go down in history like Szilard and Oppenheimer and Teller. If they are seeing the future even close to correctly, we are in for a wild ride.
Let me tell you what we see.
3. Introduction
● Success of modern day enterprises and businesses is highly relied
on how they process massive amounts of data
● “Drowning in data yet starving for knowledge”
● With the emergence of social media, public has gained the
potential to generate massive amounts of data
● But we are still in a struggle to extract useful information out of this
data
3
4. Introduction
● Road traffic has become a major issue, mainly in developing
countries
● Directly affects country’s economy and development due to the
waste of resources – Fuel, time
● Using technology to find solutions – Proven to be success stories in
number of cases
● This study was focused on one such solution emerged with the use
of social media
● Twitter – Popular for dynamic content publishing
○ Users publish on different topics such as current affairs, news, politics
and personal interests via 140 character messages called tweets
4
5. Background
● Road.lk – A website that provides localized traffic alerts from a
Twitter feed
● Experiencing road traffic or have information on road traffic? Tweet
about it!
● All users, follow @road_lk receive traffic alerts nearly in real-time
● Identified as a potential source to extract information on road traffic
in real-time
● Reliability maintained by higher number of publishers
5
7. Background
● Potential is significant to a country like Sri Lanka – Due to the
unavailability of high tech traffic monitoring systems
● Several limitations,
○ Connectivity requirement
○ Unavailability of proper alert mechanism except Twitter feed or
road.lk website
● Notable limitation – Users use natural language to post traffic
updates
● A format can make processing tweets more straightforward but it
can reduce the flexibility of sharing updates
7
8. Solution & Methodology
● A prototype solution was implemented by combining NLP and CEP
tools
● Accommodates three use cases,
○ Real-time road traffic feed and geo location map
○ Traffic search within an area
○ Traffic alert subscription
● Developed an architecture for a these use cases
● Multiple tools were utilized to retrieve, process and present
information
8
10. Solution & Methodology – Feed
● Feed Retrieval – Access Twitter via its API
● Existing feed for model training dataset generation
○ REST API, Twitter4J
● Real-time feed stream for alert generation
○ Streaming API, WSO2 Enterprise Service Bus Twitter
connector
10
11. Solution & Methodology – NLP
● @road_lk Twitter feed
○ Reliable data source to generate real-time traffic alerts
○ Constrained by natural language representation
● Transform this data into a machine readable representation – Can
use the full potential of this source for a better solution
● Proposed a NLP model to address this problem
● Extracted two entities from a tweet – location and traffic level
● Before extracting these two entities,
○ A tweet needed to be classified – Traffic alert or not?
○ Cleaning, preprocessing
11
12. Solution & Methodology – NLP
● NLP tasks required to classify and extract,
○ Tweet categorization
○ Location extraction
○ Traffic level extraction
● First task – Document categorization task
● Latter two – Name entity recognition (NER) tasks
● Apache OpenNLP toolkit was used
● Custom tokenizer for street names and city names
● Traffic level NER task – Predefined set of words selected to tag
● Had to consider factors – Spelling mistakes, informal language,
abbreviations 12
13. Solution & Methodology – CEP
● Another important property of this data source – Required to
process the Twitter feed in real-time
● Our approach was complex event processing (CEP)
● CEP is a field, concerned in processing data from multiple sources
in real-time
● Used WSO2 Complex Event Processor as the CEP tool to analyse
and process Twitter feed input stream
● Siddhi Query Language (SiddhiQL) is at the core of WSO2 CEP
● Designed to process event streams and identify complex event
occurrences
13
14. Solution & Methodology – Siddhi Queries
from classifiedStream#transform.nlp:getEntities(convertedText,4,true,"/_system/governance/en-location.bin")
select * insert into templocationStream;
from classifiedStream#transform.nlp:getEntities(convertedText,1,false,"/_system/governance/en-trafficlevel.bin")
select * insert into temptrafficlevelStream;
from S1=classifiedStream, S2=temptrafficlevelStream, S3=templocationStream
select S1.createdAt as time, S2.nameElement1 as trafficLevel, S3.nameElement1 as location1, S3.nameElement2 as
location2, S3.nameElement3 as location3, S3.nameElement4 as location4
insert into locationsStream;
from uiFeedStream#window.time(120 min) as trafficFeed join SearchEventStream as request
on (trafficFeed.latitude < request.latitude + 0.018 and trafficFeed.latitude > request.latitude - 0.018 and
trafficFeed.longitude < request.longitude + 0.027 and trafficFeed.longitude > request.longitude - 0.027)
select trafficFeed.formattedAddress, trafficFeed.latitude, trafficFeed.longitude, trafficFeed.level, trafficFeed.time
insert into searchResult;
14
15. Solution & Methodology – CEP
● Siddhi queries define how to process and combine existing event
streams to create new event streams
● SiddhiQL was extended with extensions for,
○ Tweet categorization
○ Name entity recognition
○ Geocoding
● Geocoding extension converts the locations into geo coordinates
● Searching functionality used a time-based Siddhi window
○ To retrieve traffic in nearby geo area within a predefined time
period
15
16. Results & Conclusion
● Implemented a web based interface to demonstrate the
functionalities
● Users can interact with this interface and make use of the use
cases
● Accuracy measures of NLP through OpenNLP evaluation APIs
● A solution to extract useful information from a crowdsourced social
networking service
● By utilizing a NLP/CEP combined approach
16
18. Results & Conclusion
● Results demonstrate the potential of such model
● To tackle an application of real-time natural language processing
task
● This model can be extended to tackle any real-time unstructured
data stream
● Transforming human readable data into machine readable format
enables deep processing of data to generate useful information and
insights
○ Trend analysis
○ Pattern detection and prediction
18