Content and talk by Giovani Lanzani (GoDataDriven) at SEA Amsterdam in November 2014. Real time data driven applications using Python and pandas as backend
Big Data is a new term used in Business Analytics to identify datasets that we can not manage with current methodologies or data mining software tools due to their large size and complexity. Big Data mining is the capability of extracting useful information from these large datasets or streams of data. New mining techniques are necessary due to the volume, variability, and velocity, of such data.
In this talk, we will focus on advanced techniques in Big Data mining in real time using evolving data stream techniques: using a small amount of time and memory resources, and being able to adapt to changes. We will discuss a social network application of data stream mining to compute user influence probabilities. And finally, we will present the MOA software framework with classification, regression, and frequent pattern methods, and the SAMOA distributed streaming software that runs on top of Storm, Samza and S4.
TensorFrames: Google Tensorflow on Apache SparkDatabricks
Presentation at Bay Area Spark Meetup by Databricks Software Engineer and Spark committer Tim Hunter.
This presentation covers how you can use TensorFrames with Tensorflow to distributed computing on GPU.
Big Data is a new term used in Business Analytics to identify datasets that we can not manage with current methodologies or data mining software tools due to their large size and complexity. Big Data mining is the capability of extracting useful information from these large datasets or streams of data. New mining techniques are necessary due to the volume, variability, and velocity, of such data.
In this talk, we will focus on advanced techniques in Big Data mining in real time using evolving data stream techniques: using a small amount of time and memory resources, and being able to adapt to changes. We will discuss a social network application of data stream mining to compute user influence probabilities. And finally, we will present the MOA software framework with classification, regression, and frequent pattern methods, and the SAMOA distributed streaming software that runs on top of Storm, Samza and S4.
TensorFrames: Google Tensorflow on Apache SparkDatabricks
Presentation at Bay Area Spark Meetup by Databricks Software Engineer and Spark committer Tim Hunter.
This presentation covers how you can use TensorFrames with Tensorflow to distributed computing on GPU.
A lecture given for Stats 285 at Stanford on October 30, 2017. I discuss how OSS technology developed at Anaconda, Inc. has helped to scale Python to GPUs and Clusters.
Keynote talk at PyCon Estonia 2019 where I discuss how to extend CPython and how that has led to a robust ecosystem around Python. I then discuss the need to define and build a Python extension language I later propose as EPython on OpenTeams: https://openteams.com/initiatives/2
The Weather of the Century Part 3: VisualizationMongoDB
MongoDB natively supports geospatial indexing and querying, and it integrates easily with open source visualization tools. In this presentation, learn high-performance techniques for querying and retrieving geospatial data, and how to create a rich visual representation of global weather data using Python, Monary, and Matplotlib.
MongoDB natively supports geospatial indexing and querying, and it integrates easily with open source visualization tools. In this webinar, learn high-performance techniques for querying and retrieving geospatial data, and how to create a rich visual representation of global weather data using Python, Monary, and Matplotlib.
The weather is everywhere and always. That makes for a lot of data. This talk will walk you through how you can use MongoDB to store and analyze worldwide weather data from the entire 20th century in a graphical application. We’ll discuss loading and indexing terabytes of data in a sharded cluster, and optimizing the schema design for interactive exploration. MongoDB also natively supports geospatial indexing and querying, and it integrates easily with open source visualization tools. You'll earn high-performance techniques for querying and retrieving geospatial data, and how to create a rich visual representation of global weather data using Python, Monary, and Matplotlib.
Making NumPy-style and Pandas-style code faster and run in parallel. Continuum has been working on scaled versions of NumPy and Pandas for 4 years. This talk describes how Numba and Dask provide scaled Python today.
PyCon Korea 2017의 스프린트 세션 "Tensorflow, Python 그리고 Apache Zeppelin"에서 사용된 발표 자료입니다.
튜토리얼도 아닌 맛보기라 설명이 매우 간략합니다..^^
실습 노트 주소:
https://www.zepl.com/viewer/notebooks/bm90ZTovL2p1bi82YmI1ODFjMzZmOTA0YmZmOGQyYTEzNmI3MWQzODVhNy9ub3RlLmpzb24
R Data Visualization-Spatial data and Maps in R: Using R as a GISDr. Volkan OBAN
R Data Visualization-Spatial data and Maps in R: Using R as a GIS
Reference: https://pakillo.github.io/R-GIS-tutorial/
Basic packages
library(sp) # classes for spatial data
library(raster) # grids, rasters
library(rasterVis) # raster visualization
library(maptools)
library(rgeos)
library(dismo)
library(googleVis)
library(rworldmap)
library(RgoogleMaps)
Talk given at first OmniSci user conference where I discuss cooperating with open-source communities to ensure you get useful answers quickly from your data. I get a chance to introduce OpenTeams in this talk as well and discuss how it can help companies cooperate with communities.
Probabilistic data structures. Part 2. CardinalityAndrii Gakhov
The book "Probabilistic Data Structures and Algorithms in Big Data Applications" is now available at Amazon and from local bookstores. More details at https://pdsa.gakhov.com
In the presentation, I described common data structures and algorithms to estimate the number of distinct elements in a set (cardinality), such as Linear Counting, HyperLogLog, and HyperLogLog++. Each approach comes with some math that is behind it and simple examples to clarify the theory statements.
Building a Scalable Distributed Stats Infrastructure with Storm and KairosDBCody Ray
Building a Scalable Distributed Stats Infrastructure with Storm and KairosDB
Many startups collect and display stats and other time-series data for their users. A supposedly-simple NoSQL option such as MongoDB is often chosen to get started... which soon becomes 50 distributed replica sets as volume increases. This talk describes how we designed a scalable distributed stats infrastructure from the ground up. KairosDB, a rewrite of OpenTSDB built on top of Cassandra, provides a solid foundation for storing time-series data. Unfortunately, though, it has some limitations: millisecond time granularity and lack of atomic upsert operations which make counting (critical to any stats infrastructure) a challenge. Additionally, running KairosDB atop Cassandra inside AWS brings its own set of challenges, such as managing Cassandra seeds and AWS security groups as you grow or shrink your Cassandra ring. In this deep-dive talk, we explore how we've used a mix of open-source and in-house tools to tackle these challenges and build a robust, scalable, distributed stats infrastructure.
A lecture given for Stats 285 at Stanford on October 30, 2017. I discuss how OSS technology developed at Anaconda, Inc. has helped to scale Python to GPUs and Clusters.
Keynote talk at PyCon Estonia 2019 where I discuss how to extend CPython and how that has led to a robust ecosystem around Python. I then discuss the need to define and build a Python extension language I later propose as EPython on OpenTeams: https://openteams.com/initiatives/2
The Weather of the Century Part 3: VisualizationMongoDB
MongoDB natively supports geospatial indexing and querying, and it integrates easily with open source visualization tools. In this presentation, learn high-performance techniques for querying and retrieving geospatial data, and how to create a rich visual representation of global weather data using Python, Monary, and Matplotlib.
MongoDB natively supports geospatial indexing and querying, and it integrates easily with open source visualization tools. In this webinar, learn high-performance techniques for querying and retrieving geospatial data, and how to create a rich visual representation of global weather data using Python, Monary, and Matplotlib.
The weather is everywhere and always. That makes for a lot of data. This talk will walk you through how you can use MongoDB to store and analyze worldwide weather data from the entire 20th century in a graphical application. We’ll discuss loading and indexing terabytes of data in a sharded cluster, and optimizing the schema design for interactive exploration. MongoDB also natively supports geospatial indexing and querying, and it integrates easily with open source visualization tools. You'll earn high-performance techniques for querying and retrieving geospatial data, and how to create a rich visual representation of global weather data using Python, Monary, and Matplotlib.
Making NumPy-style and Pandas-style code faster and run in parallel. Continuum has been working on scaled versions of NumPy and Pandas for 4 years. This talk describes how Numba and Dask provide scaled Python today.
PyCon Korea 2017의 스프린트 세션 "Tensorflow, Python 그리고 Apache Zeppelin"에서 사용된 발표 자료입니다.
튜토리얼도 아닌 맛보기라 설명이 매우 간략합니다..^^
실습 노트 주소:
https://www.zepl.com/viewer/notebooks/bm90ZTovL2p1bi82YmI1ODFjMzZmOTA0YmZmOGQyYTEzNmI3MWQzODVhNy9ub3RlLmpzb24
R Data Visualization-Spatial data and Maps in R: Using R as a GISDr. Volkan OBAN
R Data Visualization-Spatial data and Maps in R: Using R as a GIS
Reference: https://pakillo.github.io/R-GIS-tutorial/
Basic packages
library(sp) # classes for spatial data
library(raster) # grids, rasters
library(rasterVis) # raster visualization
library(maptools)
library(rgeos)
library(dismo)
library(googleVis)
library(rworldmap)
library(RgoogleMaps)
Talk given at first OmniSci user conference where I discuss cooperating with open-source communities to ensure you get useful answers quickly from your data. I get a chance to introduce OpenTeams in this talk as well and discuss how it can help companies cooperate with communities.
Probabilistic data structures. Part 2. CardinalityAndrii Gakhov
The book "Probabilistic Data Structures and Algorithms in Big Data Applications" is now available at Amazon and from local bookstores. More details at https://pdsa.gakhov.com
In the presentation, I described common data structures and algorithms to estimate the number of distinct elements in a set (cardinality), such as Linear Counting, HyperLogLog, and HyperLogLog++. Each approach comes with some math that is behind it and simple examples to clarify the theory statements.
Building a Scalable Distributed Stats Infrastructure with Storm and KairosDBCody Ray
Building a Scalable Distributed Stats Infrastructure with Storm and KairosDB
Many startups collect and display stats and other time-series data for their users. A supposedly-simple NoSQL option such as MongoDB is often chosen to get started... which soon becomes 50 distributed replica sets as volume increases. This talk describes how we designed a scalable distributed stats infrastructure from the ground up. KairosDB, a rewrite of OpenTSDB built on top of Cassandra, provides a solid foundation for storing time-series data. Unfortunately, though, it has some limitations: millisecond time granularity and lack of atomic upsert operations which make counting (critical to any stats infrastructure) a challenge. Additionally, running KairosDB atop Cassandra inside AWS brings its own set of challenges, such as managing Cassandra seeds and AWS security groups as you grow or shrink your Cassandra ring. In this deep-dive talk, we explore how we've used a mix of open-source and in-house tools to tackle these challenges and build a robust, scalable, distributed stats infrastructure.
Google Analytics vs. Omniture Comparative GuideJimmy Jay
Google Analytics Vs Omniture Comparative Guide is a clear way to differentiate between two available web analytics applications. This guide is based on the basic as well as complex features of both the platforms.
Giovanni Lanzani – SQL & NoSQL databases for data driven applications - NoSQL...NoSQLmatters
Giovanni Lanzani – SQL & NoSQL databases for data driven applications
For data to be the fuel of the 21th century, and for data science to live up to its promise as adriver of innovation, their application should not be confined to dashboards and static analyses.Instead they should be the driver of real applications that support the organisations that own orgenerates the data. Most of these applications are web-based and require real-time access to thedata. However, many Big Data analyses and tools are inherently batch-driven and not well suited forsecure, real-time and performance-critical connections with applications. Trade-offs become ofteninevitable, especially when mixing multiple tools and data sources.In this talk we will describe our journey to build a data driven application at a large Dutchfinancial institution. We will dive into the issues we faced, our considerations and the technicalchoices we made in order to perform data analyses but also drive a web-based, real-timeapplications. We considered and used Impala, Hbase, and MongoDB, but also conventional SQL databasessuch as MySQL and PostgreSQL. Important aspects in our journey were, among others, the handling ofgeographical data, the access to hundreds of millions of records as well as the real time analysisof millions or data points.
Real time data driven applications (SQL vs NoSQL databases)GoDataDriven
Content and talk by Giovani Lanzani (GoDataDriven) at No-SQL Matters talk in Dublin (September 2014).
Big Data: Everybody talks about it, nobody knows how to do it. Everyone else thinks someone else is doing it, so claims to be doing it.
Giovanni covers what real time data driven applications are, presents one of the app build for one of GoDataDriven customers, what challenges arose and what database helped GoDataDriven achieve the level of performance they wanted.
Timeseries - data visualization in GrafanaOCoderFest
Presentation deals with proper handling of the application and resources monitoring. It mentions tools that help with presentation layer - Grafana, storage - InfluxDB, and communication between the measurements and their destination - Telegraf.
Presented by Marek Szymeczko
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...Big Data Spain
http://www.bigdataspain.org/2014/conference/state-of-play-data-science-on-hadoop-in-2015-keynote
Machine Learning is not new. Big Machine Learning is qualitatively different: More data beats algorithm improvement, scale trumps noise and sample size effects, can brute-force manual tasks.
Session presented at Big Data Spain 2014 Conference
18th Nov 2014
Kinépolis Madrid
http://www.bigdataspain.org
Event promoted by: http://www.paradigmatecnologico.com
Slides: https://speakerdeck.com/bigdataspain/state-of-play-data-science-on-hadoop-in-2015-by-sean-owen-at-big-data-spain-2014
source{d} is building the open-source components to enable large-scale code analysis and machine learning on source code. Their powerful tools can ingest all of the world’s public git repositories turning code into ASTs ready for machine learning and other analyses, all exposed through a flexible and friendly API. Francesc will show you how to run machine learning on source code with a series of live demos.
Business Dashboards using Bonobo ETL, Grafana and Apache AirflowRomain Dorgueil
Zero-to-one hands-on introduction to building a business dashboard using Bonobo ETL, Apache Airflow, and a bit of Grafana (because graphs are cool). The talk is based on the early version of our tools to visualize apercite.fr website. Plan, Implementation, Visualization, Monitoring and Iterate from there.
I am shubham sharma graduated from Acropolis Institute of technology in Computer Science and Engineering. I have spent around 2 years in field of Machine learning. I am currently working as Data Scientist in Reliance industries private limited Mumbai. Mainly focused on problems related to data handing, data analysis, modeling, forecasting, statistics and machine learning, Deep learning, Computer Vision, Natural language processing etc. Area of interests are Data Analytics, Machine Learning, Machine learning, Time Series Forecasting, web information retrieval, algorithms, Data structures, design patterns, OOAD.
Lens: Data exploration with Dask and Jupyter widgetsVíctor Zabalza
The first step in any data-intensive project is understanding the available data. To this end, data scientists spend a significant part of their time carrying out data quality assessments and data exploration. In spite of this being a crucial step, it usually requires repeating a series of menial tasks before the data scientist gains an understanding of the dataset and can progress to the next steps in the project.
In this talk I will present Lens (https://github.com/asidatascience/lens), a Python package which automates this drudge work, enables efficient data exploration, and kickstarts data science projects. A summary is generated for each dataset, including:
- General information about the dataset, including data quality of each of the columns;
- Distribution of each of the columns through statistics and plots (histogram, CDF, KDE), optionally grouped by other categorical variables;
- 2D distribution between pairs of columns;
- Correlation coefficient matrix for all numerical columns.
Building this tool has provided a unique view into the full Python data stack, from the parallelised analysis of a dataframe within a Dask custom execution graph, to the interactive visualisation with Jupyter widgets and Plotly. During the talk, I will also introduce how Dask works, and demonstrate how to migrate data pipelines to take advantage of its scalable capabilities.
Rdio's Alex Gaynor at Heroku's Waza 2013: Why Python, Ruby and Javascript are...Heroku
Rdio Software Engineer Alex Gaynor (@alex_gaynor) took to the #Waza 2013 stage (Heroku's Developer Conference) to talk about "Why Python, Ruby and Javascript are Slow". Gaynor argues that developers should aim to make performance beautiful. For more from Gaynor or to contact him, ping him at @Alex_Gaynor.
For more on Waza visit http://waza.heroku.com/2013.
For Waza videos stay tuned at http://blog.heroku.com or visit http://vimeo.com/herokuwaza
How to create a Devcontainer for your Python projectGoDataDriven
Prevent mis-aligned environments between developers, onboard new-joiners faster, and reduce the time it takes to take your project to production. Sounds interesting? Devcontainers can help you with this. Devcontainers allow you to connect your IDE to a running Docker container and develop inside it. This gives you all the benefits of reproducibility that Docker is known for. In this talk, I will walk you through what Devcontainers are, why they might be useful for you, and how to create one for your Python project using VSCode.
Using Graph Neural Networks To Embrace The Dependency In Your Data by Usman Z...GoDataDriven
Many machine learning models we use today have the core assumption that our data needs to be tabular, but how often is this truly the case? What if our data points are not independent? By ignoring the potential interrelatedness of our data, do we lose meaningful information that our models cannot leverage? In this talk, we shall explore graph neural networks and highlight how they can solve interesting problems in a way that is intractable when limiting ourselves to using tabular data. We will look at the limitations of common algorithms and highlight how some clever linear algebra enables us to incorporate more meaningful information into our models. Social network data is a popular example of where relationships are relevant but relationships exist in many types of data where it may not be so obvious. Whether it's e-commerce, logistics or molecular data, relationships within your data likely exist and making use of them can be incredibly powerful. This talk will hopefully spark your curiosity and provide you with a way of looking at problems from a new angle. It is intended for anyone with an interest in machine learning and will only lightly touch on some technical details.
Common Issues With Time Series by Vadim Nelidov - GoDataFest 2022GoDataDriven
Time-series data is all around us: from logistics to digital marketing, from pricing to stock markets - it’s hard to imagine a modern business that has no time series data to forecast. However, mastering such forecasting is not an easy task. For this talk, we have collected a list of common time series issues that digital fortune tellers commonly run into. You will learn how to identify, understand and resolve them better. This will include stabilising divergent time series, handling outliers without anomaly propagation, reducing the impact of noise and more.
MLOps CodeBreakfast on AWS - GoDataFest 2022GoDataDriven
During the MLOps CodeBreakfast, we will be giving an introduction to MLOps. After this introduction, we will go into more detail on how to implement and deploy a Machine Learning pipeline on both Azure and AWS.
MLOps CodeBreakfast on Azure - GoDataFest 2022GoDataDriven
During the MLOps CodeBreakfast, we will be giving an introduction to MLOps. After this introduction, we will go into more detail on how to implement and deploy a Machine Learning pipeline on both Azure and AWS.
Tableau vs. Power BI by Juan Manuel Perafan - GoDataFest 2022GoDataDriven
In this talk, we will compare the most widely used BI tools in the market from the perspective of a mature data organization. The focus of this talk WON’T be on flashy features nor superficial sales talk. We will compare both tools in terms of how well they fit in with DataOps best practices. How do they rank in terms of speed of delivery, governance, robustness, and analytical capabilities.
Deploying a Modern Data Stack by Lasse Benninga - GoDataFest 2022GoDataDriven
Deploy your own modern data stack using open source components usingTerraform cloud-agnostic tooling. By leveraging open-source components you can deploy a state-of-the-art modern data platform in a day. What are the pro's and con's of “build-it-yourself" in the data+analytics space?
AWS Well-Architected Webinar Security - Ben de HaanGoDataDriven
The security pillar encompasses the ability to protect information, systems, and assets while delivering business value through risk assessments and mitigation strategies. This presentation will provide in-depth, best-practice guidance for architecting secure systems on AWS.
The 7 Habits of Effective Data Driven CompaniesGoDataDriven
1. Start searching use cases with value & impact: without use cases, nobody will want to draft a data strategy
Where do you want to go? Draft a clear Customer Experience that you want to create and think about the organization & data strategy to get there!
2. Get Tech (data scientists and engineers) and Business (Product Management & Commercial) on the same table: create a solid foundation.
3. Start with communities of practice to learn & experiment together and build the capability.
4. Stop talking about data. Start experimenting and doing.
5. Product Management needs to get real about data. (start training these capabilities)
DevOps for Data Science on Azure - Marcel de Vries (Xpirit) and Niels Zeilema...GoDataDriven
The typical organizational model is that teams are in constant flux, are created for work, are only responsible for the change and are not empowered, or lack trust, to run products. A high performance organization model allows teams to take full responsibility for cost, compliance and security, and lets them own their own incidents. This improves quality, change failure rates, lower costs and leads to more happy employees. DevOps is about creating with the end in mind, cross-functional autonomous teams and end-tn-end responsibility. You build it, you run it. You break it, you fix it. This means you want to automate everything in a CI/CD pipeline. Roll-forward, don't roll-back. DevOps principles play an important role in a data-driven maturity model. Continuous prototyping and a data mindset and skills for everybody. In a Data Science Workflow combining input data and deriving the model features usually requires the most of the work, and lots of iterations before its done. Implement features one-by-one. So, start with a baseline model and compare this against more complex models, to see if additional complexity is worth the performance gain. The result of a data scientist is a trained model. Such a model contains 4 components: input data, derived features, chosen model type and hyperparameters. A trained model is always the combination of data and the code. So where do you run this trained model? Model management is versioning code but not the data. A model management server stores hyperparameters, performance metrics, metadata, trained models. IN a data science pipeline, we have two components for deployment: the application and the trained model. So we split the pipeline into parts: a build pipeline, a train pipeline and a deploy pipeline. A complete pipeline mapped to azure components would look largely like this: An Azure DevOps Build pipeline, an Azure ML Training pipeline and an Azure DevOps Release pipeline.
Artificial intelligence in actions: delivering a new experience to Formula 1 ...GoDataDriven
At GoDataFest 2019, Guy Kfir presented how AI delivers a new experience to Formula 1 fans across the world. AWS fuels the analytics through machine learning. Did you know a Formula 1 race car contains 120 sensors and generated 3 GB of data every race at 1,500 data points per second? AWS developed several applications, including overtake possibility, pitstop advantage. How important is it for your company to invest in Machine Learning and AI? There are three scenario's for AI/ML success: Automation, Enrichment and Invention. So, what are you waiting for: create the loop, advance your data strategy and organize for succes. To get started identify AI/ML use cases, educate yourself, start with AI services and move to Amazon Sagemaker, engage with AWS, consider the partner eco system (like GoDataDriven or Binx).
Smart application on Azure at Vattenfall - Rens Weijers & Peter van 't HofGoDataDriven
During GoDataFest 2019, Rens Weijers, manager data & strategy and Peter van ' t Hof, data engineer, share the story of how Vattenfall develops smart applications on Azure. Vattenfall has the ambition to transition to fossil-free living within one generation. But what about decentral energy solutions in the Customers & Solutions business unit? Data is key to help customers to reduce their CO2 footprint. Azure enables Vattenfall to be personal and relevant towards customers.
Democratizing AI/ML with GCP - Abishay Rao (Google) at GoDataFest 2019GoDataDriven
Every company today is talking about AI/ML, but when most companies talk about AI/ML in their transformation journey, you hear terms like Proof of Concept, Feasibility Study, Pilot, A/B Test. We are at the peak of AI's hype, but only 12% of enterprises have deployed AI in production. Google aims to make big data processing available for everyone, the possiblities of Big Query ML are endless: Marketing, retail, industrial and IoT, media, gaming, and so fort.
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Discussion on Vector Databases, Unstructured Data and AI
https://www.meetup.com/unstructured-data-meetup-new-york/
This meetup is for people working in unstructured data. Speakers will come present about related topics such as vector databases, LLMs, and managing data at scale. The intended audience of this group includes roles like machine learning engineers, data scientists, data engineers, software engineers, and PMs.This meetup was formerly Milvus Meetup, and is sponsored by Zilliz maintainers of Milvus.
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...pchutichetpong
M Capital Group (“MCG”) expects to see demand and the changing evolution of supply, facilitated through institutional investment rotation out of offices and into work from home (“WFH”), while the ever-expanding need for data storage as global internet usage expands, with experts predicting 5.3 billion users by 2023. These market factors will be underpinned by technological changes, such as progressing cloud services and edge sites, allowing the industry to see strong expected annual growth of 13% over the next 4 years.
Whilst competitive headwinds remain, represented through the recent second bankruptcy filing of Sungard, which blames “COVID-19 and other macroeconomic trends including delayed customer spending decisions, insourcing and reductions in IT spending, energy inflation and reduction in demand for certain services”, the industry has seen key adjustments, where MCG believes that engineering cost management and technological innovation will be paramount to success.
MCG reports that the more favorable market conditions expected over the next few years, helped by the winding down of pandemic restrictions and a hybrid working environment will be driving market momentum forward. The continuous injection of capital by alternative investment firms, as well as the growing infrastructural investment from cloud service providers and social media companies, whose revenues are expected to grow over 3.6x larger by value in 2026, will likely help propel center provision and innovation. These factors paint a promising picture for the industry players that offset rising input costs and adapt to new technologies.
According to M Capital Group: “Specifically, the long-term cost-saving opportunities available from the rise of remote managing will likely aid value growth for the industry. Through margin optimization and further availability of capital for reinvestment, strong players will maintain their competitive foothold, while weaker players exit the market to balance supply and demand.”
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
Opendatabay - Open Data Marketplace.pptxOpendatabay
Opendatabay.com unlocks the power of data for everyone. Open Data Marketplace fosters a collaborative hub for data enthusiasts to explore, share, and contribute to a vast collection of datasets.
First ever open hub for data enthusiasts to collaborate and innovate. A platform to explore, share, and contribute to a vast collection of datasets. Through robust quality control and innovative technologies like blockchain verification, opendatabay ensures the authenticity and reliability of datasets, empowering users to make data-driven decisions with confidence. Leverage cutting-edge AI technologies to enhance the data exploration, analysis, and discovery experience.
From intelligent search and recommendations to automated data productisation and quotation, Opendatabay AI-driven features streamline the data workflow. Finding the data you need shouldn't be a complex. Opendatabay simplifies the data acquisition process with an intuitive interface and robust search tools. Effortlessly explore, discover, and access the data you need, allowing you to focus on extracting valuable insights. Opendatabay breaks new ground with a dedicated, AI-generated, synthetic datasets.
Leverage these privacy-preserving datasets for training and testing AI models without compromising sensitive information. Opendatabay prioritizes transparency by providing detailed metadata, provenance information, and usage guidelines for each dataset, ensuring users have a comprehensive understanding of the data they're working with. By leveraging a powerful combination of distributed ledger technology and rigorous third-party audits Opendatabay ensures the authenticity and reliability of every dataset. Security is at the core of Opendatabay. Marketplace implements stringent security measures, including encryption, access controls, and regular vulnerability assessments, to safeguard your data and protect your privacy.
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Round table discussion of vector databases, unstructured data, ai, big data, real-time, robots and Milvus.
A lively discussion with NJ Gen AI Meetup Lead, Prasad and Procure.FYI's Co-Found
4. Real-time, data driven app?
•No store and retrieve;
•Store, {transform, enrich, analyse} and retrieve;
•Real-time: retrieve is not a batch process;
•App: something your mother could use:
SELECT attendees
FROM pydataberlin2014
WHERE password = '1234';
21. Aside
•Memoize every recursive algorithm in Python!
from cytoolz import functoolz
@functoolz.memoize
def memo_levenshtein(s, t):
if len(s) == 0: return len(t)
if len(t) == 0: return len(s)
cost = 0 if s[-1] == t[-1] else 1
return min(levenshtein(s[:-1], t) + 1,
levenshtein(s, t[:-1]) + 1,
levenshtein(s[:-1], t[:-1]) + cost)
22. Aside
•Memoize the algorithm if you’re using Python!
a = “Marco Polo str.”
b = “Marco Polo straat”
%timeit memo_levenshtein(a, b) # 213ns
%timeit levenshtein(a, b) # 1.84 µs
24. helper.py example
def get_statistics(data, sbi):
sbi_df = data[data.sbi == sbi]
# select * from data where sbi = sbi
hits = sbi_df.hits.sum() # select sum(hits) from …
delta_hits = sbi_df.delta.sum() # select sum(delta) from …
if delta_hits:
percentage = (hits - delta_hits) / delta_hits
else:
percentage = 0
return {"sbi": sbi, "total": hits, "percentage": percentage}
25. helper.py example
def get_timeline(data, sbi):
df_sbi = data.groupby([“date”, “hour", “sbi"]).aggregate(sum)
# select sum(hits), sum(delta) from data group by date, hour, sbi
return df_sbi
27. Who has my data?
•First iteration was a (pre)-POC, less data (3GB vs
500GB);
•Time constraints;
•Oeps:
import pandas as pd
...
source_data = pd.read_csv("data.csv", …)
...
def get_data(postcodes, dates):
result = filter_data(source_data, postcodes, dates)
return result
28. Advantage of “everything is a df”
Pro:
•Fast!!
•Use what you know
•NO DBA’s!
•We all love CSV’s!
Contra:
•Doesn’t scale;
•Huge startup time;
•NO DBA’s!
•We all hate CSV’s!
29. If you want to go down this path
•Set the dataframe index wisely;
•Align the data to the index:
•Beware of modifications of the original dataframe!
source_data.sort_index(inplace=True)
30. If you want to go down this path
The reason pandas is faster is because I came
up with a better algorithm
31. New architecture
data = get_data(db, postcodes, dates)data = get_data(postcodes, dates)
database.py
Data
psycopg2
AngularJS app.py
helper.py
REST
Front-end Back-end
JSON
33. Issues?!
•With a radius of 10km, in Amsterdam, you get 10k
postcodes.You need to do this in your SQL:
•Index on date and postcode, but single queries
running more than 20 minutes.
SELECT * FROM datapoints
WHERE
date IN date_array
AND
postcode IN postcode_array;
34. Postgres + Postgis (2.x)
PostGIS is a spatial database extender for PostgreSQL.
Supports geographic objects allowing location queries
SELECT *
FROM datapoints
WHERE ST_DWithin(lon, lat, 1500)
AND dates IN ('2013-02-30', '2013-02-31');
-- every point within 1.5km
-- from (lat, lon) on imaginary dates
36. Steps to solve it
1. Align data on disk by date;
2. Use the temporary table trick:
3. Lose precision: 1234AB→1234
CREATE TEMPORARY TABLE tmp (postcodes STRING NOT NULL PRIMARY KEY);
INSERT INTO tmp (postcodes) VALUES postcode_array;
SELECT * FROM tmp
JOIN datapoints d
ON d.postcode = tmp.postcodes
WHERE
d.dt IN dates_array;
37. Take home messages
1. Geospatial problems are “hard” and can kill your
queries;
2. Not everybody has infinite resources: be smart and
KISS!
3. SQL or NoSQL? (Size, schema)
38. GoDataDriven
We’re hiring / Questions? / Thank you!
@gglanzani
giovannilanzani@godatadriven.com
Giovanni Lanzani
Data Whisperer