With all the hype around ML/AI everyone is looking at it. There is a widespread perception that you need to know Maths before you can do Machine learning. In this session we share why that is not true.
Choosing data warehouse considerationsAseem Bansal
We recently chose a data warehouse after doing a basic POC of some data warehouses - AWS Redshift, AWS Athena, Snowflake, Google BigQuery. In this slide I share what were some considerations unique to our business due to which we ended up choosing Snowflake and what were the pros and cons of the various warehouses.
Yufeng Guo - Building machine learning systems for scale with Google Cloud AI...Codemotion
Ready to add AI to your app, website, or product? As these tools and ecosystem grow, you can build faster and better. Come learn how you can leverage the power of Google Cloud to build and scale data and machine learning systems.
Personalization allows Stitch Fix to style its clients and provide recommendations to help them find what they love. To do this, the company gathers information about a client’s preferences up front when they sign up from the service and learns more about them as they become longer-term customers. This information is important for making recommendations but also must be protected and managed with care.
The data science team at Stitch Fix is the primary owner of the recommendation systems. Backing them up is the data platform team, who maintain the data infrastructure, data warehouse, and supporting tools and services. This data warehouse has several different data sources that read and write into it. This includes a logging pipeline for events, every Spark-based ETL, and daily snapshots of structured data from Stitch Fix applications.
Neelesh Srinivas Salian explains Stitch Fix’s process to better understand the movement and evolution of data within its data warehouse, from the initial ingestion from outside sources through all of its ETLs. Neelesh also details how Stitch Fix built a service that helps the company understand the lineage information that is associated with each table in the data warehouse. This service helps the company understand the source, parentage, and journey of all data in the warehouse. Although Stitch Fix makes sure to anonymize and filter out sensitive information from this data, the company needs a more flexible long-term solution as the business expands.
Misusing MLflow To Help Deduplicate Data At ScaleDatabricks
At Intuit, we have a lot of data – and a lot of duplicate data collected over decades. So we built a rule-based, self-serve tool to identify and merge duplicate records. It takes experimentation and iteration to get deduplication just right for 100s of millions of records, and spreadsheet-based tracking just wasn’t enough. We now use MLflow to automatically capture execution notes, rule settings, weights, key validation metrics, etc., all without requiring end-user action. In this talk, we’ll talk about our use case and why MLflow is useful outside its traditional ML Ops use cases.
Introduction to our Datawarehouse solutions called BigQuery.
The Google Cloud Platform products are based on our internal systems which are powering Google AdWords, Search, YouTube and our leading research in the field of real-time data analysis.
You can get access ($300 for 60 days) to our free trial through google.com/cloud
Tor Hovland: Taking a swim in the big data lakeAnalyticsConf
Are you curious about the possibilities enabled by Microsoft Azure and Cortana Analytics? Come and see how to handle data input from a large number of “Internet of Things” devices, how to work with all the data, how to scale big computations, how to make predictions, and how to build applications on top of it. There will be demos!
Choosing data warehouse considerationsAseem Bansal
We recently chose a data warehouse after doing a basic POC of some data warehouses - AWS Redshift, AWS Athena, Snowflake, Google BigQuery. In this slide I share what were some considerations unique to our business due to which we ended up choosing Snowflake and what were the pros and cons of the various warehouses.
Yufeng Guo - Building machine learning systems for scale with Google Cloud AI...Codemotion
Ready to add AI to your app, website, or product? As these tools and ecosystem grow, you can build faster and better. Come learn how you can leverage the power of Google Cloud to build and scale data and machine learning systems.
Personalization allows Stitch Fix to style its clients and provide recommendations to help them find what they love. To do this, the company gathers information about a client’s preferences up front when they sign up from the service and learns more about them as they become longer-term customers. This information is important for making recommendations but also must be protected and managed with care.
The data science team at Stitch Fix is the primary owner of the recommendation systems. Backing them up is the data platform team, who maintain the data infrastructure, data warehouse, and supporting tools and services. This data warehouse has several different data sources that read and write into it. This includes a logging pipeline for events, every Spark-based ETL, and daily snapshots of structured data from Stitch Fix applications.
Neelesh Srinivas Salian explains Stitch Fix’s process to better understand the movement and evolution of data within its data warehouse, from the initial ingestion from outside sources through all of its ETLs. Neelesh also details how Stitch Fix built a service that helps the company understand the lineage information that is associated with each table in the data warehouse. This service helps the company understand the source, parentage, and journey of all data in the warehouse. Although Stitch Fix makes sure to anonymize and filter out sensitive information from this data, the company needs a more flexible long-term solution as the business expands.
Misusing MLflow To Help Deduplicate Data At ScaleDatabricks
At Intuit, we have a lot of data – and a lot of duplicate data collected over decades. So we built a rule-based, self-serve tool to identify and merge duplicate records. It takes experimentation and iteration to get deduplication just right for 100s of millions of records, and spreadsheet-based tracking just wasn’t enough. We now use MLflow to automatically capture execution notes, rule settings, weights, key validation metrics, etc., all without requiring end-user action. In this talk, we’ll talk about our use case and why MLflow is useful outside its traditional ML Ops use cases.
Introduction to our Datawarehouse solutions called BigQuery.
The Google Cloud Platform products are based on our internal systems which are powering Google AdWords, Search, YouTube and our leading research in the field of real-time data analysis.
You can get access ($300 for 60 days) to our free trial through google.com/cloud
Tor Hovland: Taking a swim in the big data lakeAnalyticsConf
Are you curious about the possibilities enabled by Microsoft Azure and Cortana Analytics? Come and see how to handle data input from a large number of “Internet of Things” devices, how to work with all the data, how to scale big computations, how to make predictions, and how to build applications on top of it. There will be demos!
Consolidating MLOps at One of Europe’s Biggest AirportsDatabricks
At Schiphol airport we run a lot of mission critical machine learning models in production, ranging from models that predict passenger flow to computer vision models that analyze what is happening around the aircraft. Especially now in times of Covid it is paramount for us to be able to quickly iterate on these models by implementing new features, retraining them to match the new dynamics and above all to monitor them actively to see if they still fit the current state of affairs.
To achieve those needs we rely on MLFlow but have also integrated that with many of our other systems. So have we written Airflow operators for MLFlow to ease the retraining of our models, have we integrated MLFlow deeply with our CI pipelines and have we integrated it with our model monitoring tooling.
In this talk we will take you through the way we rely on MLFlow and how that enables us to release (sometimes) multiple versions of a model per week in a controlled fashion. With this set-up we are achieving the same benefits and speed as you have with a traditional software CI pipeline.
PyCaret is an open-source, low-code machine learning library in Python that allows you to go from preparing your data to deploying your model within minutes in your choice of environment. This talk is a practical demo on how to use PyCaret in your existing workflows and supercharge your data science team’s productivity.
Applied Data Science Course Part 1: Concepts & your first ML modelDataiku
In this first course of our Applied Data Science online course series, you'll learn about the mindset shift of going from small to big data, basic definitions and concepts, and an overview of the data science workflow.
Cloud computing is shaping the new normal , revolutionizing modern digital businesses.
In the words of Evgeny Morozov "Cloud computing is a great euphemism for centralization of computer services under one server".
In order to familiarize you about how Google cloud works and the various resources offered by cloud, GDSC MH have organized a session on 30 Days of Google Cloud.
Metabase is the free, easy, open source way for everyone in your company to ask questions and learn from data. Easily filter and group data to find what you are looking for. Explore connections between your data. Visualize results.
Herding Cats: Migrating Dozens of Oddball Analytics Systems to Apache Spark w...Databricks
HP ships millions of PCs, printers and other devices every year to customers in all market segments. Many of these systems have had various generations of data collection and reporting, going
back as many as 16 years. That has led to a significant sprawl of custom data formats, specialized code and numerous brittle legacy systems collecting, analyzing and reporting data.
This session will focus on samples of HP’s journey to find, catalog and ultimately eliminate these systems by migrating to Apache Spark with Databricks in the cloud. Hear about HP’s challenges dealing with legacy systems (some even located under engineers desks) and how the power of AWS, Spark, and visualization tools has significantly simplified their migrations. You’ll also learn how the success of this endeavor is not just in counting the number of systems deprecated, but also how the process is evolving into building companywide shared Spark libraries, notebooks and web services that are accelerating future migrations and analysis using Spark.
in this presentation we go through the differences and similarities between Redshift and BigQuery. It was presented during the Athens Big Data meetup May 2017.
The talk is on How to become a data scientist. This was at 2ns Annual event of Pune Developer's Community. It focuses on Skill Set required to become data scientist. And also based on who you are what you can be.
Scaling Recommendations at Quora (RecSys talk 9/16/2016)Nikhil Dandekar
Talk about scaling Quora's recommendations and ML systems given at the ACM RecSys conference at Boston during the Large Scale Recommendation Systems (LSRS) workshop.
Consolidating MLOps at One of Europe’s Biggest AirportsDatabricks
At Schiphol airport we run a lot of mission critical machine learning models in production, ranging from models that predict passenger flow to computer vision models that analyze what is happening around the aircraft. Especially now in times of Covid it is paramount for us to be able to quickly iterate on these models by implementing new features, retraining them to match the new dynamics and above all to monitor them actively to see if they still fit the current state of affairs.
To achieve those needs we rely on MLFlow but have also integrated that with many of our other systems. So have we written Airflow operators for MLFlow to ease the retraining of our models, have we integrated MLFlow deeply with our CI pipelines and have we integrated it with our model monitoring tooling.
In this talk we will take you through the way we rely on MLFlow and how that enables us to release (sometimes) multiple versions of a model per week in a controlled fashion. With this set-up we are achieving the same benefits and speed as you have with a traditional software CI pipeline.
PyCaret is an open-source, low-code machine learning library in Python that allows you to go from preparing your data to deploying your model within minutes in your choice of environment. This talk is a practical demo on how to use PyCaret in your existing workflows and supercharge your data science team’s productivity.
Applied Data Science Course Part 1: Concepts & your first ML modelDataiku
In this first course of our Applied Data Science online course series, you'll learn about the mindset shift of going from small to big data, basic definitions and concepts, and an overview of the data science workflow.
Cloud computing is shaping the new normal , revolutionizing modern digital businesses.
In the words of Evgeny Morozov "Cloud computing is a great euphemism for centralization of computer services under one server".
In order to familiarize you about how Google cloud works and the various resources offered by cloud, GDSC MH have organized a session on 30 Days of Google Cloud.
Metabase is the free, easy, open source way for everyone in your company to ask questions and learn from data. Easily filter and group data to find what you are looking for. Explore connections between your data. Visualize results.
Herding Cats: Migrating Dozens of Oddball Analytics Systems to Apache Spark w...Databricks
HP ships millions of PCs, printers and other devices every year to customers in all market segments. Many of these systems have had various generations of data collection and reporting, going
back as many as 16 years. That has led to a significant sprawl of custom data formats, specialized code and numerous brittle legacy systems collecting, analyzing and reporting data.
This session will focus on samples of HP’s journey to find, catalog and ultimately eliminate these systems by migrating to Apache Spark with Databricks in the cloud. Hear about HP’s challenges dealing with legacy systems (some even located under engineers desks) and how the power of AWS, Spark, and visualization tools has significantly simplified their migrations. You’ll also learn how the success of this endeavor is not just in counting the number of systems deprecated, but also how the process is evolving into building companywide shared Spark libraries, notebooks and web services that are accelerating future migrations and analysis using Spark.
in this presentation we go through the differences and similarities between Redshift and BigQuery. It was presented during the Athens Big Data meetup May 2017.
The talk is on How to become a data scientist. This was at 2ns Annual event of Pune Developer's Community. It focuses on Skill Set required to become data scientist. And also based on who you are what you can be.
Scaling Recommendations at Quora (RecSys talk 9/16/2016)Nikhil Dandekar
Talk about scaling Quora's recommendations and ML systems given at the ACM RecSys conference at Boston during the Large Scale Recommendation Systems (LSRS) workshop.
Megatrend: Serverless and Machine Learning
Build an application with google assistant and Cloud functions
Build a social wall completely Serverless with Firebase and GCP
Serverless machine learning at DYNO
District Data Labs Workshop
Current Workshop: August 23, 2014
Previous Workshops:
- April 5, 2014
Data products are usually software applications that derive their value from data by leveraging the data science pipeline and generate data through their operation. They aren’t apps with data, nor are they one time analyses that produce insights - they are operational and interactive. The rise of these types of applications has directly contributed to the rise of the data scientist and the idea that data scientists are professionals “who are better at statistics than any software engineer and better at software engineering than any statistician.”
These applications have been largely built with Python. Python is flexible enough to develop extremely quickly on many different types of servers and has a rich tradition in web applications. Python contributes to every stage of the data science pipeline including real time ingestion and the production of APIs, and it is powerful enough to perform machine learning computations. In this class we’ll produce a data product with Python, leveraging every stage of the data science pipeline to produce a book recommender.
Building machine learning muscle in your team & transitioning to make them do machine learning at scale. We also discuss about Spark & other relevant technologies.
Learn more about enterprise frameworks and why your technology business and you need to be thinking about your software application architecture at scale.
When setting up a new project we have some tips and tricks to help you do this in the best way possible, incl. infrastructure, database, standard attributes, logging, code alignment, and service center.
One of the most popular buzz words nowadays in the technology world is “Machine Learning (ML).” Most economists and business experts foresee Machine Learning changing every aspect of our lives in the next 10 years through automating and optimizing processes. This is leading many organizations to seek experts who can implement Machine Learning into their businesses.
The paper will be written for statistical programmers who want to explore Machine Learning career, add Machine Learning skills to their experiences or enter a Machine Learning fields. The paper will discuss about personal journey to become to a Machine Learning Engineer from a statistical programmer. The paper will share my personal experience on what motivated me to start Machine Learning career, how I started it, and what I have learned and done to be a Machine Learning Engineer. In addition, the paper will also discuss the future of Machine Learning in Pharmaceutical Industry, especially in Biometric department.
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesVictorSzoltysek
In the past six months, the AI landscape has undergone a massive transformation, ushering in a new era of productivity with the latest in Large Language Models (LLMs) and AI technology. This deep dive unlocks how to:
Create CustomGPT Models: No coding needed to tailor AI for your unique projects. Integrate your own data, including PDFs and Excel sheets, making information handling a breeze. Plus, discover how to call your own actions/integrations for even more personalized utility.
Navigate Advanced Prompting: Overcome AI's memory limits and utilize Retrieval-Augmented Generation for accessing your personalized data, streamlining how you interact with AI.
Stay Ahead with AI Trends: Peek into the evolving world of LLMs, featuring newcomers like Google Gemini, Anthropic Claude, Open Sora, and Twitter Grok, and understand what their advancements mean for your productivity.
Witness Real-Life Transformations: Through examples and prompt demonstrations, see firsthand how these AI strategies revolutionize routine tasks, from data analysis to content creation. Learn to leverage image output and input for advanced practical use cases, adding a new dimension to your productivity toolkit.
No previous coding or AI experience is needed for this talk. Stay ahead in the fast-evolving world of work. Embrace the AI revolution and transform your workflow with advanced LLM techniques. Join us to ensure you're not left behind in the productivity race.
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdfGetInData
Recently we have observed the rise of open-source Large Language Models (LLMs) that are community-driven or developed by the AI market leaders, such as Meta (Llama3), Databricks (DBRX) and Snowflake (Arctic). On the other hand, there is a growth in interest in specialized, carefully fine-tuned yet relatively small models that can efficiently assist programmers in day-to-day tasks. Finally, Retrieval-Augmented Generation (RAG) architectures have gained a lot of traction as the preferred approach for LLMs context and prompt augmentation for building conversational SQL data copilots, code copilots and chatbots.
In this presentation, we will show how we built upon these three concepts a robust Data Copilot that can help to democratize access to company data assets and boost performance of everyone working with data platforms.
Why do we need yet another (open-source ) Copilot?
How can we build one?
Architecture and evaluation
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Round table discussion of vector databases, unstructured data, ai, big data, real-time, robots and Milvus.
A lively discussion with NJ Gen AI Meetup Lead, Prasad and Procure.FYI's Co-Found
Adjusting OpenMP PageRank : SHORT REPORT / NOTESSubhajit Sahu
For massive graphs that fit in RAM, but not in GPU memory, it is possible to take
advantage of a shared memory system with multiple CPUs, each with multiple cores, to
accelerate pagerank computation. If the NUMA architecture of the system is properly taken
into account with good vertex partitioning, the speedup can be significant. To take steps in
this direction, experiments are conducted to implement pagerank in OpenMP using two
different approaches, uniform and hybrid. The uniform approach runs all primitives required
for pagerank in OpenMP mode (with multiple threads). On the other hand, the hybrid
approach runs certain primitives in sequential mode (i.e., sumAt, multiply).
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
1. Why you don't need Maths
to get benefits of ML
- Aseem Bansal
2. Speaker
● Over 4 experience web development JVM
● Dabbling in ML for over 4 years
● Working with ML in production for over 2 years
https://medium.com/@asmbansal2
https://www.linkedin.com/in/bansalaseem
https://twitter.com/AseemBansal2
3. Current Perspective
● Only for people who have done a degree in Data science/Machine learning
● Only for people who know Maths
4. Historical reasons for the perspective
● When companies like Google, Baidu wanted to do ML they did not have the
required talent
5. Historical reasons for the perspective
● When companies like Google, Baidu wanted to do ML they did not have the
required talent
● They turned to universities which had the talent
6. Historical reasons for the perspective
● When companies like Google, Baidu wanted to do ML they did not have the
required talent
● They turned to universities which had the talent
● From universities they got professors who had the required skills
7. Historical reasons for the perspective
● When companies like Google, Baidu wanted to do ML they did not have the
required talent
● They turned to universities which had the talent
● From universities they got professors who had the required skills
● Academia is Theory and Maths centric
8. Historical reasons for the perspective
● When companies like Google, Baidu wanted to do ML they did not have the
required talent
● They turned to universities which had the talent
● From universities they got professors who had the required skills
● Academia is Theory and Maths centric
● That theory and Maths centric approach spread from there to everywhere
9. What are the problems with the perspective
● Academia has always been theory and Maths centric for everything
● Industry has always been about implementations
10. What are the problems with the perspective
● Web developers don’t write Tomcat or Nginx from scratch to understand the
basics of how to make a web application
● Cricketers don’t try to derive the equation of Parabola before they go for a
match
11. What are the problems with the perspective
● Doing Practical stuff does not always require Maths
● Doing Practical stuff does not always require theory
● To do practical stuff you understand things till the depth which is required and
just do it
● Don't have to be expert of underlying. just know how to use it
● There are people who have done the lower level stuff already
12. 3 levels of Machine learning as I see it
● Composing Big Black Boxes to
build Solutions
● Cloud providers have
production ready APIs available
● No ML/Maths expertise required
13. 3 levels of Machine learning as I see it
● Using a wider variety of small
boxes to build solutions
● There are production-ready
open source libraries available
which abstract away the Maths
● At this point you would need
understanding of ML concepts
and ML libraries
14. 3 levels of Machine learning as I see it
● Build custom boxes to build
your solutions
● In a lot of cases you won’t need
to go to completely custom
solutions
● When you do need to build
custom solutions then you
would need understanding of
Maths
21. Using Pre-built Solutions
If you are doing any of the following you should try out the cloud providers before
you try to roll out your own
● Natural Language Processing
● Text to Text Translation
● Speech to Text
● Image Analysis
● Video Analysis
23. Using Pre-built Solutions
● It is Garbage-in Garbage-
out problem. Feed it bad
data you will get bad
Machine learning models
which will be useless
24. Using Pre-built Solutions
● If dealing with unstructured data
like Images, videos etc. GPUs
(aka Graphics Card) are required
which are not cheap.
25. Demo: Use libraries to build your own solutions
Going to use
● Python - programming language
● Jupyter - Browser based code editing and execution environment
● Sklearn - Machine learning library
● Matplotlib - charting library
31. Demo: Use libraries to build your own solutions
Going to use
● Python - programming language
● Jupyter - Browser based code editing and execution environment
● Sklearn - Machine learning library
32. What we did not answer
● Did not show you how to choose a algorithm
● Did not teach you the libraries
● Installation of Python/jupyter/libraries
● ML concepts
But a simple google search can you with those
● Search “Anaconda 3 download” and you will find a good page
● Search “sklearn tutorial” and go to the first tutorial that comes up
● Check out fast ai’s Machine learning course in their forums - quite practical
33. Don’t have to be a Maths Phd
to get benefits of Machine
learning