This document provides an overview of an introductory data science course (IST 380). It discusses the course content which includes learning the R programming language, descriptive statistics, predictive modeling, and machine learning algorithms. It also covers course logistics like assignments, grading, and academic honesty policies. The goal of the course is to provide students with practical data science skills that can be applied to real-world problems and datasets.
This document provides an overview and introduction to IST 380, a data science course taught by Zach Dodds. The course covers topics like R programming, statistical analysis, machine learning algorithms, and a final project. Students will learn skills in data visualization, predictive modeling, and applying data science techniques to real-world datasets. The course emphasizes hands-on learning through weekly assignments completed in R.
All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...Daniel Zivkovic
Serverless Toronto's 6th-anniversary event helps IT pros understand and prepare for the #GenAI tsunami ahead. You'll gain situational awareness of the LLM Landscape, receive condensed insights, and actionable advice about RAG in 2024 from Google AI Lead Mark Ryan and LlamaIndex creator Jerry Liu. We chose #RAG (Retrieval-Augmented Generation) because it is the predominant paradigm for building #LLM (Large Language Model) applications in enterprises today - and that's where the jobs will be shifting. Here is the recording: https://youtu.be/P5xd1ZjD-Os?si=iq8xibj5pJsJ62oW
What to Expect Your First Year as an NAU Computer Science Major. Presented to the NAU ACM Club and the Computer Science Learning Community students. Tone is very casual.
Data Workflows for Machine Learning - Seattle DAMLPaco Nathan
First public meetup at Twitter Seattle, for Seattle DAML:
http://www.meetup.com/Seattle-DAML/events/159043422/
We compare/contrast several open source frameworks which have emerged for Machine Learning workflows, including KNIME, IPython Notebook and related Py libraries, Cascading, Cascalog, Scalding, Summingbird, Spark/MLbase, MBrace on .NET, etc. The analysis develops several points for "best of breed" and what features would be great to see across the board for many frameworks... leading up to a "scorecard" to help evaluate different alternatives. We also review the PMML standard for migrating predictive models, e.g., from SAS to Hadoop.
OSCON 2014: Data Workflows for Machine LearningPaco Nathan
This document provides examples of different frameworks that can be used for machine learning data workflows, including KNIME, Python, Julia, Summingbird, Scalding, and Cascalog. It describes features of each framework such as KNIME's large number of integrations and visual workflow editing, Python's broad ecosystem, Julia's performance and parallelism support, Summingbird's ability to switch between Storm and Scalding backends, and Scalding's implementation of the Scala collections API over Cascading for compact workflow code. The document aims to familiarize readers with options for building machine learning data workflows.
This document summarizes challenges in assembling large DNA sequence data sets and strategies to address them.
1. The cost to generate DNA sequence data is decreasing rapidly, creating data sets too large for most computers to assemble. Hundreds to thousands of such data sets are generated each year.
2. Techniques like streaming compression and low-memory probabilistic data structures allow assembly memory usage to scale linearly with the sample size rather than the total data, enabling assembly of larger datasets.
3. Benchmarking different computational platforms revealed that while some platforms have faster processors, the ability to store large amounts of data locally is also important for assembly tasks. Scaling algorithms, rather than just optimizing code, is key to addressing
This document provides an overview and introduction to IST 380, a data science course taught by Zach Dodds. The course covers topics like R programming, statistical analysis, machine learning algorithms, and a final project. Students will learn skills in data visualization, predictive modeling, and applying data science techniques to real-world datasets. The course emphasizes hands-on learning through weekly assignments completed in R.
All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...Daniel Zivkovic
Serverless Toronto's 6th-anniversary event helps IT pros understand and prepare for the #GenAI tsunami ahead. You'll gain situational awareness of the LLM Landscape, receive condensed insights, and actionable advice about RAG in 2024 from Google AI Lead Mark Ryan and LlamaIndex creator Jerry Liu. We chose #RAG (Retrieval-Augmented Generation) because it is the predominant paradigm for building #LLM (Large Language Model) applications in enterprises today - and that's where the jobs will be shifting. Here is the recording: https://youtu.be/P5xd1ZjD-Os?si=iq8xibj5pJsJ62oW
What to Expect Your First Year as an NAU Computer Science Major. Presented to the NAU ACM Club and the Computer Science Learning Community students. Tone is very casual.
Data Workflows for Machine Learning - Seattle DAMLPaco Nathan
First public meetup at Twitter Seattle, for Seattle DAML:
http://www.meetup.com/Seattle-DAML/events/159043422/
We compare/contrast several open source frameworks which have emerged for Machine Learning workflows, including KNIME, IPython Notebook and related Py libraries, Cascading, Cascalog, Scalding, Summingbird, Spark/MLbase, MBrace on .NET, etc. The analysis develops several points for "best of breed" and what features would be great to see across the board for many frameworks... leading up to a "scorecard" to help evaluate different alternatives. We also review the PMML standard for migrating predictive models, e.g., from SAS to Hadoop.
OSCON 2014: Data Workflows for Machine LearningPaco Nathan
This document provides examples of different frameworks that can be used for machine learning data workflows, including KNIME, Python, Julia, Summingbird, Scalding, and Cascalog. It describes features of each framework such as KNIME's large number of integrations and visual workflow editing, Python's broad ecosystem, Julia's performance and parallelism support, Summingbird's ability to switch between Storm and Scalding backends, and Scalding's implementation of the Scala collections API over Cascading for compact workflow code. The document aims to familiarize readers with options for building machine learning data workflows.
This document summarizes challenges in assembling large DNA sequence data sets and strategies to address them.
1. The cost to generate DNA sequence data is decreasing rapidly, creating data sets too large for most computers to assemble. Hundreds to thousands of such data sets are generated each year.
2. Techniques like streaming compression and low-memory probabilistic data structures allow assembly memory usage to scale linearly with the sample size rather than the total data, enabling assembly of larger datasets.
3. Benchmarking different computational platforms revealed that while some platforms have faster processors, the ability to store large amounts of data locally is also important for assembly tasks. Scaling algorithms, rather than just optimizing code, is key to addressing
Breaking Through The Challenges of Scalable Deep Learning for Video AnalyticsJason Anderson
Meetup Link: https://www.meetup.com/Cognitive-Computing-Enthusiasts/events/250444108/
Recording Link: https://www.youtube.com/watch?v=4uXg1KTXdQc
When developing a machine learning system, the possibilities are limitless. However, with the recent explosion of Big Data and AI, there are more options than ever to filter through. Which technologies to select, which model topologies to build, and which infrastructure to use for deployment, just to name a few. We have explored these options for our faceted refinement system for video content system (consisting of 100K+ videos) along with their many roadblocks. Three primary areas of focus involve natural language processing, video frame sampling, and infrastructure deployment.
Rental Cars and Industrialized Learning to Rank with Sean DownesDatabricks
Data can be viewed as the exhaust of online activity. With the rise of cloud-based data platforms, barriers to data storage and transfer have crumbled. The demand for creative applications and learning from those datasets has accelerated. Rapid acceleration can quickly accrue disorder, and disorderly data design can turn the deepest data lake into an impenetrable swamp.
In this talk, I will discuss the evolution of the data science workflow at Expedia with a special emphasis on Learning to Rank problems. From the heroic early days of ad-hoc Spark exploration to our first production sort model on the cloud, we will explore the process of industrializing the workflow. Layered over our story, I will share some best practices and suggestions on how to keep your data productive, or even pull your organization out of the data swamp.
This document provides an overview of getting started with data science using Python. It discusses what data science is, why it is in high demand, and the typical skills and backgrounds of data scientists. It then covers popular Python libraries for data science like NumPy, Pandas, Scikit-Learn, TensorFlow, and Keras. Common data science steps are outlined including data gathering, preparation, exploration, model building, validation, and deployment. Example applications and case studies are discussed along with resources for learning including podcasts, websites, communities, books, and TV shows.
This document provides an overview of the CS760 Machine Learning course taught by David Page at the University of Wisconsin. The course will cover a broad survey of machine learning algorithms and applications over 30 class meetings. Topics will include both theoretical and practical aspects of supervised learning algorithms like naive Bayes, decision trees, neural networks, and support vector machines. Students will complete programming homework assignments applying various machine learning algorithms and a midterm exam. The primary goals of the course are to understand what learning systems should do and how existing systems work.
This document outlines the syllabus for a machine learning course. It introduces the instructor, teaching assistant, required textbook, and meeting schedule. It describes the course style as primarily algorithmic and experimental, covering many ML subfields. The goals are to understand what a learning system should do and how existing systems work. Background knowledge in languages, AI topics, and math is assumed, but no prior ML experience is needed. Requirements include biweekly programming homework, a midterm exam, and a final project. Grading will be based on homework, exam, project, and discussion participation. Policies on late homework and academic misconduct are also provided.
This presentation briefs about International Collegiate Programming Contest(ICPC) which is organized by ACM and sponsored by IBM.
This is delivered at VB Siddardha Colleges, Vijayawada on 10th Mar 2015. Somehow Indian participation is not attractive. I am encouraging Indian students to participate in this competition by delivering lectures like this.
The document discusses aspects of being a professional including being highly educated, working autonomously on intellectually challenging tasks, defining technical terms, reading books, referring to references, thinking before working and complaining, and not being overly pedantic. It provides examples of some technical terms and concepts along with explanations to illustrate how to think like a professional.
The document discusses various approaches to data analytics and common pitfalls. It provides examples of recommendation systems at Netflix and Pandora that achieved success by focusing on the business goals rather than just the technology. It also warns against complexifying systems and architectures unnecessarily over time and refusing to remove outdated components. Overall it advocates embracing complexity but also avoiding duct tape solutions, and designing systems with the intended use and business goals in mind rather than getting attached to specific technologies.
DataMind interactive learning: Dublin R User Group: September 2013DataMind-slides
Presentation explaining the motivation for building DataMind.org and the technical tools that were used. We also looked at how you can create your own interactive R tutorials with the beta version. More info on http://www.DataMind.org
PyData 2015 Keynote: "A Systems View of Machine Learning" Joshua Bloom
Despite the growing abundance of powerful tools, building and deploying machine-learning frameworks into production continues to be major challenge, in both science and industry. I'll present some particular pain points and cautions for practitioners as well as recent work addressing some of the nagging issues. I advocate for a systems view, which, when expanded beyond the algorithms and codes to the organizational ecosystem, places some interesting constraints on the teams tasked with development and stewardship of ML products.
About: Dr. Joshua Bloom is an astronomy professor at the University of California, Berkeley where he teaches high-energy astrophysics and Python for data scientists. He has published over 250 refereed articles largely on time-domain transients events and telescope/insight automation. His book on gamma-ray bursts, a technical introduction for physical scientists, was published recently by Princeton University Press. He is also co-founder and CTO of wise.io, a startup based in Berkeley. Josh has been awarded the Pierce Prize from the American Astronomical Society; he is also a former Sloan Fellow, Junior Fellow at the Harvard Society, and Hertz Foundation Fellow. He holds a PhD from Caltech and degrees from Harvard and Cambridge University.
Spark Based Distributed Deep Learning Framework For Big Data Applications Humoyun Ahmedov
Deep Learning architectures, such as deep neural networks, are currently the hottest emerging areas of data science, especially in Big Data. Deep Learning could be effectively exploited to address some major issues of Big Data, such as fast information retrieval, data classification, semantic indexing and so on. In this work, we designed and implemented a framework to train deep neural networks using Spark, fast and general data flow engine for large scale data processing, which can utilize cluster computing to train large scale deep networks. Training Deep Learning models requires extensive data and computation. Our proposed framework can accelerate the training time by distributing the model replicas, via stochastic gradient descent, among cluster nodes for data resided on HDFS.
Best Practices for Building and Deploying Data Pipelines in Apache SparkDatabricks
Many data pipelines share common characteristics and are often built in similar but bespoke ways, even within a single organisation. In this talk, we will outline the key considerations which need to be applied when building data pipelines, such as performance, idempotency, reproducibility, and tackling the small file problem. We’ll work towards describing a common Data Engineering toolkit which separates these concerns from business logic code, allowing non-Data-Engineers (e.g. Business Analysts and Data Scientists) to define data pipelines without worrying about the nitty-gritty production considerations.
We’ll then introduce an implementation of such a toolkit in the form of Waimak, our open-source library for Apache Spark (https://github.com/CoxAutomotiveDataSolutions/waimak), which has massively shortened our route from prototype to production. Finally, we’ll define new approaches and best practices about what we believe is the most overlooked aspect of Data Engineering: deploying data pipelines.
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...Big Data Spain
http://www.bigdataspain.org/2014/conference/state-of-play-data-science-on-hadoop-in-2015-keynote
Machine Learning is not new. Big Machine Learning is qualitatively different: More data beats algorithm improvement, scale trumps noise and sample size effects, can brute-force manual tasks.
Session presented at Big Data Spain 2014 Conference
18th Nov 2014
Kinépolis Madrid
http://www.bigdataspain.org
Event promoted by: http://www.paradigmatecnologico.com
Slides: https://speakerdeck.com/bigdataspain/state-of-play-data-science-on-hadoop-in-2015-by-sean-owen-at-big-data-spain-2014
This document discusses openness and reproducibility in computational science. It begins with an introduction and background on the challenges of analyzing non-model organisms. It then describes the goals and challenges of shotgun sequencing analysis, including assembly, counting, and variant calling. It emphasizes the need for efficient data structures, algorithms, and cloud-based analysis to handle large datasets. The document advocates for open science practices like publishing code, data, and analyses to ensure reproducibility of computational results.
The document provides an overview of the CS101 Introduction to Computing course. It discusses the course objectives to build an appreciation of fundamental computing concepts, proficiency in productivity software, and beginner web development skills. The course structure is outlined, covering these topics over 15 weeks through lectures, readings, and assignments culminating in a midterm and final exam. Assessment is based on assignments, a midterm exam, and a final exam. Key figures in early computing history and capabilities of modern computers are also introduced.
DevOpsDaysRiga 2017 ignite: Mikhail Iljin - DevOps meets Data Science - how t...DevOpsDays Riga
Big data, data science, machine learning is coming to a lot of companies. Everyone is used to the creation of ordinary software, but BD/DS/ML requires special care. Managers and developers may get unfamiliar problems and I want to tell you about them and solutions - no money and nerves should be wasted.
Everyone has heard of data science, machine learning and big data. Many companies are starting to build up teams and run projects. Everyone knows how to develop, deliver and deploy ordinary software, but data-driven software is a different animal. Scientists, developers and managers may not be familiar with the issues that may come up.
What is ATT&CK coverage, anyway? Breadth and depth analysis with Atomic Red TeamMITRE ATT&CK
From ATT&CKcon 3.0
By Brian Donohue, Red Canary
This presentation will highlight the Atomic Red Team project's efforts to define and increase the test coverage of MITRE ATT&CK techniques. We'll describe the challenges we encountered in defining what "coverage" means in the context of an ATT&CK-based framework, and how to use that definition to improve an open source project that's used by a diverse audience of practitioners to satisfy an equally diverse array of needs. The audience will learn how the Atomic Red Team maintainers standardize and categorize atomic tests, perform gap analysis to achieve deep technique-level coverage and broad matrix-level coverage, and quickly fill those gaps with new tests.
Data Science Challenge presentation given to the CinBITools Meetup GroupDoug Needham
The document describes the Cloudera Data Science Challenge, which involves solving three data science problems using large datasets. For the first problem, Smartfly, the goal is to predict flight delays using historical flight data and machine learning algorithms like logistic regression and SVM. The second problem, Almost Famous, involves statistical analysis of web log data and filtering for spam. The third problem, Winklr, requires social network analysis to recommend users to follow on a social media platform based on click data. The document discusses the approaches, tools, and algorithms used to solve each problem at scale using Apache Spark and Hadoop technologies.
Breaking Through The Challenges of Scalable Deep Learning for Video AnalyticsJason Anderson
Meetup Link: https://www.meetup.com/Cognitive-Computing-Enthusiasts/events/250444108/
Recording Link: https://www.youtube.com/watch?v=4uXg1KTXdQc
When developing a machine learning system, the possibilities are limitless. However, with the recent explosion of Big Data and AI, there are more options than ever to filter through. Which technologies to select, which model topologies to build, and which infrastructure to use for deployment, just to name a few. We have explored these options for our faceted refinement system for video content system (consisting of 100K+ videos) along with their many roadblocks. Three primary areas of focus involve natural language processing, video frame sampling, and infrastructure deployment.
Rental Cars and Industrialized Learning to Rank with Sean DownesDatabricks
Data can be viewed as the exhaust of online activity. With the rise of cloud-based data platforms, barriers to data storage and transfer have crumbled. The demand for creative applications and learning from those datasets has accelerated. Rapid acceleration can quickly accrue disorder, and disorderly data design can turn the deepest data lake into an impenetrable swamp.
In this talk, I will discuss the evolution of the data science workflow at Expedia with a special emphasis on Learning to Rank problems. From the heroic early days of ad-hoc Spark exploration to our first production sort model on the cloud, we will explore the process of industrializing the workflow. Layered over our story, I will share some best practices and suggestions on how to keep your data productive, or even pull your organization out of the data swamp.
This document provides an overview of getting started with data science using Python. It discusses what data science is, why it is in high demand, and the typical skills and backgrounds of data scientists. It then covers popular Python libraries for data science like NumPy, Pandas, Scikit-Learn, TensorFlow, and Keras. Common data science steps are outlined including data gathering, preparation, exploration, model building, validation, and deployment. Example applications and case studies are discussed along with resources for learning including podcasts, websites, communities, books, and TV shows.
This document provides an overview of the CS760 Machine Learning course taught by David Page at the University of Wisconsin. The course will cover a broad survey of machine learning algorithms and applications over 30 class meetings. Topics will include both theoretical and practical aspects of supervised learning algorithms like naive Bayes, decision trees, neural networks, and support vector machines. Students will complete programming homework assignments applying various machine learning algorithms and a midterm exam. The primary goals of the course are to understand what learning systems should do and how existing systems work.
This document outlines the syllabus for a machine learning course. It introduces the instructor, teaching assistant, required textbook, and meeting schedule. It describes the course style as primarily algorithmic and experimental, covering many ML subfields. The goals are to understand what a learning system should do and how existing systems work. Background knowledge in languages, AI topics, and math is assumed, but no prior ML experience is needed. Requirements include biweekly programming homework, a midterm exam, and a final project. Grading will be based on homework, exam, project, and discussion participation. Policies on late homework and academic misconduct are also provided.
This presentation briefs about International Collegiate Programming Contest(ICPC) which is organized by ACM and sponsored by IBM.
This is delivered at VB Siddardha Colleges, Vijayawada on 10th Mar 2015. Somehow Indian participation is not attractive. I am encouraging Indian students to participate in this competition by delivering lectures like this.
The document discusses aspects of being a professional including being highly educated, working autonomously on intellectually challenging tasks, defining technical terms, reading books, referring to references, thinking before working and complaining, and not being overly pedantic. It provides examples of some technical terms and concepts along with explanations to illustrate how to think like a professional.
The document discusses various approaches to data analytics and common pitfalls. It provides examples of recommendation systems at Netflix and Pandora that achieved success by focusing on the business goals rather than just the technology. It also warns against complexifying systems and architectures unnecessarily over time and refusing to remove outdated components. Overall it advocates embracing complexity but also avoiding duct tape solutions, and designing systems with the intended use and business goals in mind rather than getting attached to specific technologies.
DataMind interactive learning: Dublin R User Group: September 2013DataMind-slides
Presentation explaining the motivation for building DataMind.org and the technical tools that were used. We also looked at how you can create your own interactive R tutorials with the beta version. More info on http://www.DataMind.org
PyData 2015 Keynote: "A Systems View of Machine Learning" Joshua Bloom
Despite the growing abundance of powerful tools, building and deploying machine-learning frameworks into production continues to be major challenge, in both science and industry. I'll present some particular pain points and cautions for practitioners as well as recent work addressing some of the nagging issues. I advocate for a systems view, which, when expanded beyond the algorithms and codes to the organizational ecosystem, places some interesting constraints on the teams tasked with development and stewardship of ML products.
About: Dr. Joshua Bloom is an astronomy professor at the University of California, Berkeley where he teaches high-energy astrophysics and Python for data scientists. He has published over 250 refereed articles largely on time-domain transients events and telescope/insight automation. His book on gamma-ray bursts, a technical introduction for physical scientists, was published recently by Princeton University Press. He is also co-founder and CTO of wise.io, a startup based in Berkeley. Josh has been awarded the Pierce Prize from the American Astronomical Society; he is also a former Sloan Fellow, Junior Fellow at the Harvard Society, and Hertz Foundation Fellow. He holds a PhD from Caltech and degrees from Harvard and Cambridge University.
Spark Based Distributed Deep Learning Framework For Big Data Applications Humoyun Ahmedov
Deep Learning architectures, such as deep neural networks, are currently the hottest emerging areas of data science, especially in Big Data. Deep Learning could be effectively exploited to address some major issues of Big Data, such as fast information retrieval, data classification, semantic indexing and so on. In this work, we designed and implemented a framework to train deep neural networks using Spark, fast and general data flow engine for large scale data processing, which can utilize cluster computing to train large scale deep networks. Training Deep Learning models requires extensive data and computation. Our proposed framework can accelerate the training time by distributing the model replicas, via stochastic gradient descent, among cluster nodes for data resided on HDFS.
Best Practices for Building and Deploying Data Pipelines in Apache SparkDatabricks
Many data pipelines share common characteristics and are often built in similar but bespoke ways, even within a single organisation. In this talk, we will outline the key considerations which need to be applied when building data pipelines, such as performance, idempotency, reproducibility, and tackling the small file problem. We’ll work towards describing a common Data Engineering toolkit which separates these concerns from business logic code, allowing non-Data-Engineers (e.g. Business Analysts and Data Scientists) to define data pipelines without worrying about the nitty-gritty production considerations.
We’ll then introduce an implementation of such a toolkit in the form of Waimak, our open-source library for Apache Spark (https://github.com/CoxAutomotiveDataSolutions/waimak), which has massively shortened our route from prototype to production. Finally, we’ll define new approaches and best practices about what we believe is the most overlooked aspect of Data Engineering: deploying data pipelines.
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...Big Data Spain
http://www.bigdataspain.org/2014/conference/state-of-play-data-science-on-hadoop-in-2015-keynote
Machine Learning is not new. Big Machine Learning is qualitatively different: More data beats algorithm improvement, scale trumps noise and sample size effects, can brute-force manual tasks.
Session presented at Big Data Spain 2014 Conference
18th Nov 2014
Kinépolis Madrid
http://www.bigdataspain.org
Event promoted by: http://www.paradigmatecnologico.com
Slides: https://speakerdeck.com/bigdataspain/state-of-play-data-science-on-hadoop-in-2015-by-sean-owen-at-big-data-spain-2014
This document discusses openness and reproducibility in computational science. It begins with an introduction and background on the challenges of analyzing non-model organisms. It then describes the goals and challenges of shotgun sequencing analysis, including assembly, counting, and variant calling. It emphasizes the need for efficient data structures, algorithms, and cloud-based analysis to handle large datasets. The document advocates for open science practices like publishing code, data, and analyses to ensure reproducibility of computational results.
The document provides an overview of the CS101 Introduction to Computing course. It discusses the course objectives to build an appreciation of fundamental computing concepts, proficiency in productivity software, and beginner web development skills. The course structure is outlined, covering these topics over 15 weeks through lectures, readings, and assignments culminating in a midterm and final exam. Assessment is based on assignments, a midterm exam, and a final exam. Key figures in early computing history and capabilities of modern computers are also introduced.
DevOpsDaysRiga 2017 ignite: Mikhail Iljin - DevOps meets Data Science - how t...DevOpsDays Riga
Big data, data science, machine learning is coming to a lot of companies. Everyone is used to the creation of ordinary software, but BD/DS/ML requires special care. Managers and developers may get unfamiliar problems and I want to tell you about them and solutions - no money and nerves should be wasted.
Everyone has heard of data science, machine learning and big data. Many companies are starting to build up teams and run projects. Everyone knows how to develop, deliver and deploy ordinary software, but data-driven software is a different animal. Scientists, developers and managers may not be familiar with the issues that may come up.
What is ATT&CK coverage, anyway? Breadth and depth analysis with Atomic Red TeamMITRE ATT&CK
From ATT&CKcon 3.0
By Brian Donohue, Red Canary
This presentation will highlight the Atomic Red Team project's efforts to define and increase the test coverage of MITRE ATT&CK techniques. We'll describe the challenges we encountered in defining what "coverage" means in the context of an ATT&CK-based framework, and how to use that definition to improve an open source project that's used by a diverse audience of practitioners to satisfy an equally diverse array of needs. The audience will learn how the Atomic Red Team maintainers standardize and categorize atomic tests, perform gap analysis to achieve deep technique-level coverage and broad matrix-level coverage, and quickly fill those gaps with new tests.
Data Science Challenge presentation given to the CinBITools Meetup GroupDoug Needham
The document describes the Cloudera Data Science Challenge, which involves solving three data science problems using large datasets. For the first problem, Smartfly, the goal is to predict flight delays using historical flight data and machine learning algorithms like logistic regression and SVM. The second problem, Almost Famous, involves statistical analysis of web log data and filtering for spam. The third problem, Winklr, requires social network analysis to recommend users to follow on a social media platform based on click data. The document discusses the approaches, tools, and algorithms used to solve each problem at scale using Apache Spark and Hadoop technologies.
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Round table discussion of vector databases, unstructured data, ai, big data, real-time, robots and Milvus.
A lively discussion with NJ Gen AI Meetup Lead, Prasad and Procure.FYI's Co-Found
The Building Blocks of QuestDB, a Time Series Databasejavier ramirez
Talk Delivered at Valencia Codes Meetup 2024-06.
Traditionally, databases have treated timestamps just as another data type. However, when performing real-time analytics, timestamps should be first class citizens and we need rich time semantics to get the most out of our data. We also need to deal with ever growing datasets while keeping performant, which is as fun as it sounds.
It is no wonder time-series databases are now more popular than ever before. Join me in this session to learn about the internal architecture and building blocks of QuestDB, an open source time-series database designed for speed. We will also review a history of some of the changes we have gone over the past two years to deal with late and unordered data, non-blocking writes, read-replicas, or faster batch ingestion.
Learn SQL from basic queries to Advance queriesmanishkhaire30
Dive into the world of data analysis with our comprehensive guide on mastering SQL! This presentation offers a practical approach to learning SQL, focusing on real-world applications and hands-on practice. Whether you're a beginner or looking to sharpen your skills, this guide provides the tools you need to extract, analyze, and interpret data effectively.
Key Highlights:
Foundations of SQL: Understand the basics of SQL, including data retrieval, filtering, and aggregation.
Advanced Queries: Learn to craft complex queries to uncover deep insights from your data.
Data Trends and Patterns: Discover how to identify and interpret trends and patterns in your datasets.
Practical Examples: Follow step-by-step examples to apply SQL techniques in real-world scenarios.
Actionable Insights: Gain the skills to derive actionable insights that drive informed decision-making.
Join us on this journey to enhance your data analysis capabilities and unlock the full potential of SQL. Perfect for data enthusiasts, analysts, and anyone eager to harness the power of data!
#DataAnalysis #SQL #LearningSQL #DataInsights #DataScience #Analytics
State of Artificial intelligence Report 2023kuntobimo2016
Artificial intelligence (AI) is a multidisciplinary field of science and engineering whose goal is to create intelligent machines.
We believe that AI will be a force multiplier on technological progress in our increasingly digital, data-driven world. This is because everything around us today, ranging from culture to consumer products, is a product of intelligence.
The State of AI Report is now in its sixth year. Consider this report as a compilation of the most interesting things we’ve seen with a goal of triggering an informed conversation about the state of AI and its implication for the future.
We consider the following key dimensions in our report:
Research: Technology breakthroughs and their capabilities.
Industry: Areas of commercial application for AI and its business impact.
Politics: Regulation of AI, its economic implications and the evolving geopolitics of AI.
Safety: Identifying and mitigating catastrophic risks that highly-capable future AI systems could pose to us.
Predictions: What we believe will happen in the next 12 months and a 2022 performance review to keep us honest.
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Aggregage
This webinar will explore cutting-edge, less familiar but powerful experimentation methodologies which address well-known limitations of standard A/B Testing. Designed for data and product leaders, this session aims to inspire the embrace of innovative approaches and provide insights into the frontiers of experimentation!
The Ipsos - AI - Monitor 2024 Report.pdfSocial Samosa
According to Ipsos AI Monitor's 2024 report, 65% Indians said that products and services using AI have profoundly changed their daily life in the past 3-5 years.
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...sameer shah
"Join us for STATATHON, a dynamic 2-day event dedicated to exploring statistical knowledge and its real-world applications. From theory to practice, participants engage in intensive learning sessions, workshops, and challenges, fostering a deeper understanding of statistical methodologies and their significance in various fields."
Analysis insight about a Flyball dog competition team's performanceroli9797
Insight of my analysis about a Flyball dog competition team's last year performance. Find more: https://github.com/rolandnagy-ds/flyball_race_analysis/tree/main
1. Welcome to IST 380 !
When the course was over, I knew it was a good thing.
We don't have strong enough words to describe this class.
Data Science
Programming
an advocate of
concrete computing –
and HMC's mascot - New York Times Review of Courses
- US News and Course Report
We give this course two thumbs!
- Ebert and Roeper
2. Welcome to IST 380 !
Data Science
Programming
an advocate of
concrete computing –
and HMC's mascot
3. About myself
Who Zach Dodds
Harvey Mudd College
Where
What Research includes robotics and computer vision
Contact
Information
dodds@cs.hmc.edu
909-607-0867
Office Hours:
Friday mornings, 9-11 am
or set up a time...
When Mondays 7-10pm here in ACB 119
HMC Beckman B111
6. IST 380 ~ the big picture
Data Science
Venn Diagram
Hmmm… where am I
on this diagram?
What is it?
7. Data?!
• Neighbor's name
• A place they consider home
• Are they working at a company now?
• How many U.S. states have they visited?
• Their favorite unhealthy food… ?
• Do they have any "Data Science" background?
(statistics, machine learning, CS)
Where?
9. Data!
• Neighbor's name
• A place they consider home
• Are they working at a company now?
• How many U.S. states have they visited?
• Their favorite unhealthy food… ?
• Do they have any "Data Science" background?
(statistics, machine learning, CS)
Zachary Dodds
Pittsburgh, PA
Harvey Mudd
Where?
44
mostly CS for me…
M&Ms
10. Data!
• Neighbor's name
• A place they consider home
• Are they working at a company now?
• How many U.S. states have they visited?
• Their favorite unhealthy food… ?
• Do they have any "Data Science" background?
(statistics, machine learning, CS)
Zachary Dodds
Pittsburgh, PA
Harvey Mudd
Where?
44
mostly CS for me…
M&Ms
be sure to set up your login + profile for the submission site…
17. Make data easier to use ~ by using it!
It may be true that
Data Science isn't a
science – but that
doesn't mean it's
not useful!
18. IST 380 ~ the big picture
What? Why?
Data Science
Programming Data Rules
All of our insights – large and small, permanent and
ephemeral, natural and artificial – come about
through the integration of lots of data.
Data Science simply recognizes that the rules and
skills behind those insights are widely applicable…
19. A few examples…
Make3d
How is this being done?
Andrew Ng ~
Computers and
Thought award,
2009
… Data Science is at the heart of computer science
and how do we succeed?
20. A few examples…
… Data Science is at the heart of computer science
Stanford's
Autonomous
Vehicles project
(Thrun et al.)
Learning to
Powerslide
21. A few examples…
… Data Science is at the heart of computer science
"my summer was
finding that red line"
Learning ground
from obstacles
28. Bob Bell, winner of the "Netflix prize"
Napoleon Dynamite =
Batman Begins =
Netflix Prize
Finding Nemo =
Lord of the Rings =
(I don't know this guy)
1.22
.75
??
??
Some films are difficult to predict…
29. Bob Bell, winner of the "Netflix prize"
(I don't know this guy)
Napoleon Dynamite =
Batman Begins =
Finding Nemo =
Lord of the Rings =
1.22
.75
.67
.42
Some films are difficult to predict… and others are easier!
Netflix Prize
30. Why IST 380 ?
Specific skills:
R statistical environment (and the S programming language)
Experience with several statistical analyses (descriptive statistics)
Experience with predictive statistics (modeling) and
machine learning algorithms
31. Why IST 380 ?
Specific skills:
Broad background:
You'll be confident and capable with whatever datasets you
encounter in the future – on your own or as part of a team.
R statistical environment (and the S programming language)
Experience with several statistical analyses (descriptive statistics)
Experience with predictive statistics (modeling) and
machine learning algorithms
Final project ~ open-ended with datasets of your choice
33. Details
Web Page:
http://www.cs.hmc.edu/~dodds/IST380
Assignments, online text, necessary files, lecture slides are linked
First week's assignment: Getting started with R
Programming: R
Textbook An introduction to Data Science
jsresearch.net/groups/teachdatascience/
www.r-project.org/
Grab both of
these now…
freely available online
and many online resources…
35. Homework
Assignments
~ 2-5 problems/week ~ 100 points extra credit, often
Due Tuesday of the following week by 11:59 pm.
Assignment 1 due Tuesday, February 5.
1 week + 1 day…
36. Homework
Working on programs:
On your own or in groups of 2.
Divide the work at the keyboard evenly!
Submitting programs: at the submission website
Today's Lab:
install software ensure accounts are working
try out R - the first HW is officially due on 2/5
Assignments
~ 2-5 problems/week ~ 100 points extra credit, often
Due Tuesday of the following week by 11:59 pm.
Assignment 1 due Tuesday, February 5.
37. Outline
Weeks 1-5
using R
descriptive statistics
predictive statistics
probability distributions
Weeks 6-10
"Data Science"
"Machine Learning"
statistical modeling
support vector machines (SVMs)
random forests
k-means algorithm
nearest neighbors (NN)
Weeks 11-15
approximate!
Final Project
No breaks?!
38. Grading
Grades
Final project
if score >= 0.95: grade = "A"
if score >= 0.90: grade = "A-"
if score >= 0.86: grade = "B+"
• the last ~4 weeks will work towards a larger, final project
• there will be a short design phase and a short final presentation
• I'd encourage you to connect R and our Data Science techniques
to other datasets or projects that you use/need/like, etc.
Based on points percentage
~ 800 points for assignments
see the course syllabus for the full list...
~ 400 points for the final project
• choose your own problem to study (I'll have some suggestions, too.)
39. Academic Honesty
This course operates under CGU's (and all of Claremont Schools')
Academic Honesty policies…
•Your work must be your own. This must be true for the whole
team, if you're working in a pair.
•Consulting with others (except team members or myself) is
encouraged, but has to be limited to discussion and debugging
of problems. Sharing of written, electronic, or verbal
solutions/files/code is a violation of CGU’s academic honesty
policy.
•A reasonable guideline: Work is your own if you could delete
all of it and recreate it yourself.
42. Getting to know… R
http://lang-index.sourceforge.net/#categ
R is the programmer's toolkit for statistics; SAS, Stata,
SPSS are preferred by those in business intelligence
44. Getting to know… R
R is responsive, up-to-date, and flexible: Data Science vs. Statistics
45. Getting to know… R
1) Find the IST 380 course webpage
www.cs.hmc.edu/~dodds/IST380/
2) Download and install R
3) Run R and try some basic commands at the prompt:
6 * 7
rnorm(10)
x <- 380
46. Getting started!
1) Open Matloff's Why R? notes
2) Skip ahead to page 7, the "5 minute example session"
3) Try out the commands in section 2.2 to get started…
4) When you finish, save your session and submit it!
This is problem 1 this week
47. Saving your session
2) Use the Save to file… (Windows) or Save as…
(Mac) in order to save your current console session into
hw1
This is problem 1 this week
1) Create a folder named hw1, perhaps on your desktop
3) Name that file pr1.txt
4) From your operating system, open up that file in
order to confirm it contains your whole session!
48. Submitting your work
2) From the course webpage, click on the submission
site link.
You've completed Problem 1!
1) Zip up hw1 into hw1.zip
3) Choose a submission site login name & let me know!
4) Once your account is made, login, change your password
to something you know, and submit hw1.zip
This webserver can be
spacey -- I should know!
troubles? email me!
5) You can submit again – all copies are saved…
55. NA
R uses NA to represent data that is "not available"
What is going on here?
The function is.na( ) tests for NA
56. NA
R uses NA to represent data that is "not available"
What is going on here?
The function is.na( ) tests for NA
This uses subsetting to remove NA values!
60. Lab…
The 2nd part of each class meeting dedicated to lab work.
I welcome you to stay for the lab, but it is not required.
Today's lab:
Work through Santorico and Shin's Tutorial for the R
Statistical Package and submit the console sessions as
pr2_1.txt, pr2_1.txt, pr2_1.txt, pr2_1.txt, and pr2_1.txt.
This is a nice reinforcement of vectors, introduction to
data frames, and a look at the graphics that R supports.
61. Homework
Problem 3: Challenge exercises in R
These will reinforce the "subsetting" and data-
analysis introduction from pr2's tutorial.
Problem 4: Introduction to Data Science, early chapters
This is a fuller background on R and the field
of data science
(submit your console session for both of these…)
63. CS vs. IS and IT ?
www.acm.org/education/curric_vols/CC2005_Final_Report2.pdf
greater integration
system-wide issues
smaller details
machine specifics
69. The bigger picture
Weeks 10-12
Objects
Week 10
Week 11
Week 12
Weeks 13-15
Final Projects
classes vs. objects
methods and data
inheritance
Week 13
Week 14
Week 15
final projects
final projects
final exam
70. Data?!
• Neighbor's name
• A place they consider home
• Are they working at a company now?
• How many U.S. states have they visited?
• Their favorite unhealthy food… ?
• Do they have any "Data Science"
(statistics, machine learning, CS)
background?
Where?
72. Data!
• Neighbor's name
• A place they consider home
• Are they working at a company now?
• How many U.S. states have they visited?
• Their favorite unhealthy food… ?
• Do they have any "Data Science"
(statistics, machine learning, CS)
background?
Zachary Dodds
Pittsburgh, PA
Harvey Mudd
Where?
44
mostly CS for me…
M&Ms
73. Data!
• Neighbor's name
• A place they consider home
• Are they working at a company now?
• How many U.S. states have they visited?
• Their favorite unhealthy food… ?
• Do they have any "Data Science"
(statistics, machine learning, CS)
background?
Zachary Dodds
Pittsburgh, PA
Harvey Mudd
Where?
44
mostly CS for me…
M&Ms
be sure to set up your login + profile for the submission site…
This class is truly
seminar-style:
we're devloping
expertise in this
field together.