The document provides an introduction to machine learning concepts including definitions of machine learning, supervised learning, unsupervised learning, and reinforcement learning. It discusses popular machine learning toolkits like Scikit-learn and gives an example of using Scikit-learn to perform linear regression on the Boston housing price dataset to predict median home values from features like crime rates, tax rates, and distances to employment centers. Key features of the Boston housing dataset are also described.
IoT and the Autonomous Vehicle in the Clouds: Simultaneous Localization and M...Spark Summit
Processing real-time analytics of big data streams from sensor data will continue to be an important task as embedded technology increases and we continue to generate new types and ways of data analysis, particularly in regard to the Internet of Things (IoT). Robotics models many of these key challenges well and incorporates the possibility of high- throughput streams as well as complex online machine learning and analytics algorithms. These challenges make it an almost ideal candidate for in depth analysis of real-time streaming analytics.
We look at a simultaneous localization and mapping (SLAM) problem, an ongoing research area in robotics for autonomous vehicles, and well recognized as a non-trivial problem space in both industry and research. We will use a new integrated framework on Kafka and Spark Streaming to explore a constrained SLAM problem using online algorithms to navigate and map a space in real time.
We present benchmarks of our open-source robot’s integration with Kafka and Spark Streaming for performance against other SLAM algorithms currently in use, explore some of the challenges we faced in our implementation, and make recommendations for improvement of performance and optimization on our framework.
Finally, new to this talk, we demo real-time usage of our implementation with the Turtlebot II and explore relevant benchmarks and their implications on the future of autonomous vehicles in the IoT and cloud analytics space.
IoT Meets Big Data: The Opportunities and Challenges by Syed Hoda of ParStreamgogo6
Download our special report, IoT Tech for the Manager: http://bit.ly/report1-slideshare
IoT Meets Big Data: The Opportunities and Challenges as presented at the IoT Inc Business' Eighth Meetup. See: http://www.iot-inc.com/iot-meets-big-data-the-opportunities-and-challenges/
In our eighth Meetup we have Syed Hoda, Chief Marketing Officer of ParStream presenting “IoT Meets Big Data: The Opportunities and Challenges”. Come meet other business leaders in the IoT ecosystem and discuss the business issues you face in the Internet of Things.
Presentation Abstract
The Internet of Things (IoT) and Big Data have each made press headlines and continue to be board-level priorities. The intersection of IoT and Big Data is a fascinating area of innovation with tremendous scope for business impact. From industrial sensors to vehicles to health monitors, a huge variety of devices connects to the Internet and share information. At the same time, the cost to store data has dropped dramatically while capabilities for analysis have made huge leaps forward. How can analytics drive business benefits from IoT projects? What are the challenges in storing and analyzing huge amounts of real-world information? How can companies generate more value from their data? We will address these questions and also share our perspectives on innovative technologies enabling new IoT use cases.
IoT and the Autonomous Vehicle in the Clouds: Simultaneous Localization and M...Spark Summit
Processing real-time analytics of big data streams from sensor data will continue to be an important task as embedded technology increases and we continue to generate new types and ways of data analysis, particularly in regard to the Internet of Things (IoT). Robotics models many of these key challenges well and incorporates the possibility of high- throughput streams as well as complex online machine learning and analytics algorithms. These challenges make it an almost ideal candidate for in depth analysis of real-time streaming analytics.
We look at a simultaneous localization and mapping (SLAM) problem, an ongoing research area in robotics for autonomous vehicles, and well recognized as a non-trivial problem space in both industry and research. We will use a new integrated framework on Kafka and Spark Streaming to explore a constrained SLAM problem using online algorithms to navigate and map a space in real time.
We present benchmarks of our open-source robot’s integration with Kafka and Spark Streaming for performance against other SLAM algorithms currently in use, explore some of the challenges we faced in our implementation, and make recommendations for improvement of performance and optimization on our framework.
Finally, new to this talk, we demo real-time usage of our implementation with the Turtlebot II and explore relevant benchmarks and their implications on the future of autonomous vehicles in the IoT and cloud analytics space.
IoT Meets Big Data: The Opportunities and Challenges by Syed Hoda of ParStreamgogo6
Download our special report, IoT Tech for the Manager: http://bit.ly/report1-slideshare
IoT Meets Big Data: The Opportunities and Challenges as presented at the IoT Inc Business' Eighth Meetup. See: http://www.iot-inc.com/iot-meets-big-data-the-opportunities-and-challenges/
In our eighth Meetup we have Syed Hoda, Chief Marketing Officer of ParStream presenting “IoT Meets Big Data: The Opportunities and Challenges”. Come meet other business leaders in the IoT ecosystem and discuss the business issues you face in the Internet of Things.
Presentation Abstract
The Internet of Things (IoT) and Big Data have each made press headlines and continue to be board-level priorities. The intersection of IoT and Big Data is a fascinating area of innovation with tremendous scope for business impact. From industrial sensors to vehicles to health monitors, a huge variety of devices connects to the Internet and share information. At the same time, the cost to store data has dropped dramatically while capabilities for analysis have made huge leaps forward. How can analytics drive business benefits from IoT projects? What are the challenges in storing and analyzing huge amounts of real-world information? How can companies generate more value from their data? We will address these questions and also share our perspectives on innovative technologies enabling new IoT use cases.
It is the best book on data mining so far, and I would defln,(teJ�_.,tdiiPt
my course. The book is very C011Jprehensive and cove� all of
topics and algorithms of which I am aware. The depth of CO!Irer•liM
topic or method is exactly right and appropriate. Each a/grorirtmti �r�
in pseudocode that is s , icient for any interested readers to
working implementation in a computer language of their choice.
-Michael H Huhns, Umversity of �UDilCiii
Discussion on distributed, parallel, and incremental algorithms is outst:tlftfi!tr··· '��
-Z an Obradovic, Temple Univef'Sf1tv
Margaret Dunham offers the experienced data base professional or graduate
level Computer Science student an introduction to the full spectrum of Data
Mining concepts and algorithms. Using a database perspective throughout,
Professor Dunham examines algorithms, data structures, data types, and
complexity of algorithms and space. This text emphasizes the use of data
mining concepts in real-world applications with large database components.
KEY FEATURES:
.. Covers advanced topics such as Web Mining and Spatialrremporal mining
Includes succinct coverage of Data Warehousing, OLAP, Multidimensional
Data, and Preprocessing
Provides case studies
Offers clearly written algorithms to better understand techniques
Includes a reference on how to use Prototypes and DM products
A look at what devices are online in the Channel Islands Cyberspace provided by the Telco's. There was a large array of devices both secure and insecure ranging from SCADA , Seismographs, Webcams, DVR's to OWA installations.
Data Summer Conf 2018, “Architecting IoT system with Machine Learning (ENG)” ...Provectus
In this presentation, the speaker will share his experiences from building successful IoT systems. He will also explain why many IoT systems fail to get traction and how Machine Learning can help in that. Finally, he will talk about the right system architecture and touch upon some of the ML algorithms for IoT systems.
#MSIgnite2019 conoce todas las métricas by #SEOhashtag
Conoce las metricas y mapas de audiencia desarollados en #MSIgnite2019 . #seohashtag estuvo monitoreando todas las métricas en tiempo real by #Metricool y conocerás de que hablan sus audiencias by #NodeXL
https://vivianfrancos.com/msignite2019-conoce-todas-las-metricas/
Singapore's IoT Technical Standard Update at IoT Asia 2018Colin Koh (許国仁)
Singapore's vision to be the first Smart Nation in the world and position as advanced manufacturing hub had made good progress so far, the role of IoT play a major role in digitizing and generate more contexture data. Standards will be the driving factors for scaling up the implementation of many projects in the coming years.
Manifest Data S-1 Speculative Sensation Lab Duke Digital Studio Presentation ...Amanda Starling Gould
S-1’s Manifest Data translates personal digital data into physical sculpture to show us that, far from straightforward or entirely distinct, the systemic connections between users and the digital network are deeply interdependent.
Follow our tweets #manifestdata
The S-1 speculative sensation lab (http://www.s-1lab.org/) is a space for artistic experimentation with emerging digital technologies and their impact on sensory experience.
AI is Coming! Are You Ready? The story of “Self-Driving Datacenter”Sergey A. Razin
There is no doubt that Machine Learning finds space in every home today, but what do we know about it and how to apply?
During this presentation I will cover:
What do we think ML is vs what it is really?
Why ML is the right solution using IT operations as an example?
How do you build a “self-driving datacenter” using the open tools near you?
Where do we go next?
Using Amazon Machine Learning to Identify Trends in IoT Data - Technical 201Amazon Web Services
Internet of Things is creating a tidal wave of new data including events, correlations, business value, and much more. With the proliferation of new data sets, it also introduces more potential issues, errors, and spurious values.
In this session, we will explore using Amazon Machine Learning to analyse and understand the new data collected within your IoT solution. In addition, we will learn how to discover patterns, trends, anomalies, and correlations by demonstrating the capabilities of Amazon Machine Learning and SparkML running on AWS Cloud.
Speaker: Simon Elisha, Solutions Architect, Amazon Web Services
Using amazon machine learning to identify trends in io t data technical 201Amazon Web Services
Internet of Things is creating a tidal wave of new data including events, correlations, business value, and much more. With the proliferation of new data sets, it also introduces more potential issues, errors, and spurious values.
In this session, we will explore using Amazon Machine Learning to analyse and understand the new data collected within your IoT solution. In addition, we will learn how to discover patterns, trends, anomalies, and correlations by demonstrating the capabilities of Amazon Machine Learning and SparkML running on AWS Cloud.
Speaker: Simon Elisha, Solutions Architect, Amazon Web Services
In this presentation, Victor Gramm describes what he's learned as a 3D Print enthusiast. Victor mentions free mentions free, low-cost, and open-source , as well as commercially available solutions for 3D Scanning, photogrammetry, and 3D Modeling. While not an expert on the topic, Victor employs his enthusiasm in an effort to gather consensus on the level of interest in these domains in his area, share what he's learned, and to elicit further dialogue on the topic.
Neotys organized its first Performance Advisory Council in Scotland, the 14th & 15th of November.
With 15 Load Testing experts from several countries (UK, France, New-Zeland, Germany, USA, Australia, India…) we explored several theme around Load Testing such as DevOps, Shift Right, AI etc.
By discussing around their experience, the methods they used, their data analysis and their interpretation, we created a lot of high-value added content that you can use to discover what will be the future of Load Testing.
You want to know more about this event ? https://www.neotys.com/performance-advisory-council
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
It is the best book on data mining so far, and I would defln,(teJ�_.,tdiiPt
my course. The book is very C011Jprehensive and cove� all of
topics and algorithms of which I am aware. The depth of CO!Irer•liM
topic or method is exactly right and appropriate. Each a/grorirtmti �r�
in pseudocode that is s , icient for any interested readers to
working implementation in a computer language of their choice.
-Michael H Huhns, Umversity of �UDilCiii
Discussion on distributed, parallel, and incremental algorithms is outst:tlftfi!tr··· '��
-Z an Obradovic, Temple Univef'Sf1tv
Margaret Dunham offers the experienced data base professional or graduate
level Computer Science student an introduction to the full spectrum of Data
Mining concepts and algorithms. Using a database perspective throughout,
Professor Dunham examines algorithms, data structures, data types, and
complexity of algorithms and space. This text emphasizes the use of data
mining concepts in real-world applications with large database components.
KEY FEATURES:
.. Covers advanced topics such as Web Mining and Spatialrremporal mining
Includes succinct coverage of Data Warehousing, OLAP, Multidimensional
Data, and Preprocessing
Provides case studies
Offers clearly written algorithms to better understand techniques
Includes a reference on how to use Prototypes and DM products
A look at what devices are online in the Channel Islands Cyberspace provided by the Telco's. There was a large array of devices both secure and insecure ranging from SCADA , Seismographs, Webcams, DVR's to OWA installations.
Data Summer Conf 2018, “Architecting IoT system with Machine Learning (ENG)” ...Provectus
In this presentation, the speaker will share his experiences from building successful IoT systems. He will also explain why many IoT systems fail to get traction and how Machine Learning can help in that. Finally, he will talk about the right system architecture and touch upon some of the ML algorithms for IoT systems.
#MSIgnite2019 conoce todas las métricas by #SEOhashtag
Conoce las metricas y mapas de audiencia desarollados en #MSIgnite2019 . #seohashtag estuvo monitoreando todas las métricas en tiempo real by #Metricool y conocerás de que hablan sus audiencias by #NodeXL
https://vivianfrancos.com/msignite2019-conoce-todas-las-metricas/
Singapore's IoT Technical Standard Update at IoT Asia 2018Colin Koh (許国仁)
Singapore's vision to be the first Smart Nation in the world and position as advanced manufacturing hub had made good progress so far, the role of IoT play a major role in digitizing and generate more contexture data. Standards will be the driving factors for scaling up the implementation of many projects in the coming years.
Manifest Data S-1 Speculative Sensation Lab Duke Digital Studio Presentation ...Amanda Starling Gould
S-1’s Manifest Data translates personal digital data into physical sculpture to show us that, far from straightforward or entirely distinct, the systemic connections between users and the digital network are deeply interdependent.
Follow our tweets #manifestdata
The S-1 speculative sensation lab (http://www.s-1lab.org/) is a space for artistic experimentation with emerging digital technologies and their impact on sensory experience.
AI is Coming! Are You Ready? The story of “Self-Driving Datacenter”Sergey A. Razin
There is no doubt that Machine Learning finds space in every home today, but what do we know about it and how to apply?
During this presentation I will cover:
What do we think ML is vs what it is really?
Why ML is the right solution using IT operations as an example?
How do you build a “self-driving datacenter” using the open tools near you?
Where do we go next?
Using Amazon Machine Learning to Identify Trends in IoT Data - Technical 201Amazon Web Services
Internet of Things is creating a tidal wave of new data including events, correlations, business value, and much more. With the proliferation of new data sets, it also introduces more potential issues, errors, and spurious values.
In this session, we will explore using Amazon Machine Learning to analyse and understand the new data collected within your IoT solution. In addition, we will learn how to discover patterns, trends, anomalies, and correlations by demonstrating the capabilities of Amazon Machine Learning and SparkML running on AWS Cloud.
Speaker: Simon Elisha, Solutions Architect, Amazon Web Services
Using amazon machine learning to identify trends in io t data technical 201Amazon Web Services
Internet of Things is creating a tidal wave of new data including events, correlations, business value, and much more. With the proliferation of new data sets, it also introduces more potential issues, errors, and spurious values.
In this session, we will explore using Amazon Machine Learning to analyse and understand the new data collected within your IoT solution. In addition, we will learn how to discover patterns, trends, anomalies, and correlations by demonstrating the capabilities of Amazon Machine Learning and SparkML running on AWS Cloud.
Speaker: Simon Elisha, Solutions Architect, Amazon Web Services
In this presentation, Victor Gramm describes what he's learned as a 3D Print enthusiast. Victor mentions free mentions free, low-cost, and open-source , as well as commercially available solutions for 3D Scanning, photogrammetry, and 3D Modeling. While not an expert on the topic, Victor employs his enthusiasm in an effort to gather consensus on the level of interest in these domains in his area, share what he's learned, and to elicit further dialogue on the topic.
Neotys organized its first Performance Advisory Council in Scotland, the 14th & 15th of November.
With 15 Load Testing experts from several countries (UK, France, New-Zeland, Germany, USA, Australia, India…) we explored several theme around Load Testing such as DevOps, Shift Right, AI etc.
By discussing around their experience, the methods they used, their data analysis and their interpretation, we created a lot of high-value added content that you can use to discover what will be the future of Load Testing.
You want to know more about this event ? https://www.neotys.com/performance-advisory-council
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Round table discussion of vector databases, unstructured data, ai, big data, real-time, robots and Milvus.
A lively discussion with NJ Gen AI Meetup Lead, Prasad and Procure.FYI's Co-Found
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdfEnterprise Wired
In this guide, we'll explore the key considerations and features to look for when choosing a Trusted analytics platform that meets your organization's needs and delivers actionable intelligence you can trust.
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdfGetInData
Recently we have observed the rise of open-source Large Language Models (LLMs) that are community-driven or developed by the AI market leaders, such as Meta (Llama3), Databricks (DBRX) and Snowflake (Arctic). On the other hand, there is a growth in interest in specialized, carefully fine-tuned yet relatively small models that can efficiently assist programmers in day-to-day tasks. Finally, Retrieval-Augmented Generation (RAG) architectures have gained a lot of traction as the preferred approach for LLMs context and prompt augmentation for building conversational SQL data copilots, code copilots and chatbots.
In this presentation, we will show how we built upon these three concepts a robust Data Copilot that can help to democratize access to company data assets and boost performance of everyone working with data platforms.
Why do we need yet another (open-source ) Copilot?
How can we build one?
Architecture and evaluation
The Building Blocks of QuestDB, a Time Series Databasejavier ramirez
Talk Delivered at Valencia Codes Meetup 2024-06.
Traditionally, databases have treated timestamps just as another data type. However, when performing real-time analytics, timestamps should be first class citizens and we need rich time semantics to get the most out of our data. We also need to deal with ever growing datasets while keeping performant, which is as fun as it sounds.
It is no wonder time-series databases are now more popular than ever before. Join me in this session to learn about the internal architecture and building blocks of QuestDB, an open source time-series database designed for speed. We will also review a history of some of the changes we have gone over the past two years to deal with late and unordered data, non-blocking writes, read-replicas, or faster batch ingestion.
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
2. About me
• Email Security @ Symantec
• Doing Data Science to fight Spam and Malware
• Organizer for Python Data Science Group Singapore
• Monthly regular meet-ups—over a year
• http://meetup.com/pydata-sg >1.8K members
• https://www.facebook.com/groups/pydatasg/ >1k
members
• https://twitter.com/pydatasg
• https://engineers.sg/organizations/118 recorded
and uploaded
• Previously with CENSAM @ MIT
• Co-founded startup(s)
• NUS Alumni
• Some questions
• How many of you have heard about Machine Learning or
ML?
• How many of you know how to do ML?
• How many of you earn a living doing ML?
• What this talk offers
• Getting a foot in the door
• Grossly oversimplifying things
• How to learn ML from literature
• Relate to ML terms when thrown at you
• Types of ML
• Learning ML models and their coding (SciKit-learn and
why?)
• Linear Regression
• Logistic Regression
• Clustering
• Lessons from Practical ML
@ObaidTal
4. What is Data?
• Available data (Internal)
• Health record
• Organization
• University
• …
• Available data (external)
• www.data.gov.sg
• Publicly available
corpuses
• Quality of data
• Trustworthy or not
• Missing data
• Huge challenge in scientific
community
• Other jargon
• Tiny Data: Data from sensors
• Big Data: Data on massive scale
• Fast Data: Hash-based lookup
@ObaidTal
6. When did we all start with Machine Learning?
• Take a look at the following (outputs) and guess the ?:
• 1, 2, 3, 4, 5, 6, ?, …, ?
• 2, 4, 6, 8, 10, 12, ?, …, ?
• 3, 6, 9, 12, ?, …, ?
• 1, 3, 9, 27, ?, …, ?
• 4, 7, 10, 13, ?, …, ?
• So how can I represent above
• Input -> -> output
• X -> -> Y
• call this box as f()
• Output = f(Input) ... In maths
• Y = f(x)
• Answers
• Y=X
• Y=2*X
• Y=3*X + 0
• Y=3^X
• Y=3*X + 1
In school… Really, how? How to find ‘…,?’ – A: Equation (Single variable)
@ObaidTal
Assuming input is
1,2,3,4,5,6,…
7. Linear Regression – Statistical term
Y=mx+b… from last example, b=? & m=? b = 1
m = 3
Y=mx+b
Output
Input
Suppose this line
is Y=3x+1
Assume that this line
is ‘surrounded’ by ‘+’
shaped points, which
we had, i.e. (outputs)
4, 7, 10, 13, ?, …, ? (Y)
having inputs
1, 2, 3, 4, 5, 6, … (x)
The line Y=3x+1
kind of ‘fits’ in
these points as to
find out ‘…, ?’
10. Unsupervised Learning
Types of Machine Learning
• Data is given, and structure must be inferred
• Clustering is one example of it
• Deep Learning is also considered here
• Example is finding clusters in
– Gene data
– Image processing, grouping pixels together
– Social network analysis
– Lots of people talking, extracting the voice of single person, considering
voices of others as noise – Cocktail party problem
– Text processing
• Independent component analysis ICA algorithm Ref. Andrew Y. Ng.
2
11. Reinforcement Learning
• Sequence of decisions are made over time
• Example
• Flying an autonomous helicopter
• Reward function
• Specify what you want to get done
• Specify a good behavior and bad behavior in Reward function
• Learning algorithm will decide to maximize good behavior and minimize
bad behavior
Types of Machine Learning
Ref. Andrew Y. Ng.
3
12. Getting ready – some more terms…
• Data set/Input is also called training set, observation
• The predictor is called hypothesis for historical reasons, and it is called
classifier, estimator, predictor
• Boston housing price problem (we’ll see more of it)
• We will train/learn and predict price
• Features or input variable on right side of Y = mx+b, i.e. x
• Price, i.e. Y, output or target variable of Y = mx+b
• Linear equation, i.e. Y=mx+b can be written as predictor, where m is slope
and b is intercept
• Cost function – which Y=mx+b is better (we will see more of it)
@ObaidTal
Let’s get coding… 1
• To remember
• Will expand on
13. Popular Machine Learning Tool Kit –
Introduction
Project Language Highlight
R R A language for statistic analysis and ML
Octave Octave A language to simulate Matlab for numerical computations
Scikit-learn Python Documentation, example, tutorials available. General purpose with simple API
Tensorflow Py bindings A library for numerical computation using data flow graphs
Orange Python General Purpose ML Package
PyBrain Python Neural networks, unsupervised learning
MLlib Python/Scala Apache’s new library based within Spark
Mahout Java Apache’s framework based on Hadoop
Weka java General Purpose ML Package
GoLearn Go Machine Learning by Go
shogun C++ User interfaces to various languages
14. Machine Learning Kit – which to choose
• Factors to consider
• Language
• Performance (run speed)
• Scalability
Ref. T. Obaid & H. Zhang
• We choose Scikit learn
• Language: Python
• Performance (run speed):
good enough
• Scalability: not critical, and
can switch to MLlib in Spark
for mass data
• Well documented, enough
algorithms, clean API,
robust, fast
implementation, easy usage
Scikit Learn – Machine
Learning in Python
• Simple and efficient tools for
data mining and data analysis
• Accessible to everybody, and
reusable in various contexts
• Built on NumPy, SciPy, and
matplotlib
• Open source, commercially
usable – BSD license
15. Scikit Learn – Examples
• A lot of sample codes are in source folder:
scikit-learn-0.16.1/examples
• Boston housing prices (we will work with this example dataset)
• Will try features one by one (test only 3 of them in this session,
please try more)
• Excerpt of data… (how our data actually looks like)
1. CRIM 2. ZN 3. INDUS 4. CHAS 5. NOX 6. RM 7. AGE 8. DIS 9. RAD 10. TAX
11.
PTRATIO 12. B 13. LSTAT 14. MEDV
0.00632 18 2.31 0 0.538 6.575 65.2 4.09 1 296 15.3 396.9 4.98 24
0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.9 9.14 21.6
0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03 34.7
0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94 33.4
0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.9 5.33 36.2
Details about each
feature of this data
are coming next…
• To remember
• Please explore …
16. Features of Boston housing prices
1. CRIM per capita crime rate by town
2. ZN proportion of residential land zoned for lots over 25,000 sq.ft.
3. INDUS proportion of non-retail business acres per town
4. CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
5. NOX nitric oxides concentration (parts per 10 million)
6. RM average number of rooms per dwelling
7. AGE proportion of owner-occupied units built prior to 1940
8. DIS weighted distances to five Boston employment centers
9. RAD index of accessibility to radial highways
10. TAX full-value property-tax rate per $10,000
11. PTRATIO pupil-teacher ratio by town
12. B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
13. LSTAT % lower status of the population
14. MEDV Median value of owner-occupied homes in $1000s
Features and their details
Which of these features are
significant:
• All of them?
• A few of them?
• Another one, not in them?
Let’s observe these…
17. Scikit Learn – Demo code for Boston house
price. Try it!
import matplotlib.pyplot as plt # for plotting
import numpy as np # for matrix/array operations
from sklearn import datasets, linear_model # classifier
boston = datasets.load_boston()
boston_X = boston.data[:, np.newaxis]
boston_X_temp = boston_X[:, :, 12] # indexes – for LSTAT its 12, for PTRATIO it’s 10, for RM it’s 5 – trying each one by one
boston_X_train = boston_X_temp[:]
boston_y_train = boston.target[:]
regr = linear_model.LinearRegression() # estimator
regr.fit(boston_X_train, boston_y_train) # train parameters
fig,ax = plt.subplots()
ax.scatter(boston_X_train, boston_y_train, color='black') # we can predict boston_X_test
ax.plot(boston_X_train, regr.predict(boston_X_train), color='green', linewidth=3) # to predict
ax.set_xlabel(boston.feature_names[12]) # indexes – for LSTAT its 12, for PTRATIO it’s 10, for RM it’s 5 – trying each one by one
ax.set_ylabel('Predicted')
fig.show()
plt.show()
Ref. T. Obaid & H. Zhang
• Important ...
• Good Feature?
• Not so Good Feature?
• Comments
18. Scikit Learn – Demo result for Boston house price
• Parameters
(Coefficients, -0.95692593 )
(intercept, 34.7411998746244)
• Feature:
• % lower status of the population
• y=-0.95692593 *LSTAT + 34.7411998746244
• Looks good!
1st Try with LSTAT % lower status of the population
23. Scikit Learn – Usage
from sklearn import linear_model
X=[][] # source data with (n_samples, n_features)
Y=[] # target value with (n_samples)
clf = linear_model.LinearRegression() # Estimator, or classfier
clf = clf.fit(X, Y) # learn parameters from existing data
Test = [][] # same shape as X
clf.predict(Test) # predict the target for data in Test
Ref. T. Obaid & H. Zhang
The model program skeleton would look something like…
• Important
1. Model
2. Fit
3. Predict
• Comments
24. Observations from code
• There is always a fit function call, i.e. learning/training X, to give Y.
• Same is a predict function call, given X only, pop out Y.
• Panda library pd can alternatively be used to have relatively simpler
display of data
• train_test_split function call serves important purpose, as it
shuffles the dataset so we don’t have selection bias, i.e. if for instance
data is ordered by price ascending, and halved for training and half
for testing, then the training data may have all the house with lesser
prices.
• To remember
• Subtleties
• Probable Issue
25. Scikit Learn – Test Data
• Scikit-learn comes with a few standard datasets, for instance the iris and
digits datasets for classification and the Boston house prices dataset for
regression.
• Boston(boston house prices), iris(iris flower), mlcomp(20 newsgroups),
svmlight_file/s, diabetes, lfw_pairs(labeled face), sample_image/s(china and
flower), digits(0-9 handwriting), lfw_people(labeld people), linnerud(for
multivariate regression)
• Scipy.misc.lena()
• Load test data … Try others!
from sklearn import datasets
iris = datasets.load_iris()
digits = datasets.load_digits()
Subset of learning datasets – just saw Boston housing prices
• … Seen so far
• Ahead …
• Please Explore …
26. Scikit Learn – Main Algorithms
• Supervised learning (most have both classifier and regressor)
• Line model: LinearRegression, Lasso, Ridge, LogisticRegression, SGD
• SVM: LinearSVC, SVC, SVR
• Naïve Bayes: GaussianNB, MultinomiaNB, BernoulliNB
• Decision Tree: DecisionTree(optimized version of the CART)
• Ensemble method: RandomForest, AdaBoost, GradientBoosting(GBDT)
• Unsupervised learning
• Clustering: Kmeans(Kmeans+, mini-batch), DBSCAN
• Manifold learning(dimension reduction): MDS, Isomap, LocallyLinearEmbedding.
• Algorithm whole list:
http://scikit-learn.org/stable/modules/classes.html
Subset of supported algorithms – we just saw LinearRegression
• … Seen so far
• Ahead …
27. Logistic (Classification) Regression
• Regression is when our labels y can take any real (continuous) value.
Examples include:
• Predicting stock market.
• Predicting sales.
• Detecting the age of a person from a picture.
• Classification is when our labels y can only take a finite set of values
(categories). Examples include:
• Handwritten digit recognition: xx is an image with a handwritten digit, yy is a digit
between 0 and 9.
• Spam filtering: xx is an e-mail, and yy is 0 or 1 whether that e-mail is a spam or not.
Linear (Regression) vs Logistic (Classification)
29. Logistic Regression – with IRIS example
• Categorical output instead of continuous output
• Will use IRIS dataset – to classify 3 species of plants
• Number of Instances: 150 (50 in each of three classes)
• Number of Attributes: 4 numeric, predictive attributes and the class
• Attribute/Feature Information:
• sepal length in cm (will use this)
• sepal width in cm (will use this)
• petal length in cm
• petal width in cm
• Classes i.e. Target:
• Iris-Setosa
• Iris-Versicolour
• Iris-Virginica
IRIS is a database of flower classes… bears a little bit of botany
Setosa Versicolour Virginica
• Petal is the colored part of the flower
• Sepal is the green leaf below the petal
30. Let’s go code… Try it!
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.linear_model import
LogisticRegression
iris = load_iris()
print "--- Keys ---n", iris.keys()
print "--- Shape ---n", iris.data.shape
print "--- Feature Names ---n",
iris.feature_names
print "--- Description ---n", iris.DESCR
print "--- Target --- n", iris.target
iri = pd.DataFrame(iris.data)
print "--- Panda Head ---n", iri.head()
iri.columns = iris.feature_names
print "--- Panda Columns ---n",
iri.head()
logreg = LogisticRegression(C=1e5)
X = iris.data[:, :2] # we only take
the first two features.
Y = iris.target
print "--- X ---n", X
print "--- y ---n", Y
# we create an instance of Neighbors
Classifier and fit the data.
logreg.fit(X, Y) # again, the
infamous fit method
Part 1 Part 2
• Preparation
• Important
• Debug
31. A little bit more… Try it!
# Plotting
h = .02 # step size in the mesh
# Plot the decision boundary. For that, we will
assign a color to each
# point in the mesh [x_min,
m_max]x[y_min, y_max].
x_min, x_max = X[:, 0].min() - .5, X[:,
0].max() + .5
y_min, y_max = X[:, 1].min() - .5, X[:,
1].max() + .5
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
np.arange(y_min, y_max, h))
# Prediction
Z = logreg.predict(np.c_[xx.ravel(),
yy.ravel()])
# Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.figure(1, figsize=(4, 3))
plt.pcolormesh(xx, yy, Z,
cmap=plt.cm.Paired)
# Plot also the training points
plt.scatter(X[:, 0], X[:, 1], c=Y,
edgecolors='k', cmap=plt.cm.Paired)
plt.xlabel('Sepal length')
plt.ylabel('Sepal width’)
plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())
plt.xticks(())
plt.yticks(())
plt.show()
Part 3 Part 4
• Plotting
• Important
• Debug
33. Clustering
• Unsupervised learning
• Output unknown
• Grouping observation
K-Means
• One of the most popular "clustering" algorithms.
• Stores kk centroids that it uses to define clusters.
• If a point is closer to a cluster's centroid.
• Find best centroids by alternating between
• assigning data points to clusters based on the
current centroids
• sing centroids (points which are the center of a
cluster) based on the current assignment of data
points to clusters.
34
43
49
58
70
81
89
101
116
121
131
145
<=11
<=12
<=15
34
43
49
58
70
81
89
101
116
121
131
145
Primitive clustering e.g.
11
6
9
12
11
8
12
15
Input data sorted
2
35. Let’s codeTry it!
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d
import Axes3D
from sklearn.cluster import
KMeans
from sklearn import datasets
np.random.seed(5)
iris = datasets.load_iris()
X = iris.data # No used of Y here
est = KMeans() # We try before hand
the no. of clusters, can be even more,
default is 8
est.fit(X) # NOTICE!, no Y here, “Unsupervised”, Yay!
labels = est.labels_
fig = plt.figure(1, figsize=(4, 3))
plt.clf()
ax = Axes3D(fig, rect=[0, 0, .95, 1], elev=48, azim=134)
plt.cla()
ax.scatter(X[:, 3], X[:, 0], X[:, 2],
c=labels.astype(np.float))
ax.scatter(X[:, 3], X[:, 0], X[:, 2],
c=labels.astype(np.float))
ax.w_xaxis.set_ticklabels([])
ax.w_yaxis.set_ticklabels([])
ax.w_zaxis.set_ticklabels([])
ax.set_xlabel('Petal width')
ax.set_ylabel('Sepal length')
ax.set_zlabel('Petal length')
plt.show()
Part 1 Part 2
• Preparation/Plotting
• Important
• Debug
36. Lessons learned!
• The dataset on which the model is executed here, is available and well-formatted, which is not the case
always
• Data acquisition and preparation come prior to feature extraction
• Extracting the interesting features, “numerifying” (converting to numbers, if not already) and later
normalizing them, comes prior to running model on it
• Features or data columns can be categorical or inferential variables, or can cause singularity problem;
these affect the performance of data model and hence residual cost
• Selection of model, linear or logistic, and observing cost to select appropriate features, can also be
achieved using R, i.e. a gold standard of p (Probability of incorrectly rejecting a true null hypothesis)
would be ~ 0.05 (At least 23% (and typically close to 50%))
• Cross validation (CV) is done by running test and training a few times and measuring difference
• Confusion matrix also provides visibility into how many predictions are right and wrong
@ObaidTal
From real-life Machine Learning
37. Lessons learned!
• If the data is in time-series, and there is missing data within the time window, then we can apply
interpolation or extrapolation. Interpolation works good for archived data, whereas extrapolation for
live data
• Before applying any regression, it so happens that we may have to cluster the data and then apply
regression over it. This would help control outliers, if any, which may impact the model performance.
Outliers are not always noise in the data
• Selection bias happens when we train the model on data, which is not the true representation of the
real occurrences. For instance, dissecting the housing price ordered by ascending, and training over it,
would skip the higher-valued homes. Thus to avoid it, data should be shuffled to achieve even
distribution
• Curse of dimensionality, when challenged with too many features. To deal with it, carefully reduce the
non-significant features including the dependent, categorical or composite features, depending on
where applicable
… Continued
@ObaidTal
38. References
• Stanford’s CS229 by Prof Andrew Y. Ng – Highly recommended!
• https://www.youtube.com/watch?v=UzxYlbK2c7E
• Scikit-Learn tutorial
• http://scikit-learn.org/stable/
• http://scikit-learn.org/stable/install.html
• http://www.shogun-toolbox.org/page/features/
• http://daoudclarke.github.io/machine%20learning%20in%20practice/
2013/10/08/machine-learning-libraries/