Polong Lin(林伯龍)/how to approach data science problems from start to end台灣資料科學年會
Polong Lin is a Data Scientist at IBM. He is a regular speaker on data science and develops content for free data education on bigdatauniversity.com using open data tools on datascientistworkbench.com. Polong earned his M.Sc. at the Univ. of Tsukuba.
Jane Hsu is a professor and department chair of Computer Science and Information Engineering at National Taiwan University. Her research interests include multi-agent systems, intelligent data analysis, commonsense knowledge, and context-aware computing. Prof. Hsu is the director of the Intel-NTU Connected Context Computing Center, featuring global research collaboration among NTU, Intel, and the National Science Council of Taiwan. She serves on the editorial board of Journal of Information Science and Engineering (2010-), International Journal of Service Oriented Computing and Applications (Springer, 2007-2009) and Intelligent Data Analysis (Elsevier/IOS Press, 1997-2002). She is actively involved in many key international AI conferences as organizers and members of the program committee. In addition to serving as the President of Taiwanese Association for Artificial Intelligence (2013-2014), Prof. Hsu has been a member of AAAI, IEEE, ACM, Phi Tau Phi, and an executive committee member of the IEEE Technical Committee on E-Commerce (2000) and TAAI (2004-current).
Jeff Dean at AI Frontiers: Trends and Developments in Deep Learning ResearchAI Frontiers
In this talk at AI Frontiers conference, Jeff Dean discusses recent trends and developments in deep learning research. Jeff touches on the significant progress that this research has produced in a number of areas, including computer vision, language understanding, translation, healthcare, and robotics. These advances are driven by both new algorithmic approaches to some of these problems, and by the ability to scale computation for training ever large models on larger datasets. Finally, one of the reasons for the rapid spread of the ideas and techniques of deep learning has been the availability of open source libraries such as TensorFlow. He gives an overview of why these software libraries have an important role in making the benefits of machine learning available throughout the world.
MixTaiwan 20170222 清大電機 孫民 AI The Next Big ThingMix Taiwan
講師簡介:
孫民助理教授│清華大學電機系
孫民博士目前任教於國立清華大學電機系,他畢業於國立交通大學電子工程學系後,取得史坦福電機碩士、密西根安雅堡電機系統組博士、以及西雅圖華盛頓大學計算機工程博士後的經歷。他的研究興趣在電腦視覺、機器學習、以及人機互動領域,近年來基於深度學習在電腦視覺的突破,他致力於開發橫跨人工智慧不同子領域的系統,如自動影片文字描述(視覺x自然語言)、以及與人類行為互動的智慧機器(視覺 x 控制)。
The Unreasonable Benefits of Deep Learningindico data
Dan Kuster led a talk at Sentiment Analysis Symposium discussing why businesses should consider adopting deep learning solutions. Key takeaways include simplicity, accuracy, flexibility, and some hacks for working with the tech.
About the Session:
Machine learning is becoming the tool of choice for analyzing text and image data. While traditional text processing solutions rely on the ability of experts to encode domain knowledge, machine learning models learn this directly from the data. Deep learning is a branch of machine learning that like the human brain quickly learns hierarchical representations of concepts, and it has been key to unlocking state-of-the-art results on a range of text and image classification tasks such as sentiment analysis and beyond.
In this session, we will show the impact of a deep learning based approach over NLP and traditional machine learning based methods for text analysis across key dimensions such as accuracy, flexibility, and the amount of required training data. Specifically, we will discuss how deep learning models are now setting the records for state-of-the-art accuracy in sentiment analysis. We will also demonstrate the flexibility of this approach by showing how the features learned by one model can be easily reused in different domains (e.g., handling additional languages, or predicting new categories) to drastically reduce the time to deployment. Finally, we will touch on the ability of this method to handle additional types of data beyond text, e.g, images, for maximum insight.
Polong Lin(林伯龍)/how to approach data science problems from start to end台灣資料科學年會
Polong Lin is a Data Scientist at IBM. He is a regular speaker on data science and develops content for free data education on bigdatauniversity.com using open data tools on datascientistworkbench.com. Polong earned his M.Sc. at the Univ. of Tsukuba.
Jane Hsu is a professor and department chair of Computer Science and Information Engineering at National Taiwan University. Her research interests include multi-agent systems, intelligent data analysis, commonsense knowledge, and context-aware computing. Prof. Hsu is the director of the Intel-NTU Connected Context Computing Center, featuring global research collaboration among NTU, Intel, and the National Science Council of Taiwan. She serves on the editorial board of Journal of Information Science and Engineering (2010-), International Journal of Service Oriented Computing and Applications (Springer, 2007-2009) and Intelligent Data Analysis (Elsevier/IOS Press, 1997-2002). She is actively involved in many key international AI conferences as organizers and members of the program committee. In addition to serving as the President of Taiwanese Association for Artificial Intelligence (2013-2014), Prof. Hsu has been a member of AAAI, IEEE, ACM, Phi Tau Phi, and an executive committee member of the IEEE Technical Committee on E-Commerce (2000) and TAAI (2004-current).
Jeff Dean at AI Frontiers: Trends and Developments in Deep Learning ResearchAI Frontiers
In this talk at AI Frontiers conference, Jeff Dean discusses recent trends and developments in deep learning research. Jeff touches on the significant progress that this research has produced in a number of areas, including computer vision, language understanding, translation, healthcare, and robotics. These advances are driven by both new algorithmic approaches to some of these problems, and by the ability to scale computation for training ever large models on larger datasets. Finally, one of the reasons for the rapid spread of the ideas and techniques of deep learning has been the availability of open source libraries such as TensorFlow. He gives an overview of why these software libraries have an important role in making the benefits of machine learning available throughout the world.
MixTaiwan 20170222 清大電機 孫民 AI The Next Big ThingMix Taiwan
講師簡介:
孫民助理教授│清華大學電機系
孫民博士目前任教於國立清華大學電機系,他畢業於國立交通大學電子工程學系後,取得史坦福電機碩士、密西根安雅堡電機系統組博士、以及西雅圖華盛頓大學計算機工程博士後的經歷。他的研究興趣在電腦視覺、機器學習、以及人機互動領域,近年來基於深度學習在電腦視覺的突破,他致力於開發橫跨人工智慧不同子領域的系統,如自動影片文字描述(視覺x自然語言)、以及與人類行為互動的智慧機器(視覺 x 控制)。
The Unreasonable Benefits of Deep Learningindico data
Dan Kuster led a talk at Sentiment Analysis Symposium discussing why businesses should consider adopting deep learning solutions. Key takeaways include simplicity, accuracy, flexibility, and some hacks for working with the tech.
About the Session:
Machine learning is becoming the tool of choice for analyzing text and image data. While traditional text processing solutions rely on the ability of experts to encode domain knowledge, machine learning models learn this directly from the data. Deep learning is a branch of machine learning that like the human brain quickly learns hierarchical representations of concepts, and it has been key to unlocking state-of-the-art results on a range of text and image classification tasks such as sentiment analysis and beyond.
In this session, we will show the impact of a deep learning based approach over NLP and traditional machine learning based methods for text analysis across key dimensions such as accuracy, flexibility, and the amount of required training data. Specifically, we will discuss how deep learning models are now setting the records for state-of-the-art accuracy in sentiment analysis. We will also demonstrate the flexibility of this approach by showing how the features learned by one model can be easily reused in different domains (e.g., handling additional languages, or predicting new categories) to drastically reduce the time to deployment. Finally, we will touch on the ability of this method to handle additional types of data beyond text, e.g, images, for maximum insight.
Training at AI Frontiers 2018 - Lukasz Kaiser: Sequence to Sequence Learning ...AI Frontiers
Sequence to sequence learning is a powerful way to train deep networks for machine translation, various NLP tasks, but also image generation and recently video and music generation. We will give a hands-on tutorial showing how to use the open-source Tensor2Tensor library to train state-of-the-art models for translation, image generation, and a task of your choice!
Training at AI Frontiers 2018 - Ni Lao: Weakly Supervised Natural Language Un...AI Frontiers
In this tutorial I will introduce recent work in applying weak supervision and reinforcement learning to Questions Answering (QA) systems. Specifically we discuss the semantic parsing task for which natural language queries are converted to computation steps on knowledge graphs or data tables and produce the expected answers. State-of-the-art results can be achieved by novel memory structure for sequence models and improvements in reinforcement learning algorithms. Related code and experiment setup can be found at https://github.com/crazydonkey200/neural-symbolic-machines. Related paper: https://openreview.net/pdf?id=SyK00v5xx.
Currently hundreds of tools are promising to make artificial intelligence accessible to the masses. Tools like DataRobot, H20 Driverless AI, Amazon SageMaker or Microsoft Azure Machine Learning Studio.
These tools promise to accelerate the time-to-value of data science projects by simplifying model building.
In the workshop we will approach the AI Topic head on!
What is AI? What can AI do today? What do I need to start my own project?
We do all this using Microsoft's Machine Learning Studio.
Trainer: Philipp von Loringhoven - Chef, Designer, Developer, Markeeter - Data Nerd!
He has acquired a lot of expertise in marketing, business intelligence and product development during his time at the Rocket Internet startups (Wimdu, Lamudi) and Projekt-A (Tirendo).
Today he supports customers of the Austrian digitisation agency TOWA as Director Data Consulting to generate an added value from their data.
Slide presentasi ini dibawakan oleh Imron Zuhri dalam acara Seminar & Workshop Pengenalan & Potensi Big Data & Machine Learning yang diselenggarakan oleh KUDO pada tanggal 14 Mei 2016.
Ilya Sutskever at AI Frontiers : Progress towards the OpenAI missionAI Frontiers
I will present several advances in deep learning from OpenAI. First, I will present OpenAI Five, a neural network that learned to play on par with some of the strongest professional Dota 2 teams in the world in an 18-hero version of the game. Next, I will present Dactyl, a human-like robot hand trained entirely in simulation with reinforcement learning that has achieved unprecedented dexterity on a physical robot. I will also present our results on unsupervised learning in language, that show that pre-training and finetuning can achieve a significant improvement over state of the art. Finally, I will present an overview of the historical progress in the field.
Mengenal Machine/Deep Learning, Artificial Intelligence dan mengenal apa bedanya dengan Business Intelligence, apa hubungannya dengan Big Data dan Data Science/Analytics.
Computer vision techniques can be seen in various aspects in our daily life with tremendous impacts. This slides aim at introducing basic concepts of computer vision and applications for the general public.
Download link: https://uofi.box.com/shared/static/24vy7aule67o4g6djr83hzurf5a9lfp6.pptx
I presented these slides as a keynote at the Enterprise Intelligence Workshop at KDD2016 in San francisco.
In these slides, I describe our work towards developing a Maslow's Hierarchy for Human in the Loop Data Analytics!
Training at AI Frontiers 2018 - Lukasz Kaiser: Sequence to Sequence Learning ...AI Frontiers
Sequence to sequence learning is a powerful way to train deep networks for machine translation, various NLP tasks, but also image generation and recently video and music generation. We will give a hands-on tutorial showing how to use the open-source Tensor2Tensor library to train state-of-the-art models for translation, image generation, and a task of your choice!
Training at AI Frontiers 2018 - Ni Lao: Weakly Supervised Natural Language Un...AI Frontiers
In this tutorial I will introduce recent work in applying weak supervision and reinforcement learning to Questions Answering (QA) systems. Specifically we discuss the semantic parsing task for which natural language queries are converted to computation steps on knowledge graphs or data tables and produce the expected answers. State-of-the-art results can be achieved by novel memory structure for sequence models and improvements in reinforcement learning algorithms. Related code and experiment setup can be found at https://github.com/crazydonkey200/neural-symbolic-machines. Related paper: https://openreview.net/pdf?id=SyK00v5xx.
Currently hundreds of tools are promising to make artificial intelligence accessible to the masses. Tools like DataRobot, H20 Driverless AI, Amazon SageMaker or Microsoft Azure Machine Learning Studio.
These tools promise to accelerate the time-to-value of data science projects by simplifying model building.
In the workshop we will approach the AI Topic head on!
What is AI? What can AI do today? What do I need to start my own project?
We do all this using Microsoft's Machine Learning Studio.
Trainer: Philipp von Loringhoven - Chef, Designer, Developer, Markeeter - Data Nerd!
He has acquired a lot of expertise in marketing, business intelligence and product development during his time at the Rocket Internet startups (Wimdu, Lamudi) and Projekt-A (Tirendo).
Today he supports customers of the Austrian digitisation agency TOWA as Director Data Consulting to generate an added value from their data.
Slide presentasi ini dibawakan oleh Imron Zuhri dalam acara Seminar & Workshop Pengenalan & Potensi Big Data & Machine Learning yang diselenggarakan oleh KUDO pada tanggal 14 Mei 2016.
Ilya Sutskever at AI Frontiers : Progress towards the OpenAI missionAI Frontiers
I will present several advances in deep learning from OpenAI. First, I will present OpenAI Five, a neural network that learned to play on par with some of the strongest professional Dota 2 teams in the world in an 18-hero version of the game. Next, I will present Dactyl, a human-like robot hand trained entirely in simulation with reinforcement learning that has achieved unprecedented dexterity on a physical robot. I will also present our results on unsupervised learning in language, that show that pre-training and finetuning can achieve a significant improvement over state of the art. Finally, I will present an overview of the historical progress in the field.
Mengenal Machine/Deep Learning, Artificial Intelligence dan mengenal apa bedanya dengan Business Intelligence, apa hubungannya dengan Big Data dan Data Science/Analytics.
Computer vision techniques can be seen in various aspects in our daily life with tremendous impacts. This slides aim at introducing basic concepts of computer vision and applications for the general public.
Download link: https://uofi.box.com/shared/static/24vy7aule67o4g6djr83hzurf5a9lfp6.pptx
I presented these slides as a keynote at the Enterprise Intelligence Workshop at KDD2016 in San francisco.
In these slides, I describe our work towards developing a Maslow's Hierarchy for Human in the Loop Data Analytics!
The Machine Learning Workflow with AzureIvo Andreev
Machine learning is not black magic but a discipline that involves data analysis, data science and of course – hard work. From searching patterns in data, applying algorithms to converting to usable predictions, you would need background and appropriate tools. In this session, we will go through major approaches to prepare data, build and deploy ML models in Azure (ML Studio, DataScience VM, Jupyter Notebook). Most importantly – based on some examples from the real world, we will provide you with a workflow of best practices.
Domain-Driven Design (DDD) is very useful set of tools to tackle complexity in a software projects. However, many software developers never heard of it, yet most of the one who do emphasize too much on the technical implementation.
This slide will explain what is DDD and why, and also what is its core.
Deep Learning is the area of machine learning and one of the most talked about trends in business and computer science today.
In this talk, I will give a review of Deep Learning explaining what it is, what kinds of tasks it can do today, and what it probably could do in the future.
Information to Wisdom: Commonsense Knowledge Extraction and Compilation - Part 3Dr. Aparna Varde
This is the 3rd part of the tutorial on commonsense knowledge (CSK) at ACM WSDM 2021 by Simon Razniewski, Niket Tandon and Aparna Varde. It focuses on evaluation of the acquired knowledge, both intrinsic & extrinsic, as well as highlights, outlook with a brief perspective on COVID and open issues for further research.
Abstract: Commonsense knowledge is a foundational cornerstone of artificial intelligence applications. Whereas information extraction and knowledge base construction for instance-oriented assertions, such as Brad Pitt’s birth date, or Angelina Jolie’s movie awards, has received much attention, commonsense knowledge on general concepts (politicians, bicycles, printers) and activities (eating pizza, fixing printers) has only been tackled recently. In this tutorial we present state-of-the-art methodologies towards the compilation and consolidation of such commonsense knowledge (CSK). We cover text-extraction-based, multi-modal and Transformer-based techniques, with special focus on the issues of web search and ranking, as of relevance to the WSDM community.
The Frontier of Deep Learning in 2020 and BeyondNUS-ISS
This talk will be a summary of the recent advances in deep learning research, current trends in the industry, and the opportunities that lie ahead.
We will discuss topics in research such as:
Transformers, GPT-3, BERT
Neural Architecture Search, Evolutionary Search
Distillation, self-learning
NeRF
Self-Attention
Also shifting industry trends such as:
The move to free data
Rising importance of 3D vision
Using synthetic data (Sim2Real)
Mobile vision & Federated Learning
Алексей Ященко и Ярослав Волощук "False simplicity of front-end applications"Fwdays
It’s easy to underestimate a front-end project's complexity, which leads to shallow and thus incorrect implementation. Attempts to fix this problem result in uncontrolled complexity growth and undefined behavior in corner cases.
We'll discuss ways of revealing the inherent complexity of a problem and dealing with it both on theoretical and practical levels.
Best Practices in Recommender System ChallengesAlan Said
Recommender System Challenges such as the Netflix Prize, KDD Cup, etc. have contributed vastly to the development and adoptability of recommender systems. Each year a number of challenges or contests are organized covering different aspects of recommendation. In this tutorial and panel, we present some of the factors involved in successfully organizing a challenge, whether for reasons purely related to research, industrial challenges, or to widen the scope of recommender systems applications.
Lean Analytics is a set of rules to make data science more streamlined and productive. It touches on many aspects of what a data scientist should be and how a data science project should be defined to be successful. During this presentation Richard will present where data science projects go wrong, how you should think of data science projects, what constitutes success in data science and how you can measure progress. This session will be loaded with terms, stories and descriptions of project successes and failures. If you're wondering whether you're getting value out of data science, how to get more value out of it and even whether you need it then this talk is for you!
What you will take away from this session
Learn how to make your data science projects successful
Evaluate how to track progress and report on the efficacy of data science solutions
Understand the role of engineering and data scientists
Understand your options for processes and software
As data science workloads grow, so does their need for infrastructure. But, is it fair to ask data scientists to also become infrastructure experts? If not the data scientists, then, who is responsible for spinning up and managing data science infrastructure? This talk will address the context in which ML infrastructure is emerging, walk through two examples of ML infrastructure tools for launching hyperparameter optimization jobs, and end with some thoughts for building better tools in the future.
Originally given as a talk at the PyData Ann Arbor meetup (https://www.meetup.com/PyData-Ann-Arbor/events/260380989/)
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...2023240532
Quantitative data Analysis
Overview
Reliability Analysis (Cronbach Alpha)
Common Method Bias (Harman Single Factor Test)
Frequency Analysis (Demographic)
Descriptive Analysis
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Discussion on Vector Databases, Unstructured Data and AI
https://www.meetup.com/unstructured-data-meetup-new-york/
This meetup is for people working in unstructured data. Speakers will come present about related topics such as vector databases, LLMs, and managing data at scale. The intended audience of this group includes roles like machine learning engineers, data scientists, data engineers, software engineers, and PMs.This meetup was formerly Milvus Meetup, and is sponsored by Zilliz maintainers of Milvus.
Opendatabay - Open Data Marketplace.pptxOpendatabay
Opendatabay.com unlocks the power of data for everyone. Open Data Marketplace fosters a collaborative hub for data enthusiasts to explore, share, and contribute to a vast collection of datasets.
First ever open hub for data enthusiasts to collaborate and innovate. A platform to explore, share, and contribute to a vast collection of datasets. Through robust quality control and innovative technologies like blockchain verification, opendatabay ensures the authenticity and reliability of datasets, empowering users to make data-driven decisions with confidence. Leverage cutting-edge AI technologies to enhance the data exploration, analysis, and discovery experience.
From intelligent search and recommendations to automated data productisation and quotation, Opendatabay AI-driven features streamline the data workflow. Finding the data you need shouldn't be a complex. Opendatabay simplifies the data acquisition process with an intuitive interface and robust search tools. Effortlessly explore, discover, and access the data you need, allowing you to focus on extracting valuable insights. Opendatabay breaks new ground with a dedicated, AI-generated, synthetic datasets.
Leverage these privacy-preserving datasets for training and testing AI models without compromising sensitive information. Opendatabay prioritizes transparency by providing detailed metadata, provenance information, and usage guidelines for each dataset, ensuring users have a comprehensive understanding of the data they're working with. By leveraging a powerful combination of distributed ledger technology and rigorous third-party audits Opendatabay ensures the authenticity and reliability of every dataset. Security is at the core of Opendatabay. Marketplace implements stringent security measures, including encryption, access controls, and regular vulnerability assessments, to safeguard your data and protect your privacy.
2. About Me
2
Academic Competition
Working
• NTU CSIE BS/MS (2012/2013)
• Advisor: Prof. Hsuan-Tien Lin
• CMU MLD PhD (2014-)
• Advisor: Prof. Jeff Schneider
Prof. Barnabás Póczos
• KDD Cup 2011 Champions
KDD Cup 2013 Champions
• With Prof. Chih-Jen Lin
Prof. Hsuan-Tien Lin
Prof. Shou-De Lin
Many students
(2012 intern) (2015 intern)
3. What is Machine Learning?
• What is Machine Learning?
3
Learning Prediction
Existing Data
Machine (Algorithm)
Model
New Data
Model
Prediction
Data: Several length-d vectors
4. Data? Algorithm?
• In academic
• Assume we are given good enough data (in d-dimensional
of course )
• Focus on designing better algorithms
Sometimes complicated algorithms imply publications
• In practice
• Where is your good enough data?
• Or, how to transform your data into a d-dimensional one?
4
5. From Zero to One:
Create your features by your observations
5
7. More Fruits
• Method I: Use size of picture
• Method II: Use RGB average
• Many more powerful features
developed in computer vision
7
(640, 580) (640, 580)
(219, 156, 140) (243, 194, 113) (216, 156, 155)
8. Case Study (KDD Cup 2013)
• Determine whether a paper is written by a given
author
8
We are given raw text
of these
Data: https://www.kaggle.com/c/kdd-cup-2013-author-paper-identification-challenge
10. First observation:
Authors Information• Are these my (Chun-Liang Li) papers? (Easy! check author names)
1. Chun-Liang Li and Hsuan-Tien Lin. Condensed filter tree for cost-sensitive multi-label
classification.
2. Yao-Nan Chen and Hsuan-Tien Lin. Feature-aware label space dimension reduction for
multi-label classification.
• Encode by name similarities (e.g., how many characters are the same)
• Are Li, Chun-Liang and Chun-Liang Li the same?
• Yes! Eastern and Western order
• How about Li Chun-Liang? (Calculate the similarity of the reverse order)
• Also take co-authors into account
• 29 features in total
10
11. Second Observation:
Affiliations
• Are Dr. Chi-Jen Lu and Prof. Chih-Jen Lin the same?
• Similar name: Chi-Jen Lu v.s. Chih-Jen Lin
• Shared co-author (me!)
• Take affiliations into account!
• Academia Sinica v.s. National Taiwan University
• 13 features in total
11
12. Last of KDD Cup 2013
• Many other features, including
• Can you live for more than 100 years? At least I
think I can’t do research after 100 years
• More advanced: social network features
12
Summary
The 97 features designed by students won the competition
13. Furthermore
• If I can access the content, can I do better?
13
Author: Robert Galbraith
Who is Robert Galbraith?
“I thought it was by a very
mature writer, and not a
first-timer.” — Peter James
Definitely
14. Writing Style?
• “I was testing things like word length, sentence
length, paragraph length, frequency of particular
words and the pattern of punctuation”
— Peter Millican (University of Oxford)
14
1 2
3 4
5
17. Representation Learning
• Deep Learning as learning hidden representations
• An active research topic in academia and industry
17
Use last layer to extract features (Krizhevsky et al., 2012)
(Check Prof. Lee’s talk and go to deep learning session later )
Raw data
18. Use Pre-trained Network
• Yon don’t need to train a network by yourself
• Use existing pre-trained network to extract features
• AlexNet
• VGG
• Word2Vector
18
Result
Simply using deep learning features achieves state-of-the-art
performance in many applications
19. Successful Example
• The PASCAL Visual Object Classes Challenge
19
MeanAverage
Precision
0
0.15
0.3
0.45
0.6
2005 2007 2008 2009 2010 2012 2013 2014
Deep learning result
(Girshick et al. 2014)
HoG feature Slow progress on feature engineering and
algorithms before deep learning
21. The more, the better?
21
Practice
If we have 1,000,000 data
with 100,000 dimensions,
how much memory do we
need?
Ans:
Theory
Without any assumption,
you need data to
achieve error for d-
dimensional data
106
⇥ 105
⇥ 8
= 8 ⇥ 1011
(B)
= 800 (GB)
✏
O(
1
✏d
)
Noisy Feature
Is every feature useful?
Redundancy?
22. Feature Selection
• Select import features
• Reduce dimensions
• Explainable Results
22
Commonly Used Tools
• LASSO (Sparse Constraint)
• Random Forests
• Many others
23. KDD Cup Again
• In KDD Cup 2013, we actually generated more
than 200 features (some secrets you won’t see in the paper )
• Use random forests to select only 97 features,
since many features are unimportant and even
harmful, but why?
23
24. Non-useful Features
• Duplicated features
• Example I: Country (Taiwan) v.s. Coordinates (121, 23.5)
• Example II: Date of birth (1990) v.s. Age (26)
• Noisy features
• Noisy information (something wrong in your data)
• Missing values (something missing in your data)
• What if we still have too many features?
24
25. Dimension Reduction
• Let’s visualize the data (a perfect example)
• Non-perfect example in practice
25
Commonly Used Tools
• Principal Component Analysis (PCA)
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.2 0.4 0.6 0.8 1 1.2 1.4
−0.5
−0.4
−0.3
−0.2
−0.1
0
0.1
0.2
0.3
0.4
0.5
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.2 0.4 0.6 0.8 1 1.2 1.4
−0.5
−0.4
−0.3
−0.2
−0.1
0
0.1
0.2
0.3
0.4
0.5
Trade-off between
information and space
One dimension is enough
27. PCA — Intuition (cont.)
• We can use very few base faces to approximate
(describe) the original faces
27
http://comp435p.tk/
(Sirovich and Kirby, Low-dimensional procedure for the characterization of human faces)
1 2 3 4 5 6 7 8 9
28. PCA — Case Study
• CIFAR-10 image classification
with raw pixels as features and
using approximated kernel SVM
28
Dimensions Accuracy Time
3072 (all) 63.1% ~2 Hrs
100 (PCA) 59.8% 250 s
(Li and Pòczos, Utilize Old Coordinates: Faster Doubly Stochastic Gradients for
Kernel Methods, UAI 2016)
Trade-off between information, space and time
29. PCA in Practice
• Practical concern:
• Time complexity:
• Space complexity:
• Remark: Use fast approximation for large-scale problem (e.g.,
>100k dimensions)
1. PCA with random projection (implemented in scikit-learn)
(Halko et al., Finding Structure with Randomness, 2011)
2. Stochastic algorithms (easy to implement from scratch)
(Li et al., Rivalry of Two Families of Algorithms for Memory-Restricted Streaming PCA,
AISTATS 2016)
29
O(Nd2
)
O(d2
)
Small Problem
PCA takes <10 seconds for
CIFAR-10 dataset (d=3072) by
using 12 cores (E5-2620)
30. Conclusion
• Observe the data and encode them into meaningful features
• Deep learning is a powerful tool to use
• Reduce number of features if necessary
• Reduce non-useful features
• Computational concern
30
Existing Data Machine (Algorithm)
Features (Simple) AlgorithmExisting Data
Beginning:
Now: