SlideShare a Scribd company logo
1 of 19
Data Science
Internship Presentation
PPT By: Ishan Ralli Admission No: 21SCSE1100071
Topics
• databases and data architectures
• databases in the real world
– scaling, data quality, distributed
• machine learning/data mining/statistics
• information retrieval
• Data Science is currently a popular interest of
employers
• our Industrial Affiliates Partners say there is high
demand for students trained in Data Science
– databases, warehousing, data architectures
– data analytics – statistics, machine learning
• Big Data – gigabytes/day or more
• Examples:
– Walmart, cable companies (ads linked to content,
viewer trends), airlines/Orbitz, HMOs, call centers,
Twitter (500M tweets/day), traffic surveillance
cameras, detecting fraud, identity theft...
• supports “Business Intelligence”
– quantitative decision-making and control
– finance, inventory, pricing/marketing, advertising
– need data for identifying risks, opportunities,
conducting “what-if” analyses
Data Architectures
• traditional databases
– tables, fields
– tuples = records or rows
• <yellowstone,WY,6000000 acres,geysers>
– key = field with unique values
• can be used as a reference from one table into another
• important for avoiding redundancy (normalization), which
risks inconsistency
– join – combining 2 tables using a key
– metadata – data about the data
• names of the fields, types (string, int, real, mpeg...)
• also things like source, date, size, completeness/sampling
Name HomeTown Grad school PhD teaches title
John Flaherty Houston, TX Rice 2005 CSCE 411 Design and Analysis of Algorithms
Susan
Jenkins Omaha, NE
Univ of
Michigan 2004 CSCE 121 Introduction to Computing in C++
Susan
Jenkins Omaha, NE
Univ of
Michigan 2004 CSCE 206 Programming in C
Bill Jones Pittsburgh, PA Carnegie Mellon 1999 CSCE 314 Programming Languages
Bill Jones Pittsburgh, PA Carnegie Mellon 1999 CSCE 206 Programming in C
Name teaches
John Flaherty CSCE 411
Susan
Jenkins CSCE 121
Susan
Jenkins CSCE 206
Bill Jones CSCE 314
Bill Jones CSCE 206
course title
CSCE 411 Design and Analysis of Algorithms
CSCE 121 Introduction to Computing in C++
CSCE 314 Programming Languages
CSCE 206 Programming in C
Name HomeTown Grad school PhD
John Flaherty Houston, TX Rice 2005
Susan
Jenkins Omaha, NE
Univ of
Michigan 2004
Bill Jones Pittsburgh, PA Carnegie Mellon 1999
Instructors:
TeachingAssignments:
Courses:
• SQL: Structured Query Language
>SELECT Name,HomeTown FROM Instructors WHERE PhD<2000;
Bill Jones Pittsburgh, PA
>SELECT Course,Title FROM Courses ORDER BY Course;
CSCE 121 Introduction to Computing in C++
CSCE 206 Programming in C
CSCE 314 Programming Languages
CSCE 411 Design and Analysis of Algorithms
can also compute sums, counts, means, etc.
example of JOIN: find all courses taught by someone from CMU:
>SELECT TeachingAssignments.Course
FROM Instructors JOIN TeachingAssignments
ON Instructors.Name=TeachingAssigmnents.Name
WHERE Instructor.PhD=“Carnegie Mellon”
CSCE 314
CSCE 206
because they were both taught by Bill Jones
• SQL servers
– centralized database, required for concurrent
access by multiple users
– ODBC: Open DataBase Connectivity – protocol
to connect to servers and do queries, updates
from languages like Java, C, Python
– Oracle, IBM DB2 - industrial strength SQL
databases
• Unstructured data
– raw text
• documents, digital libraries
• grep, substring indexing, regular expressions
– like find all instances of “[aA]g+ies” including “agggggies”
• Information Retrieval
• look for synonyms, similar words (like “car” and “auto”)
– tfIdf (term frequency/inverse doc frequency) – weighting for
important words
– LSI (latent semantic indexing) – e.g. ‘dogs’ is similar to
‘canines’ because they are used similarly (both near ‘bark’
and ‘bite’)
• Natural Language parsing
– extracting requirements from jobs postings
• Unstructured data
– images, video (BLOBs=binary large objects)
• how to extract features? index them? search them?
• color histograms
• convolutions/transforms for pattern matching
• looking for ICBM missiles in aerial photos of Cuba
– streams
• sports ticker, radio, stock quotes...
– XML files
• with tags indicating field names
<course>
<name>CSCE 411</name>
<title>Design and Analysis of Algorithms</title>
</course>
• Object databases
CHEM 102
Intro to Chemistry
TR, 3:00-4:00
prereq: CHEM 101
Texas A&M
College Station, TX
Div 1A
53,299 students
Dr. Frank Smith
302 Miller St.
PhD, Cornell
13 years experience
ClassOfferedAt
TaughtBy
Instructor/Employee
In a database with millions of objects,
how do you efficiently do queries (i.e. follow pointers)
and retrieve information?
• Real-world issues with databases
– data warehousing:
• full database is stored in secure, off-site location
• slices, snapshots, or views are put on interactive
query servers for fast user access (“staging”)
– might be processed or summarized data
– databases are often distributed
• different parts of the data held in different sites
• some queries are local, others are “corporate-wide”
• how to do distributed queries?
• how to keep the databases synchronized?
• Distributed Object Programming
• OLAP: OnLine Analytical Processing
data warehouse:
every transaction
ever recorded
OLAP server
nightly updates
and summaries
http://technet.microsoft.com/en-us/
library/ms174587.aspx
– multi-dimensional tables of
aggregated sales in
different regions in recent
quarters, rather than “every
transaction”
– users can still look at
seasonal or geographic
trends in different product
categories
– project data onto 2D
spreadsheets, graphs
• data integrity
– missing values
• how to interpret? not available? 0? use the mean?
– duplicated values
• including partial matches (Jon Smith=John Smith?)
– inconsistency:
• multiple addresses for person
– out-of-date data
– inconsistent usage:
• does “destination” mean of first leg or whole flight?
– outliers:
• salaries that are negative, or in the trillions
– most database allow “integrity constraints” to be
defined that validate newly entered data
• “Data cleansing”
– filling in missing data (imputing values)
– detecting and removing outliers
– smoothing
• removing noise by averaging values together
– filtering, sampling
• keeping only selected representative values
– feature extraction
• e.g. in a photo database, which people are wearing
glasses? which have more than one person?
which are outdoors?
Data Mining/Data Analytics
• finding patterns in the data
• statistics
• machine learning (C
• Numerical data
– correlations
– multivariate regression
– fitting “models”
• predictive equations that fit the data
• from a real estate database of home sales, we get
• housing price = 100*SqFt - 6*DistanceToSchools +
0.1*AverageOfNeighborhood
– ANOVA for testing differences between
groups
– R is one of the most commonly used software
packages for doing statistical analysis
• can load a data table, calculate means and
correlations, fit distributions, estimate parameters,
test hypotheses, generate graphs and histograms
• clustering
– similar photos, documents, cases
– discovery of “structure” in the data
– example: accident database
• some clusters might be identified with “accidents
involving a tractor trailer” or “accidents at night”
– top-down vs. bottom-up clustering methods
– granularity: how many clusters?
• decision trees (classifiers)
– what factors, decisions, or treatments led to different
outcomes?
– recursive partitioning algorithms
– related methods
• “discriminant” analysis
– what factors lead to return of product?
• extract “association rules”
– boxers dogs tend to have congenital defects
– covers 5% of patients with 80% confidence
Veterinary database - dogs treated for disease
breed gender age drug sibsp outcome
terrier F 10 methotrexate 4.0 died
spaniel M 5 cytarabine 2.3 survived
doberman F 7 doxorubicin 0.1 died
Thank You
By: Ishan Ralli

More Related Content

Similar to Data Science Internship Presentation

Database Systems - Lecture Week 1
Database Systems - Lecture Week 1Database Systems - Lecture Week 1
Database Systems - Lecture Week 1Dios Kurniawan
 
Data mining concept and methods for basic
Data mining concept and methods for basicData mining concept and methods for basic
Data mining concept and methods for basicNivaTripathy2
 
Elag workshop sessie 1 en 2 v10
Elag workshop sessie 1 en 2 v10Elag workshop sessie 1 en 2 v10
Elag workshop sessie 1 en 2 v10Jeroen Rombouts
 
Big Process for Big Data @ PNNL, May 2013
Big Process for Big Data @ PNNL, May 2013Big Process for Big Data @ PNNL, May 2013
Big Process for Big Data @ PNNL, May 2013Ian Foster
 
Ch 2-introduction to dbms
Ch 2-introduction to dbmsCh 2-introduction to dbms
Ch 2-introduction to dbmsRupali Rana
 
Data mining Basics and complete description onword
Data mining Basics and complete description onwordData mining Basics and complete description onword
Data mining Basics and complete description onwordSulman Ahmed
 
lecture1.ppt
lecture1.pptlecture1.ppt
lecture1.pptbayhehua
 
Shrinking the haystack wes caldwell - final
Shrinking the haystack   wes caldwell - finalShrinking the haystack   wes caldwell - final
Shrinking the haystack wes caldwell - finallucenerevolution
 
Shrinking the Haystack" using Solr and OpenNLP
Shrinking the Haystack" using Solr and OpenNLPShrinking the Haystack" using Solr and OpenNLP
Shrinking the Haystack" using Solr and OpenNLPlucenerevolution
 
Leveraging Oracle's Life Sciences Data Hub to Enable Dynamic Cross-Study Anal...
Leveraging Oracle's Life Sciences Data Hub to Enable Dynamic Cross-Study Anal...Leveraging Oracle's Life Sciences Data Hub to Enable Dynamic Cross-Study Anal...
Leveraging Oracle's Life Sciences Data Hub to Enable Dynamic Cross-Study Anal...Perficient
 
351315535-Module-1-Intro-to-Data-Science-pptx.pptx
351315535-Module-1-Intro-to-Data-Science-pptx.pptx351315535-Module-1-Intro-to-Data-Science-pptx.pptx
351315535-Module-1-Intro-to-Data-Science-pptx.pptxXanGwaps
 

Similar to Data Science Internship Presentation (20)

Database Systems - Lecture Week 1
Database Systems - Lecture Week 1Database Systems - Lecture Week 1
Database Systems - Lecture Week 1
 
Dma unit 1
Dma unit   1Dma unit   1
Dma unit 1
 
Data mining concept and methods for basic
Data mining concept and methods for basicData mining concept and methods for basic
Data mining concept and methods for basic
 
Unit 3 part i Data mining
Unit 3 part i Data miningUnit 3 part i Data mining
Unit 3 part i Data mining
 
Elag workshop sessie 1 en 2 v10
Elag workshop sessie 1 en 2 v10Elag workshop sessie 1 en 2 v10
Elag workshop sessie 1 en 2 v10
 
Program on Mathematical and Statistical Methods for Climate and the Earth Sys...
Program on Mathematical and Statistical Methods for Climate and the Earth Sys...Program on Mathematical and Statistical Methods for Climate and the Earth Sys...
Program on Mathematical and Statistical Methods for Climate and the Earth Sys...
 
Data Mining 101
Data Mining 101Data Mining 101
Data Mining 101
 
Lecture1
Lecture1Lecture1
Lecture1
 
Big Process for Big Data @ PNNL, May 2013
Big Process for Big Data @ PNNL, May 2013Big Process for Big Data @ PNNL, May 2013
Big Process for Big Data @ PNNL, May 2013
 
Ch 2-introduction to dbms
Ch 2-introduction to dbmsCh 2-introduction to dbms
Ch 2-introduction to dbms
 
Data mining Basics and complete description onword
Data mining Basics and complete description onwordData mining Basics and complete description onword
Data mining Basics and complete description onword
 
Introduction to Data Mining
Introduction to Data MiningIntroduction to Data Mining
Introduction to Data Mining
 
lecture1.ppt
lecture1.pptlecture1.ppt
lecture1.ppt
 
Shrinking the haystack wes caldwell - final
Shrinking the haystack   wes caldwell - finalShrinking the haystack   wes caldwell - final
Shrinking the haystack wes caldwell - final
 
Shrinking the Haystack" using Solr and OpenNLP
Shrinking the Haystack" using Solr and OpenNLPShrinking the Haystack" using Solr and OpenNLP
Shrinking the Haystack" using Solr and OpenNLP
 
Leveraging Oracle's Life Sciences Data Hub to Enable Dynamic Cross-Study Anal...
Leveraging Oracle's Life Sciences Data Hub to Enable Dynamic Cross-Study Anal...Leveraging Oracle's Life Sciences Data Hub to Enable Dynamic Cross-Study Anal...
Leveraging Oracle's Life Sciences Data Hub to Enable Dynamic Cross-Study Anal...
 
351315535-Module-1-Intro-to-Data-Science-pptx.pptx
351315535-Module-1-Intro-to-Data-Science-pptx.pptx351315535-Module-1-Intro-to-Data-Science-pptx.pptx
351315535-Module-1-Intro-to-Data-Science-pptx.pptx
 
Chapter 1: Introduction to Data Mining
Chapter 1: Introduction to Data MiningChapter 1: Introduction to Data Mining
Chapter 1: Introduction to Data Mining
 
Module I-L1.pptx
Module I-L1.pptxModule I-L1.pptx
Module I-L1.pptx
 
ch2 DS.pptx
ch2 DS.pptxch2 DS.pptx
ch2 DS.pptx
 

Recently uploaded

Dubai Call Girls Naija O525547819 Call Girls In Dubai Home Made
Dubai Call Girls Naija O525547819 Call Girls In Dubai Home MadeDubai Call Girls Naija O525547819 Call Girls In Dubai Home Made
Dubai Call Girls Naija O525547819 Call Girls In Dubai Home Madekojalkojal131
 
Full Masii Russian Call Girls In Dwarka (Delhi) 9711199012 💋✔💕😘We are availab...
Full Masii Russian Call Girls In Dwarka (Delhi) 9711199012 💋✔💕😘We are availab...Full Masii Russian Call Girls In Dwarka (Delhi) 9711199012 💋✔💕😘We are availab...
Full Masii Russian Call Girls In Dwarka (Delhi) 9711199012 💋✔💕😘We are availab...shivangimorya083
 
NPPE STUDY GUIDE - NOV2021_study_104040.pdf
NPPE STUDY GUIDE - NOV2021_study_104040.pdfNPPE STUDY GUIDE - NOV2021_study_104040.pdf
NPPE STUDY GUIDE - NOV2021_study_104040.pdfDivyeshPatel234692
 
Vip Modals Call Girls (Delhi) Rohini 9711199171✔️ Full night Service for one...
Vip  Modals Call Girls (Delhi) Rohini 9711199171✔️ Full night Service for one...Vip  Modals Call Girls (Delhi) Rohini 9711199171✔️ Full night Service for one...
Vip Modals Call Girls (Delhi) Rohini 9711199171✔️ Full night Service for one...shivangimorya083
 
VIP High Profile Call Girls Jamshedpur Aarushi 8250192130 Independent Escort ...
VIP High Profile Call Girls Jamshedpur Aarushi 8250192130 Independent Escort ...VIP High Profile Call Girls Jamshedpur Aarushi 8250192130 Independent Escort ...
VIP High Profile Call Girls Jamshedpur Aarushi 8250192130 Independent Escort ...Suhani Kapoor
 
VIP Call Girl Cuttack Aashi 8250192130 Independent Escort Service Cuttack
VIP Call Girl Cuttack Aashi 8250192130 Independent Escort Service CuttackVIP Call Girl Cuttack Aashi 8250192130 Independent Escort Service Cuttack
VIP Call Girl Cuttack Aashi 8250192130 Independent Escort Service CuttackSuhani Kapoor
 
CALL ON ➥8923113531 🔝Call Girls Gosainganj Lucknow best sexual service
CALL ON ➥8923113531 🔝Call Girls Gosainganj Lucknow best sexual serviceCALL ON ➥8923113531 🔝Call Girls Gosainganj Lucknow best sexual service
CALL ON ➥8923113531 🔝Call Girls Gosainganj Lucknow best sexual serviceanilsa9823
 
内布拉斯加大学林肯分校毕业证录取书( 退学 )学位证书硕士
内布拉斯加大学林肯分校毕业证录取书( 退学 )学位证书硕士内布拉斯加大学林肯分校毕业证录取书( 退学 )学位证书硕士
内布拉斯加大学林肯分校毕业证录取书( 退学 )学位证书硕士obuhobo
 
VIP Call Girls Firozabad Aaradhya 8250192130 Independent Escort Service Firoz...
VIP Call Girls Firozabad Aaradhya 8250192130 Independent Escort Service Firoz...VIP Call Girls Firozabad Aaradhya 8250192130 Independent Escort Service Firoz...
VIP Call Girls Firozabad Aaradhya 8250192130 Independent Escort Service Firoz...Suhani Kapoor
 
Call Girl in Low Price Delhi Punjabi Bagh 9711199012
Call Girl in Low Price Delhi Punjabi Bagh  9711199012Call Girl in Low Price Delhi Punjabi Bagh  9711199012
Call Girl in Low Price Delhi Punjabi Bagh 9711199012sapnasaifi408
 
Delhi Call Girls In Atta Market 9711199012 Book Your One night Stand Call Girls
Delhi Call Girls In Atta Market 9711199012 Book Your One night Stand Call GirlsDelhi Call Girls In Atta Market 9711199012 Book Your One night Stand Call Girls
Delhi Call Girls In Atta Market 9711199012 Book Your One night Stand Call Girlsshivangimorya083
 
Delhi Call Girls South Ex 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls South Ex 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls South Ex 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls South Ex 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
VIP Russian Call Girls Amravati Chhaya 8250192130 Independent Escort Service ...
VIP Russian Call Girls Amravati Chhaya 8250192130 Independent Escort Service ...VIP Russian Call Girls Amravati Chhaya 8250192130 Independent Escort Service ...
VIP Russian Call Girls Amravati Chhaya 8250192130 Independent Escort Service ...Suhani Kapoor
 
Sonam +91-9537192988-Mind-blowing skills and techniques of Ahmedabad Call Girls
Sonam +91-9537192988-Mind-blowing skills and techniques of Ahmedabad Call GirlsSonam +91-9537192988-Mind-blowing skills and techniques of Ahmedabad Call Girls
Sonam +91-9537192988-Mind-blowing skills and techniques of Ahmedabad Call GirlsNiya Khan
 
Low Rate Call Girls Gorakhpur Anika 8250192130 Independent Escort Service Gor...
Low Rate Call Girls Gorakhpur Anika 8250192130 Independent Escort Service Gor...Low Rate Call Girls Gorakhpur Anika 8250192130 Independent Escort Service Gor...
Low Rate Call Girls Gorakhpur Anika 8250192130 Independent Escort Service Gor...Suhani Kapoor
 
Delhi Call Girls Greater Noida 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Greater Noida 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Greater Noida 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Greater Noida 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
VIP Call Girl Bhiwandi Aashi 8250192130 Independent Escort Service Bhiwandi
VIP Call Girl Bhiwandi Aashi 8250192130 Independent Escort Service BhiwandiVIP Call Girl Bhiwandi Aashi 8250192130 Independent Escort Service Bhiwandi
VIP Call Girl Bhiwandi Aashi 8250192130 Independent Escort Service BhiwandiSuhani Kapoor
 
加利福尼亚艺术学院毕业证文凭证书( 咨询 )证书双学位
加利福尼亚艺术学院毕业证文凭证书( 咨询 )证书双学位加利福尼亚艺术学院毕业证文凭证书( 咨询 )证书双学位
加利福尼亚艺术学院毕业证文凭证书( 咨询 )证书双学位obuhobo
 
Delhi Call Girls Preet Vihar 9711199171 ☎✔👌✔ Whatsapp Body to body massage wi...
Delhi Call Girls Preet Vihar 9711199171 ☎✔👌✔ Whatsapp Body to body massage wi...Delhi Call Girls Preet Vihar 9711199171 ☎✔👌✔ Whatsapp Body to body massage wi...
Delhi Call Girls Preet Vihar 9711199171 ☎✔👌✔ Whatsapp Body to body massage wi...shivangimorya083
 

Recently uploaded (20)

Call Girls In Prashant Vihar꧁❤ 🔝 9953056974🔝❤꧂ Escort ServiCe
Call Girls In Prashant Vihar꧁❤ 🔝 9953056974🔝❤꧂ Escort ServiCeCall Girls In Prashant Vihar꧁❤ 🔝 9953056974🔝❤꧂ Escort ServiCe
Call Girls In Prashant Vihar꧁❤ 🔝 9953056974🔝❤꧂ Escort ServiCe
 
Dubai Call Girls Naija O525547819 Call Girls In Dubai Home Made
Dubai Call Girls Naija O525547819 Call Girls In Dubai Home MadeDubai Call Girls Naija O525547819 Call Girls In Dubai Home Made
Dubai Call Girls Naija O525547819 Call Girls In Dubai Home Made
 
Full Masii Russian Call Girls In Dwarka (Delhi) 9711199012 💋✔💕😘We are availab...
Full Masii Russian Call Girls In Dwarka (Delhi) 9711199012 💋✔💕😘We are availab...Full Masii Russian Call Girls In Dwarka (Delhi) 9711199012 💋✔💕😘We are availab...
Full Masii Russian Call Girls In Dwarka (Delhi) 9711199012 💋✔💕😘We are availab...
 
NPPE STUDY GUIDE - NOV2021_study_104040.pdf
NPPE STUDY GUIDE - NOV2021_study_104040.pdfNPPE STUDY GUIDE - NOV2021_study_104040.pdf
NPPE STUDY GUIDE - NOV2021_study_104040.pdf
 
Vip Modals Call Girls (Delhi) Rohini 9711199171✔️ Full night Service for one...
Vip  Modals Call Girls (Delhi) Rohini 9711199171✔️ Full night Service for one...Vip  Modals Call Girls (Delhi) Rohini 9711199171✔️ Full night Service for one...
Vip Modals Call Girls (Delhi) Rohini 9711199171✔️ Full night Service for one...
 
VIP High Profile Call Girls Jamshedpur Aarushi 8250192130 Independent Escort ...
VIP High Profile Call Girls Jamshedpur Aarushi 8250192130 Independent Escort ...VIP High Profile Call Girls Jamshedpur Aarushi 8250192130 Independent Escort ...
VIP High Profile Call Girls Jamshedpur Aarushi 8250192130 Independent Escort ...
 
VIP Call Girl Cuttack Aashi 8250192130 Independent Escort Service Cuttack
VIP Call Girl Cuttack Aashi 8250192130 Independent Escort Service CuttackVIP Call Girl Cuttack Aashi 8250192130 Independent Escort Service Cuttack
VIP Call Girl Cuttack Aashi 8250192130 Independent Escort Service Cuttack
 
CALL ON ➥8923113531 🔝Call Girls Gosainganj Lucknow best sexual service
CALL ON ➥8923113531 🔝Call Girls Gosainganj Lucknow best sexual serviceCALL ON ➥8923113531 🔝Call Girls Gosainganj Lucknow best sexual service
CALL ON ➥8923113531 🔝Call Girls Gosainganj Lucknow best sexual service
 
内布拉斯加大学林肯分校毕业证录取书( 退学 )学位证书硕士
内布拉斯加大学林肯分校毕业证录取书( 退学 )学位证书硕士内布拉斯加大学林肯分校毕业证录取书( 退学 )学位证书硕士
内布拉斯加大学林肯分校毕业证录取书( 退学 )学位证书硕士
 
VIP Call Girls Firozabad Aaradhya 8250192130 Independent Escort Service Firoz...
VIP Call Girls Firozabad Aaradhya 8250192130 Independent Escort Service Firoz...VIP Call Girls Firozabad Aaradhya 8250192130 Independent Escort Service Firoz...
VIP Call Girls Firozabad Aaradhya 8250192130 Independent Escort Service Firoz...
 
Call Girl in Low Price Delhi Punjabi Bagh 9711199012
Call Girl in Low Price Delhi Punjabi Bagh  9711199012Call Girl in Low Price Delhi Punjabi Bagh  9711199012
Call Girl in Low Price Delhi Punjabi Bagh 9711199012
 
Delhi Call Girls In Atta Market 9711199012 Book Your One night Stand Call Girls
Delhi Call Girls In Atta Market 9711199012 Book Your One night Stand Call GirlsDelhi Call Girls In Atta Market 9711199012 Book Your One night Stand Call Girls
Delhi Call Girls In Atta Market 9711199012 Book Your One night Stand Call Girls
 
Delhi Call Girls South Ex 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls South Ex 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls South Ex 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls South Ex 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
VIP Russian Call Girls Amravati Chhaya 8250192130 Independent Escort Service ...
VIP Russian Call Girls Amravati Chhaya 8250192130 Independent Escort Service ...VIP Russian Call Girls Amravati Chhaya 8250192130 Independent Escort Service ...
VIP Russian Call Girls Amravati Chhaya 8250192130 Independent Escort Service ...
 
Sonam +91-9537192988-Mind-blowing skills and techniques of Ahmedabad Call Girls
Sonam +91-9537192988-Mind-blowing skills and techniques of Ahmedabad Call GirlsSonam +91-9537192988-Mind-blowing skills and techniques of Ahmedabad Call Girls
Sonam +91-9537192988-Mind-blowing skills and techniques of Ahmedabad Call Girls
 
Low Rate Call Girls Gorakhpur Anika 8250192130 Independent Escort Service Gor...
Low Rate Call Girls Gorakhpur Anika 8250192130 Independent Escort Service Gor...Low Rate Call Girls Gorakhpur Anika 8250192130 Independent Escort Service Gor...
Low Rate Call Girls Gorakhpur Anika 8250192130 Independent Escort Service Gor...
 
Delhi Call Girls Greater Noida 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Greater Noida 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Greater Noida 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Greater Noida 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
VIP Call Girl Bhiwandi Aashi 8250192130 Independent Escort Service Bhiwandi
VIP Call Girl Bhiwandi Aashi 8250192130 Independent Escort Service BhiwandiVIP Call Girl Bhiwandi Aashi 8250192130 Independent Escort Service Bhiwandi
VIP Call Girl Bhiwandi Aashi 8250192130 Independent Escort Service Bhiwandi
 
加利福尼亚艺术学院毕业证文凭证书( 咨询 )证书双学位
加利福尼亚艺术学院毕业证文凭证书( 咨询 )证书双学位加利福尼亚艺术学院毕业证文凭证书( 咨询 )证书双学位
加利福尼亚艺术学院毕业证文凭证书( 咨询 )证书双学位
 
Delhi Call Girls Preet Vihar 9711199171 ☎✔👌✔ Whatsapp Body to body massage wi...
Delhi Call Girls Preet Vihar 9711199171 ☎✔👌✔ Whatsapp Body to body massage wi...Delhi Call Girls Preet Vihar 9711199171 ☎✔👌✔ Whatsapp Body to body massage wi...
Delhi Call Girls Preet Vihar 9711199171 ☎✔👌✔ Whatsapp Body to body massage wi...
 

Data Science Internship Presentation

  • 1. Data Science Internship Presentation PPT By: Ishan Ralli Admission No: 21SCSE1100071
  • 2. Topics • databases and data architectures • databases in the real world – scaling, data quality, distributed • machine learning/data mining/statistics • information retrieval
  • 3. • Data Science is currently a popular interest of employers • our Industrial Affiliates Partners say there is high demand for students trained in Data Science – databases, warehousing, data architectures – data analytics – statistics, machine learning • Big Data – gigabytes/day or more • Examples: – Walmart, cable companies (ads linked to content, viewer trends), airlines/Orbitz, HMOs, call centers, Twitter (500M tweets/day), traffic surveillance cameras, detecting fraud, identity theft... • supports “Business Intelligence” – quantitative decision-making and control – finance, inventory, pricing/marketing, advertising – need data for identifying risks, opportunities, conducting “what-if” analyses
  • 4. Data Architectures • traditional databases – tables, fields – tuples = records or rows • <yellowstone,WY,6000000 acres,geysers> – key = field with unique values • can be used as a reference from one table into another • important for avoiding redundancy (normalization), which risks inconsistency – join – combining 2 tables using a key – metadata – data about the data • names of the fields, types (string, int, real, mpeg...) • also things like source, date, size, completeness/sampling
  • 5. Name HomeTown Grad school PhD teaches title John Flaherty Houston, TX Rice 2005 CSCE 411 Design and Analysis of Algorithms Susan Jenkins Omaha, NE Univ of Michigan 2004 CSCE 121 Introduction to Computing in C++ Susan Jenkins Omaha, NE Univ of Michigan 2004 CSCE 206 Programming in C Bill Jones Pittsburgh, PA Carnegie Mellon 1999 CSCE 314 Programming Languages Bill Jones Pittsburgh, PA Carnegie Mellon 1999 CSCE 206 Programming in C Name teaches John Flaherty CSCE 411 Susan Jenkins CSCE 121 Susan Jenkins CSCE 206 Bill Jones CSCE 314 Bill Jones CSCE 206 course title CSCE 411 Design and Analysis of Algorithms CSCE 121 Introduction to Computing in C++ CSCE 314 Programming Languages CSCE 206 Programming in C Name HomeTown Grad school PhD John Flaherty Houston, TX Rice 2005 Susan Jenkins Omaha, NE Univ of Michigan 2004 Bill Jones Pittsburgh, PA Carnegie Mellon 1999 Instructors: TeachingAssignments: Courses:
  • 6. • SQL: Structured Query Language >SELECT Name,HomeTown FROM Instructors WHERE PhD<2000; Bill Jones Pittsburgh, PA >SELECT Course,Title FROM Courses ORDER BY Course; CSCE 121 Introduction to Computing in C++ CSCE 206 Programming in C CSCE 314 Programming Languages CSCE 411 Design and Analysis of Algorithms can also compute sums, counts, means, etc. example of JOIN: find all courses taught by someone from CMU: >SELECT TeachingAssignments.Course FROM Instructors JOIN TeachingAssignments ON Instructors.Name=TeachingAssigmnents.Name WHERE Instructor.PhD=“Carnegie Mellon” CSCE 314 CSCE 206 because they were both taught by Bill Jones
  • 7. • SQL servers – centralized database, required for concurrent access by multiple users – ODBC: Open DataBase Connectivity – protocol to connect to servers and do queries, updates from languages like Java, C, Python – Oracle, IBM DB2 - industrial strength SQL databases
  • 8. • Unstructured data – raw text • documents, digital libraries • grep, substring indexing, regular expressions – like find all instances of “[aA]g+ies” including “agggggies” • Information Retrieval • look for synonyms, similar words (like “car” and “auto”) – tfIdf (term frequency/inverse doc frequency) – weighting for important words – LSI (latent semantic indexing) – e.g. ‘dogs’ is similar to ‘canines’ because they are used similarly (both near ‘bark’ and ‘bite’) • Natural Language parsing – extracting requirements from jobs postings
  • 9. • Unstructured data – images, video (BLOBs=binary large objects) • how to extract features? index them? search them? • color histograms • convolutions/transforms for pattern matching • looking for ICBM missiles in aerial photos of Cuba – streams • sports ticker, radio, stock quotes... – XML files • with tags indicating field names <course> <name>CSCE 411</name> <title>Design and Analysis of Algorithms</title> </course>
  • 10. • Object databases CHEM 102 Intro to Chemistry TR, 3:00-4:00 prereq: CHEM 101 Texas A&M College Station, TX Div 1A 53,299 students Dr. Frank Smith 302 Miller St. PhD, Cornell 13 years experience ClassOfferedAt TaughtBy Instructor/Employee In a database with millions of objects, how do you efficiently do queries (i.e. follow pointers) and retrieve information?
  • 11. • Real-world issues with databases – data warehousing: • full database is stored in secure, off-site location • slices, snapshots, or views are put on interactive query servers for fast user access (“staging”) – might be processed or summarized data – databases are often distributed • different parts of the data held in different sites • some queries are local, others are “corporate-wide” • how to do distributed queries? • how to keep the databases synchronized? • Distributed Object Programming
  • 12. • OLAP: OnLine Analytical Processing data warehouse: every transaction ever recorded OLAP server nightly updates and summaries http://technet.microsoft.com/en-us/ library/ms174587.aspx – multi-dimensional tables of aggregated sales in different regions in recent quarters, rather than “every transaction” – users can still look at seasonal or geographic trends in different product categories – project data onto 2D spreadsheets, graphs
  • 13. • data integrity – missing values • how to interpret? not available? 0? use the mean? – duplicated values • including partial matches (Jon Smith=John Smith?) – inconsistency: • multiple addresses for person – out-of-date data – inconsistent usage: • does “destination” mean of first leg or whole flight? – outliers: • salaries that are negative, or in the trillions – most database allow “integrity constraints” to be defined that validate newly entered data
  • 14. • “Data cleansing” – filling in missing data (imputing values) – detecting and removing outliers – smoothing • removing noise by averaging values together – filtering, sampling • keeping only selected representative values – feature extraction • e.g. in a photo database, which people are wearing glasses? which have more than one person? which are outdoors?
  • 15. Data Mining/Data Analytics • finding patterns in the data • statistics • machine learning (C
  • 16. • Numerical data – correlations – multivariate regression – fitting “models” • predictive equations that fit the data • from a real estate database of home sales, we get • housing price = 100*SqFt - 6*DistanceToSchools + 0.1*AverageOfNeighborhood – ANOVA for testing differences between groups – R is one of the most commonly used software packages for doing statistical analysis • can load a data table, calculate means and correlations, fit distributions, estimate parameters, test hypotheses, generate graphs and histograms
  • 17. • clustering – similar photos, documents, cases – discovery of “structure” in the data – example: accident database • some clusters might be identified with “accidents involving a tractor trailer” or “accidents at night” – top-down vs. bottom-up clustering methods – granularity: how many clusters?
  • 18. • decision trees (classifiers) – what factors, decisions, or treatments led to different outcomes? – recursive partitioning algorithms – related methods • “discriminant” analysis – what factors lead to return of product? • extract “association rules” – boxers dogs tend to have congenital defects – covers 5% of patients with 80% confidence Veterinary database - dogs treated for disease breed gender age drug sibsp outcome terrier F 10 methotrexate 4.0 died spaniel M 5 cytarabine 2.3 survived doberman F 7 doxorubicin 0.1 died

Editor's Notes

  1. Created by-Ishaan Rally(21SCSE1100071)