SlideShare a Scribd company logo
1 of 44
Download to read offline
DATA MODELS
AND DATABASES
BILL HOWE, PHD
DIRECTOR OF RESEARCH, SCALABLE DATA ANALYTICS
UNIVERSITY OF WASHINGTON ESCIENCE INSTITUTE
1
HOW DOWE STORE DATA?
2
HOW DOWE STORE DATA?
3
###query length COG hit #1 e-value #1 identity #1 score #1 hit length #1 description #1
chr_4[480001-580000].287 4500
chr_4[560001-660000].1 3556
chr_9[400001-500000].503 4211 COG4547 2.00E-04 19 44.6 620 Cobalamin biosynthesis protein CobT (nicot
chr_9[320001-420000].548 2833 COG5406 2.00E-04 38 43.9 1001 Nucleosome binding factor SPN, SPT16 subu
chr_27[320001-404298].20 3991 COG4547 5.00E-05 18 46.2 620 Cobalamin biosynthesis protein CobT (nicot
chr_26[320001-420000].378 3963 COG5099 5.00E-05 17 46.2 777 RNA-binding protein of the Puf family, trans
chr_26[400001-441226].196 2949 COG5099 2.00E-04 17 43.9 777 RNA-binding protein of the Puf family, trans
chr_24[160001-260000].65 3542
chr_5[720001-820000].339 3141 COG5099 4.00E-09 20 59.3 777 RNA-binding protein of the Puf family, trans
chr_9[160001-260000].243 3002 COG5077 1.00E-25 26 114 1089 Ubiquitin carboxyl-terminal hydrolase
chr_12[720001-820000].86 2895 COG5032 2.00E-09 30 60.5 2105 Phosphatidylinositol kinase and protein kina
chr_12[800001-900000].109 1463 COG5032 1.00E-09 30 60.1 2105 Phosphatidylinositol kinase and protein kina
chr_11[1-100000].70 2886
chr_11[80001-180000].100 1523
ANNOTATIONSUMMARY-COMBINEDORFANNOTATION16_Phaeo_genome
What is the data model?
WHAT IS A DATA MODEL?
4
Three components:
1.  Structures
2.  Constraints
3.  Operations
EXAMPLES
5
1.  Structures
•  rows and columns?
•  nodes and edges?
•  key-value pairs?
•  a sequence of bytes?
2.  Constraints
•  all rows must have the same number of columns
•  all values in one column must have the same type
•  a child cannot have two parents
3.  Operations
•  find the value of key x
•  find the rows where column “lastname” is “Jordan”
•  get the next N bytes
WHAT IS A DATABASE?
6
A collection of information
organized to afford efficient
retrieval
http://www.usg.edu/galileo/skills/unit04/primer04_01.phtml
ANOTHERVIEW
7
“When people use the word database, fundamentally what
they are saying is that the data should be self-describing and
it should have a schema.That’s really all the word database
means.”
-- Jim Gray,“The Fourth Paradigm”
WHYWOULD IWANT A DATABASE?
8
1.  Sharing
Support concurrent access by multiple readers and writers
2.  Data Model Enforcement
Make sure all applications see clean, organized data
3.  Scale
Work with datasets too large to fit in memory
4.  Flexibility
Use the data in new, unanticipated ways
QUESTIONS TO CONSIDER
How is the data physically organized on disk?
What kinds of queries are efficiently supported by
this organization, and what kinds are not?
How hard is it update the data, or add new data?
What happens when I encounter new queries that I
didn’t anticipate? Do I reorganize the data? How hard
is that?
9
MOTIVATING
RELATIONAL
DATABASES
BILL HOWE, PHD
DIRECTOR OF RESEARCH, SCALABLE DATA ANALYTICS
UNIVERSITY OF WASHINGTON ESCIENCE INSTITUTE
10
QUESTIONS TO CONSIDER
How is the data physically organized on disk?
What kinds of queries are efficiently supported by this
organization, and what kinds are not?
How hard is it update the data, or add new data?
What happens when I encounter new queries that I didn’t
anticipate? Do I reorganize the data? How hard is that?
11
HISTORICAL EXAMPLE:NETWORK DATABASES
12
Database:A collection of information
organized to afford efficient retrieval
Orderer
Customer
Screw
Nut
Washer
Contact Rep
HISTORICAL EXAMPLE:HIERARCHICAL DATABASES
13
Orderer
Customer
Screw
Nut
Nail
Contact Rep
Orderer Screw
Nut
Washer
master
detail
detail
Works great if you want to find all
orders for a particular customer.
But what if you want to find all
Customers who ordered a Nail?
ONEVIEW
14
“Relational Database Management Systems were invented
to let you use one set of data in multiple ways, including
ways that are unforeseen at the time the database is built
and the 1st applications are written.”
- Curt Monash, analyst/blogger
RELATIONAL DATABASES (CODD 1970)
Everything is a table
Every row in a table has the same columns
Relationships are implicit: no pointers
15
Course Student Id
CSE 344 223…
CSE 344 244…
CSE 514 255..
CSE 514 244…
Student Id Student
Name
223… Jane
244… Joe
255.. Susan
DATABASE PHILOSOPHY
16
God made the integers;
all else is the work of
man.
- Leopold Kronecker, 19th Century
Mathematician
slide src: Mike Franklin
Codd made relations;
all else is the work of man.
- Raghu Ramakrishnan, DB text book author
RELATIONAL DATABASE HISTORY
17
Pre-Relational: if your data changed, your application broke.
Early RDBMS were buggy and slow (and often reviled), but required only 5% of
the application code.
“Activities of users at terminals and most application programs
should remain unaffected when the internal representation of data
is changed and even when some aspects of the external
representation are changed.”
Key Ideas: Programs that manipulate tabular data exhibit an
algebraic structure allowing reasoning and manipulation
independently of physical data representation
- Codd 1979
KEY IDEA:“PHYSICAL DATA INDEPENDENCE”
18
physical data independence
files and
pointers
relations
SELECT seq
FROM ncbi_sequences
WHERE seq =
‘GATTACGATATTA’;
f = fopen(‘table_file’);
fseek(10030440);
while (True) {
fread(&buf, 1, 8192, f);
if (buf == GATTACGATATTA) {
. . .
KEY IDEA:AN ALGEBRA OF TABLES
19
select
project
join join
Other operators: aggregate,union,difference,cross product
KEY IDEA:ALGEBRAIC OPTIMIZATION
20
N = ((z*2)+((z*3)+0))/1
Algebraic Laws:
1. (+) identity: x+0 = x
2. (/) identity: x/1 = x
3. (*) distributes: (n*x+n*y) = n*(x+y)
4. (*) commutes: x*y = y*x
Apply rules 1, 3, 4, 2:
N = (2+3)*z
two operations instead of five, no division operator
Same idea works with the Relational Algebra!
EQUIVALENT LOGICAL EXPRESSIONS;
DIFFERENT COSTS
21
σp=knows(R) o=s (σp=holdsAccount(R) o=s σp=accountHomepage(R))
(σp=knows(R) o=s σp=holdsAccount(R)) o=s σp=accountHomepage(R)
σp1=knows & p2=holdsAccount & p3=accountHomepage (R x R x R)
right associative
left associative
cross product
SAME LOGICAL EXPRESSION,
DIFFERENT PHYSICAL ALGORITHMS
22
A = select(p=knows)
B = select(p=holdsAccount)
C = select(p=accountWebpage)
hA = hash(A)
AB = probe hA with B
hC = hash(C)
ABC = probe hC with AB
A = select(p=knows)
B = select(p=holdsAccount)
C = select(p=accountWebpage)
hB = hash(B)
AB = probe hB with A
hC = hash(C)
ABC = probe hC with AB
Which is faster?
ALGEBRAIC OPTIMIZATION MATTERS
23
BTC 2010 Dataset
3B quads
623 GB processed
PICASSO QUERY PLAN DIAGRAMS
24
Haritsa,VLDB 2010
KEY IDEA:DECLARATIVE LANGUAGES
25
SELECT *
FROM Order o, Item i
WHERE o.item = i.item
AND o.date = today()
join
select
scan scan
date = today()
o.item = i.item
Order oItem i
Find all orders from today,along with the items ordered
SQL IS THE“WHAT”NOT THE“HOW”
26
SELECT DISTINCT x.name, z.name
FROM Product x, Purchase y, Customer z
WHERE x.pid = y.pid and y.cid = y.cid and
x.price > 100 and z.city = ‘Seattle’
It’s clear WHAT we want, unclear HOW to get it
Product(pid, name, price)
Purchase(pid, cid, store)
Customer(cid, name, city)
RELATIONAL ALGEBRA
27
Product Purchase
pid=pid
price>100 and city=‘Seattle’
x.name,z.name
δ
cid=cid
Customer
Π
σ
T1(pid,name,price,pid,cid,store)
T2( . . . .)
T4(name,name)
Final answer
T3(. . . )
Execution order is
now clearly
specified
Product(pid, name, price)
Purchase(pid, cid, store)
Customer(cid, name, city)
But a lot of
physical details
are still left open!
ANOTHER EXAMPLE
28
left.object = right.subject
σ
predicate=knows
SELECT r1.subject
FROM R r1, R r2, R r3
WHERE r1.predicate = ‘knows’
AND r2.predicate = ‘holdsAccount’
AND r3.predicate = ‘accountHomepage’
AND r1.object = r2.subject
AND r2.object = r3.subject
σ
predicate=holdsAccount
left.object = right.subject
σ
predicate=
accountHomepage
R(subject, predicate,
object)
π
r1.subject
KEY IDEA:“LOGICAL DATA INDEPENDENCE”
29
physical data independence
logical data independence
files and
pointers
relations
views
SELECT *
FROM my_sequences
SELECT seq
FROM ncbi_sequences
WHERE seq =
‘GATTACGATATTA’;
f = fopen(‘table_file’);
fseek(10030440);
while (True) {
fread(&buf, 1, 8192, f);
if (buf == GATTACGATATTA) {
. . .
WHAT AREVIEWS?
A view is just a query with a name
We can use the view just like a real table
30
Why can we do
this?
Because we know that every
query returns a relation:
We say that the language is
“algebraically closed”
VIEW EXAMPLE
A view is a relation defined by a query
31
CREATE VIEW StorePrice AS
SELECT x.store, y.price
FROM Purchase x, Product y
WHERE x.pid = y.pid
This is like a new table
StorePrice(store,price)
Purchase(customer, product, store)
Product(pname, price)
StorePrice(store, price)
HOW TO USE AVIEW?
32
A "high end" store is a store that sold some product
over 1000. For each customer, find all the high end
stores that they visit. Return a set of (customer-
name, high-end-store) pairs.
SELECT DISTINCT z.name, u.store
FROM Customer z, Purchase u, StorePrice v
WHERE z.cid = u.customer
AND u.store = v.store
AND v.price > 1000
StorePrice(store, price)
Customer(cid, name, city)
Purchase(customer, product, store)
Product(pname, price)
KEY IDEA:INDEXES
33
Databases are especially, but not exclusively, effective
at“Needle in Haystack” problems:
•  Extracting small results from big datasets
•  Transparently provide “old style” scalability
•  Your query will always* finish, regardless of dataset size.
•  Indexes are easily built and automatically used when
appropriate
CREATE INDEX seq_idx ON sequence(seq);
SELECT seq
FROM sequence
WHERE seq = ‘GATTACGATATTA’;
*almost
IN-DATABASE
ANALYTICS
BILL HOWE, PHD
DIRECTOR OF RESEARCH, SCALABLE DATA ANALYTICS
UNIVERSITY OF WASHINGTON ESCIENCE INSTITUTE
34
35
“There is no point in bringing data ... into the data warehouse
environment without integrating it. If the data arrives at the data
warehouse in an unintegrated state, it cannot be used to support a
corporate view of data. And a corporate view of data is one of the
essences of the architected environment.”
- Inmon, 2005,“Building the Data Warehouse”
Not a good fit for analytics!
MATRIX ADDITION,VECTOR REPRESENTATION
36
SELECT A.row_number, A.vector + B.vector
FROM A, B
WHERE A.row_number = B.row_number;
A(row_number,row float[])
B(row_number,row float[])
MATRIX MULTIPLICATION
37
MATRIX MULTIPLICATION,VECTOR
REPRESENTATION
38
SELECT 1, array_accum(row_number, vector*v) FROM A;
A(row_number,row float[])
MATRIX TRANSPOSE,VECTOR
REPRESENTATION
39
SELECT S.col_number, array_accum(A.row_number, A.vector[S.col_number])
FROM A, generate_series(1,3) AS S(col_number)
GROUP BY S.col_number;
generate_series is a table function
A(row_number,row float[])
MATRIX MULTIPLICATION,SPARSE
REPRESENTATION
40
SELECT A.row_number, B.column_number, SUM(A.value * B.value)
FROM A, B
WHERE A.column_number = B.row_number
GROUP BY A.row_number, B.column_number
A(row_number,column_number,value)
B(row_number,column_number,value)
ASIDE:USER-DEFINED FUNCTIONS AND
TYPES
41
Scalar functions
SELECT myfunc(r.a, r.b)…
Aggregate functions
SELECT concat(r.s) …
Table functions
SELECT … FROM tablefunc(a,b)
EXPERIMENT DESIGN
42
CREATE VIEW trials AS
SELECT d.trial_id, AVG(a.values) AS avg_value
FROM design d,T
WHERE d.row_id = T.row_id
GROUP BY d.trial_id
CREATE VIEW design AS
SELECT a.trial_id,
floor (100 * random()) AS row_id
FROM generate_series(1,10000) AS a (trial_id),
generate_series(1,3) AS b (subsample_id)
src: Cohen et al.,VLDB 2009
43
Prior to the implementation of this functionality within the
DBMS, one Greenplum customer was accustomed to
calculating the OLS by exporting data and importing the
data into R for calculation, a process that took several
hours to complete.They reported signicant performance
improvement when they moved to running the regression
within the DBMS. Most of the benet derived from running
the analysis in parallel close to the data with minimal
data movement.
src: Cohen et al.,VLDB 2009
SLIDES CAN BE FOUND AT:
TEACHINGDATASCIENCE.ORG

More Related Content

Similar to Bill howe 2_databases

Why Your Database Queries Stink -SeaGl.org November 11th, 2016
Why Your Database Queries Stink -SeaGl.org November 11th, 2016Why Your Database Queries Stink -SeaGl.org November 11th, 2016
Why Your Database Queries Stink -SeaGl.org November 11th, 2016Dave Stokes
 
MIS5101 WK10 Outcome Measures
MIS5101 WK10 Outcome MeasuresMIS5101 WK10 Outcome Measures
MIS5101 WK10 Outcome MeasuresSteven Johnson
 
Relational Database Design Bootcamp
Relational Database Design BootcampRelational Database Design Bootcamp
Relational Database Design BootcampMark Niebergall
 
Big Data + Big Sim: Query Processing over Unstructured CFD Models
Big Data + Big Sim: Query Processing over Unstructured CFD ModelsBig Data + Big Sim: Query Processing over Unstructured CFD Models
Big Data + Big Sim: Query Processing over Unstructured CFD ModelsUniversity of Washington
 
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)University of Washington
 
Data Science Academy Student Demo day--Michael blecher,the importance of clea...
Data Science Academy Student Demo day--Michael blecher,the importance of clea...Data Science Academy Student Demo day--Michael blecher,the importance of clea...
Data Science Academy Student Demo day--Michael blecher,the importance of clea...Vivian S. Zhang
 
Myria: Analytics-as-a-Service for (Data) Scientists
Myria: Analytics-as-a-Service for (Data) ScientistsMyria: Analytics-as-a-Service for (Data) Scientists
Myria: Analytics-as-a-Service for (Data) ScientistsUniversity of Washington
 
ISA-Tab Standards at Metabolomics Society Meeting, Tsuruoka 2014, Japan
ISA-Tab Standards at Metabolomics Society Meeting, Tsuruoka 2014, JapanISA-Tab Standards at Metabolomics Society Meeting, Tsuruoka 2014, Japan
ISA-Tab Standards at Metabolomics Society Meeting, Tsuruoka 2014, JapanPhilippe Rocca-Serra
 
What Your Database Query is Really Doing
What Your Database Query is Really DoingWhat Your Database Query is Really Doing
What Your Database Query is Really DoingDave Stokes
 
Getting to Know Your Data with R
Getting to Know Your Data with RGetting to Know Your Data with R
Getting to Know Your Data with RStephen Withington
 
Large Graph Mining
Large Graph MiningLarge Graph Mining
Large Graph MiningSabri Skhiri
 
Visualizations using Visualbox
Visualizations using VisualboxVisualizations using Visualbox
Visualizations using VisualboxAlvaro Graves
 
Kudler has plenty of room to increase sales while controlling cost.docx
Kudler has plenty of room to increase sales while controlling cost.docxKudler has plenty of room to increase sales while controlling cost.docx
Kudler has plenty of room to increase sales while controlling cost.docxDIPESH30
 
polystore_NYC_inrae_sysinfo2021-1.pdf
polystore_NYC_inrae_sysinfo2021-1.pdfpolystore_NYC_inrae_sysinfo2021-1.pdf
polystore_NYC_inrae_sysinfo2021-1.pdfRim Moussa
 
2019 03 05_biological_databases_part5_v_upload
2019 03 05_biological_databases_part5_v_upload2019 03 05_biological_databases_part5_v_upload
2019 03 05_biological_databases_part5_v_uploadProf. Wim Van Criekinge
 
Reflected intelligence evolving self-learning data systems
Reflected intelligence  evolving self-learning data systemsReflected intelligence  evolving self-learning data systems
Reflected intelligence evolving self-learning data systemsTrey Grainger
 
The Other HPC: High Productivity Computing in Polystore Environments
The Other HPC: High Productivity Computing in Polystore EnvironmentsThe Other HPC: High Productivity Computing in Polystore Environments
The Other HPC: High Productivity Computing in Polystore EnvironmentsUniversity of Washington
 

Similar to Bill howe 2_databases (20)

Blinkdb
BlinkdbBlinkdb
Blinkdb
 
Why Your Database Queries Stink -SeaGl.org November 11th, 2016
Why Your Database Queries Stink -SeaGl.org November 11th, 2016Why Your Database Queries Stink -SeaGl.org November 11th, 2016
Why Your Database Queries Stink -SeaGl.org November 11th, 2016
 
MIS5101 WK10 Outcome Measures
MIS5101 WK10 Outcome MeasuresMIS5101 WK10 Outcome Measures
MIS5101 WK10 Outcome Measures
 
Relational Database Design Bootcamp
Relational Database Design BootcampRelational Database Design Bootcamp
Relational Database Design Bootcamp
 
Big Data + Big Sim: Query Processing over Unstructured CFD Models
Big Data + Big Sim: Query Processing over Unstructured CFD ModelsBig Data + Big Sim: Query Processing over Unstructured CFD Models
Big Data + Big Sim: Query Processing over Unstructured CFD Models
 
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)
 
Kdd by Mr.Sameer Kumar Das
Kdd by Mr.Sameer Kumar DasKdd by Mr.Sameer Kumar Das
Kdd by Mr.Sameer Kumar Das
 
Data Science Academy Student Demo day--Michael blecher,the importance of clea...
Data Science Academy Student Demo day--Michael blecher,the importance of clea...Data Science Academy Student Demo day--Michael blecher,the importance of clea...
Data Science Academy Student Demo day--Michael blecher,the importance of clea...
 
Myria: Analytics-as-a-Service for (Data) Scientists
Myria: Analytics-as-a-Service for (Data) ScientistsMyria: Analytics-as-a-Service for (Data) Scientists
Myria: Analytics-as-a-Service for (Data) Scientists
 
ISA-Tab Standards at Metabolomics Society Meeting, Tsuruoka 2014, Japan
ISA-Tab Standards at Metabolomics Society Meeting, Tsuruoka 2014, JapanISA-Tab Standards at Metabolomics Society Meeting, Tsuruoka 2014, Japan
ISA-Tab Standards at Metabolomics Society Meeting, Tsuruoka 2014, Japan
 
What Your Database Query is Really Doing
What Your Database Query is Really DoingWhat Your Database Query is Really Doing
What Your Database Query is Really Doing
 
Computer Scientists Retrieval - PDF Report
Computer Scientists Retrieval - PDF ReportComputer Scientists Retrieval - PDF Report
Computer Scientists Retrieval - PDF Report
 
Getting to Know Your Data with R
Getting to Know Your Data with RGetting to Know Your Data with R
Getting to Know Your Data with R
 
Large Graph Mining
Large Graph MiningLarge Graph Mining
Large Graph Mining
 
Visualizations using Visualbox
Visualizations using VisualboxVisualizations using Visualbox
Visualizations using Visualbox
 
Kudler has plenty of room to increase sales while controlling cost.docx
Kudler has plenty of room to increase sales while controlling cost.docxKudler has plenty of room to increase sales while controlling cost.docx
Kudler has plenty of room to increase sales while controlling cost.docx
 
polystore_NYC_inrae_sysinfo2021-1.pdf
polystore_NYC_inrae_sysinfo2021-1.pdfpolystore_NYC_inrae_sysinfo2021-1.pdf
polystore_NYC_inrae_sysinfo2021-1.pdf
 
2019 03 05_biological_databases_part5_v_upload
2019 03 05_biological_databases_part5_v_upload2019 03 05_biological_databases_part5_v_upload
2019 03 05_biological_databases_part5_v_upload
 
Reflected intelligence evolving self-learning data systems
Reflected intelligence  evolving self-learning data systemsReflected intelligence  evolving self-learning data systems
Reflected intelligence evolving self-learning data systems
 
The Other HPC: High Productivity Computing in Polystore Environments
The Other HPC: High Productivity Computing in Polystore EnvironmentsThe Other HPC: High Productivity Computing in Polystore Environments
The Other HPC: High Productivity Computing in Polystore Environments
 

More from Mahammad Valiyev

More from Mahammad Valiyev (6)

Bill howe 8_graphs
Bill howe 8_graphsBill howe 8_graphs
Bill howe 8_graphs
 
Bill howe 7_machinelearning_2
Bill howe 7_machinelearning_2Bill howe 7_machinelearning_2
Bill howe 7_machinelearning_2
 
Bill howe 6_machinelearning_1
Bill howe 6_machinelearning_1Bill howe 6_machinelearning_1
Bill howe 6_machinelearning_1
 
Bill howe 5_statistics
Bill howe 5_statisticsBill howe 5_statistics
Bill howe 5_statistics
 
Bill howe 4_bigdatasystems
Bill howe 4_bigdatasystemsBill howe 4_bigdatasystems
Bill howe 4_bigdatasystems
 
Bill howe 3_mapreduce
Bill howe 3_mapreduceBill howe 3_mapreduce
Bill howe 3_mapreduce
 

Recently uploaded

Spark3's new memory model/management
Spark3's new memory model/managementSpark3's new memory model/management
Spark3's new memory model/managementakshesh doshi
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
Data Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxData Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxFurkanTasci3
 
Data Warehouse , Data Cube Computation
Data Warehouse   , Data Cube ComputationData Warehouse   , Data Cube Computation
Data Warehouse , Data Cube Computationsit20ad004
 
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...shivangimorya083
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSAishani27
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiSuhani Kapoor
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystSamantha Rae Coolbeth
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 

Recently uploaded (20)

Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
Spark3's new memory model/management
Spark3's new memory model/managementSpark3's new memory model/management
Spark3's new memory model/management
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
Data Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxData Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptx
 
Data Warehouse , Data Cube Computation
Data Warehouse   , Data Cube ComputationData Warehouse   , Data Cube Computation
Data Warehouse , Data Cube Computation
 
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICS
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data Analyst
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 

Bill howe 2_databases

  • 1. DATA MODELS AND DATABASES BILL HOWE, PHD DIRECTOR OF RESEARCH, SCALABLE DATA ANALYTICS UNIVERSITY OF WASHINGTON ESCIENCE INSTITUTE 1
  • 2. HOW DOWE STORE DATA? 2
  • 3. HOW DOWE STORE DATA? 3 ###query length COG hit #1 e-value #1 identity #1 score #1 hit length #1 description #1 chr_4[480001-580000].287 4500 chr_4[560001-660000].1 3556 chr_9[400001-500000].503 4211 COG4547 2.00E-04 19 44.6 620 Cobalamin biosynthesis protein CobT (nicot chr_9[320001-420000].548 2833 COG5406 2.00E-04 38 43.9 1001 Nucleosome binding factor SPN, SPT16 subu chr_27[320001-404298].20 3991 COG4547 5.00E-05 18 46.2 620 Cobalamin biosynthesis protein CobT (nicot chr_26[320001-420000].378 3963 COG5099 5.00E-05 17 46.2 777 RNA-binding protein of the Puf family, trans chr_26[400001-441226].196 2949 COG5099 2.00E-04 17 43.9 777 RNA-binding protein of the Puf family, trans chr_24[160001-260000].65 3542 chr_5[720001-820000].339 3141 COG5099 4.00E-09 20 59.3 777 RNA-binding protein of the Puf family, trans chr_9[160001-260000].243 3002 COG5077 1.00E-25 26 114 1089 Ubiquitin carboxyl-terminal hydrolase chr_12[720001-820000].86 2895 COG5032 2.00E-09 30 60.5 2105 Phosphatidylinositol kinase and protein kina chr_12[800001-900000].109 1463 COG5032 1.00E-09 30 60.1 2105 Phosphatidylinositol kinase and protein kina chr_11[1-100000].70 2886 chr_11[80001-180000].100 1523 ANNOTATIONSUMMARY-COMBINEDORFANNOTATION16_Phaeo_genome What is the data model?
  • 4. WHAT IS A DATA MODEL? 4 Three components: 1.  Structures 2.  Constraints 3.  Operations
  • 5. EXAMPLES 5 1.  Structures •  rows and columns? •  nodes and edges? •  key-value pairs? •  a sequence of bytes? 2.  Constraints •  all rows must have the same number of columns •  all values in one column must have the same type •  a child cannot have two parents 3.  Operations •  find the value of key x •  find the rows where column “lastname” is “Jordan” •  get the next N bytes
  • 6. WHAT IS A DATABASE? 6 A collection of information organized to afford efficient retrieval http://www.usg.edu/galileo/skills/unit04/primer04_01.phtml
  • 7. ANOTHERVIEW 7 “When people use the word database, fundamentally what they are saying is that the data should be self-describing and it should have a schema.That’s really all the word database means.” -- Jim Gray,“The Fourth Paradigm”
  • 8. WHYWOULD IWANT A DATABASE? 8 1.  Sharing Support concurrent access by multiple readers and writers 2.  Data Model Enforcement Make sure all applications see clean, organized data 3.  Scale Work with datasets too large to fit in memory 4.  Flexibility Use the data in new, unanticipated ways
  • 9. QUESTIONS TO CONSIDER How is the data physically organized on disk? What kinds of queries are efficiently supported by this organization, and what kinds are not? How hard is it update the data, or add new data? What happens when I encounter new queries that I didn’t anticipate? Do I reorganize the data? How hard is that? 9
  • 10. MOTIVATING RELATIONAL DATABASES BILL HOWE, PHD DIRECTOR OF RESEARCH, SCALABLE DATA ANALYTICS UNIVERSITY OF WASHINGTON ESCIENCE INSTITUTE 10
  • 11. QUESTIONS TO CONSIDER How is the data physically organized on disk? What kinds of queries are efficiently supported by this organization, and what kinds are not? How hard is it update the data, or add new data? What happens when I encounter new queries that I didn’t anticipate? Do I reorganize the data? How hard is that? 11
  • 12. HISTORICAL EXAMPLE:NETWORK DATABASES 12 Database:A collection of information organized to afford efficient retrieval Orderer Customer Screw Nut Washer Contact Rep
  • 13. HISTORICAL EXAMPLE:HIERARCHICAL DATABASES 13 Orderer Customer Screw Nut Nail Contact Rep Orderer Screw Nut Washer master detail detail Works great if you want to find all orders for a particular customer. But what if you want to find all Customers who ordered a Nail?
  • 14. ONEVIEW 14 “Relational Database Management Systems were invented to let you use one set of data in multiple ways, including ways that are unforeseen at the time the database is built and the 1st applications are written.” - Curt Monash, analyst/blogger
  • 15. RELATIONAL DATABASES (CODD 1970) Everything is a table Every row in a table has the same columns Relationships are implicit: no pointers 15 Course Student Id CSE 344 223… CSE 344 244… CSE 514 255.. CSE 514 244… Student Id Student Name 223… Jane 244… Joe 255.. Susan
  • 16. DATABASE PHILOSOPHY 16 God made the integers; all else is the work of man. - Leopold Kronecker, 19th Century Mathematician slide src: Mike Franklin Codd made relations; all else is the work of man. - Raghu Ramakrishnan, DB text book author
  • 17. RELATIONAL DATABASE HISTORY 17 Pre-Relational: if your data changed, your application broke. Early RDBMS were buggy and slow (and often reviled), but required only 5% of the application code. “Activities of users at terminals and most application programs should remain unaffected when the internal representation of data is changed and even when some aspects of the external representation are changed.” Key Ideas: Programs that manipulate tabular data exhibit an algebraic structure allowing reasoning and manipulation independently of physical data representation - Codd 1979
  • 18. KEY IDEA:“PHYSICAL DATA INDEPENDENCE” 18 physical data independence files and pointers relations SELECT seq FROM ncbi_sequences WHERE seq = ‘GATTACGATATTA’; f = fopen(‘table_file’); fseek(10030440); while (True) { fread(&buf, 1, 8192, f); if (buf == GATTACGATATTA) { . . .
  • 19. KEY IDEA:AN ALGEBRA OF TABLES 19 select project join join Other operators: aggregate,union,difference,cross product
  • 20. KEY IDEA:ALGEBRAIC OPTIMIZATION 20 N = ((z*2)+((z*3)+0))/1 Algebraic Laws: 1. (+) identity: x+0 = x 2. (/) identity: x/1 = x 3. (*) distributes: (n*x+n*y) = n*(x+y) 4. (*) commutes: x*y = y*x Apply rules 1, 3, 4, 2: N = (2+3)*z two operations instead of five, no division operator Same idea works with the Relational Algebra!
  • 21. EQUIVALENT LOGICAL EXPRESSIONS; DIFFERENT COSTS 21 σp=knows(R) o=s (σp=holdsAccount(R) o=s σp=accountHomepage(R)) (σp=knows(R) o=s σp=holdsAccount(R)) o=s σp=accountHomepage(R) σp1=knows & p2=holdsAccount & p3=accountHomepage (R x R x R) right associative left associative cross product
  • 22. SAME LOGICAL EXPRESSION, DIFFERENT PHYSICAL ALGORITHMS 22 A = select(p=knows) B = select(p=holdsAccount) C = select(p=accountWebpage) hA = hash(A) AB = probe hA with B hC = hash(C) ABC = probe hC with AB A = select(p=knows) B = select(p=holdsAccount) C = select(p=accountWebpage) hB = hash(B) AB = probe hB with A hC = hash(C) ABC = probe hC with AB Which is faster?
  • 23. ALGEBRAIC OPTIMIZATION MATTERS 23 BTC 2010 Dataset 3B quads 623 GB processed
  • 24. PICASSO QUERY PLAN DIAGRAMS 24 Haritsa,VLDB 2010
  • 25. KEY IDEA:DECLARATIVE LANGUAGES 25 SELECT * FROM Order o, Item i WHERE o.item = i.item AND o.date = today() join select scan scan date = today() o.item = i.item Order oItem i Find all orders from today,along with the items ordered
  • 26. SQL IS THE“WHAT”NOT THE“HOW” 26 SELECT DISTINCT x.name, z.name FROM Product x, Purchase y, Customer z WHERE x.pid = y.pid and y.cid = y.cid and x.price > 100 and z.city = ‘Seattle’ It’s clear WHAT we want, unclear HOW to get it Product(pid, name, price) Purchase(pid, cid, store) Customer(cid, name, city)
  • 27. RELATIONAL ALGEBRA 27 Product Purchase pid=pid price>100 and city=‘Seattle’ x.name,z.name δ cid=cid Customer Π σ T1(pid,name,price,pid,cid,store) T2( . . . .) T4(name,name) Final answer T3(. . . ) Execution order is now clearly specified Product(pid, name, price) Purchase(pid, cid, store) Customer(cid, name, city) But a lot of physical details are still left open!
  • 28. ANOTHER EXAMPLE 28 left.object = right.subject σ predicate=knows SELECT r1.subject FROM R r1, R r2, R r3 WHERE r1.predicate = ‘knows’ AND r2.predicate = ‘holdsAccount’ AND r3.predicate = ‘accountHomepage’ AND r1.object = r2.subject AND r2.object = r3.subject σ predicate=holdsAccount left.object = right.subject σ predicate= accountHomepage R(subject, predicate, object) π r1.subject
  • 29. KEY IDEA:“LOGICAL DATA INDEPENDENCE” 29 physical data independence logical data independence files and pointers relations views SELECT * FROM my_sequences SELECT seq FROM ncbi_sequences WHERE seq = ‘GATTACGATATTA’; f = fopen(‘table_file’); fseek(10030440); while (True) { fread(&buf, 1, 8192, f); if (buf == GATTACGATATTA) { . . .
  • 30. WHAT AREVIEWS? A view is just a query with a name We can use the view just like a real table 30 Why can we do this? Because we know that every query returns a relation: We say that the language is “algebraically closed”
  • 31. VIEW EXAMPLE A view is a relation defined by a query 31 CREATE VIEW StorePrice AS SELECT x.store, y.price FROM Purchase x, Product y WHERE x.pid = y.pid This is like a new table StorePrice(store,price) Purchase(customer, product, store) Product(pname, price) StorePrice(store, price)
  • 32. HOW TO USE AVIEW? 32 A "high end" store is a store that sold some product over 1000. For each customer, find all the high end stores that they visit. Return a set of (customer- name, high-end-store) pairs. SELECT DISTINCT z.name, u.store FROM Customer z, Purchase u, StorePrice v WHERE z.cid = u.customer AND u.store = v.store AND v.price > 1000 StorePrice(store, price) Customer(cid, name, city) Purchase(customer, product, store) Product(pname, price)
  • 33. KEY IDEA:INDEXES 33 Databases are especially, but not exclusively, effective at“Needle in Haystack” problems: •  Extracting small results from big datasets •  Transparently provide “old style” scalability •  Your query will always* finish, regardless of dataset size. •  Indexes are easily built and automatically used when appropriate CREATE INDEX seq_idx ON sequence(seq); SELECT seq FROM sequence WHERE seq = ‘GATTACGATATTA’; *almost
  • 34. IN-DATABASE ANALYTICS BILL HOWE, PHD DIRECTOR OF RESEARCH, SCALABLE DATA ANALYTICS UNIVERSITY OF WASHINGTON ESCIENCE INSTITUTE 34
  • 35. 35 “There is no point in bringing data ... into the data warehouse environment without integrating it. If the data arrives at the data warehouse in an unintegrated state, it cannot be used to support a corporate view of data. And a corporate view of data is one of the essences of the architected environment.” - Inmon, 2005,“Building the Data Warehouse” Not a good fit for analytics!
  • 36. MATRIX ADDITION,VECTOR REPRESENTATION 36 SELECT A.row_number, A.vector + B.vector FROM A, B WHERE A.row_number = B.row_number; A(row_number,row float[]) B(row_number,row float[])
  • 38. MATRIX MULTIPLICATION,VECTOR REPRESENTATION 38 SELECT 1, array_accum(row_number, vector*v) FROM A; A(row_number,row float[])
  • 39. MATRIX TRANSPOSE,VECTOR REPRESENTATION 39 SELECT S.col_number, array_accum(A.row_number, A.vector[S.col_number]) FROM A, generate_series(1,3) AS S(col_number) GROUP BY S.col_number; generate_series is a table function A(row_number,row float[])
  • 40. MATRIX MULTIPLICATION,SPARSE REPRESENTATION 40 SELECT A.row_number, B.column_number, SUM(A.value * B.value) FROM A, B WHERE A.column_number = B.row_number GROUP BY A.row_number, B.column_number A(row_number,column_number,value) B(row_number,column_number,value)
  • 41. ASIDE:USER-DEFINED FUNCTIONS AND TYPES 41 Scalar functions SELECT myfunc(r.a, r.b)… Aggregate functions SELECT concat(r.s) … Table functions SELECT … FROM tablefunc(a,b)
  • 42. EXPERIMENT DESIGN 42 CREATE VIEW trials AS SELECT d.trial_id, AVG(a.values) AS avg_value FROM design d,T WHERE d.row_id = T.row_id GROUP BY d.trial_id CREATE VIEW design AS SELECT a.trial_id, floor (100 * random()) AS row_id FROM generate_series(1,10000) AS a (trial_id), generate_series(1,3) AS b (subsample_id) src: Cohen et al.,VLDB 2009
  • 43. 43 Prior to the implementation of this functionality within the DBMS, one Greenplum customer was accustomed to calculating the OLS by exporting data and importing the data into R for calculation, a process that took several hours to complete.They reported signicant performance improvement when they moved to running the regression within the DBMS. Most of the benet derived from running the analysis in parallel close to the data with minimal data movement. src: Cohen et al.,VLDB 2009
  • 44. SLIDES CAN BE FOUND AT: TEACHINGDATASCIENCE.ORG