SlideShare a Scribd company logo
Introduction to Data Science
Frank Kienle
High level introduction to Data Bases
Big Data Landscape
06.09.17 Frank Kienle p. 2
Overview of data sources
•  http://www.knuggets.com/datasets/index.html
Machine learning data
•  UCI Machine Learning Repository: archive.ics.uci.edu
Data Shop: the world’s largest repository of learning interaction data
•  https://pslcdatashop.web.cmu.edu
Getting Data is not the problem
- Very large flavor of Data Sources
06.09.17 Frank Kienle 3
•  Formally, a "database" refers to a set of related data and the way it is organized.
•  A database manages data efficiently and allows users to perform multiple tasks
with ease. The efficient access to the data is usually provided by a "database
management system" (DBMS)
•  A database management system stores, organizes and manages a large amount
of information within a single software application.
•  Use of this system increases efficiency of business operations and reduces
overall costs.
•  Different database systems exist which are designed with respect to:
•  the data to be stored in the database
•  the relationships between the different data elements. Dependencies within the data which can
be modeled by mathematical relations
•  the logical structure upon the data on the basis of these relationships. The goal is to arrange
the data into a logical structure which can then be mapped into the storage objects
Database
06.09.17 Frank Kienle p. 4
Databases overview
06.09.17 Frank Kienle 5
Scale up: using more and more main memory
Scale out: using more and more computers
Definition (m complexity order):
Scalability for N data items an algorithms scales with Nm.
E.g polynomial complexity
Parallelize it (k nodes): The algorithm scales with Nm/k
Goal find algorithms with complexity: N log(N) which relates e.g. with trees (one
touch)
Scalability in big data
06.09.17 6Frank Kienle
CAP theorem
06.09.17 Frank Kienle p. 7
C: consistency
(do all applications see all the same data)
Any data written to the database must be valid
According to all defined rules
A: availability
(can I interact with the system
In the presence of failures)
P: partitioning
If two sections of your system cannot talk to each
Other, can they make forward progress on their own
-  If not you sacrifice availability
-  If so, you might have to sacrifice consistency
Dynamo
Riak
Voldemort
Cassandra
CouchDB
Bigtable
Hbase
Hypertable
Megastore
Spanner
Accumulo
RDBMS
Relational Data Bases
Relational data bases key idea:
§  storage and retrieval of large quantities of related data.
§  When creating a database you should think about which tables needed and
what relationships exist between the data in your tables.
§  Relational algebra,
§  Physical/logical data independence
Think about the design in advance
Relational Data Bases
06.09.17 Frank Kienle p. 9
A database is created for the storage and retrieval of data.
we want to be able to INSERT data into the database and we want to be able to
SELECT data from the database.
A database query language was invented for these tasks called the Structured
Query Language,
Structured query language (SQL)
06.09.17 Frank Kienle p. 10
When you can do JOIN’s its good for analytics
When a data base does not provide joins the work is it is all up for the users
(Leave the work on the client side)
Fundamental of data exploring (joins)
06.09.17 Frank Kienle p. 11
Outer Relational Join (on time stamp)
06.09.17 Frank Kienle p. 12
Time	stamp	[s]	 Value	room	
[Wa2]	
1	 30	
2	 25	
5	 12	
Time	stamp	[s]	 Value	Home	
[Wa2]	
1	 100	
2	 78	
3	 99	
4	 70	
Time	stamp	[s]	 Value	Room	
[Wa2|	
Value	Home	
[Wa2]	
1	 30	 100	
2	 25	 78	
3	 NaN	 99	
4	 NaN	 70	
5	 12	 NaN
Left Join (on time stamp)
06.09.17 Frank Kienle p. 13
Time	stamp	[s]	 Value	room	
[Wa2]	
1	 30	
2	 25	
5	 12	
Time	stamp	[s]	 Value	Home	
[Wa2]	
1	 100	
2	 78	
3	 99	
4	 70	
Time	stamp	[s]	 Value	Room	
[Wa2|	
Value	Home	
[Wa2]	
1	 30	 100	
2	 25	 78	
5	 12	 NaN
Storing data efficiently is all about the application
schema less vs. schema
writing centric vs. reading centric
transactional vs. analytics
batch vs. stream
Key-Value object
•  A set of key-value pairs
Extensible record (XML or JSON)
•  Families of attributes have a schema
•  New attributes may be added
•  Many predictive analytics tasks will require a kind of record
•  Many REST APIs will deliver JSON, (YAML, XML) structures
•  Example: tweeter feeds
Key Value stores (Document store might be a subset)
•  No schema, no exposed nesting
•  often raw data (scalable to peta bytes)
•  on top simple analytics tasks
Different data structure
06.09.17 Frank Kienle p. 15
45777
Ux_78
321-87
Frank Kienle, Germany
Please learn
Random data
key value
JSON Example
06.09.17 Frank Kienle p. 16
Example JSON Twitter feed
06.09.17 Frank Kienle p. 17
The ability to replicate and partition data over many serves
•  Sharding: horizontal partitioning of the data set
No query language: a simple API defined
Ability to scale operations over many serves
•  Throughput increase
•  Due to missing (language) query layer each operation has to design towards the API
Operations have often restrictions to data locality
New features can be added dynamically to data records (no fixed schema)
Consistency model often weak (no modeling of transaction)
(typical) NoSQL data base features
06.09.17 Frank Kienle p. 18
In-memory database
•  primarily relies on main memory for computer data storage
•  main purpose is faster analytics on data
•  relational or unstructured data structure
•  memory optimized data structures
Main memory database system (MMDB)
06.09.17 Frank Kienle p. 19
Advantage Column-oriented:
•  Reading efficiency: more efficient when an aggregate needs to be computed over
many rows but only for a notably smaller subset of all columns of data
select col_1,col_2 from table where col_2>5 and col_2<45;
•  Writing efficiency: more efficient when new values of a column are supplied for
all rows at once
Advantage row-oriented:
•  Reading efficiency: more efficient when many columns of a single row are
required at the same time, and when row-size is relatively small
•  Writing efficiency: more efficient when writing a new row if all of the row data is
supplied at the same time, as the entire row can be written with a single disk
seek.
Row vs. Column data stores
06.09.17 Frank Kienle p. 20
Processing types
06.09.17 Frank Kienle p. 21
OLTP: On-line Transaction Processing
e.g. Business transactions
(insert, update, delete)
OLAP: On-line Analytical Processing
e.g. complex analytics
(aggregating of historical data)
for data analytics a column oriented
in-memory data base is a must have
06.09.17 Frank Kienle p. 22
Spanner Idea: Planet scale data base system
….we believe it is better to have application programmers deal with performance
problems due to overuse of transactions as bottlenecks arise, rather than always coding
around the lack of transactions …
Loose consistency for predictive analytics is horrible
Loose consistency is a no go for prescriptive analytics (dynamic pricing)
Systems should always be designed for usability
Many trends in data bases are going back to data
consistency
06.09.17 Frank Kienle p. 23

More Related Content

What's hot

Big Data Tech Stack
Big Data Tech StackBig Data Tech Stack
Big Data Tech Stack
Abdullah Çetin ÇAVDAR
 
Big Data Use Cases
Big Data Use CasesBig Data Use Cases
Big Data Use Cases
boorad
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
Joey Li
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
AmpoolIO
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
Vipin Batra
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
Kristof Jozsa
 
Big data storage
Big data storageBig data storage
Big data storage
Vikram Nandini
 
Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?
Anton Nazaruk
 
BigData Analytics with Hadoop and BIRT
BigData Analytics with Hadoop and BIRTBigData Analytics with Hadoop and BIRT
BigData Analytics with Hadoop and BIRT
Amrit Chhetri
 
Big Data Ecosystem
Big Data EcosystemBig Data Ecosystem
Big Data Ecosystem
Lucian Neghina
 
Big Data - A brief introduction
Big Data - A brief introductionBig Data - A brief introduction
Big Data - A brief introduction
Frans van Noort
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
Tyrone Systems
 
My other computer is a datacentre - 2012 edition
My other computer is a datacentre - 2012 editionMy other computer is a datacentre - 2012 edition
My other computer is a datacentre - 2012 edition
Steve Loughran
 
On Performance Under Hotspots in Hadoop versus Bigdata Replay Platforms
On Performance Under Hotspots in Hadoop versus Bigdata Replay PlatformsOn Performance Under Hotspots in Hadoop versus Bigdata Replay Platforms
On Performance Under Hotspots in Hadoop versus Bigdata Replay Platforms
Tokyo University of Science
 
Big data landscape
Big data landscapeBig data landscape
Big data landscape
Natalino Busa
 
big data overview ppt
big data overview pptbig data overview ppt
big data overview ppt
VIKAS KATARE
 
Bigdata " new level"
Bigdata " new level"Bigdata " new level"
Bigdata " new level"
Vamshikrishna Goud
 
Introduction to Big Data Technologies & Applications
Introduction to Big Data Technologies & ApplicationsIntroduction to Big Data Technologies & Applications
Introduction to Big Data Technologies & Applications
Nguyen Cao
 
Big Tools for Big Data
Big Tools for Big DataBig Tools for Big Data
Big Tools for Big Data
Lewis Crawford
 
Data Mining - The Big Picture!
Data Mining - The Big Picture!Data Mining - The Big Picture!
Data Mining - The Big Picture!
Khalid Salama
 

What's hot (20)

Big Data Tech Stack
Big Data Tech StackBig Data Tech Stack
Big Data Tech Stack
 
Big Data Use Cases
Big Data Use CasesBig Data Use Cases
Big Data Use Cases
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Big data storage
Big data storageBig data storage
Big data storage
 
Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?
 
BigData Analytics with Hadoop and BIRT
BigData Analytics with Hadoop and BIRTBigData Analytics with Hadoop and BIRT
BigData Analytics with Hadoop and BIRT
 
Big Data Ecosystem
Big Data EcosystemBig Data Ecosystem
Big Data Ecosystem
 
Big Data - A brief introduction
Big Data - A brief introductionBig Data - A brief introduction
Big Data - A brief introduction
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
My other computer is a datacentre - 2012 edition
My other computer is a datacentre - 2012 editionMy other computer is a datacentre - 2012 edition
My other computer is a datacentre - 2012 edition
 
On Performance Under Hotspots in Hadoop versus Bigdata Replay Platforms
On Performance Under Hotspots in Hadoop versus Bigdata Replay PlatformsOn Performance Under Hotspots in Hadoop versus Bigdata Replay Platforms
On Performance Under Hotspots in Hadoop versus Bigdata Replay Platforms
 
Big data landscape
Big data landscapeBig data landscape
Big data landscape
 
big data overview ppt
big data overview pptbig data overview ppt
big data overview ppt
 
Bigdata " new level"
Bigdata " new level"Bigdata " new level"
Bigdata " new level"
 
Introduction to Big Data Technologies & Applications
Introduction to Big Data Technologies & ApplicationsIntroduction to Big Data Technologies & Applications
Introduction to Big Data Technologies & Applications
 
Big Tools for Big Data
Big Tools for Big DataBig Tools for Big Data
Big Tools for Big Data
 
Data Mining - The Big Picture!
Data Mining - The Big Picture!Data Mining - The Big Picture!
Data Mining - The Big Picture!
 

Similar to Data Bases - Introduction to data science

NOSQL
NOSQLNOSQL
Report 2.0.docx
Report 2.0.docxReport 2.0.docx
Report 2.0.docx
pinstechwork
 
Analysis and evaluation of riak kv cluster environment using basho bench
Analysis and evaluation of riak kv cluster environment using basho benchAnalysis and evaluation of riak kv cluster environment using basho bench
Analysis and evaluation of riak kv cluster environment using basho bench
StevenChike
 
MySQL 8 Server Optimization Swanseacon 2018
MySQL 8 Server Optimization Swanseacon 2018MySQL 8 Server Optimization Swanseacon 2018
MySQL 8 Server Optimization Swanseacon 2018
Dave Stokes
 
MySQL 8 Tips and Tricks from Symfony USA 2018, San Francisco
MySQL 8 Tips and Tricks from Symfony USA 2018, San FranciscoMySQL 8 Tips and Tricks from Symfony USA 2018, San Francisco
MySQL 8 Tips and Tricks from Symfony USA 2018, San Francisco
Dave Stokes
 
Report 1.0.docx
Report 1.0.docxReport 1.0.docx
Report 1.0.docx
pinstechwork
 
SURVEY ON IMPLEMANTATION OF COLUMN ORIENTED NOSQL DATA STORES ( BIGTABLE & CA...
SURVEY ON IMPLEMANTATION OF COLUMN ORIENTED NOSQL DATA STORES ( BIGTABLE & CA...SURVEY ON IMPLEMANTATION OF COLUMN ORIENTED NOSQL DATA STORES ( BIGTABLE & CA...
SURVEY ON IMPLEMANTATION OF COLUMN ORIENTED NOSQL DATA STORES ( BIGTABLE & CA...
IJCERT JOURNAL
 
No SQL and MongoDB - Hyderabad Scalability Meetup
No SQL and MongoDB - Hyderabad Scalability MeetupNo SQL and MongoDB - Hyderabad Scalability Meetup
No SQL and MongoDB - Hyderabad Scalability Meetup
Hyderabad Scalability Meetup
 
Assignment_4
Assignment_4Assignment_4
Assignment_4
Kirti J
 
NoSQL.pptx
NoSQL.pptxNoSQL.pptx
NoSQL.pptx
RithikRaj25
 
Cassandra Essentials Day Cambridge
Cassandra Essentials Day CambridgeCassandra Essentials Day Cambridge
Cassandra Essentials Day Cambridge
Marc Fielding
 
Big Data technology Landscape
Big Data technology LandscapeBig Data technology Landscape
Big Data technology Landscape
ShivanandaVSeeri
 
NoSQL Fundamentals PowerPoint Presentation
NoSQL Fundamentals PowerPoint PresentationNoSQL Fundamentals PowerPoint Presentation
NoSQL Fundamentals PowerPoint Presentation
AnweshMishra21
 
6269441.ppt
6269441.ppt6269441.ppt
6269441.ppt
Swapna Jk
 
Database Management & Models
Database Management & ModelsDatabase Management & Models
Database Management & Models
Sunderland City Council
 
Dataware house multidimensionalmodelling
Dataware house multidimensionalmodellingDataware house multidimensionalmodelling
Dataware house multidimensionalmodelling
meghu123
 
Relational and non relational database 7
Relational and non relational database 7Relational and non relational database 7
Relational and non relational database 7
abdulrahmanhelan
 
Database Technologies
Database TechnologiesDatabase Technologies
Database Technologies
Michel de Goede
 
No sq lv1_0
No sq lv1_0No sq lv1_0
No sq lv1_0
Tuan Luong
 
Lecture3.ppt
Lecture3.pptLecture3.ppt
Lecture3.ppt
ShaimaaMohamedGalal
 

Similar to Data Bases - Introduction to data science (20)

NOSQL
NOSQLNOSQL
NOSQL
 
Report 2.0.docx
Report 2.0.docxReport 2.0.docx
Report 2.0.docx
 
Analysis and evaluation of riak kv cluster environment using basho bench
Analysis and evaluation of riak kv cluster environment using basho benchAnalysis and evaluation of riak kv cluster environment using basho bench
Analysis and evaluation of riak kv cluster environment using basho bench
 
MySQL 8 Server Optimization Swanseacon 2018
MySQL 8 Server Optimization Swanseacon 2018MySQL 8 Server Optimization Swanseacon 2018
MySQL 8 Server Optimization Swanseacon 2018
 
MySQL 8 Tips and Tricks from Symfony USA 2018, San Francisco
MySQL 8 Tips and Tricks from Symfony USA 2018, San FranciscoMySQL 8 Tips and Tricks from Symfony USA 2018, San Francisco
MySQL 8 Tips and Tricks from Symfony USA 2018, San Francisco
 
Report 1.0.docx
Report 1.0.docxReport 1.0.docx
Report 1.0.docx
 
SURVEY ON IMPLEMANTATION OF COLUMN ORIENTED NOSQL DATA STORES ( BIGTABLE & CA...
SURVEY ON IMPLEMANTATION OF COLUMN ORIENTED NOSQL DATA STORES ( BIGTABLE & CA...SURVEY ON IMPLEMANTATION OF COLUMN ORIENTED NOSQL DATA STORES ( BIGTABLE & CA...
SURVEY ON IMPLEMANTATION OF COLUMN ORIENTED NOSQL DATA STORES ( BIGTABLE & CA...
 
No SQL and MongoDB - Hyderabad Scalability Meetup
No SQL and MongoDB - Hyderabad Scalability MeetupNo SQL and MongoDB - Hyderabad Scalability Meetup
No SQL and MongoDB - Hyderabad Scalability Meetup
 
Assignment_4
Assignment_4Assignment_4
Assignment_4
 
NoSQL.pptx
NoSQL.pptxNoSQL.pptx
NoSQL.pptx
 
Cassandra Essentials Day Cambridge
Cassandra Essentials Day CambridgeCassandra Essentials Day Cambridge
Cassandra Essentials Day Cambridge
 
Big Data technology Landscape
Big Data technology LandscapeBig Data technology Landscape
Big Data technology Landscape
 
NoSQL Fundamentals PowerPoint Presentation
NoSQL Fundamentals PowerPoint PresentationNoSQL Fundamentals PowerPoint Presentation
NoSQL Fundamentals PowerPoint Presentation
 
6269441.ppt
6269441.ppt6269441.ppt
6269441.ppt
 
Database Management & Models
Database Management & ModelsDatabase Management & Models
Database Management & Models
 
Dataware house multidimensionalmodelling
Dataware house multidimensionalmodellingDataware house multidimensionalmodelling
Dataware house multidimensionalmodelling
 
Relational and non relational database 7
Relational and non relational database 7Relational and non relational database 7
Relational and non relational database 7
 
Database Technologies
Database TechnologiesDatabase Technologies
Database Technologies
 
No sq lv1_0
No sq lv1_0No sq lv1_0
No sq lv1_0
 
Lecture3.ppt
Lecture3.pptLecture3.ppt
Lecture3.ppt
 

More from Frank Kienle

AI for good summary
AI for good summaryAI for good summary
AI for good summary
Frank Kienle
 
Machine Learning part 3 - Introduction to data science
Machine Learning part 3 - Introduction to data science Machine Learning part 3 - Introduction to data science
Machine Learning part 3 - Introduction to data science
Frank Kienle
 
Machine Learning part 2 - Introduction to Data Science
Machine Learning part 2 -  Introduction to Data Science Machine Learning part 2 -  Introduction to Data Science
Machine Learning part 2 - Introduction to Data Science
Frank Kienle
 
Machine Learning part1 - Introduction to Data Science
Machine Learning part1 - Introduction to Data Science Machine Learning part1 - Introduction to Data Science
Machine Learning part1 - Introduction to Data Science
Frank Kienle
 
Business Models - Introduction to Data Science
Business Models -  Introduction to Data ScienceBusiness Models -  Introduction to Data Science
Business Models - Introduction to Data Science
Frank Kienle
 
Data Science Lecture: Overview and Information Collateral
Data Science Lecture: Overview and Information CollateralData Science Lecture: Overview and Information Collateral
Data Science Lecture: Overview and Information Collateral
Frank Kienle
 
Lecture summary: architectures for baseband signal processing of wireless com...
Lecture summary: architectures for baseband signal processing of wireless com...Lecture summary: architectures for baseband signal processing of wireless com...
Lecture summary: architectures for baseband signal processing of wireless com...
Frank Kienle
 
Lecture: Monte Carlo Methods
Lecture: Monte Carlo MethodsLecture: Monte Carlo Methods
Lecture: Monte Carlo Methods
Frank Kienle
 
data scientist the sexiest job of the 21st century
data scientist the sexiest job of the 21st centurydata scientist the sexiest job of the 21st century
data scientist the sexiest job of the 21st century
Frank Kienle
 

More from Frank Kienle (9)

AI for good summary
AI for good summaryAI for good summary
AI for good summary
 
Machine Learning part 3 - Introduction to data science
Machine Learning part 3 - Introduction to data science Machine Learning part 3 - Introduction to data science
Machine Learning part 3 - Introduction to data science
 
Machine Learning part 2 - Introduction to Data Science
Machine Learning part 2 -  Introduction to Data Science Machine Learning part 2 -  Introduction to Data Science
Machine Learning part 2 - Introduction to Data Science
 
Machine Learning part1 - Introduction to Data Science
Machine Learning part1 - Introduction to Data Science Machine Learning part1 - Introduction to Data Science
Machine Learning part1 - Introduction to Data Science
 
Business Models - Introduction to Data Science
Business Models -  Introduction to Data ScienceBusiness Models -  Introduction to Data Science
Business Models - Introduction to Data Science
 
Data Science Lecture: Overview and Information Collateral
Data Science Lecture: Overview and Information CollateralData Science Lecture: Overview and Information Collateral
Data Science Lecture: Overview and Information Collateral
 
Lecture summary: architectures for baseband signal processing of wireless com...
Lecture summary: architectures for baseband signal processing of wireless com...Lecture summary: architectures for baseband signal processing of wireless com...
Lecture summary: architectures for baseband signal processing of wireless com...
 
Lecture: Monte Carlo Methods
Lecture: Monte Carlo MethodsLecture: Monte Carlo Methods
Lecture: Monte Carlo Methods
 
data scientist the sexiest job of the 21st century
data scientist the sexiest job of the 21st centurydata scientist the sexiest job of the 21st century
data scientist the sexiest job of the 21st century
 

Recently uploaded

一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
g4dpvqap0
 
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
dwreak4tg
 
Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......
Sachin Paul
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
74nqk8xf
 
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
74nqk8xf
 
Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...
Bill641377
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
slg6lamcq
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
sameer shah
 
Natural Language Processing (NLP), RAG and its applications .pptx
Natural Language Processing (NLP), RAG and its applications .pptxNatural Language Processing (NLP), RAG and its applications .pptx
Natural Language Processing (NLP), RAG and its applications .pptx
fkyes25
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
slg6lamcq
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
manishkhaire30
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
nuttdpt
 
Intelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicineIntelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicine
AndrzejJarynowski
 
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
nuttdpt
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
jerlynmaetalle
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
javier ramirez
 
End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
Lars Albertsson
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
bopyb
 
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
nyfuhyz
 

Recently uploaded (20)

一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
 
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
 
Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
 
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
 
Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
 
Natural Language Processing (NLP), RAG and its applications .pptx
Natural Language Processing (NLP), RAG and its applications .pptxNatural Language Processing (NLP), RAG and its applications .pptx
Natural Language Processing (NLP), RAG and its applications .pptx
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
 
Intelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicineIntelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicine
 
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
 
End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
 
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
 

Data Bases - Introduction to data science

  • 1. Introduction to Data Science Frank Kienle High level introduction to Data Bases
  • 2. Big Data Landscape 06.09.17 Frank Kienle p. 2
  • 3. Overview of data sources •  http://www.knuggets.com/datasets/index.html Machine learning data •  UCI Machine Learning Repository: archive.ics.uci.edu Data Shop: the world’s largest repository of learning interaction data •  https://pslcdatashop.web.cmu.edu Getting Data is not the problem - Very large flavor of Data Sources 06.09.17 Frank Kienle 3
  • 4. •  Formally, a "database" refers to a set of related data and the way it is organized. •  A database manages data efficiently and allows users to perform multiple tasks with ease. The efficient access to the data is usually provided by a "database management system" (DBMS) •  A database management system stores, organizes and manages a large amount of information within a single software application. •  Use of this system increases efficiency of business operations and reduces overall costs. •  Different database systems exist which are designed with respect to: •  the data to be stored in the database •  the relationships between the different data elements. Dependencies within the data which can be modeled by mathematical relations •  the logical structure upon the data on the basis of these relationships. The goal is to arrange the data into a logical structure which can then be mapped into the storage objects Database 06.09.17 Frank Kienle p. 4
  • 6. Scale up: using more and more main memory Scale out: using more and more computers Definition (m complexity order): Scalability for N data items an algorithms scales with Nm. E.g polynomial complexity Parallelize it (k nodes): The algorithm scales with Nm/k Goal find algorithms with complexity: N log(N) which relates e.g. with trees (one touch) Scalability in big data 06.09.17 6Frank Kienle
  • 7. CAP theorem 06.09.17 Frank Kienle p. 7 C: consistency (do all applications see all the same data) Any data written to the database must be valid According to all defined rules A: availability (can I interact with the system In the presence of failures) P: partitioning If two sections of your system cannot talk to each Other, can they make forward progress on their own -  If not you sacrifice availability -  If so, you might have to sacrifice consistency Dynamo Riak Voldemort Cassandra CouchDB Bigtable Hbase Hypertable Megastore Spanner Accumulo RDBMS
  • 9. Relational data bases key idea: §  storage and retrieval of large quantities of related data. §  When creating a database you should think about which tables needed and what relationships exist between the data in your tables. §  Relational algebra, §  Physical/logical data independence Think about the design in advance Relational Data Bases 06.09.17 Frank Kienle p. 9
  • 10. A database is created for the storage and retrieval of data. we want to be able to INSERT data into the database and we want to be able to SELECT data from the database. A database query language was invented for these tasks called the Structured Query Language, Structured query language (SQL) 06.09.17 Frank Kienle p. 10
  • 11. When you can do JOIN’s its good for analytics When a data base does not provide joins the work is it is all up for the users (Leave the work on the client side) Fundamental of data exploring (joins) 06.09.17 Frank Kienle p. 11
  • 12. Outer Relational Join (on time stamp) 06.09.17 Frank Kienle p. 12 Time stamp [s] Value room [Wa2] 1 30 2 25 5 12 Time stamp [s] Value Home [Wa2] 1 100 2 78 3 99 4 70 Time stamp [s] Value Room [Wa2| Value Home [Wa2] 1 30 100 2 25 78 3 NaN 99 4 NaN 70 5 12 NaN
  • 13. Left Join (on time stamp) 06.09.17 Frank Kienle p. 13 Time stamp [s] Value room [Wa2] 1 30 2 25 5 12 Time stamp [s] Value Home [Wa2] 1 100 2 78 3 99 4 70 Time stamp [s] Value Room [Wa2| Value Home [Wa2] 1 30 100 2 25 78 5 12 NaN
  • 14. Storing data efficiently is all about the application schema less vs. schema writing centric vs. reading centric transactional vs. analytics batch vs. stream
  • 15. Key-Value object •  A set of key-value pairs Extensible record (XML or JSON) •  Families of attributes have a schema •  New attributes may be added •  Many predictive analytics tasks will require a kind of record •  Many REST APIs will deliver JSON, (YAML, XML) structures •  Example: tweeter feeds Key Value stores (Document store might be a subset) •  No schema, no exposed nesting •  often raw data (scalable to peta bytes) •  on top simple analytics tasks Different data structure 06.09.17 Frank Kienle p. 15 45777 Ux_78 321-87 Frank Kienle, Germany Please learn Random data key value
  • 17. Example JSON Twitter feed 06.09.17 Frank Kienle p. 17
  • 18. The ability to replicate and partition data over many serves •  Sharding: horizontal partitioning of the data set No query language: a simple API defined Ability to scale operations over many serves •  Throughput increase •  Due to missing (language) query layer each operation has to design towards the API Operations have often restrictions to data locality New features can be added dynamically to data records (no fixed schema) Consistency model often weak (no modeling of transaction) (typical) NoSQL data base features 06.09.17 Frank Kienle p. 18
  • 19. In-memory database •  primarily relies on main memory for computer data storage •  main purpose is faster analytics on data •  relational or unstructured data structure •  memory optimized data structures Main memory database system (MMDB) 06.09.17 Frank Kienle p. 19
  • 20. Advantage Column-oriented: •  Reading efficiency: more efficient when an aggregate needs to be computed over many rows but only for a notably smaller subset of all columns of data select col_1,col_2 from table where col_2>5 and col_2<45; •  Writing efficiency: more efficient when new values of a column are supplied for all rows at once Advantage row-oriented: •  Reading efficiency: more efficient when many columns of a single row are required at the same time, and when row-size is relatively small •  Writing efficiency: more efficient when writing a new row if all of the row data is supplied at the same time, as the entire row can be written with a single disk seek. Row vs. Column data stores 06.09.17 Frank Kienle p. 20
  • 21. Processing types 06.09.17 Frank Kienle p. 21 OLTP: On-line Transaction Processing e.g. Business transactions (insert, update, delete) OLAP: On-line Analytical Processing e.g. complex analytics (aggregating of historical data)
  • 22. for data analytics a column oriented in-memory data base is a must have 06.09.17 Frank Kienle p. 22
  • 23. Spanner Idea: Planet scale data base system ….we believe it is better to have application programmers deal with performance problems due to overuse of transactions as bottlenecks arise, rather than always coding around the lack of transactions … Loose consistency for predictive analytics is horrible Loose consistency is a no go for prescriptive analytics (dynamic pricing) Systems should always be designed for usability Many trends in data bases are going back to data consistency 06.09.17 Frank Kienle p. 23