Data Bases - Introduction to data science

Introduction to Data Science
Frank Kienle
High level introduction to Data Bases

Big Data Landscape
06.09.17 Frank Kienle p. 2

Overview of data sources
•  http://www.knuggets.com/datasets/index.html
Machine learning data
•  UCI Machine Learning Repository: archive.ics.uci.edu
Data Shop: the world’s largest repository of learning interaction data
•  https://pslcdatashop.web.cmu.edu
Getting Data is not the problem
- Very large ﬂavor of Data Sources
06.09.17 Frank Kienle 3

•  Formally, a "database" refers to a set of related data and the way it is organized.
•  A database manages data efficiently and allows users to perform multiple tasks
with ease. The efficient access to the data is usually provided by a "database
management system" (DBMS)
•  A database management system stores, organizes and manages a large amount
of information within a single software application.
•  Use of this system increases efficiency of business operations and reduces
overall costs.
•  Different database systems exist which are designed with respect to:
•  the data to be stored in the database
•  the relationships between the different data elements. Dependencies within the data which can
be modeled by mathematical relations
•  the logical structure upon the data on the basis of these relationships. The goal is to arrange
the data into a logical structure which can then be mapped into the storage objects
Database

Databases overview
06.09.17 Frank Kienle 5

Scale up: using more and more main memory
Scale out: using more and more computers
Deﬁnition (m complexity order):
Scalability for N data items an algorithms scales with Nm.
E.g polynomial complexity
Parallelize it (k nodes): The algorithm scales with Nm/k
Goal ﬁnd algorithms with complexity: N log(N) which relates e.g. with trees (one
touch)
Scalability in big data
06.09.17 6Frank Kienle

CAP theorem
C: consistency
(do all applications see all the same data)
Any data written to the database must be valid
According to all defined rules
A: availability
(can I interact with the system
In the presence of failures)
P: partitioning
If two sections of your system cannot talk to each
Other, can they make forward progress on their own
-  If not you sacrifice availability
-  If so, you might have to sacrifice consistency
Dynamo
Riak
Voldemort
Cassandra
CouchDB
Bigtable
Hbase
Hypertable
Megastore
Spanner
Accumulo
RDBMS

Relational data bases key idea:
§  storage and retrieval of large quantities of related data.
§  When creating a database you should think about which tables needed and
what relationships exist between the data in your tables.
§  Relational algebra,
§  Physical/logical data independence
Think about the design in advance
Relational Data Bases

A database is created for the storage and retrieval of data.
we want to be able to INSERT data into the database and we want to be able to
SELECT data from the database.
A database query language was invented for these tasks called the Structured
Query Language,
Structured query language (SQL)

When you can do JOIN’s its good for analytics
When a data base does not provide joins the work is it is all up for the users
(Leave the work on the client side)
Fundamental of data exploring (joins)

Outer Relational Join (on time stamp)
Time stamp [s] Value room
[Wa2]
1 30
2 25
5 12
Time stamp [s] Value Home
[Wa2]
1 100
2 78
3 99
4 70
Time stamp [s] Value Room
[Wa2|
Value Home
[Wa2]
1 30 100
2 25 78
3 NaN 99
4 NaN 70
5 12 NaN

Left Join (on time stamp)
Time stamp [s] Value room
[Wa2]
1 30
2 25
5 12
Time stamp [s] Value Home
[Wa2]
1 100
2 78
3 99
4 70
Time stamp [s] Value Room
[Wa2|
Value Home
[Wa2]
1 30 100
2 25 78
5 12 NaN

Storing data efﬁciently is all about the application
schema less vs. schema
writing centric vs. reading centric
transactional vs. analytics
batch vs. stream

Key-Value object
•  A set of key-value pairs
Extensible record (XML or JSON)
•  Families of attributes have a schema
•  New attributes may be added
•  Many predictive analytics tasks will require a kind of record
•  Many REST APIs will deliver JSON, (YAML, XML) structures
•  Example: tweeter feeds
Key Value stores (Document store might be a subset)
•  No schema, no exposed nesting
•  often raw data (scalable to peta bytes)
•  on top simple analytics tasks
Different data structure
45777
Ux_78
321-87
Frank Kienle, Germany
Please learn
Random data
key value

JSON Example

Example JSON Twitter feed

The ability to replicate and partition data over many serves
•  Sharding: horizontal partitioning of the data set
No query language: a simple API deﬁned
Ability to scale operations over many serves
•  Throughput increase
•  Due to missing (language) query layer each operation has to design towards the API
Operations have often restrictions to data locality
New features can be added dynamically to data records (no ﬁxed schema)
Consistency model often weak (no modeling of transaction)
(typical) NoSQL data base features

In-memory database
•  primarily relies on main memory for computer data storage
•  main purpose is faster analytics on data
•  relational or unstructured data structure
•  memory optimized data structures
Main memory database system (MMDB)

Advantage Column-oriented:
•  Reading efficiency: more efficient when an aggregate needs to be computed over
many rows but only for a notably smaller subset of all columns of data
select col_1,col_2 from table where col_2>5 and col_2<45;
•  Writing efficiency: more efficient when new values of a column are supplied for
all rows at once
Advantage row-oriented:
•  Reading efficiency: more efficient when many columns of a single row are
required at the same time, and when row-size is relatively small
•  Writing efficiency: more efficient when writing a new row if all of the row data is
supplied at the same time, as the entire row can be written with a single disk
seek.
Row vs. Column data stores

Processing types
OLTP: On-line Transaction Processing
e.g. Business transactions
(insert, update, delete)
OLAP: On-line Analytical Processing
e.g. complex analytics
(aggregating of historical data)

for data analytics a column oriented
in-memory data base is a must have

Spanner Idea: Planet scale data base system
….we believe it is better to have application programmers deal with performance
problems due to overuse of transactions as bottlenecks arise, rather than always coding
around the lack of transactions …
Loose consistency for predictive analytics is horrible
Loose consistency is a no go for prescriptive analytics (dynamic pricing)
Systems should always be designed for usability
Many trends in data bases are going back to data
consistency

Data Bases - Introduction to data science

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Data Bases - Introduction to data science

Similar to Data Bases - Introduction to data science (20)

More from Frank Kienle

More from Frank Kienle (9)

Recently uploaded

Recently uploaded (20)

Data Bases - Introduction to data science