SYSTEMS AND METHODS FOR BIG AND UNSTRUCTURED DATA
Data Architecture Concepts
Marco Brambilla
marco.brambilla@polimi.it
@marcobrambi
Schema
The data schema
• Typing
• Coherence
• Uniformity
Transactions
• The relational world
• Multi-user
• Distributed systems
Definition of Transaction
• An elementary unit of work performed by an application
• Each transaction is encapsulated within two commands:
• begin transaction (bot) and end transaction (eot)
• Within a transaction one of the commands below is executed (exactly once):
• commit work (commit) and rollback work (abort)
• Transactional System (OLTP): a system capable of providing the definition
and execution of transactions on behalf of multiple, concurrent applications
• As opposed to OLAP
Application and Transaction
begin T1
begin T2
end T1
end T2
Transaction
T2
Transaction
T1
Application
program
Actions
Actions
Transaction: Example
begin transaction;
update Account
set Balance = Balance + 10 where AccNum
= 12202;
update Account
set Balance = Balance – 10 where AccNum
= 42177;
commit work;
end transaction;
Transaction: Example with Alternative
begin transaction;
update Account
set Balance = Balance + 10 where AccNum =
12202;
update Account
set Balance = Balance – 10 where AccNum =
42177;
select Balance into A from Account
where AccNum = 42177;
if (A>=0) then commit work
else rollback work;
end transaction;
Well-formed Transactions
• begin transaction
• code for data manipulation (reads and writes)
• commit work – rollback work
• no data manipulation
• end transaction
S0 S1
W1
S2
W2
S3
W3
Sr
W4
Partitioning and
Replication
Horiz. Vs. Vert. Partitioning
• The distinction of horizontal vs vertical comes from the traditional
tabular view of a database.
Data Partitioning
Aim: scalability, distribution.
Partitioning splits the data in the database and partitions pieces of it to
different storage nodes.
Databases can be sharded horizontally (by rows) or vertically (by columns).
See also: Sharding (as horizontal partitioning).
C
B
A
SPLIT
C
B
A
Replication
Aim: fault-tolerance, backup
Replication copies the entire database across all nodes in the
distributed system.
COPY
Partitioning + Replication
Possible?
COPY
C
B
A
SPLIT
C
B
A
C
B
A
SPLIT
C
B
A
C
B
A
Pros / Cons of Each
Partitioning Replication
Pros
Fast data writing /
reading. Low memory
overhead.
Fast data reading.
High data reliability.
Cons
Potential data loss High network
overhead. High
memory overhead.
Scale and Ingestion
Scalability
• How big is big?
• Not only scaling up
• Elasticity
Data Ingestion
• The process of importing,
transferring and loading data for
storage and later use
• It involves loading data from a
variety of sources
• It can involve altering and
modification of individual files to fit
into a format that optimizes the
storage
• For instance, in Big Data small files
are concatenated to form files of
100s of MBs and large files are
broken down in files of 100s of MB.
Data
source #1
bus
Data
source #2
Data
source #n
[…]
Integration services
Transportation management, message
routing, message brokering,
transaction management, security, and
transformation.
Data Wrangling
The process of cleansing "raw" data
and transforming raw it into data that
can be analysed to generate valid
actionable insights.
It includes understanding, cleansing,
augmenting and shaping data.
The results is data in the best format
(e.g., columnar) for the analysis to
perform.
Understand
What types of data are
there, qualitative,
quantitative, categorical?
Is there any outliner?
Is any data missing?
Cleanse
remove outliners
adding missing values
transform quantitative
values in categorical ones
Augment
Aggregate data within one
data source
Add data from other data
sources based on
exact/fuzzy matches
Shape
Give data the best format
for the analysis to perform
SYSTEMS AND METHODS FOR BIG AND UNSTRUCTURED DATA
Data Architecture Concepts
Marco Brambilla
marco.brambilla@polimi.it
@marcobrambi

Essential concepts of data architectures

  • 1.
    SYSTEMS AND METHODSFOR BIG AND UNSTRUCTURED DATA Data Architecture Concepts Marco Brambilla marco.brambilla@polimi.it @marcobrambi
  • 2.
  • 3.
    The data schema •Typing • Coherence • Uniformity
  • 4.
    Transactions • The relationalworld • Multi-user • Distributed systems
  • 5.
    Definition of Transaction •An elementary unit of work performed by an application • Each transaction is encapsulated within two commands: • begin transaction (bot) and end transaction (eot) • Within a transaction one of the commands below is executed (exactly once): • commit work (commit) and rollback work (abort) • Transactional System (OLTP): a system capable of providing the definition and execution of transactions on behalf of multiple, concurrent applications • As opposed to OLAP
  • 6.
    Application and Transaction beginT1 begin T2 end T1 end T2 Transaction T2 Transaction T1 Application program Actions Actions
  • 7.
    Transaction: Example begin transaction; updateAccount set Balance = Balance + 10 where AccNum = 12202; update Account set Balance = Balance – 10 where AccNum = 42177; commit work; end transaction;
  • 8.
    Transaction: Example withAlternative begin transaction; update Account set Balance = Balance + 10 where AccNum = 12202; update Account set Balance = Balance – 10 where AccNum = 42177; select Balance into A from Account where AccNum = 42177; if (A>=0) then commit work else rollback work; end transaction;
  • 9.
    Well-formed Transactions • begintransaction • code for data manipulation (reads and writes) • commit work – rollback work • no data manipulation • end transaction S0 S1 W1 S2 W2 S3 W3 Sr W4
  • 10.
  • 11.
    Horiz. Vs. Vert.Partitioning • The distinction of horizontal vs vertical comes from the traditional tabular view of a database.
  • 12.
    Data Partitioning Aim: scalability,distribution. Partitioning splits the data in the database and partitions pieces of it to different storage nodes. Databases can be sharded horizontally (by rows) or vertically (by columns). See also: Sharding (as horizontal partitioning). C B A SPLIT C B A
  • 13.
    Replication Aim: fault-tolerance, backup Replicationcopies the entire database across all nodes in the distributed system. COPY
  • 14.
  • 15.
    Pros / Consof Each Partitioning Replication Pros Fast data writing / reading. Low memory overhead. Fast data reading. High data reliability. Cons Potential data loss High network overhead. High memory overhead.
  • 16.
  • 17.
    Scalability • How bigis big? • Not only scaling up • Elasticity
  • 18.
    Data Ingestion • Theprocess of importing, transferring and loading data for storage and later use • It involves loading data from a variety of sources • It can involve altering and modification of individual files to fit into a format that optimizes the storage • For instance, in Big Data small files are concatenated to form files of 100s of MBs and large files are broken down in files of 100s of MB. Data source #1 bus Data source #2 Data source #n […] Integration services Transportation management, message routing, message brokering, transaction management, security, and transformation.
  • 19.
    Data Wrangling The processof cleansing "raw" data and transforming raw it into data that can be analysed to generate valid actionable insights. It includes understanding, cleansing, augmenting and shaping data. The results is data in the best format (e.g., columnar) for the analysis to perform. Understand What types of data are there, qualitative, quantitative, categorical? Is there any outliner? Is any data missing? Cleanse remove outliners adding missing values transform quantitative values in categorical ones Augment Aggregate data within one data source Add data from other data sources based on exact/fuzzy matches Shape Give data the best format for the analysis to perform
  • 20.
    SYSTEMS AND METHODSFOR BIG AND UNSTRUCTURED DATA Data Architecture Concepts Marco Brambilla marco.brambilla@polimi.it @marcobrambi