The document provides an overview of software architecture considerations for data applications. It discusses sample data system components like Memcached, Redis, Elasticsearch, and Solr. It covers topics such as service level objectives, data models, query languages, graph models, data warehousing, machine learning pipelines, and distributed systems. Specific frameworks and technologies mentioned include Spark, Kafka, Neo4j, PostgreSQL, and ZooKeeper. The document aims to help understand architectural tradeoffs and guide the design of scalable, performant, and robust data systems.
3. 3
A Sample Data System
Memcached
Redis
ElastiCache
Elasticsearch
Solr
Service Level Objectives (SLOs)
Service Level Agreements (SLAs)
A service may be required to be up at least
99.9% of the time.
A service may be considered up if:
• Median response time < 200 ms
• 99th percentile < 1s
Elastic Service:
Automatically add computing resources when
detecting a load increase
Operability: Making life easy for operations
Simplicity: Managing Complexity
Evolvability: Making Change Easy
Kinesis
Kafka
RabbitMQ
4. 4
Data Models
Relational Model (SQL) Document Model (NoSQL)
Tree Structure
Pro: join
Con: locality
Schema-on-write
Pro: locality
Con: join
Schema-on-read
Json support in relational databases: PostgreSQL, MySQL, SQL, Oracle
5. 5
Query Languages
Imperative Language
tells the computer to perform certain operations in a certain order
Declarative Language (SQL)
just specify the pattern of the data you want, but not how to achieve
that goal. (often lend themselves to parallel execution)
Declarative Queries on the Web
the query optimizer automatically decides which parts of the query to
execute in which order, and which indexes to use
PostgreSQL
CSS:
all <p> elements whose direct parent is an
<li> element with a CSS class of selected
XSL (XPath):
Imperative Approach in JavaScript
6. 6
Graph Models & Query Language
Find people who emigrated from the US to Europe:
Cypher Query Language (Neo4j)
Using Recursive Common Table Expressions in SQL:
follow a chain of outgoing WITHIN edges until eventually you reach a vertex
of type Location, whose name property is equal to "United States".
7. 7
Data Warehouse and Star Schema
OLTP: Online Transaction Processing
OLAP: Online Analytic Processing
8. 8
Column-Oriented Storage
Although fact tables are often over 100 columns wide, a typical data
warehouse query only accesses a few of them at one time
Column Compression (when values in a column are not unique)
redshift parquet
If fact table 10M rows,
but only 100 products,
good compression ratio
9. 9
Partitioning (Sharding)
Combining Replication and Partitioning
Partitioning by Key Range
If using (update_time) as Key:
• Range scan is easy
• Can lead to hot spots
If using (user_id, update_time) as key:
• Different users may be stored on different partitions
• Within each user, the updates are stored ordered by
timestamp on a single partition
Request Routing
ZooKeeper: keep track of cluster metadata (used by Kafka, SolrCloud, HBase)
10. 10
Transaction And ACID
Transaction: group several reads and writes together into a logical unit.
All the reads and writes in a transaction are executed as one operation:
either the entire transaction succeeds (commit) or it fails (abort, rollback).
To be reliable, a system must deal with various faults and ensure
that they don’t cause catastrophic failure of the entire system
ATOMICITY
If the transaction cannot be completed (committed) due to a fault,
the database must discard or undo any writes it has made so far.
CONSISTENCY
Certain statements about your data (invariants) that must always be
true—for example, in an accounting system, credits and debits across
all accounts must always be balanced.
ISOLATION
Concurrently executing transactions are isolated from each other:
they cannot step on each other’s toes.
DURABILITY
Durability is the promise that once a transaction has committed
successfully, any data it has written will not be forgotten, even if there
is a hardware fault or the database crashes.
Serializable isolation has a performance cost, many systems to use
weaker levels of isolation, Read Committed is a basic one.
Dirty Read: read an uncommitted writes
user 2: the mailbox listing shows an unread message, but the counter shows zero unread messages
Dirty Write: overwrite an uncommitted value
The sale’s listing is assigned to “Bob”, but recipient to “Alice”
11. 11
Read Committed & Repeatable Read
No Dirty Read: only committed data can be read
user 2: sees the new value for x only after user 1’s transaction has committed
No Dirty Write: only committed data can be overwritten
Read Committed
Nonrepeatable Read Issue:
Alice see a total of $900 in her accounts, instead of $1000.
Repeatable Read (Snapshot Isolation)
Each transaction reads from a consistent snapshot of the database—
that is, the transaction sees all the data that was committed in the
database at the start of the transaction.
12. 12
Distributed System: Lock and Fencing Token
Lock and Issue:
Fencing Token (a number that increases every time a lock is granted):
client 1 believes that it still has a valid lease, even
though it has expired, and thus corrupts a file in storage.
Only one transaction or client is allowed to hold the lock
for a particular resource or object, to prevent
concurrently writing to it and corrupting it
Every time a client sends a write request to the storage
service, it must include its current fencing token.
Client 1 comes back to life and sends its write to the
storage service, including its token value 33. However,
the storage server remembers that it has already
processed a write with a higher token number (34), and
so it rejects the request with token 33
Packets can be lost, reordered, duplicated, or arbitrarily delayed in
the network; clocks are approximate at best; and nodes can pause
(e.g., due to garbage collection) or crash at any time.
13. 13
Batch Processing with Unix Tools, MapReduce, and Spark
• Services [HTTP/REST-based APIs] (online systems)
• Batch Processing Systems (offline systems)
• Stream Processing Systems (near-real-time systems)
Batch Processing Web Logs with Unix Tools
Output:
MapReduce and Distributed Filesystems
• Read a set of input files and break it up into records.
• Call the mapper function to extract a key and value from each input record.
awk '{print $7}': it extracts the URL ($7) as the key and leaves the value empty.
• Sort all the key-value pairs by key.
This is done by the first sort command.
• Call the reducer function to iterate over the sorted key-value pairs.
uniq -c, which counts the number of adjacent records with the same key.
Read log file
Split each line, get http ($7 token)
Alphabetically sort by URL
Counts of distinct URL
Sort by number in reverse
Output the first 5 lines
Downsides of MapReduce
• A MapReduce job can only start when all tasks in the preceding jobs (that generate its
inputs) have completed
• Mappers are often redundant: they just read back the same file that was just written
by a reducer and prepare it for the next stage of partitioning and sorting.
• Storing intermediate state in a distributed filesystem means those files are replicated
across several nodes, which is often overkill for such temporary data.
Solution: Dataflow Engines (such as Spark)
• Explicitly model the flow of data through several processing stages
• Parallelize work by partitioning inputs, and they copy the output of one function over
the network to become the input to another function
• All joins and data dependencies in a workflow are explicitly declared, the scheduler
has an overview of what data is required where, so it can make locality optimizations
14. 14
Stream Processing
Load Balancing (to one consumer) & Fan-out (to all consumers)
Acknowledgments and Redelivery
Partition (can be on different machines) & Topic (same type of messages)
Use of Steam Processing
• Complex Event Processing (CEP)
Describe the patterns of events that should be detected.
• Stream Analytics
Aggregations and statistical metrics of events (rate, rolling average)
• Search on Streams
Search for individual events (mentioning of companies, products, topics)
15. 15
Keeping Systems in Sync & Stream Joins
Change Data Capture (CDC)
Dual Writes Issue
Log Compaction: the storage engine periodically looks for log records with the same
key, throws away any duplicates, and keeps only the most recent update for each key
Stream Joins
Stream-Stream Joins (Window Join)
To capture the click events for corresponding onsite search events to get CTR.
Matched by session id.
Stream-Table Joins (Stream Enrichment)
Join user activity stream events and user profiles in database.
Can load a copy of database into the stream processor.
Table-Table Joins (Materialized View Maintenance)
Twitter timeline cache
• When user u sends a new tweet, it is added to the timeline of every user who is following u.
• When a user deletes a tweet, it is removed from all users’ timelines.
• When user u1 starts following user u2, recent tweets by u2 are added to u1’s timeline.
• When user u1 unfollows user u2, tweets by u2 are removed from u1’stimeline
Microbatching
Break the stream into small blocks, treat each block like a miniature batch process
The batch size is typically around one second
17. 17
Apache Spark: A Unified Engine for Large-scale Distributed Data Processing
The central thrust of the Spark project
was to bring in ideas borrowed from
Hadoop MapReduce, but to enhance
the system: make it highly fault
tolerant and parallel, support in-
memory storage for intermediate
results between iterative and
interactive map and reduce
computations, offer easy and
composable APIs in multiple languages
as a programming model, and support
other workloads in a unified manner.
Speed
• Take advantage of efficient multithreading and parallel processing
• Directed Acyclic Graph (DAG): efficient computational graph across works
on the cluster
• Generate compact code for execution
Ease of Use
• Resilient Distributed Dataset (RDD)
• a fundamental abstraction of a simple logical data structure
• upon which DataFrames and Datasets, are constructed
Modularity
• Components: Spark SQL, Spark Structured Steaming, Spark Mllib, Graph X
• API: Scala, SQL, Python, Java, R
Extensibility
• Decouple the storage and compute
• Spark’s DataFrameReaders and DataFrameWriters can also be extended to
read data from other sources, such as Apache Kafka, Kinesis, Azure Storage,
and Amazon S3, into its logical data abstraction, on which it can operate.
Distributed Data and Partitions
A distributed scheme
of breaking up data
into chunks or
partitions allows Spark
executors to process
only data that is close
to them, minimizing
network bandwidth.
That is, each
executor’s core is
assigned its own data
partition to work on
Apache Spark’s Distributed Execution
Spark driver: requests
resources (CPU, memory,
etc.) from the cluster
manager for Spark’s
executors (JVMs);
transforms all the Spark
operations into DAG
computations, schedules
them, and distributes their
execution as tasks across
the Spark executors.
18. 18
Spark Job/Stage/Task, Transformation/Action
Spark driver converts Spark
application into one or more Spark
jobs. It then transforms each job
into a DAG, Spark’s execution plan.
As part of the DAG nodes,
stages are created based on
what operations can be
performed serially or in parallel
Each task maps to a
single core and
works on a single
partition of data
Transformations, Actions, and Lazy Evaluation
• All transformations are evaluated lazily.
• To optimize transformations into stages
for more efficient execution.
• Nothing in a query plan is executed
until an action is invoked.
Narrow and Wide Transformations
a single output partition
can be computed from a
single input partition
filter()
contains()
data from other partitions
is read in, combined, and
written to disk
groupby()
orderby()
19. 19
Sample Code: Counting M&Ms for the Cookie Monster
Spark program that reads a file with
over 100,000 entries (where each row
or line has a <state, mnm_color, count>)
and computes and aggregates the
counts for each color and state.
Spark program would distribute the task
in each partition and return us the final
aggregated count.
20. 20
Apache Spark’s Structured APIs
The RDD is the most basic abstraction in Spark
• Dependencies
instructs Spark how an RDD is constructed with its inputs
• Partitions (with some locality information)
ability to parallelize computation on partitions across executors
• Compute function: Partition => Iterator[T]
produces an Iterator[T] for the data that will be stored in the RDD
Basic Python data types in Spark
Python structured data types in Spark
Schemas and Creating DataFrames
A schema defines the column names and associated data types for a DataFrame.
schema = "author STRING, title STRING, pages INT“
blogs_df = spark.createDataFrame(data, schema)
21. 21
Spark SQL and the Underlying Engine
• Unifies Spark components and
permits abstraction to
DataFrames/Datasets in Java,
Scala, Python, and R, which
simplifies working with
structured data sets.
• Connects to the Apache Hive
meta store and tables.
• Reads and writes structured data
with a specific schema from
structured file formats (JSON,
CSV, Text, Avro, Parquet, ORC,
etc.) and converts data into
temporary tables.
• Offers an interactive Spark SQL
shell for quick data exploration.
• Provides a bridge to (and from)
external tools via standard
database JDBC/ODBC
connectors.
• Generates optimized query plans
and compact code for the JVM,
for final execution.
22. 22
Spark SQL and DataFrames
Create SQL Database
Create Table
Create View
Views disappear after Spark application terminates
Global: visible across all SparkSessions on a given cluster
Session-scoped: visible only to a single SparkSession
Viewing the Metadata
Reading Tables into DataFrames
Reading from Database using JDBC
24. 24
Higher-Order Functions in Spark SQL and DataFrames
Transform()
Filter()
EXISTS()
Union two different DataFrames with the same schema together
Join
Pivot
25. 25
Structured Steaming
Data stream as an unbound table
Append mode: only append new rows
Update mode: update rows changed since last trigger
Complete mode: write entire result table
Structured Streaming Query
Define input source
Transform data
Define output sink & mode
Specify processing details
Triggering details
• Default: next micro-batch is triggered as soon as the previous one has completed
• Processing time with trigger interval: will trigger micro-batches at fixed interval
• Once: will start the query using custom schedule and stop
• Continuous: will process data continuously instead of in micro-batches, experimental
26. 26
Steaming Data Sources and Sinks
Files
Each file must appear in the directory
listing atomically—that is, the whole file
must be available at once for reading, and
once it is available, the file cannot be
updated or modified.
Structured Streaming will process the file when
the engine finds it (using directory listing) and
internally mark it as processed. Any changes to
that file will not be processed.
Apache Kafka
Fields of returned DataFrame: key, value, topic, partition,
offset, timestamp, timestampType
27. 27
Stateful Steaming Aggregations
Steam – Stream Join
Aggregations with Event-Time Windows
watermark defines how long the
engine will wait for late data to arrive.
With these time constraints for each event,
the processing engine can automatically
calculate how long events need to be
buffered to generate correct results, and
when the events can be dropped from the
state.
29. 29
Managing, Deploying, and Scaling Machine Learning Pipelines with Apache Spark
Joblib
Joblib can be used for hyperparameter tuning as it automatically broadcasts a copy of your
data to all your workers, which then create their own models with different hyperparameters
on their copies of the data.
Koalas
Pandas is a very popular data analysis and manipulation library in Python, but it is limited to
running on a single machine. Koalas is an open-source library that implements the Pandas
DataFrame API on top of Apache Spark, easing the transition from Pandas to Spark.
AWS SageMaker
Azure ML